How are parameters updated with SGD-Optimizer? - neural-network

So I have found a formula describing the SGD-Descent
θ = θ-η*∇L(θ;x,y)
Where θ is a parameter, η is the learning rate and ∇L() is the gradient descent of the loss-function. But what I don't get is how the parameter θ (which should be weight and bias) can be updated mathematically? Is there a mathematical interpretation of the parameter θ?
Thanks for any answers.

That formula applies to both gradient descent and stochastic gradient descent (SGD). The difference between the two is that in SGD the loss is computed over a random subset of the training data (i.e. a mini-batch/batch) as opposed to computing the loss over all the training data as in traditional gradient descent. So in SGD x and y correspond to a subset of the training data and labels, whereas in gradient descent they correspond to all the training data and labels.
θ represents the parameters of the model. Mathematically this is usually modeled as a vector containing all the parameters of the model (all the weights, biases, etc...) arranged into a single vector. When you compute the gradient of the loss (a scalar) w.r.t. θ you get a vector containing the partial derivative of loss w.r.t. each element of θ. So ∇L(θ;x,y) is just a vector, the same size as θ. If we were to assume that the loss were a linear function of θ, then this gradient points in the direction in parameter space that would result in the maximal increase in loss with a magnitude that corresponds to the expected increase in loss if we took a step of size 1 in that direction. Since loss isn't actually a linear function and we actually want to decrease loss we instead take a smaller step in the opposite direction, hence the η and minus.
It's also worth pointing out that mathematically the form you've given is a bit problematic. We wouldn't usually write it like this since assignment and equal aren't the same thing. The equation you provided would seem to imply that the θ on the left-hand and right-hand side of the equation were the same. They are not. The θ on the left side of the equal sign represents the value of the parameters after taking a step and the θs on the right side correspond to the parameters before taking a step. We could be more clear by writing it with subscripts
where θ_{t} is the parameter vector at step t and θ_{t+1} is the parameter vector one step later.

Related

How to model scalar values with a neural network if besides direction the magnitude matters too

Say you want to predict temperature changes based on some input data. Temperature changes are positive or negative scalars with a mean of zero. If only the direction matters one could just use tanh as an activation function in the output layer. But say for delta-temperatures predicting the magnitude of the change is also important, not just the sign.
How would you model this output. Tanh doesn't seem to be a good choice because it gives values between -1 and 1. And say temperature changes have a gaussian, or some other weird distribution, so hovering around the center quasi-linear domain of tanh(+-0) would be difficult to learn for a neural network. I'm worried that the sign would be good but the magnitude output would be useless.
How about having the network output one-hot vectors of length N, treat the argmax of this output vector as a temperature change on a pre-defined window. Say the window is -30 - +30 degrees, using N=60 long one-hot vector, if argmax( output )=45 that means the prediction is about 15 degrees.
I was actually not sure how to search for this topic.

Stacked Sparse Autoencoder parameters

I work on Stacked Sparse Autoencoders using MATLAB.
Can anyone please suggest what values should be taken for
Stacked Sparse Autoencoder parameters:
L2 Weight Regularization ( Lambda)
Sparsity Regularization (Beta)
Sparsity proportion (Rho).
It is important to realise that there are NO OBVIOUS VALUES for the hyperparameters. The optimal value will vary, depending on the data you're modeling: you'll have to try them on your data.
From sparseAutoencoder Lambda (λ) is coefficient of weight decay term which discourage weights to reach big values since it may overfit. Weight decay term (or weight regularization term) is a part of the cost function like sparsity term explained below.
Rho (ρ) is sparsity constraint which controls average number of activation on hidden layer. It is included to make autoencoder work even with relatively big number of hidden units with respect to input units. For example, if input size is 100 and hidden size is 100 or larger (even smaller but close to 100), the output can be constructed without any lost, since hidden units can learn identity function. Beta (β) is coeffecient of sparsity term which is a part of the cost function. It controls relative importance of sparsity term. Lambda and Beta specify the relative importance of their terms in cost function.
Example: You can take a loot at this example where parameter values are selected as follows.
sparsityParam = 0.1; % desired average activation of the hidden units.
% (This was denoted by the Greek alphabet rho, which looks like a lower-case "p",
% in the lecture notes).
lambda = 3e-3; % weight decay parameter
beta = 3; % weight of sparsity penalty term
But once again, i want to make you remember that there are NO OBVIOUS VALUES for the hyperparameters.

Mixture of 1D Gaussians fit to data in Matlab / Python

I have a discrete curve y=f(x). I know the locations and amplitudes of peaks. I want to approximate the curve by fitting a gaussian at each peak. How should I go about finding the optimized gaussian parameters ? I would like to know if there is any inbuilt function which will make my task simpler.
Edit
I have fixed mean of gaussians and tried to optimize on sigma using
lsqcurvefit() in matlab. MSE is less. However, I have an additional hard constraint that the value of approximate curve should be equal to the original function at the peaks. This constraint is not satisfied by my model. I am pasting current working code here. I would like to have a solution which obeys the hard constraint at peaks and approximately fits the curve at other points. The basic idea is that the approximate curve has fewer parameters but still closely resembles the original curve.
fun = #(x,xdata)myFun(x,xdata,pks,locs); %pks,locs are the peak locations and amplitudes already available
x0=w(1:6)*0.25; % my initial guess based on domain knowledge
[sigma resnorm] = lsqcurvefit(fun,x0,xdata,ydata); %xdata and ydata are the original curve data points
recons = myFun(sigma,xdata,pks,locs);
figure;plot(ydata,'r');hold on;plot(recons);
function f=myFun(sigma,xdata,a,c)
% a is constant , c is mean of individual gaussians
f=zeros(size(xdata));
for i = 1:6 %use 6 gaussians to approximate function
f = f + a(i) * exp(-(xdata-c(i)).^2 ./ (2*sigma(i)^2));
end
end
If you know your peak locations and amplitudes, then all you have left to do is find the width of each Gaussian. You can think of this as an optimization problem.
Say you have x and y, which are samples from the curve you want to approximate.
First, define a function g() that will construct the approximation for given values of the widths. g() takes a parameter vector sigma containing the width of each Gaussian. The locations and amplitudes of the Gaussians will be constrained to the values you already know. g() outputs the value of the sum-of-gaussians approximation at each point in x.
Now, define a loss function L(), which takes sigma as input. L(sigma) returns a scalar that measures the error--how badly the given approximation (using sigma) differs from the curve you're trying to approximate. The squared error is a common loss function for curve fitting:
L(sigma) = sum((y - g(sigma)) .^ 2)
The task now is to search over possible values of sigma, and find the choice that minimizes the error. This can be done using a variety of optimization routines.
If you have the Mathworks optimization toolbox, you can use the function lsqnonlin() (in this case you won't have to define L() yourself). The curve fitting toolbox is probably an alternative. Otherwise, you can use an open source optimization routine (check out cvxopt).
A couple things to note. You need to impose the constraint that all values in sigma are greater than zero. You can tell the optimization algorithm about this constraint. Also, you'll need to specify an initial guess for the parameters (i.e. sigma). In this case, you could probably choose something reasonable by looking at the curve in the vicinity of each peak. It may be the case (when the loss function is nonconvex) that the final solution is different, depending on the initial guess (i.e. you converge to a local minimum). There are many fancy techniques for dealing with this kind of situation, but a simple thing to do is to just try with multiple different initial guesses, and pick the best result.
Edited to add:
In python, you can use optimization routines in the scipy.optimize module, e.g. curve_fit().
Edit 2 (response to edited question):
If your Gaussians have much overlap with each other, then taking their sum may cause the height of the peaks to differ from your known values. In this case, you could take a weighted sum, and treat the weights as another parameter to optimize.
If you want the peak heights to be exactly equal to some specified values, you can enforce this constraint in the optimization problem. lsqcurvefit() won't be able to do it because it only handles bound constraints on the parameters. Take a look at fmincon().
you can use Expectation–Maximization algorithm for fitting Mixture of Gaussians on your data. it don't care about data dimension.
in documentation of MATLAB you can lookup gmdistribution.fit or fitgmdist.

MATLAB - Finite element on nonuniform grid

I'm solving a second order differential equation in MATLAB using a finite element method, where I write the second order derivative of a function f as:
d^2f/dx^2 = (f_{i}-f_{i-1}/(x_{i}-x{i-1}) - f_{i+1}-f_{i}/(x_{i+1}-x{i})/(x_{i+1}-x{i-1})/2
Now this operation on f can be translated into a matrix, for which I can then find the eigenvectors, which then are the solutions to the given differential equation.
All this works well for a uniform grid of x-values, i.e. same spacing. But when I try to do it for a nonuniform I get oscillations that should not be there, because the values in the matrix are weighted differently depending on how close the neighbouring grid points are.
Is my approach wrong? Should I use some kind of weighting to take care of the nonuniformity?
I am not sure if I got it right, but the relation you wrote:
d^2f/dx^2 = (f_{i}-f_{i-1}/(x_{i}-x{i-1}) - f_{i+1}-f_{i}/(x_{i+1}-x{i})/(x_{i+1}-x{i-1})/2
looks like finite difference NOT finite element!!!
Otherwise, finite element method does not care (much) about the change of element size from one point to the other in most of the mechanics problems.
If you are handling a finite difference problem, the method does not have to have a regular mesh, but the relations have to be written "carefully" in order to avoid the confusion that may be included in the system matrices.

Creating a 1D Second derivative of gaussian Window

In MATLAB I need to generate a second derivative of a gaussian window to apply to a vector representing the height of a curve. I need the second derivative in order to determine the locations of the inflection points and maxima along the curve. The vector representing the curve may be quite noise hence the use of the gaussian window.
What is the best way to generate this window?
Is it best to use the gausswin function to generate the gaussian window then take the second derivative of that?
Or to generate the window manually using the equation for the second derivative of the gaussian?
Or even is it best to apply the gaussian window to the data, then take the second derivative of it all? (I know these last two are mathematically the same, however with the discrete data points I do not know which will be more accurate)
The maximum length of the height vector is going to be around 100-200 elements.
Thanks
Chris
I would create a linear filter composed of the weights generated by the second derivative of a Gaussian function and convolve this with your vector.
The weights of a second derivative of a Gaussian are given by:
Where:
Tau is the time shift for the filter. If you are generating weights for a discrete filter of length T with an odd number of samples, set tau to zero and allow t to vary from [-T/2,T/2]
sigma - varies the scale of your operator. Set sigma to a value somewhere between T/6. If you are concerned about long filter length then this can be reduced to T/4
C is the normalising factor. This can be derived algebraically but in practice I always do this numerically after calculating the filter weights. For unity gain when smoothing periodic signals, I will set C = 1 / sum(G'').
In terms of your comment on the equivalence of smoothing first and taking a derivative later, I would say it is more involved than that. As which derivative operator would you use in the second step? A simple central difference would not yield the same results.
You can get an equivalent (but approximate) response to a second derivative of a Gaussian by filtering the data with two Gaussians of different scales and then taking the point-wise differences between the two resulting vectors. See Difference of Gaussians for that approach.