Stacked Sparse Autoencoder parameters

Stacked Sparse Autoencoder parameters - matlab

I work on Stacked Sparse Autoencoders using MATLAB.
Can anyone please suggest what values should be taken for
Stacked Sparse Autoencoder parameters:
L2 Weight Regularization ( Lambda)
Sparsity Regularization (Beta)
Sparsity proportion (Rho).

It is important to realise that there are NO OBVIOUS VALUES for the hyperparameters. The optimal value will vary, depending on the data you're modeling: you'll have to try them on your data.
From sparseAutoencoder Lambda (λ) is coefficient of weight decay term which discourage weights to reach big values since it may overfit. Weight decay term (or weight regularization term) is a part of the cost function like sparsity term explained below.
Rho (ρ) is sparsity constraint which controls average number of activation on hidden layer. It is included to make autoencoder work even with relatively big number of hidden units with respect to input units. For example, if input size is 100 and hidden size is 100 or larger (even smaller but close to 100), the output can be constructed without any lost, since hidden units can learn identity function. Beta (β) is coeffecient of sparsity term which is a part of the cost function. It controls relative importance of sparsity term. Lambda and Beta specify the relative importance of their terms in cost function.
Example: You can take a loot at this example where parameter values are selected as follows.
sparsityParam = 0.1; % desired average activation of the hidden units.
% (This was denoted by the Greek alphabet rho, which looks like a lower-case "p",
% in the lecture notes).
lambda = 3e-3; % weight decay parameter
beta = 3; % weight of sparsity penalty term
But once again, i want to make you remember that there are NO OBVIOUS VALUES for the hyperparameters.

Related

How are parameters updated with SGD-Optimizer?

So I have found a formula describing the SGD-Descent
θ = θ-η*∇L(θ;x,y)
Where θ is a parameter, η is the learning rate and ∇L() is the gradient descent of the loss-function. But what I don't get is how the parameter θ (which should be weight and bias) can be updated mathematically? Is there a mathematical interpretation of the parameter θ?
Thanks for any answers.

That formula applies to both gradient descent and stochastic gradient descent (SGD). The difference between the two is that in SGD the loss is computed over a random subset of the training data (i.e. a mini-batch/batch) as opposed to computing the loss over all the training data as in traditional gradient descent. So in SGD x and y correspond to a subset of the training data and labels, whereas in gradient descent they correspond to all the training data and labels.
θ represents the parameters of the model. Mathematically this is usually modeled as a vector containing all the parameters of the model (all the weights, biases, etc...) arranged into a single vector. When you compute the gradient of the loss (a scalar) w.r.t. θ you get a vector containing the partial derivative of loss w.r.t. each element of θ. So ∇L(θ;x,y) is just a vector, the same size as θ. If we were to assume that the loss were a linear function of θ, then this gradient points in the direction in parameter space that would result in the maximal increase in loss with a magnitude that corresponds to the expected increase in loss if we took a step of size 1 in that direction. Since loss isn't actually a linear function and we actually want to decrease loss we instead take a smaller step in the opposite direction, hence the η and minus.
It's also worth pointing out that mathematically the form you've given is a bit problematic. We wouldn't usually write it like this since assignment and equal aren't the same thing. The equation you provided would seem to imply that the θ on the left-hand and right-hand side of the equation were the same. They are not. The θ on the left side of the equal sign represents the value of the parameters after taking a step and the θs on the right side correspond to the parameters before taking a step. We could be more clear by writing it with subscripts
where θ_{t} is the parameter vector at step t and θ_{t+1} is the parameter vector one step later.

Function approximation by ANN

So I have something like this,
y=l3*[sin(theta1)*cos(theta2)*cos(theta3)+cos(theta1)*sin(theta2)*cos(theta3)-sin(theta1)*sin(theta2)*sin(theta3)+cos(theta1)*cos(theta2)sin(theta3)]+l2[sin(theta1)*cos(theta2)+cos(theta1)*sin(theta2)]+l1*sin(theta1)+l0;
and something similar for x. Where thetai is angles from specified interval and li some coeficients. Task is approximate inversion of equation, so you set x and y and result will be appropriate theta. So I random generate thetas from specified intervals, compute x and y. Then I norm x and y between <-1,1> and thetas between <0,1>. This data I used as training set in such way, inputs of network are normalized x and y, outputs are normalized thetas.
I train the network, tried different configuration and absolute error of network was still around 24.9% after whole night of training. It's so much, so I don't know what to do.
Bigger training set?
Bigger network?
Experiment with learning rate?
Longer training?
Technical info
As training algorithm was used error back propagation. Neurons have sigmoid activation function, units are biased. I tried topology: [2 50 3], [2 100 50 3], training set has length 1000 and training duration was 1000 cycle(in one cycle I go through all dataset). Learning rate has value 0.2.
Error of approximation was computed as
sum of abs(desired_output - reached_output)/dataset_lenght.
Used optimizer is stochastic gradient descent.
Loss function,
1/2 (desired-reached)^2
Network was realized in my Matlab template for NN. I know that is weak point, but I'm sure my template is right because(successful solution of XOR problem, approximation of differential equations, approximation of state regulator). But I show this template, because this information may be useful.
Neuron class
Network class
EDIT:
I used 2500 unique data within theta ranges.
theta1<0, 180>, theta2<-130, 130>, theta3<-150, 150>
I also experiment with larger dataset, but accuracy doesn't improve.

Should I perform data centering before apply SVD?

I have to use SVD in Matlab to obtain a reduced version of my data.
I've read that the function svds(X,k) performs the SVD and returns the first k eigenvalues and eigenvectors. There is not mention in the documentation if the data have to be normalized.
With normalization I mean both substraction of the mean value and division by the standard deviation.
When I implemented PCA, I used to normalize in such way. But I know that it is not needed when using the matlab function pca() because it computes the covariance matrix by using cov() which implicitly performs the normalization.
So, the question is. I need the projection matrix useful to reduce my n-dim data to k-dim ones by SVD. Should I perform data normalization of the train data (and therefore, the same normalization to further projected new data) or not?
Thanks

Essentially, the answer is yes, you should typically perform normalization. The reason is that features can have very different scalings, and we typically do not want to take scaling into account when considering the uniqueness of features.
Suppose we have two features x and y, both with variance 1, but where x has a mean of 1 and y has a mean of 1000. Then the matrix of samples will look like
n = 500; % samples
x = 1 + randn(n,1);
y = 1000 + randn(n,1);
svd([x,y])
But the problem with this is that the scale of y (without normalizing) essentially washes out the small variations in x. Specifically, if we just examine the singular values of [x,y], we might be inclined to say that x is a linear factor of y (since one of the singular values is much smaller than the other). But actually, we know that that is not the case since x was generated independently.
In fact, you will often find that you only see the "real" data in a signal once we remove the mean. At the extremely end, you could image that we have some feature
z = 1e6 + sin(t)
Now if somebody just gave you those numbers, you might look at the sequence
z = 1000001.54, 1000001.2, 1000001.4,...
and just think, "that signal is boring, it basically is just 1e6 plus some round off terms...". But once we remove the mean, we see the signal for what it actually is... a very interesting and specific one indeed. So long story short, you should always remove the means and scale.

It really depends on what you want to do with your data. Centering and scaling can be helpful to obtain principial components that are representative of the shape of the variations in the data, irrespective of the scaling. I would say it is mostly needed if you want to further use the principal components itself, particularly, if you want to visualize them. It can also help during classification since your scores will then be normalized which may help your classifier. However, it depends on the application since in some applications the energy also carries useful information that one should not discard - there is no general answer!
Now you write that all you need is "the projection matrix useful to reduce my n-dim data to k-dim ones by SVD". In this case, no need to center or scale anything:
[U,~] = svd(TrainingData);
RecudedData = U(:,k)'*TestData;
will do the job. The svds may be worth considering when your TrainingData is huge (in both dimensions) so that svd is too slow (if it is huge in one dimension, just apply svd to the gram matrix).

It depends!!!
A common use in signal processing where it makes no sense to normalize is noise reduction via dimensionality reduction in correlated signals where all the fearures are contiminated with a random gaussian noise with the same variance. In that case if the magnitude of a certain feature is twice as large it's snr is also approximately twice as large so normalizing the features makes no sense since it would just make the parts with the worse snr larger and the parts with the good snr smaller. You also don't need to subtract the mean in that case (like in PCA), the mean (or dc) isn't different then any other frequency.

Mixture of 1D Gaussians fit to data in Matlab / Python

I have a discrete curve y=f(x). I know the locations and amplitudes of peaks. I want to approximate the curve by fitting a gaussian at each peak. How should I go about finding the optimized gaussian parameters ? I would like to know if there is any inbuilt function which will make my task simpler.
Edit
I have fixed mean of gaussians and tried to optimize on sigma using
lsqcurvefit() in matlab. MSE is less. However, I have an additional hard constraint that the value of approximate curve should be equal to the original function at the peaks. This constraint is not satisfied by my model. I am pasting current working code here. I would like to have a solution which obeys the hard constraint at peaks and approximately fits the curve at other points. The basic idea is that the approximate curve has fewer parameters but still closely resembles the original curve.
fun = #(x,xdata)myFun(x,xdata,pks,locs); %pks,locs are the peak locations and amplitudes already available
x0=w(1:6)*0.25; % my initial guess based on domain knowledge
[sigma resnorm] = lsqcurvefit(fun,x0,xdata,ydata); %xdata and ydata are the original curve data points
recons = myFun(sigma,xdata,pks,locs);
figure;plot(ydata,'r');hold on;plot(recons);
function f=myFun(sigma,xdata,a,c)
% a is constant , c is mean of individual gaussians
f=zeros(size(xdata));
for i = 1:6 %use 6 gaussians to approximate function
f = f + a(i) * exp(-(xdata-c(i)).^2 ./ (2*sigma(i)^2));
end
end

If you know your peak locations and amplitudes, then all you have left to do is find the width of each Gaussian. You can think of this as an optimization problem.
Say you have x and y, which are samples from the curve you want to approximate.
First, define a function g() that will construct the approximation for given values of the widths. g() takes a parameter vector sigma containing the width of each Gaussian. The locations and amplitudes of the Gaussians will be constrained to the values you already know. g() outputs the value of the sum-of-gaussians approximation at each point in x.
Now, define a loss function L(), which takes sigma as input. L(sigma) returns a scalar that measures the error--how badly the given approximation (using sigma) differs from the curve you're trying to approximate. The squared error is a common loss function for curve fitting:
L(sigma) = sum((y - g(sigma)) .^ 2)
The task now is to search over possible values of sigma, and find the choice that minimizes the error. This can be done using a variety of optimization routines.
If you have the Mathworks optimization toolbox, you can use the function lsqnonlin() (in this case you won't have to define L() yourself). The curve fitting toolbox is probably an alternative. Otherwise, you can use an open source optimization routine (check out cvxopt).
A couple things to note. You need to impose the constraint that all values in sigma are greater than zero. You can tell the optimization algorithm about this constraint. Also, you'll need to specify an initial guess for the parameters (i.e. sigma). In this case, you could probably choose something reasonable by looking at the curve in the vicinity of each peak. It may be the case (when the loss function is nonconvex) that the final solution is different, depending on the initial guess (i.e. you converge to a local minimum). There are many fancy techniques for dealing with this kind of situation, but a simple thing to do is to just try with multiple different initial guesses, and pick the best result.
Edited to add:
In python, you can use optimization routines in the scipy.optimize module, e.g. curve_fit().
Edit 2 (response to edited question):
If your Gaussians have much overlap with each other, then taking their sum may cause the height of the peaks to differ from your known values. In this case, you could take a weighted sum, and treat the weights as another parameter to optimize.
If you want the peak heights to be exactly equal to some specified values, you can enforce this constraint in the optimization problem. lsqcurvefit() won't be able to do it because it only handles bound constraints on the parameters. Take a look at fmincon().

you can use Expectation–Maximization algorithm for fitting Mixture of Gaussians on your data. it don't care about data dimension.
in documentation of MATLAB you can lookup gmdistribution.fit or fitgmdist.

Why ridge regression minimizes test cost when lambda is negative

I am processing a set of data using ridge regression. I found a very interesting phenomenon when apply the learned function to data. Namely, when the ridge parameter increases from zero, the test error keeps increasing. But if we penalize small coefficients(set the parameter <0), the test error can even be smaller.
This is my matlab code:
for i = 1:100
beta = ridgePolyRegression(ty_train,tX_train,lambda(i));
sqridge_train_cost(i) = computePolyCostMSE(ty_train,tX_train,beta);
sqridge_test_cost(i) = computePolyCostMSE(ty_valid,tX_valid,beta);
end
plot(lambda,sqridge_test_cost,'color','b');
lambda is the ridge parameter. ty_train is the output of the training data, tX_train is the input of training data. Also, we use a quadratic function regression here.
function [ beta ] = ridgePolyRegression( y,tX,lambda )
X = tX(:,2:size(tX,2));
tX2 = [tX,X.^2];
beta = (tX2'*tX2 + lambda * eye(size(tX2,2))) \ (tX2'*y);
end
The plotted picture is:
Why the error is minimal when lambda is negative? Is it a sign of under-fitting?

You should not use negative lambdas.
From (probabilistic) theoretic point of view, lambda relates to the inverse of variance of parameter prior distribution, and variance can't be negative.
From computational point of view, it can (given it's less that the smallest eigenvalue of the covariance matrix) turn your positive-definite form into an indefinite form, which means you'll have not a maximum, but a saddle point. It also means there are points where your target function is as small (or as big) as you want, so you can reduce loss indefinitely and no minimum / maximum exists at all.
Your optimization algorithm gives you just a stationary point, which will be a global maximum if and only if the form is positive definite.

Short Answer: When lambda is negative, you're actually overfitting your data. Hence, it's reasonable to get much less error.
Long Answer:
The regularization term (or the penalty term as described by many statisticians) aims to penalize the weights (or the betas as written in the coming Eq.) for going too high (overfitting) and going too low (underfitting). Giving you the power to control how your model behaves, and you usually aim the "right fitting" model.
For mathematical intuition, you can check the following Eq. (P. S. Equation is screenshotted from Elements of Statistical Learning by Trevor Hastie et. al)
When you decide to make your lambda negative, the penalty term is indeed turned into a utility term that helps to increase the weights (i.e., overfitting).
Overfitting is, simply, understanding your data along with the features more than you should, because you do not have the whole population yet; therefore, what you understood so far is possibly wrong on a different dataset.
So, you should never be using negative values of lambdas.