Binary Classification Cost Function, Neural Networks - neural-network

I've been tweaking the Deep Learning tutorial to train the weights of a Logistic Regression model for a binary classification problem and the tutorial uses the negative log-likelihood cost function below...
self.p_y_given_x = T.nnet.softmax(T.dot(input, self.W) + self.b)
def negative_log_likelihood(self, y):
return -T.mean(T.log(self.p_y_given_x)[T.arange(y.shape[0]), y])
However, my weights don't seem to be converging properly as my validation error increases over successive epochs.
I was wondering if I'm using the proper cost function to converge upon the proper weights. It might be useful to note that my two classes are very imbalanced and my predictors are already normalized

Few reasons I can think of are:
Your learning rate is too high
For binary classification, try squared error or a cross entropy error instead of negative log likelihood.
You are using just one layer. May be the dataset you are using requires more layers. So connect more hidden layers.
Play around with the number of layers and hidden units.

Related

Neural network parameter selection

I am looking at (two-layer) feed-forward Neural Networks in Matlab. I am investigating parameters that can minimise the classification error.
A google search reveals that these are some of them:
Number of neurons in the hidden layer
Learning Rate
Momentum
Training type
Epoch
Minimum Error
Any other suggestions?
I've varied the number of hidden neurons in Matlab, varying it from 1 to 10. I found that the classification error is close to 0% with 1 hidden neuron and then grows very slightly as the number of neurons increases. My question is: shouldn't a larger number of hidden neurons guarantee an equal or better answer, i.e. why might the classification error go up with more hidden neurons?
Also, how might I vary the Learning Rate, Momentum, Training type, Epoch and Minimum Error in Matlab?
Many thanks
Since you are considering a simple two layer feed forward network and have already pointed out 6 different things you need to consider to reduce classification errors, I just want to add one thing only and that is amount of training data. If you train a neural network with more data, it will work better. Note that, training with large amount of data is a key to get good outcome from neural networks, specially from deep neural networks.
Why the classification error goes up with more hidden neurons?
Answer is simple. Your model has over-fitted the training data and thus resulting in poor performance. Note that, if you increase the number of neurons in hidden layers, it would decrease training errors but increase testing errors.
In the following figure, see what happens with increased hidden layer size!
How may I vary the Learning Rate, Momentum, Training type, Epoch and Minimum Error in Matlab?
I am expecting you have already seen feed forward neural net in Matlab. You just need to manipulate the second parameter of the function feedforwardnet(hiddenSizes,trainFcn) which is trainFcn - a training function.
For example, if you want to use gradient descent with momentum and adaptive learning rate backpropagation, then use traingdx as the training function. You can also use traingda if you want to use gradient descent with adaptive learning rate backpropagation.
You can change all the required parameters of the function as you want. For example, if you want to use traingda, then you just need to follow the following two steps.
Set net.trainFcn to traingda. This sets net.trainParam to traingda's default parameters.
Set net.trainParam properties to desired values.
Example
net = feedforwardnet(3,'traingda');
net.trainParam.lr = 0.05; % setting the learning rate to 5%
net.trainParam.epochs = 2000 % setting number of epochs
Please see this - gradient descent with adaptive learning rate backpropagation and gradient descent with momentum and adaptive learning rate backpropagation.

Label Normalization in Deep Regression Networks

In regression problems there is typically no reason for normalizing/rescaling the labels (targets) before performing the optimization.
In deep regression networks there would be in principle no need to rescale since the last activation function is linear and the cost function is the mean squared difference of the predictions from the targets.
On the other hand, for numerical stability and performance of the training process, the values of the input and hidden units are kept in the range [-1,1] via feature normalization. Doesn't it mean that the labels should be rescaled to the range [-1,1] too?
traditionally in regression problems, you denormalize the output generated

Unsupervised Neural Network with Independent hidden Activations: How to implement entropy or kurtosis penalty?

In brief:
How do I implement ICA-like entropy or kurtosis penalty term for the cost and back-propagation of my unsupervised neural network (I'm using stacked sparse Autoencoders)?
In details:
I've implemented sparse Autoencoder (AE) in Matlab based on Andrew Ng's UFLDL tutorials with conjugate gradient descent and Wolfe line search. This achieves much cleaner results on my data than Independent Component Analysis (ICA) with infomax (I compare my AE's hidden layer activations with the ICA activations of eeglab's runica function.)
However, maximal independence of the activations, as in ICA, is also an essential criterion to me. So I'd like to achieve the higher quality sNN hidden activations but making them also independent. In other words I'd like to benefit from both sparsity and independence.
ICA uses entropy or kurtosis penalty term to minimize mutual information across the components (activations).
I've tried simply to apply matlab's wentropy or kurtosis function: if a given hidden layer's activation matrix 'a' is (outputNeurons*hiddenNeurons), I'd do:
betaEntropy = 0.01;
for m = 1:size(a,1)
entropyM(m) = abs(wentropy(a(m,:),'shannon'));
% kurtosisM(m) = abs(kurtosis(a(m,:)));
end
entropyPenalty = betaEntropy * mean(entropyM);
cost = cost + Wdecay + sparsityJ + entropyPenalty;
This gave bad results and no independent activations. I think I should also calculate the error of the penalty for the back-propagation but I wasn't sure how. I'm not sure my above cost penalty is correct either.
Any ideas either to improve the above code or just to point me to the right directions in simple mathematical terms?

Non-linear classification vs regression with FFANN

I am trying to differentiate between two classes of data for forecasting. Basically the dependent variables are features of a signal that I want to forecast. I want to predict whether the signal will have a positive or negative slope in the near future (1 time step ahead). I have tried with different time series analysis, such as Fourier analysis, fitting using neural networks, auto-regressive models, and classification with neural nets (using patternet in Matlab).
The function is continuous, so the most logical assumption is to use some regression analysis tool to determine what's going to happen. However, since I only care whether the slope is going to positive or negative, I changed the signal to a binary signal (1 if the slope is positive, -1 if the slope is 0 or negative).
This is by the far the best results I have gotten! However, for some unknown reason a neural net designed for classification did not work (the confusion matrix stated that there was a precision of around 50%). So I decided to try with a regular feedforward neural net...
Since the neural network outputs continuous data, I didn't know what to do... But then I remembered about Logistic regression, and since its transfer function is a log function (bounded by 0 and 1), it can be interpreted as a probability. So I basically did the same, defined a threshold (e.g above 0 is 1, below 0 is -1), and voila! The precision sky-rocked! I am getting a precision of around 70-80%.
Since I am using a sigmoid transfer function, the neural network wll have a continuous output just as logistic regression (but on this case between -1 and 1), so I am assuming my approach is technically still regression and not classification. My question is... Which is better? For my specific problem where fitting did not give really good results but I had to convert this to a binary problem... Which should give better results? Classification or regression?
Should I try a different configuration of a neural net (with a different transfer function), should I try with support vector machine or any other classification algorithm? Or should I stick with regression but defining a threshold myself just as I would do with logistic regression?

SVM Classification with Cross Validation

I am new to using Matlab and am trying to follow the example in the Bioinformatics Toolbox documentation (SVM Classification with Cross Validation) to handle a classification problem.
However, I am not able to understand Step 9, which says:
Set up a function that takes an input z=[rbf_sigma,boxconstraint], and returns the cross-validation value of exp(z).
The reason to take exp(z) is twofold:
rbf_sigma and boxconstraint must be positive.
You should look at points spaced approximately exponentially apart.
This function handle computes the cross validation at parameters
exp([rbf_sigma,boxconstraint]):
minfn = #(z)crossval('mcr',cdata,grp,'Predfun', ...
#(xtrain,ytrain,xtest)crossfun(xtrain,ytrain,...
xtest,exp(z(1)),exp(z(2))),'partition',c);
What is the function that I should be implementing here? Is it exp or minfn? I will appreciate if you can give me the code for this section. Thanks.
I will like to know what does it mean when it says exp([rbf_sigma,boxconstraint])
rbf_sigma: The svm is using a gaussian kernel, the rbf_sigma set the standard deviation (~size) of the kernel. To understand how kernels work, the SVM is putting the kernel around every sample (so that you have a gaussian around every sample). Then the kernels are added up (sumed) for the samples of each category/type. At each point the type which sum is higher would be the "winner". For example if type A has a higher sum of these kernels at point X, then if you have a new datum to classify in point X, it will be classified as type A. (there are other configuration parameters that may change the actual threshold where a category is selected over another)
Fig. Analyze this figure from the webpage you gave us. You can see how by adding up the gaussian kernels on the red samples "sumA", and on the green samples "sumB"; it is logical that sumA>sumB in the center part of the figure. It is also logical that sumB>sumA in the outer part of the image.
boxconstraint: it is a cost/penalty over miss-classified data. During the training stage of the classifier, where you use the training data to adjust the SVM parameters, the training algorithm is using an error function to decide how to optimize the SVM parameters in an iterative fashion. The cost for a miss-classified sample is proportional to how far it is from the boundary where it would have been classified correctly. In the figure that I am attaching the boundary is the inner blue circumference.
Taking into account BGreene indications and from what I understand of the tutorial:
In the tutorial they advice to try values for rbf_sigma and boxconstraint that are exponentially apart. This means that you should compare values like {0.2, 2, 20, ...} (note that this is {2*10^(i-2), i=1,2,3,...}), and NOT like {0.2, 0.3, 0.4, 0.5} (which would be linearly apart). They advice this to try a wide range of values first. You can further optimize later FROM the first optimum that you obtained before.
The command "[searchmin fval] = fminsearch(minfn,randn(2,1),opts)" will give you back the optimum values for rbf_sigma and boxconstraint. Probably you have to use exp(z) because it affects how fminsearch increments the values of z(1) and z(2) during the search for the optimum value. I suppose that when you put exp(z(1)) in the definition of #minfn, then fminsearch will take 'exponentially' big steps.
In machine learning, always try to understand that there are three subsets in your data: training data, cross-validation data, and test data. The training set is used to optimize the parameters of the SVM classifier for EACH value of rbf_sigma and boxconstraint. Then the cross validation set is used to select the optimum value of the parameters rbf_sigma and boxconstraint. And finally the test data is used to obtain an idea of the performance of your classifier (the efficiency of the classifier is determined upon the test set).
So, if you start with 10000 samples you may divide the data for example as training(50%), cross-validation(25%), test(25%). So that you will sample randomly 5000 samples for the training set, then 2500 samples from the 5000 remaining samples for the cross-validation set, and the rest of samples (that is 2500) would be separated for the test set.
I hope that I could clarify your doubts. By the way, if you are interested in the optimization of the parameters of classifiers and machine learning algorithms I strongly suggest that you follow this free course -> www.ml-class.org (it is awesome, really).
You need to implement a function called crossfun (see example).
The function handle minfn is passed to fminsearch to be minimized.
exp([rbf_sigma,boxconstraint]) is the quantity being optimized to minimize classification error.
There are a number of functions nested within this function handle:
- crossval is producing the classification error based on cross validation using partition c
- crossfun - classifies data using an SVM
- fminsearch - optimizes SVM hyperparameters to minimize classification error
Hope this helps