Say have a training set Y :
1,0,1,0
0,1,1,0
0,0,1,1
0,0,1,0
And sigmoid function is defined as :
As the sigmoid function ouputs a value between 0 and 1 does this mean that the training data and value's we are trying to predict should also fall between 0 and 1 ?
Is also correct to use the sigmoid function for making predictions when training set values are not between 0 and 1 ? :
1,4,3,0
2,1,1,0
7,2,6,1
3,0,5,0
Yes, it is perfectly valid have non binary features.
The output falls between 0 and 1 because of the nature of the sigmoid function, there is nothing that stops you from having non binary feature set.
Do the predictions have to be binary?
Yes, you can have multiclass logistic classification as well.
The simplest way of doing that is solving a one-vs-all classification problem, wherein you train one binary logistic classifier for each of the labels.
For example. if your prediction space spans (1, 2, 3, 4), you can have 4 logistic classifiers.
Given any point in the test set, you can give it the label corresponding to the classifier which is most confident (i.e. has the highest score for that test point).
Related
I have tried to train a neural network in matlab,first of all I have build the ANN as follow
net = feedforwardnet([30 20 20 ]);
[net ,tr] = train(net , XTRAIN , temp);
which produce an ANN with the following architecture:
then I test my neural network as follow
outputsOfTest = sim(net , XTEST);
the outputsOfTest is a vector represent the output of neural network testing, usually some the elements ofoutputsOfTest are negative values , for example the outputsOfTest will be something like this [-.34 1.17 .17].
So How to interpret this output? what are negative values indicate to? which class the testing data will belong based on this output?
Should I take the greatest value as an indicator to the class that testing data will belong to?
for example if I have the output vector [-2 .5 1] , which is the greatest value is 1, So the class that testing data belong to is class 3
Should I take the greatest value in magnitude (taking the absolute value) ? for example if I have the output vector [-2 .5 1] , which is the greatest value in it's magnitude is the first element, So the class that testing data belong to is class 1.
Note: sometimes the sum of the elements ofoutputsOfTest exceed one, the sum of the elements may reach 2.5, does this normal?
Your output layer seems to have a linear activation function. Therefor your output vectors components have values that are not restricted to be between 0 and 1. For classification you should use a softmax activation function:
(Source)
The use of softmax results in vector components which have values between 0 and 1 and which sum to 1 for each vector. So basically you get a probability distribution over your classes. The Matlab help has an image showing the effects (left input, right after softmax):
There's more information about it in the UFDL Tutorial.
From what I could find, the following code change might work in Matlab:
net = feedforwardnet([30 20 20]);
net.layers{4}.transferFcn='softmax';
[net ,tr] = train(net , XTRAIN , temp);
I'm trying to implement gradient checking for a simple feedforward neural network with 2 unit input layer, 2 unit hidden layer and 1 unit output layer. What I do is the following:
Take each weight w of the network weights between all layers and perform forward propagation using w + EPSILON and then w - EPSILON.
Compute the numerical gradient using the results of the two feedforward propagations.
What I don't understand is how exactly to perform the backpropagation. Normally, I compare the output of the network to the target data (in case of classification) and then backpropagate the error derivative across the network. However, I think in this case some other value have to be backpropagated, since in the results of the numerical gradient computation are not dependent of the target data (but only of the input), while the error backpropagation depends on the target data. So, what is the value that should be used in the backpropagation part of gradient check?
Backpropagation is performed after computing the gradients analytically and then using those formulas while training. A neural network is essentially a multivariate function, where the coefficients or the parameters of the functions needs to be found or trained.
The definition of a gradient with respect to a specific variable is the rate of change of the function value. Therefore, as you mentioned, and from the definition of the first derivative we can approximate the gradient of a function, including a neural network.
To check if your analytical gradient for your neural network is correct or not, it is good to check it using the numerical method.
For each weight layer w_l from all layers W = [w_0, w_1, ..., w_l, ..., w_k]
For i in 0 to number of rows in w_l
For j in 0 to number of columns in w_l
w_l_minus = w_l; # Copy all the weights
w_l_minus[i,j] = w_l_minus[i,j] - eps; # Change only this parameter
w_l_plus = w_l; # Copy all the weights
w_l_plus[i,j] = w_l_plus[i,j] + eps; # Change only this parameter
cost_minus = cost of neural net by replacing w_l by w_l_minus
cost_plus = cost of neural net by replacing w_l by w_l_plus
w_l_grad[i,j] = (cost_plus - cost_minus)/(2*eps)
This process changes only one parameter at a time and computes the numerical gradient. In this case I have used the (f(x+h) - f(x-h))/2h, which seems to work better for me.
Note that, you mentiond: "since in the results of the numerical gradient computation are not dependent of the target data", this is not true. As when you find the cost_minus and cost_plus above, the cost is being computed on the basis of
The weights
The target classes
Therefore, the process of backpropagation should be independent of the gradient checking. Compute the numerical gradients before backpropagation update. Compute the gradients using backpropagation in one epoch (using something similar to above). Then compare each gradient component of the vectors/matrices and check if they are close enough.
Whether you want to do some classification or have your network calculate a certain numerical function, you always have some target data. For example, let's say you wanted to train a network to calculate the function f(a, b) = a + b. In that case, this is the input and target data you want to train your network on:
a b Target
1 1 2
3 4 7
21 0 21
5 2 7
...
Just as with "normal" classification problems, the more input-target pairs, the better.
Even with a simple classifier like the nearest neighbour I cannot seem to judge its accuracy and thus cannot improve it.
For example with the code below:
IDX = knnsearch(train_image_feats, test_image_feats);
predicted_categories = cell([size(test_image_feats, 1), 1]);
for i=1:size(IDX,1)
predicted_categories{i}=train_labels(IDX(i));
end
Here train_image_feats is a 300 by 256 matrix where each row represents an image. Same is the structure of test_image_feats. train_labels is the label corresponding to each row of the training matrix.
The book I am following simply said that the above method achieves an accuracy of 19%.
How did the author come to this conclusion? Is there any way to judge the accuracy of my results be it with this classifier or other?
The author then uses another method of feature extraction and says it improved accuracy by 30%.
How can I find the accuracy? Be it graphically or just via a simple percentage.
Accuracy when doing machine learning and classification is usually calculated by comparing your predicted outputs from your classifier in comparison to the ground truth. When you're evaluating the classification accuracy of your classifier, you will have already created a predictive model using a training set with known inputs and outputs. At this point, you will have a test set with inputs and outputs that were not used to train the classifier. For the purposes of this post, let's call this the ground truth data set. This ground truth data set helps assess the accuracy of your classifier when you are providing inputs to this classifier that it has not seen before. You take your inputs from your test set, and run them through your classifier. You get outputs for each input and we call the collection of these outputs the predicted values.
For each predicted value, you compare to the associated ground truth value and see if it is the same. You add up all of the instances where the outputs match up between the predicted and the ground truth. Adding all of these values up, and dividing by the total number of points in your test set yields the fraction of instances where your model accurately predicted the result in comparison to the ground truth.
In MATLAB, this is really simple to calculate. Supposing that your categories for your model were enumerated from 1 to N where N is the total number of labels you are classifying with. Let groundTruth be your vector of labels that denote the ground truth while predictedLabels denote your labels that are generated from your classifier. The accuracy is simply calculated by:
accuracy = sum(groundTruth == predictedLabels) / numel(groundTruth);
accuracyPercentage = 100*accuracy;
The first line of code calculates what the accuracy of your model is as a fraction. The second line calculates this as a percentage, where you simply multiply the first line of code by 100. You can use either or when you want to assess accuracy. One is just normalized between [0,1] while the other is a percentage from 0% to 100%. What groundTruth == predictedLabels does is that it compares each element between groundTruth and predictedLabels. If the ith value in groundTruth matches with the ith value in predictedLabels, we output a 1. If not, we output a 0. This will be a vector of 0s and 1s and so we simply sum up all of the values that are 1, which is eloquently encapsulated in the sum operation. We then divide by the total number of points in our test set to obtain the final accuracy of the classifier.
With a toy example, supposing I had 4 labels, and my groundTruth and predictedLabels vectors were this:
groundTruth = [1 2 3 2 3 4 1 1 2 3 3 4 1 2 3];
predictedLabels = [1 2 2 4 4 4 1 2 3 3 4 1 2 3 3];
The accuracy using the above vectors gives us:
>> accuracy
accuracy =
0.4000
>> accuracyPercentage
accuracyPercentage =
40
This means that we have a 40% accuracy or an accuracy of 0.40. Using this example, the predictive model was only able to accurately classify 40% of the test set when you put each test set input through the classifier. This makes sense, because between our predicted outputs and ground truth, only 40%, or 6 outputs match up. These are the 1st, 2nd, 6th, 7th, 10th and 15th elements. There are other metrics to calculating accuracy, like ROC curves, but when calculating accuracy in machine learning, this is what is usually done.
I'm using LIBSVM for matlab. When I use a regression SVM the probability estimates it outputs are an empty matrix, whereas this feature works fine when using classification. Is this a normal behavior, because in the LIBSVM readme it says:
-b probability_estimates: whether to train a SVC or SVR model for probability estimates,
0 or 1 (default 0)
[~,~,P] = svmpredict(x,y,model,'-b 1');
The output P is the probability of y belongs to class 1 and -1 respectively (m*2 array), and it only makes sense for classification problem.
For regression problem, the pairwise probability information is included in your trained model with model.ProbA.
I am working through the xor example with a three layer back propagation network. When the output layer has a sigmoid activation, an input of (1,0) might give 0.99 for a desired output of 1 and an input of (1,1) might give 0.01 for a desired output of 0.
But what if want the output to be discrete, either 0 or 1, do I simply set a threshold in between at 0.5? Would this threshold need to be trained like any other weight?
Well, you can of course put a threshold after the output neuron which makes the values after 0.5 as 1 and, vice versa, all the outputs below 0.5 as zero. I suggest to don't hide the continuous output with a discretization threshold, because an output of 0.4 is less "zero" than a value of 0.001 and this difference can give you useful information about your data.
Do the training without threshold, ie. computes the error on a example by using what the neuron networks outputs, without thresholding it.
Another little detail : you use a transfer function such as sigmoid ? The sigmoid function returns values in [0, 1], but 0 and 1 are asymptote ie. the sigmoid function can come close to those values but never reach them. A consequence of this is that your neural network can not exactly output 0 or 1 ! Thus, using sigmoid times a factor a little above 1 can correct this. This and some other practical aspects of back propagation are discussed here http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf