My dataset contains labels as 0 and 1 containing 100 examples each with feature dimension 39. There are50 examples belonging to class 1 and rest 50 belonging to class 0. The graphical output shows only one output instead of two. There should be two output nodes since there are two categories. I am flabbergasted why this is happening. The following is the code. Shall be grateful for your help.
hiddenlayersize = 5;
net = patternnet(hiddenlayersize);
net = init(net);
netperformFcn = 'crossentropy';
[net] = train(net,x,t);
out = sim(net,x);
Below is the model:
Also, out is not in binary. How do I get the predicted labels in binary as well?
The classification outputs the results in the form of probabilities - your results are fine.
Default threshold is 0.5 for converting probabilties to 2 classes say 0 and 1.
You can fine-tune threshold - by moving up and low and further analysing the outcomes like false positives , false negatives ,precision-recall curves etc. depending upon what the objective is.
Hope this helps.
Related
I am tuning an SVM using a for loop to search in the range of hyperparameter's space. The svm model learned contains the following fields
SVMModel: [1×1 ClassificationSVM]
C: 2
FeaturesIdx: [4 6 8]
Score: 0.0142
Question1) What is the meaning of the field 'score' and its utility?
Question2) I am tuning the BoxConstraint, C value. Let, the number of features be denoted by the variable featsize. The variable gridC will contain the search space which can start from any value say 2^-5, 2^-3, to 2^15 etc. So, gridC = 2.^(-5:2:15). I cannot understand if there is a way to select the range?
1. score had been documented in here, which says:
Classification Score
The SVM classification score for classifying observation x is the signed distance from x to the decision boundary ranging from -∞ to +∞.
A positive score for a class indicates that x is predicted to be in
that class. A negative score indicates otherwise.
In two class cases, if there are six observations, and the predict function gave us some score value called TestScore, then we could determine which class does the specific observation ascribed by:
TestScore=[-0.4497 0.4497
-0.2602 0.2602;
-0.0746 0.0746;
0.1070 -0.1070;
0.2841 -0.2841;
0.4566 -0.4566;];
[~,Classes] = max(TestScore,[],2);
In the two-class classification, we can also use find(TestScore > 0) instead, and it is clear that the first three observations are belonging to the second class, and the 4th to 6th observations are belonging to the first class.
In multiclass cases, there could be several scores > 0, but the code max(scores,[],2) is still validate. For example, we could use the code (from here, an example called Find Multiple Class Boundaries Using Binary SVM) following to determine the classes of the predict Samples.
for j = 1:numel(classes);
[~,score] = predict(SVMModels{j},Samples);
Scores(:,j) = score(:,2); % Second column contains positive-class scores
end
[~,maxScore] = max(Scores,[],2);
Then the maxScore will denote the predicted classes of each sample.
2. The BoxConstraint denotes C in the SVM model, so we can train SVMs in different hyperparameters and select the best one by something like:
gridC = 2.^(-5:2:15);
for ii=1:length(gridC)
SVModel = fitcsvm(data3,theclass,'KernelFunction','rbf',...
'BoxConstraint',gridC(ii),'ClassNames',[-1,1]);
%if (%some constraints were meet)
% %save the current SVModel
%end
end
Note: Another way to implement this is using libsvm, a fast and easy-to-use SVM toolbox, which has the interface of MATLAB.
I am currently reading "Introduction to machine learning" by Ethem Alpaydin and I came across nearest centroid classifiers and tried to implement it. I guess I have correctly implemented the classifier but I am getting only 68% accuracy . So, is the nearest centroid classifier itself is inefficient or is there some error in my implementation (below) ?
The data set contains 1372 data points each having 4 features and there are 2 output classes
My MATLAB implementation :
DATA = load("-ascii", "data.txt");
#DATA is 1372x5 matrix with 762 data points of class 0 and 610 data points of class 1
#there are 4 features of each data point
X = DATA(:,1:4); #matrix to store all features
X0 = DATA(1:762,1:4); #matrix to store the features of class 0
X1 = DATA(763:1372,1:4); #matrix to store the features of class 1
X0 = X0(1:610,:); #to make sure both datasets have same size for prior probability to be equal
Y = DATA(:,5); # to store outputs
mean0 = sum(X0)/610; #mean of features of class 0
mean1 = sum(X1)/610; #mean of featurs of class 1
count = 0;
for i = 1:1372
pre = 0;
cost1 = X(i,:)*(mean0'); #calculates the dot product of dataset with mean of features of both classes
cost2 = X(i,:)*(mean1');
if (cost1<cost2)
pre = 1;
end
if pre == Y(i)
count = count+1; #counts the number of correctly predicted values
end
end
disp("accuracy"); #calculates the accuracy
disp((count/1372)*100);
There are at least a few things here:
You are using dot product to assign similarity in the input space, this is almost never valid. The only reason to use dot product would be the assumption that all your data points have the same norm, or that the norm does not matter (nearly never true). Try using Euclidean distance instead, as even though it is very naive - it should be significantly better
Is it an inefficient classifier? Depends on the definition of efficiency. It is an extremely simple and fast one, but in terms of predictive power it is extremely bad. In fact, it is worse than Naive Bayes, which is already considered "toy model".
There is something wrong with the code too
X0 = DATA(1:762,1:4); #matrix to store the features of class 0
X1 = DATA(763:1372,1:4); #matrix to store the features of class 1
X0 = X0(1:610,:); #to make sure both datasets have same size for prior probability to be equal
Once you subsamples X0, you have 1220 training samples, yet later during "testing" you test on both training and "missing elements of X0", this does not really make sense from probabilistic perspective. First of all you should never test accuracy on the training set (as it overestimates true accuracy), second of all by subsampling your training data your are not equalizing priors. Not in the method like this one, you are simply degrading quality of your centroid estimate, nothing else. These kind of techniques (sub/over- sampling) equalize priors for models that do model priors. Your method does not (as it is basically generative model with the assumed prior of 1/2), so nothing good can happen.
Best
I've a question about Neural networks in Matlab.
First of all, I've a small NN, 2 inputs, 1 hidden layer with 10 neurons and one output. And this works fine. But the question which I've is. Can I determine my training date, validation data and test data?
I know, if I use e.g. net = feedforwardnet(10); that I can divide my overall dataset into e.g.70/100 15/100 and 15/100. But I don't want to do this, because in this case I want to train my NN with a 1000 data-points, validate them with another data-points and use another independent data-set of 1000 data-points to test them. With other words, I want to control these 3 interdependent data-sets.
Thus, can someone help me?
Kind regards
Edit, I don't want to use a data-set with 3000 data-points and set the devideParams on 1/3 1/3 & 1/3.
Best myself
When you use a feedforwardnet then you can define your divide parameters
net.divideParam.trainRatio = 1/3;
net.divideParam.valRatio = 1/3;
net.divideParam.testRatio = 1/3;
You know that your data will be divided into 3 pieces.
But you (I) didn't know which data.
But when you and thus I, train my network via the following command line:
[net,tr]=train(net,x,t);
then, tr will contain all the necessary info, like for example :
tr.trainInd 1x1000 double,
tr.valInd 1x1000 double,
tr.testInd 1x1000 double,
Thus, e.g. tr.trainInd, will contain all the indexes of our set of data which were used for training. Also, in tr, we can see that the type of tr.divideFcn is set on dividerand which means that the indexes are picked at random. Thus it would be logical, that there is a possibility that those indexes aren't picked randomly which means that, if we combine both things. It should be possible to use another test set --> net.divideParam.testRatio = 0 and to use two different train and validation sets --> net.divideParam.trainRatio = 1/2 and net.divideParam.valRatio = 1/2 - If you can set the tr.divideFcn on something chronological, sequential. Last but not least, if this is possible then we have nothing more to do, then put the training and validation set into one data set, etc...
Kind regards to myself
By default it will use a random index for train,validation,test. This is manually set with the following, though since it's default usually not needed:
net.divideFcn = 'dividerand'
and then you use the the commands you noted above:
net.divideParam.trainRatio = 1/3;
net.divideParam.valRatio = 1/3;
net.divideParam.testRatio = 1/3;
To do what you want and set the index of each you can do the following:
net.divideFcn = 'divideind'
net.divideParam.trainInd = [1:1000]
net.divideParam.valInd=[1001:2000]
net.divideParam.testInd=[2001:3000]
I've got a problem with implementing multilayered perceptron with Matlab Neural Networks Toolkit.
I try to implement neural network which will recognize single character stored as binary image(size 40x50).
Image is transformed into a binary vector. The output is encoded in 6bits. I use simple newff function in that way (with 30 perceptrons in hidden layer):
net = newff(P, [30, 6], {'tansig' 'tansig'}, 'traingd', 'learngdm', 'mse');
Then I train my network with a dozen of characters in 3 different fonts, with following train parameters:
net.trainParam.epochs=1000000;
net.trainParam.goal = 0.00001;
net.traxinParam.lr = 0.01;
After training net recognized all characters from training sets correctly but...
It cannot recognize more then twice characters from another fonts.
How could I improve that simple network?
you can try to add random elastic distortion to your training set (in order to expand it, and making it more "generalizable").
You can see the details on this nice article from Microsoft Research :
http://research.microsoft.com/pubs/68920/icdar03.pdf
You have a very large number of input variables (2,000, if I understand your description). My first suggestion is to reduce this number if possible. Some possible techniques include: subsampling the input variables or calculating informative features (such as row and column total, which would reduce the input vector to 90 = 40 + 50)
Also, your output is coded as 6 bits, which provides 32 possible combined values, so I assume that you are using these to represent 26 letters? If so, then you may fare better with another output representation. Consider that various letters which look nothing alike will, for instance, share the value of 1 on bit 1, complicating the mapping from inputs to outputs. An output representation with 1 bit for each class would simplify things.
You could use patternnet instead of newff, this creates a network more suitable for pattern recognition. As target function use a 26-elements vector with 1 in the right letter's position (0 elsewhere). The output of the recognition will be a vector of 26 real values between 0 and 1, with the recognized letter with the highest value.
Make sure to use data from all fonts for the training.
Give as input all data sets, train will automatically divide them into train-validation-test sets according to the specified percentages:
net.divideParam.trainRatio = .70;
net.divideParam.valRatio = .15;
net.divideParam.testRatio = .15;
(choose you own percentages).
Then test using only the test set, you can find their indices into
[net, tr] = train(net,inputs,targets);
tr.testInd
I'm working on creating a 2 layer neural network with back-propagation. The NN is supposed to get its data from a 20001x17 vector that holds following information in each row:
-The first 16 cells hold integers ranging from 0 to 15 which act as variables to help us determine which one of the 26 letters of the alphabet we mean to express when seeing those variables. For example a series of 16 values as follows are meant to represent the letter A: [2 8 4 5 2 7 5 3 1 6 0 8 2 7 2 7].
-The 17th cell holds a number ranging from 1 to 26 representing the letter of the alphabet we want. 1 stands for A, 2 stands for B etc.
The output layer of the NN consists of 26 outputs. Every time the NN is fed an input like the one described above it's supposed to output a 1x26 vector containing zeros in all but the one cell that corresponds to the letter that the input values were meant to represent. for example the output [1 0 0 ... 0] would be letter A, whereas [0 0 0 ... 1] would be the letter Z.
Some things that are important before i present the code: I need to use the traingdm function and the hidden layer number is fixed (for now) at 21.
Trying to create the above concept i wrote the following matlab code:
%%%%%%%%
%Start of code%
%%%%%%%%
%
%Initialize the input and target vectors
%
p = zeros(16,20001);
t = zeros(26,20001);
%
%Fill the input and training vectors from the dataset provided
%
for i=2:20001
for k=1:16
p(k,i-1) = data(i,k);
end
t(data(i,17),i-1) = 1;
end
net = newff(minmax(p),[21 26],{'logsig' 'logsig'},'traingdm');
y1 = sim(net,p);
net.trainParam.epochs = 200;
net.trainParam.show = 1;
net.trainParam.goal = 0.1;
net.trainParam.lr = 0.8;
net.trainParam.mc = 0.2;
net.divideFcn = 'dividerand';
net.divideParam.trainRatio = 0.7;
net.divideParam.testRatio = 0.2;
net.divideParam.valRatio = 0.1;
%[pn,ps] = mapminmax(p);
%[tn,ts] = mapminmax(t);
net = init(net);
[net,tr] = train(net,p,t);
y2 = sim(net,pn);
%%%%%%%%
%End of code%
%%%%%%%%
Now to my problem: I want my outputs to be as described, namely each column of the y2 vector for example should be a representation of a letter. My code doesn't do that though. Instead it produced results that vary greatly between 0 and 1, values from 0.1 to 0.9.
My question is: is there some conversion i need to be doing that i am not? Meaning, do i have to convert my input and/or output data to a form by which i can actually see if my NN is learning correctly?
Any input would be appreciated.
This is normal. Your output layer is using a log-sigmoid transfer function, and that will always give you some intermediate output between 0 and 1.
What you would usually do would be to look for the output with the largest value -- in other words, the most likely character.
This would mean that, for every column in y2, you're looking for the index of the row that contains the largest value in that row. You can compute this as follows:
[dummy, I]=max(y2);
I is then a vector containing the indexes of the largest value in each row.
You can think of y2 as an output probability distribution for each input being one of the 26 alphabet characters, for example if one column of y2 says:
.2
.5
.15
.15
then its 50% probability that this character is B (if we assume only 4 possible outputs).
==REMARK==
The output layer of the NN consists of
26 outputs. Every time the NN is fed
an input like the one described above
it's supposed to output a 1x26 vector
containing zeros in all but the one
cell that corresponds to the letter
that the input values were meant to
represent. for example the output [1 0
0 ... 0] would be letter A, whereas [0
0 0 ... 1] would be the letter Z.
It is preferable to avoid using target values of 0,1 to encode the output of the network.
The reason for avoiding target values of 0 and 1 is that 'logsig' sigmoid transfer function cannot produce these output values given finite weights. If you attempt to train the network to fit target values of exactly 0 and 1, gradient descent will force the weights to grow without bound.
So instead of 0 and 1 values, try using values of 0.04 and 0.9 for example, so that [0.9,0.04,...,0.04] is the target output vector for the letter A.
Reference:
Thomas M. Mitchell, Machine Learning, McGraw-Hill Higher Education, 1997, p114-115
Use hardlin fcn in output layer.
Use trainlm or trainrp for training the network.
To learn your network, use a for loop and a condition that compare the output and target. When it is the best use, break to exit from the learning loop.
Use another way instead of mapminmax for pre-processing data set.
I don't know if this constitutes an actual answer or not: but here are some remarks.
I don't understand your coding scheme. How is an 'A' represented as that set of numbers? It looks like you're falling into a fairly common trap of using arbitrary numbers to code categorical values. Don't do this: for example if 'a' is 1, 'b' is 2 and 'c' is 3, then your coding has implicitly stated that 'a' is more like 'b' than 'c' (because the network has real-value inputs the ordinal properties matter). The way to do this properly is to have each letter represented as 26 binary valued inputs, where only one is ever active, representing the letter.
Your outputs are correct, the activation at the output layer will not
ever be either 0 or 1, but real numbers. You could take the max as
your activity function, but this is problematic because it's not
differentiable, so you can't use back-prop. What you should do is
couple the outputs with the softmax function, so that their sum
is one. You can then treat the outputs as conditional probabilities
given the inputs, if you so desire. While the network is not
explicitly probabilistic, with the correct activity and activation
functions is will be identical in structure to a log-linear model
(possibly with latent variables corresponding to the hidden layer),
and people do this all the time.
See David Mackay's textbook for a nice intro to neural nets which will make clear the probabilistic connection. Take a look at this paper from Geoff Hinton's group which describes the task of predicting the next character given the context for details on the correct representation and activation/activity functions (although beware their method is non-trivial and uses a recurrent net with a different training method).