Neural Network - Train a MLP with multiple entries - neural-network

I implemented a MLP with a Backpropagation algorithm, it works fine for only one entry, for example, if the input is 1 and 1 the answers on the last layer will be 1 and 0.
Let's suppose that instead of having only one entry (like 1,1) I have four entries, (1,1; 1,0; 0,0; 0,1), all of them have different expected answers.
I need to train this MLP and it needs to answer correctly to all entries.
I'm not finding a way to do this. Let's suppose that I have 1000 epochs, in this case I would need to train every entry for 250 epochs? Train one epoch with 1 entry then the next epoch with another entry?
How I could properly train a MLP to answer correctly to all entries?

at least for a python implementation, you can simply use multidimensional training data
# training a neural network to behave like an XOR gate
import numpy as np
X = np.array([[1,0],[0,1],[1,1],[0,0]]) # entries
y = np.array([[1],[1],[0],[0]]) # expected answers
INPUTS = X.shape[1]
HIDDEN = 12
OUTPUTS = y.shape[1]
w1 = np.random.randn(INPUTS, HIDDEN) * np.sqrt(2 / INPUTS)
w2 = np.random.randn(HIDDEN, OUTPUTS) * np.sqrt(2 / HIDDEN)
ALPHA = 0.5
EPOCHS = 1000
for e in range(EPOCHS):
z1 = sigmoid(X.dot(w1))
o = sigmoid(z1.dot(w2))
o_error = o - y
o_delta = o_error * sigmoidPrime(o)
w2 -= z1.T.dot(o_delta) * ALPHA
w2_error = o_delta.dot(w2.T)
w2_delta = w2_error * sigmoidPrime(z1)
w1 -= X.T.dot(w2_delta) * ALPHA
print(np.mean(np.abs(o_error))) # prints the loss of the NN
such an approach might not work with some neural network libraries, but that shouldn't matter, because neural network libraries will usually handle stuff like that themselves
the reason this works is that during the dot product between the input and hidden layer, each training entry gets matrix-multiplied with the entire hidden layer individually, so the result is a matrix containing the result for each sample forwarded through the hidden layer
and this process continues throughout the entire network, so what you are essentially doing is running multiple instances of the same neural network in parallel
the number of training entries doesn't have to be four, it can be any arbitrarily high number, as long as the size of its contents is the same as the input layer for X and the output layer for y and X and y are the same length (and you have enough RAM)
also, nothing about the neural network architecture is fundamentally changed from using single entries, only the data that is feeded into it has changed, so you don't have to scrap the code you've written, just make a few small changes most likely

Related

NN model ouputs one category for binary classification

My dataset contains labels as 0 and 1 containing 100 examples each with feature dimension 39. There are50 examples belonging to class 1 and rest 50 belonging to class 0. The graphical output shows only one output instead of two. There should be two output nodes since there are two categories. I am flabbergasted why this is happening. The following is the code. Shall be grateful for your help.
hiddenlayersize = 5;
net = patternnet(hiddenlayersize);
net = init(net);
netperformFcn = 'crossentropy';
[net] = train(net,x,t);
out = sim(net,x);
Below is the model:
Also, out is not in binary. How do I get the predicted labels in binary as well?
The classification outputs the results in the form of probabilities - your results are fine.
Default threshold is 0.5 for converting probabilties to 2 classes say 0 and 1.
You can fine-tune threshold - by moving up and low and further analysing the outcomes like false positives , false negatives ,precision-recall curves etc. depending upon what the objective is.
Hope this helps.

Loss Function & Its Inputs For Binary Classification PyTorch

I'm trying to write a neural Network for binary classification in PyTorch and I'm confused about the loss function.
I see that BCELoss is a common function specifically geared for binary classification. I also see that an output layer of N outputs for N possible classes is standard for general classification. However, for binary classification it seems like it could be either 1 or 2 outputs.
So, should I have 2 outputs (1 for each label) and then convert my 0/1 training labels into [1,0] and [0,1] arrays, or use something like a sigmoid for a single-variable output?
Here are the relevant snippets of code so you can see:
self.outputs = nn.Linear(NETWORK_WIDTH, 2) # 1 or 2 dimensions?
def forward(self, x):
# other layers omitted
x = self.outputs(x)
return F.log_softmax(x) # <<< softmax over multiple vars, sigmoid over one, or other?
criterion = nn.BCELoss() # <<< Is this the right function?
net_out = net(data)
loss = criterion(net_out, target) # <<< Should target be an integer label or 1-hot vector?
Thanks in advance.
For binary outputs you can use 1 output unit, so then:
self.outputs = nn.Linear(NETWORK_WIDTH, 1)
Then you use sigmoid activation to map the values of your output unit to a range between 0 and 1 (of course you need to arrange your training data this way too):
def forward(self, x):
# other layers omitted
x = self.outputs(x)
return torch.sigmoid(x)
Finally you can use the torch.nn.BCELoss:
criterion = nn.BCELoss()
net_out = net(data)
loss = criterion(net_out, target)
This should work fine for you.
You can also use torch.nn.BCEWithLogitsLoss, this loss function already includes the sigmoid function so you could leave it out in your forward.
If you, want to use 2 output units, this is also possible. But then you need to use torch.nn.CrossEntropyLoss instead of BCELoss. The Softmax activation is already included in this loss function.
Edit: I just want to emphasize that there is a real difference in doing so. Using 2 output units gives you twice as many weights compared to using 1 output unit.. So these two alternatives are not equivalent.
Some theoretical add up:
For binary classification (say class 0 & class 1), the network should have only 1 output unit. Its output will be 1 (for class 1 present or class 0 absent) and 0 (for class 1 absent or class 0 present).
For loss calculation, you should first pass it through sigmoid and then through BinaryCrossEntropy (BCE). Sigmoid transforms the output of the network to probability (between 0 and 1) and BCE then maximizes the likelihood of the desired output.

Is nearest centroid classifier really inefficient?

I am currently reading "Introduction to machine learning" by Ethem Alpaydin and I came across nearest centroid classifiers and tried to implement it. I guess I have correctly implemented the classifier but I am getting only 68% accuracy . So, is the nearest centroid classifier itself is inefficient or is there some error in my implementation (below) ?
The data set contains 1372 data points each having 4 features and there are 2 output classes
My MATLAB implementation :
DATA = load("-ascii", "data.txt");
#DATA is 1372x5 matrix with 762 data points of class 0 and 610 data points of class 1
#there are 4 features of each data point
X = DATA(:,1:4); #matrix to store all features
X0 = DATA(1:762,1:4); #matrix to store the features of class 0
X1 = DATA(763:1372,1:4); #matrix to store the features of class 1
X0 = X0(1:610,:); #to make sure both datasets have same size for prior probability to be equal
Y = DATA(:,5); # to store outputs
mean0 = sum(X0)/610; #mean of features of class 0
mean1 = sum(X1)/610; #mean of featurs of class 1
count = 0;
for i = 1:1372
pre = 0;
cost1 = X(i,:)*(mean0'); #calculates the dot product of dataset with mean of features of both classes
cost2 = X(i,:)*(mean1');
if (cost1<cost2)
pre = 1;
end
if pre == Y(i)
count = count+1; #counts the number of correctly predicted values
end
end
disp("accuracy"); #calculates the accuracy
disp((count/1372)*100);
There are at least a few things here:
You are using dot product to assign similarity in the input space, this is almost never valid. The only reason to use dot product would be the assumption that all your data points have the same norm, or that the norm does not matter (nearly never true). Try using Euclidean distance instead, as even though it is very naive - it should be significantly better
Is it an inefficient classifier? Depends on the definition of efficiency. It is an extremely simple and fast one, but in terms of predictive power it is extremely bad. In fact, it is worse than Naive Bayes, which is already considered "toy model".
There is something wrong with the code too
X0 = DATA(1:762,1:4); #matrix to store the features of class 0
X1 = DATA(763:1372,1:4); #matrix to store the features of class 1
X0 = X0(1:610,:); #to make sure both datasets have same size for prior probability to be equal
Once you subsamples X0, you have 1220 training samples, yet later during "testing" you test on both training and "missing elements of X0", this does not really make sense from probabilistic perspective. First of all you should never test accuracy on the training set (as it overestimates true accuracy), second of all by subsampling your training data your are not equalizing priors. Not in the method like this one, you are simply degrading quality of your centroid estimate, nothing else. These kind of techniques (sub/over- sampling) equalize priors for models that do model priors. Your method does not (as it is basically generative model with the assumed prior of 1/2), so nothing good can happen.

Matlab - neural networks - How to use different datasets for training, validation and testing?

Best
I've a question about Neural networks in Matlab.
First of all, I've a small NN, 2 inputs, 1 hidden layer with 10 neurons and one output. And this works fine. But the question which I've is. Can I determine my training date, validation data and test data?
I know, if I use e.g. net = feedforwardnet(10); that I can divide my overall dataset into e.g.70/100 15/100 and 15/100. But I don't want to do this, because in this case I want to train my NN with a 1000 data-points, validate them with another data-points and use another independent data-set of 1000 data-points to test them. With other words, I want to control these 3 interdependent data-sets.
Thus, can someone help me?
Kind regards
Edit, I don't want to use a data-set with 3000 data-points and set the devideParams on 1/3 1/3 & 1/3.
Best myself
When you use a feedforwardnet then you can define your divide parameters
net.divideParam.trainRatio = 1/3;
net.divideParam.valRatio = 1/3;
net.divideParam.testRatio = 1/3;
You know that your data will be divided into 3 pieces.
But you (I) didn't know which data.
But when you and thus I, train my network via the following command line:
[net,tr]=train(net,x,t);
then, tr will contain all the necessary info, like for example :
tr.trainInd 1x1000 double,
tr.valInd 1x1000 double,
tr.testInd 1x1000 double,
Thus, e.g. tr.trainInd, will contain all the indexes of our set of data which were used for training. Also, in tr, we can see that the type of tr.divideFcn is set on dividerand which means that the indexes are picked at random. Thus it would be logical, that there is a possibility that those indexes aren't picked randomly which means that, if we combine both things. It should be possible to use another test set --> net.divideParam.testRatio = 0 and to use two different train and validation sets --> net.divideParam.trainRatio = 1/2 and net.divideParam.valRatio = 1/2 - If you can set the tr.divideFcn on something chronological, sequential. Last but not least, if this is possible then we have nothing more to do, then put the training and validation set into one data set, etc...
Kind regards to myself
By default it will use a random index for train,validation,test. This is manually set with the following, though since it's default usually not needed:
net.divideFcn = 'dividerand'
and then you use the the commands you noted above:
net.divideParam.trainRatio = 1/3;
net.divideParam.valRatio = 1/3;
net.divideParam.testRatio = 1/3;
To do what you want and set the index of each you can do the following:
net.divideFcn = 'divideind'
net.divideParam.trainInd = [1:1000]
net.divideParam.valInd=[1001:2000]
net.divideParam.testInd=[2001:3000]

Matlab neural networks - bad results

I've got a problem with implementing multilayered perceptron with Matlab Neural Networks Toolkit.
I try to implement neural network which will recognize single character stored as binary image(size 40x50).
Image is transformed into a binary vector. The output is encoded in 6bits. I use simple newff function in that way (with 30 perceptrons in hidden layer):
net = newff(P, [30, 6], {'tansig' 'tansig'}, 'traingd', 'learngdm', 'mse');
Then I train my network with a dozen of characters in 3 different fonts, with following train parameters:
net.trainParam.epochs=1000000;
net.trainParam.goal = 0.00001;
net.traxinParam.lr = 0.01;
After training net recognized all characters from training sets correctly but...
It cannot recognize more then twice characters from another fonts.
How could I improve that simple network?
you can try to add random elastic distortion to your training set (in order to expand it, and making it more "generalizable").
You can see the details on this nice article from Microsoft Research :
http://research.microsoft.com/pubs/68920/icdar03.pdf
You have a very large number of input variables (2,000, if I understand your description). My first suggestion is to reduce this number if possible. Some possible techniques include: subsampling the input variables or calculating informative features (such as row and column total, which would reduce the input vector to 90 = 40 + 50)
Also, your output is coded as 6 bits, which provides 32 possible combined values, so I assume that you are using these to represent 26 letters? If so, then you may fare better with another output representation. Consider that various letters which look nothing alike will, for instance, share the value of 1 on bit 1, complicating the mapping from inputs to outputs. An output representation with 1 bit for each class would simplify things.
You could use patternnet instead of newff, this creates a network more suitable for pattern recognition. As target function use a 26-elements vector with 1 in the right letter's position (0 elsewhere). The output of the recognition will be a vector of 26 real values between 0 and 1, with the recognized letter with the highest value.
Make sure to use data from all fonts for the training.
Give as input all data sets, train will automatically divide them into train-validation-test sets according to the specified percentages:
net.divideParam.trainRatio = .70;
net.divideParam.valRatio = .15;
net.divideParam.testRatio = .15;
(choose you own percentages).
Then test using only the test set, you can find their indices into
[net, tr] = train(net,inputs,targets);
tr.testInd