Deeplearning4j LSTM Example

Deeplearning4j LSTM Example - neural-network

I am trying to understand LSTM on Deeplearning4j. I am examining source code for the example, but I can't understand this.
//Allocate space:
//Note the order here:
// dimension 0 = number of examples in minibatch
// dimension 1 = size of each vector (i.e., number of characters)
// dimension 2 = length of each time series/example
INDArray input = Nd4j.zeros(currMinibatchSize,validCharacters.length,exampleLength);
INDArray labels = Nd4j.zeros(currMinibatchSize,validCharacters.length,exampleLength);
Why do we store 3D array, and what does it mean?

Good question. But that has nothing to do with LSTM functioning, but has deal with task itself. So the task is to forecast, what will be the next character. Forecast of next character has two facets in itself: classification and approximation.
If we have deal with approximation only, we can deal only with one dimension array. But if we deal with approximation and classification simultaneously, we can't feed into neural network only normalized ascii representation of characters. We need to transofrm each character into array.
For example a ( a not capital ) will be represented in this way:
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
b ( not capital ) will be represented as :
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
c will be represented as:
0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
Z (z capital !!!! )
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
so, each character gives us two dimensions array. How all of those dimensions were constructed? Code comment have following explanation:
// dimension 0 = number of examples in minibatch
// dimension 1 = size of each vector (i.e., number of characters)
// dimension 2 = length of each time series/example
I want sincerly commend you for your efforts in understanding how LSTM works, but the code which you pointed gives example which is applicable to all kinds of NN and explains how to work with text data in neural networks, but not explains how LSTM works. You need to see into another part of source code.

Related

How to decide the range for the hyperparameter space in SVM tuning? (MATLAB)

I am tuning an SVM using a for loop to search in the range of hyperparameter's space. The svm model learned contains the following fields
SVMModel: [1×1 ClassificationSVM]
C: 2
FeaturesIdx: [4 6 8]
Score: 0.0142
Question1) What is the meaning of the field 'score' and its utility?
Question2) I am tuning the BoxConstraint, C value. Let, the number of features be denoted by the variable featsize. The variable gridC will contain the search space which can start from any value say 2^-5, 2^-3, to 2^15 etc. So, gridC = 2.^(-5:2:15). I cannot understand if there is a way to select the range?

1. score had been documented in here, which says:
Classification Score
The SVM classification score for classifying observation x is the signed distance from x to the decision boundary ranging from -∞ to +∞.
A positive score for a class indicates that x is predicted to be in
that class. A negative score indicates otherwise.
In two class cases, if there are six observations, and the predict function gave us some score value called TestScore, then we could determine which class does the specific observation ascribed by:
TestScore=[-0.4497 0.4497
-0.2602 0.2602;
-0.0746 0.0746;
0.1070 -0.1070;
0.2841 -0.2841;
0.4566 -0.4566;];
[~,Classes] = max(TestScore,[],2);
In the two-class classification, we can also use find(TestScore > 0) instead, and it is clear that the first three observations are belonging to the second class, and the 4th to 6th observations are belonging to the first class.
In multiclass cases, there could be several scores > 0, but the code max(scores,[],2) is still validate. For example, we could use the code (from here, an example called Find Multiple Class Boundaries Using Binary SVM) following to determine the classes of the predict Samples.
for j = 1:numel(classes);
[~,score] = predict(SVMModels{j},Samples);
Scores(:,j) = score(:,2); % Second column contains positive-class scores
end
[~,maxScore] = max(Scores,[],2);
Then the maxScore will denote the predicted classes of each sample.
2. The BoxConstraint denotes C in the SVM model, so we can train SVMs in different hyperparameters and select the best one by something like:
gridC = 2.^(-5:2:15);
for ii=1:length(gridC)
SVModel = fitcsvm(data3,theclass,'KernelFunction','rbf',...
'BoxConstraint',gridC(ii),'ClassNames',[-1,1]);
%if (%some constraints were meet)
% %save the current SVModel
%end
end
Note: Another way to implement this is using libsvm, a fast and easy-to-use SVM toolbox, which has the interface of MATLAB.

How to perform operations along a certain dimension of an array?

I have a 3D array containing five 3-by-4 slices, defined as follows:
rng(3372061);
M = randi(100,3,4,5);
I'd like to collect some statistics about the array:
The maximum value in every column.
The mean value in every row.
The standard deviation within each slice.
This is quite straightforward using loops,
sz = size(M);
colMax = zeros(1,4,5);
rowMean = zeros(3,1,5);
sliceSTD = zeros(1,1,5);
for indS = 1:sz(3)
sl = M(:,:,indS);
sliceSTD(indS) = std(sl(1:sz(1)*sz(2)));
for indC = 1:sz(1)
rowMean(indC,1,indS) = mean(sl(indC,:));
end
for indR = 1:sz(2)
colMax(1,indR,indS) = max(sl(:,indR));
end
end
But I'm not sure that this is the best way to approach the problem.
A common pattern I noticed in the documentation of max, mean and std is that they allow to specify an additional dim input. For instance, in max:
M = max(A,[],dim) returns the largest elements along dimension dim. For example, if A is a matrix, then max(A,[],2) is a column vector containing the maximum value of each row.
How can I use this syntax to simplify my code?

Many functions in MATLAB allow the specification of a "dimension to operate over" when it matters for the result of the computation (several common examples are: min, max, sum, prod, mean, std, size, median, prctile, bounds) - which is especially important for multidimensional inputs. When the dim input is not specified, MATLAB has a way of choosing the dimension on its own, as explained in the documentation; for example in max:
If A is a vector, then max(A) returns the maximum of A.
If A is a matrix, then max(A) is a row vector containing the maximum value of each column.
If A is a multidimensional array, then max(A) operates along the first array dimension whose size does not equal 1, treating the elements as vectors. The size of this dimension becomes 1 while the sizes of all other dimensions remain the same. If A is an empty array whose first dimension has zero length, then max(A) returns an empty array with the same size as A.
Then, using the ...,dim) syntax we can rewrite the code as follows:
rng(3372061);
M = randi(100,3,4,5);
colMax = max(M,[],1);
rowMean = mean(M,2);
sliceSTD = std(reshape(M,1,[],5),0,2); % we use `reshape` to turn each slice into a vector
This has several advantages:
The code is easier to understand.
The code is potentially more robust, being able to handle inputs beyond those it was initially designed for.
The code is likely faster.
In conclusion: it is always a good idea to read the documentation of functions you're using, and experiment with different syntaxes, so as not to miss similar opportunities to make your code more succinct.

Using SURF algorithm to match objects on MATLAB

The objective is to see if two images, which have one object captured in each image, matches.
The object or image I have stored. This will be used as a baseline:
item1 (This is being matched in the code)
The object/image that needs to matched with-this is stored:
input (Need to see if this matches with what is stored
My method:
Covert images to gray-scale.
Extract SURF interest points.
Obtain features.
Match features.
Get 50 strongest features.
Match the number of strongest features with each image.
Take the ratio of- number of features matched/ number of strongest
features (which is 50).
If I have two images of the same object (two images taken separately on a camera), ideally the ratio should be near 1 or near 100%.
However this is not the case, the best ratio I am getting is near 0.5 or even worse, 0.3.
I am aware the SURF detectors and features can be used in neural networks, or using a statistics based approach. I believe I have approached the statistics based approach to some extent by using 50 of the strongest features.
Is there something I am missing? What do I add onto this or how do I improve it? Please provide me a point to start from.
%Clearing the workspace and all variables
clc;
clear;
%ITEM 1
item1 = imread('Loreal.jpg');%Retrieve order 1 and digitize it.
item1Grey = rgb2gray(item1);%convert to grayscale, 2 dimensional matrix
item1KP = detectSURFFeatures(item1Grey,'MetricThreshold',600);%get SURF dectectors or interest points
strong1 = item1KP.selectStrongest(50);
[item1Features, item1Points] = extractFeatures(item1Grey, strong1,'SURFSize',128); % using SURFSize of 128
%INPUT : Aquire Image
input= imread('MakeUp1.jpg');%Retrieve input and digitize it.
inputGrey = rgb2gray(input);%convert to grayscale, 2 dimensional matrix
inputKP = detectSURFFeatures(inputGrey,'MetricThreshold',600);%get SURF dectectors or interest
strongInput = inputKP.selectStrongest(50);
[inputFeatures, inputPoints] = extractFeatures(inputGrey, strongInput,'SURFSize',128); % using SURFSize of 128
pairs = matchFeatures(item1Features, inputFeatures, 'MaxRatio',1); %matching SURF Features
totalFeatures = length(item1Features); %baseline number of features
numPairs = length(pairs); %the number of pairs
percentage = numPairs/50;
if percentage >= 0.49
disp('We have this');
else
disp('We do not have this');
disp(percentage);
end
The baseline image
The input image

I would try not doing selectStrongest and not setting MaxRatio. Just call matchFeatures with the default options and compare the number of resulting matches.
The default behavior of matchFeatures is to use the ratio test to exclude ambiguous matches. So the number of matches it returns may be a good indicator of the presence or absence of the object in the scene.
If you want to try something more sophisticated, take a look at this example.

Matlab neural networks - bad results

I've got a problem with implementing multilayered perceptron with Matlab Neural Networks Toolkit.
I try to implement neural network which will recognize single character stored as binary image(size 40x50).
Image is transformed into a binary vector. The output is encoded in 6bits. I use simple newff function in that way (with 30 perceptrons in hidden layer):
net = newff(P, [30, 6], {'tansig' 'tansig'}, 'traingd', 'learngdm', 'mse');
Then I train my network with a dozen of characters in 3 different fonts, with following train parameters:
net.trainParam.epochs=1000000;
net.trainParam.goal = 0.00001;
net.traxinParam.lr = 0.01;
After training net recognized all characters from training sets correctly but...
It cannot recognize more then twice characters from another fonts.
How could I improve that simple network?

you can try to add random elastic distortion to your training set (in order to expand it, and making it more "generalizable").
You can see the details on this nice article from Microsoft Research :
http://research.microsoft.com/pubs/68920/icdar03.pdf

You have a very large number of input variables (2,000, if I understand your description). My first suggestion is to reduce this number if possible. Some possible techniques include: subsampling the input variables or calculating informative features (such as row and column total, which would reduce the input vector to 90 = 40 + 50)
Also, your output is coded as 6 bits, which provides 32 possible combined values, so I assume that you are using these to represent 26 letters? If so, then you may fare better with another output representation. Consider that various letters which look nothing alike will, for instance, share the value of 1 on bit 1, complicating the mapping from inputs to outputs. An output representation with 1 bit for each class would simplify things.

You could use patternnet instead of newff, this creates a network more suitable for pattern recognition. As target function use a 26-elements vector with 1 in the right letter's position (0 elsewhere). The output of the recognition will be a vector of 26 real values between 0 and 1, with the recognized letter with the highest value.
Make sure to use data from all fonts for the training.
Give as input all data sets, train will automatically divide them into train-validation-test sets according to the specified percentages:
net.divideParam.trainRatio = .70;
net.divideParam.valRatio = .15;
net.divideParam.testRatio = .15;
(choose you own percentages).
Then test using only the test set, you can find their indices into
[net, tr] = train(net,inputs,targets);
tr.testInd

Matlab - Neural network training

I'm working on creating a 2 layer neural network with back-propagation. The NN is supposed to get its data from a 20001x17 vector that holds following information in each row:
-The first 16 cells hold integers ranging from 0 to 15 which act as variables to help us determine which one of the 26 letters of the alphabet we mean to express when seeing those variables. For example a series of 16 values as follows are meant to represent the letter A: [2 8 4 5 2 7 5 3 1 6 0 8 2 7 2 7].
-The 17th cell holds a number ranging from 1 to 26 representing the letter of the alphabet we want. 1 stands for A, 2 stands for B etc.
The output layer of the NN consists of 26 outputs. Every time the NN is fed an input like the one described above it's supposed to output a 1x26 vector containing zeros in all but the one cell that corresponds to the letter that the input values were meant to represent. for example the output [1 0 0 ... 0] would be letter A, whereas [0 0 0 ... 1] would be the letter Z.
Some things that are important before i present the code: I need to use the traingdm function and the hidden layer number is fixed (for now) at 21.
Trying to create the above concept i wrote the following matlab code:
%%%%%%%%
%Start of code%
%%%%%%%%
%
%Initialize the input and target vectors
%
p = zeros(16,20001);
t = zeros(26,20001);
%
%Fill the input and training vectors from the dataset provided
%
for i=2:20001
for k=1:16
p(k,i-1) = data(i,k);
end
t(data(i,17),i-1) = 1;
end
net = newff(minmax(p),[21 26],{'logsig' 'logsig'},'traingdm');
y1 = sim(net,p);
net.trainParam.epochs = 200;
net.trainParam.show = 1;
net.trainParam.goal = 0.1;
net.trainParam.lr = 0.8;
net.trainParam.mc = 0.2;
net.divideFcn = 'dividerand';
net.divideParam.trainRatio = 0.7;
net.divideParam.testRatio = 0.2;
net.divideParam.valRatio = 0.1;
%[pn,ps] = mapminmax(p);
%[tn,ts] = mapminmax(t);
net = init(net);
[net,tr] = train(net,p,t);
y2 = sim(net,pn);
%%%%%%%%
%End of code%
%%%%%%%%
Now to my problem: I want my outputs to be as described, namely each column of the y2 vector for example should be a representation of a letter. My code doesn't do that though. Instead it produced results that vary greatly between 0 and 1, values from 0.1 to 0.9.
My question is: is there some conversion i need to be doing that i am not? Meaning, do i have to convert my input and/or output data to a form by which i can actually see if my NN is learning correctly?
Any input would be appreciated.

This is normal. Your output layer is using a log-sigmoid transfer function, and that will always give you some intermediate output between 0 and 1.
What you would usually do would be to look for the output with the largest value -- in other words, the most likely character.
This would mean that, for every column in y2, you're looking for the index of the row that contains the largest value in that row. You can compute this as follows:
[dummy, I]=max(y2);
I is then a vector containing the indexes of the largest value in each row.

You can think of y2 as an output probability distribution for each input being one of the 26 alphabet characters, for example if one column of y2 says:
.2
.5
.15
.15
then its 50% probability that this character is B (if we assume only 4 possible outputs).
==REMARK==
The output layer of the NN consists of
26 outputs. Every time the NN is fed
an input like the one described above
it's supposed to output a 1x26 vector
containing zeros in all but the one
cell that corresponds to the letter
that the input values were meant to
represent. for example the output [1 0
0 ... 0] would be letter A, whereas [0
0 0 ... 1] would be the letter Z.
It is preferable to avoid using target values of 0,1 to encode the output of the network.
The reason for avoiding target values of 0 and 1 is that 'logsig' sigmoid transfer function cannot produce these output values given finite weights. If you attempt to train the network to fit target values of exactly 0 and 1, gradient descent will force the weights to grow without bound.
So instead of 0 and 1 values, try using values of 0.04 and 0.9 for example, so that [0.9,0.04,...,0.04] is the target output vector for the letter A.
Reference:
Thomas M. Mitchell, Machine Learning, McGraw-Hill Higher Education, 1997, p114-115

Use hardlin fcn in output layer.
Use trainlm or trainrp for training the network.
To learn your network, use a for loop and a condition that compare the output and target. When it is the best use, break to exit from the learning loop.
Use another way instead of mapminmax for pre-processing data set.

I don't know if this constitutes an actual answer or not: but here are some remarks.
I don't understand your coding scheme. How is an 'A' represented as that set of numbers? It looks like you're falling into a fairly common trap of using arbitrary numbers to code categorical values. Don't do this: for example if 'a' is 1, 'b' is 2 and 'c' is 3, then your coding has implicitly stated that 'a' is more like 'b' than 'c' (because the network has real-value inputs the ordinal properties matter). The way to do this properly is to have each letter represented as 26 binary valued inputs, where only one is ever active, representing the letter.
Your outputs are correct, the activation at the output layer will not
ever be either 0 or 1, but real numbers. You could take the max as
your activity function, but this is problematic because it's not
differentiable, so you can't use back-prop. What you should do is
couple the outputs with the softmax function, so that their sum
is one. You can then treat the outputs as conditional probabilities
given the inputs, if you so desire. While the network is not
explicitly probabilistic, with the correct activity and activation
functions is will be identical in structure to a log-linear model
(possibly with latent variables corresponding to the hidden layer),
and people do this all the time.
See David Mackay's textbook for a nice intro to neural nets which will make clear the probabilistic connection. Take a look at this paper from Geoff Hinton's group which describes the task of predicting the next character given the context for details on the correct representation and activation/activity functions (although beware their method is non-trivial and uses a recurrent net with a different training method).