This question is a tough one: How can I feed a neural network, a dynamic input?
Answering this question will certainly help the advance of modern AI using deep learning for applications other than computer vision and speech recognition.
I will explain this problem further for the laymen on neural networks.
Let's take this simple example for instance:
Say you need to know the probability of winning, losing or drawing in a game of "tic-tac-toe".
So my input could be a [3,3] matrix representing the state (1-You, 2-Enemy, 0-Empty):
[2. 1. 0.]
[0. 1. 0.]
[2. 2. 1.]
Let's assume we already have a previously trained hidden layer, a [3,1] matrix of weights:
[1.5]
[0.5]
[2.5]
So if we use a simple activation function that consists basically of a matrix multiply between the two y(x)=W*x we get this [3,1] matrix in the output:
[2. 1. 0.] [1.5] [3.5]
[0. 1. 0.] * [0.5] = [0.5]
[2. 2. 1.] [2.5] [6.5]
Even without a softmax function you can tell that the highest probability is of having a draw.
But what if I want this same neural network to work for a 5x5 game of tic-tac-toe?
It has the same logic as the 3x3, its just bigger. The neural network should be able to handle it
We would have something like:
[2. 1. 0. 2. 0.]
[0. 2. 0. 1. 1.] [1.5] [?]
[2. 1. 0. 0. 1.] * [0.5] = [?] IMPOSSIBLE
[0. 0. 2. 2. 1.] [2.5] [?]
[2. 1. 0. 2. 0.]
But this multiplication would be impossible to compute. We would have to add more layers and/or change our previously trained one and RETRAIN it, because the untrained weights (initialized with 0 in this case) would cause the neural network to fail, like so:
input 1st Layer output1
[2. 1. 0. 2. 0.] [0. 0. 0.] [6.5 0. 0.]
[0. 2. 0. 1. 1.] [1.5 0. 0.] [5.5 0. 0.]
[2. 1. 0. 0. 1.] * [0.5 0. 0.] = [1.5 0. 0.]
[0. 0. 2. 2. 1.] [2.5 0. 0.] [6. 0. 0.]
[2. 1. 0. 2. 0.] [0. 0. 0.] [6.5 0. 0.]
2nd Layer output1 final output
[6.5 0. 0.]
[5.5 0. 0.]
[0. 0. 0. 0. 0.] * [1.5 0. 0.] = [0. 0. 0.] POSSIBLE
[6. 0. 0.]
[6.5 0. 0.]
Because we expanded the first layer and added a new layer of zero weights, our result is obviously inconclusive. If we apply a softmax function we will realize that the neural network is returning 33.3% chance for every possible outcome. We would need to train it again.
Obviously we want to create generic neural networks that can adapt to different input sizes, however I haven't thought of a solution for this problem yet! So I thought maybe stackoverflow can help. Thousands of heads think better than one. Any ideas?
There are solutions for Convolutional Neural Networks apart from just resizing the input to a fixed size.
Spatial Pyramid Pooling allows you to train and test CNNs with variable sized images, and it does this by introducing a dynamic pooling layer, where the input can be of any size, and the output is of a fixed size, which then can be fed to the fully connected layers.
The pooling is very simple, one defines with a number of regions in each dimension (say 7x7), and then the layer splits each feature map in non-overlapping 7x7 regions and does max-pooling on each region, outputing a 49 element vector. This can also be applied at multiple scales.
Related
Orange3 says that cosine of No.1 vector[1, 0] to No.2 vector[0, 1] is 1.000 and No.1 to No.7 vector[-1, 0] is 2.000 in Distance Matrix as below capture. I believe that it has to be 0.000 and -1.000 because it is supposed to be cosine. Or if it is radian, it has to be 1.5708(pi/2) and 3.1415(pi).
Sounds like range of cosine is 0.0 to 2.0 in Orange3, but I've never told this before.
Does someone have any idea of this cosine results?
Thank you.
What you describe is cosine similarity. Orange computes cosine distance.
The code is here: https://github.com/biolab/orange3/blob/master/Orange/distance/distance.py#L455.
Even with a simple classifier like the nearest neighbour I cannot seem to judge its accuracy and thus cannot improve it.
For example with the code below:
IDX = knnsearch(train_image_feats, test_image_feats);
predicted_categories = cell([size(test_image_feats, 1), 1]);
for i=1:size(IDX,1)
predicted_categories{i}=train_labels(IDX(i));
end
Here train_image_feats is a 300 by 256 matrix where each row represents an image. Same is the structure of test_image_feats. train_labels is the label corresponding to each row of the training matrix.
The book I am following simply said that the above method achieves an accuracy of 19%.
How did the author come to this conclusion? Is there any way to judge the accuracy of my results be it with this classifier or other?
The author then uses another method of feature extraction and says it improved accuracy by 30%.
How can I find the accuracy? Be it graphically or just via a simple percentage.
Accuracy when doing machine learning and classification is usually calculated by comparing your predicted outputs from your classifier in comparison to the ground truth. When you're evaluating the classification accuracy of your classifier, you will have already created a predictive model using a training set with known inputs and outputs. At this point, you will have a test set with inputs and outputs that were not used to train the classifier. For the purposes of this post, let's call this the ground truth data set. This ground truth data set helps assess the accuracy of your classifier when you are providing inputs to this classifier that it has not seen before. You take your inputs from your test set, and run them through your classifier. You get outputs for each input and we call the collection of these outputs the predicted values.
For each predicted value, you compare to the associated ground truth value and see if it is the same. You add up all of the instances where the outputs match up between the predicted and the ground truth. Adding all of these values up, and dividing by the total number of points in your test set yields the fraction of instances where your model accurately predicted the result in comparison to the ground truth.
In MATLAB, this is really simple to calculate. Supposing that your categories for your model were enumerated from 1 to N where N is the total number of labels you are classifying with. Let groundTruth be your vector of labels that denote the ground truth while predictedLabels denote your labels that are generated from your classifier. The accuracy is simply calculated by:
accuracy = sum(groundTruth == predictedLabels) / numel(groundTruth);
accuracyPercentage = 100*accuracy;
The first line of code calculates what the accuracy of your model is as a fraction. The second line calculates this as a percentage, where you simply multiply the first line of code by 100. You can use either or when you want to assess accuracy. One is just normalized between [0,1] while the other is a percentage from 0% to 100%. What groundTruth == predictedLabels does is that it compares each element between groundTruth and predictedLabels. If the ith value in groundTruth matches with the ith value in predictedLabels, we output a 1. If not, we output a 0. This will be a vector of 0s and 1s and so we simply sum up all of the values that are 1, which is eloquently encapsulated in the sum operation. We then divide by the total number of points in our test set to obtain the final accuracy of the classifier.
With a toy example, supposing I had 4 labels, and my groundTruth and predictedLabels vectors were this:
groundTruth = [1 2 3 2 3 4 1 1 2 3 3 4 1 2 3];
predictedLabels = [1 2 2 4 4 4 1 2 3 3 4 1 2 3 3];
The accuracy using the above vectors gives us:
>> accuracy
accuracy =
0.4000
>> accuracyPercentage
accuracyPercentage =
40
This means that we have a 40% accuracy or an accuracy of 0.40. Using this example, the predictive model was only able to accurately classify 40% of the test set when you put each test set input through the classifier. This makes sense, because between our predicted outputs and ground truth, only 40%, or 6 outputs match up. These are the 1st, 2nd, 6th, 7th, 10th and 15th elements. There are other metrics to calculating accuracy, like ROC curves, but when calculating accuracy in machine learning, this is what is usually done.
How to perform cross correlation when one discrete signal has negative samples?
For example we have :
y=[1 2 3 4 5]
h0=[7 8 9]
but the ho starts from -2.
Just do the regular cross correlation.
Afterwards set the lags vector accordingly.
I have an input dataset (matrix 25x1575) which is normalized to values between 0 and 1.
I also have a binary formatted output matrix (9x1575) like 0 0 0 0 0 0 0 0 1, 1 0 0 1 1 1 0 0 1 ...
I imported both files in matlab nntool and it automatically created a network with 25 input and 9 output nodes as I wanted.
After I trained this network using feed-forward backProp, I tested the model in its training data and each output nodes returns a decimal value like (-0.1978 0.45913 0.12748 0.25072 0.45199 0.59368 0.38359 0.31435 1.0604).
Why it doesn't return discrete values like 1 0 0 1 1 1 0 0 1?
Is there any thing that I must set in nntool to get such values?
Depending on the nature of neurons, the output can be anything. The most popular neurons are linear, sigmoidal curve (range [0, 1]) and Hyperbolic Tangent (range [-1, 1]). The first one can output any value. The latter two c approximate step function (i.e. binary behavior), but it is up to the end user (you) to define the cut-off value for that translation.
You didn't say which neurons you use, but you should definitely read more on how neural networks are implemented and how they work. You may start with this video and then read Artificial Neural Networks for Beginners by C Gershenson.
UPDATE You say that you use tanh-sigmoid neurons and wonder how come you don't get values either very close to -1 or to 1.
The output of tanh neuron is hyperbolic tangent of the sum of all its inputs. Every value between -1 and 1 is possible. What determines the "steepness" of the output (in other words: the proportion of interim values) is the output values of the preceding neurons and their weights. These depend on the output of their preceding neurons and their weights etc etc etc. It is up to the learning algorithm to find the set of weights that minimizes a predefined scoring function, given a certain input. In a typical setup, a scoring function is a function that compares neural network output to a set of desired results and returns a single number that indicates how different the actual and the desired outputs are.
Before using NN you have to do some homework. At the minimum you have to decide what your goal is, how you interpret NN output and how you measure NN performance and how you update the weights.
I am working through the xor example with a three layer back propagation network. When the output layer has a sigmoid activation, an input of (1,0) might give 0.99 for a desired output of 1 and an input of (1,1) might give 0.01 for a desired output of 0.
But what if want the output to be discrete, either 0 or 1, do I simply set a threshold in between at 0.5? Would this threshold need to be trained like any other weight?
Well, you can of course put a threshold after the output neuron which makes the values after 0.5 as 1 and, vice versa, all the outputs below 0.5 as zero. I suggest to don't hide the continuous output with a discretization threshold, because an output of 0.4 is less "zero" than a value of 0.001 and this difference can give you useful information about your data.
Do the training without threshold, ie. computes the error on a example by using what the neuron networks outputs, without thresholding it.
Another little detail : you use a transfer function such as sigmoid ? The sigmoid function returns values in [0, 1], but 0 and 1 are asymptote ie. the sigmoid function can come close to those values but never reach them. A consequence of this is that your neural network can not exactly output 0 or 1 ! Thus, using sigmoid times a factor a little above 1 can correct this. This and some other practical aspects of back propagation are discussed here http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf