Learning curves for neural networks - matlab

I am trying to find optimal parameters of my neural network model implemented on octave, this model is used for binary classification and 122 features (inputs) and 25 hidden units (1 hidden layer). For this I have 4 matrices/ Vectors:
size(X_Train): 125973 x 122
size(Y_Train): 125973 x 1
size(X_Test): 22543 x 122
size(Y_test): 22543 x 1
I have used 20% of the training set to generate a validation set (XVal and YVal)
size(X): 100778 x 122
size(Y): 100778 x 1
size(XVal): 25195 x 122
size(YVal): 25195 x 1
size(X_Test): 22543 x 122
size(Y_test): 22543 x 1
The goal is to generate the Learning curves of the NN. I have learned (the hard way xD) that this is very time consuming because I used the full size of Xval and X for this.
I don't know if there is an alternative solution for this. I am thinking to reduce the size of the training vector X (like 5000 samples for example), but I don't know if I can do that, or if the results will be biased since I'll only use a portion of the training set?
Bests,

The total number of parameters above is around 3k (122*25 + 25*1), which is not huge for one example. Since the number of examples is large, you might want to use stochastic gradient descent or mini-batches instead of gradient descent.
Note that Matlab and Octave are slow in general, specially with loops.
You need to write the code which uses matrix operations rather than loops for the speed to be manageable in Matlab/Octave.

Related

Unable to make sense of the confusion matrix returned by SVM

I am trying to understand why the SVM classifier is not able to correctly classify my data. I have presented 10 samples XX only out of 2000 samples of my original data. I cannot make sense of the confusion matrix returned by Matlab. I used SVM classifier. Is my code wrong, especially the way I did cross-validation?
XX is normalized to X, and Y is the label. Each feature vector is of length 8.
**Question **) Can somebody please help how to tackle this issue?
pred 0 pred 1
actual 0 100 0
actual 1 100 0
Thank you
You have:
an unbalanced data set (7 and 3 samples),
an 8-dimensional feature space and only 7 and 3 samples, which are very much insufficient to fill it (see curse of dimensionality), and
you're only using half those samples to train, meaning you're even further away from filling the feature space.
Thus, I am not surprised that the generalization that the SVM came up with is to classify everything as "class 0".
Try using only one of the features (first column of XX), and use leave-one-out cross validation.

Number of parameters in GMM-HMM

I want to understand the use of Gaussian Mixture Models in Hidden Markov Models.
Suppose, we have speech data and we are recognizing 5 speech sounds (which are states of HMM). For example 'X' be the speech sample and O = (s,u,h,b,a) (considering characters instead of phones just for simplicity) be HMM states. Now, we use gaussian mixture model of 3 mixtures to estimate gaussian density for every state using the following equation (sorry cannot upload image because of reputation points).
P(X|O) = sum (i=1->3) w(i) * P (X|mu(i), var(i)) (considering univariate distribution)
So, we first learn the GMM parameters from the training data using EM algorithm.
Then use these parameters for learning HMM parameters and once this is done, we use both of them on test data.
In all we are learning 3 * 3 * 5 (weight, mean and variance for 3 mixtures and 5 states) parameters for GMM in this example.
Is my understanding correct?
Your understanding is mostly correct, however, the number of parameters is usually larger. The mean and variance are vectors, not numbers. Variance could be matrix for rare case of full covariance GMM. Each vector usually contains 39 components for 13 cepstrum + 13 deltas + 13 delta-deltas.
So for every phone you learn
39 + 39 + 1 = 79 parameters
Total number of parameters is
79 * 5 = 395
And, usually phone is composed of 3 or so states, not from a single state. So you have 395 * 3 or 1185 parameters just for GMM. Then you need a transition matrix for HMM. Number of parameters is large thats why training requires a lot of data.

How to take the difference between the resulting and the correct bucket of a one hot vector into account?

Hi I am using tensorflow at my university to try to classify steering angles of a simulation program using only the images the simulation produces.
The Steering angles are values from -1 to 1 and I separated them into 50 "buckets". So the first value of my prediction vector would mean that the predicted steering angle is between -1 and -0.96.
The following shows the classification and optimization functions I am using.
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(prediction, y))
optimizer = tf.train.AdamOptimizer(0.001).minimize(cost)
y is a vector that with 49 zeros and a single 1 for the correct bucket. My question now is.
How do I take into account if e.g. the correct bucket is at index 25, that the a prediction of 26 is much better than a prediction of 48.
I didn't post the actual network since it is just a couple of conv2d and maxpool layers with a fully connected layer at the end.
Since you are applying Cross entropy or negative log likelihood. you are penalizing the system given the predicted output and the ground truth.
So saying that your system predicted different numbers on your 50 classes output and the highest one was the class number 25 but your ground truth is class 26. So your system will take the value predicted on 26 and adapt the parameters to produce the highest number on this output the next time it sees this input.
You could do two basic things:
Change your y and prediction to be scalars in the range -1..1; make the loss function be (y-prediction)**2 or something. A very different model, but perhaps more reasonable that the one-hot.
Keep the one-hot target and loss, but have y = target*w, where w is a constant matrix, mostly zeros, 1s on the diagonal, and smaller values on the next diagonal, elements (e.g. y(i) = target(i) * 1. + target(i-1) * .5 + target(i+1) * .5 + ...); kind of gross, but it should converge to something reasonable.

How can I find the difference between two plots with a dimensional mismatch?

I have a question that I don't know if there is a solution off the bat.
Here it goes,
I have two data sets, plotted on the same figure. I need to find their difference, simple so far...
the problem arises in the fact that say matrix A has 1000 data points while the second (matrix B) has 580 data points. How will I be able to find the difference between the two graphs since there is a dimensional miss match between the two figures.
One way that I thought of is artificially inflating matrix B to 1000 data points, but the trend of the plot will remain the same. Would this be possible? and if yes how?
for example:
A=[1 45 33 4 1009 ];
B=[1 22 33 44 55 66 77 88 99 1010];
Ya=A.*20+4;
Yb=B./10+3;
C=abs(B - A)
plot(A,Ya,'r',B,Yb)
xlim([-100 1000])
grid on
hold on
plot(length(B),C)
One way to do it is to resample the 580 element vector to 1000 samples. Use matlab resample (requires the Signal Processing Toolbox, I believe) for this:
x = randn(580,1);
y = randn(1000,1);
xr = resample(x, 50,29); # 50/29 = 1000/580 is the resampling ratio
You should then be able to compare the two data vectors.
There are two ways that I can think of:
1- Matching the size:
Generating more data for the matrix with lower number of elements (using interpolation, etc.)
Removing some data from the matrix with higher number of elements (i.e. outlier removal)
2- Comparing the matrices with their properties.
For instance, you can calculate the mean and the covariance of a matrix and compare it to the other matrix. The other options include, cov , mean , median , std, var , xcorr , xcov.

Using SVMs for Regression

I need to use SVMs for regression.
I have y: a 261x1 vector and x: a 261x10 vector.
I would like to calculate 10 weights such that the weighted 10 values of x at each of the 261 data points mimic the y value.
However, when I run this using the libsvm package, I am getting 261 weights and not the 10 I want.
From my understanding, libsvm requires the x and y vector to be the same length and hence inputting the transpose of x and y will not work.
(Note: this is a portfolio optimization problem and 261 is the number of days, and 10 is the number of stocks)
I could not understand what 'weights' means but I suggest you to use libsvmwrite function to write your labels and feature vectors in the required format. and use libsvmread method to get the formatted data to pass as an input.