I want to understand the use of Gaussian Mixture Models in Hidden Markov Models.
Suppose, we have speech data and we are recognizing 5 speech sounds (which are states of HMM). For example 'X' be the speech sample and O = (s,u,h,b,a) (considering characters instead of phones just for simplicity) be HMM states. Now, we use gaussian mixture model of 3 mixtures to estimate gaussian density for every state using the following equation (sorry cannot upload image because of reputation points).
P(X|O) = sum (i=1->3) w(i) * P (X|mu(i), var(i)) (considering univariate distribution)
So, we first learn the GMM parameters from the training data using EM algorithm.
Then use these parameters for learning HMM parameters and once this is done, we use both of them on test data.
In all we are learning 3 * 3 * 5 (weight, mean and variance for 3 mixtures and 5 states) parameters for GMM in this example.
Is my understanding correct?
Your understanding is mostly correct, however, the number of parameters is usually larger. The mean and variance are vectors, not numbers. Variance could be matrix for rare case of full covariance GMM. Each vector usually contains 39 components for 13 cepstrum + 13 deltas + 13 delta-deltas.
So for every phone you learn
39 + 39 + 1 = 79 parameters
Total number of parameters is
79 * 5 = 395
And, usually phone is composed of 3 or so states, not from a single state. So you have 395 * 3 or 1185 parameters just for GMM. Then you need a transition matrix for HMM. Number of parameters is large thats why training requires a lot of data.
Related
I am trying to find optimal parameters of my neural network model implemented on octave, this model is used for binary classification and 122 features (inputs) and 25 hidden units (1 hidden layer). For this I have 4 matrices/ Vectors:
size(X_Train): 125973 x 122
size(Y_Train): 125973 x 1
size(X_Test): 22543 x 122
size(Y_test): 22543 x 1
I have used 20% of the training set to generate a validation set (XVal and YVal)
size(X): 100778 x 122
size(Y): 100778 x 1
size(XVal): 25195 x 122
size(YVal): 25195 x 1
size(X_Test): 22543 x 122
size(Y_test): 22543 x 1
The goal is to generate the Learning curves of the NN. I have learned (the hard way xD) that this is very time consuming because I used the full size of Xval and X for this.
I don't know if there is an alternative solution for this. I am thinking to reduce the size of the training vector X (like 5000 samples for example), but I don't know if I can do that, or if the results will be biased since I'll only use a portion of the training set?
Bests,
The total number of parameters above is around 3k (122*25 + 25*1), which is not huge for one example. Since the number of examples is large, you might want to use stochastic gradient descent or mini-batches instead of gradient descent.
Note that Matlab and Octave are slow in general, specially with loops.
You need to write the code which uses matrix operations rather than loops for the speed to be manageable in Matlab/Octave.
I have a question that I don't know if there is a solution off the bat.
Here it goes,
I have two data sets, plotted on the same figure. I need to find their difference, simple so far...
the problem arises in the fact that say matrix A has 1000 data points while the second (matrix B) has 580 data points. How will I be able to find the difference between the two graphs since there is a dimensional miss match between the two figures.
One way that I thought of is artificially inflating matrix B to 1000 data points, but the trend of the plot will remain the same. Would this be possible? and if yes how?
for example:
A=[1 45 33 4 1009 ];
B=[1 22 33 44 55 66 77 88 99 1010];
Ya=A.*20+4;
Yb=B./10+3;
C=abs(B - A)
plot(A,Ya,'r',B,Yb)
xlim([-100 1000])
grid on
hold on
plot(length(B),C)
One way to do it is to resample the 580 element vector to 1000 samples. Use matlab resample (requires the Signal Processing Toolbox, I believe) for this:
x = randn(580,1);
y = randn(1000,1);
xr = resample(x, 50,29); # 50/29 = 1000/580 is the resampling ratio
You should then be able to compare the two data vectors.
There are two ways that I can think of:
1- Matching the size:
Generating more data for the matrix with lower number of elements (using interpolation, etc.)
Removing some data from the matrix with higher number of elements (i.e. outlier removal)
2- Comparing the matrices with their properties.
For instance, you can calculate the mean and the covariance of a matrix and compare it to the other matrix. The other options include, cov , mean , median , std, var , xcorr , xcov.
I need to use SVMs for regression.
I have y: a 261x1 vector and x: a 261x10 vector.
I would like to calculate 10 weights such that the weighted 10 values of x at each of the 261 data points mimic the y value.
However, when I run this using the libsvm package, I am getting 261 weights and not the 10 I want.
From my understanding, libsvm requires the x and y vector to be the same length and hence inputting the transpose of x and y will not work.
(Note: this is a portfolio optimization problem and 261 is the number of days, and 10 is the number of stocks)
I could not understand what 'weights' means but I suggest you to use libsvmwrite function to write your labels and feature vectors in the required format. and use libsvmread method to get the formatted data to pass as an input.
I have audio records of 4 phonemes (a, e, o, u) from 11 people. I trained an ANN using the data from 10 people, and used the other set for testing. I used 14 LPC coefficients of the first period (20ms) of records as features.
The training matrix I has 14 rows and 10 columns for each phoneme. So it is 14*40. Since it is a supervised classification problem, I constructed a target matrix T which is 4*40. It contains ones and zeros where a 1 indicates that the corresponding column in I is from that class.
The test data matrix contains four columns and 14 rows as it contains 4 phonemes from only one person. Let us call it S.
Here is the code:
net = newff(I, T, 15);
net = init(net);
net.trainParam.epochs = 10000;
net.trainParam.goal = 0.01;
net = train(net, I, T);
y1 = sim(net, I);
y2 = sim(net, S)
The results are not good even I give the training data as test data (y1).
What is wrong here?
I used 14 LPC coefficients of the first period (20ms) of records as features.
So did you ignore almost all the sound data except first 20ms? It doesn't sound right. You must have calculate an average over all frames at least.
What is wrong here?
You started coding without understanding a theory. Probably you want to read some introduction first. At least this and ideally this
To understand why ANN doesn't work calculate how many parameters are required to map 10 features to 4 classes, then calculate how many training vectors do you have for every parameter. Take into account that for every parameter you need at least 10 samples for initial estimation. That means your training data is not enough.
I want to create a ROC curve in Matlab using the perfcurve function (it's for logistic regression similar as illustrated in this example (bottom of page)). I have 150 datapoints (binary data), but they are neither positive nor negative classes; they are the number of positive observations within the particular datapoint.
Example (random data to illustrate):
datapoint +ve observations total observations
1 23 35
2 27 41
3 23 36
4 18 29
5 19 39
6 21 41
7 24 40
8 29 36
9 38 45
10 12 32
The example illustrated on mathworks (bottom of page) only demonstrates how to create labels for data rows that correspond either solely to positive or negative classes.
For
[X,Y,T,AUC] = perfcurve(labels,scores,posclass)
how do I have to format my labels and posclass in order to make the ROC curve plot work?
Thank you very much in advance.
In order to create an ROC curve in Matlab using the perfcurve function, you need to have the score for each data point (which you pass to perfcurve using the scores argument). The score of a data point is given by your classifier and corresponds to the "probability" [1] that this data point belongs to the positive class (which is defined by the posclass argument). Given your data, you don't have enough information to use the perfcurve function.
[1] Some classifiers don't return strict probabilities but higher score indicates a higher probability so it's all right. More information in Fawcett, Tom. "An introduction to ROC analysis." Pattern recognition letters 27.8 (2006): 861-874.