cross validation function crossvalind - matlab

I have question please; concerning cross validation, for me the cross-validation is used to find the best parameters.
but I did not understand the role of this function "crossvalind":Generate cross-validation indices, it just takes a data set without model, like in this exemple :
load fisheriris
[g gn] = grp2idx(species);
[trainIdx testIdx] = crossvalind('HoldOut', species, 1/3);

crossvalind() function splits your data in two groups: the training set and the cross-validation set.
By your example:
[trainIdx testIdx] = crossvalind('HoldOut', size(species,1), 1/3); means split the data in species (2/3 in the training set and 1/3 in the cross-validation set).
Supposing that your data is like:
species=[datarow1;datarow2;datarow3;datarow4;datarow5;datarow6] then
trainIdx would be like [1;1;0;1;1;0] and testIdx would be like [0;0;1;0;0;1] meaning that from the 6 total elements in our set crossvalind function assigned 4 to the train set and 2 to the cross-validation set. Of course this is a random assignment meaning that the zero and ones indices will vary every time you call the function but the proportion between them will be fixed and trainIdx + testIdx will always be ones(size(species,1),1)
crossvalind('LeaveMout',size(species,1),2) would be exactly the same as crossvalind('HoldOut', size(species,1), 1/3) in this particular case. In the 'HoldOut' format you provide parameter P which takes values from 0 to 1 (like 1/3 in the example above) while with the option 'LeaveMout' you provide integer M like 2 samples from the 6 total or like 2000 samples from the 10000 total samples in your dataset. In case of 'Resubstitution': crossvalind('Resubstitution', size(species,1), [1/3,2/3]) would be yet the same but here you also have the option of let's say [1/3,3/4] meaning that some samples can be on both the train and cross-validation sets, or even [1,1] which means that all the samples are used in both sets (trainIdx=testIdx=[1;1;1;1;1;1] in the above example). I strongly suggest to type help crossvalind and take a look at the help file which is always a lot more detailed and helpful than i could ever be.

Related

How to decide the range for the hyperparameter space in SVM tuning? (MATLAB)

I am tuning an SVM using a for loop to search in the range of hyperparameter's space. The svm model learned contains the following fields
SVMModel: [1×1 ClassificationSVM]
C: 2
FeaturesIdx: [4 6 8]
Score: 0.0142
Question1) What is the meaning of the field 'score' and its utility?
Question2) I am tuning the BoxConstraint, C value. Let, the number of features be denoted by the variable featsize. The variable gridC will contain the search space which can start from any value say 2^-5, 2^-3, to 2^15 etc. So, gridC = 2.^(-5:2:15). I cannot understand if there is a way to select the range?
1. score had been documented in here, which says:
Classification Score
The SVM classification score for classifying observation x is the signed distance from x to the decision boundary ranging from -∞ to +∞.
A positive score for a class indicates that x is predicted to be in
that class. A negative score indicates otherwise.
In two class cases, if there are six observations, and the predict function gave us some score value called TestScore, then we could determine which class does the specific observation ascribed by:
TestScore=[-0.4497 0.4497
-0.2602 0.2602;
-0.0746 0.0746;
0.1070 -0.1070;
0.2841 -0.2841;
0.4566 -0.4566;];
[~,Classes] = max(TestScore,[],2);
In the two-class classification, we can also use find(TestScore > 0) instead, and it is clear that the first three observations are belonging to the second class, and the 4th to 6th observations are belonging to the first class.
In multiclass cases, there could be several scores > 0, but the code max(scores,[],2) is still validate. For example, we could use the code (from here, an example called Find Multiple Class Boundaries Using Binary SVM) following to determine the classes of the predict Samples.
for j = 1:numel(classes);
[~,score] = predict(SVMModels{j},Samples);
Scores(:,j) = score(:,2); % Second column contains positive-class scores
end
[~,maxScore] = max(Scores,[],2);
Then the maxScore will denote the predicted classes of each sample.
2. The BoxConstraint denotes C in the SVM model, so we can train SVMs in different hyperparameters and select the best one by something like:
gridC = 2.^(-5:2:15);
for ii=1:length(gridC)
SVModel = fitcsvm(data3,theclass,'KernelFunction','rbf',...
'BoxConstraint',gridC(ii),'ClassNames',[-1,1]);
%if (%some constraints were meet)
% %save the current SVModel
%end
end
Note: Another way to implement this is using libsvm, a fast and easy-to-use SVM toolbox, which has the interface of MATLAB.

Unreasonable [positive] log-likelihood values from matlab "fitgmdist" function

I want to fit a data sets with Gaussian mixture model, the data sets contains about 120k samples and each sample has about 130 dimensions. When I use matlab to do it, so I run scripts (with cluster number 1000):
gm = fitgmdist(data, 1000, 'Options', statset('Display', 'iter'), 'RegularizationValue', 0.01);
I get the following outputs:
iter log-likelihood
1 -6.66298e+07
2 -1.87763e+07
3 -5.00384e+06
4 -1.11863e+06
5 299767
6 985834
7 1.39525e+06
8 1.70956e+06
9 1.94637e+06
The log likelihood is bigger than 0! I think it's unreasonable, and don't know why.
Could somebody help me?
First of all, it is not a problem of how large your dataset is.
Here is some code that produces similar results with a quite small dataset:
options = statset('Display', 'iter');
x = ones(5,2) + (rand(5,2)-0.5)/1000;
fitgmdist(x,1,'Options',options);
this produces
iter log-likelihood
1 64.4731
2 73.4987
3 73.4987
Of course you know that the log function (the natural logarithm) has a range from -inf to +inf. I guess your problem is that you think the input to the log (i.e. the aposteriori function) should be bounded by [0,1]. Well, the aposteriori function is a pdf function, which means that its value can be very large for very dense dataset.
PDFs must be positive (which is why we can use the log on them) and must integrate to 1. But they are not bounded by [0,1].
You can verify this by reducing the density in the above code
x = ones(5,2) + (rand(5,2)-0.5)/1;
fitgmdist(x,1,'Options',options);
this produces
iter log-likelihood
1 -8.99083
2 -3.06465
3 -3.06465
So, I would rather assume that your dataset contains several duplicate (or very close) values.

Matlab - neural networks - How to use different datasets for training, validation and testing?

Best
I've a question about Neural networks in Matlab.
First of all, I've a small NN, 2 inputs, 1 hidden layer with 10 neurons and one output. And this works fine. But the question which I've is. Can I determine my training date, validation data and test data?
I know, if I use e.g. net = feedforwardnet(10); that I can divide my overall dataset into e.g.70/100 15/100 and 15/100. But I don't want to do this, because in this case I want to train my NN with a 1000 data-points, validate them with another data-points and use another independent data-set of 1000 data-points to test them. With other words, I want to control these 3 interdependent data-sets.
Thus, can someone help me?
Kind regards
Edit, I don't want to use a data-set with 3000 data-points and set the devideParams on 1/3 1/3 & 1/3.
Best myself
When you use a feedforwardnet then you can define your divide parameters
net.divideParam.trainRatio = 1/3;
net.divideParam.valRatio = 1/3;
net.divideParam.testRatio = 1/3;
You know that your data will be divided into 3 pieces.
But you (I) didn't know which data.
But when you and thus I, train my network via the following command line:
[net,tr]=train(net,x,t);
then, tr will contain all the necessary info, like for example :
tr.trainInd 1x1000 double,
tr.valInd 1x1000 double,
tr.testInd 1x1000 double,
Thus, e.g. tr.trainInd, will contain all the indexes of our set of data which were used for training. Also, in tr, we can see that the type of tr.divideFcn is set on dividerand which means that the indexes are picked at random. Thus it would be logical, that there is a possibility that those indexes aren't picked randomly which means that, if we combine both things. It should be possible to use another test set --> net.divideParam.testRatio = 0 and to use two different train and validation sets --> net.divideParam.trainRatio = 1/2 and net.divideParam.valRatio = 1/2 - If you can set the tr.divideFcn on something chronological, sequential. Last but not least, if this is possible then we have nothing more to do, then put the training and validation set into one data set, etc...
Kind regards to myself
By default it will use a random index for train,validation,test. This is manually set with the following, though since it's default usually not needed:
net.divideFcn = 'dividerand'
and then you use the the commands you noted above:
net.divideParam.trainRatio = 1/3;
net.divideParam.valRatio = 1/3;
net.divideParam.testRatio = 1/3;
To do what you want and set the index of each you can do the following:
net.divideFcn = 'divideind'
net.divideParam.trainInd = [1:1000]
net.divideParam.valInd=[1001:2000]
net.divideParam.testInd=[2001:3000]

Interpretation of Probability Estimate for Multi-class classification in LibSVM for MATLAB

Problem: 3 class classification with labels 1,2,3.
Tool: LibSVM for MATLAB
svmModel = svmtrain(<Trainfeatures>, <TrainclassLabels>, '-b 1 -c <someCValue> -g <someGammaValue>');
[predLabels, classAccuracy, **probEstimates**] = svmpredict(<TestFeatures>, <TestClassLabels>, '-b 1');
AFter this step, I get the first ten rows of probEstimates to be,
0.9129 0.0749 0.0122
0.9059 0.0552 0.0389
0.8231 0.0183 0.1586
0.9077 0.0098 0.0825
0.9074 0.0668 0.0257
0.8685 0.0146 0.1169
0.8962 0.0664 0.0374
0.9074 0.0548 0.0377
0.9474 0.0054 0.0472
0.9178 0.0642 0.0180
but the first ten predicted labels to be:
2
2
2
2
2
2
2
2
2
2
Questions:
My understanding was that the probability estimate was the probability that a particular item would belong to a particular class, given its feature vector. However, if that were true, then these items should belong to class 1 and not class 2. Does the libsvm change the order of classes or am I missing something here? If I am wrong, can someone please explain what the real interpretation of probability estimate is?
If I have to move the decision boundary to increase the precision of class 1 (have less items to be predicted to be class 1 and hence be more conservative in the decision boundary), which of these class probabilities should I have to deal with and how?
I came across the same problem recently.
The reason is related to the order of training data.
If you want the index of post-probability vector to correspond to the label of training data, the training data should be sorted according to the label.
For example, if the label of the the first data point is 4, then the first entry of post-probability vector is related to data points labeled 4.
The order of the the labels stored in the model may different from what we thought it should be. You can check using svmModel.Label. And the probability estimates are outputted according to this order.

How to compare different distribution means with reference truth value in Matlab?

I have production (q) values from 4 different methods stored in the 4 matrices. Each of the 4 matrices contains q values from a different method as:
Matrix_1 = 1 row x 20 column
Matrix_2 = 100 rows x 20 columns
Matrix_3 = 100 rows x 20 columns
Matrix_4 = 100 rows x 20 columns
The number of columns indicate the number of years. 1 row would contain the production values corresponding to the 20 years. Other 99 rows for matrix 2, 3 and 4 are just the different realizations (or simulation runs). So basically the other 99 rows for matrix 2,3 and 4 are repeat cases (but not with exact values because of random numbers).
Consider Matrix_1 as the reference truth (or base case ). Now I want to compare the other 3 matrices with Matrix_1 to see which one among those three matrices (each with 100 repeats) compares best, or closely imitates, with Matrix_1.
How can this be done in Matlab?
I know, manually, that we use confidence interval (CI) by plotting the mean of Matrix_1, and drawing each distribution of mean of Matrix_2, mean of Matrix_3 and mean of Matrix_4. The largest CI among matrix 2, 3 and 4 which contains the reference truth (or mean of Matrix_1) will be the answer.
mean of Matrix_1 = (1 row x 1 column)
mean of Matrix_2 = (100 rows x 1 column)
mean of Matrix_3 = (100 rows x 1 column)
mean of Matrix_4 = (100 rows x 1 column)
I hope the question is clear and relevant to SO. Otherwise please feel free to edit/suggest anything in question. Thanks!
EDIT: My three methods I talked about are a1, a2 and a3 respectively. Here's my result:
ci_a1 =
1.0e+008 *
4.084733001497999
4.097677503988565
ci_a2 =
1.0e+008 *
5.424396063219890
5.586301025525149
ci_a3 =
1.0e+008 *
2.429145282593182
2.838897116739112
p_a1 =
8.094614835195452e-130
p_a2 =
2.824626709966993e-072
p_a3 =
3.054667629953656e-012
h_a1 = 1; h_a2 = 1; h_a3 = 1
None of my CI, from the three methods, includes the mean ( = 3.454992884900722e+008) inside it. So do we still consider p-value to choose the best result?
If I understand correctly the calculation in MATLAB is pretty strait-forward.
Steps 1-2 (mean calculation):
k1_mean = mean(k1);
k2_mean = mean(k2);
k3_mean = mean(k3);
k4_mean = mean(k4);
Step 3, use HIST to plot distribution histograms:
hist([k2_mean; k3_mean; k4_mean]')
Step 4. You can do t-test comparing your vectors 2, 3 and 4 against normal distribution with mean k1_mean and unknown variance. See TTEST for details.
[h,p,ci] = ttest(k2_mean,k1_mean);
EDIT : I misinterpreted your question. See the answer of Yuk and following comments. My answer is what you need if you want to compare distributions of two vectors instead of a vector against a single value. Apparently, the latter is the case here.
Regarding your t-tests, you should keep in mind that they test against a "true" mean. Given the number of values for each matrix and the confidence intervals it's not too difficult to guess the standard deviation on your results. This is a measure of the "spread" of your results. Now the error on your mean is calculated as the standard deviation of your results divided by the number of observations. And the confidence interval is calculated by multiplying that standard error with appx. 2.
This confidence interval contains the true mean in 95% of the cases. So if the true mean is exactly at the border of that interval, the p-value is 0.05 the further away the mean, the lower the p-value. This can be interpreted as the chance that the values you have in matrix 2, 3 or 4 come from a population with a mean as in matrix 1. If you see your p-values, these chances can be said to be non-existent.
So you see that when the number of values get high, the confidence interval becomes smaller and the t-test becomes very sensitive. What this tells you, is nothing more that the three matrices differ significantly from the mean. If you have to choose one, I'd take a look at the distributions anyway. Otherwise the one with the closest mean seems a good guess. If you want to get deeper into this, you could also ask on stats.stackexchange.com
Your question and your method aren't really clear :
Is the distribution equal in all columns? This is important, as two distributions can have the same mean, but differ significantly :
is there a reason why you don't use the Central Limit Theorem? This seems to me like a very complex way of obtaining a result that can easily be found using the fact that the distribution of a mean approaches a normal distribution where sd(mean) = sd(observations)/number of observations. Saves you quite some work -if the distributions are alike! -
Now if the question is really the comparison of distributions, you should consider looking at a qqplot for a general idea, and at a 2-sample kolmogorov-smirnov test for formal testing. But please read in on this test, as you have to understand what it does in order to interprete the results correctly.
On a sidenote : if you do this test on multiple cases, make sure you understand the problem of multiple comparisons and use the appropriate correction, eg. Bonferroni or Dunn-Sidak.