I am working on MATLAB LIBSVM for a while to do prediction. I have a dataset out of which I use 75% for training, 15% for finding best parameters and remaining for testing. The code is given below.
trainX and trainY are the input and output training instances
testValX and testValY are the validation dataset I use
for j = 1:100
for jj = 1:10
model(j,jj) = svmtrain(trainY,trainX,...
['-s 3 -t 2 -c ' num2str(j) ' -p 0.001 -g ' num2str(jj) '-v 5']);
[predicted_label, ~, ~]=svmpredict(testValY,...
testValX,model(j,jj));
MSE(j,jj) = sum(((predicted_label-testValY).^2)/2);
end
end
[min_val,min_indi] = min(MSE(:));
best_predicted_model_rbf(i) = model(min_indi);
My question here is whether this is correct. I am creating model matrix with different values of c and g. I use -v option which is a key here. From the predicted models I use validation dataset for prediction and there by compute mean square error. Using this MSE I pick the best c and g. Since I am using -v which returns the cross validated output, is the procedure I follow correct?
First, I think there is a slight problem with the code shown, which is that num2str(jj) '-v 5']); doesn't have a space before the -v. That may cause that flag to not be read. In the other question, you stated that this 'sometimes returns a model', which is what would happen if that flag was not read. If the flag is read, you should only get a number, not a model, when the '-v' flag is used.
Second, it looks like you are doing two different things here, either one of which would be reasonable on its own. Calling svmtrain with '-v' runs cross validation on the training set. That shouldn't return a model, it should just return an mse estimate. You could use these estimates to determine which parameter setting was best, and then train one model with that setting on all of the training data.
Anyway, next you call svmpredict(y,x,model) on a hold-out validation set, testValX, but having called svmtrain with '-v', model should just be a scalar at this point. In order for this call to run correctly, you have to get the model from svmtrain without '-v', so that it is a struct. The rest of what you are doing makes sense for this case, in which you are doing hold-out validation using testValX.
Related
I try to do a 10 folds cross validation without using built-in function to train and recognize digit from 0-9 I have sample of 500 picture(50 for each digit to train and test.)
I try implement the answer MATLAB: 10 fold cross Validation without using existing functions and other websites but it didn't help that much. Mostly because I'm new to MATLAB so I don't know much about what I should do to tweak it.
This is the code I have so far.
c=zeros(10,size(x,2),size(x,3));
K=10;
k=10;
test= 1:50/K;
for fold =1:K
if(test(1)~=1)
train = x(1:test(1)-1,:,:);
if (test(5) ~=50)
train=[train ; x(test(end):50,:,:)];
end
else
train = x(test(1):50,:,:);
end
test = test+ones(1,50/K)*50/K;
end
for i =0:9
test=test+50/K*ones(1,5);
c(i+1,:,:)=cal_likelihood(x(1+i*50:50+i*50,:,:),50/k*(k-1));
end
Variable explanation
x is the 500x28x28 double where it keeps all 500 digit picture.
test is a test set.
train is a training set.
In order to do 10 folds cross validation I need to change training set like
1st fold : 1:5 for test,6:45 for train
2nd fold : 6:10 for test,1:5 11:50 for train and so on
The problem is I don't know how to shift the training set from one set to another like from 6:45 to 1:5 and 11:50. or Can I write a better loop than this?
PSS. If someone who answer this don't mind What does 500x28x28 double actually mean.
There are a few ways you could write this, some of which are easier to understand than others. Matlab is quite nice to write in as while expressions such as 1:3 evaluate to [1,2,3], the expression 1:0 evaluates to the empty set. So, it is very straightforward to generate the sets without having to use if statements.
I'd start off the loop as:
samples_per_digit=50;
block_sze=samples_per_digit/K;
for fold =1:K
test_ind = 1+(fold-1)*block_sze:fold*block_sze;
train_ind = [1:(fold-1)*block_sze, (fold*block_sze+1):samples_per_digit];
for i=0:9
train=x(train_ind+i*samples_per_digit,:,:);
test=x(test_ind+i*samples_per_digit,:,:);
% Perform training and validation in here for this fold of the digit i
You can verify that test_ind and train_ind correspond to the subsets of blocks of training and validation that you need. It is only in the innermost loop that these translate to the matrices corresponding to the digit images, using the value of i to compute the offset. Of course, if you wish, you can swap the order of the loops, computing all of the folds for a single digit. It all depends on how you wish to store your results.
I have been reading the documentation: here and here but it's really unclear for me and I don't see how to use pratically crossval to do a leave one out cross-validation.
vals = crossval(fun,X)
vals = crossval(fun,X,Y,...)
mse = crossval('mse',X,y,'Predfun',predfun)
mcr = crossval('mcr',X,y,'Predfun',predfun)
val = crossval(criterion,X1,X2,...,y,'Predfun',predfun)
vals = crossval(...,'name',value)
I really don't understand the funpart.
I have estimatimate chlorophyll rate with different index. Then I have done a linear regression between those index and the field taken chlorophyll rate. Now I want to validate them, one of my estimation is a column with 22 entries, so I want to use 21 of them as trainee and 1 as a test, and do 22 loops so that all the data have been used as test.
But I don't where should I put the regression model? If my regression is Y=aX+b,
do I re-use the a and b calculated before for the train part, or do I do a new linear regression with the train part then see what's the test will be with that?
I am not sure I totally understood how to make a leave one out model.
Then I want to know the result of the test by calculating the RMSE (and maybe the R²).
How do I code that using crossval?
I saw the answer to the question here but I don't have access to the crossvalind fonction with my license.
Well I finaly figure it out: so this is my script:
First I charged my data and the linear regression fonction
X=indicesCha_without_Cloud(:,3);
y=Cha_g_m2t_without_Cloud(:,3);
testval=#(XTRAIN,ytrain,XTEST)Linear_regression_indices( XTRAIN,ytrain,XTEST);
where in my case fun(in the Mathwork help) is testvaland Linear_regression_indices is a very simple fonction:
function [ Linear_regression_indices ] = Linear_regression_indices(XTRAIN,ytrain,XTEST )
Linear_regression_indices=(polyval(polyfit(XTRAIN,ytrain,1),XTEST));
end
There is 2 ways to do it and they both give the same result:
one by using simply the crossval fonction
cvMse = crossval('mse',X,y,'predfun',testval,'leaveout',1);
this will do as many fold as the data size, using each time one of the data as Xtest
the second one is using cvpartition
c = cvpartition(n,'LeaveOut') creates a random partition for leave-one-out cross validation on n observations. Leave-one-out is a special case of 'KFold', in which the number of folds equals the number of observations. link
c = cvpartition(y,'LeaveOut');
cvMse2=crossval('mse',X,y,'predfun',testval,'partition',c);
then the RMSE can be easily calculated
RMSE=sqrt(cvMse);
RMSE2=sqrt(cvMse2);
then I simply get my answer, in my case RMSE=0,3548
I constructed a Gaussian Mixture Model in Matlab with a dataset:
model = gmdistribution.fit(data,M,'Replicates',5);
with M = 3 Gaussian components. I tested new data with:
[P, l] = posterior(model,new_data);
I ran the program several times and didn't get the same result. Each run produces different log-likelihood values. I use the log-likelihood to make decisions, and this value for the same data (new_data) differs for each run. What does it depend on? How can I resolve this problem?
First, assuming that you're using a newish version of Matlab, the gmdistribution.fit documentation indicates that the fit method is deprecated and that fitgmdist should be used. See here for an example.
Second, the documentation for gmdistribution.fit indicates that if the 'Replicates' option is larger than 1, the 'randSample' start method will be used to produce the initial parameters. This may be the cause (or at least one of the causes) of your observed variability.
Finally, you can also try using rng before calling gmdistribution.fit to set the seed of the global random number stream (assuming the function doesn't use it's own stream internally). Alternatively, you can try specifying an 'Options' parameter via statset:
seed = 1;
s = RandStream('mt19937ar','Seed',seed);
opts = statset('Streams',s);
model = gmdistribution.fit(data,M,'Replicates',5,'Options',opts);
I can't test this fully myself – see the gmdistribution class documentation for further details.
I have a question of an annoying fact. I'm using libsvm with matlab and I'am able to predict using:
predicted_label = svmpredict(Ylabel, Xlabel, model);
but it happen that every time I make a predictions appears this:
Accuracy = X% (y/n) (classification)
Which I find annoying because I am repeating this procedure a lot of times and also makes it slow because its displaying in screen.
I think what I want is to avoid that svmpredict being verbose.
Can anyone help me with this? Thanks in advance.
-Jessica
I found a much better approach than editing the source code of the c library was to use matlabs evalc which places any output to the first output argument.
[~ predicted_label] = evalc('svmpredict(Ylabel, Xlabel, model)');
Because the string to be evaluated is fixed should be no performance decrease.
svmpredict(Ylabel, Xlabel, model, '-q');
From the manual:
Usage: [predicted_label, accuracy, decision_values/prob_estimates] = svmpredict(testing_label_vector, testing_instance_matrix, model, 'libsvm_options')
[predicted_label] = svmpredict(testing_label_vector, testing_instance_matrix, model, 'libsvm_options')
Parameters:
model: SVM model structure from svmtrain.
libsvm_options:
-b probability_estimates: whether to predict probability estimates, 0 or 1 (default 0); one-class SVM not supported yet
-q : quiet mode (no outputs)
If you are using matlab, just find the line of code that is displaying this information (usually using 'disp', 'sprintf', or 'fprintf') and comment it out using the commenting operator %.
example:
disp(['Accuracy= ' num2str(x)]);
change it to:
% disp(['Accuracy= ' num2str(x)]);
If you are using the main libsvm library then you need to modify it before making.
1- Open the file 'svmpredict.c'
2- find this line of code:
info("Accuracy = %g%% (%d/%d) (classification)\n",
(double)correct/total*100,correct,total);
3- just comment it out using // operator
4- save and close the file
5- make the project
I'm doing some cross-validation using a Matlab Weka Interface that I got from file exchange. My loop structure seems to work fine for Weka's Logistic classifier. However, when I try to do the exact same thing for AdaBoostM1, it throws the following error:
??? Java exception occurred: java.lang.ArrayIndexOutOfBoundsException
Error in ==> wekaClassify at 24 classProbs(t+1,:) = (classifier.distributionForInstance(testData.instance(t)))';
Error in ==> classifier_search at 225 [pred ~] = wekaClassify(matlab2weka('instance', featurelabels, tester), classifier);
I have determined through some testing that this only occurs when the number of instances in the training set is greater than the number of instances in the test set. I am sure you can see why that is a problem for me, since in most situations the training set is greater than the test set in size.
Is there something different about how I should format my inputs when using Adaboost rather than Logistic? Any information you can give regarding this problem would be so helpful.
I downloaded this code from this page: http://www.mathworks.com/matlabcentral/fileexchange/21204-matlab-weka-interface
Emails bounce from the account of the guy who made it, and he doesn't seem to respond to comments on the page - I'm hoping that maybe someone here has used this.
EDIT: Here is the code that I use to train and test the classifier:
classifier = trainWekaClassifier(matlab2weka('training', featurelabels, train), 'meta.AdaBoostM1', { strcat('-P 100 -S 1 -I ', num2str(r), '-W weka.classifiers.trees.DecisionStump')});
[pred ~] = wekaClassify(matlab2weka('instance', featurelabels, tester), classifier);
I haven't used this combination of software, so I can only take a guess at what could cause this.
Are your training/testing data matrices the right way round? They should be N-by-D (N instances, D features).
If you were passing in a D-by-N training matrix and a D-by-M testing matrix, then I would expect it to work only when M < N - which is what you describe - and even then, it wouldn't give a meaningful result.