Issues related to plots in pattern recognition(Part1) - matlab

I cannot follow crossval() & cvpartition() function given in MATLAB documentation crossval(). What goes in the parameter and how would it help to compare performance and accuracy of different classifiers. Would be obliged if a simpler version of it is provided here.

Let's work on Example 2 from CROSSVAL documentation.
load('fisheriris');
y = species;
X = meas;
Here we loaded the data from example mat-file and assigned variable to X and y. meas amtrix contains different measurements of iris flowers and species are tree classes of iris, what we are trying to predict with the data.
Cross-validation is used to train a classifier on the same data set many times. Basically at each iteration you split the data set to training and test data. The proportion is determined by k-fold. For example, if k is 10, 90% of the data will be used for training, and the rest 10% - for test, and you will have 10 iterations. This is done by CVPARTITION function.
cp = cvpartition(y,'k',10); % Stratified cross-validation
You can explore cp object if you type cp. and press Tab. You will see different properties and methods. For example, find(cp.test(1)) will show indices of the test set for 1st iteration.
Next step is to prepare prediction function. This is probably where you had the main problem. This statement create function handle using anonymous function. #(XTRAIN, ytrain,XTEST) part declare that this function has 3 input arguments. Next part (classify(XTEST,XTRAIN,ytrain)) defines the function, which gets training data XTRAIN with known ytrain classes and predicts classes for XTEST data with generated model. (Those data are from cp, remember?)
classf = #(XTRAIN, ytrain,XTEST)(classify(XTEST,XTRAIN,ytrain));
Then we are running CROSSVAL function to estimate missclassification rate (mcr) passing the complete data set, prediction function handle and partitioning object cp.
cvMCR = crossval('mcr',X,y,'predfun',classf,'partition',cp)
cvMCR =
0.0200
Still have questions?

Related

matlab predict function error with fitrtree model

I am trying to do regression with fitrtree model. It works fine without the validation but with the validation the predict function returns an error.
%works fine
tree = fitrtree(trainingData,target,'MinLeafSize',2, 'Leaveout','off');
y_hat = predict(tree, xNew);
%Returns error
tree = fitrtree(trainingData,target,'MinLeafSize',2, 'Leaveout','on');
y_hat = predict(tree, xNew);
Error: Systems of classreg.learning.partition.RegressionPartitionedModel class cannot be used with the "predict"
command. Convert the system to an identified model first, such as by using the "idss" command.
Update: I figured out that when we use cross validation of any sort, the model is in the Trained attribute of tree rather than the tree itself. what is this trained attribute (tree.Trained{1}) and what information do we get from it.?
If you choose a cross-validation method when calling fitrtree(), the output of the function is a RegressionPartitionedModel instead of a RegressionTree.
As you said, you can access objects of type RegressionTree stored in tree.Trained in your case. The number and meaning of the trees you find under this attribute depends on the cross-validation model. In your case, using Leave-one-out cross-validation (LOOCV), the Trained attribute contains N RegressionTree objects, where N is the number of data points in your training set. Each of these regression trees is obtained by training on all but one of your data points. The left out data point is used for testing.
For example, if you want to access the first and last trees obtained from cross-validation, and use them for separate predictions, you can do:
%Returns RegressionPartitionedModel
cv_trees = fitrtree(trainingData,target,'MinLeafSize',2, 'Leaveout','on');
%This is the number of regression trees stored in cv_trees for LOOCV
[N, ~] = size(trainingData);
%Use one of the models from the cross-validation as a predictor
y_hat = predict(tree.Trained{1}, xNew);
y_hat_2 = predict(tree.Trained{N}, xNew);

Matlab Machine Learning Train, Validate, Test Partitions

I'm using Matlab's Statistics and Machine Learning Toolbox to create decision trees, ensembles, Knn models, etc. I would like to separate my data into training/testing partitions, then have the models train and cross validate using the training data (essentially splitting the training data into training and validation data) while preserving my testing data for error metrics. It is important that the models not be trained in any way using the testing data. For my decision tree, I have something like the following code:
chess = csvread(filename);
predictors = chess(:,1:6);
class = chess(:,7);
cvpart = cvpartition(class,'holdout', 0.3);
Xtrain = predictors(training(cvpart),:);
Ytrain = class(training(cvpart),:);
Xtest = predictors(test(cvpart),:);
Ytest = class(test(cvpart),:);
% Fit the decision tree
tree = fitctree(Xtrain, Ytrain, 'CrossVal', 'on');
% Error Metrics
testingLoss = loss(tree,Xtest,Ytest,'Subtrees','all'); % Testing
resubcost = resubLoss(tree,'Subtrees','all'); % Training
[cost,secost,ntermnodes,bestlevel] = cvloss(tree,'Subtrees','all'); % Cross Val
However, this returns
Undefined function 'loss' for input arguments of
type 'classreg.learning.partition.ClassificationPartitionedModel'.
when attempting to find the testing error. I have tried several combinations of similar methods using different types of classification algorithms, but keep coming back to not being able to apply test data to a cross validated model due to partitioned data. How am I supposed to apply test data to a cross validated model?
When you use cross validation in the call to fitctree, by default 10 model folds are constructed within the 70% of data used to train the model. You can find the kFoldLoss (within each model fold) via:
modelLoss = kfoldLoss(tree);
Since the original call to fitctree constructed 10 model folds, there are 10 separate trained models. Each of the 10 models is contained within a cell array, located at tree.Trained . For for example you could use the first trained model to test the loss on your held out data via:
testingLoss = loss(tree.Trained{1},Xtest,Ytest,'Subtrees','all'); % Testing
You can use the kfoldLoss function to also get the CV loss for each fold and then choose the trained model that gives you the least CV loss in the following way:
modelLosses = kfoldLoss(tree,'mode','individual');
The above code will give you a vector of length 10 if you have done 10-fold cross-validation while learning. Assuming the trained model with least CV error is the 'k'th one, you would then use:
testSetPredictions = predict(tree.Trained{k}, testSetFeatures);

initial seed for sparse GP regression

I use the sparse Gaussian process for regression from Rasmussen.
[http://www.tsc.uc3m.es/~miguel/downloads.php][1]
The syntax for predicting the mean is:
[~, mu_1, ~, ~, loghyper] = ssgpr_ui(Xtrain, Ytrain, Xtest, Ytest, m);
My question is, the author states that the initial hyper parameter search condition is different for different iterations, hence the results of the model is different from every iteration. Is there any way to ensure that the best initialization or seed condition is set to have good quality hyper parameters for best predictions and reproducible results?
In order to obtain the same predictions every time, it is possible to set the seed by
stream = RandStream('mt19937ar','Seed',123456);
RandStream.setGlobalStream(stream);
However, there is no standard procedure to set the best seed. Doing so will lead to over fitting of the model as we are giving too much of ideal conditions to fit the training data as quoted by #mikkola

Using Linear Prediction Over Time Series to Determine Next K Points

I have a time series of N data points of sunspots and would like to predict based on a subset of these points the remaining points in the series and then compare the correctness.
I'm just getting introduced to linear prediction using Matlab and so have decided that I would go the route of using the following code segment within a loop so that every point outside of the training set until the end of the given data has a prediction:
%x is the data, training set is some subset of x starting from beginning
%'unknown' is the number of points to extend the prediction over starting from the
%end of the training set (i.e. difference in length of training set and data vectors)
%x_pred is set to x initially
p = length(training_set);
coeffs = lpc(training_set, p);
for i=1:unknown
nextValue = -coeffs(2:end) * x_pred(end-unknown-1+i:-1:end-unknown-1+i-p+1)';
x_pred(end-unknown+i) = nextValue;
end
error = norm(x - x_pred)
I have three questions regarding this:
1) Does this appropriately do what I have described? I ask because my error seems rather large (>100) when predicting over only the last 20 points of a dataset that has hundreds of points.
2) Am I interpreting the second argument of lpc correctly? Namely, that it means the 'order' or rather number of points that you want to use in predicting the next point?
3) If this is there a more efficient, single line function in Matlab that I can call to replace the looping and just compute all necessary predictions for me given some subset of my overall data as a training set?
I tried looking through the lpc Matlab tutorial but it didn't seem to do the prediction as I have described my needs require. I have also been using How to use aryule() in Matlab to extend a number series? as a reference.
So after much deliberation and experimentation I have found the above approach to be correct and there does not appear to be any single Matlab function to do the above work. The large errors experienced are reasonable since I am using a linear prediction algorithm for a problem (i.e. sunspot prediction) that has inherent nonlinear behavior.
Hope this helps anyone else out there working on something similar.

Naïve Bayes Classifier -- is normalization necessary?

We recently studied the Naïve Bayesian Classifier in our Machine Learning class and now I'm trying to implement it on the Fisher Iris dataset as a self-exercise. The concept is easy and straightforward, with some trickiness involved for continuous attributes. I read up several literature resources which recommended using a Gaussian approximation to compute probability of test data values, so I'm going with it in my code.
Now I'm trying to run it initially for 50% training and 50% test data samples, but something is missing. The current code is always predicting class 1 (I used integers to represent the classes) for all test samples, which is obviously wrong.
My guess is that the problem may be due to normalization being omitted by the code? Though I think adding normalization would still yield proportionate results, and so far my attempts to normalize have produced the same classification results.
Can someone please suggest if there is anything obvious missing here? Or if I'm not approaching this right? Since most of the code is 'mechanics', I have made prominent (****************) the 2 lines that are responsible for the calculations. Any help is appreciated, thanks!
nsamples=75; % 50% samples
% acquire training set and test set
[trainingSample,idx] = datasample(data,nsamples,'Replace',false);
testData = data(setdiff(1:150,idx),:);
% define Gaussian function
%***********************************************************%
Phi=#(mu,sig2,x) (1/sqrt(2*pi*sig2))*exp(-((x-mu)^2)/2*sig2);
%***********************************************************%
for c=1:3 % for 3 classes in training set
clear y x mu sig2;
index=1;
for i=1 : length(trainingSample)
if trainingSample(i,5)==c
y(index,:)=trainingSample(i,:); % filter current class samples
index=index+1; % for conditional probabilities
end
end
for j=1:size(testData,1) % iterate over test samples
clear pf p;
for i=1:4 % iterate over columns
x=testData(j,i); % representing attributes
mu=mean(y(:,i));
sig2=var(y(:,i));
pf(i) = Phi(mu,sig2,x); % calc conditional probability
end
% calc class likelihood; prior * posterior
%*****************************************************%
pc(j,c) = size(y,1)/nsamples * pf(1)*pf(2)*pf(3)*pf(4);
%*****************************************************%
end
end
% find the predicted class for each test sample
% by taking the max probability calculated
for i=1:size(pc,1)
[~,q]=max(pc(i,:));
predicted(i)=q;
actual(i)=testData(i,5);
end
Normalization shouldn't be necessary since the features are only compared to each other.
p(class|thing) = p(class)p(thing|class) =
= p(class)p(feature_1|class)p(feature_2|class)...p(feature_N|class)
So when fitting the parameters for the distribution feature_i|class it will just rescale the parameters (for the new "scale") in this case (mu, sigma2), but the probabilities will remain the same.
It's hard to read the matlab code due to alot of indexing and splitting of training/testing etc. Which is a possible problem source.
You should try something with a lot less non-necessary stuff around it (I would recommend python with scikit-learn for example, alot of helpers for splitting data and such http://scikit-learn.org/).
It's really important that you separate the training and test data, and only train the model with training data and test the trained model with the test data. (Is this done?)
Next step is to check the parameters which is easiest done with either printing them out (sanity check) or..
for each feature render the gaussian bells fitted next to a histogram of the data to see that they match (remember that each histogram bar must be of height number_of_samples_within_range/total_number_of_samples.
Visualising the data and the model is really important to know what is happening.