Matlab cross validation and K-NN - matlab

I am trying to build a knn clasiffier with cross validation in Matlab. Because of my MATLAB version I have used knnclassify() in order to build the classifier (classKNN = knnclassify (sample_test, sample_training, training_label)).
I am not capable to use crossval() with that.
Thanks in advance.

There are two ways to perform the K-Nearest Neighbour in Matlab. The first one is by using knnclassify() as you did. However, this function will return the predicted labels and you cannot use crossval() with this. The cross-validation is performed on a model, not on its results. In Matlab, the model is described by an object.
crossval() only works with objects (classifier objects, be it K-NN, SVM and so on...). In order to create the so-called nearest-neighbor classification object you need to use the fitcknn() function. Given the Training Set and the Training Labels as input (in this order), such function will return your object, which you can give as input in crossval().
There's only one thing left though: how do I predict the labels for my validation set? In order to do this, you need to use the predict() function. Given the model (kNN object) and the Validation Set as input (again, in this order), such function will return (as in knnclassify()) the predicted labels vector.

Related

"No valid system or dataset was specified" while using predict() in MATLAB

I am trying to classify my input features into two classes using SVM. I want to use 10-fold cross-validation to train an SVM classifier. I am using MATLAB inbuilt functions. But while using predict() function along with crossval() function, I am getting an error:
No valid system or dataset was specified.
Does anyone knows how to resolve this issue?
Training_Features = X;
Training_Labels = Y;
SVMModel =
fitcsvm(Training_Features,Training_Labels,'KernelFunction','RBF');
CVSVMModel = crossval(SVMModel);
[Predict_Labels,Predict_Scores] =
predict(CVSVMModel,Training_Features);
I think you understood the cross validation functionality wrong. Your CVSVMModel is a so called ClassificationPartitionedModel which has no function predict() since Cross Validation is meant for testing the generalisation of your model BEFORE you train it with the WHOLE dataset (not cross-validated).
I suggest the following:
Call [Predict_Labels,Predict_Scores] = kfoldPredict(CVSVMModel); to see how well it does on each validation dataset during the cross-validation
If you're satisfied train a new SVMModel with the whole dataset and predict with that one.
EDIT:
A ClassificationPartitionedModel is a collection of models (in your case 10 different ones). You can access and even call predict() on them by for example:
[Predict_Labels,Predict_Scores] = predict(CVSVMModel.Trained{1, 1},X);

Parameter selection of SVM

I have a dataset which I use for classifcation with libSVM in Matlab. The dataset consists of 4 classes.
For parameter selection of SVM I can do nested cross-validation. The problem is that I also need the value of the best parameters in the end.
After having done the nested cross-validation and having the final accuracy I want the values of the best parameters. Then I will train a SVM for each class (one-vs-all) with the best parameters for selecting the most important features (according to heighest weight), i.e. feature importance map.
How can I do this? Should I just not do nested cross-validation and only looping over all parameters and doing cross-validation?
Second, if I use a linear SVM then using this weight vector w for assigning importance to features works, but does it also work for non-linear SVM (e.g. rbf kernel)?
To find the "best" parameters for your kernel of choice, you have to loop through all parameters to perform a so called "grid search". LIBSVM does not support a build-in grid-search mechanismn.
Regarding your second question, I would suggest to perform a feature selection (e.g. Information Gain, Mutual Information, ...) as a pre-processing step before the actual work with the SVM and in a second step take the weight vector
s into consideration (but I am not sure, if this will work with RBF or Gaußian Kernels...).

How to predict labels for new data (test set) by the PartitionedEnsemble model in Matlab?

I trained a ensemble model (RUSBoost) for a binary classification problem by the function fitensemble() in Matlab 2014a. The training by this function is performed 10-fold cross-validation through the input parameter "kfold" of the function fitensemble().
However, the output model trained by this function cannot be used to predict the labels of new data if I use the predict(model, Xtest). I checked the Matlab documents, which says we can use kfoldPredict() function to evaluate the trained model. But I did not find any input of the new data through this function. Also, I found the structure of the trained model with cross-validation is different from that model without cross-validation. So, could anyone please advise me how to use the model, which is trained with cross-validation, to predict labels of new data? Thanks!
kfoldPredict() needs a RegressionPartitionedModel or ClassificationPartitionedEnsemble object as input. This already contains the models and data for kfold cross validation.
The RegressionPartitionedModel object has a field Trained, in which the trained learners that are used for cross validation are stored.
You can take any of these learners and use it like predict(learner, Xdata).
Edit:
If k is too large, it is possible that there is too little meaningful data in one or more iteration, so the model for that iteration is less accurate.
There are no general rules for k, but k=10 like in the MATLAB default is a good starting point to play around with it.
Maybe this is also interesting for you: https://stats.stackexchange.com/questions/27730/choice-of-k-in-k-fold-cross-validation

How to see which Atribute (Feature) contribute most to the performance of the classification with PCA in Matlab?

I would like to perform classification on a small data set 65x9 using some of the Machine Learning Classification Methods (SVM, Decision Trees or any other).
So, before starting with the classification I would like to do attribute analyses with PCA in Matlab or Weka (preferred MatLab). I would like to obtain which Attribute contribute most to the performance of the classifier. So I can maybe reduce the number of some Attribute or/and include more in the future. Any example of PCA can find regarding this in MatLab or Weka?
Thanks
PCA is a unsupervised feature extraction method.
If your question is on selecting attributes to use with PCA, i don't know what your purpose is but it is unnecessary to do something like that to improve classification performance. Just use the whole attributes. PCA will give you best attributes in decreasing order for each instance.
If your question is on selecting attributes after PCA, you can chose a treshold (for example 0.95) and calculate #attributes enough for treshold beginning from the first attribute to last one. You can use the eigenvalues of covariance matrix to calculate and achive treshold in PCA.
After running PCA, we know that the first attribute is the best one, the second attribute is the best one after first etc...

Simple Sequential feature selection in Matlab

I have a 40X3249 noisy dataset and 40X1 resultset. I want to perform simple sequential feature selection on it, in Matlab. Matlab example is complicated and I can't follow it. Even a few examples on SoF didn't help. I want to use decision tree as classifier to perform feature selection. Can someone please explain in simple terms.
Also is it a problem that my dataset has very low number of observations compared to the number of features?
I am following this example: Sequential feature selection Matlab and I am getting error like this:
The pooled covariance matrix of TRAINING must be positive definite.
I've explained the error message you're getting in answers to your previous questions.
In general, it is a problem that you have many more variables than samples. This will prevent you using some techniques, such as the discriminant analysis you were attempting, but it's a problem anyway. The fact is that if you have that high a ratio of variables to samples, it is very likely that some combination of variables would perfectly classify your dataset even if they were all random numbers. That's true if you build a single decision tree model, and even more true if you are using a feature selection method to explicitly search through combinations of variables.
I would suggest you try some sort of dimensionality reduction method. If all of your variables are continuous, you could try PCA as suggested by #user1207217. Alternatively you could use a latent variable method for model-building, such as PLS (plsregress in MATLAB).
If you're still intent on using sequential feature selection with a decision tree on this dataset, then you should be able to modify the example in the question you linked to, replacing the call to classify with one to classregtree.
This error comes from the use of the classify function in that question, which is performing LDA. This error occurs when the data is rank deficient (or in other words, some features are almost exactly correlated). In order to overcome this, you should project the data down to a lower dimensional subspace. Principal component analysis can do this for you. See here for more details on how to use pca function within statistics toolbox of Matlab.
[basis, scores, ~] = pca(X); % Find the basis functions and their weighting, X is row vectors
indices = find(scores > eps(2*max(scores))); % This is to find irrelevant components up to machine precision of the biggest component .. with a litte extra tolerance (2x)
new_basis = basis(:, indices); % This gets us the relevant components, which are stored in variable "basis" as column vectors
X_new = X*new_basis; % inner products between the new basis functions spanning some subspace of the original, and the original feature vectors
This should get you automatic projections down into a relevant subspace. Note that your features won't have the same meaning as before, because they will be weighted combinations of the old features.
Extra note: If you don't want to change your feature representation, then instead of classify, you need to use something which works with rank deficient data. You could roll your own version of penalised discriminant analysis (which is quite simple), use support vector machines, or other classification functions which don't break with correlated features as LDA does (by virtue of requiring matrix inversion of the covariance estimate).
EDIT: P.S I haven't tested this, because I have rolled my own version of PCA in Matlab.