Using cross-validation to find the right value of k for the k-nearest-neighbor classifier - ipython

I am working on a UCI data set about wine quality. I have applied multiple classifiers and k-nearest neighbor is one of them. I was wondering if there is a way to find the exact value of k for nearest neighbor using 5-fold cross validation. And if yes, how do I apply that? And how can I get the depth of a decision tree using 5-fold CV?
Thanks!

I assume here that you mean the value of k that returns the lowest error in your wine quality model.
I find that a good k can depend on your data. Sparse data might prefer a lower k whereas larger datasets might work well with a larger k. In most of my work, a k between 5 and 10 have been quite good for problems with a large number of cases.
Trial and Error can at times be the best tool here, but it shouldn't take too long to see a trend in the modelling error.
Hope this Helps!

Related

K nearest neighbour validation performance

I am using knn to do classification for a telecom problem. I splitted my data into 70% training and 30% validation. While the knn classifier is able to catch over 80% in 2 deciles in training, its performance in validation sample is as good as random 45 degree line. I am surprised how does KNN work that the model performance in training and validation are so different.
Any pointers ?
Reasonable pointers are hardly possible without more details. The behavior of your KNN depends on several aspects:
The parameter K defining the neighbors. If it is set to K=1, for example, you will get no training error at all, this showing that the consideration of training-to-validation-error may not be justified.
The parameter K is often found using cross validation. I would suggest you to do this as well.
The distance metric. Which function are you using, are there different units, length scales, etc.?
The noise of your data, the size of your data ... -- there simply exist data sets which are hard to describe.
By the way: can you tell what kind of data you want to describe, and, if possible, also provide some examples or show some scatter plot (data and your result)?

How to define the maximum k of the kNN classifier?

I am trying to use kNN classifier to perform some supervised learning. In order to find the best number of 'k' of kNN, I used cross validation. For example, the following codes load some Matlab standard data and run the cross validation to plot various k values with respect to the cross validation error
load ionosphere;
[N,D] = size(X)
resp = unique(Y)
rng(8000,'twister') % for reproducibility
K = round(logspace(0,log10(N),10)); % number of neighbors
cvloss = zeros(numel(K),1);
for k=1:numel(K)
knn = ClassificationKNN.fit(X,Y,...
'NumNeighbors',K(k),'CrossVal','On');
cvloss(k) = kfoldLoss(knn);
end
figure; % Plot the accuracy versus k
plot(K,cvloss);
xlabel('Number of nearest neighbors');
ylabel('10 fold classification error');
title('k-NN classification');
The result looks like
The best k in this case is k=2 (it is not an exhaustive search). From the figure, we can see that the cross validation error goes up dramatically after k>50. It gets to a large error and become stable after k>100.
My question is what is the maximum k we should test in this kind of cross validation framework?
For example, there are two classes in the 'ionosphere' data. One class labeled as 'g' and one labeled as 'b'. There are 351 instances in total. For 'g' there are 225 cases and for 'b' there are 126 cases.
In the codes above, it chooses the largest k=351 to be tested. But should we only test from 1 to 126 or up to 225? Is there a relation between the test cases and the maximum number of k? Thanks. A.
The best way to choose a parameter in a classification problem, is to choose it by expertness. What you are doing certainly is not this. If your data is small enough to do a lot of classification with different values of parameters, you will do that, but to be reasonable, you need to show that the parameter you chose is not randomly chosen, you need to explain the behavior of plot you drawn.
In this case, the function is ascending, so you can tell 2 is the best choice.
In most cases you will not choose K more than 20, but there is no proof and you need to do the classification until you can proof your choice.
You don't want k to be too large (i.e. too close to the number of examples), because then the k neighborhood of each query example contains a large fraction of the space, so the prediction depends less and less on the actual location of the query and more on the overall statistics. This explains why the performance is not good for large k. Your classifier essentially chooses always 'g', and gets it wrong 126/351=35% as you see in the plot.
Theory suggests that k needs to grow as the number of labeled examples grow, but sub-linearly.
When you have lots of training data, you want k to be large because you want to have a good estimate of the likelihood of a point near the query point to get each label. This allows to imitate the maximum aposteriori decision rule (which is optimal, assuming you know the actual distribution).
So here are some practical tips:
Get more data if you can. Then run the experiment again.
Focus on small values of k. My bet is that k=3 is better than k=2. Usually for binary classification k is at least 3, and usually an odd number (to avoid ties).
The fact that you see that k=2 is better does not make sense. Therefore the only case in which k=1 is different than k=2 is when the 2 nearest neighbors have different labels. However, in this case the decision is made either randomly or arbitrarily (e.g. always choose 'g'). It depends on the implementation of the knn algorithm. My guess is that in the algorithm you are using the decision is fixed, and that in cases of a tie it chooses 'g' which just happens to be more likely overall. If you switch the roles of the labels you will probably see that k=1 is better than k=2.
Would be interesting to see the the plot for small values of k (e.g. 1 - 20).
References:
nearest neighbor classification
Increasing the number of neighbors to be taken into account during the classification makes your classifier a mean value choice. You only need to check the ratio of your classes to see that it is equal to the error rate.
Since you are using cross validation the k that corresponds to the minimum of your error rate is what you should select as value. In this case it is 3 if not mistaken.
Keep in mind that the cross validation parameter introduces bias in your selection of k. A more elaborate analysis is needed there, but your 10 should be fine for this case.

Choosing k for KNN in Matlab

I'm currently using Matlab's k nearest neighbors classifier (knnclassify) to train and test binary attributes. The default value argument for k if none provided is 1 and one can choose other values of k. I've done research online and on stackoverflow but nothing relevant came up to address my question for what value of k would be of best use. Is there a built in function that can tell me that for my particular data or is it simply guess and wait to see what accuracy is derived? Any help will be greatly appreciated.
Here is the link to matlab's knnclassify documentation: knnclassify
What you have here is a typical model selection problem. What you want is to pick the k that gives you the lowest overall error on your data. Larger values of k generalize better, and smaller values may tend to overfit.
Hence, cross-validation is a good way to choose this parameter and I found the this article, which seems like a reasonable method.

Matlab: K-means clustering with predefined populations

I am trying to differentiate two populations. Each population is an NxM matrix in which N is fixed between the two and M is variable in length (N=column specific attributes of each run, M=run number). I have looked at PCA and K-means for differentiating the two, but I was curious of the best practice.
To my knowledge, in K-means, there is no initial 'calibration' in which the clusters are chosen such that known bimodal populations can be differentiated. It simply minimizes the distance and assigns the data to an arbitrary number of populations. I would like to tell the clustering algorithm that I want the best fit in which the two populations are separated. I can then use the fit I get from the initial clustering on future datasets. Any help, example code, or reading material would be appreciated.
-R
K-means and PCA are typically used in unsupervised learning problems, i.e. problems where you have a single batch of data and want to find some easier way to describe it. In principle, you could run K-means (with K=2) on your data, and then evaluate the degree to which your two classes of data match up with the data clusters found by this algorithm (note: you may want multiple starts).
It sounds to like you have a supervised learning problem: you have a training data set which has already been partitioned into two classes. In this case k-nearest neighbors (as mentioned by #amas) is probably the approach most like k-means; however Support Vector Machines can also be an attractive approach.
I frequently refer to The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition (Springer Series in Statistics) by Trevor Hastie (Author), Robert Tibshirani (Author), Jerome Friedman (Author).
It really depends on the data. But just to let you know K-means does get stuck at local minima so if you wanna use it try running it from different random starting points. PCA's might also be useful how ever like any other spectral clustering method you have much less control over the clustering procedure. I recommend that you cluster the data using k-means with multiple random starting points and c how it works then you can predict and learn for each the new samples with K-NN (I don't know if it is useful for your case).
Check Lazy learners and K-NN for prediction.

How to find out which data sets destroy my data analysis using MATLAB?

I have 200 samples, each of them has 60 features. I use PCA to find the principal components. I use neural network and also try k nearest neighbor However, the classification results are not good. I don't mind to take out some samples, but how I can tell which samples destroy my classification results? I know I can try them one by one, but it would be very ineffective. Please help
Instead of throwing out some samples you need to throw out some attributes.
PCA computes a matrix with d x d entries. At 60 attributes, this matrix has 3600 entries. You have only 200 samples to compute the contents of this matrix - no wonder that the result is pretty much random. You need fewer variables and more data.
This is a classical machine learning problem. There is always a risk with such a high number of features (in your case 60) with only 200 samples. Please check whether you have features which are redundant. Let me give an example
Imagine, we have to predict housing prices from the following features
1. Size in m2
2. Number of bedrooms
3. House age
4. Size in foot2
Please note that here number 2 and number 4 features both gives the same information and they are redundant. At first it does not look that disturbing. But if you have data like that its better to remove those features.
Therefore, i would recommend you to look first in your features and then into data. For more details you have a look in Machine Learning class(by Prof. Ng) from Stanford available in coursera