How to define the maximum k of the kNN classifier? - classification

I am trying to use kNN classifier to perform some supervised learning. In order to find the best number of 'k' of kNN, I used cross validation. For example, the following codes load some Matlab standard data and run the cross validation to plot various k values with respect to the cross validation error
load ionosphere;
[N,D] = size(X)
resp = unique(Y)
rng(8000,'twister') % for reproducibility
K = round(logspace(0,log10(N),10)); % number of neighbors
cvloss = zeros(numel(K),1);
for k=1:numel(K)
knn = ClassificationKNN.fit(X,Y,...
'NumNeighbors',K(k),'CrossVal','On');
cvloss(k) = kfoldLoss(knn);
end
figure; % Plot the accuracy versus k
plot(K,cvloss);
xlabel('Number of nearest neighbors');
ylabel('10 fold classification error');
title('k-NN classification');
The result looks like
The best k in this case is k=2 (it is not an exhaustive search). From the figure, we can see that the cross validation error goes up dramatically after k>50. It gets to a large error and become stable after k>100.
My question is what is the maximum k we should test in this kind of cross validation framework?
For example, there are two classes in the 'ionosphere' data. One class labeled as 'g' and one labeled as 'b'. There are 351 instances in total. For 'g' there are 225 cases and for 'b' there are 126 cases.
In the codes above, it chooses the largest k=351 to be tested. But should we only test from 1 to 126 or up to 225? Is there a relation between the test cases and the maximum number of k? Thanks. A.

The best way to choose a parameter in a classification problem, is to choose it by expertness. What you are doing certainly is not this. If your data is small enough to do a lot of classification with different values of parameters, you will do that, but to be reasonable, you need to show that the parameter you chose is not randomly chosen, you need to explain the behavior of plot you drawn.
In this case, the function is ascending, so you can tell 2 is the best choice.
In most cases you will not choose K more than 20, but there is no proof and you need to do the classification until you can proof your choice.

You don't want k to be too large (i.e. too close to the number of examples), because then the k neighborhood of each query example contains a large fraction of the space, so the prediction depends less and less on the actual location of the query and more on the overall statistics. This explains why the performance is not good for large k. Your classifier essentially chooses always 'g', and gets it wrong 126/351=35% as you see in the plot.
Theory suggests that k needs to grow as the number of labeled examples grow, but sub-linearly.
When you have lots of training data, you want k to be large because you want to have a good estimate of the likelihood of a point near the query point to get each label. This allows to imitate the maximum aposteriori decision rule (which is optimal, assuming you know the actual distribution).
So here are some practical tips:
Get more data if you can. Then run the experiment again.
Focus on small values of k. My bet is that k=3 is better than k=2. Usually for binary classification k is at least 3, and usually an odd number (to avoid ties).
The fact that you see that k=2 is better does not make sense. Therefore the only case in which k=1 is different than k=2 is when the 2 nearest neighbors have different labels. However, in this case the decision is made either randomly or arbitrarily (e.g. always choose 'g'). It depends on the implementation of the knn algorithm. My guess is that in the algorithm you are using the decision is fixed, and that in cases of a tie it chooses 'g' which just happens to be more likely overall. If you switch the roles of the labels you will probably see that k=1 is better than k=2.
Would be interesting to see the the plot for small values of k (e.g. 1 - 20).
References:
nearest neighbor classification

Increasing the number of neighbors to be taken into account during the classification makes your classifier a mean value choice. You only need to check the ratio of your classes to see that it is equal to the error rate.
Since you are using cross validation the k that corresponds to the minimum of your error rate is what you should select as value. In this case it is 3 if not mistaken.
Keep in mind that the cross validation parameter introduces bias in your selection of k. A more elaborate analysis is needed there, but your 10 should be fine for this case.

Related

Restricting output classes in multi-class classification in Tensorflow

I am building a bidirectional LSTM to do multi-class sentence classification.
I have in total 13 classes to choose from and I am multiplying the output of my LSTM network to a matrix whose dimensionality is [2*num_hidden_unit,num_classes] and then apply softmax to get the probability of the sentence to fall into 1 of the 13 classes.
So if we consider output[-1] as the network output:
W_output = tf.Variable(tf.truncated_normal([2*num_hidden_unit,num_classes]))
result = tf.matmul(output[-1],W_output) + bias
and I get my [1, 13] matrix (assuming I am not working with batches for the moment).
Now, I also have information that a given sentence does not fall into a given class for sure and I want to restrict the number of classes considered for a given sentence. So let's say for instance that for a given sentence, I know it can fall only in 6 classes so the output should really be a matrix of dimensionality [1,6].
One option I was thinking of is to put a mask over the result matrix where I multiply the rows corresponding to the classes that I want to keep by 1 and the ones I want to discard by 0, by in this way I will just lose some of the information instead of redirecting it.
Anyone has a clue on what to do in this case?
I think your best bet is, as you seem to have described, using a weighted cross entropy loss function where the weights for your "impossible class" are 0 and 1 for the other possible classes. Tensorflow has a weighted cross entropy loss function.
Another interesting but probably less effective method is to feed whatever information you now have about what classes your sentence can/cannot fall into the network at some point (probably towards the end).

dfittool results interpretation

Does anyone know how to tell the difference between distributions (ie their goodness of fit) using the dfittool in Matlab? In a class I took forever ago, we learned about the log likelihood parameter and how to compare a pdf fitted to Gaussian vs gamma, etc. But right now, all the matlab help files online are like "it means something." Any assistance would be appreciated. Basically, I need to interpret the "results" in "edit fit" of the dfittool. I want to be able to compare my dfits to each other from the results, so I can pick the best fit for my analysis. I don't know what the difference is between a log likelihood of -111 vs -105.
Example below:
Distribution: Normal
Log likelihood: -110.954
Domain: -Inf < y < Inf
Mean: 101.443
Variance: 436.332
Parameter Estimate Std. Err.
mu 101.443 4.17771
sigma 20.8886 3.04691
Estimated covariance of parameter estimates:
mu sigma
mu 17.4533 6.59643e-15
sigma 6.59643e-15 9.28366
Thank you!
(Log) likelihood is a measure of the fit of a distribution to data, so the simple answer is: the distribution with the largest likelihood is the one that fits best. However, what you get here as an output is the maximized likelihood, i.e. the likelihood at those parameter values where it is maximal. Different families of distributions might be differently "flexible", so that it is easier to get a larger likelihood with one of them in general, so this limits comparability. This holds especially if you compare families with different numbers of parameters. A fix for this is to use formal model comparison, e.g. using the Bayes factor, which however is considerably more complex mathematically, or its approximation, the Bayesian information criterion.
More generally speaking however, it is seldomly a good idea to just randomly pick distributions and see how well they fit. It would be better to have some at least partially theoretically motivated idea why a distribution is a candidate. On the most basic level this means considering its definition range: the normal distribution is defined on the whole real line, the gamma distribution only for nonnegative real numbers. This way it should be possible to rule one of them out based on basic properties of your data.

Using cross-validation to find the right value of k for the k-nearest-neighbor classifier

I am working on a UCI data set about wine quality. I have applied multiple classifiers and k-nearest neighbor is one of them. I was wondering if there is a way to find the exact value of k for nearest neighbor using 5-fold cross validation. And if yes, how do I apply that? And how can I get the depth of a decision tree using 5-fold CV?
Thanks!
I assume here that you mean the value of k that returns the lowest error in your wine quality model.
I find that a good k can depend on your data. Sparse data might prefer a lower k whereas larger datasets might work well with a larger k. In most of my work, a k between 5 and 10 have been quite good for problems with a large number of cases.
Trial and Error can at times be the best tool here, but it shouldn't take too long to see a trend in the modelling error.
Hope this Helps!

K nearest neighbour validation performance

I am using knn to do classification for a telecom problem. I splitted my data into 70% training and 30% validation. While the knn classifier is able to catch over 80% in 2 deciles in training, its performance in validation sample is as good as random 45 degree line. I am surprised how does KNN work that the model performance in training and validation are so different.
Any pointers ?
Reasonable pointers are hardly possible without more details. The behavior of your KNN depends on several aspects:
The parameter K defining the neighbors. If it is set to K=1, for example, you will get no training error at all, this showing that the consideration of training-to-validation-error may not be justified.
The parameter K is often found using cross validation. I would suggest you to do this as well.
The distance metric. Which function are you using, are there different units, length scales, etc.?
The noise of your data, the size of your data ... -- there simply exist data sets which are hard to describe.
By the way: can you tell what kind of data you want to describe, and, if possible, also provide some examples or show some scatter plot (data and your result)?

One Class SVM using LibSVM in Matlab - Conceptual

Perhaps this is an easy question, but I want to make sure I understand the conceptual basis of the LibSVM implementation of one-class SVMs and if what I am doing is permissible.
I am using one class SVMs in this case for outlier detection and removal. This is used in the context of a greater time series prediction model as a data preprocessing step. That said, I have a Y vector (which is the quantity we are trying to predict and is continuous, not class labels) and an X matrix (continuous features used to predict). Since I want to detect outliers in the data early in the preprocessing step, I have yet to normalize or lag the X matrix for use in prediction, or for that matter detrend/remove noise/or otherwise process the Y vector (which is already scaled to within [-1,1]). My main question is whether it is correct to model the one class SVM like so (using libSVM):
svmod = svmtrain(ones(size(Y,1),1),Y,'-s 2 -t 2 -g 0.00001 -n 0.01');
[od,~,~] = svmpredict(ones(size(Y,1),1),Y,svmod);
The resulting model does yield performance somewhat in line with what I would expect (99% or so prediction accuracy, meaning 1% of the observations are outliers). But why I ask is because in other questions regarding one class SVMs, people appear to be using their X matrices where I use Y. Thanks for your help.
What you are doing here is nothing more than a fancy range check. If you are not willing to use X to find outliers in Y (even though you really should), it would be a lot simpler and better to just check the distribution of Y to find outliers instead of this improvised SVM solution (for example remove the upper and lower 0.5-percentiles from Y).
In reality, this is probably not even close to what you really want to do. With this setup you are rejecting Y values as outliers without considering any context (e.g. X). Why are you using RBF and how did you come up with that specific value for gamma? A kernel is total overkill for one-dimensional data.
Secondly, you are training and testing on the same data (Y). A kitten dies every time this happens. One-class SVM attempts to build a model which recognizes the training data, it should not be used on the same data it was built with. Please, think of the kittens.
Additionally, note that the nu parameter of one-class SVM controls the amount of outliers the classifier will accept. This is explained in the LIBSVM implementation document (page 4): It is proved that nu is an upper bound on the fraction of training errors and
a lower bound of the fraction of support vectors. In other words: your training options specifically state that up to 1% of the data can be rejected. For one-class SVM, replace can by should.
So when you say that the resulting model does yield performance somewhat in line with what I would expect ... ofcourse it does, by definition. Since you have set nu=0.01, 1% of the data is rejected by the model and thus flagged as an outlier.