Why the NMI value is small while having higher clustering accuracy and Rand index in clustering - matlab

I am using the https://www.mathworks.com/matlabcentral/fileexchange/32197-clustering-results-measurement for evaluating my clustering accuracy in MATLAB, it provides accuracy and rand_index, the performance is normal as expect. However, when I try to use NMI as a metric, the clustering performance is extremely low, I am using the source code (https://www.mathworks.com/matlabcentral/fileexchange/29047-normalized-mutual-information).
Actually I have two Nx1 vectors as inputs, one is the actual label while another is the label assignments. I basically check each of every element insides and I found that even I have 82% rand_index, the NMI is only 0.3209. Below is the example for Iris Dataset https://archive.ics.uci.edu/ml/datasets/iris with MATLAB built-in K-Means.
data = iris(:,1:data_dim);
k = 3;
[result_label,centroid] = kmeans(data,k,'MaxIter',10000);
actual_label = iris(:,end);
NMI = nmi(actual_label,result_label);
[Acc,rand_index,match] = AccMeasure(actual_label',result_label');
The result:
Auto ACC: 0.820000
Rand_Index: 0.701818
NMI: 0.320912

The Rand Index will tend towards 1 as the number of data points increases (even when comparing random clusterings) so you never really expect to see small values of Rand when you have a big data set.
At the same time, Accuracy can be high when all of your points fall into the same large cluster.
I have a feeling that the NMI is producing a more reliable comparison. To verify, trying running a dimensionality reduction and plot the data points with color based on the two clusterings. Visual statistics are often the best for developing an intuition about data.
If you want to explore more, a convenient python package for clustering comparisons is CluSim.


The performance of k-means evaluated by different metrics

I am trying to evaluate the clusters generated by K-means with different metrics, but I am not sure about whether the results are good or not.
I have 40 documents to cluster in 6 categories.
I first converted them into tf-idf vectors, then I clustered them by K-means (k = 6). Finally, I tried to evaluate the results by different metrics.
Because I have the real labels of the documents, I tried to calculate the F1 score and accuracy. But I also want to know the performance for the metrics that do not need real labels such as silhouette score.
For F1 score and accuracy, the results are about 0.65 and 0.88 respectively, while for the silhouette score, it is only about 0.05, which means I may have overlapping clusters.
In this case, can I say that the results are acceptable? Or should I handle the overlapping issue by trying other methods instead of tf-idf to represent the documents or other algorithms to cluster?
With such tiny data sets, you really need to use a measure that is adjusted for chance.
Do the following: label each document randomly with an integer 1..6
What F1 score to you get? Now repeat this 100x times, what is the best result you get? A completely random result can score pretty well on such tiny data!
Because of this problem, the standard measure used in clustering it's the adjusted Rand index (ARI). A similar adjustment also exists for NMI: Adjusted Mutual Information or AMI. But AMI is much less common.

Selecting the K value for Kmeans clustering [duplicate]

This question already has answers here:
Cluster analysis in R: determine the optimal number of clusters
(8 answers)
Closed 3 years ago.
I am going to build a K-means clustering model for outlier detection. For that, I need to identify the best number of clusters needs to be selected.
For now, I have tried to do this using Elbow Method. I plotted the sum of squared error vs. the number of clusters(k) but, I got a graph like below which makes confusion to identify the elbow point.
I need to know, why do I get a graph like this and how do I identify the optimal number of clusters.
K-means is not suitable for outlier detection. This keeps popping up here all the time.
K-means is conceptualized for "pure" data, with no false points. All measurements are supposed to come from the data, and only vary by some Gaussian measurement error. Occasionally this may yield some more extreme values, but even these are real measurements, from the real clusters, and should be explained not removed.
K-means itself is known to not work well on noisy data where data points do not belong to the clusters
It tends to split large real clusters in two, and then points right in the middle of the real cluster will have a large distance to the k-means centers
It tends to put outliers into their own clusters (because that reduces SSQ), and then the actual outliers will have a small distance, even 0.
Rather use an actual outlier detection algorithm such as Local Outlier Factor, kNN, LOOP etc. instead that were conceptualized with noisy data in mind.
Remember that the Elbow Method doesn't just 'give' the best value of k, since the best value of k is up to interpretation.
The theory behind the Elbow Method is that we in tandem both want to minimize some error function (i.e. sum of squared errors) while also picking a low value of k.
The Elbow Method thus suggests that a good value of k would lie in a point on the plot that resembles an elbow. That is the error is small, but doesn't decrease drastically when k increases locally.
In your plot you could argue that both k=3 and k=6 resembles elbows. By picking k=3 you'd have picked a small k, and we see that k=4, and k=5 doesn't do much better in minimizing the error. Same goes with k=6.

deterministic function in Matlab for clustering

I have been using Matlab built-in kmeans function to do clustering. Due to randomness used in the algorithm, the results are different if I set seeds differently. This is a little annoying. Is there a way to reduce the discrepancy of the clustering results? Alternatively, is there a deterministic function in Matlab for clustering?
If you have the image processing toolbox, there are tools which use Otsu's method, which is deterministic
If datain is your input data:
For 2 classes:
threshold = graythresh(datain);
threshold = the threshold value for splitting the data into 2 classes, normalized to [0,1]
For multiple classes:
thresholds = multithresh(datain,N);
N = number of thresholds
thresholds = 1xN vector of thresholds (not normalized)
It is normal.
The k-average algorithm creates new classes after each iteration, hence the results may be different.
For example: the algorithm is to determine which fruit is an apple that has a pear. It can classify an apple as a pear, but then all apples will be pears, while all pears will be apples.
I came up with some methods to reduce the discrepancy of the clustering results.
Put 'OnlinePhase','on' in the arguments in the kmeans. This will lead to a local min which is often the global min.
Put 'Replicates', 5 in the arguments. Here 5 can be replaced with an even larger number. It asks Matlab to do kmeans 5 times and choose the best result.
Put 'MaxIter', 1000 in the arguments. This will increase the max number of iterations from the default 100 to 1000, which could, but not likely, improve the accuracy.
As long as we aim for the best outcome from kmeans, we are more likely to get consistent results.

3D SIFT for human activity classification in videos. NOT GETTING GOOD ACCURACY.

I am trying to classify human activities in videos(six classes and almost 100 videos per class, 6*100=600 videos). I am using 3D SIFT(both xy and t scale=1) from UCF.
for f= 1:20
offset = 0;
% Generate descriptors at locations given by subs matrix
for i=1:100
reRun = 1;
while reRun == 1
loc = subs(i+offset,:);
fprintf(1,'Calculating keypoint at location (%d, %d, %d)\n',loc);
% Create a 3DSIFT descriptor at the given location
[keys{i} reRun] = Create_Descriptor(pix,1,1,loc(1),loc(2),loc(3));
if reRun == 1
offset = offset + 1;
fprintf(1,'\nFinished...\n%d points thrown out do to poor descriptive ability.\n',offset);
for t1=1:20
My approach is to first get 50 descriptors(of 640 dimension) for one video, and then perform bag of words with all descriptors(on 50*600= 30000 descriptors). After performing Kmeans(with 1000 k value)
I am getting 30k of length index vector. Then I am creating histogram signature of each video based on their index values in clusters. Then perform svmtrain(sum in matlab) on signetures(dim-600*1000).
Some potential problems-
1-I am generating random 300 points in 3D to calculate 50 descriptors on any 50 points from those points 300 points.
2- xy, and time scale values, by default they are "1".
3-Cluster numbers, I am not sure that k=1000 is enough for 30000x640 data.
4-svmtrain, I am using this matlab library.
NOTE: Everything is on MATLAB.
Your basic setup seems correct especially given that you are getting 85-95% accuracy. Now, it's just a matter of tuning your procedure. Unfortunately, there is no way to do this other than testing a variety of parameters examining the results and repeating. I going to break this answer into two parts. Advice about bag of words features, and advice about SVM classifiers.
Tuning Bag of Words Features
You are using 50 3D SIFT Features per video from randomly selected points with a vocabulary of 1000 visual words. As you've already mentioned, the size of the vocabulary is one parameter you can adjust. So is the number of descriptors per video.
Let's say that each video is 60 frames long, (at 30 fps only 2 sec, but let's assume you are sampling at 1fps for a 1 minute video). That means you are capturing less than one descriptor per frame. That seems very low to me even with 3D descriptors especially if the locations are randomly chosen.
I would manually examine the points for which you are generating features. Do they appear be well distributed in both space and time? Are you capturing too much background? Ask yourself, would I be able to distinguish between actions given these features?
If you find that many of the selected points are uninformative, increasing the number of points may help. The kmeans clustering can make a few groups for uninformative outliers, and more points means you hopefully capture a few more informative points. You can also try other methods for selecting points. For example, you could use corner points.
You can also manually examine the points that are clustered together. What sorts of structures do the groups have in common? Are the clusters too mixed? That's usually a sign that you need a larger vocabulary.
Tuning SVMs
Using the Matlab SVM implementation or the Libsvm implementation should not make a difference. They are both the same method and have similar tuning options.
First off, you should really be using cross-validation to tune the SVM to avoid overfitting on your test set.
The most powerful parameter for the SVM is the kernel choice. In Matlab, there are five built in kernel options, and you can also define your own. The kernels also have parameters of their own. For example, the gaussian kernel has a scaling factor, sigma. Typically, you start off with a simple kernel and compare to more complex kernels. For example, start with linear, then test quadratic, cubic and gaussian. To compare, you can simply look at your mean cross-validation accuracy.
At this point, the last option is to look at individual instances that are misclassified and try to identify reasons that they may be more difficult than others. Are there commonalities such as occlusion? Also look directly at the visual words that were selected for these instances. You may find something you overlooked when you were tuning your features.
Good luck!

Principal component analysis

I have to write a classificator (gaussian mixture model) that I use for human action recognition.
I have 4 dataset of video. I choose 3 of them as training set and 1 of them as testing set.
Before I apply the gm model on the training set I run the pca on it.
score = training_data * pca_coeff;
training_data = score(:,1:min(size(score,2),numDimension));
During the testing step what should I do? Should I execute a new princomp on testing data
score = testing_data * new_pca_coeff;
testing_data = score(:,1:min(size(score,2),numDimension));
or I should use the pca_coeff that I compute for the training data?
score = testing_data * pca_coeff;
testing_data = score(:,1:min(size(score,2),numDimension));
The classifier is being trained on data in the space defined by the principle components of the training data. It doesn't make sense to evaluate it in a different space - therefore, you should apply the same transformation to testing data as you did to training data, so don't compute a different pca_coef.
Incidently, if your testing data is drawn independently from the same distribution as the training data, then for large enough training and test sets, the principle components should be approximately the same.
One method for choosing how many principle components to use involves examining the eigenvalues from the PCA decomposition. You can get these from the princomp function like this:
[pca_coeff score eigenvalues] = princomp(data);
The eigenvalues variable will then be an array where each element describes the amount of variance accounted for by the corresponding principle component. If you do:
you should see that the first eigenvalue will be the largest, and they will rapidly decrease (this is called a "Scree Plot", and should look like this: http://www.ats.ucla.edu/stat/SPSS/output/spss_output_pca_5.gif, though your one may have up to 800 points instead of 12).
Principle components with small corresponding eigenvalues are unlikely to be useful, since the variance of the data in those dimensions is so small. Many people choose a threshold value, and then select all principle components where the eigenvalue is above that threshold. An informal way of picking the threshold is to look at the Scree plot and choose the threshold to be just after the line 'levels out' - in the image I linked earlier, a good value might be ~0.8, selecting 3 or 4 principle components.
IIRC, you could do something like:
proportion_of_variance = sum(eigenvalues(1:k)) ./ sum(eigenvalues);
to calculate "the proportion of variance described by the low dimensional data".
However, since you are using the principle components for a classification task, you can't really be sure that any particular number of PCs is optimal; the variance of a feature doesn't necessarily tell you anything about how useful it will be for classification. An alternative to choosing PCs with the Scree plot is just to try classification with various numbers of principle components and see what the best number is empirically.