Gaussian Mixture Model (GMM) giving only one cluster - pyspark

I have a dataset that has 70 columns and 4.4 million rows. I want to perform clustering on it. I did TF-IDF first then I used clustering with K-means, Bisecting k-means and Gaussian Mixture Model (GMM). While the other techniques give me the specified number of clusters, GMM gives only one cluster. Example, in the code below, I want 20 clusters but it returns only 1 cluster. Is this happening because of the fact that I have many columns or it is merely caused by the nature of the data?
gmm = GaussianMixture(k = 20, tol = 0.000001, maxIter=10000, seed =1)
model = gmm.fit(rescaledData)
df1 = model.transform(rescaledData).select(['label','prediction'])
df1.groupBy('prediction').count().show() # this returns 1 row

In my opinion, the main reason behind of bad clustering performance of Pyspark GMM is that it's implementation is done using diagonal covariance matrix which do not take account of covariance between different features present within the dataset.
Check it's implementation here: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala
where they have cleary mentioned to be using diagonal covariance matrix because of curse of dimensionality.
#note This algorithm is limited in its number of features since it requires storing a covariance matrix which has size quadratic in the number of features. Even when the number of features does not exceed this limit, this algorithm may perform poorly on high-dimensional data. This is due to high-dimensional data (a) making it difficult to cluster at all (based on statistical/theoretical arguments) and (b) numerical issues with Gaussian distributions.

Related

Multi-label clustering

I have a question regarding a task that I am trying to solve. The data that I have are characterisation data,
meaning that I have a label (PASS/FAIL) for every single datapoint.
So my data matrix, is of n rows and m columns and the target variables are again a matrix of
n rows and m columns composed of binary values (0s and 1s).
My task is to apply clustering and partition all these datapoints into two clusters, one being for PASS
datapoints and the other for FAIL datapoints. I wasn't able to find an algorithm that can solve
this type of 'multi-label' problem with clustering.
I tried to implement algorithms like k-means but while tuning the number of clusters to initialise
I get k=6 which doesn't really make sense. In the data, outliers are already dropped and they
are normalised as well.
I have a large amount of features on my data matrix (eg. >3000) and I tried to apply
dimensionality reduction methods like PCA to at least drop the features that are more
irrelevant than the rest. But I am not sure if this would be applicable in my case when
I have a binary matrix as target variables.
Is there a specific algorithm that can solve this type of problem and if so, what is the
necessary pre-processing I should be doing before applying it?

Why the NMI value is small while having higher clustering accuracy and Rand index in clustering

I am using the https://www.mathworks.com/matlabcentral/fileexchange/32197-clustering-results-measurement for evaluating my clustering accuracy in MATLAB, it provides accuracy and rand_index, the performance is normal as expect. However, when I try to use NMI as a metric, the clustering performance is extremely low, I am using the source code (https://www.mathworks.com/matlabcentral/fileexchange/29047-normalized-mutual-information).
Actually I have two Nx1 vectors as inputs, one is the actual label while another is the label assignments. I basically check each of every element insides and I found that even I have 82% rand_index, the NMI is only 0.3209. Below is the example for Iris Dataset https://archive.ics.uci.edu/ml/datasets/iris with MATLAB built-in K-Means.
data = iris(:,1:data_dim);
k = 3;
[result_label,centroid] = kmeans(data,k,'MaxIter',10000);
actual_label = iris(:,end);
NMI = nmi(actual_label,result_label);
[Acc,rand_index,match] = AccMeasure(actual_label',result_label');
The result:
Auto ACC: 0.820000
Rand_Index: 0.701818
NMI: 0.320912
The Rand Index will tend towards 1 as the number of data points increases (even when comparing random clusterings) so you never really expect to see small values of Rand when you have a big data set.
At the same time, Accuracy can be high when all of your points fall into the same large cluster.
I have a feeling that the NMI is producing a more reliable comparison. To verify, trying running a dimensionality reduction and plot the data points with color based on the two clusterings. Visual statistics are often the best for developing an intuition about data.
If you want to explore more, a convenient python package for clustering comparisons is CluSim.

for how How to compute whether representational similarity matrix values are significant

I am new to RSA analysis in fMRI images. I used SPM 12 for preprocessing and first level analysis of my fMRI images and used RSA-toolbox to compute RDMs (representational dissimilarity matrix) for my conditions in an specific region of the brain. Now I have the RDM mateix for every single subject also have the overall RDM across all subjects. However, RSA-toolbox doesn't report any p value or significance test for the values in the RDM. How can I compute or determine which values in the RDM matrix are significant and which are not? I used pearson's r method for to compute RDMs. In paeticular I want to have an explaination about the mathematics that can be used to test significancy of these values.

Comparing k-means clustering

I have 150 images, 15 each of 10 different people. So basically I know which image should belong together, if clustered.
These images are of 73 dimensions (feature-vector) and I clustered them into 10 clusters using kmeans function in matlab.
Later, I processed these 150 data points and reduced its dimension from 73 to 3 for my work and applied the same kmeans function on them.
I want to compare the results obtained on these data sets (processed and unprocessed) by applying the same k-means function and wish to know if the processing which reduced it to lower dimension improves the kmeans clustering or not.
I thought comparing the variance of each cluster can be one parameter for comparison, however I am not sure if I can directly compare and evaluate my results (within cluster sum of distances etc.) as both the cases are of different dimension. Could anyone please suggest a way where I can compare the kmean results, some way to normalize them or any other comparison that I can make?
I can think of three options. I am unaware of any well developed methodology to do this specifically with K-means clustering.
Look at the confusion matrices between the two approaches.
Compare the mahalanobis distances between the clusters, and between items in clusters to their nearest other clusters.
Look at the Vornoi cells and see how far your points are from the boundaries of the cells.
The problem with 3, is the distance metrics get skewed, 3D distance vs. 73D distances are not commensurate, so I'm not a fan of that approach. I'd recommend reading some books on K-means if you are adamant of that path, rank speculation is fun, but standing on the shoulders of giants is better.

Principal component analysis

I have to write a classificator (gaussian mixture model) that I use for human action recognition.
I have 4 dataset of video. I choose 3 of them as training set and 1 of them as testing set.
Before I apply the gm model on the training set I run the pca on it.
pca_coeff=princomp(trainig_data);
score = training_data * pca_coeff;
training_data = score(:,1:min(size(score,2),numDimension));
During the testing step what should I do? Should I execute a new princomp on testing data
new_pca_coeff=princomp(testing_data);
score = testing_data * new_pca_coeff;
testing_data = score(:,1:min(size(score,2),numDimension));
or I should use the pca_coeff that I compute for the training data?
score = testing_data * pca_coeff;
testing_data = score(:,1:min(size(score,2),numDimension));
The classifier is being trained on data in the space defined by the principle components of the training data. It doesn't make sense to evaluate it in a different space - therefore, you should apply the same transformation to testing data as you did to training data, so don't compute a different pca_coef.
Incidently, if your testing data is drawn independently from the same distribution as the training data, then for large enough training and test sets, the principle components should be approximately the same.
One method for choosing how many principle components to use involves examining the eigenvalues from the PCA decomposition. You can get these from the princomp function like this:
[pca_coeff score eigenvalues] = princomp(data);
The eigenvalues variable will then be an array where each element describes the amount of variance accounted for by the corresponding principle component. If you do:
plot(eigenvalues);
you should see that the first eigenvalue will be the largest, and they will rapidly decrease (this is called a "Scree Plot", and should look like this: http://www.ats.ucla.edu/stat/SPSS/output/spss_output_pca_5.gif, though your one may have up to 800 points instead of 12).
Principle components with small corresponding eigenvalues are unlikely to be useful, since the variance of the data in those dimensions is so small. Many people choose a threshold value, and then select all principle components where the eigenvalue is above that threshold. An informal way of picking the threshold is to look at the Scree plot and choose the threshold to be just after the line 'levels out' - in the image I linked earlier, a good value might be ~0.8, selecting 3 or 4 principle components.
IIRC, you could do something like:
proportion_of_variance = sum(eigenvalues(1:k)) ./ sum(eigenvalues);
to calculate "the proportion of variance described by the low dimensional data".
However, since you are using the principle components for a classification task, you can't really be sure that any particular number of PCs is optimal; the variance of a feature doesn't necessarily tell you anything about how useful it will be for classification. An alternative to choosing PCs with the Scree plot is just to try classification with various numbers of principle components and see what the best number is empirically.