Mahout binary data clustering - cluster-analysis

I have points with binary features:
id, feature 1, feature 2, ....
1, 0, 1, 0, 1, ...
2, 1, 1, 0, 1, ...
and the size of matrix is about 20k * 200k but it is sparse. I am using Mahout for clustering data by kmeans algorithm and have the following questions:
Is kmeans a good candidate for binary features?
Is there any way to reduce dimensions while keeping the concept of Manhattan distance measure (I need manhattan instead of Cosine or Tanimoto)
The memory usage of kmeans is high and needs 4GB memory for each Map/Reduce Task on (4Mb Blocks on 400Mb vector file for 3k clusterss). Considering that Vector object in Mahout uses double entries, is there any way to use just Boolean entries for points but double entries for centers?

k-means is a good candidate if you have a good distance metric. Manhattan distance could be fine; I like log-likelihood.
You can use any dimension reduction technique you like. I like alternating-least-squares; the SVD works well too. For this size matrix you can do it easily in memory with Commons Math rather than bother with Hadoop -- it is way way overkill.
(See also http://myrrix.com -- I have a very fast ALS implementation there you can reuse in the core/online modules. It can crunch this in a few seconds in tens of MB heap.)
You no longer have binary 0/1 values in your feature matrix. In the feature space, cosine distance should work well (1 - cosineSimilarity). Tanimoto/Jaccard is not appropriate.

k-means has one big requirement that is often overlooked: it needs to compute a sensible mean. This is much more important than people think.
If the mean does not reduce variance, it may not converge
(The arithmetic mean is optimal for Euclidean distance. For Manhattan, the median is said to be better. For very different metrics, I do not know)
The mean probably won't be as sparse anymore
The mean won't be a binary vector anymore, either
Furthermore, in particular for large data sets, which k do you want to use?
You really should look into other distance measures. Your data size is not that big; it should still suffice to use a single computer. Using a compact vector representation it will easily fit into main memory. Just don't use something that computes a n^2 similarity matrix first. Maybe try something with indexes for binary vector similarity.
k-means is fairly easy to implement, in particular if you don't do any advance seeding. To reduce memory usage, just implement it yourself for the representation that is optimal for your data. It could be a bitset, it could be a sorted list of dimensions that are non-zero. Manhattan distance then boils down to counting the number of dimensions where the vectors differ!

Related

Why the NMI value is small while having higher clustering accuracy and Rand index in clustering

I am using the https://www.mathworks.com/matlabcentral/fileexchange/32197-clustering-results-measurement for evaluating my clustering accuracy in MATLAB, it provides accuracy and rand_index, the performance is normal as expect. However, when I try to use NMI as a metric, the clustering performance is extremely low, I am using the source code (https://www.mathworks.com/matlabcentral/fileexchange/29047-normalized-mutual-information).
Actually I have two Nx1 vectors as inputs, one is the actual label while another is the label assignments. I basically check each of every element insides and I found that even I have 82% rand_index, the NMI is only 0.3209. Below is the example for Iris Dataset https://archive.ics.uci.edu/ml/datasets/iris with MATLAB built-in K-Means.
data = iris(:,1:data_dim);
k = 3;
[result_label,centroid] = kmeans(data,k,'MaxIter',10000);
actual_label = iris(:,end);
NMI = nmi(actual_label,result_label);
[Acc,rand_index,match] = AccMeasure(actual_label',result_label');
The result:
Auto ACC: 0.820000
Rand_Index: 0.701818
NMI: 0.320912
The Rand Index will tend towards 1 as the number of data points increases (even when comparing random clusterings) so you never really expect to see small values of Rand when you have a big data set.
At the same time, Accuracy can be high when all of your points fall into the same large cluster.
I have a feeling that the NMI is producing a more reliable comparison. To verify, trying running a dimensionality reduction and plot the data points with color based on the two clusterings. Visual statistics are often the best for developing an intuition about data.
If you want to explore more, a convenient python package for clustering comparisons is CluSim.

Clustering, Large dataset, learning large number vocabulary words

I am try to do clustering from a large dataset dim:
rows: 1.4 million
cols:900
expected number of clusters: 10,000 (10k)
Problem is : size of my dataset 10Gb, and I have RAM of 16Gb. I am trying to implement in Matlab. It will be big help for me if someone could response to it.
P.S. So far i have tried with hierarchical clustering. in one paper, tehy have suggested to go for "fixed radius incremental pre-clustering". But I didnt understand the procedure.
Thanks in advance.
Use some algorithm that does not require a distance matrix. Instead, choose one that can be index accelerated.
Anuthing with a distance matrix will exceed your memory. But even when not requiring this (e.g., SLINK uses only O(n) memory) it still may take too long. Indexes could reduce the runtime to O(n log n) although on your data, indexes may have problems.
Index accelerated algorithms are for example: OPTICS, DBSCAN.
Just don't use the really bad Matlab scripts for these algorithms.

Bag of feature: how to create the query histogram?

I'm trying to implement the Bag of Features model.
Given a descriptors matrix object (representing an image) belonging to the initial dataset, compute its histogram is easy, since we already know to which cluster each descriptor vector belongs to from k-means.
But what about if we want to compute the histogram of a query matrix? The only solution that crosses my mind is to compute the distance between each vector descriptor to each of the k cluster centroids.
This can be inefficient: supposing that k=100 (so 100 centroids), then we have an query image represented through 1000 SIFT descriptors, so a matrix 1000x100.
What we have to do now is computing 1000 * 100 eucledian distances in 128 dimensions. This seems really inefficient.
How to solve this problem?
NOTE: can you suggest me some implementations where this point is explained?
NOTE: I know LSH is a solution (since we are using high-dim vectors), but I don't think that actual implementations use it.
UPDATE:
I was talking with a collegue of mine: using a hierarchical cluster approach instead of classic k-means, should speed up the process so much! Is it correct to say that if we have k centroids, with an hierarchical cluster we have to do only log(k) comparisons in order to find the closest centroid instead of k comparisons?
For a bag of features approach, you indeed need to quantize the descriptors. Yes, if you have 10000 features and 100 features that 10000*100 distances (unless you use an index here).
Compare this to comparing each of the 10000 features to each of the 10000 features of each image in your database. Does it still sound that bad?

Multiplication of large sparse Matrices without null values in scala

I have two very sparse distributed matrixes of dimension 1,000,000,000 x 1,000,000,000 and I want to compute the matrix multiplication efficiently.
I tried to create a BlockMatrix from a CoordinateMatrix but it's a lot of memory (where in reality the non zero data are around ~500'000'000) and the time of computation is enormous.
So there is another way to create a sparse matrix and compute a multiplication efficiently in a distributed way in Spark? Or i have to compute it manually?
You must obviously use a storage format for sparse matrices that makes use of their sparsity.
Now, without knowing anything about how you handle matrices and which libraries you use, there's no helping you but to ask you to look at the linear algebra libraries of your choice and look for sparse storage formats; the "good old" Fortran-based libraries that underly a lot of modern math libs support them, and so chances are that you really have to do but a little googling with yourlibraryname + "sparse matrix".
second thoughts:
Sparse matrixes really don't lend themselves to distribution very well; think about the operations you'd have to do to coordinate distribution compared to the actual multiplications/additions.
Also, ~5e8 non-zero elements in a 1e18 element matrix are definitely a lot of memory, and since you don't specify how much you consider a lot to be, it's very possible there's nothing wrong with it. Assuming you're using the default double precision, that's 5e8 * 8B = 4GB of pure numbers, not counting the coordinates needed for sparse storage. So, if you've got ~10GB of memory, I wouldn't be surprised at all.
As there is no build-in method in Spark to perform a matrix multiplication with sparse matrixes. I resolved by reduce at best the sparsity of the matrices before perform the matrice multiplication with BlockMatrix (that not support sparse matrix).
Last edit: Even with the sparsity optimization I had a lot of problems with large dataset. Finally, I decided to implement it myself. Now is running very fast. I hope that a matrix implementation with sparse matrix will be implemented in Spark as I think there are a lot of application that can make use of this.

Data clustering algorithm

What is the most popular text clustering algorithm which deals with large dimensions and huge dataset and is fast?
I am getting confused after reading so many papers and so many approaches..now just want to know which one is used most, to have a good starting point for writing a clustering application for documents.
To deal with the curse of dimensionality you can try to determine the blind sources (ie topics) that generated your dataset. You could use Principal Component Analysis or Factor Analysis to reduce the dimensionality of your feature set and to compute useful indexes.
PCA is what is used in Latent Semantic Indexing, since SVD can be demonstrated to be PCA : )
Remember that you can lose interpretation when you obtain the principal components of your dataset or its factors, so you maybe wanna go the Non-Negative Matrix Factorization route. (And here is the punch! K-Means is a particular NNMF!) In NNMF the dataset can be explained just by its additive, non-negative components.
There is no one size fits all approach. Hierarchical clustering is an option always. If you want to have distinct groups formed out of the data, you can go with K-means clustering (it is also supposedly computationally less intensive).
The two most popular document clustering approaches, are hierarchical clustering and k-means. k-means is faster as it is linear in the number of documents, as opposed to hierarchical, which is quadratic, but is generally believed to give better results. Each document in the dataset is usually represented as an n-dimensional vector (n is the number of words), with the magnitude of the dimension corresponding to each word equal to its term frequency-inverse document frequency score. The tf-idf score reduces the importance of high-frequency words in similarity calculation. The cosine similarity is often used as a similarity measure.
A paper comparing experimental results between hierarchical and bisecting k-means, a cousin algorithm to k-means, can be found here.
The simplest approaches to dimensionality reduction in document clustering are: a) throw out all rare and highly frequent words (say occuring in less than 1% and more than 60% of documents: this is somewhat arbitrary, you need to try different ranges for each dataset to see impact on results), b) stopping: throw out all words in a stop list of common english words: lists can be found online, and c) stemming, or removing suffixes to leave only word roots. The most common stemmer is a stemmer designed by Martin Porter. Implementations in many languages can be found here. Usually, this will reduce the number of unique words in a dataset to a few hundred or low thousands, and further dimensionality reduction may not be required. Otherwise, techniques like PCA could be used.
I will stick with kmedoids, since you can compute the distance from any point to anypoint at the beggining of the algorithm, You only need to do this one time, and it saves you time, specially if there are many dimensions. This algorithm works by choosing as a center of a cluster the point that is nearer to it, not a centroid calculated in base of the averages of the points belonging to that cluster. Therefore you have all possible distance calculations already done for you in this algorithm.
In the case where you aren't looking for semantic text clustering (I can't tell if this is a requirement or not from your original question), try using Levenshtein distance and building a similarity matrix with it. From this, you can use k-medoids to cluster and subsequently validate your clustering through use of silhouette coefficients. Unfortunately, Levensthein can be quite slow, but there are ways to speed it up through uses of thresholds and other methods.
Another way to deal with the curse of dimensionality would be to find 'contrasting sets,', conjunctions of attribute-value pairs that are more prominent in one group than in the rest. You can then use those contrasting sets as dimensions either in lieu of the original attributes or with a restricted number of attributes.