Clustering, Large dataset, learning large number vocabulary words - matlab

I am try to do clustering from a large dataset dim:
rows: 1.4 million
expected number of clusters: 10,000 (10k)
Problem is : size of my dataset 10Gb, and I have RAM of 16Gb. I am trying to implement in Matlab. It will be big help for me if someone could response to it.
P.S. So far i have tried with hierarchical clustering. in one paper, tehy have suggested to go for "fixed radius incremental pre-clustering". But I didnt understand the procedure.
Thanks in advance.

Use some algorithm that does not require a distance matrix. Instead, choose one that can be index accelerated.
Anuthing with a distance matrix will exceed your memory. But even when not requiring this (e.g., SLINK uses only O(n) memory) it still may take too long. Indexes could reduce the runtime to O(n log n) although on your data, indexes may have problems.
Index accelerated algorithms are for example: OPTICS, DBSCAN.
Just don't use the really bad Matlab scripts for these algorithms.


Taking big chunk of Time while Running K means on Python Spark

I have a nparray vector with 0s and 1s with 37k rows and 6k columns.
When I try to run Kmeans Clustering in Pyspark, it takes almost forever to load and I cannot get the output. Is there any way to reduce the processing time or any other tricks to solve this issue?
I think that you may have too many columns, you could have faced the dimensionality course. Wikipedia link
[...] The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. [...]
In order to solve this problem, did you consider reducing your columns, using only relevant ones? Check again this Wikipedia link
[...] Feature projection transforms the data in the high-dimensional space to a space of fewer dimensions. The data transformation may be linear, as in principal component analysis (PCA), but many nonlinear dimensionality reduction techniques also exist. [...]

Hierarquical clustering with big vector

I'm working in MATLAB and I have a vector of 4 million values. I've tried to use the linkage function but I get this error:
Error using linkage (line 240) Requested 1x12072863584470 (40017.6GB) array exceeds maximum array size preference. Creation of arrays greater than this limit may take a long time and cause MATLAB to become unresponsive. See array size limit or preference panel for more information.
I've found that some people avoid this error using the kmeans functions however I would like to know if there is a way to avoid this error and still use the linkage function.
Most hierarchical clustering needs O(n²) memory. So no, you don't want to use these algorithms.
There are some exceptions, such as SLINK and CLINK. These can be implemented with just linear memory. I just don't know if Matlab has any good implementations.
Or you use kmeans or DBSCAN, which also need only linear memory.
Are you absolutely sure that these 4 million values are statistically correct? If yes, your a lucky person and if no, then do the data pre-processing. You'll see the 4 million values have drastically decreased to a meaningful sample which can easily be fit into the memory (RAM) to do hierarchical clustering.

How to Sub-Sample Dataset

I'm going to implement svm(support vector machines) and various other classifying algorithms.
But my train dataset is of 10Gb. How can I sub-sample it ?
This is a very basic level question but I'm a beginner.
Thank for the help
The first thing you should do is reduce the number of samples (rows). LibSVM provides a very useful python script for that. If your dataset has N samples and you want to downsample it to N - K samples, you can use the aforementioned script to: (1) randomly remove K samples from your data; (2) remove K samples from your data using stratified sampling. The last one is recommended.
It is much more complicated to reduce the number of features (columns). You can't (you shouldn't) remove them randomly. There are many algorithms for that, which are usually called data reduction algorithms. The most used one is PCA. But it's not as simple to use.
It depends on your data.
Since you're working on a basic-level question, I guess the best approach to start with is to cut down your sample size considerably. Once that is done, reduce the number of features to a nominated size.
Once the dataset is small and simple enough, you could then consider adding more attributes or samples as are fitting for the problem at hand.
Hope this Helps!

Mahout binary data clustering

I have points with binary features:
id, feature 1, feature 2, ....
1, 0, 1, 0, 1, ...
2, 1, 1, 0, 1, ...
and the size of matrix is about 20k * 200k but it is sparse. I am using Mahout for clustering data by kmeans algorithm and have the following questions:
Is kmeans a good candidate for binary features?
Is there any way to reduce dimensions while keeping the concept of Manhattan distance measure (I need manhattan instead of Cosine or Tanimoto)
The memory usage of kmeans is high and needs 4GB memory for each Map/Reduce Task on (4Mb Blocks on 400Mb vector file for 3k clusterss). Considering that Vector object in Mahout uses double entries, is there any way to use just Boolean entries for points but double entries for centers?
k-means is a good candidate if you have a good distance metric. Manhattan distance could be fine; I like log-likelihood.
You can use any dimension reduction technique you like. I like alternating-least-squares; the SVD works well too. For this size matrix you can do it easily in memory with Commons Math rather than bother with Hadoop -- it is way way overkill.
(See also -- I have a very fast ALS implementation there you can reuse in the core/online modules. It can crunch this in a few seconds in tens of MB heap.)
You no longer have binary 0/1 values in your feature matrix. In the feature space, cosine distance should work well (1 - cosineSimilarity). Tanimoto/Jaccard is not appropriate.
k-means has one big requirement that is often overlooked: it needs to compute a sensible mean. This is much more important than people think.
If the mean does not reduce variance, it may not converge
(The arithmetic mean is optimal for Euclidean distance. For Manhattan, the median is said to be better. For very different metrics, I do not know)
The mean probably won't be as sparse anymore
The mean won't be a binary vector anymore, either
Furthermore, in particular for large data sets, which k do you want to use?
You really should look into other distance measures. Your data size is not that big; it should still suffice to use a single computer. Using a compact vector representation it will easily fit into main memory. Just don't use something that computes a n^2 similarity matrix first. Maybe try something with indexes for binary vector similarity.
k-means is fairly easy to implement, in particular if you don't do any advance seeding. To reduce memory usage, just implement it yourself for the representation that is optimal for your data. It could be a bitset, it could be a sorted list of dimensions that are non-zero. Manhattan distance then boils down to counting the number of dimensions where the vectors differ!

Data clustering algorithm

What is the most popular text clustering algorithm which deals with large dimensions and huge dataset and is fast?
I am getting confused after reading so many papers and so many just want to know which one is used most, to have a good starting point for writing a clustering application for documents.
To deal with the curse of dimensionality you can try to determine the blind sources (ie topics) that generated your dataset. You could use Principal Component Analysis or Factor Analysis to reduce the dimensionality of your feature set and to compute useful indexes.
PCA is what is used in Latent Semantic Indexing, since SVD can be demonstrated to be PCA : )
Remember that you can lose interpretation when you obtain the principal components of your dataset or its factors, so you maybe wanna go the Non-Negative Matrix Factorization route. (And here is the punch! K-Means is a particular NNMF!) In NNMF the dataset can be explained just by its additive, non-negative components.
There is no one size fits all approach. Hierarchical clustering is an option always. If you want to have distinct groups formed out of the data, you can go with K-means clustering (it is also supposedly computationally less intensive).
The two most popular document clustering approaches, are hierarchical clustering and k-means. k-means is faster as it is linear in the number of documents, as opposed to hierarchical, which is quadratic, but is generally believed to give better results. Each document in the dataset is usually represented as an n-dimensional vector (n is the number of words), with the magnitude of the dimension corresponding to each word equal to its term frequency-inverse document frequency score. The tf-idf score reduces the importance of high-frequency words in similarity calculation. The cosine similarity is often used as a similarity measure.
A paper comparing experimental results between hierarchical and bisecting k-means, a cousin algorithm to k-means, can be found here.
The simplest approaches to dimensionality reduction in document clustering are: a) throw out all rare and highly frequent words (say occuring in less than 1% and more than 60% of documents: this is somewhat arbitrary, you need to try different ranges for each dataset to see impact on results), b) stopping: throw out all words in a stop list of common english words: lists can be found online, and c) stemming, or removing suffixes to leave only word roots. The most common stemmer is a stemmer designed by Martin Porter. Implementations in many languages can be found here. Usually, this will reduce the number of unique words in a dataset to a few hundred or low thousands, and further dimensionality reduction may not be required. Otherwise, techniques like PCA could be used.
I will stick with kmedoids, since you can compute the distance from any point to anypoint at the beggining of the algorithm, You only need to do this one time, and it saves you time, specially if there are many dimensions. This algorithm works by choosing as a center of a cluster the point that is nearer to it, not a centroid calculated in base of the averages of the points belonging to that cluster. Therefore you have all possible distance calculations already done for you in this algorithm.
In the case where you aren't looking for semantic text clustering (I can't tell if this is a requirement or not from your original question), try using Levenshtein distance and building a similarity matrix with it. From this, you can use k-medoids to cluster and subsequently validate your clustering through use of silhouette coefficients. Unfortunately, Levensthein can be quite slow, but there are ways to speed it up through uses of thresholds and other methods.
Another way to deal with the curse of dimensionality would be to find 'contrasting sets,', conjunctions of attribute-value pairs that are more prominent in one group than in the rest. You can then use those contrasting sets as dimensions either in lieu of the original attributes or with a restricted number of attributes.