Taking big chunk of Time while Running K means on Python Spark - pyspark

I have a nparray vector with 0s and 1s with 37k rows and 6k columns.
When I try to run Kmeans Clustering in Pyspark, it takes almost forever to load and I cannot get the output. Is there any way to reduce the processing time or any other tricks to solve this issue?

I think that you may have too many columns, you could have faced the dimensionality course. Wikipedia link
[...] The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse. This sparsity is problematic for any method that requires statistical significance. In order to obtain a statistically sound and reliable result, the amount of data needed to support the result often grows exponentially with the dimensionality. [...]
In order to solve this problem, did you consider reducing your columns, using only relevant ones? Check again this Wikipedia link
[...] Feature projection transforms the data in the high-dimensional space to a space of fewer dimensions. The data transformation may be linear, as in principal component analysis (PCA), but many nonlinear dimensionality reduction techniques also exist. [...]

Related

Naive Gaussian Elimination - Sparse and Full matrices

I am currently playing with numerical methods in MATLAB. I am trying to understand the dependence of time taken to solve sparse/full matrices of the same dimensions, with respects to different sizes of n.
My understanding is that in general, sparse matrices take shorter time to be solved as compared to full matrices. However, when i used the Naive Gaussian Elimination method, the sparse matrices took significantly longer to be solved. I have been researching online for reasons but to no avail.
Thus, I am here with this question in hopes that someone will be able to enlighten me. Thanks in advance!!!
These are my plots produced for better understanding of my question :
Sparse
Full
Modern computers have pretty large amounts of Random Access Memory available, and also the CPUs are pretty fast. In that case, systems of linear equations with matrices up to several thousands of columns/rows are processed very fast when treated directly as dense , regardless of their actual sparsity. The difference between "dense" and "sparse" algorithms becomes obvious in favour of "sparse" ones when the matrix sizes grow large, above 10000 or so (it all depends on the quality of a particular "sparse" algorithm, as well as on the CPU and RAM properties of the user's computer). "Sparse" algorithms have special schemes to store the matrix, to provide access to its elements, to modify them, etc. Those overheads can slow down the solution algorithm for not-so-large matrices in comparison with straightforward implementations for dense matrices.

Clustering, Large dataset, learning large number vocabulary words

I am try to do clustering from a large dataset dim:
rows: 1.4 million
cols:900
expected number of clusters: 10,000 (10k)
Problem is : size of my dataset 10Gb, and I have RAM of 16Gb. I am trying to implement in Matlab. It will be big help for me if someone could response to it.
P.S. So far i have tried with hierarchical clustering. in one paper, tehy have suggested to go for "fixed radius incremental pre-clustering". But I didnt understand the procedure.
Thanks in advance.
Use some algorithm that does not require a distance matrix. Instead, choose one that can be index accelerated.
Anuthing with a distance matrix will exceed your memory. But even when not requiring this (e.g., SLINK uses only O(n) memory) it still may take too long. Indexes could reduce the runtime to O(n log n) although on your data, indexes may have problems.
Index accelerated algorithms are for example: OPTICS, DBSCAN.
Just don't use the really bad Matlab scripts for these algorithms.

Handling very big Matrix in Matlab

I have data-set of epinions website and want to implement the recommendation system
At the first step I should change the structure of data-set an it should be like 120000*780000 rows and columns
Its really big matrix and because of lack of memory it's not possible to do it
In my work every user should have M-dimensional vector , And M is total number of items that is 780000
I cant use sparse matrix because I need indexes and its too slow
What can I do now? How can I have this big data-set in matlab ?
You can try different things to reduce the amount of data:
Take a random subset of your observations: 120.000 observations is quite a lot, try randomly splitting it in several smaller subsets and check which is the performance of the system.
Use PCA to reduce the dimensionality of your data: 780.000 dimensions is A LOT. You will probably get a drastic reduction of the number of dimensions with PCA.
If your data is mostly zero or constant, you can actually use sparse matrices. Sparse matrices keep track of the indexes of your non-zero data, so don't worry about that.

How to Sub-Sample Dataset

I'm going to implement svm(support vector machines) and various other classifying algorithms.
But my train dataset is of 10Gb. How can I sub-sample it ?
This is a very basic level question but I'm a beginner.
Thank for the help
The first thing you should do is reduce the number of samples (rows). LibSVM provides a very useful python script for that. If your dataset has N samples and you want to downsample it to N - K samples, you can use the aforementioned script to: (1) randomly remove K samples from your data; (2) remove K samples from your data using stratified sampling. The last one is recommended.
It is much more complicated to reduce the number of features (columns). You can't (you shouldn't) remove them randomly. There are many algorithms for that, which are usually called data reduction algorithms. The most used one is PCA. But it's not as simple to use.
It depends on your data.
Since you're working on a basic-level question, I guess the best approach to start with is to cut down your sample size considerably. Once that is done, reduce the number of features to a nominated size.
Once the dataset is small and simple enough, you could then consider adding more attributes or samples as are fitting for the problem at hand.
Hope this Helps!

Data clustering algorithm

What is the most popular text clustering algorithm which deals with large dimensions and huge dataset and is fast?
I am getting confused after reading so many papers and so many approaches..now just want to know which one is used most, to have a good starting point for writing a clustering application for documents.
To deal with the curse of dimensionality you can try to determine the blind sources (ie topics) that generated your dataset. You could use Principal Component Analysis or Factor Analysis to reduce the dimensionality of your feature set and to compute useful indexes.
PCA is what is used in Latent Semantic Indexing, since SVD can be demonstrated to be PCA : )
Remember that you can lose interpretation when you obtain the principal components of your dataset or its factors, so you maybe wanna go the Non-Negative Matrix Factorization route. (And here is the punch! K-Means is a particular NNMF!) In NNMF the dataset can be explained just by its additive, non-negative components.
There is no one size fits all approach. Hierarchical clustering is an option always. If you want to have distinct groups formed out of the data, you can go with K-means clustering (it is also supposedly computationally less intensive).
The two most popular document clustering approaches, are hierarchical clustering and k-means. k-means is faster as it is linear in the number of documents, as opposed to hierarchical, which is quadratic, but is generally believed to give better results. Each document in the dataset is usually represented as an n-dimensional vector (n is the number of words), with the magnitude of the dimension corresponding to each word equal to its term frequency-inverse document frequency score. The tf-idf score reduces the importance of high-frequency words in similarity calculation. The cosine similarity is often used as a similarity measure.
A paper comparing experimental results between hierarchical and bisecting k-means, a cousin algorithm to k-means, can be found here.
The simplest approaches to dimensionality reduction in document clustering are: a) throw out all rare and highly frequent words (say occuring in less than 1% and more than 60% of documents: this is somewhat arbitrary, you need to try different ranges for each dataset to see impact on results), b) stopping: throw out all words in a stop list of common english words: lists can be found online, and c) stemming, or removing suffixes to leave only word roots. The most common stemmer is a stemmer designed by Martin Porter. Implementations in many languages can be found here. Usually, this will reduce the number of unique words in a dataset to a few hundred or low thousands, and further dimensionality reduction may not be required. Otherwise, techniques like PCA could be used.
I will stick with kmedoids, since you can compute the distance from any point to anypoint at the beggining of the algorithm, You only need to do this one time, and it saves you time, specially if there are many dimensions. This algorithm works by choosing as a center of a cluster the point that is nearer to it, not a centroid calculated in base of the averages of the points belonging to that cluster. Therefore you have all possible distance calculations already done for you in this algorithm.
In the case where you aren't looking for semantic text clustering (I can't tell if this is a requirement or not from your original question), try using Levenshtein distance and building a similarity matrix with it. From this, you can use k-medoids to cluster and subsequently validate your clustering through use of silhouette coefficients. Unfortunately, Levensthein can be quite slow, but there are ways to speed it up through uses of thresholds and other methods.
Another way to deal with the curse of dimensionality would be to find 'contrasting sets,', conjunctions of attribute-value pairs that are more prominent in one group than in the rest. You can then use those contrasting sets as dimensions either in lieu of the original attributes or with a restricted number of attributes.