I need to run Principal Components Analysis and K-means clustering on a large-ish dataset (around 10 GB) which is spread out over many files. I want to use Apache Spark for this since it's known to be fast and distributed.
I know that Spark supports PCA and also PCA + Kmeans.
However, I haven't found an example which demonstrates how to do this with many files in a distributed manner.
Related
I have been working on clustering a dataset in scala using spark 2.2.0. Now i have made the clusters , i want to test/evaluate the quality of it.Though i have been able to find the Set Of Sum of squared of errors for each value of K, but i was hoping to do a silhouette test. could any one please help in sharing any relevant functions,packages for doing so in scala.
Silhouette is not scalable. It uses pairwise distances, this will always take O(n^2) time to compute.
Have you considered using already implemented in MLlib Within Set Sum of Squared Errors (http://spark.apache.org/docs/latest/ml-clustering.html#k-means) which also can help determining the number of clusters. (Cluster analysis in R: determine the optimal number of clusters)
I am currently looking for an Algorithm in Apache Spark (Scala/Java) that is able to cluster data that has numeric and categorical features.
As far as I have seen, there is an implementation for k-medoids and k-prototypes for pyspark (https://github.com/ThinkBigAnalytics/pyspark-distributed-kmodes), but I could not identify something similar for the Scala/Java version I am currently working with.
Is there another recommend algorithm to achieve similar things for Spark running Scala? Or am I overlooking something and could actually make use of the pyspark library in my Scala project?
If you need further information or clarification feel free to ask.
I think you need first to convert your categorical variables to numbers using OneHotEncoder then, you can apply your clustering algorithm using mllib (e.g. kmeans). Also, I recommend doing scaling or normalization before applying your cluster algorithm as it is distance sensitive.
I have 100,000 sentences that I've processed into TF-IDF vectors using scikit-learn's TfidfVectorizer with highly customized stopwords and nlp stemming. My goal is to cluster the sentences using dbscan or another density-based cluster to discover similar sentences.
In scikit-learn's dbscan implementation, I run out of memory when I cluster more than 40,000 sentences. I have seen suggestions to use ELKI's Java clustering GUI. I'd like to try clustering in Java, but I cannot find a method for moving my TF-IDF vectors from Python to ELKI. ELKI's documentation states that it can handle sparse vectors in a particular format or in .arff.
Most concrete question. Can anyone suggestion how to move TFIDF vectors from scikit-learn into a format that can be loaded into ELKI.
Will ELKI better manage memory than scikit-learn? Or is this pointless work?
Up to now I have used RapidMiner for some data/text mining tasks, but with an increasing amount of data there are huge performance issues. AFAIK the RapidMiner Parallel Processing Extensions is only available for the enterprise version - unfortunately I am limited to the community version.
Now I want to transfer the tasks to a high performance cluster by using MATLAB (academic license). I did not find any information that the Parallel Computation Toolbox supports e.g. SVM or KNN.
Does MATLAB or any additional libraries support the paralleliization of data mining algorithms?
Most data mining and machine learning functionality for MATLAB is contained within Statistics Toolbox (in recent versions, that's called Statistics and Machine Learning Toolbox). To enable parallelization, you'll also need Parallel Computing Toolbox, and to enable that parallelization to be carried out on an HPC cluster, you'll need to install MATLAB Distributed Computing Server on the cluster.
There are lots of ways that you might want to parallelize data mining tasks - for example, you might want to parallelize an individual learning task, or parallelize a cross-validation, or parallelize several learning tasks across multiple datasets.
The first is possible for some, but not all of the data mining algorithms in Statistics Toolbox. MathWorks are gradually introducing that piece by piece. For example, kmeans is parallelized, and there is a parallelized algorithm for bagged decision trees, but I believe SVM learning is currently not parallelized. You'll need to look into the documentation for Statistics Toolbox to find out if the algorithms you require are on the list.
The second two are possible. Functionality in Statistics Toolbox for cross-validation (and bootstrapping, jack-knifing) is parallelized, as are some feature selection algorithms. And in order to parallelize running several jobs over multiple datasets, you can use functionality from Parallel Computing Toolbox (such as a parfor or parallel for loop) to iterate over them.
In addition, the upcoming R2015b release of MATLAB (out in September) will include GPU-enabled statistics functionality, providing additional speedups.
What I want to achieve is simply find out which input points are included in a given cluster!?
I have a personal dataset which contains some documents that are grouped in 12 clusters manually.
I know how to interpret kmenas result in mahout .7 with using namedVector class and one of dumpers (like clusterdumper). after clustering using kmeans driver, a directory named clusteredPoints has created which contains clustering result and using clusterDumper, you can see the created clusters and the points that are in each one. in below link there is a good solution for this :
How to read Mahout clustering output
But, as I mentioned in title I want to have this capability to interpret Streaming Kmeans result which is a new feature in mahout .8.
In this feature, it uses a Centroid class for holding data points and each cluster seeds. The generated result of StreamingKMeans algorithm is only a sequence file which is constructed of centroid vectors + keys and weights of each cluster. And in this output there is no information of input data points to know the distribution of them between clusters. However, it is not possible to me to get a sense of accuracy of clustering.
by the way, How to get this information in clustering output ? It is not implemented or just I failed to find and use prepared soulution? How can I analysis the result of streamingKMeans?
thanks.