Distributed Online LDA is implemented by Spark MLlib and Gensim. I would like to choose one of them to do my project.
At present, I already have a hadoop system running with 7 worker nodes.
Could someone having experience with both give a recommendation and point out pros and cons of them, e.g. the difficulty of cluster setup, speed of the modelling process, etc. Thanks.
Related
Does avyone know the reason why the Non-Linear SVM has not been implemented in Apache Spark?
I was reading this page:
https://issues.apache.org/jira/browse/SPARK-4638
Look at the last comment. It says:
"Commenting here b/c of the recent dev list thread: Non-linear kernels for SVMs in Spark would be great to have. The main barriers are:
Kernelized SVM training is hard to distribute. Naive methods require a lot of communication. To get this feature into Spark, we'd need to do proper background research and write up a good design.
Other ML algorithms are arguably more in demand and still need improvements (as of the date of this comment). Tree ensembles are first-and-foremost in my mind."
The question is: Why is the kernelized SVM hard to distribute?
Everybody knows that the non-linear SVMs exhibit better performance than the linear ones.
I have 100,000 sentences that I've processed into TF-IDF vectors using scikit-learn's TfidfVectorizer with highly customized stopwords and nlp stemming. My goal is to cluster the sentences using dbscan or another density-based cluster to discover similar sentences.
In scikit-learn's dbscan implementation, I run out of memory when I cluster more than 40,000 sentences. I have seen suggestions to use ELKI's Java clustering GUI. I'd like to try clustering in Java, but I cannot find a method for moving my TF-IDF vectors from Python to ELKI. ELKI's documentation states that it can handle sparse vectors in a particular format or in .arff.
Most concrete question. Can anyone suggestion how to move TFIDF vectors from scikit-learn into a format that can be loaded into ELKI.
Will ELKI better manage memory than scikit-learn? Or is this pointless work?
Up to now I have used RapidMiner for some data/text mining tasks, but with an increasing amount of data there are huge performance issues. AFAIK the RapidMiner Parallel Processing Extensions is only available for the enterprise version - unfortunately I am limited to the community version.
Now I want to transfer the tasks to a high performance cluster by using MATLAB (academic license). I did not find any information that the Parallel Computation Toolbox supports e.g. SVM or KNN.
Does MATLAB or any additional libraries support the paralleliization of data mining algorithms?
Most data mining and machine learning functionality for MATLAB is contained within Statistics Toolbox (in recent versions, that's called Statistics and Machine Learning Toolbox). To enable parallelization, you'll also need Parallel Computing Toolbox, and to enable that parallelization to be carried out on an HPC cluster, you'll need to install MATLAB Distributed Computing Server on the cluster.
There are lots of ways that you might want to parallelize data mining tasks - for example, you might want to parallelize an individual learning task, or parallelize a cross-validation, or parallelize several learning tasks across multiple datasets.
The first is possible for some, but not all of the data mining algorithms in Statistics Toolbox. MathWorks are gradually introducing that piece by piece. For example, kmeans is parallelized, and there is a parallelized algorithm for bagged decision trees, but I believe SVM learning is currently not parallelized. You'll need to look into the documentation for Statistics Toolbox to find out if the algorithms you require are on the list.
The second two are possible. Functionality in Statistics Toolbox for cross-validation (and bootstrapping, jack-knifing) is parallelized, as are some feature selection algorithms. And in order to parallelize running several jobs over multiple datasets, you can use functionality from Parallel Computing Toolbox (such as a parfor or parallel for loop) to iterate over them.
In addition, the upcoming R2015b release of MATLAB (out in September) will include GPU-enabled statistics functionality, providing additional speedups.
I need to run Principal Components Analysis and K-means clustering on a large-ish dataset (around 10 GB) which is spread out over many files. I want to use Apache Spark for this since it's known to be fast and distributed.
I know that Spark supports PCA and also PCA + Kmeans.
However, I haven't found an example which demonstrates how to do this with many files in a distributed manner.
I was going through the K-means algorithm in mahout and when debugging, I noticed that when creating the first clusters it does this following code:
ClusteringPolicy policy = new KMeansClusteringPolicy(convergenceDelta);
ClusterClassifier prior = new ClusterClassifier(clusters, policy);
prior.writeToSeqFiles(priorClustersPath);
I was reading the description of these classes and it was not clear for me...
I was wondering what is the meaning of these cluster classifier and policy?
is it related with hierarchical clustering, centroid based clustering, distribution based
clustering etc?
Because I do not know what is the benefit or the reason of using this cluster classifier and policy when using K-means mahout implementation.
The implementation shares code with other variants of k-means and similar algorithms such as Canopy pre-clustering and GMM.
These classes encode only the difference between these algorithms.
Mahout is not a good place to study the k-means algorithm, the implementation is quite a mess. It's also slow. As in really really slow. Most of the time, a single CPU implementation will outright beat Mahout on anything that fits into memory. Maybe even on disk of a single machine. Because of all the map-reduce overhead.