I am planning to evaluate spark for machine learning algorithm implementations. Usually the algorithms I implement are expressed as matrix operations.
I've 2 questions regarding that-
Should algorithms be expressed as Matrix operations when implementing using Scala spark?
If so, does Scala/Spark have good Matrix libraries
By matrix libraries I mean ... something as powerful as the C counterparts, BLAS, Armadillo etc.
Thanks!
Ajay
This will be covered by the MLbase project and the MLI API which will integrated in Spark. This is still in early stage but you can find an example of Linear Regression here.
Related
I'm a bit new to Spark ML API. I'm trying to do multi-label classification for 160 labels by training 160 classifiers(logistic or random forest etc). Once I train on Dataset[LabeledPoint], I'm finding it hard to get an API where I get the probability for each class for a single example. I've read on SO that you can use the pipeline API and get the probabilities, but for my use case this is going to be hard because I'll have to repicate 160 RDDs for my evaluation features, get probability for each class and then do a join to rank the classes by their probabilities. Instead, I want to just have one copy of evaluation features, broadcast the 160 models and then do the predictions inside the map function. I find myself having to implement this but wonder if there's another convenience API in Spark to do the same for different classifiers like Logistic/RF which converts a Vector representing features to the probability for it belonging to a class. Please let me know if there's a better way to approach multi-label classification in Spark.
EDIT: I tried to create a function to transform a vector to a label for random forest, but it's super annoying because I now have to clone large pieces of tree traversal in Spark, and almost everywhere I encountered dead ends because some function or variable was private or protected. Correct me if wrong, but if this use case is not already implemented, I think it atleast is well-justified because Scikit-learn already has such APIs in place to do this.
Thanks
Found the culprit line in Spark MLLib code: https://github.com/apache/spark/blob/5ad644a4cefc20e4f198d614c59b8b0f75a228ba/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L224
The predict method is marked as protected but it should actually be public for such use cases to be supported.
This has been fixed in version 2.4 as seen here:
https://github.com/apache/spark/blob/branch-2.4/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala
So upgrading to version 2.4 should do the trick ... although I don't think 2.4 is out yet, so it's a matter of waiting.
EDIT: for people that are interested, apparently not only is this beneficial for multi-label prediction, it's been observed that there's 3-4x improvement in latency as well for regular classification/regression for single instance/small batch predictions (see https://issues.apache.org/jira/browse/SPARK-16198 for details).
I am currently looking for an Algorithm in Apache Spark (Scala/Java) that is able to cluster data that has numeric and categorical features.
As far as I have seen, there is an implementation for k-medoids and k-prototypes for pyspark (https://github.com/ThinkBigAnalytics/pyspark-distributed-kmodes), but I could not identify something similar for the Scala/Java version I am currently working with.
Is there another recommend algorithm to achieve similar things for Spark running Scala? Or am I overlooking something and could actually make use of the pyspark library in my Scala project?
If you need further information or clarification feel free to ask.
I think you need first to convert your categorical variables to numbers using OneHotEncoder then, you can apply your clustering algorithm using mllib (e.g. kmeans). Also, I recommend doing scaling or normalization before applying your cluster algorithm as it is distance sensitive.
Does avyone know the reason why the Non-Linear SVM has not been implemented in Apache Spark?
I was reading this page:
https://issues.apache.org/jira/browse/SPARK-4638
Look at the last comment. It says:
"Commenting here b/c of the recent dev list thread: Non-linear kernels for SVMs in Spark would be great to have. The main barriers are:
Kernelized SVM training is hard to distribute. Naive methods require a lot of communication. To get this feature into Spark, we'd need to do proper background research and write up a good design.
Other ML algorithms are arguably more in demand and still need improvements (as of the date of this comment). Tree ensembles are first-and-foremost in my mind."
The question is: Why is the kernelized SVM hard to distribute?
Everybody knows that the non-linear SVMs exhibit better performance than the linear ones.
I am programming a machine learning algorithm in Scala. For that one I don't think I will need matrices, but I will need vectors. However, the vectors won't even need a dot product, but just element-wise operations. I see two options:
use a linear algebra library like Breeze
implement my vectors as Scala collections like List or Vector and work on them in a functional manner
The advantage of using a linear algebra library might mean less programming for me... will it though, considering learning? I already started learning and using it and it seems not that straight forward (documentation is so-so). (Another) disadvantage is having a extra dependency. I don't have much experience in writing my own projects (so far I programmed in the job, where libraries usage was dictated).
So, is a linear algebra library - e.g. Breeze - worth the learning and dependency compared to programming needed functionality myself, in my particular case?
I'm doing some numerical estimation and correction with the Kalman filter, and would like to better estimate my parameters of Q and R, preferably dynamically.
http://en.wikipedia.org/wiki/Kalman_filter#Estimation_of_the_noise_covariances_Qk_and_Rk
That article mentions that GNU Octave is currently the best way of determining these parameters from data:
http://en.wikipedia.org/wiki/GNU_Octave#C.2B.2B_integration
Unfortunately it is written for Matlab, and there's supposedly a C++ implementation. I'm very weak in C++ and would not even know how to import a C++ library and link it properly in XCode. All of my C++ libraries to date have been wrapped in 3rd party Objective-C classes.
Has anyone used the C++ implementation for scientific computing or engineering applications on iPhone? I'd appreciate any pointers or tutorials on how to do this kind of analysis with Objective-C.
Additional keywords:
estimating covariance from data
Autocovariance Least-Squares (ALS) technique
noise covariance
Thank you!
I do not know of any such C++ library, if you fancy doing numerical analysis on iOS, the best way to go is the accelerate framework, specifically (from this description):
Linear Algebra: LAPACK and BLAS
The Basic Linear Algebra Subprograms (BLAS) and Linear Algebra Package
(LAPACK) libraries contain—as you would expect—functions to perform
linear algebra computations such as solving simultaneous linear
equations, least squares solutions of linear equations, and eigenvalue
problems. The BLAS library serves as a building block for the LAPACK
library. The BLAS and LAPACK libraries are widely distributed and
industry standard computational libraries. They are available on a
number of different platforms and architectures. So, if you are
already using these libraries you should feel right at home, as the
APIs are exactly the same on Mac OS X.
You'll need a fairly good grounding in C, pointers, arrays and such though, no way around it I feel. There is a detailed description of how to use these linear algebra primitives to implement kalman filtering (although this is using R, so probably not of mush use to you).
This is a SO post on Kalman Filtering which expressed my opinion quite well. I'm afraid I think the chances of finding a magic Objective-C wrapper for Kalman Filtering are fairly low, though I would be very happy to be proven wrong!