Advantages of ARIMA with Spark - scala

I'm new to spark and scala. I'm working on a project doing forecasting with ARIMA models. I see from the posts below that I can train ARIMA models with spark.
I'm wondering what's the advantage of using spark for ARIMA models?
How to do time-series simple forecast?
https://badrit.com/blog/2017/5/29/time-series-analysis-using-spark#.W9ONGBNKi7M

The advantage of Spark is a distributed processing engine. If you have a huge amount of data which is typically the case in real-life systems, we need such processing engines. It will benefit in terms of scalability and performance to use any algorithm not only ARIMA on platforms like Spark.

Related

Is it possible to simultaneously use and train a neural network?

Is it possible to use Tensorflow or some similar library to make a model that you can efficiently train and use at the same time.
An example/use case for this would be a chat bot that you give feedback to. Somewhat like how pets learn (i.e. replicating what they just did for a reward). Or being able to add new entries or new responses they can use.
I think what you are asking is whether a model can be trained continuously without having to retrain it from scratch each time new labelled data comes in.
Answer to that is - Online models
There are models that can be trained continuously on data without worrying about training them from scratch. As per Wikipedia definition
Online machine learning is a method of machine learning in which data becomes available in sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once.
Some examples of such algorithms are
BernoulliNB
GaussianNB
MiniBatchKMeans
MultinomialNB
PassiveAggressiveClassifier
PassiveAggressiveRegressor
Perceptron
SGDClassifier
SGDRegressor
DNNs

Non-linear SVM is not available in Apache Spark

Does avyone know the reason why the Non-Linear SVM has not been implemented in Apache Spark?
I was reading this page:
https://issues.apache.org/jira/browse/SPARK-4638
Look at the last comment. It says:
"Commenting here b/c of the recent dev list thread: Non-linear kernels for SVMs in Spark would be great to have. The main barriers are:
Kernelized SVM training is hard to distribute. Naive methods require a lot of communication. To get this feature into Spark, we'd need to do proper background research and write up a good design.
Other ML algorithms are arguably more in demand and still need improvements (as of the date of this comment). Tree ensembles are first-and-foremost in my mind."
The question is: Why is the kernelized SVM hard to distribute?
Everybody knows that the non-linear SVMs exhibit better performance than the linear ones.

Is there a runWithValidation feature for Gradient Boosted Trees (GBT) in Spark ml?

Wondering if there a runWithValidation feature for Gradient Boosted Trees (GBT) in Spark ml to prevent overfitting. It's there in mllib which works with RDDs. I am looking the same for dataframes.
Found a K-Fold Cross Validation support in Spark. It can be done using CrossValidation() with Estimators, Evaluators, ParamMap and number of folds. This helps in finding the best parameters for the model i.e model tuning.
Refer http://spark.apache.org/docs/latest/ml-tuning.html for more details.

Does MATLAB support the parallelization of supervised machine learning algorithms? Alternatives?

Up to now I have used RapidMiner for some data/text mining tasks, but with an increasing amount of data there are huge performance issues. AFAIK the RapidMiner Parallel Processing Extensions is only available for the enterprise version - unfortunately I am limited to the community version.
Now I want to transfer the tasks to a high performance cluster by using MATLAB (academic license). I did not find any information that the Parallel Computation Toolbox supports e.g. SVM or KNN.
Does MATLAB or any additional libraries support the paralleliization of data mining algorithms?
Most data mining and machine learning functionality for MATLAB is contained within Statistics Toolbox (in recent versions, that's called Statistics and Machine Learning Toolbox). To enable parallelization, you'll also need Parallel Computing Toolbox, and to enable that parallelization to be carried out on an HPC cluster, you'll need to install MATLAB Distributed Computing Server on the cluster.
There are lots of ways that you might want to parallelize data mining tasks - for example, you might want to parallelize an individual learning task, or parallelize a cross-validation, or parallelize several learning tasks across multiple datasets.
The first is possible for some, but not all of the data mining algorithms in Statistics Toolbox. MathWorks are gradually introducing that piece by piece. For example, kmeans is parallelized, and there is a parallelized algorithm for bagged decision trees, but I believe SVM learning is currently not parallelized. You'll need to look into the documentation for Statistics Toolbox to find out if the algorithms you require are on the list.
The second two are possible. Functionality in Statistics Toolbox for cross-validation (and bootstrapping, jack-knifing) is parallelized, as are some feature selection algorithms. And in order to parallelize running several jobs over multiple datasets, you can use functionality from Parallel Computing Toolbox (such as a parfor or parallel for loop) to iterate over them.
In addition, the upcoming R2015b release of MATLAB (out in September) will include GPU-enabled statistics functionality, providing additional speedups.

How to do distributed Principal Components Analysis + Kmeans using Apache Spark?

I need to run Principal Components Analysis and K-means clustering on a large-ish dataset (around 10 GB) which is spread out over many files. I want to use Apache Spark for this since it's known to be fast and distributed.
I know that Spark supports PCA and also PCA + Kmeans.
However, I haven't found an example which demonstrates how to do this with many files in a distributed manner.