I'm trying to find out if it is possible to have "incremental training of ALS model" on kinesis streaming data using MLlib in Apache Spark.
I have real time interaction of user from kinesis stream, but to get updated prediction results I need to train the model on whole data. This takes some time.
I'm trying to figure out if I can do incremental training of ALS model on streaming data, but cannot find an answer.
Incremental training of ALS
Related
I have trained a pyspark model and saved pipeline and model in drive.
Now I am trying to predict for future datasets by loading pipeline.
pipelineModel = PipelineModel.load("s3://data-production/pipelineModel_v1")
new_test= pipelineModel.transform(new_df1)
Here I am getting error in my production dataset .
How to make sure columns in training dataset be replicated to production data as well in Pyspark?
Is it possible to use Tensorflow or some similar library to make a model that you can efficiently train and use at the same time.
An example/use case for this would be a chat bot that you give feedback to. Somewhat like how pets learn (i.e. replicating what they just did for a reward). Or being able to add new entries or new responses they can use.
I think what you are asking is whether a model can be trained continuously without having to retrain it from scratch each time new labelled data comes in.
Answer to that is - Online models
There are models that can be trained continuously on data without worrying about training them from scratch. As per Wikipedia definition
Online machine learning is a method of machine learning in which data becomes available in sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once.
Some examples of such algorithms are
BernoulliNB
GaussianNB
MiniBatchKMeans
MultinomialNB
PassiveAggressiveClassifier
PassiveAggressiveRegressor
Perceptron
SGDClassifier
SGDRegressor
DNNs
I read this link that explains: Anomaly detection with PCA in Spark
But whats the code to extract PCA features and project them from training data to test data?
From what I understood, we have to use the same set of features for train on test data.
I'm new to spark and scala. I'm working on a project doing forecasting with ARIMA models. I see from the posts below that I can train ARIMA models with spark.
I'm wondering what's the advantage of using spark for ARIMA models?
How to do time-series simple forecast?
https://badrit.com/blog/2017/5/29/time-series-analysis-using-spark#.W9ONGBNKi7M
The advantage of Spark is a distributed processing engine. If you have a huge amount of data which is typically the case in real-life systems, we need such processing engines. It will benefit in terms of scalability and performance to use any algorithm not only ARIMA on platforms like Spark.
I built and trained a neural network using FANN library. This is an initial training; majority of data will be collected online.
When online data becomes available I want to improve the network using this new data (not re-train, but make previous training more accurate).
How to do this kind of incremental training with FANN?
Train from a file that change to:
set_training_algorithm(FANN_TRAIN_INCREMENTAL)
and subsequently train incrementally (online)
Otherwise consult this:
http://fann.sourceforge.net/fann.html