How do I deploy Complex Machine Learning Predictors? - deployment

So after I have spent a few days cleaning my data, preproceasing, and experimenting with a few different models (e.g. in R Studio) how do I realistically deploy the solution.
Its straightforward if the model is a simple Model e.g Decision Tree, Logistic regression, as the model is obvious and the R Predictor model deployed into an commercial R Server with http endpoints etc.
My question is, what about complex pre processing ( e.g. PCA transforms, RBF kernels, or Random forests of 100 trees.) just as in the Validation phase, I would presume I would have to deploy R Scripts to preprocess, and PCA or apply RBF pre processing scripts etc to my deployment server ?
Does this mean for RBF I have to host all the original Training data set alongside my SVM predictor ? RBF transform being a function of Training set or at least the support vectors.
And for Random Forest, I assume I have to upload all 500 or so Trees, as part of a very big model.

First, export your R solution (data pre-processing steps and the model) into PMML data format using either the combo of pmml and pmmlTransformations packages, or the r2pmml package. Second, deploy the PMML solution using Openscoring REST web service, JPMML-Spark, or whatever else PMML integration that fits your deployment needs.
PMML has no problem representing PCA, SVM with RBF kernel, tree ensembles, etc.

A pure R vanilla solution for this problem. Most of the ensemble method provide utility to dump/save the learnt model. Learning is very time consuming and iterative process that should be done once. Post learning save/dump your R object. In the deployment have only scoring code. Scoring code will do all the data transformations and later on scoring.
For normal preprocessing you can reuse R code which was used in training. For complex processing like PCA again save the final model and just score/run data over saved PCA R object. Lastly post preprocessing score/run your data on learnt model and get final results.

Related

Implementation of knowledge flow environment in Weka using training and test data set

I like to compare the various ROC curves, which are build by various classifier using WEKA KNOWLEDGE FLOW platform. I have a training data set and a test data set. I want to build the model using training dataset and then want to supply test dataset to build the ROC curves. As per my understanding I have created a knowledge flow environment.
However, I am not sure about my implementation.

Is it possible to simultaneously use and train a neural network?

Is it possible to use Tensorflow or some similar library to make a model that you can efficiently train and use at the same time.
An example/use case for this would be a chat bot that you give feedback to. Somewhat like how pets learn (i.e. replicating what they just did for a reward). Or being able to add new entries or new responses they can use.
I think what you are asking is whether a model can be trained continuously without having to retrain it from scratch each time new labelled data comes in.
Answer to that is - Online models
There are models that can be trained continuously on data without worrying about training them from scratch. As per Wikipedia definition
Online machine learning is a method of machine learning in which data becomes available in sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once.
Some examples of such algorithms are
BernoulliNB
GaussianNB
MiniBatchKMeans
MultinomialNB
PassiveAggressiveClassifier
PassiveAggressiveRegressor
Perceptron
SGDClassifier
SGDRegressor
DNNs

Neural Networks and correlation between input and output

I am trying to fit some input to predict an output in Matlab using fitnet neural networks, but I am concerned in finding which input candidate vector would correlate the most with the output as a preprocessing step prior to my neural network training.
In the figure below the output in yellow has five input candidates where I need to chose only from. What command should I use in Matlab and how should I prepare that data (repeated around 1000 time) so I can get a clear correlation between the input candidate and the output.
To find out correlation between given feature and target variable you can use R = corrcoef(A,B), but... do not do it!.
This process makes no sense and will be probably harmfull for the whole process. You are going to remove part of information from your data so only features which have idependent, linear realtion to target variable persist. Then, you will apply highly-non linear model which exploits co-occurences and features correlations. These two steps are completely incompatible. The only valid relation is - if your data is very simple and it can be pretty much modeled with linear model, then neural net will work as well. But then there is no point in using a neural net in the first place, just apply linear regression. Consequently: do not perform feature selection unless you have to. Try to build a good model without doing that, and if you have to remove some features (maybe getting them is expensive process?) use post-hoc model analysis to remove features which are not used by this model. Do not split your problem to multiple, independent processes if you do not have to (unless you can show that this decomposition does not harm the process, but in case of feature selection + regressor this is not true, as you cannot construct a valid feature selection supervision without trained regressor).

libSVM outputs "Line search fails in two-class probability estimates"

When I tried to train a SVM(trainsvm function) with RBF kernel,
The libSVM library outputs "Line search fails in two-class probability estimates" during training.
After training, the training accuracy of the model is just 20%.
I think I might miss something and it is related to the message.
For more information about my project,
I'm dealing with PASCAL VOC action classification problem.
I'm trying to follow this method.
http://www.ifp.illinois.edu/~jyang29/papers/CVPR09-ScSPM.pdf
There are 1300 training images and 11 classes.
After making codebooks and sparse coding,
The dimension of feature vector is 2688.
The number of training example is 1370.
You need to do a grid search, either using cross validation, or using a separate validation data set to get good values for C and gamma. Libsvm has a script called grid.py that is useful for this. I noticed you tagged this with matlab, using grid.py needs command line tools and a python installation (IMO this generally works out better than with matlab, especially if you have a some big machines to run many jobs in parallel).
I recommend that you read the libsvm guide if you haven't already done so: http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf‎.
I also suggest you initially use the same dataset as used for the paper as occasionally published algorithms only work well on the dataset chosen for the paper.
Lastly, you could contact the authors of the paper.
I asked about this warning the author of LIBSVM, and he replied that this warning can be ignored.

Why we need training and test datasets in research?

I'm newbie in research area of data mining (text clustering) and i have couple question regarding to training and test datasets.
Is that clustering need training and testing datasets?
why we need to separate into training and test datasets?
Sorry for the rookie question hope expert in this group can help me.
As your question is on clustering:
In cluster analysis, there usually is no training or test data split.
Because you do cluster analysis when you do not have labels, so you cannot "train".
Training is a concept from machine learning, and train-test splitting is used to avoid overfitting.
But if you are not learning labels, you cannot overfit.
Properly used cluster analysis is a knowledge discovery method. You want to discover some new structure in your data, not rediscover something that is already labeled.
To train your data you need a sets of relevant data similar but not identical to your testing data. For example, you could split up your data where 0.7 of your data is training and the rest testing. This will allow your algorithm to get a feel for what it should be looking for. The rest of the data 0.3 can be used for testing as it is a distinct set of information (hopefully) which should allow the algorithm to test itself.
Why split it up?
Well if you train your data on data A and then test your algorithm on data A your algorithm will be able to identify all the information correctly because that is what it was trained on.
For example, if when learning addition you were given the sums 3+4, 4+5, 6+9, which you correctly solved it would be redundant to test your knowledge of addition using the same sums.
further information:
http://en.wikipedia.org/wiki/Natural_language_processing
http://www.nltk.org/book
Hope this helps.