Implementation of knowledge flow environment in Weka using training and test data set - classification

I like to compare the various ROC curves, which are build by various classifier using WEKA KNOWLEDGE FLOW platform. I have a training data set and a test data set. I want to build the model using training dataset and then want to supply test dataset to build the ROC curves. As per my understanding I have created a knowledge flow environment.
However, I am not sure about my implementation.

Related

H2O Random Forest Impurity Measure for Classification

I currently use the DRF implementation of H2O-3 to create a binary classification model. However, I was wondering, that H2O only supports squared error as impurity measure which is usually used to contruct regression trees but no other measure like e.g. Gini which I think are better suited for classification tasks.
Within the documentation I was not able to identify the reasonning for applying this metric also for classification tasks. Can anyone explain to, why this appraoch makes sense?

Is it possible to simultaneously use and train a neural network?

Is it possible to use Tensorflow or some similar library to make a model that you can efficiently train and use at the same time.
An example/use case for this would be a chat bot that you give feedback to. Somewhat like how pets learn (i.e. replicating what they just did for a reward). Or being able to add new entries or new responses they can use.
I think what you are asking is whether a model can be trained continuously without having to retrain it from scratch each time new labelled data comes in.
Answer to that is - Online models
There are models that can be trained continuously on data without worrying about training them from scratch. As per Wikipedia definition
Online machine learning is a method of machine learning in which data becomes available in sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once.
Some examples of such algorithms are
BernoulliNB
GaussianNB
MiniBatchKMeans
MultinomialNB
PassiveAggressiveClassifier
PassiveAggressiveRegressor
Perceptron
SGDClassifier
SGDRegressor
DNNs

How do I deploy Complex Machine Learning Predictors?

So after I have spent a few days cleaning my data, preproceasing, and experimenting with a few different models (e.g. in R Studio) how do I realistically deploy the solution.
Its straightforward if the model is a simple Model e.g Decision Tree, Logistic regression, as the model is obvious and the R Predictor model deployed into an commercial R Server with http endpoints etc.
My question is, what about complex pre processing ( e.g. PCA transforms, RBF kernels, or Random forests of 100 trees.) just as in the Validation phase, I would presume I would have to deploy R Scripts to preprocess, and PCA or apply RBF pre processing scripts etc to my deployment server ?
Does this mean for RBF I have to host all the original Training data set alongside my SVM predictor ? RBF transform being a function of Training set or at least the support vectors.
And for Random Forest, I assume I have to upload all 500 or so Trees, as part of a very big model.
First, export your R solution (data pre-processing steps and the model) into PMML data format using either the combo of pmml and pmmlTransformations packages, or the r2pmml package. Second, deploy the PMML solution using Openscoring REST web service, JPMML-Spark, or whatever else PMML integration that fits your deployment needs.
PMML has no problem representing PCA, SVM with RBF kernel, tree ensembles, etc.
A pure R vanilla solution for this problem. Most of the ensemble method provide utility to dump/save the learnt model. Learning is very time consuming and iterative process that should be done once. Post learning save/dump your R object. In the deployment have only scoring code. Scoring code will do all the data transformations and later on scoring.
For normal preprocessing you can reuse R code which was used in training. For complex processing like PCA again save the final model and just score/run data over saved PCA R object. Lastly post preprocessing score/run your data on learnt model and get final results.

Self organizing Maps and Linear vector quantization

Self organizing maps are more suited for clustering(dimension reduction) rather than classification. But SOM's are used in Linear vector quantization for fine tuning. But LVQ is a supervised leaning method. So to use SOM's in LVQ, LVQ should be provided with a labelled training data set. But since SOM's only do clustering and not classification and thus cannot have labelled data how can SOM be used as an input for LVQ?
Does LVQ fine tune the clusters in SOM?
Before using in LVQ should SOM be put through another classification algorithm so that it can classify the inputs so that these labelled inputs maybe used in LVQ?
It must be clear that supervised differs from unsupervised because in the first the target values are known.
Therefore, the output of supervised models is a prediction.
Instead, the output of unsupervised models is a label for which we don't know the meaning yet. For this purpose, after clustering, it is necessary to do the profiling of each one of those new label.
Having said so, you could label the dataset using an unsupervised learning technique such as SOM. Then, you should profile each class in order to be sure to understand the meaning of each class.
At this point, you can pursue two different path depending on what is your final objective:
1. use this new variable as a way for dimensionality reduction
2. use this new dataset featured with the additional variable representing the class as a labelled data that you will try to predict using the LVQ
Hope this can be useful!

Testing with LIBSVM

I have been using libsvm in Matlab for our research. The accuracies of the model when used in cross-validation results are already quite good. However, I am having a small problem when trying to use the (already-trained) model for prediction. The svmpredict command requires us to specify the correct testing label (to be used for measuring accuracy) however I want deploy the model to classify new data (with unknown label), so of course there is no label for this testing data. Can I call svmpredict without specifying the (target) label of the testing data? How do I deploy the models that I have obtained in a production setting?
I don't know about matlab, but in python you have the same situation. It really doesn't matter and you can predict with that value, just ignore the accuracy value on the output.