Is there any way to reapply a preprocessment done in a training dataset to a new dataset of experimental data for submission of the transformed data to the already trained classifier?
The preprocessor modifies the domain on the training data set. If you want to apply the same transformations on the testing (experimental) data, you apparently have to cast it into the same domain, as the Orange's built-in predictors seem to do:
train = preprocess(train)
test = Table(train.domain, test)
Related
I am attempting to use KMeans clustering to create a feature for an XGBOOST regression. The problem is, I am not sure if there is data leakage. It is data with a date, so right now I am clustering on the first 70% of data sorted by date, and using the same as my training set.
Included in the clustering is my target variable. Using the cluster as a feature provides a huge boost to test scores, so I worry that this is causing data leakage. However, the clusters used for test scores are unseen data in the test set.
Is this valid, or is it causing data leakage? Thank you
So after I have spent a few days cleaning my data, preproceasing, and experimenting with a few different models (e.g. in R Studio) how do I realistically deploy the solution.
Its straightforward if the model is a simple Model e.g Decision Tree, Logistic regression, as the model is obvious and the R Predictor model deployed into an commercial R Server with http endpoints etc.
My question is, what about complex pre processing ( e.g. PCA transforms, RBF kernels, or Random forests of 100 trees.) just as in the Validation phase, I would presume I would have to deploy R Scripts to preprocess, and PCA or apply RBF pre processing scripts etc to my deployment server ?
Does this mean for RBF I have to host all the original Training data set alongside my SVM predictor ? RBF transform being a function of Training set or at least the support vectors.
And for Random Forest, I assume I have to upload all 500 or so Trees, as part of a very big model.
First, export your R solution (data pre-processing steps and the model) into PMML data format using either the combo of pmml and pmmlTransformations packages, or the r2pmml package. Second, deploy the PMML solution using Openscoring REST web service, JPMML-Spark, or whatever else PMML integration that fits your deployment needs.
PMML has no problem representing PCA, SVM with RBF kernel, tree ensembles, etc.
A pure R vanilla solution for this problem. Most of the ensemble method provide utility to dump/save the learnt model. Learning is very time consuming and iterative process that should be done once. Post learning save/dump your R object. In the deployment have only scoring code. Scoring code will do all the data transformations and later on scoring.
For normal preprocessing you can reuse R code which was used in training. For complex processing like PCA again save the final model and just score/run data over saved PCA R object. Lastly post preprocessing score/run your data on learnt model and get final results.
I trained a ensemble model (RUSBoost) for a binary classification problem by the function fitensemble() in Matlab 2014a. The training by this function is performed 10-fold cross-validation through the input parameter "kfold" of the function fitensemble().
However, the output model trained by this function cannot be used to predict the labels of new data if I use the predict(model, Xtest). I checked the Matlab documents, which says we can use kfoldPredict() function to evaluate the trained model. But I did not find any input of the new data through this function. Also, I found the structure of the trained model with cross-validation is different from that model without cross-validation. So, could anyone please advise me how to use the model, which is trained with cross-validation, to predict labels of new data? Thanks!
kfoldPredict() needs a RegressionPartitionedModel or ClassificationPartitionedEnsemble object as input. This already contains the models and data for kfold cross validation.
The RegressionPartitionedModel object has a field Trained, in which the trained learners that are used for cross validation are stored.
You can take any of these learners and use it like predict(learner, Xdata).
Edit:
If k is too large, it is possible that there is too little meaningful data in one or more iteration, so the model for that iteration is less accurate.
There are no general rules for k, but k=10 like in the MATLAB default is a good starting point to play around with it.
Maybe this is also interesting for you: https://stats.stackexchange.com/questions/27730/choice-of-k-in-k-fold-cross-validation
I built and trained a neural network using FANN library. This is an initial training; majority of data will be collected online.
When online data becomes available I want to improve the network using this new data (not re-train, but make previous training more accurate).
How to do this kind of incremental training with FANN?
Train from a file that change to:
set_training_algorithm(FANN_TRAIN_INCREMENTAL)
and subsequently train incrementally (online)
Otherwise consult this:
http://fann.sourceforge.net/fann.html
I've developed a model using Libsvm in Matlab. I've choose best parameters using CV and I obtained the model training the whole dataset. I use normalization to get better results:
maximum=max(TR)+0.00001;
minimum=min(TR);
for i=1:size(TR,2)
training(1:size(TR,1),i)=double(TR(1:size(TR,1),i)-maximum(i))/(maximum(i)-minimum(i));
end
Now how can I use directly my model to obtain classification for new data? I mean for records that haven't class label. Do I have to manually build functions from model information?
Are you using libsvmtrain to train on your training data? If so, there is an output argument that you can use to classify test/future data. Then pass that output structure to svmpredict along with test data.