How to create an Estimator that trains new samples after already fitted to initial dataset? - scala

I'm trying to create my own Estimator following this example I found in the Spark source code DeveloperApiExample.scala.
But in this example, everytime I call fit() method in Estimator, it will return a new Model.
I want something like fitting again to train more samples that was not trained yet.
I thought in creating a new method in the Model class to do so. But I'm not sure if it makes sense.
It's maybe good to know that my model don't need to process all dataset again to train a new sample and we don't want to change the model structure.

The base class for a spark ml Estimator is defined here. As you can see, the class method fit is a vanilla call to train the model using the input data.
You should reference something like the LogisticRegression class, specifically the trainOnRows function where the input is an RDD and optionally an initial coefficient matrix (output of a trained model). This will allow you to iteratively train a model on different data sets.
For what you need to achieve, please remember that your algorithm of choice must be able to support iterative updates. For example, glm's, neural networks, tree ensembles etc.

If you know how to improve the training in your model without retraining with the already used data, you can't do it in the same class, because you want a Model that is also a Estimator, but sadly this is not possible directly because both are abstract classes, and can't be used mixed in the same class.
As you say, you can provide in the model a method that will return the Estimator to improve/increase the training.
class MyEstimator extends Estimator[MyModel] {
...
}
class MyModel extends Model[MyModel] {
def retrain: MyEstimator = // Create a instance of my estimator that it carries all the previous knowledge
}

You can use PipelineModels to save and load and continue fitting models:
MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline, or workflow. This section covers the key concepts introduced by the Pipelines API, where the pipeline concept is mostly inspired by the scikit-learn project.
Find exemplary code here.

Related

Spark ML API to convert a vector to a probability for multilabel classification

I'm a bit new to Spark ML API. I'm trying to do multi-label classification for 160 labels by training 160 classifiers(logistic or random forest etc). Once I train on Dataset[LabeledPoint], I'm finding it hard to get an API where I get the probability for each class for a single example. I've read on SO that you can use the pipeline API and get the probabilities, but for my use case this is going to be hard because I'll have to repicate 160 RDDs for my evaluation features, get probability for each class and then do a join to rank the classes by their probabilities. Instead, I want to just have one copy of evaluation features, broadcast the 160 models and then do the predictions inside the map function. I find myself having to implement this but wonder if there's another convenience API in Spark to do the same for different classifiers like Logistic/RF which converts a Vector representing features to the probability for it belonging to a class. Please let me know if there's a better way to approach multi-label classification in Spark.
EDIT: I tried to create a function to transform a vector to a label for random forest, but it's super annoying because I now have to clone large pieces of tree traversal in Spark, and almost everywhere I encountered dead ends because some function or variable was private or protected. Correct me if wrong, but if this use case is not already implemented, I think it atleast is well-justified because Scikit-learn already has such APIs in place to do this.
Thanks
Found the culprit line in Spark MLLib code: https://github.com/apache/spark/blob/5ad644a4cefc20e4f198d614c59b8b0f75a228ba/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L224
The predict method is marked as protected but it should actually be public for such use cases to be supported.
This has been fixed in version 2.4 as seen here:
https://github.com/apache/spark/blob/branch-2.4/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala
So upgrading to version 2.4 should do the trick ... although I don't think 2.4 is out yet, so it's a matter of waiting.
EDIT: for people that are interested, apparently not only is this beneficial for multi-label prediction, it's been observed that there's 3-4x improvement in latency as well for regular classification/regression for single instance/small batch predictions (see https://issues.apache.org/jira/browse/SPARK-16198 for details).

How to add a custom layer and loss function into a pretrained CNN model by matconvnet?

I'm new to matconvnet. Recently, I'd like to try a new loss function instead of the existing one in pretrained model, e.g., vgg-16, which usually uses softmax loss layer. What's more, I want to use a new feature extractor layer, instead of pooling layer or max layer. I know there are 2 CNN wrappers in matconvnet, simpleNN and DagNN respectively, since I'm using vgg-16,a linear model which has a linear sequence of building blocks. So, in simpleNN wrapper, how to create a custom layer in detail, espectially the procedure and the relevant concept, e.g., do I need to remove layers behind the new feature extractor layer or just leave them ? And I know how to compute the derivative of the loss function so the details of computation inside the layer is not that important in this question, I just want to know the procedure represented by codes. Could someone help me? I'll appreciate it a lot !
You can remove the older error or objective layer
net.layer('abc')=[];
and you can add new error code in vl_nnloss() file

How to predict labels for new data (test set) by the PartitionedEnsemble model in Matlab?

I trained a ensemble model (RUSBoost) for a binary classification problem by the function fitensemble() in Matlab 2014a. The training by this function is performed 10-fold cross-validation through the input parameter "kfold" of the function fitensemble().
However, the output model trained by this function cannot be used to predict the labels of new data if I use the predict(model, Xtest). I checked the Matlab documents, which says we can use kfoldPredict() function to evaluate the trained model. But I did not find any input of the new data through this function. Also, I found the structure of the trained model with cross-validation is different from that model without cross-validation. So, could anyone please advise me how to use the model, which is trained with cross-validation, to predict labels of new data? Thanks!
kfoldPredict() needs a RegressionPartitionedModel or ClassificationPartitionedEnsemble object as input. This already contains the models and data for kfold cross validation.
The RegressionPartitionedModel object has a field Trained, in which the trained learners that are used for cross validation are stored.
You can take any of these learners and use it like predict(learner, Xdata).
Edit:
If k is too large, it is possible that there is too little meaningful data in one or more iteration, so the model for that iteration is less accurate.
There are no general rules for k, but k=10 like in the MATLAB default is a good starting point to play around with it.
Maybe this is also interesting for you: https://stats.stackexchange.com/questions/27730/choice-of-k-in-k-fold-cross-validation

Combining labeled and unlabeled data in a single pipeline

I'm building image classifier that uses DBN for feature learning and logistic regression to fine-tune resulting network. Normally, the most convenient way to implement such an architecture in SciKit Learn is to use Pipeline class. But in my case I have ~10K unlabeled images and only ~300 labeled ones. Surely, I want to use all images to train DBN and fit logistic regression with only labeled examples.
I can think of implementing my own Pipeline class that will handle this case, but first I'd like to know if there's already something existing. Is it?
The current scikit-learn Pipeline API is not well suited for supervised learning with unsupervised pre-training. Implementing your own wrapper class is probably the best way to go forward for that case.

Usage of Libsvm model

I've developed a model using Libsvm in Matlab. I've choose best parameters using CV and I obtained the model training the whole dataset. I use normalization to get better results:
maximum=max(TR)+0.00001;
minimum=min(TR);
for i=1:size(TR,2)
training(1:size(TR,1),i)=double(TR(1:size(TR,1),i)-maximum(i))/(maximum(i)-minimum(i));
end
Now how can I use directly my model to obtain classification for new data? I mean for records that haven't class label. Do I have to manually build functions from model information?
Are you using libsvmtrain to train on your training data? If so, there is an output argument that you can use to classify test/future data. Then pass that output structure to svmpredict along with test data.