I'm just wondering roughly what decision tree algorithm orange3 implements in the tree widget?
What measure of purity does it deploy?
Tree is a simple algorithm that splits the data into nodes by class purity. It is a precursor to Random Forest
https://docs.biolab.si//3/visual-programming/widgets/model/tree.html
This should be the related source code:
https://github.com/biolab/orange3/blob/master/Orange/modelling/randomforest.py
Related
I want to optimally find the clusters and assignment of each subject into the correct cohort in the nonlinear mixed effect framework. I came to know a package in R which is lcmm that calls this type of modeling a latent class mixture model. They have the clustering of the linear mixed effect model in hlme function. I am wondering if there is a package that deals with the latent class/clustering of the nonlinear mixed effect modeling? Any help is appreciated.
I'm trying to create my own Estimator following this example I found in the Spark source code DeveloperApiExample.scala.
But in this example, everytime I call fit() method in Estimator, it will return a new Model.
I want something like fitting again to train more samples that was not trained yet.
I thought in creating a new method in the Model class to do so. But I'm not sure if it makes sense.
It's maybe good to know that my model don't need to process all dataset again to train a new sample and we don't want to change the model structure.
The base class for a spark ml Estimator is defined here. As you can see, the class method fit is a vanilla call to train the model using the input data.
You should reference something like the LogisticRegression class, specifically the trainOnRows function where the input is an RDD and optionally an initial coefficient matrix (output of a trained model). This will allow you to iteratively train a model on different data sets.
For what you need to achieve, please remember that your algorithm of choice must be able to support iterative updates. For example, glm's, neural networks, tree ensembles etc.
If you know how to improve the training in your model without retraining with the already used data, you can't do it in the same class, because you want a Model that is also a Estimator, but sadly this is not possible directly because both are abstract classes, and can't be used mixed in the same class.
As you say, you can provide in the model a method that will return the Estimator to improve/increase the training.
class MyEstimator extends Estimator[MyModel] {
...
}
class MyModel extends Model[MyModel] {
def retrain: MyEstimator = // Create a instance of my estimator that it carries all the previous knowledge
}
You can use PipelineModels to save and load and continue fitting models:
MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline, or workflow. This section covers the key concepts introduced by the Pipelines API, where the pipeline concept is mostly inspired by the scikit-learn project.
Find exemplary code here.
I have calculated the following parameters after applying the following algorithms on a dataset from kaggle
enter image description here
In the above case,linear model is giving the best results.
Are the above results correct and can linear model actually give better results than other 3 in any case?
Or am I missing something?
According to AUC criterion this classification is perfect (1 is theoretical maximum). This means a clear difference in the data. In this case, it makes no sense to talk about differences in the results of methods. Another point is that you can play with methods parameters (you likely will get slightly different results) and other methods can become better. But real result will be indistinguishable. Sophisticated methods are invented for sophisticated data. This is not the case.
All models are wrong, some are useful. - George Box
In terms of classification, a model would be effective as long as it could nicely fit the classification boundaries.
For binary classification case, supposing your data is perfectly linearly separable, then linear model will do the job - actually the "best" job since any more complicated models won't perform better.
If your +'s and -'s are somehow a bit scattered when they cannot be separated by a line (actually hyperplane), then linear model could be beaten by decision tree simply because decision trees can provide classification boundary of more complex shape (cubes).
Then random forest may beat decision tree as classification boundary of random forest is more flexible.
However, as we mentioned early, linear model still has its time.
I'm trying to train a classifier (specifically, a decision forest) using the Matlab 'TreeBagger' class.
I notice from the online documentation for TreeBagger, that there are a couple of methods/properties that could be used to see how important each data point feature is for distinguishing between classes of data point.
The two I found were the ComputeOOBVarImp property and the ClassificationTree.predictorImportance method. Using the latter on a decision forest/bagged ensemble of trees that I'd built, I found that many data point features had zero importance for the classifier.
Is there anything I can do with the TreeBagger class, or in conjunction with it, so that my trees use weak learners/splitting criteria that aren't just bounds on single input data features, but linear combinations of these features, in order to improve the 'information gain' at each node split.
I suppose this comes down to dimensionality reduction of the data, that I have no experience in dealing with in Matlab.
Thanks.
I know we can use the vote classifier to combine different classifiers.
May I know if there is any way to combine the classifiers with different weights for each classifier? How would I be able to do that with Weka?
I have googled that we can add weights to attributes or instances. But I would like to know how to add weight to classifiers.
If weighted vote is not possible, is there any other way I can do that? Thanks.
It does not appear possible to achieve weighted voting with Weka without modifying the Java classes yourself.
Source 1
Source 2
That being said, I believe you can achieve rudimentary weighting by providing multiples of the base classifiers to the voting meta classifier. This appears to be backed up by the source code.
For example:
Classifier 1: J48 decision tree
Classifier 2: J48 decision tree
Classifier 3: naive Bayes
This would allow the decision tree to vote twice and therefore have a higher weight than naive Bayes.