What is discrete class variable in Orange? - orange

I'd like to start trying to train stock price time series using Orange. I have a simple time series for Amazon that is attached to a Logistic Regression widget. The widget throws the error:
Discrete class variable expected.
Any one know what this means or how to solve it?
Workspace: https://www.dropbox.com/s/e43ssam3higoqgb/stockprice_regression.ows?dl=0
Data file: https://www.dropbox.com/s/38ye3qm92dpbiov/amazon.csv?dl=0
-- EDIT --
Logistics regression has been replaced with linear regression. This moves things along a little.

In order to learn the model in a supervised manner (Logistic regression is the model that need to be learned), you need to tell a model what is the class variable in your data. For Logistic regression which is a classification model, the class variable must be discrete (it represent few classes in data). Since you have a continuous class variable (a class variable is marked as a target in File widget) I suggest you use one of the regression models such as Linear regression (the model that do not predict classes but values directly).

Related

How to create an Estimator that trains new samples after already fitted to initial dataset?

I'm trying to create my own Estimator following this example I found in the Spark source code DeveloperApiExample.scala.
But in this example, everytime I call fit() method in Estimator, it will return a new Model.
I want something like fitting again to train more samples that was not trained yet.
I thought in creating a new method in the Model class to do so. But I'm not sure if it makes sense.
It's maybe good to know that my model don't need to process all dataset again to train a new sample and we don't want to change the model structure.
The base class for a spark ml Estimator is defined here. As you can see, the class method fit is a vanilla call to train the model using the input data.
You should reference something like the LogisticRegression class, specifically the trainOnRows function where the input is an RDD and optionally an initial coefficient matrix (output of a trained model). This will allow you to iteratively train a model on different data sets.
For what you need to achieve, please remember that your algorithm of choice must be able to support iterative updates. For example, glm's, neural networks, tree ensembles etc.
If you know how to improve the training in your model without retraining with the already used data, you can't do it in the same class, because you want a Model that is also a Estimator, but sadly this is not possible directly because both are abstract classes, and can't be used mixed in the same class.
As you say, you can provide in the model a method that will return the Estimator to improve/increase the training.
class MyEstimator extends Estimator[MyModel] {
...
}
class MyModel extends Model[MyModel] {
def retrain: MyEstimator = // Create a instance of my estimator that it carries all the previous knowledge
}
You can use PipelineModels to save and load and continue fitting models:
MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline, or workflow. This section covers the key concepts introduced by the Pipelines API, where the pipeline concept is mostly inspired by the scikit-learn project.
Find exemplary code here.

One class learning to make predictions using MATLAB

I am using MATLAB to build a prediction model which the target is binary.
The problem is that those negative observations in my training data may indeed are positives but are just not detected.
I started with a logistic regression model assuming the data is accurate and the results are less than satisfactory. After some research, I moved to one class learning hoping that I can focus on the only the part of data (the positives) that I am certain with.
I looked up the related materials from MATLAB documentation and found that I can use fitcsvm to proceed.
My current problem is:
Am I on the right path? Can one class learning solve my problem?
I tried to use fitcsvm to create a ClassificationSVM using all the positive observations that I have.
model = fitcsvm(Instance,Label,'KernelScale','auto','Standardize',true)
However, when I try to use the model to predict
[label,score] = predict(model,Test)
All the labels predicted for my Test cases are 1. I think I did something wrong. So should I feed the svm only the positive observations that I have?
If not what should I do?

Online logistic regression

I wish to use online logistic regression training in Matlab in which I train the model by presenting the first sample, evaluate the model, next add the second sample, evaluate etc. etc.
I could do this by first creating a model on the first sample, evaluating it, throw this model away; next create a model on sample one and two, evaluate it etc. etc but this is very ineffecient. Is there a way I could do 'real' online training of the logistic regression model in Matlab?
Short answer: No Matlab does not support it (at least not that i'm aware of). Therefore you need to create a whole new model every time you get new input data. Depending on the size of the task this might still be the best choice.
Workaround: You can implement it yourself, by creating a loss function which updates every time. Take a look at this paper if you decide to go this way (it about many kinds of loss function but you are interested in the logistic one):
http://arxiv.org/abs/1011.1576
Or you could go Bayesan and update your priors any time a new point comes in.

Self organizing Maps and Linear vector quantization

Self organizing maps are more suited for clustering(dimension reduction) rather than classification. But SOM's are used in Linear vector quantization for fine tuning. But LVQ is a supervised leaning method. So to use SOM's in LVQ, LVQ should be provided with a labelled training data set. But since SOM's only do clustering and not classification and thus cannot have labelled data how can SOM be used as an input for LVQ?
Does LVQ fine tune the clusters in SOM?
Before using in LVQ should SOM be put through another classification algorithm so that it can classify the inputs so that these labelled inputs maybe used in LVQ?
It must be clear that supervised differs from unsupervised because in the first the target values are known.
Therefore, the output of supervised models is a prediction.
Instead, the output of unsupervised models is a label for which we don't know the meaning yet. For this purpose, after clustering, it is necessary to do the profiling of each one of those new label.
Having said so, you could label the dataset using an unsupervised learning technique such as SOM. Then, you should profile each class in order to be sure to understand the meaning of each class.
At this point, you can pursue two different path depending on what is your final objective:
1. use this new variable as a way for dimensionality reduction
2. use this new dataset featured with the additional variable representing the class as a labelled data that you will try to predict using the LVQ
Hope this can be useful!

How to predict labels for new data (test set) by the PartitionedEnsemble model in Matlab?

I trained a ensemble model (RUSBoost) for a binary classification problem by the function fitensemble() in Matlab 2014a. The training by this function is performed 10-fold cross-validation through the input parameter "kfold" of the function fitensemble().
However, the output model trained by this function cannot be used to predict the labels of new data if I use the predict(model, Xtest). I checked the Matlab documents, which says we can use kfoldPredict() function to evaluate the trained model. But I did not find any input of the new data through this function. Also, I found the structure of the trained model with cross-validation is different from that model without cross-validation. So, could anyone please advise me how to use the model, which is trained with cross-validation, to predict labels of new data? Thanks!
kfoldPredict() needs a RegressionPartitionedModel or ClassificationPartitionedEnsemble object as input. This already contains the models and data for kfold cross validation.
The RegressionPartitionedModel object has a field Trained, in which the trained learners that are used for cross validation are stored.
You can take any of these learners and use it like predict(learner, Xdata).
Edit:
If k is too large, it is possible that there is too little meaningful data in one or more iteration, so the model for that iteration is less accurate.
There are no general rules for k, but k=10 like in the MATLAB default is a good starting point to play around with it.
Maybe this is also interesting for you: https://stats.stackexchange.com/questions/27730/choice-of-k-in-k-fold-cross-validation