TreeBagger() (MATLAB) and different number of variables on train and test set - matlab

I am using MATLAB function TreeBagger() for Random Forest classification, for an assignment. It gives error when the number of variables of the Test data is different from the number of variables of the Training data.
I have been taught that variable selection should be done on training data only, not on test data, so that there is no bias on the test data. So after spliting the initial dataset (50 variables) to training and test set, I perform variable selection (chi-square test of independence) on the training set. So the training set consists of 37 variables, whereas the test set remains with 50 variables.
I used TreeBagger() to train the training set and then I used the test set for prediction (function predict()). And I get an error because the number of variables of the test set is different from the number of variables that the model was trained on.
Is it wrong to perform variable selection on the training set only? Is there a way I can perform the prediction using this function?

The selected variables are a part of your final model.
This means that the final model has to use only the variables selected on the training set whenever you want to use it.
Thus, before applying your TreeBagger model, you filter out the variables that were not selected and then apply it to get predictions on your test set.

Related

Finding the correct model to implement a binary classification problem

I am working with a loan default dataset spanning across 1300000 records and 160 Independent variables and my Target Variable is labelled as 0 and 1. I have used Feature Selection (used Mutual Info and Chi Sq for Categorical and Anova for Continuous) to reduce it down to 11 Continuous Variable and 10 Categorical Variable.
My target is to develop a prediction model with a mix of Continuous and Categorical variables as regressors. I am thinking of running an SVM Classifier on Continuous variables, a Random Forest Classifier on Categorical Variables and then run an ensemble technique? Would that be the correct way to go about it?
Note: I am using Python for this exercise.

What is the significance of data set sampling during Bagging method for dynamic classifier selection method?

During Dynamic Classifier Selection, in training stage we apply bagging to the training set to get pool of classifiers. Bagging includes the process of dividing/sampling the training data set into number of data subsets containing elements with replacement and subsequently these each data subsets is trained by learners to get the values. The predicted values from each classifier is then compared by voting method and the classifier with highest accuracy value is selected.
In the following condition during data subset creation why there is need of sampling of main training set(i.e dividing the data set into number of data subsets), why cant we give the whole training data as input to each learners going to be used for training the data set?

How to iterate over values of models in Simulink Matlab?

I have designed a model in simulink. Generally, I generate a plot by setting the values of blocks(eg.gain) in the model and simulating the model and opening the scope block. But I need to generate different grpahs corresponding to different values of blocks(eg.gain). Basically, for different values of Gain value, I want different graphs but all in the same plot. The different values I give to my gain should be from an array. This is my model
I am using MATLAB for the first time. Please answer this in a beginner's approach
Setting The Gain Value
The values of the gain blocks can be can be set as variables rather than constants i.e. you can give a gain block the value of K in the settings panel.
You can then create a script that gives K a value e.g;
%script to set gain and run model
K=2;
sim('Model Name Here');
This will set the value for your gain block and run the model.
Saving The Output
In the sinks section of the simulink library browser is a block called To Workspace, this allows you to send any output value to the MATLAB workspace in multiple formats with a name that you define.
Your simulink model will now look something like this;
Now you can create a script that sets a gain value for your model, runs the model and saves the output to your workspace. With a couple of for loops and you can produce an array of inputs and outputs for your system.
From here you should be able to plot the inputs and outputs on the same graph using the well documented plot function.

Logistic Regression with variables that do not vary

A few questions around constant variables and logistic regression -
Lets say I have a continuous variable, but has only 1 value across the whole data set. I know I should ideally eliminate the variable since it brings no predictive value. Instead of manually doing this for each feature, does Logistic Regression make the coefficient of such variables 0 automatically?
If I use such a variable (that has only one value) in Logistic Regression with L1 regularization, will the regularization force the coefficient to 0?
On similar lines, if I have a categorical variable for which I have 3 levels - first level spans say 60% of the data set, second spans across 35% and the 3rd level at 5%), and I split it into training and testing, there is a good chance that the third level may not end up in the test set, leading us a scenario where we have a variable that has one value in the test set and other in the training set. How do I handle such scenarios ? Does regularization take care of things like this automatically?
ND
Regarding question 3)
If you want to be sure that both train and test set contain samples from each categorical variables, you can simply divide each subgroup into test and training set and then combine these again.
Regarding question 1) and 2)
The coefficent for a variable with variance zero should be zero, yes. However, whether such a coefficent "automatically" will be set to zero or be excluded from regression depends on the implementation.
If you implement logistic regression yourself, you can post the code and we can discuss specifically.
I recommend you to find an implemented version of logistic regression and test it using toy data. Then you will have your answer, whether or not the coeffient will be set to zero (which i assume).

Concept of validate for neural network

I have a problem with concept of Validation for NN. suppose I have 100 set of input variables (for example 8 input, X1,...,X8) and want to predict one Target(Y). now I have two ways to use NN:
1- use 70 set of data for training NN and then use trained NN to predict other 30 sets of Target for validation and then plot output VS Target for this 30 sets as validation plot.
2- use 100 sets of data for training NN and then divide all outputs to two part (70% and 30%). plot 70% of outputs VS corresponding Targets as Training plot. then plot other 30% outputs VS their corresponding Targets as validation plot
Which one is correct??
Also, what the difference between checking NN with new data set and validation data set??
Thanks
You cannot use data for validation, if it has been already used for the training, because the trained NN will already "know" your validation examples. The result of such validation will be very biased. I would for sure use the first way.