What is the significance of data set sampling during Bagging method for dynamic classifier selection method? - matlab

During Dynamic Classifier Selection, in training stage we apply bagging to the training set to get pool of classifiers. Bagging includes the process of dividing/sampling the training data set into number of data subsets containing elements with replacement and subsequently these each data subsets is trained by learners to get the values. The predicted values from each classifier is then compared by voting method and the classifier with highest accuracy value is selected.
In the following condition during data subset creation why there is need of sampling of main training set(i.e dividing the data set into number of data subsets), why cant we give the whole training data as input to each learners going to be used for training the data set?

Related

Classification Using Weka did not give any result for precision , Fmeasure and MCC

I have a dataset. The dataset has some categorical values and some discrete value. My dataset is an imbalance dataset. I divide the dataset into 60% training data and 40% test data using Resample filter which is available in Weka. To make the dataset balanced I am using SMOTE technique. After that I used Random Forest to classify the dataset.
The result is
Now I can not understand what is the meaning of ? in the result? Secondly, Why there is no value for False Positive and True Positive? Does that mean the dataset is still bias towards No class even after applying SMOTE?
Note: I applied SMOTE only on training data not in test data.
It would be helpful if someone clarify my doubts.
That was asked on the Weka mailing list before (2019-07-26, How can I explain the tag "?" in the performance of the model). Here is Eibe's answer:
It means the statistic could not be computed. For example, precision for class “High” cannot be computed because the classifier did not assign any instances to that class. This means the denominator in the calculation for precision is zero.

How to make a hybrid model (LSTM and Ensemble) in MATLAB

I am working on C02 prediction in matlab. My data set is 3787 (Including test and validation set). I am trying to predict the CO2 with a Standard Deviation of 179.60. My number of predictor is 15 and response is 1. Within that predictors I have two types of datasets (1. Sequential number data such as temperature and humidity, 2. Conditions i.e yes/no ). So that I have decided to use two types of networks to train my model.
1) LSTM - For the sequential data
2) Ensemble or SVM - for the yes/no data
3) Combine two models and predict the response variable.
How to achieve this? Can anyone help me to achieve this?

TreeBagger() (MATLAB) and different number of variables on train and test set

I am using MATLAB function TreeBagger() for Random Forest classification, for an assignment. It gives error when the number of variables of the Test data is different from the number of variables of the Training data.
I have been taught that variable selection should be done on training data only, not on test data, so that there is no bias on the test data. So after spliting the initial dataset (50 variables) to training and test set, I perform variable selection (chi-square test of independence) on the training set. So the training set consists of 37 variables, whereas the test set remains with 50 variables.
I used TreeBagger() to train the training set and then I used the test set for prediction (function predict()). And I get an error because the number of variables of the test set is different from the number of variables that the model was trained on.
Is it wrong to perform variable selection on the training set only? Is there a way I can perform the prediction using this function?
The selected variables are a part of your final model.
This means that the final model has to use only the variables selected on the training set whenever you want to use it.
Thus, before applying your TreeBagger model, you filter out the variables that were not selected and then apply it to get predictions on your test set.

EM soft clustering in lingpipe

In Lingpipe's EM tutorial they said that it is possible to run the algorithm with no supervised data:
It is possible to train a classifier in a completely unsupervised fashion by having the initial classifier assign categories at random. Only the number of categories must be fixed. The algorithm is exactly the same, and the result after convergence or the maximum number of epochs is a classifier.
But their class, TradNaiveBayesClassifier required a labeled and an unlabeled corpora to run. How can I modify it to run with no labelled data?
EM is a probabilistic maximal likelihood optimization algorithm. In general, it is applied to unsupervised algorithms (for clustering) such as PLSA, Gaussian Mixture Model.
I think the linepipe doc is saying that you can using random initialization of all data labels (distribution of labels for each data) and then feed into NB to compute the ELBO (evidence lower bound), and then maximize it to get update of parameters.
In short, you will need to use the NB to write up the M step --- updating the model parameters.

ANNs traning, what happen in each epoche?

I have a question about the training of ANNs,
So I want to ask how the training is done for a set of input samples? Are there some relation between the size of the input set for training and the number of epoches for training, or it's totally independant?
For exemple, if my ANN has 4 inputs and for 2000 training's samples I get an input matrix of size 4x2000. So for each epoche of training, is that the whole matrix is loaded, or just one sample (training matrix column) is loaded for each epoche of traning?
in each epoch of a NN all the weight values of the neurons are updated, all the nodes. Usually the more neurons & layers & data you have the more epoches you need for a correct value for the weights, but there is not an equation for relationing epoches with neurons.
For the training usually there is used Backpropagation algorithm (check wikpedia for a great example), that updates each weight once. The more epoches the more accurate your NN will be. Usually for the training you set 2 variables: max num of epoches and accuracy, and when one of the two is finally achieved you stop iterating.