Two situations:
I have a set of instances with label "0", "1", and "?". I use the labelled ones (1,0) to train a multilayer perceptron (default settings, SMOTE correction, LOOCV to estimate performance), build a model and then predict the "?" instances.
I use all instances (1, 0, ?) to train the NN (default settings, SMOTE correction for imbalance between 1 and 0 instances, as before, and LOOCV to estimate performance).
Scores given to each prediction change between both situations.
My question is, how does NN handle having unlabelled instances in the training set? Is situation 1 a bad approach?
thanks
Related
Consider a dataset A which has examples for training in a binary classification problem. I have used SVM and applied the weighted method (in MATLAB) since the dataset is highly imbalanced. I have applied weights as inversely proportional to the frequency of data in each class. This is done on training using the command
fitcsvm(trainA, trainTarg , ...
'KernelFunction', 'RBF', 'KernelScale', 'auto', ...
'BoxConstraint', C,'Weight',weightTrain );
I have used 10 folds cross-validation for training and learned the hyperparameter as well. so, inside CV the dataset A is split into train (trainA) and validation sets (valA). After training is over and outside the CV loop, I get the confusion matrix on A:
80025 1
0 140
where the first row is for the majority class and the second row is for the minority class. There is only 1 false positive (FP) and all minority class examples have been correctly classified giving true positive (TP) = 140.
PROBLEM: Then, I run the trained model on a new unseen test data set B which was never seen during training. This is the confusion matrix for testing on B .
50075 0
100 0
As can be seen, the minority class has not been classified at all, hence the purpose of weights has failed. Although, there is no FP the SVM fails to capture the minority class examples.
I have not applied any weights or balancing method such as sampling (SMOTE, RUSBoost etc) on B. What could be wrong and how to overcome this problem?
Class misclassification weights could be set instead of sample weights!
You can set the class weights based on the following example.
Mis-classification weight for class A(n-records; dominant) into class B (m-records; minority class) can be n/m.
Mis-classification weight For class B as class A can be set as 1 or m/n based on the severity, which you want to impose on the learning
c=[0 2.2;1 0];
mod=fitcsvm(X,Y,'Cost',c)
According to documentation:
For two-class learning, if you specify a cost matrix, then the
software updates the prior probabilities by incorporating the
penalties described in the cost matrix. Consequently, the cost matrix
resets to the default. For more details on the relationships and
algorithmic behavior of BoxConstraint, Cost, Prior, Standardize, and
Weights, see Algorithms.
Area Under Curve (AUC) is usually used to measure performance of models that applied on unbalanced data. It is also good to plot ROC curve to visually get more insights. Using only confusion matrix for such models may lead to misinterpretation.
perfcurve from the Statistics and Machine Learning Toolbox provides both functionalities.
I know that categorical data should be one-hot encoded before training the machine learning algorithm. I also need that for multivariate linear regression I need to exclude one of the encoded variable to avoid so called dummy variable trap.
Ex: If I have categorical feature "size": "small", "medium", "large", then in one hot encoded I would have something like:
small medium large other-feature
0 1 0 2999
So to avoid dummy variable trap I need to remove any of the 3 columns, for example, column "small".
Should I do the same for training a Neural Network? Or this is purely for multivariate regression?
Thanks.
As stated here, dummy variable trap needs to be avoided (one category of each categorical feature removed after encoding but before training) on input of algorithms that consider all the predictors together, as a linear combination. Such algorithms are:
Linear/multilinear regression
Logistic regression
Discriminant analysis
Neural networks that don't employ weight decay
If you remove a category from input of a neural network that employs weight decay, it will get biased in favor of the omitted category instead.
Even though no information is lost when omitting one category after encoding a feature, other algorithms will have to infer the correlation of the omitted category indirectly through combination of all the other categories, making them do more computation for the same result.
I am solving a classification problem. I train my unsupervised neural network for a set of entities (using skip-gram architecture).
The way I evaluate is to search k nearest neighbours for each point in validation data, from training data. I take weighted sum (weights based on distance) of labels of nearest neighbours and use that score of each point of validation data.
Observation - As I increase the number of epochs (model1 - 600 epochs, model 2- 1400 epochs and model 3 - 2000 epochs), my AUC improves at smaller values of k but saturates at the similar values.
What could be a possible explanation of this behaviour?
[Reposted from CrossValidated]
To cross check if imbalanced classes are an issue, try fitting a SVM model. If that gives a better classification(possible if your ANN is not very deep) it may be concluded that classes should be balanced first.
Also, try some kernel functions to check if this transformation makes data linearly separable?
I'm developing a project for the university. I have to create a classifier for a disease. The data-set i have contains several inputs (symptoms) and each of them is associated to a multiplicative probability factor (e.g. if patient has the symptom A, he has a double probability to have that disease).
So, how can i do this type of classifier? Is there any type of neural network or other instrument to do this??
Thanks in advance
You should specify how much labeled data and unlabeled data you have.
Let's assume you have only labeled data. Then you could use neural networks, but IMHO, SVM or random forests are the best techniques for a first try.
Note that if you use machine learning techniques, your prior information about symptoms (multiplicative coefficients) are not used because the labels are used instead. If you want to use these coefficients, it's no more machine learning.
You can use neural network for this purpose also. If to speak about your situation, with binding symptom A to more chances for decease B, that is what neural network should be able to accomplish. To bind connection weights from input A ( symptom A ) to desease B. From your side, you can engrain such classification rule in case if you'll have enough training data in your training data set. Also I propose you to try two different approaches: 1. neural network with N outputs (N = number of deseases to clasif). 2. Create for each desease neural network.
I'm trainging neural network. My test set correlation is decreasing while training set correlation increases.
What can be a problem?
This is an expected behaviour. You are simply visualizing the overfitting of your network. It is trying "to hard" to model your training data and as a result it loses its generalization capabilities (scores on the test data). For this reason you should not train neural networks "to some error on the training set" nor "as long as you can" but instead - work with some more reasonable techniques, like regularization techniques (at least weight decay) and/or early stopping techniques.