Vowpal Wabbit unbalanced classes - classification

I'm trying to fit the model for binary classification and predict the probability of values belonging to these classes.
My first problem is that I can't interpret the results. I have a training set in whichlabels=0 and labels=1 (not -1 and +1).
I run the model:
vw train.vw -f model.vw --link=logistic
Next:
vw test.vw -t -i model.vw -p pred.txt
Then I have a file pred.txt with these values:
0.5
0.5111
0.5002
0.5093
0.5
I don't understand what mean 0.5? All value in pred.txt about 0.5. I wrote the script and deducted from results 0.5. I get this lines:
0
0.111
0.002
0.093
0
Is that my desired probability?
And here is my second problem - I have unbalanced target class. I have a 95% negative (0) and 5% positive results (1). How can I prescribe that VW made the imbalance of classes, like {class 0:0.1, class 1:0.9}?
Or it should be done when preparing dataset?

For binary classification in VW, the labels need to be converted (from 0 and 1) to -1 and +1, e.g. with sed -e 's/^0/-1/'.
In addition to --link=logistic you need to use also --loss_function=logistic if you want to interpret the predictions as probabilities.
For unbalanced classes, you need to use importance weighting and tune the importance weight constant on heldout set (or cross-validation) with some external evaluation metric of your choice (e.g. AUC or F1).
See also:
Calculating AUC when using Vowpal Wabbit
Vowpal Wabbit Logistic Regression
How to perform logistic regression using vowpal wabbit on very imbalanced dataset

Related

Why is softmax function necessory? Why not simple normalization?

I am not familiar with deep learning so this might be a beginner question.
In my understanding, softmax function in Multi Layer Perceptrons is in charge of normalization and distributing probability for each class.
If so, why don't we use the simple normalization?
Let's say, we get a vector x = (10 3 2 1)
applying softmax, output will be y = (0.9986 0.0009 0.0003 0.0001).
Applying simple normalization (dividing each elements by the sum(16))
output will be y = (0.625 0.1875 0.125 0.166).
It seems like simple normalization could also distribute the probabilities.
So, what is the advantage of using softmax function on the output layer?
Normalization does not always produce probabilities, for example, it doesn't work when you consider negative values. Or what if the sum of the values is zero?
But using exponential of the logits changes that, it is in theory never zero, and it can map the full range of the logits into probabilities. So it is preferred because it actually works.
This depends on the training loss function. Many models are trained with a log loss algorithm, so that the values you see in that vector estimate the log of each probability. Thus, SoftMax is merely converting back to linear values and normalizing.
The empirical reason is simple: SoftMax is used where it produces better results.

Multiclass classification and the sigmoid function

Say have a training set Y :
1,0,1,0
0,1,1,0
0,0,1,1
0,0,1,0
And sigmoid function is defined as :
As the sigmoid function ouputs a value between 0 and 1 does this mean that the training data and value's we are trying to predict should also fall between 0 and 1 ?
Is also correct to use the sigmoid function for making predictions when training set values are not between 0 and 1 ? :
1,4,3,0
2,1,1,0
7,2,6,1
3,0,5,0
Yes, it is perfectly valid have non binary features.
The output falls between 0 and 1 because of the nature of the sigmoid function, there is nothing that stops you from having non binary feature set.
Do the predictions have to be binary?
Yes, you can have multiclass logistic classification as well.
The simplest way of doing that is solving a one-vs-all classification problem, wherein you train one binary logistic classifier for each of the labels.
For example. if your prediction space spans (1, 2, 3, 4), you can have 4 logistic classifiers.
Given any point in the test set, you can give it the label corresponding to the classifier which is most confident (i.e. has the highest score for that test point).

LIBSVM - no probability estimates

I'm using LIBSVM for matlab. When I use a regression SVM the probability estimates it outputs are an empty matrix, whereas this feature works fine when using classification. Is this a normal behavior, because in the LIBSVM readme it says:
-b probability_estimates: whether to train a SVC or SVR model for probability estimates,
0 or 1 (default 0)
[~,~,P] = svmpredict(x,y,model,'-b 1');
The output P is the probability of y belongs to class 1 and -1 respectively (m*2 array), and it only makes sense for classification problem.
For regression problem, the pairwise probability information is included in your trained model with model.ProbA.

Labeling one class for cross validation in libsvm matlab

I want to use one-class classification using LibSVM in MATLAB.
I want to train data and use cross validation, but I don't know what I have to do to label the outliers.
If for example I have this data:
trainData = [1,1,1; 1,1,2; 1,1,1.5; 1,1.5,1; 20,2,3; 2,20,2; 2,20,5; 20,2,2];
labelTrainData = [-1 -1 -1 -1 0 0 0 0];
(The first four are examples of the 1 class, the other four are examples of outliers, just for the cross validation)
And I train the model using this:
model = svmtrain(labelTrainData, trainData , '-s 2 -t 0 -d 3 -g 2.0 -r 2.0 -n 0.5 -m 40.0 -c 0.0 -e 0.0010 -p 0.1 -v 2' );
I'm not sure which value use to label the 1-class data and what to use to the outliers. Does someone knows how to do this?.
Thanks in advance.
-Jessica
According to http://www.joint-research.org/wp-content/uploads/2011/07/lukashevich2009Using-One-class-SVM-Outliers-Detection.pdf "Due to the lack of class labels in
the one-class SVM, it is not possible to optimize the kernel
parameters using cross-validation".
However, according to the LIBSVM FAQ that is not quite correct:
Q: How do I choose parameters for one-class SVM as training data are in only one class?
You have pre-specified true positive rate in mind and then search for parameters which achieve similar cross-validation accuracy.
Furthermore the README for the libsvm source says of the input data:
"For classification, label is an integer indicating the class label ... For one-class SVM, it's not used so can be any number."
I think your outliers should not be included in the training data - libsvm will ignore the training labels. What you are trying to do is find a hypersphere that contains good data but not outliers. If you train with outliers in the data LIBSVM will try yo find a hypersphere that includes the outliers, which is exactly what you don't want. So you will need a training dataset without outliers, a validation dataset with outliers for choosing parameters, and a final test dataset to see whether your model generalizes.

Neural Network with tanh wrong saturation with normalized data

I'm using a neural network made of 4 input neurons, 1 hidden layer made of 20 neurons and a 7 neuron output layer.
I'm trying to train it for a bcd to 7 segment algorithm. My data is normalized 0 is -1 and 1 is 1.
When the output error evaluation happens, the neuron saturates wrong. If the desired output is 1 and the real output is -1, the error is 1-(-1)= 2.
When I multiply it by the derivative of the activation function error*(1-output)*(1+output), the error becomes almost 0 Because of 2*(1-(-1)*(1-1).
How can I avoid this saturation error?
Saturation at the asymptotes of of the activation function is a common problem with neural networks. If you look at a graph of the function, it doesn't surprise: They are almost flat, meaning that the first derivative is (almost) 0. The network cannot learn any more.
A simple solution is to scale the activation function to avoid this problem. For example, with tanh() activation function (my favorite), it is recommended to use the following activation function when the desired output is in {-1, 1}:
f(x) = 1.7159 * tanh( 2/3 * x)
Consequently, the derivative is
f'(x) = 1.14393 * (1- tanh( 2/3 * x))
This will force the gradients into the most non-linear value range and speed up the learning. For all the details I recommend reading Yann LeCun's great paper Efficient Back-Prop.
In the case of tanh() activation function, the error would be calculated as
error = 2/3 * (1.7159 - output^2) * (teacher - output)
This is bound to happen no matter what function you use. The derivative, by definition, will be zero when the output reaches one of two extremes. It's been a while since I have worked with Artificial Neural Networks but if I remember correctly, this (among many other things) is one of the limitations of using the simple back-propagation algorithm.
You could add a Momentum factor to make sure there is some correction based off previous experience, even when the derivative is zero.
You could also train it by epoch, where you accumulate the delta values for the weights before doing the actual update (compared to updating it every iteration). This also mitigates conditions where the delta values are oscillating between two values.
There may be more advanced methods, like second order methods for back propagation, that will mitigate this particular problem.
However, keep in mind that tanh reaches -1 or +1 at the infinities and the problem is purely theoretical.
Not totally sure if I am reading the question correctly, but if so, you should scale your inputs and targets between 0.9 and -0.9 which would help your derivatives be more sane.