Classifier converges to just one Class - Increase Diversity

Classifier converges to just one Class - Increase Diversity - classification

I want to force a classifier to not come up with the same results all the time (unsupervised, so I have no targets):
max_indices = tf.argmax(result, 1)
cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits(result, max_indices, name="cross_entropy_per_example")
cross_entropy_mean = tf.reduce_mean(cross_entropy, name="cross_entropy")
Where:
result are the logits returned from inference
max_indices are thus the predicted classes across all the batches (size=matchsize)
cross_entropy as implemented here measures how strongly the predicted result is in fact predicted (as if measuring simply the confidence)
I then optimize to minimize that loss. Basically I want the net to predict a class as strongly as possible.
Obviously this converges to some random class and will then classify everything in that one class.
So what I want is to add a penalty to prevent all predictions in a batch to be the same. I checked the math and came up with the Shannon Diversity as a good measure, but I cannot implement this in tensorflow. Any idea how to do this, either with the diversity measure stated or any substitute?
Thx

A good rule of thumb is to have the loss function that reflects on what you actually want to optimize. If you want to increase the diversity, it would make sense to have your loss function actually measure diversity.
While I'm sure there's a more correct way to do it, here's one heuristic that can get you closer to the Shannon Diversity you mention:
Let's make a hypothesis that the output of the softmax is actually close to one for the predicted class and is close to zero for all other classes.
Then the proportion of each class is the sum of outputs of the softmax over the batch divided by the batch size.
Then the loss function that approximates the Shannon Diversity would be something along the lines of:
sm = tf.softmax(result)
proportions = tf.reduce_mean(result, 0) # approximated proportion of each class
addends = proportions * tf.log(proportions) # multiplied by the log of itself
loss = tf.reduce_sum(addends) # add them up together to get the loss
When I think more about it, it might potentially break and instead of trying to diversify classes instead make very uncertain predictions (effectively breaking the original assumption that softmax is a good approximation for the one-hot encoding of the predicted class). To get around it I would add up together the loss I described above and your original loss from your question. The loss I described will be optimizing the approximated Shannon Diversity, while your original loss will prevent the softmax from becoming more and more uncertain.

Related

Strange behavior of linear regression in PyTorch

I am facing a peculiar problem and I was wondering if there is an explanation. I am trying to run a linear regression problem and test different optimization methods and two of them have a strange outcome when comparing to each other. I build a data set that satisfies y=2x+5 and I add a random noise to that.
xtrain=np.range(0,50,1).reshape(50,1)
ytrain=2*train+5+np.random.normal(0,2,(50,1))
opt1=torch.optim.SGD(model.parameters(),lr=1e-5,momentum=0.8))
opt2=torch.optim.Rprop(model.parameters(),lr=1e-5)
F_loss=F.mse_loss
from torch.utils.data import TensorDataset,DataLoader
train_d=TensorDataset(xtrain,ytrain)
train=DataLoader(train_d,50,shuffle=True)
model1=nn.Linear(1,1)
loss=F_loss(model1(xtrain),ytrain)
def fit(nepoch, model1, F_loss, opt):
for epoch in range(nepoch):
for i,j in train:
predict = model1(i)
loss = F_loss(predict, j)
loss.backward()
opt.step()
opt.zero_grad()
When i compare the results of the following commands:
fit(500000, model1, F_loss, opt1)
fit(500000, model1, F_loss, opt2)
In the last epoch for opt1:loss=2.86,weight=2.02,bias=4.46
In the last epoch for opt2:loss=3.47,weight=2.02,bias=4.68
These results do not make sense to me, shouldn't opt2 have a smaller loss than opt1 since the weight and bias it finds is closer to the real value? opt2's method finds weights and biases to be closer to the real value (they are respectively 2 and 5). Am i doing something wrong?

This has to do with the fact that you are drawing the training samples themselves from a random distribution.
By doing so, you inherently randomized the ground truth to some extent. Sure, you will get values that are inherently distributed around 2x+5, but you do not guarantee that 2x+5 will also be the best fit to this data distribution.
It could thus happen that you accidentally end up with values that deviate quite significantly from the original function, and, since you use a mean squared error, these values get weighted quite significantly.
In expectation (i.e., for the number of samples going towards infinity), you will likely get closer and closer to the expected parameters.
A way to verify this would be to plot your training samples against the parameter set, as well as the (ideal) underlying function.
Also note that Linear Regression does have a direct solution - something that is very uncommon in Machine Learning - meaning you can directly calculate an optimal solution, e.g., with sklearn's function

Neural Network - Working with a imbalanced dataset

I am working on a Classification problem with 2 labels : 0 and 1. My training dataset is a very imbalanced dataset (and so will be the test set considering my problem).
The proportion of the imbalanced dataset is 1000:4 , with label '0' appearing 250 times more than label '1'. However, I have a lot of training samples : around 23 millions. So I should get around 100 000 samples for the label '1'.
Considering the big number of training samples I have, I didn't consider SVM. I also read about SMOTE for Random Forests. However, I was wondering whether NN could be efficient to handle this kind of imbalanced dataset with a large dataset ?
Also, as I am using Tensorflow to design the model, which characteristics should/could I tune to be able to handle this imbalanced situation ?
Thanks for your help !
Paul
Update :
Considering the number of answers, and that they are quite similar, I will answer all of them here, as a common answer.
1) I tried during this weekend the 1st option, increasing the cost for the positive label. Actually, with less unbalanced proportion (like 1/10, on another dataset), this seems to help a bit to get a better result, or at least to 'bias' the precision/recall scores proportion.
However, for my situation,
It seems to be very sensitive to the alpha number. With alpha = 250, which is the proportion of the unbalanced dataset, I have a precision of 0.006 and a recall score of 0.83, but the model is predicting way too many 1 that it should be - around 0.50 of label '1' ...
With alpha = 100, the model predicts only '0'. I guess I'll have to do some 'tuning' for this alpha parameter :/
I'll take a look at this function from TF too as I did it manually for now : tf.nn.weighted_cross_entropy_with_logitsthat
2) I will try to de-unbalance the dataset but I am afraid that I will lose a lot of info doing that, as I have millions of samples but only ~ 100k positive samples.
3) Using a smaller batch size seems indeed a good idea. I'll try it !

There are usually two common ways for imbanlanced dataset:
Online sampling as mentioned above. In each iteration you sample a class-balanced batch from the training set.
Re-weight the cost of two classes respectively. You'd want to give the loss on the dominant class a smaller weight. For example this is used in the paper Holistically-Nested Edge Detection

I will expand a bit on chasep's answer.
If you are using a neural network followed by softmax+cross-entropy or Hinge Loss you can as #chasep255 mentionned make it more costly for the network to misclassify the example that appear the less.
To do that simply split the cost into two parts and put more weights on the class that have fewer examples.
For simplicity if you say that the dominant class is labelled negative (neg) for softmax and the other the positive (pos) (for Hinge you could exactly the same):
L=L_{neg}+L_{pos} =>L=L_{neg}+\alpha*L_{pos}
With \alpha greater than 1.
Which would translate in tensorflow for the case of cross-entropy where the positives are labelled [1, 0] and the negatives [0,1] to something like :
cross_entropy_mean=-tf.reduce_mean(targets*tf.log(y_out)*tf.constant([alpha, 1.]))
Whatismore by digging a bit into Tensorflow API you seem to have a tensorflow function tf.nn.weighted_cross_entropy_with_logitsthat implements it did not read the details but look fairly straightforward.
Another way if you train your algorithm with mini-batch SGD would be make batches with a fixed proportion of positives.
I would go with the first option as it is slightly easier to do with TF.

One thing I might try is weighting the samples differently when calculating the cost. For instance maybe divide the cost by 250 if the expected result is a 0 and leave it alone if the expected result is a one. This way the more rare samples have more of an impact. You could also simply try training it without any changes and see if the nnet just happens to work. I would make sure to use a large batch size though so you always get at least one of the rare samples in each batch.

Yes - neural network could help in your case. There are at least two approaches to such problem:
Leave your set not changed but decrease the size of batch and number of epochs. Apparently this might help better than keeping the batch size big. From my experience - in the beginning network is adjusting its weights to assign the most probable class to every example but after many epochs it will start to adjust itself to increase performance on all dataset. Using cross-entropy will give you additional information about probability of assigning 1 to a given example (assuming your network has sufficient capacity).
Balance your dataset and adjust your score during evaluation phase using Bayes rule:score_of_class_k ~ score_from_model_for_class_k / original_percentage_of_class_k.
You may reweight your classes in the cost function (as mentioned in one of the answers). Important thing then is to also reweight your scores in your final answer.

I'd suggest a slightly different approach. When it comes to image data, the deep learning community has already come up with a few ways to augment data. Similar to image augmentation, you could try to generate fake data to "balance" your dataset. The approach I tried was to use a Variational Autoencoder and then sample from the underlying distribution to generate fake data for the class you want. I tried it and the results are looking pretty cool: https://lschmiddey.github.io/fastpages_/2021/03/17/data-augmentation-tabular-data.html

Neural Networks Regression : scaling the outputs or using a linear layer?

I am currently trying to use Neural Network to make regression predictions.
However, I don't know what is the best way to handle this, as I read that there were 2 different ways to do regression predictions with a NN.
1) Some websites/articles suggest to add a final layer which is linear.
http://deeplearning4j.org/linear-regression.html
My final layers would look like, I think, :
layer1 = tanh(layer0*weight1 + bias1)
layer2 = identity(layer1*weight2+bias2)
I also noticed that when I use this solution, I usually get a prediction which is the mean of the batch prediction. And this is the case when I use tanh or sigmoid as a penultimate layer.
2) Some other websites/articles suggest to scale the output to a [-1,1] or [0,1] range and to use tanh or sigmoid as a final layer.
Are these 2 solutions acceptable ? Which one should one prefer ?
Thanks,
Paul

I would prefer the second case, in which we use normalization and sigmoid function as the output activation and then scale back the normalized output values to their actual values. This is because, in the first case, to output the large values (since actual values are large in most cases), the weights mapping from penultimate layer to the output layer would have to be large. Thus, for faster convergence, the learning rate has to be made larger. But this may also cause learning of the earlier layers to diverge since we are using a larger learning rate. Hence, it is advised to work with normalized target values, so that the weights are small and they learn quickly.
Hence in short, the first method learns slowly or may diverge if a larger learning rate is used and on the other hand, the second method is comparatively safer to use and learns quickly.

Why does my neural network trained on MNIST data set not predict 7 and 9 correctly?

I'm using Matlab ( github code repository ). The details of the network are:
Hidden units: 100 ( variable )
Epochs : 500
Batch size: 100
The weights are being updated using Back propagation algorithm.
I've been able to recognize 0,1,2,3,4,5,6,8 which I have drawn in photoshop.
However 7,9 are not recognized, but upon running on the test set I get only 749/10000 wrong and it correctly classifies 9251/10000.
Any idea what might be wrong? Because it is learning and based on the test set results its learning correctly.

I don't see anything downright incorrect in your code, but there is a lot that can be improved:
You use this to set the initial weights:
hiddenWeights = rand(hiddenUnits,inputVectorSize);
outputWeights = rand(outputVectorSize,hiddenUnits);
hiddenWeights = hiddenWeights./size(hiddenWeights, 2);
outputWeights = outputWeights./size(outputWeights, 2);
This will make your weights very small I think. Not only that, but you will have no negative values, so you'll throw away half of the sigmoid's range of values. I suggest you try:
weights = 2*rand(x, y) - 1
Which will generate random numbers in [-1, 1]. You can then try dividing this interval to get smaller weights (try dividing by the sqrt of the size).
You use this as the output delta:
outputDelta = dactivation(outputActualInput).*(outputVector - targetVector) % (tk-yk)*f'(yin)
Multiplying by the derivative is done if you use the square loss function. For log loss (which is usually the one used in classification), you should have just outputVector - targetVector. It might not make that big of a difference, but you might want to try.
You say in the comments that the network doesn't detect your own sevens and nines. This can suggest overfitting on the MNIST data. To address this, you'll need to add some form of regularization to your network: either weight decay or dropout.
You should try different learning rates as well, if you haven't already.
You don't seem to have any bias neurons. Each layer, except the output layer, should have a neuron that only returns the value 1 to the next layer. You can implement this by adding another feature to your input data that is always 1.
MNIST is a big data set for which better algorithms are still being researched. Your networks is very basic, small, with no regularization, no bias neurons and no improvements to classic gradient descent. It's not surprising that it's not working too well: you'll likely need a more complex network for better results.

Nothing to do with neural nets or your code,
but this picture of KNN-nearest digits shows that some MNIST digits
are simply hard to recognize:

Interpretation of MATLAB's NaiveBayses 'posterior' function

After we created a Naive Bayes classifier object nb (say, with multivariate multinomial (mvmn) distribution), we can call posterior function on testing data using nb object. This function has 3 output parameters:
[post,cpre,logp] = posterior(nb,test)
I understand how post is computed and the meaning of that, also cpre is the predicted class, based on the maximum over posterior probabilities for each class.
The question is about logp. It is clear how it is computed (logarithm of the PDF of each pattern in test), but I don't understand the meaning of this measure and how it can be used in the context of Naive Bayes procedure. Any light on this is very much appreciated.
Thanks.

The logp you are referring to is the log likelihood, which is one way to measure how good a model is fitting. We use log probabilities to prevent computers from underflowing on very small floating-point numbers, and also because adding is faster than multiplying.
If you learned your classifier several times with different starting points, you would get different results because the likelihood function is not log-concave, meaning there are local maxima that you would get stuck in. If you computed the likelihood of the posterior on your original data you would get the likelihood of the model. Although the likelihood gives you a good measure of how one set of parameters fits compared to another, you need to be careful that you're not overfitting.
In your case, you are computing the likelihood on some unobserved (test) data, which gives you an idea of how well your learned classifier is fitting on the data. If you were trying to learn this model based on the test set, you would pick the parameters based on the highest test likelihood; however in general when you're doing this it's better to use a validation set. What you are doing here is computing predictive likelihood.
Computing the log likelihood is not limited to Naive Bayes classifiers and can in fact be computed for any Bayesian model (gaussian mixture, latent dirichlet allocation, etc).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse