Pytorch - how to undersample using weightedrandomsampler - neural-network

I have an unbalanced dataset and would like to undersample the class that is overrepresented.How do I go about it. I would like to use to weightedrandomsampler but I am also open to other suggestions.
So far I am assuming that my code will have to be structured kind of like the following. But I dont know how to exaclty do it.
trainset = datasets.ImageFolder(path_train,transform=transform)
...
sampler = data.WeightedRandomSampler(weights=..., num_samples=..., replacement=...)
...
trainloader = data.DataLoader(trainset, batchsize = batchsize, sampler=sampler)
I hope someone can help. Thanks a lot

From my understanding, pytorch WeightedRandomSampler 'weights' argument is somewhat similar to numpy.random.choice 'p' argument which is the probability that a sample will get randomly selected. Pytorch uses weights instead to random sample training examples and they state in the doc that the weights don't have to sum to 1 so that's what I mean that it's not exactly like numpy's random choice. The stronger the weight, the more likely that sample will get sampled.
When you have replacement=True, it means that training examples can be drawn more than once which means you can have copies of training examples in your train set that get used to train your model; oversampling. Alongside, if the weights are low COMPARED TO THE OTHER TRAINING SAMPLE WEIGHTS the opposite occurs which means that those samples have a lower chance of being selected for random sampling; undersampling.
I have no clue how the num_samples argument works when using it with the train loader but I can warn you to NOT put your batch size there. Today, I tried putting the batch size and it gave horrible results. My co-worker put the number of classes*100 and his results were much better. All I know is that you should not put the batch size there. I also tried putting the size of all my training data for num_samples and it had better results but took forever to train. Either way, play around with it and see what works best for you. I would guess that the safe bet is to use the number of training examples for the num_samples argument.
Here's the example I saw somebody else use and I use it as well for binary classification. It seems to work just fine. You take the inverse of the number of training examples for each class and you set all training examples with that class its respective weight.
A quick example using your trainset object
labels = np.array(trainset.samples)[:,1] # turn to array and take all of column index 1 which are the labels
labels = labels.astype(int) # change to int
majority_weight = 1/num_of_majority_class_training_examples
minority_weight = 1/num_of_minority_class_training_examples
sample_weights = np.array([majority_weight, minority_weight]) # This is assuming that your minority class is the integer 1 in the labels object. If not, switch places so it's minority_weight, majority_weight.
weights = samples_weights[labels] # this goes through each training example and uses the labels 0 and 1 as the index in sample_weights object which is the weight you want for that class.
sampler = WeightedRandomSampler(weights=weights, num_samples=, replacement=True)
trainloader = data.DataLoader(trainset, batchsize = batchsize, sampler=sampler)
Since the pytorch doc says that the weights don't have to sum to 1, I think you can also just use the ratio which between the imbalanced classes. For example, if you had 100 training examples of the majority class and 50 training examples of the minority class, it would be a 2:1 ratio. To counterbalance this, I think you can just use a weight of 1.0 for each majority class training example and a weight 2.0 for all minority class training examples because technically you want the minority class to be 2 times more likely to be selected which would balance your classes during random selection.
I hope this helped a little bit. Sorry for the sloppy writing, I was in a huge rush and saw that nobody answered. I struggled through this myself without being able to find any help for it either. If it doesn't make sense just say so and I'll re-edit it and make it more clear when I get free time.

Based on torchdata (disclaimer: I'm the author) one can create a custom undersampler.
First, _Equalizer base class which:
creates multiple RandomSubsetSamplers (one for each class)
based on function (torch.max or torch.min) will behave as oversampler or undersampler
Code:
class _Equalizer(Sampler):
def __init__(self, labels: torch.tensor, function):
if len(labels.shape) > 1:
raise ValueError(
"labels can only have a single dimension (N, ), got shape: {}".format(
labels.shape
)
)
tensors = [
torch.nonzero(labels == i, as_tuple=False).flatten()
for i in torch.unique(labels)
]
self.samples_per_label = getattr(builtins, function)(map(len, tensors))
self.samplers = [
iter(
RandomSubsetSampler(
tensor,
replacement=len(tensor) < self.samples_per_label,
num_samples=self.samples_per_label
if len(tensor) < self.samples_per_label
else None,
)
)
for tensor in tensors
]
#property
def num_samples(self):
return self.samples_per_label * len(self.samplers)
def __iter__(self):
for _ in range(self.samples_per_label):
for index in torch.randperm(len(self.samplers)).tolist():
yield next(self.samplers[index])
def __len__(self):
return self.num_samples
Now, we can create undersampler (added oversampler as it is really short right now):
class RandomUnderSampler(_Equalizer):
def __init__(self, labels: torch.tensor):
super().__init__(labels, "min")
class RandomOverSampler(_Equalizer):
def __init__(self, labels):
super().__init__(labels, "max")
Just pass in your labels to the __init__ (has to be 1D but can have multiple or binary classes) and you can up/under sample your data.

Related

Compute softmax using breeze

I am constructing a deep neural network from scratch and I want to implement the softmax http://neuralnetworksanddeeplearning.com/chap3.html#softmax distributed function.
I am using breeze for that but it is not working as expected.
The documentation is also poor with very few examples, so it is difficult for me to understand how I should use it.
here is an example :
I have an ouput array that contains 10 dimensions.
I have my label array also.
Z contains 10 rows with the weighted values.
My label array contains also 10 rows and one is set to 1 to specify which row is the expected result.
lab(0) = 1
lab(1 to 9) = 0
my code :
def ComputeZ(ActivationFunction : String, z:Array[Double], label:Array[Double]) : Array[Double] = {
ActivationFunction match {
case "SoftMax" => **val t = softmax(z,label)**
t
}
}
I was expecting having a distributed probability with a total of 1 for the 10 rows but it returns actually the same values as Z.
I don't know what I am doing wrong
thanks for your help
Your question seems a little bit confusing to me. I mean, creating a SoftMax from scratch has nothing to do with the label or the real output value. A Softmax function is used to create a valid output probability distribution of a neural network, used in multiclass classification problems. As I see you have a one hot vector as label, it seems that you want to implement a CrossEntropy criterion or some error function that evaluates the divergence of the prediction distribution and the label distribution. That needs the output prediction probability distribution(applying your Softmax to the output layer) and the one hot vector of the output.
I watched the code of the softmax function in breeze but I don´t see a Layer implementation and it doesn´t do what I was expecting. Have in mind that you need a forward an a backward function.

Interclass and Intraclass classification structure of CNN

I am working on a inter-class and intra-class classification problem with one CNN such as first there is two classes Cat and Dog than in Cat there is a classification three different breeds of cats and in Dog there are 5 different breeds dogs.
I haven't tried the coding yet just working on feasibility if that works.
My question is what will be the feasible design for this kind of problem.
I am thinking to design for the training, first CNN-1 network that will differentiate cat and dog and gather the image data of all the training images. After the separation of cat and dog, CNN-2 and CNN-3 will train these images further for each breed of dog and cat. I am just not sure how the testing will work in this situation.
I have approached a similar problem previously in Python. Hopefully this is helpful and you can come up with an alternative implementation in Matlab if that is what you are using.
After all was said and done, I landed on a single model for all predictions. For your purpose you could have one binary output for dog vs. cat, another multi-class output for the dog breeds, and another multi-class output for the cat breeds.
Using Tensorflow, I created a mask for the irrelevant classes. For example, if the image was of a cat, then all of the dog breeds are irrelevant and they should not impact model training for that example. This required a customized TF Dataset (that converted 0's to -1 for the mask) and a customized loss function that returned 0 error when the mask was present for that example.
Finally for the training process. Specific to your question, you will have to create custom accuracy functions that can handle the mask values how you want them to, but otherwise this part of the process should be standard. It was best practice to evenly spread out the classes among the training data but they can all be trained together.
If you google "Multi-Task Training" you can find additional resources for this problem.
Here are some code snips if you are interested:
For the customize TF dataset that masked irrelevant labels...
# Replace 0's with -1 for mask when there aren't any labels
def produce_mask(features):
for filt, tensor in features.items():
if "target" in filt:
condition = tf.equal(tf.math.reduce_sum(tensor), 0)
features[filt] = tf.where(condition, tf.ones_like(tensor) * -1, tensor)
return features
def create_dataset(filepath, batch_size=10):
...
# **** This is where the mask was applied to the dataset
dataset = dataset.map(produce_mask, num_parallel_calls=cpu_count())
...
return parsed_features
Custom loss function. I was using binary-crossentropy because my problem was multi-label. You will likely want to adapt this to categorical-crossentropy.
# Custom loss function
def masked_binary_crossentropy(y_true, y_pred):
mask = backend.cast(backend.not_equal(y_true, -1), backend.floatx())
return backend.binary_crossentropy(y_true * mask, y_pred * mask)
Then for the custom accuracy metrics. I was using top-k accuracy, you may need to modify for your purposes, but this will give you the general idea. When comparing this to the loss function, instead of converting all to 0, which would over-inflate the accuracy, this function filters those values out entirely. That works because the outputs are measured individually, so each output (binary, cat breed, dog breed) would have a different accuracy measure filtered only to the relevant examples.
backend is keras backend.
def top_5_acc(y_true, y_pred, k=5):
mask = backend.cast(backend.not_equal(y_true, -1), tf.bool)
mask = tf.math.reduce_any(mask, axis=1)
masked_true = tf.boolean_mask(y_true, mask)
masked_pred = tf.boolean_mask(y_pred, mask)
return top_k_categorical_accuracy(masked_true, masked_pred, k)
Edit
No, in the scenario I described above there is only one model and it is trained with all of the data together. There are 3 outputs to the single model. The mask is a major part of this as it allows the network to only adjust weights that are relevant to the example. If the image was a cat, then the dog breed prediction does not result in loss.

How to use `crossval` in matlab for a Leave one Out Validation method

I have been reading the documentation: here and here but it's really unclear for me and I don't see how to use pratically crossval to do a leave one out cross-validation.
vals = crossval(fun,X)
vals = crossval(fun,X,Y,...)
mse = crossval('mse',X,y,'Predfun',predfun)
mcr = crossval('mcr',X,y,'Predfun',predfun)
val = crossval(criterion,X1,X2,...,y,'Predfun',predfun)
vals = crossval(...,'name',value)
I really don't understand the funpart.
I have estimatimate chlorophyll rate with different index. Then I have done a linear regression between those index and the field taken chlorophyll rate. Now I want to validate them, one of my estimation is a column with 22 entries, so I want to use 21 of them as trainee and 1 as a test, and do 22 loops so that all the data have been used as test.
But I don't where should I put the regression model? If my regression is Y=aX+b,
do I re-use the a and b calculated before for the train part, or do I do a new linear regression with the train part then see what's the test will be with that?
I am not sure I totally understood how to make a leave one out model.
Then I want to know the result of the test by calculating the RMSE (and maybe the R²).
How do I code that using crossval?
I saw the answer to the question here but I don't have access to the crossvalind fonction with my license.
Well I finaly figure it out: so this is my script:
First I charged my data and the linear regression fonction
X=indicesCha_without_Cloud(:,3);
y=Cha_g_m2t_without_Cloud(:,3);
testval=#(XTRAIN,ytrain,XTEST)Linear_regression_indices( XTRAIN,ytrain,XTEST);
where in my case fun(in the Mathwork help) is testvaland Linear_regression_indices is a very simple fonction:
function [ Linear_regression_indices ] = Linear_regression_indices(XTRAIN,ytrain,XTEST )
Linear_regression_indices=(polyval(polyfit(XTRAIN,ytrain,1),XTEST));
end
There is 2 ways to do it and they both give the same result:
one by using simply the crossval fonction
cvMse = crossval('mse',X,y,'predfun',testval,'leaveout',1);
this will do as many fold as the data size, using each time one of the data as Xtest
the second one is using cvpartition
c = cvpartition(n,'LeaveOut') creates a random partition for leave-one-out cross validation on n observations. Leave-one-out is a special case of 'KFold', in which the number of folds equals the number of observations. link
c = cvpartition(y,'LeaveOut');
cvMse2=crossval('mse',X,y,'predfun',testval,'partition',c);
then the RMSE can be easily calculated
RMSE=sqrt(cvMse);
RMSE2=sqrt(cvMse2);
then I simply get my answer, in my case RMSE=0,3548

Scikit-Learn's DPGMM fitting: number of components?

I'm trying to fit a mixed normal model to some data using scikit-learn's DPGMM algorithm. One of the advantages advertised on [0] is that I don't need to specify the number of components; which is good, because I do not know the number of components in my data. The documentation states that I only need to specify an upper bound. However, it looks very much like that is not true:
>>> data = numpy.random.normal(loc = 0.0, scale = 1.0, size = 1000)
>>> from sklearn.mixture import DPGMM
>>> d = DPGMM(n_components=5)
>>> d.fit(data.reshape(-1,1))
DPGMM(alpha=1.0, covariance_type='diag', init_params='wmc', min_covar=None,
n_components=5, n_iter=10, params='wmc', random_state=None, thresh=None,
tol=0.001, verbose=0)
>>> d.n_components
5
>>> d.means_
array([[-0.02283383],
[ 0.06259168],
[ 0.00390097],
[ 0.02934676],
[-0.05533165]])
As you can see, the fitting reports five components (the upper bound) even for data clearly sampled from just one normal distribution.
Am I doing something wrong? Did I misunderstand something?
Thanks a lot in advance,
Lukas
[0] http://scikit-learn.org/stable/modules/mixture.html#dpgmm
I recently had similar doubts about results of this DPGMM implementation. If you check provided example you notice that DPGMM always return model with n_components, now the trick is to remove redundant components. This can be done with predict function.
Unfortunately this important pice is hidden in comment in code example.
# as the DP will not use every component it has access to
# unless it needs it, we shouldn't plot the redundant components
Perhaps look at using an improved sklearn solution for this kind of problem, namely a Bayesian Gaussian Mixture. With this model, the suggested prior number of components must be given, but once trained, the model assigns weightings to each component, which essentially indicate their relevance. Here is a pretty cool visual demo of BGMM in action.
Once you have experimented with training a few BGMMs on your data, you can get a feel for a sensible estimate to the number of components for your given problem.

Reporting log-likelihood / perplexity of spark LDA model (different in local vs distributed models?)

Given a training corpus docsWithFeatures, I've trained an LDA model in Spark (via Scala API) like so:
import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel, LocalLDAModel}
val n_topics = 10;
val lda = new LDA().setK(n_topics).setMaxIterations(20)
val ldaModel = lda.run(docsWithFeatures)
val distLDAModel = ldaModel.asInstanceOf[DistributedLDAModel]
And now I want to report the log-likelihood and perplexity of the model.
I can get the log-likelihood like so:
scala> distLDAModel.logLikelihood
res11: Double = -2600097.2875547716
But this is where things get weird. I also wanted the perplexity, which is only implemented for a local model, so I run:
val localModel = distLDAModel.toLocal
Which lets me get the (log) perplexity like so:
scala> localModel.logPerplexity(docsWithFeatures)
res14: Double = 0.36729132682898674
But the local model also supports the log-likelihood calculation, which I run like this:
scala> localModel.logLikelihood(docsWithFeatures)
res15: Double = -3672913.268234148
So what's going on here? Shouldn't the two log-likelihood values be the same? The documentation for a distributed model says
"logLikelihood: log likelihood of the training corpus, given the inferred topics and document-topic distributions"
while for a local model it says:
"logLikelihood(documents): Calculates a lower bound on the provided documents given the inferred topics."
I guess these are different, but it's not clear to me how or why. Which one should I use? That is, which one is the "true" likelihood of the model, given the training documents?
To summarize, two main questions:
1 - How and why are the two log-likelihood values different, and which should I use?
2 - When reporting perplexity, am I correct in thinking that I should use the exponential of the logPerplexity result? (But why does the model give log perplexity instead of just plain perplexity? Am I missing something?)
1) These two log-likelihood values differ because they are computing the log-likelihood for two different models. DistributedLDAModel is effectively computing the log-likelihood w.r.t. a model where the parameters for the topics and the mixing weights for each of the documents are constants (as I mentioned in another post, the DistributedLDAModel is essentially regularized PLSI, though you need to use logPrior to also account for the regularization), while the LocalLDAModel takes the view that the topic parameters as well as the mixing weights for each document are random variables. So in the case of LocalLDAModel you have to integrate (marginalize) out the topic parameters and document mixing weights in order to compute the log-likelihood (and this is what makes the variational approximation/lower bound necessary, though even without the approximation the log-likelihoods would not be the same since the models are just different.)
As far as which one you should use, my suggestion (without knowing what you ultimately want to do) would be to go with the log-likelihood method attached to the class you originally trained (i.e. the DistributedLDAModel.) As a side note, the primary (only?) reason that I can see to convert a DistributedLDAModel into a LocalLDAModel via toLocal is to enable the computation of topic mixing weights for a new (out-of-training) set of documents (for more info on this see my post on this thread: Spark MLlib LDA, how to infer the topics distribution of a new unseen document?), a operation which is not (but could be) supported in DistributedLDAModel.
2) log-perplexity is just the negative log-likelihood divided by the number of tokens in your corpus. If you divide the log-perplexity by math.log(2.0) then the resulting value can also be interpreted as the approximate number of bits per a token needed to encode your corpus (as a bag of words) given the model.