MiniBatches there are no samples for class label exception - classification

I was following the first example given in Accord.Net framework's documentation here to train a multi class SVM classifier with my own dataset but during the training loop the I got an error that says:
There are no samples for class label 3. Please make sure that class
labels are contiguous and there is at least one training sample for
each label.
The data that I'm using has 27 classes and 10 of the classes have less than 1000 samples in them so the batches returned by the MiniBatches object might not have samples from all of the available classes. Is there a way to resolve this issue without writing a custom object to sample from each class for each mini batch?

Related

Image Processing MLP - Detecting Classes

I've implemented an MLP that is able to detect hand written digits. So far the algorithm can identify numbers 0 and 1, but when I have implemented a new class, e.i. 2, the algorithm is unable to learn it. At the beginning I thought I had had a mistake in the implementation of the new class so I decided to swap the new class for a previous one that worked, in other words, if class0 was 0 and new class was 2 now class0 is 2 and new class is 0. Surprisingly the new class managed to be detected with almost no error, but class0 had a huge error, which means, the new class is properly implemented.
The MLP has two layers with 20 hidden units each, both of them are nonlinear with a sigmoidal function.
I think if I am able to understand your question properly then as you will add a new class and training a model such as here you trained a neural network then the final layer will change i.e, the no. of neurons in the final layer will be changed as a new class is added.
This can be one of the reasons for not detecting the new class.

Impact of using data shuffling in Pytorch dataloader

I implemented an image classification network to classify a dataset of 100 classes by using Alexnet as a pretrained model and changing the final output layers.
I noticed when I was loading my data like
trainloader = torch.utils.data.DataLoader(train_data, batch_size=32, shuffle=False)
, I was getting accuracy on validation dataset around 2-3 % for around 10 epochs but when I just changed shuffle=True and retrained the network, the accuracy jumped to 70% in the first epoch itself.
I was wondering if it happened because in the first case the network was being shown one example after the other continuously for just one class for few instances resulting in network making poor generalizations during training or is there some other reason behind it?
But, I did not expect that to have such a drastic impact.
P.S: All the code and parameters were exactly the same for both the cases except changing the shuffle option.
Yes it totally can affect the result! Shuffling the order of the data that we use to fit the classifier is so important, as the batches between epochs do not look alike.
Checking the Data Loader Documentation it says:
"shuffle (bool, optional) – set to True to have the data reshuffled at every epoch"
In any case, it will make the model more robust and avoid over/underfitting.
In your case this heavy increase of accuracy (from the lack of awareness of the dataset) probably is due to how the dataset is "organised" as maybe, as an example, each category goes to a different batch, and in every epoch, a batch contains the same category, which derives to a very bad accuracy when you are testing.
PyTorch did many great things, and one of them is the DataLoader class.
DataLoader class takes the dataset (data), sets the batch_size (which is how many samples per batch to load), and invokes the sampler from a list of classes:
DistributedSampler
SequentialSampler
RandomSampler
SubsetRandomSampler
WeightedRandomSampler
BatchSampler
The key thing samplers do is how they implement the iter() method.
In case of SequentionalSampler it looks like this:
def __iter__(self):
return iter(range(len(self.data_source)))
This returns an iterator, for every item in the data_source.
When you set shuffle=True that would not use SequentionalSampler, but instead the RandomSampler.
And this may improve the learning process.

How to structure dask to stream to random forrest classifier

I have a number of h5py datasets in 1 file where the class label is the dataset name and the shape is (20000,250000) of float64 compressed using gzip
How would the community suggest I use dask to enable random forrest training without needing to load entire datasets into memory?
Im working with a high core, high memory instance.
I should have mentioned I have 3 class labels.
Update:
My current thinking of loading the data was to create a dask array for each class label with the shape of (20000,250000) then concatenate the 3 arrays together. If I did that would I be able to use the distributed random forrest mentioned in the comments to then create the smaller training and test data frames needed?

Multiclass classification in SVM

I have been working on "Script identification from bilingual documents".
I want to classify the pages/blocks as either Eng(class 1), Hindi (class 2) or Mixed using libsvm in matlab. but the problem is that the training data i have consists of samples corresponding to Hindi and english pages/blocks only but no mixed pages.
The test data i want to give may consists of Mixed pages/blocks also, in that case i want it to be classified as "Mixed". I am planning to do it using confidence score or probability values. like if the prob value of class 1 is greater than a threshold (say 0.8) and prob value of class 2 is less than a threshold say(0.05) then it will be classified as class 1, and class 2 vice-versa. but if aforementioned two conditions dont satisfy then i want to classify it as "Mixed".
The third return value from the "libsvmpredict" is prob_values and i was planning to go ahead with this prob_values to decide whether the testdata is Hindi, English or Mixed. but at few places i learnt that "libsvmpredict" does not produce the actual prob_values.
Is there any way which can help me to classify the test data into 3 classes( Hindi, English, Mixed) using training data consisting of only 2 classes in SVM.
This is not the modus operandi for SVMs.
In no way SVMs can predict a given class without knowing it, without knowing how to separate such class from all other classes.
The function svmpredict() in LibSVM actually shows the probability estimates and the greater this value is, the more confident you can be regarding your prediction. But you cannot rely on such values if you have just two classes in order to predict a third class: indeed svmpredict() will return as many decision values as there are classes.
You can go on with your thresholding system (which, again, is not SVM-based) but it most likely fail or give bad performances. Think about that: you have to set up two thresholds and use them in a logic AND manner. The chance of correctly classified non-Mixed documents will indeed drastically decrease.
My suggestion is: instead of wasting time setting up thresholds, with a high chance of bad performances, join some of these texts together or create some new files with some Hindi and some English lines in order to add to your training data some proper Mixed documents and perform a standard 3-classes SVM system.
In order to create such files you can as well use Matlab, which has a pretty decent file I/O functions such as fread(), fwrite(), fprintf(), fscanf(), importdata() and so on...

How to automatically optimize a classifier in Weka in order to have a given class to contain 100 % sure data?

I have two (or three) classes and each classes can only possess one label.
I want to optimize (automatically if possible) parameters and thresholds of classifiers in order for my first class to contain only 100 % sure data. Even if it contains a small number of instances.
I don't mind for the remaining classes to contain false alarm or correct rejection.
I don't mind to have unclassified data.
I have already been searching on stackoverflow and on the weka's wiki but maybe my lack of knowledge concerning weka made me miss some keywords.
I also tried to perform the task with the well-known "iris" database but I think that in this case, any class can be 100 % sure.
Yet, I have only succeed in testing multiple classifiers and tuning them manually but without performing 100 % correct for my first class. (I checked this result in the confusion matrix given by weka's report.)
Somehow, I know it is possible for my class to contain 100% sure data because I managed to do it in Matlab with simple threshold set manually. But I would like to try out a bigger database, to obtain better threshold and to use the power of weka.
Any suggestions would be helpful, thanks !
You probably need the "Cost Sensitive Classifier" among "meta" classifiers.
If you are working in the Explorer, here is the dialog you get.
Choose the your "classifier" (something beyond ZeroR :) ).
Set your "cost matrix". For 2-class problem this will be 2x2 matrix.
By setting one non-diagonal component very large (>>1, let us say 1000) you ensure that misclassifying one class (your "first" class) is 1000 times more expensive than misclassifying another class. This should do the job.