Why do I need to shuffle data, when I get the path to images? - neural-network

train_image_paths = [str(path) for path in list(train_path.glob('*/*.jpeg'))]
random.shuffle(train_image_paths)
Above is a sample code you can see.
I have the same question in this case too:
train_dataset = tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(64).shuffle(10000)
I don't understand why I need the shuffle in these cases.

In the obvious case, shuffling is helpful if your training data is sorted by class labels. By shuffling, you allow your model to "see" a wide range of data points each belonging to different classes in the context of classification. If the model goes through a sorted training data, your model runs the risk of overfitting to certain classes. In short, shuffling helps reduce variance and ensures that the train, test, and validation sets are representative of the true distribution.

Related

Impact of using data shuffling in Pytorch dataloader

I implemented an image classification network to classify a dataset of 100 classes by using Alexnet as a pretrained model and changing the final output layers.
I noticed when I was loading my data like
trainloader = torch.utils.data.DataLoader(train_data, batch_size=32, shuffle=False)
, I was getting accuracy on validation dataset around 2-3 % for around 10 epochs but when I just changed shuffle=True and retrained the network, the accuracy jumped to 70% in the first epoch itself.
I was wondering if it happened because in the first case the network was being shown one example after the other continuously for just one class for few instances resulting in network making poor generalizations during training or is there some other reason behind it?
But, I did not expect that to have such a drastic impact.
P.S: All the code and parameters were exactly the same for both the cases except changing the shuffle option.
Yes it totally can affect the result! Shuffling the order of the data that we use to fit the classifier is so important, as the batches between epochs do not look alike.
Checking the Data Loader Documentation it says:
"shuffle (bool, optional) – set to True to have the data reshuffled at every epoch"
In any case, it will make the model more robust and avoid over/underfitting.
In your case this heavy increase of accuracy (from the lack of awareness of the dataset) probably is due to how the dataset is "organised" as maybe, as an example, each category goes to a different batch, and in every epoch, a batch contains the same category, which derives to a very bad accuracy when you are testing.
PyTorch did many great things, and one of them is the DataLoader class.
DataLoader class takes the dataset (data), sets the batch_size (which is how many samples per batch to load), and invokes the sampler from a list of classes:
DistributedSampler
SequentialSampler
RandomSampler
SubsetRandomSampler
WeightedRandomSampler
BatchSampler
The key thing samplers do is how they implement the iter() method.
In case of SequentionalSampler it looks like this:
def __iter__(self):
return iter(range(len(self.data_source)))
This returns an iterator, for every item in the data_source.
When you set shuffle=True that would not use SequentionalSampler, but instead the RandomSampler.
And this may improve the learning process.

Multiclass classification in SVM

I have been working on "Script identification from bilingual documents".
I want to classify the pages/blocks as either Eng(class 1), Hindi (class 2) or Mixed using libsvm in matlab. but the problem is that the training data i have consists of samples corresponding to Hindi and english pages/blocks only but no mixed pages.
The test data i want to give may consists of Mixed pages/blocks also, in that case i want it to be classified as "Mixed". I am planning to do it using confidence score or probability values. like if the prob value of class 1 is greater than a threshold (say 0.8) and prob value of class 2 is less than a threshold say(0.05) then it will be classified as class 1, and class 2 vice-versa. but if aforementioned two conditions dont satisfy then i want to classify it as "Mixed".
The third return value from the "libsvmpredict" is prob_values and i was planning to go ahead with this prob_values to decide whether the testdata is Hindi, English or Mixed. but at few places i learnt that "libsvmpredict" does not produce the actual prob_values.
Is there any way which can help me to classify the test data into 3 classes( Hindi, English, Mixed) using training data consisting of only 2 classes in SVM.
This is not the modus operandi for SVMs.
In no way SVMs can predict a given class without knowing it, without knowing how to separate such class from all other classes.
The function svmpredict() in LibSVM actually shows the probability estimates and the greater this value is, the more confident you can be regarding your prediction. But you cannot rely on such values if you have just two classes in order to predict a third class: indeed svmpredict() will return as many decision values as there are classes.
You can go on with your thresholding system (which, again, is not SVM-based) but it most likely fail or give bad performances. Think about that: you have to set up two thresholds and use them in a logic AND manner. The chance of correctly classified non-Mixed documents will indeed drastically decrease.
My suggestion is: instead of wasting time setting up thresholds, with a high chance of bad performances, join some of these texts together or create some new files with some Hindi and some English lines in order to add to your training data some proper Mixed documents and perform a standard 3-classes SVM system.
In order to create such files you can as well use Matlab, which has a pretty decent file I/O functions such as fread(), fwrite(), fprintf(), fscanf(), importdata() and so on...

PyBrain: MemoryError: while loading training dataset

I am trying to train a feedforward neural network, for binary classification. My Dataset is 6.2M with 1.5M dimension. I am using PyBrain. I am unable to load even a single datapoint. I am getting MemoryError.
My Code snippet is:
Train_ds = SupervisedDataSet(FV_length, 1) #FV_length is a computed value. 150000
feature_vector = numpy.zeros((FV_length),dtype=numpy.int)
#activate feature values
for index in nonzero_index_list:
feature_vector[index] = 1
Train_ds.addSample(feature_vector,class_label) # both the arguments are tuples
It looks like your computer just does not have the memory to add your feature and class label arrays to the supervised data set Train_ds.
If there is no way for you to allocate more memory to your system it might be a good idea to random sample from your data set and train on the smaller sample.
This should still give accurate results assuming the sample is large enough to be representative.

Why the decision tree shows a correct classificationthe while some instances are being misclassified

I am using WEKA, 10-fold cross validation or split 66% to create training and testing sets .. I used c4.5 (J48) as a classifier ..
I get in my results that some instances are misclassified, but, when I visualize the tree I see that based on the tree the instances should have been classified correctly !!!
I don't see this when the testing set is the same training set .. if the classifier decided to create such a tree why some instances are not being classified based on this tree ???
Thanks in advance.
It sounds like you are after a completely unpruned tree such that the Training Data should return a 100% accuracy.
The options that are likely causing the undesirable results are outlined below:
Unpruned is used to minimise the number of rules in the decision tree and can lead to lower generalisation error
minNumObj is used to determine the minimum number of cases required to make a rule. If this is higher than one, you could get some errors on the training data.
I wouldn't generally recommend using these options for a given problem, but if you are trying to obtain a 100% result on the training data, this would be the place to start.
Hope this helps!

How to use KNN to classify data in MATLAB?

I'm having problems in understanding how K-NN classification works in MATLAB.´
Here's the problem, I have a large dataset (65 features for over 1500 subjects) and its respective classes' label (0 or 1).
According to what's been explained to me, I have to divide the data into training, test and validation subsets to perform supervised training on the data, and classify it via K-NN.
First of all, what's the best ratio to divide the 3 subgroups (1/3 of the size of the dataset each?).
I've looked into ClassificationKNN/fitcknn functions, as well as the crossval function (idealy to divide data), but I'm really not sure how to use them.
To sum up, I wanted to
- divide data into 3 groups
- "train" the KNN (I know it's not a method that requires training, but the equivalent to training) with the training subset
- classify the test subset and get it's classification error/performance
- what's the point of having a validation test?
I hope you can help me, thank you in advance
EDIT: I think I was able to do it, but, if that's not asking too much, could you see if I missed something? This is my code, for a random case:
nfeats=60;ninds=1000;
trainRatio=0.8;valRatio=.1;testRatio=.1;
kmax=100; %for instance...
data=randi(100,nfeats,ninds);
class=randi(2,1,ninds);
[trainInd,valInd,testInd] = dividerand(1000,trainRatio,valRatio,testRatio);
train=data(:,trainInd);
test=data(:,testInd);
val=data(:,valInd);
train_class=class(:,trainInd);
test_class=class(:,testInd);
val_class=class(:,valInd);
precisionmax=0;
koptimal=0;
for know=1:kmax
%is it the same thing use knnclassify or fitcknn+predict??
predicted_class = knnclassify(val', train', train_class',know);
mdl = fitcknn(train',train_class','NumNeighbors',know) ;
label = predict(mdl,val');
consistency=sum(label==val_class')/length(val_class);
if consistency>precisionmax
precisionmax=consistency;
koptimal=know;
end
end
mdl_final = fitcknn(train',train_class','NumNeighbors',know) ;
label_final = predict(mdl,test');
consistency_final=sum(label==test_class')/length(test_class);
Thank you very much for all your help
For your 1st question "what's the best ratio to divide the 3 subgroups" there are only rules of thumb:
The amount of training data is most important. The more the better.
Thus, make it as big as possible and definitely bigger than the test or validation data.
Test and validation data have a similar function, so it is convenient to assign them the same amount
of data. But it is important to have enough data to be able to recognize over-adaptation. So, they
should be picked from the data basis fully randomly.
Consequently, a 50/25/25 or 60/20/20 partitioning is quite common. But if your total amount of data is small in relation to the total number of weights of your chosen topology (e.g. 10 weights in your net and only 200 cases in the data), then 70/15/15 or even 80/10/10 might be better choices.
Concerning your 2nd question "what's the point of having a validation test?":
Typically, you train the chosen model on your training data and then estimate the "success" by applying the trained model to unseen data - the validation set.
If you now would completely stop your efforts to improve accuracy, you indeed don't need three partitions of your data. But typically, you feel that you can improve the success of your model by e.g. changing the number of weights or hidden layers or ... and now a big loops starts to run with many iterations:
1) change weights and topology, 2) train, 3) validate, not satisfied, goto 1)
The long-term effect of this loop is, that you increasingly adapt your model to the validation data, so the results get better not because you so intelligently improve your topology but because you unconsciously learn the properties of the validation set and how to cope with them.
Now, the final and only valid accuracy of your neural net is estimated on really unseen data: the test set. This is done only once and is also useful to reveal over-adaption. You are not allowed to start a second even bigger loop now to prohibit any adaption to the test set!