Many shot, medium shot, few shot? - loss

While I was reading one CVPR 2020 paper, titled Equalization loss for long-tailed object recognition, I cannot understand the terms "many shot", "medium shot", and "few shot". Could you give me an advice for understanding those terms?
Click 1
Click 2

The following definition should meet these terms:
"""
...many-shot classes (classes each with
over training 100 samples), medium-shot classes (classes
each with 20∼100 training samples) and few-shot classes
(classes under 20 training samples)
"""
source: https://arxiv.org/pdf/1904.05160.pdf

Related

Classifiers assembled with identical training sets using IBM Watson NLU and IBM Watson NLC services yield different results

Everyone actively using the Natural Language Classifier service from IBM Watson has seen the following message while using the API:
"On 9 August 2021, IBM announced the deprecation of the Natural Language Classifier service. The service will no longer be available from 8 August 2022. As of 9 September 2021, you will not be able to create new instances. Existing instances will be supported until 8 August 2022. Any instance that still exists on that date will be deleted.
For more information, see IBM Cloud Docs"
IBM actively promotes to migrate NLC models to IBM's Natural Language Understanding Service. Today I have migrated my first classification model from Natural Language Classifier to Natural Language Understanding. Since I did not dive into the technological background of either service, I wanted to compare the output of both services. In order to do so, I followed the migration guidelines provided by IBM ( NLC --> NLU migration guidelines ). To recreate the NLC classifier in NLU, I downloaded the complete set of training data used to create the initial classifier built in the NLC service. So the data sets used to train the NLC and NLU classifiers are identical. Recreation of the classifier in NLU was straightforward forward and the classifier training took about the same time as in NLC. 
To compare the performance, I then assembled a test set of phrases that was not used for training purposes in either the NLC or NLU service. The test set contains 100 phrases that were passed through both the NLC and NLU classifier. To my big surprise, the differences are substantial. Out of 100, 18 results are different (more than 0.30 difference in confidence value), or 37 out of 100 when accepting a difference of 0.2 in confidence value. To summarize, the differences in analysis results are substantial.
In my opinion, this difference is too large to blindly move on to migrating all NLC models to NLU without any hesitation. The results I obtained so far justify further investigation using a manual curation step by a SME to validate the yielded analysis results. I am not too happy about this. I was wondering whether more users have seen this issue and/or have the same observation. Perhaps someone can shed a light on the differences in analysis results between the NLC and NLU services. And how to close the gap between the differences in analysis results obtained with the NLC and NLU service.
Please find below an excerpt of the analysis results of comparison:
title
NLC
NLU
Comparability
"Microbial Volatile Organic Compound (VOC)-Driven Dissolution and Surface Modification of Phosphorus-Containing Soil Minerals for Plant Nutrition: An Indirect Route for VOC-Based Plant-Microbe Communications"
0,01
0,05
comparable
"Valorization of kiwi agricultural waste and industry by-products by recovering bioactive compounds and applications as food additives: A circular economy model"
0,01
0,05
comparable
"Quantitatively unravelling the effect of altitude of cultivation on the volatiles fingerprint of wheat by a chemometric approach"
0,70
0,39
different
"Identification of volatile biomarkers for high-throughput sensing of soft rot and Pythium leak diseases in stored potatoes"
0,01
0,33
different
"Impact of Electrolyzed Water on the Microbial Spoilage Profile of Piedmontese Steak Tartare"
0,08
0,50
different
"Review on factors affecting Coffee Volatiles: From Seed to Cup"
0,67
0,90
different
"Chemometric analysis of the volatile profile in peduncles of cashew clones and its correlation with sensory attributes"
0,79
0,98
comparable
"Surface-enhanced Raman scattering sensors for biomedical and molecular detection applications in space"
0,00
0,00
comparable
"Understanding the flavor signature of the rice grown in different regions of China via metabolite profiling"
0,26
0,70
different
"Nutritional composition, antioxidant activity, volatile compounds, and stability properties of sweet potato residues fermented with selected lactic acid bacteria and bifidobacteria"
0,77
0,87
comparable
We have also been migrating our classifiers from NLC to NLU and doing analysis to explain the differences. We explored different possible factors to see what may have an influence: Upper case/Lower case, text length…no correlation found in these cases.
We did however find some correlation between the difference in score between the 1st and 2nd class returned by NLU and the score drop from NLC. That is to say we noticed that the closer the score of the second class returned the lower the NLU score on the first class. We call this confusion. In the case of our data there are times when the confusion is ‘real’ (ie. an SME would also classify the test phrase as borderline between 2 classes) but there were also times when we realized we could improve our training data to have more ‘distinct’ classes.
Bottom line, we can not explain the internals of NLU that generate the difference and we do still have a drop in the scores between NLC and NLU but it is across the board. We will move ahead to NLU despite the lowering of the scores: it does not hinder our interpretation of results.

How to Improve Classification Accuracy with Support Vector Machine

I have 7 classes of inputs that are related to the brain signals activity (EEG).
When the number of classes is large, the performance of classification algorithms may be affected.
As you can see in the following code, I extracted the features for them and in the first phase I trained my model with 70% of the my data and got 100% accuracy but in the testing phase with the remaining 30% I did not get more than 42.5% accuracy. What is your suggestion to improve the accuracy of my Model?
for i=1:7
[A D]=dwt2(segment_train(i).train,'db1');
wave_train(i).A=A;
wave_train(i).D=D;
f1=mean(A);
f2=median(A);
f3=max(D);
f4=abs(fft(D));
f4=mean(f4);
f5=var(D);
f6=skewness(D);
f7=entropy(D);
f8=var(A);
f9=mean(D);
f(i,:)=[f1 f2 f3 f4 f5 f6 f7 f8 f9];
end
% feature extraction
% Classifier
nOfSamples=7;
nOfClassInstance=10;
Sample=f;
class=[1 2 3 4 5 6 7]'
%SVM
Model=fitcecoc(Sample,class);
predictt=predict(Model,Sample);
disp('class predict')
disp([class predictt])
%Accuracy
Accuracy=mean(class==predictt)*100;
fprintf('\nAccuracy =%d\n',Accuracy)
The question is a tad broad. However, it's a good idea to explore the distribution of the class labels.
Perhaps the distribution of the classes are skewed? It may be the case that some classes show up a lot more than others. There are various ways to counteract this, such as up/down sampling, weighting the error of under-sampled classes with a larger factor, etc. It would be a good idea to explore this further online.
That being said, it certainly sounds like you're overfitting the model. You may also want to explore regularisation to combat the low test score.

What kind of features are extracted with the AlexNet layers?

Question is regarding this method, which extracts features from the FC7 layer of AlexNet.
What kind of features is it actually extracting?
I used this method on images of paintings done by two artists. The training set is about 150 training images from each artist (so that the features are stored in a 300x4096 matrix); the validation set is 40 images. This works really well, 85-90% correct classification. I would like to know why it works so well.
WHAT FEATURES ?
FC8 is the classification layer; FC7 is the one before it, where all of the prior kernel pixels are linearised and concatenated. These represent the abstract, top-level features that the model training has "discovered". To examine these features, try one of the many layer visualization tools available on line (don't ask for references here; SO bans requests for resources).
The features vary from one training to another, depending on the kernel initialization (usually random) and very dependent on the training set. However, the features tend to be simple in the early layers, with greater variety and detail in the later ones. For instance, on the original AlexNet target (ILSVRC 2012, a.k.a. ImageNet data set), the FC7 features often include vehicle tires, animal faces, various types of flower petals, green leaves and stems, two-legged animal torsos, airplane sections, car/truck/bus grill work, etc.
Does that help?
WHY DOES IT WORK SO WELL ?
That depends on the data set and training parameters. How different are the images from the artists? There are plenty of features to extract: choice of subject, palette, compositional complexity, hard/soft edges, even direction of brush strokes. For instance, differentiating any two early cubists could be a little tricky; telling Rembrandt from Jackson Pollack should hit 100%.

Why too few features are selected in this dataset by subset selection method

I have a classification dataset with 148 input features (20 of which are binary and the rest are continuous on the range [0,1]). The dataset has 66171 negative example and only 71 positive examples.
The dataset (arff text file) can be downloaded from this dropbox link: https://dl.dropboxusercontent.com/u/26064635/SDataset.arff.
In Weka suite, when I use CfsSubsetEval and GreedyStepwise (with setSearchBackwards() set to true and also false), the selected feature set contains only 2 features (i.e. 79 and 140)! It is probably needless to say that the classification performance with these two features are terribly bad.
Using ConsistencySubsetEval (in Weka as well) leads to the selection of ZERO features! When feature ranking methods are used instead and the best (e.g. 12) features are selected, a much better classification performance is achieved.
I have two questions:
First, What is it about the dataset that leads to the selection of such a few features? is it because of the imbalance between the number of positive and negative examples?
Second, and more importantly, are there any other subset selection methods (in Matlab or otherwise) that I can try and may lead to the selection of more features?
Clearly, the class imbalance is not helping. You could try to take a subsample of the dataset for better diagnostic. SpreadSubsample filter lets you do that, stating what are the maximun class imbalance admisible, like 10:1, 3:1, or whatever you find appropriate.
For selection methods, you could try dimensionality reduction methods, like PCA, in WEKA, first.
But if the algorithms are selecting those sets of features, they seem to be the most meaningful for your classificatin task.

Depth of Artificial Neural Networks

According to this answer, one should never use more than two hidden layers of Neurons.
According to this answer, a middle layer should contain at most twice the amount of input or output neurons (so if you have 5 input neurons and 10 output neurons, one should use (at most) 20 middle neurons per layer).
Does that mean that all data will be modeled within that amount of Neurons?
So if, for example, one wants to do anything from modeling weather (a million input nodes from data from different weather stations) to simple OCR (of scanned text with a resolution of 1000x1000DPI) one would need the same amount of nodes?
PS.
My last question was closed. Is there another SE site where these kinds of questions are on topic?
You will likely have overfitting of your data (aka, High Variance). Think of it like this: The more neurons and layers you have gives you more parameters to fit your data better.
Remember that for the first layer node the equation becomes Z = sigmoid(sum(W*x))
The second layer node becomes Z2 = Sigmoid(sum(W*Z))
Look into machine learning class taught at Stanford...its a great online course and good tool as a reference.
More than two hidden layers can be useful in certain architectures
such as cascade correlation (Fahlman and Lebiere 1990) and in special
applications, such as the two-spirals problem (Lang and Witbrock 1988)
and ZIP code recognition (Le Cun et al. 1989).
Fahlman, S.E. and Lebiere, C. (1990), "The Cascade Correlation
Learning Architecture," NIPS2, 524-532.
Le Cun, Y., Boser, B., Denker, J.s., Henderson, D., Howard, R.E.,
Hubbard, W., and Jackel, L.D. (1989), "Backpropagation applied to
handwritten ZIP code recognition", Neural Computation, 1, 541-551.
Check out the sections "How many hidden layers should I use?" and "How many hidden units should I use?" on comp.ai.neural-nets's FAQ for more information.