Rapidminer - neural net operator - output confidence - neural-network

I have feed-forward neural network with six inputs, 1 hidden layer and two output nodes (1; 0). This NN is learned by 0;1 values.
When applying model, there are created variables confidence(0) and confidence(1), where sum of this two numbers for each row is 1.
My question is: what do these two numbers (confidence(0) and confidence(1)) exactly mean? Are these two numbers probabilities?
Thanks for answers

In general
The confidence values (or scores, as they are called in other programs) represent a measure how, well, confident the model is that the presented example belongs to a certain class. They are highly dependent on the general strategy and the properties of the algorithm.
Examples
The easiest example to illustrate is the majority classifier, who just assigns the same score for all observations based on the proportions in the original testset
Another is example the k-nearest-neighbor-classifier, where the score for a class i is calculated by averaging the distance to those examples which both belong to the k-nearest-neighbors and have class i. Then the score is sum-normalized across all classes.
In the specific example of NN, I do not know how they are calculated without checking the code. I guess it is just the value of output node, sum-normalized across both classes.
Do the confidences represent probabilities ?
In general no. To illustrate what probabilities in this context mean: If an example has probability 0.3 for class "1", then 30% of all examples with similar feature/variable values should belong to class "1" and 70% should not.
As far as I know, his task is called "calibration". For this purpose some general methods exist (e.g. binning the scores and mapping them to the class-fraction of the corresponding bin) and some classifier-dependent (like e.g. Platt Scaling which has been invented for SVMs). A good point to start is:
Bianca Zadrozny, Charles Elkan: Transforming Classifier Scores into Accurate Multiclass Probability Estimates

The confidence measures correspond to the proportion of outputs 0 and 1 that are activated in the initial training dataset.
E.g. if 30% of your training set has outputs (1;0) and the remaining 70% has outputs (0; 1), then confidence(0) = 30% and confidence(1) = 70%

Related

Combining probabilities from different CNN models

Let's say I have 2 images of a car, but one is generated from the camera and the other is a depth image generated from Lidar pointcloud transformation.
I used the same CNN model on both image to predict the class (output is a softmax, as there is other classes in my dataset : pedestrian, van, truck, cyclist, etc.
How can I combine the two probabilities vector in order to predict the class by taking into account both predictions?
I used method like average, maximum, minimum, naive product apply to each score for each class, but don't know if it work.
Thanks you in advance
EDIT :
Following this article : https://www.researchgate.net/publication/327744903_Multimodal_CNN_Pedestrian_Classification_a_Study_on_Combining_LIDAR_and_Camera_Data
We can see that they use maximum or minimum rule to combine the outpout of classifiers. So dit it work for multiclass problem?
As per MSalter's comment, the softmax output isn't a true probability vector. But if we choose to regard it as such, we can simply take the average of each prediction. This is equivalent to having two persons each classify a random sample of objects from a big pool of objects and assuming they both counted an equal amount, estimate the distribution of objects in the big pool by combining their observations. The sum of the 'probabilities' of the classes will still equal 1.
Following this article : https://www.researchgate.net/publication/327744903_Multimodal_CNN_Pedestrian_Classification_a_Study_on_Combining_LIDAR_and_Camera_Data
We can see that they use maximum or minimum rule to combine the outpout of classifiers. So dit it work for multiclass problem?

Poor performance for SVM for unbalanced dataset- how to improve?

Consider a dataset A which has examples for training in a binary classification problem. I have used SVM and applied the weighted method (in MATLAB) since the dataset is highly imbalanced. I have applied weights as inversely proportional to the frequency of data in each class. This is done on training using the command
fitcsvm(trainA, trainTarg , ...
'KernelFunction', 'RBF', 'KernelScale', 'auto', ...
'BoxConstraint', C,'Weight',weightTrain );
I have used 10 folds cross-validation for training and learned the hyperparameter as well. so, inside CV the dataset A is split into train (trainA) and validation sets (valA). After training is over and outside the CV loop, I get the confusion matrix on A:
80025 1
0 140
where the first row is for the majority class and the second row is for the minority class. There is only 1 false positive (FP) and all minority class examples have been correctly classified giving true positive (TP) = 140.
PROBLEM: Then, I run the trained model on a new unseen test data set B which was never seen during training. This is the confusion matrix for testing on B .
50075 0
100 0
As can be seen, the minority class has not been classified at all, hence the purpose of weights has failed. Although, there is no FP the SVM fails to capture the minority class examples.
I have not applied any weights or balancing method such as sampling (SMOTE, RUSBoost etc) on B. What could be wrong and how to overcome this problem?
Class misclassification weights could be set instead of sample weights!
You can set the class weights based on the following example.
Mis-classification weight for class A(n-records; dominant) into class B (m-records; minority class) can be n/m.
Mis-classification weight For class B as class A can be set as 1 or m/n based on the severity, which you want to impose on the learning
c=[0 2.2;1 0];
mod=fitcsvm(X,Y,'Cost',c)
According to documentation:
For two-class learning, if you specify a cost matrix, then the
software updates the prior probabilities by incorporating the
penalties described in the cost matrix. Consequently, the cost matrix
resets to the default. For more details on the relationships and
algorithmic behavior of BoxConstraint, Cost, Prior, Standardize, and
Weights, see Algorithms.
Area Under Curve (AUC) is usually used to measure performance of models that applied on unbalanced data. It is also good to plot ROC curve to visually get more insights. Using only confusion matrix for such models may lead to misinterpretation.
perfcurve from the Statistics and Machine Learning Toolbox provides both functionalities.

Multiclass classification or regression?

I am trying to train a CNN model to classify images based on their aesthetic score. There are 2,00,000 images and every image is rated by more than 100 subjects. Mean score is calculated and the scores are normalized.
The distribution of the scores is approximately gaussian. So I have decided to build a 10 class classification model after assigning appropriate weight for each class as the data is imbalanced.
My question:
For this problem, the scores are continuous, ie, 0<0.2<0.3<0.4<0.5<..<1.
Then does that mean this is a regression problem? If so, how do I balance the data for a regression problem, as most of the datapoints are present in between 0.4 and 0.6.
Thanks!
Since your labels are continuous, you could divide them in to 10 equal quantiles using a technique like pandas.qcut() and provide label to each classes. This can turn a regression problem to a classification problem.
And as far as the imbalance is concerned, you may want to try to oversample the minority data. This will ensure your model is not biased towards majority data.
Hope this helps.
I would recommend you to do a Histogram Equalization over ALL data of your participants first, so that their ratings are destributed equaly.
Then for each image in your training set calculate the Expected Value (and if you also want to, the Variance) The Expected Value is just the mean of the votes. For the Variance there are standard functions in (almost) every programming language where you can input an array of votes which will output the Variance.
Now take the Expected Value (and if you want also the Variance) as your ground truth for your Network.
EDIT: Histogram Equalization:
Histogram equalization is a method to use the given numerical range as efficient as possible.
In the context of images, this would change the pixel values, so that the darkest pixel becomes the value 0 and the lightest value becomes 255. Furthermore every grayscale value gets destributed so that it occurs as often as each other (in average). For your dataset you want the same. Even though your values are not from 0 to 255 but from 0 to 10. Furthermore you don't need to (and shoudn't) round the resulting values to integers. In this way more often occurring votes are more spread and less often votes are contracted.
Maybe you should first calculate the expected value and than do the histogram equalization over the expected values of all images.
By this the CNN sould be able to better differentiate those small differences.

Restricting output classes in multi-class classification in Tensorflow

I am building a bidirectional LSTM to do multi-class sentence classification.
I have in total 13 classes to choose from and I am multiplying the output of my LSTM network to a matrix whose dimensionality is [2*num_hidden_unit,num_classes] and then apply softmax to get the probability of the sentence to fall into 1 of the 13 classes.
So if we consider output[-1] as the network output:
W_output = tf.Variable(tf.truncated_normal([2*num_hidden_unit,num_classes]))
result = tf.matmul(output[-1],W_output) + bias
and I get my [1, 13] matrix (assuming I am not working with batches for the moment).
Now, I also have information that a given sentence does not fall into a given class for sure and I want to restrict the number of classes considered for a given sentence. So let's say for instance that for a given sentence, I know it can fall only in 6 classes so the output should really be a matrix of dimensionality [1,6].
One option I was thinking of is to put a mask over the result matrix where I multiply the rows corresponding to the classes that I want to keep by 1 and the ones I want to discard by 0, by in this way I will just lose some of the information instead of redirecting it.
Anyone has a clue on what to do in this case?
I think your best bet is, as you seem to have described, using a weighted cross entropy loss function where the weights for your "impossible class" are 0 and 1 for the other possible classes. Tensorflow has a weighted cross entropy loss function.
Another interesting but probably less effective method is to feed whatever information you now have about what classes your sentence can/cannot fall into the network at some point (probably towards the end).

Training HMM - The amount of data required

I'm using HMM for classifications. I came cross an example in Wikipedia Baum–Welch algorithm Example. Hope someone can help me.
The example as follow: "Suppose we have a chicken from which we collect eggs at noon everyday. Now whether or not the chicken has laid eggs for collection depends on some unknown factors that are hidden. We can however (for simplicity) assume that there are only two states that determine whether the chicken lays eggs."
Note that we have 2 different observations (N and E) and 2 states (S1 and S2) in this example.
My question here is:
How many observations/Observed sequences (or training data) do we need to best train the model. Is there any way to estimate or to test the amount of training data required.
For each variable in your HMM model, you need about 10 samples. Using this rule of thumb, you can easily calculate how many samples do you need to construct a reliable classifier.
In your example you have two states which results in a 2 in 2 transition matrix A=[a_00, a_01;a_10, a_11] where a_ij is the transition probability from state S_i to S_j.
Moreover, each of these states with probability p_S1 and p_S2 generate observations, i.e.: If we are at state S1 with probability p_S1 the chicken will lay egg and with probability 1-p_S1 it will not.
In total you have 6 variables needed to be estimate. It is more or less obvious that it is not possible to accurately estimate them from only two observations. As I mentioned before, it is conventional to assume at least 10 samples per variable are needed in order to estimate that variable accurately.