Combining probabilities from different CNN models - classification

Let's say I have 2 images of a car, but one is generated from the camera and the other is a depth image generated from Lidar pointcloud transformation.
I used the same CNN model on both image to predict the class (output is a softmax, as there is other classes in my dataset : pedestrian, van, truck, cyclist, etc.
How can I combine the two probabilities vector in order to predict the class by taking into account both predictions?
I used method like average, maximum, minimum, naive product apply to each score for each class, but don't know if it work.
Thanks you in advance
Following this article :
We can see that they use maximum or minimum rule to combine the outpout of classifiers. So dit it work for multiclass problem?

As per MSalter's comment, the softmax output isn't a true probability vector. But if we choose to regard it as such, we can simply take the average of each prediction. This is equivalent to having two persons each classify a random sample of objects from a big pool of objects and assuming they both counted an equal amount, estimate the distribution of objects in the big pool by combining their observations. The sum of the 'probabilities' of the classes will still equal 1.

Following this article :
We can see that they use maximum or minimum rule to combine the outpout of classifiers. So dit it work for multiclass problem?


Multiclass classification or regression?

I am trying to train a CNN model to classify images based on their aesthetic score. There are 2,00,000 images and every image is rated by more than 100 subjects. Mean score is calculated and the scores are normalized.
The distribution of the scores is approximately gaussian. So I have decided to build a 10 class classification model after assigning appropriate weight for each class as the data is imbalanced.
My question:
For this problem, the scores are continuous, ie, 0<0.2<0.3<0.4<0.5<..<1.
Then does that mean this is a regression problem? If so, how do I balance the data for a regression problem, as most of the datapoints are present in between 0.4 and 0.6.
Since your labels are continuous, you could divide them in to 10 equal quantiles using a technique like pandas.qcut() and provide label to each classes. This can turn a regression problem to a classification problem.
And as far as the imbalance is concerned, you may want to try to oversample the minority data. This will ensure your model is not biased towards majority data.
Hope this helps.
I would recommend you to do a Histogram Equalization over ALL data of your participants first, so that their ratings are destributed equaly.
Then for each image in your training set calculate the Expected Value (and if you also want to, the Variance) The Expected Value is just the mean of the votes. For the Variance there are standard functions in (almost) every programming language where you can input an array of votes which will output the Variance.
Now take the Expected Value (and if you want also the Variance) as your ground truth for your Network.
EDIT: Histogram Equalization:
Histogram equalization is a method to use the given numerical range as efficient as possible.
In the context of images, this would change the pixel values, so that the darkest pixel becomes the value 0 and the lightest value becomes 255. Furthermore every grayscale value gets destributed so that it occurs as often as each other (in average). For your dataset you want the same. Even though your values are not from 0 to 255 but from 0 to 10. Furthermore you don't need to (and shoudn't) round the resulting values to integers. In this way more often occurring votes are more spread and less often votes are contracted.
Maybe you should first calculate the expected value and than do the histogram equalization over the expected values of all images.
By this the CNN sould be able to better differentiate those small differences.

Restricting output classes in multi-class classification in Tensorflow

I am building a bidirectional LSTM to do multi-class sentence classification.
I have in total 13 classes to choose from and I am multiplying the output of my LSTM network to a matrix whose dimensionality is [2*num_hidden_unit,num_classes] and then apply softmax to get the probability of the sentence to fall into 1 of the 13 classes.
So if we consider output[-1] as the network output:
W_output = tf.Variable(tf.truncated_normal([2*num_hidden_unit,num_classes]))
result = tf.matmul(output[-1],W_output) + bias
and I get my [1, 13] matrix (assuming I am not working with batches for the moment).
Now, I also have information that a given sentence does not fall into a given class for sure and I want to restrict the number of classes considered for a given sentence. So let's say for instance that for a given sentence, I know it can fall only in 6 classes so the output should really be a matrix of dimensionality [1,6].
One option I was thinking of is to put a mask over the result matrix where I multiply the rows corresponding to the classes that I want to keep by 1 and the ones I want to discard by 0, by in this way I will just lose some of the information instead of redirecting it.
Anyone has a clue on what to do in this case?
I think your best bet is, as you seem to have described, using a weighted cross entropy loss function where the weights for your "impossible class" are 0 and 1 for the other possible classes. Tensorflow has a weighted cross entropy loss function.
Another interesting but probably less effective method is to feed whatever information you now have about what classes your sentence can/cannot fall into the network at some point (probably towards the end).

h2o random forest calculating MSE for multinomial classification

Why is h2o.randomforest calculating MSE on Out of bag sample and while training for a multinomail classification problem?
I have done binary classification also using h2o.randomforest, there it used to calculate AUC on out of bag sample and while training but for multi classification random forest is calculating MSE which seems suspicious. Please see this screenshot.
My target variable was a factor containing 4 factor levels model1, model2, model3 and model4. In the screenshot you would also a confusion matrix for these factors.
Can someone please explain this behaviour?
Both binomial and multinomial classification display MSE, so you will see it in the Scoring History table for both models (highlighted training_MSE column).
H2O does not evaluate a multinomial AUC. A few evaluation methods exist, but there is not yet a single widely adopted method. The pROC package discusses the method of Hand and Till, but mentions that it cannot be plotted and results rarely tested. Log loss and classification error are still available, specific to classification, as each has standard methods of evaluation in a multinomial context.
There is a confusion matrix comparing your 4 factor levels, as you highlighted. Can you clarify what more you are expecting? If you were looking for four individual confusion matrices, the four-column table contains enough information that they could be computed.

Does sklearn support a cost matrix?

Is it possible to train classifiers in sklearn with a cost matrix with different costs for different mistakes? For example in a 2 class problem, the cost matrix would be a 2 by 2 square matrix. For example A_ij = cost of classifying i as j.
The main classifier I am using is a Random Forest.
The cost-sensitive framework you describe is not supported in scikit-learn, in any of the classifiers we have.
You could use a custom scoring function that accepts a matrix of per-class or per-instance costs. Here's an example of a scorer that calculates per-instance misclassification cost:
def financial_loss_scorer(y, y_pred, **kwargs):
import pandas as pd
totals = kwargs['totals']
# Create an indicator - 0 if correct, 1 otherwise
errors = pd.DataFrame((~(y == y_pred)).astype(int).rename('Result'))
# Use the product totals dataset to create results
results = errors.merge(totals, left_index=True, right_index=True, how='inner')
# Calculate per-prediction loss
loss = results.Result * results.SumNetAmount
return loss.sum()
The scorer becomes:
make_scorer(financial_loss_scorer, totals=totals_data, greater_is_better=False)
Where totals_data is a pandas.DataFrame with indexes that match the training set indexes.
You could always just look at your ROC curve. Each point on the ROC curve corresponds to a separate confusion matrix. So by specifying the confusion matrix you want, via choosing your classifier threshold implies some sort of cost weighting scheme. Then you just have to choose the confusion matrix that would imply the cost matrix you are looking for.
On the other hand if you really had your heart set on it, and really want to "train" an algorithm using a cost matrix, you could "sort of" do it in sklearn.
Although it is impossible to directly train an algorithm to be cost sensitive in sklearn you could use a cost matrix sort of setup to tune your hyper-parameters. I've done something similar to this using a genetic algorithm. It really doesn't do a great job, but it should give a modest boost to performance.
One way to circumvent this limitation is to use under or oversampling. E.g., if you are doing binary classification with an imbalanced dataset, and want to make errors on the minority class more costly, you could oversample it. You may want to have a look at imbalanced-learn which is a package from scikit-learn-contrib.
May not be direct to your question (since you are asking about Random Forest).
But for SVM (in Sklearn), you can utilize the class_weight parameter to specify the weights of different classes. Essentially, you will pass in a dictionary.
You might want to refer to this page to see an example of using class_weight.

Rapidminer - neural net operator - output confidence

I have feed-forward neural network with six inputs, 1 hidden layer and two output nodes (1; 0). This NN is learned by 0;1 values.
When applying model, there are created variables confidence(0) and confidence(1), where sum of this two numbers for each row is 1.
My question is: what do these two numbers (confidence(0) and confidence(1)) exactly mean? Are these two numbers probabilities?
Thanks for answers
In general
The confidence values (or scores, as they are called in other programs) represent a measure how, well, confident the model is that the presented example belongs to a certain class. They are highly dependent on the general strategy and the properties of the algorithm.
The easiest example to illustrate is the majority classifier, who just assigns the same score for all observations based on the proportions in the original testset
Another is example the k-nearest-neighbor-classifier, where the score for a class i is calculated by averaging the distance to those examples which both belong to the k-nearest-neighbors and have class i. Then the score is sum-normalized across all classes.
In the specific example of NN, I do not know how they are calculated without checking the code. I guess it is just the value of output node, sum-normalized across both classes.
Do the confidences represent probabilities ?
In general no. To illustrate what probabilities in this context mean: If an example has probability 0.3 for class "1", then 30% of all examples with similar feature/variable values should belong to class "1" and 70% should not.
As far as I know, his task is called "calibration". For this purpose some general methods exist (e.g. binning the scores and mapping them to the class-fraction of the corresponding bin) and some classifier-dependent (like e.g. Platt Scaling which has been invented for SVMs). A good point to start is:
Bianca Zadrozny, Charles Elkan: Transforming Classifier Scores into Accurate Multiclass Probability Estimates
The confidence measures correspond to the proportion of outputs 0 and 1 that are activated in the initial training dataset.
E.g. if 30% of your training set has outputs (1;0) and the remaining 70% has outputs (0; 1), then confidence(0) = 30% and confidence(1) = 70%