Individual class accuracy calculation confusion - matlab

The total number of data points for which the following binary classification result is obtained = 1500. Out of which, I have
1473 labelled as 0 and
the remaining 27 as 1 .
As can be seen from the confusion matrix, out of 27 data points belonging to class 1, I got only 1 data point misclassified as 0 . So, I calculated the accuracy for individual classes and got Accuracy for class labelled as 0 = 98.2% and for the other as 1.7333%. Is this calculation correct? I am not sure...I did get a pretty good classification for the class labelled as 1 so why the accuracy for it is low?
The individual class accuracies should have been 100% for class0 and around 98% for class1
Does one misclassification reduce the accuracy of class 1 by so much amount? This is the how I calculated the individual class accuracies in MAtlab.
cmMatrix =
1473 0
1 26
acc_class0 = 100*(cmMatrix(1,1))/1500;
acc_class1= 100*(cmMatrix(2,2))/1500;

If everything had been classified correctly, your computation would indicate accuracy for class 1 as 27/1500=0.018. This is obviously wrong. Overall accuracy is 1499/1500, but per-class accuracy cannot use 1500 as denominator. 27 is the maximum correctly classified elements, and should therefore be the denominator.
acc_class0 = 100*cmMatrix(1,1)/sum(cmMatrix(1,:));
acc_class1 = 100*cmMatrix(2,2)/sum(cmMatrix(2,:));

Related

Undefined F1 scores in multiclass classifications when model does not predict one class

I am trying to use F1 scores for model selection in multiclass classification.
I am calculating them class-wise and average over them:
(F1(class1)+F1(class1)+F1(class1))/3 = F1(total)
However, in some cases I get NaN values for the F1 score. Here is an example:
Let true_label = [1 1 1 2 2 2 3 3 3] and pred_label = [2 2 2 2 2 2 3 3 3].
Then the confusion matrix looks like:
C =[0 3 0; 0 3 0; 0 0 3]
Which means when I calculate the precision (to calculate the F1 score) for the first class, I obtain: 0/(0+0+0), which is not defined or NaN.
Firstly, am I making a mistake in calculating F1 scores or precisions here?
Secondly, how should I treat these cases in model selection? Ignore them or should I just set the F1 scores for this class to 0 (reducing the total F1 score for this model).
Any help would be greatly appreciated!
You need to avoid the division by zero for the precision in order to report meaningful results. You might find this answer useful, in which you explicitly report a poor outcome. Additionally, this implementation suggests an alternate way to differentiate in your reporting between good and poor outcomes.

How to interpret results from Sparkling Water's GBM algorithm on classification task

I'm new to Sparkling Water and machine learning,
I've built GBM model with two datasets divided manually into train and test.
Task is classification with all numeric atributes (response column is converted to enum type). Code is in Scala.
val gbmParams = new GBMParameters()
gbmParams._train = train
gbmParams._valid = test
gbmParams._response_column = "response"
gbmParams._ntrees = 50
gbmParams._max_depth = 6
val gbm = new GBM(gbmParams)
val gbmModel = gbm.trainModel.get
In model summary I get four different - one on train data and one on test data before building individual trees with prediction. The result is with predicted value as 1 in each case - this is for test data:
CM: Confusion Matrix (vertical: actual; across: predicted):
0 1 Error Rate
0 0 500 1,0000 500 / 500
1 0 300 0,0000 0 / 300
Totals 0 800 0,6250 500 / 800
The second confusion matrix is similar with predicted value as 1 in each case for train data. Third and Fourth confusion matrix after built trees gaves normal results with values distributed in all sections of matrix.
I need to interpret first and second matrix. Why is Sparkling Water doing that? Can I work with these results or it's just some middle step?
Thank you.
Interpreting the matrix given by:
CM: Confusion Matrix (vertical: actual; across: predicted):
0 1 Error Rate
0 0 500 1,0000 500 / 500
1 0 300 0,0000 0 / 300
Totals 0 800 0,6250 500 / 800
We can see that all 800 observations were labelled 1, given by the numbers in the Totals row.
The model being tested predicted 0 500 times, and 1 300 times given by the rows. That gives you an overall error of 0.625 or 62.5%.
This tells us two things:
The data in that dataset were completely unbalanced in favour of class 1.
The model did a pretty bad job
Is it possible that the two initial matrices represent the summary of an untrained model, essentially picking classes at random? And the latter two matrices represent the summary of the trained model?

Understanding the output of ovrpredict in LIBSVM

I'm implementing a multiclass classification with Libsvm adopting a one versus all strategy. For this purpose, I used the ovrtrain and ovrpredict MATLAB functions:
model = ovrtrain(GroupTrain, TrainingSet,'t -0' );
[predicted_labels ac decv] = ovrpredict(testY, TestSet, model);
The output of ovrpredict is as follows
Accuracy = 90% (18/20) (classification)
Accuracy = 90% (18/20) (classification)
Accuracy = 90% (18/20) (classification)
Accuracy = 95% (19/20) (classification)
Accuracy = 90% (18/20) (classification)
Accuracy = 90% (18/20) (classification)
Accuracy = 90% (18/20) (classification)
Accuracy = 90% (18/20) (classification)
Accuracy = 90% (18/20) (classification)
Accuracy = 90% (18/20) (classification)
I have 10 classes, I'm new to libsvm so I guess that those accuracies correspond to the classification accuracy of each class. However, I don't understand what is the difference between this output and the value of the accuracy ac returned by ovrpredict, which is 60%.
ac =
0.6000
Thanks
Both values are quite different from each other. Accuracy is the output of svmpredict() function, which tells you how your test data set is fitting to that specific class while ac gives you accuracy of input test class-labels (testY in your case) w.r.t predicted class-labels.
Lets, have a look inside overpredict function and see how these accuracy values are being generated.
function [pred, ac, decv] = ovrpredict(y, x, model)
From definition, we can see, we have 3 input parameters.
y = Class labels
x = Test sata set
model = A struct containing 10 models for 10 different classes.
labelSet = model.labelSet;
labelSet extracts labelSet (unique class-labels). In your case, you will have 10 unique labels, depending how you set while defining 10 separate classes of test data.
labelSetSize = length(labelSet)
Here you get number of classes (10 in your case).
models = model.models;
'models' variable will contain all training models (10 in your case).
decv= zeros(size(y, 1), labelSetSize)
Here, decv matrix has been created to keep decision probablities of each test data value.
for i=1:labelSetSize
[l,a,d] = svmpredict(double(y == labelSet(i)), x, models{i});
decv(:, i) = d * (2 * models{i}.Label(1) - 1);
end
Here, we pass our test data from svmpredict function for each generated model. In your case, this loop will iterate 10 times and generate classification Accuracy of test for each specific class. For example, Accuracy = 90% (18/20) (classification) indicates that 18 out of 20 rows of your test data set matches to that specific class.
Please note, in multi-class SVM, you can't make a decision based on Accuracy values. You will need Pred and ac values to make individual or overall estimate respectively.
double(y == labelSet(i) changes multi-class labels to single class labels by by checking which labels in y belong to a specific class (where iterator i is pointing). it will output either 0 or 1 for unmatched or matched cases respectively. Hence output label vector will contain either 0's or 1's thus corresponding to single class SVM.
decv(:, i) = d * (2 * models{i}.Label(1) - 1) labels the decision values -ve(unhealthy) or +ve(healthy) depending upon the single-class label values in respective trained model. models{i}.Label(1) contains only 2 types of values .i.e. 0 (for unmatched cases) or 1(for matched cases). Hence (2 * models{i}.Label(1) - 1)will always evaluate to 1 or -1, therefore, labelling the decision value healthy or unhealthy.
[tmp,pred] = max(decv, [], 2);
pred = labelSet(pred);
max returns two column vectors, 1st (tmp) containing the maximum decision value in each row and end (pred) respective row (or class) index.Hence, we are only interested in class index, we discard tmp variable.
ac = sum(y==pred) / size(x, 1);
Finally, we will calculate ac by checking how many predicted labels match input test labels and dividing the sum with number of test classes.
In your case ac=0.6 means 6 out of 10 test labels match predicted labels or 4 labels have been predicted otherwise.
I hope, it answers your question.

Find median value of the largest clump of similar values in an array in the most computationally efficient manner

Sorry for the long title, but that about sums it up.
I am looking to find the median value of the largest clump of similar values in an array in the most computationally efficient manner.
for example:
H = [99,100,101,102,103,180,181,182,5,250,17]
I would be looking for the 101.
The array is not sorted, I just typed it in the above order for easier understanding.
The array is of a constant length and you can always assume there will be at least one clump of similar values.
What I have been doing so far is basically computing the standard deviation with one of the values removed and finding the value which corresponds to the largest reduction in STD and repeating that for the number of elements in the array, which is terribly inefficient.
for j = 1:7
G = double(H);
for i = 1:7
G(i) = NaN;
T(i) = nanstd(G);
end
best = find(T==min(T));
H(best) = NaN;
end
x = find(H==max(H));
Any thoughts?
This possibility bins your data and looks for the bin with most elements. If your distribution consists of well separated clusters this should work reasonably well.
H = [99,100,101,102,103,180,181,182,5,250,17];
nbins = length(H); % <-- set # of bins here
[v bins]=hist(H,nbins);
[vm im]=max(v); % find max in histogram
bl = bins(2)-bins(1); % bin size
bm = bins(im); % position of bin with max #
ifb =find(abs(H-bm)<bl/2) % elements within bin
median(H(ifb)) % average over those elements in bin
Output:
ifb = 1 2 3 4 5
H(ifb) = 99 100 101 102 103
median = 101
The more challenging parameters to set are the number of bins and the size of the region to look around the most populated bin. In the example you provided neither of these is so critical, you could set the number of bins to 3 (instead of length(H)) and it still would work. Using length(H) as the number of bins is in fact a little extreme and probably not a good general choice. A better choice is somewhere between that number and the expected number of clusters.
It may help for certain distributions to change bl within the find expression to a value you judge better in advance.
I should also note that there are clustering methods (kmeans) that may work better, but perhaps less efficiently. For instance this is the output of [H' kmeans(H',4) ]:
99 2
100 2
101 2
102 2
103 2
180 3
181 3
182 3
5 4
250 3
17 1
In this case I decided in advance to attempt grouping into 4 clusters.
Using kmeans you can get an answer as follows:
nbin = 4;
km = kmeans(H',nbin);
[mv iv]=max(histc(km,[1:nbin]));
H(km==km(iv))
median(H(km==km(iv)))
Notice however that kmeans does not necessarily return the same value every time it is run, so you might need to average over a few iterations.
I timed the two methods and found that kmeans takes ~10 X longer. However, it is more robust since the bin sizes adapt to your problem and do not need to be set beforehand (only the number of bins does).

How to count matches in several matrices?

Making a dichotomous study, I have to count how many times a condition takes place?
The study is based on two kinds of matrices, ones with forecasts and others with analyzed data.
Both in the forecast and analysis matrices, in case a condition is satisfied we add 1 to a counter. This process is repeated for a points distributed in a grid.
Are there any functions in MATLAB that help me with counting or any script that supports this procedure?
Thanks guys!
EDIT:
The case goes about precipitation registered and forecasted. When both exceed a threshold I consider it as a hit. I have Europe divided in several grid points, and I have to count how many times the forecast is correct. I also have 50 forecasts for each year, so the result (hit/no hit) must be a cumulative action.
I've trying with count and sum functions, but they reduce the spatial dimension of the matrices.
It's difficult to tell exactly what you are trying to do but the following may help.
forecasted = [ 40 10 50 0 15];
registered = [ 0 15 30 0 10];
mismatch = abs( forecasted - registered );
maxDelta = 10;
forecastCorrect = mismatch <= maxDelta
totalCorrectForecasts = sum(forecastCorrect)
Results:
forecastCorrect =
0 1 0 1 1
totalCorrectForecasts =
3