How to interpret results from Sparkling Water's GBM algorithm on classification task - scala

I'm new to Sparkling Water and machine learning,
I've built GBM model with two datasets divided manually into train and test.
Task is classification with all numeric atributes (response column is converted to enum type). Code is in Scala.
val gbmParams = new GBMParameters()
gbmParams._train = train
gbmParams._valid = test
gbmParams._response_column = "response"
gbmParams._ntrees = 50
gbmParams._max_depth = 6
val gbm = new GBM(gbmParams)
val gbmModel = gbm.trainModel.get
In model summary I get four different - one on train data and one on test data before building individual trees with prediction. The result is with predicted value as 1 in each case - this is for test data:
CM: Confusion Matrix (vertical: actual; across: predicted):
0 1 Error Rate
0 0 500 1,0000 500 / 500
1 0 300 0,0000 0 / 300
Totals 0 800 0,6250 500 / 800
The second confusion matrix is similar with predicted value as 1 in each case for train data. Third and Fourth confusion matrix after built trees gaves normal results with values distributed in all sections of matrix.
I need to interpret first and second matrix. Why is Sparkling Water doing that? Can I work with these results or it's just some middle step?
Thank you.

Interpreting the matrix given by:
CM: Confusion Matrix (vertical: actual; across: predicted):
0 1 Error Rate
0 0 500 1,0000 500 / 500
1 0 300 0,0000 0 / 300
Totals 0 800 0,6250 500 / 800
We can see that all 800 observations were labelled 1, given by the numbers in the Totals row.
The model being tested predicted 0 500 times, and 1 300 times given by the rows. That gives you an overall error of 0.625 or 62.5%.
This tells us two things:
The data in that dataset were completely unbalanced in favour of class 1.
The model did a pretty bad job
Is it possible that the two initial matrices represent the summary of an untrained model, essentially picking classes at random? And the latter two matrices represent the summary of the trained model?

Related

Individual class accuracy calculation confusion

The total number of data points for which the following binary classification result is obtained = 1500. Out of which, I have
1473 labelled as 0 and
the remaining 27 as 1 .
As can be seen from the confusion matrix, out of 27 data points belonging to class 1, I got only 1 data point misclassified as 0 . So, I calculated the accuracy for individual classes and got Accuracy for class labelled as 0 = 98.2% and for the other as 1.7333%. Is this calculation correct? I am not sure...I did get a pretty good classification for the class labelled as 1 so why the accuracy for it is low?
The individual class accuracies should have been 100% for class0 and around 98% for class1
Does one misclassification reduce the accuracy of class 1 by so much amount? This is the how I calculated the individual class accuracies in MAtlab.
cmMatrix =
1473 0
1 26
acc_class0 = 100*(cmMatrix(1,1))/1500;
acc_class1= 100*(cmMatrix(2,2))/1500;
If everything had been classified correctly, your computation would indicate accuracy for class 1 as 27/1500=0.018. This is obviously wrong. Overall accuracy is 1499/1500, but per-class accuracy cannot use 1500 as denominator. 27 is the maximum correctly classified elements, and should therefore be the denominator.
acc_class0 = 100*cmMatrix(1,1)/sum(cmMatrix(1,:));
acc_class1 = 100*cmMatrix(2,2)/sum(cmMatrix(2,:));

Undefined F1 scores in multiclass classifications when model does not predict one class

I am trying to use F1 scores for model selection in multiclass classification.
I am calculating them class-wise and average over them:
(F1(class1)+F1(class1)+F1(class1))/3 = F1(total)
However, in some cases I get NaN values for the F1 score. Here is an example:
Let true_label = [1 1 1 2 2 2 3 3 3] and pred_label = [2 2 2 2 2 2 3 3 3].
Then the confusion matrix looks like:
C =[0 3 0; 0 3 0; 0 0 3]
Which means when I calculate the precision (to calculate the F1 score) for the first class, I obtain: 0/(0+0+0), which is not defined or NaN.
Firstly, am I making a mistake in calculating F1 scores or precisions here?
Secondly, how should I treat these cases in model selection? Ignore them or should I just set the F1 scores for this class to 0 (reducing the total F1 score for this model).
Any help would be greatly appreciated!
You need to avoid the division by zero for the precision in order to report meaningful results. You might find this answer useful, in which you explicitly report a poor outcome. Additionally, this implementation suggests an alternate way to differentiate in your reporting between good and poor outcomes.

Neural Networks for integer values

I have approximately 5000 integer vectors (=SIZE) that look like:
[1 0 4 2 0 1 3 ...]
They have the same length N=32 and their values ranges from 0 to 4 but let's say [0 MAX].
I created a NN that takes vectors as inputs and outputs a binary array corresponding to one of the desired output(number of possible outputs = M):
for instance [0 1 0 0 ...0] => 2nd output. array_length = M
I used a Multi Layer Perceptron in Neuroph with those integer values but it did not converge.
So I am guessing the problem is using integer values or using a MLP with 3 layers: input, hidden and output.
Can you advise me on the network structure? which type of NN is suitable? Should I remodel the input and output to simplify the learning process? I have been thinking about Gray encoding for the integers input.

How to count matches in several matrices?

Making a dichotomous study, I have to count how many times a condition takes place?
The study is based on two kinds of matrices, ones with forecasts and others with analyzed data.
Both in the forecast and analysis matrices, in case a condition is satisfied we add 1 to a counter. This process is repeated for a points distributed in a grid.
Are there any functions in MATLAB that help me with counting or any script that supports this procedure?
Thanks guys!
EDIT:
The case goes about precipitation registered and forecasted. When both exceed a threshold I consider it as a hit. I have Europe divided in several grid points, and I have to count how many times the forecast is correct. I also have 50 forecasts for each year, so the result (hit/no hit) must be a cumulative action.
I've trying with count and sum functions, but they reduce the spatial dimension of the matrices.
It's difficult to tell exactly what you are trying to do but the following may help.
forecasted = [ 40 10 50 0 15];
registered = [ 0 15 30 0 10];
mismatch = abs( forecasted - registered );
maxDelta = 10;
forecastCorrect = mismatch <= maxDelta
totalCorrectForecasts = sum(forecastCorrect)
Results:
forecastCorrect =
0 1 0 1 1
totalCorrectForecasts =
3

Training a Decision Tree in MATLAB over binary train data

I want to train a decision tree in MATLAB for binary data. Here is a sample of data I use.
traindata <87*239> [array of data with 239 features]
1 0 1 0 0 0 1 1 0 0 1 0 1 0 1 1 1 1 1 0 0 0 1 1 0 ... [till 239]
1 1 1 0 0 0 1 0 0 0 1 0 1 0 1 1 0 0 1 0 0 0 1 0 1 ... [till 239]
....
The thing is that this data corresponds to a form which has only options for yes/no. The outcome of the form is also binary and has the meaning that a patinet has some medical disorder or not! we have used classification tree and the classifier shows us double numbers. for example it branches the first node based on x137 value being bigger than 0.75 or not! Since we don't have 0.75 in our data and it has no yes/no meaning we wanted to use a decision tree which is best for our work. The best decision tree for us is the one that is trained based on boolean variables not double ones. Also it understands that the data is not continuous and for example instead of above representation shows x137 is yes o no (1 or 0). Can someone help me with this? I would also appreciate a solution to map our data to double variables and features if the boolean decision tree is not appliable. I am currently using classregtree in matlab with <87*237> as train and <87*1> as results.
classregtree has an optional input parameter categorical. Using this option, you can pass in a vector indicating which of your input variables are categorical (in your case, this vector would be 1x239, all ones). The decision tree should then contain yes/no decisions rather than numerical thresholds.
From the help of classregtree:
t = classregtree(X,y) creates a decision tree t for predicting the response y as a function of the predictors in the columns of X. X is an n-by-m matrix of predictor values. If y is a vector of n response values, classregtree performs regression. If y is a categorical variable, character array, or cell array of strings, classregtree performs classification.
What's the type of y in your case? It seems that classregtree is doing regression in your case but you want classification. So, y should be a categorical variable.
EDIT: To make your y categorical, you can try "nominal(y)".