Calculating accuracy for multi-class classification - classification

Consider a three class classification problem with the following confusion matrix.
cm_matrix =
predict_class1 predict_class2 predict_class3
______________ ______________ ______________
Actual_class1 2000 0 0
Actual_class2 34 1966 0
Actual_class3 0 0 2000
Multi-Class Confusion Matrix Output
TruePositive FalsePositive FalseNegative TrueNegative
____________ _____________ _____________ ____________
Actual_class1 2000 34 0 3966
Actual_class2 1966 0 34 4000
Actual_class3 2000 0 0 4000
The formula that I have used are:
Accuracy Of Each class=(TP ./total instances of that class)
( formula based on an answer here: Individual class accuracy calculation confusion)
Sensitivity=TP./TP+FN ;
The implementation of it in Matlab is:
acc_1 = 100*(cm_matrix(1,1))/sum(cm_matrix(1,:)) = 100*(2000)/(2000+0+0) = 100
acc_2 = 100*(cm_matrix(2,2))/sum(cm_matrix(2,:)) = 100*(1966)/(34+1966+0) = 98.3
acc_3 = 100*(cm_matrix(3,3))/sum(cm_matrix(3,:)) = 100*(2000)/(0+0+2000) = 100
sensitivity_1 = 2000/(2000+0)=1 = acc_1
sensitivity_2 = 1966/(1966+34) = 98.3 = acc_2
sensitivity_3 = 2000/2000 = 1 = acc_3
Question1) Is my formula for Accuracy of each class correct? For calculating accuracy of each individual class, say for positive class I should take the TP in the numerator. Similarly, for accuracy of only the negative class, I should consider TN in the numerator in the formula for accuracy. Is the same formula applicable to binary classification? Is my implementation of it correct?
Question2) Is my formula for sensitivity correct? Then how come I am getting same answer as individual class accuracies?

Question1) Is my formula for Accuracy of each class correct?
No, the formula you're using is for the Sensitivity (Recall). See below.
For calculating accuracy of each individual class, say for positive class I should take the TP in the numerator. Similarly, for accuracy of only the negative class, I should consider TN in the numerator in the formula for accuracy. Is the same formula applicable to binary classification? Is my implementation of it correct?
Accuracy is the ratio of the number of correctly classified instances to the total number of instances. TN, or the number of instances correctly identified as not being in a class, are correctly classified instances, too. You cannot simply leave them out.
Accuracy is also normally only used for evaluating the entire classifier for all classes, not individual classes. You can, however, generalize the accuracy formula to handle individual classes, as done here for computing the average classification accuracy for a multiclass classifier. (See also the referenced article.)
The formula they use for each class is:
As you can see, it is identical to the usual formula for accuracy, but we only take into account the individual class's TP and TN scores (the denominator is still the total number of observations). Applying this to your data set, we get:
acc_1 = (2000+3966)/(2000+34+0+3966) = 0.99433
acc_2 = (1966+4000)/(1966+0+34+4000) = 0.99433
acc_3 = (2000+4000)/(2000+0+0+4000) = 1.00000
This at least makes more intuitive sense, since the first two classes had mis-classified instances and the third did not. Whether these measures are at all useful is another question.
Question2) Is my formula for sensitivity correct?
Yes, Sensitivity is given as:
TP / TP+FN
which is the ratio of the instances correctly identified as being in this class to the total number of instances in the class. In a binary classifier, you are by default calculating the sensitivity for the positive class. The sensitivity for the negative class is the error rate (also called the miss rate or false negative rate in the wikipedia article) and is simply:
FN / TP+FN === 1 - Sensitivity
FN is nothing more than the TP for the negative class! (The meaning of TP is likewise reversed.) So it is natural to extend this to all classes as you have done.
Then how come I am getting same answer as individual class accuracies?
Because you're using the same formula for both.
Look at your confusion matrix:
cm_matrix =
predict_class1 predict_class2 predict_class3
______________ ______________ ______________
Actual_class1 2000 0 0
Actual_class2 34 1966 0
Actual_class3 0 0 2000
TP for class 1 is obviously 2000
cm_matrix(1,1)
FN is the sum of the other two columns in that row. Therefore, TP+FN is the sum of row 1
sum(cm_matrix(1,:)
That's exactly the formula you used for the accuracy.
acc_1 = 100*(cm_matrix(1,1))/sum(cm_matrix(1,:)) = 100*(2000)/(2000+0+0) = 100

Answer to question 1. It seems that accuracy is used only in binary classification, check this link.
You refer to an answer on this site, but it concerns also a binary classification (i.e. classification into 2 classes only). You seem to have more than two classes, and in this case you should try something else, or a one-versus-all classification for each class (for each class, parse prediction for class_n and non_class_n).
Answer to question 2. Same issue, this measure is appropriate for binary classification which is not your case.
The formula for sensitivity is:
TP./(TP + FN)
The formula for accuracy is:
(TP)./(TP+FN+FP+TN)
See the documentation here.
UPDATE
And if you wish to use the confusion matrix, you have:
TP on the diagonal, at the level of the class
FN the sum of all the values in the column of the class. In the function getvalues start counting lines from the declaration of the function and check lines 30 and 31:
TP(i)=c_matrix(i,i);
FN(i)=sum(c_matrix(i,:))-c_matrix(i,i);
FP(i)=sum(c_matrix(:,i))-c_matrix(i,i);
TN(i)=sum(c_matrix(:))-TP(i)-FP(i)-FN(i);
If you apply the accuracy formula, you obtain, after calculating and simplifying :
accuracy = c_matrix(i,i) / sum(c_matrix(:))
For the sensitivity you obtain, after simplifying:
sensitivity = c_matrix(i,i) / sum(c_matrix(i,:))
If you want to understand better, just check the links I sent you.

Related

How can I extract the random effects information from lmm and lqmm models using multiple imputed data?

Continuing from this question: Is it possible to use lqmm with a mira object?
I have tried to get the random effects for the mixed models (lmm and lqmm), and it has been hard.
library(lqmm)
library(mice)
library(lme4)
library(mitml)
summary(airquality)
imputed<-mice(airquality,m=5)
summary(imputed)
fit1<-lqmm(Ozone~Solar.R+Wind+Temp+Day,random=~1,
tau=0.5, group= Month, data=airquality,na.action=na.omit)
fit1
summary(fit1)
fit2<-with(imputed, lqmm(Ozone~Solar.R+Wind+Temp+Day,random=~1,
tau=0.5, group= Month, na.action=na.omit))
#did not work because it does not recognize a data frame
fit2 <- with(imputed,
lqmm(Ozone ~ Solar.R + Wind + Temp + Day,
data = data.frame(mget(ls())),
random = ~1, tau = 0.5, group = Month, na.action = na.omit))
tidy.lqmm <- function(x, conf.int = FALSE, conf.level = 0.95, ...) {
broom:::as_tidy_tibble(data.frame(
estimate = coef(x),
std.error = sqrt(
diag(summary(x, covariance = TRUE,
R = 50)$Cov[names(coef(x)),
names(coef(x))]))))
}
glance.lqmm <- function(x, ...) {
broom:::as_glance_tibble(
logLik = as.numeric(stats::logLik(x)),
df.residual = summary(x)$rdf,
nobs = stats::nobs(x),
na_types = "rii")
}
pool(fit2)
summary(pool(fit2))
So far so good, but I want to build a table that resembles the sJPlot::tab_model function on lmer objects. For this, I need to extract the random effects from the pooled estimates. This is where I am lost. I have not been able to do it with the linear mixed model, much less with the linear quantile mixed model.
Extract the Random-effects from an LMM and LQMM.
##LMM
fit3 <- with(imputed,
lmer(Ozone ~ Solar.R + Wind + Temp + Day+ (1|Month)))
library(broom.mixed)
summary(pool(fit3))
library(sjPlot)
tab_model(fit3$analyses) #this will give me the results for each of the 5 lmer, but I need the pooled ones
pool(fit3)$glanced # also this will five me the random effects for each of the 5 lmer
stargazer(fit3$analyses,type="text")#this would give the AIC, LL, and BYC
#something like
tab_model(pool(fit3$analyses))
"Error ....
Could not access model information."
#or
tab_model(pool(fit3)) #A data frame is not a valid object for this function.
##LQMM
tab_model(fit2$analyses)#gives me less information
pool(fit2)$glanced #gives the 5 models logLik, df.residual and nobs individually
UPDATE: I could find a formula to extract the random effects of the LMM thanks to Can I pool imputed random effect model estimates using the mi package?. However, does not work for LQMM
testEstimates(as.mitml.result(fit3), extra.pars = T)$extra.pars
testEstimates(as.mitml.result(fit2), extra.pars = T)$extra.pars
Error in UseMethod("vcov") : no applicable method for 'vcov'
applied to an object of class "lqmm"
The Pooled Random Effects information I need for both lmer and lqmm with sjPlot::tab_model and stargazer:
sigma^2: Pooled Residual Variance.
tau_00Month: Pooled Variance
explained by the month (between month differences).
ICC: Pooled
sigma^2/ (sigma^2+tau_00Month).
N_Month: n. Months used in the
regression.
Marginal R2/ Conditional R2: The marginal R-squared
considers only the variance of the fixed effects, while the
conditional R-squared takes both the fixed and random effects into
account.
Residual Scale Parameter: also would appreciate it if somebody clarifies what this is calculating.
Log-Likelihood:
Akaike Inf. Crit.:

Small bug in MATLAB R2017B LogLikelihood after fitnlm?

Background: I am working on a problem similar to the nonlinear logistic regression described in the link [1] (my problem is more complicated, but link [1] is enough for the next sections of this post). Comparing my results with those obtained in parallel with a R package, I got similar results for the coefficients, but (very approximately) an opposite logLikelihood.
Hypothesis: The logLikelihood given by fitnlm in Matlab is in fact the negative LogLikelihood. (Note that this impairs consequently the BIC and AIC computation by Matlab)
Reasonning: in [1], the same problem is solved through two different approaches. ML-approach/ By defining the negative LogLikelihood and making an optimization with fminsearch. GLS-approach/ By using fitnlm.
The negative LogLikelihood after the ML-approach is:380
The negative LogLikelihood after the GLS-approach is:-406
I imagine the second one should be at least multiplied by (-1)?
Questions: Did I miss something? Is the (-1) coefficient enough, or would this simple correction not be enough?
Self-contained code:
%copy-pasting code from [1]
myf = #(beta,x) beta(1)*x./(beta(2) + x);
mymodelfun = #(beta,x) 1./(1 + exp(-myf(beta,x)));
rng(300,'twister');
x = linspace(-1,1,200)';
beta = [10;2];
beta0=[3;3];
mu = mymodelfun(beta,x);
n = 50;
z = binornd(n,mu);
y = z./n;
%ML Approach
mynegloglik = #(beta) -sum(log(binopdf(z,n,mymodelfun(beta,x))));
opts = optimset('fminsearch');
opts.MaxFunEvals = Inf;
opts.MaxIter = 10000;
betaHatML = fminsearch(mynegloglik,beta0,opts)
neglogLH_MLApproach = mynegloglik(betaHatML);
%GLS Approach
wfun = #(xx) n./(xx.*(1-xx));
nlm = fitnlm(x,y,mymodelfun,beta0,'Weights',wfun)
neglogLH_GLSApproach = - nlm.LogLikelihood;
Source:
[1] https://uk.mathworks.com/help/stats/examples/nonlinear-logistic-regression.html
This answer (now) only details which code is used. Please see Tom Lane's answer below for a substantive answer.
Basically, fitnlm.m is a call to NonLinearModel.fit.
When opening NonLinearModel.m, one gets in line 1209:
model.LogLikelihood = getlogLikelihood(model);
getlogLikelihood is itself described between lines 1234-1251.
For instance:
function L = getlogLikelihood(model)
(...)
L = -(model.DFE + model.NumObservations*log(2*pi) + (...) )/2;
(...)
Please also not that this notably impacts ModelCriterion.AIC and ModelCriterion.BIC, as they are computed using model.LogLikelihood ("thinking" it is the logLikelihood).
To get the corresponding formula for BIC/AIC/..., type:
edit classreg.regr.modelutils.modelcriterion
this is Tom from MathWorks. Take another look at the formula quoted:
L = -(model.DFE + model.NumObservations*log(2*pi) + (...) )/2;
Remember the normal distribution has a factor (1/sqrt(2*pi)), so taking logs of that gives us -log(2*pi)/2. So the minus sign comes from that and it is part of the log likelihood. The property value is not the negative log likelihood.
One reason for the difference in the two log likelihood values is that the "ML approach" value is computing something based on the discrete probabilities from the binomial distribution. Those are all between 0 and 1, and they add up to 1. The "GLS approach" is computing something based on the probability density of the continuous normal distribution. In this example, the standard deviation of the residuals is about 0.0462. That leads to density values that are much higher than 1 at the peak. So the two things are not really comparable. You would need to convert the normal values to probabilities on the same discrete intervals that correspond to individual outcomes from the binomial distribution.

How to setup fitensemble for binary imbalanced datasets?

I've been trying to test matlab's ensemble methods with randomly generated imbalance dataset and no matter what I set the prior/cost/weight parameters the method never predicts close to the label ratio.
Below is an example of the tests I did.
prob = 0.9; %set label ratio to 90% 1 and 10% 0
y = (rand(100,1) < prob);
X = rand(100,3); %generate random training data with three features
X_test = rand(100,3); %generate random test data
%A few parameter sets I've tested
B = TreeBagger(100,X,y);
B2 = TreeBagger(100,X,y,'Prior','Empirical');
B3 = TreeBagger(100,X,y,'Cost',[0,9;1,0]);
B4 = TreeBagger(100,X,y,'Cost',[0,1;9,0]);
B5 = fitensemble(X,y,'RUSBoost', 20, 'Tree', 'Prior', 'Empirical');
Here I tried to predict the trained classifiers on random test data. My assumption is that since the classifier is trained on random data, it should on average predict close to the dataset ratio (1/9) if it takes the prior into account. But each of the classifiers predicted 98-100% in favor of '1' instead of ~90% that I am looking for.
l1 = predict(B,X_test);
l2 = predict(B2,X_test);
l3 = predict(B3,X_test);
l4 = predict(B4,X_test);
l5 = predict(B5,X_test);
How do I get the ensemble method to take the prior into account? Or is there a fundamental misunderstanding on my part?
I don't think it can work like you think.
Thats because as i understood your training and test data is random. So how should your classifier find any relation between features and your label?
lets take the accuracy as a mesurement and make an example.
class A: 900 datarows.
class B: 100 datarows.
Classify 100% as A:
0.9*/(0.1+0.9) = 0.9
gets you 90% Accuracy.
if your classifier does something different, means trying to classify some datarows to B he will by chance get 9 times more wrongly classified A datarows
Lets say 20 B datarows are correctly classified you will get around 180 wrong a classified A datarows
B: 20 correct, 80 incorrect
A: 720 correct, 180 wrong
740/(740+260) = 0.74
Accuracy goes down to 74 %. And thats not something your classifying algorithms want.
Long story short: Your classifier will allways tend to classify allmost 100% class A if you dont get any information into your Data

How do I use eps in this situation where I am taking differences of cumulative sums

First, the data:
orig = reshape([0.0000000000000000 0.3480000000000000 0.7570000000000000 1.3009999999999999 2.8300000000000001 4.7519999999999998 5.2660000000000000 5.8120000000000003 14.3360000000000000 15.3390000000000000 ],[10 1])
change = reshape([0.0000000000000000 0.3480000000000000 0.0000000000000000 0.9530000000000000 1.5290000000000001 1.9219999999999997 0.5140000000000002 0.5460000000000003 0.0000000000000000 9.5270000000000010 ],[10 1])
change = cumsum(change)
orig is a vector of seconds elapsed. change is a vector derived by taking differences between (some) elements of orig. The cumulative sum of change has some elements actually equal to the corresponding element in orig.
However, due to precision issues:
diff = orig - change
gives
diff =
0
0
0.409
0
0
0
0
0
8.524
-1.77635683940025e-15
It seems that if I run the following command:
diff(abs(diff) <= eps(orig)) = 0
then this sets entries which should be zero, but are not due to precision issues, to be zero.
My question is, is this the correct way to do it? Why is the comparison <= instead of <? Should the statement be:
diff(abs(diff) < k*eps(orig)) = 0
for some k > 1 to give some tolerance? If so, how would one pick k?
In case it is necessary to know how change is derived from orig, the following alternate example also shows this behaviour:
orig = reshape([0.0000000000000000 0.3480000000000000 0.7570000000000000 1.3009999999999999 2.8300000000000001 4.7519999999999998 5.2660000000000000 5.8120000000000003 14.3360000000000000 15.3390000000000000 ],[10 1])
change = orig - [0; orig(1:end-1)]
change = cumsum(change)
diff = orig - change
The following statement will be true only if the "almost zero" happens because 1 bit is offseted.
abs(diff) <= eps(orig)
1 bit is a ridiculously high precision to ask, a precision that most likely you can not achieve. Generally, you need to define your treshold yourself, such as
abs(diff) <= 1e-12
You also ask how to choose this value. Answer: there is no way we can tell you that. Its algorithm, application, unit, computer, [...] specific.
You are computing distance between particles? Maybe you need a smaller tolerance. You are doing economic profit calculus? 1e-12 is then a decimal you won't get in cash, for sure. Use 1e-4 instead. Or are you using an algorithm that does numerical approximations? Then you need a higher tolerance. How much tolerance you are OK with is, and will always be, a user choice.
Note: you need to be aware of the types you are using to set this minimum threshold right. MATLAB uses double as default, but if you are using other types, them this threshold is too strict. As an alternative, you can use
abs(diff) <= 100*eps(class(diff))
If your data type is not fixed/known.

MATLAB: Naive Bayes with Univariate Gaussian

I am trying to implement Naive Bayes Classifier using a dataset published by UCI machine learning team. I am new to machine learning and trying to understand techniques to use for my work related problems, so I thought it's better to get the theory understood first.
I am using pima dataset (Link to Data - UCI-ML), and my goal is to build Naive Bayes Univariate Gaussian Classifier for K class problem (Data is only there for K=2). I have done splitting data, and calculate the mean for each class, standard deviation, priors for each class, but after this I am kind of stuck because I am not sure what and how I should be doing after this. I have a feeling that I should be calculating posterior probability,
Here is my code, I am using percent as a vector, because I want to see the behavior as I increase the training data size from 80:20 split. Basically if you pass [10 20 30 40] it will take that percentage from 80:20 split, and use 10% of 80% as training.
function[classMean] = naivebayes(file, iter, percent)
dm = load(file);
for i=1:iter
idx = randperm(size(dm.data,1))
%Using same idx for data and labels
shuffledMatrix_data = dm.data(idx,:);
shuffledMatrix_label = dm.labels(idx,:);
percent_data_80 = round((0.8) * length(shuffledMatrix_data));
%Doing 80-20 split
train = shuffledMatrix_data(1:percent_data_80,:);
test = shuffledMatrix_data(percent_data_80+1:length(shuffledMatrix_data),:);
train_labels = shuffledMatrix_label(1:percent_data_80,:)
test_labels = shuffledMatrix_data(percent_data_80+1:length(shuffledMatrix_data),:);
%Getting the array of percents
for pRows = 1:length(percent)
percentOfRows = round((percent(pRows)/100) * length(train));
new_train = train(1:percentOfRows,:)
new_trin_label = shuffledMatrix_label(1:percentOfRows)
%get unique labels in training
numClasses = size(unique(new_trin_label),1)
classMean = zeros(numClasses,size(new_train,2));
for kclass=1:numClasses
classMean(kclass,:) = mean(new_train(new_trin_label == kclass,:))
std(new_train(new_trin_label == kclass,:))
priorClassforK = length(new_train(new_trin_label == kclass))/length(new_train)
priorClassforK_1 = 1 - priorClassforK
end
end
end
end
First, compute the probability of evey class label based on frequency counts. For a given sample of data and a given class in your data set, you compute the probability of evey feature. After that, multiply the conditional probability for all features in the sample by each other and by the probability of the considered class label. Finally, compare values of all class labels and you choose the label of the class with the maximum probability (Bayes classification rule).
For computing conditonal probability, you can simply use the Normal distribution function.