Adaboost weka True positive vs False positive recognition issue - classification

I am using Adaboost M1 algorithm in Weka Experiment Environment with default setup:
Runs (1-10) -> 10 runs to provide more statistically significant results
Random Split Result Producer
I use train percent to divide training from evaluation data
Now, the problem is with the Weighted average TP and FP results.
I get this:
TP:0.8
FP:0.47
But as far as I am aware, if TP rate is 0.8, the FP rate should be as high as 0.2?
I assume that this has to do something with 10 runs, but anyway if average values is taken from this run, again this FP rate should be much lower?
Sorry if this is too simple question, but from my logic this seems like error in Weka toolkit, or am I wrong? Thanks
EDIT:
In order to avoid asking a new question and because this is related to the same problem, can anyone answer what are Weighted average values displayed in Weka?
I have included the Atilla's example below: it can be seen that Weighted average are not Average values,e.g. AVG(0.933,0.422) != 0.77, etc.
Can someone answer what these values actually are?
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.933 0.578 0.776 0.933 0.847 0.429 0.844 0.917 tested_negative
0.422 0.067 0.745 0.422 0.538 0.429 0.844 0.696 tested_positive
Weighted Avg. 0.77 0.416 0.766 0.77 0.749 0.429 0.844 0.847

I run adoboostM1 with default parameters on diabetes data set of weka. I got following results.
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
0.933 0.578 0.776 0.933 0.847 0.429 0.844 0.917 tested_negative
0.422 0.067 0.745 0.422 0.538 0.429 0.844 0.696 tested_positive
Weighted Avg. 0.77 0.416 0.766 0.77 0.749 0.429 0.844 0.847
Notice that this TP Rate and FP rate is for each of your class values. Since I have two (2) values for class feature in this data set, I have two (2) lines.
Also notice that:
0.933 + 0.067 = 1
0.578 + 0.422 = 1
As you correctly pointed that TP rate + FP rate should be equal to one (1). So in your example: I assume that you have following class variable:
target {A,B}
TP Rate FP Rate
0.8 0.47 ..... for A
0.53 0.2 ..... for B

Related

mgcv: Difference between s(x, by=cat) and s(cat, bs='re')

What is the difference between adding a by= parameter to a smooth and adding a random effect smooth?
I've tried both, and get different results. E.g.:
library(mgcv)
set.seed(26)
gam.df <- tibble(y=rnorm(400),
x1=rnorm(400),
cat=factor(rep(1:4, each=100)))
gam0 <- gam(y ~ s(x1, by=cat), data=gam.df)
summary(gam0)
produces:
15:15:39> summary(gam0)
Family: gaussian
Link function: identity
Formula:
y ~ s(x1, by = cat)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.001275 0.049087 -0.026 0.979
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x1):cat1 1 1 7.437 0.00667 **
s(x1):cat2 1 1 0.047 0.82935
s(x1):cat3 1 1 0.393 0.53099
s(x1):cat4 1 1 0.019 0.89015
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.00968 Deviance explained = 1.96%
GCV = 0.97413 Scale est. = 0.96195 n = 400
On the other hand:
gam1 <- gam(y ~ s(x1) + s(cat, bs='re'), data=gam.df)
summary(gam1)
produces:
15:16:33> summary(gam1)
Family: gaussian
Link function: identity
Formula:
y ~ s(x1) + s(cat, bs = "re")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0001211 0.0572271 0.002 0.998
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x1) 1.0000 1 2.359 0.125
s(cat) 0.7883 3 0.356 0.256
R-sq.(adj) = 0.00594 Deviance explained = 1.04%
GCV = 0.97236 Scale est. = 0.96558 n = 400
I understand that by= shows the summary by each factor level, but shouldn't the overall results of the model such as R^2 be the same?
The factor by model, gam0, contains a separate smooth of x1 for each level of cat, but doesn't include anything specifically for the means of y in each group[*] because it is miss-specified. Compare this with gam1, which has a single smooth of x1 plus group means for the levels of cat.
Even though you generated random data without any smooth or group level effects, the gam0 model is potentially much more complex and flexible a model as it contains 4 separate smooths, each using potentially 9 degrees of freedom. Your gam1 has a single smooth of x1 which uses up to 9 degrees of freedom, plus something between 4 and 0 degrees of freedom for the random effect smooth. gam0 is simply exploiting random variation in the data that can be explained a little bit by those extra potential degrees of freedom. You can see this in the adjusted R-sq.(adj), which is lower for gam0 despite it explaining ~ twice the deviance as does gam1 (not that either is a good amount of deviance explained).
r$> library("gratia")
r$> smooths(gam0)
[1] "s(x1):cat1" "s(x1):cat2" "s(x1):cat3" "s(x1):cat4"
r$> smooths(gam1)
[1] "s(x1)" "s(cat)"
[*] Note that your by model should be
gam0 <- gam(y ~ cat + s(x1, by=cat), data=gam.df)
because the smooths created by s(x1, by=cat) are subject to an identifiability constraint (as there's a constant term — the intercept — in the model). This constraint is a sum-to-zero constraint which means that the individual smooths do not contain the group means. This forces the smooths to not only model the way Y changes as a function of x1 in each group but also model the magnitude of Y in the respective groups, but without functions in the span of the basis that could model such constant (magnitude) effects.

Calculating accuracy for multi-class classification

Consider a three class classification problem with the following confusion matrix.
cm_matrix =
predict_class1 predict_class2 predict_class3
______________ ______________ ______________
Actual_class1 2000 0 0
Actual_class2 34 1966 0
Actual_class3 0 0 2000
Multi-Class Confusion Matrix Output
TruePositive FalsePositive FalseNegative TrueNegative
____________ _____________ _____________ ____________
Actual_class1 2000 34 0 3966
Actual_class2 1966 0 34 4000
Actual_class3 2000 0 0 4000
The formula that I have used are:
Accuracy Of Each class=(TP ./total instances of that class)
( formula based on an answer here: Individual class accuracy calculation confusion)
Sensitivity=TP./TP+FN ;
The implementation of it in Matlab is:
acc_1 = 100*(cm_matrix(1,1))/sum(cm_matrix(1,:)) = 100*(2000)/(2000+0+0) = 100
acc_2 = 100*(cm_matrix(2,2))/sum(cm_matrix(2,:)) = 100*(1966)/(34+1966+0) = 98.3
acc_3 = 100*(cm_matrix(3,3))/sum(cm_matrix(3,:)) = 100*(2000)/(0+0+2000) = 100
sensitivity_1 = 2000/(2000+0)=1 = acc_1
sensitivity_2 = 1966/(1966+34) = 98.3 = acc_2
sensitivity_3 = 2000/2000 = 1 = acc_3
Question1) Is my formula for Accuracy of each class correct? For calculating accuracy of each individual class, say for positive class I should take the TP in the numerator. Similarly, for accuracy of only the negative class, I should consider TN in the numerator in the formula for accuracy. Is the same formula applicable to binary classification? Is my implementation of it correct?
Question2) Is my formula for sensitivity correct? Then how come I am getting same answer as individual class accuracies?
Question1) Is my formula for Accuracy of each class correct?
No, the formula you're using is for the Sensitivity (Recall). See below.
For calculating accuracy of each individual class, say for positive class I should take the TP in the numerator. Similarly, for accuracy of only the negative class, I should consider TN in the numerator in the formula for accuracy. Is the same formula applicable to binary classification? Is my implementation of it correct?
Accuracy is the ratio of the number of correctly classified instances to the total number of instances. TN, or the number of instances correctly identified as not being in a class, are correctly classified instances, too. You cannot simply leave them out.
Accuracy is also normally only used for evaluating the entire classifier for all classes, not individual classes. You can, however, generalize the accuracy formula to handle individual classes, as done here for computing the average classification accuracy for a multiclass classifier. (See also the referenced article.)
The formula they use for each class is:
As you can see, it is identical to the usual formula for accuracy, but we only take into account the individual class's TP and TN scores (the denominator is still the total number of observations). Applying this to your data set, we get:
acc_1 = (2000+3966)/(2000+34+0+3966) = 0.99433
acc_2 = (1966+4000)/(1966+0+34+4000) = 0.99433
acc_3 = (2000+4000)/(2000+0+0+4000) = 1.00000
This at least makes more intuitive sense, since the first two classes had mis-classified instances and the third did not. Whether these measures are at all useful is another question.
Question2) Is my formula for sensitivity correct?
Yes, Sensitivity is given as:
TP / TP+FN
which is the ratio of the instances correctly identified as being in this class to the total number of instances in the class. In a binary classifier, you are by default calculating the sensitivity for the positive class. The sensitivity for the negative class is the error rate (also called the miss rate or false negative rate in the wikipedia article) and is simply:
FN / TP+FN === 1 - Sensitivity
FN is nothing more than the TP for the negative class! (The meaning of TP is likewise reversed.) So it is natural to extend this to all classes as you have done.
Then how come I am getting same answer as individual class accuracies?
Because you're using the same formula for both.
Look at your confusion matrix:
cm_matrix =
predict_class1 predict_class2 predict_class3
______________ ______________ ______________
Actual_class1 2000 0 0
Actual_class2 34 1966 0
Actual_class3 0 0 2000
TP for class 1 is obviously 2000
cm_matrix(1,1)
FN is the sum of the other two columns in that row. Therefore, TP+FN is the sum of row 1
sum(cm_matrix(1,:)
That's exactly the formula you used for the accuracy.
acc_1 = 100*(cm_matrix(1,1))/sum(cm_matrix(1,:)) = 100*(2000)/(2000+0+0) = 100
Answer to question 1. It seems that accuracy is used only in binary classification, check this link.
You refer to an answer on this site, but it concerns also a binary classification (i.e. classification into 2 classes only). You seem to have more than two classes, and in this case you should try something else, or a one-versus-all classification for each class (for each class, parse prediction for class_n and non_class_n).
Answer to question 2. Same issue, this measure is appropriate for binary classification which is not your case.
The formula for sensitivity is:
TP./(TP + FN)
The formula for accuracy is:
(TP)./(TP+FN+FP+TN)
See the documentation here.
UPDATE
And if you wish to use the confusion matrix, you have:
TP on the diagonal, at the level of the class
FN the sum of all the values in the column of the class. In the function getvalues start counting lines from the declaration of the function and check lines 30 and 31:
TP(i)=c_matrix(i,i);
FN(i)=sum(c_matrix(i,:))-c_matrix(i,i);
FP(i)=sum(c_matrix(:,i))-c_matrix(i,i);
TN(i)=sum(c_matrix(:))-TP(i)-FP(i)-FN(i);
If you apply the accuracy formula, you obtain, after calculating and simplifying :
accuracy = c_matrix(i,i) / sum(c_matrix(:))
For the sensitivity you obtain, after simplifying:
sensitivity = c_matrix(i,i) / sum(c_matrix(i,:))
If you want to understand better, just check the links I sent you.

How to enhance the accuracy of knn classifier?

my homework is to make a code in Matlab to calculate the accuracy of the knn classifier if my data as the following Training data
Data length: 6 seconds, 3 channels, 768 samples / trial, 140 tests, fs = 128 Hz Test data: 3 channels, 1152 samples / trial, 140 experiments.
I have written part of the code, but then I do not know where to use the cross-validation and the accuracy was very low 65%..
clear all
close all
clc
load('LabelTest.mat');
load('LabelTrain.mat');
load('TestData.mat');
load('TrainData.mat');
LabelTest=LabelTest;
LabelTrain=LabelTrain;
TestData=TestData;
TrainData=TrainData;
%Extracting feature from the training set
ndx1=find(LabelTrain==1);
ndx2=find(LabelTrain==2);
TrainClass1=TrainData(:,:,ndx1);
TrainClass2=TrainData(:,:,ndx2);
K1=1;
K2=2;
for i=1:size(TrainClass1,3)
FVclass1(i,:)=[kurtosis(TrainClass1(:,K1,i)) std(TrainClass1(:,K1,i)) sum(TrainClass1(:,K2,i))];
FVclass2(i,:)=[kurtosis(TrainClass2(:,K1,i)) std(TrainClass2(:,K1,i)) sum(TrainClass2(:,K2,i))];
end
FVTrain=[FVclass1;FVclass2];
%Test data feature extraction
for j=1:size(TestData,3)
FVTest(j,:)=[kurtosis(TestData(:,K1,j)) mean(TestData(:,K1,j)) sum(TestData(:,K2,j))];
end
TR_Label=[ones(1,size(TrainClass1,3)) 2*ones(1,size(TrainClass2,3))];
for k=35:-1:1
PredictedClass=knnclassify(FVTest,FVTrain,TR_Label,k); %classification predic
PERF=classperf(LabelTest,PredictedClass);
SD (k)=PERF.CorrectRate ; %Test the accuracy
end
figure
plot(1:35,SD);

Decide best 'k' in k-means algorithm in weka

I am using k-means algorithm for clustering but I am not sure how to decide best optimal value of k based on the results.
For ex, i have applied k-means on a dataset for k=10:
kMeans
======
Number of iterations: 16
Within cluster sum of squared errors: 38.47923197081721
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1 2 3 4 5 6 7 8 9
(214) (16) (9) (13) (23) (46) (12) (11) (40) (15) (29)
==============================================================================================================================================================================================================================================================
RI 1.5184 1.5181 1.5175 1.5189 1.5178 1.5172 1.519 1.5255 1.5175 1.5222 1.5171
Na 13.4079 12.9988 14.6467 12.8277 13.2148 13.1896 13.63 12.6318 13.0518 13.9107 14.4421
Mg 2.6845 3.4894 1.3056 0.7738 3.4261 3.4987 3.4917 0.2145 3.4958 3.8273 0.5383
Al 1.4449 1.1844 1.3667 2.0338 1.3552 1.4898 1.3308 1.1891 1.2617 0.716 2.1228
Si 72.6509 72.785 73.2067 72.3662 72.6526 72.6989 72.07 72.0709 72.9532 71.7467 72.9659
K 0.4971 0.4794 0 1.47 0.527 0.59 0.4108 0.2345 0.547 0.1007 0.3252
Ca 8.957 8.8069 9.3567 10.1238 8.5648 8.3041 8.87 13.1291 8.5035 9.5887 8.4914
Ba 0.175 0.015 0 0.1877 0.023 0.003 0.0667 0.2864 0 0 1.04
Fe 0.057 0.2238 0 0.0608 0.2013 0.0104 0.0167 0.1109 0.011 0.0313 0.0134
Type build wind non-float build wind float tableware containers build wind non-float build wind non-float build wind float build wind non-float build wind float build wind float headlamps
There are various methods for deciding the optimal value for "k" in k-means algorithm Thumb-Rule, elbow method, silhouette method etc. In my work I used to follow the result obtained form the elbow method and got succeed with my results, I had done all the analysis in the R-Language.
Here is the link of the description for those methods link
Try to find the sub links of the given link, build a code for any one of the method & apply on your data.
I hope this will help you, if not I am sorry.
All the Best with your work.

How to find subset selection for linear regression model?

I am working with mtcars dataset and using linear regression
data(mtcars)
fit<- lm(mpg ~.,mtcars);summary(fit)
When I fit the model with lm it shows the result like this
Call:
lm(formula = mpg ~ ., data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.5087 -1.3584 -0.0948 0.7745 4.6251
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.87913 20.06582 1.190 0.2525
cyl6 -2.64870 3.04089 -0.871 0.3975
cyl8 -0.33616 7.15954 -0.047 0.9632
disp 0.03555 0.03190 1.114 0.2827
hp -0.07051 0.03943 -1.788 0.0939 .
drat 1.18283 2.48348 0.476 0.6407
wt -4.52978 2.53875 -1.784 0.0946 .
qsec 0.36784 0.93540 0.393 0.6997
vs1 1.93085 2.87126 0.672 0.5115
amManual 1.21212 3.21355 0.377 0.7113
gear4 1.11435 3.79952 0.293 0.7733
gear5 2.52840 3.73636 0.677 0.5089
carb2 -0.97935 2.31797 -0.423 0.6787
carb3 2.99964 4.29355 0.699 0.4955
carb4 1.09142 4.44962 0.245 0.8096
carb6 4.47757 6.38406 0.701 0.4938
carb8 7.25041 8.36057 0.867 0.3995
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.833 on 15 degrees of freedom
Multiple R-squared: 0.8931, Adjusted R-squared: 0.779
F-statistic: 7.83 on 16 and 15 DF, p-value: 0.000124
I found that none of variables are marked as significant at 0.05 significant level.
To find out significant variables I want to to do subset selection to find out best pair of vairables as predictors with response variable mpg.
The function regsubsets in the package leaps does best subset regression (see ?leaps). Adapting your code:
library(leaps)
regfit <- regsubsets(mpg ~., data = mtcars)
summary(regfit)
# or for a more visual display
plot(regfit,scale="Cp")