R-prediction intervals - prediction

Data set with 47 obs and 5 variables, (male is coded as 0 and female as 1) trying to predict male with average status, income and verbal would spend on 95% CI.
I used my lm<-spending ~ status + income + verbal + sex, teenspend to obtain average.
I found my coefficients as:
mdl$coefficient
(Intercept) sexfemale status income
22.55565063 -22.11833009 0.05223384 4.96197922
verbal
-2.95949350
predict(mdl, sex=0, interval='confidence', level=0.90)
Some questions: I used the above predict but I get all the observations, how do I find my prediction?
fit lwr upr
1 -10.6507430 -21.4372267 0.1357407
2 -9.3711318 -21.9428731 3.2006095
3 -5.4630298 -15.0782882 4.1522286
4 24.7957487 12.5630143 37.0284831
Please clarify?

Check the documentation for predict.lm and you'll see that the argument sex=0 cannot be used here. The predict method ignores that argument and thus you get the fitted values plus confidence interval for ALL observations in your data. You can specify the prediction in the following way:
predict(mdl, newdata=teenspend[teenspend$sex==0,], interval="confidence")
If you indeed need a prediction interval--you use it in the title of your question--you should choose interval="prediction".

Related

Calculating accuracy for multi-class classification

Consider a three class classification problem with the following confusion matrix.
cm_matrix =
predict_class1 predict_class2 predict_class3
______________ ______________ ______________
Actual_class1 2000 0 0
Actual_class2 34 1966 0
Actual_class3 0 0 2000
Multi-Class Confusion Matrix Output
TruePositive FalsePositive FalseNegative TrueNegative
____________ _____________ _____________ ____________
Actual_class1 2000 34 0 3966
Actual_class2 1966 0 34 4000
Actual_class3 2000 0 0 4000
The formula that I have used are:
Accuracy Of Each class=(TP ./total instances of that class)
( formula based on an answer here: Individual class accuracy calculation confusion)
Sensitivity=TP./TP+FN ;
The implementation of it in Matlab is:
acc_1 = 100*(cm_matrix(1,1))/sum(cm_matrix(1,:)) = 100*(2000)/(2000+0+0) = 100
acc_2 = 100*(cm_matrix(2,2))/sum(cm_matrix(2,:)) = 100*(1966)/(34+1966+0) = 98.3
acc_3 = 100*(cm_matrix(3,3))/sum(cm_matrix(3,:)) = 100*(2000)/(0+0+2000) = 100
sensitivity_1 = 2000/(2000+0)=1 = acc_1
sensitivity_2 = 1966/(1966+34) = 98.3 = acc_2
sensitivity_3 = 2000/2000 = 1 = acc_3
Question1) Is my formula for Accuracy of each class correct? For calculating accuracy of each individual class, say for positive class I should take the TP in the numerator. Similarly, for accuracy of only the negative class, I should consider TN in the numerator in the formula for accuracy. Is the same formula applicable to binary classification? Is my implementation of it correct?
Question2) Is my formula for sensitivity correct? Then how come I am getting same answer as individual class accuracies?
Question1) Is my formula for Accuracy of each class correct?
No, the formula you're using is for the Sensitivity (Recall). See below.
For calculating accuracy of each individual class, say for positive class I should take the TP in the numerator. Similarly, for accuracy of only the negative class, I should consider TN in the numerator in the formula for accuracy. Is the same formula applicable to binary classification? Is my implementation of it correct?
Accuracy is the ratio of the number of correctly classified instances to the total number of instances. TN, or the number of instances correctly identified as not being in a class, are correctly classified instances, too. You cannot simply leave them out.
Accuracy is also normally only used for evaluating the entire classifier for all classes, not individual classes. You can, however, generalize the accuracy formula to handle individual classes, as done here for computing the average classification accuracy for a multiclass classifier. (See also the referenced article.)
The formula they use for each class is:
As you can see, it is identical to the usual formula for accuracy, but we only take into account the individual class's TP and TN scores (the denominator is still the total number of observations). Applying this to your data set, we get:
acc_1 = (2000+3966)/(2000+34+0+3966) = 0.99433
acc_2 = (1966+4000)/(1966+0+34+4000) = 0.99433
acc_3 = (2000+4000)/(2000+0+0+4000) = 1.00000
This at least makes more intuitive sense, since the first two classes had mis-classified instances and the third did not. Whether these measures are at all useful is another question.
Question2) Is my formula for sensitivity correct?
Yes, Sensitivity is given as:
TP / TP+FN
which is the ratio of the instances correctly identified as being in this class to the total number of instances in the class. In a binary classifier, you are by default calculating the sensitivity for the positive class. The sensitivity for the negative class is the error rate (also called the miss rate or false negative rate in the wikipedia article) and is simply:
FN / TP+FN === 1 - Sensitivity
FN is nothing more than the TP for the negative class! (The meaning of TP is likewise reversed.) So it is natural to extend this to all classes as you have done.
Then how come I am getting same answer as individual class accuracies?
Because you're using the same formula for both.
Look at your confusion matrix:
cm_matrix =
predict_class1 predict_class2 predict_class3
______________ ______________ ______________
Actual_class1 2000 0 0
Actual_class2 34 1966 0
Actual_class3 0 0 2000
TP for class 1 is obviously 2000
cm_matrix(1,1)
FN is the sum of the other two columns in that row. Therefore, TP+FN is the sum of row 1
sum(cm_matrix(1,:)
That's exactly the formula you used for the accuracy.
acc_1 = 100*(cm_matrix(1,1))/sum(cm_matrix(1,:)) = 100*(2000)/(2000+0+0) = 100
Answer to question 1. It seems that accuracy is used only in binary classification, check this link.
You refer to an answer on this site, but it concerns also a binary classification (i.e. classification into 2 classes only). You seem to have more than two classes, and in this case you should try something else, or a one-versus-all classification for each class (for each class, parse prediction for class_n and non_class_n).
Answer to question 2. Same issue, this measure is appropriate for binary classification which is not your case.
The formula for sensitivity is:
TP./(TP + FN)
The formula for accuracy is:
(TP)./(TP+FN+FP+TN)
See the documentation here.
UPDATE
And if you wish to use the confusion matrix, you have:
TP on the diagonal, at the level of the class
FN the sum of all the values in the column of the class. In the function getvalues start counting lines from the declaration of the function and check lines 30 and 31:
TP(i)=c_matrix(i,i);
FN(i)=sum(c_matrix(i,:))-c_matrix(i,i);
FP(i)=sum(c_matrix(:,i))-c_matrix(i,i);
TN(i)=sum(c_matrix(:))-TP(i)-FP(i)-FN(i);
If you apply the accuracy formula, you obtain, after calculating and simplifying :
accuracy = c_matrix(i,i) / sum(c_matrix(:))
For the sensitivity you obtain, after simplifying:
sensitivity = c_matrix(i,i) / sum(c_matrix(i,:))
If you want to understand better, just check the links I sent you.

How can I incorporate a for loop into my genetic algorithm?

I'm doing a genetic algorithm that attempts to find an optimized solution over the course of 100 generations. My code as is will find 2 generations. I'm trying to find a way to add a for loop in order to repeat the process for the full duration of 100 generations.
clc,clear
format shorte
k=80;
mu=50;
s=.05;
c1=k+(4/3)*mu;
c2=k-(2/3)*mu;
for index=1:50 %6 traits generated at random 50 times
a=.005*rand-.0025;
b=.005*rand-.0025;
c=.005*rand-.0025;
d=.005*rand-.0025;
e=.005*rand-.0025;
f=.005*rand-.0025;
E=[c1,c2,c2,0,0,0;
c2,c1,c2,0,0,0;
c2,c2,c1,0,0,0;
0,0,0,mu,0,0;
0,0,0,0,mu,0;
0,0,0,0,0,mu];
S=[a;d;f;2*b;2*c;2*e];
G=E*S;
g=G(1);
h=G(2);
i=G(3);
j=G(4);
k=G(5);
l=G(6);
F=[(g-h)^2+(h-i)^2+(i-g)^2+6*(j^2+k^2+l^2)];
PI=((F-(2*s^2))/(2*s^2))^2; %cost function, fitness assessed
RP(index,:)=[a,b,c,d,e,f,PI]; %initial random population
end
Gen1=sortrows(RP,7,{'descend'}); %the initial population ranked
%for loop 1:100 would start here
children=zeros(10,6); %10 new children created from the top 20 parents
babysitter=1;
for parent=1:2:20
theta=rand(1);
traita=theta*Gen1(parent,1)+(1-theta)*Gen1(1+parent,1);
theta=rand(1);
traitb=theta*Gen1(parent,2)+(1-theta)*Gen1(1+parent,2);
theta=rand(1);
traitc=theta*Gen1(parent,3)+(1-theta)*Gen1(1+parent,3);
theta=rand(1);
traitd=theta*Gen1(parent,4)+(1-theta)*Gen1(1+parent,4);
theta=rand(1);
traite=theta*Gen1(parent,5)+(1-theta)*Gen1(1+parent,5);
theta=rand(1);
traitf=theta*Gen1(parent,6)+(1-theta)*Gen1(1+parent,6);
children(babysitter,:)=[traita,traitb,traitc,traitd,traite,traitf];
babysitter=babysitter+1;
end
top10parents=Gen1(1:10,1:6);
Gen1([11:50],:)=[]; %bottom 40 parents removed
for newindex=1:30 %6 new traits generated randomly 30 times
newa=.005*rand-.0025;
newb=.005*rand-.0025;
newc=.005*rand-.0025;
newd=.005*rand-.0025;
newe=.005*rand-.0025;
newf=.005*rand-.0025;
newgenes(newindex,:)=[newa,newb,newc,newd,newe,newf];
end
nextgen=[top10parents;children;newgenes]; %top 10 parents, the 10 new children, and the new 30 random traits added into one new matrix
for new50=1:50
newS=[nextgen(new50,1);nextgen(new50,4);nextgen(new50,6);2*nextgen(new50,2);2*nextgen(new50,3);2*nextgen(new50,5)];
newG=E*newS;
newg=newG(1);
newh=newG(2);
newi=newG(3);
newj=newG(4);
newk=newG(5);
newl=newG(6);
newF=[(newg-newh)^2+(newh-newi)^2+(newi-newg)^2+6*(newj^2+newk^2+newl^2)]; %von-Mises criterion
newPI=((newF-(2*s^2))/(2*s^2))^2; %fitness assessed for new generation
PIcolumn(new50,:)=[newPI];
end
nextgenwPI=[nextgen,PIcolumn]; %pi column added to nextgen matrix
Gen2=sortrows(nextgenwPI,7,{'descend'}) %generation 2 ranked
So my question is, how can I get the generations to count themselves in order to make the for loop work. I've searched for an answer and I've read that having matrices count themselves is not a good idea. However, I'm not sure how I could do this besides finding a way to make a genN matrix that counts upward in increments of 1 after the first generation. Any suggestions?
Thank you

How to setup fitensemble for binary imbalanced datasets?

I've been trying to test matlab's ensemble methods with randomly generated imbalance dataset and no matter what I set the prior/cost/weight parameters the method never predicts close to the label ratio.
Below is an example of the tests I did.
prob = 0.9; %set label ratio to 90% 1 and 10% 0
y = (rand(100,1) < prob);
X = rand(100,3); %generate random training data with three features
X_test = rand(100,3); %generate random test data
%A few parameter sets I've tested
B = TreeBagger(100,X,y);
B2 = TreeBagger(100,X,y,'Prior','Empirical');
B3 = TreeBagger(100,X,y,'Cost',[0,9;1,0]);
B4 = TreeBagger(100,X,y,'Cost',[0,1;9,0]);
B5 = fitensemble(X,y,'RUSBoost', 20, 'Tree', 'Prior', 'Empirical');
Here I tried to predict the trained classifiers on random test data. My assumption is that since the classifier is trained on random data, it should on average predict close to the dataset ratio (1/9) if it takes the prior into account. But each of the classifiers predicted 98-100% in favor of '1' instead of ~90% that I am looking for.
l1 = predict(B,X_test);
l2 = predict(B2,X_test);
l3 = predict(B3,X_test);
l4 = predict(B4,X_test);
l5 = predict(B5,X_test);
How do I get the ensemble method to take the prior into account? Or is there a fundamental misunderstanding on my part?
I don't think it can work like you think.
Thats because as i understood your training and test data is random. So how should your classifier find any relation between features and your label?
lets take the accuracy as a mesurement and make an example.
class A: 900 datarows.
class B: 100 datarows.
Classify 100% as A:
0.9*/(0.1+0.9) = 0.9
gets you 90% Accuracy.
if your classifier does something different, means trying to classify some datarows to B he will by chance get 9 times more wrongly classified A datarows
Lets say 20 B datarows are correctly classified you will get around 180 wrong a classified A datarows
B: 20 correct, 80 incorrect
A: 720 correct, 180 wrong
740/(740+260) = 0.74
Accuracy goes down to 74 %. And thats not something your classifying algorithms want.
Long story short: Your classifier will allways tend to classify allmost 100% class A if you dont get any information into your Data

sound classfication using mfcc and dynamic time warping (dtw)

My objective is to classify non-speech signal for which I am using mfcc and dtw in java. However I am stuck in middle. I would appreciate any help.
I have evaluated 13 mfcc values for each frame however some values are negative, I am confused whether the process I am following is right or wrong. Currently I am using the code provided by JAudio. I have also tried other code, they give me negative values as well.
Secondly, I get 13 coefficients for each frame, considering 157 frames for a certain length of sample, I get 157 sets of 13 mfccs. I am having hard time how to use all the coefficients in DTW because dtw only gives closest distance between two time signals. I do have code of DTW to compare two time signals. I am not sure how to use all the mfccs values of the signal as features.
Is there some crucial step of classification I am missing? Please help me.
The use of DTW suppose to verify 2 audio sequences in your case. Thus, for the sequence to be verify you will have a matrix M1xN and for the query M2xN. This implies that your cost matrix will have M1xM2.
To construct the cost matrix you have to apply a distance/cost measure between the sequences, as cost(i,j) = your_chosen_multidimension_metric(M1[i,:],M2[j,:])
The resulted cost matrix will be 2D, and you could apply easily DTW.
I made a similar code for DTW based on MFCC. Below is the Python implementation which returs DTW score; x and y are the MFCC matrix of voice sequences, with M1xN and M2xN dimensions:
def my_dtw (x, y):
cost_matrix = cdist(x, y,metric='seuclidean')
m,n = np.shape(cost_matrix)
for i in range(m):
for j in range(n):
if ((i==0) & (j==0)):
cost_matrix[i,j] = cost_matrix[i,j]
elif (i==0):
cost_matrix[i,j] = cost_matrix[i,j] + cost_matrix[i,j-1]
elif (j==0):
cost_matrix[i,j] = cost_matrix[i,j] + cost_matrix[i-1,j]
else:
min_local_dist = cost_matrix[i-1,j]
if min_local_dist > cost_matrix[i,j-1]:
min_local_dist = cost_matrix[i,j-1]
if min_local_dist > cost_matrix[i-1,j-1]:
min_local_dist = cost_matrix[i-1,j-1]
cost_matrix[i,j] = cost_matrix[i,j] + min_local_dist
return cost_matrix[m-1,n-1]
Say you have N1 sets of 13 MFCCs each for the first signal and N2 sets of MFCCs for the second.
You should compute the distance between each set in from the first signal and each set from the second (you can use the Euclidian Distance for the distance between two 13-sized arrays)
This would leave you with an N1xN2 bidimensional array on which you should now apply the DTW.
Check out: http://code.google.com/p/aquila/
Specifically: http://code.google.com/p/aquila/source/browse/trunk/examples/dtw_distance/main.cpp which has an example codeof dtw distace calculation.

Understanding T-SQL stdev, stdevp, var, and varp

I'm having a difficult time understand what these statistics functions do and how they work. I'm having an even more difficult time understanding how stdev works vs stdevp and the var equivelant. Can someone please break these down into dumb for me?
In statistics Standard Deviation and Variance are measures of how much a metric in a population deviate from the mean (usually the average.)
The Standard Deviation is defined as the square root of the Variance and the Variance is defined as the average of the squared difference from the mean, i.e.:
For a population of size n: x1, x2, ..., xn
with mean: xmean
Stdevp = sqrt( ((x1-xmean)^2 + (x2-xmean)^2 + ... + (xn-xmean)^2)/n )
When values for the whole population are not available (most of the time) it is customary to apply Bessel's correction to get a better estimate of the actual standard deviation for the whole population. Bessel's correction is merely dividing by n-1 instead of by n when computing the variance, i.e:
Stdev = sqrt( ((x1-xmean)^2 + (x2-xmean)^2 + ... + (xn-xmean)^2)/(n-1) )
Note that for large enough data-sets it won't really matter which function is used.
You can verify my answer by running the following T-SQL script:
-- temporary data set with values 2, 3, 4
declare #t table([val] int);
insert into #t values
(2),(3),(4);
select avg(val) as [avg], -- equals to 3.0
-- Estimation of the population standard devisation using a sample and Bessel's Correction:
-- ((x1 - xmean)^2 + (x2 - xmean)^2 + ... + (xn-xmean)^2)/(n-1)
stdev(val) as [stdev],
sqrt( (square(2-3.0) + square(3-3) + square(4-3))/2) as [stdev calculated], -- calculated with value 2, 3, 4
-- Population standard deviation:
-- ((x1 - xmean)^2 + (x2 - xmean)^2 + ... + (xn-xmean)^2)/n
stdevp(val) as [stdevp],
sqrt( (square(2-3.0) + square(3-3) + square(4-3))/3) as [stdevp calculated] -- calculated with value 2, 3, 4
from #t;
Further reading wikipedia articles for: standard deviation and Bessel's Correction.
STDDEV is used for computing the standard deviation of a data set. STDDEVP is used to compute the standard deviation of a population from which your data is a sample.
If your input is the entire population, then the population standard deviation is computed with STDDEV. More typically, your data set is a sample of a much larger population. In this case the standard deviation of the data set would not represent the true standard deviation of the population since it will usually be biased too low. A better estimate for the standard deviation of the population based on a sample is obtained with STDDEVP.
The situation with VAR and VARP is the same.
For a more thorough discussion of the topic, please see this Wikipedia article.