Multiclass Logistic Regression ROC Curves in MATLAB - matlab

I have 7 classes within my training examples (labeled 1-7). I'm running logistic regression and I want to create my ROC curve for each of my classes.
To train my model and make a prediction, I have the following code:
Theta = zeros(k, n+1); %initialize theta
[Theta, costs] = gradientDescent(Theta, #(t)(CostFunc(t, X, Y, lambda)),...
#(t)(DerivOfCostFunc(t, X, Y, lambda)), alpha, iter_num);
%Make prediction with trained model
[scores,prediction] = predict(Theta, X_test); %X_test is the design matrix (ones on the first col)
Within the predict script, I have
scores = g(X*all_theta'); %this is the sigmoid function
[p_max, IndexOfMax]=max(scores, [], 2);
prediction = IndexOfMax;
Note that scores is a m by k matrix, where m is the number of training examples and k is the number of classes. Prediction is a m by 1 vector with numbers going from 1-7, based on the predicted class.
To create the ROC curve, for class 3 for example,
classNum=3;
for i=1:size(scores,1)
temp=scores(i,:);
diffscore(i,:)=temp(classNum)-max([temp(:,1:classNum-1),temp(:,classNum+1:end)]);
end
This last part I did because I read that I had to establish my class 3 as positive and the others as negative.
At last, I made my curve with the following code:
[xROC,yROC,~,auc] = perfcurve(y_test,diffscore,classNum);
%y_test contains my true labels, m by 1 column vector
However, when running the ROC curve for each of my classes, I get the same plot for all. They all have an AUC of 1. Based on some analysis, I know this is not correct but can't figure out in which part of the code I went wrong! Is there additional code I should add or should I need to modify any of my existing code?

Related

How to plot precision and recall of a CNN in MATLAB?

How to plot the precision and recall curves of a CNN?
I have generated the scores from CNN and want to plot the precision-recall curve, but I am unable to get that.
I have calculated TP, TN, FP, and FN using:
idx = (ACTUAL()==1);
p = length(ACTUAL(idx));
n = length(ACTUAL(~idx));
N = p+n;
tp = sum(ACTUAL(idx)==PREDICTED(idx));
tn = sum(ACTUAL(~idx)==PREDICTED(~idx));
fp = n-tn;
fn = p-tp;
The formula of precision and recall is
precision = tp/(tp+fp)
but with that, I am getting some undesired plot.
I have obtained scores of the CNN using the following command:
[YTest,score]=classify(convnet,TestData)
MATLAB has a function for creating ROC curves and similar performance curves (such as precision-recall curves) in the Statistics and Machine Learning Toolbox: perfcurve.
By default, the ROC curve is calculated.
The function has the following syntax:
[X, Y] = perfcurve(labels, scores, posclass)
Here, labels is the true label for each sample, scores is the prediction of the CNN (or any other classifier), and posclass is the label of the class you assume to be "positive" - which appears to be 1 in your example. The outputs of the perfcurve function are the (x, y) coordinates of the ROC curve, so you can easily plot it using
plot(X, Y)
To make perfcurve plot the precision-recall curve instead of the ROC curve, you have to set the optional 'XCrit' and 'YCrit' arguments of the function. As described in the documentation, different pre-defined criteria such as number of false positives ('fp'), true positive rate ('tpr'), accuracy ('accu') and many more, or even custom functions can be used.
By setting 'XCrit' to 'tpr' (Recall) and 'YCrit' to 'prec' (Precision), a precision-recall curve is created:
[X, Y] = perfcurve(labels, scores, posclass, 'XCrit', 'tpr', 'YCrit', 'prec');
plot(X, Y);
xlabel('Recall')
ylabel('Precision')
xlim([0, 1])
ylim([0, 1])
For example (using randomly generated data and a SVM):
The answer of hbaderts is correct but the end of the answer is wrong.
[X,Y] = perfcurve(labels,scores,posclass,'xCrit', 'fpr', 'yCrit', 'tpr');
Then the generated Receiver operating characteristic (ROC) curve is correct.

GMModel - how do I use this to predict a label's data?

I've made a GMModel using fitgmdist. The idea is to produce two gaussian distributions on the data and use that to predict their labels. How can I determine if a future data point fits into one of those distributions? Am I misunderstanding the purpose of a GMModel?
clear;
load C:\Users\Daniel\Downloads\data1 data;
% Mixed Gaussian
GMModel = fitgmdist(data(:, 1:4),2)
Produces
GMModel =
Gaussian mixture distribution with 2 components in 4 dimensions
Component 1:
Mixing proportion: 0.509709
Mean: 2.3254 -2.5373 3.9288 0.4863
Component 2:
Mixing proportion: 0.490291
Mean: 2.5161 -2.6390 0.8930 0.4833
Edit:
clear;
load C:\Users\Daniel\Downloads\data1 data;
% Mixed Gaussian
GMModel = fitgmdist(data(:, 1:4),2);
P = posterior(GMModel, data(:, 1:4));
X = round(P)
blah = X(:, 1)
dah = data(:, 5)
Y = max(mean(blah == dah), mean(~blah == dah))
I don't understand why you round the posterior values. Here is what I would do after fitting a mixture model.
P = posterior(GMModel, data(:, 1:4));
[~,Y] = max(P,[],2);
Now Y contains the labels that is index of which Gaussian the data belongs in-terms of maximum aposterior (MAP). Important thing to do is to align the labels before evaluating the classification error. Since renumbering might happen, i.e., Gaussian component 1 in the true might be component 2 in the clustering produced and so on. May be that why you are getting varying accuracy ranging from 51% accuracy to 95% accuracy, in addition to other subtle problems.

Fitting Gaussian Mixture Model

I have six bivariate normal distributions and I want to combine them as a Gaussian mixture model. I calculated the mean and covariance matrices below. When I sample random data (mvnrnd) for given distribution parameters, gmdistribution.fit gives different results for different sample sizes. In other words, random sampling sizes n=50 and n=1000 converge different gaussian distributions. My underlying data contains 30 samples for each cluster. So what is the best way to fit gaussian mixture model to my data? Any ideas?
mu1=[log(0.29090) log(0.0038)]
mu2=[log(0.4017) log(0.0053)]
mu3=[log(0.4477) log(0.0051)]
mu4=[log(0.5396) log(0.0072)]
mu5=[log(0.6881) log(0.0090)]
mu6=[log(0.8091) log(0.0099)]
cov1=[0.052 0.0011;0.0011 0.044]
cov2=[0.054 0.0010;0.0010 0.078]
cov3=[0.126 0.011;0.011 0.23]
cov4=[0.092 0.0061;0.0061 0.12]
cov5=[0.113 0.0092;0.0092 0.14]
cov6=[0.1047 0.0217;0.0217 0.35]
X = [mvnrnd(mu1,cov1,50);mvnrnd(mu2,cov2,50);mvnrnd(mu3,cov3,50);mvnrnd(mu4,cov4,50);mvnrnd(mu5,cov5,50);mvnrnd(mu6,cov6,50)];
scatter(X(:,1),X(:,2),'g')
options = statset('MaxIter',200,'Display','final','TolFun',1e-6)
obj = gmdistribution.fit(X,6,'Options',options)
hold on
ezcontour(#(x,y)pdf(obj,[x y]),[-2.5 1],[-7 -2.5],300);
hold off
ezsurfc(#(x,y) pdf(obj,[x y]))
x = -2.5:0.1:1.5; y = -7.0:0.1:-3; n=length(x); a=zeros(n,n);
for i = 1:n,
for j = 1:n,
gaussPDF(i,j) = pdf(obj,[x(i) y(j)]);
end;
end;

Hidden Markov model classifying a sequence in Matlab

I'm very new to machine learning, I'v read about Matlab's Statistics toolbox for hidden Markov model, I want to classify a given sequence of signals using it. I'v 3D co-ordinates in matrix P i.e [501x3] and I want to train model based on that. Evert complete trajectory ends on a specfic set of points, i.e at (0,0,0) where it achieves its target.
What is the appropriate Pseudocode/approach according to my scenario.
My Pseudocode:
501x3 matrix P is Emission matrix where each co-ordinate is state
random NxN transition matrix values (but i'm confused in it)
generating test sequence using the function hmmgenerate
train using hmmtrain(sequence,old_transition,old_emission)
give final transition and emission matrix to hmmdecode with an unknown sequence to give the probability (confusing also)
EDIT 1:
In a nutshell, I want to classify 10 classes of trajectories having each of [501x3] with HMM. I want to sampled 50 rows i.e [50x3] for each trajectory in order to build model. However, I'v murphyk's toolbox of HMM for such random sequences.
Here is a general outline of the approach to classifying d-dimensional sequences using hidden Markov models:
1) Training:
For each class k:
prepare an HMM model. This includes initializing the following:
a transition matrix: Q-by-Q matrix, where Q is the number of states
a vector of prior probabilities: Q-by-1 vector
the emission model: in your case the observations are 3D points so you could use a mutlivariate normal distribution (with specified mean vector and covariance matrix) or a Guassian mixture model (a bunch of MVN distributions combined using mixture coefficient)
after properly initializing the above parameters, you train the HMM model, feeding it the set of sequences belong to this class (EM algorithm).
2) Prediction
Next to classify a new sequence X:
you compute the log-likelihood of the sequence using each model log P(X|model_k)
then you pick the class that gave the highest probability. This is the class prediction.
As I mentioned in the comments, the Statistics Toolbox only implement discrete observation HMM models, so you will have to find another libraries or implement the code yourself. Kevin Murphy's toolboxes (HMM toolbox, BNT, PMTK3) are popular choices in this domain.
Here are some answers I posted in the past using Kevin Murphy's toolboxes:
Issue in training hidden markov model and usage for classification
Simple example/use-case for a BNT gaussian_CPD
The above answers are somewhat different from what you are trying to do here, but it's a good place to start.
The statement/case tells to build and train a hidden Markov's model having following components specially using murphyk's toolbox for HMM as per the choice:
O = Observation's vector
Q = States vector
T = vectors sequence
nex = number of sequences
M = number of mixtures
Demo Code (from murphyk's toolbox):
O = 8; %Number of coefficients in a vector
T = 420; %Number of vectors in a sequence
nex = 1; %Number of sequences
M = 1; %Number of mixtures
Q = 6; %Number of states
data = randn(O,T,nex);
% initial guess of parameters
prior0 = normalise(rand(Q,1));
transmat0 = mk_stochastic(rand(Q,Q));
if 0
Sigma0 = repmat(eye(O), [1 1 Q M]);
% Initialize each mean to a random data point
indices = randperm(T*nex);
mu0 = reshape(data(:,indices(1:(Q*M))), [O Q M]);
mixmat0 = mk_stochastic(rand(Q,M));
else
[mu0, Sigma0] = mixgauss_init(Q*M, data, 'full');
mu0 = reshape(mu0, [O Q M]);
Sigma0 = reshape(Sigma0, [O O Q M]);
mixmat0 = mk_stochastic(rand(Q,M));
end
[LL, prior1, transmat1, mu1, Sigma1, mixmat1] = ...
mhmm_em(data, prior0, transmat0, mu0, Sigma0, mixmat0, 'max_iter', 5);
loglik = mhmm_logprob(data, prior1, transmat1, mu1, Sigma1, mixmat1);

MATLAB - How to calculate 2D least squares regression based on both x and y. (regression surface)

I have a set of data with independent variable x and y. Now I'm trying to build a two dimensional regression model that has a regression surface cutting through my data points. However, I couldn't find a way to achieve this. Can anyone give me some assistance?
You could use my favorite, polyfitn for linear or polynomial models. If you would like a different model, please edit your question or add a comment. HTH!
EDIT
Also, take a look here under Multiple Regression, likely can help you as well.
EDIT AGAIN
Sorry, I'm having too much fun with this, here's an example of multivariate regression using least squares with stock Matlab:
t = (1:10)';
x = t;
y = exp(-t);
A = [ y x ];
z = 10*y + 0.5*x;
A\z
ans =
10.0000
0.5000
If you are performing linear regression, the best tool is the regress function. Note that, if you are fitting a model of the form y(x1,x2) = b1.f(x1) + b2.g(x2) + b3 this is still a linear regression, as long as you know the functions f and g.
Nsamp = 100; %number of samples
X1 = randn(Nsamp,1); %regressor 1 (could also be some computed f(x1) )
X2 = randn(Nsamp,1); %regressor 2 (could also be some computed g(x2) )
Y = X1 + X2 + randn(Nsamp,1); %generate some data to be regressed
%now run the regression
[b,bint,r,rint,stats] = regress(Y,[X1 X2 ones(Nsamp,1)]);
% 'b' contains the coefficients, b1,b2,b3 of the fit; can be used to plot regression surface)
% 'r' contains residuals of the fit
% 'stats' contains the overall regression R^2, F stat, p-value and error variance