Inputs for the ROC curve - matlab

I have a 2 column matrix, where in each row are observations for healthy (column 1) and not healthy (2 column) patients. Also, I have 5 partition values which should be used to plot ROC curve.
Could you please help me to understand how to get the inputs from this data for the perfcurve function?
Thank you for any reply!

I've made a small script that shows the basics of a perfcurve given a two column matrix input. If you execute this in MATLAB and take a careful look at it then you should have no trouble using perfcurve
%Simulate your data as Gaussian data with 1000 measurements in each group.
%Lets give them a mean difference of 1 and a standard deviation of 1.
Data = zeros(1000,2);
Data(:,1) = normrnd(0,1,1000,1);
Data(:,2) = normrnd(1,1,1000,1);
%Now the data is reshaped to a vector (required for perfcurve) and I create the labels.
Data = reshape(Data,2000,1);
Labels = zeros(size(Data,1),1);
Labels(end/2+1:end) = 1;
%Your bottom half of the data (initially second column) is now group 1, the
%top half is group 0.
%Lets set the positive class to group 1.
PosClass = 1;
%Now we have all required variables to call perfcurve. We will give
%perfcurve the 'Xvals' input to define the values at which the ROC curve is
%calculated. This parameter can be left out to let matlab calculate the
%curve at all values.
[X Y] = perfcurve(Labels,Data,PosClass, 'Xvals', 0:0.25:1);
%Lets plot this
plot(X,Y)
%One limitation in scripting it like this is that you must have equal group
%sizes for healthy and sick. If you reshape your Data matrix to a vector
%and keep a seperate labels vector then you can also handle groups of
%different sizes.

Related

How to sort rows of a matrix based on the costrain of another matrix?

The 6 faces method is a very cheap and fast way to calibrate and accelerometer like my MPU6050, here a great description of the method.
I made 6 tests to calibrate the accelerometer based on the g vector.
After that i build up a matrix and in each row is stored the mean of each axis expressed in m/s^2, thanks to this question i automatically calculated the mean for each column in each file.
The tests were randomly performed, i tested all the six positions, but i didn't follow any path.
So i sorted manually the final matrix, based on the sort of the Y matrix, my reference matrix.
The Y elements are fixed.
The matrix manually sorted is the following
Here how i manually sorted the matrix
meanmatrix=[ax ay az];
mean1=meanmatrix(1,:);
mean2=meanmatrix(2,:);
mean3=meanmatrix(3,:);
mean4=meanmatrix(4,:);
mean5=meanmatrix(5,:);
mean6=meanmatrix(6,:);
meanmatrix= [mean1; mean3; mean2; mean4;mean6;mean5];
Based on the Y matrix constrain how can sort my matrix without knwowing "a priori" wich is the test stored in the row?
Assuming that the bias on the accelerometer is not huge, you can look at the rows of your matrix and see with which of the rows in your Y matrix matches.
sorted_meanmatrix = zeros(size(meanmatrix));
for rows = 1:length(Y)
% Calculates the square of distance and see which row has a nearest distcance
[~,index] = min(sum((meanmatrix - Y(rows,:)).^2, 2));
sorted_meanmatrix(rows,:) = meanmatrix(index,:);
end

Generate random samples from arbitrary discrete probability density function in Matlab

I've got an arbitrary probability density function discretized as a matrix in Matlab, that means that for every pair x,y the probability is stored in the matrix:
A(x,y) = probability
This is a 100x100 matrix, and I would like to be able to generate random samples of two dimensions (x,y) out of this matrix and also, if possible, to be able to calculate the mean and other moments of the PDF. I want to do this because after resampling, I want to fit the samples to an approximated Gaussian Mixture Model.
I've been looking everywhere but I haven't found anything as specific as this. I hope you may be able to help me.
Thank you.
If you really have a discrete probably density function defined by A (as opposed to a continuous probability density function that is merely described by A), you can "cheat" by turning your 2D problem into a 1D problem.
%define the possible values for the (x,y) pair
row_vals = [1:size(A,1)]'*ones(1,size(A,2)); %all x values
col_vals = ones(size(A,1),1)*[1:size(A,2)]; %all y values
%convert your 2D problem into a 1D problem
A = A(:);
row_vals = row_vals(:);
col_vals = col_vals(:);
%calculate your fake 1D CDF, assumes sum(A(:))==1
CDF = cumsum(A); %remember, first term out of of cumsum is not zero
%because of the operation we're doing below (interp1 followed by ceil)
%we need the CDF to start at zero
CDF = [0; CDF(:)];
%generate random values
N_vals = 1000; %give me 1000 values
rand_vals = rand(N_vals,1); %spans zero to one
%look into CDF to see which index the rand val corresponds to
out_val = interp1(CDF,[0:1/(length(CDF)-1):1],rand_vals); %spans zero to one
ind = ceil(out_val*length(A));
%using the inds, you can lookup each pair of values
xy_values = [row_vals(ind) col_vals(ind)];
I hope that this helps!
Chip
I don't believe matlab has built-in functionality for generating multivariate random variables with arbitrary distribution. As a matter of fact, the same is true for univariate random numbers. But while the latter can be easily generated based on the cumulative distribution function, the CDF does not exist for multivariate distributions, so generating such numbers is much more messy (the main problem is the fact that 2 or more variables have correlation). So this part of your question is far beyond the scope of this site.
Since half an answer is better than no answer, here's how you can compute the mean and higher moments numerically using matlab:
%generate some dummy input
xv=linspace(-50,50,101);
yv=linspace(-30,30,100);
[x y]=meshgrid(xv,yv);
%define a discretized two-hump Gaussian distribution
A=floor(15*exp(-((x-10).^2+y.^2)/100)+15*exp(-((x+25).^2+y.^2)/100));
A=A/sum(A(:)); %normalized to sum to 1
%plot it if you like
%figure;
%surf(x,y,A)
%actual half-answer starts here
%get normalized pdf
weight=trapz(xv,trapz(yv,A));
A=A/weight; %A normalized to 1 according to trapz^2
%mean
mean_x=trapz(xv,trapz(yv,A.*x));
mean_y=trapz(xv,trapz(yv,A.*y));
So, the point is that you can perform a double integral on a rectangular mesh using two consecutive calls to trapz. This allows you to compute the integral of any quantity that has the same shape as your mesh, but a drawback is that vector components have to be computed independently. If you only wish to compute things which can be parametrized with x and y (which are naturally the same size as you mesh), then you can get along without having to do any additional thinking.
You could also define a function for the integration:
function res=trapz2(xv,yv,A,arg)
if ~isscalar(arg) && any(size(arg)~=size(A))
error('Size of A and var must be the same!')
end
res=trapz(xv,trapz(yv,A.*arg));
end
This way you can compute stuff like
weight=trapz2(xv,yv,A,1);
mean_x=trapz2(xv,yv,A,x);
NOTE: the reason I used a 101x100 mesh in the example is that the double call to trapz should be performed in the proper order. If you interchange xv and yv in the calls, you get the wrong answer due to inconsistency with the definition of A, but this will not be evident if A is square. I suggest avoiding symmetric quantities during the development stage.

How to select first component and calculate percentage of variation in PCA?

I have a matrix M where the columns are data points and the rows are features. Now I want to do PCA and select only the first component which has highest variance.
I know that I can do it in Matlab with [coeff,score,latent] = pca(M'). First I think I have to transpose matrix M.
How can I select now the first component? I'm not sure about the three different output matrices.
Second, I also want to calculate the percentage of variance explained for each component. How can I do this?
Indeed, you should transpose your input to have rows as data points and columns as features:
[coeff, score, latent, ~, explained] = pca(M');
The principal components are given by the columns of coeff in order of descending variance, so the first column holds the most important component. The variances for each component are given in latent, and the percentage of total variance explained is given in explained.
firstCompCoeff = coeff(:,1);
firstCompVar = latent(1);
For more information: pca documentation.
Note that the pca function requires the Statistics Toolbox. If you don't have it, you can either search the internet for an alternative or implement it yourself using svd.
If your matrix has dimensions m x n, where m is cases and n is variables:
% First you might want to normalize the matrix...
M = normalize(M);
% means very close to zero
round(mean(M),10)
% standard deviations all one
round(std(M),10)
% Perform a singular value decomposition of the matrix
[U,S,V] = svd(M);
% First Principal Component is the first column of V
V(:,1)
% Calculate percentage of variation
(var(S) / sum(var(S))) * 100

Selecting values plotted on a scatter3 plot

I have a 3d matrix of 100x100x100. Each point of that matrix has assigned a value that corresponds to a certain signal strength. If I plot all the points the result is incomprehensible and requires horsepower to compute, due to the large amount of points that are painted.
The next picture examplify the problem (in that case the matrix was 50x50x50 for reducing the computation time):
[x,y,z] = meshgrid(1:50,1:50,1:50);
scatter3(x(:),y(:),z(:),5,strength(:),'filled')
I would like to plot only the highest values (for example, the top 10). How can I do it?
One simple solution that came up in my mind is to asign "nan" to the values higher than the treshold.
Even the results are nice I think that it must be a most elegant solution to fix it.
Reshape it into an nx1 vector. Sort that vector and take the first ten values.
num_of_rows = size(M,1)
V = reshape(M,num_of_rows,1);
sorted_V = sort(V,'descend');
ind = sorted_V(1:10)
I am assuming that M is your 3D matrix. This will give you your top ten values in your matrix and the respective index. The you can use ind2sub() to get the x,y,z.

Mahalanobis distance in matlab: pdist2() vs. mahal() function

I have two matrices X and Y. Both represent a number of positions in 3D-space. X is a 50*3 matrix, Y is a 60*3 matrix.
My question: why does applying the mean-function over the output of pdist2() in combination with 'Mahalanobis' not give the result obtained with mahal()?
More details on what I'm trying to do below, as well as the code I used to test this.
Let's suppose the 60 observations in matrix Y are obtained after an experimental manipulation of some kind. I'm trying to assess whether this manipulation had a significant effect on the positions observed in Y. Therefore, I used pdist2(X,X,'Mahalanobis') to compare X to X to obtain a baseline, and later, X to Y (with X the reference matrix: pdist2(X,Y,'Mahalanobis')), and I plotted both distributions to have a look at the overlap.
Subsequently, I calculated the mean Mahalanobis distance for both distributions and the 95% CI and did a t-test and Kolmogorov-Smirnoff test to asses if the difference between the distributions was significant. This seemed very intuitive to me, however, when testing with mahal(), I get different values, although the reference matrix is the same. I don't get what the difference between both ways of calculating mahalanobis distance is exactly.
Comment that is too long #3lectrologos:
You mean this: d(I) = (Y(I,:)-mu)inv(SIGMA)(Y(I,:)-mu)'? This is just the formula for calculating mahalanobis, so should be the same for pdist2() and mahal() functions. I think mu is a scalar and SIGMA is a matrix based on the reference distribution as a whole in both pdist2() and mahal(). Only in mahal you are comparing each point of your sample set to the points of the reference distribution, while in pdist2 you are making pairwise comparisons based on a reference distribution. Actually, with my purpose in my mind, I think I should go for mahal() instead of pdist2(). I can interpret a pairwise distance based on a reference distribution, but I don't think it's what I need here.
% test pdist2 vs. mahal in matlab
% the purpose of this script is to see whether the average over the rows of E equals the values in d...
% data
X = []; % 50*3 matrix, data omitted
Y = []; % 60*3 matrix, data omitted
% calculations
S = nancov(X);
% mahal()
d = mahal(Y,X); % gives an 60*1 matrix with a value for each Cartesian element in Y (second matrix is always the reference matrix)
% pairwise mahalanobis distance with pdist2()
E = pdist2(X,Y,'mahalanobis',S); % outputs an 50*60 matrix with each ij-th element the pairwise distance between element X(i,:) and Y(j,:) based on the covariance matrix of X: nancov(X)
%{
so this is harder to interpret than mahal(), as elements of Y are not just compared to the "mahalanobis-centroid" based on X,
% but to each individual element of X
% so the purpose of this script is to see whether the average over the rows of E equals the values in d...
%}
F = mean(E); % now I averaged over the rows, which means, over all values of X, the reference matrix
mean(d)
mean(E(:)) % not equal to mean(d)
d-F' % not zero
% plot output
figure(1)
plot(d,'bo'), hold on
plot(mean(E),'ro')
legend('mahal()','avaraged over all x values pdist2()')
ylabel('Mahalanobis distance')
figure(2)
plot(d,'bo'), hold on
plot(E','ro')
plot(d,'bo','MarkerFaceColor','b')
xlabel('values in matrix Y (Yi) ... or ... pairwise comparison Yi. (Yi vs. all Xi values)')
ylabel('Mahalanobis distance')
legend('mahal()','pdist2()')
One immediate difference between the two is that mahal subtracts the sample mean of X from each point in Y before computing distances.
Try something like E = pdist2(X,Y-mean(X),'mahalanobis',S); to see if that gives you the same results as mahal.
Note that
mahal(X,Y)
is equivalent to
pdist2(X,mean(Y),'mahalanobis',cov(Y)).^2
Well, I guess there are two different ways to calculate mahalanobis distance between two clusters of data like you explain above:
1) you compare each data point from your sample set to mu and sigma matrices calculated from your reference distribution (although labeling one cluster sample set and the other reference distribution may be arbitrary), thereby calculating the distance from each point to this so called mahalanobis-centroid of the reference distribution.
2) you compare each datapoint from matrix Y to each datapoint of matrix X, with, X the reference distribution (mu and sigma are calculated from X only)
The values of the distances will be different, but I guess the ordinal order of dissimilarity between clusters is preserved when using either method 1 or 2? I actually wonder when comparing 10 different clusters to a reference matrix X, or to each other, if the order of the dissimilarities would differ using method 1 or method 2? Also, I can't imagine a situation where one method would be wrong and the other method not. Although method 1 seems more intuitive in some situations, like mine.