Greetins,
How can I calculate how many distance calculations would need to be performed to classify the IRIS dataset using Nearest Mean Classifier.
I know that IRIS dataset has 4 features and every record is classified according to 3 different labels.
According to some textbooks, the calculation can be carried out as follow:
However, I am lost on these different notations and what does this equation mean. For example, what is s^2 is in the equation?
The notation is standard with most machine learning textbooks. s in this case is the sample standard deviation for the training set. It is quite common to assume that each class has the same standard deviation, which is why every class is assigned the same value.
However you shouldn't be paying attention to that. The most important point is when the priors are equal. This is a fair assumption which means that you expect that the distribution of each class in your dataset are roughly equal. By doing this, the classifier simply boils down to finding the smallest distance from a training sample x to each of the other classes represented by their mean vectors.
How you'd compute this is quite simple. In your training set, you have a set of training examples with each example belonging to a particular class. For the case of the iris dataset, you have three classes. You find the mean feature vector for each class, which would be stored as m1, m2 and m3 respectively. After, to classify a new feature vector, simply find the smallest distance from this vector to each of the mean vectors. Whichever one has the smallest distance is the class you'd assign.
Since you chose MATLAB as the language, allow me to demonstrate with the actual iris dataset.
load fisheriris; % Load iris dataset
[~,~,id] = unique(species); % Assign for each example a unique ID
means = zeros(3, 4); % Store the mean vectors for each class
for i = 1 : 3 % Find the mean vectors per class
means(i,:) = mean(meas(id == i, :), 1); % Find the mean vector for class 1
end
x = meas(10, :); % Choose a random row from the dataset
% Determine which class has the smallest distance and thus figure out the class
[~,c] = min(sum(bsxfun(#minus, x, means).^2, 2));
The code is fairly straight forward. Load in the dataset and since the labels are in a cell array, it's handy to create a new set of labels that are enumerated as 1, 2 and 3 so that it's easy to isolate out the training examples per class and compute their mean vectors. That's what's happening in the for loop. Once that's done, I choose a random data point from the training set then compute the distance from this point to each of the mean vectors. We choose the class that gives us the smallest distance.
If you wanted to do this for the entire dataset, you can but that will require some permutation of the dimensions to do so.
data = permute(meas, [1 3 2]);
means_p = permute(means, [3 1 2]);
P = sum(bsxfun(#minus, data, means_p).^2, 3);
[~,c] = min(P, [], 2);
data and means_p are the transformed features and mean vectors in a way that is a 3D matrix with a singleton dimension. The third line of code computes the distances vectorized so that it finally generates a 2D matrix with each row i calculating the distance from the training example i to each of the mean vectors. We finally find the class with the smallest distance for each example.
To get a sense of the accuracy, we can simply compute the fraction of the total number of times we classified correctly:
>> sum(c == id) / numel(id)
ans =
0.9267
With this simple nearest mean classifier, we have an accuracy of 92.67%... not bad, but you can do better. Finally, to answer your question, you would need K * d distance calculations, with K being the number of examples and d being the number of classes. You can clearly see that this is required by examining the logic and code above.
Related
I understand that GMM is not a classifier itself, but I am trying to follow the instructions of some users in this stack exchange post below to create a GMM-inspired classifier.
lejlot: Multiclass classification using Gaussian Mixture Models with scikit learn
"construct your own classifier where you fit one GMM per label and then use assigned probability to do actual classification. Then it is a proper classifier"
What is meant by "assigned probability" for GMM Matlab objects in the above quote and how can we input a new point to get our desired assigned probability? For a new point that we are trying to classify, my understanding is that we need to get the posterior probabilities that the new point belongs to either Gaussian and then compare these two probabilities.
It looks from the documentation https://www.mathworks.com/help/stats/gmdistribution.html
like we only have access to cluster center mu's and covariance matrices (sigma) but not an actual probability distribution that would take in a point and spit out a probability
podludek: Multiclass classification using Gaussian Mixture Models with scikit learn
"GMM is not a classifier, but generative model. You can use it to a classification problem by applying Bayes theorem.....You should use GMM as a posterior distribution, one GMM per each class." -
In the documentation in Matlab for posterior(gm,X), the tutorial shows us inputting X, which is already the the data we used to create ("train") our GMM. But how can we get the posterior probability of being in a cluster for a new point?
https://www.mathworks.com/help/stats/gmdistribution.posterior.html
"P = posterior(gm,X) returns the posterior probability of each Gaussian mixture component in gm given each observation in X"
--> But the X used in the link above is the 'training' data used to create the GMM itself, not a new point. Also we have two gm objects, not one. How can we grab the probability a point belongs to a Gaussian?
The pseudocode below is how I envisioned a GMM inspired classifier would go for a two class example: I would fit GMM's to individual clusters as described by podludek. Then, I would use the posterior probailities of a point being in each cluster and then pick the bigger probability.
I'm aware there are issues with this conceptually (such as the two GMM objects having conflicting covariance matrices) but I've been assured by my mentor that there is a way to make a supervised version of GMM, and he wants me to make one, so here we go:
Pseusdocode:
X % The training data matrix
% each new row is a new data point
% each column is new feature
% Ex: if you had 10,000 data points and 100 features for each, your matrix
% would be 10000 by 100
% Let's say we had 200 points of each class in our training data
% Grab subsets of X that corresponds to classes 1 and 2
X_only_class_2 = X(1:200,:)
X_only_class_1 = X(201:end,:)
gmfit_class_1 = fitgmdist(X_only_class_1,1,'RegularizationValue',0.1);
cov_matrix_1=gmfit_class_1.Sigma;
gmfit_class_2 = fitgmdist(X_only_class_2,1,'RegularizationValue',0.1);
cov_matrix_2=gmfit_class_2.Sigma;
% Now do some tests on data we already know the classification of to check if this is working as we would expect:
a = posterior(gmfit_class_1,X_only_class_1)
b = posterior(gmfit_class_1,X_only_class_2)
c = posterior(gmfit_class_2,X_only_class_1)
d = posterior(gmfit_class_2,X_only_class_2)
But unfortunately, computing these posteriors a, b, c, and d just result in column vectors of 1's. I'm aware these are degenerate cases (and pointless for actual classification since we already know the classifications of our training data) but I still wanted to test them to make sure the posterior method is working as I would expect.
Expected:
a = posterior(gmfit_class_1,X_only_class_1)
% ^ This produces a column vector of 1's, which I thought was fine. After all, the gmfit object was trained on those points
b = posterior(gmfit_class_1,X_only_class_2)
% ^ This one also produces a vector of 1's, which I thought was wrong. It should be a vector of low, but nonzero numbers
c = posterior(gmfit_class_2,X_only_class_1)
% ^ This one also produces a vector of 1's, which I thought was wrong. It should be a vector of low, but nonzero numbers
d = posterior(gmfit_class_2,X_only_class_2)
% ^ This produces a column vector of 1's, which I thought was fine. After all, the gmfit object was trained on those points
I have to think that somehow Matlab is being confused by how in both gmm fit models, there is only one cluster in each. Either that or I am not interpreting the posterior method correctly.
Suppose there is a matrix B, where its size is a 500*1000 double(Here, 500 represents the number of observations and 1000 represents the number of features).
sigma is the covariance matrix of B, and D is a diagonal matrix whose diagonal elements are the eigenvalues of sigma. Assume A is the eigenvectors of the covariance matrix sigma.
I have the following questions:
I need to select the first k = 800 eigenvectors corresponding to the eigenvalues with the largest magnitude to rank the selected features. The final matrix named Aq. How can I do this in MATLAB?
What is the meaning of these selected eigenvectors?
It seems the size of the final matrix Aq is 1000*800 double once I calculate Aq. The time points/observation information of 500 has disappeared. For the final matrix Aq, what does the value 1000 in matrix Aq represent now? Also, what does the value 800 in matrix Aq represent now?
I'm assuming you determined the eigenvectors from the eig function. What I would recommend to you in the future is to use the eigs function. This not only computes the eigenvalues and eigenvectors for you, but it will compute the k largest eigenvalues with their associated eigenvectors for you. This may save computational overhead where you don't have to compute all of the eigenvalues and associated eigenvectors of your matrix as you only want a subset. You simply supply the covariance matrix of your data to eigs and it returns the k largest eigenvalues and eigenvectors for you.
Now, back to your problem, what you are describing is ultimately Principal Component Analysis. The mechanics behind this would be to compute the covariance matrix of your data and find the eigenvalues and eigenvectors of the computed result. It has been known that doing it this way is not recommended due to numerical instability with computing the eigenvalues and eigenvectors for large matrices. The most canonical way to do this now is via Singular Value Decomposition. Concretely, the columns of the V matrix give you the eigenvectors of the covariance matrix, or the principal components, and the associated eigenvalues are the square root of the singular values produced in the diagonals of the matrix S.
See this informative post on Cross Validated as to why this is preferred:
https://stats.stackexchange.com/questions/79043/why-pca-of-data-by-means-of-svd-of-the-data
I'll throw in another link as well that talks about the theory behind why the Singular Value Decomposition is used in Principal Component Analysis:
https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca
Now let's answer your question one at a time.
Question #1
MATLAB generates the eigenvalues and the corresponding ordering of the eigenvectors in such a way where they are unsorted. If you wish to select out the largest k eigenvalues and associated eigenvectors given the output of eig (800 in your example), you'll need to sort the eigenvalues in descending order, then rearrange the columns of the eigenvector matrix produced from eig then select out the first k values.
I should also note that using eigs will not guarantee sorted order, so you will have to explicitly sort these too when it comes down to it.
In MATLAB, doing what we described above would look something like this:
sigma = cov(B);
[A,D] = eig(sigma);
vals = diag(D);
[~,ind] = sort(abs(vals), 'descend');
Asort = A(:,ind);
It's a good thing to note that you do the sorting on the absolute value of the eigenvalues because scaled eigenvalues are also eigenvalues themselves. These scales also include negatives. This means that if we had a component whose eigenvalue was, say -10000, this is a very good indication that this component has some significant meaning to your data, and if we sorted purely on the numbers themselves, this gets placed near the lower ranks.
The first line of code finds the covariance matrix of B, even though you said it's already stored in sigma, but let's make this reproducible. Next, we find the eigenvalues of your covariance matrix and the associated eigenvectors. Take note that each column of the eigenvector matrix A represents one eigenvector. Specifically, the ith column / eigenvector of A corresponds to the ith eigenvalue seen in D.
However, the eigenvalues are in a diagonal matrix, so we extract out the diagonals with the diag command, sort them and figure out their ordering, then rearrange A to respect this ordering. I use the second output of sort because it tells you the position of where each value in the unsorted result would appear in the sorted result. This is the ordering we need to rearrange the columns of the eigenvector matrix A. It's imperative that you choose 'descend' as the flag so that the largest eigenvalue and associated eigenvector appear first, just like we talked about before.
You can then pluck out the first k largest vectors and values via:
k = 800;
Aq = Asort(:,1:k);
Question #2
It's a well known fact that the eigenvectors of the covariance matrix are equal to the principal components. Concretely, the first principal component (i.e. the largest eigenvector and associated largest eigenvalue) gives you the direction of the maximum variability in your data. Each principal component after that gives you variability of a decreasing nature. It's also good to note that each principal component is orthogonal to each other.
Here's a good example from Wikipedia for two dimensional data:
I pulled the above image from the Wikipedia article on Principal Component Analysis, which I linked you to above. This is a scatter plot of samples that are distributed according to a bivariate Gaussian distribution centred at (1,3) with a standard deviation of 3 in roughly the (0.878, 0.478) direction and of 1 in the orthogonal direction. The component with a standard deviation of 3 is the first principal component while the one that is orthogonal is the second component. The vectors shown are the eigenvectors of the covariance matrix scaled by the square root of the corresponding eigenvalue, and shifted so their tails are at the mean.
Now let's get back to your question. The reason why we take a look at the k largest eigenvalues is a way of performing dimensionality reduction. Essentially, you would be performing a data compression where you would take your higher dimensional data and project them onto a lower dimensional space. The more principal components you include in your projection, the more it will resemble the original data. It actually begins to taper off at a certain point, but the first few principal components allow you to faithfully reconstruct your data for the most part.
A great visual example of performing PCA (or SVD rather) and data reconstruction is found by this great Quora post I stumbled upon in the past.
http://qr.ae/RAEU8a
Question #3
You would use this matrix to reproject your higher dimensional data onto a lower dimensional space. The number of rows being 1000 is still there, which means that there were originally 1000 features in your dataset. The 800 is what the reduced dimensionality of your data would be. Consider this matrix as a transformation from the original dimensionality of a feature (1000) down to its reduced dimensionality (800).
You would then use this matrix in conjunction with reconstructing what the original data was. Concretely, this would give you an approximation of what the original data looked like with the least amount of error. In this case, you don't need to use all of the principal components (i.e. just the k largest vectors) and you can create an approximation of your data with less information than what you had before.
How you reconstruct your data is very simple. Let's talk about the forward and reverse operations first with the full data. The forward operation is to take your original data and reproject it but instead of the lower dimensionality, we will use all of the components. You first need to have your original data but mean subtracted:
Bm = bsxfun(#minus, B, mean(B,1));
Bm will produce a matrix where each feature of every sample is mean subtracted. bsxfun allows the subtraction of two matrices in unequal dimension provided that you can broadcast the dimensions so that they can both match up. Specifically, what will happen in this case is that the mean of each column / feature of B will be computed and a temporary replicated matrix will be produced that is as large as B. When you subtract your original data with this replicated matrix, the effect will subtract every data point with their respective feature means, thus decentralizing your data so that the mean of each feature is 0.
Once you do this, the operation to project is simply:
Bproject = Bm*Asort;
The above operation is quite simple. What you are doing is expressing each sample's feature as a linear combination of principal components. For example, given the first sample or first row of the decentralized data, the first sample's feature in the projected domain is a dot product of the row vector that pertains to the entire sample and the first principal component which is a column vector.. The first sample's second feature in the projected domain is a weighted sum of the entire sample and the second component. You would repeat this for all samples and all principal components. In effect, you are reprojecting the data so that it is with respect to the principal components - which are orthogonal basis vectors that transform your data from one representation to another.
A better description of what I just talked about can be found here. Look at Amro's answer:
Matlab Principal Component Analysis (eigenvalues order)
Now to go backwards, you simply do the inverse operation, but a special property with the eigenvector matrix is that if you transpose this, you get the inverse. To get the original data back, you undo the operation above and add the means back to the problem:
out = bsxfun(#plus, Bproject*Asort.', mean(B, 1));
You want to get the original data back, so you're solving for Bm with respect to the previous operation that I did. However, the inverse of Asort is just the transpose here. What's happening after you perform this operation is that you are getting the original data back, but the data is still decentralized. To get the original data back, you must add the means of each feature back into the data matrix to get the final result. That's why we're using another bsxfun call here so that you can do this for each sample's feature values.
You should be able to go back and forth from the original domain and projected domain with the above two lines of code. Now where the dimensionality reduction (or the approximation of the original data) comes into play is the reverse operation. What you need to do first is project the data onto the bases of the principal components (i.e. the forward operation), but now to go back to the original domain where we are trying to reconstruct the data with a reduced number of principal components, you simply replace Asort in the above code with Aq and also reduce the amount of features you're using in Bproject. Concretely:
out = bsxfun(#plus, Bproject(:,1:k)*Aq.', mean(B, 1));
Doing Bproject(:,1:k) selects out the k features in the projected domain of your data, corresponding to the k largest eigenvectors. Interestingly, if you just want the representation of the data with regards to a reduced dimensionality, you can just use Bproject(:,1:k) and that'll be enough. However, if you want to go forward and compute an approximation of the original data, we need to compute the reverse step. The above code is simply what we had before with the full dimensionality of your data, but we use Aq as well as selecting out the k features in Bproject. This will give you the original data that is represented by the k largest eigenvectors / eigenvalues in your matrix.
If you'd like to see an awesome example, I'll mimic the Quora post that I linked to you but using another image. Consider doing this with a grayscale image where each row is a "sample" and each column is a feature. Let's take the cameraman image that's part of the image processing toolbox:
im = imread('camerman.tif');
imshow(im); %// Using the image processing toolbox
We get this image:
This is a 256 x 256 image, which means that we have 256 data points and each point has 256 features. What I'm going to do is convert the image to double for precision in computing the covariance matrix. Now what I'm going to do is repeat the above code, but incrementally increasing k at each go from 3, 11, 15, 25, 45, 65 and 125. Therefore, for each k, we are introducing more principal components and we should slowly start to get a reconstruction of our data.
Here's some runnable code that illustrates my point:
%%%%%%%// Pre-processing stage
clear all;
close all;
%// Read in image - make sure we cast to double
B = double(imread('cameraman.tif'));
%// Calculate covariance matrix
sigma = cov(B);
%// Find eigenvalues and eigenvectors of the covariance matrix
[A,D] = eig(sigma);
vals = diag(D);
%// Sort their eigenvalues
[~,ind] = sort(abs(vals), 'descend');
%// Rearrange eigenvectors
Asort = A(:,ind);
%// Find mean subtracted data
Bm = bsxfun(#minus, B, mean(B,1));
%// Reproject data onto principal components
Bproject = Bm*Asort;
%%%%%%%// Begin reconstruction logic
figure;
counter = 1;
for k = [3 11 15 25 45 65 125 155]
%// Extract out highest k eigenvectors
Aq = Asort(:,1:k);
%// Project back onto original domain
out = bsxfun(#plus, Bproject(:,1:k)*Aq.', mean(B, 1));
%// Place projection onto right slot and show the image
subplot(4, 2, counter);
counter = counter + 1;
imshow(out,[]);
title(['k = ' num2str(k)]);
end
As you can see, the majority of the code is the same from what we have seen. What's different is that I loop over all values of k, project back onto the original space (i.e. computing the approximation) with the k highest eigenvectors, then show the image.
We get this nice figure:
As you can see, starting with k=3 doesn't really do us any favours... we can see some general structure, but it wouldn't hurt to add more in. As we start increasing the number of components, we start to get a clearer picture of what the original data looks like. At k=25, we actually can see what the cameraman looks like perfectly, and we don't need components 26 and beyond to see what's happening. This is what I was talking about with regards to data compression where you don't need to work on all of the principal components to get a clear picture of what's going on.
I'd like to end this note by referring you to Chris Taylor's wonderful exposition on the topic of Principal Components Analysis, with code, graphs and a great explanation to boot! This is where I got started on PCA, but the Quora post is what solidified my knowledge.
Matlab - PCA analysis and reconstruction of multi dimensional data
I've got an arbitrary probability density function discretized as a matrix in Matlab, that means that for every pair x,y the probability is stored in the matrix:
A(x,y) = probability
This is a 100x100 matrix, and I would like to be able to generate random samples of two dimensions (x,y) out of this matrix and also, if possible, to be able to calculate the mean and other moments of the PDF. I want to do this because after resampling, I want to fit the samples to an approximated Gaussian Mixture Model.
I've been looking everywhere but I haven't found anything as specific as this. I hope you may be able to help me.
Thank you.
If you really have a discrete probably density function defined by A (as opposed to a continuous probability density function that is merely described by A), you can "cheat" by turning your 2D problem into a 1D problem.
%define the possible values for the (x,y) pair
row_vals = [1:size(A,1)]'*ones(1,size(A,2)); %all x values
col_vals = ones(size(A,1),1)*[1:size(A,2)]; %all y values
%convert your 2D problem into a 1D problem
A = A(:);
row_vals = row_vals(:);
col_vals = col_vals(:);
%calculate your fake 1D CDF, assumes sum(A(:))==1
CDF = cumsum(A); %remember, first term out of of cumsum is not zero
%because of the operation we're doing below (interp1 followed by ceil)
%we need the CDF to start at zero
CDF = [0; CDF(:)];
%generate random values
N_vals = 1000; %give me 1000 values
rand_vals = rand(N_vals,1); %spans zero to one
%look into CDF to see which index the rand val corresponds to
out_val = interp1(CDF,[0:1/(length(CDF)-1):1],rand_vals); %spans zero to one
ind = ceil(out_val*length(A));
%using the inds, you can lookup each pair of values
xy_values = [row_vals(ind) col_vals(ind)];
I hope that this helps!
Chip
I don't believe matlab has built-in functionality for generating multivariate random variables with arbitrary distribution. As a matter of fact, the same is true for univariate random numbers. But while the latter can be easily generated based on the cumulative distribution function, the CDF does not exist for multivariate distributions, so generating such numbers is much more messy (the main problem is the fact that 2 or more variables have correlation). So this part of your question is far beyond the scope of this site.
Since half an answer is better than no answer, here's how you can compute the mean and higher moments numerically using matlab:
%generate some dummy input
xv=linspace(-50,50,101);
yv=linspace(-30,30,100);
[x y]=meshgrid(xv,yv);
%define a discretized two-hump Gaussian distribution
A=floor(15*exp(-((x-10).^2+y.^2)/100)+15*exp(-((x+25).^2+y.^2)/100));
A=A/sum(A(:)); %normalized to sum to 1
%plot it if you like
%figure;
%surf(x,y,A)
%actual half-answer starts here
%get normalized pdf
weight=trapz(xv,trapz(yv,A));
A=A/weight; %A normalized to 1 according to trapz^2
%mean
mean_x=trapz(xv,trapz(yv,A.*x));
mean_y=trapz(xv,trapz(yv,A.*y));
So, the point is that you can perform a double integral on a rectangular mesh using two consecutive calls to trapz. This allows you to compute the integral of any quantity that has the same shape as your mesh, but a drawback is that vector components have to be computed independently. If you only wish to compute things which can be parametrized with x and y (which are naturally the same size as you mesh), then you can get along without having to do any additional thinking.
You could also define a function for the integration:
function res=trapz2(xv,yv,A,arg)
if ~isscalar(arg) && any(size(arg)~=size(A))
error('Size of A and var must be the same!')
end
res=trapz(xv,trapz(yv,A.*arg));
end
This way you can compute stuff like
weight=trapz2(xv,yv,A,1);
mean_x=trapz2(xv,yv,A,x);
NOTE: the reason I used a 101x100 mesh in the example is that the double call to trapz should be performed in the proper order. If you interchange xv and yv in the calls, you get the wrong answer due to inconsistency with the definition of A, but this will not be evident if A is square. I suggest avoiding symmetric quantities during the development stage.
I am trying to understand principal component analysis in Matlab,
There seems to be at least 3 different functions that do it.
I have some questions re the code below:
Am I creating approximate x values using only one eigenvector (the one corresponding to the largest eigenvalue) correctly? I think so??
Why are PC and V which are both meant to be the loadings for (x'x) presented differently? The column order is reversed because eig does not order the eigenvalues with the largest value first but why are they the negative of each other?
Why are the eig values not in ordered with the eigenvector corresponding to the largest eigenvalue in the first column?
Using the code below I get back to the input matrix x when using svd and eig, but the results from princomp seem to be totally different? What so I have to do to make princomp match the other two functions?
Code:
x=[1 2;3 4;5 6;7 8 ]
econFlag=0;
[U,sigma,V] = svd(x,econFlag);%[U,sigma,coeff] = svd(z,econFlag);
U1=U(:,1);
V1=V(:,1);
sigma_partial=sigma(1,1);
score1=U*sigma;
test1=score1*V';
score_partial=U1*sigma_partial;
test1_partial=score_partial*V1';
[PC, D] = eig(x'*x)
score2=x*PC;
test2=score2*PC';
PC1=PC(:,2);
score2_partial=x*PC1;
test2_partial=score2_partial*PC1';
[o1 o2 o3]=princomp(x);
Yes. According to the documentation of svd, diagonal elements of the output S are in decreasing order. There is no such guarantee for the the output D of eig though.
Eigenvectors and singular vectors have no defined sign. If a is an eigenvector, so is -a.
I've often wondered the same. Laziness on the part of TMW? Optimization, because sorting would be an additional step and not everybody needs 'em sorted?
princomp centers the input data before computing the principal components. This makes sense as normally the PCA is computed with respect to the covariance matrix, and the eigenvectors of x' * x are only identical to those of the covariance matrix if x is mean-free.
I would compute the PCA by transforming to the basis of the eigenvectors of the covariance matrix (centered data), but apply this transform to the original (uncentered) data. This allows to capture a maximum of variance with as few principal components as possible, but still to recover the orginal data from all of them:
[V, D] = eig(cov(x));
score = x * V;
test = score * V';
test is identical to x, up to numerical error.
In order to easily pick the components with the most variance, let's fix that lack of sorting ourselves:
[V, D] = eig(cov(x));
[D, ind] = sort(diag(D), 'descend');
V = V(:, ind);
score = x * V;
test = score * V';
Reconstruct the signal using the strongest principal component only:
test_partial = score(:, 1) * V(:, 1)';
In response to Amro's comments: It is of course also possible to first remove the means from the input data, and transform these "centered" data. In that case, for perfect reconstruction of the original data it would be necessary to add the means again. The way to compute the PCA given above is the one described by Neil H. Timm, Applied Multivariate Analysis, Springer 2002, page 446:
Given an observation vector Y with mean mu and covariance matrix Sigma of full rank p, the goal of PCA is to create a new set of variables called principal components (PCs) or principal variates. The principal components are linear combinations of the variables of the vector Y that are uncorrelated such that the variance of the jth component is maximal.
Timm later defines "standardized components" as those which have been computed from centered data and are then divided by the square root of the eigenvalues (i.e. variances), i.e. "standardized principal components" have mean 0 and variance 1.
I have two matrices X and Y. Both represent a number of positions in 3D-space. X is a 50*3 matrix, Y is a 60*3 matrix.
My question: why does applying the mean-function over the output of pdist2() in combination with 'Mahalanobis' not give the result obtained with mahal()?
More details on what I'm trying to do below, as well as the code I used to test this.
Let's suppose the 60 observations in matrix Y are obtained after an experimental manipulation of some kind. I'm trying to assess whether this manipulation had a significant effect on the positions observed in Y. Therefore, I used pdist2(X,X,'Mahalanobis') to compare X to X to obtain a baseline, and later, X to Y (with X the reference matrix: pdist2(X,Y,'Mahalanobis')), and I plotted both distributions to have a look at the overlap.
Subsequently, I calculated the mean Mahalanobis distance for both distributions and the 95% CI and did a t-test and Kolmogorov-Smirnoff test to asses if the difference between the distributions was significant. This seemed very intuitive to me, however, when testing with mahal(), I get different values, although the reference matrix is the same. I don't get what the difference between both ways of calculating mahalanobis distance is exactly.
Comment that is too long #3lectrologos:
You mean this: d(I) = (Y(I,:)-mu)inv(SIGMA)(Y(I,:)-mu)'? This is just the formula for calculating mahalanobis, so should be the same for pdist2() and mahal() functions. I think mu is a scalar and SIGMA is a matrix based on the reference distribution as a whole in both pdist2() and mahal(). Only in mahal you are comparing each point of your sample set to the points of the reference distribution, while in pdist2 you are making pairwise comparisons based on a reference distribution. Actually, with my purpose in my mind, I think I should go for mahal() instead of pdist2(). I can interpret a pairwise distance based on a reference distribution, but I don't think it's what I need here.
% test pdist2 vs. mahal in matlab
% the purpose of this script is to see whether the average over the rows of E equals the values in d...
% data
X = []; % 50*3 matrix, data omitted
Y = []; % 60*3 matrix, data omitted
% calculations
S = nancov(X);
% mahal()
d = mahal(Y,X); % gives an 60*1 matrix with a value for each Cartesian element in Y (second matrix is always the reference matrix)
% pairwise mahalanobis distance with pdist2()
E = pdist2(X,Y,'mahalanobis',S); % outputs an 50*60 matrix with each ij-th element the pairwise distance between element X(i,:) and Y(j,:) based on the covariance matrix of X: nancov(X)
%{
so this is harder to interpret than mahal(), as elements of Y are not just compared to the "mahalanobis-centroid" based on X,
% but to each individual element of X
% so the purpose of this script is to see whether the average over the rows of E equals the values in d...
%}
F = mean(E); % now I averaged over the rows, which means, over all values of X, the reference matrix
mean(d)
mean(E(:)) % not equal to mean(d)
d-F' % not zero
% plot output
figure(1)
plot(d,'bo'), hold on
plot(mean(E),'ro')
legend('mahal()','avaraged over all x values pdist2()')
ylabel('Mahalanobis distance')
figure(2)
plot(d,'bo'), hold on
plot(E','ro')
plot(d,'bo','MarkerFaceColor','b')
xlabel('values in matrix Y (Yi) ... or ... pairwise comparison Yi. (Yi vs. all Xi values)')
ylabel('Mahalanobis distance')
legend('mahal()','pdist2()')
One immediate difference between the two is that mahal subtracts the sample mean of X from each point in Y before computing distances.
Try something like E = pdist2(X,Y-mean(X),'mahalanobis',S); to see if that gives you the same results as mahal.
Note that
mahal(X,Y)
is equivalent to
pdist2(X,mean(Y),'mahalanobis',cov(Y)).^2
Well, I guess there are two different ways to calculate mahalanobis distance between two clusters of data like you explain above:
1) you compare each data point from your sample set to mu and sigma matrices calculated from your reference distribution (although labeling one cluster sample set and the other reference distribution may be arbitrary), thereby calculating the distance from each point to this so called mahalanobis-centroid of the reference distribution.
2) you compare each datapoint from matrix Y to each datapoint of matrix X, with, X the reference distribution (mu and sigma are calculated from X only)
The values of the distances will be different, but I guess the ordinal order of dissimilarity between clusters is preserved when using either method 1 or 2? I actually wonder when comparing 10 different clusters to a reference matrix X, or to each other, if the order of the dissimilarities would differ using method 1 or method 2? Also, I can't imagine a situation where one method would be wrong and the other method not. Although method 1 seems more intuitive in some situations, like mine.