Reducing dimensionality on training data with PCA in Matlab - matlab

This is a follow up question to:
PCA Dimensionality Reduction
In order to classify the new 10 dimensional test data do I have to reduce the training data down to 10 dimensions as well?
I tried:
X = bsxfun(#minus, trainingData, mean(trainingData,1));
covariancex = (X'*X)./(size(X,1)-1);
[V D] = eigs(covariancex, 10); % reduce to 10 dimension
Xtrain = bsxfun(#minus, trainingData, mean(trainingData,1));
pcatrain = Xtest*V;
But using the classifier with this and the 10 dimensional testing data produces very unreliable results? Is there something that I am doing fundamentally wrong?
Edit:
X = bsxfun(#minus, trainingData, mean(trainingData,1));
covariancex = (X'*X)./(size(X,1)-1);
[V D] = eigs(covariancex, 10); % reduce to 10 dimension
Xtrain = bsxfun(#minus, trainingData, mean(trainingData,1));
pcatrain = Xtest*V;
X = bsxfun(#minus, pcatrain, mean(pcatrain,1));
covariancex = (X'*X)./(size(X,1)-1);
[V D] = eigs(covariancex, 10); % reduce to 10 dimension
Xtest = bsxfun(#minus, test, mean(pcatrain,1));
pcatest = Xtest*V;

You have to reduce both training and test data, but both in the same way. So once you got your reduction matrix from PCA on the training data, you have to use this matrix to reduce dimensionality of the test data. In short words, you need one, constant transformation which is applied to both training and testing elements.
Using your code
% first, 0-mean data
Xtrain = bsxfun(#minus, Xtrain, mean(Xtrain,1));
Xtest = bsxfun(#minus, Xtest, mean(Xtrain,1));
% Compute PCA
covariancex = (Xtrain'*Xtrain)./(size(Xtrain,1)-1);
[V D] = eigs(covariancex, 10); % reduce to 10 dimension
pcatrain = Xtrain*V;
% here you should train your classifier on pcatrain and ytrain (correct labels)
pcatest = Xtest*V;
% here you can test your classifier on pcatest using ytest (compare with correct labels)

Related

How to plot decision boundary from linear SVM after PCA in Matlab?

I have conducted a linear SVM on a large dataset, however in order to reduce the number of dimensions I performed a PCA, than conducted the SVM on a subset of the component scores (the first 650 components which explained 99.5% of the variance). Now I want to plot the decision boundary in the original variable space using the beta weights and bias from the SVM created in PCA space. But I can't figure out how to project the bias term from the SVM into the original variable space. I've written a demo using the fisher iris data to illustrate:
clear; clc; close all
% load data
load fisheriris
inds = ~strcmp(species,'setosa');
X = meas(inds,3:4);
Y = species(inds);
mu = mean(X)
% perform the PCA
[eigenvectors, scores] = pca(X);
% train the svm
SVMModel = fitcsvm(scores,Y);
% plot the result
figure(1)
gscatter(scores(:,1),scores(:,2),Y,'rgb','osd')
title('PCA space')
% now plot the decision boundary
betas = SVMModel.Beta;
m = -betas(1)/betas(2); % my gradient
b = -SVMModel.Bias; % my y-intercept
f = #(x) m.*x + b; % my linear equation
hold on
fplot(f,'k')
hold off
axis equal
xlim([-1.5 2.5])
ylim([-2 2])
% inverse transform the PCA
Xhat = scores * eigenvectors';
Xhat = bsxfun(#plus, Xhat, mu);
% plot the result
figure(2)
hold on
gscatter(Xhat(:,1),Xhat(:,2),Y,'rgb','osd')
% and the decision boundary
betaHat = betas' * eigenvectors';
mHat = -betaHat(1)/betaHat(2);
bHat = b * eigenvectors';
bHat = bHat + mu; % I know I have to add mu somewhere...
bHat = bHat/betaHat(2);
bHat = sum(sum(bHat)); % sum to reduce the matrix to a single value
% the correct value of bHat should be 6.3962
f = #(x) mHat.*x + bHat;
fplot(f,'k')
hold off
axis equal
title('Recovered feature space')
xlim([3 7])
ylim([0 4])
Any guidance on how I'm calculating bHat incorrectly would be much appreciated.
Just in case anyone else comes across this problem, the solution is the bias term can be used to find the y-intercept, b = -SVMModel.Bias/betas(2). And the y-intercept is just another point in space [0 b] which can be recovered/unrotated by inverse transforming it through the PCA. This new point can then be used to solve the linear equation y = mx + b (i.e., b = y - mx). So the code should be:
% and the decision boundary
betaHat = betas' * eigenvectors';
mHat = -betaHat(1)/betaHat(2);
yint = b/betas(2); % y-intercept in PCA space
yintHat = [0 b] * eigenvectors'; % recover in original space
yintHat = yintHat + mu;
bHat = yintHat(2) - mHat*yintHat(1); % solve the linear equation
% the correct value of bHat is now 6.3962

How to reconstruct data from projections obtained with Linear Discriminant Analysis

I am trying to reconstruct data from the projections obtained with LDA. The idea would be to evaluate the reconstruction errors obtained from reduced sets of LDA factors. In the following matlab code, the question is how to obtain the reconstruction using the projected data p and the eigenvectors LTrans.
L = fitcdiscr(data,class);
[LTrans,Lambda] = eig(L.BetweenSigma,L.Sigma,'chol');
[Lambda,sorted] = sort(diag(Lambda),'descend'); % sort by eigenvalues
LTrans = LTrans(:,sorted);
xc = L.X;
mu = mean(xc);
Xm = bsxfun(#minus, xc, mu);
for i_fact=1:size(L.Sigma,2)
z = Xm*LTrans(:,1:i_fact);
p = z*LTrans(:,1:i_fact)';
p = bsxfun(#plus, p, mu);
end

Euclidean and Mahalanobis classifiers always return same error for each classifier?

I have a simple matlab code to generate some random data and then to use a Euclidean and Mahalanobis classifier to classify the random data. The issue I am having is that the error results for each classifier is always the same. They both always misclassify the same vectors. But the data is different each time.
So the data is created in a simple way to check the results easily. Because we have three classes all of which are equiprobable, I just generate 333 random values for each class and add them all to X to be classified. Thus the results should be [class 1, class 2, class 3] but 333 of each.
I can tell the classifiers work because I can view the data created by mvnrnd is random each time and the error changes. But between the two classifiers the error does not change.
Can anyone tell why?
% Create some initial values, means, covariance matrix, etc
c = 3;
P = 1/c; % All 3 classes are equiprobable
N = 999;
m1 = [1, 1];
m2 = [12, 8];
m3 = [16, 1];
m = [m1; m2; m3];
S = [4 0; 0 4]; % All share the same covar matrix
% Generate random data for each class
X1 = mvnrnd(m1, S, N*P);
X2 = mvnrnd(m2, S, N*P);
X3 = mvnrnd(m3, S, N*P);
X = [X1; X2; X3];
% Create the solution array zEst to compare results to
xEst = ceil((3/999:3/999:3));
% Do the actual classification for mahalanobis and euclidean
zEuc = euc_mal_classifier(m', S, P, X', c, N, true);
zMal = euc_mal_classifier(m', S, P, X', c, N, false);
% Check the results
numEucErr = 0;
numMalErr = 0;
for i=1:N
if(zEuc(i) ~= xEst(i))
numEucErr = numEucErr + 1;
end
if(zMal(i) ~= xEst(i))
numMalErr = numMalErr + 1;
end
end
% Tell the user the results of the classification
strE = ['Euclidean classifier error percent: ', num2str((numEucErr/N) * 100)];
strM = ['Mahalanob classifier error percent: ', num2str((numMalErr/N) * 100)];
disp(strE);
disp(strM);
And the classifier
function z = euc_mal_classifier( m, S, P, X, c, N, eOrM)
for i=1:N
for j=1:c
if(eOrM == true)
t(j) = sqrt((X(:,i)- m(:,j))'*(X(:,i)-m(:,j)));
else
t(j) = sqrt((X(:,i)- m(:,j))'*inv(S)*(X(:,i)-m(:,j)));
end
end
[num, z(i)] = min(t);
end
The reason why there is no difference in classification lies in your covariance matrix.
Assume the distance of a point to the center of a class is [x,y].
For euclidian the distance then will be:
sqrt(x*x + y*y);
For Mahalanobis:
Inverse of covariance matrix:
inv([a,0;0,a]) = [1/a,0;0,1/a]
Distance is then:
sqrt(x*x*1/a + y*y*1/a) = 1/sqrt(a)* sqrt(x*x + y*y)
So, the distances for the classes will be the same as euclidean but with a scale factor. Since the scale factor is the same for all classes and dimensions, you will not find a difference in your class assignments!
Test it with different covariance matrices and you will find your errors to differ.
Because of this kind of data with identity covariance matrix, all classifiers should result in almost the same performance
let's see the data without identity covariance matrix that three classifiers lead to different errors :
err_bayesian =
0.0861
err_euclidean =
0.1331
err_mahalanobis =
0.0871
close('all');clear;
% Generate and plot dataset X1
m1=[1, 1]'; m2=[10, 5]';m3=[11, 1]';
m=[m1 m2 m3];
S1 = [7 4 ; 4 5];
S(:,:,1)=S1;
S(:,:,2)=S1;
S(:,:,3)=S1;
P=[1/3 1/3 1/3];
N=1000;
randn('seed',0);
[X,y] =generate_gauss_classes(m,S,P,N);
plot_data(X,y,m,1);
randn('seed',200);
[X4,y1] =generate_gauss_classes(m,S,P,N);
% 2.5_b.1 Applying Bayesian classifier
z_bayesian=bayes_classifier(m,S,P,X4);
% 2.5_b.2 Apply ML estimates of the mean values and covariance matrix (common to all three
% classes) using function Gaussian_ML_estimate
class1_data=X(:,find(y==1));
[m1_hat, S1_hat]=Gaussian_ML_estimate(class1_data);
class2_data=X(:,find(y==2));
[m2_hat, S2_hat]=Gaussian_ML_estimate(class2_data);
class3_data=X(:,find(y==3));
[m3_hat, S3_hat]=Gaussian_ML_estimate(class3_data);
S_hat=(1/3)*(S1_hat+S2_hat+S3_hat);
m_hat=[m1_hat m2_hat m3_hat];
% Apply the Euclidean distance classifier, using the ML estimates of the means, in order to
% classify the data vectors of X1
z_euclidean=euclidean_classifier(m_hat,X4);
% 2.5_b.3 Similarly, for the Mahalanobis distance classifier, we have
z_mahalanobis=mahalanobis_classifier(m_hat,S_hat,X4);
% 2.5_c. Compute the error probability for each classifier
err_bayesian = (1-length(find(y1==z_bayesian))/length(y1))
err_euclidean = (1-length(find(y1==z_euclidean))/length(y1))
err_mahalanobis = (1-length(find(y1==z_mahalanobis))/length(y1))

Is my Matlab code correct for applying PCA to data?

I have following code for calculating PCA in Matlab:
train_out = train';
test_out = test';
% subtract off the mean for each dimension
mn = mean(train_out,2);
train_out = train_out - repmat(mn,1,train_size);
test_out = test_out - repmat(mn,1,test_size);
% calculate the covariance matrix
covariance = 1 / (train_size-1) * train_out * train_out';
% find the eigenvectors and eigenvalues
[PC, V] = eig(covariance);
% extract diagonal of matrix as vector
V = diag(V);
% sort the variances in decreasing order
[junk, rindices] = sort(-1*V);
V = V(rindices);
PC = PC(:,rindices);
% project the original data set
out = PC' * train_out;
train_out = out';
out = PC' * test_out;
test_out = out';
Train and test matrix have observations in rows and feature variables in columns. When I perform classification on original data (without PCA) I get much better results than with PCA, even when I keep all dimensions. When I tried doing PCA directly on the whole dataset (train + test) I noticed correlation between these new principal components and previous ones are either near 1 or near -1 which I find strange. I am probably doing something wrong but just can't figure it out.
The code is correct, however using princomp function my be easier:
train_out=train; % save original data
test_out=test;
mn = mean(train_out);
train_out = bsxfun(#minus,train_out,mn); % substract mean
test_out = bsxfun(#minus,test_out,mn);
[coefs,scores,variances] = princomp(train_out,'econ'); % PCA
pervar = cumsum(variances) / sum(variances);
dims = max(find(pervar < var_frac)); % var_frac - e.g. 0.99 - fraction of variance explained
train_out = train_out*coefs(:,1:dims); % dims - keep this many dimensions
test_out = test_out*coefs(:,1:dims); % result is in train_out and test_out

Finding principal components with maximum variance in matlab

I used the following code to compute PCA :
function [signals,PC,V] = pca2(data)
[M,N] = size(data);
% subtract off the mean for each dimension
mn = mean(data,2);
data = data - repmat(mn,1,N);
% construct the matrix Y
Y = data’ / sqrt(N-1);
% SVD does it all
[u,S,PC] = svd(Y);
% calculate the variances
S = diag(S);
V = S .* S;
% project the original data
signals = PC’ * data;
I want to keep the principal components with the maximum variance , say maybe the first 10 principal components which contribute to the maximum variance. How do i go about this?
function [signals,V] = pca2(data)
[M,N] = size(data);
data = reshape(data, M*N,1);
% subtract off the mean for each dimension
mn = mean(data,2);
data = bsxfun(#minus, data, mean(data,1));
% construct the matrix Y
Y = data'*data / (M*N-1);
[V D] = eigs(Y, 10); % reduce to 10 dimension
% project the original data
signals = data * V;
I guess svds can do the job for you.
In the doc, it says:
s = svds(A,k) computes the k largest singular values and associated
singular vectors of matrix A.
Which is essentially the k largest eigenvalues and eigenvectors. These are sorted by eigenvalues in descending order.
So for 10 principal components, just use [eigvec eigval] = svds(Y, 10);