ICA - Statistical Independence & Eigenvalues of Covariance Matrix - matlab

I am currently creating different signals using Matlab, mixing them by multiplying them by a mixing matrix A, and then trying to get back the original signals using FastICA.
So far, the recovered signals are really bad when compared to the original ones, which was not what I expected.
I'm trying to see whether I'm doing anything wrong. The signals I'm generating are the following: (Amplitudes are in the range [0,1].)
s1 = (-x.^2 + 100*x + 500) / 3000; % quadratic
s2 = exp(-x / 10); % -ve exponential
s3 = (sin(x)+ 1) * 0.5; % sine
s4 = 0.5 + 0.1 * randn(size(x, 2), 1); % gaussian
s5 = (sawtooth(x, 0.75)+ 1) * 0.5; % sawtooth
One condition for ICA to be successful is that at most one signal is Gaussian, and I've observed this in my signal generation.
However, another condition is that all signals are statistically independent.
All I know is that this means that, given two signals A & B, knowing one signal does not give any information with regards to the other, i.e.: P(A|B) = P(A) where P is the probability.
Now my question is this: Are my signals statistically independent? Is there any way I can determine this? Perhaps some property that must be observed?
Another thing I've noticed is that when I calculate the eigenvalues of the covariance matrix (calculated for the matrix containing the mixed signals), the eigenspectrum seems to show that there is only one (main) principal component. What does this really mean? Shouldn't there be 5, since I have 5 (supposedly) independent signals?
For example, when using the following mixing matrix:
A =
0.2000 0.4267 0.2133 0.1067 0.0533
0.2909 0.2000 0.2909 0.1455 0.0727
0.1333 0.2667 0.2000 0.2667 0.1333
0.0727 0.1455 0.2909 0.2000 0.2909
0.0533 0.1067 0.2133 0.4267 0.2000
The eigenvalues are: 0.0000 0.0005 0.0022 0.0042 0.0345 (only 4!)
When using the identity matrix as the mixing matrix (i.e. the mixed signals are the same as the original ones), the eigenspectrum is: 0.0103 0.0199 0.0330 0.0811 0.1762. There still is one value much larger than the rest..
Thank you for your help.
I apologise if the answers to my questions are painfully obvious, but I'm really new to statistics, ICA and Matlab. Thanks again.
EDIT - I have 500 samples of each signal, in the range [0.2, 100], in steps of 0.2, i.e. x = 0:0.1:100.
EDIT - Given the ICA Model: X = As + n (I'm not adding any noise at the moment), but I am referring to the eigenspectrum of the transpose of X, i.e. eig(cov(X')).

Your signals are correlated (not independent). Right off the bat, the sawtooth and the sine are the same period. Tell me the value of one I'll tell you the value of the other, perfect correlation.
If you change up the period of one of them that'll make them more independent.
Also S1 and S2 are kinda correlated.
As for the eigenvalues, first of all your signals are not independent (see above).
Second of all, your filter matrix A is also not well conditioned, spreading out your eigenvalues further.
Even if you were to pipe in five fully independent (iid, yada yada) signals the covariance would be:
E[ A y y' A' ] = E[ A I A' ] = A A'
The eigenvalues of that are:
eig(A*A')
ans =
0.000167972216475
0.025688510850262
0.035666735304024
0.148813869149738
1.042451912479502
So you're really filtering/squishing all the signals down onto one basis function / degree of freedom and of course they'll be hard to recover, whatever method you use.

To find if the signals are mutually independent you could look at the techniques described here In general two random variables are independent if they are orthogonal. This means that: E{s1*s2} = 0 Meaning that the expectation of the random variable s1 multiplied by the random variable s2 is zero. This orthogonality condition is extremely important in statistics and probability and shows up everywhere. Unfortunately it applies to 2 variables at a time. There are multivariable techniques, but none that I would feel comfortable recommending. Another link I dug up was this one, not sure what your application is, but that paper is very well done.
When I calculate the covariance matrix I get:
cov(A) =
0.0619 -0.0284 -0.0002 -0.0028 -0.0010
-0.0284 0.0393 0.0049 0.0007 -0.0026
-0.0002 0.0049 0.1259 0.0001 -0.0682
-0.0028 0.0007 0.0001 0.0099 -0.0012
-0.0010 -0.0026 -0.0682 -0.0012 0.0831
With eigenvectors,V and values D:
[V,D] = eig(cov(A))
V =
-0.0871 0.5534 0.0268 -0.8279 0.0063
-0.0592 0.8264 -0.0007 0.5584 -0.0415
-0.0166 -0.0352 0.5914 -0.0087 -0.8054
-0.9937 -0.0973 -0.0400 0.0382 -0.0050
-0.0343 0.0033 0.8050 0.0364 0.5912
D =
0.0097 0 0 0 0
0 0.0200 0 0 0
0 0 0.0330 0 0
0 0 0 0.0812 0
0 0 0 0 0.1762
Here's my code:
x = transpose(0.2:0.2:100);
s1 = (-x.^2 + 100*x + 500) / 3000; % quadratic
s2 = exp(-x / 10); % -ve exponential
s3 = (sin(x)+ 1) * 0.5; % sine
s4 = 0.5 + 0.1 * randn(length(x), 1); % gaussian
s5 = (sawtooth(x, 0.75)+ 1) * 0.5; % sawtooth
A = [s1 s2 s3 s4 s5];
cov(A)
[V,D] = eig(cov(A))
Let me know if I can help any more, or if I misunderstood.
EDIT Properly referred to eigenvalues and vectors, used 0.2 sampling interval added code.

Related

Loop to build virtual disparity image in MATLAB

I'm trying to implement the homogeneous transformation from the disparity image to the virtual disparity image, following the paper of Suganuma et. al. "An Obstacle Extraction Method Using Virtual Disparity Image".
After doing the matrix computations described in the paper, I reach a global homogeneous transformation matrix that describes just a translation of -27.7 in the direction v, which makes sense.
Now, to make this transformation, I implemented a loop in MATLAB:
virtual_disparity=zeros(size(disparityMap));
%Homogeneous vector of a point of the disparityMap U=[u/d v/d 1/d 1]' (4x1)
U = zeros(4,1);
U_v = zeros(4,1);
for i=1:size(disparityMap,1) %Rows-->y
for j=1:size(disparityMap,2) %Cols-->x
d = disparityMap(i, j); % (i,j)-->(cols,rows)-->(y,x)
U = [j/d i/d 1/d 1]'; % [u/d v/d 1/d 1]'
U_v = B*U; % B is the whole homogeneous transform
U_v = U_v./U_v(4);
u_v_x = U_v(1); %u_v_j
u_v_y = U_v(2); %u_v_i
if((u_v_x>1) && (u_v_x<=size(virtual_disparity, 2)) && (u_v_y>1) && (u_v_y<=size(virtual_disparity, 1)))
virtual_disparity(round(u_v_y), round(u_v_x)) = disparityMap(i, j);
end
end
end
Now, the problem is that the virtual disparity that I get doesn't make any sense, since it doesn't even corresponds with the transformation described in B, which, as I said is:
1.0000 0 0.0000 0
0 1.0000 0.0000 -27.7003
0 0 1.0000 0
0 0 0 1.0000
These are the disparity and the virtual disparity respectively:
Disparity Map:
Virtual Disparity:
I've been rechecking all day long and I don't find the error.
Finally, a colleage helped me and we found whats wrong. I was suposing that the final coordinates U_v would be given in the form [u v 0 1]', which actually doesn't make much sense. Actually they were given, as the input coordinates in the form [u/d v/d 1/d 1]'. So, instead of normalizing them dividing by the element 4, as I was doing, I must divide them by the element 3 (1/d).
To sum up, it was just an error in the line:
U_v = U_v./U_v(4);
Which must be substituted by:
U_v = U_v./U_v(3);
Now the image, although is a little bit more sparse than I thought, it's similar to the one of the paper:

How to use Principle Component Analysis (PCA) for dimensionality reduction in matlab [duplicate]

I have a large dataset of multidimensional data(132 dimensions).
I am a beginner at performing data mining and I want to apply Principal Components Analysis by using Matlab. However, I have seen that there are a lot of functions explained on the web but I do not understand how should they be applied.
Basically, I want to apply PCA and to obtain the eigenvectors and their corresponding eigenvalues out of my data.
After this step I want to be able to do a reconstruction for my data based on a selection of the obtained eigenvectors.
I can do this manually, but I was wondering if there are any predefined functions which can do this because they should already be optimized.
My initial data is something like : size(x) = [33800 132]. So basically I have 132 features(dimensions) and 33800 data points. And I want to perform PCA on this data set.
Any help or hint would do.
Here's a quick walkthrough. First we create a matrix of your hidden variables (or "factors"). It has 100 observations and there are two independent factors.
>> factors = randn(100, 2);
Now create a loadings matrix. This is going to map the hidden variables onto your observed variables. Say your observed variables have four features. Then your loadings matrix needs to be 4 x 2
>> loadings = [
1 0
0 1
1 1
1 -1 ];
That tells you that the first observed variable loads on the first factor, the second loads on the second factor, the third variable loads on the sum of factors and the fourth variable loads on the difference of the factors.
Now create your observations:
>> observations = factors * loadings' + 0.1 * randn(100,4);
I added a small amount of random noise to simulate experimental error. Now we perform the PCA using the pca function from the stats toolbox:
>> [coeff, score, latent, tsquared, explained, mu] = pca(observations);
The variable score is the array of principal component scores. These will be orthogonal by construction, which you can check -
>> corr(score)
ans =
1.0000 0.0000 0.0000 0.0000
0.0000 1.0000 0.0000 0.0000
0.0000 0.0000 1.0000 0.0000
0.0000 0.0000 0.0000 1.0000
The combination score * coeff' will reproduce the centered version of your observations. The mean mu is subtracted prior to performing PCA. To reproduce your original observations you need to add it back in,
>> reconstructed = score * coeff' + repmat(mu, 100, 1);
>> sum((observations - reconstructed).^2)
ans =
1.0e-27 *
0.0311 0.0104 0.0440 0.3378
To get an approximation to your original data, you can start dropping columns from the computed principal components. To get an idea of which columns to drop, we examine the explained variable
>> explained
explained =
58.0639
41.6302
0.1693
0.1366
The entries tell you what percentage of the variance is explained by each of the principal components. We can clearly see that the first two components are more significant than the second two (they explain more than 99% of the variance between them). Using the first two components to reconstruct the observations gives the rank-2 approximation,
>> approximationRank2 = score(:,1:2) * coeff(:,1:2)' + repmat(mu, 100, 1);
We can now try plotting:
>> for k = 1:4
subplot(2, 2, k);
hold on;
grid on
plot(approximationRank2(:, k), observations(:, k), 'x');
plot([-4 4], [-4 4]);
xlim([-4 4]);
ylim([-4 4]);
title(sprintf('Variable %d', k));
end
We get an almost perfect reproduction of the original observations. If we wanted a coarser approximation, we could just use the first principal component:
>> approximationRank1 = score(:,1) * coeff(:,1)' + repmat(mu, 100, 1);
and plot it,
>> for k = 1:4
subplot(2, 2, k);
hold on;
grid on
plot(approximationRank1(:, k), observations(:, k), 'x');
plot([-4 4], [-4 4]);
xlim([-4 4]);
ylim([-4 4]);
title(sprintf('Variable %d', k));
end
This time the reconstruction isn't so good. That's because we deliberately constructed our data to have two factors, and we're only reconstructing it from one of them.
Note that despite the suggestive similarity between the way we constructed the original data and its reproduction,
>> observations = factors * loadings' + 0.1 * randn(100,4);
>> reconstructed = score * coeff' + repmat(mu, 100, 1);
there is not necessarily any correspondence between factors and score, or between loadings and coeff. The PCA algorithm doesn't know anything about the way your data is constructed - it merely tries to explain as much of the total variance as it can with each successive component.
User #Mari asked in the comments how she could plot the reconstruction error as a function of the number of principal components. Using the variable explained above this is quite easy. I'll generate some data with a more interesting factor structure to illustrate the effect -
>> factors = randn(100, 20);
>> loadings = chol(corr(factors * triu(ones(20))))';
>> observations = factors * loadings' + 0.1 * randn(100, 20);
Now all of the observations load on a significant common factor, with other factors of decreasing importance. We can get the PCA decomposition as before
>> [coeff, score, latent, tsquared, explained, mu] = pca(observations);
and plot the percentage of explained variance as follows,
>> cumexplained = cumsum(explained);
cumunexplained = 100 - cumexplained;
plot(1:20, cumunexplained, 'x-');
grid on;
xlabel('Number of factors');
ylabel('Unexplained variance')
You have a pretty good dimensionality reduction toolbox at http://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html
Besides PCA, this toolbox has a lot of other algorithms for dimensionality reduction.
Example of doing PCA:
Reduced = compute_mapping(Features, 'PCA', NumberOfDimension);

Labelling new data using trained Gaussian Mixture Model

I am not sure how to do the prediction for some new data using trained Gaussian Mixture Model (GMM). For example, I have got some labelled data drawn from 3 different classes (clusters). For each class of data points, I fit a GMM (gm1, gm2 and gm3). Suppose we know the number of Gaussian mixture for each class (e.g., k1=2, k2=1 and k3=3) or it can be estimated (optimised) using Akaike information criterion (AIC). Then when I have got some new dataset, how can I know if it is more likely belong to class 1, 2 or 3?
Some Matlab script shows what I mean:
clc; clf; clear all; close all;
%% Create some artificial training data
% 1. Cluster 1 with two mixture of Gaussian (k1 = 2)
rng default; % For reproducibility
mu1 = [1 2];
sigma1 = [3 .2; .2 2];
mu2 = [-1 -2];
sigma2 = [2 0; 0 1];
X1 = [mvnrnd(mu1,sigma1,200); mvnrnd(mu2,sigma2,100)];
options1 = statset('Display', 'final');
k1 = 2;
gm1 = fitgmdist(X1, k1, 'Options', options1);
% 2. Cluster 2 with one mixture of Gaussian (k2 = 1)
mu3 = [6 4];
sigma3 = [3 .1; .1 4];
X2 = mvnrnd(mu3,sigma3,300);
options2 = statset('Display', 'final');
k2 = 1;
gm2 = fitgmdist(X2, k2, 'Options', options2);
% 3. Cluster 3 with three mixture of Gaussian (k3 = 3)
mu4 = [-5 -6];
sigma4 = [1 .1; .1 1];
mu5 = [-5 -10];
sigma5 = [6 .1; .1 1];
mu6 = [-2 -15];
sigma6 = [8 .1; .1 4];
X3 = [mvnrnd(mu4,sigma4,200); mvnrnd(mu5,sigma5,300); mvnrnd(mu6,sigma6,100)];
options3 = statset('Display', 'final');
k3 = 3;
gm3 = fitgmdist(X3, k3, 'Options', options3);
% Display
figure,
scatter(X1(:,1),X1(:,2),10,'ko'); hold on;
ezcontour(#(x,y)pdf(gm1, [x y]), [-12 12], [-12 12]);
scatter(X2(:,1),X2(:,2),10,'ko');
ezcontour(#(x,y)pdf(gm2, [x y]), [-12 12], [-12 12]);
scatter(X3(:,1),X3(:,2),10,'ko');
ezcontour(#(x,y)pdf(gm3, [x y]), [-12 12], [-12 12]); hold off;
We can get the figure:
Then we have got some new testing data for example:
%% Create some artificial testing data
mut1 = [6.1 3.8];
sigmat1 = [3.1 .1; .1 4.2];
mut2 = [5.8 4.5];
sigmat2 = [2.8 .1; .1 3.8];
Xt1 = [mvnrnd(mut1,sigmat1,500); mvnrnd(mut2,sigmat2,100)];
figure,
scatter(Xt1(:,1),Xt1(:,2),10,'ko');
xlim([-12 12]); ylim([-12 12]);
I made the testing data similar to the Cluster 2 data on purpose. After we do the training using GMM, can we somehow predict the label of the new testing data? Is that possible to get some probabilities out like (p1 = 18%, p2 = 80% and p3 = 2%) for the prediction of each class. As we have got p2=80%, we can then have a hard classification that the new testing data is labelled as Cluster 2.
p.s.: I have found this post but it seems to theoretical to me (A similar post). If you can please put some simple Matlab script in your reply.
Thanks very much. A.
EDIT:
As Amro replied a solution for the problem, I have got more questions.
Amro created a new GMM using the entire dataset with some initialisation:
% initial parameters of the new GMM (combination of the previous three)
% (note PComponents is normalized according to proportion of data in each subset)
S = struct('mu',[gm1.mu; gm2.mu; gm3.mu], ...
'Sigma',cat(3, gm1.Sigma, gm2.Sigma, gm3.Sigma), ...
'PComponents',[gm1.PComponents*n1, gm2.PComponents*n2, gm3.PComponents*n3]./n);
% train the final model over all instances
opts = statset('MaxIter',1000, 'Display','final');
gmm = fitgmdist(X, k, 'Options',opts, 'Start',S);
What Amro got is something like below
which may not be suitable of my data because it separates my labelled cluster1, and cluster2 mixed with part of cluster1. This is what I am trying to avoid.
Here what I present is an artificial numerical example; however, in my real application, it deals with problem of image segmentation (for example, cluster1 is my background image and cluster2 is the object I want to separate). Then I try to somehow 'force' the separate GMM to fit separate classes. If two clusters are far away (for example, cluster1 and cluster 3 in this example), there is no problem to use Amro's method to combine all the data and then do a GMM fitting. However, when we do the training on the image data, it will never be perfect to separate background from object due to the limitation of the resolution (caused partial volume effect); therefore, it is very likely we have the case of cluster1 overlapped with cluster2 as shown. I think maybe mix all the data and then do the fitting will cause some problem for further prediction of the new data, am I right?
However, after a little bit of thinking, what I am trying to do now is:
% Combine the mixture of Gaussian and form a new gmdistribution
muAll = [gm1.mu; gm2.mu; gm3.mu];
sigmaAll = cat(3, gm1.Sigma, gm2.Sigma, gm3.Sigma);
gmAll = gmdistribution(muAll, sigmaAll);
pt1 = posterior(gmAll, Xt1);
What do you guys think? Or it is equivalent to Amro's method? If so, is there a method to force my trained GMM separated?
Also, I have question about the rationale of using the posterior function. Essentially, I want to estimate the likelihood of my testing data given the GMM fitting. Then why we calculate the posterior probability now? Or it is just a naming issue (in other words, the 'posterior probability'='likelihood')?
As far as I know, GMM has always been used as a unsupervised method. Someone even mentioned to me that GMM is a probability version of k-means clustering. Is it eligible to use it in such a 'supervised' style? Any recommended papers or references?
Thanks very much again for your reply!
A.
Effectively you have trained three GMM models not one, each being a mixture itself. Typically you would create one GMM with multiple components, where each component represents a cluster...
So what I would do in your case is create a new GMM model trained on the entire dataset (X1, X2, and X3) with the number of components equal to the total sum of all components from the three GMM (that is 2+1+3 = 6 Gaussian mixtures). This model would be initialized using the parameters of the individually trained ones.
Here the code to illustrate (I'm using the same variables you created in your example):
% number of instances in each data subset
n1 = size(X1,1);
n2 = size(X2,1);
n3 = size(X3,1);
% the entire dataset
X = [X1; X2; X3];
n = n1 + n2 + n3;
k = k1 + k2 + k3;
% initial parameters of the new GMM (combination of the previous three)
% (note PComponents is normalized according to proportion of data in each subset)
S = struct('mu',[gm1.mu; gm2.mu; gm3.mu], ...
'Sigma',cat(3, gm1.Sigma, gm2.Sigma, gm3.Sigma), ...
'PComponents',[gm1.PComponents*n1, gm2.PComponents*n2, gm3.PComponents*n3]./n);
% train the final model over all instances
opts = statset('MaxIter',1000, 'Display','final');
gmm = fitgmdist(X, k, 'Options',opts, 'Start',S);
% display GMM density function over training data
line(X(:,1), X(:,2), 'LineStyle','none', ...
'Marker','o', 'MarkerSize',1, 'Color','k')
hold on
ezcontour(#(x,y) pdf(gmm,[x y]), xlim(), ylim())
hold off
title(sprintf('GMM over %d training instances',n))
Now that we have trained one GMM model over the entire training dataset (with k=6 mixtures), we can use it to cluster new data instances:
cIdx = cluster(gmm, Xt1);
This is the same as manually computing the posterior probability of components, and taking the component with the largest probability as cluster index:
pr = posterior(gmm, Xt1);
[~,cIdx] = max(pr,[],2);
As expected almost 95% of the test data was clustered as belonging to the same component:
>> tabulate(cIdx)
Value Count Percent
1 27 4.50%
2 0 0.00%
3 573 95.50%
Here is the matching Guassian parameters:
>> idx = 3;
>> gmm.mu(idx,:)
ans =
5.7779 4.1731
>> gmm.Sigma(:,:,idx)
ans =
2.9504 0.0801
0.0801 4.0907
This indeed corresponds to the component in the upper-right side from the previous figure.
Similarly if you inspect the other component idx=1, it will be the one just on the left of the previous one, which explains how 27 out of the 600 test instances were "misclassified" if you will... Here's how confident the GMM was on those instances:
>> pr(cIdx==1,:)
ans =
0.9813 0.0001 0.0186 0.0000 0.0000 0.0000
0.6926 0.0000 0.3074 0.0000 0.0000 0.0000
0.5069 0.0000 0.4931 0.0000 0.0000 0.0000
0.6904 0.0018 0.3078 0.0000 0.0000 0.0000
0.6954 0.0000 0.3046 0.0000 0.0000 0.0000
<... output truncated ...>
0.5077 0.0000 0.4923 0.0000 0.0000 0.0000
0.6859 0.0001 0.3141 0.0000 0.0000 0.0000
0.8481 0.0000 0.1519 0.0000 0.0000 0.0000
Here are the test instances overlayed on top of the previous figure:
hold on
gscatter(Xt1(:,1), Xt1(:,2), cIdx)
hold off
title('clustered test instances')
EDIT:
My example above was meant to show how to use GMMs for clustering data (unsupervised learning). From what I understand now, what you want instead is to classify data using existing trained models (supervied learning). I guess I was confused by your use of "clusters" term :)
Anyway it should be easy now; just compute the class-conditional probability density function of the test data using each model, and pick the model with the highest likelihood as class label (no need to combine models into one).
So continuing on your initial code, that would simply be:
p = [pdf(gm1,Xt), pdf(gm2,Xt), pdf(gm3,Xt)]; % P(x|model_i)
[,cIdx] = max(p,[],2); % argmax_i P(x|model_i)
cIdx is the class prediction (1, 2, or 3) of each instance in the test data.

vector of n numbers around a specific number

I'm trying to do an algorithm in Matlab to try to calculate a received power in dBm of a logarithmic model of a wireless telecommunication system..
My algorithm calculate the received power for a number of distances in km that the user specified in the input and stores it in a vector
vector_distances = { 1, 5, 10, 50, 75 }
vector_Prx = { 131.5266 145.5060 151.5266 165.5060 169.0278 }
The thing is that I almost have everything that I need, but for graphics purposes I need to plot a graph in where on the x axys I have my vector of receiver power but on the y axys I want to show the same received power but with the most complete logarithmic model (the one that have also the noise - with Log-normal distribution on the formula - but for this thing in particular for every distance in my vector I need to choose 50 numbers with 0.5 distance between them (like a matrix) and then for every new point in the same distance calculate the logarithmic model to later plot in the same graph the two functions, one with the model with no noise (a straight line) and one with the noise.. like this picture
!http://imgur.com/gLSrKor
My question is, is there a way to choose 50 numbers with 0.5 distance between them for an existing number?
I know for example, if you have a vector
EDU>> m = zeros(1,5)
m =
0 0 0 0 0
EDU>> v = 5 %this is the starter distance%
v =
5
EDU>> m(1) = 5
m =
5 0 0 0 0
% I want to create a vector with 5 numbers with 0.5 distance between them %
EDU>> for i=2:5
m(i) = m(i-1) + 0.5
end
EDU>> m
m =
5.0000 5.5000 6.0000 6.5000 7.0000
But I have two problems, the firs one is, could this be more simplex? I am new on Matlab..and the other one, could I create a vector like this (with the initial number in the center)
EDU>> m
m =
4.0000 4.5000 **5.0000** 5.5000 6.0000
Sorry for my english, and thank you so much for helping me
In MATLAB, if you want to create a vector from a number n to a number m, you use the format
A = 5:10;
% A = [5,6,7,8,9,10]
You can also specify the step of the vector by including a third argument between the other two, like so:
A = 5:0.5:10;
% A = [5,5.5,6,6.5,7,7.5,8,8.5,9,9.5,10]
You can also use this to count backwards:
A = 10:-1:5
% A = [10,9,8,7,6,5]

Matlab - PCA analysis and reconstruction of multi dimensional data

I have a large dataset of multidimensional data(132 dimensions).
I am a beginner at performing data mining and I want to apply Principal Components Analysis by using Matlab. However, I have seen that there are a lot of functions explained on the web but I do not understand how should they be applied.
Basically, I want to apply PCA and to obtain the eigenvectors and their corresponding eigenvalues out of my data.
After this step I want to be able to do a reconstruction for my data based on a selection of the obtained eigenvectors.
I can do this manually, but I was wondering if there are any predefined functions which can do this because they should already be optimized.
My initial data is something like : size(x) = [33800 132]. So basically I have 132 features(dimensions) and 33800 data points. And I want to perform PCA on this data set.
Any help or hint would do.
Here's a quick walkthrough. First we create a matrix of your hidden variables (or "factors"). It has 100 observations and there are two independent factors.
>> factors = randn(100, 2);
Now create a loadings matrix. This is going to map the hidden variables onto your observed variables. Say your observed variables have four features. Then your loadings matrix needs to be 4 x 2
>> loadings = [
1 0
0 1
1 1
1 -1 ];
That tells you that the first observed variable loads on the first factor, the second loads on the second factor, the third variable loads on the sum of factors and the fourth variable loads on the difference of the factors.
Now create your observations:
>> observations = factors * loadings' + 0.1 * randn(100,4);
I added a small amount of random noise to simulate experimental error. Now we perform the PCA using the pca function from the stats toolbox:
>> [coeff, score, latent, tsquared, explained, mu] = pca(observations);
The variable score is the array of principal component scores. These will be orthogonal by construction, which you can check -
>> corr(score)
ans =
1.0000 0.0000 0.0000 0.0000
0.0000 1.0000 0.0000 0.0000
0.0000 0.0000 1.0000 0.0000
0.0000 0.0000 0.0000 1.0000
The combination score * coeff' will reproduce the centered version of your observations. The mean mu is subtracted prior to performing PCA. To reproduce your original observations you need to add it back in,
>> reconstructed = score * coeff' + repmat(mu, 100, 1);
>> sum((observations - reconstructed).^2)
ans =
1.0e-27 *
0.0311 0.0104 0.0440 0.3378
To get an approximation to your original data, you can start dropping columns from the computed principal components. To get an idea of which columns to drop, we examine the explained variable
>> explained
explained =
58.0639
41.6302
0.1693
0.1366
The entries tell you what percentage of the variance is explained by each of the principal components. We can clearly see that the first two components are more significant than the second two (they explain more than 99% of the variance between them). Using the first two components to reconstruct the observations gives the rank-2 approximation,
>> approximationRank2 = score(:,1:2) * coeff(:,1:2)' + repmat(mu, 100, 1);
We can now try plotting:
>> for k = 1:4
subplot(2, 2, k);
hold on;
grid on
plot(approximationRank2(:, k), observations(:, k), 'x');
plot([-4 4], [-4 4]);
xlim([-4 4]);
ylim([-4 4]);
title(sprintf('Variable %d', k));
end
We get an almost perfect reproduction of the original observations. If we wanted a coarser approximation, we could just use the first principal component:
>> approximationRank1 = score(:,1) * coeff(:,1)' + repmat(mu, 100, 1);
and plot it,
>> for k = 1:4
subplot(2, 2, k);
hold on;
grid on
plot(approximationRank1(:, k), observations(:, k), 'x');
plot([-4 4], [-4 4]);
xlim([-4 4]);
ylim([-4 4]);
title(sprintf('Variable %d', k));
end
This time the reconstruction isn't so good. That's because we deliberately constructed our data to have two factors, and we're only reconstructing it from one of them.
Note that despite the suggestive similarity between the way we constructed the original data and its reproduction,
>> observations = factors * loadings' + 0.1 * randn(100,4);
>> reconstructed = score * coeff' + repmat(mu, 100, 1);
there is not necessarily any correspondence between factors and score, or between loadings and coeff. The PCA algorithm doesn't know anything about the way your data is constructed - it merely tries to explain as much of the total variance as it can with each successive component.
User #Mari asked in the comments how she could plot the reconstruction error as a function of the number of principal components. Using the variable explained above this is quite easy. I'll generate some data with a more interesting factor structure to illustrate the effect -
>> factors = randn(100, 20);
>> loadings = chol(corr(factors * triu(ones(20))))';
>> observations = factors * loadings' + 0.1 * randn(100, 20);
Now all of the observations load on a significant common factor, with other factors of decreasing importance. We can get the PCA decomposition as before
>> [coeff, score, latent, tsquared, explained, mu] = pca(observations);
and plot the percentage of explained variance as follows,
>> cumexplained = cumsum(explained);
cumunexplained = 100 - cumexplained;
plot(1:20, cumunexplained, 'x-');
grid on;
xlabel('Number of factors');
ylabel('Unexplained variance')
You have a pretty good dimensionality reduction toolbox at http://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html
Besides PCA, this toolbox has a lot of other algorithms for dimensionality reduction.
Example of doing PCA:
Reduced = compute_mapping(Features, 'PCA', NumberOfDimension);