Labelling new data using trained Gaussian Mixture Model - matlab

I am not sure how to do the prediction for some new data using trained Gaussian Mixture Model (GMM). For example, I have got some labelled data drawn from 3 different classes (clusters). For each class of data points, I fit a GMM (gm1, gm2 and gm3). Suppose we know the number of Gaussian mixture for each class (e.g., k1=2, k2=1 and k3=3) or it can be estimated (optimised) using Akaike information criterion (AIC). Then when I have got some new dataset, how can I know if it is more likely belong to class 1, 2 or 3?
Some Matlab script shows what I mean:
clc; clf; clear all; close all;
%% Create some artificial training data
% 1. Cluster 1 with two mixture of Gaussian (k1 = 2)
rng default; % For reproducibility
mu1 = [1 2];
sigma1 = [3 .2; .2 2];
mu2 = [-1 -2];
sigma2 = [2 0; 0 1];
X1 = [mvnrnd(mu1,sigma1,200); mvnrnd(mu2,sigma2,100)];
options1 = statset('Display', 'final');
k1 = 2;
gm1 = fitgmdist(X1, k1, 'Options', options1);
% 2. Cluster 2 with one mixture of Gaussian (k2 = 1)
mu3 = [6 4];
sigma3 = [3 .1; .1 4];
X2 = mvnrnd(mu3,sigma3,300);
options2 = statset('Display', 'final');
k2 = 1;
gm2 = fitgmdist(X2, k2, 'Options', options2);
% 3. Cluster 3 with three mixture of Gaussian (k3 = 3)
mu4 = [-5 -6];
sigma4 = [1 .1; .1 1];
mu5 = [-5 -10];
sigma5 = [6 .1; .1 1];
mu6 = [-2 -15];
sigma6 = [8 .1; .1 4];
X3 = [mvnrnd(mu4,sigma4,200); mvnrnd(mu5,sigma5,300); mvnrnd(mu6,sigma6,100)];
options3 = statset('Display', 'final');
k3 = 3;
gm3 = fitgmdist(X3, k3, 'Options', options3);
% Display
figure,
scatter(X1(:,1),X1(:,2),10,'ko'); hold on;
ezcontour(#(x,y)pdf(gm1, [x y]), [-12 12], [-12 12]);
scatter(X2(:,1),X2(:,2),10,'ko');
ezcontour(#(x,y)pdf(gm2, [x y]), [-12 12], [-12 12]);
scatter(X3(:,1),X3(:,2),10,'ko');
ezcontour(#(x,y)pdf(gm3, [x y]), [-12 12], [-12 12]); hold off;
We can get the figure:
Then we have got some new testing data for example:
%% Create some artificial testing data
mut1 = [6.1 3.8];
sigmat1 = [3.1 .1; .1 4.2];
mut2 = [5.8 4.5];
sigmat2 = [2.8 .1; .1 3.8];
Xt1 = [mvnrnd(mut1,sigmat1,500); mvnrnd(mut2,sigmat2,100)];
figure,
scatter(Xt1(:,1),Xt1(:,2),10,'ko');
xlim([-12 12]); ylim([-12 12]);
I made the testing data similar to the Cluster 2 data on purpose. After we do the training using GMM, can we somehow predict the label of the new testing data? Is that possible to get some probabilities out like (p1 = 18%, p2 = 80% and p3 = 2%) for the prediction of each class. As we have got p2=80%, we can then have a hard classification that the new testing data is labelled as Cluster 2.
p.s.: I have found this post but it seems to theoretical to me (A similar post). If you can please put some simple Matlab script in your reply.
Thanks very much. A.
EDIT:
As Amro replied a solution for the problem, I have got more questions.
Amro created a new GMM using the entire dataset with some initialisation:
% initial parameters of the new GMM (combination of the previous three)
% (note PComponents is normalized according to proportion of data in each subset)
S = struct('mu',[gm1.mu; gm2.mu; gm3.mu], ...
'Sigma',cat(3, gm1.Sigma, gm2.Sigma, gm3.Sigma), ...
'PComponents',[gm1.PComponents*n1, gm2.PComponents*n2, gm3.PComponents*n3]./n);
% train the final model over all instances
opts = statset('MaxIter',1000, 'Display','final');
gmm = fitgmdist(X, k, 'Options',opts, 'Start',S);
What Amro got is something like below
which may not be suitable of my data because it separates my labelled cluster1, and cluster2 mixed with part of cluster1. This is what I am trying to avoid.
Here what I present is an artificial numerical example; however, in my real application, it deals with problem of image segmentation (for example, cluster1 is my background image and cluster2 is the object I want to separate). Then I try to somehow 'force' the separate GMM to fit separate classes. If two clusters are far away (for example, cluster1 and cluster 3 in this example), there is no problem to use Amro's method to combine all the data and then do a GMM fitting. However, when we do the training on the image data, it will never be perfect to separate background from object due to the limitation of the resolution (caused partial volume effect); therefore, it is very likely we have the case of cluster1 overlapped with cluster2 as shown. I think maybe mix all the data and then do the fitting will cause some problem for further prediction of the new data, am I right?
However, after a little bit of thinking, what I am trying to do now is:
% Combine the mixture of Gaussian and form a new gmdistribution
muAll = [gm1.mu; gm2.mu; gm3.mu];
sigmaAll = cat(3, gm1.Sigma, gm2.Sigma, gm3.Sigma);
gmAll = gmdistribution(muAll, sigmaAll);
pt1 = posterior(gmAll, Xt1);
What do you guys think? Or it is equivalent to Amro's method? If so, is there a method to force my trained GMM separated?
Also, I have question about the rationale of using the posterior function. Essentially, I want to estimate the likelihood of my testing data given the GMM fitting. Then why we calculate the posterior probability now? Or it is just a naming issue (in other words, the 'posterior probability'='likelihood')?
As far as I know, GMM has always been used as a unsupervised method. Someone even mentioned to me that GMM is a probability version of k-means clustering. Is it eligible to use it in such a 'supervised' style? Any recommended papers or references?
Thanks very much again for your reply!
A.

Effectively you have trained three GMM models not one, each being a mixture itself. Typically you would create one GMM with multiple components, where each component represents a cluster...
So what I would do in your case is create a new GMM model trained on the entire dataset (X1, X2, and X3) with the number of components equal to the total sum of all components from the three GMM (that is 2+1+3 = 6 Gaussian mixtures). This model would be initialized using the parameters of the individually trained ones.
Here the code to illustrate (I'm using the same variables you created in your example):
% number of instances in each data subset
n1 = size(X1,1);
n2 = size(X2,1);
n3 = size(X3,1);
% the entire dataset
X = [X1; X2; X3];
n = n1 + n2 + n3;
k = k1 + k2 + k3;
% initial parameters of the new GMM (combination of the previous three)
% (note PComponents is normalized according to proportion of data in each subset)
S = struct('mu',[gm1.mu; gm2.mu; gm3.mu], ...
'Sigma',cat(3, gm1.Sigma, gm2.Sigma, gm3.Sigma), ...
'PComponents',[gm1.PComponents*n1, gm2.PComponents*n2, gm3.PComponents*n3]./n);
% train the final model over all instances
opts = statset('MaxIter',1000, 'Display','final');
gmm = fitgmdist(X, k, 'Options',opts, 'Start',S);
% display GMM density function over training data
line(X(:,1), X(:,2), 'LineStyle','none', ...
'Marker','o', 'MarkerSize',1, 'Color','k')
hold on
ezcontour(#(x,y) pdf(gmm,[x y]), xlim(), ylim())
hold off
title(sprintf('GMM over %d training instances',n))
Now that we have trained one GMM model over the entire training dataset (with k=6 mixtures), we can use it to cluster new data instances:
cIdx = cluster(gmm, Xt1);
This is the same as manually computing the posterior probability of components, and taking the component with the largest probability as cluster index:
pr = posterior(gmm, Xt1);
[~,cIdx] = max(pr,[],2);
As expected almost 95% of the test data was clustered as belonging to the same component:
>> tabulate(cIdx)
Value Count Percent
1 27 4.50%
2 0 0.00%
3 573 95.50%
Here is the matching Guassian parameters:
>> idx = 3;
>> gmm.mu(idx,:)
ans =
5.7779 4.1731
>> gmm.Sigma(:,:,idx)
ans =
2.9504 0.0801
0.0801 4.0907
This indeed corresponds to the component in the upper-right side from the previous figure.
Similarly if you inspect the other component idx=1, it will be the one just on the left of the previous one, which explains how 27 out of the 600 test instances were "misclassified" if you will... Here's how confident the GMM was on those instances:
>> pr(cIdx==1,:)
ans =
0.9813 0.0001 0.0186 0.0000 0.0000 0.0000
0.6926 0.0000 0.3074 0.0000 0.0000 0.0000
0.5069 0.0000 0.4931 0.0000 0.0000 0.0000
0.6904 0.0018 0.3078 0.0000 0.0000 0.0000
0.6954 0.0000 0.3046 0.0000 0.0000 0.0000
<... output truncated ...>
0.5077 0.0000 0.4923 0.0000 0.0000 0.0000
0.6859 0.0001 0.3141 0.0000 0.0000 0.0000
0.8481 0.0000 0.1519 0.0000 0.0000 0.0000
Here are the test instances overlayed on top of the previous figure:
hold on
gscatter(Xt1(:,1), Xt1(:,2), cIdx)
hold off
title('clustered test instances')
EDIT:
My example above was meant to show how to use GMMs for clustering data (unsupervised learning). From what I understand now, what you want instead is to classify data using existing trained models (supervied learning). I guess I was confused by your use of "clusters" term :)
Anyway it should be easy now; just compute the class-conditional probability density function of the test data using each model, and pick the model with the highest likelihood as class label (no need to combine models into one).
So continuing on your initial code, that would simply be:
p = [pdf(gm1,Xt), pdf(gm2,Xt), pdf(gm3,Xt)]; % P(x|model_i)
[,cIdx] = max(p,[],2); % argmax_i P(x|model_i)
cIdx is the class prediction (1, 2, or 3) of each instance in the test data.

Related

How to use Principle Component Analysis (PCA) for dimensionality reduction in matlab [duplicate]

I have a large dataset of multidimensional data(132 dimensions).
I am a beginner at performing data mining and I want to apply Principal Components Analysis by using Matlab. However, I have seen that there are a lot of functions explained on the web but I do not understand how should they be applied.
Basically, I want to apply PCA and to obtain the eigenvectors and their corresponding eigenvalues out of my data.
After this step I want to be able to do a reconstruction for my data based on a selection of the obtained eigenvectors.
I can do this manually, but I was wondering if there are any predefined functions which can do this because they should already be optimized.
My initial data is something like : size(x) = [33800 132]. So basically I have 132 features(dimensions) and 33800 data points. And I want to perform PCA on this data set.
Any help or hint would do.
Here's a quick walkthrough. First we create a matrix of your hidden variables (or "factors"). It has 100 observations and there are two independent factors.
>> factors = randn(100, 2);
Now create a loadings matrix. This is going to map the hidden variables onto your observed variables. Say your observed variables have four features. Then your loadings matrix needs to be 4 x 2
>> loadings = [
1 0
0 1
1 1
1 -1 ];
That tells you that the first observed variable loads on the first factor, the second loads on the second factor, the third variable loads on the sum of factors and the fourth variable loads on the difference of the factors.
Now create your observations:
>> observations = factors * loadings' + 0.1 * randn(100,4);
I added a small amount of random noise to simulate experimental error. Now we perform the PCA using the pca function from the stats toolbox:
>> [coeff, score, latent, tsquared, explained, mu] = pca(observations);
The variable score is the array of principal component scores. These will be orthogonal by construction, which you can check -
>> corr(score)
ans =
1.0000 0.0000 0.0000 0.0000
0.0000 1.0000 0.0000 0.0000
0.0000 0.0000 1.0000 0.0000
0.0000 0.0000 0.0000 1.0000
The combination score * coeff' will reproduce the centered version of your observations. The mean mu is subtracted prior to performing PCA. To reproduce your original observations you need to add it back in,
>> reconstructed = score * coeff' + repmat(mu, 100, 1);
>> sum((observations - reconstructed).^2)
ans =
1.0e-27 *
0.0311 0.0104 0.0440 0.3378
To get an approximation to your original data, you can start dropping columns from the computed principal components. To get an idea of which columns to drop, we examine the explained variable
>> explained
explained =
58.0639
41.6302
0.1693
0.1366
The entries tell you what percentage of the variance is explained by each of the principal components. We can clearly see that the first two components are more significant than the second two (they explain more than 99% of the variance between them). Using the first two components to reconstruct the observations gives the rank-2 approximation,
>> approximationRank2 = score(:,1:2) * coeff(:,1:2)' + repmat(mu, 100, 1);
We can now try plotting:
>> for k = 1:4
subplot(2, 2, k);
hold on;
grid on
plot(approximationRank2(:, k), observations(:, k), 'x');
plot([-4 4], [-4 4]);
xlim([-4 4]);
ylim([-4 4]);
title(sprintf('Variable %d', k));
end
We get an almost perfect reproduction of the original observations. If we wanted a coarser approximation, we could just use the first principal component:
>> approximationRank1 = score(:,1) * coeff(:,1)' + repmat(mu, 100, 1);
and plot it,
>> for k = 1:4
subplot(2, 2, k);
hold on;
grid on
plot(approximationRank1(:, k), observations(:, k), 'x');
plot([-4 4], [-4 4]);
xlim([-4 4]);
ylim([-4 4]);
title(sprintf('Variable %d', k));
end
This time the reconstruction isn't so good. That's because we deliberately constructed our data to have two factors, and we're only reconstructing it from one of them.
Note that despite the suggestive similarity between the way we constructed the original data and its reproduction,
>> observations = factors * loadings' + 0.1 * randn(100,4);
>> reconstructed = score * coeff' + repmat(mu, 100, 1);
there is not necessarily any correspondence between factors and score, or between loadings and coeff. The PCA algorithm doesn't know anything about the way your data is constructed - it merely tries to explain as much of the total variance as it can with each successive component.
User #Mari asked in the comments how she could plot the reconstruction error as a function of the number of principal components. Using the variable explained above this is quite easy. I'll generate some data with a more interesting factor structure to illustrate the effect -
>> factors = randn(100, 20);
>> loadings = chol(corr(factors * triu(ones(20))))';
>> observations = factors * loadings' + 0.1 * randn(100, 20);
Now all of the observations load on a significant common factor, with other factors of decreasing importance. We can get the PCA decomposition as before
>> [coeff, score, latent, tsquared, explained, mu] = pca(observations);
and plot the percentage of explained variance as follows,
>> cumexplained = cumsum(explained);
cumunexplained = 100 - cumexplained;
plot(1:20, cumunexplained, 'x-');
grid on;
xlabel('Number of factors');
ylabel('Unexplained variance')
You have a pretty good dimensionality reduction toolbox at http://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html
Besides PCA, this toolbox has a lot of other algorithms for dimensionality reduction.
Example of doing PCA:
Reduced = compute_mapping(Features, 'PCA', NumberOfDimension);

vector of n numbers around a specific number

I'm trying to do an algorithm in Matlab to try to calculate a received power in dBm of a logarithmic model of a wireless telecommunication system..
My algorithm calculate the received power for a number of distances in km that the user specified in the input and stores it in a vector
vector_distances = { 1, 5, 10, 50, 75 }
vector_Prx = { 131.5266 145.5060 151.5266 165.5060 169.0278 }
The thing is that I almost have everything that I need, but for graphics purposes I need to plot a graph in where on the x axys I have my vector of receiver power but on the y axys I want to show the same received power but with the most complete logarithmic model (the one that have also the noise - with Log-normal distribution on the formula - but for this thing in particular for every distance in my vector I need to choose 50 numbers with 0.5 distance between them (like a matrix) and then for every new point in the same distance calculate the logarithmic model to later plot in the same graph the two functions, one with the model with no noise (a straight line) and one with the noise.. like this picture
!http://imgur.com/gLSrKor
My question is, is there a way to choose 50 numbers with 0.5 distance between them for an existing number?
I know for example, if you have a vector
EDU>> m = zeros(1,5)
m =
0 0 0 0 0
EDU>> v = 5 %this is the starter distance%
v =
5
EDU>> m(1) = 5
m =
5 0 0 0 0
% I want to create a vector with 5 numbers with 0.5 distance between them %
EDU>> for i=2:5
m(i) = m(i-1) + 0.5
end
EDU>> m
m =
5.0000 5.5000 6.0000 6.5000 7.0000
But I have two problems, the firs one is, could this be more simplex? I am new on Matlab..and the other one, could I create a vector like this (with the initial number in the center)
EDU>> m
m =
4.0000 4.5000 **5.0000** 5.5000 6.0000
Sorry for my english, and thank you so much for helping me
In MATLAB, if you want to create a vector from a number n to a number m, you use the format
A = 5:10;
% A = [5,6,7,8,9,10]
You can also specify the step of the vector by including a third argument between the other two, like so:
A = 5:0.5:10;
% A = [5,5.5,6,6.5,7,7.5,8,8.5,9,9.5,10]
You can also use this to count backwards:
A = 10:-1:5
% A = [10,9,8,7,6,5]

By which measures should I set the size of my Gaussian filter in MATLAB?

I'm trying to learn image processing using MATLAB and I have read about filters on images. By considering this code:
gaussianFilter = fspecial('gaussian', [7, 7], 5) ,
this builds a Gaussian filter matrix of 7 rows and 7 columns, with standard deviation of 5. As such, the size of filter matrix is 7 x 7 .
How can the size of this matrix be effective on filtering? (What does this matrix do ?)
By which measures should I set the size of filter matrix in my code?
One of the most common and heuristic measures on determining the size and ultimately the standard deviation of the Gaussian filter is what is known as the 3-sigma rule. If you recall from probability, the Gaussian distribution has most of its values centered between [mu - 3*sigma, mu + 3*sigma] where mu is the mean of the distribution and sigma is the standard deviation of the distribution. This is actually known as a 99% confidence interval. A good diagram of this is shown below:
Source: Wikipedia
By taking a look at [mu - 3*sigma, mu + 3*sigma], most of the variation can be contained within 99% of the total area underneath the Gaussian distribution. As a sidenote, between [mu - 2*sigma, mu + 2*sigma], this covers roughly 95% of the total area and finally for [mu - sigma, mu + sigma], this covers roughly 68% of the total area.
As such, what people usually do is take a look at an image and figure out what the smallest feature is. They measure the width or height of the feature and ensure that the width / height / span of the feature fits within the 99% confidence interval. Measuring across gives us a total width of 6*sigma. However, because we are dealing in the discrete domain, we need to also accommodate for the centre of the Gaussian as well. As such, you want to ensure that the total width is thus: 2 * floor(3*sigma) + 1. Therefore, what you need to do is figure out the width you want. Once you do that, you can figure out what sigma is required to satisfy this width. As an example, let's say the width of our smallest feature was 19. You would then figure out what your sigma was by:
19 = 2*floor(3*sigma) + 1
19 = 6*sigma + 1
18 = 6*sigma
sigma = 3
Therefore, you would create your Gaussian kernel like so:
h = fspecial('gaussian', [19 19], 3);
If you want to play around with the mask size, simply use the above equation to manipulate and solve for sigma each time. Now to answer your question about size, this is a low-pass filter. As such, increasing the size of the matrix will actually increase the effects of the LPF. Your image will become more progressively blurred as you increase its size. Play around with the size and see what you get. If you don't have any particular image in mind when trying this out, you can use any built-in image in MATLAB instead. As such, try doing the following:
%// Read in the image - Part of MATLAB path
im = imread('cameraman.tif');
%// Determine widths and standard deviations
width1 = 3; sigma1 = (width1-1) / 6;
width2 = 7; sigma2 = (width2-1) / 6;
width3 = 13; sigma3 = (width3-1) / 6;
width4 = 19; sigma4 = (width4-1) / 6;
%// Create Gaussian kernels
h1 = fspecial('gaussian', [width1 width1], sigma1);
h2 = fspecial('gaussian', [width2 width2], sigma2);
h3 = fspecial('gaussian', [width3 width3], sigma3);
h4 = fspecial('gaussian', [width4 width4], sigma4);
%// Filter the image using each kernel
out1 = imfilter(im, h1, 'replicate');
out2 = imfilter(im, h2, 'replicate');
out3 = imfilter(im, h3, 'replicate');
out4 = imfilter(im, h4, 'replicate');
%// Display them all on a figure
figure;
subplot(2,2,1);
imshow(out1);
title(['Width = 3']);
subplot(2,2,2);
imshow(out2);
title(['Width = 7']);
subplot(2,2,3);
imshow(out3);
title(['Width = 13']);
subplot(2,2,4);
imshow(out4);
title(['Width = 19']);
You'll get the following output:
Theoretically the gauss bell has a infinite size, but this would simply last to long to calculate.
Take a look at this output:
>> fspecial('gaussian', [7, 7], 1)
ans =
0.0000 0.0002 0.0011 0.0018 0.0011 0.0002 0.0000
0.0002 0.0029 0.0131 0.0216 0.0131 0.0029 0.0002
0.0011 0.0131 0.0586 0.0966 0.0586 0.0131 0.0011
0.0018 0.0216 0.0966 0.1592 0.0966 0.0216 0.0018
0.0011 0.0131 0.0586 0.0966 0.0586 0.0131 0.0011
0.0002 0.0029 0.0131 0.0216 0.0131 0.0029 0.0002
0.0000 0.0002 0.0011 0.0018 0.0011 0.0002 0.0000
You can see that the outer columns/rows are filled with very small values, which will have no relevant input to the result. For such a small standard derivation, you can use a smaller filter to save computation time. I would suggest to apply different sizes to an image with sharp edges, if the size is small and the derivation high you will see artefacts.

Matlab - PCA analysis and reconstruction of multi dimensional data

I have a large dataset of multidimensional data(132 dimensions).
I am a beginner at performing data mining and I want to apply Principal Components Analysis by using Matlab. However, I have seen that there are a lot of functions explained on the web but I do not understand how should they be applied.
Basically, I want to apply PCA and to obtain the eigenvectors and their corresponding eigenvalues out of my data.
After this step I want to be able to do a reconstruction for my data based on a selection of the obtained eigenvectors.
I can do this manually, but I was wondering if there are any predefined functions which can do this because they should already be optimized.
My initial data is something like : size(x) = [33800 132]. So basically I have 132 features(dimensions) and 33800 data points. And I want to perform PCA on this data set.
Any help or hint would do.
Here's a quick walkthrough. First we create a matrix of your hidden variables (or "factors"). It has 100 observations and there are two independent factors.
>> factors = randn(100, 2);
Now create a loadings matrix. This is going to map the hidden variables onto your observed variables. Say your observed variables have four features. Then your loadings matrix needs to be 4 x 2
>> loadings = [
1 0
0 1
1 1
1 -1 ];
That tells you that the first observed variable loads on the first factor, the second loads on the second factor, the third variable loads on the sum of factors and the fourth variable loads on the difference of the factors.
Now create your observations:
>> observations = factors * loadings' + 0.1 * randn(100,4);
I added a small amount of random noise to simulate experimental error. Now we perform the PCA using the pca function from the stats toolbox:
>> [coeff, score, latent, tsquared, explained, mu] = pca(observations);
The variable score is the array of principal component scores. These will be orthogonal by construction, which you can check -
>> corr(score)
ans =
1.0000 0.0000 0.0000 0.0000
0.0000 1.0000 0.0000 0.0000
0.0000 0.0000 1.0000 0.0000
0.0000 0.0000 0.0000 1.0000
The combination score * coeff' will reproduce the centered version of your observations. The mean mu is subtracted prior to performing PCA. To reproduce your original observations you need to add it back in,
>> reconstructed = score * coeff' + repmat(mu, 100, 1);
>> sum((observations - reconstructed).^2)
ans =
1.0e-27 *
0.0311 0.0104 0.0440 0.3378
To get an approximation to your original data, you can start dropping columns from the computed principal components. To get an idea of which columns to drop, we examine the explained variable
>> explained
explained =
58.0639
41.6302
0.1693
0.1366
The entries tell you what percentage of the variance is explained by each of the principal components. We can clearly see that the first two components are more significant than the second two (they explain more than 99% of the variance between them). Using the first two components to reconstruct the observations gives the rank-2 approximation,
>> approximationRank2 = score(:,1:2) * coeff(:,1:2)' + repmat(mu, 100, 1);
We can now try plotting:
>> for k = 1:4
subplot(2, 2, k);
hold on;
grid on
plot(approximationRank2(:, k), observations(:, k), 'x');
plot([-4 4], [-4 4]);
xlim([-4 4]);
ylim([-4 4]);
title(sprintf('Variable %d', k));
end
We get an almost perfect reproduction of the original observations. If we wanted a coarser approximation, we could just use the first principal component:
>> approximationRank1 = score(:,1) * coeff(:,1)' + repmat(mu, 100, 1);
and plot it,
>> for k = 1:4
subplot(2, 2, k);
hold on;
grid on
plot(approximationRank1(:, k), observations(:, k), 'x');
plot([-4 4], [-4 4]);
xlim([-4 4]);
ylim([-4 4]);
title(sprintf('Variable %d', k));
end
This time the reconstruction isn't so good. That's because we deliberately constructed our data to have two factors, and we're only reconstructing it from one of them.
Note that despite the suggestive similarity between the way we constructed the original data and its reproduction,
>> observations = factors * loadings' + 0.1 * randn(100,4);
>> reconstructed = score * coeff' + repmat(mu, 100, 1);
there is not necessarily any correspondence between factors and score, or between loadings and coeff. The PCA algorithm doesn't know anything about the way your data is constructed - it merely tries to explain as much of the total variance as it can with each successive component.
User #Mari asked in the comments how she could plot the reconstruction error as a function of the number of principal components. Using the variable explained above this is quite easy. I'll generate some data with a more interesting factor structure to illustrate the effect -
>> factors = randn(100, 20);
>> loadings = chol(corr(factors * triu(ones(20))))';
>> observations = factors * loadings' + 0.1 * randn(100, 20);
Now all of the observations load on a significant common factor, with other factors of decreasing importance. We can get the PCA decomposition as before
>> [coeff, score, latent, tsquared, explained, mu] = pca(observations);
and plot the percentage of explained variance as follows,
>> cumexplained = cumsum(explained);
cumunexplained = 100 - cumexplained;
plot(1:20, cumunexplained, 'x-');
grid on;
xlabel('Number of factors');
ylabel('Unexplained variance')
You have a pretty good dimensionality reduction toolbox at http://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html
Besides PCA, this toolbox has a lot of other algorithms for dimensionality reduction.
Example of doing PCA:
Reduced = compute_mapping(Features, 'PCA', NumberOfDimension);

ICA - Statistical Independence & Eigenvalues of Covariance Matrix

I am currently creating different signals using Matlab, mixing them by multiplying them by a mixing matrix A, and then trying to get back the original signals using FastICA.
So far, the recovered signals are really bad when compared to the original ones, which was not what I expected.
I'm trying to see whether I'm doing anything wrong. The signals I'm generating are the following: (Amplitudes are in the range [0,1].)
s1 = (-x.^2 + 100*x + 500) / 3000; % quadratic
s2 = exp(-x / 10); % -ve exponential
s3 = (sin(x)+ 1) * 0.5; % sine
s4 = 0.5 + 0.1 * randn(size(x, 2), 1); % gaussian
s5 = (sawtooth(x, 0.75)+ 1) * 0.5; % sawtooth
One condition for ICA to be successful is that at most one signal is Gaussian, and I've observed this in my signal generation.
However, another condition is that all signals are statistically independent.
All I know is that this means that, given two signals A & B, knowing one signal does not give any information with regards to the other, i.e.: P(A|B) = P(A) where P is the probability.
Now my question is this: Are my signals statistically independent? Is there any way I can determine this? Perhaps some property that must be observed?
Another thing I've noticed is that when I calculate the eigenvalues of the covariance matrix (calculated for the matrix containing the mixed signals), the eigenspectrum seems to show that there is only one (main) principal component. What does this really mean? Shouldn't there be 5, since I have 5 (supposedly) independent signals?
For example, when using the following mixing matrix:
A =
0.2000 0.4267 0.2133 0.1067 0.0533
0.2909 0.2000 0.2909 0.1455 0.0727
0.1333 0.2667 0.2000 0.2667 0.1333
0.0727 0.1455 0.2909 0.2000 0.2909
0.0533 0.1067 0.2133 0.4267 0.2000
The eigenvalues are: 0.0000 0.0005 0.0022 0.0042 0.0345 (only 4!)
When using the identity matrix as the mixing matrix (i.e. the mixed signals are the same as the original ones), the eigenspectrum is: 0.0103 0.0199 0.0330 0.0811 0.1762. There still is one value much larger than the rest..
Thank you for your help.
I apologise if the answers to my questions are painfully obvious, but I'm really new to statistics, ICA and Matlab. Thanks again.
EDIT - I have 500 samples of each signal, in the range [0.2, 100], in steps of 0.2, i.e. x = 0:0.1:100.
EDIT - Given the ICA Model: X = As + n (I'm not adding any noise at the moment), but I am referring to the eigenspectrum of the transpose of X, i.e. eig(cov(X')).
Your signals are correlated (not independent). Right off the bat, the sawtooth and the sine are the same period. Tell me the value of one I'll tell you the value of the other, perfect correlation.
If you change up the period of one of them that'll make them more independent.
Also S1 and S2 are kinda correlated.
As for the eigenvalues, first of all your signals are not independent (see above).
Second of all, your filter matrix A is also not well conditioned, spreading out your eigenvalues further.
Even if you were to pipe in five fully independent (iid, yada yada) signals the covariance would be:
E[ A y y' A' ] = E[ A I A' ] = A A'
The eigenvalues of that are:
eig(A*A')
ans =
0.000167972216475
0.025688510850262
0.035666735304024
0.148813869149738
1.042451912479502
So you're really filtering/squishing all the signals down onto one basis function / degree of freedom and of course they'll be hard to recover, whatever method you use.
To find if the signals are mutually independent you could look at the techniques described here In general two random variables are independent if they are orthogonal. This means that: E{s1*s2} = 0 Meaning that the expectation of the random variable s1 multiplied by the random variable s2 is zero. This orthogonality condition is extremely important in statistics and probability and shows up everywhere. Unfortunately it applies to 2 variables at a time. There are multivariable techniques, but none that I would feel comfortable recommending. Another link I dug up was this one, not sure what your application is, but that paper is very well done.
When I calculate the covariance matrix I get:
cov(A) =
0.0619 -0.0284 -0.0002 -0.0028 -0.0010
-0.0284 0.0393 0.0049 0.0007 -0.0026
-0.0002 0.0049 0.1259 0.0001 -0.0682
-0.0028 0.0007 0.0001 0.0099 -0.0012
-0.0010 -0.0026 -0.0682 -0.0012 0.0831
With eigenvectors,V and values D:
[V,D] = eig(cov(A))
V =
-0.0871 0.5534 0.0268 -0.8279 0.0063
-0.0592 0.8264 -0.0007 0.5584 -0.0415
-0.0166 -0.0352 0.5914 -0.0087 -0.8054
-0.9937 -0.0973 -0.0400 0.0382 -0.0050
-0.0343 0.0033 0.8050 0.0364 0.5912
D =
0.0097 0 0 0 0
0 0.0200 0 0 0
0 0 0.0330 0 0
0 0 0 0.0812 0
0 0 0 0 0.1762
Here's my code:
x = transpose(0.2:0.2:100);
s1 = (-x.^2 + 100*x + 500) / 3000; % quadratic
s2 = exp(-x / 10); % -ve exponential
s3 = (sin(x)+ 1) * 0.5; % sine
s4 = 0.5 + 0.1 * randn(length(x), 1); % gaussian
s5 = (sawtooth(x, 0.75)+ 1) * 0.5; % sawtooth
A = [s1 s2 s3 s4 s5];
cov(A)
[V,D] = eig(cov(A))
Let me know if I can help any more, or if I misunderstood.
EDIT Properly referred to eigenvalues and vectors, used 0.2 sampling interval added code.