I have a large dataset of multidimensional data(132 dimensions).
I am a beginner at performing data mining and I want to apply Principal Components Analysis by using Matlab. However, I have seen that there are a lot of functions explained on the web but I do not understand how should they be applied.
Basically, I want to apply PCA and to obtain the eigenvectors and their corresponding eigenvalues out of my data.
After this step I want to be able to do a reconstruction for my data based on a selection of the obtained eigenvectors.
I can do this manually, but I was wondering if there are any predefined functions which can do this because they should already be optimized.
My initial data is something like : size(x) = [33800 132]. So basically I have 132 features(dimensions) and 33800 data points. And I want to perform PCA on this data set.
Any help or hint would do.
Here's a quick walkthrough. First we create a matrix of your hidden variables (or "factors"). It has 100 observations and there are two independent factors.
>> factors = randn(100, 2);
Now create a loadings matrix. This is going to map the hidden variables onto your observed variables. Say your observed variables have four features. Then your loadings matrix needs to be 4 x 2
>> loadings = [
1 0
0 1
1 1
1 -1 ];
That tells you that the first observed variable loads on the first factor, the second loads on the second factor, the third variable loads on the sum of factors and the fourth variable loads on the difference of the factors.
Now create your observations:
>> observations = factors * loadings' + 0.1 * randn(100,4);
I added a small amount of random noise to simulate experimental error. Now we perform the PCA using the pca function from the stats toolbox:
>> [coeff, score, latent, tsquared, explained, mu] = pca(observations);
The variable score is the array of principal component scores. These will be orthogonal by construction, which you can check -
>> corr(score)
ans =
1.0000 0.0000 0.0000 0.0000
0.0000 1.0000 0.0000 0.0000
0.0000 0.0000 1.0000 0.0000
0.0000 0.0000 0.0000 1.0000
The combination score * coeff' will reproduce the centered version of your observations. The mean mu is subtracted prior to performing PCA. To reproduce your original observations you need to add it back in,
>> reconstructed = score * coeff' + repmat(mu, 100, 1);
>> sum((observations - reconstructed).^2)
ans =
1.0e-27 *
0.0311 0.0104 0.0440 0.3378
To get an approximation to your original data, you can start dropping columns from the computed principal components. To get an idea of which columns to drop, we examine the explained variable
>> explained
explained =
58.0639
41.6302
0.1693
0.1366
The entries tell you what percentage of the variance is explained by each of the principal components. We can clearly see that the first two components are more significant than the second two (they explain more than 99% of the variance between them). Using the first two components to reconstruct the observations gives the rank-2 approximation,
>> approximationRank2 = score(:,1:2) * coeff(:,1:2)' + repmat(mu, 100, 1);
We can now try plotting:
>> for k = 1:4
subplot(2, 2, k);
hold on;
grid on
plot(approximationRank2(:, k), observations(:, k), 'x');
plot([-4 4], [-4 4]);
xlim([-4 4]);
ylim([-4 4]);
title(sprintf('Variable %d', k));
end
We get an almost perfect reproduction of the original observations. If we wanted a coarser approximation, we could just use the first principal component:
>> approximationRank1 = score(:,1) * coeff(:,1)' + repmat(mu, 100, 1);
and plot it,
>> for k = 1:4
subplot(2, 2, k);
hold on;
grid on
plot(approximationRank1(:, k), observations(:, k), 'x');
plot([-4 4], [-4 4]);
xlim([-4 4]);
ylim([-4 4]);
title(sprintf('Variable %d', k));
end
This time the reconstruction isn't so good. That's because we deliberately constructed our data to have two factors, and we're only reconstructing it from one of them.
Note that despite the suggestive similarity between the way we constructed the original data and its reproduction,
>> observations = factors * loadings' + 0.1 * randn(100,4);
>> reconstructed = score * coeff' + repmat(mu, 100, 1);
there is not necessarily any correspondence between factors and score, or between loadings and coeff. The PCA algorithm doesn't know anything about the way your data is constructed - it merely tries to explain as much of the total variance as it can with each successive component.
User #Mari asked in the comments how she could plot the reconstruction error as a function of the number of principal components. Using the variable explained above this is quite easy. I'll generate some data with a more interesting factor structure to illustrate the effect -
>> factors = randn(100, 20);
>> loadings = chol(corr(factors * triu(ones(20))))';
>> observations = factors * loadings' + 0.1 * randn(100, 20);
Now all of the observations load on a significant common factor, with other factors of decreasing importance. We can get the PCA decomposition as before
>> [coeff, score, latent, tsquared, explained, mu] = pca(observations);
and plot the percentage of explained variance as follows,
>> cumexplained = cumsum(explained);
cumunexplained = 100 - cumexplained;
plot(1:20, cumunexplained, 'x-');
grid on;
xlabel('Number of factors');
ylabel('Unexplained variance')
You have a pretty good dimensionality reduction toolbox at http://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html
Besides PCA, this toolbox has a lot of other algorithms for dimensionality reduction.
Example of doing PCA:
Reduced = compute_mapping(Features, 'PCA', NumberOfDimension);
Related
I have a large dataset of multidimensional data(132 dimensions).
I am a beginner at performing data mining and I want to apply Principal Components Analysis by using Matlab. However, I have seen that there are a lot of functions explained on the web but I do not understand how should they be applied.
Basically, I want to apply PCA and to obtain the eigenvectors and their corresponding eigenvalues out of my data.
After this step I want to be able to do a reconstruction for my data based on a selection of the obtained eigenvectors.
I can do this manually, but I was wondering if there are any predefined functions which can do this because they should already be optimized.
My initial data is something like : size(x) = [33800 132]. So basically I have 132 features(dimensions) and 33800 data points. And I want to perform PCA on this data set.
Any help or hint would do.
Here's a quick walkthrough. First we create a matrix of your hidden variables (or "factors"). It has 100 observations and there are two independent factors.
>> factors = randn(100, 2);
Now create a loadings matrix. This is going to map the hidden variables onto your observed variables. Say your observed variables have four features. Then your loadings matrix needs to be 4 x 2
>> loadings = [
1 0
0 1
1 1
1 -1 ];
That tells you that the first observed variable loads on the first factor, the second loads on the second factor, the third variable loads on the sum of factors and the fourth variable loads on the difference of the factors.
Now create your observations:
>> observations = factors * loadings' + 0.1 * randn(100,4);
I added a small amount of random noise to simulate experimental error. Now we perform the PCA using the pca function from the stats toolbox:
>> [coeff, score, latent, tsquared, explained, mu] = pca(observations);
The variable score is the array of principal component scores. These will be orthogonal by construction, which you can check -
>> corr(score)
ans =
1.0000 0.0000 0.0000 0.0000
0.0000 1.0000 0.0000 0.0000
0.0000 0.0000 1.0000 0.0000
0.0000 0.0000 0.0000 1.0000
The combination score * coeff' will reproduce the centered version of your observations. The mean mu is subtracted prior to performing PCA. To reproduce your original observations you need to add it back in,
>> reconstructed = score * coeff' + repmat(mu, 100, 1);
>> sum((observations - reconstructed).^2)
ans =
1.0e-27 *
0.0311 0.0104 0.0440 0.3378
To get an approximation to your original data, you can start dropping columns from the computed principal components. To get an idea of which columns to drop, we examine the explained variable
>> explained
explained =
58.0639
41.6302
0.1693
0.1366
The entries tell you what percentage of the variance is explained by each of the principal components. We can clearly see that the first two components are more significant than the second two (they explain more than 99% of the variance between them). Using the first two components to reconstruct the observations gives the rank-2 approximation,
>> approximationRank2 = score(:,1:2) * coeff(:,1:2)' + repmat(mu, 100, 1);
We can now try plotting:
>> for k = 1:4
subplot(2, 2, k);
hold on;
grid on
plot(approximationRank2(:, k), observations(:, k), 'x');
plot([-4 4], [-4 4]);
xlim([-4 4]);
ylim([-4 4]);
title(sprintf('Variable %d', k));
end
We get an almost perfect reproduction of the original observations. If we wanted a coarser approximation, we could just use the first principal component:
>> approximationRank1 = score(:,1) * coeff(:,1)' + repmat(mu, 100, 1);
and plot it,
>> for k = 1:4
subplot(2, 2, k);
hold on;
grid on
plot(approximationRank1(:, k), observations(:, k), 'x');
plot([-4 4], [-4 4]);
xlim([-4 4]);
ylim([-4 4]);
title(sprintf('Variable %d', k));
end
This time the reconstruction isn't so good. That's because we deliberately constructed our data to have two factors, and we're only reconstructing it from one of them.
Note that despite the suggestive similarity between the way we constructed the original data and its reproduction,
>> observations = factors * loadings' + 0.1 * randn(100,4);
>> reconstructed = score * coeff' + repmat(mu, 100, 1);
there is not necessarily any correspondence between factors and score, or between loadings and coeff. The PCA algorithm doesn't know anything about the way your data is constructed - it merely tries to explain as much of the total variance as it can with each successive component.
User #Mari asked in the comments how she could plot the reconstruction error as a function of the number of principal components. Using the variable explained above this is quite easy. I'll generate some data with a more interesting factor structure to illustrate the effect -
>> factors = randn(100, 20);
>> loadings = chol(corr(factors * triu(ones(20))))';
>> observations = factors * loadings' + 0.1 * randn(100, 20);
Now all of the observations load on a significant common factor, with other factors of decreasing importance. We can get the PCA decomposition as before
>> [coeff, score, latent, tsquared, explained, mu] = pca(observations);
and plot the percentage of explained variance as follows,
>> cumexplained = cumsum(explained);
cumunexplained = 100 - cumexplained;
plot(1:20, cumunexplained, 'x-');
grid on;
xlabel('Number of factors');
ylabel('Unexplained variance')
You have a pretty good dimensionality reduction toolbox at http://homepage.tudelft.nl/19j49/Matlab_Toolbox_for_Dimensionality_Reduction.html
Besides PCA, this toolbox has a lot of other algorithms for dimensionality reduction.
Example of doing PCA:
Reduced = compute_mapping(Features, 'PCA', NumberOfDimension);
I am not sure how to do the prediction for some new data using trained Gaussian Mixture Model (GMM). For example, I have got some labelled data drawn from 3 different classes (clusters). For each class of data points, I fit a GMM (gm1, gm2 and gm3). Suppose we know the number of Gaussian mixture for each class (e.g., k1=2, k2=1 and k3=3) or it can be estimated (optimised) using Akaike information criterion (AIC). Then when I have got some new dataset, how can I know if it is more likely belong to class 1, 2 or 3?
Some Matlab script shows what I mean:
clc; clf; clear all; close all;
%% Create some artificial training data
% 1. Cluster 1 with two mixture of Gaussian (k1 = 2)
rng default; % For reproducibility
mu1 = [1 2];
sigma1 = [3 .2; .2 2];
mu2 = [-1 -2];
sigma2 = [2 0; 0 1];
X1 = [mvnrnd(mu1,sigma1,200); mvnrnd(mu2,sigma2,100)];
options1 = statset('Display', 'final');
k1 = 2;
gm1 = fitgmdist(X1, k1, 'Options', options1);
% 2. Cluster 2 with one mixture of Gaussian (k2 = 1)
mu3 = [6 4];
sigma3 = [3 .1; .1 4];
X2 = mvnrnd(mu3,sigma3,300);
options2 = statset('Display', 'final');
k2 = 1;
gm2 = fitgmdist(X2, k2, 'Options', options2);
% 3. Cluster 3 with three mixture of Gaussian (k3 = 3)
mu4 = [-5 -6];
sigma4 = [1 .1; .1 1];
mu5 = [-5 -10];
sigma5 = [6 .1; .1 1];
mu6 = [-2 -15];
sigma6 = [8 .1; .1 4];
X3 = [mvnrnd(mu4,sigma4,200); mvnrnd(mu5,sigma5,300); mvnrnd(mu6,sigma6,100)];
options3 = statset('Display', 'final');
k3 = 3;
gm3 = fitgmdist(X3, k3, 'Options', options3);
% Display
figure,
scatter(X1(:,1),X1(:,2),10,'ko'); hold on;
ezcontour(#(x,y)pdf(gm1, [x y]), [-12 12], [-12 12]);
scatter(X2(:,1),X2(:,2),10,'ko');
ezcontour(#(x,y)pdf(gm2, [x y]), [-12 12], [-12 12]);
scatter(X3(:,1),X3(:,2),10,'ko');
ezcontour(#(x,y)pdf(gm3, [x y]), [-12 12], [-12 12]); hold off;
We can get the figure:
Then we have got some new testing data for example:
%% Create some artificial testing data
mut1 = [6.1 3.8];
sigmat1 = [3.1 .1; .1 4.2];
mut2 = [5.8 4.5];
sigmat2 = [2.8 .1; .1 3.8];
Xt1 = [mvnrnd(mut1,sigmat1,500); mvnrnd(mut2,sigmat2,100)];
figure,
scatter(Xt1(:,1),Xt1(:,2),10,'ko');
xlim([-12 12]); ylim([-12 12]);
I made the testing data similar to the Cluster 2 data on purpose. After we do the training using GMM, can we somehow predict the label of the new testing data? Is that possible to get some probabilities out like (p1 = 18%, p2 = 80% and p3 = 2%) for the prediction of each class. As we have got p2=80%, we can then have a hard classification that the new testing data is labelled as Cluster 2.
p.s.: I have found this post but it seems to theoretical to me (A similar post). If you can please put some simple Matlab script in your reply.
Thanks very much. A.
EDIT:
As Amro replied a solution for the problem, I have got more questions.
Amro created a new GMM using the entire dataset with some initialisation:
% initial parameters of the new GMM (combination of the previous three)
% (note PComponents is normalized according to proportion of data in each subset)
S = struct('mu',[gm1.mu; gm2.mu; gm3.mu], ...
'Sigma',cat(3, gm1.Sigma, gm2.Sigma, gm3.Sigma), ...
'PComponents',[gm1.PComponents*n1, gm2.PComponents*n2, gm3.PComponents*n3]./n);
% train the final model over all instances
opts = statset('MaxIter',1000, 'Display','final');
gmm = fitgmdist(X, k, 'Options',opts, 'Start',S);
What Amro got is something like below
which may not be suitable of my data because it separates my labelled cluster1, and cluster2 mixed with part of cluster1. This is what I am trying to avoid.
Here what I present is an artificial numerical example; however, in my real application, it deals with problem of image segmentation (for example, cluster1 is my background image and cluster2 is the object I want to separate). Then I try to somehow 'force' the separate GMM to fit separate classes. If two clusters are far away (for example, cluster1 and cluster 3 in this example), there is no problem to use Amro's method to combine all the data and then do a GMM fitting. However, when we do the training on the image data, it will never be perfect to separate background from object due to the limitation of the resolution (caused partial volume effect); therefore, it is very likely we have the case of cluster1 overlapped with cluster2 as shown. I think maybe mix all the data and then do the fitting will cause some problem for further prediction of the new data, am I right?
However, after a little bit of thinking, what I am trying to do now is:
% Combine the mixture of Gaussian and form a new gmdistribution
muAll = [gm1.mu; gm2.mu; gm3.mu];
sigmaAll = cat(3, gm1.Sigma, gm2.Sigma, gm3.Sigma);
gmAll = gmdistribution(muAll, sigmaAll);
pt1 = posterior(gmAll, Xt1);
What do you guys think? Or it is equivalent to Amro's method? If so, is there a method to force my trained GMM separated?
Also, I have question about the rationale of using the posterior function. Essentially, I want to estimate the likelihood of my testing data given the GMM fitting. Then why we calculate the posterior probability now? Or it is just a naming issue (in other words, the 'posterior probability'='likelihood')?
As far as I know, GMM has always been used as a unsupervised method. Someone even mentioned to me that GMM is a probability version of k-means clustering. Is it eligible to use it in such a 'supervised' style? Any recommended papers or references?
Thanks very much again for your reply!
A.
Effectively you have trained three GMM models not one, each being a mixture itself. Typically you would create one GMM with multiple components, where each component represents a cluster...
So what I would do in your case is create a new GMM model trained on the entire dataset (X1, X2, and X3) with the number of components equal to the total sum of all components from the three GMM (that is 2+1+3 = 6 Gaussian mixtures). This model would be initialized using the parameters of the individually trained ones.
Here the code to illustrate (I'm using the same variables you created in your example):
% number of instances in each data subset
n1 = size(X1,1);
n2 = size(X2,1);
n3 = size(X3,1);
% the entire dataset
X = [X1; X2; X3];
n = n1 + n2 + n3;
k = k1 + k2 + k3;
% initial parameters of the new GMM (combination of the previous three)
% (note PComponents is normalized according to proportion of data in each subset)
S = struct('mu',[gm1.mu; gm2.mu; gm3.mu], ...
'Sigma',cat(3, gm1.Sigma, gm2.Sigma, gm3.Sigma), ...
'PComponents',[gm1.PComponents*n1, gm2.PComponents*n2, gm3.PComponents*n3]./n);
% train the final model over all instances
opts = statset('MaxIter',1000, 'Display','final');
gmm = fitgmdist(X, k, 'Options',opts, 'Start',S);
% display GMM density function over training data
line(X(:,1), X(:,2), 'LineStyle','none', ...
'Marker','o', 'MarkerSize',1, 'Color','k')
hold on
ezcontour(#(x,y) pdf(gmm,[x y]), xlim(), ylim())
hold off
title(sprintf('GMM over %d training instances',n))
Now that we have trained one GMM model over the entire training dataset (with k=6 mixtures), we can use it to cluster new data instances:
cIdx = cluster(gmm, Xt1);
This is the same as manually computing the posterior probability of components, and taking the component with the largest probability as cluster index:
pr = posterior(gmm, Xt1);
[~,cIdx] = max(pr,[],2);
As expected almost 95% of the test data was clustered as belonging to the same component:
>> tabulate(cIdx)
Value Count Percent
1 27 4.50%
2 0 0.00%
3 573 95.50%
Here is the matching Guassian parameters:
>> idx = 3;
>> gmm.mu(idx,:)
ans =
5.7779 4.1731
>> gmm.Sigma(:,:,idx)
ans =
2.9504 0.0801
0.0801 4.0907
This indeed corresponds to the component in the upper-right side from the previous figure.
Similarly if you inspect the other component idx=1, it will be the one just on the left of the previous one, which explains how 27 out of the 600 test instances were "misclassified" if you will... Here's how confident the GMM was on those instances:
>> pr(cIdx==1,:)
ans =
0.9813 0.0001 0.0186 0.0000 0.0000 0.0000
0.6926 0.0000 0.3074 0.0000 0.0000 0.0000
0.5069 0.0000 0.4931 0.0000 0.0000 0.0000
0.6904 0.0018 0.3078 0.0000 0.0000 0.0000
0.6954 0.0000 0.3046 0.0000 0.0000 0.0000
<... output truncated ...>
0.5077 0.0000 0.4923 0.0000 0.0000 0.0000
0.6859 0.0001 0.3141 0.0000 0.0000 0.0000
0.8481 0.0000 0.1519 0.0000 0.0000 0.0000
Here are the test instances overlayed on top of the previous figure:
hold on
gscatter(Xt1(:,1), Xt1(:,2), cIdx)
hold off
title('clustered test instances')
EDIT:
My example above was meant to show how to use GMMs for clustering data (unsupervised learning). From what I understand now, what you want instead is to classify data using existing trained models (supervied learning). I guess I was confused by your use of "clusters" term :)
Anyway it should be easy now; just compute the class-conditional probability density function of the test data using each model, and pick the model with the highest likelihood as class label (no need to combine models into one).
So continuing on your initial code, that would simply be:
p = [pdf(gm1,Xt), pdf(gm2,Xt), pdf(gm3,Xt)]; % P(x|model_i)
[,cIdx] = max(p,[],2); % argmax_i P(x|model_i)
cIdx is the class prediction (1, 2, or 3) of each instance in the test data.
I already have my data prepared in terms of:
p1=input1 %load of today current hour
p2=input2 %load of today past one hour
p3=input3 $load of today past two hours
a1=output %load of next day current hour
I have the following code below:
%Input Set 1 For Weekday Load(d+1,t)
%(d,t),(d,t-1), (d,t-2)
L=xlsread('input_set1_weekday.xlsx',1); %2011
k=1;
size(L,1);
for a=5:2:size(L,1)-48 % L load for 2011
P(1,k)= L(a,1);
P(2,k)= L(a-2,1);
P(3,k)= L(a-4,1);
P(4,k)= L(a+48,1);
k=k+1;
end
I have my data arranged in such a way that in every column, p1, p2, p3 are my predictor variables and a1 is my response variable.
How do I now fit a linear model to this set of data to check the performance of my predictions? By the way it is electrical load forecasting model.
My other doubt is that in the examples shown by most of the sources, they use the last column data as response variable and this is the part I'm struggling with.
fitlm will be able to do this for you quite nicely. You use fitlm to train a linear regression model, so you provide it the predictors as well as the responses. Once you do this, you can then use predict to predict the new responses based on new predictors that you put in.
The basic way for you to call this is:
lmModel = fitlm(X, y, 'linear', 'RobustOpts', 'on');
X is a data matrix where each column is a predictor and each row is an observation. Therefore, you would have to transpose your matrix before running this function. Basically, you would do P(1:3,:).' as you only want the first three rows (now columns) of your data. y would be your output values for each observation and this is a column vector that has the same number of rows as your observations. Regarding your comment about using the "last" column as the response vector, you don't have to do this at all. You specify your response vector in a completely separate input variable, which is y. As such, your a1 would serve here, while your predictors and observations would be stored in X. You can totally place your response vector as a column in your matrix; you would just have to subset it accordingly.
As such, y would be your a1 variable, and make sure it's a column vector, and so you can do this a1(:) to be sure. The linear flag specifies linear regression, but that is the default flag anyway. RobustOpts is recommended so that you can perform robust linear regression. For your case, you would have to call fitlm this way:
lmModel = fitlm(P(1:3,:).', a1(:), 'linear', 'RobustOpts', 'on');
Now to predict new responses, you would do:
ypred = predict(lmModel, Xnew);
Xnew would be your new observations that follow the same style as X. You have to have the same number of columns as X, but you can have as many rows as you want. The output ypred will give you the predicted response for each observation of X that you have. As an example, let's use a dataset that is built into MATLAB, split up the data into a training and test data set, fit a model with the training set, then use the test dataset and see what the predicted responses are. Let's split up the data so that it's a 75% / 25% ratio. We will use the carsmall dataset which contains 100 observations for various cars and have descriptors such as Weight, Displacement, Model... typically used to describe cars. We will use Weight, Cylinders and Acceleration as the predictor variables, and let's try and predict the miles per gallon MPG as our outcome. Once I do this, let's calculate the difference between the predicted values and the true values and compare between them. As such:
load carsmall; %// Load in dataset
%// Build predictors and outcome
X = [Weight Cylinders Acceleration];
y = MPG;
%// Set seed for reproducibility
rng(1234);
%// Generate training and test data sets
%// Randomly select 75 observations for the training
%// dataset. First generate the indices to select the data
indTrain = randperm(100, 75);
%// The above may generate an error if you have anything below R2012a
%// As such, try this if the above doesn't work
%//indTrain = randPerm(100);
%//indTrain = indTrain(1:75);
%// Get those indices that haven't been selected as the test dataset
indTest = 1 : 100;
indTest(indTrain) = [];
%// Now build our test and training data
trainX = X(indTrain, :);
trainy = y(indTrain);
testX = X(indTest, :);
testy = y(indTest);
%// Fit linear model
lmModel = fitlm(trainX, trainy, 'linear', 'RobustOpts', 'on');
%// Now predict
ypred = predict(lmModel, testX);
%// Show differences between predicted and true test output
diffPredict = abs(ypred - testy);
This is what happens when you echo out what the linear model looks like:
lmModel =
Linear regression model (robust fit):
y ~ 1 + x1 + x2 + x3
Estimated Coefficients:
Estimate SE tStat pValue
__________ _________ _______ __________
(Intercept) 52.495 3.7425 14.027 1.7839e-21
x1 -0.0047557 0.0011591 -4.1031 0.00011432
x2 -2.0326 0.60512 -3.359 0.0013029
x3 -0.26011 0.1666 -1.5613 0.12323
Number of observations: 70, Error degrees of freedom: 66
Root Mean Squared Error: 3.64
R-squared: 0.788, Adjusted R-Squared 0.778
F-statistic vs. constant model: 81.7, p-value = 3.54e-22
This all comes from statistical analysis, but for a novice, what matters are the p-values for each of our predictors. The smaller the p-value, the more suitable this predictor is for your model. You can see that the first two predictors: Weight and Cylinders are a good representation on determining the MPG. Acceleration... not so much. What this means is that this variable is not a meaningful predictor to use, so you should probably use something else. In fact, if you were to remove this predictor and retrain your model, you would most likely see that the predicted values would closely match those where the Acceleration was included.
This is a truly bastardized version of interpreting p-values and so I defer you to an actual regression models or statistics course for more details.
This is what we have predicted the values to be, given our test set and beside it what the true values are:
>> [ypred testy]
ans =
17.0324 18.0000
12.9886 15.0000
13.1869 14.0000
14.1885 NaN
16.9899 14.0000
29.1824 24.0000
23.0753 18.0000
28.6148 28.0000
28.2572 25.0000
29.0365 26.0000
20.5819 22.0000
18.3324 20.0000
20.4845 17.5000
22.3334 19.0000
12.2569 16.5000
13.9280 13.0000
14.7350 13.0000
26.6757 27.0000
30.9686 36.0000
30.4179 31.0000
29.7588 36.0000
30.6631 38.0000
28.2995 26.0000
22.9933 22.0000
28.0751 32.0000
The fourth actual output value from the test data set is NaN, which denotes that the value is missing. However, when we run our the observation corresponding to this output value into our linear model, it predicts a value anyway which is to be expected. You have other observations to help train the model that when using this observation to find a prediction, it would naturally draw from those other observations.
When we compute the difference between these two, we get:
diffPredict =
0.9676
2.0114
0.8131
NaN
2.9899
5.1824
5.0753
0.6148
3.2572
3.0365
1.4181
1.6676
2.9845
3.3334
4.2431
0.9280
1.7350
0.3243
5.0314
0.5821
6.2412
7.3369
2.2995
0.9933
3.9249
As you can see, there are some instances where the prediction was quite close, and others where the prediction was far from the truth.... it's the crux of any prediction algorithm really. You'll have to play around with what predictors you want, as well as playing with the options with your training. Have a look at the fitlm documentation for more details on what you can play around with.
Edit - July 30th, 2014
As you don't have fitlm, you can easily use LinearModel.fit. You would call it with the same inputs like fitlm. As such:
lmModel = LinearModel.fit(trainX, trainy, 'linear', 'RobustOpts', 'on');
This should give you exactly the same results. predict should exist pre-R2014a, so that should be available to you.
Good luck!
I have a set of features which and I wish to rank according to their Correlation Coefficient with each other, without accounting for the true label (that would by a Supervised feature selection, right?).
My objective is selecting the first feature as the one more correlated with every other, take it out and so on.
The problem is how to test the correlation of a vector with a matrix (all the other vectors/features)? Is it possible to do this or am I doing this all right.
PS: I'm using MATLAB 2013b
Thank you all
Say you had a n-by-d matrix X where the rows are instances and columns are the features/dimensions, then you can compute the correlation coefficient matrix simply using the corr or corrcoeff functions:
% Fisher Iris dataset, 150x4
>> load fisheriris
>> X = meas;
>> C = corr(X)
C =
1.0000 -0.1176 0.8718 0.8179
-0.1176 1.0000 -0.4284 -0.3661
0.8718 -0.4284 1.0000 0.9629
0.8179 -0.3661 0.9629 1.0000
The result is a d-by-d matrix containing correlation coefficients of each feature against every other feature. The diagonal is thus all ones (because corr(x,x) = 1), the matrix is also symmetric (because corr(x,y) = corr(y,x)). Values range from -1 to 1, where -1 means inverse correlation between two variables, 1 means positive correlation, and 0 means no linear correlation.
Now because you want to remove the feature which is on average the most correlated with other features, you have to summarize that matrix as one number per feature. One way to do that is to compute the mean:
% mean
>> mean_corr = mean(C)
mean_corr =
0.6430 0.0220 0.6015 0.6037
% most correlated feature on average
>> [~,idx] = max(mean_corr)
idx =
1
% drop that feature
>> X(:,idx) = [];
EDIT:
I probably should have taken the mean of the absolute value of C in the above code, because we don't care if two variables are positively or negatively correlated, only how strong the correlation is.
I am currently creating different signals using Matlab, mixing them by multiplying them by a mixing matrix A, and then trying to get back the original signals using FastICA.
So far, the recovered signals are really bad when compared to the original ones, which was not what I expected.
I'm trying to see whether I'm doing anything wrong. The signals I'm generating are the following: (Amplitudes are in the range [0,1].)
s1 = (-x.^2 + 100*x + 500) / 3000; % quadratic
s2 = exp(-x / 10); % -ve exponential
s3 = (sin(x)+ 1) * 0.5; % sine
s4 = 0.5 + 0.1 * randn(size(x, 2), 1); % gaussian
s5 = (sawtooth(x, 0.75)+ 1) * 0.5; % sawtooth
One condition for ICA to be successful is that at most one signal is Gaussian, and I've observed this in my signal generation.
However, another condition is that all signals are statistically independent.
All I know is that this means that, given two signals A & B, knowing one signal does not give any information with regards to the other, i.e.: P(A|B) = P(A) where P is the probability.
Now my question is this: Are my signals statistically independent? Is there any way I can determine this? Perhaps some property that must be observed?
Another thing I've noticed is that when I calculate the eigenvalues of the covariance matrix (calculated for the matrix containing the mixed signals), the eigenspectrum seems to show that there is only one (main) principal component. What does this really mean? Shouldn't there be 5, since I have 5 (supposedly) independent signals?
For example, when using the following mixing matrix:
A =
0.2000 0.4267 0.2133 0.1067 0.0533
0.2909 0.2000 0.2909 0.1455 0.0727
0.1333 0.2667 0.2000 0.2667 0.1333
0.0727 0.1455 0.2909 0.2000 0.2909
0.0533 0.1067 0.2133 0.4267 0.2000
The eigenvalues are: 0.0000 0.0005 0.0022 0.0042 0.0345 (only 4!)
When using the identity matrix as the mixing matrix (i.e. the mixed signals are the same as the original ones), the eigenspectrum is: 0.0103 0.0199 0.0330 0.0811 0.1762. There still is one value much larger than the rest..
Thank you for your help.
I apologise if the answers to my questions are painfully obvious, but I'm really new to statistics, ICA and Matlab. Thanks again.
EDIT - I have 500 samples of each signal, in the range [0.2, 100], in steps of 0.2, i.e. x = 0:0.1:100.
EDIT - Given the ICA Model: X = As + n (I'm not adding any noise at the moment), but I am referring to the eigenspectrum of the transpose of X, i.e. eig(cov(X')).
Your signals are correlated (not independent). Right off the bat, the sawtooth and the sine are the same period. Tell me the value of one I'll tell you the value of the other, perfect correlation.
If you change up the period of one of them that'll make them more independent.
Also S1 and S2 are kinda correlated.
As for the eigenvalues, first of all your signals are not independent (see above).
Second of all, your filter matrix A is also not well conditioned, spreading out your eigenvalues further.
Even if you were to pipe in five fully independent (iid, yada yada) signals the covariance would be:
E[ A y y' A' ] = E[ A I A' ] = A A'
The eigenvalues of that are:
eig(A*A')
ans =
0.000167972216475
0.025688510850262
0.035666735304024
0.148813869149738
1.042451912479502
So you're really filtering/squishing all the signals down onto one basis function / degree of freedom and of course they'll be hard to recover, whatever method you use.
To find if the signals are mutually independent you could look at the techniques described here In general two random variables are independent if they are orthogonal. This means that: E{s1*s2} = 0 Meaning that the expectation of the random variable s1 multiplied by the random variable s2 is zero. This orthogonality condition is extremely important in statistics and probability and shows up everywhere. Unfortunately it applies to 2 variables at a time. There are multivariable techniques, but none that I would feel comfortable recommending. Another link I dug up was this one, not sure what your application is, but that paper is very well done.
When I calculate the covariance matrix I get:
cov(A) =
0.0619 -0.0284 -0.0002 -0.0028 -0.0010
-0.0284 0.0393 0.0049 0.0007 -0.0026
-0.0002 0.0049 0.1259 0.0001 -0.0682
-0.0028 0.0007 0.0001 0.0099 -0.0012
-0.0010 -0.0026 -0.0682 -0.0012 0.0831
With eigenvectors,V and values D:
[V,D] = eig(cov(A))
V =
-0.0871 0.5534 0.0268 -0.8279 0.0063
-0.0592 0.8264 -0.0007 0.5584 -0.0415
-0.0166 -0.0352 0.5914 -0.0087 -0.8054
-0.9937 -0.0973 -0.0400 0.0382 -0.0050
-0.0343 0.0033 0.8050 0.0364 0.5912
D =
0.0097 0 0 0 0
0 0.0200 0 0 0
0 0 0.0330 0 0
0 0 0 0.0812 0
0 0 0 0 0.1762
Here's my code:
x = transpose(0.2:0.2:100);
s1 = (-x.^2 + 100*x + 500) / 3000; % quadratic
s2 = exp(-x / 10); % -ve exponential
s3 = (sin(x)+ 1) * 0.5; % sine
s4 = 0.5 + 0.1 * randn(length(x), 1); % gaussian
s5 = (sawtooth(x, 0.75)+ 1) * 0.5; % sawtooth
A = [s1 s2 s3 s4 s5];
cov(A)
[V,D] = eig(cov(A))
Let me know if I can help any more, or if I misunderstood.
EDIT Properly referred to eigenvalues and vectors, used 0.2 sampling interval added code.