Understanding PCA in MATLAB - matlab

What are the difference between the following two functions?
prepTransform.m
function [mu trmx] = prepTransform(tvec, comp_count)
% Computes transformation matrix to PCA space
% tvec - training set (one row represents one sample)
% comp_count - count of principal components in the final space
% mu - mean value of the training set
% trmx - transformation matrix to comp_count-dimensional PCA space
% this is memory-hungry version
% commented out is the version proper for Win32 environment
tic;
mu = mean(tvec);
cmx = cov(tvec);
%cmx = zeros(size(tvec,2));
%f1 = zeros(size(tvec,1), 1);
%f2 = zeros(size(tvec,1), 1);
%for i=1:size(tvec,2)
% f1(:,1) = tvec(:,i) - repmat(mu(i), size(tvec,1), 1);
% cmx(i, i) = f1' * f1;
% for j=i+1:size(tvec,2)
% f2(:,1) = tvec(:,j) - repmat(mu(j), size(tvec,1), 1);
% cmx(i, j) = f1' * f2;
% cmx(j, i) = cmx(i, j);
% end
%end
%cmx = cmx / (size(tvec,1)-1);
toc
[evec eval] = eig(cmx);
eval = sum(eval);
[eval evid] = sort(eval, 'descend');
evec = evec(:, evid(1:size(eval,2)));
% save 'nist_mu.mat' mu
% save 'nist_cov.mat' evec
trmx = evec(:, 1:comp_count);
pcaTransform.m
function [pcaSet] = pcaTransform(tvec, mu, trmx)
% tvec - matrix containing vectors to be transformed
% mu - mean value of the training set
% trmx - pca transformation matrix
% pcaSet - output set transforrmed to PCA space
pcaSet = tvec - repmat(mu, size(tvec,1), 1);
%pcaSet = zeros(size(tvec));
%for i=1:size(tvec,1)
% pcaSet(i,:) = tvec(i,:) - mu;
%end
pcaSet = pcaSet * trmx;
Which one is actually doing PCA?
If one is doing PCA, what is the other one doing?

The first function prepTransform is actually doing the PCA on your training data where you are determining the new axes to represent your data onto a lower dimensional space. What it does is that it finds the eigenvectors of the covariance matrix of your data and then orders the eigenvectors such that the eigenvector with the largest eigenvalue appears in the first column of the eigenvector matrix evec and the eigenvector with the smallest eigenvalue appears in the last column. What's important with this function is that you can define how many dimensions you want to reduce the data down to by keeping the first N columns of evec which will allow you to reduce your data down to N dimensions. The discarding of the other columns and keeping only the first N is what is set as trmx in the code. The variable N is defined by the prep_count variable in prepTransform function.
The second function pcaTransform finally transforms data that is defined within the same domain as your training data but not necessarily the training data itself (it could be if you wish) onto the lower dimensional space that is defined by the eigenvectors of the covariance matrix. To finally perform the reduction of dimensions, or dimensionality reduction as it is popularly known, you simply take your training data where each feature is subtracted from its mean and you multiply your training data by the matrix trmx. Note that prepTransform outputting the mean of each feature in the vector mu is important in order to mean subtract your data when you finally call pcaTransform.
How to use these functions
To use these functions effectively, first determine the trmx matrix, which contain the principal components of your data by first defining how many dimensions you want to reduce your data down to as well as the mean of each feature stored in mu:
N = 2; % Reduce down to two dimensions for example
[mu, trmx] = prepTransform(tvec, N);
Next you can finally perform dimensionality reduction on your data that is defined within the same domain as tvec (or even tvec if you wish, but it doesn't have to be) by:
pcaSet = pcaTransform(tvec, mu, trmx);
In terms of vocabulary, pcaSet contain what are known as the principal scores of your data, which is the term used for the transformation of your data to the lower dimensional space.
If I can recommend something...
Finding PCA through the eigenvector approach is known to be unstable. I highly recommend you use the Singular Value Decomposition via svd on the covariance matrix where the V matrix of the result already gives you the eigenvectors sorted which correspond to your principal components:
mu = mean(tvec, 1);
[~,~,V] = svd(cov(tvec));
Then perform the transformation by taking the mean subtracted data per feature and multiplying by the V matrix, once you subset and grab the first N columns of V:
N = 2;
X = bsxfun(#minus, tvec, mu);
pcaSet = X*V(:, 1:N);
X is the mean subtracted data which performs the same thing as doing pcaSet = tvec - repmat(mu, size(tvec,1), 1);, but you are not explicitly replicating the mean vector over each training example but letting bsxfun do that for you internally. However, taking advantage of MATLAB R2016b, this repeating can be done without the explicit call to bsxfun:
X = tvec - mu;
Further Reading
If you fully want to understand the code that was written and the theory behind what it's doing, I recommend the following two Stack Overflow posts that I have written that talk about the topic:
What does selecting the largest eigenvalues and eigenvectors in the covariance matrix mean in data analysis?
How to use eigenvectors obtained through PCA to reproject my data?
The first post brings the code you presented into light which performs PCA using the eigenvector approach. The second post touches base on how you'd do it using the SVD towards the end of the answer. This answer I've written here is a mix between the two posts above.

Related

Finite difference scheme in Matlab

I am trying to implement a finite difference scheme for KdV equation in MATLAB, and I have most of the code ready, except for approximation at the first level using initial condition. It was suggested I use Euler's method to obtain 'u' at m=1, and then use the scheme for m>=2.
How does one apply Euler's method in this context? Even just a general answer for approximation at the first level would be appreciated.
I am including my code for reference
close all
clear
clc
% Generating grid with n points, with the space between two points being
%(x2-x1)/(n-1)
x = linspace(-5,5,1001);
N=1001;
h=x(2)-x(1); % grid size
dt=0.05;
%Soliton initial condition
Am=8; %Amplitude
mu=sqrt(Am/2);
x0=-15;
c=1;
syms H(x)
H(x)=piecewise(x < 0,0,x > 0,1);
u= Am*(sech(mu*(x'-x0))).^2+c^2*H(x);
% Creating a matrix A - First order
A = diag(ones(N-1,1),1)-diag(ones(N-1,1),-1);
Cvector = zeros(N, 1);
Cvector(end) = 1;
u_ic = Cvector;
% First order finite difference scheme
diff_first=A*u/(2*h)+1/(2*h)*u_ic;
% Weighted average matrix for the term 'u'
A_w = diag(ones(N-1,1),1)+diag(ones(N-1,1),-1)+diag(ones(N,1));
diff_w=2*A_w*u +2*u_ic;
% Matrix multiplication of first derivative and weighted average for 6uu_x
diff_middle=diff_first.*diff_w;
% Creating a Third Order Matrix
r = zeros(1,N);
r(2:3) = [-2,1];
c = -r;
A_third = toeplitz(c,r);
% Difference scheme for third order term
diff_third=A_third*u/(h*h*h)-1/(h*h*h)*u_ic;
%Computing finite difference method
u = u - 2*dt*diff_middle-dt*diff_third;
plot(u)

rewrite the seqneighjoin function in matlab

I have the task to rewrite the seqneighjoin function in matlab by adding the frequency of all the sequences. After searching, I understand that this function returns a phylogenetic tree object obtained by seqences neighbor joinn method from the wiki http://en.wikipedia.org/wiki/Neighbor_joining
Now, I have the following two questions.
(1): what is the data structure of this phytree object obtained by this function? How to express it? For example, for the similar linkage function, it also returns a phylogenetic tree, and the data structure is very clear there, i.e., it is a matrix with three columns, where the i-th column indicates which nodes are combined and the corresponding distance when they are combined. Thanks very much for your time and attention.
(2): Based on wiki, how am I supposed to add frequency to the function seqneighjoin? I am totally confused.
Thanks so much for your time and attention. I truly appreciate that.
EDIT: the following is the code.
function z = seqneighjoin(D_all, freq)
n = size (D_all, 2);
m=(1+sqrt(8*n+1))/2;
z=zeros(m-1,3);
q=zeros(m,m);
str = zeros (m,m);
% initialize the distance matrix d
d=ones(m,m);
d(tril(d,-1)==1)=D_all;
d(triu(d,1)==1)=D_all;
d(eye(m,m)==1) = 1:m; % the diagonal entries of the matrix d is the indices of the clusters
% initialize the matrix str
for r=1:m
for c=1:m
str(r,c)=freq(r)*freq(c)*d(r,c);
str(c,r)=freq(r)*freq(c);
end
end
% loop through for m-1 times to create the matrix z
for k = 1:m-1
% initialize (for the first time) or update (for all other times)
% the matrix q
colSum = sum(d, 1);
rowSum=sum(d,2);
a=size(colSum, 2);
colSumM=colSum(ones(a,1),:);
rowSunM=rowSum(:,ones(1,a));
q=(a-2)*d-colSumM-rowSumM;
% find the minimum element in the matrix q
u=min(q);
v=min(u);
[i,j]=find(q==v);
r=i(1);
c=j(1);
% combine d(r,r) and d(c,c) to get a new node m+k
z(k,:)=[d(r,r), d(c,c), v];
% calculate the distance between the new node m+k and all other node
% which are not m+k
d(r,:) = (d(r,:) + d(c,:) - d(r,c) )/2;
d(r,r) = m+k;
d(c,:)=[]; d(:,c)=[];
end
Here, D_all is the vector representation of a distance matrix returned by the seqpdist function in matlab, and freq is the vector indicating the frequency of all the sequences.

Numerical derivative of a vector

I have a problem with numerical derivative of a vector that is x: Nx1 with respect to another vector t (time) that is the same size of x.
I do the following (x is chosen to be sine function as an example):
t=t0:ts:tf;
x=sin(t);
xd=diff(x)/ts;
but the answer xd is (N-1)x1 and I figured out that it does not compute derivative corresponding to the first element of x.
is there any other way to compute this derivative?
You are looking for the numerical gradient I assume.
t0 = 0;
ts = pi/10;
tf = 2*pi;
t = t0:ts:tf;
x = sin(t);
dx = gradient(x)/ts
The purpose of this function is a different one (vector fields), but it offers what diff doesn't: input and output vector of equal length.
gradient calculates the central difference between data points. For an
array, matrix, or vector with N values in each row, the ith value is
defined by
The gradient at the end points, where i=1 and i=N, is calculated with
a single-sided difference between the endpoint value and the next
adjacent value within the row. If two or more outputs are specified,
gradient also calculates central differences along other dimensions.
Unlike the diff function, gradient returns an array with the same
number of elements as the input.
I know I'm a little late to the game here, but you can also get an approximation of the numerical derivative by taking the derivatives of the polynomial (cubic) splines that runs through your data:
function dy = splineDerivative(x,y)
% the spline has continuous first and second derivatives
pp = spline(x,y); % could also use pp = pchip(x,y);
[breaks,coefs,K,r,d] = unmkpp(pp);
% pre-allocate the coefficient vector
dCoeff = zeroes(K,r-1);
% Columns are ordered from highest to lowest power. Both spline and pchip
% return 4xn matrices, ordered from 3rd to zeroth power. (Thanks to the
% anonymous person who suggested this edit).
dCoeff(:, 1) = 3 * coefs(:, 1); % d(ax^3)/dx = 3ax^2;
dCoeff(:, 2) = 2 * coefs(:, 2); % d(ax^2)/dx = 2ax;
dCoeff(:, 3) = 1 * coefs(:, 3); % d(ax^1)/dx = a;
dpp = mkpp(breaks,dCoeff,d);
dy = ppval(dpp,x);
The spline polynomial is always guaranteed to have continuous first and second derivatives at each point. I haven not tested and compared this against using pchip instead of spline, but that might be another option as it too has continuous first derivatives (but not second derivatives) at every point.
The advantage of this is that there is no requirement that the step size be even.
There are some options to work-around your issue.
First: you can make your domain larger. Instead of N, use N+1 gridpoints.
Second: depending on the end-point of interest, you can use
Forward difference: F(x + dx) - F(x)
Backward difference: F(x) - F(x - dx)

Matlab Vectorization of Multivariate Gaussian Basis Functions

I have the following code for calculating the result of a linear combination of Gaussian functions. What I'd really like to do is to vectorize this somehow so that it's far more performant in Matlab.
Note that y is a column vector (output), x is a matrix where each column corresponds to a data point and each row corresponds to a dimension (i.e. 2 rows = 2D), variance is a double, gaussians is a matrix where each column is a vector corresponding to the mean point of the gaussian and weights is a row vector of the weights in front of each gaussian. Note that the length of weights is 1 bigger than gaussians as weights(1) is the 0th order weight.
function [ y ] = CalcPrediction( gaussians, variance, weights, x )
basisFunctions = size(gaussians, 2);
xvalues = size(x, 2);
if length(weights) ~= basisFunctions + 1
ME = MException('TRAIN:CALC', 'The number of weights should be equal to the number of basis functions plus one');
throw(ME);
end
y = weights(1) * ones(xvalues, 1);
for xIdx = 1:xvalues
for i = 1:basisFunctions
diff = x(:, xIdx) - gaussians(:, i);
y(xIdx) = y(xIdx) + weights(i+1) * exp(-(diff')*diff/(2*variance));
end
end
end
You can see that at the moment I simply iterate over the x vectors and then the gaussians inside 2 for loops. I'm hoping that this can be improved - I've looked at meshgrid but that seems to only apply to vectors (and I have matrices)
Thanks.
Try this
diffx = bsxfun(#minus,x,permute(gaussians,[1,3,2])); % binary operation with singleton expansion
diffx2 = squeeze(sum(diffx.^2,1)); % dot product, shape is now [XVALUES,BASISFUNCTIONS]
weight_col = weights(:); % make sure weights is a column vector
y = exp(-diffx2/2/variance)*weight_col(2:end); % a column vector of length XVALUES
Note, I changed diff to diffx since diff is a builtin. I'm not sure this will improve performance as allocating arrays will offset increase by vectorization.

Mixture of Gaussians (EM) how to calculate the responsabilities

I have an assignment to implement MoG with EM in matlab. The assignment:
My code atm;
clear
clc
load('data2')
%% INITIALIZE
K = 20
pi = 0.01:((1-0.01)/K):1;
for k=1:20
sigma{k} = eye(2);
mu(k,:) = [rand(1),rand(1)];
end
%% Posterior over the laten variables
addition = 0;
for k =1:20
addition = addition + (pi(k)*mvnpdf(x,mu(k,:), sigma{k}));
end
test = 0;
for k =1:20
gamma{k} = (pi(k)*mvnpdf(x,mu(k), sigma{k})) ./ addition;
end
data has 1000 rows and 2 columns (so 1000 datapoints). My question is now how do I calculate the responsibilities. When I try to calculate the covariance matrix I get a 1x1000 matrix. While I believe the covariance matrix should be 2x2.
Unfortunately, I don't speak Matlab, so I can't really see where your code is incorrect, but I can answer generally (and maybe someone who knows Matlab can see if your code can be salvaged). Each datapoint has a gamma associated with it, which is the expectation of an indicator variable for each component in the mixture. Calculating them is pretty simple: for the i-th datapoint and the k-th component, gamma_ik is just the density of the k-th component at the i-th point, multiplied by the k-th mixture coefficient (the prior probability that the point came from the k-th component, which is pi in your assignment), normalised by this quantity computed over all k. Thus for each datapoint, you have a vector of responsibilities (of length k) with a sum of one.