Making sense of CCA (Matlab implementation) 2 - matlab

I am using CCA for my work and want to understand something.
This is my MATLAB code. I have only taken 100 samples to better understand the concepts of CCA.
clc;clear all;close all;
load carbig;
data = [Displacement Horsepower Weight Acceleration MPG];
data(isnan(data))=0;
X = data(1:100,1:3);
Y = data(1:100,4:5);
[wx,wy,~,U,V] = CCA(X,Y);
clear Acceleration Cylinders Displacement Horsepower MPG Mfg Model Model_Year Origin Weight when org
subplot(1,2,1),plot(U(:,1),V(:,1),'.');
subplot(1,2,2),plot(U(:,2),V(:,2),'.');
My plots are coming like this:
This points out that in the 1st figure (left), the transformed variables are highly correlated with little scatter around the central axis. While in the 2nd figure(right), the scatter around the central axis is much more.
As I understand from here that CCA maximizes the correlation between the data in the transformed space. So I tried to design a matching score which should return a minimum value if the vectors are maximally correlated. I tried to match each vector of U(i,:) with that of V(j,:) with i,j going from 1 to 100.
%% Finding the difference between the projected vectors
for i=1:size(U,1)
cost = repmat(U(i,:),size(U,1),1)- V;
for j=1:size(U,1)
c(i,j) = norm(cost(j,:),size(U,2));
end
[~,idx(i)] = min(c(i,:));
end
Ideally idx should be like this :
idx = 1 2 3 4 5 6 7 8 9 10 ....
as they are maximally correlated. However my output comes something like this :
idx = 80 5 3 1 4 7 17 17 17 10 68 78 78 75 9 10 5 1 6 17 .....
I dont understand why this happens.
Am I wrong somewhere ? Isnt the vectors supposed to be maximally correlated in the transformed CCA subspace?
If my above assumption is wrong, please point me out in the correct direction.
Thanks in advance.

First, Let me transpose your code in R2014b:
load carbig;
data = [Displacement Horsepower Weight Acceleration MPG];
% Truncate the data, to follow-up with your sample code
data = data(1:100,:);
nans = sum(isnan(data),2) > 0;
[wx, wy, r, U, V,] = canoncorr(X(~nans,1:3),X(~nans,4:5));
OK, now the trick is that the vectors which are maximally correlated in the CCA subspace are the column vectors U(:,1) with V(:,1) and U(:,2) with V(:,2), and not the row vectors U(i,:), as you are trying to compute. In the CCA subspace, vectors should be N-dimensional (here N=100), and not simple 2D vectors. That's the reason why visualization of CCA results is often quite complicated !
By the way, the correlations are given by the third output of canoncorr, that you (intentionally ?) choosed to skip in your code. If you check its content, you'll see that the correlations (i.e. the vectors) are well-ordered:
r =
0.9484 0.5991
It is hard to explain CCA better than the link you already provided. If you want to go further, you should probably invest in a book, like this one or this one.

Related

Calculating the root-mean-square-error between two matrices one of which contains NaN values

This is a part of a larger project so I will try to keep only the relevant parts (The variables and my attempt at the calculations)
I want to calculate the root mean squared error between Zi_cubic and Z_actual
RMSE formula
Given/already established variables
rng('default');
% Set up 2,000 random numbers between -1 & +1 as our x & y values
n=2000;
x = 2*(rand(n,1)-0.5);
y = 2*(rand(n,1)-0.5);
z = x.^5+y.^3;
% Interpolate to a regular grid
d = -1:0.01:1;
[Xi,Yi] = meshgrid(d,d);
Zi_cubic = griddata(x,y,z,Xi,Yi,'cubic');
Z_actual = Xi.^5+Yi.^3;
My attempt at a calculation
My approach is to
Arrange Zi_cubic and Z_actual as column vectors
Take the difference
Square each element in the difference
Sum up all the elements in 4 using nansum
Divide by the number of finite elements in 4
Take the square root
D1 = reshape(Zi_cubic,[numel(Zi_cubic),1]);
D2 = reshape(Z_actual,[numel(Z_actual),1]);
D3 = D1 - D2;
D4 = D3.^2;
D5 = nansum(D4)
d6 = sum(isfinite(D4))
D6 = D5/d6
D7 = sqrt(D6)
Apparently this is wrong. I'm either mis-applying the RMSE formula or I don't understand what I'm telling matlab to do.
Any help would be appreciated. Thanks in advance.
Your RMSE is fine (in my book). The only thing that seems possibly off is the meshgrid and griddata. Your inputs to griddata are vectors and you are asking for a matrix output. That is fine, but you're potentially undersampling your input space. In other words, you are giving n samples as inputs, but perhaps you are expected to give n^2 samples as inputs? Here's some sample code for a smaller n to demonstrate this effect more clearly:
rng('default');
% Set up 2,000 random numbers between -1 & +1 as our x & y values
n=100; %Reduced because scatter is slow to plot
x = 2*(rand(n,1)-0.5);
y = 2*(rand(n,1)-0.5);
z = x.^5+y.^3;
S = 100;
subplot(1,2,1)
scatter(x,y,S,z)
%More data, more accurate ...
[x2,y2] = meshgrid(x,y);
z2 = x2.^5+y2.^3;
subplot(1,2,2)
scatter(x2(:),y2(:),S,z2(:))
The second plot should be a lot cleaner and thus will likely provide a more accurate estimate of Z_actual later on.
I also thought you might be running into some issues with floating point numbers and calculating RMSE but that appears not to be the case. Here's some alternative code which is how I would write RMSE.
d = Zi_cubic(:) - Z_actual(:);
mask = ~isnan(d);
n_valid = sum(mask);
rmse = sqrt(sum(d(mask).^2)/n_valid);
Notice that (:) linearizes the matrix. Also it is useful to try and use better variable names than D1-D7.
In the end though these are just suggestions and your code looks fine.
PS - I'm assuming that you are supposed to be using cubic interpolation as that is another place you could perhaps deviate from what's expected ...

How to make a polynomial approximation in Scilab?

I've a set of measures, which I want to approximate. I know I can do that with a 4th degree polynomial, but I don't know how to find it's five coefficients using Scilab.
For now, I must use the user-friendly functions of Open office calc... So, to keep using only Scilab, I'd like to know if a built-in function exists, or if we can use a simple script.
There is no built-in polyfit function like in Matlab, but you can make your own:
function cf = polyfit(x,y,n)
A = ones(length(x),n+1)
for i=1:n
A(:,i+1) = x(:).^i
end
cf = lsq(A,y(:))
endfunction
This function accepts two vectors of equal size (they can be either row or column vectors; colon operator makes sure they are column-oriented in the computation) and the degree of polynomial.
It returns the column of coefficients, ordered from 0th to the nth degree.
The computational method is straightforward: set up the (generally, overdetermined) linear system that requires the polynomial to pass through every point. Then solve it in the sense of least squares with lsq (in practice, it seems that cf = A\y(:) performs identically, although the algorithm is a bit different there).
Example of usage:
x = [-3 -1 0 1 3 5 7]
y = [50 74 62 40 19 35 52]
cf = polyfit(x,y,4)
t = linspace(min(x),max(x))' // now use these coefficients to plot the polynomial
A = ones(length(t),n+1)
for i=1:n
A(:,i+1) = t.^i
end
plot(x,y,'r*')
plot(t,A*cf)
Output:
The Atom's toolbox "stixbox" has Matlab-compatible "polyfit" and "polyval" functions included.
// Scilab 6.x.x need:
atomsInstall(["stixbox";"makematrix";"distfun";"helptbx";linalg"]) // install toolboxes
// POLYNOMINAL CURVE_FITTING
// Need toolboxes above
x = [-3 -1 0 1 3 5 7];
y = [50 74 62 40 19 35 52];
plot(x,y,"."); // plot sample points only
pcoeff = polyfit(x,y,4); // calculate polynominal coefficients (4th-degree)
xp = linspace(-3,7,100); // generate a little more x-values for a smoother curve fitting
yp = polyval(pcoeff,xp); // calculate the y-values for the curve fitting
plot(xp, yp,"k"); // plot the curve fitting in black

Computing a moving average

I need to compute a moving average over a data series, within a for loop. I have to get the moving average over N=9 days. The array I'm computing in is 4 series of 365 values (M), which itself are mean values of another set of data. I want to plot the mean values of my data with the moving average in one plot.
I googled a bit about moving averages and the "conv" command and found something which i tried implementing in my code.:
hold on
for ii=1:4;
M=mean(C{ii},2)
wts = [1/24;repmat(1/12,11,1);1/24];
Ms=conv(M,wts,'valid')
plot(M)
plot(Ms,'r')
end
hold off
So basically, I compute my mean and plot it with a (wrong) moving average. I picked the "wts" value right off the mathworks site, so that is incorrect. (source: http://www.mathworks.nl/help/econ/moving-average-trend-estimation.html) My problem though, is that I do not understand what this "wts" is. Could anyone explain? If it has something to do with the weights of the values: that is invalid in this case. All values are weighted the same.
And if I am doing this entirely wrong, could I get some help with it?
My sincerest thanks.
There are two more alternatives:
1) filter
From the doc:
You can use filter to find a running average without using a for loop.
This example finds the running average of a 16-element vector, using a
window size of 5.
data = [1:0.2:4]'; %'
windowSize = 5;
filter(ones(1,windowSize)/windowSize,1,data)
2) smooth as part of the Curve Fitting Toolbox (which is available in most cases)
From the doc:
yy = smooth(y) smooths the data in the column vector y using a moving
average filter. Results are returned in the column vector yy. The
default span for the moving average is 5.
%// Create noisy data with outliers:
x = 15*rand(150,1);
y = sin(x) + 0.5*(rand(size(x))-0.5);
y(ceil(length(x)*rand(2,1))) = 3;
%// Smooth the data using the loess and rloess methods with a span of 10%:
yy1 = smooth(x,y,0.1,'loess');
yy2 = smooth(x,y,0.1,'rloess');
In 2016 MATLAB added the movmean function that calculates a moving average:
N = 9;
M_moving_average = movmean(M,N)
Using conv is an excellent way to implement a moving average. In the code you are using, wts is how much you are weighing each value (as you guessed). the sum of that vector should always be equal to one. If you wish to weight each value evenly and do a size N moving filter then you would want to do
N = 7;
wts = ones(N,1)/N;
sum(wts) % result = 1
Using the 'valid' argument in conv will result in having fewer values in Ms than you have in M. Use 'same' if you don't mind the effects of zero padding. If you have the signal processing toolbox you can use cconv if you want to try a circular moving average. Something like
N = 7;
wts = ones(N,1)/N;
cconv(x,wts,N);
should work.
You should read the conv and cconv documentation for more information if you haven't already.
I would use this:
% does moving average on signal x, window size is w
function y = movingAverage(x, w)
k = ones(1, w) / w
y = conv(x, k, 'same');
end
ripped straight from here.
To comment on your current implementation. wts is the weighting vector, which from the Mathworks, is a 13 point average, with special attention on the first and last point of weightings half of the rest.

Interpolation with MATLAB in EMG processing

I have 3 EMG recordings for 2 muscles, with a sampling rate of 1000Hz. In other words I have 3 matrices of EMG data; each has 2 rows (for 2 muscles).
However the number of samples (columns) in each isn't the same: the first one has 2600 samples, the second has 2500 samples and the third one has 2550 samples.
I want to make their lengths the same as each other, to get 3 matrices with the same number of rows and columns. I think it is foolish to cut the bigger ones and use just 2500 columns. However if I want to do so, I don't know whether I should cut from the start or end of them?
Is there a way in MATLAB to interpolate the data to get 3 matrices, each of size 3 x 2600?
All 3 matrices belong to the same movement, and I want to match the samples.
You most likely want to look at using interp1 in this situation. This performs an interpolation between your points so that you can sample at any position on the x-axis.
http://www.mathworks.com/help/matlab/ref/interp1.html
I have the following example which has some random sample data sample1, sample2 and sample3. These variables are of lengths 2600, 2500 and 2550 respectively.
sample1 = exp(2*linspace(0,1,2600)+rand(1, 2600));
sample2 = exp(linspace(0,1,2500)+rand(1, 2500));
sample3 = exp(3*linspace(0,1,2550)+rand(1, 2550));
I have a desired length (I am using a length which corresponds to your shortest sample size)
desiredlength = 2500;
You can then interpolate your data with the following code (note the default is a linear interpolation):
adjusted = zeros(3, desiredlength);
adjusted(1, :) = interp1(linspace(0,1,length(sample1)), sample1, linspace(0,1,desiredlength));
adjusted(2, :) = interp1(linspace(0,1,length(sample2)), sample2, linspace(0,1,desiredlength));
adjusted(3, :) = interp1(linspace(0,1,length(sample3)), sample3, linspace(0,1,desiredlength));
plot(adjusted')
linspace(a, b, n) is a function which gives you a vector of n points between a and b, and for sample1 I am transforming from linspace(0, 1, 2600) to linspace(0, 1, 2500)
I hope this helps.

How do I generate pair of random points in a circle using Matlab?

Let a circle of known radius be plotted in MATLAB.
Assume a pair of random points whose location has to be determined in terms of coordinates (x1,y1) (x2,y2)..(xn,yn). Pairs should be close to each other. For example T1 and R1 should be near.
As shown in figure, there are four random pairs (T1,R1)..(T4,R4).
There coordinates need to be determined wrt to center (0,0).
How can I generate this in MATLAB?
The simplest approach to pick a point from a uniform distribution over a circle with reduce R is using Gibbs sampling. Here is the code:
function [x y] = circular uniform (R)
while true
x = 2*R*rand() - R
y = 2*R*rand() - R
if (x*x + y*y) > R*R
return
end
end
The loop runs 4/π times on average.
(Complete edit after the question was edited).
To complete this task, I think that you need to combine the different approaches that have been mentioned before your edit:
To generate the centers T1,T2,T3,... in the green torus, use the polar coordinates. (Edit: this turned out to be wrong, rejection sampling must also be used here, otherwise, the distribution is not uniform!)
To generate the points R1,R2,R3,... in the circle around T1,T2,T3,... but still in the torus, use the rejection sampling.
With these ingredients, you should be able to do everything you need. Here is the code I wrote:
d=859.23;
D=1432.05;
R=100;
N=400;
% Generate the angle
theta = 2*pi*rand(N,1);
% Generate the radius
r = d + (D-d)*rand(N,1);
% Get the centers of the circles
Tx = r.*cos(theta);
Ty = r.*sin(theta);
% Generate the R points
Rx=zeros(N,1);
Ry=zeros(N,1);
for i=1:N
while true
% Try
alpha = 2*pi*rand();
rr = R*rand();
Rx(i) = Tx(i) + rr*cos(alpha);
Ry(i) = Ty(i) + rr*sin(alpha);
% Check if in the correct zone
if ( (Rx(i)*Rx(i) + Ry(i)*Ry(i) > d*d) && (Rx(i)*Rx(i) + Ry(i)*Ry(i) < D*D) )
break
end
end
end
% Display
figure(1);
clf;
angle=linspace(0,2*pi,1000);
plot( d*cos(angle), d*sin(angle),'-b');
hold on;
plot( D*cos(angle), D*sin(angle),'-b');
for i=1:N
plot(Tx(i),Ty(i),'gs');
plot(Rx(i),Ry(i),'rx');
plot([Tx(i) Rx(i)],[Ty(i) Ry(i)],'-k');
end
hold off;
let R be radious of (0;0) centered circle.
(x,y) : x^2+y^2<=R^2 (LE) to be inside the circle
x = rand()*2*R - R;
y should be in interval (-sqrt(R^2 - x^2);+sqrt(R^2 - x^2))
so, let it be
y = rand()*sqrt(R^2 - x^2)*2-sqrt(R^2 - x^2);
Hope, that's right, i have no matlab to test.
Hope, you'll manage to find close pairs your self.
Ok, i'll spend a bit more time for a hint.
To find a random number k in interval [a,b] use
k = rand()*(b-a)+a
Now it should really help if i still remember the matlab syntaxis. Good luck.
Here is a low quality solution that is very easy to use with uniformly distributed points. Assuming the number of points is small efficiency should not be a concern, if you want better quality you can use something more powerfull than nearest neighbor:
While you have less than n points: Generate a random point
If it is in the circle, store it else go to step 1
While there are unpaired points: check which point is nearest to the first unpaired point, make them a pair
As a result most pairs should be good, but some can be really really bad. I would recommend you to try it and perhaps add a step 4 with k-opt or some other local search if required. And if you really have little points (e.g. less than 20) you can of course just calculate all distances and find the optimum matching.
If you don't really care about the uniform distribution, here is an even easier solution:
While you have less than n points: Generate a random point
If it is in the circle, store it else go to step 1
For each of these points, generate a point near it
If it is in the circle, store it else go to step 3
Generate the random points as #PheuVerg suggested (with a slight vectorized tweak)
n = 8; %must be even!
x = rand(n, 1)*2*R - R;
y = rand(n, 1).*sqrt(R^2 - x.^2).*2-sqrt(R^2 - x.^2);
Then use kmeans clustering to get n/2 centers
[~ c] = kmeans([x y], n/2);
now you have to loop through each center and find it's distance to each point
dists = zeros(n, n/2);
for cc = 1:n/2
for pp = 1:n
dists(pp, cc) = sqrt((c(cc,1) - x(pp))^2 + (c(cc,2) - y(pp))^2);
end
end
now you must find the smallest 2 values for each columns of dists
[sorted, idx] = sort(dists);
so now the top two rows of each column are the two nearest points. But there could be clashes! i.e. points that are nearest to two different centers. So for repeated values you have to loop through and choose swap for the point that will give you the smallest extra distance.
Example data:
x =
0.7894
-0.7176
-0.5814
0.0708
0.5198
-0.2299
0.2245
-0.8941
y =
-0.0800
-0.3339
0.0012
0.9765
-0.4135
0.5733
-0.1867
0.2094
sorted =
0.1870 0 0 0.1555
0.2895 0.5030 0.5030 0.2931
0.3145 1.1733 0.6715 0.2989
1.0905 1.1733 0.7574 0.7929
1.1161 1.2326 0.8854 0.9666
1.2335 1.2778 1.0300 1.2955
1.2814 1.4608 1.2106 1.3051
1.4715 1.5293 1.2393 1.5209
idx =
5 4 6 3
7 6 4 2
1 3 3 8
6 7 8 6
3 8 7 7
2 1 2 4
4 5 1 5
8 2 5 1
So now it's clear that 5 and 7 are pairs, and that 3 and 2 are pairs. But 4 and 6 are both repeated. (in this case it is clear that they are pairs too I guess!) but what I would suggest is to leave point 4 with center 2 and point 6 with center 3. Then we start at column 2 and see the next available point is 8 with a distance of 1.2326. This would leave point 1 paired with point 6 but then it's distance from the center is 1.2106. Had we paired point 6 with 8 and point 4 with 1 we would have got distances of 0.7574 and 1.2778 respectively which is actually less total distance. So finding 'close' pairs is easy but finding the set of pairs with the globally smallest minimum is hard! This solutions gets you something decent quite easily but fi you need the global best then I'm afraid you have quite a bit of work to do still :(
Finally let me add some visualisation. First lets (manually) create a vector that shows which points are paired:
I = [1 2 2 1 3 4 3 4];
Remember that that will depend on your data! Now you can plot is nicely like this:
gscatter(x, y, I)
Hope this gets you close and that you can eliminate the manual pairing of mine at the end by yourself. It shouldn't be too hard to get a crude solution.