RMSE of two data sets of unequal lengths in MATLAB? - matlab

I want to calculate the RMSE of two unequal data sets.
Dataset 1 has the dimensions 1067x1 and dataset 2 has the dimensions 2227x1.
How do I calculate the RMSE?
Thanks

Hard to answer not knowing the data.
One option is interpolate the length of one vector to the other. That gets more accurate if you have e.g. timestamps for the two datasets.
v1 = rand(1067,1);
v2 = rand(2227,1);
v1_int = interp1(1:size(v2,1)/size(v1,1):size(v2,1), v1, 1:size(v2,1), 'linear', 'extrap')';
sqrt(mean((v1_int-v2).^2))

You can interpolate the smaller vector before proceeding with the RMSE calculation, as follows:
d1 = randn(1067,1);
d1_len = numel(d1);
d2 = randn(2227,1);
d2_len = numel(d2);
d1 = interp1(1:(d2_len / d1_len):d2_len,d1,1:d2_len,'linear','extrap');
plot(d2,'b');
hold on;
plot(d1,'r')
hold off;
Alternatively, downsample and upsample functions can be used instead but they require more attention on the final output data and length. Once this is done, you can obtain the RMSE using this code:
RMSE = sqrt(mean(((d2 - d1) .^ 2)));
The RMSE is actually defined as the squared root of the mean of the squares of the errors, where the errors are given by the difference between observed values y and predicted values ycap... so choose carefully which one of your vectors represents the former and which one the latter. For more information, read this.

Related

How to use the reduced data - the output of principal component analysis

I am finding it hard to link the theory with the implementation. I would appreciate help in knowing where my understanding is wrong.
Notations - matrix in bold capital and vectors in bold font small letter
is a dataset on observations, each of variables. So, given these observed -dimensional data vectors, the -dimensional principal axes are , for in where is the target dimension.
The principal components of the observed data matrix would be where matrix , matrix , and matrix .
Columns of form an orthogonal basis for the features and the output is the principal component projection that minimizes the squared reconstruction error:
and the optimal reconstruction of is given by .
The data model is
X(i,j) = A(i,:)*S(:,j) + noise
where PCA should be done on X to get the output S. S must be equal to Y.
Problem 1: The reduced data Y is not equal to S that is used in the model. Where is my understanding wrong?
Problem 2: How to reconstruct such that the error is minimum?
Please help. Thank you.
clear all
clc
n1 = 5; %d dimension
n2 = 500; % number of examples
ncomp = 2; % target reduced dimension
%Generating data according to the model
% X(i,j) = A(i,:)*S(:,j) + noise
Ar = orth(randn(n1,ncomp))*diag(ncomp:-1:1);
T = 1:n2;
%generating synthetic data from a dynamical model
S = [ exp(-T/150).*cos( 2*pi*T/50 )
exp(-T/150).*sin( 2*pi*T/50 ) ];
% Normalizing to zero mean and unit variance
S = ( S - repmat( mean(S,2), 1, n2 ) );
S = S ./ repmat( sqrt( mean( Sr.^2, 2 ) ), 1, n2 );
Xr = Ar * S;
Xrnoise = Xr + 0.2 * randn(n1,n2);
h1 = tsplot(S);
X = Xrnoise;
XX = X';
[pc, ~] = eigs(cov(XX), ncomp);
Y = XX*pc;
UPDATE [10 Aug]
Based on the Answer, here is the full code that
clear all
clc
n1 = 5; %d dimension
n2 = 500; % number of examples
ncomp = 2; % target reduced dimension
%Generating data according to the model
% X(i,j) = A(i,:)*S(:,j) + noise
Ar = orth(randn(n1,ncomp))*diag(ncomp:-1:1);
T = 1:n2;
%generating synthetic data from a dynamical model
S = [ exp(-T/150).*cos( 2*pi*T/50 )
exp(-T/150).*sin( 2*pi*T/50 ) ];
% Normalizing to zero mean and unit variance
S = ( S - repmat( mean(S,2), 1, n2 ) );
S = S ./ repmat( sqrt( mean( S.^2, 2 ) ), 1, n2 );
Xr = Ar * S;
Xrnoise = Xr + 0.2 * randn(n1,n2);
X = Xrnoise;
XX = X';
[pc, ~] = eigs(cov(XX), ncomp);
Y = XX*pc; %Y are the principal components of X'
%what you call pc is misleading, these are not the principal components
%These Y columns are orthogonal, and should span the same space
%as S approximatively indeed (not exactly, since you introduced noise).
%If you want to reconstruct
%the original data can be retrieved by projecting
%the principal components back on the original space like this:
Xrnoise_reconstructed = Y*pc';
%Then, you still need to project it through
%to the S space, if you want to reconstruct S
S_reconstruct = Ar'*Xrnoise_reconstructed';
plot(1:length(S_reconstruct),S_reconstruct,'r')
hold on
plot(1:length(S),S)
The plot is which is very different from the one that is shown in the Answer. Only one component of S exactly matches with that of S_reconstructed. Shouldn't the entire original 2 dimensional space of the source input S be reconstructed?
Even if I cut off the noise, then also onely one component of S is exactly reconstructed.
I see nobody answered your question, so here goes:
What you computed in Y are the principal components of X' (what you call pc is misleading, these are not the principal components). These Y columns are orthogonal, and should span the same space as S approximatively indeed (not exactly, since you introduced noise).
If you want to reconstruct Xrnoise, you must look at the theory (e.g. here) and apply it correctly: the original data can be retrieved by projecting the principal components back on the original space like this:
Xrnoise_reconstructed = Y*pc'
Then, you still need to transform it through pinv(Ar)*Xrnoise_reconstructed, if you want to reconstruct S.
Matches nicely for me:
answer to UPDATE [10 Aug]: (EDITED 12 Aug)
Your Ar matrix does not define an orthonormal basis, and as such, the transpose Ar' is not the reverse transformation. The earlier answer I provided was thus wrong. The answer has been corrected above.
Your understanding is quite right. One of the reasons for somebody to use PCA would be to reduce the dimensionality of the data. The first principal component has the largest sample variance amongst of all the normalized linear combinations of the columns of X. The second principal component has maximum variance subject to being orthogonal to the next one, etc.
One might then do a PCA on a dataset, and decide to cut off the last principal component or several of the last principal components of the data. This is done to reduce the effect of the curse of dimensionality. The curse of dimensionality is a term used to point out the fact that any group of vectors is sparse in a relatively high dimensional space. Conversely, this means that you would need an absurd amount of data to form any model on a fairly high dimension dataset, such as an word histogram of a text document with possibly tens of thousands of dimensions.
In effect a dimensionality reduction by PCA removes components that are strongly correlated. For example let's take a look at a picture:
As you can see, most of the values are almost the same, strongly correlated. You could meld some of these correlated pixels by removing the last principal components. This would reduce the dimensionality of the image, pack it, by removing some of the information in the image.
There is no magic way to determine the best amount of principal components or the best reconstruction that I'm aware of.
Forgive me if i am not mathematically rigorous.
If we look at the equation: X = A*S we can say that we are taking some two dimensional data and we map it to a 2 dimensional subspace in 5 dimensional space. Were A is some base for that 2 dimensional subspace.
When we solve the PCA problem for X and look at PC (principal compononet) we see that the two big eignvectors (which coresponds to the two largest eignvalues) span the same subspace that A did. (multpily A'*PC and see that for the first three small eignvectors we get 0 which means that the vectors are orthogonal to A and only for the two largest we get values that are different than 0).
So what i think that the reason why we get a different base for this two dimensional space is because X=A*S can be product of some A1 and S1 and also for some other A2 and S2 and we will still get X=A1*S1=A2*S2. What PCA gives us is a particular base that maximize the variance in each dimension.
So how to solve the problem you have? I can see that you chose as the testing data some exponential times sin and cos so i think you are dealing with a specific kind of data. I am not an expert in signal processing but look at MUSIC algorithm.
You could use the pca function from Statistics toolbox.
coeff = pca(X)
From documentation, each column of coeff contains coefficients for one principal component. So you can reconstruct the observed data X by multiplying with coeff, e.g. X*coeff

How do I visualize n-dimensional features?

I have two matrices A and B. The size of A is 200*1000 double (here: 1000 represents 1000 different features). Matrix A belongs to group 1, where I use ones(200,1) as the label vector. The size of B is also 200*1000 double (here: 1000 also represents 1000 different features). Matrix B belongs to group 2, where I use -1*ones(200,1) as the label vector.
My question is how do I visualize matrices A and B so that I can clearly distinguish them based on the given groups?
I'm assuming each sample in your matrices A and B is determined by a row in either matrix. If I understand you correctly, you want to draw a series of 1000-dimensional vectors, which is impossible. We can't physically visualize anything beyond three dimensions.
As such, what I suggest you do is perform a dimensionality reduction to reduce your data so that each input is reduced to either 2 or 3 dimensions. Once you reduce your data, you can plot them normally and assign a different marker to each point, depending on what group they belonged to.
If you want to achieve this in MATLAB, use Principal Components Analysis, specifically the pca function in MATLAB, that calculates the residuals and the reprojected samples if you were to reproject them onto a lower dimensionality. I'm assuming you have the Statistics Toolbox... if you don't, then sorry this won't work.
Specifically, given your matrices A and B, you would do this:
[coeffA, scoreA] = pca(A);
[coeffB, scoreB] = pca(B);
numDimensions = 2;
scoreAred = scoreA(:,1:numDimensions);
scoreBred = scoreB(:,1:numDimensions);
The second output of pca gives you reprojected values and so you simply have to determine how many dimensions you want by extracting the first N columns, where N is the desired number of dimensions you want.
I chose 2 for now, and we can see what it looks like in 3 dimensions after. Once we have what we need for 2 dimensions, it's just a matter of plotting:
plot(scoreAred(:,1), scoreAred(:,2), 'rx', scoreBred(:,1), scoreBred(:,2), 'bo');
This will produce a plot where the samples from matrix A are with red crosses while the samples from matrix B are with blue circles.
Here's a sample run given completely random data:
rng(123); %// Set seed for reproducibility
A = rand(200,1000); B = rand(200,1000); %// Generate random data
%// Code as before
[coeffA, scoreA] = pca(A);
[coeffB, scoreB] = pca(B);
numDimensions = 2;
scoreAred = scoreA(:,1:numDimensions);
scoreBred = scoreB(:,1:numDimensions);
%// Plot the data
plot(scoreAred(:,1), scoreAred(:,2), 'rx', scoreBred(:,1), scoreBred(:,2), 'bo');
We get this:
If you want three dimensions, simply change numDimensions = 3, then change the plot code to use plot3:
plot3(scoreAred(:,1), scoreAred(:,2), scoreAred(:,3), 'rx', scoreBred(:,1), scoreBred(:,2), scoreBred(:,3), 'bo');
grid;
With those changes, this is what we get:

Computing a moving average

I need to compute a moving average over a data series, within a for loop. I have to get the moving average over N=9 days. The array I'm computing in is 4 series of 365 values (M), which itself are mean values of another set of data. I want to plot the mean values of my data with the moving average in one plot.
I googled a bit about moving averages and the "conv" command and found something which i tried implementing in my code.:
hold on
for ii=1:4;
M=mean(C{ii},2)
wts = [1/24;repmat(1/12,11,1);1/24];
Ms=conv(M,wts,'valid')
plot(M)
plot(Ms,'r')
end
hold off
So basically, I compute my mean and plot it with a (wrong) moving average. I picked the "wts" value right off the mathworks site, so that is incorrect. (source: http://www.mathworks.nl/help/econ/moving-average-trend-estimation.html) My problem though, is that I do not understand what this "wts" is. Could anyone explain? If it has something to do with the weights of the values: that is invalid in this case. All values are weighted the same.
And if I am doing this entirely wrong, could I get some help with it?
My sincerest thanks.
There are two more alternatives:
1) filter
From the doc:
You can use filter to find a running average without using a for loop.
This example finds the running average of a 16-element vector, using a
window size of 5.
data = [1:0.2:4]'; %'
windowSize = 5;
filter(ones(1,windowSize)/windowSize,1,data)
2) smooth as part of the Curve Fitting Toolbox (which is available in most cases)
From the doc:
yy = smooth(y) smooths the data in the column vector y using a moving
average filter. Results are returned in the column vector yy. The
default span for the moving average is 5.
%// Create noisy data with outliers:
x = 15*rand(150,1);
y = sin(x) + 0.5*(rand(size(x))-0.5);
y(ceil(length(x)*rand(2,1))) = 3;
%// Smooth the data using the loess and rloess methods with a span of 10%:
yy1 = smooth(x,y,0.1,'loess');
yy2 = smooth(x,y,0.1,'rloess');
In 2016 MATLAB added the movmean function that calculates a moving average:
N = 9;
M_moving_average = movmean(M,N)
Using conv is an excellent way to implement a moving average. In the code you are using, wts is how much you are weighing each value (as you guessed). the sum of that vector should always be equal to one. If you wish to weight each value evenly and do a size N moving filter then you would want to do
N = 7;
wts = ones(N,1)/N;
sum(wts) % result = 1
Using the 'valid' argument in conv will result in having fewer values in Ms than you have in M. Use 'same' if you don't mind the effects of zero padding. If you have the signal processing toolbox you can use cconv if you want to try a circular moving average. Something like
N = 7;
wts = ones(N,1)/N;
cconv(x,wts,N);
should work.
You should read the conv and cconv documentation for more information if you haven't already.
I would use this:
% does moving average on signal x, window size is w
function y = movingAverage(x, w)
k = ones(1, w) / w
y = conv(x, k, 'same');
end
ripped straight from here.
To comment on your current implementation. wts is the weighting vector, which from the Mathworks, is a 13 point average, with special attention on the first and last point of weightings half of the rest.

Matlab Average Two Fourier transforms with different frequency vectors

Suppose I have two power spectrum vectors PS1 and PS2 which were created using fft and then taking only the positive frequency values and squaring the fft values (complex conjugate really).
Suppose also that the corresponding frequency values for PS1 and PS2 are different. E.g. PS1(10) might correspond to 10 Hz and PS2(10) might correspond to 10.5 Hz.
I want to have an average of these two (and more) power spectra. How would I best create such an average? It is fine if the PS_ave is a longer vector than any of the original power spectra, so long as there is a corresponding frequency vector. So, it might be that PS_ave(11) corresponds to 10.25 Hz, and this value should probably be the average of PS1(10) and PS2(10). All ideas are welcome!
Thanks!
You might try using interp1, which can interpolate within one signal to match frequencies for another spectrum. The following example illustrates this:
v1 =1;
t1 = [0:0.1:10];
sig1 = sin(2*pi*t1*v1).*exp(-0.5*t1)/length(t1);
v2 = 0.5;
t2 = [0:0.2:10];
sig2 = cos(2*pi*t2*v2).*exp(-0.5*t2)/length(t2);
s1= fft(sig1);
s1 = s1(1:end/2);
f1 = [0:length(s1)-1]*(1/max(t1));
s2= fft(sig2);
s2 = s2(1:end/2);
f2 = [0:length(s2)-1]*(1/max(t2));
p1 = abs(s1);
p2 = abs(s2);
% Now average using interpolation to find points in the longer vector matching the shorter
p1_interp = interp1(f1,p1,f2);
power_avg = mean([p2; p1_interp],1)
hold on, plot(f2,power_avg,'r')
Here's the result (red = avg power):

Generate random data given mean and standard deviation in MATLAB?

I have limited data RV for which I can find the mean mu and standard deviation sigma. Now I want to generate more data points keeping the same mu and sigma. How would I go about doing this in MATLAB? I did the following, however when I plot mean of the generated data (mu_2) it does not match mu...
N = 15
R = mean(RV) + std(RV)*randn(N, 1);
mu = mean(RV)*ones(N,1);
mu_2 = mean(R)*ones(N,1);
I think you should use normrnd(mu,sigma) function
go to documentation to get more details
Best regards
That looks correct. For such a small sample size, it's unlikely that you'll get a very good match. Try a much bigger value of N.
If you want to force your dataset to a particular mean and stddev, then you could just generate a set of samples, then measure their mean and stddev, and then just adjust by scaling and scalar addition.
For example:
R = randn(N,1);
% Measure
mu_tmp = mean(R);
std_tmp = std(R);
% Normalise and denormalise
R = (R - mu_tmp) / std_tmp;
R = (R * std_desired) + mu_desired;
You can also generate Gaussian mixtures using the Netlab library (its free!)
mix=gmm(8,3,'spherical');
[Data, Label]=gmmsamp(mix,1000);
The above generates a data set with 8 dimensions and three centers (spherical) over 1000 observations.