Calculating the root-mean-square-error between two matrices one of which contains NaN values - matlab

This is a part of a larger project so I will try to keep only the relevant parts (The variables and my attempt at the calculations)
I want to calculate the root mean squared error between Zi_cubic and Z_actual
RMSE formula
Given/already established variables
rng('default');
% Set up 2,000 random numbers between -1 & +1 as our x & y values
n=2000;
x = 2*(rand(n,1)-0.5);
y = 2*(rand(n,1)-0.5);
z = x.^5+y.^3;
% Interpolate to a regular grid
d = -1:0.01:1;
[Xi,Yi] = meshgrid(d,d);
Zi_cubic = griddata(x,y,z,Xi,Yi,'cubic');
Z_actual = Xi.^5+Yi.^3;
My attempt at a calculation
My approach is to
Arrange Zi_cubic and Z_actual as column vectors
Take the difference
Square each element in the difference
Sum up all the elements in 4 using nansum
Divide by the number of finite elements in 4
Take the square root
D1 = reshape(Zi_cubic,[numel(Zi_cubic),1]);
D2 = reshape(Z_actual,[numel(Z_actual),1]);
D3 = D1 - D2;
D4 = D3.^2;
D5 = nansum(D4)
d6 = sum(isfinite(D4))
D6 = D5/d6
D7 = sqrt(D6)
Apparently this is wrong. I'm either mis-applying the RMSE formula or I don't understand what I'm telling matlab to do.
Any help would be appreciated. Thanks in advance.

Your RMSE is fine (in my book). The only thing that seems possibly off is the meshgrid and griddata. Your inputs to griddata are vectors and you are asking for a matrix output. That is fine, but you're potentially undersampling your input space. In other words, you are giving n samples as inputs, but perhaps you are expected to give n^2 samples as inputs? Here's some sample code for a smaller n to demonstrate this effect more clearly:
rng('default');
% Set up 2,000 random numbers between -1 & +1 as our x & y values
n=100; %Reduced because scatter is slow to plot
x = 2*(rand(n,1)-0.5);
y = 2*(rand(n,1)-0.5);
z = x.^5+y.^3;
S = 100;
subplot(1,2,1)
scatter(x,y,S,z)
%More data, more accurate ...
[x2,y2] = meshgrid(x,y);
z2 = x2.^5+y2.^3;
subplot(1,2,2)
scatter(x2(:),y2(:),S,z2(:))
The second plot should be a lot cleaner and thus will likely provide a more accurate estimate of Z_actual later on.
I also thought you might be running into some issues with floating point numbers and calculating RMSE but that appears not to be the case. Here's some alternative code which is how I would write RMSE.
d = Zi_cubic(:) - Z_actual(:);
mask = ~isnan(d);
n_valid = sum(mask);
rmse = sqrt(sum(d(mask).^2)/n_valid);
Notice that (:) linearizes the matrix. Also it is useful to try and use better variable names than D1-D7.
In the end though these are just suggestions and your code looks fine.
PS - I'm assuming that you are supposed to be using cubic interpolation as that is another place you could perhaps deviate from what's expected ...

Related

find the line which best fits to the data

I'm trying to find the line which best fits to the data. I use the following code below but now I want to have the data placed into an array sorted so it has the data which is closest to the line first how can I do this? Also is polyfit the correct function to use for this?
x=[1,2,2.5,4,5];
y=[1,-1,-.9,-2,1.5];
n=1;
p = polyfit(x,y,n)
f = polyval(p,x);
plot(x,y,'o',x,f,'-')
PS: I'm using Octave 4.0 which is similar to Matlab
You can first compute the error between the real value y and the predicted value f
err = abs(y-f);
Then sort the error vector
[val, idx] = sort(err);
And use the sorted indexes to have your y values sorted
y2 = y(idx);
Now y2 has the same values as y but the ones closer to the fitting value first.
Do the same for x to compute x2 so you have a correspondence between x2 and y2
x2 = x(idx);
Sembei Norimaki did a good job of explaining your primary question, so I will look at your secondary question = is polyfit the right function?
The best fit line is defined as the line that has a mean error of zero.
If it must be a "line" we could use polyfit, which will fit a polynomial. Of course, a "line" can be defined as first degree polynomial, but first degree polynomials have some properties that make it easy to deal with. The first order polynomial (or linear) equation you are looking for should come in this form:
y = mx + b
where y is your dependent variable and X is your independent variable. So the challenge is this: find the m and b such that the modeled y is as close to the actual y as possible. As it turns out, the error associated with a linear fit is convex, meaning it has one minimum value. In order to calculate this minimum value, it is simplest to combine the bias and the x vectors as follows:
Xcombined = [x.' ones(length(x),1)];
then utilized the normal equation, derived from the minimization of error
beta = inv(Xcombined.'*Xcombined)*(Xcombined.')*(y.')
great, now our line is defined as Y = Xcombined*beta. to draw a line, simply sample from some range of x and add the b term
Xplot = [[0:.1:5].' ones(length([0:.1:5].'),1)];
Yplot = Xplot*beta;
plot(Xplot, Yplot);
So why does polyfit work so poorly? well, I cant say for sure, but my hypothesis is that you need to transpose your x and y matrixies. I would guess that that would give you a much more reasonable line.
x = x.';
y = y.';
then try
p = polyfit(x,y,n)
I hope this helps. A wise man once told me (and as I learn every day), don't trust an algorithm you do not understand!
Here's some test code that may help someone else dealing with linear regression and least squares
%https://youtu.be/m8FDX1nALSE matlab code
%https://youtu.be/1C3olrs1CUw good video to work out by hand if you want to test
function [a0 a1] = rtlinreg(x,y)
x=x(:);
y=y(:);
n=length(x);
a1 = (n*sum(x.*y) - sum(x)*sum(y))/(n*sum(x.^2) - (sum(x))^2); %a1 this is the slope of linear model
a0 = mean(y) - a1*mean(x); %a0 is the y-intercept
end
x=[65,65,62,67,69,65,61,67]'
y=[105,125,110,120,140,135,95,130]'
[a0 a1] = rtlinreg(x,y); %a1 is the slope of linear model, a0 is the y-intercept
x_model =min(x):.001:max(x);
y_model = a0 + a1.*x_model; %y=-186.47 +4.70x
plot(x,y,'x',x_model,y_model)

How to fit an exponential curve to damped harmonic oscillation data in MATLAB?

I'm trying to fit an exponential curve to data sets containing damped harmonic oscillations. The data is a bit complicated in the sense that the sinusoidal oscillations contain many frequencies as seen below:
I need to find the rate of decay in the data. The method I am using can be found here. How it works, is it takes the log of the y values above the steady state value and then uses:
lsqlin(A,y1(:),-A,-y1(:),[],[],[],[],[],optimset('algorithm','active-set','display','off'))
To fit it.
However, this results in the following data fits:
I tried using a linear regression fit which obviously didn't work because it took the average. I also tried RANSAC thinking that there is more data near the peaks. It worked a bit better than the linear regression but the method is flawed as there are times when more points exist at the wrong regions.
Does anyone know of a good method to just fit the peaks for this data?
Currently, I'm thinking of dividing the 500 data points into 10 different regions and in each region find the largest value. At the end, I should have 50 points that I can fit using any of the exponential fitting methods mentioned above. What do you think of this method?
Thought I'd give everyone an update of potential solutions that may work. As mentioned earlier, the data is complicated by the varying sinusoidal frequencies, so certain methods may not work because of this. The methods listed below can be good depending on the data and the frequencies involved.
First off, I assume that the data has the form:
y = average + b*e^-(c*x)
In my case, the average is 290 so we have:
y = 290 + b*e^-(c*x)
With that being said, let's dive into the different methods that I tried:
findpeaks() Method
This is the method that Alexander Büse suggested. It's a pretty good method for most data, but for my data, since there's multiple sinusoidal frequencies, it gets the wrong peaks. The red x's show the peaks.
% Find Peaks Method
[max_num,max_ind] = findpeaks(y(ind));
plot(max_ind,max_num,'x','Color','r'); hold on;
x1 = max_ind;
y1 = log(max_num-290);
coeffs = polyfit(x1,y1,1)
b = exp(coeffs(2));
c = coeffs(1);
RANSAC
RANSAC is good if you have most of your data at the peaks. You see that in mine, because of the multiple frequencies, more peaks exist near the top. However, the problem with my data is that not all the data sets are like this. Hence, it occasionally worked.
% RANSAC Method
ind = (y > avg);
x1 = x(ind);
y1 = log(y(ind) - avg);
iterNum = 300;
thDist = 0.5;
thInlrRatio = .1;
[t,r] = ransac([x1;y1'],iterNum,thDist,thInlrRatio);
k1 = -tan(t);
b1 = r/cos(t);
% plot(x1,k1*x1+b1,'r'); hold on;
b = exp(b1);
c = k1;
Lsqlin Method
This method is the one used here. It uses Lsqlin to constrain the system. However, it seems to ignore the data in the middle. Depending on your data set, this could work really well as it did for the person in the original post.
% Lsqlin Method
avg = 290;
ind = (y > avg);
x1 = x(ind);
y1 = log(y(ind) - avg);
A = [ones(numel(x1),1),x1(:)]*1.00;
coeffs = lsqlin(A,y1(:),-A,-y1(:),[],[],[],[],[],optimset('algorithm','active-set','display','off'));
b = exp(coeffs(2));
c = coeffs(1);
Find Peaks in Period
This is the method I mentioned in my post where I get the peak in each region, . This method works pretty well and from this I realized that my data may not actually have a perfect exponential fit. We see that it is unable to fit the large peaks at the beginning. I was able to make this a bit better by only using the first 150 data points and ignoring the steady state data points. Here I found the peak every 25 data points.
% Incremental Method 2 Unknowns
x1 = [];
y1 = [];
max_num=[];
max_ind=[];
incr = 25;
for i=1:floor(size(y,1)/incr)
[max_num(end+1),max_ind(end+1)] = max(y(1+incr*(i-1):incr*i));
max_ind(end) = max_ind(end) + incr*(i-1);
if max_num(end) > avg
x1(end+1) = max_ind(end);
y1(end+1) = log(max_num(end)-290);
end
end
plot(max_ind,max_num,'x','Color','r'); hold on;
coeffs = polyfit(x1,y1,1)
b = exp(coeffs(2));
c = coeffs(1);
Using all 500 data points:
Using the first 150 data points:
Find Peaks in Period With b Constrained
Since I want it to start at the first peak, I constrained the b value. I know the system is y=290+b*e^-c*x and I constrain it such that b=y(1)-290. By doing so, I just need to solve for c where c=(log(y-290)-logb)/x. I can then take the average or median of c. This method is quite good as well, it doesn't fit the value near the end as well but that isn't as big of a deal since the change there is minimal.
% Incremental Method 1 Unknown (b is constrained y(1)-290 = b)
b = y(1) - 290;
c = [];
max_num=[];
max_ind=[];
incr = 25;
for i=1:floor(size(y,1)/incr)
[max_num(end+1),max_ind(end+1)] = max(y(1+incr*(i-1):incr*i));
max_ind(end) = max_ind(end) + incr*(i-1);
if max_num(end) > avg
c(end+1) = (log(max_num(end)-290)-log(b))/max_ind(end);
end
end
c = mean(c); % Or median(c) works just as good
Here I take the peak for every 25 data points and then take the mean of c
Here I take the peak for every 25 data points and then take the median of c
Here I take the peak for every 10 data points and then take the mean of c
If the main goal is to extract the damping parameter from the fit, maybe you want to consider fitting directly a damped sine curve to your data. Something like this (created with the curve fitting tool):
[xData, yData] = prepareCurveData( x, y );
ft = fittype( 'a + sin(b*x - c).*exp(d*x)', 'independent', 'x', 'dependent', 'y' );
opts = fitoptions( 'Method', 'NonlinearLeastSquares' );
opts.Display = 'Off';
opts.StartPoint = [1 0.285116122712545 0.805911873245316 0.63235924622541];
[fitresult, gof] = fit( xData, yData, ft, opts );
plot( fitresult, xData, yData );
Especially since some of your example data really don't have many data points in the interesting region (above the noise).
If however, you really need to fit directly to maxima of the experimental data, you could use the findpeaks function to select only the maxima and then fit to them. You may want to play a bit with the MinPeakProminence parameter to adjust it to your needs.

Mahalanobis distance in Matlab

I would like to calculate the mahalanobis distance of input feature vector Y (1x14) to all feature vectors in matrix X (18x14). Each 6 vectors of X represent one class (So I have 3 classes). Then based on mahalanobis distances I will choose the vector that is the nearest to the input and classify it to one of the three classes as well.
My problem is when I use the following code I got only one value. How can I get mahalanobis distance between the input Y and every vector in X. So at the end I have 18 values and then I choose the smallest one. Any help will be appreciated. Thank you.
Note: I know that mahalanobis distance is a measure of the distance between a point P and a distribution D, but I don't how could this be applied in my situation.
Y = test1; % Y: 1x14 vector
S = cov(X); % X: 18x14 matrix
mu = mean(X,1);
d = ((Y-mu)/S)*(Y-mu)'
I also tried to separate the matrix X into 3; so each one represent the feature vectors of one class. This is the code, but it doesn't work properly and I got 3 distances and some have negative value!
Y = test1;
X1 = Action1;
S1 = cov(X1);
mu1 = mean(X1,1);
d1 = ((Y-mu1)/S1)*(Y-mu1)'
X2 = Action2;
S2 = cov(X2);
mu2 = mean(X2,1);
d2 = ((Y-mu2)/S2)*(Y-mu2)'
X3= Action3;
S3 = cov(X3);
mu3 = mean(X3,1);
d3 = ((Y-mu3)/S3)*(Y-mu3)'
d= [d1,d2,d3];
MahalanobisDist= min(d)
One last thing, when I used mahal function provided by Matlab I got this error:
Warning: Matrix is close to singular or badly scaled. Results may be inaccurate.
If you have to implement the distance yourself (school assignment for instance) this is of absolutely no use to you, but if you just need to calculate the distance as an intermediate step for other calculations I highly recommend d = Pdist2(a,b, distance_measure) the documentation is on matlabs site
It computes the pairwise distance between a vector (or even a matrix) b and all elements in a and stores them in vector d where the columns correspond to entries in b and the rows are entries from a. So d(i,j) is the distance between row j in b and row i in a (hope that made sense). If you want it could even parameters to find the k nearest neighbors, it's a great function.
in your case you would use the following code and you'd end up with the distance between elements, and the index as well
%number of neighbors
K = 1;
% X=18x14, Y=1x14, dist=18x1
[dist, iidx] = pdist2(X,Y,'mahalanobis','smallest',K);
%to find the class, you can do something like this
num_samples_per_class = 6;
matching_class = ceil(iidx/ num_samples_per_class);

Interpolation with matlab

I have a vector with different values.
Some of the values are zeros and sometimes they even come one after another.
I need to plot this vector against another vector with the same size but I can't have zeros in it.
What is the best way I can do some kind of interpolation to my vector and how do I do it?
I tried to read about interpolation in mat-lab but I didn't understand good enough to implement it.
If it's possible to explain it to me step by step I will be grateful since I'm new with this program.
Thanks
Starting from a dataset consisting of two equal length vectors x,y, where y values equal to zero are to be excluded, first pick the subset excluding zeros:
incld = y~=0;
Then you interpolate over that subset:
yn = interp1(x(incld),y(incld),x);
Example result, plotting x against y (green) and x against yn (red):
edit
Notice that, by the definition of interpolation, if terminal points are zero, you will have to take care of that separately, for instance by running the following before the lines above:
if y(1)==0, y(1) = y(find(y~=0,1,'first'))/2; end
if y(end)==0, y(end) = y(find(y~=0,1,'last'))/2; end
edit #2
And this is the 2D version of the above, where arrays X and Y are coordinates corresponding to the entries in 2D array Z:
[nr nc]=size(Z);
[X Y] = meshgrid([1:nc],[1:nr]);
X2 = X;
Y2 = Y;
Z2 = Z;
excld = Z==0;
X2(excld) = [];
Y2(excld) = [];
Z2(excld) = [];
ZN = griddata(X2,Y2,Z2,X,Y);
ZN contains the interpolated points.
In the figure below, zeros are shown by dark blue patches. Left is before interpolation, right is after:

Multiply an arbitrary number of matrices an arbitrary number of times

I have found several questions/answers for vectorizing and speeding up routines for multiplying a matrix and a vector in a single loop, but I am trying to do something a little more general, namely multiplying an arbitrary number of matrices together, and then performing that operation an arbitrary number of times.
I am writing a general routine for calculating thin-film reflection from an arbitrary number of layers vs optical frequency. For each optical frequency W each layer has an index of refraction N and an associated 2x2 transfer matrix L and 2x2 interface matrix I which depends on the index of refraction and the thickness of the layer. If n is the number of layers, and m is the number of frequencies, then I can vectorize the index into an n x m matrix, but then in order to calculate the reflection at each frequency, I have to do nested loops. Since I am ultimately using this as part of a fitting routine, anything I can do to speed it up would be greatly appreciated.
This should provide a minimum working example:
W = 1260:0.1:1400; %frequency in cm^-1
N = rand(4,numel(W))+1i*rand(4,numel(W)); %dummy complex index of refraction
D = [0 0.1 0.2 0]/1e4; %thicknesses in cm
[n,m] = size(N);
r = zeros(size(W));
for x = 1:m %loop over frequencies
C = eye(2); % first medium is air
for y = 2:n %loop over layers
na = N(y-1,x);
nb = N(y,x);
%I = InterfaceMatrix(na,nb); % calculate the 2x2 interface matrix
I = [1 na*nb;na*nb 1]; % dummy matrix
%L = TransferMatrix(nb) % calculate the 2x2 transfer matrix
L = [exp(-1i*nb*W(x)*D(y)) 0; 0 exp(+1i*nb*W(x)*D(y))]; % dummy matrix
C = C*I*L;
end
a = C(1,1);
c = C(2,1);
r(x) = c/a; % reflectivity, the answer I want.
end
Running this twice for two different polarizations for a three layer (air/stuff/substrate) problem with 2562 frequencies takes 0.952 seconds while solving the exact same problem with the explicit formula (vectorized) for a three layer system takes 0.0265 seconds. The problem is that beyond 3 layers, the explicit formula rapidly becomes intractable and I would have to have a different subroutine for each number of layers while the above is completely general.
Is there hope for vectorizing this code or otherwise speeding it up?
(edited to add that I've left several things out of the code to shorten it, so please don't try to use this to actually calculate reflectivity)
Edit: In order to clarify, I and L are different for each layer and for each frequency, so they change in each loop. Simply taking the exponent will not work. For a real world example, take the simplest case of a soap bubble in air. There are three layers (air/soap/air) and two interfaces. For a given frequency, the full transfer matrix C is:
C = L_air * I_air2soap * L_soap * I_soap2air * L_air;
and I_air2soap ~= I_soap2air. Thus, I start with L_air = eye(2) and then go down successive layers, computing I_(y-1,y) and L_y, multiplying them with the result from the previous loop, and going on until I get to the bottom of the stack. Then I grab the first and third values, take the ratio, and that is the reflectivity at that frequency. Then I move on to the next frequency and do it all again.
I suspect that the answer is going to somehow involve a block-diagonal matrix for each layer as mentioned below.
Not next to a matlab, so that's only a starter,
Instead of the double loop you can write na*nb as Nab=N(1:end-1,:).*N(2:end,:);
The term in the exponent nb*W(x)*D(y) can be written as e=N(2:end,:)*W'*D;
The result of I*L is a 2x2 block matrix that has this form:
M = [1, Nab; Nab, 1]*[e-, 0;0, e+] = [e- , Nab*e+ ; Nab*e- , e+]
with e- as exp(-1i*e), and e+ as exp(1i*e)'
see kron on how to get the block matrix form, to vectorize the propagation C=C*I*L just take M^n
#Lama put me on the right path by suggesting block matrices, but the ultimate answer ended up being more complicated, and so I put it here for posterity. Since the transfer and interface matrix is different for each layer, I leave in the loop over the layers, but construct a large sparse block matrix where each block represents a frequency.
W = 1260:0.1:1400; %frequency in cm^-1
N = rand(4,numel(W))+1i*rand(4,numel(W)); %dummy complex index of refraction
D = [0 0.1 0.2 0]/1e4; %thicknesses in cm
[n,m] = size(N);
r = zeros(size(W));
C = speye(2*m); % first medium is air
even = 2:2:2*m;
odd = 1:2:2*m-1;
for y = 2:n %loop over layers
na = N(y-1,:);
nb = N(y,:);
% get the reflection and transmission coefficients from subroutines as a vector
% of length m, one value for each frequency
%t = Tab(na, nb);
%r = Rab(na, nb);
t = rand(size(W)); % dummy vector for MWE
r = rand(size(W)); % dummy vector for MWE
% create diagonal and off-diagonal elements. each block is [1 r;r 1]/t
Id(even) = 1./t;
Id(odd) = Id(even);
Io(even) = 0;
Io(odd) = r./t;
It = [Io;Id/2].';
I = spdiags(It,[-1 0],2*m,2*m);
I = I + I.';
b = 1i.*(2*pi*D(n).*nb).*W;
B(even) = -b;
B(odd) = b;
L = spdiags(exp(B).',0,2*m,2*m);
C = C*I*L;
end
a = spdiags(C,0);
a = a(odd).';
c = spdiags(C,-1);
c = c(odd).';
r = c./a; % reflectivity, the answer I want.
With the 3 layer system mentioned above, it isn't quite as fast as the explicit formula, but it's close and probably can get a little faster after some profiling. The full version of the original code clocks at 0.97 seconds, the formula at 0.012 seconds and the sparse diagonal version here at 0.065 seconds.