I'm trying to find the line which best fits to the data. I use the following code below but now I want to have the data placed into an array sorted so it has the data which is closest to the line first how can I do this? Also is polyfit the correct function to use for this?
x=[1,2,2.5,4,5];
y=[1,-1,-.9,-2,1.5];
n=1;
p = polyfit(x,y,n)
f = polyval(p,x);
plot(x,y,'o',x,f,'-')
PS: I'm using Octave 4.0 which is similar to Matlab
You can first compute the error between the real value y and the predicted value f
err = abs(y-f);
Then sort the error vector
[val, idx] = sort(err);
And use the sorted indexes to have your y values sorted
y2 = y(idx);
Now y2 has the same values as y but the ones closer to the fitting value first.
Do the same for x to compute x2 so you have a correspondence between x2 and y2
x2 = x(idx);
Sembei Norimaki did a good job of explaining your primary question, so I will look at your secondary question = is polyfit the right function?
The best fit line is defined as the line that has a mean error of zero.
If it must be a "line" we could use polyfit, which will fit a polynomial. Of course, a "line" can be defined as first degree polynomial, but first degree polynomials have some properties that make it easy to deal with. The first order polynomial (or linear) equation you are looking for should come in this form:
y = mx + b
where y is your dependent variable and X is your independent variable. So the challenge is this: find the m and b such that the modeled y is as close to the actual y as possible. As it turns out, the error associated with a linear fit is convex, meaning it has one minimum value. In order to calculate this minimum value, it is simplest to combine the bias and the x vectors as follows:
Xcombined = [x.' ones(length(x),1)];
then utilized the normal equation, derived from the minimization of error
beta = inv(Xcombined.'*Xcombined)*(Xcombined.')*(y.')
great, now our line is defined as Y = Xcombined*beta. to draw a line, simply sample from some range of x and add the b term
Xplot = [[0:.1:5].' ones(length([0:.1:5].'),1)];
Yplot = Xplot*beta;
plot(Xplot, Yplot);
So why does polyfit work so poorly? well, I cant say for sure, but my hypothesis is that you need to transpose your x and y matrixies. I would guess that that would give you a much more reasonable line.
x = x.';
y = y.';
then try
p = polyfit(x,y,n)
I hope this helps. A wise man once told me (and as I learn every day), don't trust an algorithm you do not understand!
Here's some test code that may help someone else dealing with linear regression and least squares
%https://youtu.be/m8FDX1nALSE matlab code
%https://youtu.be/1C3olrs1CUw good video to work out by hand if you want to test
function [a0 a1] = rtlinreg(x,y)
x=x(:);
y=y(:);
n=length(x);
a1 = (n*sum(x.*y) - sum(x)*sum(y))/(n*sum(x.^2) - (sum(x))^2); %a1 this is the slope of linear model
a0 = mean(y) - a1*mean(x); %a0 is the y-intercept
end
x=[65,65,62,67,69,65,61,67]'
y=[105,125,110,120,140,135,95,130]'
[a0 a1] = rtlinreg(x,y); %a1 is the slope of linear model, a0 is the y-intercept
x_model =min(x):.001:max(x);
y_model = a0 + a1.*x_model; %y=-186.47 +4.70x
plot(x,y,'x',x_model,y_model)
Related
Working in MATLAB. I have the following equation:
S = aW + bX + cY + dZ
where S,W,X,Y, and Z are all known n x 1 vectors. I am trying to fit the data of S with a linear combination of the basis vectors W,X,Y, and Z with the constraint of the constants (a,b,c,d) being greater than 0. I have managed to do this in Excel's solver, and have attempted to figure it out on MATLAB, being directed towards functions like fmincon, but I am not overly familiar with MATLAB and feel I am misunderstanding the use of fmincon.
I am looking for help with understanding fmincon's use towards my problem, or redirection towards a more efficient method for solving.
Currently I have:
initials = [0.2 0.2 0.2 0.2];
fun = #(x)x(1)*W + x(2)*X + x(3)*Y + x(4)*Z;
lb = [0,0,0,0];
soln = fmincon(fun,initials,data,b,Aeq,beq,lb,ub);
I am receiving an error stating "A must have 4 column(s)". Where A is referring to my variable data which corresponds to my S in the above equation. I do not understand why it is expecting 4 columns. Also to note the variables that are in my above snippet that are not explicitly defined are defined as [], serving as space holders.
Using fmincon is a huge overkill in this case. It's like using big heavy microscope to crack nuts... or martian rover to tow a pushcart or... anyway :) May be it's OK if you don't have to process large sets of vectors. If you need to fit hundreds of thousands of such vectors it can take hours. But this classic solution will be faster by several orders of magnitude.
%first make a n x 4 matrix of your vectors;
P=[W,X,Y,Z];
% now your equation looks like this S = P*m where m is 4 x 1 vectro
% containing your coefficients ( m = [a,b,c,d] )
% so solution will be simply
m_1 = inv(P'*P)*P'*S;
% or you can use this form
m_2 = (P'*P)\P'*S;
% or even simpler
m_3 = (S'/P')';
% all three solutions should give exactly same resul
% third solution is the neatest but may not work in every version of matlab
% Your modeled vector will be
Sm = P*m_3; %you can use any of m_1, m_2 or m_3;
% And your residual
R = S-Sm;
If you need to procees many vectors don't use for cycle. For cycles are very slow in Matlab and you should use matrices instead, if possible. S can also be nk matrix, where k is number vectors you want to process. In this case m will be 4k matrix.
What you are trying to do is similar to the answer I gave at is there a python (or matlab) function that achieves the minimum MSE between given set of output vector and calculated set of vector?.
The procedure is similar to what you are doing in EXCEL with solver. You create an objective function that takes (a, b, c, d) as the input parameters and output a measure of fit (mse) and then use fmincon or similar solver to get the best (a, b, c, d) that minimize this mse. See the code below (no MATLAB to test it but it should work).
function [a, b, c, d] = optimize_abcd(W, X, Y, Z)
%=========================================================
%Second argument is the starting point, second to the last argument is the lower bound
%to ensure the solutions are all positive
res = fmincon(#MSE, [0,0,0,0], [], [], [], [], [0,0,0,0], []);
a=res(1);
b=res(2);
c=res(3);
d=res(4);
function mse = MSE(x_)
a_=x_(1);
b_=x_(2);
c_=x_(3);
d_=x_(4);
S_ = a_*W + b_*X + c_*Y + d_*Z
mse = norm(S_-S);
end
end
This is a part of a larger project so I will try to keep only the relevant parts (The variables and my attempt at the calculations)
I want to calculate the root mean squared error between Zi_cubic and Z_actual
RMSE formula
Given/already established variables
rng('default');
% Set up 2,000 random numbers between -1 & +1 as our x & y values
n=2000;
x = 2*(rand(n,1)-0.5);
y = 2*(rand(n,1)-0.5);
z = x.^5+y.^3;
% Interpolate to a regular grid
d = -1:0.01:1;
[Xi,Yi] = meshgrid(d,d);
Zi_cubic = griddata(x,y,z,Xi,Yi,'cubic');
Z_actual = Xi.^5+Yi.^3;
My attempt at a calculation
My approach is to
Arrange Zi_cubic and Z_actual as column vectors
Take the difference
Square each element in the difference
Sum up all the elements in 4 using nansum
Divide by the number of finite elements in 4
Take the square root
D1 = reshape(Zi_cubic,[numel(Zi_cubic),1]);
D2 = reshape(Z_actual,[numel(Z_actual),1]);
D3 = D1 - D2;
D4 = D3.^2;
D5 = nansum(D4)
d6 = sum(isfinite(D4))
D6 = D5/d6
D7 = sqrt(D6)
Apparently this is wrong. I'm either mis-applying the RMSE formula or I don't understand what I'm telling matlab to do.
Any help would be appreciated. Thanks in advance.
Your RMSE is fine (in my book). The only thing that seems possibly off is the meshgrid and griddata. Your inputs to griddata are vectors and you are asking for a matrix output. That is fine, but you're potentially undersampling your input space. In other words, you are giving n samples as inputs, but perhaps you are expected to give n^2 samples as inputs? Here's some sample code for a smaller n to demonstrate this effect more clearly:
rng('default');
% Set up 2,000 random numbers between -1 & +1 as our x & y values
n=100; %Reduced because scatter is slow to plot
x = 2*(rand(n,1)-0.5);
y = 2*(rand(n,1)-0.5);
z = x.^5+y.^3;
S = 100;
subplot(1,2,1)
scatter(x,y,S,z)
%More data, more accurate ...
[x2,y2] = meshgrid(x,y);
z2 = x2.^5+y2.^3;
subplot(1,2,2)
scatter(x2(:),y2(:),S,z2(:))
The second plot should be a lot cleaner and thus will likely provide a more accurate estimate of Z_actual later on.
I also thought you might be running into some issues with floating point numbers and calculating RMSE but that appears not to be the case. Here's some alternative code which is how I would write RMSE.
d = Zi_cubic(:) - Z_actual(:);
mask = ~isnan(d);
n_valid = sum(mask);
rmse = sqrt(sum(d(mask).^2)/n_valid);
Notice that (:) linearizes the matrix. Also it is useful to try and use better variable names than D1-D7.
In the end though these are just suggestions and your code looks fine.
PS - I'm assuming that you are supposed to be using cubic interpolation as that is another place you could perhaps deviate from what's expected ...
I am trying to understand code written by my predecessor. Rather than using a xcorr in MATLAB, she did the following. Apparently this seems to working. I would really appreciate if someone could explain, what is happening here. She is saying the pattern is symmetric by calculating the variable sym below, in the code below.
close all hidden
t = 0:0.01:2*pi;
x = sin(t)
plot(x,'k')
mu = mean(x)
sigma = std(x)
y = (x-mu)/(sigma);
hold on
plot(y,'r')
yrev = y(end:-1:1);
hold on
plot(yrev)
hold on
sym = sum(y.*yrev/length(y))
plot(y.*yrev/length(y),'r*')
sym is the normalised cross-correlation between y and the reverse of y.
If sym is close to one, y is a symmetric function.
If sym is close to zero, y is an asymmetric function
If sym is close to minus one, y is a anti-symmetric function
EDIT: relation with xcorr
You would obtain the same result if you calculate sym as follows:
sym = xcorr(y, yrev, 0, 'coeff')
I have some data points to which I need to fit an exponential curve of the form
y = B * exp(A/x)
(without the help of Curve Fitting Toolbox).
What I have tried so far to linearize the model by applying log, which results in
log(y/B) = A/x
log(y) = A/x + log(B)
I can then write it in the form
Y = AX + B
Now, if I neglect B, then I am able to solve it with
A = pseudoinverse (X) * Y
but I am stuck with values of B...
Fitting a curve of the form
y = b * exp(a / x)
to some data points (xi, yi) in the least-squares sense is difficult. You cannot use linear least-squares for that, because the model parameters (a and b) do not appear in an affine manner in the equation. Unless you're ready to use some nonlinear-least-squares method, an alternative approach is to modify the optimization problem so that the modified problem can be solved using linear least squares (this process is sometimes called "data linearization"). Let's do that.
Under the assumption that b and the yi's be positive, you can apply the natural logarithm to both sides of the equations:
log(y) = log(b) + a / x
or
a / x + log(b) = log(y)
By introducing a new parameter b2, defined as log(b), it becomes evident that parameters a and b2 appear in a linear (affine, really) manner in the new equation:
a / x + b2 = log(y)
Therefore, you can compute the optimal values of those parameters using least squares; all you have left to do is construct the right linear system and then solve it using MATLAB's backslash operator:
A = [1 ./ x, ones(size(x))];
B = log(y);
params_ls = A \ B;
(I'm assuming x and y are column vectors, here.)
Then, the optimal values (in the least-squares sense) for the modified problem are given by:
a_ls = params_ls(1);
b_ls = exp(params_ls(2));
Although those values are not, in general, optimal for the original problem, they are often "good enough" in practice. If needed, you can always use them as initial guesses for some iterative nonlinear-least-squares method.
Doing the log transform then using linear regression should do it. Wikipedia has a nice section on how to do this:
http://en.wikipedia.org/wiki/Linear_least_squares_%28mathematics%29#The_general_problem
%MATLAB code for finding the best fit line using least squares method
x=input('enter a') %input in the form of matrix, rows contain points
a=[1,x(1,1);1,x(2,1);1,x(3,1)] %forming A of Ax=b
b=[x(1,2);x(2,2);x(3,2)] %forming b of Ax=b
yy=inv(transpose(a)*a)*transpose(a)*b %computing projection of matrix A on b, giving x
%plotting the best fit line
xx=linspace(1,10,50);
y=yy(1)+yy(2)*xx;
plot(xx,y)
%plotting the points(data) for which we found the best fit line
hold on
plot(x(2,1),x(2,2),'x')
hold on
plot(x(1,1),x(1,2),'x')
hold on
plot(x(3,1),x(3,2),'x')
hold off
I'm sure the code can be cleaned up, but that's the gist of it.
I have a function V that is computed from two inputs (X,Y). Since the computation is quite demanding I just perform it on a grid of points and would like to rely on 2d linear interpolation. I now want to inverse that function for fixed Y. So basically my starting point is:
X = [1,2,3];
Y = [1,2,3];
V =[3,4,5;6,7,8;9,10,11];
Is is of course easy to obtain V at any combination of (X,Y), for instance:
Vq = interp2(X,Y,V,1.8,2.5)
gives
Vq =
8.3000
But how would I find X for given V and Y using 2d linear interploation? I will have to perform this task a lot of times, so I would need a quick and easy to implement solution.
Thank you for your help, your effort is highly appreciated.
P.
EDIT using additional info
If not both x and y have to be found, but one of them is given, this problem reduces to finding a minimum in only 1 direction (i.e. in x-direction). A simple approach is formulating this in a problem which can be minizmied by an optimization routine such as fminsearch. Therefore we define the function f which returns the difference between the value Vq and the result of the interpolation. We try to find the x which minimizes this difference, after we give an intial guess x0. Depending on this initial guess the result will be what we are looking for:
% Which x value to choose if yq and Vq are fixed?
xq = 1.8; % // <-- this one is to be found
yq = 2.5; % // (given)
Vq = interp2(X,Y,V,xq,yq); % // 8.3 (given)
% this function will be minimized (difference between Vq and the result
% of the interpolation)
f = #(x) abs(Vq-interp2(X, Y, V, x, yq));
x0 = 1; % initial guess)
x_opt = fminsearch(f, x0) % // solution found: 1.8
Nras, thank you very much. I did something else in the meantime:
function [G_inv] = G_inverse (lambda,U,grid_G_inverse,range_x,range_lambda)
for t = 1:size(U,1)
for i = 1:size(U,2)
xf = linspace(range_x(1), range_x(end),10000);
[Xf,Yf] = meshgrid(xf,lambda);
grid_fine = interp2(range_x,range_lambda,grid_G_inverse',Xf,Yf);
idx = find (abs(grid_fine-U(t,i))== min(min(abs(grid_fine-U(t,i))))); % find min distance point and take x index
G_inv(t,i)=xf(idx(1));
end
end
G_inv is supposed to contain x, U is yq in the above example and grid_G_inverse contains Vq. range_x and range_lambda are the corresponding vectors for the grid axis. What do you think about this solution, also compared to yours? I would guess mine is faster but less accurate. Spped is, however, a major issue in my code.