I am using Matlab curve-fitting tool cftool to fit my data. The issue is that the y values are largely varying (strongly decreasing) with respect to x-axis. A sample is given below,
x y
0.1 237.98
1 25.836
10 3.785
30 1.740
100 0.804
300 0.431
1000 0.230
2000 0.180
The fitted format is: y=a/x^b+c/x^d with a,b,c, and d as constants. The curve-fit from matlab is quite accurate for large y-values (that's at lower x-range) with less than 0.1% deviation. However, at higher x-values, the accuracy of the fit is not good (around 11% deviation). I would like also to include % deviation in the curve-fitting iteration to make sure the data is captured exactly. The plot is given for the fit and data reference.
Can anyone suggest me for better ways to fit the data?
The most common way to fit a curve is to do a least squares fit, which minimizes the sum of the square differences between the data and the fit. This is why your fit is tighter when y is large: an 11% deviation on a value of 0.18 is only a squared error of 0.000392, while a 0.1% deviation on a value of 240 is a squared error of 0.0576, much more significant.
If what you care about is deviations rather than absolute (squared) errors, then you can either reformulate the fitting algorithm, or transform your data in a clever way. The second way is a common and useful tool to know.
One way to do this in your case is fit the log(y) instead of y. This will have the effect of making small errors much more significant:
data = [0.1 237.98
1 25.836
10 3.785
30 1.740
100 0.804
300 0.431
1000 0.230
2000 0.180];
x = data(:,1);
y = data(:,2);
% Set up fittype and options.
ft = fittype( 'a/x^b + c/x^d', 'independent', 'x', 'dependent', 'y' );
opts = fitoptions( 'Method', 'NonlinearLeastSquares' );
opts.Display = 'Off';
opts.StartPoint = [0.420712466925742 0.585539298834167 0.771799485946335 0.706046088019609];
%% Usual least-squares fit
[fitresult] = fit( x, y, ft, opts );
yhat = fitresult(x);
% Plot fit with data.
figure
semilogy( x, y );
hold on
semilogy( x, yhat);
deviation = abs((y-yhat))./y * 100
%% log-transformed fit
[fitresult] = fit( x, log(y), ft, opts );
yhat = exp(fitresult(x));
% Plot fit with data.
figure
semilogy( x, y );
hold on
semilogy( x, yhat );
deviation = abs((y-yhat))./y * 100
One approach would be to fit to the lowest sum-of-squared relative error, rather than the lowest sum-of-squared absolute error. When I use the data posted in your question, fitting to lowest sum-of-squared relative error yields +/- 4 percent error - so this may be a useful option. To verify if you might want to consider this approach, here are the coefficients I determined from your posted data using this method:
a = 2.2254477037465399E+01
b = 1.0038013513610324E+00
c = 4.1544917994119190E+00
d = 4.2684956973959676E-01
Related
I am batch processing 1000s of data. Sometime the peak positions and magnitudes change drastically, and the program struggles to find these peaks with a single start point value. I have to divide my data into smaller batches to change the start point values, which is time consuming.
Is it possible to try various start point values and select the one with the best rsquare?
ft = fittype('y0 + a*exp(-((x-xa)/(wa))^2), 'independent', 'x', 'dependent', 'y' );
opts = fitoptions( 'Method', 'NonlinearLeastSquares' );
opts.Display = 'Off';
opts.StartPoint = [10 10 10 0]; % this is a, wa, xa and y0 - from the equation
[fitresult, gof] = fit(xData, yData, ft, opts);
alpha = gof.rsquare; % extract goodness of fit
if alpha < 0.98 % if rsquare (goodness of fit) is not good enough
for x = 100:10:500; y= 10:1:50 %these numbers are not set in stone - can be any number
opts.StartPoint = [10+x 10 10+y 0]; % tweak the start point values for the fit
[fitresult, gof] = fit(xData, yData, ft, opts); % fit again
Then select the start point with the best rsquare and plot the results.
% plot
f = figure('Name', 'Gauss','Pointer','crosshair');
h = plot(fitresult, xData, yData, '-o');
If they are difficulties in guessing, I suggest to use a different method which is not iterative and doesn't need guessed value of the parameters to start the numerical calculus.
Since I have no representative data of your problem, I cannot check if the method proposed below is convenient in your case. This depends on the scatter of the data and on the distribution of the points.
Try it and see. If the result is not correct, please let me know.
A numerical example with highly scattered data is shown below. With this example you can check if the method is correctly implemented.
NOTE : This method can be used to obtain some approximate values of the parameters which can be put as "guessed" values in the usual non-linear regression softwares.
For information : The method is a linear regression wrt an integral equation to which the Gaussian function is solution :
For the general principle, see : https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales
Suppose I have a vector t = [0 0.1 0.9 1 1.4], and a vector x = [1 3 5 2 3]. How can I compute the derivative of x with respect to time that has the same length as the original vectors?
I should not use any symbolic operations. The command diff(x)./diff(t) does not produce a vector of the same length. Should I first interpolate the x(t) function and then take its derivative?
Different approaches exist to calculate the derivative at the same points as your initial data:
Finite differences: Use a central difference scheme at your inner points and a forward/backward scheme at your first/last point
or
Curve fitting: Fit a curve through your points, calculate the derivative of this fitted function and sample them at the same points as the original data. Typical fitting functions are polynomials or spline functions.
Note that the curve fitting approach gives better results, but needs more tuning options and is slower (~100x).
Demonstration
As an example, I will calculate the derivative of a sine function:
t = 0:0.1:1;
y = sin(t);
Its exact derivative is well known:
dy_dt_exact = cos(t);
The derivative can approximately been calculated as:
Finite differences:
dy_dt_approx = zeros(size(y));
dy_dt_approx(1) = (y(2) - y(1))/(t(2) - t(1)); % forward difference
dy_dt_approx(end) = (y(end) - y(end-1))/(t(end) - t(end-1)); % backward difference
dy_dt_approx(2:end-1) = (y(3:end) - y(1:end-2))./(t(3:end) - t(1:end-2)); % central difference
or
Polynomial fitting:
p = polyfit(t,y,5); % fit fifth order polynomial
dp = polyder(p); % calculate derivative of polynomial
The results can be visualised as follows:
figure('Name', 'Derivative')
hold on
plot(t, dy_dt_exact, 'DisplayName', 'eyact');
plot(t, dy_dt_approx, 'DisplayName', 'finite difference');
plot(t, polyval(dp, t), 'DisplayName', 'polynomial');
legend show
figure('Name', 'Error')
hold on
plot(t, abs(dy_dt_approx - dy_dt_exact)/max(dy_dt_exact), 'DisplayName', 'finite difference');
plot(t, abs(polyval(dp, t) - dy_dt_exact)/max(dy_dt_exact), 'DisplayName', 'polynomial');
legend show
The first graph shows the derivatives itself and the second graph plots the relative errors made by both methods.
Discussion
One clearly sees that the curve fitting method gives better results than the finite differences, but it is ~100x slower. The curve fitting methods has a relative error of order 10^-5. Note that the finite differences approach becomes better when your data is sampled more densely or you use a higher order scheme. The disadvantage of the curve fitting approach is that one has to choose a good polynomial order. Spline functions may be better suited in general.
A 10x faster sampled dataset, i.e. t = 0:0.01:1;, results in the following graphs:
I'm trying to fit an exponential curve to data sets containing damped harmonic oscillations. The data is a bit complicated in the sense that the sinusoidal oscillations contain many frequencies as seen below:
I need to find the rate of decay in the data. The method I am using can be found here. How it works, is it takes the log of the y values above the steady state value and then uses:
lsqlin(A,y1(:),-A,-y1(:),[],[],[],[],[],optimset('algorithm','active-set','display','off'))
To fit it.
However, this results in the following data fits:
I tried using a linear regression fit which obviously didn't work because it took the average. I also tried RANSAC thinking that there is more data near the peaks. It worked a bit better than the linear regression but the method is flawed as there are times when more points exist at the wrong regions.
Does anyone know of a good method to just fit the peaks for this data?
Currently, I'm thinking of dividing the 500 data points into 10 different regions and in each region find the largest value. At the end, I should have 50 points that I can fit using any of the exponential fitting methods mentioned above. What do you think of this method?
Thought I'd give everyone an update of potential solutions that may work. As mentioned earlier, the data is complicated by the varying sinusoidal frequencies, so certain methods may not work because of this. The methods listed below can be good depending on the data and the frequencies involved.
First off, I assume that the data has the form:
y = average + b*e^-(c*x)
In my case, the average is 290 so we have:
y = 290 + b*e^-(c*x)
With that being said, let's dive into the different methods that I tried:
findpeaks() Method
This is the method that Alexander Büse suggested. It's a pretty good method for most data, but for my data, since there's multiple sinusoidal frequencies, it gets the wrong peaks. The red x's show the peaks.
% Find Peaks Method
[max_num,max_ind] = findpeaks(y(ind));
plot(max_ind,max_num,'x','Color','r'); hold on;
x1 = max_ind;
y1 = log(max_num-290);
coeffs = polyfit(x1,y1,1)
b = exp(coeffs(2));
c = coeffs(1);
RANSAC
RANSAC is good if you have most of your data at the peaks. You see that in mine, because of the multiple frequencies, more peaks exist near the top. However, the problem with my data is that not all the data sets are like this. Hence, it occasionally worked.
% RANSAC Method
ind = (y > avg);
x1 = x(ind);
y1 = log(y(ind) - avg);
iterNum = 300;
thDist = 0.5;
thInlrRatio = .1;
[t,r] = ransac([x1;y1'],iterNum,thDist,thInlrRatio);
k1 = -tan(t);
b1 = r/cos(t);
% plot(x1,k1*x1+b1,'r'); hold on;
b = exp(b1);
c = k1;
Lsqlin Method
This method is the one used here. It uses Lsqlin to constrain the system. However, it seems to ignore the data in the middle. Depending on your data set, this could work really well as it did for the person in the original post.
% Lsqlin Method
avg = 290;
ind = (y > avg);
x1 = x(ind);
y1 = log(y(ind) - avg);
A = [ones(numel(x1),1),x1(:)]*1.00;
coeffs = lsqlin(A,y1(:),-A,-y1(:),[],[],[],[],[],optimset('algorithm','active-set','display','off'));
b = exp(coeffs(2));
c = coeffs(1);
Find Peaks in Period
This is the method I mentioned in my post where I get the peak in each region, . This method works pretty well and from this I realized that my data may not actually have a perfect exponential fit. We see that it is unable to fit the large peaks at the beginning. I was able to make this a bit better by only using the first 150 data points and ignoring the steady state data points. Here I found the peak every 25 data points.
% Incremental Method 2 Unknowns
x1 = [];
y1 = [];
max_num=[];
max_ind=[];
incr = 25;
for i=1:floor(size(y,1)/incr)
[max_num(end+1),max_ind(end+1)] = max(y(1+incr*(i-1):incr*i));
max_ind(end) = max_ind(end) + incr*(i-1);
if max_num(end) > avg
x1(end+1) = max_ind(end);
y1(end+1) = log(max_num(end)-290);
end
end
plot(max_ind,max_num,'x','Color','r'); hold on;
coeffs = polyfit(x1,y1,1)
b = exp(coeffs(2));
c = coeffs(1);
Using all 500 data points:
Using the first 150 data points:
Find Peaks in Period With b Constrained
Since I want it to start at the first peak, I constrained the b value. I know the system is y=290+b*e^-c*x and I constrain it such that b=y(1)-290. By doing so, I just need to solve for c where c=(log(y-290)-logb)/x. I can then take the average or median of c. This method is quite good as well, it doesn't fit the value near the end as well but that isn't as big of a deal since the change there is minimal.
% Incremental Method 1 Unknown (b is constrained y(1)-290 = b)
b = y(1) - 290;
c = [];
max_num=[];
max_ind=[];
incr = 25;
for i=1:floor(size(y,1)/incr)
[max_num(end+1),max_ind(end+1)] = max(y(1+incr*(i-1):incr*i));
max_ind(end) = max_ind(end) + incr*(i-1);
if max_num(end) > avg
c(end+1) = (log(max_num(end)-290)-log(b))/max_ind(end);
end
end
c = mean(c); % Or median(c) works just as good
Here I take the peak for every 25 data points and then take the mean of c
Here I take the peak for every 25 data points and then take the median of c
Here I take the peak for every 10 data points and then take the mean of c
If the main goal is to extract the damping parameter from the fit, maybe you want to consider fitting directly a damped sine curve to your data. Something like this (created with the curve fitting tool):
[xData, yData] = prepareCurveData( x, y );
ft = fittype( 'a + sin(b*x - c).*exp(d*x)', 'independent', 'x', 'dependent', 'y' );
opts = fitoptions( 'Method', 'NonlinearLeastSquares' );
opts.Display = 'Off';
opts.StartPoint = [1 0.285116122712545 0.805911873245316 0.63235924622541];
[fitresult, gof] = fit( xData, yData, ft, opts );
plot( fitresult, xData, yData );
Especially since some of your example data really don't have many data points in the interesting region (above the noise).
If however, you really need to fit directly to maxima of the experimental data, you could use the findpeaks function to select only the maxima and then fit to them. You may want to play a bit with the MinPeakProminence parameter to adjust it to your needs.
I wish to fit a polynomial curve (4 ou 5 degree) to my data. I did it with EXCEL and I get coefficient around 10^-13 for the 5th one, 10^-9 for the 4th one and 10^-5 for the third one...
I would like to constrain all the coefficients to not be lower than 10^-2. The curve won't be fitted that good but it is ok.
How can I do that with the polyfit function ?
And then, from a mathematical point of vue. Does it make sense to constrain coefficient ? Or is it useless and I better keep going with a second degree polyfit (which has coefficient lower than 10^-2).
The reason I'm asking this : I'm doing some research and from a physical point of view, it is interesting to test the 5th degree polyfit but I can't use coefficient lower than 10^-2.
Thank you
Use fit rather than polyfit
%What is the degree of the polynomial (quartic)
polyDegree = 4;
%This sets up the options
opts = fitoptions( 'Method', 'LinearLeastSquares' );
%All coefficients of degrees not specified between x^n and x^0 can have any value greater than 10^-2
opts.Lower = 1E-2;
opts.Upper = inf(1, polyDegree + 1);
%Do the fit using the specified polynomial degree.
[fitresult, gof] = fit( x, y, ['poly', num2str(polyDegree)] , opts );
I am trying to use Matlab's nlinfit function to estimate the best fitting Gaussian for x,y paired data. In this case, x is a range of 2D orientations and y is the probability of a "yes" response.
I have copied #norm_funct from relevant posts and I'd like to return a smoothed, normal distribution that best approximates the observed data in y, and returns the magnitude, mean and SD of the best fitting pdf. At the moment, the fitted function appears to be incorrectly scaled and less than smooth - any help much appreciated!
x = -30:5:30;
y = [0,0.20,0.05,0.15,0.65,0.85,0.88,0.80,0.55,0.20,0.05,0,0;];
% plot raw data
figure(1)
plot(x, y, ':rs');
axis([-35 35 0 1]);
% initial paramter guesses (based on plot)
initGuess(1) = max(y); % amplitude
initGuess(2) = 0; % mean centred on 0 degrees
initGuess(3) = 10; % SD in degrees
% equation for Gaussian distribution
norm_func = #(p,x) p(1) .* exp(-((x - p(2))/p(3)).^2);
% use nlinfit to fit Gaussian using Least Squares
[bestfit,resid]=nlinfit(y, x, norm_func, initGuess);
% plot function
xFine = linspace(-30,30,100);
figure(2)
plot(x, y, 'ro', x, norm_func(xFine, y), '-b');
Many thanks
If your data actually represent probability estimates which you expect come from normally distributed data, then fitting a curve is not the right way to estimate the parameters of that normal distribution. There are different methods of different sophistication; one of the simplest is the method of moments, which means you choose the parameters such that the moments of the theoretical distribution match those of your sample. In the case of the normal distribution, these moments are simply mean and variance (or standard deviation). Here's the code:
% normalize y to be a probability (sum = 1)
p = y / sum(y);
% compute weighted mean and standard deviation
m = sum(x .* p);
s = sqrt(sum((x - m) .^ 2 .* p));
% compute theoretical probabilities
xs = -30:0.5:30;
pth = normpdf(xs, m, s);
% plot data and theoretical distribution
plot(x, p, 'o', xs, pth * 5)
The result shows a decent fit:
You'll notice the factor 5 in the last line. This is due to the fact that you don't have probability (density) estimates for the full range of values, but from points at distances of 5. In my treatment I assumed that they correspond to something like an integral over the probability density, e.g. over an interval [x - 2.5, x + 2.5], which can be roughly approximated by multiplying the density in the middle by the width of the interval. I don't know if this interpretation is correct for your data.
Your data follow a Gaussian curve and you describe them as probabilities. Are these numbers (y) your raw data – or did you generate them from e.g. a histogram over a larger data set? If the latter, the estimate of the distribution parameters could be improved by using the original full data.