How to fit a curve by a series of segmented lines in Matlab? - matlab

I have a simple loglog curve as above. Is there some function in Matlab which can fit this curve by segmented lines and show the starting and end points of these line segments ? I have checked the curve fitting toolbox in matlab. They seems to do curve fitting by either one line or some functions. I do not want to curve fitting by one line only.
If there is no direct function, any alternative to achieve the same goal is fine with me. My goal is to fit the curve by segmented lines and get locations of the end points of these segments .

First of all, your problem is not called curve fitting. Curve fitting is when you have data, and you find the best function that describes it, in some sense. You, on the other hand, want to create a piecewise linear approximation of your function.
I suggest the following strategy:
Split manually into sections. The section size should depend on the derivative, large derivative -> small section
Sample the function at the nodes between the sections
Find a linear interpolation that passes through the points mentioned above.
Here is an example of a code that does that. You can see that the red line (interpolation) is very close to the original function, despite the small amount of sections. This happens due to the adaptive section size.
function fitLogLog()
x = 2:1000;
y = log(log(x));
%# Find section sizes, by using an inverse of the approximation of the derivative
numOfSections = 20;
indexes = round(linspace(1,numel(y),numOfSections));
derivativeApprox = diff(y(indexes));
inverseDerivative = 1./derivativeApprox;
weightOfSection = inverseDerivative/sum(inverseDerivative);
totalRange = max(x(:))-min(x(:));
sectionSize = weightOfSection.* totalRange;
%# The relevant nodes
xNodes = x(1) + [ 0 cumsum(sectionSize)];
yNodes = log(log(xNodes));
figure;plot(x,y);
hold on;
plot (xNodes,yNodes,'r');
scatter (xNodes,yNodes,'r');
legend('log(log(x))','adaptive linear interpolation');
end

Andrey's adaptive solution provides a more accurate overall fit. If what you want is segments of a fixed length, however, then here is something that should work, using a method that also returns a complete set of all the fitted values. Could be vectorized if speed is needed.
Nsamp = 1000; %number of data samples on x-axis
x = [1:Nsamp]; %this is your x-axis
Nlines = 5; %number of lines to fit
fx = exp(-10*x/Nsamp); %generate something like your current data, f(x)
gx = NaN(size(fx)); %this will hold your fitted lines, g(x)
joins = round(linspace(1, Nsamp, Nlines+1)); %define equally spaced breaks along the x-axis
dx = diff(x(joins)); %x-change
df = diff(fx(joins)); %f(x)-change
m = df./dx; %gradient for each section
for i = 1:Nlines
x1 = joins(i); %start point
x2 = joins(i+1); %end point
gx(x1:x2) = fx(x1) + m(i)*(0:dx(i)); %compute line segment
end
subplot(2,1,1)
h(1,:) = plot(x, fx, 'b', x, gx, 'k', joins, gx(joins), 'ro');
title('Normal Plot')
subplot(2,1,2)
h(2,:) = loglog(x, fx, 'b', x, gx, 'k', joins, gx(joins), 'ro');
title('Log Log Plot')
for ip = 1:2
subplot(2,1,ip)
set(h(ip,:), 'LineWidth', 2)
legend('Data', 'Piecewise Linear', 'Location', 'NorthEastOutside')
legend boxoff
end

This is not an exact answer to this question, but since I arrived here based on a search, I'd like to answer the related question of how to create (not fit) a piecewise linear function that is intended to represent the mean (or median, or some other other function) of interval data in a scatter plot.
First, a related but more sophisticated alternative using regression, which apparently has some MATLAB code listed on the wikipedia page, is Multivariate adaptive regression splines.
The solution here is to just calculate the mean on overlapping intervals to get points
function [x, y] = intervalAggregate(Xdata, Ydata, aggFun, intStep, intOverlap)
% intOverlap in [0, 1); 0 for no overlap of intervals, etc.
% intStep this is the size of the interval being aggregated.
minX = min(Xdata);
maxX = max(Xdata);
minY = min(Ydata);
maxY = max(Ydata);
intInc = intOverlap*intStep; %How far we advance each iteraction.
if intOverlap <= 0
intInc = intStep;
end
nInt = ceil((maxX-minX)/intInc); %Number of aggregations
parfor i = 1:nInt
xStart = minX + (i-1)*intInc;
xEnd = xStart + intStep;
intervalIndices = find((Xdata >= xStart) & (Xdata <= xEnd));
x(i) = aggFun(Xdata(intervalIndices));
y(i) = aggFun(Ydata(intervalIndices));
end
For instance, to calculate the mean over some paired X and Y data I had handy with intervals of length 0.1 having roughly 1/3 overlap with each other (see scatter image):
[x,y] = intervalAggregate(Xdat, Ydat, #mean, 0.1, 0.333)
x =
Columns 1 through 8
0.0552 0.0868 0.1170 0.1475 0.1844 0.2173 0.2498 0.2834
Columns 9 through 15
0.3182 0.3561 0.3875 0.4178 0.4494 0.4671 0.4822
y =
Columns 1 through 8
0.9992 0.9983 0.9971 0.9955 0.9927 0.9905 0.9876 0.9846
Columns 9 through 15
0.9803 0.9750 0.9707 0.9653 0.9598 0.9560 0.9537
We see that as x increases, y tends to decrease slightly. From there, it is easy enough to draw line segments and/or perform some other kind of smoothing.
(Note that I did not attempt to vectorize this solution; a much faster version could be assumed if Xdata is sorted.)

Related

Matlab partial area under the curve

I want to plot the area above and below a particular value in x axis.
The problem i am facing is with discrete values. The code below for instance has an explicit X=10 so i have written it in such a way that i can find the index and calculate the values above and below that particular value but if i want to find the area under the curve above and below 4 this program will now work.
Though in the plot matlab does a spline fitting(or some sort of fitting for connecting discrete values) there is a value for y corresponding to x=4 that matlab computes i cant seem to store or access it.
%Example for Area under the curve and partial area under the curve using Trapezoidal rule of integration
clc;
close all;
clear all;
x=[0,5,10,15,20];% domain
y=[0,25,50,25,0];% Values
LP=log2(y);
plot(x,y);
full = trapz(x,y);% plot of the total area
I=find(x==10);% in our case will be the distance value up to which we want
half = trapz(x(1:I),y(1:I));%Plot of the partial area
How can we find the area under the curve for a value of ie x = 2 or 3 or 4 or 6 or 7 or ...
This is an elaboration of patrik's comment, "first interpolate and then integrate".
For the purpose of this answer I'll assume that the area in question is the area that can be seen in the plot, and since plot connects points by straight lines I assume that linear interpolation is adequate. Moreover, since the trapezoidal rule itself is based on linear interpolation, we only need interpolated values at the beginning and end of the interval.
Starting from the given points
x = [0, 5, 10, 15, 20];
y = [0, 25, 50, 25, 0];
and the integration interval limits, say
xa = 4;
xb = 20;
we first select the data points within the limits
ind = (x > xa) & (x < xb);
xw = x(ind);
yw = y(ind);
and then complete them by interpolation values at the edges:
ya = interp1(x, y, xa);
yb = interp1(x, y, xb);
xw = [xa, xw, xb];
yw = [ya, yw, yb];
Now we can simply apply trapezoidal integration:
area = trapz(xw, yw);
I think that you either need more samples, or to interpolate the data. Another alternative is to use a function handle. Then you need to know the function though. Example using linear interpolation follows.
x0 = [0;5;10;15;20];
y0 = [0,25,50,25,0];
x1 = 0:20;
y1 = interp1(x0,y0,x1,'linear');
xMax = 4;
partInt = trapz(x1(x1<=xMax),y1(x1<=xMax));
Some other kind of interpolation may be suitable, but that is hard to say without more information. Also, this interpolates from the beginning to x. However, I guess figuring out how to change the limits should be easy from here. This solution is different than the former, since it is less depending on the pyramid shape of the data. So to say, it is more general.

Scale Space for solving Sum of Gaussians

I'm attempting to use scale space implementation to fit n Gaussian curves to peaks in a noisy time series digital signal (measuring voltage).
To test it I created the following sample sum of three gaussians with noise (0.2*rand, sorry no picture, i'm new here)
amp = [2; 0.9; 1.3];
mu = [19; 23; 28];
sigma = [4.8; 1.3; 2.5];
x = linspace(1,50,1000);
for n=1:3, y(n,:) = A(n)*exp(-(x-B(n)).^2./(2*C(n)^2)); end
noisysignal = y(1,:) + y(2,:) + y(3,:) + 0.2*rand(1,numel(x))
I found this article http://www.engineering.wright.edu/~agoshtas/GMIP94.pdf posted by user355856 answer to thread "Peak decomposition"!
I believe my code generates the correct result for plotting the zero crossings as a function of the gaussian filter resolution sigma, but I have two issues. The first is that it seems yet another fitting routine would be needed to identify the approximate location of the arch intercepts for approximating the initial peak sigma and mu values. The second is that the edges of the scale space plot have substantial arches that definitely do not correspond to any peak. I'm not sure how to screen these out effectively. Last thing is that is used a spacing of 50 when calculating the second derivative central finite difference since too much more destroyed feature, and to much less results in a forest of zero crossings. Would there be a better way to filter that to control random zero crossings in the gaussian peak tails?
function [crossing] = scalespace(x, y, sigmalimit)
figure; hold on; ylim([0 sigmalimit]);
for sigma = 1:sigmalimit %
yconv = convkernel(sigma, y); %convolve with kernel
xconv = linspace(x(1), x(end), length(yconv));
yconvpp = d2centralfinite(xconv, yconv, 50); % 50 was empirically chosen
num = 0;
for i = 1 : length(yconvpp)-1
if sign(yconvpp(i)) ~= sign(yconvpp(i+1))
crossing(sigma, num+1) = xconv(i);
num = num+1;
end
end
plot(crossing(sigma, crossing(sigma, :) ~= 0),...
sigma*ones(1, numel(crossing(sigma, crossing(sigma, :) ~= 0))), '.');
end
function [yconv] = convkernel(sigma, y)
t = sigma^2;
C = 3; % for kernel truncation
M = C*round(sqrt(t))+1;
window = (-M) : (+M);
G = zeros(1, length(window));
G(:) = (1/(2*pi()*t))*exp(-(window.^2)./(2*t));
yconv = conv(G, y);
This is my first post and I apologize in advance for any issues in style. I'm fairly new to programming, so any advice regarding the programming style or information provided in this question would be much appreciated. I also read through Amro's answer about matlab's GMM function! if anyone feels that would be a more efficient approach to modeling multiple gaussians in a digital signal.
Thank you!

How to make a vector that follows a certain trend?

I have a set of data with over 4000 points. I want to exclude grooves from them, ideally from the point from which they start. The data look for example like this:
The problem with this is the noise I get at the top of the plateaus. I have an idea, in which I would take an average value of the most common within some boundaries (again, ideally sth like the red line here:
and then I would construct a temporary matrix, which would fill up one by one with Y if they are less than this average. If the Y(i) would rise above average, the matrix would find its minima and compare it with the global minima. If the temporary matrix's minima wouldn't be sth like 80% of the global minima, it would be discarded as noise.
I've tried using mean(Y), interpolating and fitting it in a polynomial (the green line) - none of those method would cut it to the point I would be satisfied.
I need this to be extremely robust and it doesn't need to be quick. The top and bottom values can vary a lot, as well as the shape of the plateaus. The groove width is more or less the same.
Do you have any ideas? Again, the point is to extract the values that would make the groove.
How about a median filter?
Let's define some noisy data similar to yours, and plot it in blue:
x = .2*sin((0:9999)/1000); %// signal
x(1000:1099) = x(1000:1099) + sin((0:99)/50*pi); %// noise: spike
x(5000:5199) = x(5000:5199) - sin((0:199)/100*pi); %// noise: wider spike
x = x + .05*sin((0:9999)/10); %// noise: high-freq ripple
plot(x)
Now apply the median filter (using medfilt2 from the Image Processing Toolbox) and plot in red. The parameter k controls the filter memory. It should chosen to be large compared to noise variations, and small compared to signal variations:
k = 500; %// filter memory. Choose as needed
y = medfilt2(x,[1 k]);
hold on
plot(y, 'r', 'linewidth', 2)
In case you don't have the image processing toolbox and can't use medfilt2 a method that's more manual. Skip the extreme values, and do a curve fit with sin1 as curve type. Note that this will only work if the signal is in fact a sine wave!
x = linspace(0,3*pi,1000);
y1 = sin(x) + rand()*sin(100*x).*(mod(round(10*x),5)<3);
y2 = 20*(mod(round(5*x),5) == 0).*sin(20*x);
y = y1 + y2; %// A messy sine-wave
yy = y; %// Store the messy sine-wave
[~, idx] = sort(y);
y(idx(1:round(0.15*end))) = y(idx(round(0.15*end))); %// Flatten out the smallest values
y(idx(round(0.85*end):end)) = y(idx(round(0.85*end)));%// Flatten out the largest values
[foo goodness output] = fit(x.',y.', 'sin1'); %// Do a curve fit
plot(foo,x,y) %// Plot it
hold on
plot(x,yy,'black')
Might not be perfect, but it's a step in the right direction.

Finding 2D area defined by contour lines in Matlab

I am having difficulty with calculating 2D area of contours produced from a Kernel Density Estimation (KDE) in Matlab. I have three variables:
X and Y = meshgrid which variable 'density' is computed over (256x256)
density = density computed from the KDE (256x256)
I run the code
contour(X,Y,density,10)
This produces the plot that is attached. For each of the 10 contour levels I would like to calculate the area. I have done this in some other platforms such as R but am having trouble figuring out the correct method / syntax in Matlab.
C = contourc(density)
I believe the above line would store all of the values of the contours allowing me to calculate the areas but I do not fully understand how these values are stored nor how to get them properly.
This little script will help you. Its general for contour. Probably working for contour3 and contourf as well, with adjustments of course.
[X,Y,Z] = peaks; %example data
% specify certain levels
clevels = [1 2 3];
C = contour(X,Y,Z,clevels);
xdata = C(1,:); %not really useful, in most cases delimters are not clear
ydata = C(2,:); %therefore further steps to determine the actual curves:
%find curves
n(1) = 1; %n: indices where the certain curves start
d(1) = ydata(1); %d: distance to the next index
ii = 1;
while true
n(ii+1) = n(ii)+d(ii)+1; %calculate index of next startpoint
if n(ii+1) > numel(xdata) %breaking condition
n(end) = []; %delete breaking point
break
end
d(ii+1) = ydata(n(ii+1)); %get next distance
ii = ii+1;
end
%which contourlevel to calculate?
value = 2; %must be member of clevels
sel = find(ismember(xdata(n),value));
idx = n(sel); %indices belonging to choice
L = ydata( n(sel) ); %length of curve array
% calculate area and plot all contours of the same level
for ii = 1:numel(idx)
x{ii} = xdata(idx(ii)+1:idx(ii)+L(ii));
y{ii} = ydata(idx(ii)+1:idx(ii)+L(ii));
figure(ii)
patch(x{ii},y{ii},'red'); %just for displaying purposes
%partial areas of all contours of the same plot
areas(ii) = polyarea(x{ii},y{ii});
end
% calculate total area of all contours of same level
totalarea = sum(areas)
Example: peaks (by Matlab)
Level value=2 are the green contours, the first loop gets all contour lines and the second loop calculates the area of all green polygons. Finally sum it up.
If you want to get all total areas of all levels I'd rather write some little functions, than using another loop. You could also consider, to plot just the level you want for each calculation. This way the contourmatrix would be much easier and you could simplify the process. If you don't have multiple shapes, I'd just specify the level with a scalar and use contour to get C for only this level, delete the first value of xdata and ydata and directly calculate the area with polyarea
Here is a similar question I posted regarding the usage of Matlab contour(...) function.
The main ideas is to properly manipulate the return variable. In your example
c = contour(X,Y,density,10)
the variable c can be returned and used for any calculation over the isolines, including area.

Matlab plot in histogram

Assume y is a vector with random numbers following the distribution f(x)=sqrt(4-x^2)/(2*pi). At the moment I use the command hist(y,30). How can I plot the distribution function f(x)=sqrt(4-x^2)/(2*pi) into the same histogram?
Instead of normalizing numerically, you could also do it by finding a theoretical scaling factor as follows.
nbins = 30;
nsamples = max(size(y));
binsize = (max(y)-min(y)) / nsamples
hist(y,nbins)
hold on
x1=linspace(min(y),max(y),100);
scalefactor = nsamples * binsize
y1=scalefactor * sqrt(4-x^2)/(2*pi)
plot(x1,y1)
Update: How it works.
For any dataset that is large enough to give a good approximation to the pdf (call it f(x)), the integral of f(x) over this domain will be approximately unity. However we know that the area under any histogram is precisely equal to the total number of samples times the bin-width.
So a very simple scale factor to bring the pdf into line with the histogram is Ns*Wb, the total number of sample point times the width of the bins.
Let's take an example of another distribution function, the standard normal. To do exactly what you say you want, you do this:
nRand = 10000;
y = randn(1,nRand);
[myHist, bins] = hist(y,30);
pdf = normpdf(bins);
figure, bar(bins, myHist,1); hold on; plot(bins,pdf,'rx-'); hold off;
This is probably NOT what you actually want though. Why? You'll notice that your density function looks like a thin line at the bottom of your histogram plot. This is because a histogram is counts of numbers in bins, while a density function is normalized to integrate to one. If you have hundreds of items in a bin, there is no way that the density function will match that in scale, so you have a scaling or normalization problem. Either you have to normalize the histogram, or plot a scaled distribution function. I prefer to scale the distribution function so that my counts are sensical when I look at the histogram:
normalizedpdf = pdf/sum(pdf)*sum(myHist);
figure, bar(bins, myHist,1); hold on; plot(bins,normalizedpdf,'rx-'); hold off;
Your case is the same, except you'll use the function f(x) you specified instead of the normpdf command.
Let me add another example to the mix:
%# some normally distributed random data
data = randn(1e3,1);
%# histogram
numbins = 30;
hist(data, numbins);
h(1) = get(gca,'Children');
set(h(1), 'FaceColor',[.8 .8 1])
%# figure out how to scale the pdf (with area = 1), to the area of the histogram
[bincounts,binpos] = hist(data, numbins);
binwidth = binpos(2) - binpos(1);
histarea = binwidth*sum(bincounts);
%# fit a gaussian
[muhat,sigmahat] = normfit(data);
x = linspace(binpos(1),binpos(end),100);
y = normpdf(x, muhat, sigmahat);
h(2) = line(x, y*histarea, 'Color','b', 'LineWidth',2);
%# kernel estimator
[f,x,u] = ksdensity( data );
h(3) = line(x, f*histarea, 'Color','r', 'LineWidth',2);
legend(h, {'freq hist','fitted Gaussian','kernel estimator'})