What algorithm can I use to recognize the line in this scatterplot? - scipy

I'm creating a program to compare audio files which uses a similar algorithm to the one described here http://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf. I am plotting the times of matches between two songs being compared and finding the line of least squares for the plot. href=http://imgur.com/fGu7jhX&yOeMSK0 is an example plot of matching files. The plot is too messy and the least squares regression line does not produce a high correlation coefficient even though there is an obvious line in the graph. What other algorithm can I use to recognize this line?

This is an interesting question, but it's been pretty quiet. Maybe this answer
will trigger some more activity.
For identifying lines with arbitrary slopes and intercepts within a collection
of points, the Hough transform would be a good place to start. For your audio
application, however, it looks like the slope should always be 1, so you don't
need the full generality of the Hough transform.
Instead, you can think of the problem as one of clustering the differences x - y, where x and y are the vectors holding the x and y coordinates of the points.
One approach would be to compute a histogram of x - y. Points that are close to lying in the same line with slope 1 will have differences in the same bin in the histogram. The bin with the largest count corresponds to the largest collection of points that are approximately aligned. An issue to deal with in this approach is choosing the boundaries of the histogram bins. A bad choice could result in points that should be grouped together being split into neighboring bins.
A simple brute-force approach is to imagine a diagonal window with a given width, sliding left to right across the (x,y) plane. The best candidate for a line corresponds to the position of the window that contains the most points. This is similar to a histogram of x - y, but instead of having a collection of disjoint bins, there are overlapping bins, one for each point. All the bins have the same width, and each point determines the left edge of a bin.
The function count_diag_groups in the code below does that computation. For each point, it counts how many points are in the diagonal window when the left edge of the window is on that point. The best candidate for a line is the window with the most points. Here's the plot generated by the script. The top is the scatter plot of the data. The bottow is the same scatter plot, with the best candidate points highlighted.
A nice feature of this method is that there is only one parameter, the window width. A not-so-nice feature is that it has time complexity O(n**2), where n is the number of points. There are surely algorithms with better time complexity that could do something similar; the article that you link to discusses this. To judge the quality of an alternative, however, will require more concrete specifications of how "good" or robust the line identification must be.
import numpy as np
import matplotlib.pyplot as plt
def count_diag_groups(x, y, width):
"""
Returns a list of arrays. The length of the list is the same
as the length of x. The k-th array holds the indices into x
(and y) of a set of points that are in a "diagonal" window with
the given width whose left edge includes the point (x[k], y[k]).
"""
d = x - y
result = []
for i in range(d.size):
delta = d - d[i]
neighbors = np.where((delta >= 0) & (delta <= width))[0]
result.append(neighbors)
return result
def generate_demo_data():
# Generate some data.
np.random.seed(123)
xmin = 0
xmax = 100
ymin = 0
ymax = 25
nrnd = 175
xrnd = xmin + (xmax - xmin)*np.random.rand(nrnd)
yrnd = ymin + (ymax - ymin)*np.random.rand(nrnd)
n = 25
xx = xmin + 0.1*(xmax - xmin) + ymax*np.random.rand(n)
yy = (xx - xx.min()) + 0.2*np.random.randn(n)
x = np.concatenate((xrnd, xx))
y = np.concatenate((yrnd, yy))
return x, y
def plot_result(x, y, width, selection):
xmin = x.min()
xmax = x.max()
ymin = y.min()
ymax = y.max()
xsel = x[selection]
ysel = y[selection]
# Plot...
plt.figure(1)
plt.clf()
ax = plt.subplot(2,1,1)
plt.plot(x, y, 'o', mfc='b', mec='b', alpha=0.5)
plt.xlim(xmin - 1, xmax + 1)
plt.ylim(ymin - 1, ymax + 1)
plt.subplot(2,1,2, sharex=ax, sharey=ax)
plt.plot(x, y, 'o', mfc='b', mec='b', alpha=0.5)
plt.plot(xsel, ysel, 'o', mfc='w', mec='w')
plt.plot(xsel, ysel, 'o', mfc='r', mec='r', alpha=0.65)
xi = np.array([xmin, xmax])
d = x - y
yi1 = xi - d[imax]
yi2 = yi1 - width
plt.plot(xi, yi1, 'r-', alpha=0.25)
plt.plot(xi, yi2, 'r-', alpha=0.25)
plt.xlim(xmin - 1, xmax + 1)
plt.ylim(ymin - 1, ymax + 1)
plt.show()
if __name__ == "__main__":
x, y = generate_demo_data()
# Find a selection of points that are close to being aligned
# with a slope of 1.
width = 0.75
r = count_diag_groups(x, y, width)
# Find the largest group.
sz = np.array(list(len(f) for f in r))
imax = sz.argmax()
# k holds the indices of the selected points.
selection = r[imax]
plot_result(x, y, width, selection)

This looks like an excellent example of a task for Random Sampling Consensus (RANSAC).
The Wikipedia article even uses your problem as an example!
The rough outline is something like this.
Select 2 random points in your data, fit a line to them
For each other point, find the distance to that line. If the distance is below a threshold, it is part of the inlier set.
If the final inlier set for this particular line is larger than the previously best line, then keep the new line as the best candidate.
If the decided number of iterations is reached, return the best line found, else go back to 1 and choose new random points.
Check the Wikipedia article for more information.

Related

How to create random points alongside a complex polyline?

I would like to populate random points on a 2D plot, in such a way that the points fall in proximity of a "C" shaped polyline.
I managed to accomplish this for a rather simple square shaped "C":
This is how I did it:
% Marker color
c = 'k'; % Black
% Red "C" polyline
xl = [8,2,2,8];
yl = [8,8,2,2];
plot(xl,yl,'r','LineWidth',2);
hold on;
% Axis settings
axis equal;
axis([0,10,0,10]);
set(gca,'xtick',[],'ytick',[]);
step = 0.05; % Affects point quantity
coeff = 0.9; % Affects point density
% Top Horizontal segment
x = 2:step:9.5;
y = 8 + coeff*randn(size(x));
scatter(x,y,'filled','MarkerFaceColor',c);
% Vertical segment
y = 1.5:step:8.5;
x = 2 + coeff*randn(size(y));
scatter(x,y,'filled','MarkerFaceColor',c);
% Bottom Horizontal segment
x = 2:step:9.5;
y = 2 + coeff*randn(size(x));
scatter(x,y,'filled','MarkerFaceColor',c);
hold off;
As you can see in the code, for each segment of the polyline I generate the scatter point coordinates artificially using randn.
For the previous example, splitting the polyline into segments and generating the points manually is fine. However, what if I wanted to experiment with a more sophisticated "C" shape like this one:
Note that with my current approach, when the geometric complexity of the polyline increases so does the coding effort.
Before going any further, is there a better approach for this problem?
A simpler approach, which generalizes to any polyline, is to run a loop over the segments. For each segment, r is its length, and m is the number of points to be placed along that segment (it closely corresponds to the prescribed step size, with slight deviation in case the step size does not evenly divide the length). Note that both x and y are subject to random perturbation.
for n = 1:numel(xl)-1
r = norm([xl(n)-xl(n+1), yl(n)-yl(n+1)]);
m = round(r/step) + 1;
x = linspace(xl(n), xl(n+1), m) + coeff*randn(1,m);
y = linspace(yl(n), yl(n+1), m) + coeff*randn(1,m);
scatter(x,y,'filled','MarkerFaceColor',c);
end
Output:
A more complex example, using coeff = 0.4; and xl = [8,4,2,2,6,8];
yl = [8,6,8,2,4,2];
If you think this point cloud is too thin near the endpoints, you can artifically lengthen the first and last segments before running the loop. But I don't see the need: it makes sense that the fuzzied curve is thinning out at the extremities.
With your original approach, two places with the same distance to a line can sampled with a different probability, especially at the corners where two lines meet. I tried to fix this rephrasing the random experiment. The random experiment my code does is: "Pick a random point. Accept it with a probability of normpdf(d)<rand where d is the distance to the next line". This is a rejection sampling strategy.
xl = [8,4,2,2,6,8];
yl = [8,6,8,2,4,2];
resolution=50;
points_to_sample=200;
step=.5;
sigma=.4; %lower value to get points closer to the line.
xmax=(max(xl)+2);
ymax=(max(yl)+2);
dist=zeros(xmax*resolution+1,ymax*resolution+1);
x=[];
y=[];
for n = 1:numel(xl)-1
r = norm([xl(n)-xl(n+1), yl(n)-yl(n+1)]);
m = round(r/step) + 1;
x = [x,round(linspace(xl(n)*resolution+1, xl(n+1)*resolution+1, m*resolution))];
y = [y,round(linspace(yl(n)*resolution+1, yl(n+1)*resolution+1, m*resolution))];
end
%dist contains the lines:
dist(sub2ind(size(dist),x,y))=1;
%dist contains the normalized distance of each rastered pixel to the line.
dist=bwdist(dist)/resolution;
pseudo_pdf=normpdf(dist,0,sigma);
%scale up to have acceptance rate of 1 for most likely pixels.
pseudo_pdf=pseudo_pdf/max(pseudo_pdf(:));
sampled_points=zeros(0,2);
while size(sampled_points,1)<points_to_sample
%sample a random point
sx=rand*xmax;
sy=rand*ymax;
%accept it if criteria based on normal distribution matches.
if pseudo_pdf(round(sx*resolution)+1,round(sy*resolution)+1)>rand
sampled_points(end+1,:)=[sx,sy];
end
end
plot(xl,yl,'r','LineWidth',2);
hold on
scatter(sampled_points(:,1),sampled_points(:,2),'filled');

Plotting a linear decision boundary

I have a set of data points (40 x 2), and I've derived the formula for the decision boundary which ends up like this :
wk*X + w0 = 0
wk is a 1 x 2 vector and X is a 2 x 1 point from the data point set; essentially X = (xi,yi), where i = 1,2,...,40. I have the values for wk and w0.
I'm trying to plot the line wk*X + w0 = 0 but I have no idea how to plot the actual line. In the past, I've done this by finding the minimum and maximum of the data points and just connecting them together but that's definitely not the right approach.
wk*X is simply the dot product between two vectors, and so the equation becomes:
w1*x + w2*y + w0 = 0
... assuming a general point (x,y). If we rearrange this equation and solve for y, we get:
y = -(w1/w2)*x - (w0/w2)
As such, this defines an equation of the line where the slope is -(w1/w2) with an intercept -(w0/w2). All you have to do is define a bunch of linearly spaced points within a certain range, take each point and substitute this into the above equation and get an output. You'd plot all of these output points in the figure as well as the actual points themselves. You make the space or resolution between points small enough so that we are visualizing a line when we connect all of the points together.
To determine the range or limits of this line, figure out what the smallest and largest x value is in your data, define a set of linearly spaced points between these and plot your line using the equation of the line we just talked about.
Something like this could work assuming that you have a matrix of points stored in X as you have mentioned, and w1 and w2 are defined in the vector wk and w0 is defined separately:
x = linspace(min(X(:,1)), max(X(:,1)));
y = -(wk(1)/wk(2))*x - (w0/wk(2));
plot(X(:,1), X(:,2), 'b.', x, y);
linspace determines a linearly spaced array of points from a beginning to an end, and by default 100 points are generated. We then create the output values of the line given these points and we plot the individual points in blue as well as the line itself on top of these points.

"Frequency" shift in discrete FFT in MATLAB

(Disclaimer: I thought about posting this on math.statsexchange, but found similar questions there that were moved to SO, so here I am)
The context:
I'm using fft/ifft to determine probability distributions for sums of random variables.
So e.g. I'm having two uniform probability distributions - in the simplest case two uniform distributions on the interval [0,1].
So to get the probability distribution for the sum of two random variables sampled from these two distributions, one can calculate the product of the fourier-transformed of each probabilty density.
Doing the inverse fft on this product, you get back the probability density for the sum.
An example:
function usumdist_example()
x = linspace(-1, 2, 1e5);
dx = diff(x(1:2));
NFFT = 2^nextpow2(numel(x));
% take two uniform distributions on [0,0.5]
intervals = [0, 0.5;
0, 0.5];
figure();
hold all;
for i=1:size(intervals,1)
% construct the prob. dens. function
P_x = x >= intervals(i,1) & x <= intervals(i,2);
plot(x, P_x);
% for each pdf, get the characteristic function fft(pdf,NFFT)
% and form the product of all char. functions in Y
if i==1
Y = fft(P_x,NFFT) / NFFT;
else
Y = Y .* fft(P_x,NFFT) / NFFT;
end
end
y = ifft(Y, NFFT);
x_plot = x(1) + (0:dx:(NFFT-1)*dx);
plot(x_plot, y / max(y), '.');
end
My issue is, the shape of the resulting prob. dens. function is perfect.
However, the x-axis does not fit to the x I create in the beginning, but is shifted.
In the example, the peak is at 1.5, while it should be 0.5.
The shift changes if I e.g. add a third random variable or if I modify the range of x.
But I can't get figure how.
I'm afraid it might have to do with the fact that I'm having negative x values, while fourier transforms usually work in a time/frequency domain, where frequencies < 0 don't make sense.
I'm aware I could find e.g. the peak and shift it to its proper place, but seems nasty and error prone...
Glad about any ideas!
The problem is that your x origin is -1, not 0. You expect the center of the triangular pdf to be at .5, because that's twice the value of the center of the uniform pdf. However, the correct reasoning is: the center of the uniform pdf is 1.25 above your minimum x, and you get the center of the triangle at 2*1.25 = 2.5 above the minimum x (that is, at 1.5).
In other words: although your original x axis is (-1, 2), the convolution (or the FFT) behave as if it were (0, 3). In fact, the FFT knows nothing about your x axis; it only uses the y samples. Since your uniform is zero for the first samples, that zero interval of width 1 is amplified to twice its width when you do the convolution (or the FFT). I suggest drawing the convolution on paper to see this (draw original signal, reflected signal about y axis, displace the latter and see when both begin to overlap). So you need a correction in the x_plot line to compensate for this increased width of the zero interval: use
x_plot = 2*x(1) + (0:dx:(NFFT-1)*dx);
and then plot(x_plot, y / max(y), '.') will give the correct graph:

Matlab, generate and plot a point cloud distributed within a triangle

I'm trying to generate a cloud of 2D points (uniformly) distributed within a triangle. So far, I've achieved the following:
The code I've used is this:
N = 1000;
X = -10:0.1:10;
for i=1:N
j = ceil(rand() * length(X));
x_i = X(j);
y_i = (10 - abs(x_i)) * rand;
E(:, i) = [x_i y_i];
end
However, the points are not uniformly distributed, as clearly seen in the left and right corners. How can I improve that result? I've been trying to search for the different shapes too, with no luck.
You should first ask yourself what would make the points within a triangle distributed uniformly.
To make a long story short, given all three vertices of the triangle, you need to transform two uniformly distributed random values like so:
N = 1000; % # Number of points
V = [-10, 0; 0, 10; 10, 0]; % # Triangle vertices, pairs of (x, y)
t = sqrt(rand(N, 1));
s = rand(N, 1);
P = (1 - t) * V(1, :) + bsxfun(#times, ((1 - s) * V(2, :) + s * V(3, :)), t);
This will produce a set of points which are uniformly distributed inside the specified triangle:
scatter(P(:, 1), P(:, 2), '.')
Note that this solution does not involve repeated conditional manipulation of random numbers, so it cannot potentially fall into an endless loop.
For further reading, have a look at this article.
That concentration of points would be expected from the way you are building the points. Your points are equally distributed along the X axis. At the extremes of the triangle there is approximately the same amount of points present at the center of the triangle, but they are distributed along a much smaller region.
The first and best approach I can think of: brute force. Distribute the points equally around a bigger region, and then delete the ones that are outside the region you are interested in.
N = 1000;
points = zeros(N,2);
n = 0;
while (n < N)
n = n + 1;
x_i = 20*rand-10; % generate a number between -10 and 10
y_i = 10*rand; % generate a number between 0 and 10
if (y_i > 10 - abs(x_i)) % if the points are outside the triangle
n = n - 1; % decrease the counter to try to generate one more point
else % if the point is inside the triangle
points(n,:) = [x_i y_i]; % add it to a list of points
end
end
% plot the points generated
plot(points(:,1), points(:,2), '.');
title ('1000 points randomly distributed inside a triangle');
The result of the code I've posted:
one important disclaimer: Randomly distributed does not mean "uniformly" distributed! If you generate data randomly from an Uniform Distribution, that does not mean that it will be "evenly distributed" along the triangle. You will see, in fact, some clusters of points.
You can imagine that the triangle is split vertically into two halves, and move one half so that together with the other it makes a rectangle. Now you sample uniformly in the rectangle, which is easy, and then move the half triangle back.
Also, it's easier to work with unit lengths (the rectangle becomes a square) and then stretch the triangle to the desired dimensions.
x = [-10 10]; % //triangle base
y = [0 10]; % //triangle height
N = 1000; %// number of points
points = rand(N,2); %// sample uniformly in unit square
ind = points(:,2)>points(:,1); %// points to be unfolded
points(ind,:) = [2-points(ind,2) points(ind,1)]; %// unfold them
points(:,1) = x(1) + (x(2)-x(1))/2*points(:,1); %// stretch x as needed
points(:,2) = y(1) + (y(2)-y(1))*points(:,2); %// stretch y as needed
plot(points(:,1),points(:,2),'.')
We can generalize this case. If you want to sample points from some (n - 1)-dimensional simplex in Euclidean space UNIFORMLY (not necessarily a triangle - it can be any convex polytope), just sample a vector from a symmetric n-dimensional Dirichlet distribution with parameter 1 - these are the convex (or barycentric) coordinates relative to the vertices of the polytope.

How to fit a curve by a series of segmented lines in Matlab?

I have a simple loglog curve as above. Is there some function in Matlab which can fit this curve by segmented lines and show the starting and end points of these line segments ? I have checked the curve fitting toolbox in matlab. They seems to do curve fitting by either one line or some functions. I do not want to curve fitting by one line only.
If there is no direct function, any alternative to achieve the same goal is fine with me. My goal is to fit the curve by segmented lines and get locations of the end points of these segments .
First of all, your problem is not called curve fitting. Curve fitting is when you have data, and you find the best function that describes it, in some sense. You, on the other hand, want to create a piecewise linear approximation of your function.
I suggest the following strategy:
Split manually into sections. The section size should depend on the derivative, large derivative -> small section
Sample the function at the nodes between the sections
Find a linear interpolation that passes through the points mentioned above.
Here is an example of a code that does that. You can see that the red line (interpolation) is very close to the original function, despite the small amount of sections. This happens due to the adaptive section size.
function fitLogLog()
x = 2:1000;
y = log(log(x));
%# Find section sizes, by using an inverse of the approximation of the derivative
numOfSections = 20;
indexes = round(linspace(1,numel(y),numOfSections));
derivativeApprox = diff(y(indexes));
inverseDerivative = 1./derivativeApprox;
weightOfSection = inverseDerivative/sum(inverseDerivative);
totalRange = max(x(:))-min(x(:));
sectionSize = weightOfSection.* totalRange;
%# The relevant nodes
xNodes = x(1) + [ 0 cumsum(sectionSize)];
yNodes = log(log(xNodes));
figure;plot(x,y);
hold on;
plot (xNodes,yNodes,'r');
scatter (xNodes,yNodes,'r');
legend('log(log(x))','adaptive linear interpolation');
end
Andrey's adaptive solution provides a more accurate overall fit. If what you want is segments of a fixed length, however, then here is something that should work, using a method that also returns a complete set of all the fitted values. Could be vectorized if speed is needed.
Nsamp = 1000; %number of data samples on x-axis
x = [1:Nsamp]; %this is your x-axis
Nlines = 5; %number of lines to fit
fx = exp(-10*x/Nsamp); %generate something like your current data, f(x)
gx = NaN(size(fx)); %this will hold your fitted lines, g(x)
joins = round(linspace(1, Nsamp, Nlines+1)); %define equally spaced breaks along the x-axis
dx = diff(x(joins)); %x-change
df = diff(fx(joins)); %f(x)-change
m = df./dx; %gradient for each section
for i = 1:Nlines
x1 = joins(i); %start point
x2 = joins(i+1); %end point
gx(x1:x2) = fx(x1) + m(i)*(0:dx(i)); %compute line segment
end
subplot(2,1,1)
h(1,:) = plot(x, fx, 'b', x, gx, 'k', joins, gx(joins), 'ro');
title('Normal Plot')
subplot(2,1,2)
h(2,:) = loglog(x, fx, 'b', x, gx, 'k', joins, gx(joins), 'ro');
title('Log Log Plot')
for ip = 1:2
subplot(2,1,ip)
set(h(ip,:), 'LineWidth', 2)
legend('Data', 'Piecewise Linear', 'Location', 'NorthEastOutside')
legend boxoff
end
This is not an exact answer to this question, but since I arrived here based on a search, I'd like to answer the related question of how to create (not fit) a piecewise linear function that is intended to represent the mean (or median, or some other other function) of interval data in a scatter plot.
First, a related but more sophisticated alternative using regression, which apparently has some MATLAB code listed on the wikipedia page, is Multivariate adaptive regression splines.
The solution here is to just calculate the mean on overlapping intervals to get points
function [x, y] = intervalAggregate(Xdata, Ydata, aggFun, intStep, intOverlap)
% intOverlap in [0, 1); 0 for no overlap of intervals, etc.
% intStep this is the size of the interval being aggregated.
minX = min(Xdata);
maxX = max(Xdata);
minY = min(Ydata);
maxY = max(Ydata);
intInc = intOverlap*intStep; %How far we advance each iteraction.
if intOverlap <= 0
intInc = intStep;
end
nInt = ceil((maxX-minX)/intInc); %Number of aggregations
parfor i = 1:nInt
xStart = minX + (i-1)*intInc;
xEnd = xStart + intStep;
intervalIndices = find((Xdata >= xStart) & (Xdata <= xEnd));
x(i) = aggFun(Xdata(intervalIndices));
y(i) = aggFun(Ydata(intervalIndices));
end
For instance, to calculate the mean over some paired X and Y data I had handy with intervals of length 0.1 having roughly 1/3 overlap with each other (see scatter image):
[x,y] = intervalAggregate(Xdat, Ydat, #mean, 0.1, 0.333)
x =
Columns 1 through 8
0.0552 0.0868 0.1170 0.1475 0.1844 0.2173 0.2498 0.2834
Columns 9 through 15
0.3182 0.3561 0.3875 0.4178 0.4494 0.4671 0.4822
y =
Columns 1 through 8
0.9992 0.9983 0.9971 0.9955 0.9927 0.9905 0.9876 0.9846
Columns 9 through 15
0.9803 0.9750 0.9707 0.9653 0.9598 0.9560 0.9537
We see that as x increases, y tends to decrease slightly. From there, it is easy enough to draw line segments and/or perform some other kind of smoothing.
(Note that I did not attempt to vectorize this solution; a much faster version could be assumed if Xdata is sorted.)