how to find k-th nearest neighbor of a point in a set of point - matlab

I have a set of point (x,y) on a 2d plane. Given a point (x0,y0), and the number k, how to find the k-th nearest neighbor of (x0,x0) in the point set. In detail, the point set are represented by two array: x and y. The point (x0,y0) is given by the index i0. It means x0=x(i0) and y0=y(i0).
Is there any function or something in Matlab helps me this problem. If Matlab doesn't have such kind of function, can you suggest any other effective ways.
EDIT: I have to calculate this kind of distance for every point (x0,y0) in the set. The size of the set is about 1000. The value of k should be about sqrt(1500). The worst thing is that I do this many times. At each iteration, the set changes, and I calculate the distances again. So, the running time is a critical problem.

if you will do this check for many points you might want to construct a inter-point distance table first
squareform(pdist([x y]))

If you have the statistics toolbox, you can use the function knnsearch.

A brute force algorithm would be something like this:
array x[n] = ()
array y[n] = ()
array d[n] = ()
... populate x and y arrays with your n points ...
/* step over each point and calculate its distance from (x0, y0) */
for i = 1 to n
do
d[i] = distance((x0, y0), (x[i], y[i])
end
/* sort the distances in increasing order */
sort(d)
/* the k'th element of d, is the k'th nearest point to (x0, y0) */
return d[k]

The free and opensource VLFeat toolbox contains a kd-tree implementation, amongst other useful things.

The brute force approach looks something like this:
%Create some points
n = 10;
x = randn(n,1);
y = randn(n,1);
%Choose x0
ix0 = randi(n);
%Get distances
d = sqrt(...
(x - x(ix0) ).^2 + ...
(y - y(ix0) ).^2 );
%Sort distances
[sorted_Dstances, ixSort] = sort(d);
%Get kth point
k = 3;
kth = [x(ixSort(k+1)); y(ixSort(k+1))]; %+1 since the first element will always be the x0 element.

Related

Mahalanobis distance in Matlab

I would like to calculate the mahalanobis distance of input feature vector Y (1x14) to all feature vectors in matrix X (18x14). Each 6 vectors of X represent one class (So I have 3 classes). Then based on mahalanobis distances I will choose the vector that is the nearest to the input and classify it to one of the three classes as well.
My problem is when I use the following code I got only one value. How can I get mahalanobis distance between the input Y and every vector in X. So at the end I have 18 values and then I choose the smallest one. Any help will be appreciated. Thank you.
Note: I know that mahalanobis distance is a measure of the distance between a point P and a distribution D, but I don't how could this be applied in my situation.
Y = test1; % Y: 1x14 vector
S = cov(X); % X: 18x14 matrix
mu = mean(X,1);
d = ((Y-mu)/S)*(Y-mu)'
I also tried to separate the matrix X into 3; so each one represent the feature vectors of one class. This is the code, but it doesn't work properly and I got 3 distances and some have negative value!
Y = test1;
X1 = Action1;
S1 = cov(X1);
mu1 = mean(X1,1);
d1 = ((Y-mu1)/S1)*(Y-mu1)'
X2 = Action2;
S2 = cov(X2);
mu2 = mean(X2,1);
d2 = ((Y-mu2)/S2)*(Y-mu2)'
X3= Action3;
S3 = cov(X3);
mu3 = mean(X3,1);
d3 = ((Y-mu3)/S3)*(Y-mu3)'
d= [d1,d2,d3];
MahalanobisDist= min(d)
One last thing, when I used mahal function provided by Matlab I got this error:
Warning: Matrix is close to singular or badly scaled. Results may be inaccurate.
If you have to implement the distance yourself (school assignment for instance) this is of absolutely no use to you, but if you just need to calculate the distance as an intermediate step for other calculations I highly recommend d = Pdist2(a,b, distance_measure) the documentation is on matlabs site
It computes the pairwise distance between a vector (or even a matrix) b and all elements in a and stores them in vector d where the columns correspond to entries in b and the rows are entries from a. So d(i,j) is the distance between row j in b and row i in a (hope that made sense). If you want it could even parameters to find the k nearest neighbors, it's a great function.
in your case you would use the following code and you'd end up with the distance between elements, and the index as well
%number of neighbors
K = 1;
% X=18x14, Y=1x14, dist=18x1
[dist, iidx] = pdist2(X,Y,'mahalanobis','smallest',K);
%to find the class, you can do something like this
num_samples_per_class = 6;
matching_class = ceil(iidx/ num_samples_per_class);

Matlab calculating nearest neighbour distance for all (u, v) vectors in an array

I am trying to calculate the distance between nearest neighbours within a nx2 matrix like the one shown below
point_coordinates =
11.4179 103.1400
16.7710 10.6691
16.6068 119.7024
25.1379 74.3382
30.3651 23.2635
31.7231 105.9109
31.8653 36.9388
%for loop going from the top of the vector column to the bottom
for counter = 1:size(point_coordinates,1)
%current point defined selected
current_point = point_coordinates(counter,:);
%math to calculate distance between the current point and all the points
distance_search= point_coordinates-repmat(current_point,[size(point_coordinates,1) 1]);
dist_from_current_point = sqrt(distance_search(:,1).^2+distance_search(:,2).^2);
%line to omit self subtraction that gives zero
dist_from_current_point (dist_from_current_point <= 0)=[];
%gives the shortest distance calculated for a certain vector and current_point
nearest_dist=min(dist_from_current_point);
end
%final line to plot the u,v vectors and the corresponding nearest neighbour
%distances
matnndist = [point_coordinates nearest_dist]
I am not sure how to structure the 'for' loop/nearest_neighbour line to be able to get the nearest neighbour distance for each u,v vector.
I would like to have, for example ;
for the first vector you could have the coordinates and the corresponding shortest distance, for the second vector another its shortest distance, and this goes on till n
Hope someone can help.
Thanks
I understand you want to obtain the minimum distance between different points.
You can compute the distance for each pair of points with bsxfun; remove self-distances; minimize. It's more computationally efficient to work with squared distances, and take the square root only at the end.
n = size(point_coordinates,1);
dist = bsxfun(#minus, point_coordinates(:,1), point_coordinates(:,1).').^2 + ...
bsxfun(#minus, point_coordinates(:,2), point_coordinates(:,2).').^2;
dist(1:n+1:end) = inf; %// remove self-distances
min_dist = sqrt(min(dist(:)));
Alternatively, you could use pdist. This avoids computing each distance twice, and also avoids self-distances:
dist = pdist(point_coordinates);
min_dist = min(dist(:));
If I can suggest a built-in function, use knnsearch from the statistics toolbox. What you are essentially doing is a K-Nearest Neighbour (KNN) algorithm, but you are ignoring self-distances. The way you would call knnsearch is in the following way:
[idx,d] = knnsearch(X, Y, 'k', k);
In simple terms, the KNN algorithm returns the k closest points to your data set given a query point. Usually, the Euclidean distance is the distance metric that is used. For MATLAB's knnsearch, X is a 2D array that consists of your dataset where each row is an observation and each column is a variable. Y would be the query points. Y is also a 2D array where each row is a query point and you need to have the same number of columns as X. We would also specify the flag 'k' to denote how many closest points you want returned. By default, k = 1.
As such, idx would be a N x K matrix, where N is the total number of query points (number of rows of Y) and K would be those k closest points to the dataset for each query point we have. idx indicates the particular points in your dataset that were closest to each query. d is also a N x K matrix that returns the smallest distances for these corresponding closest points.
As such, what you want to do is find the closest point for your dataset to each of the other points, ignoring self-distances. Therefore, you would set both X and Y to be the same, and set k = 2, discarding the first column of both outputs to get the result you're looking for.
Therefore:
[idx,d] = knnsearch(point_coordinates, point_coordinates, 'k', 2)
idx = idx(:,2);
d = d(:,2);
We thus get for idx and d:
>> idx
idx =
3
5
1
1
7
3
5
>> d
d =
17.3562
18.5316
17.3562
31.9027
13.7573
20.4624
13.7573
As such, this tells us that for the first point in your data set, it matched with point #3 the best. This matched with the closest distance of 17.3562. For the second point in your data set, it matched with point #5 the best with the closest distance being 18.5316. You can continue on with the rest of the results in a similar pattern.
If you don't have access to the statistics toolbox, consider reading my StackOverflow post on how I compute KNN from first principles.
Finding K-nearest neighbors and its implementation
In fact, it is very similar to Luis Mendo's post to you earlier.
Good luck!

How to generate random cartesian coordinates given distance constraint in Matlab

I need to generate N random coordinates for a 2D plane. The distance between any two points are given (number of distance is N(N - 1) / 2). For example, say I need to generate 3 points i.e. A, B, C. I have the distance between pair of them i.e. distAB, distAC and distBC.
Is there any built-in function in MATLAB that can do this? Basically, I'm looking for something that is the reverse of pdist() function.
My initial idea was to choose a point (say A is the origin). Then, I can randomly find B and C being on two different circles with radii distAB and distAC. But then the distance between B and C might not satisfy distBC and I'm not sure how to proceed if this happens. And I think this approach will get very complicated if N is a large number.
Elaborating on Ansaris answer I produced the following. It assumes a valid distance matrix provided, calculates positions in 2D based on cmdscale, does a random rotation (random translation could be added also), and visualizes the results:
%Distance matrix
D = [0 2 3; ...
2 0 4; ...
3 4 0];
%Generate point coordinates based on distance matrix
Y = cmdscale(D);
[nPoints dim] = size(Y);
%Add random rotation
randTheta = 2*pi*rand(1);
Rot = [cos(randTheta) -sin(randTheta); sin(randTheta) cos(randTheta) ];
Y = Y*Rot;
%Visualization
figure(1);clf;
plot(Y(:,1),Y(:,2),'.','markersize',20)
hold on;t=0:.01:2*pi;
for r = 1 : nPoints - 1
for c = r+1 : nPoints
plot(Y(r,1)+D(r,c)*sin(t),Y(r,2)+D(r,c)*cos(t));
plot(Y(c,1)+D(r,c)*sin(t),Y(c,2)+D(r,c)*cos(t));
end
end
You want to use a technique called classical multidimensional scaling. It will work fine and losslessly if the distances you have correspond to distances between valid points in 2-D. Luckily there is a function in MATLAB that does exactly this: cmdscale. Once you run this function on your distance matrix, you can treat the first two columns in the first output argument as the points you need.

Alternative to using squareform (Matlab)

At the moment i am using the pdist function in Matlab, to calculate the euclidian distances between various points in a three dimensional cartesian system. I'm doing this because i want to know which point has the smallest average distance to all the other points (the medoid). The syntax for pdist looks like this:
% calculate distances between all points
distances = pdist(m);
But because pdist returns a one dimensional array of distances, there is no easy way to figure out which point has the smallest average distance (directly). Which is why i am using squareform and then calculating the smallest average distance, like so:
% convert found distances to matrix of distances
distanceMatrix = squareform(distances);
% find index of point with smallest average distance
[~,j] = min(mean(distanceMatrix,2));
The distances are averaged for each column, and the variable j is the index for the column (and the point) with the smallest average distance.
This works, but squareform takes a lot of time (this piece of code is repeated thousands of times), so i am looking for a way to optimise it. Does anyone know of a faster way to deduce the point with the smallest average distance from the results of pdist?
I think for your task using SQUAREFORM function is the best way from vectorization view point. If you look at the content of this function by
edit squareform
You will see that it performs a lot of checks that take time of course. Since you know your input to squareform and can be sure it will work, you can create your custom function with just the core of squareform.
[r, c] = size(m);
distanceMatrix = zeros(r);
distanceMatrix(tril(true(r),-1)) = distances;
distanceMatrix = distanceMatrix + distanceMatrix';
Then run the same code as you did to find the medioid.
Here's an implementation that doesn't require a call to squareform:
N1 = 10;
dim = 5;
% generate points
X = randn(N1, dim);
% find mean distance
for iter=N1:-1:1
d_mean(iter) = mean(pdist2(X(iter,:),X([1:(iter-1) (iter+1):end],:),'euclidean'));
% D(iter,:) = pdist2(X(iter,:),X([1:(iter-1) (iter+1):end],:),'euclidean');
end
[val ind] = min(d_mean);
But without knowing more about your problem, I have no idea if it would be faster.
If this is the lynchpin for your program's performance, you may need to consider other speedup options like mex.
Good luck.
When pdist computes distances between pairs of observations (1,2,...,n), the distances are arranged in the following order:
(2,1), (3,1), ..., (m,1), (3,2), ..., (m,2), ..., (m,m–1))
To demonstrate this, try the following:
> X = [.2 .1 .7 .5]';
> D = pdist(X)
.1 .5 .3 .6 .4 .2
In this example, X stores n=4 observations. The result, D, is a vector of distances between observations (2,1), (3,1), (4,1), (3,2), (4,2), (5,4). This arrangement corresponds with the entries of the lower triangular part of the following n-by-n matrix:
M=
0 0 0 0
.1 0 0 0
.5 .6 0 0
.3 .4 .2 0
Notice that D(1)=M(2,1), D(2)=(3,1) and so on. So, one way to get the pair of indices in M that correspond with D(k) would be to compute the linear index of D(k) in M. This could be done as follows:
% matrix size
n = 4;
% r(j) is the no. of elements in cols 1..j, belonging to the upper triangular part
r = cumsum(1:n-1);
% p(j) is the no. elements in cols 1..j, belonging to the lower triangular part
p = cumsum(n-1:-1:1);
% The linear index of value D(k)
q = find(p >= k, 1);
% The subscript indices of value D(k)
[i j] = ind2sub([n n], k + r(q));
Notice that n, r and p need to be set only once. From that point, you can find the index for any given k using the last two lines. Let's check this:
for k = 1:6
q = find(p >= k, 1);
[i, j] = ind2sub([n n], k + r(q));
fprintf('D(%d) is the distance between observations (%d %d)\n', k, i, j);
end
Here's the output:
D(1) is the distance between observations (2 1)
D(2) is the distance between observations (3 1)
D(3) is the distance between observations (4 1)
D(4) is the distance between observations (3 2)
D(5) is the distance between observations (4 2)
D(6) is the distance between observations (4 3)
There is no need to use squareform:
distances = pdist(m);
l=length(distances);
n=(1+sqrt(1+4*l))/2;
m=[];
for i=1:n
idx=[1+i:n:length(distances)];
m(i)=mean(distances(idx));
end
j=min(m);
I am not sure, but maybe this can be vectorised as well, but now it is late.

How do I create a simliarity matrix in MATLAB?

I am working towards comparing multiple images. I have these image data as column vectors of a matrix called "images." I want to assess the similarity of images by first computing their Eucledian distance. I then want to create a matrix over which I can execute multiple random walks. Right now, my code is as follows:
% clear
% clc
% close all
%
% load tea.mat;
images = Input.X;
M = zeros(size(images, 2), size (images, 2));
for i = 1:size(images, 2)
for j = 1:size(images, 2)
normImageTemp = sqrt((sum((images(:, i) - images(:, j))./256).^2));
%Need to accurately select the value of gamma_i
gamma_i = 1/10;
M(i, j) = exp(-gamma_i.*normImageTemp);
end
end
My matrix M however, ends up having a value of 1 along its main diagonal and zeros elsewhere. I'm expecting "large" values for the first few elements of each row and "small" values for elements with column index > 4. Could someone please explain what is wrong? Any advice is appreciated.
Since you're trying to compute a Euclidean distance, it looks like you have an error in where your parentheses are placed when you compute normImageTemp. You have this:
normImageTemp = sqrt((sum((...)./256).^2));
%# ^--- Note that this parenthesis...
But you actually want to do this:
normImageTemp = sqrt(sum(((...)./256).^2));
%# ^--- ...should be here
In other words, you need to perform the element-wise squaring, then the summation, then the square root. What you are doing now is summing elements first, then squaring and taking the square root of the summation, which essentially cancel each other out (or are actually the equivalent of just taking the absolute value).
Incidentally, you can actually use the function NORM to perform this operation for you, like so:
normImageTemp = norm((images(:, i) - images(:, j))./256);
The results you're getting seem reasonable. Recall the behavior of the exp(-x). When x is zero, exp(-x) is 1. When x is large exp(-x) is zero.
Perhaps if you make M(i,j) = normImageTemp; you'd see what you expect to see.
Consider this solution:
I = Input.X;
D = squareform( pdist(I') ); %'# euclidean distance between columns of I
M = exp(-(1/10) * D); %# similarity matrix between columns of I
PDIST and SQUAREFORM are functions from the Statistics Toolbox.
Otherwise consider this equivalent vectorized code (using only built-in functions):
%# we know that: ||u-v||^2 = ||u||^2 + ||v||^2 - 2*u.v
X = sum(I.^2,1);
D = real( sqrt(bsxfun(#plus,X,X')-2*(I'*I)) );
M = exp(-(1/10) * D);
As was explained in the other answers, D is the distance matrix, while exp(-D) is the similarity matrix (which is why you get ones on the diagonal)
there is an already implemented function pdist, if you have a matrix A, you can directly do
Sim= squareform(pdist(A))