Alternative to using squareform (Matlab) - matlab

At the moment i am using the pdist function in Matlab, to calculate the euclidian distances between various points in a three dimensional cartesian system. I'm doing this because i want to know which point has the smallest average distance to all the other points (the medoid). The syntax for pdist looks like this:
% calculate distances between all points
distances = pdist(m);
But because pdist returns a one dimensional array of distances, there is no easy way to figure out which point has the smallest average distance (directly). Which is why i am using squareform and then calculating the smallest average distance, like so:
% convert found distances to matrix of distances
distanceMatrix = squareform(distances);
% find index of point with smallest average distance
[~,j] = min(mean(distanceMatrix,2));
The distances are averaged for each column, and the variable j is the index for the column (and the point) with the smallest average distance.
This works, but squareform takes a lot of time (this piece of code is repeated thousands of times), so i am looking for a way to optimise it. Does anyone know of a faster way to deduce the point with the smallest average distance from the results of pdist?

I think for your task using SQUAREFORM function is the best way from vectorization view point. If you look at the content of this function by
edit squareform
You will see that it performs a lot of checks that take time of course. Since you know your input to squareform and can be sure it will work, you can create your custom function with just the core of squareform.
[r, c] = size(m);
distanceMatrix = zeros(r);
distanceMatrix(tril(true(r),-1)) = distances;
distanceMatrix = distanceMatrix + distanceMatrix';
Then run the same code as you did to find the medioid.

Here's an implementation that doesn't require a call to squareform:
N1 = 10;
dim = 5;
% generate points
X = randn(N1, dim);
% find mean distance
for iter=N1:-1:1
d_mean(iter) = mean(pdist2(X(iter,:),X([1:(iter-1) (iter+1):end],:),'euclidean'));
% D(iter,:) = pdist2(X(iter,:),X([1:(iter-1) (iter+1):end],:),'euclidean');
end
[val ind] = min(d_mean);
But without knowing more about your problem, I have no idea if it would be faster.
If this is the lynchpin for your program's performance, you may need to consider other speedup options like mex.
Good luck.

When pdist computes distances between pairs of observations (1,2,...,n), the distances are arranged in the following order:
(2,1), (3,1), ..., (m,1), (3,2), ..., (m,2), ..., (m,m–1))
To demonstrate this, try the following:
> X = [.2 .1 .7 .5]';
> D = pdist(X)
.1 .5 .3 .6 .4 .2
In this example, X stores n=4 observations. The result, D, is a vector of distances between observations (2,1), (3,1), (4,1), (3,2), (4,2), (5,4). This arrangement corresponds with the entries of the lower triangular part of the following n-by-n matrix:
M=
0 0 0 0
.1 0 0 0
.5 .6 0 0
.3 .4 .2 0
Notice that D(1)=M(2,1), D(2)=(3,1) and so on. So, one way to get the pair of indices in M that correspond with D(k) would be to compute the linear index of D(k) in M. This could be done as follows:
% matrix size
n = 4;
% r(j) is the no. of elements in cols 1..j, belonging to the upper triangular part
r = cumsum(1:n-1);
% p(j) is the no. elements in cols 1..j, belonging to the lower triangular part
p = cumsum(n-1:-1:1);
% The linear index of value D(k)
q = find(p >= k, 1);
% The subscript indices of value D(k)
[i j] = ind2sub([n n], k + r(q));
Notice that n, r and p need to be set only once. From that point, you can find the index for any given k using the last two lines. Let's check this:
for k = 1:6
q = find(p >= k, 1);
[i, j] = ind2sub([n n], k + r(q));
fprintf('D(%d) is the distance between observations (%d %d)\n', k, i, j);
end
Here's the output:
D(1) is the distance between observations (2 1)
D(2) is the distance between observations (3 1)
D(3) is the distance between observations (4 1)
D(4) is the distance between observations (3 2)
D(5) is the distance between observations (4 2)
D(6) is the distance between observations (4 3)

There is no need to use squareform:
distances = pdist(m);
l=length(distances);
n=(1+sqrt(1+4*l))/2;
m=[];
for i=1:n
idx=[1+i:n:length(distances)];
m(i)=mean(distances(idx));
end
j=min(m);
I am not sure, but maybe this can be vectorised as well, but now it is late.

Related

How can this code be vectorized in MATLAB? Which kinds of code can be vectorized? [duplicate]

I have a matrix a and I want to calculate the distance from one point to all other points. So really the outcome matrix should have a zero (at the point I have chosen) and should appear as some sort of circle of numbers around that specific point.
This is what I have already but I cant seem to get the correct outcome.
a = [1 2 3 4 5 6 7 8 9 10]
for i = 2:20
a(i,:) = a(i-1,:) + 1;
end
N = 10
for I = 1:N
for J = 1:N
dx = a(I,1)-a(J,1);
dy = a(I,2)-a(J,2);
distance(I,J) = sqrt(dx^2 + dy^2)
end
end
Your a matrix is a 1D vector and is incompatible with the nested loop, which computes distance in 2D space from each point to each other point. So the following answer applies to the problem of finding all pairwise distances in a N-by-D matrix, as your loop does for the case of D=2.
Option 1 - pdist
I think you are looking for pdist with the 'euclidean' distance option.
a = randn(10, 2); %// 2D, 10 samples
D = pdist(a,'euclidean'); %// euclidean distance
Follow that by squareform to get the square matrix with zero on the diagonal as you want it:
distances = squareform(D);
Option 2 - bsxfun
If you don't have pdist, which is in the Statistics Toolbox, you can do this easily with bsxfun:
da = bsxfun(#minus,a,permute(a,[3 2 1]));
distances = squeeze(sqrt(sum(da.^2,2)));
Option 3 - reformulated equation
You can also use an alternate form of Euclidean (2-norm) distance,
||A-B|| = sqrt ( ||A||^2 + ||B||^2 - 2*A.B )
Writing this in MATLAB for two data arrays u and v of size NxD,
dot(u-v,u-v,2) == dot(u,u,2) + dot(v,v,2) - 2*dot(u,v,2) % useful identity
%// there are actually small differences from floating point precision, but...
abs(dot(u-v,u-v,2) - (dot(u,u,2) + dot(v,v,2) - 2*dot(u,v,2))) < 1e-15
With the reformulated equation, the solution becomes:
aa = a*a';
a2 = sum(a.*a,2); % diag(aa)
a2 = bsxfun(#plus,a2,a2');
distances = sqrt(a2 - 2*aa);
You might use this method if Option 2 eats up too much memory.
Timings
For a random data matrix of size 1e3-by-3 (N-by-D), here are timings for 100 runs (Core 2 Quad, 4GB DDR2, R2013a).
Option 1 (pdist): 1.561150 sec (0.560947 sec in pdist)
Option 2 (bsxfun): 2.695059 sec
Option 3 (bsxfun alt): 1.334880 sec
Findings: (i) Do computations with bsxfun, use the alternate formula. (ii) the pdist+squareform option has comparable performance. (iii) The reason why squareform takes twice as much time as pdist is probably because pdist only computes the triangular matrix since the distance matrix is symmetric. If you can do without the square matrix, then you can avoid squareform and do your computations in about 40% of the time required to do it manually with bsxfun (0.5609/1.3348).
This is what i was looking for, but thanks for all the suggestions.
A = rand(5, 5);
select_cell = [3 3];
distance = zeros(size(A, 1), size(A, 2));
for i = 1:size(A, 1)
for j = 1:size(A, 2)
distance(i, j) = sqrt((i - select_cell(1))^2 + (j - select_cell(2))^2);
end
end
disp(distance)
Also you can improve it by using vectorisation:
distances = sqrt((x-xCenter).^2+(y-yCenter).^2
IMPORTANT: data_matrix is D X N, where D is number of dimensions and N is number of data points!
final_dist_pairs=data_matrix'*data_matrix;
norms = diag(final_dist_pairs);
final_dist_pairs = bsxfun(#plus, norms, norms') - 2 * final_dist_pairs;
Hope it helps!
% Another important thing,
Never use pdist function of MATLAB. It is a sequential evaluation, that is something like for loops and takes a lot of time, maybe in O(N^2)

Matlab calculating nearest neighbour distance for all (u, v) vectors in an array

I am trying to calculate the distance between nearest neighbours within a nx2 matrix like the one shown below
point_coordinates =
11.4179 103.1400
16.7710 10.6691
16.6068 119.7024
25.1379 74.3382
30.3651 23.2635
31.7231 105.9109
31.8653 36.9388
%for loop going from the top of the vector column to the bottom
for counter = 1:size(point_coordinates,1)
%current point defined selected
current_point = point_coordinates(counter,:);
%math to calculate distance between the current point and all the points
distance_search= point_coordinates-repmat(current_point,[size(point_coordinates,1) 1]);
dist_from_current_point = sqrt(distance_search(:,1).^2+distance_search(:,2).^2);
%line to omit self subtraction that gives zero
dist_from_current_point (dist_from_current_point <= 0)=[];
%gives the shortest distance calculated for a certain vector and current_point
nearest_dist=min(dist_from_current_point);
end
%final line to plot the u,v vectors and the corresponding nearest neighbour
%distances
matnndist = [point_coordinates nearest_dist]
I am not sure how to structure the 'for' loop/nearest_neighbour line to be able to get the nearest neighbour distance for each u,v vector.
I would like to have, for example ;
for the first vector you could have the coordinates and the corresponding shortest distance, for the second vector another its shortest distance, and this goes on till n
Hope someone can help.
Thanks
I understand you want to obtain the minimum distance between different points.
You can compute the distance for each pair of points with bsxfun; remove self-distances; minimize. It's more computationally efficient to work with squared distances, and take the square root only at the end.
n = size(point_coordinates,1);
dist = bsxfun(#minus, point_coordinates(:,1), point_coordinates(:,1).').^2 + ...
bsxfun(#minus, point_coordinates(:,2), point_coordinates(:,2).').^2;
dist(1:n+1:end) = inf; %// remove self-distances
min_dist = sqrt(min(dist(:)));
Alternatively, you could use pdist. This avoids computing each distance twice, and also avoids self-distances:
dist = pdist(point_coordinates);
min_dist = min(dist(:));
If I can suggest a built-in function, use knnsearch from the statistics toolbox. What you are essentially doing is a K-Nearest Neighbour (KNN) algorithm, but you are ignoring self-distances. The way you would call knnsearch is in the following way:
[idx,d] = knnsearch(X, Y, 'k', k);
In simple terms, the KNN algorithm returns the k closest points to your data set given a query point. Usually, the Euclidean distance is the distance metric that is used. For MATLAB's knnsearch, X is a 2D array that consists of your dataset where each row is an observation and each column is a variable. Y would be the query points. Y is also a 2D array where each row is a query point and you need to have the same number of columns as X. We would also specify the flag 'k' to denote how many closest points you want returned. By default, k = 1.
As such, idx would be a N x K matrix, where N is the total number of query points (number of rows of Y) and K would be those k closest points to the dataset for each query point we have. idx indicates the particular points in your dataset that were closest to each query. d is also a N x K matrix that returns the smallest distances for these corresponding closest points.
As such, what you want to do is find the closest point for your dataset to each of the other points, ignoring self-distances. Therefore, you would set both X and Y to be the same, and set k = 2, discarding the first column of both outputs to get the result you're looking for.
Therefore:
[idx,d] = knnsearch(point_coordinates, point_coordinates, 'k', 2)
idx = idx(:,2);
d = d(:,2);
We thus get for idx and d:
>> idx
idx =
3
5
1
1
7
3
5
>> d
d =
17.3562
18.5316
17.3562
31.9027
13.7573
20.4624
13.7573
As such, this tells us that for the first point in your data set, it matched with point #3 the best. This matched with the closest distance of 17.3562. For the second point in your data set, it matched with point #5 the best with the closest distance being 18.5316. You can continue on with the rest of the results in a similar pattern.
If you don't have access to the statistics toolbox, consider reading my StackOverflow post on how I compute KNN from first principles.
Finding K-nearest neighbors and its implementation
In fact, it is very similar to Luis Mendo's post to you earlier.
Good luck!

How to connect a 3D points with a distance threshold Matlab

I have a vector of 3D points lets say A as shown below,
A=[
-0.240265581092000 0.0500598627544876 1.20715641293013
-0.344503191645519 0.390376667574812 1.15887540716612
-0.0931248606994074 0.267137193112796 1.24244644549763
-0.183530493218807 0.384249186312578 1.14512014134276
-0.0201358671977785 0.404732019283683 1.21816745283019
-0.242108038906952 0.229873488902244 1.24229940627651
-0.391349107031230 0.262170158259873 1.23856838565023
]
what I want to do is to connect 3D points with lines which only have distance less than a specific threshold T. I want to get a list of pairs of points needed to be connected. Such as,
[
( -0.240265581092000 0.0500598627544876 1.20715641293013), (-0.344503191645519 0.390376667574812 1.15887540716612);
(-0.0931248606994074 0.267137193112796 1.24244644549763),(-0.183530493218807 0.384249186312578 1.14512014134276),.....
]
So as shown, I'll have a vector of pairs of points needed to be connected. So if anyone could please advise how this can be done in Matlab.
The following example demonstrates how to accomplish this.
%# Build an example matrix
A = [1 2 3; 0 0 0; 3 1 3; 2 0 2; 0 1 0];
Threshold = 3;
%# Calculate distance between all points
D = pdist2(A, A);
%# Discard any points with distance greater than threshold
D(D > Threshold) = nan;
If you wish to extract an index of all observation pairs that are linked by a distance less than (or equal to) Threshold, as well as the corresponding distance (your question didn't specify what form you wanted the output to take, so I am essentially guessing here), then instead use the following:
%# Obtain a list of linear indices of observations less than or equal to TH
I1 = find(D <= Threshold);
%#Extract the actual distances, as well as the corresponding observation indices from A
[Obs1Index, Obs2Index] = ind2sub(size(D), I1);
DList = [Obs1Index, Obs2Index, D(I1)];
Note, pdist2 uses Euclidean distance by default, but there are other options - see the documentation here.
UPDATE: Based on the OP's comments, the following code will express the output as a K*6 matrix, where K is the number of distance measures less than the threshold value, and the first three columns of each row is the first data point (3 dimensions) and the second three columns of each row is the connected data point.
DList2 = [A(Obs1Index, :), A(Obs2Index, :)];
SECOND UPDATE: I have not made any assumptions on the distance measure in this answer. That is, I'm deliberately using pdist2 in case your distance measure is not symmetric. However, if you are using a symmetric distance measure, then you could probably speed up the run-time by using pdist instead, although my indexing code would need to be adjusted accordingly.
Plot3 and pdist2 can be used to achieve what you want.
D=pdist2(A,A);
T=0.2;
for i=1:7
for j=i+1:7
if D(i,j)<T & D(i,j)~=0
i
j
plot3(A([i j],1),A([i j],2),A([i j],3));
hold on;
fprintf('line is plotted\n');
pause;
end
end
end

Matlab Vectorization of Multivariate Gaussian Basis Functions

I have the following code for calculating the result of a linear combination of Gaussian functions. What I'd really like to do is to vectorize this somehow so that it's far more performant in Matlab.
Note that y is a column vector (output), x is a matrix where each column corresponds to a data point and each row corresponds to a dimension (i.e. 2 rows = 2D), variance is a double, gaussians is a matrix where each column is a vector corresponding to the mean point of the gaussian and weights is a row vector of the weights in front of each gaussian. Note that the length of weights is 1 bigger than gaussians as weights(1) is the 0th order weight.
function [ y ] = CalcPrediction( gaussians, variance, weights, x )
basisFunctions = size(gaussians, 2);
xvalues = size(x, 2);
if length(weights) ~= basisFunctions + 1
ME = MException('TRAIN:CALC', 'The number of weights should be equal to the number of basis functions plus one');
throw(ME);
end
y = weights(1) * ones(xvalues, 1);
for xIdx = 1:xvalues
for i = 1:basisFunctions
diff = x(:, xIdx) - gaussians(:, i);
y(xIdx) = y(xIdx) + weights(i+1) * exp(-(diff')*diff/(2*variance));
end
end
end
You can see that at the moment I simply iterate over the x vectors and then the gaussians inside 2 for loops. I'm hoping that this can be improved - I've looked at meshgrid but that seems to only apply to vectors (and I have matrices)
Thanks.
Try this
diffx = bsxfun(#minus,x,permute(gaussians,[1,3,2])); % binary operation with singleton expansion
diffx2 = squeeze(sum(diffx.^2,1)); % dot product, shape is now [XVALUES,BASISFUNCTIONS]
weight_col = weights(:); % make sure weights is a column vector
y = exp(-diffx2/2/variance)*weight_col(2:end); % a column vector of length XVALUES
Note, I changed diff to diffx since diff is a builtin. I'm not sure this will improve performance as allocating arrays will offset increase by vectorization.

how to find k-th nearest neighbor of a point in a set of point

I have a set of point (x,y) on a 2d plane. Given a point (x0,y0), and the number k, how to find the k-th nearest neighbor of (x0,x0) in the point set. In detail, the point set are represented by two array: x and y. The point (x0,y0) is given by the index i0. It means x0=x(i0) and y0=y(i0).
Is there any function or something in Matlab helps me this problem. If Matlab doesn't have such kind of function, can you suggest any other effective ways.
EDIT: I have to calculate this kind of distance for every point (x0,y0) in the set. The size of the set is about 1000. The value of k should be about sqrt(1500). The worst thing is that I do this many times. At each iteration, the set changes, and I calculate the distances again. So, the running time is a critical problem.
if you will do this check for many points you might want to construct a inter-point distance table first
squareform(pdist([x y]))
If you have the statistics toolbox, you can use the function knnsearch.
A brute force algorithm would be something like this:
array x[n] = ()
array y[n] = ()
array d[n] = ()
... populate x and y arrays with your n points ...
/* step over each point and calculate its distance from (x0, y0) */
for i = 1 to n
do
d[i] = distance((x0, y0), (x[i], y[i])
end
/* sort the distances in increasing order */
sort(d)
/* the k'th element of d, is the k'th nearest point to (x0, y0) */
return d[k]
The free and opensource VLFeat toolbox contains a kd-tree implementation, amongst other useful things.
The brute force approach looks something like this:
%Create some points
n = 10;
x = randn(n,1);
y = randn(n,1);
%Choose x0
ix0 = randi(n);
%Get distances
d = sqrt(...
(x - x(ix0) ).^2 + ...
(y - y(ix0) ).^2 );
%Sort distances
[sorted_Dstances, ixSort] = sort(d);
%Get kth point
k = 3;
kth = [x(ixSort(k+1)); y(ixSort(k+1))]; %+1 since the first element will always be the x0 element.