Distance matrix calculation - matlab

I hope you will have the right answer to this question, which is rather challanging.
I have two uni-dimensional vectors Y and Z, which hold the coordinates of N grid points located on a square grid. So
Ny points along Y
Nz points along Z
N = Ny*Nz
Y = Y[N]; (Y holds N entries)
Z = Z[N]; (Z holds N entries)
Now, the goal would be generating the distance matrix D, which holds N*N entries: so each row of matrix D is defined as the distance between the i-th point on the grid and the remnant (N - i) points.
Generally, to compute the whole matrix I would call
D = squareform(pdist([Y Z]));
or,equivalently,
D = pdist2([Y Z],[Y Z]).
But, since D is a symmetric matrix, I'd like to generate only the N(N + 1)/2 independent entries and store them into a row-ordered vector Dd.
So the question is: how to generate a row-ordered array Dd whose entries are defined by the lower triangular terms of matrix D? I'd, furthermore, like storing the entries in a column-major order.
I hope the explanation is clear enough.

As woodchips commented, it is simpler and faster to compute the whole matrix and extract the elements you care about. Here's a way to do it:
ndx = find(tril(true(size(D))));
Dd = D(ndx);
If you absolutely must compute elements of Dd without computing matrix D first, you probably need a double for loop.

Related

Calculate distance matrix for 3D points

I have the lists xA, yA, zA and the lists xB, yB, zB in Matlab. The contain the the x,y and z coordinates of points of type A and type B. There may be different numbers of type A and type B points.
I would like to calculate a matrix containing the Euclidean distances between points of type A and type B. Of course, I only need to calculate one half of the matrix, since the other half contains duplicate data.
What is the most efficient way to do that?
When I'm done with that, I want to find the points of type B that are closest to one point of type A. How do I then find the coordinates the closest, second closest, third closest and so on points of type B?
Given a matrix A of size [N,3] and a matrix B of size [M,3], you can use the pdist2 function to get a matrix of size [M,N] containing all the pairwise distances.
If you want to order the points from B by their distance to the rth point in A then you can sort the rth row of the pairwise distance matrix.
% generate some example data
N = 4
M = 7
A = randn(N,3)
B = randn(M,3)
% compute N x M matrix containing pairwise distances
D = pdist2(A, B, 'euclidean')
% sort points in B by their distance to the rth point in A
r = 3
[~, b_idx] = sort(D(r,:))
Then b_idx will contain the indices of the points in B sorted by their distance to the rth point in A. So the actual points in B ordered by b_idx can be obtained with B(b_idx,:), which has the same size as B.
If you want to do this for all r you could use
[~, B_idx] = sort(D, 1)
to sort all the rows of D at the same time. Then the rth row of B_idx will contain b_idx.
If your only objective is to find the closest k points in B for each point in A (for some positive integer k which is less than M), then we would not generally want to compute all the pairwise distances. This is because space-partitioning data structures like K-D-trees can be used to improve the efficiency of searching without explicitly computing all the pairwise distances.
Matlab provides a knnsearch function that uses K-D-trees for this exact purpose. For example, if we do
k = 2
B_kidx = knnsearch(B, A, 'K', k)
then B_kidx will be the first two columns of B_idx, i.e. for each point in A the indices of the nearest two points in B. Also, note that this is only going to be more efficient than the pdist2 method when k is relatively small. If k is too large then knnsearch will automatically use the explicit method from before instead of the K-D-tree approach.

knnsearch from Matlab to Julia

I am trying to run a nearest neighbour search in Julia using NearestNeighbors.jl package. The corresponding Matlab code is
X = rand(10);
Y = rand(100);
Z = zeros(size(Y));
Z = knnsearch(X, Y);
This generates Z, a vector of length 100, where the i-th element is the index of X whose element is nearest to the i-th element in Y, for all i=1:100.
Could really use some help converting the last line of the Matlab code above to Julia!
Use:
X = rand(1, 10)
Y = rand(1, 100)
nn(KDTree(X), Y)[1]
The storing the intermediate KDTree object would be useful if you wanted to reuse it in the future (as it will improve the efficiency of queries).
Now what is the crucial point of my example. The NearestNeighbors.jl accepst the following input data:
It can either be:
a matrix of size nd × np with the points to insert in the tree where nd is the dimensionality of the points and np is the number of points
a vector of vectors with fixed dimensionality, nd, which must be part of the type.
I have used the first approach. The point is that observations must be in columns (not in rows as in your original code). Remember that in Julia vectors are columnar, so rand(10) is considered to be 1 observation that has 10 dimensions by NearestNeighbors.jl, while rand(1, 10) is considered to be 10 observations with 1 dimension each.
However, for your original data since you want a nearest neighbor only and it is single-dimensional and is small it is enough to write (here I assume X and Y are original data you have stored in vectors):
[argmin(abs(v - y) for v in X) for y in Y]
without using any extra packages.
The NearestNeighbors.jl is very efficient for working with high-dimensional data that has very many elements.

Find the minimum difference between any pair of elements between two vectors

Which of the following statements will find the minimum difference between any pair of elements (a,b) where a is from the vector A and b is from the vector B.
A. [X,Y] = meshgrid(A,B);
min(abs(X-Y))
B. [X,Y] = meshgrid(A,B);
min(abs(min(Y-X)))
C. min(abs(A-B))
D. [X,Y] = meshgrid(A,B);
min(min(abs(X-Y)))
Can someone please explain to me?
By saying "minimum difference between any pair of elements(a,b)", I presume you mean that you are treating A and B as sets and you intend to find the absolute difference in any possible pair of elements from these two sets. So in this case you should use your option D
[X,Y] = meshgrid(A,B);
min(min(abs(X-Y)))
Explanation: Meshgrid turns a pair of 1-D vectors into 2-D grids. This link can explain what I mean to say:
http://www.mathworks.com/help/matlab/ref/meshgrid.html?s_tid=gn_loc_drop
Hence (X-Y) will give the difference in all possible pairs (a,b) such that a belongs to A and b belongs to B. Note that this will be a 2-D matrix.
abs(X-Y) would return the absolute values of all elements in this matrix (the absolute difference in each pair).
To find the smallest element in this matrix you will have to use min(min(abs(X-Y))). This is because if Z is a matrix, min(Z) treats the columns of Z as vectors, returning a row vector containing the minimum element from each column. So a single min command will give a row vector with each element being the min of the elements of that column. Using min for a second time returns the min of this row vector. This would be the smallest element in the entire matrix.
This can help:
http://www.mathworks.com/help/matlab/ref/min.html?searchHighlight=min
Options C is correct if you treat A and B as vectors and not sets. In this case you won't be considering all possible pairs. You'll end up finding the minimum of (a-b) where a,b are both in the same position in their corresponding vectors (pair-wise difference).
D. [X,Y] = meshgrid(A,B);
min(min(abs(X-Y)))
meshgrid will generate two grids - X and Y - from the vectors, which are arranged so that X-Y will generate all combinations of ax-bx where ax is in a and bx is in b.
The rest of the expression just gets the minimum absolute value from the array resulting from the subtraction, which is the value you want.
CORRECT ANSWER IS D
Let m = size(A) and n = size(B)
You want to subtract each pair of (a,b) such that a is from vector A and b is from vector B.
meshgrid(A,B) creates two matrices X Y both of size nxm where X have rows sames have vector A while Yhas columns same as vector B .
Hence , Z = X-Y will give you a matrix with n*m values corresponding to the difference between each pair of values taken from A and B . Now all you have to do is to find the absolute minimum among all values of Z.
You can do that by
req_min = min(min(abs(z)))
The whole code is
[X Y ] = meshgrid(A,B);
Z= X-Y;
Z = abs(Z);
req_min = min(min(Z));
You could also use bsxfun instead of meshgrid:
min(min(abs(bsxfun(#minus, A(:), B(:).'))))
Or use pdist2:
min(min(pdist2(A(:),B(:))))

Comparing two sets of vectors

I've got matrices A and B
size(A) = [n x]; size(B) = [n y];
Now I need to compare euclidian distance of each column vector of A from each column vector of B. I'm using dist method right now
Q = dist([A B]); Q = Q(1:x, x:end);
But it does also lot of needless work (like calculating distances between vectors of A and B separately).
What is the best way to calculate this?
You are looking for pdist2.
% Compute the ordinary Euclidean distance
D = pdist2(A.',B.','euclidean'); % euclidean distance
You should take the transpose of the matrices since pdist2 assumes the observations are in rows, not in columns.
An alternative solution to pdist2, if you don't have the Statistics Toolbox, is to compute this manually. For example, one way to do it is:
[X, Y] = meshgrid(1:size(A, 2), 1:size(B, 2)); %// or meshgrid(1:x, 1:y)
Q = sqrt(sum((A(:, X(:)) - B(:, Y(:))) .^ 2, 1));
The indices of the columns from A and B for each value in vector Q can be obtained by computing:
[X(:), Y(:)]
where each row contains a pair of indices: the first is the column index in matrix A, and the second is the column index in matrix B.
Another solution if you don't have pdist2 and which may also be faster for very large matrices is to vectorize the following mathematical fact:
||x-y||^2 = ||x||^2 + ||y||^2 - 2*dot(x,y)
where ||a|| is the L2-norm (euclidean norm) of a.
Comments:
C=-2*A'*B (this is a x by y matrix) is the vectorization of the dot products.
||x-y||^2 is the square of the euclidean distance which you are looking for.
Is that enough or do you need the explicit code?
The reason this may be faster asymptotically is that you avoid doing the metric calculation for all x*y comparisons, since you are instead making the bottleneck a matrix multiplication (matrix multiplication is highly optimized in matlab). You are taking advantage of the fact that this is the euclidean distance and not just some unknown metric.

Calculate distance from point p to high dimensional Gaussian (M, V)

I have a high dimensional Gaussian with mean M and covariance matrix V. I would like to calculate the distance from point p to M, taking V into consideration (I guess it's the distance in standard deviations of p from M?).
Phrased differentially, I take an ellipse one sigma away from M, and would like to check whether p is inside that ellipse.
If V is a valid covariance matrix of a gaussian, it then is symmetric positive definite and therefore defines a valid scalar product. By the way inv(V) also does.
Therefore, assuming that M and p are column vectors, you could define distances as:
d1 = sqrt((M-p)'*V*(M-p));
d2 = sqrt((M-p)'*inv(V)*(M-p));
the Matlab way one would rewrite d2as (probably some unnecessary parentheses):
d2 = sqrt((M-p)'*(V\(M-p)));
The nice thing is that when V is the unit matrix, then d1==d2and it correspond to the classical euclidian distance. To find wether you have to use d1 or d2is left as an exercise (sorry, part of my job is teaching). Write the multi-dimensional gaussian formula and compare it to the 1D case, since the multidimensional case is only a particular case of the 1D (or perform some numerical experiment).
NB: in very high dimensional spaces or for very many points to test, you might find a clever / faster way from the eigenvectors and eigenvalues of V (i.e. the principal axes of the ellipsoid and their corresponding variance).
Hope this helps.
A.
Consider computing the probability of the point given the normal distribution:
M = [1 -1]; %# mean vector
V = [.9 .4; .4 .3]; %# covariance matrix
p = [0.5 -1.5]; %# 2d-point
prob = mvnpdf(p,M,V); %# probability P(p|mu,cov)
The function MVNPDF is provided by the Statistics Toolbox
Maybe I'm totally off, but isn't this the same as just asking for each dimension: Am I inside the sigma?
PSEUDOCODE:
foreach(dimension d)
(M(d) - sigma(d) < p(d) < M(d) + sigma(d)) ?
Because you want to know if p is inside every dimension of your gaussian. So actually, this is just a space problem and your Gaussian hasn't have to do anything with it (except for M and sigma which are just distances).
In MATLAB you could try something like:
all(M - sigma < p < M + sigma)
A distance to that place could be, where I don't know the function for the Euclidean distance. Maybe dist works:
dist(M, p)
Because M is just a point in space and p as well. Just 2 vectors.
And now the final one. You want to know the distance in a form of sigma's:
% create a distance vector and divide it by sigma
M - p ./ sigma
I think that will do the trick.