Comparing two sets of vectors - matlab

I've got matrices A and B
size(A) = [n x]; size(B) = [n y];
Now I need to compare euclidian distance of each column vector of A from each column vector of B. I'm using dist method right now
Q = dist([A B]); Q = Q(1:x, x:end);
But it does also lot of needless work (like calculating distances between vectors of A and B separately).
What is the best way to calculate this?

You are looking for pdist2.
% Compute the ordinary Euclidean distance
D = pdist2(A.',B.','euclidean'); % euclidean distance
You should take the transpose of the matrices since pdist2 assumes the observations are in rows, not in columns.

An alternative solution to pdist2, if you don't have the Statistics Toolbox, is to compute this manually. For example, one way to do it is:
[X, Y] = meshgrid(1:size(A, 2), 1:size(B, 2)); %// or meshgrid(1:x, 1:y)
Q = sqrt(sum((A(:, X(:)) - B(:, Y(:))) .^ 2, 1));
The indices of the columns from A and B for each value in vector Q can be obtained by computing:
[X(:), Y(:)]
where each row contains a pair of indices: the first is the column index in matrix A, and the second is the column index in matrix B.

Another solution if you don't have pdist2 and which may also be faster for very large matrices is to vectorize the following mathematical fact:
||x-y||^2 = ||x||^2 + ||y||^2 - 2*dot(x,y)
where ||a|| is the L2-norm (euclidean norm) of a.
Comments:
C=-2*A'*B (this is a x by y matrix) is the vectorization of the dot products.
||x-y||^2 is the square of the euclidean distance which you are looking for.
Is that enough or do you need the explicit code?
The reason this may be faster asymptotically is that you avoid doing the metric calculation for all x*y comparisons, since you are instead making the bottleneck a matrix multiplication (matrix multiplication is highly optimized in matlab). You are taking advantage of the fact that this is the euclidean distance and not just some unknown metric.

Related

Calculate distance matrix for 3D points

I have the lists xA, yA, zA and the lists xB, yB, zB in Matlab. The contain the the x,y and z coordinates of points of type A and type B. There may be different numbers of type A and type B points.
I would like to calculate a matrix containing the Euclidean distances between points of type A and type B. Of course, I only need to calculate one half of the matrix, since the other half contains duplicate data.
What is the most efficient way to do that?
When I'm done with that, I want to find the points of type B that are closest to one point of type A. How do I then find the coordinates the closest, second closest, third closest and so on points of type B?
Given a matrix A of size [N,3] and a matrix B of size [M,3], you can use the pdist2 function to get a matrix of size [M,N] containing all the pairwise distances.
If you want to order the points from B by their distance to the rth point in A then you can sort the rth row of the pairwise distance matrix.
% generate some example data
N = 4
M = 7
A = randn(N,3)
B = randn(M,3)
% compute N x M matrix containing pairwise distances
D = pdist2(A, B, 'euclidean')
% sort points in B by their distance to the rth point in A
r = 3
[~, b_idx] = sort(D(r,:))
Then b_idx will contain the indices of the points in B sorted by their distance to the rth point in A. So the actual points in B ordered by b_idx can be obtained with B(b_idx,:), which has the same size as B.
If you want to do this for all r you could use
[~, B_idx] = sort(D, 1)
to sort all the rows of D at the same time. Then the rth row of B_idx will contain b_idx.
If your only objective is to find the closest k points in B for each point in A (for some positive integer k which is less than M), then we would not generally want to compute all the pairwise distances. This is because space-partitioning data structures like K-D-trees can be used to improve the efficiency of searching without explicitly computing all the pairwise distances.
Matlab provides a knnsearch function that uses K-D-trees for this exact purpose. For example, if we do
k = 2
B_kidx = knnsearch(B, A, 'K', k)
then B_kidx will be the first two columns of B_idx, i.e. for each point in A the indices of the nearest two points in B. Also, note that this is only going to be more efficient than the pdist2 method when k is relatively small. If k is too large then knnsearch will automatically use the explicit method from before instead of the K-D-tree approach.

Vectorize function that finds an array of nearest values

I am still wrapping my head around vectorization and I'm having a difficult time trying to resolve the following function I made...
for i = 1:size(X, 1)
min_n = inf;
for j=1:K
val = X(i,:)' - centroids(j,:)';
diff = val'*val;
if (diff < min_n)
idx(i) = j;
min_n = diff;
end
end
end
X is an array of (x,y) coordinates...
2 5
5 6
...
...
centroids in this example is limited to 3 rows. It is also in (x,y) format as shown above.
For every pair in X I am computing the closest pair of centroids. I then store the index of the centroid in idx.
So idx(i) = j means that I am storing the index j of the centroid at index i, where i corresponds to the index of X. This means the closest centroid to pair X(i, :) is at idx(i).
Can I possibly simplify this via vectorization? I struggle with just vectorizing the inner loop.
Here are three options. But please note that the disadvantage of vectorization, as compared to your double loops, is that it stores all the difference operation results at once, which means that if your matrices have many rows, you might run out of memory. On the other hand, the vectorized approach is probably much faster.
Option 1
If you have access to Statistics and Machine Learning Toolbox, you can use the function pdist2 to get all the pairwise distances between rows of two matrices. Then, the min function gives you the minimum of each column of the result. Its first returned value are the minimal values, and its second are the indices, which is what you need for idx:
diff = pdist2(centroids,X);
[~,idx] = min(diff);
Option 2
If you don't have access to the toolbox, you can use bsxfun. This will let you compute the difference operation between the two matrices even if their dimensions don't agree. All you need to do is to use shiftdim to reshape X' to have size [1,size(X,2),size(X,1)], and then reshapedX and and centroids are compatible with their dimensions (see documentation of bsxfun). This lets you take the difference between their values. The result is a three dimensional array, which you need to sum along the second dimension to get the norm of the differences between rows. At this point you can proceed as in option 1.
reshapedX = shiftdim(X',-1);
diff = bsxfun(#minus,centroids,reshapedX);
diff = squeeze(sum(diff.^2,2));
[~,idx] = min(diff);
Note: Starting in the Matlab version 2016b, the bsxfun is used implicitly and you do not need to call it anymore. So the line with bsxfun can be replaced with the simpler line diff = centroids-reshapedX.
Option 3
Use the function dsearchn, which performs exactly what you need:
idx = dsearchn(centroids,X);
it could be done using pdist2 - pairwise distances between rows of two matrices:
% random data
X = rand(500,2);
centroids = rand(3,2);
% pairwise distances
D = pdist2(X,centroids);
% closest centroid index for each X coordinates
[~,idx] = min(D,[],2)
% plot
scatter(centroids(:,1),centroids(:,2),300,(1:size(centroids,1))','filled');
hold on;
scatter(X(:,1),X(:,2),30,idx);
legend('Centroids','data');

Multiply two matrices in Matlab to obtain 3-dimensional matrix

I have two sparse matrices in Matlab, A and B,
and I want to compute a three-dimensional matrix C such that
C(i,j,k) = A(i,j) * B(j,k)
can I do this without a loop?
(Side question: Is there a name for this operation?)
Edit:
Seems my question has already been asked (just for full matrices):
Create a 3-dim matrix from two 2-dim matrices
For full matrices:
You can do it using bsxfun and shiftdim:
C = bsxfun(#times, A, shiftdim(B,-1))
Explanation: Let A be of size M x N and B of size N x P. Applying shiftdim(B,-1) gives a 1 x N x P array. bsxfun implicitly replicates A along the third dimension and shiftdim(B,-1) along the first to compute the desired element-wise product.
Another possibility, usually less efficient than bsxfun, is to repeat the arrays explicity along the desired dimensions, using repmat:
C = repmat(A, [1 1 size(B,2)]) .* repmat(shiftdim(B,-1), [size(A,1) 1 1])
For sparse matrices:
The result cannot be sparse, as sparse ND-arrays are not supported.. But you can do the computations with sparse A and B using linear indexing:
ind1 = repmat(1:numel(A),1,size(B,2));
ind2 = repmat(1:numel(B),size(A,1),1);
ind2 = ind2(:).';
C = NaN([size(A,1),size(A,2),size(B,2)]); %// preallocate with appropriate shape
C(:) = full(A(ind1).*B(ind2)); %// need to use full if C is to be 3D
Answer to your side question: the name for this operation is a hash join.

How to normalize a matrix of 3-D vectors

I have a 512x512x3 matrix that stores 512x512 there-dimensional vectors. What is the best way to normalize all those vectors, so that my result are 512x512 vectors with length that equals 1?
At the moment I use for loops, but I don't think that is the best way in MATLAB.
If the vectors are Euclidean, the length of each is the square root of the sum of the squares of its coordinates. To normalize each vector individually so that it has unit length, you need to divide its coordinates by its norm. For that purpose you can use bsxfun:
norm_A = sqrt(sum(A .^ 2, 3)_; %// Calculate Euclidean length
norm_A(norm_A < eps) == 1; %// Avoid division by zero
B = bsxfun(#rdivide, A, norm_A); %// Normalize
where A is your original 3-D vector matrix.
EDIT: Following Shai's comment, added a fix to avoid possible division by zero for null vectors.

Calculate distance from point p to high dimensional Gaussian (M, V)

I have a high dimensional Gaussian with mean M and covariance matrix V. I would like to calculate the distance from point p to M, taking V into consideration (I guess it's the distance in standard deviations of p from M?).
Phrased differentially, I take an ellipse one sigma away from M, and would like to check whether p is inside that ellipse.
If V is a valid covariance matrix of a gaussian, it then is symmetric positive definite and therefore defines a valid scalar product. By the way inv(V) also does.
Therefore, assuming that M and p are column vectors, you could define distances as:
d1 = sqrt((M-p)'*V*(M-p));
d2 = sqrt((M-p)'*inv(V)*(M-p));
the Matlab way one would rewrite d2as (probably some unnecessary parentheses):
d2 = sqrt((M-p)'*(V\(M-p)));
The nice thing is that when V is the unit matrix, then d1==d2and it correspond to the classical euclidian distance. To find wether you have to use d1 or d2is left as an exercise (sorry, part of my job is teaching). Write the multi-dimensional gaussian formula and compare it to the 1D case, since the multidimensional case is only a particular case of the 1D (or perform some numerical experiment).
NB: in very high dimensional spaces or for very many points to test, you might find a clever / faster way from the eigenvectors and eigenvalues of V (i.e. the principal axes of the ellipsoid and their corresponding variance).
Hope this helps.
A.
Consider computing the probability of the point given the normal distribution:
M = [1 -1]; %# mean vector
V = [.9 .4; .4 .3]; %# covariance matrix
p = [0.5 -1.5]; %# 2d-point
prob = mvnpdf(p,M,V); %# probability P(p|mu,cov)
The function MVNPDF is provided by the Statistics Toolbox
Maybe I'm totally off, but isn't this the same as just asking for each dimension: Am I inside the sigma?
PSEUDOCODE:
foreach(dimension d)
(M(d) - sigma(d) < p(d) < M(d) + sigma(d)) ?
Because you want to know if p is inside every dimension of your gaussian. So actually, this is just a space problem and your Gaussian hasn't have to do anything with it (except for M and sigma which are just distances).
In MATLAB you could try something like:
all(M - sigma < p < M + sigma)
A distance to that place could be, where I don't know the function for the Euclidean distance. Maybe dist works:
dist(M, p)
Because M is just a point in space and p as well. Just 2 vectors.
And now the final one. You want to know the distance in a form of sigma's:
% create a distance vector and divide it by sigma
M - p ./ sigma
I think that will do the trick.