Different behaviour for pdist and pdist2 - matlab

In a MATLAB code I am using the kullback_leibler_divergence dissimilarity function that can be found here.
I have a matrix A and I compute the dissimilarity matrix using the downloaded function.
In theory, if I calculate
clear
A = rand(132,6); % input matrix
diss_mat = pdist(A,'#kullback_leibler_divergence'); % calculate the dissimilarity
square_diss_mat = squareform(diss_mat); % I put the dissimilarities in a square matrix
one_dist = pdist2(A(1,:),A,#kullback_leibler_divergence);
I should get the first row of square_diss_mat equal to one_dist, but I am not.
If I use the Euclidean distance I get it:
diss_mat = pdist(A);
square_diss_mat = squareform(diss_mat);
one_dist = pdist2(A(1,:),A);
Could you please tell me why?

The kullback_leibler_divergence is not symmetric, thus the order matters:
one_dist = pdist2(A, A(1,:), #kullback_leibler_divergence);
I don't see any practical application using a non-symmetric function with pdist or pdist2.

Related

How to calculate cosine similarity between two frequency vectors in MATLAB?

I need to find the cosine similarity between two frequency vectors in MATLAB.
Example vectors:
a = [2,3,4,4,6,1]
b = [1,3,2,4,6,3]
How do I measure the cosine similarity between these vectors in MATLAB?
Take a quick look at the mathematical definition of Cosine similarity.
From the definition, you just need the dot product of the vectors divided by the product of the Euclidean norms of those vectors.
% MATLAB 2018b
a = [2,3,4,4,6,1];
b = [1,3,2,4,6,3];
cosSim = sum(a.*b)/sqrt(sum(a.^2)*sum(b.^2)); % 0.9436
Alternatively, you could use
cosSim = (a(:).'*b(:))/sqrt(sum(a.^2)*sum(b.^2)); % 0.9436
which gives the same result.
After reading this correct answer, to avoid sending you to another castle I've added another approach using MATLAB's built-in linear algebra functions, dot() and norm().
cosSim = dot(a,b)/(norm(a)*norm(b)); % 0.9436
See also the tag-wiki for cosine-similarity.
Performance by Approach:
sum(a.*b)/sqrt(sum(a.^2)*sum(b.^2))
(a(:).'*b(:))/sqrt(sum(a.^2)*sum(b.^2))
dot(a,b)/(norm(a)*norm(b))
Each point represents the geometric mean of the computation times for 10 randomly generated vectors.
If you have the Statistics toolbox, you can use the pdist2 function with the 'cosine' input flag, which gives 1 minus the cosine similarity:
a = [2,3,4,4,6,1];
b = [1,3,2,4,6,3];
result = 1-pdist2(a, b, 'cosine');

Vectorize function that finds an array of nearest values

I am still wrapping my head around vectorization and I'm having a difficult time trying to resolve the following function I made...
for i = 1:size(X, 1)
min_n = inf;
for j=1:K
val = X(i,:)' - centroids(j,:)';
diff = val'*val;
if (diff < min_n)
idx(i) = j;
min_n = diff;
end
end
end
X is an array of (x,y) coordinates...
2 5
5 6
...
...
centroids in this example is limited to 3 rows. It is also in (x,y) format as shown above.
For every pair in X I am computing the closest pair of centroids. I then store the index of the centroid in idx.
So idx(i) = j means that I am storing the index j of the centroid at index i, where i corresponds to the index of X. This means the closest centroid to pair X(i, :) is at idx(i).
Can I possibly simplify this via vectorization? I struggle with just vectorizing the inner loop.
Here are three options. But please note that the disadvantage of vectorization, as compared to your double loops, is that it stores all the difference operation results at once, which means that if your matrices have many rows, you might run out of memory. On the other hand, the vectorized approach is probably much faster.
Option 1
If you have access to Statistics and Machine Learning Toolbox, you can use the function pdist2 to get all the pairwise distances between rows of two matrices. Then, the min function gives you the minimum of each column of the result. Its first returned value are the minimal values, and its second are the indices, which is what you need for idx:
diff = pdist2(centroids,X);
[~,idx] = min(diff);
Option 2
If you don't have access to the toolbox, you can use bsxfun. This will let you compute the difference operation between the two matrices even if their dimensions don't agree. All you need to do is to use shiftdim to reshape X' to have size [1,size(X,2),size(X,1)], and then reshapedX and and centroids are compatible with their dimensions (see documentation of bsxfun). This lets you take the difference between their values. The result is a three dimensional array, which you need to sum along the second dimension to get the norm of the differences between rows. At this point you can proceed as in option 1.
reshapedX = shiftdim(X',-1);
diff = bsxfun(#minus,centroids,reshapedX);
diff = squeeze(sum(diff.^2,2));
[~,idx] = min(diff);
Note: Starting in the Matlab version 2016b, the bsxfun is used implicitly and you do not need to call it anymore. So the line with bsxfun can be replaced with the simpler line diff = centroids-reshapedX.
Option 3
Use the function dsearchn, which performs exactly what you need:
idx = dsearchn(centroids,X);
it could be done using pdist2 - pairwise distances between rows of two matrices:
% random data
X = rand(500,2);
centroids = rand(3,2);
% pairwise distances
D = pdist2(X,centroids);
% closest centroid index for each X coordinates
[~,idx] = min(D,[],2)
% plot
scatter(centroids(:,1),centroids(:,2),300,(1:size(centroids,1))','filled');
hold on;
scatter(X(:,1),X(:,2),30,idx);
legend('Centroids','data');

Defining an efficient distance function in matlab

I'm using kNN search function in matlab, but I'm calculating the distance between two objects of my own defined class, so I've written a new distance function. This is it:
function d = allRepDistance(obj1, obj2)
%calculates the min dist. between repr.
% obj2 is a vector, to fit kNN function requirements
n = size(obj2,1);
d = zeros(n,1);
for i=1:n
M = dist(obj1.Repr, [obj2(i,:).Repr]');
d(i) = min(min(M));
end
end
The difference is that obj.Repr may be a matrix, and I want to calculate the minimal distance between all the rows of each argument. But even if obj1.Repr is just a vector, which gives essentially the normal euclidian distance between two vectors, the kNN function is slower by a factor of 200!
I've checked the performance of just the distance function (no kNN). I measured the time it takes to calculate the distance between a vector and the rows of a matrix (when they are in the object), and it work slower by a factor of 3 then the normal distance function.
Does that make any sense? Is there a solution?
You are using dist(), which corresponds to the Euclidean distance weight function. However, you are not weighting your data, i.e. you don't consider that one dimension is more important that others. Thus, you can directly use the Euclidean distance pdist():
function d = allRepDistance(obj1, obj2)
% calculates the min dist. between repr.
% obj2 is a vector, to fit kNN function requirements
n = size(obj2,1);
d = zeros(n,1);
for i=1:n
X = [obj1.Repr, obj2(i,:).Repr'];
M = pdist(X,'euclidean');
d(i) = min(min(M));
end
end
BTW, I don't know your matrix dimensions, so you will need to deal with the concatenation of elements to create X correctly.

Calculating the covariance of a 1000 5x5 matrices in matlab

I have a 1000 5x5 matrices (Xm) like this:
Each $(x_ij)m$ is a point estimate drawn from a distribution. I'd like to calculate the covariance cov of each $x{ij}$, where i=1..n, and j=1..n in the direction of the red arrow.
For example the variance of $X_m$ is `var(X,0,3) which gives a 5x5 matrix of variances. Can I calculate the covariance in the same way?
Attempt at answer
So far I've done this:
for m=1:1000
Xm_new(m,:)=reshape(Xm(:,:,m)',25,1);
end
cov(Xm_new)
spy(Xm_new) gives me this unusual looking sparse matrix:
If you look at cov (edit cov in the command window) you might see why it doesn't support multi-dimensional arrays. It perform a transpose and a matrix multiplication of the input matrices: xc' * xc. Both operations don't support multi-dimensional arrays and I guess whoever wrote the function decided not to do the work to generalize it (it still might be good to contact the Mathworks however and make a feature request).
In your case, if we take the basic code from cov and make a few assumptions, we can write a covariance function M-file the supports 3-D arrays:
function x = cov3d(x)
% Based on Matlab's cov, version 5.16.4.10
[m,n,p] = size(x);
if m == 1
x = zeros(n,n,p,class(x));
else
x = bsxfun(#minus,x,sum(x,1)/m);
for i = 1:p
xi = x(:,:,i);
x(:,:,i) = xi'*xi;
end
x = x/(m-1);
end
Note that this simple code assumes that x is a series of 2-D matrices stacked up along the third dimension. And the normalization flag is 0, the default in cov. It could be exapnded to multiple dimensions like var with a bit of work. In my timings, it's over 10 times faster than a function that calls cov(x(:,:,i)) in a for loop.
Yes, I used a for loop. There may or may not be faster ways to do this, but in this case for loops are going to be faster than most schemes, especially when the size of your array is not known a priori.
The answer below also works for a rectangular matrix xi=x(:,:,i)
function xy = cov3d(x)
[m,n,p] = size(x);
if m == 1
x = zeros(n,n,p,class(x));
else
xc = bsxfun(#minus,x,sum(x,1)/m);
for i = 1:p
xci = xc(:,:,i);
xy(:,:,i) = xci'*xci;
end
xy = xy/(m-1);
end
My answer is very similar to horchler, however horchler's code does not work with rectangular matrices xi (whose dimensions are different from xi'*xi dimensions).

How do I create a simliarity matrix in MATLAB?

I am working towards comparing multiple images. I have these image data as column vectors of a matrix called "images." I want to assess the similarity of images by first computing their Eucledian distance. I then want to create a matrix over which I can execute multiple random walks. Right now, my code is as follows:
% clear
% clc
% close all
%
% load tea.mat;
images = Input.X;
M = zeros(size(images, 2), size (images, 2));
for i = 1:size(images, 2)
for j = 1:size(images, 2)
normImageTemp = sqrt((sum((images(:, i) - images(:, j))./256).^2));
%Need to accurately select the value of gamma_i
gamma_i = 1/10;
M(i, j) = exp(-gamma_i.*normImageTemp);
end
end
My matrix M however, ends up having a value of 1 along its main diagonal and zeros elsewhere. I'm expecting "large" values for the first few elements of each row and "small" values for elements with column index > 4. Could someone please explain what is wrong? Any advice is appreciated.
Since you're trying to compute a Euclidean distance, it looks like you have an error in where your parentheses are placed when you compute normImageTemp. You have this:
normImageTemp = sqrt((sum((...)./256).^2));
%# ^--- Note that this parenthesis...
But you actually want to do this:
normImageTemp = sqrt(sum(((...)./256).^2));
%# ^--- ...should be here
In other words, you need to perform the element-wise squaring, then the summation, then the square root. What you are doing now is summing elements first, then squaring and taking the square root of the summation, which essentially cancel each other out (or are actually the equivalent of just taking the absolute value).
Incidentally, you can actually use the function NORM to perform this operation for you, like so:
normImageTemp = norm((images(:, i) - images(:, j))./256);
The results you're getting seem reasonable. Recall the behavior of the exp(-x). When x is zero, exp(-x) is 1. When x is large exp(-x) is zero.
Perhaps if you make M(i,j) = normImageTemp; you'd see what you expect to see.
Consider this solution:
I = Input.X;
D = squareform( pdist(I') ); %'# euclidean distance between columns of I
M = exp(-(1/10) * D); %# similarity matrix between columns of I
PDIST and SQUAREFORM are functions from the Statistics Toolbox.
Otherwise consider this equivalent vectorized code (using only built-in functions):
%# we know that: ||u-v||^2 = ||u||^2 + ||v||^2 - 2*u.v
X = sum(I.^2,1);
D = real( sqrt(bsxfun(#plus,X,X')-2*(I'*I)) );
M = exp(-(1/10) * D);
As was explained in the other answers, D is the distance matrix, while exp(-D) is the similarity matrix (which is why you get ones on the diagonal)
there is an already implemented function pdist, if you have a matrix A, you can directly do
Sim= squareform(pdist(A))