Finding the shortest distance between a 2D array of peaks - matlab

I'm trying to use MatLab to help in automated analysis of microscopy data from our lab. We have images that consist of 2D arrays of points that correspond to atom positions. We would first like to fit these positions to Gaussians, then find the shortest distance between the positions and draw vectors between them.
I've been using the Fast 2D Peak Finder to locate the peak positions and it works quite well. However, I'm having trouble identifying the shortest distances and plotting vectors between them. Does anyone have an idea for how this might work? Thanks for your help!

Assuming that you identify n peaks and store their coordinates in n-by-2 matrix X, you could calculate the pairwise distance between these peaks using D = pdist(X) (this function requires the statistics toolbox). By default, this will assume you are interested in finding the euclidean distance between each pair of point.
The returned vector, D, corresponds to a list of pairwise distances. The pdist() documentation describes the meaning of the ordering of these distances. I recommend following the D = pdist(X) with D = squareform(D) to convert the vector into a pairwise distance matrix.
Then you simply need to identify the shortest k distances you are interested in and plot these points.
I have provided an example way of doing this below.
X = rand(10,2); % Generate random 2D points
k = 3; % Number of closest pairs of points to choose
Y = [];
D = pdist(X); % Get vector of distances
B = sort(D,'ascend'); % Sort distances
D = tril(squareform(D)); % Convert distance vector to lower triangular matrix
% Find k pairs of rows of X corresponding to closest peaks
[Y(:,1),Y(:,2)] = find(ismember(D,B(1:k)));
% Plot results
figure; hold on;
plot(X(:,1),X(:,2),'b+'); % Plot "peaks"
for i = 1:k
plot(X([Y(i,1),Y(i,2)],1),X([Y(i,1),Y(i,2)],2),'r'); % Plot closest peaks
end

Related

matlab: upsampling a zero contour

I am using the undocumented contours function in Matlab to obtain the zero contour of a function mapping R^2 to R. The function contours() calls contourc() to obtain the zero contour in terms of array indices, and contours() then performs linear interpolation to convert from (M,N) array indices to (Y,X) data coordinates. This works well and gives a zero contour which is essentially accurate to machine precision. However, attempting to interpolate along this zero contour to obtain additional points fails, apparently because the interpolated points deviate to a small extent from the true zero contour; the resulting error is too large for the intended application.
As a simple example, one can compute a zero contour of the peaks function, compute the function values along the contour, and then compute the function values at new points interpolated along the contour. The maximum error is roughly eps(max(Z(:)) for the points found by countours/countourc and about ten orders of magnitude higher for the interpolated points.
% compute the zero contours; keep part of one contour
[X,Y,Z] = peaks(999);
XY0 = contours(X, Y, Z, [0 0]);
XY0 = XY0(:,140:(XY0(2,1)+1)); % keep only ascending values in the first contour for interpolation
% interpolate points along the zero contour
x2y = griddedInterpolant(XY0(1,:), XY0(2,:), 'linear','none');
X0i = linspace(XY0(1,1), XY0(1,end), 1e4);
Y0i = x2y(X0i);
% compute values of the function along the zero contour
Zi = interp2(X,Y,Z, XY0(1,:), XY0(2,:), 'linear', NaN);
Z0i = interp2(X,Y,Z, X0i, Y0i, 'linear', NaN);
% plot results
figure;
subplot(1,3,1); plot(XY0(1,:), XY0(2,:), '.'); hold on; plot(X0i, Y0i, 'Linewidth',1);
xlabel('X_0'); ylabel('Y_0'); title('(X_0,Y_0), (X_{0i},Y_{0i})');
subplot(1,3,2); plot(XY0(1,:), Zi, '.');
xlabel('X_0'); ylabel('Z_0'); title('f(X_0,Y_0)');
subplot(1,3,3); plot(XY0(1,:), Zi, '.'); hold on; plot(X0i, Z0i, '.');
xlabel('X_0'); ylabel('Z_0'); title('f(X_{0i},Y_{0i})');
The error is almost certainly because the interpolant does not truly follow the zero contour (i.e., having the same tangent, curvature, and indeed all higher derivatives at each point). Each interpolated point thus deviates to a small degree from the path of the zero contour. This is true for all tested methods of interpolation. If the functional form of the zero contour were known, one could fit a function to the zero contour, which would likely give satisfactory results. However, for the application motivating this question, an analytical solution of the zero contour is not possible.
Given the inadequacy of interpolation to this task, how can new points along the zero contour be obtained? The points should be such that the value of the function is as close to zero as for the set of points returned by contourc. Increasing the grid resolution globally is not an option since the memory and computational expense increase with the square of the resolution. Locally sampling at high resolution in an iterative manner over consecutive short intervals of the zero contour is an option, but would take time to implement, and I'm not aware of existing code (e.g., in the FEX) to perform such a task. Also, since contourc is closed-source, adapting it to this task is not an option.
I solved this problem by taking the interpolated (X,Y) values as an approximation of the true zero contour and reinterpolating the Y coordinate of each pair with respect to Z values of the surface flanking the approximate zero in order to locate the Y coordinate at which Z=0 for each (X,Y) pair. In so doing, uniform spacing of the interpolated X values is preserved.
For each (iX,iY) pair, where iX and iY denote the (fractional) array indices of the X and Y coordinate grids for the points interpolated along the zero contour, the approximate iY values are reinterpolated with respect to the values of Z at positions flanking (in the Y dimension) the approximate zero, i.e., at (iX,floor(iY)) and (iX,ceil(iY)). The Z values at these flanking points are themselves obtained by performing bilinear interpolation of Z at these points.
After reinterpolation, the magnitude of the error is the same as that of the points originally returned by contourc/contours.

How to order one dimensional matrices base on values

I want to determine a point in space by geometry and I have math computations that gives me several theta values. After evaluating the theta values, I could get N 1 x 3 dimension matrix where N is the number of theta evaluated. Since I have my targeted point, I only need to decide which of the matrices is closest to the target with adequate focus on the three coordinates (x,y,z).
Take a view of the analysis in the figure below:
Fig 1: Determining Closest Point with all points having minimal error
It can easily be seen that the third matrix is closest using sum(abs(Matrix[x,y,z])). However, if the method is applied on another figure given below, obviously, the result is wrong.
Fig 2: One Point has closest values with 2-axes of the reference point
Looking at point B, it is closer to the reference point on y-,z- axes but just that it strayed greatly on x-axis.
So how can I evaluate the matrices and select the closest one to point of reference and adequate emphasis will be on error differences in all coordinates (x,y,z)?
If your results is in terms of (x,y,z), why don't evaluate the euclidean distance of each matrix you have obtained from the reference point?
Sort of matlab code:
Ref_point = [48.98, 20.56, -1.44];
Curr_point = [x,y,z];
Xd = (x-Ref_point(1))^2 ;
Yd = (y-Ref_point(2))^2 ;
Zd = (z-Ref_point(3))^2 ;
distance = sqrt(Xd + Yd + Zd);
%find the minimum distance

Matlab : Help in finding minimum distance

I am trying to find the point that is at a minimum distance from the candidate set. Z is a matrix where the rows are the dimension and columns indicate points. Computing the inter-point distances, and then recording the point with minimum distance and its distance as well. Below is the code snippet. The code works fine for a small dimension and small set of points. But, it takes a long time for large data set (N = 1 million data points and dimension is also high). Is there an efficient way?
I suggest that you use pdist to do the heavy lifting for you. This function will compute the pairwise distance between every two points in your array. The resulting vector has to be put into matrix form using squareform in order to find the minimal value for each pair:
N = 100;
Z = rand(2,N); % each column is a 2-dimensional point
% pdist assumes that the second index corresponds to dimensions
% so we need to transpose inside pdist()
distmatrix = squareform(pdist(Z.','euclidean')); % output is [N, N] in size
% set diagonal values to infinity to avoid getting 0 self-distance as minimum
distmatrix = distmatrix + diag(inf(1,size(distmatrix,1)));
mindists = min(distmatrix,[],2); % find the minimum for each row
sum_dist = sum(mindists); % sum of minimal distance between each pair of points
This computes every pair twice, but I think this is true for your original implementation.
The idea is that pdist computes the pairwise distance between the columns of its input. So we put the transpose of Z into pdist. Since the full output is always a square matrix with zero diagonal, pdist is implemented such that it only returns the values above the diagonal, in a vector. So a call to squareform is needed to get the proper distance matrix. Then, the row-wise minimum of this matrix have to be found, but first we have to exclude the zero in the diagonals. I was lazy so I put inf into the diagonals, to make sure that the minimum is elsewhere. In the end we just have to sum up the minimal distances.

Mahalanobis distance in matlab: pdist2() vs. mahal() function

I have two matrices X and Y. Both represent a number of positions in 3D-space. X is a 50*3 matrix, Y is a 60*3 matrix.
My question: why does applying the mean-function over the output of pdist2() in combination with 'Mahalanobis' not give the result obtained with mahal()?
More details on what I'm trying to do below, as well as the code I used to test this.
Let's suppose the 60 observations in matrix Y are obtained after an experimental manipulation of some kind. I'm trying to assess whether this manipulation had a significant effect on the positions observed in Y. Therefore, I used pdist2(X,X,'Mahalanobis') to compare X to X to obtain a baseline, and later, X to Y (with X the reference matrix: pdist2(X,Y,'Mahalanobis')), and I plotted both distributions to have a look at the overlap.
Subsequently, I calculated the mean Mahalanobis distance for both distributions and the 95% CI and did a t-test and Kolmogorov-Smirnoff test to asses if the difference between the distributions was significant. This seemed very intuitive to me, however, when testing with mahal(), I get different values, although the reference matrix is the same. I don't get what the difference between both ways of calculating mahalanobis distance is exactly.
Comment that is too long #3lectrologos:
You mean this: d(I) = (Y(I,:)-mu)inv(SIGMA)(Y(I,:)-mu)'? This is just the formula for calculating mahalanobis, so should be the same for pdist2() and mahal() functions. I think mu is a scalar and SIGMA is a matrix based on the reference distribution as a whole in both pdist2() and mahal(). Only in mahal you are comparing each point of your sample set to the points of the reference distribution, while in pdist2 you are making pairwise comparisons based on a reference distribution. Actually, with my purpose in my mind, I think I should go for mahal() instead of pdist2(). I can interpret a pairwise distance based on a reference distribution, but I don't think it's what I need here.
% test pdist2 vs. mahal in matlab
% the purpose of this script is to see whether the average over the rows of E equals the values in d...
% data
X = []; % 50*3 matrix, data omitted
Y = []; % 60*3 matrix, data omitted
% calculations
S = nancov(X);
% mahal()
d = mahal(Y,X); % gives an 60*1 matrix with a value for each Cartesian element in Y (second matrix is always the reference matrix)
% pairwise mahalanobis distance with pdist2()
E = pdist2(X,Y,'mahalanobis',S); % outputs an 50*60 matrix with each ij-th element the pairwise distance between element X(i,:) and Y(j,:) based on the covariance matrix of X: nancov(X)
%{
so this is harder to interpret than mahal(), as elements of Y are not just compared to the "mahalanobis-centroid" based on X,
% but to each individual element of X
% so the purpose of this script is to see whether the average over the rows of E equals the values in d...
%}
F = mean(E); % now I averaged over the rows, which means, over all values of X, the reference matrix
mean(d)
mean(E(:)) % not equal to mean(d)
d-F' % not zero
% plot output
figure(1)
plot(d,'bo'), hold on
plot(mean(E),'ro')
legend('mahal()','avaraged over all x values pdist2()')
ylabel('Mahalanobis distance')
figure(2)
plot(d,'bo'), hold on
plot(E','ro')
plot(d,'bo','MarkerFaceColor','b')
xlabel('values in matrix Y (Yi) ... or ... pairwise comparison Yi. (Yi vs. all Xi values)')
ylabel('Mahalanobis distance')
legend('mahal()','pdist2()')
One immediate difference between the two is that mahal subtracts the sample mean of X from each point in Y before computing distances.
Try something like E = pdist2(X,Y-mean(X),'mahalanobis',S); to see if that gives you the same results as mahal.
Note that
mahal(X,Y)
is equivalent to
pdist2(X,mean(Y),'mahalanobis',cov(Y)).^2
Well, I guess there are two different ways to calculate mahalanobis distance between two clusters of data like you explain above:
1) you compare each data point from your sample set to mu and sigma matrices calculated from your reference distribution (although labeling one cluster sample set and the other reference distribution may be arbitrary), thereby calculating the distance from each point to this so called mahalanobis-centroid of the reference distribution.
2) you compare each datapoint from matrix Y to each datapoint of matrix X, with, X the reference distribution (mu and sigma are calculated from X only)
The values of the distances will be different, but I guess the ordinal order of dissimilarity between clusters is preserved when using either method 1 or 2? I actually wonder when comparing 10 different clusters to a reference matrix X, or to each other, if the order of the dissimilarities would differ using method 1 or method 2? Also, I can't imagine a situation where one method would be wrong and the other method not. Although method 1 seems more intuitive in some situations, like mine.

Getting the index of closest data point to the centriods in Kmeans clustering in MATLAB

I am doing some clustering using K-means in MATLAB. As you might know the usage is as below:
[IDX,C] = kmeans(X,k)
where IDX gives the cluster number for each data point in X, and C gives the centroids for each cluster.I need to get the index(row number in the actual data set X) of the closest datapoint to the centroid. Does anyone know how I can do that?
Thanks
The "brute-force approach", as mentioned by #Dima would go as follows
%# loop through all clusters
for iCluster = 1:max(IDX)
%# find the points that are part of the current cluster
currentPointIdx = find(IDX==iCluster);
%# find the index (among points in the cluster)
%# of the point that has the smallest Euclidean distance from the centroid
%# bsxfun subtracts coordinates, then you sum the squares of
%# the distance vectors, then you take the minimum
[~,minIdx] = min(sum(bsxfun(#minus,X(currentPointIdx,:),C(iCluster,:)).^2,2));
%# store the index into X (among all the points)
closestIdx(iCluster) = currentPointIdx(minIdx);
end
To get the coordinates of the point that is closest to the cluster center k, use
X(closestIdx(k),:)
The brute force approach would be to run k-means, and then compare each data point in the cluster to the centroid, and find the one closest to it. This is easy to do in matlab.
On the other hand, you may want to try the k-medoids clustering algorithm, which gives you a data point as the "center" of each cluster. Here is a matlab implementation.
Actually, kmeans already gives you the answer, if I understand you right:
[IDX,C, ~, D] = kmeans(X,k); % D is the distance of each datapoint to each of the clusters
[minD, indMinD] = min(D); % indMinD(i) is the index (in X) of closest point to the i-th centroid