Finding a matching row in two separate datasets in matlab - matlab

I have two correlated Nx3 datasets (one is xyz points, the other is the normal vector for those points). I have a point in my first dataset and now I want to find the matching row in the second dataset. What's the best way to do this? I was thinking print out the row number but not sure exactly what the code is to do that?

Given that you have a point in your one dataset that is size 1 x 3, there are two possible ways that you can do this.
Method #1 - Using knnsearch
The easiest way would be to use knnsearch from the Statistics Toolbox.
knnsearch stands for K-Nearest Neighbour search. Given an input query point, knnsearch finds the k closest points to your dataset given the input query point. In your case, k=1. Also, the distance metric is the Euclidean distance, but seeing how your points are in 3D Cartesian space, I don't see this being a problem.
Therefore, assuming your xyz points are stored in X and the query point (normal vector) is in y, just do this:
IDX = knnsearch(X, y);
The above defaults to k=1. If you'd like more than 1 point returned, you'd do this:
IDX = knnsearch(X, y, 'K', n);
n is the number of points you want returned or the n closest points given the query y. IDX contains the index of which point in X is closest to y. I would also like to point out that X is arranged such that each row is a point and each column is a variable.
Therefore, the closest point using IDX would be:
closest_point = X(IDX,:);
Method #2 - Using bsxfun
If you don't have the Statistics Toolbox, you can very easily achieve the same thing using bsxfun. Bear in mind that the code I will write is only for returning the closest point, or k=1:
dists = sqrt(sum(bsxfun(#minus, X, y).^2, 2));
[~,IDX] = min(dists);
The bsxfun call first determines the component-wise distance between y and every point in X. Once we do this, we square each component, add up all of the components together then take the square root. This essentially finds the Euclidean distance with y and all of the points in X. This gives us N distances where N is the total number of points in the dataset. We then find the minimum distance with min and determine the index of where the closest matching point is, which corresponds to the closest point between y and the dataset.
If you'd like to extend this to more than one point, you'd sort the distances in ascending order, then retrieve those number of points with the smallest distances. Remember, smaller Euclidean distances mean that the points are similar, which is why we sort in ascending order. Something like this:
dists = sqrt(sum(bsxfun(#minus, X, y).^2, 2));
[~,ind] = sort(dists);
IDX = ind(1:n);
Just a small step upwards from what we had before. Instead of using min, you'd use sort and get the second output of sort to determine the locations of the minimum distances. We'd then index into ind to get the n closest indices and finally index into X to get our actual points.
You would again do the same thing to retrieve the actual points that are closest:
closest_point = X(IDX,:);
Some Bonus Material
If you'd like to read more about how K-Nearest Neighbour works, I encourage you to read my post about it here:
Finding K-nearest neighbors and its implementation
Good luck!

Related

How to order one dimensional matrices base on values

I want to determine a point in space by geometry and I have math computations that gives me several theta values. After evaluating the theta values, I could get N 1 x 3 dimension matrix where N is the number of theta evaluated. Since I have my targeted point, I only need to decide which of the matrices is closest to the target with adequate focus on the three coordinates (x,y,z).
Take a view of the analysis in the figure below:
Fig 1: Determining Closest Point with all points having minimal error
It can easily be seen that the third matrix is closest using sum(abs(Matrix[x,y,z])). However, if the method is applied on another figure given below, obviously, the result is wrong.
Fig 2: One Point has closest values with 2-axes of the reference point
Looking at point B, it is closer to the reference point on y-,z- axes but just that it strayed greatly on x-axis.
So how can I evaluate the matrices and select the closest one to point of reference and adequate emphasis will be on error differences in all coordinates (x,y,z)?
If your results is in terms of (x,y,z), why don't evaluate the euclidean distance of each matrix you have obtained from the reference point?
Sort of matlab code:
Ref_point = [48.98, 20.56, -1.44];
Curr_point = [x,y,z];
Xd = (x-Ref_point(1))^2 ;
Yd = (y-Ref_point(2))^2 ;
Zd = (z-Ref_point(3))^2 ;
distance = sqrt(Xd + Yd + Zd);
%find the minimum distance

Matlab : Help in finding minimum distance

I am trying to find the point that is at a minimum distance from the candidate set. Z is a matrix where the rows are the dimension and columns indicate points. Computing the inter-point distances, and then recording the point with minimum distance and its distance as well. Below is the code snippet. The code works fine for a small dimension and small set of points. But, it takes a long time for large data set (N = 1 million data points and dimension is also high). Is there an efficient way?
I suggest that you use pdist to do the heavy lifting for you. This function will compute the pairwise distance between every two points in your array. The resulting vector has to be put into matrix form using squareform in order to find the minimal value for each pair:
N = 100;
Z = rand(2,N); % each column is a 2-dimensional point
% pdist assumes that the second index corresponds to dimensions
% so we need to transpose inside pdist()
distmatrix = squareform(pdist(Z.','euclidean')); % output is [N, N] in size
% set diagonal values to infinity to avoid getting 0 self-distance as minimum
distmatrix = distmatrix + diag(inf(1,size(distmatrix,1)));
mindists = min(distmatrix,[],2); % find the minimum for each row
sum_dist = sum(mindists); % sum of minimal distance between each pair of points
This computes every pair twice, but I think this is true for your original implementation.
The idea is that pdist computes the pairwise distance between the columns of its input. So we put the transpose of Z into pdist. Since the full output is always a square matrix with zero diagonal, pdist is implemented such that it only returns the values above the diagonal, in a vector. So a call to squareform is needed to get the proper distance matrix. Then, the row-wise minimum of this matrix have to be found, but first we have to exclude the zero in the diagonals. I was lazy so I put inf into the diagonals, to make sure that the minimum is elsewhere. In the end we just have to sum up the minimal distances.

How to understand the Matlab build in function "kmeans"?

Suppose I have a matrix A, the size of which is 2000*1000 double. Then I apply
Matlab build in function "kmeans"to the matrix A.
k = 8;
[idx,C] = kmeans(A, k, 'Distance', 'cosine');
I get C = 8*1000 double; idx = 2000*1 double, with values from 1 to 8;
According to the documentation, C returns the k cluster centroid locations in the k-by-p (8 by 1000) matrix. And idx returns an n-by-1 vector containing cluster indices of each observation.
My question is:
1) I do not know how to understand the C, the centroid locations. Locations should be represented as (x,y), right? How to understand the matrix C correctly?
2) What are the final centers c1, c2,...,ck? Are they just values or locations?
3) For each cluster, if I only want to get the vector closest to the center of this cluster, how to calculate and get it?
Thanks!
Before I answer the three parts, I'll just explain the syntax that is used in MATLAB's explanation of k-means (http://www.mathworks.com/help/stats/kmeans.html).
A is your data matrix (it's represented as X in the link). There are n rows (in this case, 2000), which represent the number of observations/data points that you have. There are also p columns (in this case, 1000), which represent the number of "features" that each data points has. For example, if your data consisted of 2D points, then p would equal 2.
k is the number of clusters that you want to group the data into. Based on the dimensions of C that you gave, k must be 8.
Now I will answer the three parts:
The C matrix has dimensions k x p. Each row represents a centroid. Centroid locations DO NOT have to be (x, y) at all. The dimensions of the centroid locations are equal to p. In other words, if you have 2D points, you could graph the centroids as (x, y). If you have 3D points, you could graph the centroids as (x, y, z). Since each data point in A has 1000 features, your centroids therefore have 1000 dimensions.
This is sort of difficult to explain without knowing what your data is exactly. Centroids are certainly not just values, and they may not necessarily be locations. If your data A were coordinate points, you could certainly represent the centroids as locations. However, we can view it more generally. If you had a cluster centroid i and the data points v that are grouped with that centroid, the centroid would represent the data point that is most similar to those in its cluster. Hopefully, that makes sense, and I can give a clearer explanation if necessary.
The k-means method actually gives us a good way to accomplish this. The function actually has 4 possible outputs, but I will focus on the 4th, which I will call D:
[idx,C,sumd,D] = kmeans(A, k, 'Distance', 'cosine');
D has dimensions n x k. For a data point i, the row i in the D matrix gives the distance from that point to every centroid. Therefore, for each centroid, you simply need to find the data point closest to this, and return that corresponding data point. I can supply the short code for this if you need it.
Also, just a tip. You should probably use kmeans++ method of initializing the centroids. It's faster and generally better. You can call it using this:
[idx,C,sumd,D] = kmeans(A, k, 'Distance', 'cosine', 'Start', 'plus');
Edit:
Here is the code necessary for part 3:
[~, min_idxs] = min(D, [], 1);
closest_vecs = A(min_idxs, :);
Each row i of closest_vecs is the vector that is closest to centroid i.
OK, before we actually get into the details, let's give a brief overview on what K-means clustering is first.
k-means clustering works such that for some data that you have, you want to group them into k groups. You initially choose k random points in your data, and these will have labels from 1,2,...,k. These are what we call the centroids. Then, you determine how close the rest of the data are to each of these points. You then group those points so that whichever points are closest to any of these k points, you assign those points to belong to that particular group (1,2,...,k). After, for all of the points for each group, you update the centroids, which actually is defined as the representative point for each group. For each group, you compute the average of all of the points in each of the k groups. These become the new centroids for the next iteration. In the next iteration, you determine how close each point in your data is to each of the centroids. You keep iterating and repeating this behaviour until the centroids don't move anymore, or they move very little.
How you use the kmeans function in MATLAB is that assuming you have a data matrix (A in your case), it is arranged such that each row is a sample and each column is a feature / dimension of a sample. For example, we could have N x 2 or N x 3 arrays of Cartesian coordinates, either in 2D or 3D. In colour images, we could have N x 3 arrays where each column is a colour component in an image - red, green or blue.
How you invoke kmeans in MATLAB is the following way:
[IDX, C] = kmeans(X, K);
X is the data matrix we talked about, K is the total number of clusters / groups you would like to see and the outputs IDX and C are respectively an index and centroid matrix. IDX is a N x 1 array where N is the total number of samples that you have put into the function. Each value in IDX tells you which centroid the sample / row in X best matched with a particular centroid. You can also override the distance measure used to measure the distance between points. By default, this is the Euclidean distance, but you used the cosine distance in your invocation.
C has K rows where each row is a centroid. Therefore, for the case of Cartesian coordinates, this would be a K x 2 or K x 3 array. Therefore, you would interpret IDX as telling which group / centroid that the point is closest to when computing k-means. As such, if we got a value of IDX=1 for a point, this means that the point best matched with the first centroid, which is the first row of C. Similarly, if we got a value of IDX=1 for a point, this means that the point best matched with the third centroid, which is the third row of C.
Now to answer your questions:
We just talked about C and IDX so this should be clear.
The final centres are stored in C. Each row gives you a centroid / centre that is representative of a group.
It sounds like you want to find the closest point to each cluster in the data, besides the actual centroid itself. That's easy to do if you use knnsearch which performs K-Nearest Neighbour search by giving a set of points and it outputs the K closest points within your data that are close to a query point. As such, you supply the clusters as the input and your data as the output, then use K=2 and skip the first point. The first point will have a distance of 0 as this will be equal to the centroid itself and the second point will give you the closest point that is closest to the cluster.
You can do that by the following, assuming you already ran kmeans:
out = knnsearch(A, C, 'k', 2);
out = out(:,2);
You run knnsearch, then toss out the closest point as it would essentially have a distance of 0. The second column is what you're after, which gives you the closest point to the cluster excluding the actual centroid. out will give you which points in your data matrix A that was closest to each centroid. To get the actual points, do this:
pts = A(out,:);
Hope this helps!

Mahalanobis distance in matlab: pdist2() vs. mahal() function

I have two matrices X and Y. Both represent a number of positions in 3D-space. X is a 50*3 matrix, Y is a 60*3 matrix.
My question: why does applying the mean-function over the output of pdist2() in combination with 'Mahalanobis' not give the result obtained with mahal()?
More details on what I'm trying to do below, as well as the code I used to test this.
Let's suppose the 60 observations in matrix Y are obtained after an experimental manipulation of some kind. I'm trying to assess whether this manipulation had a significant effect on the positions observed in Y. Therefore, I used pdist2(X,X,'Mahalanobis') to compare X to X to obtain a baseline, and later, X to Y (with X the reference matrix: pdist2(X,Y,'Mahalanobis')), and I plotted both distributions to have a look at the overlap.
Subsequently, I calculated the mean Mahalanobis distance for both distributions and the 95% CI and did a t-test and Kolmogorov-Smirnoff test to asses if the difference between the distributions was significant. This seemed very intuitive to me, however, when testing with mahal(), I get different values, although the reference matrix is the same. I don't get what the difference between both ways of calculating mahalanobis distance is exactly.
Comment that is too long #3lectrologos:
You mean this: d(I) = (Y(I,:)-mu)inv(SIGMA)(Y(I,:)-mu)'? This is just the formula for calculating mahalanobis, so should be the same for pdist2() and mahal() functions. I think mu is a scalar and SIGMA is a matrix based on the reference distribution as a whole in both pdist2() and mahal(). Only in mahal you are comparing each point of your sample set to the points of the reference distribution, while in pdist2 you are making pairwise comparisons based on a reference distribution. Actually, with my purpose in my mind, I think I should go for mahal() instead of pdist2(). I can interpret a pairwise distance based on a reference distribution, but I don't think it's what I need here.
% test pdist2 vs. mahal in matlab
% the purpose of this script is to see whether the average over the rows of E equals the values in d...
% data
X = []; % 50*3 matrix, data omitted
Y = []; % 60*3 matrix, data omitted
% calculations
S = nancov(X);
% mahal()
d = mahal(Y,X); % gives an 60*1 matrix with a value for each Cartesian element in Y (second matrix is always the reference matrix)
% pairwise mahalanobis distance with pdist2()
E = pdist2(X,Y,'mahalanobis',S); % outputs an 50*60 matrix with each ij-th element the pairwise distance between element X(i,:) and Y(j,:) based on the covariance matrix of X: nancov(X)
%{
so this is harder to interpret than mahal(), as elements of Y are not just compared to the "mahalanobis-centroid" based on X,
% but to each individual element of X
% so the purpose of this script is to see whether the average over the rows of E equals the values in d...
%}
F = mean(E); % now I averaged over the rows, which means, over all values of X, the reference matrix
mean(d)
mean(E(:)) % not equal to mean(d)
d-F' % not zero
% plot output
figure(1)
plot(d,'bo'), hold on
plot(mean(E),'ro')
legend('mahal()','avaraged over all x values pdist2()')
ylabel('Mahalanobis distance')
figure(2)
plot(d,'bo'), hold on
plot(E','ro')
plot(d,'bo','MarkerFaceColor','b')
xlabel('values in matrix Y (Yi) ... or ... pairwise comparison Yi. (Yi vs. all Xi values)')
ylabel('Mahalanobis distance')
legend('mahal()','pdist2()')
One immediate difference between the two is that mahal subtracts the sample mean of X from each point in Y before computing distances.
Try something like E = pdist2(X,Y-mean(X),'mahalanobis',S); to see if that gives you the same results as mahal.
Note that
mahal(X,Y)
is equivalent to
pdist2(X,mean(Y),'mahalanobis',cov(Y)).^2
Well, I guess there are two different ways to calculate mahalanobis distance between two clusters of data like you explain above:
1) you compare each data point from your sample set to mu and sigma matrices calculated from your reference distribution (although labeling one cluster sample set and the other reference distribution may be arbitrary), thereby calculating the distance from each point to this so called mahalanobis-centroid of the reference distribution.
2) you compare each datapoint from matrix Y to each datapoint of matrix X, with, X the reference distribution (mu and sigma are calculated from X only)
The values of the distances will be different, but I guess the ordinal order of dissimilarity between clusters is preserved when using either method 1 or 2? I actually wonder when comparing 10 different clusters to a reference matrix X, or to each other, if the order of the dissimilarities would differ using method 1 or method 2? Also, I can't imagine a situation where one method would be wrong and the other method not. Although method 1 seems more intuitive in some situations, like mine.

Selecting data based on the distance from a query point in Matlab

I have a data-set that has four columns [X Y Z C]. I would like to find all the C values that are in a given sphere centered at [X, Y, Z] with a radius r. What is the best approach to address this problem? Should I use the clusterdata command?
Here is one solution that uses naively euclidean distance:
say V = [X Y Z C] is your dataset, Center = [x,y,z] is the center of the sphere, then
dist = bsxfun(#minus,V(:,1:3),Center); % // finds the distance vectors
% // between the points and the center
dist = sum(dist.^2,2); % // evaluate the squares of the euclidean distances (scalars)
idx = (dist < r^2); % // Find the indexes of the matching points
The good C values are
good = V(idx,4); % // here I kept just the C column
This is not "cluster analysis": You do not attempt to discover structure in your data.
Instead, what you are doing, is commonly called a "range query" or "radius query". In classic database terms, a SELECT, with a distance selector.
You probably want to define your sphere using euclidean distance. For computational purposes, it actually is beneficial to instead of squared Euclidean, by simply taking the square of your radius.
I don't use matlab, but there must be tons of examples of how to compute the distance of each instance in a data set from a query point, and then selecting those objects where the distance is small enough.
I don't know if there is any good index structures package for Matlab. But in general, at 3D, this can be well accelerated with index structures. Computing all distances is O(n), but with an index structure only O(log n).