How to sort rows of a matrix based on the costrain of another matrix? - matlab

The 6 faces method is a very cheap and fast way to calibrate and accelerometer like my MPU6050, here a great description of the method.
I made 6 tests to calibrate the accelerometer based on the g vector.
After that i build up a matrix and in each row is stored the mean of each axis expressed in m/s^2, thanks to this question i automatically calculated the mean for each column in each file.
The tests were randomly performed, i tested all the six positions, but i didn't follow any path.
So i sorted manually the final matrix, based on the sort of the Y matrix, my reference matrix.
The Y elements are fixed.
The matrix manually sorted is the following
Here how i manually sorted the matrix
meanmatrix=[ax ay az];
mean1=meanmatrix(1,:);
mean2=meanmatrix(2,:);
mean3=meanmatrix(3,:);
mean4=meanmatrix(4,:);
mean5=meanmatrix(5,:);
mean6=meanmatrix(6,:);
meanmatrix= [mean1; mean3; mean2; mean4;mean6;mean5];
Based on the Y matrix constrain how can sort my matrix without knwowing "a priori" wich is the test stored in the row?

Assuming that the bias on the accelerometer is not huge, you can look at the rows of your matrix and see with which of the rows in your Y matrix matches.
sorted_meanmatrix = zeros(size(meanmatrix));
for rows = 1:length(Y)
% Calculates the square of distance and see which row has a nearest distcance
[~,index] = min(sum((meanmatrix - Y(rows,:)).^2, 2));
sorted_meanmatrix(rows,:) = meanmatrix(index,:);
end

Related

Calculating the most similar pair of column vectors using cosine distance in a matrix

I have a 943x1682 matrix in which I want to calculate the two most similar vectors in this matrix. So I want see the cosine distance of each vector in the matrix to each vector in the matrix, of course not including the vector with itself, if one cannot do that I can just ignore those.
I made this loop to try to calculate this, so I can get a 1682x1682 matrix, with each cell corresponding to the similarity between i and j. However when I run this, it takes forever to run, and when I try to open the resulting matrix in my workspace, it says:
Cannot display summaries of variables with more than 524288 elements.
Is there an easier way to do this or am I doing something wrong?
Cross posted on MATLAB Answers. Repeating answer here:
Use a standard matrix multiply to get the dot products. MATLAB is very fast at standard matrix multiplies. And then normalize the result. E.g.,
AA = A' * A; % the column dot products via a standard matrix multiply
Anorm = sqrt(diag(AA)); % the norms of the columns
Adist = AA ./ (Anorm .* Anorm.'); % normalize the column dot products into cosine distances
Then pick off the maximum value for your answer, disregarding the diagonal. E.g.,
n = size(A,2); % the number of columns
Adist(1:n+1:end) = -inf; % disregard the diagonal (column compared to itself)
[~,x] = max(Adist(:)); % find the max cosine distance linear index
[col1,col2] = ind2sub(size(Adist),x); % convert linear index into the original columns
Then col1 and col2 are the column numbers of the most similiar columns using cosine distance as a measure.
You can normalise the columns of the matrix first, then the cosine similarity equation simplifies to a matrix multiplication:
aNorm = normc(A);
cosSim = aNorm' * aNorm;
Generally, matrix multiplication is more performant than looping. In a quick test, with N = 1000, the looping code takes ~7 seconds and the matrix multiplication code ~0.5 seconds.
The resultant matrix may still be too large to open in your workspace, you could copy any individual rows or columns into a temporary and view those, or do a contour plot (heat-map) of the matrix to get a visual representation.

Matlab : Help in finding minimum distance

I am trying to find the point that is at a minimum distance from the candidate set. Z is a matrix where the rows are the dimension and columns indicate points. Computing the inter-point distances, and then recording the point with minimum distance and its distance as well. Below is the code snippet. The code works fine for a small dimension and small set of points. But, it takes a long time for large data set (N = 1 million data points and dimension is also high). Is there an efficient way?
I suggest that you use pdist to do the heavy lifting for you. This function will compute the pairwise distance between every two points in your array. The resulting vector has to be put into matrix form using squareform in order to find the minimal value for each pair:
N = 100;
Z = rand(2,N); % each column is a 2-dimensional point
% pdist assumes that the second index corresponds to dimensions
% so we need to transpose inside pdist()
distmatrix = squareform(pdist(Z.','euclidean')); % output is [N, N] in size
% set diagonal values to infinity to avoid getting 0 self-distance as minimum
distmatrix = distmatrix + diag(inf(1,size(distmatrix,1)));
mindists = min(distmatrix,[],2); % find the minimum for each row
sum_dist = sum(mindists); % sum of minimal distance between each pair of points
This computes every pair twice, but I think this is true for your original implementation.
The idea is that pdist computes the pairwise distance between the columns of its input. So we put the transpose of Z into pdist. Since the full output is always a square matrix with zero diagonal, pdist is implemented such that it only returns the values above the diagonal, in a vector. So a call to squareform is needed to get the proper distance matrix. Then, the row-wise minimum of this matrix have to be found, but first we have to exclude the zero in the diagonals. I was lazy so I put inf into the diagonals, to make sure that the minimum is elsewhere. In the end we just have to sum up the minimal distances.

Find the minimum absolute values along the third dimension in a 3D matrix and ensuring the sign is maintained

I have a m X n X k matrix and I want to find the elements that have minimal absolute value along the third dimension for each unique 2D spatial coordinate. An additional constraint is that once I find these minimum values, the sign of these values (i.e. before I took the absolute value) must be maintained.
The code I wrote to accomplish this is shown below.
tmp = abs(dist); %the size(dist)=[m,n,k]
[v,ind] = min(tmp,[],3); %find the index of minimal absolute value in the 3rd dimension
ind = reshape(ind,m*n,1);
[col,row]=meshgrid(1:n,1:m); row = reshape(row,m*n,1); col = reshape(col,m*n,1);
ind2 = sub2ind(size(dist),row,col,ind); % row, col, ind are sub
dm = dist(ind2); %take the signed value from dist
dm = reshape(dm,m,n);
The resulting matrix dm which is m X n corresponds to the matrix that is subject to the constraints that I have mentioned previously. However, this code sounds a little bit inefficient since I have to generate linear indices. Is there any way to improve this?
If I'm interpreting your problem statement correctly, you wish to find the minimum absolute value along the third dimension for each unique 2D spatial coordinate in your 3D matrix. This is already being done by the first two lines of your code.
However, a small caveat is that once you find these minimum values, you must ensure that the original sign of these values (i.e. before taking the absolute value) are respected. That is the purpose of the rest of the code.
If you want to select the original values, you don't have a choice but to generate the correct linear indices and sample from the original matrix. However, a lot of code is rather superfluous. There is no need to perform any kind of reshaping.
We can simplify your method by using ndgrid to generate the correct spatial coordinates to sample from the 3D matrix then use ind from your code to reference the third dimension. After, use this to sample dist and complete your code:
%// From your code
[v,ind] = min(abs(dist),[],3);
%// New code
[row,col] = ndgrid(1:size(dist,1), 1:size(dist,2));
%// Output
dm = dist(sub2ind(size(dist), row, col, ind));

Selecting values plotted on a scatter3 plot

I have a 3d matrix of 100x100x100. Each point of that matrix has assigned a value that corresponds to a certain signal strength. If I plot all the points the result is incomprehensible and requires horsepower to compute, due to the large amount of points that are painted.
The next picture examplify the problem (in that case the matrix was 50x50x50 for reducing the computation time):
[x,y,z] = meshgrid(1:50,1:50,1:50);
scatter3(x(:),y(:),z(:),5,strength(:),'filled')
I would like to plot only the highest values (for example, the top 10). How can I do it?
One simple solution that came up in my mind is to asign "nan" to the values higher than the treshold.
Even the results are nice I think that it must be a most elegant solution to fix it.
Reshape it into an nx1 vector. Sort that vector and take the first ten values.
num_of_rows = size(M,1)
V = reshape(M,num_of_rows,1);
sorted_V = sort(V,'descend');
ind = sorted_V(1:10)
I am assuming that M is your 3D matrix. This will give you your top ten values in your matrix and the respective index. The you can use ind2sub() to get the x,y,z.

Mahalanobis distance in matlab: pdist2() vs. mahal() function

I have two matrices X and Y. Both represent a number of positions in 3D-space. X is a 50*3 matrix, Y is a 60*3 matrix.
My question: why does applying the mean-function over the output of pdist2() in combination with 'Mahalanobis' not give the result obtained with mahal()?
More details on what I'm trying to do below, as well as the code I used to test this.
Let's suppose the 60 observations in matrix Y are obtained after an experimental manipulation of some kind. I'm trying to assess whether this manipulation had a significant effect on the positions observed in Y. Therefore, I used pdist2(X,X,'Mahalanobis') to compare X to X to obtain a baseline, and later, X to Y (with X the reference matrix: pdist2(X,Y,'Mahalanobis')), and I plotted both distributions to have a look at the overlap.
Subsequently, I calculated the mean Mahalanobis distance for both distributions and the 95% CI and did a t-test and Kolmogorov-Smirnoff test to asses if the difference between the distributions was significant. This seemed very intuitive to me, however, when testing with mahal(), I get different values, although the reference matrix is the same. I don't get what the difference between both ways of calculating mahalanobis distance is exactly.
Comment that is too long #3lectrologos:
You mean this: d(I) = (Y(I,:)-mu)inv(SIGMA)(Y(I,:)-mu)'? This is just the formula for calculating mahalanobis, so should be the same for pdist2() and mahal() functions. I think mu is a scalar and SIGMA is a matrix based on the reference distribution as a whole in both pdist2() and mahal(). Only in mahal you are comparing each point of your sample set to the points of the reference distribution, while in pdist2 you are making pairwise comparisons based on a reference distribution. Actually, with my purpose in my mind, I think I should go for mahal() instead of pdist2(). I can interpret a pairwise distance based on a reference distribution, but I don't think it's what I need here.
% test pdist2 vs. mahal in matlab
% the purpose of this script is to see whether the average over the rows of E equals the values in d...
% data
X = []; % 50*3 matrix, data omitted
Y = []; % 60*3 matrix, data omitted
% calculations
S = nancov(X);
% mahal()
d = mahal(Y,X); % gives an 60*1 matrix with a value for each Cartesian element in Y (second matrix is always the reference matrix)
% pairwise mahalanobis distance with pdist2()
E = pdist2(X,Y,'mahalanobis',S); % outputs an 50*60 matrix with each ij-th element the pairwise distance between element X(i,:) and Y(j,:) based on the covariance matrix of X: nancov(X)
%{
so this is harder to interpret than mahal(), as elements of Y are not just compared to the "mahalanobis-centroid" based on X,
% but to each individual element of X
% so the purpose of this script is to see whether the average over the rows of E equals the values in d...
%}
F = mean(E); % now I averaged over the rows, which means, over all values of X, the reference matrix
mean(d)
mean(E(:)) % not equal to mean(d)
d-F' % not zero
% plot output
figure(1)
plot(d,'bo'), hold on
plot(mean(E),'ro')
legend('mahal()','avaraged over all x values pdist2()')
ylabel('Mahalanobis distance')
figure(2)
plot(d,'bo'), hold on
plot(E','ro')
plot(d,'bo','MarkerFaceColor','b')
xlabel('values in matrix Y (Yi) ... or ... pairwise comparison Yi. (Yi vs. all Xi values)')
ylabel('Mahalanobis distance')
legend('mahal()','pdist2()')
One immediate difference between the two is that mahal subtracts the sample mean of X from each point in Y before computing distances.
Try something like E = pdist2(X,Y-mean(X),'mahalanobis',S); to see if that gives you the same results as mahal.
Note that
mahal(X,Y)
is equivalent to
pdist2(X,mean(Y),'mahalanobis',cov(Y)).^2
Well, I guess there are two different ways to calculate mahalanobis distance between two clusters of data like you explain above:
1) you compare each data point from your sample set to mu and sigma matrices calculated from your reference distribution (although labeling one cluster sample set and the other reference distribution may be arbitrary), thereby calculating the distance from each point to this so called mahalanobis-centroid of the reference distribution.
2) you compare each datapoint from matrix Y to each datapoint of matrix X, with, X the reference distribution (mu and sigma are calculated from X only)
The values of the distances will be different, but I guess the ordinal order of dissimilarity between clusters is preserved when using either method 1 or 2? I actually wonder when comparing 10 different clusters to a reference matrix X, or to each other, if the order of the dissimilarities would differ using method 1 or method 2? Also, I can't imagine a situation where one method would be wrong and the other method not. Although method 1 seems more intuitive in some situations, like mine.