How do I make a Mahalanobis distance matrix in MATLAB? - matlab

I have a data set with 5 repeats for each sample and 25 variables.
I am trying to make a Mahalanobis distance matrix between all of the samples using these parameters. I used the "mahal" function, but this gives a vector of all of the distances for each repeat. How can I make a matrix of distances between samples (38*38) and not a vector (1*190)?

For some test data:
X = rand(38,25); % some random test data with 38 observations and 25 variables
X = repmat(X,5,1); % 5 duplicates of each observation
You could use:
X = unique(X,'rows'); % remove duplicate observations
D = pdist(X,'mahalanobis'); % distance between all remaining observations
Z = squareform(D); % to square matrix format

Related

Calculate Euclidean distance for every row with every other row in a NxM matrix?

I have a matrix that I generate from a CSV file as follows:
X = xlsread('filename.csv');
I am looping through the matrix based on the number of records and I need to find the Euclidean distance for each of the rows of this matrix :
for i = 1:length(X)
j = X(:, [2:5])
end
The resulting matrix is of 150 X 4. What would be the best way to calculate the Euclidean distance of each row (with 4 columns as the data points) with every row and getting an average of the same?
In order to find the Euclidean distance between any pair of rows, you could use the function pdist.
X = randn(6, 4);
D = pdist(X,'euclidean');
res=mean(D);
The average is stored in res.

Get Matrix of minimum coordinate distance to point set

I have a set of points or coordinates like {(3,3), (3,4), (4,5), ...} and want to build a matrix with the minimum distance to this point set. Let me illustrate using a runnable example:
width = 10;
height = 10;
% Get min distance to those points
pts = [3 3; 3 4; 3 5; 2 4];
sumSPts = length(pts);
% Helper to determine element coordinates
[cols, rows] = meshgrid(1:width, 1:height);
PtCoords = cat(3, rows, cols);
AllDistances = zeros(height, width,sumSPts);
% To get Roh_I of evry pt
for k = 1:sumSPts
% Get coordinates of current Scribble Point
currPt = pts(k,:);
% Get Row and Col diffs
RowDiff = PtCoords(:,:,1) - currPt(1);
ColDiff = PtCoords(:,:,2) - currPt(2);
AllDistances(:,:,k) = sqrt(RowDiff.^2 + ColDiff.^2);
end
MinDistances = min(AllDistances, [], 3);
This code runs perfectly fine but I have to deal with matrix sizes of about 700 milion entries (height = 700, width = 500, sumSPts = 2k) and this slows down the calculation. Is there a better algorithm to speed things up?
As stated in the comments, you don't necessary have to put everything into a huge matrix and deal with gigantic matrices. You can :
1. Slice the pts matrix into reasonably small slices (say of length 100)
2. Loop on the slices and calculate the Mindistances slice over these points
3. Take the global min
tic
Mindistances=[];
width = 500;
height = 700;
Np=2000;
pts = [randi(width,Np,1) randi(height,Np,1)];
SliceSize=100;
[Xcoords,Ycoords]=meshgrid(1:width,1:height);
% Compute the minima for the slices from 1 to floor(Np/SliceSize)
for i=1:floor(Np/SliceSize)
% Calculate indexes of the next slice
SliceIndexes=((i-1)*SliceSize+1):i*SliceSize
% Get the corresponding points and reshape them to a vector along the 3rd dim.
Xpts=reshape(pts(SliceIndexes,1),1,1,[]);
Ypts=reshape(pts(SliceIndexes,2),1,1,[]);
% Do all the diffs between your coordinates and your points using bsxfun singleton expansion
Xdiffs=bsxfun(#minus,Xcoords,Xpts);
Ydiffs=bsxfun(#minus,Ycoords,Ypts);
% Calculate all the distances of the slice in one call
Alldistances=bsxfun(#hypot,Xdiffs,Ydiffs);
% Concatenate the mindistances
Mindistances=cat(3,Mindistances,min(Alldistances,[],3));
end
% Check if last slice needed
if mod(Np,SliceSize)~=0
% Get the corresponding points and reshape them to a vector along the 3rd dim.
Xpts=reshape(pts(floor(Np/SliceSize)*SliceSize+1:end,1),1,1,[]);
Ypts=reshape(pts(floor(Np/SliceSize)*SliceSize+1:end,2),1,1,[]);
% Do all the diffs between your coordinates and your points using bsxfun singleton expansion
Xdiffs=bsxfun(#minus,Xcoords,Xpts);
Ydiffs=bsxfun(#minus,Ycoords,Ypts);
% Calculate all the distances of the slice in one call
Alldistances=bsxfun(#hypot,Xdiffs,Ydiffs);
% Concatenate the mindistances
Mindistances=cat(3,Mindistances,min(Alldistances,[],3));
end
% Get global minimum
Mindistances=min(Mindistances,[],3);
toc
Elapsed time is 9.830051 seconds.
Note :
You'll not end up doing less calculations. But It will be a lot less intensive for your memory (700M doubles takes 45Go in memory), thus speeding up the process (With the help of vectorizing aswell)
About bsxfun singleton expansion
One of the great strength of bsxfun is that you don't have to feed it matrices whose values are along the same dimensions.
For example :
Say I have two vectors X and Y defined as :
X=[1 2]; % row vector X
Y=[1;2]; % Column vector Y
And that I want a 2x2 matrix Z built as Z(i,j)=X(i)+Y(j) for 1<=i<=2 and 1<=j<=2.
Suppose you don't know about the existence of meshgrid (The example is a bit too simple), then you'll have to do :
Xs=repmat(X,2,1);
Ys=repmat(Y,1,2);
Z=Xs+Ys;
While with bsxfun you can just do :
Z=bsxfun(#plus,X,Y);
To calculate the value of Z(2,2) for example, bsxfun will automatically fetch the second value of X and Y and compute. This has the advantage of saving a lot of memory space (No need to define Xs and Ys in this example) and being faster with big matrices.
Bsxfun Vs Repmat
If you're interested with comparing the computational time between bsxfun and repmat, here are two excellent (word is not even strong enough) SO posts by Divakar :
Comparing BSXFUN and REPMAT
BSXFUN on memory efficiency with relational operations

Distance between any combination of two points

I have 100 coordinates in a variable x in MATLAB . How can I make sure that distance between all combinations of two points is greater than 1?
You can do this in just one simple line, with the functions all and pdist:
if all(pdist(x)>1)
...
end
Best,
First you'll need to generate a matrix that gives you all possible pairs of coordinates. This post can serve as inspiration:
Generate a matrix containing all combinations of elements taken from n vectors
I'm going to assume that your coordinates are stored such that the columns denote the dimensionality and the rows denote how many points you have. As such, for 2D, you would have a 100 x 2 matrix, and in 3D you would have a 100 x 3 matrix and so on.
Once you generate all possible combinations, you simply compute the distance... which I will assume it to be Euclidean here... of all points and ensure that all of them are greater than 1.
As such:
%// Taken from the linked post
vectors = { 1:100, 1:100 }; %// input data: cell array of vectors
n = numel(vectors); %// number of vectors
combs = cell(1,n); %// pre-define to generate comma-separated list
[combs{end:-1:1}] = ndgrid(vectors{end:-1:1}); %// the reverse order in these two
%// comma-separated lists is needed to produce the rows of the result matrix in
%// lexicographical order
combs = cat(n+1, combs{:}); %// concat the n n-dim arrays along dimension n+1
combs = reshape(combs,[],n); %// reshape to obtain desired matrix
%// Index into your coordinates array
source_points = x(combs(:,1), :);
end_points = x(combs(:,2), :);
%// Checks to see if all distances are greater than 1
is_separated = all(sqrt(sum((source_points - end_points).^2, 2)) > 1);
is_separated will contain either 1 if all points are separated by a distance of 1 or greater and 0 otherwise. If we dissect the last line of code, it's a three step procedure:
sum((source_points - end_points).^2, 2) computes the pairwise differences between each component for each pair of points, squares the differences and then sums all of the values together.
sqrt(...(1)...) computes the square root which gives us the Euclidean distance.
all(...(2)... > 1) then checks to see if all of the distances computed in Step #2 were greater than 1 and our result thus follows.

Generating a random list of (x, y) points that satisfy a condition?

So I need to generate a matrix of x and y points given that they meet the condition that at these (x,y) points concentration is greater than 10. Note that I first run a code that gives me concentration at each location, and now I need Matlab to "randomly" pick (x,y) points with the above condition.
Would appreciate any suggestions on how to go about this.
assuming your data looks something like this :
data= [... x y concentration
1, 1, 1; ...
2, 1, 11; ...
1, 2, 12; ...
2, 2, 1 ...
]
You could find all concentrations bigger than 10 with:
data_cbigger10=data(data(:,3)>10,:) % using logical indexing
and choose a random point from there with:
randomPoint=data_cbigger10(ceil(size(data_cbigger10,2)*rand),:) % pick a random index
If the dimensions are as follows:
the dimension of concentration is 52x61x61 as concentration is c(x,y,time), that of x is 1x61 and 1x52 for y. #PetrH – s2015
this should do the trick:
This is your data, I just make something up:
x=linspace(0,1,61);
y=linspace(0,1,52);
con=20*rand(61,52);
Now I find all positions in con which are bigger than 10. This results in a logical matrix. By multipling it with an random matrix the same size I get a matrix with random values where 'con' is bigger than 10, but everywhere else equals zero.
data_cbigger10=rand(size(con)).*(con>10);
by finding the max, or min, Value a random point is choosen:
for n=1:1:10
data_cbigger10=rand(size(con)).*(con>10);
[vals,xind]=max(data_cbigger10);
xind=squeeze(xind);
[vals,yind]=max(squeeze(vals));
[~,time_ind]=max(squeeze(vals));
yind=yind(time_ind);
xind=xind(yind,time_ind);
x_res(n)=x(xind)
y_res(n)=y(yind)
time_res(n)=time(time_ind)
con_res(n)=con(xind,yind,time_ind)
con(xind,yind,time_ind)=0; % setting the choosen point to zero, so it will not be choosen again.
end
Hope this works now for you.
Assuming you have the concentration for each point (x,y) stored in an array concentration you can use the find() and randsample() functions like so:
conGT10 = find(concentration>10); % find where concentration is greater than 10 (gives you indices)
randomPoints = randsample(conGT10,nn); % choose nn random numbers from those that satisfy the condition
x1 = x(randomPoints); % given the randomly drawn indices pull the corresponding numbers for x and y
y1 = y(randomPoints);
EDIT:
The above assumes that arrays x, y, and concentration are 1d and of the same length. Apparently this is not true for your problem.
You have a grid of points on a (x,y) plane and you measure concentration on this grid in different time periods. So the length of x is nx, the length of y is ny and the size of concentration is nx by ny by nt. For simplicity I will assume that you measure concentration only once, i.e. nt=1 and concentration is only 2d array.
The modified version of my previous answer would then be as follows:
[rows,cols] = find(concentration>10); % find where concentration is greater than 10 (gives you indices)
randomIndices = randsample(length(rows),nn); % choose nn random integers from 1 to n, where n is the number of observations that satisfy the condition 'concentration>10'
randomX = x(rows(randomIndices));
randomY = y(cols(randomIndices));

Finding maximum/minimum distance of two rows in a matrix using MATLAB

Say we have a matrix m x n where the number of rows of the matrix is very big. If we assume each row is a vector, then how could one find the maximum/minimum distance between vectors in this matrix?
My suggestion would be to use pdist. This computes pairs of Euclidean distances between unique combinations of observations like #seb has suggested, but this is already built into MATLAB. Your matrix is already formatted nicely for pdist where each row is an observation and each column is a variable.
Once you do apply pdist, apply squareform so that you can display the distance between pairwise entries in a more pleasant matrix form. The (i,j) entry for each value in this matrix tells you the distance between the ith and jth row. Also note that this matrix will be symmetric and the distances along the diagonal will inevitably equal to 0, as any vector's distance to itself must be zero. If your minimum distance between two different vectors were zero, if we were to search this matrix, then it may possibly report a self-distance instead of the actual distance between two different vectors. As such, in this matrix, you should set the diagonals of this matrix to NaN to avoid outputting these.
As such, assuming your matrix is A, all you have to do is this:
distValues = pdist(A); %// Compute pairwise distances
minDist = min(distValues); %// Find minimum distance
maxDist = max(distValues); %// Find maximum distance
distMatrix = squareform(distValues); %// Prettify
distMatrix(logical(eye(size(distMatrix)))) = NaN; %// Ignore self-distances
[minI,minJ] = find(distMatrix == minDist, 1); %// Find the two vectors with min. distance
[maxI,maxJ] = find(distMatrix == maxDist, 1); %// Find the two vectors with max. distance
minI, minJ, maxI, maxJ will return the two rows of A that produced the smallest distance and the largest distance respectively. Note that with the find statement, I have made the second parameter 1 so that it only returns one pair of vectors that have this minimum / maximum distance between each other. However, if you omit this parameter, then it will return all possible pairs of rows that share this same distance, but you will get duplicate entries as the squareform is symmetric. If you want to escape the duplication, set either the upper triangular half, or lower triangular half of your squareform matrix to NaN to tell MATLAB to skip searching in these duplicated areas. You can use MATLAB's tril or triu commands to do that. Take note that either of these methods by default will include the diagonal of the matrix and so there won't be any extra work here. As such, try something like:
distValues = pdist(A); %// Compute pairwise distances
minDist = min(distValues); %// Find minimum distance
maxDist = max(distValues); %// Find maximum distance
distMatrix = squareform(distValues); %// Prettify
distMatrix(triu(true(size(distMatrix)))) = NaN; %// To avoid searching for duplicates
[minI,minJ] = find(distMatrix == minDist); %// Find pairs of vectors with min. distance
[maxI,maxJ] = find(distMatrix == maxDist); %// Find pairs of vectors with max. distance
Judging from your application, you just want to find one such occurrence only, so let's leave it at that, but I'll put that here for you in case you need it.
You mean the max/min distance between any 2 rows? If so, you can try that:
numRows = 6;
A = randn(numRows, 100); %// Example of input matrix
%// Compute distances between each combination of 2 rows
T = nchoosek(1:numRows,2); %// pairs of indexes for all combinations of 2 rows
for k=1:length(T)
d(k) = norm(A(T(k,1),:)-A(T(k,2),:));
end
%// Find min/max distance
[~, minIndex] = min(d);
[~, maxIndex] = max(d);
T(minIndex,:) %// Displays indexes of the 2 rows with minimum distance
T(maxIndex,:) %// Displays indexes of the 2 rows with maximum distance