I have a set of points or coordinates like {(3,3), (3,4), (4,5), ...} and want to build a matrix with the minimum distance to this point set. Let me illustrate using a runnable example:
width = 10;
height = 10;
% Get min distance to those points
pts = [3 3; 3 4; 3 5; 2 4];
sumSPts = length(pts);
% Helper to determine element coordinates
[cols, rows] = meshgrid(1:width, 1:height);
PtCoords = cat(3, rows, cols);
AllDistances = zeros(height, width,sumSPts);
% To get Roh_I of evry pt
for k = 1:sumSPts
% Get coordinates of current Scribble Point
currPt = pts(k,:);
% Get Row and Col diffs
RowDiff = PtCoords(:,:,1) - currPt(1);
ColDiff = PtCoords(:,:,2) - currPt(2);
AllDistances(:,:,k) = sqrt(RowDiff.^2 + ColDiff.^2);
end
MinDistances = min(AllDistances, [], 3);
This code runs perfectly fine but I have to deal with matrix sizes of about 700 milion entries (height = 700, width = 500, sumSPts = 2k) and this slows down the calculation. Is there a better algorithm to speed things up?
As stated in the comments, you don't necessary have to put everything into a huge matrix and deal with gigantic matrices. You can :
1. Slice the pts matrix into reasonably small slices (say of length 100)
2. Loop on the slices and calculate the Mindistances slice over these points
3. Take the global min
tic
Mindistances=[];
width = 500;
height = 700;
Np=2000;
pts = [randi(width,Np,1) randi(height,Np,1)];
SliceSize=100;
[Xcoords,Ycoords]=meshgrid(1:width,1:height);
% Compute the minima for the slices from 1 to floor(Np/SliceSize)
for i=1:floor(Np/SliceSize)
% Calculate indexes of the next slice
SliceIndexes=((i-1)*SliceSize+1):i*SliceSize
% Get the corresponding points and reshape them to a vector along the 3rd dim.
Xpts=reshape(pts(SliceIndexes,1),1,1,[]);
Ypts=reshape(pts(SliceIndexes,2),1,1,[]);
% Do all the diffs between your coordinates and your points using bsxfun singleton expansion
Xdiffs=bsxfun(#minus,Xcoords,Xpts);
Ydiffs=bsxfun(#minus,Ycoords,Ypts);
% Calculate all the distances of the slice in one call
Alldistances=bsxfun(#hypot,Xdiffs,Ydiffs);
% Concatenate the mindistances
Mindistances=cat(3,Mindistances,min(Alldistances,[],3));
end
% Check if last slice needed
if mod(Np,SliceSize)~=0
% Get the corresponding points and reshape them to a vector along the 3rd dim.
Xpts=reshape(pts(floor(Np/SliceSize)*SliceSize+1:end,1),1,1,[]);
Ypts=reshape(pts(floor(Np/SliceSize)*SliceSize+1:end,2),1,1,[]);
% Do all the diffs between your coordinates and your points using bsxfun singleton expansion
Xdiffs=bsxfun(#minus,Xcoords,Xpts);
Ydiffs=bsxfun(#minus,Ycoords,Ypts);
% Calculate all the distances of the slice in one call
Alldistances=bsxfun(#hypot,Xdiffs,Ydiffs);
% Concatenate the mindistances
Mindistances=cat(3,Mindistances,min(Alldistances,[],3));
end
% Get global minimum
Mindistances=min(Mindistances,[],3);
toc
Elapsed time is 9.830051 seconds.
Note :
You'll not end up doing less calculations. But It will be a lot less intensive for your memory (700M doubles takes 45Go in memory), thus speeding up the process (With the help of vectorizing aswell)
About bsxfun singleton expansion
One of the great strength of bsxfun is that you don't have to feed it matrices whose values are along the same dimensions.
For example :
Say I have two vectors X and Y defined as :
X=[1 2]; % row vector X
Y=[1;2]; % Column vector Y
And that I want a 2x2 matrix Z built as Z(i,j)=X(i)+Y(j) for 1<=i<=2 and 1<=j<=2.
Suppose you don't know about the existence of meshgrid (The example is a bit too simple), then you'll have to do :
Xs=repmat(X,2,1);
Ys=repmat(Y,1,2);
Z=Xs+Ys;
While with bsxfun you can just do :
Z=bsxfun(#plus,X,Y);
To calculate the value of Z(2,2) for example, bsxfun will automatically fetch the second value of X and Y and compute. This has the advantage of saving a lot of memory space (No need to define Xs and Ys in this example) and being faster with big matrices.
Bsxfun Vs Repmat
If you're interested with comparing the computational time between bsxfun and repmat, here are two excellent (word is not even strong enough) SO posts by Divakar :
Comparing BSXFUN and REPMAT
BSXFUN on memory efficiency with relational operations
Related
I have a materials matrix where the values indicate the type of material (value between 1 and 8). Each value below 5 indicates an "interesting" material. Now at a certain point, i want to sum up the amount of non-interesting neighbor materials. So in a 3D-matrix the result at one point can be value between 0 and 6. One of the problems is that the "current" point is at the edge of the 3D matrix. I can solve this using 3 very expensive for-loops:
materials; % given 3D matrix i.e. 97*87*100
matrixSize = size(materials);
n = matrixSize(1)*matrixSize(2)*matrixSize(3); * total number of points
materialsFlattened = reshape(materials, [n 1]); % flattened materials matrix from a 3D matrix to a 1D matrix
pageSize = matrixSize(1)*matrixSize(2); % size of a page in z-direction
interestingMaterials = materialsFlattened(:) < 5; % logical vector indicating if the materials are interesting
n_bc = zeros(obj.n, 1); % amount of neighbour non-interesting materials
for l = 1:matrixSize(3) % loop over all z
for k = 1:matrixSize(2) % loop over all y
for j = 1:matrixSize(1) % loop over all x
n_bc(sub2ind(matrixSize,j,k,l)) = ...
~interestingMaterials(sub2ind(matrixSize,j,k,max(1, l-1)))...
+ ~interestingMaterials(sub2ind(matrixSize,j,max(1,k-1),l))...
+ ~interestingMaterials(sub2ind(matrixSize,max(1, j-1),k,l))...
+ ~interestingMaterials(sub2ind(matrixSize,min(matrixSize(1),j+1),k,l))...
+ ~interestingMaterials(sub2ind(matrixSize,j,min(matrixSize(2),k+1),l))...
+ ~interestingMaterials(sub2ind(matrixSize,j,k,min(matrixSize(3),l+1)));
end
end
end
So note that i first flatten the matrix to a 1D matrix using reshape. The min and max operators ensure that i do not go out of the bounds of the matrix; instead i take the value of the material where i currently am. For my application, speed is of the essence and i was hoping i can get rid of this ugly loop in loop structure. Often times that is possible in MATLAB, as the element-wise indexing is amazing and sometimes kinda magic.
I am still wrapping my head around vectorization and I'm having a difficult time trying to resolve the following function I made...
for i = 1:size(X, 1)
min_n = inf;
for j=1:K
val = X(i,:)' - centroids(j,:)';
diff = val'*val;
if (diff < min_n)
idx(i) = j;
min_n = diff;
end
end
end
X is an array of (x,y) coordinates...
2 5
5 6
...
...
centroids in this example is limited to 3 rows. It is also in (x,y) format as shown above.
For every pair in X I am computing the closest pair of centroids. I then store the index of the centroid in idx.
So idx(i) = j means that I am storing the index j of the centroid at index i, where i corresponds to the index of X. This means the closest centroid to pair X(i, :) is at idx(i).
Can I possibly simplify this via vectorization? I struggle with just vectorizing the inner loop.
Here are three options. But please note that the disadvantage of vectorization, as compared to your double loops, is that it stores all the difference operation results at once, which means that if your matrices have many rows, you might run out of memory. On the other hand, the vectorized approach is probably much faster.
Option 1
If you have access to Statistics and Machine Learning Toolbox, you can use the function pdist2 to get all the pairwise distances between rows of two matrices. Then, the min function gives you the minimum of each column of the result. Its first returned value are the minimal values, and its second are the indices, which is what you need for idx:
diff = pdist2(centroids,X);
[~,idx] = min(diff);
Option 2
If you don't have access to the toolbox, you can use bsxfun. This will let you compute the difference operation between the two matrices even if their dimensions don't agree. All you need to do is to use shiftdim to reshape X' to have size [1,size(X,2),size(X,1)], and then reshapedX and and centroids are compatible with their dimensions (see documentation of bsxfun). This lets you take the difference between their values. The result is a three dimensional array, which you need to sum along the second dimension to get the norm of the differences between rows. At this point you can proceed as in option 1.
reshapedX = shiftdim(X',-1);
diff = bsxfun(#minus,centroids,reshapedX);
diff = squeeze(sum(diff.^2,2));
[~,idx] = min(diff);
Note: Starting in the Matlab version 2016b, the bsxfun is used implicitly and you do not need to call it anymore. So the line with bsxfun can be replaced with the simpler line diff = centroids-reshapedX.
Option 3
Use the function dsearchn, which performs exactly what you need:
idx = dsearchn(centroids,X);
it could be done using pdist2 - pairwise distances between rows of two matrices:
% random data
X = rand(500,2);
centroids = rand(3,2);
% pairwise distances
D = pdist2(X,centroids);
% closest centroid index for each X coordinates
[~,idx] = min(D,[],2)
% plot
scatter(centroids(:,1),centroids(:,2),300,(1:size(centroids,1))','filled');
hold on;
scatter(X(:,1),X(:,2),30,idx);
legend('Centroids','data');
I have the following code for calculating the result of a linear combination of Gaussian functions. What I'd really like to do is to vectorize this somehow so that it's far more performant in Matlab.
Note that y is a column vector (output), x is a matrix where each column corresponds to a data point and each row corresponds to a dimension (i.e. 2 rows = 2D), variance is a double, gaussians is a matrix where each column is a vector corresponding to the mean point of the gaussian and weights is a row vector of the weights in front of each gaussian. Note that the length of weights is 1 bigger than gaussians as weights(1) is the 0th order weight.
function [ y ] = CalcPrediction( gaussians, variance, weights, x )
basisFunctions = size(gaussians, 2);
xvalues = size(x, 2);
if length(weights) ~= basisFunctions + 1
ME = MException('TRAIN:CALC', 'The number of weights should be equal to the number of basis functions plus one');
throw(ME);
end
y = weights(1) * ones(xvalues, 1);
for xIdx = 1:xvalues
for i = 1:basisFunctions
diff = x(:, xIdx) - gaussians(:, i);
y(xIdx) = y(xIdx) + weights(i+1) * exp(-(diff')*diff/(2*variance));
end
end
end
You can see that at the moment I simply iterate over the x vectors and then the gaussians inside 2 for loops. I'm hoping that this can be improved - I've looked at meshgrid but that seems to only apply to vectors (and I have matrices)
Thanks.
Try this
diffx = bsxfun(#minus,x,permute(gaussians,[1,3,2])); % binary operation with singleton expansion
diffx2 = squeeze(sum(diffx.^2,1)); % dot product, shape is now [XVALUES,BASISFUNCTIONS]
weight_col = weights(:); % make sure weights is a column vector
y = exp(-diffx2/2/variance)*weight_col(2:end); % a column vector of length XVALUES
Note, I changed diff to diffx since diff is a builtin. I'm not sure this will improve performance as allocating arrays will offset increase by vectorization.
Imagine a set of data with given x-values (as a column vector) and several y-values combined in a matrix (row vector of column vectors). Some of the values in the matrix are not available:
%% Create the test data
N = 1e2; % Number of x-values
x = 2*sort(rand(N, 1))-1;
Y = [x.^2, x.^3, x.^4, x.^5, x.^6]; % Example values
Y(50:80, 4) = NaN(31, 1); % Some values are not avaiable
Now i have a column vector of new x-values for interpolation.
K = 1e2; % Number of interplolation values
x_i = rand(K, 1);
My goal is to find a fast way to interpolate all y-values for the given x_i values. If there are NaN values in the y-values, I want to use the y-value which is before the missing data. In the example case this would be the data in Y(49, :).
If I use interp1, I get NaN-values and the execution is slow for large x and x_i:
starttime = cputime;
Y_i1 = interp1(x, Y, x_i);
executiontime1 = cputime - starttime
An alternative is interp1q, which is about two times faster.
What is a very fast way which allows my modifications?
Possible ideas:
Do postprocessing of Y_i1 to eliminate NaN-values.
Use a combination of a loop and the find-command to always use the neighbour without interpolation.
Using interp1 with spline interpolation (spline) ignores NaN's.
Background:
Basically I'm using a dynamic time warping algorithm like used in speech recognition to try to warp geological data (filter out noise from environmental conditions) The main difference between these two problems is that dtw prints a warping function that allows both vectors that are input to be warped, whereas for the problem I'm trying to solve I need to keep one reference vector constant while stretching and shrinking the test variable vector to fit.
here is dtw in matlab:
function [Dist,D,k,w]=dtw()
%Dynamic Time Warping Algorithm
%Dist is unnormalized distance between t and r
%D is the accumulated distance matrix
%k is the normalizing factor
%w is the optimal path
%t is the vector you are testing against
%r is the vector you are testing
[t,r,x1,x2]=randomtestdata();
[rows,N]=size(t);
[rows,M]=size(r);
%for n=1:N
% for m=1:M
% d(n,m)=(t(n)-r(m))^2;
% end
%end
d=(repmat(t(:),1,M)-repmat(r(:)',N,1)).^2; %this replaces the nested for loops from above Thanks Georg Schmitz
D=zeros(size(d));
D(1,1)=d(1,1);
for n=2:N
D(n,1)=d(n,1)+D(n-1,1);
end
for m=2:M
D(1,m)=d(1,m)+D(1,m-1);
end
for n=2:N
for m=2:M
D(n,m)=d(n,m)+min([D(n-1,m),D(n-1,m-1),D(n,m-1)]);
end
end
Dist=D(N,M);
n=N;
m=M;
k=1;
w=[];
w(1,:)=[N,M];
while ((n+m)~=2)
if (n-1)==0
m=m-1;
elseif (m-1)==0
n=n-1;
else
[values,number]=min([D(n-1,m),D(n,m-1),D(n-1,m-1)]);
switch number
case 1
n=n-1;
case 2
m=m-1;
case 3
n=n-1;
m=m-1;
end
end
k=k+1;
w=cat(1,w,[n,m]);
end
w=flipud(w)
%w is a matrix that looks like this:
% 1 1
% 1 2
% 2 2
% 3 3
% 3 4
% 3 5
% 4 5
% 5 6
% 6 6
so what this is saying is that the both the first and second points of the second vector should be mapped to the first point of the first vector. i.e. 1 1
1 2
and that the fifth and sixth points on the first vector should be mapped to the second vector at point six. etc. so w contains the x coordinates of the warped data.
Normally I would be able to say
X1=w(:,1);
X2=w(:,2);
for i=1:numel(reference vector)
Y1(i)=reference vector(X1(i));
Y2(i)=test vector(X2(i));
end
but I need not to stretch the reference vector so I need to use the repeats in X1 to know how to shrink Y2 and the repeats in X2 to know how to stretch Y2 rather than using repeats in X1 to stretch Y1 and repeats in X2 to stretch Y2.
I tried using a find method to find the repeats in both X1 and X2 and then average(shrink) or interpolate linearly(stretch) as needed but the code became very complicated and difficult to debug.
Was this really unclear? I had a hard time explaining this problem, but I just need to know how to take w and create a Y2 that is stretched and shrunk accordingly.
First, here's DTW in Matlab translated from the pseudocode on wikipedia:
t = 0:.1:2*pi;
x0 = sin(t) + rand(size(t)) * .1;
x1 = sin(.9*t) + rand(size(t)) * .1;
figure
plot(t, x0, t, x1);
hold on
DTW = zeros(length(x0), length(x1));
DTW(1,:) = inf;
DTW(:,1) = inf;
DTW(1,1) = 0;
for i0 = 2:length(x0)
for i1 = 2:length(x1)
cost = abs(x0(i0) - x1(i1));
DTW(i0, i1) = cost + min( [DTW(i0-1, i1) DTW(i0, i1-1) DTW(i0-1, i1-1)] );
end
end
Whether you are warping x_0 onto x_1, x_1 onto x_0, or warping them onto each other, you can get your answer out of the matrix DTW. In your case:
[cost, path] = min(DTW, [], 2);
plot(t, x1(path));
legend({'x_0', 'x_1', 'x_1 warped to x_0'});
I don't have an answer but I have been playing with the code of #tokkot implemented from the pseudocode in the Wikipedia article. It works, but I think it lacks three requeriments of DTW:
The first and last points of both sequences must be a match, with the use of min(), some (or many) of the first and ending points of one of the sequences are lost.
The output sequence is not monotonically increasing. I have used x1(sort(path)) instead, but I don't believe it is the real minimum distance.
Additionally, for a reason I haven't found yet, some intermediate points of the warped sequences are lost, which I believe is not compatible with DTW.
I'm still searching for an algorithm like DTW in which one of the sequences is fixed (not warped). I need to compare a time series of equally spaced temperature measurements with another sequence. The first one cannot be time shifted, it does not make sense.