pdist2 equivalent in MATLAB version 7 - matlab

I need to calculate the euclidean distance between 2 matrices in matlab. Currently I am using bsxfun and calculating the distance as below( i am attaching a snippet of the code ):
for i=1:4754
test_data=fea_test(i,:);
d=sqrt(sum(bsxfun(#minus, test_data, fea_train).^2, 2));
end
Size of fea_test is 4754x1024 and fea_train is 6800x1024 , using his for loop is causing the execution of the for to take approximately 12 minutes which I think is too high.
Is there a way to calculate the euclidean distance between both the matrices faster?
I was told that by removing unnecessary for loops I can reduce the execution time. I also know that pdist2 can help reduce the time for calculation but since I am using version 7. of matlab I do not have the pdist2 function. Upgrade is not an option.
Any help.
Regards,
Bhavya

Here is vectorized implementation for computing the euclidean distance that is much faster than what you have (even significantly faster than PDIST2 on my machine):
D = sqrt( bsxfun(#plus,sum(A.^2,2),sum(B.^2,2)') - 2*(A*B') );
It is based on the fact that: ||u-v||^2 = ||u||^2 + ||v||^2 - 2*u.v
Consider below a crude comparison between the two methods:
A = rand(4754,1024);
B = rand(6800,1024);
tic
D = pdist2(A,B,'euclidean');
toc
tic
DD = sqrt( bsxfun(#plus,sum(A.^2,2),sum(B.^2,2)') - 2*(A*B') );
toc
On my WinXP laptop running R2011b, we can see a 10x times improvement in time:
Elapsed time is 70.939146 seconds. %# PDIST2
Elapsed time is 7.879438 seconds. %# vectorized solution
You should be aware that it does not give exactly the same results as PDIST2 down to the smallest precision.. By comparing the results, you will see small differences (usually close to eps the floating-point relative accuracy):
>> max( abs(D(:)-DD(:)) )
ans =
1.0658e-013
On a side note, I've collected around 10 different implementations (some are just small variations of each other) for this distance computation, and have been comparing them. You would be surprised how fast simple loops can be (thanks to the JIT), compared to other vectorized solutions...

You could fully vectorize the calculation by repeating the rows of fea_test 6800 times, and of fea_train 4754 times, like this:
rA = size(fea_test,1);
rB = size(fea_train,1);
[I,J]=ndgrid(1:rA,1:rB);
d = zeros(rA,rB);
d(:) = sqrt(sum(fea_test(J(:),:)-fea_train(I(:),:)).^2,2));
However, this would lead to intermediary arrays of size 6800x4754x1024 (*8 bytes for doubles), which will take up ~250GB of RAM. Thus, the full vectorization won't work.
You can, however, reduce the time of the distance calculation by preallocation, and by not calculating the square root before it's necessary:
rA = size(fea_test,1);
rB = size(fea_train,1);
d = zeros(rA,rB);
for i = 1:rA
test_data=fea_test(i,:);
d(i,:)=sum( (test_data(ones(nB,1),:) - fea_train).^2, 2))';
end
d = sqrt(d);

Try this vectorized version, it should be pretty efficient. Edit: just noticed that my answer is similar to #Amro's.
function K = calculateEuclideanDist(P,Q)
% Vectorized method to compute pairwise Euclidean distance
% Returns K(i,j) = sqrt((P(i,:) - Q(j,:))'*(P(i,:) - Q(j,:)))
[nP, d] = size(P);
[nQ, d] = size(Q);
pmag = sum(P .* P, 2);
qmag = sum(Q .* Q, 2);
K = sqrt(ones(nP,1)*qmag' + pmag*ones(1,nQ) - 2*P*Q');
end

Related

Vectorizing a matlab loop with internal functions

I have a 3D Mesh grid, X, Y, Z. I want to create a new 3D array that is a function of X, Y, & Z. That function comprises the sum of several 3D Gaussians located at different points. Currently, I have a for loop that runs over the different points where I have my gaussians, and I have an array of center locations r0(nGauss, 1:3)
[X,Y,Z]=meshgrid(-10:.1:10);
Psi=0*X;
for index = 1:nGauss
Psi = Psi + Gauss3D(X,Y,Z,[r0(index,1),r0(index,2),r0(index,3)]);
end
where my 3D gaussian function is
function output=Gauss3D(X,Y,Z,r0)
output=exp(-(X-r0(1)).^2 + (Y-r0(2)).^2 + (Z-r0(3)).^2);
end
I'm happy to redesign the function, which is the slowest part of my code and has to happen many many time, but I can't figure out how to vectorize this so that it will run faster. Any suggestions would be appreciated
*****NB the Original function had a square root in it, and has been modified to make it an actual gaussian***
NOTE! I've modified your code to create a Gaussian, which was:
output=exp(-sqrt((X-r0(1)).^2 + (Y-r0(2)).^2 + (Z-r0(3)).^2));
That does not make a Gaussian. I changed this to:
output = exp(-((X-r0(1)).^2 + (Y-r0(2)).^2 + (Z-r0(3)).^2));
(note no sqrt). This is a Gaussian with sigma = sqrt(1/2).
If this is not what you want, then this answer might not be very useful to you, because your function does not go to 0 as fast as the Gaussian, and therefore is harder to truncate, and it is not separable.
Vectorizing this code is pointless, as the other answers attest. MATLAB's JIT is perfectly capable of running this as fast as it'll go. But you can reduce the amount of computation significantly by noting that the Gaussian goes to almost zero very quickly, and is separable:
Most of the exp evaluations you're doing here yield a very tiny number. You don't need to compute those, just fill in 0.
exp(-x.^2-y.^2) is the same as exp(-x.^2).*exp(-y.^2), which is much cheaper to compute.
Let's put these two things to the test. Here is the test code:
function gaussian_test
N = 100;
r0 = rand(N,3)*20 - 10;
% Original
tic
[X,Y,Z] = meshgrid(-10:.1:10);
Psi1 = zeros(size(X));
for index = 1:N
Psi1 = Psi1 + Gauss3D(X,Y,Z,r0(index,:));
end
t = toc;
fprintf('original, time = %f\n',t)
% Fast, large truncation
tic
[X,Y,Z] = deal(-10:.1:10);
Psi2 = zeros(numel(X),numel(Y),numel(Z));
for index = 1:N
Psi2 = Gauss3D_fast(Psi2,X,Y,Z,r0(index,:),5);
end
t = toc;
fprintf('tuncation = 5, time = %f\n',t)
fprintf('mean abs error = %f\n',mean(reshape(abs(Psi2-Psi1),[],1)))
fprintf('mean square error = %f\n',mean(reshape((Psi2-Psi1).^2,[],1)))
fprintf('max abs error = %f\n',max(reshape(abs(Psi2-Psi1),[],1)))
% Fast, smaller truncation
tic
[X,Y,Z] = deal(-10:.1:10);
Psi3 = zeros(numel(X),numel(Y),numel(Z));
for index = 1:N
Psi3 = Gauss3D_fast(Psi3,X,Y,Z,r0(index,:),3);
end
t = toc;
fprintf('tuncation = 3, time = %f\n',t)
fprintf('mean abs error = %f\n',mean(reshape(abs(Psi3-Psi1),[],1)))
fprintf('mean square error = %f\n',mean(reshape((Psi3-Psi1).^2,[],1)))
fprintf('max abs error = %f\n',max(reshape(abs(Psi3-Psi1),[],1)))
% DIPimage, same smaller truncation
tic
Psi4 = newim(201,201,201);
coords = (r0+10) * 10;
Psi4 = gaussianblob(Psi4,coords,10*sqrt(1/2),(pi*100).^(3/2));
t = toc;
fprintf('DIPimage, time = %f\n',t)
fprintf('mean abs error = %f\n',mean(reshape(abs(Psi4-Psi1),[],1)))
fprintf('mean square error = %f\n',mean(reshape((Psi4-Psi1).^2,[],1)))
fprintf('max abs error = %f\n',max(reshape(abs(Psi4-Psi1),[],1)))
end % of function gaussian_test
function output = Gauss3D(X,Y,Z,r0)
output = exp(-((X-r0(1)).^2 + (Y-r0(2)).^2 + (Z-r0(3)).^2));
end
function Psi = Gauss3D_fast(Psi,X,Y,Z,r0,trunc)
% sigma = sqrt(1/2)
x = X-r0(1);
y = Y-r0(2);
z = Z-r0(3);
mx = abs(x) < trunc*sqrt(1/2);
my = abs(y) < trunc*sqrt(1/2);
mz = abs(z) < trunc*sqrt(1/2);
Psi(my,mx,mz) = Psi(my,mx,mz) + exp(-x(mx).^2) .* reshape(exp(-y(my).^2),[],1) .* reshape(exp(-z(mz).^2),1,1,[]);
% Note! the line above uses implicit singleton expansion. For older MATLABs use bsxfun
end
This is the output on my machine, reordered for readability (I'm still on MATLAB R2017a):
| time(s) | mean abs | mean sq. | max abs
--------------+----------+----------+----------+----------
original | 5.035762 | | |
tuncation = 5 | 0.169807 | 0.000000 | 0.000000 | 0.000005
tuncation = 3 | 0.054737 | 0.000452 | 0.000002 | 0.024378
DIPimage | 0.044099 | 0.000452 | 0.000002 | 0.024378
As you can see, using these two properties of the Gaussian we can reduce time from 5.0 s to 0.17 s, a 30x speedup, with hardly noticeable differences (truncating at 5*sigma). A further 3x speedup can be gained by allowing a small error. The smallest the truncation value, the faster this will go, but the larger the error will be.
I added that last method, the gaussianblob function from DIPimage (I'm an author), just to show that option in case you need to squeeze that bit of extra time from your code. That function is implemented in C++. This version that I used you will need to compile yourself. Our current official release implements this function still in M-file code, and is not as fast.
Further chance of improvement is if the fractional part of the coordinates is always the same (w.r.t. the pixel grid). In this case, you can draw the Gaussian once, and shift it over to each of the centroids.
Another alternative involves computing the Gaussian once, at a somewhat larger scale, and interpolating into it to generate each of the 1D Gaussians needed to generate the output. I did not implement this, I have no idea if it will be faster or if the time difference will be significant. In the old days, exp was expensive, I'm not sure this is still the case.
So, I am building off of the answer above me #Durkee. I enjoy these kinds of problems, so I thought a little about how to make each of the expansions implicit, and I have the one-line function below. Using this function I shaved .11 seconds off of the call, which is completely negligible. It looks like yours is pretty decent. The only advantage of mine might be how the code scales on a finer mesh.
xLin = [-10:.1:10]';
tic
psi2 = sum(exp(-sqrt((permute(xLin-r0(:,1)',[3 1 4 2])).^2 ...
+ (permute(xLin-r0(:,2)',[1 3 4 2])).^2 ...
+ (permute(xLin-r0(:,3)',[3 4 1 2])).^2)),4);
toc
The relative run times on my computer were (all things kept the same):
Original - 1.234085
Other - 2.445375
Mine - 1.120701
So this is a bit of an unusual problem where on my computer the unvectorized code actually works better than the vectorized code, here is my script
clear
[X,Y,Z]=meshgrid(-10:.1:10);
Psi=0*X;
nGauss = 20; %Sample nGauss as you didn't specify
r0 = rand(nGauss,3); % Just make this up as it doesn't really matter in this case
% Your original code
tic
for index = 1:nGauss
Psi = Psi + Gauss3D(X,Y,Z,[r0(index,1),r0(index,2),r0(index,3)]);
end
toc
% Vectorize these functions so we can use implicit broadcasting
X1 = X(:);
Y1 = Y(:);
Z1 = Z(:);
tic
val = [X1 Y1 Z1];
% Change the dimensions so that r0 operates on the right elements
r0_temp = permute(r0,[3 2 1]);
% Perform the gaussian combination
out = sum(exp(-sqrt(sum((val-r0_temp).^2,2))),3);
toc
% Check to make sure both functions match
sum(abs(vec(Psi)-vec(out)))
function output=Gauss3D(X,Y,Z,r0)
output=exp(-sqrt((X-r0(1)).^2 + (Y-r0(2)).^2 + (Z-r0(3)).^2));
end
function out = vec(in)
out = in(:);
end
As you can see, this is probably about as vectorized as you can get. The whole function is done using broadcasting and vectorized operations which normally improve performance ten-one hundredfold. However, in this case, this is not what we see
Elapsed time is 1.876460 seconds.
Elapsed time is 2.909152 seconds.
This actually shows the unvectorized version as being faster.
There could be a few reasons for this of which I am by no means an expert.
MATLAB uses a JIT compiler now which means that for loops are no longer inefficient.
Your code is already reasonably vectorized, you are operating at 8 million elements at once
Unless nGauss is 1000 or something, you're not looping through that much, and at that point, vectorization means you will run out of memory
I could be hitting some memory threshold where I am using too much memory and that is making my code inefficient, I noticed that when I lowered the resolution on the meshgrid the vectorized version worked better
As an aside, I tested this on my GTX 1060 GPU with single precision(single precision is 10x faster than double precision on most GPUs)
Elapsed time is 0.087405 seconds.
Elapsed time is 0.241456 seconds.
Once again the unvectorized version is faster, sorry I couldn't help you out but it seems that your code is about as good as you are going to get unless you lower the tolerances on your meshgrid.

Vectorizing the solution of a linear equation system in MATLAB

Summary: This question deals with the improvement of an algorithm for the computation of linear regression.
I have a 3D (dlMAT) array representing monochrome photographs of the same scene taken at different exposure times (the vector IT) . Mathematically, every vector along the 3rd dimension of dlMAT represents a separate linear regression problem that needs to be solved. The equation whose coefficients need to be estimated is of the form:
DL = R*IT^P, where DL and IT are obtained experimentally and R and P must be estimated.
The above equation can be transformed into a simple linear model after applying a logarithm:
log(DL) = log(R) + P*log(IT) => y = a + b*x
Presented below is the most "naive" way to solve this system of equations, which essentially involves iterating over all "3rd dimension vectors" and fitting a polynomial of order 1 to (IT,DL(ind1,ind2,:):
%// Define some nominal values:
R = 0.3;
IT = 600:600:3000;
P = 0.97;
%// Impose some believable spatial variations:
pMAT = 0.01*randn(3)+P;
rMAT = 0.1*randn(3)+R;
%// Generate "fake" observation data:
dlMAT = bsxfun(#times,rMAT,bsxfun(#power,permute(IT,[3,1,2]),pMAT));
%// Regression:
sol = cell(size(rMAT)); %// preallocation
for ind1 = 1:size(dlMAT,1)
for ind2 = 1:size(dlMAT,2)
sol{ind1,ind2} = polyfit(log(IT(:)),log(squeeze(dlMAT(ind1,ind2,:))),1);
end
end
fittedP = cellfun(#(x)x(1),sol); %// Estimate of pMAT
fittedR = cellfun(#(x)exp(x(2)),sol); %// Estimate of rMAT
The above approach seems like a good candidate for vectorization, since it does not utilize MATLAB's main strength that is MATrix operations. For this reason, it does not scale very well and takes much longer to execute than I think it should.
There exist alternative ways to perform this computation based on matrix division, as demonstrated here and here, which involve something like this:
sol = [ones(size(x)),log(x)]\log(y);
That is, appending a vector of 1s to the observations, followed by mldivide to solve the equation system.
The main challenge I'm facing is how to adapt my data to the algorithm (or vice versa).
Question #1: How can the matrix-division-based solution be extended to solve the problem presented above (and potentially replace the loops I am using)?
Question #2 (bonus): What is the principle behind this matrix-division-based solution?
The secret ingredient behind the solution that includes matrix division is the Vandermonde matrix. The question discusses a linear problem (linear regression), and those can always be formulated as a matrix problem, which \ (mldivide) can solve in a mean-square error senseā€”. Such an algorithm, solving a similar problem, is demonstrated and explained in this answer.
Below is benchmarking code that compares the original solution with two alternatives suggested in chat1, 2 :
function regressionBenchmark(numEl)
clc
if nargin<1, numEl=10; end
%// Define some nominal values:
R = 5;
IT = 600:600:3000;
P = 0.97;
%// Impose some believable spatial variations:
pMAT = 0.01*randn(numEl)+P;
rMAT = 0.1*randn(numEl)+R;
%// Generate "fake" measurement data using the relation "DL = R*IT.^P"
dlMAT = bsxfun(#times,rMAT,bsxfun(#power,permute(IT,[3,1,2]),pMAT));
%% // Method1: loops + polyval
disp('-------------------------------Method 1: loops + polyval')
tic; [fR,fP] = method1(IT,dlMAT); toc;
fprintf(1,'Regression performance:\nR: %d\nP: %d\n',norm(fR-rMAT,1),norm(fP-pMAT,1));
%% // Method2: loops + Vandermonde
disp('-------------------------------Method 2: loops + Vandermonde')
tic; [fR,fP] = method2(IT,dlMAT); toc;
fprintf(1,'Regression performance:\nR: %d\nP: %d\n',norm(fR-rMAT,1),norm(fP-pMAT,1));
%% // Method3: vectorized Vandermonde
disp('-------------------------------Method 3: vectorized Vandermonde')
tic; [fR,fP] = method3(IT,dlMAT); toc;
fprintf(1,'Regression performance:\nR: %d\nP: %d\n',norm(fR-rMAT,1),norm(fP-pMAT,1));
function [fittedR,fittedP] = method1(IT,dlMAT)
sol = cell(size(dlMAT,1),size(dlMAT,2));
for ind1 = 1:size(dlMAT,1)
for ind2 = 1:size(dlMAT,2)
sol{ind1,ind2} = polyfit(log(IT(:)),log(squeeze(dlMAT(ind1,ind2,:))),1);
end
end
fittedR = cellfun(#(x)exp(x(2)),sol);
fittedP = cellfun(#(x)x(1),sol);
function [fittedR,fittedP] = method2(IT,dlMAT)
sol = cell(size(dlMAT,1),size(dlMAT,2));
for ind1 = 1:size(dlMAT,1)
for ind2 = 1:size(dlMAT,2)
sol{ind1,ind2} = flipud([ones(numel(IT),1) log(IT(:))]\log(squeeze(dlMAT(ind1,ind2,:)))).'; %'
end
end
fittedR = cellfun(#(x)exp(x(2)),sol);
fittedP = cellfun(#(x)x(1),sol);
function [fittedR,fittedP] = method3(IT,dlMAT)
N = 1; %// Degree of polynomial
VM = bsxfun(#power, log(IT(:)), 0:N); %// Vandermonde matrix
result = fliplr((VM\log(reshape(dlMAT,[],size(dlMAT,3)).')).');
%// Compressed version:
%// result = fliplr(([ones(numel(IT),1) log(IT(:))]\log(reshape(dlMAT,[],size(dlMAT,3)).')).');
fittedR = exp(real(reshape(result(:,2),size(dlMAT,1),size(dlMAT,2))));
fittedP = real(reshape(result(:,1),size(dlMAT,1),size(dlMAT,2)));
The reason why method 2 can be vectorized into method 3 is essentially that matrix multiplication can be separated by the columns of the second matrix. If A*B produces matrix X, then by definition A*B(:,n) gives X(:,n) for any n. Moving A to the right-hand side with mldivide, this means that the divisions A\X(:,n) can be done in one go for all n with A\X. The same holds for an overdetermined system (linear regression problem), in which there is no exact solution in general, and mldivide finds the matrix that minimizes the mean-square error. In this case too, the operations A\X(:,n) (method 2) can be done in one go for all n with A\X (method 3).
The implications of improving the algorithm when increasing the size of dlMAT can be seen below:
For the case of 500*500 (or 2.5E5) elements, the speedup from Method 1 to Method 3 is about x3500!
It is also interesting to observe the output of profile (here, for the case of 500*500):
Method 1
Method 2
Method 3
From the above it is seen that rearranging the elements via squeeze and flipud takes up about half (!) of the runtime of Method 2. It is also seen that some time is lost on the conversion of the solution from cells to matrices.
Since the 3rd solution avoids all of these pitfalls, as well as the loops altogether (which mostly means re-evaluation of the script on every iteration) - it unsurprisingly results in a considerable speedup.
Notes:
There was very little difference between the "compressed" and the "explicit" versions of Method 3 in favor of the "explicit" version. For this reason it was not included in the comparison.
A solution was attempted where the inputs to Method 3 were gpuArray-ed. This did not provide improved performance (and even somewhat degradaed them), possibly due to wrong implementation, or the overhead associated with copying matrices back and forth between RAM and VRAM.

How do I efficiently replace a function with a lookup?

I am trying to increase the speed of code that operates on large datasets. I need to perform the function out = sinc(x), where x is a 2048-by-37499 matrix of doubles. This is very expensive and is the bottleneck of my program (even when computed on the GPU).
I am looking for any solution which improves the speed of this operation.
I expect that this might be achieved by pre-computing a vector LookUp = sinc(y) where y is the vector y = min(min(x)):dy:max(max(x)), i.e. a vector spanning the whole range of expected x elements.
How can I efficiently generate an approximation of sinc(x) from this LookUp vector?
I need to avoid generating a three dimensional array, since this would consume more memory than I have available.
Here is a test for the interp1 solution:
a = -15;
b = 15;
rands = (b-a).*rand(1024,37499) + a;
sincx = -15:0.000005:15;
sincy = sinc(sincx);
tic
res1 = interp1(sincx,sincy,rands);
toc
tic
res2 = sinc(rands);
toc'
sincx = gpuArray(sincx);
sincy = gpuArray(sincy);
r = gpuArray(rands);
tic
r = interp1(sincx,sincy,r);
toc
r = gpuArray(rands);
tic
r = sinc(r);
toc
Elapsed time is 0.426091 seconds.
Elapsed time is 0.472551 seconds.
Elapsed time is 0.004311 seconds.
Elapsed time is 0.130904 seconds.
Corresponding to CPU interp1, CPU sinc, GPU interp1, GPU sinc respectively
Not sure I understood completely your problem.
But once you have LookUp = sinc(y) you can use the Matlab function interp1
out = interp1(y,LookUp,x)
where x can be a matrix of any size
I came to the conclusion, that your code can not be improved significantly. The fastest possible lookup table is based on simple indexing. For a performance test, lets just perform the test based on random data:
%test data:
x=rand(2048,37499);
%relevant code:
out = sinc(x);
Now the lookup based on integer indices:
a=min(x(:));
b=max(x(:));
n=1000;
x2=round((x-a)/(b-a)*(n-1)+1);
lookup=sinc(1:n);
out2=lookup(x2);
Regardless of the size of the lookup table or the input data, the last lines in both code blocks take roughly the same time. Having sinc evaluate roughly as fast as a indexing operation, I can only assume that it is already implemented using a lookup table.
I found a faster way (if you have a NVIDIA GPU on your PC) , however this will return NaN for x=0, but if, for any reason, you can deal with having NaN or you know it will never be zero then:
if you define r = gpuArray(rands); and actually evaluate the sinc function by yourself in the GPU as:
tic
r=rdivide(sin(pi*r),pi*r);
toc
This generally is giving me about 3.2x the speed than the interp1 version in the GPU, and its more accurate (tested using your code above, iterating 100 times with different random data, having both methods similar std).
This works because sin and elementwise division rdivide are also GPU implemented (while for some reason sinc isn't) . See: http://uk.mathworks.com/help/distcomp/run-built-in-functions-on-a-gpu.html
m = min(x(:));
y = m:dy:max(x(:));
LookUp = sinc(y);
now sinc(n) should equal
LookUp((n-m)/dy + 1)
assuming n is an integer multiple of dy and lies within the range m and max(x(:)). To get to the LookUp index (i.e. an integer between 1 and numel(y), we first shift n but the minimum m, then scale it by dy and finally add 1 because MATLAB indexes from 1 instead of 0.
I don't know what that wll do for you efficiency though but give it a try.
Also you can put this into an anonymous function to help readability:
sinc_lookup = #(n)(LookUp((n-m)/dy + 1))
and now you can just call
sinc_lookup(n)

Apply function to rolling window

Say I have a long list A of values (say of length 1000) for which I want to compute the std in pairs of 100, i.e. I want to compute std(A(1:100)), std(A(2:101)), std(A(3:102)), ..., std(A(901:1000)).
In Excel/VBA one can easily accomplish this by writing e.g. =STDEV(A1:A100) in one cell and then filling down in one go. Now my question is, how could one accomplish this efficiently in Matlab without having to use any expensive for-loops.
edit: Is it also possible to do this for a list of time series, e.g. when A has dimensions 1000 x 4 (i.e. 4 time series of length 1000)? The output matrix should then have dimensions 901 x 4.
Note: For the fastest solution see Luis Mendo's answer
So firstly using a for loop for this (especially if those are your actual dimensions) really isn't going to be expensive. Unless you're using a very old version of Matlab, the JIT compiler (together with pre-allocation of course) makes for loops inexpensive.
Secondly - have you tried for loops yet? Because you should really try out the naive implementation first before you start optimizing prematurely.
Thirdly - arrayfun can make this a one liner but it is basically just a for loop with extra overhead and very likely to be slower than a for loop if speed really is your concern.
Finally some code:
n = 1000;
A = rand(n,1);
l = 100;
for loop (hardly bulky, likely to be efficient):
S = zeros(n-l+1,1); %//Pre-allocation of memory like this is essential for efficiency!
for t = 1:(n-l+1)
S(t) = std(A(t:(t+l-1)));
end
A vectorized (memory in-efficient!) solution:
[X,Y] = meshgrid(1:l)
S = std(A(X+Y-1))
A probably better vectorized solution (and a one-liner) but still memory in-efficient:
S = std(A(bsxfun(#plus, 0:l-1, (1:l)')))
Note that with all these methods you can replace std with any function so long as it is applies itself to the columns of the matrix (which is the standard in Matlab)
Going 2D:
To go 2D we need to go 3D
n = 1000;
k = 4;
A = rand(n,k);
l = 100;
ind = bsxfun(#plus, permute(o:n:(k-1)*n, [3,1,2]), bsxfun(#plus, 0:l-1, (1:l)')); %'
S = squeeze(std(A(ind)));
M = squeeze(mean(A(ind)));
%// etc...
OR
[X,Y,Z] = meshgrid(1:l, 1:l, o:n:(k-1)*n);
ind = X+Y+Z-1;
S = squeeze(std(A(ind)))
M = squeeze(mean(A(ind)))
%// etc...
OR
ind = bsxfun(#plus, 0:l-1, (1:l)'); %'
for t = 1:k
S = std(A(ind));
M = mean(A(ind));
%// etc...
end
OR (taken from Luis Mendo's answer - note in his answer he shows a faster alternative to this simple loop)
S = zeros(n-l+1,k);
M = zeros(n-l+1,k);
for t = 1:(n-l+1)
S(t,:) = std(A(k:(k+l-1),:));
M(t,:) = mean(A(k:(k+l-1),:));
%// etc...
end
What you're doing is basically a filter operation.
If you have access to the image processing toolbox,
stdfilt(A,ones(101,1)) %# assumes that data series are in columns
will do the trick (no matter the dimensionality of A). Note that if you also have access to the parallel computing toolbox, you can let filter operations like these run on a GPU, although your problem might be too small to generate noticeable speedups.
To minimize number of operations, you can exploit the fact that the standard deviation can be computed as a difference involving second and first moments,
and moments over a rolling window are obtained efficiently with a cumulative sum (using cumsum):
A = randn(1000,4); %// random data
N = 100; %// window size
c = size(A,2);
A1 = [zeros(1,c); cumsum(A)];
A2 = [zeros(1,c); cumsum(A.^2)];
S = sqrt( (A2(1+N:end,:)-A2(1:end-N,:) ...
- (A1(1+N:end,:)-A1(1:end-N,:)).^2/N) / (N-1) ); %// result
Benchmarking
Here's a comparison against a loop based solution, using timeit. The loop approach is as in Dan's solution but adapted to the 2D case, exploting the fact that std works along each column in a vectorized manner.
%// File loop_approach.m
function S = loop_approach(A,N);
[n, p] = size(A);
S = zeros(n-N+1,p);
for k = 1:(n-N+1)
S(k,:) = std(A(k:(k+N-1),:));
end
%// File bsxfun_approach.m
function S = bsxfun_approach(A,N);
[n, p] = size(A);
ind = bsxfun(#plus, permute(0:n:(p-1)*n, [3,1,2]), bsxfun(#plus, 0:n-N, (1:N).')); %'
S = squeeze(std(A(ind)));
%// File cumsum_approach.m
function S = cumsum_approach(A,N);
c = size(A,2);
A1 = [zeros(1,c); cumsum(A)];
A2 = [zeros(1,c); cumsum(A.^2)];
S = sqrt( (A2(1+N:end,:)-A2(1:end-N,:) ...
- (A1(1+N:end,:)-A1(1:end-N,:)).^2/N) / (N-1) );
%// Benchmarking code
clear all
A = randn(1000,4); %// Or A = randn(1000,1);
N = 100;
t_loop = timeit(#() loop_approach(A,N));
t_bsxfun = timeit(#() bsxfun_approach(A,N));
t_cumsum = timeit(#() cumsum_approach(A,N));
disp(' ')
disp(['loop approach: ' num2str(t_loop)])
disp(['bsxfun approach: ' num2str(t_bsxfun)])
disp(['cumsum approach: ' num2str(t_cumsum)])
disp(' ')
disp(['bsxfun/loop gain factor: ' num2str(t_loop/t_bsxfun)])
disp(['cumsum/loop gain factor: ' num2str(t_loop/t_cumsum)])
Results
I'm using Matlab R2014b, Windows 7 64 bits, dual core processor, 4 GB RAM:
4-column case:
loop approach: 0.092035
bsxfun approach: 0.023535
cumsum approach: 0.0002338
bsxfun/loop gain factor: 3.9106
cumsum/loop gain factor: 393.6526
Single-column case:
loop approach: 0.085618
bsxfun approach: 0.0040495
cumsum approach: 8.3642e-05
bsxfun/loop gain factor: 21.1431
cumsum/loop gain factor: 1023.6236
So the cumsum-based approach seems to be the fastest: about 400 times faster than the loop in the 4-column case, and 1000 times faster in the single-column case.
Several functions can do the job efficiently in Matlab.
On one side, you can use functions such as colfilt or nlfilter, which performs computations on sliding blocks. colfilt is way more efficient than nlfilter, but can be used only if the order of the elements inside a block does not matter. Here is how to use it on your data:
S = colfilt(A, [100,1], 'sliding', #std);
or
S = nlfilter(A, [100,1], #std);
On your example, you can clearly see the difference of performance. But there is a trick : both functions pad the input array so that the output vector has the same size as the input array. To get only the relevant part of the output vector, you need to skip the first floor((100-1)/2) = 49 first elements, and take 1000-100+1 values.
S(50:end-50)
But there is also another solution, close to colfilt, more efficient. colfilt calls col2im to reshape the input vector into a matrix on which it applies the given function on each distinct column. This transforms your input vector of size [1000,1] into a matrix of size [100,901]. But colfilt pads the input array with 0 or 1, and you don't need it. So you can run colfilt without the padding step, then apply std on each column and this is easy because std applied on a matrix returns a row vector of the stds of the columns. Finally, transpose it to get a column vector if you want. In brief and in one line:
S = std(im2col(X,[100 1],'sliding')).';
Remark: if you want to apply a more complex function, see the code of colfilt, line 144 and 147 (for v2013b).
If your concern is speed of the for loop, you can greatly reduce the number of loop iteration by folding your vector into an array (using reshape) with the columns having the number of element you want to apply your function on.
This will let Matlab and the JIT perform the optimization (and in most case they do that way better than us) by calculating your function on each column of your array.
You then reshape an offseted version of your array and do the same. You will still need a loop but the number of iteration will only be l (so 100 in your example case), instead of n-l+1=901 in a classic for loop (one window at a time).
When you're done, you reshape the array of result in a vector, then you still need to calculate manually the last window, but overall it is still much faster.
Taking the same input notation than Dan:
n = 1000;
A = rand(n,1);
l = 100;
It will take this shape:
width = (n/l)-1 ; %// width of each line in the temporary result array
tmp = zeros( l , width ) ; %// preallocation never hurts
for k = 1:l
tmp(k,:) = std( reshape( A(k:end-l+k-1) , l , [] ) ) ; %// calculate your stat on the array (reshaped vector)
end
S2 = [tmp(:) ; std( A(end-l+1:end) ) ] ; %// "unfold" your results then add the last window calculation
If I tic ... toc the complete loop version and the folded one, I obtain this averaged results:
Elapsed time is 0.057190 seconds. %// windows by window FOR loop
Elapsed time is 0.016345 seconds. %// "Folded" FOR loop
I know tic/toc is not the way to go for perfect timing but I don't have the timeit function on my matlab version. Besides, the difference is significant enough to show that there is an improvement (albeit not precisely quantifiable by this method). I removed the first run of course and I checked that the results are consistent with different matrix sizes.
Now regarding your "one liner" request, I suggest your wrap this code into a function like so:
function out = foldfunction( func , vec , nPts )
n = length( vec ) ;
width = (n/nPts)-1 ;
tmp = zeros( nPts , width ) ;
for k = 1:nPts
tmp(k,:) = func( reshape( vec(k:end-nPts+k-1) , nPts , [] ) ) ;
end
out = [tmp(:) ; func( vec(end-nPts+1:end) ) ] ;
Which in your main code allows you to call it in one line:
S = foldfunction( #std , A , l ) ;
The other great benefit of this format, is that you can use the very same sub function for other statistical function. For example, if you want the "mean" of your windows, you call the same just changing the func argument:
S = foldfunction( #mean , A , l ) ;
Only restriction, as it is it only works for vector as input, but with a bit of rework it could be made to take arrays as input too.

MATLab Bootstrap without for loop

yesterday I implemented my first bootstrap in MATLab. (and yes, I know, for loops are evil.):
%data is an mxn matrix where the data should be sampled per column but there
can be a NaNs Elements
%from the array (a column of data) n values are sampled nReps times
function result = bootstrap_std(data, n, nReps,quantil)
result = zeros(1,size(data,2));
for i=1:size(data,2)
bootstrap_data = zeros(n,nReps);
values = find(~isnan(data(:,i)));
if isempty(values)
bootstrap_data(:,:) = NaN;
else
for k=1:nReps
bootstrap_data(:,k) = datasample(data(values,i),n);
end
end
stat = zeros(1,nReps);
for k=1:nReps
stat(k) = nanstd(bootstrap_data(:,k));
end
sort(stat);
result(i) = quantile(stat,quantil);
end
end
As one can see, this version works columnwise. The algorithm does what it should but is really slow when the data size increaes. My question is now: Is it possible to implement this logic without using for loops? My problem is here that I could not find a version of datasample which does the sampling columnwise. Or is there a better function to use?
I am happy for any hint or idea how I can speed up this implementation.
Thanks and best regards!
stephan
The bottlenecks in your implementation are
The function spends a lot of time inside nanstd which is unnecessary since you exclude NaN values from your sample anyway.
There are a lot of functions that operate column-wise, but you spend time looping over the columns and calling them many times.
You make many calls to datasample which is a relatively slow function. It's much faster to create a random vector of indices using randi and use that instead.
Here's how I would write the function (actually I probably wouldn't put in this many comments, and I wouldn't use so many temp variables, but I'm doing it now so you can see what all the steps of the computation are).
function result = bootstrap_std_new(data, n, nRep, quantil)
result = zeros(1, size(data,2));
for i = 1:size(data,2)
isbad = isnan(data(:,i)); %// Vector of NaN values
if all(isbad)
result(i) = NaN;
else
data0 = data(~isbad, i); %// Temp copy of this column for indexing
index = randi(size(data0,1), n, nRep); %// Create the indexing vector
bootstrapdata = data0(index); %// Sample the data
stdevs = std(bootstrapdata); %// Stdev of sampled data
result(i) = quantile(stdevs, quantil); %// Find the correct quantile
end
end
end
Here are some timings
>> data = randn(100,10);
>> data(randi(1000, 50, 1)) = NaN;
>> tic, bootstrap_std(data, 50, 1000, 0.5); toc
Elapsed time is 1.359529 seconds.
>> tic, bootstrap_std_new(data, 50, 1000, 0.5); toc
Elapsed time is 0.038558 seconds.
So this gives you about a 35x speedup.
Your main issue seems to be that you may have varying numbers/positions of NaN in each column, so can't work on the full matrix unless you're okay with also sampling NaNs. However, some of the inner loops could be simplified.
for k=1:nReps
bootstrap_data(:,k) = datasample(data(values,i),n);
end
Since you're sampling with replacement, you should be able to just do:
bootstrap_data = datasample(data(values,i), n*nReps);
bootstrap_data = reshape(bootstrap_data, [n nReps]);
Also nanstd can work on a full matrix so no need to loop:
stat = nanstd(bootstrap_data); % or nanstd(x,0,2) to change dimension
It would also be worth just looking over your code with profile to see where the bottlenecks are.