How can a loop involving recursive multiplication and accumulation be vectorized? - matlab

In matlab, I have a loop of the form:
a=1;
for (i = 1:N)
a = a * b(i) + c(i);
end
Can this loop be vectorized, or partially unrolled?

For b and c of length 4 each, when the loop is unrolled, you would have -
output = b1b2b3b4 + c1b2b3b4 + c2b3b4 + c3b4 + c4
So, the generic formula would be :
output = b1b2b3...bN + c1b2b3..bN + c2b3..bN + c3b4..bN + ... cN-1bN + cN
The cummulative product of b could be calculated with cumprod with elements being flipped or "reversed". Rest is all about elementwise multiplication with c elements that are 1 place shifted and then including the remaining scalar elements from the cummulative product and c and finally summing all those up to get us the final scalar output.
So, the coded version would look something like this -
cumb = cumprod(b,'reverse');
a = sum(cumb(2:end).*c(1:end-1)) + cumb(1) + c(end);
Benchmarking
Let's compare the loopy approach from the question against the vectorized one as proposed earlier in this post.
Here are the approaches as functions -
function a = loopy(b,c)
N = numel(b);
a = 1;
for i = 1:N
a = a * b(i) + c(i);
end
return;
function a = vectorized(b,c)
cumb = cumprod(b,'reverse');
a = sum(cumb(2:end).*c(1:end-1)) + cumb(1) + c(end);
return;
Here's the code to benchmark those two approaches -
datasizes = 10.^(1:8);
Nd = numel(datasizes);
time_loopy = zeros(1,Nd);
time_vectorized = zeros(1,Nd);
for k1 = 1:numel(datasizes)
N = datasizes(k1);
b = rand(1,N);
c = rand(1,N);
func1 = #() loopy(b,c);
func2 = #() vectorized(b,c);
time_loopy(k1) = timeit(func1);
time_vectorized(k1) = timeit(func2);
end
figure,
loglog(datasizes,time_loopy,'-rx'), hold on
loglog(datasizes,time_vectorized,'-b+'),
set(gca,'xgrid','on'),set(gca,'ygrid','on'),
xlabel('Datasize (# elements)'), ylabel('Runtime (s)')
legend({'Loop','Vectorized'}),title('Runtime Plot')
figure,
semilogx(datasizes,time_loopy./time_vectorized,'-k.')
set(gca,'xgrid','on'),set(gca,'ygrid','on'),
xlabel('Datasize (# elements)'), ylabel('Speedup (x)')
legend({'Speedup with vectorized method over loopy one'}),title('Speedup Plot')
Here's the runtime and speedup plots -
Few observations
Stage #1: From starting datasize until 1000 elements, loopy approach has the upper hand, as the vectorized approach isn't getting enough elements to benefit from everything-in-one-go approach.
Stage #2: From 1000 elements until 1000,0000 elements, vectorized method seems to be the better one, as its getting enough elements to work with.
Stage #3: For the bigger datasize cases, it seems the memory bandwidth requirement of storing and working with millions of elements with vectorized approach as opposed to using just a scalar value with the loopy approach
might be pegging back the vectorized approach.
Conclusions: If performance is the most important criteria, one can go with vectorized method or stay with the original loopy code based on the datasizes.

Related

How can I exploit parallelism when defining values in a sparse matrix?

The following MATLAB code loops through all elements of a matrix with size 2IJ x 2IJ.
for i=1:(I-2)
for j=1:(J-2)
ij1 = i*J+j+1; % row
ij2 = i*J+j+1 + I*J; % col
D1(ij1,ij1) = 2;
D1(ij1,ij2) = -1;
end
end
Is there any way I can parallelize it use MATLAB's parfor command? You can assume any element not defined is 0. So this matrix ends up being sparse (mostly 0s).
Before using parfor it is recommended to read the guidelines related to decide when to use parfor. Specially this:
Generally, if you want to make code run faster, first try to vectorize it.
Here vectorization can be used effectively to compute indices of the nonzero elements. Those indices are used in function sparse. For it you need to define one of i or j to be a column vector and another a row vector. Implicit expansion takes effect and indices are computed.
I = 300;
J = 300;
i = (1:I-2).';
j = 1:J-2;
ij1 = i*J+j+1;
ij2 = i*J+j+1 + I*J;
D1 = sparse(ij1, ij1, 2, 2*I*J, 2*I*J) + sparse(ij1, ij2, -1, 2*I*J, 2*I*J);
However for the comparison this can be a way of using parfor (not tested):
D1 = sparse (2*I*J, 2*I*J);
parfor i=1:(I-2)
for j=1:(J-2)
ij1 = i*J+j+1;
ij2 = i*J+j+1 + I*J;
D1 = D1 + sparse([ij1;ij1], [ij1;ij2], [2;-1], 2*I*J, 2*I*J) ;
end
end
Here D1 used as reduction variable.

vectorize two nested for-loops in MATLAB

I have two nested for-loops that are used to format data that I load it. The loops have the following construction:
data = magic(20000);
data = data(:,1:3);
for i=0:10
for j=0:10
data_tmp = data((1:100)+100*j+100*10*i,:);
vx(:, i+1,j+1) = data_tmp(:,1);
vy(:, i+1,j+1) = data_tmp(:,2);
vz(:, i+1,j+1) = data_tmp(:,3);
end
end
Arrays vx, vy and vz I do pre-allocate to their desired size. However, is there a way to vectorize the for-loops to increase the efficiency? I'm not convinced it is the case due to the first line in the second loop, data((1:100)+100*j+100*10*i,:), is there a better way to do this?
It turns out that you have repeated index in loop
at i=k, j=10 and i=k+1, j=0 for k<10
for example, you read 1:100 + 100*10 + 100*10*0 and then read 1:100 + 100*0 + 100*10*1 which are identical.
Reshape w/ Repeated index
If this was what you intended to do, then vectorization needs one more step (index generation).
Following is my suggestion (N=100, M=10 where N is the length of data_tmp and M is the maximum loop variable)
index = bsxfun(#plus,bsxfun(#plus,(1:N)',reshape(N*(0:M),1,1,M+1)),M*N*(0:M)); %index generation
vx = data(index);
vy = data(index + size(data,1));
vz = data(index + size(data,1)*2);
This is not that desirable, but it will work.
When I tested on my laptop, it is twice faster than your original code with pre-allocation. As I increase the size of data, the gap gets smaller and smaller.
Reshape w/o Repeated index
If not i.e., you want to reshape each column in the direction of 3rd dimension first, 2nd dimension last), then following would work.
Firstly, this is how I interpreted your code
data = magic(20000);
data = data(:,1:3);
N = 100; M = 10;
for i=0:(M-1)
for j=0:(M-1)
data_tmp = data((1:N)+M*j+N*M*i,:);
vx(:, i+1,j+1) = data_tmp(:,1);
vy(:, i+1,j+1) = data_tmp(:,2);
vz(:, i+1,j+1) = data_tmp(:,3);
end
end
Note that loop ended at (M-1).
Following is my suggestion.
vx = permute(reshape(dat(1:N*M*M,1), N, M, M),[1,3,2]);
vy = permute(reshape(dat(1:N*M*M,2), N, M, M),[1,3,2]);
vz = permute(reshape(dat(1:N*M*M,3), N, M, M),[1,3,2]);
In my laptop, it is 4 times faster than original code. As I increase the size, the gap approaches to 2.
Just in case you want to stick with the loop, here is a much faster way to do this:
data = randi(100,20000,3);
[vx,vy,vz] = deal(zeros(100,11,11));
[J,I] = ndgrid(1:11,1:11);
c = 1;
for k = 0:100:11000
vx(:,I(c),J(c)) = data((1:100)+k,1);
vy(:,I(c),J(c)) = data((1:100)+k,2);
vz(:,I(c),J(c)) = data((1:100)+k,3);
c = c+1;
end
My guess is that reshape from #Dohyun answer is what you looking for (and it's x10 faster than this, and x10000 faster than your code), but for next time you use loops, this may be useful.
And here is another option to do this without reshape, in a similar time to the reshape version:
[vx,vy,vz] = deal(zeros(100,10,11));
vx(:) = data(1:11000,1);
vy(:) = data(1:11000,2);
vz(:) = data(1:11000,3);
vx = permute(vx,[1 3 2]);
vy = permute(vy,[1 3 2]);
vz = permute(vz,[1 3 2]);
The idea is that you define the shape of [vx,vy,vz] while allocating them.

Apply function to rolling window

Say I have a long list A of values (say of length 1000) for which I want to compute the std in pairs of 100, i.e. I want to compute std(A(1:100)), std(A(2:101)), std(A(3:102)), ..., std(A(901:1000)).
In Excel/VBA one can easily accomplish this by writing e.g. =STDEV(A1:A100) in one cell and then filling down in one go. Now my question is, how could one accomplish this efficiently in Matlab without having to use any expensive for-loops.
edit: Is it also possible to do this for a list of time series, e.g. when A has dimensions 1000 x 4 (i.e. 4 time series of length 1000)? The output matrix should then have dimensions 901 x 4.
Note: For the fastest solution see Luis Mendo's answer
So firstly using a for loop for this (especially if those are your actual dimensions) really isn't going to be expensive. Unless you're using a very old version of Matlab, the JIT compiler (together with pre-allocation of course) makes for loops inexpensive.
Secondly - have you tried for loops yet? Because you should really try out the naive implementation first before you start optimizing prematurely.
Thirdly - arrayfun can make this a one liner but it is basically just a for loop with extra overhead and very likely to be slower than a for loop if speed really is your concern.
Finally some code:
n = 1000;
A = rand(n,1);
l = 100;
for loop (hardly bulky, likely to be efficient):
S = zeros(n-l+1,1); %//Pre-allocation of memory like this is essential for efficiency!
for t = 1:(n-l+1)
S(t) = std(A(t:(t+l-1)));
end
A vectorized (memory in-efficient!) solution:
[X,Y] = meshgrid(1:l)
S = std(A(X+Y-1))
A probably better vectorized solution (and a one-liner) but still memory in-efficient:
S = std(A(bsxfun(#plus, 0:l-1, (1:l)')))
Note that with all these methods you can replace std with any function so long as it is applies itself to the columns of the matrix (which is the standard in Matlab)
Going 2D:
To go 2D we need to go 3D
n = 1000;
k = 4;
A = rand(n,k);
l = 100;
ind = bsxfun(#plus, permute(o:n:(k-1)*n, [3,1,2]), bsxfun(#plus, 0:l-1, (1:l)')); %'
S = squeeze(std(A(ind)));
M = squeeze(mean(A(ind)));
%// etc...
OR
[X,Y,Z] = meshgrid(1:l, 1:l, o:n:(k-1)*n);
ind = X+Y+Z-1;
S = squeeze(std(A(ind)))
M = squeeze(mean(A(ind)))
%// etc...
OR
ind = bsxfun(#plus, 0:l-1, (1:l)'); %'
for t = 1:k
S = std(A(ind));
M = mean(A(ind));
%// etc...
end
OR (taken from Luis Mendo's answer - note in his answer he shows a faster alternative to this simple loop)
S = zeros(n-l+1,k);
M = zeros(n-l+1,k);
for t = 1:(n-l+1)
S(t,:) = std(A(k:(k+l-1),:));
M(t,:) = mean(A(k:(k+l-1),:));
%// etc...
end
What you're doing is basically a filter operation.
If you have access to the image processing toolbox,
stdfilt(A,ones(101,1)) %# assumes that data series are in columns
will do the trick (no matter the dimensionality of A). Note that if you also have access to the parallel computing toolbox, you can let filter operations like these run on a GPU, although your problem might be too small to generate noticeable speedups.
To minimize number of operations, you can exploit the fact that the standard deviation can be computed as a difference involving second and first moments,
and moments over a rolling window are obtained efficiently with a cumulative sum (using cumsum):
A = randn(1000,4); %// random data
N = 100; %// window size
c = size(A,2);
A1 = [zeros(1,c); cumsum(A)];
A2 = [zeros(1,c); cumsum(A.^2)];
S = sqrt( (A2(1+N:end,:)-A2(1:end-N,:) ...
- (A1(1+N:end,:)-A1(1:end-N,:)).^2/N) / (N-1) ); %// result
Benchmarking
Here's a comparison against a loop based solution, using timeit. The loop approach is as in Dan's solution but adapted to the 2D case, exploting the fact that std works along each column in a vectorized manner.
%// File loop_approach.m
function S = loop_approach(A,N);
[n, p] = size(A);
S = zeros(n-N+1,p);
for k = 1:(n-N+1)
S(k,:) = std(A(k:(k+N-1),:));
end
%// File bsxfun_approach.m
function S = bsxfun_approach(A,N);
[n, p] = size(A);
ind = bsxfun(#plus, permute(0:n:(p-1)*n, [3,1,2]), bsxfun(#plus, 0:n-N, (1:N).')); %'
S = squeeze(std(A(ind)));
%// File cumsum_approach.m
function S = cumsum_approach(A,N);
c = size(A,2);
A1 = [zeros(1,c); cumsum(A)];
A2 = [zeros(1,c); cumsum(A.^2)];
S = sqrt( (A2(1+N:end,:)-A2(1:end-N,:) ...
- (A1(1+N:end,:)-A1(1:end-N,:)).^2/N) / (N-1) );
%// Benchmarking code
clear all
A = randn(1000,4); %// Or A = randn(1000,1);
N = 100;
t_loop = timeit(#() loop_approach(A,N));
t_bsxfun = timeit(#() bsxfun_approach(A,N));
t_cumsum = timeit(#() cumsum_approach(A,N));
disp(' ')
disp(['loop approach: ' num2str(t_loop)])
disp(['bsxfun approach: ' num2str(t_bsxfun)])
disp(['cumsum approach: ' num2str(t_cumsum)])
disp(' ')
disp(['bsxfun/loop gain factor: ' num2str(t_loop/t_bsxfun)])
disp(['cumsum/loop gain factor: ' num2str(t_loop/t_cumsum)])
Results
I'm using Matlab R2014b, Windows 7 64 bits, dual core processor, 4 GB RAM:
4-column case:
loop approach: 0.092035
bsxfun approach: 0.023535
cumsum approach: 0.0002338
bsxfun/loop gain factor: 3.9106
cumsum/loop gain factor: 393.6526
Single-column case:
loop approach: 0.085618
bsxfun approach: 0.0040495
cumsum approach: 8.3642e-05
bsxfun/loop gain factor: 21.1431
cumsum/loop gain factor: 1023.6236
So the cumsum-based approach seems to be the fastest: about 400 times faster than the loop in the 4-column case, and 1000 times faster in the single-column case.
Several functions can do the job efficiently in Matlab.
On one side, you can use functions such as colfilt or nlfilter, which performs computations on sliding blocks. colfilt is way more efficient than nlfilter, but can be used only if the order of the elements inside a block does not matter. Here is how to use it on your data:
S = colfilt(A, [100,1], 'sliding', #std);
or
S = nlfilter(A, [100,1], #std);
On your example, you can clearly see the difference of performance. But there is a trick : both functions pad the input array so that the output vector has the same size as the input array. To get only the relevant part of the output vector, you need to skip the first floor((100-1)/2) = 49 first elements, and take 1000-100+1 values.
S(50:end-50)
But there is also another solution, close to colfilt, more efficient. colfilt calls col2im to reshape the input vector into a matrix on which it applies the given function on each distinct column. This transforms your input vector of size [1000,1] into a matrix of size [100,901]. But colfilt pads the input array with 0 or 1, and you don't need it. So you can run colfilt without the padding step, then apply std on each column and this is easy because std applied on a matrix returns a row vector of the stds of the columns. Finally, transpose it to get a column vector if you want. In brief and in one line:
S = std(im2col(X,[100 1],'sliding')).';
Remark: if you want to apply a more complex function, see the code of colfilt, line 144 and 147 (for v2013b).
If your concern is speed of the for loop, you can greatly reduce the number of loop iteration by folding your vector into an array (using reshape) with the columns having the number of element you want to apply your function on.
This will let Matlab and the JIT perform the optimization (and in most case they do that way better than us) by calculating your function on each column of your array.
You then reshape an offseted version of your array and do the same. You will still need a loop but the number of iteration will only be l (so 100 in your example case), instead of n-l+1=901 in a classic for loop (one window at a time).
When you're done, you reshape the array of result in a vector, then you still need to calculate manually the last window, but overall it is still much faster.
Taking the same input notation than Dan:
n = 1000;
A = rand(n,1);
l = 100;
It will take this shape:
width = (n/l)-1 ; %// width of each line in the temporary result array
tmp = zeros( l , width ) ; %// preallocation never hurts
for k = 1:l
tmp(k,:) = std( reshape( A(k:end-l+k-1) , l , [] ) ) ; %// calculate your stat on the array (reshaped vector)
end
S2 = [tmp(:) ; std( A(end-l+1:end) ) ] ; %// "unfold" your results then add the last window calculation
If I tic ... toc the complete loop version and the folded one, I obtain this averaged results:
Elapsed time is 0.057190 seconds. %// windows by window FOR loop
Elapsed time is 0.016345 seconds. %// "Folded" FOR loop
I know tic/toc is not the way to go for perfect timing but I don't have the timeit function on my matlab version. Besides, the difference is significant enough to show that there is an improvement (albeit not precisely quantifiable by this method). I removed the first run of course and I checked that the results are consistent with different matrix sizes.
Now regarding your "one liner" request, I suggest your wrap this code into a function like so:
function out = foldfunction( func , vec , nPts )
n = length( vec ) ;
width = (n/nPts)-1 ;
tmp = zeros( nPts , width ) ;
for k = 1:nPts
tmp(k,:) = func( reshape( vec(k:end-nPts+k-1) , nPts , [] ) ) ;
end
out = [tmp(:) ; func( vec(end-nPts+1:end) ) ] ;
Which in your main code allows you to call it in one line:
S = foldfunction( #std , A , l ) ;
The other great benefit of this format, is that you can use the very same sub function for other statistical function. For example, if you want the "mean" of your windows, you call the same just changing the func argument:
S = foldfunction( #mean , A , l ) ;
Only restriction, as it is it only works for vector as input, but with a bit of rework it could be made to take arrays as input too.

Matlab Vectorize

I have a probability matrix(glcm) of size 256x256x20. I have reshaped the matrix to
65536x20, so that I can eliminate one loop (along the 3rd dimension).
I want to do the following calculation.
for y = 1:256
for x = 1:256
if (ismember((x + y),(2:2*256)))
p_xplusy((x+y),:) = p_xplusy((x+y),:) + glcm(((y-1)*256+x),:);
end
end
end
So the p_xplusy will be a 511x20 matrix which each element is the sum of the diagonal of nxn sub matrix (where n belongs to 1:256) of the original 256x256x20 matrix.
This code block is making my program inefficient and I want to vectorize this loop. Any help would be appreciated.
Since your if statement is just checking whether x+y is less than or equal to 256, just force it to always be, and remove excess loops:
for y = 1:256
for x = 1:256-y
p_xplusy((x+y),:) = p_xplusy((x+y),:) + glcm(((y-1)*256+x),:);
end
end
This should provide a noticeable speed-up to your code.
You can reduce the complexity from O(n^2) to O(2*n) and thus improve runtime efficiency -
N = 256;
for k1 = 1:N
idx_glcm = k1:N-1:N*(k1-1)+1;
p_xplusy(k1+1,:) = p_xplusy(k1+1,:) + sum(glcm(idx_glcm,:),1);
end
for k1 = 2:N
idx_glcm = k1*N:N-1:N*(N-1)+k1;
p_xplusy(N+k1,:) = p_xplusy(N+k1,:) + sum(glcm(idx_glcm,:),1);
end
Some quick runtime tests seem to confirm our efficiency theory too.

Non Local Means Filter Optimization in MATLAB

I'm trying to write a Non-Local Means filter for an assignment. I've written the code in two ways, but the method I'd expect to be quicker is much slower than the other method.
Method 1: (This method is slower)
for i = 1:size(I,1)
tic
sprintf('%d/%d',i,size(I,1))
for j = 1:size(I,2)
w = exp((-abs(I-I(i,j))^2)/(h^2));
Z = sum(sum(w));
w = w/Z;
sumV = w .* I;
NL(i,j) = sum(sum(sumV));
end
toc
end
Method 2: (This method is faster)
for i = 1:size(I,1)
tic
sprintf('%d/%d',i,size(I,1))
for j = 1:size(I,2)
Z = 0;
for k = 1:size(I,1)
for l = 1:size(I,2)
w = exp((-abs(I(i,j)-I(k,l))^2)/(h^2));
Z = Z + w;
end
end
sumV = 0;
for k = 1:size(I,1)
for l = 1:size(I,2)
w = exp((-abs(I(i,j)-I(k,l))^2)/(h^2));
w = w/Z;
sumV = sumV + w * I(k,l);
end
end
NL(i,j) = sumV;
end
toc
end
I really thought that MATLAB would be optimized for Matrix operations. Is there reason it isn't in this code? The difference is pretty large. For a 512x512 image, with h = 0.05, one iteration of the outer loop takes 24-28 seconds for Method 1 and 10-12 seconds for Method 2.
The two methods are not doing the same thing. In Method 2, the term abs(I(i,j)-I(k,l)) in the w= expression is being squared, which is fine because the term is just a single numeric value.
However, in Method 1, the term abs(I-I(i,j)) is actually a matrix (The numeric value I(i,j) is being subtracted from every element in the matrix I, returning a matrix again). So, when this term is squared with the ^ operator, matrix multiplication is happening. My guess, based on Method 2, is that this is not what you intended. If instead, you want to square each element in that matrix, then use the .^ operator, as in abs(I-I(i,j)).^2
Matrix multiplication is a much more computation intensive operation, which is likely why Method 1 takes so much longer.
My guess is that you have not preassigned NL, that both methods are in the same function (or are scripts and you didn't clear NL between function runs). This would have slowed the first method by quite a bit.
Try the following: Create a function for both methods. Run each method once. Then use the profiler to see where each function spends most of its time.
A much faster implementation (Vectorized) could be achieved using im2col:
Create a Vector out of each neighborhood.
Using predefined indices calculate the distance between each patch.
Sum over the values and the weights using sum function.
This method will work with no loop at all.