How can I express this large number of computations without for loops? - matlab

I work primarily in MATLAB but I think the answer should not be too hard to carry over from one language to another.
I have a multi-dimensional array X with dimensions [n, p, 3].
I would like to calculate the following multi-dimensional array.
T = zeros(p, p, p)
for i = 1:p
for j = 1:p
for k = 1:p
T(i, j, k) = sum(X(:, i, 1) .* X(:, j, 2) .* X(:, k, 3));
end
end
end
The sum is of the elements of a length-n vector. Any help is appreciated!

You only need some permuting of dimensions and multiplication with singleton expansion:
T = sum(bsxfun(#times, bsxfun(#times, permute(X(:,:,1), [2 4 5 3 1]), permute(X(:,:,2), [4 2 5 3 1])), permute(X(:,:,3), [4 5 2 3 1])), 5);
From R2016b onwards, this can be written more easily as
T = sum(permute(X(:,:,1), [2 4 5 3 1]) .* permute(X(:,:,2), [4 2 5 3 1]) .* permute(X(:,:,3), [4 5 2 3 1]), 5);

As I mentioned in a comment, vectorization is not always a huge advantage any more. Therefore there are vectorization methods that slow down the code rather than speed it up. You must always time your solutions. Vectorization often involves the creation of large temporary arrays, or the copy of large amounts of data, which are avoided in loop code. It depends on the architecture, the size of the input, and many other factors if such a solution is going to be faster.
Nonetheless, in this case it seems vectorization approaches can yield a large speedup.
The first thing to notice about the original code is that X(:, i, 1) .* X(:, j, 2) gets re-computed in the inner loop, though it is a constant value there. Rewriting the inner loop as this will save time:
Y = X(:, i, 1) .* X(:, j, 2);
for k = 1:p
T(i, j, k) = sum(Y .* X(:, k, 3));
end
Now we notice that the inner loop is a dot product, and can be written as follows:
Y = X(:, i, 1) .* X(:, j, 2);
T(i, j, :) = Y.' * X(:, :, 3);
The .' transpose on Y does not copy the data, as Y is a vector. Next, we notice that X(:, :, 3) is indexed repeatedly. Let's move this out of the outer loop. Now I'm left with the following code:
T = zeros(p, p, p);
X1 = X(:, :, 1);
X2 = X(:, :, 2);
X3 = X(:, :, 3);
for i = 1:p
for j = 1:p
Y = X1(:, i) .* X2(:, j);
T(i, j, :) = Y.' * X3;
end
end
It is likely that removing the loop over j is equally easy, which would leave a single loop over i. But this is where I stop.
This is the timings I see (R2017a, 3-year old iMac with 4 cores). For n=10, p=20:
original: 0.0206
moving Y out the inner loop: 0.0100
removing inner loop: 0.0016
moving indexing out of loops: 7.6294e-04
Luis' answer: 1.9196e-04
For a larger array with n=50, p=100:
original: 2.9107
moving Y out the inner loop: 1.3488
removing inner loop: 0.0910
moving indexing out of loops: 0.0361
Luis' answer: 0.1417
"Luis' answer" is this one. It is by far fastest for small arrays, but for larger arrays it shows the cost of the permutation. Moving the computation of the first product out of the inner loop saves a bit over half the computation cost. But removing the inner loop reduces the cost quite dramatically (which I hadn't expected, I presume the single matrix product can use parallelism better than the many small element-wise products). We then get a further time reduction by reducing the amount of indexing operations within the loop.
This is the timing code:
function so()
n = 10; p = 20;
%n = 50; p = 100;
X = randn(n,p,3);
T1 = method1(X);
T2 = method2(X);
T3 = method3(X);
T4 = method4(X);
T5 = method5(X);
assert(max(abs(T1(:)-T2(:)))<1e-13)
assert(max(abs(T1(:)-T3(:)))<1e-13)
assert(max(abs(T1(:)-T4(:)))<1e-13)
assert(max(abs(T1(:)-T5(:)))<1e-13)
timeit(#()method1(X))
timeit(#()method2(X))
timeit(#()method3(X))
timeit(#()method4(X))
timeit(#()method5(X))
function T = method1(X)
p = size(X,2);
T = zeros(p, p, p);
for i = 1:p
for j = 1:p
for k = 1:p
T(i, j, k) = sum(X(:, i, 1) .* X(:, j, 2) .* X(:, k, 3));
end
end
end
function T = method2(X)
p = size(X,2);
T = zeros(p, p, p);
for i = 1:p
for j = 1:p
Y = X(:, i, 1) .* X(:, j, 2);
for k = 1:p
T(i, j, k) = sum(Y .* X(:, k, 3));
end
end
end
function T = method3(X)
p = size(X,2);
T = zeros(p, p, p);
for i = 1:p
for j = 1:p
Y = X(:, i, 1) .* X(:, j, 2);
T(i, j, :) = Y.' * X(:, :, 3);
end
end
function T = method4(X)
p = size(X,2);
T = zeros(p, p, p);
X1 = X(:, :, 1);
X2 = X(:, :, 2);
X3 = X(:, :, 3);
for i = 1:p
for j = 1:p
Y = X1(:, i) .* X2(:, j);
T(i, j, :) = Y.' * X3;
end
end
function T = method5(X)
T = sum(permute(X(:,:,1), [2 4 5 3 1]) .* permute(X(:,:,2), [4 2 5 3 1]) .* permute(X(:,:,3), [4 5 2 3 1]), 5);

You have mentioned you are open to other languages and NumPy by its syntax is very close to MATLAB, so we will try to have a NumPy based solution on this.
Now, these tensor related sum-reductions, specially matrix multiplications ones are easily expressed as einstein-notation and NumPy luckily has one function on the same as np.einsum. Under the hoods, it's implemented in C and is pretty efficient. Recently it's been optimized further to leverage BLAS based matrix-multiplication implementations.
So, a translation of the stated code onto NumPy territory keeping in mind that it follows 0-based indexing and the axes are visuallized differently than the dimensions with MATLAB, would be -
import numpy as np
# X is a NumPy array of shape : (n,p,3). So, a random one could be
# generated with : `X = np.random.rand(n,p,3)`.
T = np.zeros((p, p, p))
for i in range(p):
for j in range(p):
for k in range(p):
T[i, j, k] = np.sum(X[:, i, 0] * X[:, j, 1] * X[:, k, 2])
The einsum way to solve it would be -
np.einsum('ia,ib,ic->abc',X[...,0],X[...,1],X[...,2])
To leverage matrix-multiplication, use optimize flag -
np.einsum('ia,ib,ic->abc',X[...,0],X[...,1],X[...,2],optimize=True)
Timings (with large sizes)
In [27]: n,p = 100,100
...: X = np.random.rand(n,p,3)
In [28]: %%timeit
...: T = np.zeros((p, p, p))
...: for i in range(p):
...: for j in range(p):
...: for k in range(p):
...: T[i, j, k] = np.sum(X[:, i, 0] * X[:, j, 1] * X[:, k, 2])
1 loop, best of 3: 6.23 s per loop
In [29]: %timeit np.einsum('ia,ib,ic->abc',X[...,0],X[...,1],X[...,2])
1 loop, best of 3: 353 ms per loop
In [31]: %timeit np.einsum('ia,ib,ic->abc',X[...,0],X[...,1],X[...,2],optimize=True)
100 loops, best of 3: 10.5 ms per loop
In [32]: 6230.0/10.5
Out[32]: 593.3333333333334
Around 600x speedup there!

Related

MATLAB find the average time using tic toc

Construct an experiment to study the performance of the Cramer rule (with two implementations
determinants) in relation to Gauss's algorithm.
In each iteration 10 random arrays A (NxN), and vectors b (Nx1) will be created.
The 10 linear systems will be solved using the Cramer rule ("cramer.m") using
of rec_det (A) and using det (A), and the Gaussian algorithm
(“GaussianElimination.m”), and the time for each technique will be the average of 10 values.
Repeat the above for N = 2 to 10 and make a graph of the average time
in relation to the dimension N.
This is my task. I dont know if the way that I calculate the average time is correct and the graphic is not displayed.
T1=0;
T2=0;
T3=0;
for N=2:10
for i=1:10
A=rand(N,N);
b=rand(N,1);
t1=[1,i];
t2=[1,i];
t3=[1,i];
tic;
crammer(A,b);
t1(i)=toc;
tic
crammer_rec(A,b);
t2(i)=toc;
tic
gaussianElimination(A,b);
t3(i)=toc;
T1=T1+t1(i);
T2=T2+t2(i);
T3=T3+t3(i);
end
avT1=T1/10;
avT2=T2/10;
avT3=T3/10;
end
plot(2:10 , avT1 , 2:10 , avT2 , 2:10 , avT3);
function x = cramer(A, b)
n = length(b);
d = det(A);
% d = rec_det(A);
x = zeros(n, 1);
for j = 1:n
x(j) = det([A(:,1:j-1) b A(:,j+1:end)]) / d;
% x(j) = rec_det([A(:,1:j-1) b A(:,j+1:end)]) / d;
end
end
function x = cramer(A, b)
n = length(b);
d = rec_det(A);
x = zeros(n, 1);
for j = 1:n
x(j) = rec_det([A(:,1:j-1) b A(:,j+1:end)]) / d;
end
end
function deta = rec_det(R)
if size(R,1)~=size(R,2)
error('Error.Matrix must be square.')
else
n = size(R,1);
if ( n == 2 )
deta=(R(1,1)*R(2,2))-(R(1,2)*R(2,1));
else
for i=1:n
deta_temp=R;
deta_temp(1,:)=[ ];
deta_temp(:,i)=[ ];
if i==1
deta=(R(1,i)*((-1)^(i+1))*rec_det(deta_temp));
else
deta=deta+(R(1,i)*((-1)^(i+1))*rec_det(deta_temp));
end
end
end
end
end
function x = gaussianElimination(A, b)
[m, n] = size(A);
if m ~= n
error('Matrix A must be square!');
end
n1 = length(b);
if n1 ~= n
error('Vector b should be equal to the number of rows and columns of A!');
end
Aug = [A b]; % build the augmented matrix
C = zeros(1, n + 1);
% elimination phase
for k = 1:n - 1
% ensure that the pivoting point is the largest in its column
[pivot, j] = max(abs(Aug(k:n, k)));
C = Aug(k, :);
Aug(k, :) = Aug(j + k - 1, :);
Aug(j + k - 1, :) = C;
if Aug(k, k) == 0
error('Matrix A is singular');
end
for i = k + 1:n
r = Aug(i, k) / Aug(k, k);
Aug(i, k:n + 1) = Aug(i, k:n + 1) - r * Aug(k, k: n + 1);
end
end
% back substitution phase
x = zeros(n, 1);
x(n) = Aug(n, n + 1) / Aug(n, n);
for k = n - 1:-1:1
x(k) = (Aug(k, n + 1) - Aug(k, k + 1:n) * x(k + 1:n)) / Aug(k, k);
end
end
I think the easiest way to do this is by creating a 9 * 3 dimensional matrix to contain all the total times, and then take the average at the end.
allTimes = zeros(9, 3);
for N=2:10
for ii=1:10
A=rand(N,N);
b=rand(N,1);
tic;
crammer(A,b);
temp = toc;
allTimes(N-1,1) = allTimes(N-1,1) + temp;
tic
crammer_rec(A,b);
temp = toc;
allTimes(N-1,2) = allTimes(N-1,2) + temp;
tic
gaussianElimination(A,b);
temp = toc;
allTimes(N-1,3) = allTimes(N-1,3) + temp;
end
end
allTimes = allTimes/10;
figure; plot(2:10, allTimes);
You can use this approach because the numbers are quite straightforward and simple. If you had a more complicated setup, the way to store the times/calculate the averages would have to be tweaked.
If you had more functions you could also use function handles and create a third inner loop, but this is a little more advanced.

Element-wise matrix multiplication for multi-dimensional array

I want to realize component-wise matrix multiplication in MATLAB, which can be done using numpy.einsum in Python as below:
import numpy as np
M = 2
N = 4
I = 2000
J = 300
A = np.random.randn(M, M, I)
B = np.random.randn(M, M, N, J, I)
C = np.random.randn(M, J, I)
# using einsum
D = np.einsum('mki, klnji, lji -> mnji', A, B, C)
# naive for-loop
E = np.zeros(M, N, J, I)
for i in range(I):
for j in range(J):
for n in range(N):
E[:,n,j,i] = B[:,:,i] # A[:,:,n,j,i] # C[:,j,i]
print(np.sum(np.abs(D-E))) # expected small enough
So far I use for-loop of i, j, and n, but I don't want to, at least for-loop of n.
Option 1: Calling numpy from MATLAB
Assuming your system is set up according to the documentation, and you have the numpy package installed, you could do (in MATLAB):
np = py.importlib.import_module('numpy');
M = 2;
N = 4;
I = 2000;
J = 300;
A = matpy.mat2nparray( randn(M, M, I) );
B = matpy.mat2nparray( randn(M, M, N, J, I) );
C = matpy.mat2nparray( randn(M, J, I) );
D = matpy.nparray2mat( np.einsum('mki, klnji, lji -> mnji', A, B, C) );
Where matpy can be found here.
Option 2: Native MATLAB
Here the most important part is to get the permutations right, so we need to keep track of our dimensions. We'll be using the following order:
I(1) J(2) K(3) L(4) M(5) N(6)
Now, I'll explain how I got the correct permute order (let's take the example of A): einsum expects the dimension order to be mki, which according to our numbering is 5 3 1. This tells us that the 1st dimension of A needs to be the 5th, the 2nd needs to be 3rd and the 3rd needs to be 1st (in short 1->5, 2->3, 3->1). This also means that the "sourceless dimensions" (meaning those that have no original dimensions becoming them; in this case 2 4 6) should be singleton. Using ipermute this is really simple to write:
pA = ipermute(A, [5,3,1,2,4,6]);
In the above example, 1->5 means we write 5 first, and the same goes for the other two dimensions (yielding [5,3,1]). Then we just add the singletons (2,4,6) at the end to get [5,3,1,2,4,6]. Finally:
A = randn(M, M, I);
B = randn(M, M, N, J, I);
C = randn(M, J, I);
% Reference dim order: I(1) J(2) K(3) L(4) M(5) N(6)
pA = ipermute(A, [5,3,1,2,4,6]); % 1->5, 2->3, 3->1; 2nd, 4th & 6th are singletons
pB = ipermute(B, [3,4,6,2,1,5]); % 1->3, 2->4, 3->6, 4->2, 5->1; 5th is singleton
pC = ipermute(C, [4,2,1,3,5,6]); % 1->4, 2->2, 3->1; 3rd, 5th & 6th are singletons
pD = sum( ...
permute(pA .* pB .* pC, [5,6,2,1,3,4]), ... 1->5, 2->6, 3->2, 4->1; 3rd & 4th are singletons
[5,6]);
(see note regarding sum at the bottom of the post.)
Another way to do it in MATLAB, as mentioned by #AndrasDeak, is the following:
rD = squeeze(sum(reshape(A, [M, M, 1, 1, 1, I]) .* ...
reshape(B, [1, M, M, N, J, I]) .* ...
... % same as: reshape(B, [1, size(B)]) .* ...
... % same as: shiftdim(B,-1) .* ...
reshape(C, [1, 1, M, 1, J, I]), [2, 3]));
See also: squeeze, reshape, permute, ipermute, shiftdim.
Here's a full example that shows that tests whether these methods are equivalent:
function q55913093
M = 2;
N = 4;
I = 2000;
J = 300;
mA = randn(M, M, I);
mB = randn(M, M, N, J, I);
mC = randn(M, J, I);
%% Option 1 - using numpy:
np = py.importlib.import_module('numpy');
A = matpy.mat2nparray( mA );
B = matpy.mat2nparray( mB );
C = matpy.mat2nparray( mC );
D = matpy.nparray2mat( np.einsum('mki, klnji, lji -> mnji', A, B, C) );
%% Option 2 - native MATLAB:
%%% Reference dim order: I(1) J(2) K(3) L(4) M(5) N(6)
pA = ipermute(mA, [5,3,1,2,4,6]); % 1->5, 2->3, 3->1; 2nd, 4th & 6th are singletons
pB = ipermute(mB, [3,4,6,2,1,5]); % 1->3, 2->4, 3->6, 4->2, 5->1; 5th is singleton
pC = ipermute(mC, [4,2,1,3,5,6]); % 1->4, 2->2, 3->1; 3rd, 5th & 6th are singletons
pD = sum( permute( ...
pA .* pB .* pC, [5,6,2,1,3,4]), ... % 1->5, 2->6, 3->2, 4->1; 3rd & 4th are singletons
[5,6]);
rD = squeeze(sum(reshape(mA, [M, M, 1, 1, 1, I]) .* ...
reshape(mB, [1, M, M, N, J, I]) .* ...
reshape(mC, [1, 1, M, 1, J, I]), [2, 3]));
%% Comparisons:
sum(abs(pD-D), 'all')
isequal(pD,rD)
Running the above we get that the results are indeed equivalent:
>> q55913093
ans =
2.1816e-10
ans =
logical
1
Note that these two methods of calling sum were introduced in recent releases, so you might need to replace them if your MATLAB is relatively old:
S = sum(A,'all') % can be replaced by ` sum(A(:)) `
S = sum(A,vecdim) % can be replaced by ` sum( sum(A, dim1), dim2) `
As requested in the comments, here's a benchmark comparing the methods:
function t = q55913093_benchmark(M,N,I,J)
if nargin == 0
M = 2;
N = 4;
I = 2000;
J = 300;
end
% Define the arrays in MATLAB
mA = randn(M, M, I);
mB = randn(M, M, N, J, I);
mC = randn(M, J, I);
% Define the arrays in numpy
np = py.importlib.import_module('numpy');
pA = matpy.mat2nparray( mA );
pB = matpy.mat2nparray( mB );
pC = matpy.mat2nparray( mC );
% Test for equivalence
D = cat(5, M1(), M2(), M3());
assert( sum(abs(D(:,:,:,:,1) - D(:,:,:,:,2)), 'all') < 1E-8 );
assert( isequal (D(:,:,:,:,2), D(:,:,:,:,3)));
% Time
t = [ timeit(#M1,1), timeit(#M2,1), timeit(#M3,1)];
function out = M1()
out = matpy.nparray2mat( np.einsum('mki, klnji, lji -> mnji', pA, pB, pC) );
end
function out = M2()
out = permute( ...
sum( ...
ipermute(mA, [5,3,1,2,4,6]) .* ...
ipermute(mB, [3,4,6,2,1,5]) .* ...
ipermute(mC, [4,2,1,3,5,6]), [3,4]...
), [5,6,2,1,3,4]...
);
end
function out = M3()
out = squeeze(sum(reshape(mA, [M, M, 1, 1, 1, I]) .* ...
reshape(mB, [1, M, M, N, J, I]) .* ...
reshape(mC, [1, 1, M, 1, J, I]), [2, 3]));
end
end
On my system this results in:
>> q55913093_benchmark
ans =
1.3964 0.1864 0.2428
Which means that the 2nd method is preferable (at least for the default input sizes).

manipulating indices of matrix in parallel in matlab

Suppose I have a m-by-n-by-p matrix "A", each indices stores a real number, now I want to create another matrix "B" and B(i, j, k) = f(A(i, j, k), i, j, k, otherVars), is there a faster way to do it in matlab rather than looping through all the elements? (notice the function requires the index number (i, j, k))
An example is as follows(The actual function f could be more complex):
A = rand(3, 4, 5);
B = zeros(size(A));
C = 10;
for x = 1:size(A, 1)
for y = 1:size(A, 2)
for z = 1:size(A, 3)
B(x, y, z) = A(x,y,z) + x - y * z + C;
end
end
end
I've tried creating a cell "B", and
B{i, j, k} = [A(i, j, k), i, j, k];
I then applied cellfun() to do the parallel computing, but it's even slower than a for-loop over each elements in A.
In my real implementation, function f is much more complex than B = A + X - Y.*Z + C; it takes four scaler values and I don't want to modify it since it's a function written in an external package. Any suggestions?
Vectorize it by building an ndgrid of the appropriate values:
[X,Y,Z] = ndgrid(1:size(A,1), 1:size(A,2), 1:size(A,3));
B = A + X - Y.*Z + C;

Generating arrays using bsxfun with anonymous function and for elementwise subtractions - MATLAB

I have the following code:
n = 10000;
s = 100;
Z = rand(n, 2);
x = rand(s, 1);
y = rand(s, 1);
fun = #(a) exp(a);
In principle, the anonymous function f can have a different form. I need to create two arrays.
First, I need to create an array of size n x s x s with generic elements
fun(Z(i, 1) - x(j)) * fun(Z(i, 2) - y(k))
where i=1,...n while j,k=1,...,s. What I can easily do, is to construct matrices using bsxfun, e.g.
bsxfun(#(x, y) fun(x - y), Z(:, 1), x');
bsxfun(#(x, y) fun(x - y), Z(:, 2), y');
But then I would need to combine them into 3D array by multiplying element-wise each column of those two matrices.
In the second step, I need to create an array of size n x 3 x s x s, which would look from one side as the following matrix
[ones(n, 1), Z(:, 1) - x(i), Z(:, 2) - y(j);]
where i=1,...s, j=1,...s. I could loop over the two extra dimensions with something like
A = [ones(n, 1), Z(:, 1) - x(1), Z(:, 2) - y(1)];
for i = 1:s
for j = 1:s
A(:, :, i, j) = [ones(n, 1), Z(:, 1) - x(i), Z(:, 2) - y(j);];
end
end
Is there a way to avoid loops?
In the third step, suppose that after obtaining array out1 (output from first step), I want to create a new array out3 of dimension n x n x s x s, which contains the original array out1 on the main diagonal, i.e. out3(i,i,s,s) = out1(i, s, s) and out3(i,j,s,s)=0 for all i~=j. Is there some kind of alternative of diag for creating "diagonal arrays"? Alternatively, if I create n x n x s x s array of zeros, is there a way to put out1 on the main diagonal?
Code
exp_Z_x = exp(bsxfun(#minus,Z(:,1),x.')); %//'
exp_Z_y = exp(bsxfun(#minus,Z(:,2),y.')); %//'
out1 = bsxfun(#times,exp_Z_x,permute(exp_Z_y,[1 3 2]));
Z1 = [ones(n,1) Z(:,1) Z(:,2)];
X1 = permute([ zeros(s,1) x zeros(s,1)],[3 2 1]);
Y1 = permute([ zeros(s,1) zeros(s,1) y],[4 2 3 1]);
out2 = bsxfun(#minus,bsxfun(#minus,Z1,X1),Y1);
out3 = zeros(n,n,s,s); %// out3(n,n,s,s) = 0; could be used for performance
out3(bsxfun(#plus,[1:n+1:n*n]',[0:s*s-1]*n*n)) = out1; %//'
%// out1, out2 and out3 are the desired outputs

Vectorizing a nested for loop which fills a dynamic programming table

I was wondering if there was a way to vectorize the nested for loop in this function which is filling up the entries of the 2D dynamic programming table DP. I believe that at the very least the inner loop could be vectorized as each row only depends on the previous row. I'm not sure how to do it though. Note this function is called on large 2D arrays (images) so the nested for loop really doesn't cut it.
function [cols] = compute_seam(energy)
[r, c, ~] = size(energy);
cols = zeros(r);
DP = padarray(energy, [0, 1], Inf);
BP = zeros(r, c);
for i = 2 : r
for j = 1 : c
[x, l] = min([DP(i - 1, j), DP(i - 1, j + 1), DP(i - 1, j + 2)]);
DP(i, j + 1) = DP(i, j + 1) + x;
BP(i, j) = j + (l - 2);
end
end
[~, j] = min(DP(r, :));
j = j - 1;
for i = r : -1 : 1
cols(i) = j;
j = BP(i, j);
end
end
Vectorization of the innermost nested loop
You were right in postulating that at least the inner loop is vectorizable. Here's the modified code for the nested loops part -
rows_DP = size(DP,1); %// rows in DP
%// Get first row linear indices for a group of neighboring three columns,
%// which would be incremented as we move between rows with the row iterator
start_ind1 = bsxfun(#plus,[1:rows_DP:2*rows_DP+1]',[0:c-1]*rows_DP); %//'
for i = 2 : r
ind1 = start_ind1 + i-2; %// setup linear indices for the row of this iteration
[x,l] = min(DP(ind1),[],1); %// get x and l values in one go
DP(i,2:c+1) = DP(i,2:c+1) + x; %// set DP values of a row in one go
BP(i,1:c) = [1:c] + l-2; %// set BP values of a row in one go
end
Benchmarking
Benchmarking Code -
N = 3000; %// Datasize
energy = rand(N);
[r, c, ~] = size(energy);
disp('------------------------------------- With Original Code')
DP = padarray(energy, [0, 1], Inf);
BP = zeros(r, c);
tic
for i = 2 : r
for j = 1 : c
[x, l] = min([DP(i - 1, j), DP(i - 1, j + 1), DP(i - 1, j + 2)]);
DP(i, j + 1) = DP(i, j + 1) + x;
BP(i, j) = j + (l - 2);
end
end
toc,clear DP BP x l
disp('------------------------------------- With Vectorized Code')
DP = padarray(energy, [0, 1], Inf);
BP = zeros(r, c);
tic
rows_DP = size(DP,1); %// rows in DP
start_ind1 = bsxfun(#plus,[1:rows_DP:2*rows_DP+1]',[0:c-1]*rows_DP); %//'
for i = 2 : r
ind1 = start_ind1 + i-2; %// setup linear indices for the row of this iteration
[x,l] = min(DP(ind1),[],1); %// get x and l values in one go
DP(i,2:c+1) = DP(i,2:c+1) + x; %// set DP values of a row in one go
BP(i,1:c) = [1:c] + l-2; %// set BP values of a row in one go
end
toc
Results -
------------------------------------- With Original Code
Elapsed time is 44.200746 seconds.
------------------------------------- With Vectorized Code
Elapsed time is 1.694288 seconds.
Thus, you might enjoy a good 26x speedup improvement in performance with that little vectorization tweak.
More tweaks
Few more optimization tweaks could be tried into your code for performance -
cols = zeros(r) could be replaced with col(r,r) = 0.
DP = padarray(energy, [0, 1], Inf) could be replaced with
DP(1:size(energy,1),1:size(energy,2)+2)=Inf;
DP(:,2:end-1) = energy;
BP = zeros(r, c) could be replaced with BP(r, c) = 0.
The pre-allocation tweaks used here are inspired by this blog post.