Parallel implementation for Jacobi algorithm takes too much time - matlab

I implemented a parallel version of Jacobi's method for the resolution of a linear system. Doing some tests I noticed that the time to execute the function in parallel is very high compared to the time to execute the sequential function. This is strange because the Jacobi's method should be faster when executed with a parallel implementation.
I think I'm doing something wrong in the code:
function [x,niter,resrel] = Parallel_Jacobi(A,b,TOL,MAXITER)
[n, m] = size(A);
D = 1./spdiags(A,0);
B = speye(n)-A./spdiags(A,0);
C= D.*b;
x0=sparse(zeros(length(A),1));
spmd
cod_vett=codistributor1d(1,codistributor1d.unsetPartition,[n,1]);
cod_mat=codistributor1d(1,codistributor1d.unsetPartition,[n,m]);
B= codistributed(B,cod_mat);
C= codistributed(C,cod_vett);
x= codistributed(B*x0 + C,cod_vett);
Niter = 1;
TOLX = TOL;
while(norm(x-x0,Inf) > norm(x0,Inf)*TOLX && Niter < MAXITER)
if(TOL*norm(x,Inf) > realmin)
TOLX = norm(x,Inf)*TOL;
else
TOLX = realmin;
end
x0 = x;
x = B*x0 + C;
Niter=Niter+1;
end
end
Niter=Niter{1};
x=gather(x);
end
Below there are the tests
%sequential Jacobi
format long;
A = gallery('poisson',20);
tic;
x= jacobi(A,ones(400,1),1e-6,2000000);
toc;
Elapsed time is 0.009054 seconds.
%parallel Jacobi
format long;
A = gallery('poisson',20);
tic;
x= Parallel_Jacobi(A,ones(400,1),1e-6,2000000);
toc;
Elapsed time is 11.484130 seconds.
I timed the parpool function with 1,2,3 and 4 workers (I have a quad core processor) and the result is the following:
%Test
format long;
A = gallery('poisson',20);
delete(gcp('nocreate'));
tic
%parpool(1/2/3/4) means that i executed 4 tests that differ only for the
%argument in the function: first parpool(1), second parpool(2) and so on.
parpool(1/2/3/4);
toc
tic;
x= Parallel_Jacobi(A,ones(400,1),1e-6,2000000);
toc;
4 workers: parpool=13.322899 seconds, function=23.772271
3 workers: parpool=10.911769 seconds, function=16.402633
2 workers: parpool=9.371729 seconds, function=12.945154
1 worker: parpool=8.460357 seconds, function=7.982958 .
The less workers, the better the time is. Which is, like #Adriaan said, likely due overhead.
Does this mean that, in this case, the sequential function is always faster than the parallel function? Or is there a better way to implement the parallel one?
In this question it is said that the performance in parallel is better when the number of iterations is high. In my case, with this test, there are only 32 iteration.
The sequential implementation of Jacobi's method is this:
function [x,niter,resrel] = jacobi(A,b,TOL,MAXITER)
n = size(A,1);
D = 1./spdiags(A,0);
B = speye(n)-A./spdiags(A,0);
C= D.*b;
x0=sparse(zeros(length(A),1));
x = B*x0 + C;
Niter = 1;
TOLX = TOL;
while(norm(x-x0,Inf) > norm(x0,Inf)*TOLX && Niter < MAXITER)
if(TOL*norm(x,Inf) > realmin)
TOLX = norm(x,Inf)*TOL;
else
TOLX = realmin;
end
x0 = x;
x = B*x0 + C;
Niter=Niter+1;
end
end
I timed the code with the timeit function and the results are these(the inputs are the same of the previous):
4 workers: 11.693473075964102
3 workers: 9.221281335264003
2 workers: 9.150417240778545
1 worker: 6.047181992020434
sequential: 0.002893932969688

Related

How to speed up iterative function call in MatLab?

In MatLab I have to call the cdf of the t distribution (tcdf) iteratively (since the next input value depends on the previous output of tcdf), which unfortunately slows down my code massively.
tic
z = NaN(1e5,1);
z(1) = 1;
x = 2;
for ii = 2:1e5
x = tcdf(z(ii-1),x);
z(ii) = z(ii-1)*x;
end
toc
Elapsed time is 4.717087 seconds.
Is there a way to speed this up somehow?
For comparison:
tic
z = randn(1e5,1);
tcdf(z,5);
toc
Elapsed time is 0.091353 seconds.
Move the random number generation outside the loop as suggested below
numVal = 1e5
z = randn(numVal,1);
for ii = 2:numVal
z(ii) = z(ii-1) + z(ii);
end
tcdf(z,5);

Vectorizing for loop on square matrix in matlab

I have a big square matrix (refConnect) with approx 500000 elements.
I need to perform this operation:
tmp = find(referenceCluster == 67);
for j=1:length(tmp)
refConnect(tmp(j),tmp)=1;
end
I wonder if there is a simple way to vectorise this so I can avoid the for loop which is taking forever.
Thanks for any help.
Cheers
Seems you can't significantly decrease execution time.
Try evaluate the execution time with this test function.
function test_spped(N, M)
if nargin < 1
N = 707;
end
if nargin < 2
M = 2;
end
refConnectIn = rand(N, N);
referenceCluster = randi(M, N, 1);
refConnectA = refConnectIn;
tic
tmpA = find(referenceCluster == 1);
for j=1:length(tmpA)
refConnectA(tmpA(j),tmpA) = 1;
end
toc
refConnectB = refConnectIn;
tic
tmpB = referenceCluster == 1;
refConnectB(tmpB, tmpB) = 1;
toc
if isequal(refConnectA, refConnectB)
disp('Result are equals');
else
disp('Result are UNEQUALS!');
end
With default parameters you get:
>> test_speed
Elapsed time is 0.002865 seconds.
Elapsed time is 0.001575 seconds.
Result are equals
Note, the execution time of the vectorized code (B case) can be worse for large M:
>> test_speed(707,1000)
Elapsed time is 0.001623 seconds.
Elapsed time is 0.002219 seconds.
Result are equals

Computational complexity of expanding a vector using its index in MATLAB

suppose I have a n-by-1 vector A, and a m-by-1 all integer vector b, where max(b)<=n, min(b)>0. Can anyone tell me what is the computational complexity (in big-O notation) for performing command A(b) in MATLAB?
As I found and get a test from A with n = 100, 1000, 10^4, ..., 10^7 and m = 100 which A and b generated randomly, average time on the same machine for all case of n are the same (also, approximately, all times are the same, and equail to 5.00679e-05 for A(b)). Part of the code is like below (you should be aware that to get the more accurate, you should run in 100 times and get an average. Also, get the variance to see that values are not so different):
A = randi(100,100,1);
b = randi(100,100,1);
tic; A(b); toc
> Elapsed time is 5.38826e-05 seconds.
A = randi(100,1000,1);
b = randi(100,100,1);
tic; A(b); toc
> Elapsed time is 6.31809e-05 seconds.
A = randi(100,10000,1);
b = randi(100,100,1);
tic; A(b); toc
> Elapsed time is 4.88758e-05 seconds.
...
% Also, as you can see the range of the b is growth, base on the size of A
% However, I can't see any changes in the time complexity scale.
A = randi(100,1000000,1);
b = randi(1000000,100,1);
tic; A(b); toc
> Elapsed time is 6.60419e-05 seconds.
6.60419e-05
Also, run test over n = 100 and m = 10000, 10^4, ..., 10^7, and on each run I got 10^-5, 10^-4, and so on. In this way, run this on different n, and I got the same result. In the below as the same as the previous part, you can follow the following:
A = randi(100,100,1);
b = randi(100,10000,1);
tic;A(b);toc
> Elapsed time is 7.00951e-05 seconds.
A = randi(100,100,1);
b = randi(100,100000,1);
tic;A(b);toc
> Elapsed time is 0.000529051 seconds.
A = randi(100,100,1);
b = randi(100,1000000,1);
tic;A(b);toc
> Elapsed time is 0.00533104 seconds.
...
For the case of be more aacurate, you can run the code like the following:
a = zeros(100,1);
for idx = 1:100
A = randi(100,100,1);b = randi(100,100000,1);
tic; A(b);a(idx) = toc;
end
var(a)
> ans = 4.4092e-10 % very little
mean(a)
> ans = 3.6702e-04 % 10^-4 scale
As you can see, the variance is very little, and mean scale is 10^-4. Hence, you can approve the others by the same method.
Therefore, base on the above analysis and common index methods, We can say the time complexity is not dependent on n and dependent on size of the b. In sum, the time complexity is O(m).
The complexity is 'unusual'
I came up with a short test script, and ran it a few times. If the size of B increases, there is a clear increase in the required time. However, the results get confusing if you look at the size of A.
R = zeros(3,4);
for m = 10.^[4 5 6]
mc = mc+1;
nc =0;
for n = 10.^[1 2 3 4]
nc = nc+1;
A = rand(m,1);
b = randi(m, n, 1);
tic; A(b); R(mc,nc) = toc*10^5;
end
end
R
The above script gave the following results (also similar results when repeated). Note that this was in Octave Online, rather than Matlab.
R =
3.2902 1.7881 2.3127 9.3937
5.3167 2.0027 2.8133 8.2970
21.6007 3.3140 4.9829 17.3092

Vectorization when mapping between indices in an assignment is not injective

Suppose that c is a scalar value, T and W are M-by-N matrices, k is another M-by-N matrix containing values from 1 to M (and there are at least two pairs (i1, j1), (i2, j2) such that k(i1, j1)==k(i2, j2)) and a is a 1-by-M vector. I want to vectorize the following code (hoping that this will speed it up):
T = zeros(M,N);
for j = 1:N
for i = 1:M
T(k(i,j),j) = T(k(i,j),j) + c*W(i,j)/a(i);
end
end
Do you have any tips so that I can vectorize this code (or make it faster in general)?
Thanks in advance!
Since k only ever effects how values are aggregated within a column, but not between columns, you can achieve a slight speedup by reducing the problem to a single loop over columns and using accumarray like so:
T = zeros(M, N);
for col = 1:N
T(:, col) = accumarray(k(:,col), c*W(:, col)./a, [M 1]);
end
I tested each of the solutions (the loop in your question, rahnema's, Divakar's, and mine) by taking the average of 100 iterations using input values initialized as in Divakar's answer. Here's what I got (running Windows 7 x64, 16 GB RAM, MATLAB R2016b):
solution | avg. time (s) | max(abs(err))
---------+---------------+---------------
loop | 0.12461 | 0
rahnema | 0.84518 | 0
divakar | 0.12381 | 1.819e-12
gnovice | 0.09477 | 0
The take-away: loops actually aren't so bad, but if you can simplify them into one it can save you a little time.
Here's an approach with a combination of bsxfun and accumarray -
% Create 2D array of unique IDs along each col to be used as flattened subs
id = bsxfun(#plus,k,M*(0:N-1));
% Compute "c*W(i,j)/a(i)" for all i's and j's
cWa = c*bsxfun(#rdivide,W,a);
% Accumulate final result for all cols
out = reshape(accumarray(id(:),reshape(cWa,[],1),[M*N 1]),[M,N]);
Benchmarking
Approaches as functions -
function out = func1(W,a,c,k,M,N)
id = bsxfun(#plus,k,M*(0:N-1));
cWa = c*bsxfun(#rdivide,W,a);
out = reshape(accumarray(id(:),reshape(cWa,[],1),[M*N 1]),[M,N]);
function T = func2(W,a,c,k,M,N) % #rahnema1's solution
[I J] = meshgrid(1:M,1:N);
idx1 = sub2ind([M N], I ,J);
R = c.* W(idx1) ./ a(I);
T = accumarray([k(idx1(:)) ,J(:)], R(:),[M N]);
function T = func3(W,a,c,k,M,N) % Original approach
T = zeros(M,N);
for j = 1:N
for i = 1:M
T(k(i,j),j) = T(k(i,j),j) + c*W(i,j)/a(i);
end
end
function T = func4(W,a,c,k,M,N) % #gnovice's solution
T = zeros(M, N);
for col = 1:N
T(:, col) = accumarray(k(:,col), c*W(:, col)./a, [M 1]);
end
Machine setup : Kubuntu 16.04, MATLAB 2012a, 4GB RAM.
Timing code -
% Setup inputs
M = 3000;
N = 3000;
W = rand(M,N);
a = rand(M,1);
c = 2.34;
k = randi([1,M],[M,N]);
disp('------------------ With func1')
tic,out = func1(W,a,c,k,M,N);toc
clear out
disp('------------------ With func2')
tic,out = func2(W,a,c,k,M,N);toc
clear out
disp('------------------ With func3')
tic,out = func3(W,a,c,k,M,N);toc
clear out
disp('------------------ With func4')
tic,out = func4(W,a,c,k,M,N);toc
Timing code run -
------------------ With func1
Elapsed time is 0.215591 seconds.
------------------ With func2
Elapsed time is 1.555373 seconds.
------------------ With func3
Elapsed time is 0.572668 seconds.
------------------ With func4
Elapsed time is 0.291552 seconds.
Possible improvements in proposed approach
1] In c*bsxfun(#rdivide,W,a), we are use two stages of broadcasting - One at bsxfun(#rdivide,W,a), where a is broadcasted ; Second one when c is broadcasted to match-up against the 2D output of bsxfun(#rdivide,W,a), though we don't need bsxfun for this one. So, a possible improvement would be if we insert-in c to be divided by a, where c would be only broadcasted to 1D, instead of 2D and then the second level of broadcasting would be1D: c/a to 2D : W just like before. This minor improvement could be timed -
>> tic, c*bsxfun(#rdivide,W,a); toc
Elapsed time is 0.073244 seconds.
>> tic, bsxfun(#times,W,c/a); toc
Elapsed time is 0.041745 seconds.
But, in cases where c and a differ by a lot, the scaling factor c/a would affect the final result by appreciably. So, one need to be careful with this suggestion.
A possible solution:
[I J] = meshgrid(1:M,1:N);
idx1 = sub2ind([M N], I ,J);
R = c.* W(idx1) ./ a(I);
T = accumarray([K(idx1(:)) ,J(:)], R(:),[M N]);
Comparison of different methods in Octave without jit:
------------------ Divakar
Elapsed time is 0.282008 seconds.
------------------ rahnema1
Elapsed time is 1.08827 seconds.
------------------ gnovice
Elapsed time is 0.418701 seconds.
------------------ loop
doesn't complete in 15 seconds.

How to efficiently construct a matrix in matlab that depends on indices

In my program in matlab I have several instances where I need to create a matrix, which entries depends on its indices and perform matrix-vector operations with it. I wonder how I can implement this most efficiently.
For example, I need to speed up:
N = 1e4;
x = rand(N,1);
% Option 1
tic
I = 1:N;
J = 1:N;
S = zeros(N,N);
for i = 1:N
for j = 1:N
S(i,j) = (i+j)/(abs(i-j)+1);
end
end
a = x'*S*x
fprintf('Option 1 takes %.4f sec\n',toc)
clearvars -except x N
I try to speed this up, so I have tried the following options:
% Option 2
tic
I = 1:N;
J = 1:N;
Sx = zeros(N,1);
for i = 1:N
Srow_i = (i+J)./(abs(i-J)+1);
Sx(i)= Srow_i*x;
end
a = x'*Sx
fprintf('Option 2 takes %.4f sec\n',toc)
clearvars -except x N
and
% Option 3
tic
I = 1:N;
J = 1:N;
S = bsxfun(#plus,I',J)./(abs(bsxfun(#minus,I',J))+1);
a = x'*S*x
fprintf('Option 3 takes %.4f sec\n',toc)
clearvars -except x N
and (thanks to one of the answers)
% options 4
tic
[I , J] = meshgrid(1:N,1:N);
S = (I+J) ./ (abs(I-J) + 1);
a = x' * S * x;
fprintf('Option 4 takes %.4f sec\n',toc)
clearvars -except x N
Otion 2 is the most efficient. Is there a faster option of doing this operation?
Update:
I have also tried the option by Abhinav:
% Option 5 using Tony's Trick
tic
i = 1:N;
j = (1:N)';
I = i(ones(N,1),:);
J = j(:,ones(N,1));
S = (I+J)./(abs(I-J)+1);
a = x'*S*x;
fprintf('Option 5 takes %.4f sec\n',toc)
clearvars -except x N
It seems that the most efficient procedure depends on the size of N. For different N I get the following output:
N = 100:
Option 1 takes 0.00233 sec
Option 2 takes 0.00276 sec
Option 3 takes 0.00183 sec
Option 4 takes 0.00145 sec
Option 5 takes 0.00185 sec
N = 10000:
Option 1 takes 3.29824 sec
Option 2 takes 0.41597 sec
Option 3 takes 0.72224 sec
Option 4 takes 1.23450 sec
Option 5 takes 1.27717 sec
So, for small N, option 2 is the slowest but it becomes the most efficient for larger N. Maybe because of memory? Could somebody explain this?
You can create indices using meshgrid and no need to loop:
N = 1e4;
[I , J] = meshgrid(1:N,1:N);
x = rand(N,1);
S = (I+J) ./ (abs(I-J) + 1);
a = x' * S * x;
Update:
Since #Optimist shown that performance of this code is less than Option2 and Option3 I decided to slightly improve Option2:
N = 1e4;
x = rand(N,1);
Sx = zeros(N,1);
for i = 1:N
Srow_i = (i+1:i+N)./[i:-1:2,1:N-i+1] ;
Sx(i)= Srow_i*x;
end
a = x'*Sx;
You should try using the Tony's trick to do vector stacking/tiling in Matlab the fastest way. I have answered a similar question here. Here is the Tony's Trick option.
% Option using Tony's Trick
tic
i = 1:N;
j = (1:N)';
I = i(ones(N,1),:);
J = j(:,ones(N,1));
S = (I+J)./(abs(I-J)+1);
a = x'*S*x
fprintf('Option 1 takes %.4f sec\n',toc)
Edit 1: I ran a few tests and found the following. Up to N=1000, the Tony's trick option is slightly faster than the Option 2. Beyond that, Option 2 again catches up and becomes faster.
Possible Reason :
This should be so because, up until the size of the array could fit in the cache, the fully vectorized code (Tony's Trick option) is faster BUT as soon as the array sizes grow (N>1000), it spills into memory caches away from the CPU and then Matlab uses some internal optimization to breakdown the Tony's Trick code into piecemeal code so that it no-longer enjoys the benefit of complete vectorization.