I've used spmd to calculate two piece of code simultaneously. The computer which I'm using have a processor with 8 cores.which means the communication overhead is something like zero!
I compare the running time of this spmd block and same code outside of spmd with tic & toc.
When I run the code, The parallel version of my code take more time than the sequential form.
Any idea why is that so?
Here is a sample code of what I'm talking about :
tic;
spmd
if labindex == 1
gamma = (alpha*beta);
end
if labindex == 2
for t = 1:T,
for i1=1:n
for j1=1:n
kesi(i1,j1,t) = (alpha(i1,t) + phi(j1,t));
end;
end;
end;
end
end
t_spmd = toc;
tic;
gamma2= (alpha * beta);
for t = 1:T,
for i1=1:n
for j1=1:n
kesi2(i1,j1,t) = (alpha(i1,t) + phi(j1,t));
end;
end;
end;
t_seq = toc;
disp('t spmd : ');disp(t_spmd);
disp('t seq : ');disp(t_seq);
There are two reasons here. Firstly, your use of if labindex == 2 means that the main body of the spmd block is being executed by only a single worker - there's no parallelism here.
Secondly, it's important to remember that (by default) parallel pool workers run in single computational thread mode. Therefore, when using local workers, you can only expect speedup when the body of your parallel construct cannot be implicitly multi-threaded by MATLAB.
Finally, in this particular case, you're much better off using bsxfun (or implicit expansion in R2016b or later), like so:
T = 10;
n = 7;
alpha = rand(n, T);
phi = rand(n, T);
alpha_r = reshape(alpha, n, 1, T);
phi_r = reshape(phi, 1, n, T);
% In R2016b or later:
kesi = alpha_r + phi_r;
% In R2016a or earlier:
kesi = bsxfun(#plus, alpha_r, phi_r);
Related
I'm trying to speed up my computing by using gpuArray. However, that's not the case for my code below.
for i=1:10
calltest;
end
function [t1,t2]=calltest
N=10;
tic
u=gpuArray(rand(1,N).^(1./[N:-1:1]));
t1=toc
tic
u2=rand(1,N).^(1./[N:-1:1]);
t2=toc
end
where I get
t1 =
4.8445e-05
t2 =
1.4369e-05
I have an Nvidia GTX850M graphic card. Am I using gpuArray incorrectly? This code is wrapped inside a function, and the function is called by a loop thousands of times.
Why?Because there is botha) a small problem-scale &b) not a "mathematically-dense" GPU-kernel
The method of comparison is blurring the root-cause of the problem
Step 0: separate data-set ( a vector ) creation from section-under-test:
N = 10;
R = rand( 1, N );
tic; < a-gpu-based-computing-section>; GPU_t = toc
tic; c = R.^( 1. / [N:-1:1] ); CPU_t = toc
Step 1: test the scaling:
trying just 10 elements, will not make the observation clear, as an overhead-naive formulation of Amdahl Law does not explicitly emphasise the added time, spent on CPU-based GPU-kernel assembly & transport + ( CPU-to-GPU + GPU-to-CPU ) data-handling phases. These add-on phases may get negligibly small, if compared to
a) an indeed large-scale vector / matrix GPU-kernel processing, which N ~10 obviously is not
or
b) an indeed "mathematically-dense" GPU-kernel processing, which R.^() obviously is not
so,
do not blame the GPU-computing for having acquired a must-do part ( the overheads ) as it cannot get working without this prior add-ons in time ( and CPU may, during the same amount of time, produce the final result - Q.E.D. )
Fine-grain measurement, per each of the CPU-GPU-CPU-workflow sections:
N = 10; %% 100, 1000, 10000, 100000, ..
tic; CPU_hosted = rand( N, 'single' ); %% 'double'
CPU_gen_RAND = toc
tic; GPU_hosted_IN1 = gpuArray( CPU_hosted );
GPU_xfer_h2d = toc
tic; GPU_hosted_IN2 = rand( N, 'gpuArray' );
GPU_gen__h2d = toc
tic; <kernel-generation-with-might-be-lazy-eval-deferred-xfer-setup>;
GPU_kernel_AssyExec = toc
tic; CPU_hosted_RES = gather( GPU_hosted_RES );
GPU_xfer_d2h = toc
This post builds on my post about quickly evaluating analytic Jacobian in Matlab:
fast evaluation of analytical jacobian in MATLAB
The key difference is that now, I am working with the Hessian and I have to evaluate close to 700 matlabFunctions (instead of 1 matlabFunction, like I did for the Jacobian) each time the hessian is evaluated. So there is an opportunity to do things a little differently.
I have tried to do this two ways so far and I am thinking about implementing a third and was wondering if anyone has any other suggestions. I will go through each method with a toy example, but first some preprocessing to generate these matlabFunctions:
PreProcessing:
% This part of the code is calculated once, it is not the issue
dvs = 5;
X=sym('X',[dvs,1]);
num = dvs - 1; % number of constraints
% multiple functions
for k = 1:num
f1(X(k+1),X(k)) = (X(k+1)^3 - X(k)^2*k^2);
c(k) = f1;
end
gradc = jacobian(c,X).'; % .' performs transpose
parfor k = 1:num
hessc{k} = jacobian(gradc(:,k),X);
end
parfor k = 1:num
hess_name = strcat('hessian_',num2str(k));
matlabFunction(hessc{k},'file',hess_name,'vars',X);
end
METHOD #1 : Evaluate functions in series
%% Now we use the functions to run an "optimization." Just for an example the "optimization" is just a for loop
fprintf('This is test A, where the functions are evaluated in series!\n');
tic
for q = 1:10
x_dv = rand(dvs,1); % these are the design variables
lambda = rand(num,1); % these are the lagrange multipliers
x_dv_cell = num2cell(x_dv); % for passing large design variables
for k = 1:num
hess_name = strcat('hessian_',num2str(k));
function_handle = str2func(hess_name);
H_temp(:,:,k) = lambda(k)*function_handle(x_dv_cell{:});
end
H = sum(H_temp,3);
end
fprintf('The time for test A was:\n')
toc
METHOD # 2: Evaluate functions in parallel
%% Try to run a parfor loop
fprintf('This is test B, where the functions are evaluated in parallel!\n');
tic
for q = 1:10
x_dv = rand(dvs,1); % these are the design variables
lambda = rand(num,1); % these are the lagrange multipliers
x_dv_cell = num2cell(x_dv); % for passing large design variables
parfor k = 1:num
hess_name = strcat('hessian_',num2str(k));
function_handle = str2func(hess_name);
H_temp(:,:,k) = lambda(k)*function_handle(x_dv_cell{:});
end
H = sum(H_temp,3);
end
fprintf('The time for test B was:\n')
toc
RESULTS:
METHOD #1 = 0.008691 seconds
METHOD #2 = 0.464786 seconds
DISCUSSION of RESULTS
This result makes sense because, the functions evaluate very quickly and running them in parallel waists a lot of time setting up and sending out the jobs to the different Matlabs ( and then getting the data back from them). I see the same result on my actual problem.
METHOD # 3: Evaluating the functions using the GPU
I have not tried this yet, but I am interested to see what the performance difference is. I am not yet familiar with doing this in Matlab and will add it once I am done.
Any other thoughts? Comments? Thanks!
In Matlab I need to accumulate overlapping diagonal blocks of a large matrix. The sample code is given below.
Since this piece of code needs to run several times, it consumes a lot of resources. The process is used in array signal processing for a so-called subarray smoothing or spatial smoothing. Is there any way to do this faster?
% some values for parameters
M = 1000; % size of array
m = 400; % size of subarray
n = M-m+1; % number of subarrays
R = randn(M)+1i*rand(M);
% main code
S = R(1:m,1:m);
for i = 2:n
S = S + R(i:m+i-1,i:m+i-1);
end
ATTEMPTS:
1) I tried the following alternative vectorized version, but unfortunately it became much slower!
[X,Y] = meshgrid(1:m);
inds1 = sub2ind([M,M],Y(:),X(:));
steps = (0:n-1)*(M+1);
inds = repmat(inds1,1,n) + repmat(steps,m^2,1);
RR = sum(R(inds),2);
S = reshape(RR,m,m);
2) I used Matlab coder to create a MEX file and it became much slower!
I've personally had to fasten up some portions of my code lately. Being not an expert at all, I would recommend trying the following:
1) Vectorize:
Getting rid of the for-loop
S = R(1:m,1:m);
for i = 2:n
S = S + R(i:m+i-1,i:m+i-1)
end
and replacing it for an alternative based on cumsum should be the way to go here.
Note: will try and work on this approach on a future Edit
2) Generating a MEX-file:
In some instances, you could simply fire up the Matlab Coder app (given that you have it in your current Matlab version).
This should generate a .mex file for you, that you can call as it was the function that you are trying to replace.
Regardless of your choice (1) or 2)), you should profile your current implementation with tic; my_function(); toc; for a fair number of function calls, and compare it with your current implementation:
my_time = zeros(1,10000);
for count = 1:10000
tic;
my_function();
my_time(count) = toc;
end
mean(my_time)
Can anyone explain to me, why the following gives an error for u but nor for h
max_X = 100;
max_Y = 100;
h = ones(max_Y,max_X);
u = zeros(max_Y,max_X);
parfor l=1:max_X*max_Y
i = mod(l-1,max_X) + 1;
j = floor((l-1)/max_Y) + 1;
for k=1:9
m = i + floor((k-1)/3) - 1;
n = j + mod(k,-3) + 1;
h_average(k) = sqrt(h(i,j)*h(m,n));
u_average(k) = (u(i,j)*sqrt(h(i,j)) + u(m,n)*sqrt(h(m,n)))/(sqrt(h(i,j)) + sqrt(h(m,n)));
end
end
I can now substitute (i,j) with (l), but even if I try to calculate the related variable, let's call it p, according to (m,n), and write u(p) instead of u(m,n) it gives me an error message.
It only underlines the u(m,n), resp. u(p) but not the h(m,n).
MATLAB says:
Explanation:
For MATLAB to execute parfor loops efficiently, the amount of data sent to the MATLAB workers must be minimal. One of the ways MATLAB achieves this is by restricting the way variables can be indexed in parfor iterations. The indicated variable is indexed in a way that is incompatible with parfor.
Suggested Action
Fix the indexing. For a description of the indexing restrictions, see “Sliced Variables” in the Parallel Computing Toolbox documentation
Any idea, what's wrong here?
The problems with u and h are that they are both being sent as broadcast variables to the PARFOR loop. This is not an error - it's just a warning indicating that more data than might otherwise be necessary is being sent.
The PARFOR loop cannot run because you're indexing but not slicing u_average and h_average. It's not clear what outputs you want from this loop since you're overwriting u_average and h_average each time, therefore the PARFOR loop is pointless.
I am writing MATLAB scripts since some time and, still, I do not understand how it works "under the hood". Consider the following script, that do some computation using (big) vectors in three different ways:
MATLAB vector operations;
Simple for cycle that do the same computation component-wise;
An optimized cycle that is supposed to be faster than 2. since avoid some allocation and some assignment.
Here is the code:
N = 10000000;
A = linspace(0,100,N);
B = linspace(-100,100,N);
C = linspace(0,200,N);
D = linspace(100,200,N);
% 1. MATLAB Operations
tic
C_ = C./A;
D_ = D./B;
G_ = (A+B)/2;
H_ = (C_+D_)/2;
I_ = (C_.^2+D_.^2)/2;
X = G_ .* H_;
Y = G_ .* H_.^2 + I_;
toc
tic
X;
Y;
toc
% 2. Simple cycle
tic
C_ = zeros(1,N);
D_ = zeros(1,N);
G_ = zeros(1,N);
H_ = zeros(1,N);
I_ = zeros(1,N);
X = zeros(1,N);
Y = zeros(1,N);
for i = 1:N,
C_(i) = C(i)/A(i);
D_(i) = D(i)/B(i);
G_(i) = (A(i)+B(i))/2;
H_(i) = (C_(i)+D_(i))/2;
I_(i) = (C_(i)^2+D_(i)^2)/2;
X(i) = G_(i) * H_(i);
Y(i) = G_(i) * H_(i)^2 + I_(i);
end
toc
tic
X;
Y;
toc
% 3. Opzimized cycle
tic
X = zeros(1,N);
Y = zeros(1,N);
for i = 1:N,
X(i) = (A(i)+B(i))/2 * (( C(i)/A(i) + D(i)/B(i) ) /2);
Y(i) = (A(i)+B(i))/2 * (( C(i)/A(i) + D(i)/B(i) ) /2)^2 + ( (C(i)/A(i))^2 + (D(i)/B(i))^2 ) / 2;
end
toc
tic
X;
Y;
toc
I know that one shall always try to vectorize computations, being MATLAB build over matrices/vectors (thus, nowadays, it is not always the best choice), so I am expecting that something like:
C = A .* B;
is faster than:
for i in 1:N,
C(i) = A(i) * B(i);
end
What I am not expecting is that it is actually faster even in the above script, despite the second and the third methods I am using go through only one cycle, whereas the first method performs many vector operations (which, theoretically, are a "for" cycle every time). This force me to conclude that MATLAB has some magic that permit (for example) to:
C = A .* B;
D = C .* C;
to be run faster than a single "for" cycle with some operation inside it.
So:
what is the magic that avoid the 1st part to be executed so fast?
when you write "D= A .* B" does MATLAB actually do a component wise computation with a "for" cycle, or simply keeps track that D contains some multiplication of "bla" and "bla"?
EDIT
suppose I want to implement the same computation using C++ (using maybe some library). Will be the first method of MATLAB be faster even than the third one implemented in C++? (I'll answer to this question myself, just give me some time.)
EDIT 2
As requested, here there are the experiment runtimes:
Part 1: 0.237143
Part 2: 4.440132
of which 0.195154 for allocation
Part 3: 2.280640
of which 0.057500 for allocation
and without JIT:
Part 1: 0.337259
Part 2: 149.602017
of which 0.033886 for allocation
Part 3: 82.167713
of which 0.010852 for allocation
The first one is the fastest because vectorized code can be easily interpreted to a small number of optimized C++ library calls. Matlab could also optimize it at more high level, for example, replace G*H+I with an optimized mul_add(G,H,I) instead of add(mul(G,H),I) in its core.
The second one can't be converted to C++ calls easily. It has to be interpreted or compiled. The most modern approach for scripting languages is JIT-compilation. The Matlab JIT compiler is not very good but it doesn't mean it has to be so. I don't know why MathWorks don't improve it. Thus #2 performs so slow that #1 is faster even it makes more "mathematical" operations.
Julia language was invented to be a compromise between Matlab expression and C++ speed. The same non-vectorized code (julia vs matlab) works very fast because JIT-compilation is very good.
Regarding performance optimization I follow #memyself suggestion using the profiler for both approaches as mentioned in 'for' loop vs vectorization in MATLAB.
For educational purposes it does make sense to experiment with numerical algorithms, for anything else I would go with well proven libraries.