What is the reason behind parfeval's time overhead compared to a serial implementation? - matlab

I am trying to parallelize some code used in the Gauss-Seidel algorithm to approximate the solution of a Linear Equation System.
In brief, for an NxN matrix, during one iteration, I am doing sqrt(N) sessions of parallel computation, one by one. During one session of parallel computation, I distribute the task of calculating sqrt(N) values from a vector among the available workers.
The code involved in a parallel computation session is this:
future_results(1:num_workers) = parallel.FevalFuture;
for i = 1:num_workers
start_itv = buck_bound+1 + (i - 1) * worker_length;
end_itv = min(buck_bound+1 + i * worker_length - 1, ends_of_buckets(current_bucket));
future_results(i) = parfeval(p, #hybrid_parallel_function, 3, A, b, x, x_last, buck_bound, n, start_itv, end_itv);
end
for i = 1:num_workers
[~, arr, start_itv, end_itv] = fetchNext(future_results(i));
x(start_itv:end_itv) = arr;
end
The function called by parfeval is this:
function [x_par, start_itv, end_itv] = hybrid_parallel_function (A, b, x, x_last, buck_bound, n, start_itv, end_itv)
x_par = zeros(end_itv - start_itv + 1, 1);
for i = start_itv:end_itv
x_par(i-start_itv+1) = b(i);
x_par(i-start_itv+1) = x_par(i-start_itv+1) - A(i, 1:buck_bound) * x(1:buck_bound);
x_par(i-start_itv+1) = x_par(i-start_itv+1) - A(i, buck_bound+1:i-1) * x_last(buck_bound+1:i-1);
x_par(i-start_itv+1) = x_par(i-start_itv+1) - A(i, i+1:n) * x_last(i+1:n);
x_par(i-start_itv+1) = x_par(i-start_itv+1) / A(i, i);
end
end
The entire code can be found here: https://pastebin.com/hRQ5Ugqz
The matlab profiler for a 1000x1000 matrix. The parallel code is between 20 to 135 times slower than its serial counterpart, depending on the chosen coefficient matrix (and still much faster than spmd).
The parfeval computation might be lazily split between the lines 50 and 57? Still, I cannot explain to myself why there is this major overhead. It seems to have something to do with the number of times parfeval is called: I did lower the execution time by lowering the parfeval calls.
Is there something that can be further optimized? Do I have to resort to writing the code in C++?
Please help. Thank you very much!

There's a few possibilities here. Most importantly is the simple fact that if you're using the 'local' cluster type, then the workers are running in single-threaded code. In situations where the "serial" code is actually taking advantage of MATLAB's intrinsic multi-threading, then you're already taking full advantage of the available CPU hardware, and using parallel workers cannot gain you anything. It's not certain that this is the case for you, but I'd strongly suspect it given the code.
There are overheads to running in parallel, and as you've observed, running fewer parfeval calls lowers these overheads. Your code as written copies the whole of the A matrix to each worker multiple times. You dont' need to change A, so you could use parallel.pool.Constant to avoid those repeated copies.
While parfeval is more flexible, it tends to be less efficient than parfor in cases where parfor can be applied.
Yes, you can expect the workers to start working as soon as the first parfeval call has completed.
(Sorry, this isn't really a proper "answer", so some kind soul will probably come along and delete this soon, but there's much too much to fit into a comment).

Related

why does a*b*a take longer than (a'*(a*b)')' when using gpuArray in Matlab scripts?

The code below performs the operation the same operation on gpuArrays a and b in two different ways. The first part computes (a'*(a*b)')' , while the second part computes a*b*a. The results are then verified to be the same.
%function test
clear
rng('default');rng(1);
a=sprand(3000,3000,0.1);
b=rand(3000,3000);
a=gpuArray(a);
b=gpuArray(b);
tic;
c1=gather(transpose(transpose(a)*transpose(a*b)));
disp(['time for (a''*(a*b)'')'': ' , num2str(toc),'s'])
clearvars -except c1
rng('default');
rng(1)
a=sprand(3000,3000,0.1);
b=rand(3000,3000);
a=gpuArray(a);
b=gpuArray(b);
tic;
c2=gather(a*b*a);
disp(['time for a*b*a: ' , num2str(toc),'s'])
disp(['error = ',num2str(max(max(abs(c1-c2))))])
%end
However, computing (a'*(a*b)')' is roughly 4 times faster than computing a*b*a. Here is the output of the above script in R2018a on an Nvidia K20 (I've tried different versions and different GPUs with the similar behaviour).
>> test
time for (a'*(a*b)')': 0.43234s
time for a*b*a: 1.7175s
error = 2.0009e-11
Even more strangely, if the first and last lines of the above script are uncommented (to turn it into a function), then both take the longer amount of time (~1.7s instead of ~0.4s). Below is the output for this case:
>> test
time for (a'*(a*b)')': 1.717s
time for a*b*a: 1.7153s
error = 1.0914e-11
I'd like to know what is causing this behaviour, and how to perform a*b*a or (a'*(a*b)')' or both in the shorter amount of time (i.e. ~0.4s rather than ~1.7s) inside a matlab function rather than inside a script.
There seem to be an issue with multiplication of two sparse matrices on GPU. time for sparse by full matrix is more than 1000 times faster than sparse by sparse. A simple example:
str={'sparse*sparse','sparse*full'};
for ii=1:2
rng(1);
a=sprand(3000,3000,0.1);
b=sprand(3000,3000,0.1);
if ii==2
b=full(b);
end
a=gpuArray(a);
b=gpuArray(b);
tic
c=a*b;
disp(['time for ',str{ii},': ' , num2str(toc),'s'])
end
In your context, it is the last multiplication which does it. to demonstrate I replace a with a duplicate c, and multiply by it twice, once as sparse and once as full matrix.
str={'a*b*a','a*b*full(a)'};
for ii=1:2
%rng('default');
rng(1)
a=sprand(3000,3000,0.1);
b=rand(3000,3000);
rng(1)
c=sprand(3000,3000,0.1);
if ii==2
c=full(c);
end
a=gpuArray(a);
b=gpuArray(b);
c=gpuArray(c);
tic;
c1{ii}=a*b*c;
disp(['time for ',str{ii},': ' , num2str(toc),'s'])
end
disp(['error = ',num2str(max(max(abs(c1{1}-c1{2}))))])
I may be wrong, but my conclusion is that a * b * a involves multiplication of two sparse matrices (a and a again) and is not treated well, while using transpose() approach divides the process to two stage multiplication, in none of which there are two sparse matrices.
I got in touch with Mathworks tech support and Rylan finally shed some light on this issue. (Thanks Rylan!) His full response is below. The function vs script issue appears to be related to certain optimizations matlab applies automatically to functions (but not scripts) not working as expected.
Rylan's response:
Thank you for your patience on this issue. I have consulted with the MATLAB GPU computing developers to understand this better.
This issue is caused by internal optimizations done by MATLAB when encountering some specific operations like matrix-matrix multiplication and transpose. Some of these optimizations may be enabled specifically when executing a MATLAB function (or anonymous function) rather than a script.
When your initial code was being executed from a script, a particular matrix transpose optimization is not performed, which results in the 'res2' expression being faster than the 'res1' expression:
n = 2000;
a=gpuArray(sprand(n,n,0.01));
b=gpuArray(rand(n));
tic;res1=a*b*a;wait(gpuDevice);toc % Elapsed time is 0.884099 seconds.
tic;res2=transpose(transpose(a)*transpose(a*b));wait(gpuDevice);toc % Elapsed time is 0.068855 seconds.
However when the above code is placed in a MATLAB function file, an additional matrix transpose-times optimization is done which causes the 'res2' expression to go through a different code path (and different CUDA library function call) compared to the same line being called from a script. Therefore this optimization generates slower results for the 'res2' line when called from a function file.
To avoid this issue from occurring in a function file, the transpose and multiply operations would need to be split in a manner that stops MATLAB from applying this optimization. Separating each clause within the 'res2' statement seems to be sufficient for this:
tic;i1=transpose(a);i2=transpose(a*b);res3=transpose(i1*i2);wait(gpuDevice);toc % Elapsed time is 0.066446 seconds.
In the above line, 'res3' is being generated from two intermediate matrices: 'i1' and 'i2'. The performance (on my system) seems to be on par with that of the 'res2' expression when executed from a script; in addition the 'res3' expression also shows similar performance when executed from a MATLAB function file. Note however that additional memory may be used to store the transposed copy of the initial array. Please let me know if you see different performance behavior on your system, and I can investigate this further.
Additionally, the 'res3' operation shows faster performance when measured with the 'gputimeit' function too. Please refer to the attached 'testscript2.m' file for more information on this. I have also attached 'test_v2.m' which is a modification of the 'test.m' function in your Stack Overflow post.
Thank you for reporting this issue to me. I would like to apologize for any inconvenience caused by this issue. I have created an internal bug report to notify the MATLAB developers about this behavior. They may provide a fix for this in a future release of MATLAB.
Since you had an additional question about comparing the performance of GPU code using 'gputimeit' vs. using 'tic' and 'toc', I just wanted to provide one suggestion which the MATLAB GPU computing developers had mentioned earlier. It is generally good to also call 'wait(gpuDevice)' before the 'tic' statements to ensure that GPU operations from the previous lines don't overlap in the measurement for the next line. For example, in the following lines:
b=gpuArray(rand(n));
tic; res1=a*b*a; wait(gpuDevice); toc
if the 'wait(gpuDevice)' is not called before the 'tic', some of the time taken to construct the 'b' array from the previous line may overlap and get counted in the time taken to execute the 'res1' expression. This would be preferred instead:
b=gpuArray(rand(n));
wait(gpuDevice); tic; res1=a*b*a; wait(gpuDevice); toc
Apart from this, I am not seeing any specific issues in the way that you are using the 'tic' and 'toc' functions. However note that using 'gputimeit' is generally recommended over using 'tic' and 'toc' directly for GPU-related profiling.
I will go ahead and close this case for now, but please let me know if you have any further questions about this.
%testscript2.m
n = 2000;
a = gpuArray(sprand(n, n, 0.01));
b = gpuArray(rand(n));
gputimeit(#()transpose_mult_fun(a, b))
gputimeit(#()transpose_mult_fun_2(a, b))
function out = transpose_mult_fun(in1, in2)
i1 = transpose(in1);
i2 = transpose(in1*in2);
out = transpose(i1*i2);
end
function out = transpose_mult_fun_2(in1, in2)
out = transpose(transpose(in1)*transpose(in1*in2));
end
.
function test_v2
clear
%% transposed expression
n = 2000;
rng('default');rng(1);
a = sprand(n, n, 0.1);
b = rand(n, n);
a = gpuArray(a);
b = gpuArray(b);
tic;
c1 = gather(transpose( transpose(a) * transpose(a * b) ));
disp(['time for (a''*(a*b)'')'': ' , num2str(toc),'s'])
clearvars -except c1
%% non-transposed expression
rng('default');
rng(1)
n = 2000;
a = sprand(n, n, 0.1);
b = rand(n, n);
a = gpuArray(a);
b = gpuArray(b);
tic;
c2 = gather(a * b * a);
disp(['time for a*b*a: ' , num2str(toc),'s'])
disp(['error = ',num2str(max(max(abs(c1-c2))))])
%% sliced equivalent
rng('default');
rng(1)
n = 2000;
a = sprand(n, n, 0.1);
b = rand(n, n);
a = gpuArray(a);
b = gpuArray(b);
tic;
intermediate1 = transpose(a);
intermediate2 = transpose(a * b);
c3 = gather(transpose( intermediate1 * intermediate2 ));
disp(['time for split equivalent: ' , num2str(toc),'s'])
disp(['error = ',num2str(max(max(abs(c1-c3))))])
end
EDIT 2 I might have been right, see this other answer
EDIT: They use MAGMA, which is column major. My answer does not hold, however I will leave it here for a while in case it can help crack this strange behavior.
The below answer is wrong
This is my guess, I can not 100% tell you without knowing the code under MATLAB's hood.
Hypothesis: MATLABs parallel computing code uses CUDA libraries, not their own.
Important information
MATLAB is column major and CUDA is row major.
There is no such things as 2D matrices, only 1D matrices with 2 indices
Why does this matter? Well because CUDA is highly optimized code that uses memory structure to maximize cache hits per kernel (the slowest operation on GPUs is reading memory). This means a standard CUDA matrix multiplication code will exploit the order of memory reads to make sure they are adjacent. However, what is adjacent memory in row-major is not in column-major.
So, there are 2 solutions to this as someone writing software
Write your own column-major algebra libraries in CUDA
Take every input/output from MATLAB and transpose it (i.e. convert from column-major to row major)
They have done point 2, and assuming that there is a smart JIT compiler for MATLAB parallel processing toolbox (reasonable assumption), for the second case, it takes a and b, transposes them, does the maths, and transposes the output when you gather.
In the first case however, you already do not need to transpose the output, as it is internally already transposed and the JIT catches this, so instead of calling gather(transpose( XX )) it just skips the output transposition is side. The same with transpose(a*b). Note that transpose(a*b)=transpose(b)*transpose(a), so suddenly no transposes are needed (they are all internally skipped). A transposition is a costly operation.
Indeed there is a weird thing here: making the code a function suddenly makes it slow. My best guess is that because the JIT behaves differently in different situations, it doesn't catch all this transpose stuff inside and just does all the operations anyway, losing the speed up.
Interesting observation: It takes the same time in CPU than GPU to do a*b*a in my PC.

Standard arrays seem faster than gpuArray on conv net feed forward

I am implementing Convolutional networks in MATLAB, and I added a support for GPUs (I am using gpuArrays). I implemented the feed forward part. When I run it with standard array (I have the arrays already in my workspace ready), it takes 0.15 sec. However, when I run the EXACT same thing, but the arrays being gpuArrays, which are all in my workspace prior to running the feed forward script, it takes ~1.39 sec. Can someone explain what's going on here? Thanks
UPDATE: I tested running time and everything suggests that the main bottleneck is my convolution part, so I will paste that part of code down here:
pad = (size(layers_W{layerNum}, 1)-1) / 2;
for imageNum = 1:options.minibatchSize
for filterNum = 1:size(layers_W{layerNum}, 4)
for filterD = 1:size(layers_W{layerNum}, 3)
c = conv2(convInput(:, :, filterD, imageNum), ...
rot90(layers_W{layerNum}(:, :, filterD, filterNum), 2), 'valid');
layers_activations{layerNum}(pad+1:end-pad, pad+1:end-pad, filterNum, imageNum) = ...
layers_activations{layerNum}(pad+1:end-pad, pad+1:end-pad, filterNum, imageNum) + ...
c;
end
layers_activations{layerNum}(pad+1:end-pad, pad+1:end-pad, filterNum, imageNum) = ...
layers_activations{layerNum}(pad+1:end-pad, pad+1:end-pad, filterNum, imageNum) + ...
layers_b{layerNum}(filterNum);
end
end
if strcmp(options.activation, 'relu') == 1
layers_activations{layerNum} = max(0, layers_activations{layerNum});
elseif strcmp(options.activation, 'sigmoid') == 1
layers_activations{layerNum} = 1 ./ (1 + exp(-layers_activations{layerNum}));
end
This exact piece of code is ~52 times slower on GPU than on CPU. Any ideas?
UPDATE2: Tested separately the line that does 2d convolution (~10 times slower on GPU) and the line below it that adds two matrices(~100 times slower on GPU). I am completely confused why this is happening.
This isn't at all a surprise. The GPU is efficient at doing convolutions on large images (HD, 4K) but not particularly at images 227x227 or smaller, such as are typical in CNNs. You need to at least be running a 3-D convolution so you can apply all the filters over each input activation in one call, rather than looping over all the filters and all the images. Try replacing the inner loop with a call to convn.
Smart GPU implementations of convolution in this context, such as that used by the Neural Network Toolbox in MATLAB, use custom kernels and multi-threading to take advantage of spatial parallelism and parallelism in the batch dimensions of filters and inputs. Your implementation throws away all the batch parallelism.

How to `parfor` nested loops?

The original code is like this:
for i = 1 : size(H, 1)
for j = 1 : size(H, 2)
H{i,j} blabla
and I tried to adapt it into parallel code like this:
parfor ind = 1 : numel(H)
[i, j] = ind2sub(ind);
H{i,j} blabla
which generates an error saying parfor cannot run due to H{i,j}.
Then what's the error here? And how can I adapt the nested loop into parfor?
One possible solution is
for i = 1 : size(H, 1)
parfor j = 1 : size(H, 2)
H{i,j} blabla
But I doubt using a parfor within another loop will multiply the overhead of parfor which results in additional computation time.
I think the error for using parfor is that Matlab is unable to detect that [i,j] is unique through the loop because it is the result of a function. Thus, for the engine, you may access to H{i,j} multiple times, iterations are not analyzed to be independent from each other.
Edit: as mentioned by patrik, you have to be sure that there is no dependence between two iterations, that is here H{i,j} does not depend on H{k,l}, i!=k and j!=l, nor the value of a variable in the iteration is used in another iteration. This requirement is the basic one to allow a parfor, except from reduction assignment.
Besides that point, if you want to run independent computations in parallel, and if it worth it, always choose to parfor the outermost loop. In addition to this, remind that Matlab does not allow nested parfor; instead, you have to make a function which runs a parfor if you want to parallelize inner for-loops. The parallelization of inner loops may not bring a speed-up (depends on how many workers are there in the parpool).
From my experience, it is not recommended to run parallel inner loops. As an example (outside Matlab), I would cite LibSVM, which recommends to parallelize only the outermost loop with openmp if you want to speed-up the computation, never other inner loops.
The reason of this recommendation is that you have a limited pool of workers, and workers may be viewed as threads; there is a limit where if you add threads, the computation run slower because of the time of switching between threads. Matlab may manage this part very well, but the point is that you will have a pool of workers limited in size. If each outermost iteration takes a lot of time and if you have many iteration, you will gain no time to parallelize inner loops because each worker will be busy to run the whole iteration (including inner loops).
Nevertheless, it's always a good thing to test each option, some of them may be counter-intuitively more adapted to your problem!
Why not simply use the linear index to assign into H? For example:
H = cell(4, 4);
parfor idx = 1:16
[i, j] = ind2sub([4, 4], idx);
H{idx} = rand(i, j); % or whatever
end
Otherwise, it's always best to make the outermost loop the PARFOR loop. The following also works:
H = cell(4, 4);
parfor r = 1:4
for c = 1:4
H{r, c} = rand(r, c);
end
end

vectorizing loops in Matlab - performance issues

This question is related to these two:
Introduction to vectorizing in MATLAB - any good tutorials?
filter that uses elements from two arrays at the same time
Basing on the tutorials I read, I was trying to vectorize some procedure that takes really a lot of time.
I've rewritten this:
function B = bfltGray(A,w,sigma_r)
dim = size(A);
B = zeros(dim);
for i = 1:dim(1)
for j = 1:dim(2)
% Extract local region.
iMin = max(i-w,1);
iMax = min(i+w,dim(1));
jMin = max(j-w,1);
jMax = min(j+w,dim(2));
I = A(iMin:iMax,jMin:jMax);
% Compute Gaussian intensity weights.
F = exp(-0.5*(abs(I-A(i,j))/sigma_r).^2);
B(i,j) = sum(F(:).*I(:))/sum(F(:));
end
end
into this:
function B = rngVect(A, w, sigma)
W = 2*w+1;
I = padarray(A, [w,w],'symmetric');
I = im2col(I, [W,W]);
H = exp(-0.5*(abs(I-repmat(A(:)', size(I,1),1))/sigma).^2);
B = reshape(sum(H.*I,1)./sum(H,1), size(A, 1), []);
Where
A is a matrix 512x512
w is half of the window size, usually equal 5
sigma is a parameter in range [0 1] (usually one of: 0.1, 0.2 or 0.3)
So the I matrix would have 512x512x121 = 31719424 elements
But this version seems to be as slow as the first one, but in addition it uses a lot of memory and sometimes causes memory problems.
I suppose I've made something wrong. Probably some logic mistake regarding vectorizing. Well, in fact I'm not surprised - this method creates really big matrices and probably the computations are proportionally longer.
I have also tried to write it using nlfilter (similar to the second solution given by Jonas) but it seems to be hard since I use Matlab 6.5 (R13) (there are no sophisticated function handles available).
So once again, I'm asking not for ready solution, but for some ideas that would help me to solve this in reasonable time. Maybe you will point me what I did wrong.
Edit:
As Mikhail suggested, the results of profiling are as follows:
65% of time was spent in the line H= exp(...)
25% of time was used by im2col
How big are I and H (i.e. numel(I)*8 bytes)? If you start paging, then the performance of your second solution is going to be affected very badly.
To test whether you really have a problem due to too large arrays, you can try and measure the speed of the calculation using tic and toc for arrays A of increasing size. If the execution time increases faster than by the square of the size of A, or if the execution time jumps at some size of A, you can try and split the padded I into a number of sub-arrays and perform the calculations like that.
Otherwise, I don't see any obvious places where you could be losing lots of time. Well, maybe you could skip the reshape, by replacing B with A in your function (saves a little memory as well), and writing
A(:) = sum(H.*I,1)./sum(H,1);
You may also want to look into upgrading to a more recent version of Matlab - they've worked hard on improving performance.

Parallelize or vectorize all-against-all operation on a large number of matrices?

I have approximately 5,000 matrices with the same number of rows and varying numbers of columns (20 x ~200). Each of these matrices must be compared against every other in a dynamic programming algorithm.
In this question, I asked how to perform the comparison quickly and was given an excellent answer involving a 2D convolution. Serially, iteratively applying that method, like so
list = who('data_matrix_prefix*')
H = cell(numel(list),numel(list));
for i=1:numel(list)
for j=1:numel(list)
if i ~= j
eval([ 'H{i,j} = compare(' char(list(i)) ',' char(list(j)) ');']);
end
end
end
is fast for small subsets of the data (e.g. for 9 matrices, 9*9 - 9 = 72 calls are made in ~1 s, 870 calls in ~2.5 s).
However, operating on all the data requires almost 25 million calls.
I have also tried using deal() to make a cell array composed entirely of the next element in data, so I could use cellfun() in a single loop:
# who(), load() and struct2cell() calls place k data matrices in a 1D cell array called data.
nextData = cell(k,1);
for i=1:k
[nextData{:}] = deal(data{i});
H{:,i} = cellfun(#compare,data,nextData,'UniformOutput',false);
end
Unfortunately, this is not really any faster, because all the time is in compare(). Both of these code examples seem ill-suited for parallelization. I'm having trouble figuring out how to make my variables sliced.
compare() is totally vectorized; it uses matrix multiplication and conv2() exclusively (I am under the impression that all of these operations, including the cellfun(), should be multithreaded in MATLAB?).
Does anyone see a (explicitly) parallelized solution or better vectorization of the problem?
Note
I realize both my examples are inefficient - the first would be twice as fast if it calculated a triangular cell array, and the second is still calculating the self comparisons, as well. But the time savings for a good parallelization are more like a factor of 16 (or 72 if I install MATLAB on everyone's machines).
Aside
There is also a memory issue. I used a couple of evals to append each column of H into a file, with names like H1, H2, etc. and then clear Hi. Unfortunately, the saves are very slow...
Does
compare(a,b) == compare(b,a)
and
compare(a,a) == 1
If so, change your loop
for i=1:numel(list)
for j=1:numel(list)
...
end
end
to
for i=1:numel(list)
for j= i+1 : numel(list)
...
end
end
and deal with the symmetry and identity case. This will cut your calculation time by half.
The second example can be easily sliced for use with the Parallel Processing Toolbox. This toolbox distributes iterations of your code among up to 8 different local processors. If you want to run the code on a cluster, you also need the Distributed Computing Toolbox.
%# who(), load() and struct2cell() calls place k data matrices in a 1D cell array called data.
parfor i=1:k-1 %# this will run the loop in parallel with the parallel processing toolbox
%# only make the necessary comparisons
H{i+1:k,i} = cellfun(#compare,data(i+1:k),repmat(data(i),k-i,1),'UniformOutput',false);
%# if the above doesn't work, try this
hSlice = cell(k,1);
hSlice{i+1:k} = cellfun(#compare,data(i+1:k),repmat(data(i),k-i,1),'UniformOutput',false);
H{:,i} = hSlice;
end
If I understand correctly you have to perform 5000^2 matrix comparisons ? Rather than try to parallelise the compare function, perhaps you should think of your problem being composed of 5000^2 tasks ? The Matlab Parallel Compute Toolbox supports task-based parallelism. Unfortunately my experience with PCT is with parallelisation of large linear algebra type problems so I can't really tell you much more than that. The documentation will undoubtedly help you more.