intel parallel studio 2011 - summing in parallel - intel-parallel-studio

I have a serial code that looks something like that:
sum = a;
sum += b;
sum += c;
sum += d;
I would like to parallelize it to something like that:
temp1 = a + b and in the same time temp2 = c + d
sum = temp1 + temp2
How do I do it using Intel parallel studio tools?
Thanks!!!

Assuming that all variables are of integral or floating point types, there is absolutely no sense to parallelize this code (in the sense of executing by different threads/cores), as the overhead will be much much higher than any benefit out of it. The applicable parallelism in this example is at the level of multiple computation units and/or vectorization on a single CPU. Optimizing compilers are sophisticated enough nowadays to exploit this automatically, without code changes; however if you wish you may explicitly use temporary variables, as in the second part of the question.
And if you ask just out of curiosity: Intel Parallel Studio provides several ways to parallelize code. For example, let's use Cilk keywords together with C++11 lambda functions:
#include <cilk/cilk.h>
...
temp = cilk_spawn [=]{ return a+b; }();
sum = c+d;
cilk_sync;
sum += temp;
Don't expect to get performance out of that (see above), unless you use classes with a computational-heavy overloaded operator+.

Related

What is the reason behind parfeval's time overhead compared to a serial implementation?

I am trying to parallelize some code used in the Gauss-Seidel algorithm to approximate the solution of a Linear Equation System.
In brief, for an NxN matrix, during one iteration, I am doing sqrt(N) sessions of parallel computation, one by one. During one session of parallel computation, I distribute the task of calculating sqrt(N) values from a vector among the available workers.
The code involved in a parallel computation session is this:
future_results(1:num_workers) = parallel.FevalFuture;
for i = 1:num_workers
start_itv = buck_bound+1 + (i - 1) * worker_length;
end_itv = min(buck_bound+1 + i * worker_length - 1, ends_of_buckets(current_bucket));
future_results(i) = parfeval(p, #hybrid_parallel_function, 3, A, b, x, x_last, buck_bound, n, start_itv, end_itv);
end
for i = 1:num_workers
[~, arr, start_itv, end_itv] = fetchNext(future_results(i));
x(start_itv:end_itv) = arr;
end
The function called by parfeval is this:
function [x_par, start_itv, end_itv] = hybrid_parallel_function (A, b, x, x_last, buck_bound, n, start_itv, end_itv)
x_par = zeros(end_itv - start_itv + 1, 1);
for i = start_itv:end_itv
x_par(i-start_itv+1) = b(i);
x_par(i-start_itv+1) = x_par(i-start_itv+1) - A(i, 1:buck_bound) * x(1:buck_bound);
x_par(i-start_itv+1) = x_par(i-start_itv+1) - A(i, buck_bound+1:i-1) * x_last(buck_bound+1:i-1);
x_par(i-start_itv+1) = x_par(i-start_itv+1) - A(i, i+1:n) * x_last(i+1:n);
x_par(i-start_itv+1) = x_par(i-start_itv+1) / A(i, i);
end
end
The entire code can be found here: https://pastebin.com/hRQ5Ugqz
The matlab profiler for a 1000x1000 matrix. The parallel code is between 20 to 135 times slower than its serial counterpart, depending on the chosen coefficient matrix (and still much faster than spmd).
The parfeval computation might be lazily split between the lines 50 and 57? Still, I cannot explain to myself why there is this major overhead. It seems to have something to do with the number of times parfeval is called: I did lower the execution time by lowering the parfeval calls.
Is there something that can be further optimized? Do I have to resort to writing the code in C++?
Please help. Thank you very much!
There's a few possibilities here. Most importantly is the simple fact that if you're using the 'local' cluster type, then the workers are running in single-threaded code. In situations where the "serial" code is actually taking advantage of MATLAB's intrinsic multi-threading, then you're already taking full advantage of the available CPU hardware, and using parallel workers cannot gain you anything. It's not certain that this is the case for you, but I'd strongly suspect it given the code.
There are overheads to running in parallel, and as you've observed, running fewer parfeval calls lowers these overheads. Your code as written copies the whole of the A matrix to each worker multiple times. You dont' need to change A, so you could use parallel.pool.Constant to avoid those repeated copies.
While parfeval is more flexible, it tends to be less efficient than parfor in cases where parfor can be applied.
Yes, you can expect the workers to start working as soon as the first parfeval call has completed.
(Sorry, this isn't really a proper "answer", so some kind soul will probably come along and delete this soon, but there's much too much to fit into a comment).

How to make this code simplier to run faster

I have a 2800x4800 matrix. There is data only in the first column. I want to add data the rest of the columns as well. The values in a row should continue like this: n = (n-1) + 0.005. I wrote a code with a loop and it works, however, it takes too long. How can I write this without a loop?
for j=2:size(Time,2)
Time(:,j) = Time(:,(j-1)) + (1/(Fs*1000));
end
It could be likes the following by replacing the computation for rows of 1:2:size(Time,2)-1 with rows of 2:2:size(Time,2) (indeed you can remove the for to speed up). Notice we assume that Fs is a constant here:
m = size(Time,2);
Time(:,2:m)= Time(:,1:(m-1))+(1/(Fs*1000));
It's possible to get the same results as your sample code in just one line by writing
Time(:,2:end) = bsxfun(#plus,Time(:,1), (1/(Fs*1000)) .* (1:size(Time,2)-1));
If you have a newer version of Matlab (>= r2016b) you can use implicit expansion by Matlab and simply write
Time(:,2:end) = Time(:,1) + (1/(Fs*1000)) .* (1:size(Time,2)-1);
But at least on my computer I do not really see any performance improvement by using this vectorization instead of your loop. The JIT compilation has gotten quite a bit better over time, so it would be interesting to know which Matlab version you use.

Summation of dense and sparse vectors

I need to sum up a dense and a sparse vectors in Matlab, and the naive way of doing it, that is:
w = rand(1e7,1);
d = sprand(1e7,1,.001);
tic
for k = 1 : 100
w = w + d;
end
toc
takes about 3.5 seconds, which is about 20 times slower then the way I'd expect Matlab to implement it behind the hood:
for k = 1 : 100
ind = find(d);
w(ind) = w(ind) + d(ind);
end
(of course the timing of this faster version depends on the sparsity).
So, why doesn't Matlab do it 'the fast way'? My experience with Matlab till now suggests that it is quite good at utilizing sparsity.
More important, are there other 'sparse' operations I should suspect as being not efficient?
I don't know the answer for sure, but I will give you my guess about what is happening. I don't know Fortran, but from a C++ perspective, what you are showing makes sense, we just need to deconstruct that statement.
The pseudo-code translation of a = b + c where a,b are full and c is sparse would look something like a.operator= ( b.operator+ (c) ).
In all likelihood, full matrix containers in Matlab should have specialised arithmetic operators to deal with sparse inputs, ie something like full full::operator+ ( const sparse& ). The main thing to notice here is that the result of a mixed full/sparse arithmetic operation has to be full. So we're going to need to create a new full container to store the result, even though there are few updated values.
[ Note: the returned full container is a temporary, and so the assignment a.operator= ( ... ) may avoid a full additional copy, eg with full& full::operator= ( full&& ). ]
Unfortunately, there is no way around returning a new full container because there is no arithmetic compound operations in Matlab (ie operator +=). This is why Matlab cannot exploit the fact that in your example a is the same as b (try to time your loop with x = w + d instead, there's no runtime difference), and this is where the overhead comes from IMO.
[ Note: even when there is no explicit assignment, eg b+c;, the generic answer variable ans is assigned. ]
Interestingly, there seems to be a notable difference between full full::operator+ ( const sparse& ) and full sparse::operator+ ( const full& ), that is between a = b + c and a = c + b; I can't say more about why that is, but the latter seems to be faster.
In any case, my short answer is "because Matlab doesn't have arithmetic compound operators", which is unfortunate. If you know you are going to do these operations a lot though, it shouldn't be too hard to implement ad hoc optimized versions like the one you proposed. Hope this helps!

Replacement for repmat in MATLAB

I have a function which does the following loop many, many times:
for cluster=1:max(bins), % bins is a list in the same format as kmeans() IDX output
select=bins==cluster; % find group of values
means(select,:)=repmat_fast_spec(meanOneIn(x(select,:)),sum(select),1);
% (*, above) for each point, write the mean of all points in x that
% share its label in bins to the equivalent row of means
delta_x(select,:)=x(select,:)-(means(select,:));
%subtract out the mean from each point
end
Noting that repmat_fast_spec and meanOneIn are stripped-down versions of repmat() and mean(), respectively, I'm wondering if there's a way to do the assignment in the line labeled (*) that avoids repmat entirely.
Any other thoughts on how to squeeze performance out of this thing would also be welcome.
Here is a possible improvement to avoid REPMAT:
x = rand(20,4);
bins = randi(3,[20 1]);
d = zeros(size(x));
for i=1:max(bins)
idx = (bins==i);
d(idx,:) = bsxfun(#minus, x(idx,:), mean(x(idx,:)));
end
Another possibility:
x = rand(20,4);
bins = randi(3,[20 1]);
m = zeros(max(bins),size(x,2));
for i=1:max(bins)
m(i,:) = mean( x(bins==i,:) );
end
dd = x - m(bins,:);
One obvious way to speed up calculation in MATLAB is to make a MEX file. You can compile C code and perform any operations you want. If you're searching for the fastest-possible performance, turning the operation into a custom MEX file would likely be the way to go.
You may be able to get some improvement by using ACCUMARRAY.
%# gather array sizes
[nPts,nDims] = size(x);
nBins = max(bins);
%# calculate means. Not sure whether it might be faster to loop over nDims
meansCell = accumarray(bins,1:nPts,[nBins,1],#(idx){mean(x(idx,:),1)},{NaN(1,nDims)});
means = cell2mat(meansCell);
%# subtract cluster means from x - this is how you can avoid repmat in your code, btw.
%# all you need is the array with cluster means.
delta_x = x - means(bins,:);
First of all: format your code properly, surround any operator or assignment by whitespace. I find your code very hard to comprehend as it looks like a big blob of characters.
Next of all, you could follow the other responses and convert the code to C (mex) or Java, automatically or manually, but in my humble opinion this is a last resort. You should only do such things when your performance is not there yet by a small margin. On the other hand, your algorithm doesn't show obvious flaws.
But the first thing you should do when trying to improve performance: profile. Use the MATLAB profiler to determine which part of your code is causing your problems. How much would you need to improve this to meet your expectations? If you don't know: first determine this boundary, otherwise you will be looking for a needle in a hay stack which might not even be in there in the first place. MATLAB will never be the fastest kid on the block with respect to runtime, but it might be the fastest with respect to development time for certain kinds of operations. In that respect, it might prove useful to sacrifice the clarity of MATLAB over the execution speed of other languages (C or even Java). But in the same respect, you might as well code everything in assembler to squeeze all of the performance out of the code.
Another obvious way to speed up calculation in MATLAB is to make a Java library (similar to #aardvarkk's answer) since MATLAB is built on Java and has very good integration with user Java libraries.
Java's easier to interface and compile than C. It might be slower than C in some cases, but the just-in-time (JIT) compiler in the Java virtual machine generally speeds things up very well.

Methods to speed up for loop in MATLAB

I have just profiled my MATLAB code and there is a bottle-neck in this for loop:
for vert=-down:up
for horz=-lhs:rhs
y = y + x(k+vert.*length+horz).*DM(abs(vert).*nu+abs(horz)+1);
end
end
where y, x and DM are vectors I have already defined. I vectorised the loop by writing,
B=(-down:up)'*ones(1,lhs+rhs+1);
C=ones(up+down+1,1)*(-lhs:rhs);
y = sum(sum(x(k+length.*B+C).*DM(abs(B).*nu+abs(C)+1)));
But this ended up being sufficiently slower.
Are there any suggestions on how I can speed up this for loop?
Thanks in advance.
What you've done is not really vectorization. It's very difficult, if not impossible, to write proper vectorization procedures for image processing (I assume that's what you're doing) in Matlab. When we use the term vectorized, we really mean "vectorized with no additional computation". For example, this code
a = 1:1000000;
for i = a
n = n+i;
end
would run much slower then this code
a = 1:1000000;
sum(a)
Update: code above has been modified, thanks to #Rasman's keen suggestion. The reason is that Matlab does not compile your code into machine language before running it, and that's what causes it to be slower. Built-in functions like sum, mean and the .* operator run pre-compiled C code behind the scenes. For loops are a great example of code that runs slowly when not optimized for you CPU's registers.
What you have done, and please ignore my first comment, is rewriting your procedure with a vector operation and some additional operations. Those are the operations that take extra CPU simply because you're telling your computer to do more computations, even though each computation separately may (or may not) take less time.
If you are really after speeding up you code, take a look at MEX files. They allow you to write and compile C and C++ code, compile it and run as Matlab functions, just like those fast built-in ones. In any case, Matlab is not meant to be a fast general-purpose programming platform, but rather a computer simulation environment, though this approach has been changing in the recent years. My advise (from experience) is that if you do image processing, you will write for loops, and there's rarely a way around it. Vector operations were written for a more intuitive approach to linear algebra problems, and we rarely treat digital images as regular rectangular matrices in terms of what we do with them.
I hope this helps.
I would use matrices when handling images... you could then try to extract submatrices like so:
X = reshape(x,height,length);
kx = mod(k,length);
ky = floor(k/length);
xstamp = X( [kx-down:kx+up], [ky-lhs:ky+rhs]);
xstamp = xstamp.*getDMMMask(width, height);
y = sum(xstamp);
...
function mask = getDMMask(width, height, nu)
% I don't get what you're doing there .. return an appropriate sized mask here.
return mask;
end