Matlab's fftn gets slower with multithreading? - matlab

I have access to a 12 core machine and some matlab code that relies heavily on fftn. I would like to speed up my code.
Since the fft can be parallelized I would think that more cores would help but I'm seeing the opposite.
Here's an example:
X = peaks(1028);
ncores = feature('numcores');
ntrials = 20;
mtx_power_times = zeros(ncores,ntrials);
fft_times = zeros(ncores, ntrials);
for i=1:ncores
for j=1:ntrials
maxNumCompThreads(i);
tic;
X^2;
mtx_power_times(i,j) = toc;
tic
fftn(X);
fft_times(i,j) = toc;
end
end
subplot(1,2,1);
plot(mtx_power_times,'x-')
title('mtx power time vs number of cores');
subplot(1,2,2);
plot(fft_times,'x-');
title('fftn time vs num of cores');
Which gives me this:
The speedup for matrix multiplication is great but it looks like my ffts go almost 3x slower when I use all my cores. What's going on?
For reference my version is 7.12.0.635 (R2011a)
Edit: On large 2D arrays taking 1D transforms I get the same problem:
Edit: The problem appears to be that fftw is not seeing the thread limiting that maxNumCompThreads enforces. I'm getting all the cpus going full speed no matter what I set maxNumCompThreads at.
So... is there a way I can specify how many processors I want to use for an fft in Matlab?
Edit: Looks like I can't do this without some careful work in .mex files. http://www.mathworks.com/matlabcentral/answers/35088-how-to-control-number-of-threads-in-fft has an answer. It would be nice if someone has an easy fix...

Looks like I can't do this without some careful work in .mex files. http://www.mathworks.com/matlabcentral/answers/35088-how-to-control-number-of-threads-in-fft has an answer. It would be nice if someone has an easy fix...

To use different cores, you should use the Parallel Computing Toolbox. For instance, you could use a parfor loop, and you have to pass the functions as a list of handles:
function x = f(n, i)
...
end
m = ones(8);
parfor i=1:8
m(i,:) = f(m(i,:), i);
end
More info is available at:
High performance computing
Multithreaded computation
Multithreading

Related

Optimizing Fourier Series Fitting Function Matlab

I am trying to iterate through a set of samples that seems to show periodic changes. I need continuously apply the fit function to get the fourier series coefficients, the regression has to be n samples in the past (in my case, around 30). The problem is, my code is extremely slow! It will take like 1 hour to do this for a set of 50,000 samples. Is there any way to optimize this? What am I doing wrong?
Here's my code:
function[coefnames,coef] = fourier_regression(vect_waves,n)
j = 1;
coef = zeros(length(vect_waves)-n,10);
for i=n+1:length(vect_waves)
take_fourier = vect_waves(i-n+1:i);
x = 1:n;
f = fit(x,take_fourier,'fourier4');
current_coef = coeffvalues(f);
coef(j,1:length(current_coef)) = current_coef;
j = j + 1;
end
coefnames = coeffnames(f);
end
When I call [coefnames,coef] = fourier_regression(VECTOR,30); This takes forever to compute. Is there any way to fix it? What's wrong with my code?
Note: I have a intel i7 5500 U cpu, 16GB RAM, and using Matlab 2015a.
As I am not familiar with your application, I am not sure whether it is possible to vectorize the code to improve performance. However, I have a couple of other tips.
One thing you should consider is preallocation of arrays. In this case, you should preallocate at least the array coef since I believe you do know its size before starting the loop.
Another thing I suggest is to profile your code. This will provide information on what parts of your code are consuming the most time, helping you focus your effort on improving those parts' performance.

Matlab CUDA basic experiment

(correctly and instructively asnwered, see below)
I'm beginning to do experiments with matlab and gpu (nvidia gtx660).
Now, I wrote this simple monte carlo algorithm to calculate PI. The following is the CPU version:
function pig = mc1vecnocuda(n)
countr=0;
A=rand(n,2);
for i=1:n
if norm(A(i,:))<1
countr=countr+1;
end
end
pig=(countr/n)*4;
end
This takes very little time to be executed on CPU with 100000 points "thrown" into the unit circle:
>> tic; mc1vecnocuda(100000);toc;
Elapsed time is 0.092473 seconds.
See, instead, what happens with gpu-ized version of the algorithm:
function pig = mc1veccuda(n)
countr=0;
gpucountr=gpuArray(countr);
A=gpuArray.rand(n,2);
parfor (i=1:n,1024)
if norm(A(i,:))<1
gpucountr=gpucountr+1;
end
end
pig=(gpucountr/n)*4;
end
Now, this takes a LONG time to be executed:
>> tic; mc1veccuda(100000);toc;
Elapsed time is 21.137954 seconds.
I don't understand why. I used parfor loop with 1024 workers, because querying my nvidia card with gpuDevice, 1024 is the maximum number of simultaneous threads allowed on the gtx660.
Can someone help me? Thanks.
Edit: this is the updated version that avoids IF:
function pig = mc2veccuda(n)
countr=0;
gpucountr=gpuArray(countr);
A=gpuArray.rand(n,2);
parfor (i=1:n,1024)
gpucountr = gpucountr+nnz(norm(A(i,:))<1);
end
pig=(gpucountr/n)*4;
end
And this is the code written following Bichoy's guidelines (the
right code to achieve result):
function pig = mc3veccuda(n)
countr=0;
gpucountr=gpuArray(countr);
A=gpuArray.rand(n,2);
Asq = A.^2;
Asqsum_big_column = Asq(:,1)+Asq(:,2);
Anorms=Asqsum_big_column.^(1/2);
gpucountr=gpucountr+nnz(Anorms<1);
pig=(gpucountr/n)*4;
end
Please note execution time with n=10 millions:
>> tic; mc3veccuda(10000000); toc;
Elapsed time is 0.131348 seconds.
>> tic; mc1vecnocuda(10000000); toc;
Elapsed time is 8.108907 seconds.
I didn't test my original cuda version (for/parfor), for its execution would require hours with n=10000000.
Great Bichoy! ;)
I guess the problem is with parfor!
parfor is supposed to run on MATLAB workers, that is your host not the GPU!
I guess what is actually happening is that you are starting 1024 threads on your host (not on your GPU) and each of them is trying to call the GPU. This result in the tremendous time your code is taking.
Try to re-write your code to use matrix and array operations, not for-loops! This will show some speed-up. Also, remember that you should have much more calculations to do in the GPU otherwise, memory transfer will just dominate your code.
Code:
This is the final code after including all corrections and suggestions from several people:
function pig = mc2veccuda(n)
A=gpuArray.rand(n,2); % An nx2 random matrix
Asq = A.^2; % Get the square value of each element
Anormsq = Asq(:,1)+Asq(:,2); % Get the norm squared of each point
gpucountr = nnz(Anorm<1); % Check the number of elements < 1
pig=(gpucountr/n)*4;
Many reasons like:
Movement of data between host & device
Computation within each loop is very small
Call to rand on GPU may not be parallel
if condition within the loop can cause divergence
Accumulation to a common variable may run in serial, with overhead
It is difficult to profile Matlab+CUDA code. You should probably try in native C++/CUDA and use parallel Nsight to find the bottleneck.
As Bichoy said, CUDA code should always be done vectorized. In MATLAB, unless you're writing a CUDA Kernal, the only large speedup that you're getting is that the vectorized operations are called on the GPU which has thousands of (slow) cores. If you don't have large vectors and vectorized code, it won't help.
Another thing that hasn't been mentioned is that for highly parallel architectures like GPUs you want to use different random number generating algorithms than the "standard" ones. So to add to Bichoy's answer, adding the parameter 'Threefry4x64' (64-bit) or 'Philox4x32-10' (32-bit and a lot faster! Super fast!) can lead to large speedups in CUDA code. MATLAB explains this here: http://www.mathworks.com/help/distcomp/examples/generating-random-numbers-on-a-gpu.html

Methods to speed up for loop in MATLAB

I have just profiled my MATLAB code and there is a bottle-neck in this for loop:
for vert=-down:up
for horz=-lhs:rhs
y = y + x(k+vert.*length+horz).*DM(abs(vert).*nu+abs(horz)+1);
end
end
where y, x and DM are vectors I have already defined. I vectorised the loop by writing,
B=(-down:up)'*ones(1,lhs+rhs+1);
C=ones(up+down+1,1)*(-lhs:rhs);
y = sum(sum(x(k+length.*B+C).*DM(abs(B).*nu+abs(C)+1)));
But this ended up being sufficiently slower.
Are there any suggestions on how I can speed up this for loop?
Thanks in advance.
What you've done is not really vectorization. It's very difficult, if not impossible, to write proper vectorization procedures for image processing (I assume that's what you're doing) in Matlab. When we use the term vectorized, we really mean "vectorized with no additional computation". For example, this code
a = 1:1000000;
for i = a
n = n+i;
end
would run much slower then this code
a = 1:1000000;
sum(a)
Update: code above has been modified, thanks to #Rasman's keen suggestion. The reason is that Matlab does not compile your code into machine language before running it, and that's what causes it to be slower. Built-in functions like sum, mean and the .* operator run pre-compiled C code behind the scenes. For loops are a great example of code that runs slowly when not optimized for you CPU's registers.
What you have done, and please ignore my first comment, is rewriting your procedure with a vector operation and some additional operations. Those are the operations that take extra CPU simply because you're telling your computer to do more computations, even though each computation separately may (or may not) take less time.
If you are really after speeding up you code, take a look at MEX files. They allow you to write and compile C and C++ code, compile it and run as Matlab functions, just like those fast built-in ones. In any case, Matlab is not meant to be a fast general-purpose programming platform, but rather a computer simulation environment, though this approach has been changing in the recent years. My advise (from experience) is that if you do image processing, you will write for loops, and there's rarely a way around it. Vector operations were written for a more intuitive approach to linear algebra problems, and we rarely treat digital images as regular rectangular matrices in terms of what we do with them.
I hope this helps.
I would use matrices when handling images... you could then try to extract submatrices like so:
X = reshape(x,height,length);
kx = mod(k,length);
ky = floor(k/length);
xstamp = X( [kx-down:kx+up], [ky-lhs:ky+rhs]);
xstamp = xstamp.*getDMMMask(width, height);
y = sum(xstamp);
...
function mask = getDMMask(width, height, nu)
% I don't get what you're doing there .. return an appropriate sized mask here.
return mask;
end

MATLAB matrix multiplication vs for loop for each column

When multiplying two matrices, I tried the following two options:
1)
res = X*A;
2)
for i = 1:size(A,2)
res(:,i) = X*A(:,i);
end
I preallocated memory for res in both. And surprisingly, I found option 2 to be faster.
Can someone explain how this is so?
edit:
I tried
K=10000;
clear t1 t2
t1=zeros(K,1);
t2=zeros(K,1);
for k=1:K
clear res
x = rand(100,100);
a = rand(100,100);
tic
res = x*a;
t1(k) = toc;
end
for k=1:K
clear res2
res2 = zeros(100,100);
x = rand(100,100);
a = rand(100,100);
tic
for i = 1:100
res2(:,i) = x*a(:,i);
end
t2(k) = toc;
end
I run both codes in a loop 1000 times. In average (but not always) the first vectorized code was 3-4 times faster. I cleared the result variables and preallocated before starting timer.
x = rand(100,100);
a = rand(100,100);
K=1000;
clear t1 t2
t1=zeros(K,1);
t2=zeros(K,1);
for k=1:K
clear res
tic
res = x*a;
t1(k) = toc;
end
for k=1:K
clear res2
res2 = zeros(100,100);
tic
for i = 1:100
res2(:,i) = x*a(:,i);
end
t2(k) = toc;
end
So, never make a timing conclusion based on a single run.
I believe I can chime in on the variation in timings between the two methods, as well as why people are getting different relative speeds.
Before Matlab version 2008a (or a version near that release), for loops took a major hit in any Matlab code because the interpreter (a layer between the very readable script and a lower level implementation of the code) would have to re-interpret the code each time through the for loop.
Since that release, the interpreter has gotten progressively better so, when running a modern version of Matlab, the interpreter can look at your code and say "Ah ha! I know what he is doing, let me optimize it just a bit" and avoid the hit it would otherwise take by reinterpreting the code.
I would expect the two ways of performing matrix multiplies to evaluate in the same amount of time, why the for loop implementation runs faster is because of some detail in the optimizations of the interpreter that us mere mortals are not privy to know.
One broad lesson we should take from this, is not all versions are equal. I do work on a couple of bleeding edge cases using two Matlab add ons, the SimBiology and the Parallel Computing Toolboxes, both of which (especially if you want them to work together) are version dependent in speed of execution, and from time to time other stability issues. As such, I keep the three most recent releases of Matlab, will test that I get the same answers out of each version, and I'll occasionally roll back to an earlier version if I find issues with some features. This is probably overkill for most people, but gives you an idea of version differences.
Hope this helps.
Edits:
To clarify, code vectorization is still important. But given a script like:
x_slow = zeros(1,1e5);
x_fast = zeros(1,1e5);
tic;
for i=1:1e5
x_slow(i) = log(i);
end
time_slow = toc; % evaluates for me in .0132 seconds
tic;
x_fast = log(1:1e5);
time_fast = toc; % evaluates for me in .0055 seconds
The disparity between time_slow and time_fast has reduced in the past several versions based on improvements in the interpreter. The example I saw I believe was on 2000a vs. 2008b, but that's subject to my recollection.
There is something else that might be going on that was addressed by Oli and Yuk. There is often a difference between the time_1 and time_2 in:
tic; x = log(1:1e5); time_1 = toc
tic; x = log(1:1e5); time_2 = toc
So the test of one million evaluations vs. one evaluation is valuable, depending on where in memory x is (in cache or no).
Hope this helps again.
This may well be an effect of caching. a is already in the cache by the time you do the second version, so it has an advantage. Try creating an independent set of inputs to make it fair. Also, it's probably better to measure the time of e.g. 1 million iterations of this, in order to eliminate typical variations due to outside effects.
It looks to me that you are not multiplying matrix properly, you need to sum all the products from ith row of X matrix and jth column of A matrix, that might be a reason.
Look here to see how it's done.

vectorizing loops in Matlab - performance issues

This question is related to these two:
Introduction to vectorizing in MATLAB - any good tutorials?
filter that uses elements from two arrays at the same time
Basing on the tutorials I read, I was trying to vectorize some procedure that takes really a lot of time.
I've rewritten this:
function B = bfltGray(A,w,sigma_r)
dim = size(A);
B = zeros(dim);
for i = 1:dim(1)
for j = 1:dim(2)
% Extract local region.
iMin = max(i-w,1);
iMax = min(i+w,dim(1));
jMin = max(j-w,1);
jMax = min(j+w,dim(2));
I = A(iMin:iMax,jMin:jMax);
% Compute Gaussian intensity weights.
F = exp(-0.5*(abs(I-A(i,j))/sigma_r).^2);
B(i,j) = sum(F(:).*I(:))/sum(F(:));
end
end
into this:
function B = rngVect(A, w, sigma)
W = 2*w+1;
I = padarray(A, [w,w],'symmetric');
I = im2col(I, [W,W]);
H = exp(-0.5*(abs(I-repmat(A(:)', size(I,1),1))/sigma).^2);
B = reshape(sum(H.*I,1)./sum(H,1), size(A, 1), []);
Where
A is a matrix 512x512
w is half of the window size, usually equal 5
sigma is a parameter in range [0 1] (usually one of: 0.1, 0.2 or 0.3)
So the I matrix would have 512x512x121 = 31719424 elements
But this version seems to be as slow as the first one, but in addition it uses a lot of memory and sometimes causes memory problems.
I suppose I've made something wrong. Probably some logic mistake regarding vectorizing. Well, in fact I'm not surprised - this method creates really big matrices and probably the computations are proportionally longer.
I have also tried to write it using nlfilter (similar to the second solution given by Jonas) but it seems to be hard since I use Matlab 6.5 (R13) (there are no sophisticated function handles available).
So once again, I'm asking not for ready solution, but for some ideas that would help me to solve this in reasonable time. Maybe you will point me what I did wrong.
Edit:
As Mikhail suggested, the results of profiling are as follows:
65% of time was spent in the line H= exp(...)
25% of time was used by im2col
How big are I and H (i.e. numel(I)*8 bytes)? If you start paging, then the performance of your second solution is going to be affected very badly.
To test whether you really have a problem due to too large arrays, you can try and measure the speed of the calculation using tic and toc for arrays A of increasing size. If the execution time increases faster than by the square of the size of A, or if the execution time jumps at some size of A, you can try and split the padded I into a number of sub-arrays and perform the calculations like that.
Otherwise, I don't see any obvious places where you could be losing lots of time. Well, maybe you could skip the reshape, by replacing B with A in your function (saves a little memory as well), and writing
A(:) = sum(H.*I,1)./sum(H,1);
You may also want to look into upgrading to a more recent version of Matlab - they've worked hard on improving performance.