maxNumCompThreads is deprecated, but is it still working with R2014a?
I tried to force a script to use a single computational thread, but it uses 2 logical cores:
maxNumCompThreads(1); % limit MATLAB to a single computational thread.
signal = rand(1, 1000000);
for i=1:100
cwt(signal,1:10,'sym2');
i
end
Any idea why?
Setting the -singleCompThread option when starting MATLAB does work fine (the script then uses one core only).
Note that my computer has hyperthreading, so 2 logical cores is actually only 1 physical core but usually Matlab count with logical cores, not physical ones (e.g. when setting the number of cores in a parallel pool).
Related
I am writing a matlab code, which does some operations on a large matrix. First I create three 3D array
dw2 = 0.001;
W2 = [0:dw2:1];
dp = 0.001;
P1 = [dp:dp:1];
dI = 0.001;
I = [1:-dI:0];
[II,p1,ww2] = ndgrid(I,P1,W2);
Then my code basically does the following
G = [0:0.1:10]
Y = zeros(length(G),1)
for i = 1:1:length(G)
g = G(i);
Y(i) = myfunction(II,p1,ww2,g)
end
This code roughly takes about 100s, with each iteration being nearly 10s.
However, after I start parfor
ProcessPool with properties:
Connected: true
NumWorkers: 48
Cluster: local
AttachedFiles: {}
AutoAddClientPath: true
IdleTimeout: 30 minutes (30 minutes remaining)
SpmdEnabled: true
Then it is like running forever. The maximum number of workers is 48. I've also tried 2, 5, 10. All of these are slower than non-parallel computing. Is that because matlab copied II,p1,ww2 48 times and that causes the problem? Also myfunction involves a lot of vectorization. I have already optimized the myfunction. Will that lead to slow performance of parfor? Is there a way to utilize (some of) the 48 workers to speed up the code? Any comments are highly appreciated. I need to run millions of cases. So I really hope that I can utilize the 48 workers in some way.
It seems that you have large data, and a lot of cores. It is likely that you simply run out of memory, which is why things get so slow.
I would suggest that you set up your workers to be threads, not separate processes.
You can do this with parpool('threads'). Your code must conform to some limitations, not all code can be run this way, see here.
In thread-based parallelism, you have shared memory (arrays are not copied). In process-based parallelism, you have 48 copies of MATLAB running on your computer at the same time, each needing their own copy of your data. That latter system was originally designed to work on a compute cluster, and was later retrofitted to work on a single machine with two or four cores. I don’t think it was ever meant for 48 cores.
If you cannot use threads with your code, configured your parallel pool to have fewer workers. For example parpool('local',8).
For more information, see this documentation page.
I'm using the TreeBagger class provided by Matlab (R2014a&b), in conjunction with the distributed computing toolbox. I have a local cluster running, with 30 workers, on a Windows 7 machine with 40 cores.
I call the TreeBagger constructor to generate a regression forest (an ensemble containing 32 trees), passing an options structure with 'UseParallel' set to 'always'.
However, TreeBagger seems to only make use of 8 or so workers, out of the 30 available (judging by CPU usage per process, observed using the Task Manager). When I try to test the pool with a simple parfor loop:
parfor i=1:30
a = fft(rand(20000));
end
Then all 30 workers are engaged.
My question is:
(How) can I force TreeBagger to use all available resources?
Based on the documentation for the TreeBagger class it would appear that the operations required are quite memory intensive. Without knowing more about the internal scheduling system used by Matlab it seems likely that distributing the workload across fewer workers with more memory for each worker is what the scheduler believes will be the most efficient way to solve the problem.
The number of workers used/available may also depend on the number of physical cores on the system(which is different from the number of hyper threaded cores), as well as the resources Matlab is allowed to consume.
Splitting memory intensive tasks across a less than maximum number of workers is a common technique in HPC for some types of problems.
the code I'm dealing with has loops like the following:
bistar = zeros(numdims,numcases);
parfor hh=1:nt
bistar = bistar + A(:,:,hh)*data(:,:,hh+1)' ;
end
for small nt (10).
After timing it, it is actually 100 times slower than using the regular loop!!! I know that parfor can do parallel sums, so I'm not sure why this isn't working.
I run
matlabpool
with the out-of-the-box configurations before running my code.
I'm relatively new to matlab, and just started to use the parallel features, so please don't assume that I'm am not doing something stupid.
Thanks!
PS: I'm running the code on a quad core so I would expect to see some improvements.
Making the partitioning and grouping the results (overhead in dividing the work and gathering results from the several threads/cores) is high for small values of nt. This is normal, you would not partition data for easy tasks that can be performed quickly in a simple loop.
Always perform something challenging inside the loop that is worth the partitioning overhead. Here is a nice introduction to parallel programming.
The threads come from a thread pool so the overhead of creating the threads should not be there. But in order to create the partial results n matrices from the bistar size must be created, all the partial results computed and then all these partial results have to be added (recombining). In a straight loop, this is with a high probability done in-place, no allocations take place.
The complete statement in the help (thanks for your link hereunder) is:
If the time to compute f, g, and h is
large, parfor will be significantly
faster than the corresponding for
statement, even if n is relatively
small.
So you see they mean exactly the same as what I mean, the overhead for small n values is only worth the effort if what you do in the loop is complex/time consuming enough.
Parforcomes with a bit of overhead. Thus, if nt is really small, and if the computation in the loop is done very quickly (like an addition), the parfor solution is slower. Furthermore, if you run parforon a quad-core, speed gain will be close to linear for 1-3 cores, but less if you use 4 cores, since the last core also needs to run system processes.
For example, if parfor comes with 100ms of overhead, and the computation in the loop takes 5ms, and if we assume that speed gain is linear up to 4 cores with a coefficient of 1 (i.e. using 4 cores makes the computation 4 times faster), nt needs to be about 30 for you to achieve a speed gain with parfor (150ms with for, 132ms with parfor). If you were to run only 10 iterations, parfor would be slower (50ms with for, 112ms with parfor).
You can calculate the overhead on your machine by comparing execution time with 1 worker vs 0 workers, and you can estimate speed gain by making a liner fit through the execution times with 1 to 4 workers. Then you'll know when it's useful to use parfor.
Besides the bad performance because of the communication overhead (see other answers), there is another reason not to use parfor in this case. Everything which is done within the parfor in this case uses built-in multithreading. Assuming all workers are running on the same PC there is no advantage because a single call already uses all cores of your processor.
I have a script which solves a system of differential equations for many parameters in a for loop. ( iterations are completely independent, but at the end of each iteration , a large matrix ( mat ) is modified according to the results of the computation ). Here is the code: (B is a matrix containing parameters)
mat=zeros(20000,1);
for n=1:20000
prop=B(n,:); % B is a (20000 * 2 ) matrix that contains U and V parameters
U=prop(1);
V=prop(2);
options=odeset('RelTol',1e-6,'AbsTol',1e-20);
[T,X]=ode45(#acceleration,tspan,x0,options);
rad=X(:,1);
if max(rad)<radius % radius is a constant
mat(n)=1;
end
function xprime=acceleration(T,X)
.
.
.
end
First I tried to use parfor, but because the acceleration function (ode45 input) was defined as an inline function, (to achieve better performance) I couldn't do that .
Can I open 4 MATLAB sessions (my CPU has 4 cores) and run the code separately in each session , instead of modifying the code to implement acceleration as a separate function, and therefore , using parfor? Does it give 4X the performance of running on one session? (or does it give same performance as parallelized code ? - in parallel code I can't define inline functions-)
(on Windows)
If you're prepared do to the work of separating out the problem to run separately in 4 sessions, then reassemble the results, sure you can do that. In my experience (on Windows) it actually runs faster to run code in four separate sessions than to have a parfor loop with 4 workers. Not quite as fast as 4x performance of a single session, because the operating system will have other work to do... so for example if you have no other processor-heavy applications running, the OS itself might take up 25% of one core, leaving you with maybe 3.75x performance of a single session. However, this assumes you have enough memory for that not be the limiting factor.
If you wanted to do this regularly you might need to create some file-based signalling/data passing system.
This is obviously not as elegant as a parfor, but is workable for your situation, or if you can't afford the license fee for the parallel toolbox.
I was wondering when we are running spmd blocks and create individual lab workers, then how much is the memory allocated to each of them them?
I have an 8 core machine and I used 8 lab workers.
Thanks.
When you launch workers using the matlabpool command in Parallel Computing Toolbox, each worker process starts the same - they're essentially an ordinary MATLAB process but with no desktop visible. They consume memory as and when you create arrays on them. For example, in the following case, each worker uses the same amount of memory to store x:
spmd
x = zeros(1000);
end
But in the following case, each worker consumes a different amount of memory to store their copy of x:
spmd
x = zeros(100 * labindex);
end