Fast merge of very many small matlab matrices - matlab

I need to merge around 50 million small Matlab matrices. Using a script like the below seems never to finish. Is there a faster way? I'd be willing to try a non-Matlab Route if that were faster.
main_data_a = zeros(10000000, 3);
main_data_b = zeros(10000000, 3);
main_data_c = ones(10000000, 1);
for i=1:1:10000000
try
to_load=sprintf('data/output%d.mat',i);
load(to_load);
catch
end;
if sum(a) ~= 0
main_data_a(i,:) = a;
main_data_b(i,:) = b;
main_data_c(i,:) = c;
end;
end;

Here is a complete example showing "parfor" usage.
disp('Using for-loop')
tmpData = zeros(100000,1);
tic
for i = 1:length(tmpData)
tmpData(i) = max(max(eig(i*ones(100,100)))); % Some operation
end
toc
disp('Using parfor-loop')
tic
parfor (i = 1:length(tmpData))
tmpData(i) = max(max(eig(i*ones(100,100)))); % Some operation
end
toc
The above code resulted in the following timings on my machine.
Using for-loop
Elapsed time is 32.792182 seconds.
Using parfor-loop
Elapsed time is 7.673821 seconds.
However, if some simple calculations are carried out inside the for (or parfor) loop body then "for-loop" runs faster than "parfor". For example, if you replace tmpData(i) = max(max(eig(i*ones(100,100)))); with tmpData(i) = i; then you could see that for-loop performs better.

Related

Create array with function and size

Recently I wrote this statement
v = arrayfun(#(x) sum(randn(1, 4).^2), zeros(1, 1000000));
to create a new vector and now I'm asking if there exists a function in Matlab to avoid the creation of the unnecessary second vector zeros(1, 1000000). I'm looking for something like
v = FUN(#someInitFunction, [rows, cols]);
without loops, recursion and unnecessary allocation where someInitFunction is given and can't be changed. Does Matlab provide such a function FUN? A simple "No, it doesn't exist" would be a valid answer for me.
To summarize the function FUN: I want to create a new array by calling a function someInitFunction for each element of this new array. The array should be equivalent to
[
someInitFunction() someInitFunction() ...;
someInitFunction() someInitFunction() ...;
.
.
.
someInitFunction() someInitFunction() ...
]
As far as I know there is no builtin function for that. It's relatively easy to create your own however.
You asked a solution without loop but the current solution you are using (arrayfun) uses loop under the hood, and generally coding the same in a properly organised loop is actually faster than arrayfun.
For your case, the function GenArrayFun.m :
function out = GenArrayFun(initFunction , arraySize)
out = zeros(arraySize) ;
for k=1:numel(out)
out(k) = initFunction() ;
end
It has a loop, but no more than arrayfun, and seem to perform twice as fast (at least on my installation, R2016a, win10):
initFunction = #() sum(randn(1, 4).^2) ;
tic
v = arrayfun(#(x) sum(randn(1, 4).^2), zeros(1, 1000000));
toc
tic
out = GenArrayFun( initFunction , [1,1000000] );
toc
Sorry I did not take the time to build a proper timeit benchmark for such a small example, I think the results are significant enough to notice a difference:
Elapsed time is 6.815043 seconds. % arrayfun
Elapsed time is 3.060161 seconds. % GenArrayFun
And just to make sure it evaluate the initFunction for every element:
>> out = GenArrayFun( initFunction , [2,3] )
out =
6.25676106665387 6.52758807745462 2.99236122767462
0.386750258201569 0.566092999842791 2.21158011908878

how to assign one value to a list of Objects in an efficient way in Matlab?

I want to assign one value to a list of objects in Matlab without using a for-loop (In order to increase efficiency)
Basically this works:
for i=1:Nr_of_Objects
Objectlist(i,1).weight=0.2
end
But I would like something like this:
Objectlist(:,1).weight=0.2
Which is not working. I get this error:
Expected one output from a curly brace or dot indexing expression, but there were 5 results.
Writing an array to the right hand side is also not working.
I`m not very familiar with object oriented programming in Matlab, so I would be happy if someone could help me.
Your looking for the deal function:
S(1,1).a = 1
S(2,1).a = 2
S(1,2).a = 3
[S(:,1).a] = deal(4)
Now S(1,1).a and S(2,1).a equal to 4.
In matlab you can concatenate several output in one array using []. And deal(X) copies the single input to all the requested outputs.
So in your case:
[Objectlist(:,1).weight] = deal(0.2)
Should work.
Noticed that I'm not sure that it will be faster than the for loop since I don't know how the deal function is implemented.
EDIT: Benchmark
n = 1000000;
[S(1:n,1).a] = deal(1);
tic
for ii=1:n
S(ii,1).a = 2;
end
toc
% Elapsed time is 3.481088 seconds
tic
[S(1:n,1).a] = deal(2);
toc
% Elapsed time is 0.472028 seconds
Or with timeit
n = 1000000;
[S(1:n,1).a] = deal(1);
g = #() func1(S,n);
h = #() func2(S,n);
timeit(g)
% ans = 3.67
timeit(h)
% ans = 0.41
function func1(S,n)
for ii=1:n
S(ii,1).a = 2;
end
end
function func2(S,n)
[S(1:n,1).a] = deal(2);
end
So it seems that using the deal function reduce the computational time.

Avoiding race conditions when using parfor in MATLAB

I'm looping in parallel and changing a variable if a condition is met. Super idiomatic code that I'm sure everyone has written a hundred times:
trials = 100;
greatest_so_far = 0;
best_result = 0;
for trial_i = 1:trials
[amount, result] = do_work();
if amount > greatest_so_far
greatest_so_far = amount;
best_result = result;
end
end
If I wanted to replace for by parfor, how can I ensure that there aren't race conditions when checking whether we should replace greatest_so_far? Is there a way to lock this variable outside of the check? Perhaps like:
trials = 100;
greatest_so_far = 0;
best_result = 0;
parfor trial_i = 1:trials
[amount, result] = do_work();
somehow_lock(greatest_so_far);
if amount > greatest_so_far
greatest_so_far = amount;
best_result = result;
end
somehow_unlock(greatest_so_far);
end
Skewed answer. It does not exactly solve your problem, but it might help you avoiding it.
If you can afford the memory to store the outputs of your do_work() in some vectors, then you could simply run your parfor on this function only, store the result, then do your scoring at the end (outside of the loop):
amount = zeros( trials , 1 ) ;
result = zeros( trials , 1 ) ;
parfor trial_i = 1:trials
[amount(i), result(i)] = do_work();
end
[ greatest_of_all , greatest_index ] = max(amount) ;
best_result = result(greatest_index) ;
Edit/comment : (wanted to put that in comment of your question but it was too long, sorry).
I am familiar with .net and understand completely your lock/unlock request. I myself tried many attempts to implement a kind of progress indicator for very long parfor loop ... to no avail.
If I understand Matlab classification of variable correctly, the mere fact that you assign greatest_so_far (in greatest_so_far=amount) make Matlab treat it as a temporary variable, which will be cleared and reinitialized at the beginning of every loop iteration (hence unusable for your purpose).
So an easy locked variable may not be a concept we can implement simply at the moment. Some convoluted class event or file writing/checking may do the trick but I am afraid the timing would suffer greatly. If each iteration takes a long time to execute, the overhead might be worth it, but if you use parfoor to accelerate a high number of short execution iterations, then the convoluted solutions would slow you down more than help ...
You can have a look at this stack exchange question, you may find something of interest for your case: Semaphores and locks in MATLAB
The solution from Hoki is the right way to solve the problem as stated. However, as you asked about race conditions and preventing them when loop iterations depend on each other you might want to investigate spmd and the various lab* functions.
You need to use SPMD to do this - SPMD allows communication between the workers. Something like this:
bestResult = -Inf;
bestIndex = NaN;
N = 97;
spmd
% we need to round up the loop range to ensure that each
% worker executes the same number of iterations
loopRange = numlabs * ceil(N / numlabs);
for idx = 1:numlabs:loopRange
if idx <= N
local_result = rand(); % obviously replace this with your actual function
else
local_result = -Inf;
end
% Work out which index has the best result - use a really simple approach
% by concatenating all the results this time from each worker
% allResultsThisTime will be 2-by-numlabs where the first row is all the
% the results this time, and the second row is all the values of idx from this time
allResultsThisTime = gcat([local_result; idx]);
% The best result this time - consider the first row
[bestResultThisTime, labOfBestResult] = max(allResultsThisTime(1, :));
if bestResultThisTime > bestResult
bestResult = bestResultThisTime;
bestIndex = allResultsThisTime(2, labOfBestResult);
end
end
end
disp(bestResult{1})
disp(bestIndex{1})

MATLAB: Nested for-loop takes longer every successive iteration

/edit: The loop doesn't become slower. I didn't take the time correctly. See Rasman's answer.
I'm looping over 3 parameters for a somewhat long and complicated function and I noticed two things that I don't understand:
The execution gets slower with each successive iteration, although the function only returns one struct (of which I only need one field) that I overwrite with each iteration.
The profiler shows that the end statement for the innermost for takes a quite long time.
Consider the following example (I'm aware that this can easily be vectorized, but as far as I understand the function I call can't):
function stuff = doSomething( x, y, z )
stuff.one = x+y+z;
stuff.two = x-y-z;
end
and how I execute the function
n = 50;
i = 0;
currenttoc = 0;
output = zeros(n^3,4);
tic
for x = 1:n
for y = 1:n
for z = 1:n
i = i + 1;
output(i,1) = x;
output(i,2) = y;
output(i,3) = z;
stuff = doSomething(x,y,z);
output(i,4) = stuff.one;
if mod(i,1e4) == 0 % only for demonstration, not in final script
currenttoc = toc - currenttoc;
fprintf(1,'time for last 10000 iterations: %f \n',currenttoc)
end
end
end
end
How can I speed this up? Why does every iteration take longer than the one before? I'm pretty sure this is horrible programming, sorry for that.
When I replace the call to doSomething with output(i,4)=toc;, and I plot diff(output(:,4)), I see that it's the call to fprintf that takes longer and longer every time, apparently.
Removing the if-clause returns to every iteration taking about the same amount of time.
So, the problem gets largely eliminated when I replace the if statement with:
if mod(i,1e4) == 0 % only for demonstration, not in final script
fprintf(1,'time for last 10000 iterations: %f \n',toc); tic;
end
I think the operation on toc may be causing the problem
It's MUCH faster if doSomething returns multiple output variables rather than a struct
function [out1,out2] = doSomething( x, y, z )
out1 = x+y+z;
out2 = x-y-z;
end
The fact that it gets slower on each subsequent iteration is strange and i have no explanation for it but hopefully that gives you some speed up at least.

How do I know how many iterations are left in a parfor loop in Matlab?

I am running a parfor loop in Matlab that takes a lot of time and I would like to know how many iterations are left. How can I get that info?
I don't believe you can get that information directly from MATLAB, short of printing something with each iteration and counting these lines by hand.
To see why, recall that each parfor iteration executes in its own workspace: while incrementing a counter within the loop is legal, accessing its "current" value is not (because this value does not really exist until completion of the loop). Furthermore, the parfor construct does not guarantee any particular execution order, so printing the iterator value isn't helpful.
cnt = 0;
parfor i=1:n
cnt = cnt + 1; % legal
disp(cnt); % illegal
disp(i); % legal ofc. but out of order
end
Maybe someone does have a clever workaround, but I think that the independent nature of the parfor iterations belies taking a reliable count. The restrictions mentioned above, plus those on using evalin, etc. support this conclusion.
As #Jonas suggested, you could obtain the iteration count via side effects occurring outside of MATLAB, e.g. creating empty files in a certain directory and counting them. This you can do in MATLAB of course:
fid = fopen(['countingDir/f' num2str(i)],'w');
fclose(fid);
length(dir('countingDir'));
Try this FEX file: http://www.mathworks.com/matlabcentral/fileexchange/32101-progress-monitor--progress-bar--that-works-with-parfor
You can easily modify it to return the iteration number instead of displaying a progress bar.
Something like a progress bar could be done similar to this...
Before the parfor loop :
fprintf('Progress:\n');
fprintf(['\n' repmat('.',1,m) '\n\n']);
And during the loop:
fprintf('\b|\n');
Here we have m is the total number of iterations, the . shows the total number of iterations and | shows the number of iterations completed. The \n makes sure the characters are printed in the parfor loop.
With Matlab 2017a or later you can use a data queue or a pollable data queue to achieve this. Here's the MathWorks documentation example of how to do a progress bar from the first link :
function a = parforWaitbar
D = parallel.pool.DataQueue;
h = waitbar(0, 'Please wait ...');
afterEach(D, #nUpdateWaitbar);
N = 200;
p = 1;
parfor i = 1:N
a(i) = max(abs(eig(rand(400))));
send(D, i);
end
function nUpdateWaitbar(~)
waitbar(p/N, h);
p = p + 1;
end
end
End result :
If you just want to know how much time is left approximately, you can run the program once record the max time and then do this
tStart = tic;
parfor i=1:n
tElapsed = toc(tStart;)
disp(['Time left in min ~ ', num2str( ( tMax - tElapsed ) / 60 ) ]);
...
end
I created a utility to do this:
http://www.mathworks.com/matlabcentral/fileexchange/48705-drdan14-parforprogress