MATLAB parallel reduction using parfeval

MATLAB parallel reduction using parfeval - matlab

I want to compute the sum of many large matrices in parallel. However, using parfeval, the matrices are getting buffered which yields a very high memory usage.
This is my current approach:
N = 50;
p = gcp('nocreate');
if isempty(p)
p = parpool('local', 4);
end
A_total = zeros(500, 500, 500);
for i = 1:N
% do things
pause(rand() * 0.05);
f(i) = parfeval(p, #get_matrix, 1);
end
for i = 1:N
[completed_index, A] = fetchNext(f);
f(completed_index) = []; % drop future to reduce memory
A_total = A_total + A;
end
function A = get_matrix()
A = rand(500, 500, 500);
end
Even though I limit the number of parallel executions, the memory usage is still very high as the matrices are stored temporarily until they are fetched using fetchNext(). Is it possible to use afterEach(f) to add the generated matrix to A_total immediately after the execution was completed?
Another approach would be to check the number of unread futures in the first for loop and if this number exceeds a threshold, the data must be fetched first to limit the memory usage.

Related

On histogram equalization

So I'd like to do it without histeq, but my code seems to get out a rather peculiar, really whited out image, and doesn't seem all too much improved from the original picture. Is there a better way to apply the proper histogram?
Cumlative=zeros(256,1);
CumHisty=uint8(zeros(ROWS,COLS));
% First we need to find the probabilities and the frequencies
freq = zeros(256,1);
probab = zeros(256,1);
for i=1:ROWS
for j=1:COLS
value=I1(i,j);
freq(value+1)=freq(value+1)+1;
probab(value+1)=freq(value+1)/(ROWS*COLS);
end
end
count=0;
cumprobab=zeros(256,1);
distrib=zeros(256,1);
for i=1:size(probab)
count=count+freq(i);
Cumlative(i)=count;
cumprobab(i)=Cumlative(i)/(ROWS*COLS);
distrib(i)=round(cumprobab(i)*(ROWS*COLS));
end
for i=1:ROWS
for j=1:COLS
CumHisty(i,j)=distrib(I1(i,j)+1);
end

You probably want to do:
distrib(i) = round(cumprobab(i)*255);
EDIT:
Here is a version of your code without the redundant computations, and simplified looping:
freq = zeros(256,1);
for i = 1:numel(I1)
index = I1(i) + 1;
freq(index) = freq(index)+1;
end
count = 0;
distrib = zeros(256,1);
for i = 1:length(freq)
count = count + freq(i);
cumprobab = count/numel(I1);
distrib(i) = round(cumprobab*255);
end
CumHisty = zeros(size(I1),'uint8');
for i = 1:numel(I1)
CumHisty(i) = distrib(I1(i)+1);
end
I use linear indexing above, it's simpler (one loop instead of 2) and automatically helps you access the pixels in the same order that they are stored in. The way you looped (over rows in the outer loop, and columns in the inner loop) means that you are not accessing pixels in the optimal order, since arrays are stored column-wise (column-major order). Accessing data in the order in which it is stored in memory allows for an optimal cache usage (i.e. is faster).
The above can also be written as:
freq = histcounts(I1,0:256);
distrib = round(cumsum(freq)*(255/numel(I1)));
distrib = uint8(distrib);
CumHisty = distrib(I1+1);
This is faster than the loop code, but within the same order of magnitude. Recent versions of MATLAB are no longer terribly slow doing loops.
I clocked your code at 40 ms, with simplified loops at 19.5 ms, and without loops at 5.8 ms, using an image of size 1280x1024.

Why does `minmax` take longer than a consecutive `min` and `max`?

Well basically the question says it all, my intuition tells me that a call to minmax should take less time than calling a min and then a max.
Is there some optimization I prevent Matlab carrying out in the following code?
minmax:
function minmax_vals = minmaxtest()
buffSize = 1000000;
A = rand(128,buffSize);
windowSize = 100;
minmax_vals = zeros(128,buffSize/windowSize*2);
for i=1:(buffSize/windowSize)
minmax_vals(:,(2*i-1):(2*i)) = minmax(A(:,((i-1)*windowSize+1):(i*windowSize)));
end
end
separate min-max:
function minmax_vals = minmaxtest()
buffSize = 1000000;
A = rand(128,buffSize);
windowSize = 100;
minmax_vals = zeros(128,buffSize/windowSize*2);
for i=1:(buffSize/windowSize)
minmax_vals(:,(2*i-1)) = min(A(:,((i-1)*windowSize+1):(i*windowSize)),[],2);
minmax_vals(:,(2*i)) = max(A(:,((i-1)*windowSize+1):(i*windowSize)),[],2);
end
end

Summary
You can see the overhead because minmax isn't completely obfuscated. Simply type
edit minmax
And you will see the function!
It appears that there is a data-type conversion to nntype.data('format',x,'Data');, which will not be the case for min or max and could be costly. This is for use with MATLAB's neural networking (nn) tools as minmax belongs to that toolbox.
In short, min and max are lower-level, compiled functions (hence they are fully obfuscated), which don't require functionality from the nn toolbox.
Benchmark
Here is a slightly more isolated benchmark, without your windowing and using timeit instead of the profiler. I've also included timings for just the data conversion used in minmax! The test gets the min and max of each row in a large matrix, see here the output plot and code below...
It appears that there is a linear relationship between number of rows and time taken (as expected for a linear operator), but the coefficient is much greater for the combined minmax relationship, with the separate operations being approximately 10x quicker. Also you can clearly see that data conversion takes more time that the min then max version alone!
function benchie()
K = zeros(10, 3);
for k = 1:10
n = 2^k;
A = rand(n, 200);
Arow = zeros(1,200);
m = zeros(n,2);
f1 = #()minmaxtest(A,m);
K(k,1) = timeit(f1);
f2 = #()minthenmaxtest(A,m);
K(k,2) = timeit(f2);
f3 = #()dataconversiontest(A, Arow);
K(k,3) = timeit(f3);
end
figure; hold on; plot(2.^(1:10), K(:,1)); plot(2.^(1:10), K(:,2)); plot(2.^(1:10), K(:,3));
end
function minmaxtest(A,m)
for ii = 1:size(A,1)
m(ii, 1:2) = minmax(A(ii,:));
end
end
function dataconversiontest(A, Arow)
for ii = 1:size(A,1)
Arow = nntype.data('format', A(ii,:), 'Data');;
end
end
function minthenmaxtest(A,m)
for ii = 1:size(A,1)
m(ii, 1) = min(A(ii,:));
m(ii, 2) = max(A(ii,:));
end
end

Performance/Low level meaning of " a(idx) = [] " vs "a = a(~idx)"

As the title says, I wish to know what does Matlab do differently between the two options. For the sake of argument, let's say that matrix a and idx are sufficiently large to be dealing with memory issues, and define:
Case A: a(idx) = []
Case B: a = a(~idx)
My intuition says that in Case A performs a value reassignment, which then the CPU needs to deal with indexed copies from original positions to the new ordered ones, while keeping track what is the current "head" of the same matrix, and later trimming the excess memory.
On the other hand, Case B would perform an indexed bulk copy to a newly allocation memory space.
So probably Case A is slower but less memory demanding than Case B. Am I assuming right? I don't know, immediately after writing this I feel like Case B needs to perform Case A first... Any ideas?
Thanks in advance

It's interesting, so I decided to take a measure:
I am using Windows (64 bits) version of Matlab R2016a.
CPU: Core i5-3550 at 3.3GHz.
Memory: 8GB DDR3 1333 (Dual channel).
len = 100000000; %Number of elements in array (make it large enouth to be outsize of cache memory).
idx = zeros(len, 1, 'logical'); %Fill idx with ones.
idx(1:10:end) = 1; %Put 1 in every 10'th element of idx.
a = ones(len, 1); %Fill arrary a with ones.
disp('Measure: a(idx) = [];')
tic
a(idx) = [];
toc
a = ones(len, 1);
disp(' ');disp('Measure: a = a(~idx);')
tic
a = a(~idx);
toc
disp(' ');disp('Measure: not_idx = ~idx;')
tic
not_idx = ~idx;
toc
a = ones(len, 1);
disp(' ');disp('Measure: a = a(not_idx);')
tic
a = a(not_idx);
toc
Result:
Measure: a(idx) = [];
Elapsed time is 1.647617 seconds.
Measure: a = a(~idx);
Elapsed time is 0.732233 seconds.
Measure: not_idx = ~idx;
Elapsed time is 0.032649 seconds.
Measure: a = a(not_idx);
Elapsed time is 0.686351 seconds.
Conclusions:
a = a(~idx) is about twice faster than a(idx) = [].
Total time of a = a(~idx) equals sum of not_idx = ~idx plus a = a(not_idx)
Matlab is probably calculating ~idx separately, so it consumes more memory.
Memory consumption meters only when physical RAM is in full consumption.
I think it's negligible (~idx memory consumption is temporary).
Both solutions are not optimized.
I estimate, fully optimized implementation (in C) to be 10 times faster.

How to implement parallel-for in a 4 level nested for loop block

I have to calculate the std and mean of a large data set with respect to quite a few models. The final loop block is nested to four levels.
This is what it looks like:
count = 1;
alpha = 0.5;
%%%Below if each individual block is to be posterior'd and then average taken
c = 1;
for i = 1:numel(writers) %no. of writers
for j = 1: numel(test_feats{i}) %no. of images
for k = 1: numel(gmm) %no. of models
for n = 1: size(test_feats{i}{j},1)
[~, scores(c)] = posterior(gmm{k}, test_feats{i}{j}(n,:));
c = c + 1;
end
c = 1;
index_kek=find(abs(scores-mean(scores))>alpha*std(scores));
avg = mean(scores(index_kek)); %using std instead of mean... beacause of ..reasons
NLL(count) = avg;
count = count + 1;
end
count = 1; %reset count
NLL_scores{i}(j,:) = NLL;
end
fprintf('***score for model_%d done***\n', i)
end
It works and gives the desired result but it takes 3 days to give me the final calculation, even on my i7 processor. During processing the task manager tells me that only 20% of the cpu is being used, so I would rather put more load on the cpu to get the result faster.
Going by the official help here if I suppose want to make the outer most loop a parfor while keeping the rest normal for all I have to do is to insert integer limits rather than function calls such as size or numel.
So making these changes the above code will become:
count = 1;
alpha = 0.5;
%%%Below if each individual block is to be posterior'd and then average taken
c = 1;
num_writers = numel(writers);
num_images = numel(test_feats{1});
num_models = numel(gmm);
num_feats = size(test_feats{1}{1},1);
parfor i = 1:num_writers %no. of writers
for j = 1: num_images %no. of images
for k = 1: num_models %no. of models
for n = 1: num_feats
[~, scores(c)] = posterior(gmm{k}, test_feats{i}{j}(n,:));
c = c + 1;
end
c = 1;
index_kek=find(abs(scores-mean(scores))>alpha*std(scores));
avg = mean(scores(index_kek)); %using std instead of mean... beacause of ..reasons
NLL(count) = avg;
count = count + 1;
end
count = 1; %reset count
NLL_scores{i}(j,:) = NLL;
end
fprintf('***score for model_%d done***\n', i)
end
Is this the most optimum way to implement parfor in my case? Can it be improved or optimized further?

I couldn't test in Matlab for now but it should be close to a working solution. It has a reduced number of loops and changes a few implementation details but overall it might perform just as fast (or even slower) as your earlier code.
If gmm and test_feats take lots of memory then it is important that parfor is able to determine which peaces of data need to be delivered to which workers. The IDE should warn you if inefficient memory access is detected. This modification is especially useful if num_writers is much less than the number of cores in your CPU, or if it is only slightly larger (like 5 writers for 4 cores would take about as long as 8 writers).
[i_writer i_image i_model] = ndgrid(1:num_writers, 1:num_images, 1:num_models);
idx_combined = [i_writer(:) i_image(:) i_model(:)];
n_combined = size(idx_combined, 1);
NLL_scores = zeros(n_combined, 1);
parfor i_for = 1:n_combined
i = idx_combined(i_for, 1)
j = idx_combined(i_for, 2)
k = idx_combined(i_for, 3)
% pre-allocate
scores = zeros(num_feats, 1)
for i_feat = 1:num_feats
[~, scores(i_feat)] = posterior(gmm{k}, test_feats{i}{j}(i_feat,:));
end
% "find" is redundant here and performs a bit slower, might be insignificant though
index_kek = abs(scores - mean(scores)) > alpha * std(scores);
NLL_scores(i_for) = mean(scores(index_kek));
end

Writing a matlab program for combinations of large numbers

Because for combinations of large numbers at times matlab replies NaN, the assignment is to write a program to compute combinations of 200 objects taken 90 at a time. Once this works we are to make it into a function y = comb(n,k).
This is what I have so far based on an example we were given of the probability that 2 people in a class have the same birthday.
This is the example:
nMax = 70; %maximum number of people in classroom
nArray = 1:nMax;
prevPnot = 1; %initialize probability
for iN = 1:nMax
Pnot = prevPnot*(365-iN+1)/365; %probability that no birthdays are the same
P(iN) = 1-Pnot; %probability that at least two birthdays are the same
prevPnot = Pnot;
end
plot(nArray, P, '.-')
xlabel('nb. of people')
ylabel('prob. that at least two have same birthday')
grid on
At this point I'm having trouble because I'm more familiar with java. This is what I have so far, and it isn't coming out at all.
k = 90;
n = 200;
nArray = 1:k;
prevPnot = 1;
for counter = 1:k
Pnot = (n-counter+1)/(prevPnot*(n-k-counter+1);
P(iN) = Pnot;
prevPnot = Pnot;
end
The point of the loop I wrote is to separate out each term
i.e. n/k*(n-k), times (n-counter)/(k-counter)*(n-k-counter), and so forth.
I'm also not entirely sure how to save a loop as a function in matlab.

To compute the number of combinations of n objects taken k at a time, you can use gammaln to compute the logarithm of the factorials in order to avoid overflow:
result = exp(gammaln(n+1)-gammaln(k+1)-gammaln(n-k+1));
Another approach is to remove terms that will cancel and then compute the result:
result = prod((n-k+1:n)./(1:k));

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

MATLAB parallel reduction using parfeval - matlab

Related

On histogram equalization

Why does `minmax` take longer than a consecutive `min` and `max`?

Performance/Low level meaning of " a(idx) = [] " vs "a = a(~idx)"

How to implement parallel-for in a 4 level nested for loop block

Writing a matlab program for combinations of large numbers

Categories

Resources