Parallelize or vectorize all-against-all operation on a large number of matrices? - matlab

I have approximately 5,000 matrices with the same number of rows and varying numbers of columns (20 x ~200). Each of these matrices must be compared against every other in a dynamic programming algorithm.
In this question, I asked how to perform the comparison quickly and was given an excellent answer involving a 2D convolution. Serially, iteratively applying that method, like so
list = who('data_matrix_prefix*')
H = cell(numel(list),numel(list));
for i=1:numel(list)
for j=1:numel(list)
if i ~= j
eval([ 'H{i,j} = compare(' char(list(i)) ',' char(list(j)) ');']);
end
end
end
is fast for small subsets of the data (e.g. for 9 matrices, 9*9 - 9 = 72 calls are made in ~1 s, 870 calls in ~2.5 s).
However, operating on all the data requires almost 25 million calls.
I have also tried using deal() to make a cell array composed entirely of the next element in data, so I could use cellfun() in a single loop:
# who(), load() and struct2cell() calls place k data matrices in a 1D cell array called data.
nextData = cell(k,1);
for i=1:k
[nextData{:}] = deal(data{i});
H{:,i} = cellfun(#compare,data,nextData,'UniformOutput',false);
end
Unfortunately, this is not really any faster, because all the time is in compare(). Both of these code examples seem ill-suited for parallelization. I'm having trouble figuring out how to make my variables sliced.
compare() is totally vectorized; it uses matrix multiplication and conv2() exclusively (I am under the impression that all of these operations, including the cellfun(), should be multithreaded in MATLAB?).
Does anyone see a (explicitly) parallelized solution or better vectorization of the problem?
Note
I realize both my examples are inefficient - the first would be twice as fast if it calculated a triangular cell array, and the second is still calculating the self comparisons, as well. But the time savings for a good parallelization are more like a factor of 16 (or 72 if I install MATLAB on everyone's machines).
Aside
There is also a memory issue. I used a couple of evals to append each column of H into a file, with names like H1, H2, etc. and then clear Hi. Unfortunately, the saves are very slow...

Does
compare(a,b) == compare(b,a)
and
compare(a,a) == 1
If so, change your loop
for i=1:numel(list)
for j=1:numel(list)
...
end
end
to
for i=1:numel(list)
for j= i+1 : numel(list)
...
end
end
and deal with the symmetry and identity case. This will cut your calculation time by half.

The second example can be easily sliced for use with the Parallel Processing Toolbox. This toolbox distributes iterations of your code among up to 8 different local processors. If you want to run the code on a cluster, you also need the Distributed Computing Toolbox.
%# who(), load() and struct2cell() calls place k data matrices in a 1D cell array called data.
parfor i=1:k-1 %# this will run the loop in parallel with the parallel processing toolbox
%# only make the necessary comparisons
H{i+1:k,i} = cellfun(#compare,data(i+1:k),repmat(data(i),k-i,1),'UniformOutput',false);
%# if the above doesn't work, try this
hSlice = cell(k,1);
hSlice{i+1:k} = cellfun(#compare,data(i+1:k),repmat(data(i),k-i,1),'UniformOutput',false);
H{:,i} = hSlice;
end

If I understand correctly you have to perform 5000^2 matrix comparisons ? Rather than try to parallelise the compare function, perhaps you should think of your problem being composed of 5000^2 tasks ? The Matlab Parallel Compute Toolbox supports task-based parallelism. Unfortunately my experience with PCT is with parallelisation of large linear algebra type problems so I can't really tell you much more than that. The documentation will undoubtedly help you more.

Related

Matlab: need some help for a seemingly simple vectorization of an operation

I would like to optimize this piece of Matlab code but so far I have failed. I have tried different combinations of repmat and sums and cumsums, but all my attempts seem to not give the correct result. I would appreciate some expert guidance on this tough problem.
S=1000; T=10;
X=rand(T,S),
X=sort(X,1,'ascend');
Result=zeros(S,1);
for c=1:T-1
for cc=c+1:T
d=(X(cc,:)-X(c,:))-(cc-c)/T;
Result=Result+abs(d');
end
end
Basically I create 1000 vectors of 10 random numbers, and for each vector I calculate for each pair of values (say the mth and the nth) the difference between them, minus the difference (n-m). I sum over of possible pairs and I return the result for every vector.
I hope this explanation is clear,
Thanks a lot in advance.
It is at least easy to vectorize your inner loop:
Result=zeros(S,1);
for c=1:T-1
d=(X(c+1:T,:)-X(c,:))-((c+1:T)'-c)./T;
Result=Result+sum(abs(d),1)';
end
Here, I'm using the new automatic singleton expansion. If you have an older version of MATLAB you'll need to use bsxfun for two of the subtraction operations. For example, X(c+1:T,:)-X(c,:) is the same as bsxfun(#minus,X(c+1:T,:),X(c,:)).
What is happening in the bit of code is that instead of looping cc=c+1:T, we take all of those indices at once. So I simply replaced cc for c+1:T. d is then a matrix with multiple rows (9 in the first iteration, and one fewer in each subsequent iteration).
Surprisingly, this is slower than the double loop, and similar in speed to Jodag's answer.
Next, we can try to improve indexing. Note that the code above extracts data row-wise from the matrix. MATLAB stores data column-wise. So it's more efficient to extract a column than a row from a matrix. Let's transpose X:
X=X';
Result=zeros(S,1);
for c=1:T-1
d=(X(:,c+1:T)-X(:,c))-((c+1:T)-c)./T;
Result=Result+sum(abs(d),2);
end
This is more than twice as fast as the code that indexes row-wise.
But of course the same trick can be applied to the code in the question, speeding it up by about 50%:
X=X';
Result=zeros(S,1);
for c=1:T-1
for cc=c+1:T
d=(X(:,cc)-X(:,c))-(cc-c)/T;
Result=Result+abs(d);
end
end
My takeaway message from this exercise is that MATLAB's JIT compiler has improved things a lot. Back in the day any sort of loop would halt code to a grind. Today it's not necessarily the worst approach, especially if all you do is use built-in functions.
The nchoosek(v,k) function generates all combinations of the elements in v taken k at a time. We can use this to generate all possible pairs of indicies then use this to vectorize the loops. It appears that in this case the vectorization doesn't actually improve performance (at least on my machine with 2017a). Maybe someone will come up with a more efficient approach.
idx = nchoosek(1:T,2);
d = bsxfun(#minus,(X(idx(:,2),:) - X(idx(:,1),:)), (idx(:,2)-idx(:,1))/T);
Result = sum(abs(d),1)';
Update: here are the results for the running times for the different proposals (10^5 trials):
So it looks like the transformation of the matrix is the most efficient intervention, and my original double-loop implementation is, amazingly, the best compared to the vectorized versions. However, in my hands (2017a) the improvement is only 16.6% compared to the original using the mean (18.2% using the median).
Maybe there is still room for improvement?

Replacement for repmat in MATLAB

I have a function which does the following loop many, many times:
for cluster=1:max(bins), % bins is a list in the same format as kmeans() IDX output
select=bins==cluster; % find group of values
means(select,:)=repmat_fast_spec(meanOneIn(x(select,:)),sum(select),1);
% (*, above) for each point, write the mean of all points in x that
% share its label in bins to the equivalent row of means
delta_x(select,:)=x(select,:)-(means(select,:));
%subtract out the mean from each point
end
Noting that repmat_fast_spec and meanOneIn are stripped-down versions of repmat() and mean(), respectively, I'm wondering if there's a way to do the assignment in the line labeled (*) that avoids repmat entirely.
Any other thoughts on how to squeeze performance out of this thing would also be welcome.
Here is a possible improvement to avoid REPMAT:
x = rand(20,4);
bins = randi(3,[20 1]);
d = zeros(size(x));
for i=1:max(bins)
idx = (bins==i);
d(idx,:) = bsxfun(#minus, x(idx,:), mean(x(idx,:)));
end
Another possibility:
x = rand(20,4);
bins = randi(3,[20 1]);
m = zeros(max(bins),size(x,2));
for i=1:max(bins)
m(i,:) = mean( x(bins==i,:) );
end
dd = x - m(bins,:);
One obvious way to speed up calculation in MATLAB is to make a MEX file. You can compile C code and perform any operations you want. If you're searching for the fastest-possible performance, turning the operation into a custom MEX file would likely be the way to go.
You may be able to get some improvement by using ACCUMARRAY.
%# gather array sizes
[nPts,nDims] = size(x);
nBins = max(bins);
%# calculate means. Not sure whether it might be faster to loop over nDims
meansCell = accumarray(bins,1:nPts,[nBins,1],#(idx){mean(x(idx,:),1)},{NaN(1,nDims)});
means = cell2mat(meansCell);
%# subtract cluster means from x - this is how you can avoid repmat in your code, btw.
%# all you need is the array with cluster means.
delta_x = x - means(bins,:);
First of all: format your code properly, surround any operator or assignment by whitespace. I find your code very hard to comprehend as it looks like a big blob of characters.
Next of all, you could follow the other responses and convert the code to C (mex) or Java, automatically or manually, but in my humble opinion this is a last resort. You should only do such things when your performance is not there yet by a small margin. On the other hand, your algorithm doesn't show obvious flaws.
But the first thing you should do when trying to improve performance: profile. Use the MATLAB profiler to determine which part of your code is causing your problems. How much would you need to improve this to meet your expectations? If you don't know: first determine this boundary, otherwise you will be looking for a needle in a hay stack which might not even be in there in the first place. MATLAB will never be the fastest kid on the block with respect to runtime, but it might be the fastest with respect to development time for certain kinds of operations. In that respect, it might prove useful to sacrifice the clarity of MATLAB over the execution speed of other languages (C or even Java). But in the same respect, you might as well code everything in assembler to squeeze all of the performance out of the code.
Another obvious way to speed up calculation in MATLAB is to make a Java library (similar to #aardvarkk's answer) since MATLAB is built on Java and has very good integration with user Java libraries.
Java's easier to interface and compile than C. It might be slower than C in some cases, but the just-in-time (JIT) compiler in the Java virtual machine generally speeds things up very well.

4 dimensional matrix

I need to use 4 dimensional matrix as an accumulator for voting 4 parameters. every parameters vary in the range of 1~300. for that, I define Acc = zeros(300,300,300,300) in MATLAB. and somewhere for example, I used:
Acc(4,10,120,78)=Acc(4,10,120,78)+1
however, MATLAB says some error happened because of memory limitation.
??? Error using ==> zeros
Out of memory. Type HELP MEMORY for your options.
in the below, you can see a part of my code:
I = imread('image.bmp'); %I is logical 300x300 image.
Acc = zeros(100,100,100,100);
for i = 1:300
for j = 1:300
if I(i,j)==1
for x0 = 3:3:300
for y0 = 3:3:300
for a = 3:3:300
b = abs(j-y0)/sqrt(1-((i-x0)^2) / (a^2));
b1=floor(b/3);
if b1==0
b1=1;
end
a1=ceil(a/3);
Acc(x0/3,y0/3,a1,b1) = Acc(x0/3,y0/3,a1,b1)+1;
end
end
end
end
end
end
As #Rasman mentioned, you probably want to use a sparse representation of the matrix Acc.
Unfortunately, the sparse function is geared toward 2D matrices, not arbitrary n-D.
But that's ok, because we can take advantage of sub2ind and linear indexing to go back and forth to 4D.
Dims = [300, 300, 300, 300]; % it will be a 300 by 300 by 300 by 300 matrix
Acc = sparse([], [], [], prod(Dims), 1, ExpectedNumElts);
Here ExpectedNumElts should be some number like 30 or 9000 or however many non-zero elements you expect for the matrix Acc to have. We notionally think of Acc as a matrix, but actually it will be a vector. But that's okay, we can use sub2ind to convert 4D coordinates into linear indices into the vector:
ind = sub2ind(Dims, 4, 10, 120, 78);
Acc(ind) = Acc(ind) + 1;
You may also find the functions find, nnz, spy, and spfun helpful.
edit: see lambdageek for the exact same answer with a bit more elegance.
The other answers are helping to guide you to use a sparse mat instead of your current dense solution. This is made a little more difficult since current matlab doesn't support N-dimensional sparse arrays. One implementation to do this is
replace
zeros(100,100,100,100)
with
sparse(100*100*100*100,1)
this will store all your counts in a sparse array, as long as most remain zero, you will be ok for memory.
then to access this data, instead of:
Acc(h,i,j,k)=Acc(h,i,j,k)+1
use:
index = h+100*i+100*100*j+100*100*100*k
Acc(index,1)=Acc(index,1)+1
See Avoiding 'Out of Memory' Errors
Your statement would require more than 4 GB of RAM (Around 16 Gigs, to be specific).
Solutions to 'Out of Memory' problems
fall into two main categories:
Maximizing the memory available to
MATLAB (i.e., removing or increasing
limits) on your system via operating
system selection and system
configuration. These usually have the
greatest overall applicability but are
potentially the most disruptive (e.g.
using a different operating system).
These techniques are covered in the
first two sections of this document.
Minimizing the memory used by MATLAB
by making your code more memory
efficient. These are all algorithm
and application specific and therefore
are less broadly applicable. These
techniques are covered in later
sections of this document.
In your case later seems to be the solution - try reducing the amount of memory used / required.

Methods to speed up for loop in MATLAB

I have just profiled my MATLAB code and there is a bottle-neck in this for loop:
for vert=-down:up
for horz=-lhs:rhs
y = y + x(k+vert.*length+horz).*DM(abs(vert).*nu+abs(horz)+1);
end
end
where y, x and DM are vectors I have already defined. I vectorised the loop by writing,
B=(-down:up)'*ones(1,lhs+rhs+1);
C=ones(up+down+1,1)*(-lhs:rhs);
y = sum(sum(x(k+length.*B+C).*DM(abs(B).*nu+abs(C)+1)));
But this ended up being sufficiently slower.
Are there any suggestions on how I can speed up this for loop?
Thanks in advance.
What you've done is not really vectorization. It's very difficult, if not impossible, to write proper vectorization procedures for image processing (I assume that's what you're doing) in Matlab. When we use the term vectorized, we really mean "vectorized with no additional computation". For example, this code
a = 1:1000000;
for i = a
n = n+i;
end
would run much slower then this code
a = 1:1000000;
sum(a)
Update: code above has been modified, thanks to #Rasman's keen suggestion. The reason is that Matlab does not compile your code into machine language before running it, and that's what causes it to be slower. Built-in functions like sum, mean and the .* operator run pre-compiled C code behind the scenes. For loops are a great example of code that runs slowly when not optimized for you CPU's registers.
What you have done, and please ignore my first comment, is rewriting your procedure with a vector operation and some additional operations. Those are the operations that take extra CPU simply because you're telling your computer to do more computations, even though each computation separately may (or may not) take less time.
If you are really after speeding up you code, take a look at MEX files. They allow you to write and compile C and C++ code, compile it and run as Matlab functions, just like those fast built-in ones. In any case, Matlab is not meant to be a fast general-purpose programming platform, but rather a computer simulation environment, though this approach has been changing in the recent years. My advise (from experience) is that if you do image processing, you will write for loops, and there's rarely a way around it. Vector operations were written for a more intuitive approach to linear algebra problems, and we rarely treat digital images as regular rectangular matrices in terms of what we do with them.
I hope this helps.
I would use matrices when handling images... you could then try to extract submatrices like so:
X = reshape(x,height,length);
kx = mod(k,length);
ky = floor(k/length);
xstamp = X( [kx-down:kx+up], [ky-lhs:ky+rhs]);
xstamp = xstamp.*getDMMMask(width, height);
y = sum(xstamp);
...
function mask = getDMMask(width, height, nu)
% I don't get what you're doing there .. return an appropriate sized mask here.
return mask;
end

vectorizing loops in Matlab - performance issues

This question is related to these two:
Introduction to vectorizing in MATLAB - any good tutorials?
filter that uses elements from two arrays at the same time
Basing on the tutorials I read, I was trying to vectorize some procedure that takes really a lot of time.
I've rewritten this:
function B = bfltGray(A,w,sigma_r)
dim = size(A);
B = zeros(dim);
for i = 1:dim(1)
for j = 1:dim(2)
% Extract local region.
iMin = max(i-w,1);
iMax = min(i+w,dim(1));
jMin = max(j-w,1);
jMax = min(j+w,dim(2));
I = A(iMin:iMax,jMin:jMax);
% Compute Gaussian intensity weights.
F = exp(-0.5*(abs(I-A(i,j))/sigma_r).^2);
B(i,j) = sum(F(:).*I(:))/sum(F(:));
end
end
into this:
function B = rngVect(A, w, sigma)
W = 2*w+1;
I = padarray(A, [w,w],'symmetric');
I = im2col(I, [W,W]);
H = exp(-0.5*(abs(I-repmat(A(:)', size(I,1),1))/sigma).^2);
B = reshape(sum(H.*I,1)./sum(H,1), size(A, 1), []);
Where
A is a matrix 512x512
w is half of the window size, usually equal 5
sigma is a parameter in range [0 1] (usually one of: 0.1, 0.2 or 0.3)
So the I matrix would have 512x512x121 = 31719424 elements
But this version seems to be as slow as the first one, but in addition it uses a lot of memory and sometimes causes memory problems.
I suppose I've made something wrong. Probably some logic mistake regarding vectorizing. Well, in fact I'm not surprised - this method creates really big matrices and probably the computations are proportionally longer.
I have also tried to write it using nlfilter (similar to the second solution given by Jonas) but it seems to be hard since I use Matlab 6.5 (R13) (there are no sophisticated function handles available).
So once again, I'm asking not for ready solution, but for some ideas that would help me to solve this in reasonable time. Maybe you will point me what I did wrong.
Edit:
As Mikhail suggested, the results of profiling are as follows:
65% of time was spent in the line H= exp(...)
25% of time was used by im2col
How big are I and H (i.e. numel(I)*8 bytes)? If you start paging, then the performance of your second solution is going to be affected very badly.
To test whether you really have a problem due to too large arrays, you can try and measure the speed of the calculation using tic and toc for arrays A of increasing size. If the execution time increases faster than by the square of the size of A, or if the execution time jumps at some size of A, you can try and split the padded I into a number of sub-arrays and perform the calculations like that.
Otherwise, I don't see any obvious places where you could be losing lots of time. Well, maybe you could skip the reshape, by replacing B with A in your function (saves a little memory as well), and writing
A(:) = sum(H.*I,1)./sum(H,1);
You may also want to look into upgrading to a more recent version of Matlab - they've worked hard on improving performance.