vectorization of matlab for-loop - matlab

I am looking for proper vectorization of following matlab function to eliminate for-loop and gain speed by multithreading.
size(A) = N-by-N, where 30 <= N <= 60
1e4 <= numIter <= 1e6
function val=permApproxStochSquare(A,numIter)
%// A ... input square non-negative matrix
%// numIter ... number of interations
N=size(A,1);
alpha=zeros(numIter,1);
for curIter=1:numIter
U=randn(N,N);
B=U.*sqrt(A);
alpha(curIter)=det(B)^2;
end
val=mean(alpha);
end

To summarize the discussion in the comment to two versions of the code which slightly improve the performance:
Using multiple ideas from the comments, the code needs roughly 1/3 less time:
N=size(A,1);
%precompute sqrt(A)
sA=sqrt(A);
alpha=zeros(numIter,1);
parfor curIter=1:numIter
%vectorizing rand did not improve the performance because it increased communitcation when combined with parfor
U=randn(N,N);
B=U.*sA;
alpha(curIter)=det(B);
end
%moved calculation out of the loop to vectorize
val=mean(alpha.^2);
Another approach, vectorize as far as possible using a for loop only did small improvemens to the perfrmance:
N=size(A,1);
%precompute sqrt(A)
sA=sqrt(A);
alpha=zeros(numIter,1);
%using a for, a vectorized rand outside the loop is faster.
U=randn(N,N,numIter);
B=bsxfun(#times,U,sA);
for curIter=1:numIter
alpha(curIter)=det(B(:,:,curIter));
end
val=mean(alpha.^2);

Related

How to find argmin/best fit/optimize for an overdetermined quadratic system for multiple variables in Matlab

I have 100 equations with 5 variables. Is there a function in Matlab which I can use to find the optimal solution of these equations?
My problem is to find argmin ||(a-ic)^2 + (b-jd)^2 + e - h(i,j)|| over all i, j from -10 to 10. ie.
%% Note: not Matlab code. Just showing the Math.
for i = -10:10
for j = -10:10
(a-ic)^2 + (b-jd)^2 + e = h(i,j)
known: h(i,j) is a 10*10 matrix,and i,j are indexes
expected: the optimal result of a,b,c,d,e
You can try using lsqnonlin as follows.
%% define a helper function in your .m file
function f = fun(x)
a=x(1); b=x(2); c=x(3); d=x(4); e=x(5); % Using variable names from your question. In other situations, be careful when overwriting e.
f=zeros(21*21,0); % size(f) is taken from your question. You should make this a variable for good practice.
for i = -10:10
for j = -10:10
f(10*(i+10+1)+(j+10+1)) = (a-i*c)^2 + (b-j*d)^2 + e - h(i,j); % 10 is taken from your question.
end
end
end
(Aside, why is your h(i,j) taking negative indices??)
In your main function you can simply write
function out=myproblem(x0)
out=lsqnonlin(#fun,x0);
end
In your cmd, you can call with specific initial try such as
myproblem([0,0,0,0,0])
Helper function over anonymous because in my experience helpers get sped up by JIT while anonymous do not. I also opted to reshape in the loops as an opposed to actually call reshape after because I expect reshape to cost significant extra time. Remember that O(1) in fun is not O(1) in lsqnonlin.
(As always, a solution to a nonlinear problem is not guaranteed.)

performance difference between subscript indexing and linear indexing

I have a 2D matrix in MATLAB and I use two different ways to access its elements. One is based on subscript indexing and the other is based on linear indexing. I test both methods by following code:
N = 512; it = 400; im = zeros(N);
%// linear indexing
[ind_x,ind_y] = ndgrid(1:2:N,1:2:N);
index = sub2ind(size(im),ind_x,ind_y);
tic
for i=1:it
im(index) = im(index) + 1;
end
toc %//cost 0.45 seconds on my machine (MATLAB2015b, Thinkpad T410)
%// subscript indexing
x = 1:2:N;
y = 1:2:N;
tic
for i=1:it
im(x,y) = im(x,y) +1;
end
toc %// cost 0.12 seconds on my machine(MATLAB2015b, Thinkpad T410)
%//someone pointed that double or uint32 might an issue, so we turn both into uint32
%//uint32 for linear indexing
index = uint32(index);
tic
for i=1:it
im(index) = im(index) +1;
end
toc%// cost 0.25 seconds on my machine(MATLAB2015b, Thinkpad T410)
%//uint32 for the subscript indexing
x = uint32(1:2:N);
y = uint32(1:2:N);
tic
for i=1:it
im(x,y) = im(x,y) +1;
end
toc%// cost 0.11 seconds on my machine(MATLAB2015b, Thinkpad T410)
%% /*********************comparison with others*****************/
%//third way of indexing, loops
tic
for i=1:it
for j=1:2:N
for k=1:2:N
im(j,k) = im(j,k)+1;
end
end
end
toc%// cost 0.74 seconds on my machine(MATLAB2015b, Thinkpad T410)
It seems that directly using subscript indexing is faster than the linear indexing obtained from sub2ind. Does anyone know why? I thought they were almost the same.
The intuition
As Daniel mentioned in his answer, the linear index takes up more space in RAM while the subscripts are much smaller.
For the subscripted indexing, internally, Matlab will not create the linear index, but it will use a (double) compiled loop to cycle through all elements.
The subscripted version on the other hand will have to loop through all the linear indices passed from outside, which will require more reads from memory, thus will take longer.
Claims
Linear indexing is faster
...as long as the total number of indices is the same
Timings
From the timings we see a direct confirmation for the first claim and we can infer the second with some additional testing (below).
LOOPED
subs assignment: 0.2878s
linear assignment: 0.0812s
VECTORIZED
subs assignment: 0.0302s
linear assignment: 0.0862s
First claim
We can test it with loops. The number of subref operations is the same but the linear index points directly to the element of interest while subscripts, internally, need to be converted.
The functions of interest:
function B = subscriptedIndexing(A,row,col)
n = numel(row);
B = zeros(n);
for r = 1:n
for c = 1:n
B(r,c) = A(row(r),col(c));
end
end
end
function B = linearIndexing(A,index)
B = zeros(size(index));
for ii = 1:numel(index)
B(ii) = A(index(ii));
end
end
Second claim
This claim is an inference from the observed difference in speed when using the vectorized approach.
First, the vectorized approach (as opposed to the looped) speeds up the subscripted assignment while linear indexing is slightly slower (probably not statistically significant).
Second, the only difference in the two indexing methods comes from the size of the indices/subscripts. We want to isolate this as the only possible cause of the difference in the timings. One other major player could be JIT optimization.
The testing functions:
function B = subscriptedIndexingVect(A,row,col)
n = numel(row);
B = zeros(n);
B = A(row,col);
end
function B = linearIndexingVect(A,index)
B = zeros(size(index));
B = A(index);
end
NOTE: I keep the superfluous preallocation of B, to keep the vectorized and looped approaches comparable. In other words, differences in timings should only come from indexing and the internal implementation of the loops.
All tests are run with:
function testFun(N)
A = magic(N);
row = 1:2:N;
col = 1:2:N;
[ind_x,ind_y] = ndgrid(row,col);
index = sub2ind(size(A),ind_x,ind_y);
% isequal(linearIndexing(A,index), subscriptedIndexing(A,row,col))
% isequal(linearIndexingVect(A,index), subscriptedIndexingVect(A,row,col))
fprintf('<strong>LOOPED</strong>\n')
fprintf(' subs assignment: %.4fs\n', timeit(#()subscriptedIndexing(A,row,col)))
fprintf(' linear assignment: %.4fs\n\n',timeit(#()linearIndexing(A,index)))
fprintf('<strong>VECTORIZED</strong>\n')
fprintf(' subs assignment: %.4fs\n', timeit(#()subscriptedIndexingVect(A,row,col)))
fprintf(' linear assignment: %.4fs\n', timeit(#()linearIndexingVect(A,index)))
end
Turning JIT on/off has NO impact:
feature accel off
testFun(5e3)
...
VECTORIZED
subs assignment: 0.0303s
linear assignment: 0.0873s
feature accel on
testFun(5e3)
...
VECTORIZED
subs assignment: 0.0303s
linear assignment: 0.0871s
This excludes that subscripted assignment's superior speed comes from JIT optimization which leaves us with the only plausible cause, number of RAM accesses. It is true that the final matrix has the same number of elements. However, the linear assignment has to retrieve all elements of the index in order to fetch the numbers.
SETUP
Tested on Win7 64 with MATLAB R2015b. Prior versions of Matlab will provide different results due to recent changes in Matlab's execution engine
In fact, turning JIT off in Matlab R2014a affects timings, but only for the loops (expected result):
feature accel off
testFun(5e3)
LOOPED
subs assignment: 7.8915s
linear assignment: 6.4418s
VECTORIZED
subs assignment: 0.0295s
linear assignment: 0.0878s
This again confirms that the difference in timings between linear and sibscripted assignment should come from the number of RAM accesses, since JIT does not play a role in the vectorized approach.
It does not really surprise me that the subscript indexing is much faster here. If you take a look at your input data, the index is much smaller in this case. For the subscript indexing case you have 512 elements while for the linear indexing case you have 65536 elements.
When you apply your example to a vector instead, you will notice that there is no difference between both methods.
Here is the slightly modified code I used to evaluate different matrix sizes:
it = 400; im = zeros(512*512,1);
x = 1:2:size(im,1);
y = 1:2:size(im,2);
%// linear indexing
[ind_x,ind_y] = ndgrid(x,y);
index = sub2ind(size(im),ind_x,ind_y);
tic
for i=1:it
im(index) = im(index) + 1;
end
toc
%// subscript indexing
tic
for i=1:it
im(x,y) = im(x,y) +1;
end
toc
A very good question. Right ahead, I don't know the correct answer, however, you can analyze the behavior. Save the first toc into t1 and the second one into t2. At the end calculate t1/t2. You will recognize, changing the number of iterations or the size of your matrix does (almost) not change the factor.
I propose:
The amount of iterations only improves the quality of the tictoc. (obvious?)
The size of the matrix has no influcence, i.e. there must be a time delay in the syntax.
I imagine, that there is simply an internal check or transformation from linear index to subscript indexing, i.e. the internal addition (operation) you perform is exactly the same. It appears to be more natural to use subscript indexing instead of linear indexing, so maybe mathworks simply optimized the first.
UPDATE:
You can also simply access an element in your matrix, you will see, that using subscript index is faster than using linear index. That supports the theory, that there is a slow conversion done internally from linear to subscript.
DISCLAIMER: I don't have a MATLAB license at the moment, so the code I provide below is admittedly untested. However, if anyone decides to test, please comment on this answer accordingly.
Depending on your release of MATLAB (are you using R2015b?), there is a possibility that you may not have paid the full upfront cost of preallocation when invoking "zeros". There is a possibility that you are paying for allocation on the first get/set of im, which is causing additional but hidden overhead when you first access the values inside im.
See: http://undocumentedmatlab.com/blog/preallocation-performance
As an initial test, I suggest switching the order that you are profiling the code:
N = 512; it = 400; im = zeros(N);
%// subscript indexing
x = 1:2:N;
y = 1:2:N;
tic
for i=1:it
im(x,y) = im(x,y) +1;
end
toc %// What's the cost now?
%// linear indexing
[ind_x,ind_y] = ndgrid(1:2:N,1:2:N);
index = sub2ind(size(im),ind_x,ind_y);
tic
for i=1:it
im(index) = im(index) + 1;
end
toc %// What's the cost now?
To profile perhaps more fairly against subscript vs. linear indexing, I suggest one of two possible methods:
Make sure you incur allocation costs on both methods by creating two separate im matrices, im1 and im2, both initially set to zeros(N), and use each matrix for a separate indexing method.
Run a full get/set on each element of im before actually profiling between subscript vs. linear indexing.
Method 1:
N = 512; it = 400; im1 = zeros(N); im2 = zeros(N);
%// subscript indexing
x = 1:2:N;
y = 1:2:N;
tic
for i=1:it
im1(x,y) = im1(x,y) + 1;
end
toc %// What's the cost now?
%// linear indexing
[ind_x,ind_y] = ndgrid(1:2:N,1:2:N);
index = sub2ind(size(im2),ind_x,ind_y);
tic
for i=1:it
im2(index) = im2(index) + 1;
end
toc %// What's the cost now?
Method 2:
N = 512; it = 400; im = zeros(N);
%// Run a full get/set on each element to force allocation
tic
for i=1:N^2
im(i) = im(i) +1;
end
toc
%// subscript indexing
x = 1:2:N;
y = 1:2:N;
tic
for i=1:it
im(x,y) = im(x,y) +1;
end
toc %// What's the cost now?
%// linear indexing
[ind_x,ind_y] = ndgrid(1:2:N,1:2:N);
index = sub2ind(size(im),ind_x,ind_y);
tic
for i=1:it
im(index) = im(index) + 1;
end
toc %// What's the cost now?
I have a second hypothesis, which is that you incur some additional overhead when you explicitly declare each and every single element to be accessed, versus if you have MATLAB infer the elements for you. excasa's "duplicate post" reference (not exactly a duplicate in my humble opinion) has the same general insight, but uses different datapoints to come to this conclusion. I won't write examples of this here, but basically, creating a straight up giant array index compared to the smaller subscript indices x and y gives MATLAB less room for internal optimizations. I don't know what inside MATLAB would perform these specific optimizations, but perhaps they come from the black magic that you may know as MATLAB's JIT/LXE. If you honestly want to check if JIT is the culprit here (and are working in 2014b or prior), then you can try disabling it and then running the code above.
There are several ways to disable the JIT:
Use undocumented feature methods.
Copy/paste the commands to the command prompt, as opposed running them straight from the script editor.
Unfortunately, I do not know of a way to turn of LXE in R2015a and later, and trying to diagnose if LXE is the culprit may be a bit of an uphill battle. If this is where you are stuck, perhaps you can delve even further via MathWorks' technical support or MathWorks Central. You may be surprised to find some astounding experts from either source.

Is there a way to speed up concatenation in MATLAB?

I want to concatenate along the third dimension
z = cat(3,A,B,C);
Many many times. I if I was doing that along the second dimension then
z = [A,B,C];
Would be faster than
z = cat(2,A,B,C);
Can a similar thing be done along the third dimension or is there any other way to speed this up?
There are some indexing options to get a slightly better performance than cat(3,...).
Both solutions use U(30,30,3)=0; instead of zeros(30,30,3) to preallocate, but it is unsave as it will result in a subscript dimension missmatch when U is already a variable of a larger size.
The first option is to assign the different slices individually.
%fast but unsafe preallocation
U(30,30,3)=0;
%robust alternative:
%U=zeros(30,30,3)
U(:,:,3)=C;
U(:,:,1)=A;
U(:,:,2)=B;
The second option is to use linear indexing. For z1 = cat(3,A,B,C); and z2=[A;B;C] it is true that z1(:)==z2(:)
%fast but unsafe preallocation
U(30,30,3)=0;
%robust alternative:
%U=zeros(30,30,3)
U(:)=[A,B,C];
I benchmarked the solutions, comparing it to cat(3,A,B,C) and [A,B,C]. The linear indexing solution is only slightly slower than [A,B,C].
0.392289 s for 2D CAT
0.476525 s for Assign slices
0.588346 s for cat(3...)
0.392703 s for linear indexing
Code for benchmarking:
N=30;
A=randn(N,N);
B=randn(N,N);
C=randn(N,N);
T=containers.Map;
cycles=10^5;
tic;
for i=1:cycles
W=[A;B;C];
X=W+1;
end
T('2D CAT')=toc;
tic;
for i=1:cycles
W=cat(3,A,B,C);
X=W+1;
end
T('cat(3...)')=toc;
U=zeros(N,N,3);
tic;
for i=1:cycles
U(N,N,3)=0;
U(:,:,3)=C;
U(:,:,1)=A;
U(:,:,2)=B;
V=U+1;
end
T('Assign slices')=toc;
tic;
for i=1:cycles
U(N,N,3)=0;
U(:)=[A,B,C];
V=U+1;
end
T('linear indexing')=toc;
for X=T.keys
fprintf('%f s for %s\n',T(X{1}),X{1})
end

Gauss-Seidel method doesn't work for large sparse arrays?

Once again I have a problem with the Gauss-Seidel Method in Matlab. Here it is:
function [x] = ex1_3(A,b)
format long
sizeA=size(A,1);
x=zeros(sizeA,1);
%Just a check for the conditions of the Gauss-Seidel Method (if it has dominant diagonal)
for i=1:sizeA
sum=0;
for j=1:sizeA
if i~=j
sum=sum+abs(A(i,j));
end
end
if abs(A(i,i))<sum
fprintf('\nGauss-Seidel''s conditions not met!\n');
return
end
end
%Actual Gauss-Seidel Method
max_temp=10^(-6); %Pass first iteration
while max_temp>(0.5*10^(-6))
xprevious=x;
for i=1:sizeA
x(i,1)=b(i,1);
for j=1:sizeA
if i~=j
x(i,1)=x(i,1)-A(i,j)*x(j,1);
end
end
x(i,1)=x(i,1)/A(i,i);
end
x
%Calculating infinite norm of vector x-xprevious
temp=x-xprevious;
max_temp=temp(1,1);
for i=2:sizeA
if abs(temp(i,1))>max_temp
max_temp=abs(temp(i,1));
end
end
end
It actually works fine for a 100x100 matrix or smaller. However, my tutor wants it to work for 100000x100000 matrices. At first it was difficult to even create the matrix itself, but I managed to do it with a little help from here:
Matlab Help Center
Now, I call the ex1_3 function with A as a parameter, but it goes really slow. Actually it never ends. How can I make it work?
Here's my code for creating the specific matrix my tutor wanted:
The important part is just that it meets these conditions:
A(i; i) = 3, A(i - 1; i) = A(i; i + 1) = -1 n=100000
b=ones(100000,1);
b(1,1)=2;
b(100000,1)=2;
i=zeros(299998,1); %Matrix with the lines that we want to put nonzero elements
j=zeros(299998,1); %Matrix with the columns that we want to put nonzero elements
s=zeros(299998,1); %Matrix with the nonzero elements.
number=1;
previousNumberJ=0;
numberJ=0;
for k=1:299998 %Our index in i and j matrices
if mod((k-1),3)==0
s(k,1)=3;
else
s(k,1)=-1;
end
if k==1 || k==2
i(k,1)=1;
j(k,1)=k;
elseif k==299997 || k==299998
i(k,1)=100000;
j(k,1)=(k-200000)+2;
else
if mod(k,3)==0
number=number+1;
numberJ=previousNumberJ+1;
previousNumberJ=numberJ;
end
i(k,1)=number;
j(k,1)=numberJ;
numberJ=numberJ+1;
end
end
A=sparse(i,j,s); %Creating the sparse array
x=ex1_3(A,b);
the for loop works very slowly in Matlab, perhaps you may want to try the matrix form of the iteration:
function x=gseidel(A,b)
max_temp=10^(-6); %Pass first iteration
x=b;
Q=tril(A);
r=b-A*x;
for i=1:100
dx=Q\r;
x=x+1*dx;
r=b-A*x;
% convergence check
if all(abs(r)<max_temp) && all(abs(dx)<max_temp), return; end
end
For your A and b, it only takes 16 steps to converge.
tril extracts the lower triangular part of A, you can also obtain this Q when you build up the matrix. Since Q is already the triangular matrix, you can solve the equation Q*dx=r very easily if you are not allowed to use \ function.

pdist2 equivalent in MATLAB version 7

I need to calculate the euclidean distance between 2 matrices in matlab. Currently I am using bsxfun and calculating the distance as below( i am attaching a snippet of the code ):
for i=1:4754
test_data=fea_test(i,:);
d=sqrt(sum(bsxfun(#minus, test_data, fea_train).^2, 2));
end
Size of fea_test is 4754x1024 and fea_train is 6800x1024 , using his for loop is causing the execution of the for to take approximately 12 minutes which I think is too high.
Is there a way to calculate the euclidean distance between both the matrices faster?
I was told that by removing unnecessary for loops I can reduce the execution time. I also know that pdist2 can help reduce the time for calculation but since I am using version 7. of matlab I do not have the pdist2 function. Upgrade is not an option.
Any help.
Regards,
Bhavya
Here is vectorized implementation for computing the euclidean distance that is much faster than what you have (even significantly faster than PDIST2 on my machine):
D = sqrt( bsxfun(#plus,sum(A.^2,2),sum(B.^2,2)') - 2*(A*B') );
It is based on the fact that: ||u-v||^2 = ||u||^2 + ||v||^2 - 2*u.v
Consider below a crude comparison between the two methods:
A = rand(4754,1024);
B = rand(6800,1024);
tic
D = pdist2(A,B,'euclidean');
toc
tic
DD = sqrt( bsxfun(#plus,sum(A.^2,2),sum(B.^2,2)') - 2*(A*B') );
toc
On my WinXP laptop running R2011b, we can see a 10x times improvement in time:
Elapsed time is 70.939146 seconds. %# PDIST2
Elapsed time is 7.879438 seconds. %# vectorized solution
You should be aware that it does not give exactly the same results as PDIST2 down to the smallest precision.. By comparing the results, you will see small differences (usually close to eps the floating-point relative accuracy):
>> max( abs(D(:)-DD(:)) )
ans =
1.0658e-013
On a side note, I've collected around 10 different implementations (some are just small variations of each other) for this distance computation, and have been comparing them. You would be surprised how fast simple loops can be (thanks to the JIT), compared to other vectorized solutions...
You could fully vectorize the calculation by repeating the rows of fea_test 6800 times, and of fea_train 4754 times, like this:
rA = size(fea_test,1);
rB = size(fea_train,1);
[I,J]=ndgrid(1:rA,1:rB);
d = zeros(rA,rB);
d(:) = sqrt(sum(fea_test(J(:),:)-fea_train(I(:),:)).^2,2));
However, this would lead to intermediary arrays of size 6800x4754x1024 (*8 bytes for doubles), which will take up ~250GB of RAM. Thus, the full vectorization won't work.
You can, however, reduce the time of the distance calculation by preallocation, and by not calculating the square root before it's necessary:
rA = size(fea_test,1);
rB = size(fea_train,1);
d = zeros(rA,rB);
for i = 1:rA
test_data=fea_test(i,:);
d(i,:)=sum( (test_data(ones(nB,1),:) - fea_train).^2, 2))';
end
d = sqrt(d);
Try this vectorized version, it should be pretty efficient. Edit: just noticed that my answer is similar to #Amro's.
function K = calculateEuclideanDist(P,Q)
% Vectorized method to compute pairwise Euclidean distance
% Returns K(i,j) = sqrt((P(i,:) - Q(j,:))'*(P(i,:) - Q(j,:)))
[nP, d] = size(P);
[nQ, d] = size(Q);
pmag = sum(P .* P, 2);
qmag = sum(Q .* Q, 2);
K = sqrt(ones(nP,1)*qmag' + pmag*ones(1,nQ) - 2*P*Q');
end