Efficient method for finding elements in MATLAB matrix - matlab

I would like to know how can the bottleneck be treated in the given piece of code.
%% Points is an Nx3 matrix having the coordinates of N points where N ~ 10^6
Z = points(:,3)
listZ = (Z >= a & Z < b); % Bottleneck
np = sum(listZ); % For later usage
slice = points(listZ,:);
Currently for N ~ 10^6, np ~ 1000 and number of calls to this part of code = 1000, the bottleneck statement is taking around 10 seconds in total, which is a big chunk of time compared to the rest of my code.
Some more screenshots of a sample code for only the indexing statement as requested by #EitanT

If the equality on one side is not important you can reformulate it to a one-sided comparison and it gets one order of magnitude faster:
Z = rand(1e6,3);
a=0.5; b=0.6;
c=(a+b)/2;
d=abs(a-b)/2;
tic
for k=1:100,
listZ1 = (Z >= a & Z < b); % Bottleneck
end
toc
tic
for k=1:100,
listZ2 = (abs(Z-c)<d);
end
toc
isequal(listZ1, listZ2)
returns
Elapsed time is 5.567460 seconds.
Elapsed time is 0.625646 seconds.
ans =
1

Assuming the worst case:
element-wise & is not short-circuited internally
the comparisons are single-threaded
You're doing 2*1e6*1e3 = 2e9 comparisons in ~10 seconds. That's ~200 million comparisons per second (~200 MFLOPS).
Considering you can do some 1.7 GFLops on a single core, this indeed seems rather low.
Are you running Windows 7? If so, have you checked your power settings? You are on a mobile processor, so I expect that by default, there will be some low-power consumption scheme in effect. This allows windows to scale down the processing speed, so...check that.
Other than that....I really have no clue.

Try doing something like this:
for i = 1:1000
x = (a >= 0.5);
x = (x < 0.6);
end
I found it to be faster than:
for i = 1:1000
x = (a >= 0.5 & a < 0.6);
end
by about 4 seconds:
Elapsed time is 0.985001 seconds. (first one)
Elapsed time is 4.888243 seconds. (second one)
I think the reason for your slowing is the element wise & operation.

Related

Vectorizing a matlab loop with internal functions

I have a 3D Mesh grid, X, Y, Z. I want to create a new 3D array that is a function of X, Y, & Z. That function comprises the sum of several 3D Gaussians located at different points. Currently, I have a for loop that runs over the different points where I have my gaussians, and I have an array of center locations r0(nGauss, 1:3)
[X,Y,Z]=meshgrid(-10:.1:10);
Psi=0*X;
for index = 1:nGauss
Psi = Psi + Gauss3D(X,Y,Z,[r0(index,1),r0(index,2),r0(index,3)]);
end
where my 3D gaussian function is
function output=Gauss3D(X,Y,Z,r0)
output=exp(-(X-r0(1)).^2 + (Y-r0(2)).^2 + (Z-r0(3)).^2);
end
I'm happy to redesign the function, which is the slowest part of my code and has to happen many many time, but I can't figure out how to vectorize this so that it will run faster. Any suggestions would be appreciated
*****NB the Original function had a square root in it, and has been modified to make it an actual gaussian***
NOTE! I've modified your code to create a Gaussian, which was:
output=exp(-sqrt((X-r0(1)).^2 + (Y-r0(2)).^2 + (Z-r0(3)).^2));
That does not make a Gaussian. I changed this to:
output = exp(-((X-r0(1)).^2 + (Y-r0(2)).^2 + (Z-r0(3)).^2));
(note no sqrt). This is a Gaussian with sigma = sqrt(1/2).
If this is not what you want, then this answer might not be very useful to you, because your function does not go to 0 as fast as the Gaussian, and therefore is harder to truncate, and it is not separable.
Vectorizing this code is pointless, as the other answers attest. MATLAB's JIT is perfectly capable of running this as fast as it'll go. But you can reduce the amount of computation significantly by noting that the Gaussian goes to almost zero very quickly, and is separable:
Most of the exp evaluations you're doing here yield a very tiny number. You don't need to compute those, just fill in 0.
exp(-x.^2-y.^2) is the same as exp(-x.^2).*exp(-y.^2), which is much cheaper to compute.
Let's put these two things to the test. Here is the test code:
function gaussian_test
N = 100;
r0 = rand(N,3)*20 - 10;
% Original
tic
[X,Y,Z] = meshgrid(-10:.1:10);
Psi1 = zeros(size(X));
for index = 1:N
Psi1 = Psi1 + Gauss3D(X,Y,Z,r0(index,:));
end
t = toc;
fprintf('original, time = %f\n',t)
% Fast, large truncation
tic
[X,Y,Z] = deal(-10:.1:10);
Psi2 = zeros(numel(X),numel(Y),numel(Z));
for index = 1:N
Psi2 = Gauss3D_fast(Psi2,X,Y,Z,r0(index,:),5);
end
t = toc;
fprintf('tuncation = 5, time = %f\n',t)
fprintf('mean abs error = %f\n',mean(reshape(abs(Psi2-Psi1),[],1)))
fprintf('mean square error = %f\n',mean(reshape((Psi2-Psi1).^2,[],1)))
fprintf('max abs error = %f\n',max(reshape(abs(Psi2-Psi1),[],1)))
% Fast, smaller truncation
tic
[X,Y,Z] = deal(-10:.1:10);
Psi3 = zeros(numel(X),numel(Y),numel(Z));
for index = 1:N
Psi3 = Gauss3D_fast(Psi3,X,Y,Z,r0(index,:),3);
end
t = toc;
fprintf('tuncation = 3, time = %f\n',t)
fprintf('mean abs error = %f\n',mean(reshape(abs(Psi3-Psi1),[],1)))
fprintf('mean square error = %f\n',mean(reshape((Psi3-Psi1).^2,[],1)))
fprintf('max abs error = %f\n',max(reshape(abs(Psi3-Psi1),[],1)))
% DIPimage, same smaller truncation
tic
Psi4 = newim(201,201,201);
coords = (r0+10) * 10;
Psi4 = gaussianblob(Psi4,coords,10*sqrt(1/2),(pi*100).^(3/2));
t = toc;
fprintf('DIPimage, time = %f\n',t)
fprintf('mean abs error = %f\n',mean(reshape(abs(Psi4-Psi1),[],1)))
fprintf('mean square error = %f\n',mean(reshape((Psi4-Psi1).^2,[],1)))
fprintf('max abs error = %f\n',max(reshape(abs(Psi4-Psi1),[],1)))
end % of function gaussian_test
function output = Gauss3D(X,Y,Z,r0)
output = exp(-((X-r0(1)).^2 + (Y-r0(2)).^2 + (Z-r0(3)).^2));
end
function Psi = Gauss3D_fast(Psi,X,Y,Z,r0,trunc)
% sigma = sqrt(1/2)
x = X-r0(1);
y = Y-r0(2);
z = Z-r0(3);
mx = abs(x) < trunc*sqrt(1/2);
my = abs(y) < trunc*sqrt(1/2);
mz = abs(z) < trunc*sqrt(1/2);
Psi(my,mx,mz) = Psi(my,mx,mz) + exp(-x(mx).^2) .* reshape(exp(-y(my).^2),[],1) .* reshape(exp(-z(mz).^2),1,1,[]);
% Note! the line above uses implicit singleton expansion. For older MATLABs use bsxfun
end
This is the output on my machine, reordered for readability (I'm still on MATLAB R2017a):
| time(s) | mean abs | mean sq. | max abs
--------------+----------+----------+----------+----------
original | 5.035762 | | |
tuncation = 5 | 0.169807 | 0.000000 | 0.000000 | 0.000005
tuncation = 3 | 0.054737 | 0.000452 | 0.000002 | 0.024378
DIPimage | 0.044099 | 0.000452 | 0.000002 | 0.024378
As you can see, using these two properties of the Gaussian we can reduce time from 5.0 s to 0.17 s, a 30x speedup, with hardly noticeable differences (truncating at 5*sigma). A further 3x speedup can be gained by allowing a small error. The smallest the truncation value, the faster this will go, but the larger the error will be.
I added that last method, the gaussianblob function from DIPimage (I'm an author), just to show that option in case you need to squeeze that bit of extra time from your code. That function is implemented in C++. This version that I used you will need to compile yourself. Our current official release implements this function still in M-file code, and is not as fast.
Further chance of improvement is if the fractional part of the coordinates is always the same (w.r.t. the pixel grid). In this case, you can draw the Gaussian once, and shift it over to each of the centroids.
Another alternative involves computing the Gaussian once, at a somewhat larger scale, and interpolating into it to generate each of the 1D Gaussians needed to generate the output. I did not implement this, I have no idea if it will be faster or if the time difference will be significant. In the old days, exp was expensive, I'm not sure this is still the case.
So, I am building off of the answer above me #Durkee. I enjoy these kinds of problems, so I thought a little about how to make each of the expansions implicit, and I have the one-line function below. Using this function I shaved .11 seconds off of the call, which is completely negligible. It looks like yours is pretty decent. The only advantage of mine might be how the code scales on a finer mesh.
xLin = [-10:.1:10]';
tic
psi2 = sum(exp(-sqrt((permute(xLin-r0(:,1)',[3 1 4 2])).^2 ...
+ (permute(xLin-r0(:,2)',[1 3 4 2])).^2 ...
+ (permute(xLin-r0(:,3)',[3 4 1 2])).^2)),4);
toc
The relative run times on my computer were (all things kept the same):
Original - 1.234085
Other - 2.445375
Mine - 1.120701
So this is a bit of an unusual problem where on my computer the unvectorized code actually works better than the vectorized code, here is my script
clear
[X,Y,Z]=meshgrid(-10:.1:10);
Psi=0*X;
nGauss = 20; %Sample nGauss as you didn't specify
r0 = rand(nGauss,3); % Just make this up as it doesn't really matter in this case
% Your original code
tic
for index = 1:nGauss
Psi = Psi + Gauss3D(X,Y,Z,[r0(index,1),r0(index,2),r0(index,3)]);
end
toc
% Vectorize these functions so we can use implicit broadcasting
X1 = X(:);
Y1 = Y(:);
Z1 = Z(:);
tic
val = [X1 Y1 Z1];
% Change the dimensions so that r0 operates on the right elements
r0_temp = permute(r0,[3 2 1]);
% Perform the gaussian combination
out = sum(exp(-sqrt(sum((val-r0_temp).^2,2))),3);
toc
% Check to make sure both functions match
sum(abs(vec(Psi)-vec(out)))
function output=Gauss3D(X,Y,Z,r0)
output=exp(-sqrt((X-r0(1)).^2 + (Y-r0(2)).^2 + (Z-r0(3)).^2));
end
function out = vec(in)
out = in(:);
end
As you can see, this is probably about as vectorized as you can get. The whole function is done using broadcasting and vectorized operations which normally improve performance ten-one hundredfold. However, in this case, this is not what we see
Elapsed time is 1.876460 seconds.
Elapsed time is 2.909152 seconds.
This actually shows the unvectorized version as being faster.
There could be a few reasons for this of which I am by no means an expert.
MATLAB uses a JIT compiler now which means that for loops are no longer inefficient.
Your code is already reasonably vectorized, you are operating at 8 million elements at once
Unless nGauss is 1000 or something, you're not looping through that much, and at that point, vectorization means you will run out of memory
I could be hitting some memory threshold where I am using too much memory and that is making my code inefficient, I noticed that when I lowered the resolution on the meshgrid the vectorized version worked better
As an aside, I tested this on my GTX 1060 GPU with single precision(single precision is 10x faster than double precision on most GPUs)
Elapsed time is 0.087405 seconds.
Elapsed time is 0.241456 seconds.
Once again the unvectorized version is faster, sorry I couldn't help you out but it seems that your code is about as good as you are going to get unless you lower the tolerances on your meshgrid.

There is a bug in my code and I don't know where!!! [MATLAB]

I am trying to reproduce the results from an article. But so far I am not being successful. Here is the code I wrote
EDIT: Based on the initial comments of Zizy Archer, the code has been revised.
clear;
Nmax = 30; % number of rounds
M = 10000; % number of simulations
beta0 = 5*10^-6; % relative clock offset in micro seconds
alpha0 = 1.01; % relative clock skew
for simN = 1:M
for N = 1:Nmax
mean_dly = randi([20 50],N,1).*10^-6; % micro seconds
stdd_dly = randi([1 5],N,1).*10^-6; % micro seconds
XpropDly = normrnd(mean_dly,stdd_dly,N,1); % micro seconds
YpropDly = normrnd(mean_dly,stdd_dly,N,1); % micro seconds
prcssTme = randi([100 500],N,1).*10^-6; % micro seconds
T_1 = (1:N)'*10^-3; % milli seconds
T_2 = T_1 + XpropDly; % milli seconds
T_3 = T_2 + prcssTme; % milli seconds
T_4 = T_3 + YpropDly; % milli seconds
% actual time
T_2act = (T_1 + XpropDly).*alpha0 + beta0;
T_3act = (T_4 - YpropDly).*alpha0 + beta0;
% equation 13
A = sum(T_2act(1:N) + T_3act(1:N));
B = sum(T_1(1:N) + T_4(1:N));
C = sum((T_2act(1:N) + T_3act(1:N)).^2);
D = sum((T_2act(1:N) + T_3act(1:N)).*(T_1(1:N) + T_4(1:N)));
% equation 16
alpha0est(simN,N) = (A.^2-C.*Nmax)./(A.*B-D.*Nmax);
beta0est(simN,N) = (B.*C-A.*D)./(2.*(A.*B-D.*Nmax));
end
timestamps = [T_1 T_2 T_3 T_4];
clear T_*;
end
% equation 29 and 30
MSE_alpha = sum((alpha0est - alpha0).^2)/M;
MSE_beta = sum((beta0est - beta0).^2)/M;
figure %2(a)
semilogy(1:Nmax,MSE_beta.*10^12)
xlabel('N');ylabel('MSE of the estimated offset \beta_{0}')
figure %2(b)
semilogy(1:Nmax,MSE_alpha)
xlabel('N');ylabel('MSE of the estimated skew \alpha_{0}')
But this is what I get:
EDIT2: Snippets were removed.
Can anyone tell me what is wrong with my code?
Thanking you all in advance.
Next time at least try some rudimentary debugging yourself to figure out what could be wrong.
To debug, perhaps print out some variables or plot stuff. Put some conditionals to check if the values are somewhat expected or make no sense. It isn't that hard if you know something is wrong in such a small piece of code (however, if there was no paper you were trying to hit, this bug might have been lurking for a while).
Well, to the step-by-step solution in this case:
What you immediately notice is that if you plot alpha0est or beta0est, your estimate for alpha is systematically too high, at 1.015 instead of 1.01 for the single round case, similar for beta.
Now, what could it be? It obviously isn't noise in signal processing or delays, this one is shown as all this hairy stuff around the mean, you can set all delays to 0 to verify this. So, it has to be something else.
Looking further, you can notice that this systematic error is decreasing when you increase number of rounds performed, and is gone for full 30 rounds.
So, it has to be something with the number of rounds you are doing. Now try setting N = 10 instead of 30, whoa now 10 round case is fine. And there you have your bug. Equation 13 from the paper - there you have N elements summed. Equation 16 similarly multiplies with N. This N obviously has to be the same number. But as it turns out, in your code it isn't. Equation 13 in your code sums ROUNDS cases. Could be 1, could be 30. Equation 16 multiplies with N (=30, always).
All this could be easily avoided if you used saner variable names (all-caps, really?). If you used N for number of rounds performed, and maxN as the limit how many rounds you can try doing at maximum, you would easily get it right.

Matlab: for loop on vector. Weird speed behaviour?

I was playing a bit with the for loop on matlab, I know that often it is possible to avoid them and in that case, much much faster. But if I really want to go through all the element of the vector V , I did that little test:
n=50000000;
V =1:n;
s1 = 0;
tic
for x=V
s1 = s1+x;
end
toc
s2 = 0;
tic
for ind=1:numel(V)
s2 = s2+V(ind);
end
toc
s1 et s2 are equal (normal) but it takes 24.63 sec for the first loop and only 0.48 sec for the second one.
I was a bit surprised by these numbers. Is it something known ? does anyone have any explanation ?
This is probably caused by memory allocation. The statement in case 1,
for x=V
has to create a copy of V. Why do we know? If you modify V within the loop x won't be affected: it will still run through the original V values.
On the other hand, the statement in case 2,
for ind=1:numel(V)
doesn't actually create the vector 1:numel(V). From help for,
Long loops are more memory efficient when the colon expression appears in the for statement since the index vector is never created.
The fact that no memory needs to be allocated probably accounts for the increase of speed, at least in part.
To test this, let's change for ind=1:numel(V) to for ind=[1:numel(V)]. This will force creation of the vector 1:numel(V). Then the running time should be similar to case 1, or indeed a little larger because we still need to index into V with V(ind).
These are the running times on my computer:
% Case 1
n=50000000;
V =1:n;
s1 = 0;
tic
for x=V
s1 = s1+x;
end
toc
% Case 2
s2 = 0;
tic
for ind=1:numel(V)
s2 = s2+V(ind);
end
toc
% Case 3
s3 = 0;
tic
for ind=[1:numel(V)]
s3 = s3+V(ind);
end
toc
Results:
Elapsed time is 0.610825 seconds.
Elapsed time is 0.182983 seconds.
Elapsed time is 0.831321 seconds.

Preallocation and Vectorization Speedup

I am trying to improve the speed of script I am trying to run.
Here is the code: (my machine = 4 core win 7)
clear y;
n=100;
x=linspace(0,1,n);
% no y pre-allocation using zeros
start_time=tic;
for k=1:n,
y(k) = (1-(3/5)*x(k)+(3/20)*x(k)^2 -(x(k)^3/60)) / (1+(2/5)*x(k)-(1/20)*x(k)^2);
end
elapsed_time1 = toc(start_time);
fprintf('Computational time for serialized solution: %f\n',elapsed_time1);
Above code gives 0.013654 elapsed time.
On the other hand, I was tried to use pre-allocation by adding y = zeros(1,n); in the above code where the comment is but the running time is similar around ~0.01. Any ideas why? I was told it would improve by a factor of 2. Am I missing something?
Lastly is there any type of vectorization in Matlab that will allow me to forget about the for loop in the above code?
Thanks,
In your code: try with n=10000 and you'll see more of a difference (a factor of almost 10 on my machine).
These things related with allocation are most noticeable when the size of your variable is large. In that case it's more difficult for Matlab to dynamically allocate memory for that variable.
To reduce the number of operations: do it vectorized, and reuse intermediate results to avoid powers:
y = (1 + x.*(-3/5 + x.*(3/20 - x/60))) ./ (1 + x.*(2/5 - x/20));
Benchmarking:
With n=100:
Parag's / venergiac's solution:
>> tic
for count = 1:100
y=(1-(3/5)*x+(3/20)*x.^2 -(x.^3/60))./(1+(2/5)*x-(1/20)*x.^2);
end
toc
Elapsed time is 0.010769 seconds.
My solution:
>> tic
for count = 1:100
y = (1 + x.*(-3/5 + x.*(3/20 - x/60))) ./ (1 + x.*(2/5 - x/20));
end
toc
Elapsed time is 0.006186 seconds.
You don't need a for loop. Replace the for loop with the following and MATLAB will handle it.
y=(1-(3/5)*x+(3/20)*x.^2 -(x.^3/60))./(1+(2/5)*x-(1/20)*x.^2);
This may give a computational advantage when vectors become larger in size. Smaller size is the reason why you cannot see the effect of pre-allocation. Read this page for additional tips on how to improve the performance.
Edit: I observed that at larger sizes, n>=10^6, I am getting a constant performance improvement when I try the following:
x=0:1/n:1;
instead of using linspace. At n=10^7, I gain 0.05 seconds (0.03 vs 0.08) by NOT using linspace.
try operation element per element (.*, .^)
clear y;
n=50000;
x=linspace(0,1,n);
% no y pre-allocation using zeros
start_time=tic;
for k=1:n,
y(k) = (1-(3/5)*x(k)+(3/20)*x(k)^2 -(x(k)^3/60)) / (1+(2/5)*x(k)-(1/20)*x(k)^2);
end
elapsed_time1 = toc(start_time);
fprintf('Computational time for serialized solution: %f\n',elapsed_time1);
start_time=tic;
y = (1-(3/5)*x+(3/20)*x.^2 -(x.^3/60)) / (1+(2/5)*x-(1/20)*x.^2);
elapsed_time1 = toc(start_time);
fprintf('Computational time for product solution: %f\n',elapsed_time1);
my data
Computational time for serialized solution: 2.578290
Computational time for serialized solution: 0.010060

pdist2 equivalent in MATLAB version 7

I need to calculate the euclidean distance between 2 matrices in matlab. Currently I am using bsxfun and calculating the distance as below( i am attaching a snippet of the code ):
for i=1:4754
test_data=fea_test(i,:);
d=sqrt(sum(bsxfun(#minus, test_data, fea_train).^2, 2));
end
Size of fea_test is 4754x1024 and fea_train is 6800x1024 , using his for loop is causing the execution of the for to take approximately 12 minutes which I think is too high.
Is there a way to calculate the euclidean distance between both the matrices faster?
I was told that by removing unnecessary for loops I can reduce the execution time. I also know that pdist2 can help reduce the time for calculation but since I am using version 7. of matlab I do not have the pdist2 function. Upgrade is not an option.
Any help.
Regards,
Bhavya
Here is vectorized implementation for computing the euclidean distance that is much faster than what you have (even significantly faster than PDIST2 on my machine):
D = sqrt( bsxfun(#plus,sum(A.^2,2),sum(B.^2,2)') - 2*(A*B') );
It is based on the fact that: ||u-v||^2 = ||u||^2 + ||v||^2 - 2*u.v
Consider below a crude comparison between the two methods:
A = rand(4754,1024);
B = rand(6800,1024);
tic
D = pdist2(A,B,'euclidean');
toc
tic
DD = sqrt( bsxfun(#plus,sum(A.^2,2),sum(B.^2,2)') - 2*(A*B') );
toc
On my WinXP laptop running R2011b, we can see a 10x times improvement in time:
Elapsed time is 70.939146 seconds. %# PDIST2
Elapsed time is 7.879438 seconds. %# vectorized solution
You should be aware that it does not give exactly the same results as PDIST2 down to the smallest precision.. By comparing the results, you will see small differences (usually close to eps the floating-point relative accuracy):
>> max( abs(D(:)-DD(:)) )
ans =
1.0658e-013
On a side note, I've collected around 10 different implementations (some are just small variations of each other) for this distance computation, and have been comparing them. You would be surprised how fast simple loops can be (thanks to the JIT), compared to other vectorized solutions...
You could fully vectorize the calculation by repeating the rows of fea_test 6800 times, and of fea_train 4754 times, like this:
rA = size(fea_test,1);
rB = size(fea_train,1);
[I,J]=ndgrid(1:rA,1:rB);
d = zeros(rA,rB);
d(:) = sqrt(sum(fea_test(J(:),:)-fea_train(I(:),:)).^2,2));
However, this would lead to intermediary arrays of size 6800x4754x1024 (*8 bytes for doubles), which will take up ~250GB of RAM. Thus, the full vectorization won't work.
You can, however, reduce the time of the distance calculation by preallocation, and by not calculating the square root before it's necessary:
rA = size(fea_test,1);
rB = size(fea_train,1);
d = zeros(rA,rB);
for i = 1:rA
test_data=fea_test(i,:);
d(i,:)=sum( (test_data(ones(nB,1),:) - fea_train).^2, 2))';
end
d = sqrt(d);
Try this vectorized version, it should be pretty efficient. Edit: just noticed that my answer is similar to #Amro's.
function K = calculateEuclideanDist(P,Q)
% Vectorized method to compute pairwise Euclidean distance
% Returns K(i,j) = sqrt((P(i,:) - Q(j,:))'*(P(i,:) - Q(j,:)))
[nP, d] = size(P);
[nQ, d] = size(Q);
pmag = sum(P .* P, 2);
qmag = sum(Q .* Q, 2);
K = sqrt(ones(nP,1)*qmag' + pmag*ones(1,nQ) - 2*P*Q');
end