MATLAB fast (componentwise) vector operations are...really fast - matlab

I am writing MATLAB scripts since some time and, still, I do not understand how it works "under the hood". Consider the following script, that do some computation using (big) vectors in three different ways:
MATLAB vector operations;
Simple for cycle that do the same computation component-wise;
An optimized cycle that is supposed to be faster than 2. since avoid some allocation and some assignment.
Here is the code:
N = 10000000;
A = linspace(0,100,N);
B = linspace(-100,100,N);
C = linspace(0,200,N);
D = linspace(100,200,N);
% 1. MATLAB Operations
tic
C_ = C./A;
D_ = D./B;
G_ = (A+B)/2;
H_ = (C_+D_)/2;
I_ = (C_.^2+D_.^2)/2;
X = G_ .* H_;
Y = G_ .* H_.^2 + I_;
toc
tic
X;
Y;
toc
% 2. Simple cycle
tic
C_ = zeros(1,N);
D_ = zeros(1,N);
G_ = zeros(1,N);
H_ = zeros(1,N);
I_ = zeros(1,N);
X = zeros(1,N);
Y = zeros(1,N);
for i = 1:N,
C_(i) = C(i)/A(i);
D_(i) = D(i)/B(i);
G_(i) = (A(i)+B(i))/2;
H_(i) = (C_(i)+D_(i))/2;
I_(i) = (C_(i)^2+D_(i)^2)/2;
X(i) = G_(i) * H_(i);
Y(i) = G_(i) * H_(i)^2 + I_(i);
end
toc
tic
X;
Y;
toc
% 3. Opzimized cycle
tic
X = zeros(1,N);
Y = zeros(1,N);
for i = 1:N,
X(i) = (A(i)+B(i))/2 * (( C(i)/A(i) + D(i)/B(i) ) /2);
Y(i) = (A(i)+B(i))/2 * (( C(i)/A(i) + D(i)/B(i) ) /2)^2 + ( (C(i)/A(i))^2 + (D(i)/B(i))^2 ) / 2;
end
toc
tic
X;
Y;
toc
I know that one shall always try to vectorize computations, being MATLAB build over matrices/vectors (thus, nowadays, it is not always the best choice), so I am expecting that something like:
C = A .* B;
is faster than:
for i in 1:N,
C(i) = A(i) * B(i);
end
What I am not expecting is that it is actually faster even in the above script, despite the second and the third methods I am using go through only one cycle, whereas the first method performs many vector operations (which, theoretically, are a "for" cycle every time). This force me to conclude that MATLAB has some magic that permit (for example) to:
C = A .* B;
D = C .* C;
to be run faster than a single "for" cycle with some operation inside it.
So:
what is the magic that avoid the 1st part to be executed so fast?
when you write "D= A .* B" does MATLAB actually do a component wise computation with a "for" cycle, or simply keeps track that D contains some multiplication of "bla" and "bla"?
EDIT
suppose I want to implement the same computation using C++ (using maybe some library). Will be the first method of MATLAB be faster even than the third one implemented in C++? (I'll answer to this question myself, just give me some time.)
EDIT 2
As requested, here there are the experiment runtimes:
Part 1: 0.237143
Part 2: 4.440132
of which 0.195154 for allocation
Part 3: 2.280640
of which 0.057500 for allocation
and without JIT:
Part 1: 0.337259
Part 2: 149.602017
of which 0.033886 for allocation
Part 3: 82.167713
of which 0.010852 for allocation

The first one is the fastest because vectorized code can be easily interpreted to a small number of optimized C++ library calls. Matlab could also optimize it at more high level, for example, replace G*H+I with an optimized mul_add(G,H,I) instead of add(mul(G,H),I) in its core.
The second one can't be converted to C++ calls easily. It has to be interpreted or compiled. The most modern approach for scripting languages is JIT-compilation. The Matlab JIT compiler is not very good but it doesn't mean it has to be so. I don't know why MathWorks don't improve it. Thus #2 performs so slow that #1 is faster even it makes more "mathematical" operations.
Julia language was invented to be a compromise between Matlab expression and C++ speed. The same non-vectorized code (julia vs matlab) works very fast because JIT-compilation is very good.

Regarding performance optimization I follow #memyself suggestion using the profiler for both approaches as mentioned in 'for' loop vs vectorization in MATLAB.
For educational purposes it does make sense to experiment with numerical algorithms, for anything else I would go with well proven libraries.

Related

Vectorizing a matlab loop with internal functions

I have a 3D Mesh grid, X, Y, Z. I want to create a new 3D array that is a function of X, Y, & Z. That function comprises the sum of several 3D Gaussians located at different points. Currently, I have a for loop that runs over the different points where I have my gaussians, and I have an array of center locations r0(nGauss, 1:3)
[X,Y,Z]=meshgrid(-10:.1:10);
Psi=0*X;
for index = 1:nGauss
Psi = Psi + Gauss3D(X,Y,Z,[r0(index,1),r0(index,2),r0(index,3)]);
end
where my 3D gaussian function is
function output=Gauss3D(X,Y,Z,r0)
output=exp(-(X-r0(1)).^2 + (Y-r0(2)).^2 + (Z-r0(3)).^2);
end
I'm happy to redesign the function, which is the slowest part of my code and has to happen many many time, but I can't figure out how to vectorize this so that it will run faster. Any suggestions would be appreciated
*****NB the Original function had a square root in it, and has been modified to make it an actual gaussian***
NOTE! I've modified your code to create a Gaussian, which was:
output=exp(-sqrt((X-r0(1)).^2 + (Y-r0(2)).^2 + (Z-r0(3)).^2));
That does not make a Gaussian. I changed this to:
output = exp(-((X-r0(1)).^2 + (Y-r0(2)).^2 + (Z-r0(3)).^2));
(note no sqrt). This is a Gaussian with sigma = sqrt(1/2).
If this is not what you want, then this answer might not be very useful to you, because your function does not go to 0 as fast as the Gaussian, and therefore is harder to truncate, and it is not separable.
Vectorizing this code is pointless, as the other answers attest. MATLAB's JIT is perfectly capable of running this as fast as it'll go. But you can reduce the amount of computation significantly by noting that the Gaussian goes to almost zero very quickly, and is separable:
Most of the exp evaluations you're doing here yield a very tiny number. You don't need to compute those, just fill in 0.
exp(-x.^2-y.^2) is the same as exp(-x.^2).*exp(-y.^2), which is much cheaper to compute.
Let's put these two things to the test. Here is the test code:
function gaussian_test
N = 100;
r0 = rand(N,3)*20 - 10;
% Original
tic
[X,Y,Z] = meshgrid(-10:.1:10);
Psi1 = zeros(size(X));
for index = 1:N
Psi1 = Psi1 + Gauss3D(X,Y,Z,r0(index,:));
end
t = toc;
fprintf('original, time = %f\n',t)
% Fast, large truncation
tic
[X,Y,Z] = deal(-10:.1:10);
Psi2 = zeros(numel(X),numel(Y),numel(Z));
for index = 1:N
Psi2 = Gauss3D_fast(Psi2,X,Y,Z,r0(index,:),5);
end
t = toc;
fprintf('tuncation = 5, time = %f\n',t)
fprintf('mean abs error = %f\n',mean(reshape(abs(Psi2-Psi1),[],1)))
fprintf('mean square error = %f\n',mean(reshape((Psi2-Psi1).^2,[],1)))
fprintf('max abs error = %f\n',max(reshape(abs(Psi2-Psi1),[],1)))
% Fast, smaller truncation
tic
[X,Y,Z] = deal(-10:.1:10);
Psi3 = zeros(numel(X),numel(Y),numel(Z));
for index = 1:N
Psi3 = Gauss3D_fast(Psi3,X,Y,Z,r0(index,:),3);
end
t = toc;
fprintf('tuncation = 3, time = %f\n',t)
fprintf('mean abs error = %f\n',mean(reshape(abs(Psi3-Psi1),[],1)))
fprintf('mean square error = %f\n',mean(reshape((Psi3-Psi1).^2,[],1)))
fprintf('max abs error = %f\n',max(reshape(abs(Psi3-Psi1),[],1)))
% DIPimage, same smaller truncation
tic
Psi4 = newim(201,201,201);
coords = (r0+10) * 10;
Psi4 = gaussianblob(Psi4,coords,10*sqrt(1/2),(pi*100).^(3/2));
t = toc;
fprintf('DIPimage, time = %f\n',t)
fprintf('mean abs error = %f\n',mean(reshape(abs(Psi4-Psi1),[],1)))
fprintf('mean square error = %f\n',mean(reshape((Psi4-Psi1).^2,[],1)))
fprintf('max abs error = %f\n',max(reshape(abs(Psi4-Psi1),[],1)))
end % of function gaussian_test
function output = Gauss3D(X,Y,Z,r0)
output = exp(-((X-r0(1)).^2 + (Y-r0(2)).^2 + (Z-r0(3)).^2));
end
function Psi = Gauss3D_fast(Psi,X,Y,Z,r0,trunc)
% sigma = sqrt(1/2)
x = X-r0(1);
y = Y-r0(2);
z = Z-r0(3);
mx = abs(x) < trunc*sqrt(1/2);
my = abs(y) < trunc*sqrt(1/2);
mz = abs(z) < trunc*sqrt(1/2);
Psi(my,mx,mz) = Psi(my,mx,mz) + exp(-x(mx).^2) .* reshape(exp(-y(my).^2),[],1) .* reshape(exp(-z(mz).^2),1,1,[]);
% Note! the line above uses implicit singleton expansion. For older MATLABs use bsxfun
end
This is the output on my machine, reordered for readability (I'm still on MATLAB R2017a):
| time(s) | mean abs | mean sq. | max abs
--------------+----------+----------+----------+----------
original | 5.035762 | | |
tuncation = 5 | 0.169807 | 0.000000 | 0.000000 | 0.000005
tuncation = 3 | 0.054737 | 0.000452 | 0.000002 | 0.024378
DIPimage | 0.044099 | 0.000452 | 0.000002 | 0.024378
As you can see, using these two properties of the Gaussian we can reduce time from 5.0 s to 0.17 s, a 30x speedup, with hardly noticeable differences (truncating at 5*sigma). A further 3x speedup can be gained by allowing a small error. The smallest the truncation value, the faster this will go, but the larger the error will be.
I added that last method, the gaussianblob function from DIPimage (I'm an author), just to show that option in case you need to squeeze that bit of extra time from your code. That function is implemented in C++. This version that I used you will need to compile yourself. Our current official release implements this function still in M-file code, and is not as fast.
Further chance of improvement is if the fractional part of the coordinates is always the same (w.r.t. the pixel grid). In this case, you can draw the Gaussian once, and shift it over to each of the centroids.
Another alternative involves computing the Gaussian once, at a somewhat larger scale, and interpolating into it to generate each of the 1D Gaussians needed to generate the output. I did not implement this, I have no idea if it will be faster or if the time difference will be significant. In the old days, exp was expensive, I'm not sure this is still the case.
So, I am building off of the answer above me #Durkee. I enjoy these kinds of problems, so I thought a little about how to make each of the expansions implicit, and I have the one-line function below. Using this function I shaved .11 seconds off of the call, which is completely negligible. It looks like yours is pretty decent. The only advantage of mine might be how the code scales on a finer mesh.
xLin = [-10:.1:10]';
tic
psi2 = sum(exp(-sqrt((permute(xLin-r0(:,1)',[3 1 4 2])).^2 ...
+ (permute(xLin-r0(:,2)',[1 3 4 2])).^2 ...
+ (permute(xLin-r0(:,3)',[3 4 1 2])).^2)),4);
toc
The relative run times on my computer were (all things kept the same):
Original - 1.234085
Other - 2.445375
Mine - 1.120701
So this is a bit of an unusual problem where on my computer the unvectorized code actually works better than the vectorized code, here is my script
clear
[X,Y,Z]=meshgrid(-10:.1:10);
Psi=0*X;
nGauss = 20; %Sample nGauss as you didn't specify
r0 = rand(nGauss,3); % Just make this up as it doesn't really matter in this case
% Your original code
tic
for index = 1:nGauss
Psi = Psi + Gauss3D(X,Y,Z,[r0(index,1),r0(index,2),r0(index,3)]);
end
toc
% Vectorize these functions so we can use implicit broadcasting
X1 = X(:);
Y1 = Y(:);
Z1 = Z(:);
tic
val = [X1 Y1 Z1];
% Change the dimensions so that r0 operates on the right elements
r0_temp = permute(r0,[3 2 1]);
% Perform the gaussian combination
out = sum(exp(-sqrt(sum((val-r0_temp).^2,2))),3);
toc
% Check to make sure both functions match
sum(abs(vec(Psi)-vec(out)))
function output=Gauss3D(X,Y,Z,r0)
output=exp(-sqrt((X-r0(1)).^2 + (Y-r0(2)).^2 + (Z-r0(3)).^2));
end
function out = vec(in)
out = in(:);
end
As you can see, this is probably about as vectorized as you can get. The whole function is done using broadcasting and vectorized operations which normally improve performance ten-one hundredfold. However, in this case, this is not what we see
Elapsed time is 1.876460 seconds.
Elapsed time is 2.909152 seconds.
This actually shows the unvectorized version as being faster.
There could be a few reasons for this of which I am by no means an expert.
MATLAB uses a JIT compiler now which means that for loops are no longer inefficient.
Your code is already reasonably vectorized, you are operating at 8 million elements at once
Unless nGauss is 1000 or something, you're not looping through that much, and at that point, vectorization means you will run out of memory
I could be hitting some memory threshold where I am using too much memory and that is making my code inefficient, I noticed that when I lowered the resolution on the meshgrid the vectorized version worked better
As an aside, I tested this on my GTX 1060 GPU with single precision(single precision is 10x faster than double precision on most GPUs)
Elapsed time is 0.087405 seconds.
Elapsed time is 0.241456 seconds.
Once again the unvectorized version is faster, sorry I couldn't help you out but it seems that your code is about as good as you are going to get unless you lower the tolerances on your meshgrid.

Quickly Evaluating MANY matlabFunctions

This post builds on my post about quickly evaluating analytic Jacobian in Matlab:
fast evaluation of analytical jacobian in MATLAB
The key difference is that now, I am working with the Hessian and I have to evaluate close to 700 matlabFunctions (instead of 1 matlabFunction, like I did for the Jacobian) each time the hessian is evaluated. So there is an opportunity to do things a little differently.
I have tried to do this two ways so far and I am thinking about implementing a third and was wondering if anyone has any other suggestions. I will go through each method with a toy example, but first some preprocessing to generate these matlabFunctions:
PreProcessing:
% This part of the code is calculated once, it is not the issue
dvs = 5;
X=sym('X',[dvs,1]);
num = dvs - 1; % number of constraints
% multiple functions
for k = 1:num
f1(X(k+1),X(k)) = (X(k+1)^3 - X(k)^2*k^2);
c(k) = f1;
end
gradc = jacobian(c,X).'; % .' performs transpose
parfor k = 1:num
hessc{k} = jacobian(gradc(:,k),X);
end
parfor k = 1:num
hess_name = strcat('hessian_',num2str(k));
matlabFunction(hessc{k},'file',hess_name,'vars',X);
end
METHOD #1 : Evaluate functions in series
%% Now we use the functions to run an "optimization." Just for an example the "optimization" is just a for loop
fprintf('This is test A, where the functions are evaluated in series!\n');
tic
for q = 1:10
x_dv = rand(dvs,1); % these are the design variables
lambda = rand(num,1); % these are the lagrange multipliers
x_dv_cell = num2cell(x_dv); % for passing large design variables
for k = 1:num
hess_name = strcat('hessian_',num2str(k));
function_handle = str2func(hess_name);
H_temp(:,:,k) = lambda(k)*function_handle(x_dv_cell{:});
end
H = sum(H_temp,3);
end
fprintf('The time for test A was:\n')
toc
METHOD # 2: Evaluate functions in parallel
%% Try to run a parfor loop
fprintf('This is test B, where the functions are evaluated in parallel!\n');
tic
for q = 1:10
x_dv = rand(dvs,1); % these are the design variables
lambda = rand(num,1); % these are the lagrange multipliers
x_dv_cell = num2cell(x_dv); % for passing large design variables
parfor k = 1:num
hess_name = strcat('hessian_',num2str(k));
function_handle = str2func(hess_name);
H_temp(:,:,k) = lambda(k)*function_handle(x_dv_cell{:});
end
H = sum(H_temp,3);
end
fprintf('The time for test B was:\n')
toc
RESULTS:
METHOD #1 = 0.008691 seconds
METHOD #2 = 0.464786 seconds
DISCUSSION of RESULTS
This result makes sense because, the functions evaluate very quickly and running them in parallel waists a lot of time setting up and sending out the jobs to the different Matlabs ( and then getting the data back from them). I see the same result on my actual problem.
METHOD # 3: Evaluating the functions using the GPU
I have not tried this yet, but I am interested to see what the performance difference is. I am not yet familiar with doing this in Matlab and will add it once I am done.
Any other thoughts? Comments? Thanks!

How to accumulate submatrices without looping (subarray smoothing)?

In Matlab I need to accumulate overlapping diagonal blocks of a large matrix. The sample code is given below.
Since this piece of code needs to run several times, it consumes a lot of resources. The process is used in array signal processing for a so-called subarray smoothing or spatial smoothing. Is there any way to do this faster?
% some values for parameters
M = 1000; % size of array
m = 400; % size of subarray
n = M-m+1; % number of subarrays
R = randn(M)+1i*rand(M);
% main code
S = R(1:m,1:m);
for i = 2:n
S = S + R(i:m+i-1,i:m+i-1);
end
ATTEMPTS:
1) I tried the following alternative vectorized version, but unfortunately it became much slower!
[X,Y] = meshgrid(1:m);
inds1 = sub2ind([M,M],Y(:),X(:));
steps = (0:n-1)*(M+1);
inds = repmat(inds1,1,n) + repmat(steps,m^2,1);
RR = sum(R(inds),2);
S = reshape(RR,m,m);
2) I used Matlab coder to create a MEX file and it became much slower!
I've personally had to fasten up some portions of my code lately. Being not an expert at all, I would recommend trying the following:
1) Vectorize:
Getting rid of the for-loop
S = R(1:m,1:m);
for i = 2:n
S = S + R(i:m+i-1,i:m+i-1)
end
and replacing it for an alternative based on cumsum should be the way to go here.
Note: will try and work on this approach on a future Edit
2) Generating a MEX-file:
In some instances, you could simply fire up the Matlab Coder app (given that you have it in your current Matlab version).
This should generate a .mex file for you, that you can call as it was the function that you are trying to replace.
Regardless of your choice (1) or 2)), you should profile your current implementation with tic; my_function(); toc; for a fair number of function calls, and compare it with your current implementation:
my_time = zeros(1,10000);
for count = 1:10000
tic;
my_function();
my_time(count) = toc;
end
mean(my_time)

MATLAB - Vectorize a double loop containing a distance measure

I am trying to optimize my code and am not sure how and if I would be able to vectorize this particular section??
for base_num = 1:base_length
for sub_num = 1:base_length
dist{base_num}(sub_num) = sqrt((x(base_num) - x(sub_num))^2 + (y(base_num) - y(sub_num))^2);
end
end
The following example provides one method of vectorization:
%# Set example parameters
N = 10;
X = randn(N, 1);
Y = randn(N, 1);
%# Your loop based solution
Dist1 = cell(N, 1);
for n = 1:N
for m = 1:N
Dist1{n}(m) = sqrt((X(n) - X(m))^2 + (Y(n) - Y(m))^2);
end
end
%# My vectorized solution
Dist2 = sqrt(bsxfun(#minus, X, X').^2 + bsxfun(#minus, Y, Y').^2);
Dist2Cell = num2cell(Dist2, 2);
A quick speed test at N = 1000 has the vectorized solution running two orders of magnitude faster than the loop solution.
Note: I've used a second line in my vectorized solution to mimic your cell array output structure. Up to you whether you want to include it or two combine it into one line etc.
By the way, +1 for posting code in the question. However, two small suggestions for the future: 1) When posting to SO, use simple variable names - especially for loop subscripts - such as I have in my answer. 2) It is nice when we can copy and paste example code straight into a script and run it without having to do any changes or additions (again such as in my answer). This allows us to converge on a solution more rapidly.

Matlab: Optimizing speed of function call and cosine sign-finding in a loop

The code in question is here:
function k = whileloop(odefun,args)
...
while (sign(costheta) == originalsign)
y=y(:) + odefun(0,y(:),vars,param)*(dt); % Line 4
costheta = dot(y-normpt,normvec);
k = k + 1;
end
...
end
and to clarify, odefun is F1.m, an m-file of mine. I pass it into the function that contains this while-loop. It's something like whileloop(#F1,args). Line 4 in the code-block above is the Euler method.
The reason I'm using a while-loop is because I want to trigger upon the vector "y" crossing a plane defined by a point, "normpt", and the vector normal to the plane, "normvec".
Is there an easy change to this code that will speed it up dramatically? Should I attempt learning how to make mex files instead (for a speed increase)?
Edit:
Here is a rushed attempt at an example of what one could try to test with. I have not debugged this. It is to give you an idea:
%Save the following 3 lines in an m-file named "F1.m"
function ydot = F1(placeholder1,y,placeholder2,placeholder3)
ydot = y/10;
end
%Run the following:
dt = 1.5e-12 %I do not know about this. You will have to experiment.
y0 = [.1,.1,.1];
normpt = [3,3,3];
normvec = [1,1,1];
originalsign = sign(dot(y0-normpt,normvec));
costheta = originalsign;
y = y0;
k = 0;
while (sign(costheta) == originalsign)
y=y(:) + F1(0,y(:),0,0)*(dt); % Line 4
costheta = dot(y-normpt,normvec);
k = k + 1;
end
disp(k);
dt should be sufficiently small that it takes hundreds of thousands of iterations to trigger.
Assume I must use the Euler method. I have a stochastic differential equation with state-dependent noise if you are curious as to why I tell you to take such an assumption.
I would focus on your actual ODE integration. The fewer steps you have to take, the faster the loop will run. I would only worry about the speed of the sign check after you've optimized the actual integration method.
It looks like you're using the first-order explicit Euler method. Have you tried a higher-order integrator or an implicit method? Often you can increase the time step significantly.