I have a 3D Mesh grid, X, Y, Z. I want to create a new 3D array that is a function of X, Y, & Z. That function comprises the sum of several 3D Gaussians located at different points. Currently, I have a for loop that runs over the different points where I have my gaussians, and I have an array of center locations r0(nGauss, 1:3)
[X,Y,Z]=meshgrid(-10:.1:10);
Psi=0*X;
for index = 1:nGauss
Psi = Psi + Gauss3D(X,Y,Z,[r0(index,1),r0(index,2),r0(index,3)]);
end
where my 3D gaussian function is
function output=Gauss3D(X,Y,Z,r0)
output=exp(-(X-r0(1)).^2 + (Y-r0(2)).^2 + (Z-r0(3)).^2);
end
I'm happy to redesign the function, which is the slowest part of my code and has to happen many many time, but I can't figure out how to vectorize this so that it will run faster. Any suggestions would be appreciated
*****NB the Original function had a square root in it, and has been modified to make it an actual gaussian***
NOTE! I've modified your code to create a Gaussian, which was:
output=exp(-sqrt((X-r0(1)).^2 + (Y-r0(2)).^2 + (Z-r0(3)).^2));
That does not make a Gaussian. I changed this to:
output = exp(-((X-r0(1)).^2 + (Y-r0(2)).^2 + (Z-r0(3)).^2));
(note no sqrt). This is a Gaussian with sigma = sqrt(1/2).
If this is not what you want, then this answer might not be very useful to you, because your function does not go to 0 as fast as the Gaussian, and therefore is harder to truncate, and it is not separable.
Vectorizing this code is pointless, as the other answers attest. MATLAB's JIT is perfectly capable of running this as fast as it'll go. But you can reduce the amount of computation significantly by noting that the Gaussian goes to almost zero very quickly, and is separable:
Most of the exp evaluations you're doing here yield a very tiny number. You don't need to compute those, just fill in 0.
exp(-x.^2-y.^2) is the same as exp(-x.^2).*exp(-y.^2), which is much cheaper to compute.
Let's put these two things to the test. Here is the test code:
function gaussian_test
N = 100;
r0 = rand(N,3)*20 - 10;
% Original
tic
[X,Y,Z] = meshgrid(-10:.1:10);
Psi1 = zeros(size(X));
for index = 1:N
Psi1 = Psi1 + Gauss3D(X,Y,Z,r0(index,:));
end
t = toc;
fprintf('original, time = %f\n',t)
% Fast, large truncation
tic
[X,Y,Z] = deal(-10:.1:10);
Psi2 = zeros(numel(X),numel(Y),numel(Z));
for index = 1:N
Psi2 = Gauss3D_fast(Psi2,X,Y,Z,r0(index,:),5);
end
t = toc;
fprintf('tuncation = 5, time = %f\n',t)
fprintf('mean abs error = %f\n',mean(reshape(abs(Psi2-Psi1),[],1)))
fprintf('mean square error = %f\n',mean(reshape((Psi2-Psi1).^2,[],1)))
fprintf('max abs error = %f\n',max(reshape(abs(Psi2-Psi1),[],1)))
% Fast, smaller truncation
tic
[X,Y,Z] = deal(-10:.1:10);
Psi3 = zeros(numel(X),numel(Y),numel(Z));
for index = 1:N
Psi3 = Gauss3D_fast(Psi3,X,Y,Z,r0(index,:),3);
end
t = toc;
fprintf('tuncation = 3, time = %f\n',t)
fprintf('mean abs error = %f\n',mean(reshape(abs(Psi3-Psi1),[],1)))
fprintf('mean square error = %f\n',mean(reshape((Psi3-Psi1).^2,[],1)))
fprintf('max abs error = %f\n',max(reshape(abs(Psi3-Psi1),[],1)))
% DIPimage, same smaller truncation
tic
Psi4 = newim(201,201,201);
coords = (r0+10) * 10;
Psi4 = gaussianblob(Psi4,coords,10*sqrt(1/2),(pi*100).^(3/2));
t = toc;
fprintf('DIPimage, time = %f\n',t)
fprintf('mean abs error = %f\n',mean(reshape(abs(Psi4-Psi1),[],1)))
fprintf('mean square error = %f\n',mean(reshape((Psi4-Psi1).^2,[],1)))
fprintf('max abs error = %f\n',max(reshape(abs(Psi4-Psi1),[],1)))
end % of function gaussian_test
function output = Gauss3D(X,Y,Z,r0)
output = exp(-((X-r0(1)).^2 + (Y-r0(2)).^2 + (Z-r0(3)).^2));
end
function Psi = Gauss3D_fast(Psi,X,Y,Z,r0,trunc)
% sigma = sqrt(1/2)
x = X-r0(1);
y = Y-r0(2);
z = Z-r0(3);
mx = abs(x) < trunc*sqrt(1/2);
my = abs(y) < trunc*sqrt(1/2);
mz = abs(z) < trunc*sqrt(1/2);
Psi(my,mx,mz) = Psi(my,mx,mz) + exp(-x(mx).^2) .* reshape(exp(-y(my).^2),[],1) .* reshape(exp(-z(mz).^2),1,1,[]);
% Note! the line above uses implicit singleton expansion. For older MATLABs use bsxfun
end
This is the output on my machine, reordered for readability (I'm still on MATLAB R2017a):
| time(s) | mean abs | mean sq. | max abs
--------------+----------+----------+----------+----------
original | 5.035762 | | |
tuncation = 5 | 0.169807 | 0.000000 | 0.000000 | 0.000005
tuncation = 3 | 0.054737 | 0.000452 | 0.000002 | 0.024378
DIPimage | 0.044099 | 0.000452 | 0.000002 | 0.024378
As you can see, using these two properties of the Gaussian we can reduce time from 5.0 s to 0.17 s, a 30x speedup, with hardly noticeable differences (truncating at 5*sigma). A further 3x speedup can be gained by allowing a small error. The smallest the truncation value, the faster this will go, but the larger the error will be.
I added that last method, the gaussianblob function from DIPimage (I'm an author), just to show that option in case you need to squeeze that bit of extra time from your code. That function is implemented in C++. This version that I used you will need to compile yourself. Our current official release implements this function still in M-file code, and is not as fast.
Further chance of improvement is if the fractional part of the coordinates is always the same (w.r.t. the pixel grid). In this case, you can draw the Gaussian once, and shift it over to each of the centroids.
Another alternative involves computing the Gaussian once, at a somewhat larger scale, and interpolating into it to generate each of the 1D Gaussians needed to generate the output. I did not implement this, I have no idea if it will be faster or if the time difference will be significant. In the old days, exp was expensive, I'm not sure this is still the case.
So, I am building off of the answer above me #Durkee. I enjoy these kinds of problems, so I thought a little about how to make each of the expansions implicit, and I have the one-line function below. Using this function I shaved .11 seconds off of the call, which is completely negligible. It looks like yours is pretty decent. The only advantage of mine might be how the code scales on a finer mesh.
xLin = [-10:.1:10]';
tic
psi2 = sum(exp(-sqrt((permute(xLin-r0(:,1)',[3 1 4 2])).^2 ...
+ (permute(xLin-r0(:,2)',[1 3 4 2])).^2 ...
+ (permute(xLin-r0(:,3)',[3 4 1 2])).^2)),4);
toc
The relative run times on my computer were (all things kept the same):
Original - 1.234085
Other - 2.445375
Mine - 1.120701
So this is a bit of an unusual problem where on my computer the unvectorized code actually works better than the vectorized code, here is my script
clear
[X,Y,Z]=meshgrid(-10:.1:10);
Psi=0*X;
nGauss = 20; %Sample nGauss as you didn't specify
r0 = rand(nGauss,3); % Just make this up as it doesn't really matter in this case
% Your original code
tic
for index = 1:nGauss
Psi = Psi + Gauss3D(X,Y,Z,[r0(index,1),r0(index,2),r0(index,3)]);
end
toc
% Vectorize these functions so we can use implicit broadcasting
X1 = X(:);
Y1 = Y(:);
Z1 = Z(:);
tic
val = [X1 Y1 Z1];
% Change the dimensions so that r0 operates on the right elements
r0_temp = permute(r0,[3 2 1]);
% Perform the gaussian combination
out = sum(exp(-sqrt(sum((val-r0_temp).^2,2))),3);
toc
% Check to make sure both functions match
sum(abs(vec(Psi)-vec(out)))
function output=Gauss3D(X,Y,Z,r0)
output=exp(-sqrt((X-r0(1)).^2 + (Y-r0(2)).^2 + (Z-r0(3)).^2));
end
function out = vec(in)
out = in(:);
end
As you can see, this is probably about as vectorized as you can get. The whole function is done using broadcasting and vectorized operations which normally improve performance ten-one hundredfold. However, in this case, this is not what we see
Elapsed time is 1.876460 seconds.
Elapsed time is 2.909152 seconds.
This actually shows the unvectorized version as being faster.
There could be a few reasons for this of which I am by no means an expert.
MATLAB uses a JIT compiler now which means that for loops are no longer inefficient.
Your code is already reasonably vectorized, you are operating at 8 million elements at once
Unless nGauss is 1000 or something, you're not looping through that much, and at that point, vectorization means you will run out of memory
I could be hitting some memory threshold where I am using too much memory and that is making my code inefficient, I noticed that when I lowered the resolution on the meshgrid the vectorized version worked better
As an aside, I tested this on my GTX 1060 GPU with single precision(single precision is 10x faster than double precision on most GPUs)
Elapsed time is 0.087405 seconds.
Elapsed time is 0.241456 seconds.
Once again the unvectorized version is faster, sorry I couldn't help you out but it seems that your code is about as good as you are going to get unless you lower the tolerances on your meshgrid.
This post builds on my post about quickly evaluating analytic Jacobian in Matlab:
fast evaluation of analytical jacobian in MATLAB
The key difference is that now, I am working with the Hessian and I have to evaluate close to 700 matlabFunctions (instead of 1 matlabFunction, like I did for the Jacobian) each time the hessian is evaluated. So there is an opportunity to do things a little differently.
I have tried to do this two ways so far and I am thinking about implementing a third and was wondering if anyone has any other suggestions. I will go through each method with a toy example, but first some preprocessing to generate these matlabFunctions:
PreProcessing:
% This part of the code is calculated once, it is not the issue
dvs = 5;
X=sym('X',[dvs,1]);
num = dvs - 1; % number of constraints
% multiple functions
for k = 1:num
f1(X(k+1),X(k)) = (X(k+1)^3 - X(k)^2*k^2);
c(k) = f1;
end
gradc = jacobian(c,X).'; % .' performs transpose
parfor k = 1:num
hessc{k} = jacobian(gradc(:,k),X);
end
parfor k = 1:num
hess_name = strcat('hessian_',num2str(k));
matlabFunction(hessc{k},'file',hess_name,'vars',X);
end
METHOD #1 : Evaluate functions in series
%% Now we use the functions to run an "optimization." Just for an example the "optimization" is just a for loop
fprintf('This is test A, where the functions are evaluated in series!\n');
tic
for q = 1:10
x_dv = rand(dvs,1); % these are the design variables
lambda = rand(num,1); % these are the lagrange multipliers
x_dv_cell = num2cell(x_dv); % for passing large design variables
for k = 1:num
hess_name = strcat('hessian_',num2str(k));
function_handle = str2func(hess_name);
H_temp(:,:,k) = lambda(k)*function_handle(x_dv_cell{:});
end
H = sum(H_temp,3);
end
fprintf('The time for test A was:\n')
toc
METHOD # 2: Evaluate functions in parallel
%% Try to run a parfor loop
fprintf('This is test B, where the functions are evaluated in parallel!\n');
tic
for q = 1:10
x_dv = rand(dvs,1); % these are the design variables
lambda = rand(num,1); % these are the lagrange multipliers
x_dv_cell = num2cell(x_dv); % for passing large design variables
parfor k = 1:num
hess_name = strcat('hessian_',num2str(k));
function_handle = str2func(hess_name);
H_temp(:,:,k) = lambda(k)*function_handle(x_dv_cell{:});
end
H = sum(H_temp,3);
end
fprintf('The time for test B was:\n')
toc
RESULTS:
METHOD #1 = 0.008691 seconds
METHOD #2 = 0.464786 seconds
DISCUSSION of RESULTS
This result makes sense because, the functions evaluate very quickly and running them in parallel waists a lot of time setting up and sending out the jobs to the different Matlabs ( and then getting the data back from them). I see the same result on my actual problem.
METHOD # 3: Evaluating the functions using the GPU
I have not tried this yet, but I am interested to see what the performance difference is. I am not yet familiar with doing this in Matlab and will add it once I am done.
Any other thoughts? Comments? Thanks!
In Matlab I need to accumulate overlapping diagonal blocks of a large matrix. The sample code is given below.
Since this piece of code needs to run several times, it consumes a lot of resources. The process is used in array signal processing for a so-called subarray smoothing or spatial smoothing. Is there any way to do this faster?
% some values for parameters
M = 1000; % size of array
m = 400; % size of subarray
n = M-m+1; % number of subarrays
R = randn(M)+1i*rand(M);
% main code
S = R(1:m,1:m);
for i = 2:n
S = S + R(i:m+i-1,i:m+i-1);
end
ATTEMPTS:
1) I tried the following alternative vectorized version, but unfortunately it became much slower!
[X,Y] = meshgrid(1:m);
inds1 = sub2ind([M,M],Y(:),X(:));
steps = (0:n-1)*(M+1);
inds = repmat(inds1,1,n) + repmat(steps,m^2,1);
RR = sum(R(inds),2);
S = reshape(RR,m,m);
2) I used Matlab coder to create a MEX file and it became much slower!
I've personally had to fasten up some portions of my code lately. Being not an expert at all, I would recommend trying the following:
1) Vectorize:
Getting rid of the for-loop
S = R(1:m,1:m);
for i = 2:n
S = S + R(i:m+i-1,i:m+i-1)
end
and replacing it for an alternative based on cumsum should be the way to go here.
Note: will try and work on this approach on a future Edit
2) Generating a MEX-file:
In some instances, you could simply fire up the Matlab Coder app (given that you have it in your current Matlab version).
This should generate a .mex file for you, that you can call as it was the function that you are trying to replace.
Regardless of your choice (1) or 2)), you should profile your current implementation with tic; my_function(); toc; for a fair number of function calls, and compare it with your current implementation:
my_time = zeros(1,10000);
for count = 1:10000
tic;
my_function();
my_time(count) = toc;
end
mean(my_time)
I am trying to optimize my code and am not sure how and if I would be able to vectorize this particular section??
for base_num = 1:base_length
for sub_num = 1:base_length
dist{base_num}(sub_num) = sqrt((x(base_num) - x(sub_num))^2 + (y(base_num) - y(sub_num))^2);
end
end
The following example provides one method of vectorization:
%# Set example parameters
N = 10;
X = randn(N, 1);
Y = randn(N, 1);
%# Your loop based solution
Dist1 = cell(N, 1);
for n = 1:N
for m = 1:N
Dist1{n}(m) = sqrt((X(n) - X(m))^2 + (Y(n) - Y(m))^2);
end
end
%# My vectorized solution
Dist2 = sqrt(bsxfun(#minus, X, X').^2 + bsxfun(#minus, Y, Y').^2);
Dist2Cell = num2cell(Dist2, 2);
A quick speed test at N = 1000 has the vectorized solution running two orders of magnitude faster than the loop solution.
Note: I've used a second line in my vectorized solution to mimic your cell array output structure. Up to you whether you want to include it or two combine it into one line etc.
By the way, +1 for posting code in the question. However, two small suggestions for the future: 1) When posting to SO, use simple variable names - especially for loop subscripts - such as I have in my answer. 2) It is nice when we can copy and paste example code straight into a script and run it without having to do any changes or additions (again such as in my answer). This allows us to converge on a solution more rapidly.
The code in question is here:
function k = whileloop(odefun,args)
...
while (sign(costheta) == originalsign)
y=y(:) + odefun(0,y(:),vars,param)*(dt); % Line 4
costheta = dot(y-normpt,normvec);
k = k + 1;
end
...
end
and to clarify, odefun is F1.m, an m-file of mine. I pass it into the function that contains this while-loop. It's something like whileloop(#F1,args). Line 4 in the code-block above is the Euler method.
The reason I'm using a while-loop is because I want to trigger upon the vector "y" crossing a plane defined by a point, "normpt", and the vector normal to the plane, "normvec".
Is there an easy change to this code that will speed it up dramatically? Should I attempt learning how to make mex files instead (for a speed increase)?
Edit:
Here is a rushed attempt at an example of what one could try to test with. I have not debugged this. It is to give you an idea:
%Save the following 3 lines in an m-file named "F1.m"
function ydot = F1(placeholder1,y,placeholder2,placeholder3)
ydot = y/10;
end
%Run the following:
dt = 1.5e-12 %I do not know about this. You will have to experiment.
y0 = [.1,.1,.1];
normpt = [3,3,3];
normvec = [1,1,1];
originalsign = sign(dot(y0-normpt,normvec));
costheta = originalsign;
y = y0;
k = 0;
while (sign(costheta) == originalsign)
y=y(:) + F1(0,y(:),0,0)*(dt); % Line 4
costheta = dot(y-normpt,normvec);
k = k + 1;
end
disp(k);
dt should be sufficiently small that it takes hundreds of thousands of iterations to trigger.
Assume I must use the Euler method. I have a stochastic differential equation with state-dependent noise if you are curious as to why I tell you to take such an assumption.
I would focus on your actual ODE integration. The fewer steps you have to take, the faster the loop will run. I would only worry about the speed of the sign check after you've optimized the actual integration method.
It looks like you're using the first-order explicit Euler method. Have you tried a higher-order integrator or an implicit method? Often you can increase the time step significantly.