Speed-up mex files in Matlab

Speed-up mex files in Matlab - matlab

I have a big loop that I need to compute many times in my code and I thought that it would be possible to speed it up using MATLABcoder. However, the mex file version of my code (the code for the function is attached below) ends up being slower than the m-file. I compiled the code using MINGW64 compiler.
I tried different compiler settings (played with coder.config) and different memory specifications but without any success.
function v_d = test_code(q,v_exp,wealth,b_grid_choice,k_grid_choice,nz,nk,nb)
%#codegen
% Inputs
% q is an array of dimension [nz nk nb]
% v_exp is an array of dimension [nz nk nb]
% wealth is an array of dimension [nz nk nb]
% b_grid_choice is an array of dimension [nk nb]
% k_grid_choice is an array of dimension [nk nb]
% Typically, nz/nk/nb is an integer (currently up to 200)
v_d = coder.nullcopy(zeros(nz,nk,nb));
parfor ii = 1:nz
q_ini_ii = reshape(q(ii,:,:),[],nz);
v_ini_exp_ii = reshape(v_exp(ii,:,:),[],nz);
choice = q_ini_ii.*b_grid_choice - k_grid_choice;
for jj = 1:nk
for kk = 1:nb
% dividends at time t
d = wealth(ii,jj,kk) + choice;
% choosing optimal consumption
vec_d = d + v_ini_exp_ii.*(d>0);
v_d(ii,jj,kk) = max(vec_d,[],'all');
end
end
end
When I run this code when nz=nk=nb=100, the m-file takes 2.5s while the generated mex file takes 7.5s. I tried using both MATLABcoder app as well as the codegen command
codegen -O enable:OpenMP test_code -args {q,v_exp,wealth,b_grid_choice,k_grid_choice,nz,nk,nb}
I also played with codegen.config('mex') but with little impact on the performance.
When I test the speed of the above code I simply generate these matrices using randn command and then pass them into the code. Interestingly, when I generate these matrices inside the function then this has no effect on the speed of the m-file but speeds-up the mex file almost 10 times (so three times faster than the m-file)! This suggests to me that there is a way to speed up that code, but despite my effort I couldn't figure it out. I thought that memory allocation could be behind it, but my attempts with playing with dynamic memory specifications did not lead to any gains either...

Related

Separate definition of constant values and dependent parameters in Matlab

In my code, I have a lot of constant values and parameters that take significant space in my code.
For example in C++, I would make a header and a separate file where I would define these parameters as, e.g., "const-type" and share the header with main or other .cpp files.
How do you keep such structuring in MATLAB, is it worth it?
An example: Coefficients.m looks as follows:
classdef coefficients
properties(Constant)
% NIST data
A_N = 28.98641;
end
end
Another file: Gas.m where I would like to use A_N looks as follows:
function Gas()
clear all
clc
import Coefficients.* % Does not work
% A simple print
Values.A_N % Does not work
coefficients.A_N % Does not work
Constant.A_N % Does not work
end

Ok so assuming the class coefficients defined as:
classdef coefficients
properties(Constant)
% NIST data
A_N = 28.98641;
end
end
This code must be saved in a file named coefficients.m (The class name and file name have to match to avoid weird effect sometimes).
Then assuming the following Gas.m file:
function Gas()
% Usage with "disp()"
disp('The A_N coefficient taken from NIST data is:')
disp(coefficients.A_N)
% Usage with fprintf
fprintf('\nFrom NIST data, the coefficient A_N = %f\n',coefficients.A_N)
% Usage in calculation (just use it as if it was a variable/constant name
AN2 = coefficients.A_N^2 ;
fprintf('\nA_N coefficient squared = %.2f\n',AN2)
% If you want a shorter notation, you can copy the coefficient value in
% a variable with a shorter name, then use that variable later in code
A_N = coefficients.A_N ;
fprintf('\nA_N coefficient cubed = %.2f\n',A_N^3)
end
Then running this file (calling it from the command line) yields:
>> Gas
The A_N coefficient taken from NIST data is:
28.98641
From NIST data, the coefficient A_N = 28.986410
A_N coefficient squared = 840.21
A_N coefficient cubed = 24354.73
Or if you simply need to access the coefficient in the Matlab console:
>> coefficients.A_N
ans =
28.98641
Now all these examples assume that the class file coefficient.m is visible in the current Matlab scope. For Matlab, it means the file must be in the MATLAB search path (or the current folder works too).
For more info about what is the Matlab search path and how it works you can read:
What Is the MATLAB Search
Path?
Files and Folders that MATLAB
Accesses
In your case, I would make a folder containing all these sorts of classes, then add this folder to the Matlab path, so you never have to worry again about individual script, function or program calling for it.

See this link for tips on using a class definition for this. Highlights of the tips are:
Properties can be accessed directly to get the value
Properties can be added that give the units of the constant (highly advised!)
Comments can be added that act as help text for the constants
The doc command automatically creates a reference page for this classdef
E.g.,
classdef myconstants
properties(Constant)
% g is the gravity of Earth
% (https://en.wikipedia.org/wiki/Gravity_of_Earth)
g = 9.8;
g_units = 'm/s/s';
% c is the speed of light in a vacuum
c = 299792458;
c_units = 'm/s';
end
end
>> help myconstants.g

why does aba take longer than (a'(ab)')' when using gpuArray in Matlab scripts?

The code below performs the operation the same operation on gpuArrays a and b in two different ways. The first part computes (a'*(a*b)')' , while the second part computes a*b*a. The results are then verified to be the same.
%function test
clear
rng('default');rng(1);
a=sprand(3000,3000,0.1);
b=rand(3000,3000);
a=gpuArray(a);
b=gpuArray(b);
tic;
c1=gather(transpose(transpose(a)*transpose(a*b)));
disp(['time for (a''*(a*b)'')'': ' , num2str(toc),'s'])
clearvars -except c1
rng('default');
rng(1)
a=sprand(3000,3000,0.1);
b=rand(3000,3000);
a=gpuArray(a);
b=gpuArray(b);
tic;
c2=gather(a*b*a);
disp(['time for a*b*a: ' , num2str(toc),'s'])
disp(['error = ',num2str(max(max(abs(c1-c2))))])
%end
However, computing (a'*(a*b)')' is roughly 4 times faster than computing a*b*a. Here is the output of the above script in R2018a on an Nvidia K20 (I've tried different versions and different GPUs with the similar behaviour).
>> test
time for (a'*(a*b)')': 0.43234s
time for a*b*a: 1.7175s
error = 2.0009e-11
Even more strangely, if the first and last lines of the above script are uncommented (to turn it into a function), then both take the longer amount of time (~1.7s instead of ~0.4s). Below is the output for this case:
>> test
time for (a'*(a*b)')': 1.717s
time for a*b*a: 1.7153s
error = 1.0914e-11
I'd like to know what is causing this behaviour, and how to perform a*b*a or (a'*(a*b)')' or both in the shorter amount of time (i.e. ~0.4s rather than ~1.7s) inside a matlab function rather than inside a script.

There seem to be an issue with multiplication of two sparse matrices on GPU. time for sparse by full matrix is more than 1000 times faster than sparse by sparse. A simple example:
str={'sparse*sparse','sparse*full'};
for ii=1:2
rng(1);
a=sprand(3000,3000,0.1);
b=sprand(3000,3000,0.1);
if ii==2
b=full(b);
end
a=gpuArray(a);
b=gpuArray(b);
tic
c=a*b;
disp(['time for ',str{ii},': ' , num2str(toc),'s'])
end
In your context, it is the last multiplication which does it. to demonstrate I replace a with a duplicate c, and multiply by it twice, once as sparse and once as full matrix.
str={'a*b*a','a*b*full(a)'};
for ii=1:2
%rng('default');
rng(1)
a=sprand(3000,3000,0.1);
b=rand(3000,3000);
rng(1)
c=sprand(3000,3000,0.1);
if ii==2
c=full(c);
end
a=gpuArray(a);
b=gpuArray(b);
c=gpuArray(c);
tic;
c1{ii}=a*b*c;
disp(['time for ',str{ii},': ' , num2str(toc),'s'])
end
disp(['error = ',num2str(max(max(abs(c1{1}-c1{2}))))])
I may be wrong, but my conclusion is that a * b * a involves multiplication of two sparse matrices (a and a again) and is not treated well, while using transpose() approach divides the process to two stage multiplication, in none of which there are two sparse matrices.

I got in touch with Mathworks tech support and Rylan finally shed some light on this issue. (Thanks Rylan!) His full response is below. The function vs script issue appears to be related to certain optimizations matlab applies automatically to functions (but not scripts) not working as expected.
Rylan's response:
Thank you for your patience on this issue. I have consulted with the MATLAB GPU computing developers to understand this better.
This issue is caused by internal optimizations done by MATLAB when encountering some specific operations like matrix-matrix multiplication and transpose. Some of these optimizations may be enabled specifically when executing a MATLAB function (or anonymous function) rather than a script.
When your initial code was being executed from a script, a particular matrix transpose optimization is not performed, which results in the 'res2' expression being faster than the 'res1' expression:
n = 2000;
a=gpuArray(sprand(n,n,0.01));
b=gpuArray(rand(n));
tic;res1=a*b*a;wait(gpuDevice);toc % Elapsed time is 0.884099 seconds.
tic;res2=transpose(transpose(a)*transpose(a*b));wait(gpuDevice);toc % Elapsed time is 0.068855 seconds.
However when the above code is placed in a MATLAB function file, an additional matrix transpose-times optimization is done which causes the 'res2' expression to go through a different code path (and different CUDA library function call) compared to the same line being called from a script. Therefore this optimization generates slower results for the 'res2' line when called from a function file.
To avoid this issue from occurring in a function file, the transpose and multiply operations would need to be split in a manner that stops MATLAB from applying this optimization. Separating each clause within the 'res2' statement seems to be sufficient for this:
tic;i1=transpose(a);i2=transpose(a*b);res3=transpose(i1*i2);wait(gpuDevice);toc % Elapsed time is 0.066446 seconds.
In the above line, 'res3' is being generated from two intermediate matrices: 'i1' and 'i2'. The performance (on my system) seems to be on par with that of the 'res2' expression when executed from a script; in addition the 'res3' expression also shows similar performance when executed from a MATLAB function file. Note however that additional memory may be used to store the transposed copy of the initial array. Please let me know if you see different performance behavior on your system, and I can investigate this further.
Additionally, the 'res3' operation shows faster performance when measured with the 'gputimeit' function too. Please refer to the attached 'testscript2.m' file for more information on this. I have also attached 'test_v2.m' which is a modification of the 'test.m' function in your Stack Overflow post.
Thank you for reporting this issue to me. I would like to apologize for any inconvenience caused by this issue. I have created an internal bug report to notify the MATLAB developers about this behavior. They may provide a fix for this in a future release of MATLAB.
Since you had an additional question about comparing the performance of GPU code using 'gputimeit' vs. using 'tic' and 'toc', I just wanted to provide one suggestion which the MATLAB GPU computing developers had mentioned earlier. It is generally good to also call 'wait(gpuDevice)' before the 'tic' statements to ensure that GPU operations from the previous lines don't overlap in the measurement for the next line. For example, in the following lines:
b=gpuArray(rand(n));
tic; res1=a*b*a; wait(gpuDevice); toc
if the 'wait(gpuDevice)' is not called before the 'tic', some of the time taken to construct the 'b' array from the previous line may overlap and get counted in the time taken to execute the 'res1' expression. This would be preferred instead:
b=gpuArray(rand(n));
wait(gpuDevice); tic; res1=a*b*a; wait(gpuDevice); toc
Apart from this, I am not seeing any specific issues in the way that you are using the 'tic' and 'toc' functions. However note that using 'gputimeit' is generally recommended over using 'tic' and 'toc' directly for GPU-related profiling.
I will go ahead and close this case for now, but please let me know if you have any further questions about this.
%testscript2.m
n = 2000;
a = gpuArray(sprand(n, n, 0.01));
b = gpuArray(rand(n));
gputimeit(#()transpose_mult_fun(a, b))
gputimeit(#()transpose_mult_fun_2(a, b))
function out = transpose_mult_fun(in1, in2)
i1 = transpose(in1);
i2 = transpose(in1*in2);
out = transpose(i1*i2);
end
function out = transpose_mult_fun_2(in1, in2)
out = transpose(transpose(in1)*transpose(in1*in2));
end
.
function test_v2
clear
%% transposed expression
n = 2000;
rng('default');rng(1);
a = sprand(n, n, 0.1);
b = rand(n, n);
a = gpuArray(a);
b = gpuArray(b);
tic;
c1 = gather(transpose( transpose(a) * transpose(a * b) ));
disp(['time for (a''*(a*b)'')'': ' , num2str(toc),'s'])
clearvars -except c1
%% non-transposed expression
rng('default');
rng(1)
n = 2000;
a = sprand(n, n, 0.1);
b = rand(n, n);
a = gpuArray(a);
b = gpuArray(b);
tic;
c2 = gather(a * b * a);
disp(['time for a*b*a: ' , num2str(toc),'s'])
disp(['error = ',num2str(max(max(abs(c1-c2))))])
%% sliced equivalent
rng('default');
rng(1)
n = 2000;
a = sprand(n, n, 0.1);
b = rand(n, n);
a = gpuArray(a);
b = gpuArray(b);
tic;
intermediate1 = transpose(a);
intermediate2 = transpose(a * b);
c3 = gather(transpose( intermediate1 * intermediate2 ));
disp(['time for split equivalent: ' , num2str(toc),'s'])
disp(['error = ',num2str(max(max(abs(c1-c3))))])
end

EDIT 2 I might have been right, see this other answer
EDIT: They use MAGMA, which is column major. My answer does not hold, however I will leave it here for a while in case it can help crack this strange behavior.
The below answer is wrong
This is my guess, I can not 100% tell you without knowing the code under MATLAB's hood.
Hypothesis: MATLABs parallel computing code uses CUDA libraries, not their own.
Important information
MATLAB is column major and CUDA is row major.
There is no such things as 2D matrices, only 1D matrices with 2 indices
Why does this matter? Well because CUDA is highly optimized code that uses memory structure to maximize cache hits per kernel (the slowest operation on GPUs is reading memory). This means a standard CUDA matrix multiplication code will exploit the order of memory reads to make sure they are adjacent. However, what is adjacent memory in row-major is not in column-major.
So, there are 2 solutions to this as someone writing software
Write your own column-major algebra libraries in CUDA
Take every input/output from MATLAB and transpose it (i.e. convert from column-major to row major)
They have done point 2, and assuming that there is a smart JIT compiler for MATLAB parallel processing toolbox (reasonable assumption), for the second case, it takes a and b, transposes them, does the maths, and transposes the output when you gather.
In the first case however, you already do not need to transpose the output, as it is internally already transposed and the JIT catches this, so instead of calling gather(transpose( XX )) it just skips the output transposition is side. The same with transpose(a*b). Note that transpose(a*b)=transpose(b)*transpose(a), so suddenly no transposes are needed (they are all internally skipped). A transposition is a costly operation.
Indeed there is a weird thing here: making the code a function suddenly makes it slow. My best guess is that because the JIT behaves differently in different situations, it doesn't catch all this transpose stuff inside and just does all the operations anyway, losing the speed up.
Interesting observation: It takes the same time in CPU than GPU to do a*b*a in my PC.

How can I read a bunch of TIFF files faster in Matlab?

I have to read in hundreds of TIFF files, perform some mathematical operation, and output a few things. This is being done for thousands of instances. And the biggest bottleneck is imread. Using PixelRegion, I read in only parts of the file, but it is still very slow.
Currently, the reading part is here.
Can you suggest how I can speed it up?
for m = 1:length(pfile)
if ~exist(pfile{m}, 'file')
continue;
end
pConus = imread(pfile{m}, 'PixelRegion',{[min(r1),max(r1)],[min(c1),max(c1)]});
pEvent(:,m) = pConus(tselect);
end

General Speedup
The pixel region does not appear to change at each iteration. I'm not entirely sure if Matlab will optimize the min and max calls (though I'm pretty sure it won't). If you don't change them at each iteration, move them outside the for loop and calculate them once.
Parfor
The following solution assumes you have access to the parallel computing toolbox. I tested it with 10,840 tiffs, each image was 1000x1000 originally, but I only read in a 300x300 section of them. I am not sure how many big pConus(tselect) is, so I just stored the whole 300x300 image.
P.S. Sorry about the formatting. It refuses to format it as a block of code.
Results based on my 2.3 GHz i7 w/ 16GB of ram
for: 130s
parfor: 26s + time to start pool
% Setup
clear;clc;
n = 12000;
% Would be faster to preallocate this, but negligeble compared to the
% time it takes imread to complete.
fileNames = {};
for i = 1:n
name = sprintf('in_%i.tiff', i);
% I do the exist check here, assuming that the file won't be touched in
% until the program advances a files lines.
if exist(name, 'file')
fileNames{end+1} = name;
end
end
rows = [200, 499];
cols = [200, 499];
pics = cell(1, length(fileNames));
tic;
parfor i = 1:length(fileNames)
% I don't know why using the temp variable is faster, but it is
temp = imread(fileNames{i}, 'PixelRegion', {rows, cols});
pics{i} = temp;
end
toc;

Large workspace variables affect multiplication runtime of unrelated variables

Old Title: *Small matrix multiplication much slower in R2016b than R2016a*
(update below)
I find that multiplication of small matrices seems much smaller in R2016b than R2016a. Here's a minimal example:
r = rand(50,100);
s = rand(100,100);
tic; r * s; toc
This takes about 0.0012s in R2016a and 0.018s R2016b.
Creating an artificial loop to make sure this isn't just some initial overhead or something leads to the same loss factor:
tic; for i = 1:1000, a = r*s; end, toc
This takes about 0.18s in R2016a and 2.1s R2016b.
Once I make the matrices much bigger, say r = rand(500,1000); and s = rand(1000,1000), the version behave similarly (R2016b even seems to be ~15% faster). Anyone have any insight as to why this is, or can verify this behavior on another system?
I wonder if it has to do with the new arithmetic expansions implementation (if this feature has some cost for small matrix multiplication): http://blogs.mathworks.com/loren/2016/10/24/matlab-arithmetic-expands-in-r2016b/
update
After many tests, I discovered that this difference was not between MATLAB versions (my apologies). Instead, it seems to be a difference of what's in my base workspace... and worse, the type of variable that's in the base workspace.
I cleared a huge workspace (which had many large cell arrays with many small, differently sized matrix entries). If I clear the variables and do the timing of r*s, I get much faster runtime (x10-x100) than before the workspace was loaded.
So the question is, why does having variables in the workspace affect the matrix multiplication of two small variables? And even more, why does having certain types of variables slow down the workspace dramatically.
Here's an example where a large variable in cell form in the workspace affects the runtime of the matrix multiplication or two unrelated matrices. If I collapse this cell to a matrix, the effect goes away.
clear;
ticReps = 10000;
nCells = 100;
aa = rand(50,100);
bb = rand(100, 100);
% test original timing
tic; for i = 1:ticReps, aa * bb; end
fprintf('original: %3.3f\n', toc);
% make some matrices inside a large number of cells
q = cell(nCells, nCells);
for i = 1:nCells * nCells
q{i} = sprand(10000,10000, 0.0001);
end
% the timing again
tic; for i = 1:ticReps, aa * bb; end
fprintf('after large q cell: %3.3f\n', toc);
% make q into a matrix
q = cat(2, q{:});
% the timing again
tic; for i = 1:ticReps, aa * bb; end
fprintf('after large q matrix: %3.3f\n', toc);
clear q
% the timing again
tic; for i = 1:ticReps, aa * bb; end
fprintf('after clear q: %3.3f\n', toc);
In both staged, q takes up about 2Gb. Result:
original: 0.183
after large q cell: 0.320
after large q matrix: 0.175
after clear q: 0.184

I've received an update from mathworks.
As far as I understand it, they say that this is the fault of the Windows memory manager, which allots memory to large cell arrays in a fairly fragmented manner. Since the (unrelated) multiplication needs memory (for the output), getting this piece of memory now takes longer time due to the memory fragmentation caused by the cell. Linux (as tested) does not have this issue.

Evaluate a changing function in loop

I am writing a code that generates a function f in a loop. This function f changes in every loop, for example from f = x + 2x to f = 3x^2 + 1 (randomly), and I want to evaluate f at different points in every loop. I have tried using subs, eval, matlabFunction etc but it is still running slowly. How would you tackle a problem like this in the most efficient way?
This is as fast as I have been able to do it. ****matlabFunction and subs go slower than this.
The code below is my solution and it is one loop. In my larger code the function f and point x0 change in every loop so you can imagine why I want this to go as fast as possible. I would greatly appreciate it if someone could go through this, and give me any pointers. If my coding is crap feel free to tell me :D
x = sym('x',[2,1]);
f = [x(1)-x(1)cos(x(2)), x(2)-3x(2)^2*cos(x(1))];
J = jacobian(f,x);
x0 = [2,1];
N=length(x0); % Number of equations
%% Transform into string
fstr = map2mat(char(f));
Jstr = map2mat(char(J));
% replace every occurence of 'xi' with 'x(i)'
Jstr = addPar(Jstr,N);
fstr = addPar(fstr,N);
x = x0;
phi0 = eval(fstr)
J = eval(Jstr)
function str = addPar(str,N)
% pstr = addPar(str,N)
% Transforms every occurence of xi in str into x(i)
% N is the maximum value of i
% replace every occurence of xi with x(i)
% note that we do this backwards to avoid x10 being
% replaced with x(1)0
for i=N:-1:1
is = num2str(i);
xis = ['x' is];
xpis = ['x(' is ')'];
str = strrep(str,xis,xpis);
end
function r = map2mat(r)
% MAP2MAT Maple to MATLAB string conversion.
% Lifted from the symbolic toolbox source code
% MAP2MAT(r) converts the Maple string r containing
% matrix, vector, or array to a valid MATLAB string.
%
% Examples: map2mat(matrix([[a,b], [c,d]]) returns
% [a,b;c,d]
% map2mat(array([[a,b], [c,d]]) returns
% [a,b;c,d]
% map2mat(vector([[a,b,c,d]]) returns
% [a,b,c,d]
% Deblank.
r(findstr(r,' ')) = [];
% Special case of the empty matrix or vector
if strcmp(r,'vector([])') | strcmp(r,'matrix([])') | ...
strcmp(r,'array([])')
r = [];
else
% Remove matrix, vector, or array from the string.
r = strrep(r,'matrix([[','['); r = strrep(r,'array([[','[');
r = strrep(r,'vector([','['); r = strrep(r,'],[',';');
r = strrep(r,']])',']'); r = strrep(r,'])',']');
end

There are several ways to get huge boosts in speed for this sort of problem:
The java GUI front end slows everything down. Go back to version 2010a or earlier. Go back to when it was based on C or fortran. The MATLAB script runs as fast as if you had put it into the MATLAB "compiler".
If you have MatLab compiler (or builder, I forget which) but not the coder, then you can process your code and have it run a few times faster without modifying the code.
write it to a file, then call it as a function. I have done this for changing finite-element expressions, so large ugly math that makes $y = 3x^2 +1$ look simple. In that it gave me solid speed increase.
vectorize, vectorize, vectorize. It used to reliably give 10x to 12x speed increase. Pull it out of loops. The java, I think, obscures this some by making everything slower.
have you "profiled" your function to make sure that "eval" or such are the problem? If you fix "eval" and your bottleneck is elsewhere then you will have problems.
If you have the choice between eval and subs, stick with eval. subs gives you a symbolic solution, not a numeric one.
If there is a clean way to have multiple instances of MatLab running, especially if you have a decently core-rich cpu that MatLab does not fully utilize, then get several of them going. If you are at an educational institution you might try running several different versions (2010a, 2010b, 2009a,...) on the same system. I (fuzzily) recall they didn't collide when I did it. Running more than about 8 started slowing things down more than it improved them. Make sure they aren't colliding on file access if you are using files to share control.
You could write your program in LabVIEW (not MathScript, not MatLab) and because it is a compiled language, there are times that code can run 1000x faster.
You could go all numeric and make it a matrix activity. This depends on your code, but if you could randomly populate the columns in the matrix then matrix multiply it to a matrix $ \left[ 1, x, x^{2}, ...\right] $, that would likely be several hundreds or thousands of times faster than your current level of equation handling and still in MatLab.
About your coding:
don't redeclare "x" as a symbol every loop, that is expensive.
what is this "map2mat" then "addPar" stuff?
the string handling functions are horrible for runtime. Stick to one language. The symbolic toolbox IS maple, and you don't have to get goofy hand-made parsing to make it work with the rest of MatLab.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse