Why does Matlab run faster after a script is "warmed up"? - matlab

I have noticed that the first time I run a script, it takes considerably more time than the second and third time1. The "warm-up" is mentioned in this question without an explanation.
Why does the code run faster after it is "warmed up"?
I don't clear all between calls2, but the input parameters change for every function call. Does anyone know why this is?
1. I have my license locally, so it's not a problem related to license checking.
2. Actually, the behavior doesn't change if I clear all.

One reason why it would run faster after the first time is that many things are initialized once, and their results are cached and reused the next time. For example in the M-side, variables can be defined as persistent in functions that can be locked. This can also occur on the MEX-side of things.
In addition many dependencies are loaded after the first time and remain so in memory to be re-used. This include M-functions, OOP classes, Java classes, MEX-functions, and so on. This applies to both builtin and user-defined ones.
For example issue the following command before and after running your script for the first run, then compare:
[M,X,C] = inmem('-completenames')
Note that clear all does not necessarily clear all of the above, not to mention locked functions...
Finally let us not forget the role of the accelerator. Instead of interpreting the M-code every time a function is invoked, it gets compiled into machine code instructions during runtime. JIT compilation occurs only for the first invocation, so ideally the efficiency of running object code the following times will overcome the overhead of re-interpreting the program every time it runs.

Matlab is interpreted. If you don't warm up the code, you will be losing a lot of time due to interpretation instead of the actual algorithm. This can skew results of timings significantly.
Running the code at least once will enable Matlab to actually compile appropriate code segments.

Besides Matlab-specific reasons like JIT-compilation, modern CPUs have large caches, branch predictors, and so on. Warming these up is an issue for benchmarking even in assembly language.
Also, more importantly, modern CPUs usually idle at low clock speed, and only jump to full speed after several milliseconds of sustained load.
Intel's Turbo feature gets even more funky: when power and thermal limits allow, the CPU can run faster than its sustainable max frequency. So the first ~20 seconds to 1 minute of your benchmark may run faster than the rest of it, if you aren't careful to control for these factors.

Another issue not mensioned by Amro and Marc is memory (pre)allocation.
If your script does not pre-allocate its memory it's first run would be very slow due to memory allocation. Once it completed its first iteration all memory is allocated, so consecutive invokations of the script would be more efficient.
An illustrative example
for ii = 1:1000
vec(ii) = ii; %// vec grows inside loop the first time this code is executed only
end

Related

GPU MATLAB gives different elapsed time between first and second execution

When i execute my code using Matlab parallel toolbox it gives me two different time execution between first and second time.
In fact the first time is very slow (more than CPU version) however the second time is faster and logical, and subsequent runs are the same as the second time. Why does this happen?
That is correct, and expected.
When you call it for the first time, it needs to initialize the GPU ("turn it on" in some sense), set up the CUDA contexts, etc etc. The second time you run it the GPU is ready to take anything you throw at it.
On top of that, depends on how you wrote the code, maybe the first time it requires to move some data to the GPU, and perhaps in the second time the memory is already there.
Often doing gpuDevice(1) will initialize the context enough, but otherwise just throw a small matrix multiplication to it to initialize.
All this is somehow true for other parallel computing paradigms in MATLAB, e.g. if you want to use parfor you need to initialize the parallel pool or it will take very long the first time.

How to avoid using the To Workspace block

I ran the profiler on my Simulink model and realized that the "To Workspace" block is using 20% of the total simulation time. Because this model is ran more than one time, I'm looking for a way to increase its performance.
Hence, is there an alternate solution to using the "To Workspace" block that would increase my model global performance?
Yes, you can use Signal Logging. The various approaches to logging simulation results are discussed in the documentation under Export Simulation Data. Finally, see also View Simulation Results for alternative approaches. My personal recommendation would be signal logging or a To File block.
According to my general understanding of memory management, reserving a fixed memory block takes less time than expanding it in each timestep. So, it might be useful to limit the number of data points to be logged, that the memory space reserved for your data set will not be dynamically increased at every timestep. Of course this would be only valid if you would know the number of data points and therefore number of steps in simulation prior to start of the simulation, which can be achieved by a fixed step size solver (if applicable by your simulation system setuo). Thus pre-allocation of the workspace array might save you some time in terms of not reaching the memory management system at each timestep.

Faster way to run simulink simulation repeatedly for a large number of time

I want to run a simulation which includes SimEvent blocks (thus only Normal option is available for sim run) for a large number of times, like at least 1000. When I use sim it compiles the program every time and I wonder if there is any other solution which just run the simulation repeatedly in a faster way. I disabled Rebuild option from Configuration Parameter and it does make it faster but still takes ages to run for around 100 times.
And single simulation time is not long at all.
Thank you!
It's difficult to say why the model compiles every time without actually seeing the model and what's inside it. However, the Parallel Computing Toolbox provides you with the ability to distribute the iterations of your model across several cores, or even several machines (with the MATLAB Distributed Computing Server). See Run Parallel Simulations in the documentation for more details.

How to utilise parallel processing in Matlab

I am working on a time series based calculation. Each iteration of the calculation is independent. Could anyone share some tips / online primers on using utilising parallel processing in Matlab? How can this be specified inside the actual code?
Since you have access to the Parallel toolbox, I suggest that you first check whether you can do it the easy way.
Basically, instead of writing
for i=1:lots
out(:,i)=do(something);
end
You write
parfor i=1:lots
out(:,i)=do(something);
end
Then, you use matlabpool to create a number of workers (you can have a maximum of 8 on your local machine with the toolbox, and tons on a remote cluster if you also have a Distributed Computing Server license), and you run the code, and see nice speed gains when your iterations are run by 8 cores instead of one.
Even though the parfor route is the easiest, it may not work right out of the box, since you might do your indexing wrong, or you may be referencing an array in a problematic way etc. Look at the mlint warnings in the editor, read the documentation, and rely on good old trial and error, and you should figure it out reasonably fast. If you have nested loops, it's often best parallelize only the innermost one and ensure it does tons of iterations - this is not only good design, it also reduces the amount of code that could give you trouble.
Note that especially if you run the code on a local machine, you may run into memory issues (which might manifest in really slow execution in parallel mode because you're paging): Every worker gets a copy of the workspace, so if your calculation involves creating a 500MB array, 8 workers will need a total 4GB of RAM - and then you haven't even started counting the RAM of the parent process! In addition, it can be good to only use N-1 cores on your machine, so that there is still one core left for other processes that may run on the computer (such as a mandatory antivirus...).
Mathworks offers its own parallel computing toolbox. If you do not want to purchase that, there a few options
You could write your own mex file and use pthreads or OpenMP.
However make sure you do not call any Mex api in the parallel part of the code, because they arent thread safe
If you want coarser grained parallelism via MPI you can try pmatlab
Same with parmatlab
Edit: Adding link Parallel MATLAB with openmp mex files
I have only tried the first.
Don't forget that many Matlab functions are already multithreaded. By careful programming you may be able to take advantage of them -- check the documentation for your version as the Mathworks seem to be increasing the range and number of multithreaded functions with each new release. For example, it seems that 2010a has multithreaded ffts which may be useful for time series processing.
If the intrinsic multithreading is not what you need, then as #srean suggests, the Parallel Computing Toolbox is available. For my money (or rather, my employers' money) it's the way to go, allowing you to program in parallel in Matlab, rather than having to bolt things on. I have to admit, too, that I'm quite impressed by the toolbox and the facilities it offers.

What's the best way to measure and track performance over various calls at runtime?

I'm trying to optimize the performance of my code, but I'm not familiar with xcode's debuggers or debuggers in general. Is it possible to track the execution time and frequency of calls being made at runtime?
Imagine a chain of events with some recursive calls over a fraction of a second. What's the best way to track where the CPU spends most of its time?
Many thanks.
Edit: Maybe this is better asked by saying, how do I use the xcode debug tools to do a stack trace?
You want to use the built-in performance tools called 'Instruments', check out Apples guide to Instruments. Specifically you probably want the System Instruments. There's also the Tuning Guide which could be useful to you and Shark.
Imagine a chain of events with some
recursive calls over a fraction of a
second. What's the best way to track
where the CPU spends most of its time?
Short version of previous answer.
Learn an IDE or debugger. Make sure it has a "pause" button or you can interrupt it when your program is running and taking too long.
If your code runs too quickly to be manually paused, wrap a temporary loop of 10 to 1000 times around it.
When you pause it, make a copy of the call stack, into some text editor. Repeat several times.
Your answer will be in those stacks. If the CPU is spending most of its time in a statement, that statement will be at the bottom of most of the stack samples. If there is some function call that causes most of the time to be used, that function call will be on most of the stacks. It doesn't matter if it's recursive - that just means it shows up more than once on a stack.
Don't think about measuring microseconds, or counting calls. Think about "percent of time active". That's what stack samples tell you, and that's roughly what you'll save if you fix it.
It's that simple.
BTW, when you fix that problem, you will get a speedup factor. Then, other issues in your code will be magnified by that factor, so they will be easier to find. This way, you can keep going until you've squeezed every cycle out of it.
The first thing I tell people is to recognize the difference between
1) timing routines and counting how many times they are called, and
2) finding code that you can fruitfully optimize.
For (1) there are instrumenting profilers.
To be really successful at (2) you need a rare type of profiler.
You need a sampling profiler that
samples the entire call stack, not just the program counter
samples at random wall clock times, not just CPU, so as to capture possible I/O problems
samples when you want it to (not when waiting for user input)
for output, gives you, for each line of code that appears on stack samples, the percent of samples containing that line. That is a direct measure of the total time that could be saved if that line were not there.
(I actually do it by hand, interrupting the program under the debugger.)
Don't get sidetracked by problems you don't have, such as
accuracy of measurement. If a line of code appears on 30% of call stack samples, it's actual cost could be anywhere in a range around 30%. If you can find a way to eliminate it or invoke it a lot less, you will save what it costs, even if you don't know in advance exactly what its cost is.
efficiency of sampling. Since you don't need accuracy of time measurement, you don't need a large number of samples. Even if you get a large number of samples, they don't skew the results significantly, because they don't fail to spot the costly lines of code.
call graphs. They make nice graphics, but are not what you need to know. An arc on a call graph corresponds to a line of code in the best case, usually multiple lines, so knowing cost of an arc only tells the cost of a line in the best case. Call graphs concentrate on functions, when what you need to find is lines of code. Call graphs get wrapped up in the issue of recursion, which is irrelevant.
It's important to understand what to expect. Many programmers, using traditional profilers, can get a 20% improvement, consider that terrific, count the profiler a winner, and stop there. Others, working with large programs, can often get speedup factors of 20 times.
This is done by fixing a series of problems, each one giving a multiplicative speedup factor. As soon as the profiler fails to find the next problem, the process stops. That's why "good enough" isn't good enough.
Here is a brief explanation of the method.