What's the best way to measure and track performance over various calls at runtime? - iphone

I'm trying to optimize the performance of my code, but I'm not familiar with xcode's debuggers or debuggers in general. Is it possible to track the execution time and frequency of calls being made at runtime?
Imagine a chain of events with some recursive calls over a fraction of a second. What's the best way to track where the CPU spends most of its time?
Many thanks.
Edit: Maybe this is better asked by saying, how do I use the xcode debug tools to do a stack trace?

You want to use the built-in performance tools called 'Instruments', check out Apples guide to Instruments. Specifically you probably want the System Instruments. There's also the Tuning Guide which could be useful to you and Shark.

Imagine a chain of events with some
recursive calls over a fraction of a
second. What's the best way to track
where the CPU spends most of its time?
Short version of previous answer.
Learn an IDE or debugger. Make sure it has a "pause" button or you can interrupt it when your program is running and taking too long.
If your code runs too quickly to be manually paused, wrap a temporary loop of 10 to 1000 times around it.
When you pause it, make a copy of the call stack, into some text editor. Repeat several times.
Your answer will be in those stacks. If the CPU is spending most of its time in a statement, that statement will be at the bottom of most of the stack samples. If there is some function call that causes most of the time to be used, that function call will be on most of the stacks. It doesn't matter if it's recursive - that just means it shows up more than once on a stack.
Don't think about measuring microseconds, or counting calls. Think about "percent of time active". That's what stack samples tell you, and that's roughly what you'll save if you fix it.
It's that simple.
BTW, when you fix that problem, you will get a speedup factor. Then, other issues in your code will be magnified by that factor, so they will be easier to find. This way, you can keep going until you've squeezed every cycle out of it.

The first thing I tell people is to recognize the difference between
1) timing routines and counting how many times they are called, and
2) finding code that you can fruitfully optimize.
For (1) there are instrumenting profilers.
To be really successful at (2) you need a rare type of profiler.
You need a sampling profiler that
samples the entire call stack, not just the program counter
samples at random wall clock times, not just CPU, so as to capture possible I/O problems
samples when you want it to (not when waiting for user input)
for output, gives you, for each line of code that appears on stack samples, the percent of samples containing that line. That is a direct measure of the total time that could be saved if that line were not there.
(I actually do it by hand, interrupting the program under the debugger.)
Don't get sidetracked by problems you don't have, such as
accuracy of measurement. If a line of code appears on 30% of call stack samples, it's actual cost could be anywhere in a range around 30%. If you can find a way to eliminate it or invoke it a lot less, you will save what it costs, even if you don't know in advance exactly what its cost is.
efficiency of sampling. Since you don't need accuracy of time measurement, you don't need a large number of samples. Even if you get a large number of samples, they don't skew the results significantly, because they don't fail to spot the costly lines of code.
call graphs. They make nice graphics, but are not what you need to know. An arc on a call graph corresponds to a line of code in the best case, usually multiple lines, so knowing cost of an arc only tells the cost of a line in the best case. Call graphs concentrate on functions, when what you need to find is lines of code. Call graphs get wrapped up in the issue of recursion, which is irrelevant.
It's important to understand what to expect. Many programmers, using traditional profilers, can get a 20% improvement, consider that terrific, count the profiler a winner, and stop there. Others, working with large programs, can often get speedup factors of 20 times.
This is done by fixing a series of problems, each one giving a multiplicative speedup factor. As soon as the profiler fails to find the next problem, the process stops. That's why "good enough" isn't good enough.
Here is a brief explanation of the method.

Related

High precision millisecond timer in matlab

I'm trying to implement a high-precision, millisecond-timescale timer in matlab. Every T seconds, I want to query a camera linked to matlab, and if there is an image in the memory, I want to pull it out. The actual connection to the camera is straightforward - but a problem arises because images are coming in every ~60ms, and need to all be pulled off before another image enters the camera buffer. This essentially means that I need to be checking the camera buffer at least every ~30ms, and ideally every ~5ms.
While MATLAB's built in timer function ostensibly allows millisecond timing, it suffers from poor precision. While in >95% of executions the built in MATLAB timer will indeed pause for ~5ms between runs, in ~5% of cases it hovers around ~30ms, and in ~1% of cases it takes >100ms between executions, which unacceptably kills the performance. I should clarify in MATLAB's defense that simultaneously there are two other timers running (both with 1 s periods), as well as a number of figure windows open, so even though my machine is beefy (16-core, 64GB RAM), there is certainly a lot to be doing all at once. I have tried using timers based on .NET timers (System.Timers.Timer(period)) as well as with the Java sleep function (java.lang.Thread.sleep(period)), both of which should theoretically be more precise, and while both are better than the MATLAB timer (at the cost of being more unwieldy), none are able to consistently achieve <60ms execution delay across thousands of iterations.
Maybe I'm asking for something which is not implementable - but I hope that there is some way to implement a high-precision timer in MATLAB which will continue executing at a ms time-scale even when there are other figures/timers/commands being executed semi-simultaneously. I should maybe clarify that when running just a timer with no other timers/figures open I am able to consistently achieve <60ms execution (and really, consistent <10ms execution for a 5ms timer period). This is possible even when all those timers/figures are open in a different instance of MATLAB, so it seems the problem is to somehow separate the timer from the rest of MATLAB. Any advice or guidance would be appreciated in this regard.
Depending on what exactly you are doing, the timing system of Psychtoolbox may help you.
Specifically, check out the WaitSecs function and its documentation. It is supposed to be more precise than timer, and the documentation contains some tips about achieving high precision timing in general.
Also related is the GetSecs function.
It might however happen that switching to WaitSecs will not help you. In that case you can be quite sure that your machine is just too loaded to do what you are trying to do.

A way/ tool to make an estimation of execution time for an APP/ task

I'am trying to run real-life experiment for an application in Raspberry Pi, and I need to estimate or predicate the execution time for the application. in other words, before execution/ run the task i need to know how long (roughly) this task/app going to take to get the result back. I have identified several techniques and works that have been done before. but most of it are simulation work which doesn't work with real-life experiment. does anyone can help me with any idea or technique (No code). thank you in advance
Estimating the execution time of an application or function is going to be difficult in any context. You might want to look up the halting problem for some insight to why. It's impossible to determine whether a given program will finish executing, and therefore, you can't really tell how long a given program will take to finish executing.
For general computing, varying hardware capabilities of any given system will always have an effect on the execution time of a program. Raspberry Pi is a little more discrete than that, and therefore more predictable in that sense, but those specifications will not always be consistent across its various versions. That adds to the complexity of determining a run time.
Practically, the most reliable way to determine how long a process will take would be to just run it and time it. If you absolutely need predicted times for something, you might be able to do a bit of a composite estimate - time the smaller chunks of the application separately, and then use those to determine how long you expect the application as a whole to run. For most situations, though, it would be much faster to just run the program itself rather than trying to predict it.
Store the time before and after the execution ? Then you could know the execution time

MATLAB timer object pitfalls and poor usage

I created a timer that executes every 0.1 seconds. It calls a function that reads data and then updates the property of an object. When I start the timer, MATLAB displays the "Busy" signal at the bottom of the command window. MATLAB becomes unresponsive and I cannot halt the timer using the stop() function. My only recourse is to use Ctrl-C.
I determined the problem to be the processing time of the timer callback function was longer than the calling period and I presume no other MATLAB code could squeak into the stack/queue. This makes me somewhat worried about relying upon timers. My goal is to continuously make measurements from several devices, store them in an object, and need MATLAB to do other things in between these measurements. Also, I cannot afford to miss a measurement.
I am creating an app that responds to user input and provides the user with real time information, so I chose a fast period thinking it would produce a snapping UX. Since I am committed to using MATLAB I could not think of a better way of implementing this functionality than using a timer object. So the first question is, do timer objects seem like the right tool for the job I describe above?
Second, if I am to use timer objects would someone please share their experience about common mistakes or pitfalls using timers? Or has someone any advice on how to best implement timer objects? Is there a practical limit to the number of timer objects that can be used simultaneously? What is the best way to determine the optimal frequency of a timer object?
Thanks!
I would think that 0.1 seconds is pushing it for reliability, especially if you have multiple timers going at once, and especially if you want a user interface to be responsive at the same time as well.
MATLAB is basically single-threaded. There are exceptions, for example the lower level math routines call BLAS in a multithreaded way that speeds them up a lot. You can also write MEX code in C that is multithreaded, and call that from MATLAB. But there's basically one true thread that all your code runs on.
Timer objects are also an exception in a way. When you create a timer object, underneath that there's a java timer object, which is running on a separate java thread. But when any of its callbacks fire, it calls back into MATLAB to execute them, which happens on the one true thread.
If you cannot afford to miss a measurement, you'll need to set the BusyMode property of your timers to queue rather than the default drop, which will mean that if any of the callbacks take longer than 0.1 seconds to execute, they will back up - and you have to fit in any user interface actions as well.
In addition, MATLAB doesn't (and couldn't) make any real guarantees about the precision of how fast or regularly the timer callbacks will execute. If Windows suddenly decides to run a virus scan, or update itself etc, MATLAB will lose priority and that will mess things up. If you were firing timers every 10 seconds, or even every 1 second, it's in all likelihood going to be roughly accurate. But if you're firing them every millisecond, you can't expect it to be reliable - you'd need a proper real time environment for that. 0.1 seconds seems borderline to me, and I'd expect its reliability to depend on what exactly you were doing in those 0.1 seconds, and what else was going on as well (and what computer you're running, as well).
To answer your last questions (max number of timers, optimal timer frequency etc) - there's no general answer, just try it and work out what happens with a range of values in your particular situation.
If it turns out that it's not reliable enough, you could either try:
Doing the data acquisition in another faster way (e.g. in C), maybe in a separate process, then calling that from MATLAB via MEX, perhaps with some sort of buffering to smooth things out.
Turning to some the other MathWorks products (e.g. Simulink, Simulink Coder, some of the System Toolboxes) that are designed for developing properly real time systems.
Hope that helps!

Why does Matlab run faster after a script is "warmed up"?

I have noticed that the first time I run a script, it takes considerably more time than the second and third time1. The "warm-up" is mentioned in this question without an explanation.
Why does the code run faster after it is "warmed up"?
I don't clear all between calls2, but the input parameters change for every function call. Does anyone know why this is?
1. I have my license locally, so it's not a problem related to license checking.
2. Actually, the behavior doesn't change if I clear all.
One reason why it would run faster after the first time is that many things are initialized once, and their results are cached and reused the next time. For example in the M-side, variables can be defined as persistent in functions that can be locked. This can also occur on the MEX-side of things.
In addition many dependencies are loaded after the first time and remain so in memory to be re-used. This include M-functions, OOP classes, Java classes, MEX-functions, and so on. This applies to both builtin and user-defined ones.
For example issue the following command before and after running your script for the first run, then compare:
[M,X,C] = inmem('-completenames')
Note that clear all does not necessarily clear all of the above, not to mention locked functions...
Finally let us not forget the role of the accelerator. Instead of interpreting the M-code every time a function is invoked, it gets compiled into machine code instructions during runtime. JIT compilation occurs only for the first invocation, so ideally the efficiency of running object code the following times will overcome the overhead of re-interpreting the program every time it runs.
Matlab is interpreted. If you don't warm up the code, you will be losing a lot of time due to interpretation instead of the actual algorithm. This can skew results of timings significantly.
Running the code at least once will enable Matlab to actually compile appropriate code segments.
Besides Matlab-specific reasons like JIT-compilation, modern CPUs have large caches, branch predictors, and so on. Warming these up is an issue for benchmarking even in assembly language.
Also, more importantly, modern CPUs usually idle at low clock speed, and only jump to full speed after several milliseconds of sustained load.
Intel's Turbo feature gets even more funky: when power and thermal limits allow, the CPU can run faster than its sustainable max frequency. So the first ~20 seconds to 1 minute of your benchmark may run faster than the rest of it, if you aren't careful to control for these factors.
Another issue not mensioned by Amro and Marc is memory (pre)allocation.
If your script does not pre-allocate its memory it's first run would be very slow due to memory allocation. Once it completed its first iteration all memory is allocated, so consecutive invokations of the script would be more efficient.
An illustrative example
for ii = 1:1000
vec(ii) = ii; %// vec grows inside loop the first time this code is executed only
end

How to find the time value of operation to optimize new algorithm design?

My question is specific to iPhone, iPod, and iPad, since I am assuming that the architecture makes a big difference. I'm hoping there is either a specification somewhere (for the various chips perhaps), or a reliable way to measure T for each specific instruction. I know I can use any number of tools to measure aggregate processor time used, memory used, etc. I want to quantify at a lower level.
So, I'm able to figure out how many times I go through the main part of the algorithm. For example, I iterate n * (n-1) times in a naive implementation, and between n (best case) and n + n * (n-1) (worst case) in another. I can also make a reasonable count of the total number of instructions (+ - = % * /, and logic statements), and I can compare those counts, but that's assuming the weight of each operation is the same. Also, I don't have any idea how to weight the actual time value of a logic statement (if, else, for, while) vs a mathematical operator... is "if" as much work as "+" each time I use it? I would love to know where to find this information.
So, for clarity, my goal is to discover how much processor time I am demanding of the CPU (or GPU or any U) so that I can design an optimal algorithm around processor time. Can someone give me an idea of where to start for iOS hardware?
Edit: This link to ClockServices.c and SIMD stuff in the developer portal might be a good start for people interested in this. A few more cups of coffee tonight and I might get through it ;)
On a modern platform, processor time isn't the only limiting factor. Often, memory access is.
Still, processor time:
Your basic approach at an estimation for the processor load is OK, though, and is sensible: Make a rough estimate of the cost based on your knowledge of typical platforms.
In this article, Table 1 shows the times for typical primitive operations in .NET. While your platform may vary, the relative time is usually very similar. Maybe you can find - or even make - one for iStuff.
(I haven't come across one so thorough for other platforms, except processor / instruction set manuals, but they deal with assembly instructions)
memory locality:
A cache miss can cost you hundreds of cycles, a disk access a thousand times as much. So controlling your memory access patterns (i.e. reducing the working set, restructuring and accessing data in a cache-friendly way) is an important part of evaluating an algorithm.
xCode has instruments to measure performance of each function/operation, you can simply use them.