How to use the cachegrind output to optimize the application - daemon

I need to improve the throughput of the system.
The usual cycle of optimization has been done and we have already achieved 1.5X better throughput.
I am now beginning to wonder if I can utilize the cachegrind output to improve the system's throughput.
Can somebody point me to how to begin on this?
What I understand is we need to ensure most frequently used data should be kept small enough so that it remains in L1 cache and the next set of data should fit in the L2.
Is this the right direction I am taking?

It`s true that cachegrind output in itself does not give too much information how to go about optimizing code. One needs to know how to interpret it and what you are saying about data fitting into L1 and L2 is indeed the right direction.
To fully understand how memory access patterns influence performance, I recommend reading an excellent paper "What Every Programmer Should Know About Memory" by Ulrich Drepper, the GNU libc maintainer.

If you're having trouble parsing the cachegrind output, look into KCacheGrind (it should be available in your distro of choice). I use it and find it quite helpful.

According to the Cachegrind documentation, the details given to you by cachegrind are the number of cache misses for a given part of your code. You need to know about how caches work on the architecture you are targetting so that you know how to fix the code. In practice this means making data smaller or changing the access pattern of some data so that cached data is still in the cache. However you need to understand your program's data and data access before you can act on the information. As it says in the manual,
In short, Cachegrind can tell you where some of the bottlenecks in your code are, but it can't tell you how to fix them. You have to work that out for yourself. But at least you have the information!

1.5x is a nice speedup. It means you found something that took 33% of the time that you could get rid of. I bet you can do more, even before you get down to low-level issues like data memory cache. This is an example of how. Basically, you could have additional performance problems (and opportunities for speedup) that were not large before, like 25% say. Well, with the 1.5x speedup, that 25% is now 37.5%, so it is "worth more" than it was. Often such a problem is in the form of some mid-stack function call that is requesting work that, once you know how much it costs, you may decide isn't completely necessary. Since kcachegrind does not really pinpoint these, you may not realize it is a problem.

Related

Do cats and scalaz create performance overhead on application?

I know it is totally a nonsense question but due to my illiteracy on programming skill this question came to my mind.
Cats and scalaz are used so that we can code in Scala similar to Haskell/in pure functional programming way. But for achieving this we need to add those libraries additionally with our projects. Eventually for using these we need to wrap our codes with their objects and functions. It is something adding extra codes and dependencies.
I don't know whether these create larger objects in memory.
These is making me think about. So my question: will I face any performance issue like more memory consumption if I use cats/scalaz ?
Or should I avoid these if my application needs performance?
Do cats and scalaz create performance overhead on application?
Absolutely.
The same way any line of code adds performance overhead.
So, if that is your concern, then don't write any code (well, actually the world may be simpler if we would have never tried all this).
Now, dick answer outside. The proper question you should be asking is: "Does the overhead of X library is harmful to my software?"; remember this applies to any library, actually to any code you write, to any algorithm you pick, etc.
And, in order to answer that question, we need some things before.
Define the SLAs the software you are writing must hold. Without those, any performance question / observation you made is pointless. It doesn't matter if something is faster / slower if you don't know if that is meaningful for you and your clients.
Once you have SLAs you need to perform stress tests to verify if your current version of the software satisfies those. Because, if your current code is performant enough, then you should worry about other things like maintainability, testing, adding more features, etc.
PS: Remember that those SLAs should not be raw numbers but be expressed in terms of percentiles, the same goes for the results of the tests.
When you found that you are falling your SLAs then you need to do proper benchmarking and debugging to identify the bottlenecks of your project. As you saw, caring about performance must be done on each line of code, but that is a lot of work that usually doesn't produce any relevant output. Thus, instead of evaluating the performance of everything, we find the bottlenecks first, those small pieces of code that have the biggest contributions to the overall performance of your software (remember the Pareto principle).
Remember that in this step, we have to be integral, network matters too. (and you will see this last one is usually the biggest slowdown; thus, usually you would rather search for architectural solutions like using Fibers instead of Threads rather than trying to optimize small functions. Also, sometimes the easier and cheaper solution is better infrastructure).
When you find the bottleneck, then you need to formulate some alternatives, implement those and not only benchmark them but do Statistical hypothesis testing to validate if the proposed changes are worth it or not. And, of course, validate if they were enough to satisfy the SLAs.
Thus, as you can see, performance is an art and a lot of work. So, unless you are committed to doing all this then stop worrying about something you will not measure and optimize properly.
Rather, focus on increasing the maintainability of your code. This actually also helps performance, because when you find that you need to change something you would be grateful that the code is as clean as possible and that the whole architecture of the code allows for an easy change.
And, believe me when I say that, using tools like cats, cats-effect, fs2, etc will help with that regard. Also, they actually pretty optimized on their core so you should be good for a lot of use cases.
Now, the big exception is that if you know that the work you are doing will be very CPU and memory bound then yeah, you pretty much can be sure all those abstractions will be harmful. In those cases, you may even want to stay away from the JVM and rather write pretty low-level code in a language like Rust which will provide you with proper tools for that kind of problem and still be way safer than plain old C.

Performance of calling gameObject.SetActive(true) when game object is already active?

Is there any performance concern around calling MyAlreadyActiveGameObject.SetActive(true) a ton, e.g., once per frame?
Put another way, is it ever worth pulling a gameObject.active check upwards? Or caching/checking an _alreadyActive?
If you are concerned about this, you could always do a basic if-check on the same line.
if(!myObject.activeSelf) myObject.SetActive(true);
When in doubt, try the simple thing first and see if it works. Sometimes it's worth focusing on optimization, but you should check if it's necessary effort. Unity's built-in profiling tools can give you an estimate on how long each operation is taking in each frame.
Is there any performance concern around calling MyAlreadyActiveGameObject.SetActive(true) a ton, e.g., once per frame?
It does take time, but not much. If you're doing this hundreds or thousands of times per frame, it might be worth worrying about. Up until then, the concern should be negligible.
If you are worried about performance for a large number of objects, you can look into pausing expensive behaviors for objects which are far away, out of view, etc.

Any performance gain/loss with having several function calls rather than a single large one?

I am currently making a game for the iPad and iPhone using cocos2d, Box2D and Objective-C.
A lot of stuff is happening every update, and a lot has to be resolved.
I recently refactored a lot of my code to several small methods, instead of having hundreds of lines of code inside the same method.
Is there any performance loss doing this?
Will fewer method calls increase performance?
Each function call results in a constant-time (O(1)) delay because of the stack frame adjustments and branching. However, you won't feel that delay unless the calls are made inside a time-critical loop a million times.
The best approach would be, I think, writing the cleanest code possible and then optimizing it -- with the help of a profiler -- as needed.
You may also want to check out this answer: https://stackoverflow.com/a/4816703/252687 Inline functions may reduce the aforementioned overhead a bit without compromising the modularity.
I have seen cases where multiple smaller functions resulted in significantly better-performing code, since the compiler was better able to optimize registers. Highly dependent on the compiler and style of programming, though.
But in general, on modern systems (other than really low-level microprocessors) optimizing performance at this level is counter-productive. Better to well-structure the code (which generally implies a fair number of subroutines) so that it's more reliable, easier to maintain, and easier to spot and fix more global performance issues.
Of course there is a performance decline with more method calls. However that is not a reason to use fewer, that would be pre-mature optimization at the expense of cleaner code.
Personally I go for the cleanest most clear code, let the compiler optimize and in the end profile for the real bottlenecks.
I was once hired on the basis of an answer to single question, that was I would profile before optimizing. :-)
After the compiler optimizes your code, you probably won't notice any reliable performance difference, unless you are trying to use method dispatches inside the inner loops of a CPU intensive computation routine, such as DSP or pixel level image processing.

Is it a good idea to warm up a cache in the BEGIN block in Perl?

Is it a good idea to warm up cache in the BEGIN block, when it gets used?
You didn't really provide any information on what kind of environment you're talking about, which I think is important. In most cases the answer is probably "no", but I can think of one case where it's a definite yes, which is preforking servers -- web applications and the like. In that case, any work that you can do "before the fork" not only saves the cost of having the children recompute the same values individually, it alo saves memory, since the pages containing the results can be shared across all of the child processes by the OS's COW mechanism.
If you're talking about a module you're writing and not an application, then I'd say no, don't lift things to compilation time without the user's permission unless they're things that have to be done for the module to work. Instead, provide a preheat_cache class method, and if your caller has a reason to need a hot cache at compile time they can put the call into a BEGIN block themselves. You could also use a :preheat_cache import tag but that's unnecessarily fancy in my book.
If it's a choice between preloading your cache at compile time, or preloading your cache as the first thing you do at run time, there's virtually no difference.
If your cache is large enough that loading it will trigger a lot of page swaps, that's an argument for waiting until run time. That way, all your module loading and other compile time code can be done while your system is under a lighter load.
I'm going to go with "no", even though I could be wrong. Reasoning goes like this: keep the code, and data it uses, small, so that it takes up less space in any caches (I am presuming you mean CPU cache, not programmatic hashes with common query results or some such thing).
Unless you see some sort of bad access pattern, trying to second guess what needs to be prefetched is probably useless at best. In fact such code or initialization data is likely to displace something you (or another process on the system) were actually using. Think about what you can do in the actual work part of the code to maximize locality of reference, to try to stay within smaller memory regions at any one time.
I used to use "top" to detect when processes were swapping between memory and disk. I don't know of any good tools yet to tell how often a process is getting cache misses and going to plain old slow mo'board memory. There must be such tools, I just don't know what they are yet (software tools, rather than some custom In Circuit Emulator type hardware). Perhaps some thought on this earlier in the day...
by warm up I assume you mean use BEGIN() to guarantee the cache is preloaded before anything else in your script executes?
If you need the cache for your program to run properly, then yes, I think it would be a good idea.

Do software metrics work both ways

I just started working for a large company. in a recent internal audit, measuring metrics such as Cyclomatic complexity and file sizes it turned out that several modules including the one owned by my team have a very high index. so in the last week we have been all concentrating on lowering these indexes for our code. by removing decision points and splitting files.
maybe I am missing something being the new guy but, how will this make our software better?, I know that software metrics can measure how good your code is, but dose it work the other way around? will our code become better just because for example we are making a 10000 lines file into 4 2500 lines files?
The purpose of metrics is to have more control over your project. They are not a goal on their own, but can help to increase the overall quality and/or to spot design disharmonies. Cyclomatic complexity is just one of them.
Test coverage is another one. It is however well-known that you can get high test coverage and still have a poor test suite, or the opposite, a great test suite that focus on one part of the code. The same happens for cyclomatic complexity. Consider the context of each metrics, and whether there is something to improve.
You should try to avoid accidental complexity, but if the processing has essential complexity, you code will anyway be more complicated. Try then to write mainteanble code with a fair balance between the number of methods and their size.
A great book to look at is "Object-oriented metrics in practice".
It depends how you define "better". Smaller files and less cyclomatic complexity generally makes it easier to maintain. Of course the code itself could still be wrong, and unit tests and other test methods will help with that. It's just a part of making code more maintainable.
Code is easier to understand and manage in smaller chunks.
It is a good idea to group related bits of code in their own functional areas for improved readability and cohesiveness.
Having a whole large program all in a single file will make your project very difficult to debug, extend, and maintain. I think this is quite obvious.
The particular metric is really only a rule of thumb and should not be followed religiously, but it may indicate something is not as nice as it could be.
Whether legacy working code should be touched and refactored is something that needs to be evaluated. If you decide to do so, you should consider writing tests for it first, that way you'll quickly know whether your changes broke any required behavior.
Never ever opened one of your own projects after several months again? The larger and more complex the single components are the more one asks oneself, what genious wrote that code and why the heck he wrote it that way.
And, there's never too much or even enough documentation. So if the components themself are lesser complex and smaller, its easier to re-understand 'em
This is bit Subjective. The idea of assigning a maximim Cyclomatic complexity index is to improve the maintainability and the readability of the code.
As an example in the perspective of the unit testing, it is really convenient to have smaller "units". And avoiding the long codes will help the reader to understand the code. You cannot ensure that the original developer works on the code forever so in the company's perspective it is fair to assign such a criteria to keep the code "simple"
It is easy to write a code that can undertand by a computer. It is more harder to write a code that can understood by a human.
how will this make our software better?
Excerpt from the articles Fighting Fabricated Complexity related to the tool for .NET developers NDepend. NDepend is good at helping team to manage large and complex code base. The idea is that code metrics are good are reducing fabricated complexity in the code implementation:
During my interview on Code Metrics by Scott Hanselman’s on Software Metrics, Scott had a particularly relevant remark.
Basically, while I was explaining that long and complex methods are killing quality and should be split into smaller methods, Scott asked me:
looking at this big too complicated
method and I break it up into smaller
methods, the complexity of the
business problem is still there,
looking at my application I can say,
this is no longer complex from the
method perspective, but the software
itself, the way it is coupled with
other bits of code, may indicate other
problem…
Software complexity is a subjective measure relative to the human cognition capacity. Something is complex when it requires effort to be understood by a human. The fact is that software complexity is a 2 dimensional measure. To understand a piece of code one must understand both:
what this piece of code is supposed to do at run-time, the behavior of the code, this is the business problem complexity
how the actual implementation does achieve the business problem, what was the developer mental state while she wrote the code, this is the implementation complexity.
Business problem complexity lies into the specification of the program and reducing it means working on the behavior of the code itself. On the other hand, we are talking of fabricated complexity when it comes to the complexity of the implementation: it is fabricated in the sense that it can be reduced without altering the behavior of the code.
how will this make our software better?
It can be a trigger for a refactoring, but following one metric doesn't guarantee that all other quality metrics stay the same. And tools are only able to follow very few metrics. You can't measure to which degree code is understandable.
Will our code become better just
because for example we are making a
10 000 lines file into 4 2500 lines
files?
Not necessarily. Sometimes the larger one can be more understandable, better structured and have lesser bugs.
Most design patterns for example "improve" your code by making it more general and maintenable, but often with the cost of added source lines.