Performance of calling gameObject.SetActive(true) when game object is already active? - unity3d

Is there any performance concern around calling MyAlreadyActiveGameObject.SetActive(true) a ton, e.g., once per frame?
Put another way, is it ever worth pulling a gameObject.active check upwards? Or caching/checking an _alreadyActive?

If you are concerned about this, you could always do a basic if-check on the same line.
if(!myObject.activeSelf) myObject.SetActive(true);

When in doubt, try the simple thing first and see if it works. Sometimes it's worth focusing on optimization, but you should check if it's necessary effort. Unity's built-in profiling tools can give you an estimate on how long each operation is taking in each frame.
Is there any performance concern around calling MyAlreadyActiveGameObject.SetActive(true) a ton, e.g., once per frame?
It does take time, but not much. If you're doing this hundreds or thousands of times per frame, it might be worth worrying about. Up until then, the concern should be negligible.
If you are worried about performance for a large number of objects, you can look into pausing expensive behaviors for objects which are far away, out of view, etc.

Related

Do cats and scalaz create performance overhead on application?

I know it is totally a nonsense question but due to my illiteracy on programming skill this question came to my mind.
Cats and scalaz are used so that we can code in Scala similar to Haskell/in pure functional programming way. But for achieving this we need to add those libraries additionally with our projects. Eventually for using these we need to wrap our codes with their objects and functions. It is something adding extra codes and dependencies.
I don't know whether these create larger objects in memory.
These is making me think about. So my question: will I face any performance issue like more memory consumption if I use cats/scalaz ?
Or should I avoid these if my application needs performance?
Do cats and scalaz create performance overhead on application?
Absolutely.
The same way any line of code adds performance overhead.
So, if that is your concern, then don't write any code (well, actually the world may be simpler if we would have never tried all this).
Now, dick answer outside. The proper question you should be asking is: "Does the overhead of X library is harmful to my software?"; remember this applies to any library, actually to any code you write, to any algorithm you pick, etc.
And, in order to answer that question, we need some things before.
Define the SLAs the software you are writing must hold. Without those, any performance question / observation you made is pointless. It doesn't matter if something is faster / slower if you don't know if that is meaningful for you and your clients.
Once you have SLAs you need to perform stress tests to verify if your current version of the software satisfies those. Because, if your current code is performant enough, then you should worry about other things like maintainability, testing, adding more features, etc.
PS: Remember that those SLAs should not be raw numbers but be expressed in terms of percentiles, the same goes for the results of the tests.
When you found that you are falling your SLAs then you need to do proper benchmarking and debugging to identify the bottlenecks of your project. As you saw, caring about performance must be done on each line of code, but that is a lot of work that usually doesn't produce any relevant output. Thus, instead of evaluating the performance of everything, we find the bottlenecks first, those small pieces of code that have the biggest contributions to the overall performance of your software (remember the Pareto principle).
Remember that in this step, we have to be integral, network matters too. (and you will see this last one is usually the biggest slowdown; thus, usually you would rather search for architectural solutions like using Fibers instead of Threads rather than trying to optimize small functions. Also, sometimes the easier and cheaper solution is better infrastructure).
When you find the bottleneck, then you need to formulate some alternatives, implement those and not only benchmark them but do Statistical hypothesis testing to validate if the proposed changes are worth it or not. And, of course, validate if they were enough to satisfy the SLAs.
Thus, as you can see, performance is an art and a lot of work. So, unless you are committed to doing all this then stop worrying about something you will not measure and optimize properly.
Rather, focus on increasing the maintainability of your code. This actually also helps performance, because when you find that you need to change something you would be grateful that the code is as clean as possible and that the whole architecture of the code allows for an easy change.
And, believe me when I say that, using tools like cats, cats-effect, fs2, etc will help with that regard. Also, they actually pretty optimized on their core so you should be good for a lot of use cases.
Now, the big exception is that if you know that the work you are doing will be very CPU and memory bound then yeah, you pretty much can be sure all those abstractions will be harmful. In those cases, you may even want to stay away from the JVM and rather write pretty low-level code in a language like Rust which will provide you with proper tools for that kind of problem and still be way safer than plain old C.

What happens when an AppendStructuredBuffer overflows in a compute shader?

I have a Unity project in which I'm writing to an AppendStructuredBuffer<Triangle> via Append(triangle) in a compute shader.
In this instance, I know the theoretical limit to the number of triangles that could exist, so the obvious correct approach is to size the buffer accordingly. As a hack, though, I'm experimenting with allocating drastically smaller buffers so that they can be more efficiently processed by other parts of the system (in particular, reading back to CPU). One could imagine other situations in which a specific limit may not be known, or could be wrongly assumed.
Clearly, this is potentially hazardous. I'm sure there are more robust approaches that could be used for my current system (or more generally) without sacrificing performance, but I'm not (particularly) asking for advice on that.
What I want to know is what the expected behaviour is when a program calls Append() beyond the capacity of such a buffer. I imagine that it is undefined, and potentially liable to corrupt other areas of VRAM, to an extent dependent on GPU drivers / DirectX version etc. It may be that it is more formally specified, but I haven't been able to find that out.
Of course, even if the behaviour is specified, it seems somewhat reckless to deliberately risk. Still, I'd like to know:
Whether it is possible to detect that such a buffer is full in the context of a kernel function (given the highly threaded nature this is likely impractical).
What the performance implications of that are if it is possible.
What the consequences of overflowing are (in this instance I'm specifically anticipating it, but bugs happen).
How all of the above might be expected to differ for different hardware vendors, APIs, etc.
Perhaps it is 'safe' to the extent that excess data will simply be lost to the void without cost. In any case the system can - for example - periodically check fullness of buffers and do any extra housework that may be necessary... leaving the question of how severe any mistakes in the tuning of such a system might be.
Under many circumstances, at least in DirectX, out of bounds access is defined as returning 0. I'm still not totally sure about writes, but think there is reason to believe they should be generally safe in current implementations.
I would still be very wary of relying on this, especially when using other APIs.
According to this specification,
5.3.10.2 Using Unordered Count and Append Buffers
...
The counter behind imm_atomic_alloc and imm_atomic_consume has no
overflow or underflow clamping, and there is no feedback given to the
shader as to whether overflow/underflow happened (wrapping of the
counter). The only thing the counter really accomplishes is a way of
generating unique addresses that is conveniently bundled with the UAV.
Further, https://microsoft.github.io/DirectX-Specs/d3d/archive/D3D11_3_FunctionalSpec.htm#inst_IMM_ATOMIC_ALLOC
There is no clamping of the count, so it wraps on overflow.
I don't think I'm wrong in interpretting 'wrapping' as being to the length of the buffer in these instances.
So, the answer as I understand it is that on Append() the internal counter will wrap, and subsequent invocations will end up overwriting earlier data. As it happens, I am currently rendering my buffer without reference to such a counter (because I do another pass on the 'triangles' to turn them into vertices for rendering, which I currently do on a non-AppendBuffer). I should experiment with passing a buffer with a count to that draw call, which should allow me to verify whether most of my model suddenly disappears when I overflow.
In any case, it seems that the operation should be safe in terms of not corrupting other parts of the system, but that referring to the counter may be the wrong way to detect problems.

Any performance gain/loss with having several function calls rather than a single large one?

I am currently making a game for the iPad and iPhone using cocos2d, Box2D and Objective-C.
A lot of stuff is happening every update, and a lot has to be resolved.
I recently refactored a lot of my code to several small methods, instead of having hundreds of lines of code inside the same method.
Is there any performance loss doing this?
Will fewer method calls increase performance?
Each function call results in a constant-time (O(1)) delay because of the stack frame adjustments and branching. However, you won't feel that delay unless the calls are made inside a time-critical loop a million times.
The best approach would be, I think, writing the cleanest code possible and then optimizing it -- with the help of a profiler -- as needed.
You may also want to check out this answer: https://stackoverflow.com/a/4816703/252687 Inline functions may reduce the aforementioned overhead a bit without compromising the modularity.
I have seen cases where multiple smaller functions resulted in significantly better-performing code, since the compiler was better able to optimize registers. Highly dependent on the compiler and style of programming, though.
But in general, on modern systems (other than really low-level microprocessors) optimizing performance at this level is counter-productive. Better to well-structure the code (which generally implies a fair number of subroutines) so that it's more reliable, easier to maintain, and easier to spot and fix more global performance issues.
Of course there is a performance decline with more method calls. However that is not a reason to use fewer, that would be pre-mature optimization at the expense of cleaner code.
Personally I go for the cleanest most clear code, let the compiler optimize and in the end profile for the real bottlenecks.
I was once hired on the basis of an answer to single question, that was I would profile before optimizing. :-)
After the compiler optimizes your code, you probably won't notice any reliable performance difference, unless you are trying to use method dispatches inside the inner loops of a CPU intensive computation routine, such as DSP or pixel level image processing.

Is it a good idea to warm up a cache in the BEGIN block in Perl?

Is it a good idea to warm up cache in the BEGIN block, when it gets used?
You didn't really provide any information on what kind of environment you're talking about, which I think is important. In most cases the answer is probably "no", but I can think of one case where it's a definite yes, which is preforking servers -- web applications and the like. In that case, any work that you can do "before the fork" not only saves the cost of having the children recompute the same values individually, it alo saves memory, since the pages containing the results can be shared across all of the child processes by the OS's COW mechanism.
If you're talking about a module you're writing and not an application, then I'd say no, don't lift things to compilation time without the user's permission unless they're things that have to be done for the module to work. Instead, provide a preheat_cache class method, and if your caller has a reason to need a hot cache at compile time they can put the call into a BEGIN block themselves. You could also use a :preheat_cache import tag but that's unnecessarily fancy in my book.
If it's a choice between preloading your cache at compile time, or preloading your cache as the first thing you do at run time, there's virtually no difference.
If your cache is large enough that loading it will trigger a lot of page swaps, that's an argument for waiting until run time. That way, all your module loading and other compile time code can be done while your system is under a lighter load.
I'm going to go with "no", even though I could be wrong. Reasoning goes like this: keep the code, and data it uses, small, so that it takes up less space in any caches (I am presuming you mean CPU cache, not programmatic hashes with common query results or some such thing).
Unless you see some sort of bad access pattern, trying to second guess what needs to be prefetched is probably useless at best. In fact such code or initialization data is likely to displace something you (or another process on the system) were actually using. Think about what you can do in the actual work part of the code to maximize locality of reference, to try to stay within smaller memory regions at any one time.
I used to use "top" to detect when processes were swapping between memory and disk. I don't know of any good tools yet to tell how often a process is getting cache misses and going to plain old slow mo'board memory. There must be such tools, I just don't know what they are yet (software tools, rather than some custom In Circuit Emulator type hardware). Perhaps some thought on this earlier in the day...
by warm up I assume you mean use BEGIN() to guarantee the cache is preloaded before anything else in your script executes?
If you need the cache for your program to run properly, then yes, I think it would be a good idea.

How to use the cachegrind output to optimize the application

I need to improve the throughput of the system.
The usual cycle of optimization has been done and we have already achieved 1.5X better throughput.
I am now beginning to wonder if I can utilize the cachegrind output to improve the system's throughput.
Can somebody point me to how to begin on this?
What I understand is we need to ensure most frequently used data should be kept small enough so that it remains in L1 cache and the next set of data should fit in the L2.
Is this the right direction I am taking?
It`s true that cachegrind output in itself does not give too much information how to go about optimizing code. One needs to know how to interpret it and what you are saying about data fitting into L1 and L2 is indeed the right direction.
To fully understand how memory access patterns influence performance, I recommend reading an excellent paper "What Every Programmer Should Know About Memory" by Ulrich Drepper, the GNU libc maintainer.
If you're having trouble parsing the cachegrind output, look into KCacheGrind (it should be available in your distro of choice). I use it and find it quite helpful.
According to the Cachegrind documentation, the details given to you by cachegrind are the number of cache misses for a given part of your code. You need to know about how caches work on the architecture you are targetting so that you know how to fix the code. In practice this means making data smaller or changing the access pattern of some data so that cached data is still in the cache. However you need to understand your program's data and data access before you can act on the information. As it says in the manual,
In short, Cachegrind can tell you where some of the bottlenecks in your code are, but it can't tell you how to fix them. You have to work that out for yourself. But at least you have the information!
1.5x is a nice speedup. It means you found something that took 33% of the time that you could get rid of. I bet you can do more, even before you get down to low-level issues like data memory cache. This is an example of how. Basically, you could have additional performance problems (and opportunities for speedup) that were not large before, like 25% say. Well, with the 1.5x speedup, that 25% is now 37.5%, so it is "worth more" than it was. Often such a problem is in the form of some mid-stack function call that is requesting work that, once you know how much it costs, you may decide isn't completely necessary. Since kcachegrind does not really pinpoint these, you may not realize it is a problem.