Sibling calls don’t appear in stack trace? - compiler-optimization

I just stumbled upon a line in the Wikipedia article on stack traces.
It says:
Sibling calls do not appear in a stack trace.
What exactly does this mean? I thought all stack frames appeared in a stack trace. From my understanding, even with a tail call, new frames are still pushed onto the stack, and are thus traceable. Is there an example where I can see this in action, where sibling calls are not shown in the stack trace?

From my understanding, even with a tail call, new frames are still pushed onto the stack, and are thus traceable
You misunderstand.
From Wikipedia:
Tail calls can be implemented without adding a new stack frame to the call stack. [emphasis mine] Most of the frame of the current procedure is no longer needed, and can be replaced by the frame of the tail call, modified as appropriate (similar to overlay for processes, but for function calls). The program can then jump to the called subroutine. Producing such code instead of a standard call sequence is called tail call elimination.
As "sibling calls" are just a special case of tail calls, they can be optimized in the same way. You should be able to see examples of this in any scenario where the compiler would optimize other tail calls, as well as in those specific examples such as described in the above-referenced Wikipedia article.

Sibling calls are, according to the lemma led to by the link under that term, "tail calls to functions which take and return the same types as the caller".
This uses a jump instead of a function call using the current stack frame.

Related

How do I make a dtrace probe only execute when a certain function is in the stack

OK, this is "fun." I'm trying to figure out a weird thread timeout in a program, and need to look at what is happening when pthread_cond_timedwait() returns on Solaris... But... Only when a certain wrapper function or functions are being called (call them foo_lock and foo_unlock)...
I don't think I can use this answer because the function I'm looking at is at least 2-3 hops up in the stack: dtrace execute action only when the function returns to a specific module
One of the weird behaviors is that I see the right parent in tracing entry, but not exit, when I use a return probe... This could be a buffering issue though. Have to dig...

Is it possible to tail call eBPF codes that use different modes?

Is it possible to tail call eBPF codes that use different modes?
For example, if I coded a code that printk("hello world") using kprobe,
would I be able to tail call a XDP code afterwards or vice versa?
I programmed something on eBPF that uses a socket buffer and seems like when I try to tail call another code that uses kprobe, it doesn't load the program.
I wanted to tail call a code that uses XDP_PASS after using a BPF.SOCKET_FILTER mode but seems like tail call isn't working.
I've been trying to figure this out but I can't find any documentations regarding tail calling codes that use different modes :P
Thanks in advance!
No, it is not.
Have a look at kernel commit 04fd61ab36ec, which introduced tail calls: the comment in the first piece of code (in internal kernel header bpf.h), defining the struct bpf_array, sets a owner_prog_type member, and explains the following in a comment:
/* 'ownership' of prog_array is claimed by the first program that
* is going to use this map or by the first program which FD is stored
* in the map to make sure that all callers and callees have the same
* prog_type and JITed flag
*/
So once the program type associated with a BPF program array, used for tail calls, has been defined, it is not possible to use it with other program types. Which makes sense, since different program types work with different context (packet data VS traced function context VS ...), can use different helpers, have return functions with different meanings, necessitate different checks from the verifier, ... So it's hard to see how jumping from one type to another would work. How could you start with processing a network packet, and all of a sudden jump to a piece of code that is supposed to trace some internals of the kernel? :)
Note that it is also impossible to mix JIT-ed and non-JIT-ed programs, as indicated by the owner_jited of the struct.

Why ILSpy is adding variables on stack instead of Instructions?

Why ILSpy is adding variables on stack instead of Instructions? I mean, when pushing or poping from/on stack it adds Ldloc and Stloc instructions. Can anyone explain why it has this behaviour? Thanks!
Because a stack slot acts like a variable: it can be used multiple times (e.g. on both branches of an if), but the effect of the instruction only happens once, when the value is pushed on the stack.
A decompiler that uses a stack of instructions would effectively cause the side effects of the instruction to instead happen at the point where the value is popped from the stack. This would be a program reordering that could subtly change program behavior -> incorrect decompilation.
In principle, using a stack of instructions would be possible within basic blocks; but when there's control flow (either outgoing or incoming) or a dup instruction, the whole stack of instructions would have to be converted to a stack of variables.
Currently the ILSpy ILReader uses a single pass (as specified in the Ecma-335 spec), so it doesn't know about incoming control flow during the ILReader run, so it has to always use a stack of variables to be safe.
It turns out that this is not how the .NET framework reads IL bytecodes, and some obfuscators are exploiting the difference. So in the future, we may rewrite the ILReader to work more like the .NET bytecode importer, at which point we might move to the mixed stack of variables+stack of instructions model. ILSpy issue #901

Understanding Chrome Dev Tools timeline

I'm trying to understand why I have several Long Frames reported by Chrome Dev Tools.
The first row (top of the call stack) in the flame chart are mostly Timer Fired events, triggered by jQuery.Deferred()s executing a bunch of $(function(){ }); ready funcs.
If I dig into the jQuery source and replace their use of setTimeout with requestAnimationFrame the flame chart doesn't change much, I still get many of the rAFs firing within a single frame (as reported by dev tools) making long frames. I'd have expected doing the below pseudocode:
window.requestAnimationFrame(function() {
// do stuff
});
window.requestAnimationFrame(function() {
// do more stuff
});
to be executed on two difference animation frames. Is this not the case?
All of the JS that is executing is necessary, but what should I do to execute it as "micro tasks" (as hinted at, but not explained here https://developers.google.com/web/fundamentals/performance/rendering/optimize-javascript-execution) when setTimeout and rAF don't seem to achieve this.
Update
Here's a zoomed in shot of one of the long frames that doesn't seem to have any reflows (forced or otherwise) in it. Why are all the rAF callbacks here being executed in one frame?
Long frames are usually caused by forced synchronous layouts, which is when you (unintentionally) force a layout operation to happen early.
When you write to the DOM, the layout needs to be reflowed because it has been invalidated by the write operation. This usually happens at the next frame. However, if you try to read from the DOM, the layout happens early, in the current frame, in order to make sure that the correct value gets returned. When forced layout occurs, it causes long frames, leading to jank.
To prevent this from happening, you should only perform the write operations inside your requestAnimationFrame function. The read operations should be done outside of this, so as to avoid the browser doing an early layout.
Diagnose Forced Synchronous Layouts is a nicely explained article, and has a simple example demo for detecting forced reflow in DevTools, and how to resolve it.
It might also be worth checking out FastDom, which is a library for batching your read and write. It is basically a queuing system, and is more scalable.
Additional Source:
What forces layout / reflow, by Paul Irish, contains a comprehensive list of properties and methods that will force layout/reflow.
Update: As for the assumption that multiple requestAnimationFrame calls will execute callbacks on separate frames, this is not the case. When you have consecutive calls, the browser adds the callbacks to a document list of animation callbacks. When the browser goes to run the next frame, it traverses the document list and executes each of the callbacks, in the order they were added.
See Animation Frames from the HTML spec for more of the implementation details.
This means that you should avoid using the consecutive calls, especially where the callback function execution times combined exceed your frame budget. I think this would explain the long frames that aren't caused by reflow.

Does MATLAB perform tail call optimization?

I've recently learned Haskell, and am trying to carry the pure functional style over to my other code when possible. An important aspect of this is treating all variables as immutable, i.e. constants. In order to do so, many computations that would be implemented using loops in an imperative style have to be performed using recursion, which typically incurs a memory penalty due to the allocation a new stack frame for each function call. In the special case of a tail call (where the return value of a called function is immediately returned to the callee's caller), however, this penalty can be bypassed by a process called tail call optimization (in one method, this can be done by essentially replacing a call with a jmp after setting up the stack properly). Does MATLAB perform TCO by default, or is there a way to tell it to?
If I define a simple tail-recursive function:
function tailtest(n)
if n==0; feature memstats; return; end
tailtest(n-1);
end
and call it so that it will recurse quite deeply:
set(0,'RecursionLimit',10000);
tailtest(1000);
then it doesn't look as if stack frames are eating a lot of memory. However, if I make it recurse much deeper:
set(0,'RecursionLimit',10000);
tailtest(5000);
then (on my machine, today) MATLAB simply crashes: the process unceremoniously dies.
I don't think this is consistent with MATLAB doing any TCO; the case where a function tail-calls itself, only in one place, with no local variables other than a single argument, is just about as simple as anyone could hope for.
So: No, it appears that MATLAB does not do TCO at all, at least by default. I haven't (so far) looked for options that might enable it. I'd be surprised if there were any.
In cases where we don't blow out the stack, how much does recursion cost? See my comment to Bill Cheatham's answer: it looks like the time overhead is nontrivial but not insane.
... Except that Bill Cheatham deleted his answer after I left that comment. OK. So, I took a simple iterative implementation of the Fibonacci function and a simple tail-recursive one, doing essentially the same computation in both, and timed them both on fib(60). The recursive implementation took about 2.5 times longer to run than the iterative one. Of course the relative overhead will be smaller for functions that do more work than one addition and one subtraction per iteration.
(I also agree with delnan's sentiment: highly-recursive code of the sort that feels natural in Haskell is typically likely to be unidiomatic in MATLAB.)
There is a simple way to check this. Create this function tail_recursion_check:
function r = tail_recursion_check(n)
if n > 1
r = tail_recursion_check(n - 1);
else
error('error');
end
end
and run tail_recursion_check(10), for example. You are going to see a very long stack trace with 10 items that says error at line 3. If there were tail call optimization, you would only see one.