LuaJIT FFI callback performance - callback

The LuaJIT FFI docs mention that calling from C back into Lua code is relatively slow and recommend avoiding it where possible:
Do not use callbacks for performance-sensitive work: e.g. consider a numerical integration routine which takes a user-defined function to integrate over. It's a bad idea to call a user-defined Lua function from C code millions of times. The callback overhead will be absolutely detrimental for performance.
For new designs avoid push-style APIs (C function repeatedly calling a callback for each result). Instead use pull-style APIs (call a C function repeatedly to get a new result). Calls from Lua to C via the FFI are much faster than the other way round. Most well-designed libraries already use pull-style APIs (read/write, get/put).
However, they don't give any sense of how much slower callbacks from C are. If I have some code that I want to speed up that uses callbacks, roughly how much of a speedup could I expect if I rewrote it to use a pull-style API? Does anyone have any benchmarks comparing implementations of equivalent functionality using each style of API?

On my computer, a function call from LuaJIT into C has an overhead of 5 clock cycles (notably, just as fast as calling a function via a function pointer in plain C), whereas calling from C back into Lua has a 135 cycle overhead, 27x slower. That being said, program that required a million calls from C into Lua would only add ~100ms overhead to the program's runtime; while it might be worth it to avoid FFI callbacks in a tight loop that operates on mostly in-cache data, the overhead of callbacks if they're invoked, say, once per I/O operation is probably not going to be noticeable compared to the overhead of the I/O itself.
$ luajit-2.0.0-beta10 callback-bench.lua
C into C 3.344 nsec/call
Lua into C 3.345 nsec/call
C into Lua 75.386 nsec/call
Lua into Lua 0.557 nsec/call
C empty loop 0.557 nsec/call
Lua empty loop 0.557 nsec/call
$ sysctl -n machdep.cpu.brand_string
Intel(R) Core(TM) i5-3427U CPU # 1.80GHz
Benchmark code: https://gist.github.com/3726661

Since this issue (and LJ in general) has been the source of great pain for me, I'd like to toss some extra information into the ring, in hopes that it may assist someone out there in the future.
'Callbacks' are Not Always Slow
The LuaJIT FFI documentation, when it says 'callbacks are slow,' is referring very specifically to the case of a callback created by LuaJIT and passed through FFI to a C function that expects a function pointer. This is completely different from other callback mechanisms, in particular, it has entirely different performance characteristics compared to calling a standard lua_CFunction that uses the API to invoke a callback.
With that said, the real question is then: when do we use the Lua C API to implement logic that involves pcall et al, vs. keeping everything in Lua? As always with performance, but especially in the case of a tracing JIT, one must profile (-jp) to know the answer. Period.
I have seen situations that looked similar yet fell on opposite ends of the performance spectrum; that is, I have encountered code (not toy code, but rather production code in the context of writing a high-perf game engine) that performs better when structured as Lua-only, as well as code (that seems structurally-similar) that performs better upon introducing a language boundary via calling a lua_CFunction that uses luaL_ref to maintain handles to callbacks and callback arguments.
Optimizing for LuaJIT without Measurement is a Fool's Errand
Tracing JITs are already hard to reason about, even if you're an expert in static language perf analysis. They take everything you thought you knew about performance and shatter it to pieces. If the concept of compiling recorded IR rather than compiling functions doesn't already annihilate one's ability to reason about LuaJIT performance, then the fact that calling into C via the FFI is more-or-less free when successfully JITed, yet potentially an order-of-magnitude more expensive than an equivalent lua_CFunction call when interpreted...well, this for sure pushes the situation over the edge.
Concretely, a system that you wrote last week that vastly out-performed a C equivalent may tank this week because you introduced an NYI in trace-proximity to said system, which may well have come from a seemingly-orthogonal region of code, and now your system is falling back and obliterating performance. Even worse, perhaps you're well-aware of what is and isn't an NYI, but you added just enough code to the trace proximity that it exceeded the JIT's max recorded IR instructions, max virtual registers, call depth, unroll factor, side trace limit...etc.
Also, note that, while 'empty' benchmarks can sometimes give a very general insight, it is even more important with LJ (for the aforementioned reasons) that code be profiled in context. It is very, very difficult to write representative performance benchmarks for LuaJIT, since traces are, by their nature, non-local. When using LJ in a large application, these non-local interactions become tremendously impactful.
TL;DR
There is exactly one person on this planet who really and truly understands the behavior of LuaJIT. His name is Mike Pall.
If you are not Mike Pall, do not assume anything about LJ behavior and performance. Use -jv (verbose; watch for NYIs and fallbacks), -jp (profiler! Combine with jit.zone for custom annotations; use -jp=vf to see what % of your time is being spent due in the interpreter due to fallbacks), and, when you really need to know what's going on, -jdump (trace IR & ASM). Measure, measure, measure. Take generalizations about LJ performance characteristics with a grain of salt unless they come from the man himself or you've measured them in your specific usage case (in which case, after all, it's not a generalization). And remember, the right solution might be all in Lua, it might be all in C, it might be Lua -> C through FFI, it might be Lua -> lua_CFunction -> Lua, ...you get the idea.
Coming from someone who has been fooled time-and-time-again into thinking that he has understood LuaJIT, only to be proven wrong the following week, I sincerely hope this information helps someone out there :) Personally, I simply no longer make 'educated guess' about LuaJIT. My engine outputs jv and jp logs for every run, and they are the 'word of God' for me with respect to optimization.

Two years later, I redid the benchmarks from Miles' answer, for the following reasons:
See if they improved with the new advancements (in CPU and LuaJIT)
To add tests for functions with parameters and returns. The callback documentation mentiones that apart from the call overhead, parameter marshalling also matters:
[...] the C to Lua transition itself has an unavoidable cost, similar to a lua_call() or lua_pcall(). Argument and result marshalling add to that cost [...]
Check the difference between PUSH style and PULL style.
My results, on Intel(R) Core(TM) i7 CPU 920 # 2.67GHz:
operation reps time(s) nsec/call
C into Lua set_v 10000000 0.498 49.817
C into Lua set_i 10000000 0.662 66.249
C into Lua set_d 10000000 0.681 68.143
C into Lua get_i 10000000 0.633 63.272
C into Lua get_d 10000000 0.650 64.990
Lua into C call(void) 100000000 0.381 3.807
Lua into C call(int) 100000000 0.381 3.815
Lua into C call(double) 100000000 0.415 4.154
Lua into Lua 100000000 0.104 1.039
C empty loop 1000000000 0.695 0.695
Lua empty loop 1000000000 0.693 0.693
PUSH style 1000000 0.158 158.256
PULL style 1000000 0.207 207.297
The code for this results is here.
Conclusion: C callbacks into Lua have a really big overhead when used with parameters (which is what you almost always do), so they really shouldn't be used in critical points. You can use them for IO or user input though.
I am a bit surprised there is so little difference between PUSH/PULL styles, but maybe my implementation is not among the best.

There is a significant performance difference, as shown by these results:
LuaJIT 2.0.0-beta10 (Windows x64)
JIT: ON CMOV SSE2 SSE3 SSE4.1 fold cse dce fwd dse narrow loop abc sink fuse
n Push Time Pull Time Push Mem Pull Mem
256 0.000333 0 68 64
4096 0.002999 0.001333 188 124
65536 0.037999 0.017333 2108 1084
1048576 0.588333 0.255 32828 16444
16777216 9.535666 4.282999 524348 262204
The code for this benchmark can be found here.

Related

What does the "world" mean in functional programming world?

I've been diving into functional programming for more than 3 years and I've been reading and understanding many articles and aspects of functional programming.
But I often stumbled into many articles about the "world" in side effect computations and also carrying and copying the "world" in IO monad samples. What does the "world" means in this context? Is this the same "world" in all side effect computation context or is it only applied in IO monads?
Also the documentation and other articles about Haskell mention the "world" many times.
Some reference about this "world":
http://channel9.msdn.com/Shows/Going+Deep/Erik-Meijer-Functional-Programming
and this:
http://www.infoq.com/presentations/Taming-Effect-Simon-Peyton-Jones
I expect a sample, not just explanation of the world concept. I welcome sample code in Haskell, F#, Scala, Scheme.
The "world" is just an abstract concept that captures "the state of the world", i.e. the state of everything outside the current computation.
Take this I/O function, for example:
write : Filename -> String -> ()
This is non-functional, as it changes the file (whose content is part of the state of the world) by side effect. If, however, we modelled the world as an explicit object, we could provide this function:
write : World -> Filename -> String -> World
This takes the current world and functionally produces a "new" one, with the file modified, which you can then pass to consecutive calls. The World itself is just an abstract type, there is no way to peek at it directly, except through corresponding functions like read.
Now, there is one problem with the above interface: without further restrictions, it would allow a program to "duplicate" the world. For example:
w1 = write w "file" "yes"
w2 = write w "file" "no"
You've used the same world w twice, producing two different future worlds. Obviously, this makes no sense as a model for physical I/O. To prevent examples like that, a more fancy type system is needed that makes sure that the world is handled linearly, i.e., never used twice. The language Clean is based on a variation of this idea.
Alternatively, you can encapsulate the world such that it never becomes explicit and thereby cannot be duplicated by construction. That is what the I/O monad achieves -- it can be thought of as a state monad whose state is the world, which it threads through the monadic actions implicitly.
The "world" is a concept involved in one kind of embedding of imperative programming into a purely functional language.
As you most certainly know, purely functional programming requires the result of a function to depend exclusively on the values of the arguments. So suppose we want to express a typical getLine operation as a pure function. There are two evident problems:
getLine can produce a different result each time it's called with the same arguments (no arguments, in this case).
getLine has the side effect of consuming some portion of a stream. If your program uses getLine, then (a) each invocation of it must consume a different part of the input, (b) each part of the program's input must be consumed by some invocation. (You can't have two calls to getLine reading the same input line twice unless that line occurs twice in the input; you can't have the program randomly skip a line of input either.)
So getLine just can't be a function, right? Well, not so fast, there's some tricks we could do:
Multiple calls to getLine can return different results. To make that compatible with purely functional behavior, this means that a purely functional getLine could take an argument: getLine :: W -> String. Then we can reconcile the idea of different results on each call by stipulating that each call must be made with a different value for the W argument. You could imagine that W represents the state of the input stream.
Multiple calls to getLine must be executed in some definite order, and each must consume the input that was left over from the previous call. Change: give getLine the type W -> (String, W), and forbid programs from using a W value more than once (something that we can check at compilation). Now to use getLine more than once in your program you must take care to feed the earlier call's W result to the succeeding call.
As long as you can guarantee that Ws are not reused, you can use this sort of technique to translate any (single-threaded) imperative program into a purely functional one. You don't even need to have any actual in-memory objects for the W type—you just type-check your program and analyze it to prove that each W is only used once, then emit code that doesn't refer to anything of the sort.
So the "world" is just this idea, but generalized to cover all imperative operations, not just getLine.
Now having explained all that, you may be wondering if you're better off knowing this. My opinion is no, you aren't. See, IMO, the whole "passing the world around" idea is one of those things like monad tutorials, where too many Haskell programmers have chosen to be "helpful" in ways that actually aren't.
"Passing the world around" is routinely offered as an "explanation" to help newbies understand Haskell IO. But the problem is that (a) it's a really exotic concept for many people to wrap their heads around ("what do you mean I'm going to pass the state of the whole world around?"), (b) very abstract (a lot of people can't wrap their head around the idea that nearly every function your program will have an unused dummy parameter that neither appears in the source code nor the object code), and (c) not the easiest, most practical explanation anyway.
The easiest, most practical explanation of Haskell I/O, IMHO, goes like this:
Haskell is purely functional, so things like getLine can't be functions.
But Haskell has things like getLine. This means those things are something else that's not a function. We call them actions.
Haskell allows you to treat actions as values. You can have functions that produce actions (e.g., putStrLn :: String -> IO ()), functions that accept actions as arguments (e.g., (>>) :: IO a -> IO b -> IO b), etc.
Haskell however has no function that executes an action. There can't be an execute :: IO a -> a because it would not be a true function.
Haskell has built-in functions to compose actions: make compound actions out of simple actions. Using basic actions and action combinators, you can describe any imperative program as an action.
Haskell compilers know how to translate actions into executable native code. So you write an executable Haskell program by writing a main :: IO () action in terms of subactions.
Passing around values that represent "the world" is one way to make a pure model for doing IO (and other side effects) in pure declarative programming.
The "problem" with pure declarative (not just functional) programming is obvious. Pure declarative programming provides a model of computation. These models can express any possible computation, but in the real world we use programs to have computers do things that aren't computation in a theoretical sense: taking input, rendering to displays, reading and writing storage, using networks, controlling robots, etc, etc. You can directly model almost all of such programs as computation (e.g. what output should be written to a file given this input is a computation), but the actual interactions with things outside the program just isn't part of the pure model.
That's actually true of imperative programming too. The "model" of computation that is the C programming language provides no way to write to files, read from keyboards, or anything. But the solution in imperative programming is trivial. Performing a computation in the imperative model is executing a sequences of instructions, and what each instruction actually does depends on the whole environment of the program at the time it is executed. So you can just provide "magic" instructions that carry out your IO actions when they are executed. And since imperative programmers are used to thinking about their programs operationally1, this fits very naturally with what they're already doing.
But in all pure models of computation, what a given unit of computation (function, predicate, etc) will do should only depend on its inputs, not on some arbitrary environment that can be different every time. So not only performing IO actions but also implementing computations which depend on the universe outside the program is impossible.
The idea for the solution is fairly simple though. You build a model for how IO actions work within the whole pure model of computation. Then all the principles and theories that apply to the pure model in general will also apply to the part of it that models IO. Then, within the language or library implementation (because it's not expressible in the language itself), you hook up manipulations of the IO model to actual IO actions.
This brings us to passing around a value that represents the world. For example, a "hello world" program in Mercury looks like this:
:- pred main(io::di, io::uo) is det.
main(InitialWorld, FinalWorld) :-
print("Hello world!", InitialWorld, TmpWorld),
nl(TmpWorld, FinalWorld).
The program is given InitialWorld, a value in the type io which represents the entire universe outside the program. It passes this world to print, which gives it back TmpWorld, the world that is like InitialWorld but in which "Hello world!" has been printed to the terminal, and whatever else has happened in the meantime since InitialWorld was passed to main is also incorporated. It then passes TmpWorld to nl, which gives back FinalWorld (a world that is very like TmpWorld but it incorporates the printing of the newline, plus any other effects that happened in the meantime). FinalWorld is the final state of the world passed out of main back to the operating system.
Of course, we're not really passing around the entire universe as a value in the program. In the underlying implementation there usually isn't a value of type io at all, because there's no information that's useful to actually pass around; it all exists outside the program. But using the model where we pass around io values allows us to program as if the entire universe was an input and output of every operation that is affected by it (and consequently see that any operation that doesn't take an input and output io argument can't be affected by the external world).
And in fact, usually you wouldn't actually even think of programs that do IO as if they're passing around the universe. In real Mercury code you'd use the "state variable" syntactic sugar, and write the above program like this:
:- pred main(io::di, io::uo) is det.
main(!IO) :-
print("Hello world!", !IO),
nl(!IO).
The exclamation point syntax signifies that !IO really stands for two arguments, IO_X and IO_Y, where the X and Y parts are automatically filled in by the compiler such that the state variable is "threaded" through the goals in the order in which they are written. This is not just useful in the context of IO btw, state variables are really handy syntactic sugar to have in Mercury.
So the programmer actually tends to think of this as a sequence of steps (depending on and affecting external state) that are executed in the order in which they are written. !IO almost becomes a magic tag that just marks the calls to which this applies.
In Haskell, the pure model for IO is a monad, and a "hello world" program looks like this:
main :: IO ()
main = putStrLn "Hello world!"
One way to interpret the IO monad is similarly to the State monad; it's automatically threading a state value through, and every value in the monad can depend on or affect this state. Only in the case of IO the state being threaded is the entire universe, as in the Mercury program. With Mercury's state variables and Haskell's do notation, the two approaches end up looking quite similar, with the "world" automatically threaded through in a way that respects the order in which the calls were written in the source code, =but still having IO actions explicitly marked.
As explained quite well in sacundim's answer, another way to interpret Haskell's IO monad as a model for IO-y computations is to imagine that putStrLn "Hello world!" isn't in fact a computation through which "the universe" needs to be threaded, but rather that putStrLn "Hello World!" is itself a data structure describing an IO action that could be taken. On this understanding what programs in the IO monad are doing is using pure Haskell programs to generate at runtime an imperative program. In pure Haskell there's no way to actually execute that program, but since main is of type IO () main itself evaluates to such a program, and we just know operationally that the Haskell runtime will execute the main program.
Since we're hooking up these pure models of IO to actual interactions with the outside world, we need to be a little careful. We're programming as if the entire universe was a value we can pass around the same as other values. But other values can be passed into multiple different calls, stored in polymorphic containers, and many other things that don't make any sense in terms of the actual universe. So we need some restrictions that prevent us from doing anything with "the world" in the model that doesn't correspond to anything that can actually be done to the real world.
The approach taken in Mercury is to use unique modes to enforce that the io value remains unique. That's why the input and output world were declared as io::di and io::uo respectively; it's a shorthand for declaring that the type of the first paramter is io and it's mode is di (short for "destructive input"), while the type of the second parameter is io and its mode is uo (short for "unique output"). Since io is an abstract type, there's no way to construct new ones, so the only way to meet the uniqueness requirement is to always pass the io value to at most one call, which must also give you back a unique io value, and then to output the final io value from the last thing you call.
The approach taken in Haskell is to use the monad interface to allow values in the IO monad to be constructed from pure data and from other IO values, but not expose any functions on IO values that would allow you to "extract" pure data from the IO monad. This means that only the IO values incorporated into main will ever do anything, and those actions must be correctly sequenced.
I mentioned before that programmers doing IO in a pure language still tend to think operationally about most of their IO. So why go to all this trouble to come up with a pure model for IO if we're only going to think about it the same way imperative programmers do? The big advantage is that now all the theories/code/whatever that apply to all of the language apply to IO code as well.
For example, in Mercury the equivalent of fold processes a list element-by-element to build up an accumulator value, which means fold takes an input/output pair of variables of some arbitrary type as the accumulator (this is a very common pattern in the Mercury standard library, and is why I said state variable syntax often turns out to be very handy in other contexts than IO). Since "the world" appears in Mercury programs explicitly as a value in the type io, it's possible to use io values as the accumulator! Printing a list of strings in Mercury is as simple as foldl(print, MyStrings, !IO). Similarly in Haskell, generic monad/functor code works just fine on IO values. We get a whole lot of "higher-order" IO operations that would have to be implemented anew specialised to IO in a language that handles IO by some completely special mechanism.
Also, since we avoid breaking the pure model by IO, theories that are true of the computational model remain true even in the presence of IO. This makes reasoning by the programmer and by program-analysis tools not have to consider whether IO might be involved. In languages like Scala for example, even though much "normal" code is in fact pure, optimizations and implementation techniques that work on pure code are generally inapplicable, because the compiler has to presume that every single call might contain IO or other effects.
1 Thinking about programs operationally means understanding them in terms of the operations the computer will carry out when executing them.
I think the first thing we should read about this subject is Tackling the Awkward Squad. (I didn't do so and I regret it.)
The author actually describes the GHC's internal representation of IO as world -> (a,world) as "a bit of a hack".
I think this "hack" is meant to be a kind of innocent lie. I think there are two kinds of lie here:
GHC pretends the 'world' is representable by some variable.
The type world -> (a,world) basically says that if we could in some way instantiate the world, then the "next state" of our world is functionally determined by some small program running on a computer. Since this is clearly not realizable, the primitives are (of course) implemented as functions with side effects, ignoring the meaningless "world" parameter, just like in most other languages.
The author defends this "hack" on the two bases:
By treating the IO as a thin wrapper of the type world -> (a,world), GHC can reuse many optimizations for the IO code, so this design is very practical and economical.
The operational semantics of the IO computation implemented as above can be proved sound provided the compiler satisfies certain properties. This paper is cited for the proof of this.
The problem (that I wanted to ask here, but you asked it first so forgive me to write it here) is that in the presense of the standard 'lazy IO' functions, I'm no longer sure that the GHC's operational semantics remains sound.
The standard 'lazy IO' functions such as hGetContents internally calls unsafeInterleaveIO which in turn is equivalent to
unsafeDupableInterleaveIO for single thread programs.
unsafeDupableInterleaveIO :: IO a -> IO a
unsafeDupableInterleaveIO (IO m)
= IO ( \ s -> let r = case m s of (# _, res #) -> res
in (# s, r #))
Pretending that the equational reasoning still works for this kind of programs (note that m is an impure function) and ignoring the constructor , we have
unsafeDupableInterleaveIO m >>= f ==> \world -> f (snd (m world)) world , which semantically would have the same effect that Andreas Rossberg described above: it "duplicates" the world. Since our world cannot be duplicated this way, and the precise evaluation order of a Haskell program is virtually unpredictable --- what we get is an almost unconstrained and unsynchronized concurrency racing for some precious system resources such as file handles. This kind of operation is of course never considered in Ariola&Sabry. So I disagree with Andreas in this respect -- the IO monad doesn't really thread the world properly even if we restrict ourselves within the limit of the standard library (and this is why some people say lazy IO is bad).
The world means just that - the physical, real world. (There is only one, mind you.)
By neglecting physical processes that are confined to the CPU and memory, one can classify every function:
Those that do not have effects in the physical world (except for ephemeral, mostly unobservable effects in the CPU and RAM)
Those that do have observable effects. for example: print something on the printer, send electrons through network cables, launch rockets or move disk heads.
The distinction is a bit artificial, insofar as running even the purest Haskell program in reality does have observable effects, like: your CPU getting hotter, which causes the fan to turn on.
Basically every program you write can be divided into 2 parts (in FP word, in imperative/OO world there is no such distinction).
Core/Pure part: This is your actual logic/algorithm of the application that is used to solve the problem for which you have build the application. (95% of applications today lack this part as they are just a mess of API calls with if/else sprinkled, and people start calling themselves programmers) For ex: In an image manipulation tool the algorithm to apply various effects to the image belongs to this core part. So in FP, you build this core part using FP concepts like purity etc. You build your function that takes input and return result and there is no mutation whatsoever in this part of your application.
The outer layer part: Now lets says you have completed the core part of the image manipulation tool and have tested the algorithms by calling function with various input and checking the output but this isnt something that you can ship, how the user is supposed to use this core part, there is no face of it, it is just a bunch of functions. Now to make this core usable from end user point of view, you need to build some sort of UI, way to read files from disk, may be use some embedded database to store user preferences and the list goes on. This interaction with various other stuff, which is not the core concept of your application but still is required to make it usable is called the world in FP.
Exercise: Think about any application you have build earlier and try to divide it into above mentioned 2 parts and hopefully that will make things more clear.
The world refers to interacting with the real world / has side effects - for example
fprintf file "hello world"
which has a side effect - the file has had "hello world" added to it.
This is opposed to purely functional code like
let add a b = a + b
which has no side effects

how can I call Unix system calls interactively?

I'd like to play with Unix system calls, ideally from Ruby. How can I do so?
I've heard about Fiddle, but I don't know where to begin / which C library should I attach it to?
I assume by "interactively" you mean via irb.
A high-level language like Ruby is going to provide wrappers for most kernel syscalls, of varying thickness.
Occasionally these wrappers will be very thin, as with sysread() and syswrite(). These are more or less equivalent to read(2) and write(2), respectively.
Other syscalls will be hidden behind thicker layers, such as with the socket I/O stuff. I don't know if calling UNIXSocket.recv() counts as "calling a syscall" precisely. At some level, that's exactly what happens, but who knows how much Ruby and C code stands between you and the actual system call.
Then there are those syscalls that aren't in the standard Ruby API at all, most likely because they don't make a great amount of sense to be, like mmap(2). That syscall is all about raw pointers to memory, something you've chosen to avoid by using a language like Ruby in the first place. There happens to be a third party Ruby mmap module, but it's really not going to give you all the power you can tap from C.
The syscall() interface Mat pointed out in the comment above is a similar story: in theory, it lets you call any system call in the kernel. But, if you don't have the ability to deal with pointers, lay out data precisely in memory for structures, etc., your ability to make useful calls is going to be quite limited.
If you want to play with system calls, learn C. There is no shortcut.
Eric Wong started a mailing list for system-level programming in Ruby. It isn't terribly active now, but you can get to it at http://librelist.com/browser/usp.ruby/.

Looking for the best equivalents of prefetch instructions for ia32, ia64, amd64, and powerpc

I'm looking at some slightly confused code that's attempted a platform abstraction of prefetch instructions, using various compiler builtins. It appears to be based on powerpc semantics initially, with Read and Write prefetch variations using dcbt and dcbtst respectively (both of these passing TH=0 in the new optional stream opcode).
On ia64 platforms we've got for read:
__lfetch(__lfhint_nt1, pTouch)
wherease for write:
__lfetch_excl(__lfhint_nt1, pTouch)
This (read vs. write prefetching) appears to match the powerpc semantics fairly well (with the exception that ia64 allows for a temporal hint).
Somewhat curiously the ia32/amd64 code in question is using
prefetchnta
Not
prefetchnt1
as it would if that code were to be consistent with the ia64 implementations (#ifdef variations of that in our code for our (still live) hpipf port and our now dead windows and linux ia64 ports).
Since we are building with the intel compiler I should be able to many of our ia32/amd64 platforms consistent by switching to the xmmintrin.h builtins:
_mm_prefetch( (char *)pTouch, _MM_HINT_NTA )
_mm_prefetch( (char *)pTouch, _MM_HINT_T1 )
... provided I can figure out what temporal hint should be used.
Questions:
Are there read vs. write ia32/amd64 prefetch instructions? I don't see any in the instruction set reference.
Would one of the nt1, nt2, nta temporal variations be preferred for read vs. write prefetching?
Any idea if there would have been a good reason to use the NTA temporal hint on ia32/amd64, yet T1 on ia64?
Are there read vs. write ia32/amd64 prefetch instructions? I don't see any in the instruction set reference.
Some systems support the prefetchw instructions for writes
Would one of the nt1, nt2, nta temporal variations be preferred for read vs. write prefetching?
If the line is exclusively used by the calling thread, it shouldn't matter how you bring the line, both reads and writes would be able to use it. The benefit for prefetchw mentioned above is that it will bring the line and give you ownership on it, which may take a while if the line was also used by another core. The hint level on the other hand is orthogonal with the MESI states, and only affects how long would the prefetched line survive. This matters if you prefetch long ahead of the actual access and don't want to prefetch to get lost in that duration, or alternatively - prefetch right before the access, and don't want the prefetches to thrash your cache too much.
Any idea if there would have been a good reason to use the NTA temporal hint on ia32/amd64, yet T1 on ia64?
Just speculating - perhaps the larger caches and aggressive memory BW are more vulnerable to bad prefetching and you'd want to reduce the impact through the non-temporal hint. Consider that your prefetcher is suddenly set loose to fetch anything it can, you'd end up swamped in junk prefetches that would through away lots of useful cachelines. The NTA hint makes them overrun each other, leaving the rest undamaged.
Of course this may also be just a bug, I can't tell for sure, only whoever developed the compiler, but it might make sense for the reason above.
The best resource I could find on x86 prefetching hint types was the good ol' article What Every Programmer Should Know About Memory.
For the most part on x86 there aren't different instructions for read and write prefetches. The exceptions seem to be those that are non-temporal aligned, where a write can bypass the cache but as far as I can tell, a read will always get cached.
It's going to be hard to backtrack through why the earlier code owners used one hint and not the other on a certain architecture. They could be making assumptions about how much cache is available on processors in that family, typical working set sizes for binaries there, long term control flow patterns, etc... and there's no telling how much any of those assumptions were backed up with good reasoning or data. From the limited background here I think you'd be justified in taking the approach that makes the most sense for the platform you're developing on now, regardless what was done on other platforms. This is especially true when you consider articles like this one, which is not the only context where I've heard that it's really, really hard to get any performance gain at all with software prefetches.
Are there any more details known up front, like typical cache miss ratios when using this code, or how much prefetches are expected to help?

Is Objective C fast enough for DSP/audio programming

I've been making some progress with audio programming for iPhone. Now I'm doing some performance tuning, trying to see if I can squeeze more out of this little machine. Running Shark, I see that a significant part of my cpu power (16%) is getting eaten up by objc_msgSend. I understand I can speed this up somewhat by storing pointers to functions (IMP) rather than calling them using [object message] notation. But if I'm going to go through all this trouble, I wonder if I might just be better off using C++.
Any thoughts on this?
Objective C is absolutely fast enough for DSP/audio programming, because Objective C is a superset of C. You don't need to (and shouldn't) make everything a message. Where performance is critical, use plain C function calls (or use inline assembly, if there are hardware features you can leverage that way). Where performance isn't critical, and your application can benefit from the features of message indirection, use the square brackets.
The Accelerate framework on OS X, for example, is a great high-performance Objective C library. It only uses standard C99 function calls, and you can call them from Objective C code without any wrapping or indirection.
The problem with Objective-C and functions like DSP is not speed per se but rather the uncertainty of when the inevitable bottlenecks will occur.
All languages have bottlenecks but in static linked languages like C++ you can better predict when and where in the code they will occur. In the case of Objective-C's runtime coupling, the time it takes to find the appropriate object, the time it takes to send a message is not necessary slow but it is variable and unpredictable. Objective-C's flexibility in UI, data management and reuse work against it in the case of tightly timed task.
Most audio processing in the Apple API is done in C or C++ because of the need to nail down the time it takes code to execute. However, its easy to mix Objective-C, C and C++ in the same app. This allows you to pick the best language for the immediate task at hand.
Is Objective C fast enough for DSP/audio programming
Real Time Rendering
Definitely Not. The Objective-C runtime and its libraries are simply not designed for the demands of real time audio rendering. The fact is, it's virtually impossible to guarantee that using ObjC runtime or libraries such as Foundation (or even CoreFoundation) will not result your renderer missing its deadline.
The common case is a lock -- even a simple heap allocation (malloc, new/new[], [[NSObject alloc] init]) will likely require a lock.
To use ObjC is to utilize libraries and a runtime which assume locks are acceptable at any point within their execution. The lock can suspend execution of your render thread (e.g. during your render callback) while waiting to acquire the lock. Then you can miss your render deadline because your render thread is held up, ultimately resulting in dropouts/glitches.
Ask a pro audio plugin developer: they will tell you that blocking within the realtime render domain is forbidden. You cannot e.g. run to the filesystem or create heap allocations because you have no practical upper bound regarding the time it will take to finish.
Here's a nice introduction: http://www.rossbencina.com/code/real-time-audio-programming-101-time-waits-for-nothing
Offline Rendering
Yes, it would be acceptably fast in most scenarios for high level messaging. At the lower levels, I recommend against using ObjC because it would be wasteful -- it could take many, many times longer to render if ObjC messaging used at that level (compared to a C or C++ implementation).
See also: Will my iPhone app take a performance hit if I use Objective-C for low level code?
objc_msgSend is just a utility.
The cost of sending a message is not just the cost of sending the message.
It is the cost of doing everything that the message initiates.
(Just like the true cost of a function call is its inclusive cost, including I/O if there is any.)
What you need to know is where are the time-dominant messages coming from and going to and why.
Stack samples will tell you which routines / methods are being called so often that you should figure out how to call them more efficiently.
You may find that you're calling them more than you have to.
Especially if you find that many of the calls are for creating and deleting data structure, you can probably find better ways to do that.

Speed improvements for Perl's chameneos-redux in the Computer Language Benchmarks Game

Ever looked at the Computer Language Benchmarks Game (formerly known as the Great Language Shootout)?
Perl has some pretty healthy competition there at the moment. It also occurs to me that there's probably some places that Perl's scores could be improved. The biggest one is in the chameneos-redux script right now—the Perl version runs the worst out of any language: 1,626 times slower than the C baseline solution!
There are some restrictions on how the programs can be made and optimized, and there is Perl's interpreted runtime penalty, but 1,626 times? There's got to be something that can get the runtime of this program way down.
Taking a look at the source code and the challenge, how can the speed be improved?
I ran the source code through the Devel::SmallProf profiler. The profile output is a little too verbose to post here, but you can see the results yourself using $ perl -d:SmallProf chameneos.pl 10000 (no need to run it for 6000000 meetings unless you really want to!) See perlperf for more details on some profiling tools in Perl.
It turns out that using semaphores is the major bottleneck. The lion's share of total CPU time is spent on checking whether a semaphore is locked or not. Although I haven't had enough time to look at why the source code uses semaphores, it may be that you can work around having to use semaphores altogether. That's probably your best shot at improving the code's performance.
As Zaid posted, Thread::Semaphore is rather slow. One optimization could be to use the implicit locks on shared variables instead of them. It should be faster, though I suspect it won't be faster by much.
In general, Perl's threading implementation sucks for any kind of usage that requires a lot of interthread communication. It's very suitable for tasks with little communication (as unlike CPython's threads and CRuby's threads they are actually preemptive).
It may be possible to improve that situation, we need better primitives.
I have a version based on another version from Jesse Millikian, which I think was never published.
I think it may run ~ 7x faster than the current entry, and uses standard modules all around. I'm not sure if it actually complies with all the rules though.
I've tried the forks module on it, but I think it slows it down a bit.
Anyone tried s/threads/forks/ on the Perl entry? Or Coro / Coro::MP, though the latter would probably trigger the 'interesting alternative implementations' clause.