Speed improvements for Perl's chameneos-redux in the Computer Language Benchmarks Game - perl

Ever looked at the Computer Language Benchmarks Game (formerly known as the Great Language Shootout)?
Perl has some pretty healthy competition there at the moment. It also occurs to me that there's probably some places that Perl's scores could be improved. The biggest one is in the chameneos-redux script right now—the Perl version runs the worst out of any language: 1,626 times slower than the C baseline solution!
There are some restrictions on how the programs can be made and optimized, and there is Perl's interpreted runtime penalty, but 1,626 times? There's got to be something that can get the runtime of this program way down.
Taking a look at the source code and the challenge, how can the speed be improved?

I ran the source code through the Devel::SmallProf profiler. The profile output is a little too verbose to post here, but you can see the results yourself using $ perl -d:SmallProf chameneos.pl 10000 (no need to run it for 6000000 meetings unless you really want to!) See perlperf for more details on some profiling tools in Perl.
It turns out that using semaphores is the major bottleneck. The lion's share of total CPU time is spent on checking whether a semaphore is locked or not. Although I haven't had enough time to look at why the source code uses semaphores, it may be that you can work around having to use semaphores altogether. That's probably your best shot at improving the code's performance.

As Zaid posted, Thread::Semaphore is rather slow. One optimization could be to use the implicit locks on shared variables instead of them. It should be faster, though I suspect it won't be faster by much.
In general, Perl's threading implementation sucks for any kind of usage that requires a lot of interthread communication. It's very suitable for tasks with little communication (as unlike CPython's threads and CRuby's threads they are actually preemptive).
It may be possible to improve that situation, we need better primitives.

I have a version based on another version from Jesse Millikian, which I think was never published.
I think it may run ~ 7x faster than the current entry, and uses standard modules all around. I'm not sure if it actually complies with all the rules though.
I've tried the forks module on it, but I think it slows it down a bit.

Anyone tried s/threads/forks/ on the Perl entry? Or Coro / Coro::MP, though the latter would probably trigger the 'interesting alternative implementations' clause.

Related

how can I call Unix system calls interactively?

I'd like to play with Unix system calls, ideally from Ruby. How can I do so?
I've heard about Fiddle, but I don't know where to begin / which C library should I attach it to?
I assume by "interactively" you mean via irb.
A high-level language like Ruby is going to provide wrappers for most kernel syscalls, of varying thickness.
Occasionally these wrappers will be very thin, as with sysread() and syswrite(). These are more or less equivalent to read(2) and write(2), respectively.
Other syscalls will be hidden behind thicker layers, such as with the socket I/O stuff. I don't know if calling UNIXSocket.recv() counts as "calling a syscall" precisely. At some level, that's exactly what happens, but who knows how much Ruby and C code stands between you and the actual system call.
Then there are those syscalls that aren't in the standard Ruby API at all, most likely because they don't make a great amount of sense to be, like mmap(2). That syscall is all about raw pointers to memory, something you've chosen to avoid by using a language like Ruby in the first place. There happens to be a third party Ruby mmap module, but it's really not going to give you all the power you can tap from C.
The syscall() interface Mat pointed out in the comment above is a similar story: in theory, it lets you call any system call in the kernel. But, if you don't have the ability to deal with pointers, lay out data precisely in memory for structures, etc., your ability to make useful calls is going to be quite limited.
If you want to play with system calls, learn C. There is no shortcut.
Eric Wong started a mailing list for system-level programming in Ruby. It isn't terribly active now, but you can get to it at http://librelist.com/browser/usp.ruby/.

Who is affected when bypassing Perl safe signals?

Do the risks caused by bypassing Perl safe signals for example like shown in the second timeout example in the DBI documentation concern only the code that uses such bypassing?
The code in that example works hard to localize the change to just that section of code, or any code called from it.
There is not 100% guarantee that no code will be effected outside the code that bypasses safe signals, because signals are no longer safe. In the example the call being timed out is a DBI->connect. For most DBD's this will be implemented mostly in C, unless the C code can handle being aborted and tried again you might find that some data structures internal to the DBD, or the libraries it uses, are left in a inconstant state.
The chances of the example code going wrong is probably incredibly tiny. My personal anecdote on the issues is that I had used the traditional Perl signal handling for years before safe signals were introduced and for a long time I had never had a problem. I hadn't even been very cautious about what I did in my signal handlers. Then we managed to hit a data set that actually did trigger memory corruptions in about 1 out of ever 100 runs. Just modifying the signal handlers to use better practices, similar to those in the example, eliminated our issues.
What does that even mean? By using unsafe signals, you can corrupt Perl's internals and Perl variables. It can also cause problem if a non-reentrant C library call is interrupted.
This can lead to SEGFAULTs and other problems, and those may only manifest themselves outside the block where the timeout is in effect.

Looking for the best equivalents of prefetch instructions for ia32, ia64, amd64, and powerpc

I'm looking at some slightly confused code that's attempted a platform abstraction of prefetch instructions, using various compiler builtins. It appears to be based on powerpc semantics initially, with Read and Write prefetch variations using dcbt and dcbtst respectively (both of these passing TH=0 in the new optional stream opcode).
On ia64 platforms we've got for read:
__lfetch(__lfhint_nt1, pTouch)
wherease for write:
__lfetch_excl(__lfhint_nt1, pTouch)
This (read vs. write prefetching) appears to match the powerpc semantics fairly well (with the exception that ia64 allows for a temporal hint).
Somewhat curiously the ia32/amd64 code in question is using
prefetchnta
Not
prefetchnt1
as it would if that code were to be consistent with the ia64 implementations (#ifdef variations of that in our code for our (still live) hpipf port and our now dead windows and linux ia64 ports).
Since we are building with the intel compiler I should be able to many of our ia32/amd64 platforms consistent by switching to the xmmintrin.h builtins:
_mm_prefetch( (char *)pTouch, _MM_HINT_NTA )
_mm_prefetch( (char *)pTouch, _MM_HINT_T1 )
... provided I can figure out what temporal hint should be used.
Questions:
Are there read vs. write ia32/amd64 prefetch instructions? I don't see any in the instruction set reference.
Would one of the nt1, nt2, nta temporal variations be preferred for read vs. write prefetching?
Any idea if there would have been a good reason to use the NTA temporal hint on ia32/amd64, yet T1 on ia64?
Are there read vs. write ia32/amd64 prefetch instructions? I don't see any in the instruction set reference.
Some systems support the prefetchw instructions for writes
Would one of the nt1, nt2, nta temporal variations be preferred for read vs. write prefetching?
If the line is exclusively used by the calling thread, it shouldn't matter how you bring the line, both reads and writes would be able to use it. The benefit for prefetchw mentioned above is that it will bring the line and give you ownership on it, which may take a while if the line was also used by another core. The hint level on the other hand is orthogonal with the MESI states, and only affects how long would the prefetched line survive. This matters if you prefetch long ahead of the actual access and don't want to prefetch to get lost in that duration, or alternatively - prefetch right before the access, and don't want the prefetches to thrash your cache too much.
Any idea if there would have been a good reason to use the NTA temporal hint on ia32/amd64, yet T1 on ia64?
Just speculating - perhaps the larger caches and aggressive memory BW are more vulnerable to bad prefetching and you'd want to reduce the impact through the non-temporal hint. Consider that your prefetcher is suddenly set loose to fetch anything it can, you'd end up swamped in junk prefetches that would through away lots of useful cachelines. The NTA hint makes them overrun each other, leaving the rest undamaged.
Of course this may also be just a bug, I can't tell for sure, only whoever developed the compiler, but it might make sense for the reason above.
The best resource I could find on x86 prefetching hint types was the good ol' article What Every Programmer Should Know About Memory.
For the most part on x86 there aren't different instructions for read and write prefetches. The exceptions seem to be those that are non-temporal aligned, where a write can bypass the cache but as far as I can tell, a read will always get cached.
It's going to be hard to backtrack through why the earlier code owners used one hint and not the other on a certain architecture. They could be making assumptions about how much cache is available on processors in that family, typical working set sizes for binaries there, long term control flow patterns, etc... and there's no telling how much any of those assumptions were backed up with good reasoning or data. From the limited background here I think you'd be justified in taking the approach that makes the most sense for the platform you're developing on now, regardless what was done on other platforms. This is especially true when you consider articles like this one, which is not the only context where I've heard that it's really, really hard to get any performance gain at all with software prefetches.
Are there any more details known up front, like typical cache miss ratios when using this code, or how much prefetches are expected to help?

How do I time my sml code?

Can anyone tell me how I can time my sml code?
I have implemented several different versions of the same algorithm and would like to time them and perhaps even know the memoryusage?
The Timer module is what you want. It can either give you cpu time (gives you user, sys and gc times) or wall clock time.
For example of how to use it see the Benchmark module of MyLib.
With respect to finding out how much memory your algorithms are using, you might bind the profiling feature of MLton handy. Note however that i have actually never used this, but it states that:
you can profile your program to find out how many bytes each function allocates.

How can I find memory leaks in long-running Perl program?

Perl uses reference counting for GC, and it's quite easy to make a circular reference by accident. I see that my program seems to be using more and more memory, and it will probably overflow after a few days.
Is there any way to debug memory leaks in Perl? Attaching to a program and getting numbers of objects of various types would be a good start. If I knew which objects are much more numerous than expected I could check all references to them and hopefully fix the leak.
It may be relevant that Perl never gives memory back to the system by itself: It's all up to malloc() and all the rules associated with that.
Knowing how malloc() allocates memory is important to answering the greater question, and it varies from system to system, but in general most malloc() implementations are optimized for programs allocating and deallocating in stack-like orders. Perl uses reference-counting for tracking memory which means that deallocations which means (unlike a GC-based language which uses malloc() underneath) it is actually not all that difficult to tell where deallocation is going to occur, and in what order.
It may be that you can reorganize your program to take advantage of this fact- by calling undef($old_object) explicitly - and in the right order, in a manner similar to the way C-programmers say free(old_object);
For long-running programs (days, months, etc), where I have loads of load/copy/dump cycles, I garbage-collect using exit() and exec(), and where it's otherwide unfeasible, I simply pack up my data structures (using Storable) and file descriptors (using $^F) and exec($0) - usually with an environment variable set like $ENV{EXEC_GC_MODE}, and you may need something similar even if you don't have any leaks of your own simply because Perl is leaking small chunks that your system's malloc() can't figure out how to give back.
Of course, if you do have leaks in your code, then the rest of my advice is somewhat more relevant. It was originally posted to another question on this subject, but it didn't explicitly cover long-running programs.
All perl program memory leaks will either be an XS holding onto a reference, or a circular data structure. Devel::Cycle is a great tool for finding circular references, if you know what structures are likely to contain the loops. Devel::Peek can be used to find objects with a higher-than-expected reference count.
If you don't know where else to look, Devel::LeakTrace::Fast could be a good first place, but you'll need a perl built for debugging.
If you suspect the leak is inside XS-space, it's much harder, and Valgrind will probably be your best bet. Test::Valgrind may help you lower the amount of code you need to search, but this won't work on Windows, so you'd have to port (at least the leaky portion) to Linux in order to do this.
Devel::Gladiator is another useful tool in this space.
Seems like the cpan module Devel::Cycle is what you are looking for. It requires making some changes to your code, but it should help you find your references without too many problems.
valgrind is a great linux application, which locates memory leaks in running code. If your Perl code runs on linux, you should check it out.
In addition to the other comments, you may find my Perl Memory Use talk at LPW2013 useful. I'd recommend watching the screencast as it explains the slides and has some cute visuals and some Q&A at the end.
I'd also suggest looking at Paul Evans Devel::MAT module which I mention in the talk.