how can I call Unix system calls interactively? - system-calls

I'd like to play with Unix system calls, ideally from Ruby. How can I do so?
I've heard about Fiddle, but I don't know where to begin / which C library should I attach it to?

I assume by "interactively" you mean via irb.
A high-level language like Ruby is going to provide wrappers for most kernel syscalls, of varying thickness.
Occasionally these wrappers will be very thin, as with sysread() and syswrite(). These are more or less equivalent to read(2) and write(2), respectively.
Other syscalls will be hidden behind thicker layers, such as with the socket I/O stuff. I don't know if calling UNIXSocket.recv() counts as "calling a syscall" precisely. At some level, that's exactly what happens, but who knows how much Ruby and C code stands between you and the actual system call.
Then there are those syscalls that aren't in the standard Ruby API at all, most likely because they don't make a great amount of sense to be, like mmap(2). That syscall is all about raw pointers to memory, something you've chosen to avoid by using a language like Ruby in the first place. There happens to be a third party Ruby mmap module, but it's really not going to give you all the power you can tap from C.
The syscall() interface Mat pointed out in the comment above is a similar story: in theory, it lets you call any system call in the kernel. But, if you don't have the ability to deal with pointers, lay out data precisely in memory for structures, etc., your ability to make useful calls is going to be quite limited.
If you want to play with system calls, learn C. There is no shortcut.

Eric Wong started a mailing list for system-level programming in Ruby. It isn't terribly active now, but you can get to it at http://librelist.com/browser/usp.ruby/.

Related

System call or function call - performance-wise

In Linux, when you can choose between a system call or a function call to do a task, which option is the better one due to a better performance?
We should note that in most of the cases we do not directly use system call. We use the interface provided by glibc.
http://www.kernel.org/doc/man-pages/online/pages/man2/syscalls.2.html
http://www.gnu.org/software/libc/manual/html_node/System-Calls.html
Now in cases like File Mangement/IPC/ process management etc which are the core resource management activities of the Operating System the only option is system call and not library functions.
In these cases, typically we use Library function which works as a wrapper over a system call. That is say for reading a file, we have many library functions like
fgetc/fgets/fscanf/fread - all should invoke read system call.
So shall we use read system call? or the other library functions?
This should depend on the particular application.If we are using read, then we again need to change the code to run this, on some other operating system where read is not available.
We are losing some flexibilty. It may be useful when we are sure of the platform and we can do some optimisations by using read only or may be the application must use only file descriptors and not file pointer etc.
Now in cases where we need to consider only say user level operations and invoke
no service from operating system , like say copying a string.(strcpy).
In this case definitely we shall not use any system call unnecessarily, if at
all something is there, since it should be an extra overhead due to operating
system intervention, which is not needed in this case.
So I feel choosing between a system call and a library function only occurs for cases where we have a library function built on top of a system call.
(like adding to examples above we can have say malloc which calls system call brk).
Here the choice will depend on the particular type of software, the platform on which it should run, the precise non functional requirements like speed (Though you cannot say with certainty that your code will run faster if you are using brk instead of malloc), portability etc.

Looking for the best equivalents of prefetch instructions for ia32, ia64, amd64, and powerpc

I'm looking at some slightly confused code that's attempted a platform abstraction of prefetch instructions, using various compiler builtins. It appears to be based on powerpc semantics initially, with Read and Write prefetch variations using dcbt and dcbtst respectively (both of these passing TH=0 in the new optional stream opcode).
On ia64 platforms we've got for read:
__lfetch(__lfhint_nt1, pTouch)
wherease for write:
__lfetch_excl(__lfhint_nt1, pTouch)
This (read vs. write prefetching) appears to match the powerpc semantics fairly well (with the exception that ia64 allows for a temporal hint).
Somewhat curiously the ia32/amd64 code in question is using
prefetchnta
Not
prefetchnt1
as it would if that code were to be consistent with the ia64 implementations (#ifdef variations of that in our code for our (still live) hpipf port and our now dead windows and linux ia64 ports).
Since we are building with the intel compiler I should be able to many of our ia32/amd64 platforms consistent by switching to the xmmintrin.h builtins:
_mm_prefetch( (char *)pTouch, _MM_HINT_NTA )
_mm_prefetch( (char *)pTouch, _MM_HINT_T1 )
... provided I can figure out what temporal hint should be used.
Questions:
Are there read vs. write ia32/amd64 prefetch instructions? I don't see any in the instruction set reference.
Would one of the nt1, nt2, nta temporal variations be preferred for read vs. write prefetching?
Any idea if there would have been a good reason to use the NTA temporal hint on ia32/amd64, yet T1 on ia64?
Are there read vs. write ia32/amd64 prefetch instructions? I don't see any in the instruction set reference.
Some systems support the prefetchw instructions for writes
Would one of the nt1, nt2, nta temporal variations be preferred for read vs. write prefetching?
If the line is exclusively used by the calling thread, it shouldn't matter how you bring the line, both reads and writes would be able to use it. The benefit for prefetchw mentioned above is that it will bring the line and give you ownership on it, which may take a while if the line was also used by another core. The hint level on the other hand is orthogonal with the MESI states, and only affects how long would the prefetched line survive. This matters if you prefetch long ahead of the actual access and don't want to prefetch to get lost in that duration, or alternatively - prefetch right before the access, and don't want the prefetches to thrash your cache too much.
Any idea if there would have been a good reason to use the NTA temporal hint on ia32/amd64, yet T1 on ia64?
Just speculating - perhaps the larger caches and aggressive memory BW are more vulnerable to bad prefetching and you'd want to reduce the impact through the non-temporal hint. Consider that your prefetcher is suddenly set loose to fetch anything it can, you'd end up swamped in junk prefetches that would through away lots of useful cachelines. The NTA hint makes them overrun each other, leaving the rest undamaged.
Of course this may also be just a bug, I can't tell for sure, only whoever developed the compiler, but it might make sense for the reason above.
The best resource I could find on x86 prefetching hint types was the good ol' article What Every Programmer Should Know About Memory.
For the most part on x86 there aren't different instructions for read and write prefetches. The exceptions seem to be those that are non-temporal aligned, where a write can bypass the cache but as far as I can tell, a read will always get cached.
It's going to be hard to backtrack through why the earlier code owners used one hint and not the other on a certain architecture. They could be making assumptions about how much cache is available on processors in that family, typical working set sizes for binaries there, long term control flow patterns, etc... and there's no telling how much any of those assumptions were backed up with good reasoning or data. From the limited background here I think you'd be justified in taking the approach that makes the most sense for the platform you're developing on now, regardless what was done on other platforms. This is especially true when you consider articles like this one, which is not the only context where I've heard that it's really, really hard to get any performance gain at all with software prefetches.
Are there any more details known up front, like typical cache miss ratios when using this code, or how much prefetches are expected to help?

Is Objective C fast enough for DSP/audio programming

I've been making some progress with audio programming for iPhone. Now I'm doing some performance tuning, trying to see if I can squeeze more out of this little machine. Running Shark, I see that a significant part of my cpu power (16%) is getting eaten up by objc_msgSend. I understand I can speed this up somewhat by storing pointers to functions (IMP) rather than calling them using [object message] notation. But if I'm going to go through all this trouble, I wonder if I might just be better off using C++.
Any thoughts on this?
Objective C is absolutely fast enough for DSP/audio programming, because Objective C is a superset of C. You don't need to (and shouldn't) make everything a message. Where performance is critical, use plain C function calls (or use inline assembly, if there are hardware features you can leverage that way). Where performance isn't critical, and your application can benefit from the features of message indirection, use the square brackets.
The Accelerate framework on OS X, for example, is a great high-performance Objective C library. It only uses standard C99 function calls, and you can call them from Objective C code without any wrapping or indirection.
The problem with Objective-C and functions like DSP is not speed per se but rather the uncertainty of when the inevitable bottlenecks will occur.
All languages have bottlenecks but in static linked languages like C++ you can better predict when and where in the code they will occur. In the case of Objective-C's runtime coupling, the time it takes to find the appropriate object, the time it takes to send a message is not necessary slow but it is variable and unpredictable. Objective-C's flexibility in UI, data management and reuse work against it in the case of tightly timed task.
Most audio processing in the Apple API is done in C or C++ because of the need to nail down the time it takes code to execute. However, its easy to mix Objective-C, C and C++ in the same app. This allows you to pick the best language for the immediate task at hand.
Is Objective C fast enough for DSP/audio programming
Real Time Rendering
Definitely Not. The Objective-C runtime and its libraries are simply not designed for the demands of real time audio rendering. The fact is, it's virtually impossible to guarantee that using ObjC runtime or libraries such as Foundation (or even CoreFoundation) will not result your renderer missing its deadline.
The common case is a lock -- even a simple heap allocation (malloc, new/new[], [[NSObject alloc] init]) will likely require a lock.
To use ObjC is to utilize libraries and a runtime which assume locks are acceptable at any point within their execution. The lock can suspend execution of your render thread (e.g. during your render callback) while waiting to acquire the lock. Then you can miss your render deadline because your render thread is held up, ultimately resulting in dropouts/glitches.
Ask a pro audio plugin developer: they will tell you that blocking within the realtime render domain is forbidden. You cannot e.g. run to the filesystem or create heap allocations because you have no practical upper bound regarding the time it will take to finish.
Here's a nice introduction: http://www.rossbencina.com/code/real-time-audio-programming-101-time-waits-for-nothing
Offline Rendering
Yes, it would be acceptably fast in most scenarios for high level messaging. At the lower levels, I recommend against using ObjC because it would be wasteful -- it could take many, many times longer to render if ObjC messaging used at that level (compared to a C or C++ implementation).
See also: Will my iPhone app take a performance hit if I use Objective-C for low level code?
objc_msgSend is just a utility.
The cost of sending a message is not just the cost of sending the message.
It is the cost of doing everything that the message initiates.
(Just like the true cost of a function call is its inclusive cost, including I/O if there is any.)
What you need to know is where are the time-dominant messages coming from and going to and why.
Stack samples will tell you which routines / methods are being called so often that you should figure out how to call them more efficiently.
You may find that you're calling them more than you have to.
Especially if you find that many of the calls are for creating and deleting data structure, you can probably find better ways to do that.

Speed improvements for Perl's chameneos-redux in the Computer Language Benchmarks Game

Ever looked at the Computer Language Benchmarks Game (formerly known as the Great Language Shootout)?
Perl has some pretty healthy competition there at the moment. It also occurs to me that there's probably some places that Perl's scores could be improved. The biggest one is in the chameneos-redux script right now—the Perl version runs the worst out of any language: 1,626 times slower than the C baseline solution!
There are some restrictions on how the programs can be made and optimized, and there is Perl's interpreted runtime penalty, but 1,626 times? There's got to be something that can get the runtime of this program way down.
Taking a look at the source code and the challenge, how can the speed be improved?
I ran the source code through the Devel::SmallProf profiler. The profile output is a little too verbose to post here, but you can see the results yourself using $ perl -d:SmallProf chameneos.pl 10000 (no need to run it for 6000000 meetings unless you really want to!) See perlperf for more details on some profiling tools in Perl.
It turns out that using semaphores is the major bottleneck. The lion's share of total CPU time is spent on checking whether a semaphore is locked or not. Although I haven't had enough time to look at why the source code uses semaphores, it may be that you can work around having to use semaphores altogether. That's probably your best shot at improving the code's performance.
As Zaid posted, Thread::Semaphore is rather slow. One optimization could be to use the implicit locks on shared variables instead of them. It should be faster, though I suspect it won't be faster by much.
In general, Perl's threading implementation sucks for any kind of usage that requires a lot of interthread communication. It's very suitable for tasks with little communication (as unlike CPython's threads and CRuby's threads they are actually preemptive).
It may be possible to improve that situation, we need better primitives.
I have a version based on another version from Jesse Millikian, which I think was never published.
I think it may run ~ 7x faster than the current entry, and uses standard modules all around. I'm not sure if it actually complies with all the rules though.
I've tried the forks module on it, but I think it slows it down a bit.
Anyone tried s/threads/forks/ on the Perl entry? Or Coro / Coro::MP, though the latter would probably trigger the 'interesting alternative implementations' clause.

How can I find memory leaks in long-running Perl program?

Perl uses reference counting for GC, and it's quite easy to make a circular reference by accident. I see that my program seems to be using more and more memory, and it will probably overflow after a few days.
Is there any way to debug memory leaks in Perl? Attaching to a program and getting numbers of objects of various types would be a good start. If I knew which objects are much more numerous than expected I could check all references to them and hopefully fix the leak.
It may be relevant that Perl never gives memory back to the system by itself: It's all up to malloc() and all the rules associated with that.
Knowing how malloc() allocates memory is important to answering the greater question, and it varies from system to system, but in general most malloc() implementations are optimized for programs allocating and deallocating in stack-like orders. Perl uses reference-counting for tracking memory which means that deallocations which means (unlike a GC-based language which uses malloc() underneath) it is actually not all that difficult to tell where deallocation is going to occur, and in what order.
It may be that you can reorganize your program to take advantage of this fact- by calling undef($old_object) explicitly - and in the right order, in a manner similar to the way C-programmers say free(old_object);
For long-running programs (days, months, etc), where I have loads of load/copy/dump cycles, I garbage-collect using exit() and exec(), and where it's otherwide unfeasible, I simply pack up my data structures (using Storable) and file descriptors (using $^F) and exec($0) - usually with an environment variable set like $ENV{EXEC_GC_MODE}, and you may need something similar even if you don't have any leaks of your own simply because Perl is leaking small chunks that your system's malloc() can't figure out how to give back.
Of course, if you do have leaks in your code, then the rest of my advice is somewhat more relevant. It was originally posted to another question on this subject, but it didn't explicitly cover long-running programs.
All perl program memory leaks will either be an XS holding onto a reference, or a circular data structure. Devel::Cycle is a great tool for finding circular references, if you know what structures are likely to contain the loops. Devel::Peek can be used to find objects with a higher-than-expected reference count.
If you don't know where else to look, Devel::LeakTrace::Fast could be a good first place, but you'll need a perl built for debugging.
If you suspect the leak is inside XS-space, it's much harder, and Valgrind will probably be your best bet. Test::Valgrind may help you lower the amount of code you need to search, but this won't work on Windows, so you'd have to port (at least the leaky portion) to Linux in order to do this.
Devel::Gladiator is another useful tool in this space.
Seems like the cpan module Devel::Cycle is what you are looking for. It requires making some changes to your code, but it should help you find your references without too many problems.
valgrind is a great linux application, which locates memory leaks in running code. If your Perl code runs on linux, you should check it out.
In addition to the other comments, you may find my Perl Memory Use talk at LPW2013 useful. I'd recommend watching the screencast as it explains the slides and has some cute visuals and some Q&A at the end.
I'd also suggest looking at Paul Evans Devel::MAT module which I mention in the talk.