Instruction Pipelining - Architecture Simulator and Pipeline Visualizer

Instruction Pipelining - Architecture Simulator and Pipeline Visualizer - compiler-optimization

I am working on a compiler, and had written an optimization which actually made my code slower! On investigating it, I found that there the code generator had decided to use a Handle (a double reference in case of our compiler) when my optimization was off, and a pointer to the Handle when my optimization was on! This resulted in one more de-reference instruction in the second case whenever the array was accessed.
But this single instruction caused a surprising 32% slowdown in the running time of the code. I am suspecting that this has to do with instruction pipelining as this extra de-reference causes 3 dependent instructions which might explain the slowdown.
I need to demonstrate the same and am trying to get more info on pipelining and it would be great if someone could suggest some good materials on instruction pipelining, useful architecture simulators and pipeline visualizers.

When you need to dereference a pointer, you must first load the pointer from memory, and only then can you load the value that the pointer points to. If you have a pointer to a pointer to a value, then you need to do three loads that are consecutive. This is called pointer chasing. If those pointers aren't in the cache, the performance impact can be huge. Pipelining doesn't help much. The standard book on computer architecture is hennessy & patterson. There are several architectural simulators out there. http://gem5.org is pretty popular (full disclosure, I'm a committer), but they almost always have a steep learning curve.

Related

Looking for the best equivalents of prefetch instructions for ia32, ia64, amd64, and powerpc

I'm looking at some slightly confused code that's attempted a platform abstraction of prefetch instructions, using various compiler builtins. It appears to be based on powerpc semantics initially, with Read and Write prefetch variations using dcbt and dcbtst respectively (both of these passing TH=0 in the new optional stream opcode).
On ia64 platforms we've got for read:
__lfetch(__lfhint_nt1, pTouch)
wherease for write:
__lfetch_excl(__lfhint_nt1, pTouch)
This (read vs. write prefetching) appears to match the powerpc semantics fairly well (with the exception that ia64 allows for a temporal hint).
Somewhat curiously the ia32/amd64 code in question is using
prefetchnta
Not
prefetchnt1
as it would if that code were to be consistent with the ia64 implementations (#ifdef variations of that in our code for our (still live) hpipf port and our now dead windows and linux ia64 ports).
Since we are building with the intel compiler I should be able to many of our ia32/amd64 platforms consistent by switching to the xmmintrin.h builtins:
_mm_prefetch( (char *)pTouch, _MM_HINT_NTA )
_mm_prefetch( (char *)pTouch, _MM_HINT_T1 )
... provided I can figure out what temporal hint should be used.
Questions:
Are there read vs. write ia32/amd64 prefetch instructions? I don't see any in the instruction set reference.
Would one of the nt1, nt2, nta temporal variations be preferred for read vs. write prefetching?
Any idea if there would have been a good reason to use the NTA temporal hint on ia32/amd64, yet T1 on ia64?

Are there read vs. write ia32/amd64 prefetch instructions? I don't see any in the instruction set reference.
Some systems support the prefetchw instructions for writes
Would one of the nt1, nt2, nta temporal variations be preferred for read vs. write prefetching?
If the line is exclusively used by the calling thread, it shouldn't matter how you bring the line, both reads and writes would be able to use it. The benefit for prefetchw mentioned above is that it will bring the line and give you ownership on it, which may take a while if the line was also used by another core. The hint level on the other hand is orthogonal with the MESI states, and only affects how long would the prefetched line survive. This matters if you prefetch long ahead of the actual access and don't want to prefetch to get lost in that duration, or alternatively - prefetch right before the access, and don't want the prefetches to thrash your cache too much.
Any idea if there would have been a good reason to use the NTA temporal hint on ia32/amd64, yet T1 on ia64?
Just speculating - perhaps the larger caches and aggressive memory BW are more vulnerable to bad prefetching and you'd want to reduce the impact through the non-temporal hint. Consider that your prefetcher is suddenly set loose to fetch anything it can, you'd end up swamped in junk prefetches that would through away lots of useful cachelines. The NTA hint makes them overrun each other, leaving the rest undamaged.
Of course this may also be just a bug, I can't tell for sure, only whoever developed the compiler, but it might make sense for the reason above.

The best resource I could find on x86 prefetching hint types was the good ol' article What Every Programmer Should Know About Memory.
For the most part on x86 there aren't different instructions for read and write prefetches. The exceptions seem to be those that are non-temporal aligned, where a write can bypass the cache but as far as I can tell, a read will always get cached.
It's going to be hard to backtrack through why the earlier code owners used one hint and not the other on a certain architecture. They could be making assumptions about how much cache is available on processors in that family, typical working set sizes for binaries there, long term control flow patterns, etc... and there's no telling how much any of those assumptions were backed up with good reasoning or data. From the limited background here I think you'd be justified in taking the approach that makes the most sense for the platform you're developing on now, regardless what was done on other platforms. This is especially true when you consider articles like this one, which is not the only context where I've heard that it's really, really hard to get any performance gain at all with software prefetches.
Are there any more details known up front, like typical cache miss ratios when using this code, or how much prefetches are expected to help?

Questions around memory utilization in Perl

SO community,
I have been scratching my head lately around two memory issues I am running into with some of my perl scripts and I am hoping I am finding some help/pointers here to better understand what is going on.
Questionable observation #1:
I am running the same perl script on different server instances (local laptop macosx, dedicated server hardware, virtual server hardware) and am getting significantly varying results in the traced memory consumption. Just after script initialization one instance would report be a memory consumption of the script of 210 MB compared to 330 MB on another box which is a fluctuation of over 60%. I understand that the malloc() function in charge of "garbage collection" for Perl is OS specific but are there deviations normal or should I be looking more closely at what is going on?
Questionable observation #2:
One script that is having memory leaks is relatively trivial:
foreach(#dataSamples) {
#memorycheck_1
my $string = subRoutine($_);
print FILE $string;
#memorycheck_2
}
All variables in the subRoutine are kept local and should be out of scope once the subroutine finishes. Yet when checking memory usage at #memorycheck_1 and #memorycheck_1 there is a significant memory leak.
Is there any explanation for that? Using Devel::Leak it seems there are leaked pointers which I have a hard time understanding where they would be coming from. Is there an easy way to translate the response of Devel::Leak into something that can actually give me pointers from where those leaked references origin?
Thanks

You have two different questions:
1) Why is the memory footprint not the same across various environments?
Well, are all the OS involved 64 bit? Or is there a mix? If one OS is 32 bit and the other 64 bit, the variation is to be expected. Or, as #hobbs notes in the comments, is one of the perls compiled with threads support whereas another is not?
2) Why does the memory footprint change between check #1 and check #2?
That does not necessarily mean there is a memory leak. Perl won't give back memory to the OS. The memory footprint of your program will be the largest footprint it reaches and will not go down.
Neither of these points is Perl specific. For more detail, you'll need to show more detail.
See also Question 7.25 in the C FAQ and further reading mentioned in that FAQ entry.

The most common reason for a memory leak in Perl is circular references. The simplest form would be something along the lines of:
sub subRoutine {
my( $this, $that );
$this = \$that;
$that = \$this;
return $_[0];
}
Now of course people reading that are probably saying, "Why would anyone do that?" And one generally wouldn't. But more complex data structures can contain circular references pretty easily, and we don't even blink an eye at them. Consider double-linked lists where each node refers to the node to its left and its right. It's important to not let the last explicit reference to such a list pass out of scope without first breaking the circular references contained in each of its nodes, or you'll get a structure that is inaccessible but can't be garbage collected because the reference count to each node never falls to zero.
Per Eric Strom's excellent suggestion, the core module Scalar::Util has a function called weaken. A reference that has been weakened won't hold a reference count to the entity it refers to. This can be helpful for preventing circular references. Another strategy is to implement your circular-reference-wielding datastructure within a class where an object method explicitly breaks the circular reference. Either way, such data structures do require careful handling.
Another source of trouble is poorly written XS modules (nothing against XS authors; it's just really tricky to write XS modules well). What goes on behind the closed doors of an XS module may be a memory leak.
Until we see what's happening inside of subRoutine we can only guess whether or not there's actually an issue, and what the source of the issue may be.

CCsprite.m am getting analyser error in Cocos2d

When am using build and analyse method the following error occurs:- /Users/ghost/demo/libs/cocos2d/CCSprite.m:476:2 Assigned value is garbage or undefined
in:- -(void)updateTransform method am getting above error
here is my screenshot for this error:-
Is it my fault that the program is leaking memory or in cocos2d libraries leaking memory.
recently i asked question regarding this same issue refer the link :-
memory leakage in system libraries
how to rectify this issue:-

Assigned value is garbage or undefined
None of that indicates anything about leaking memory. The analyzer checks for much more than just memory abuse.
The analyzer has identified a code path that, if followed, will lead to an undefined/uninitialized value being used. May happen, may not, but it is worthy of a bug against cocos2d!

This looks like a problem in the Cocos code. The matrix is initialized in two conditionals, therefore it might not be initialized before it’s used. The conditionals might be written so that the matrix is always initialized, the analyzer does not know. I’d simply initialize the matrix with identity transform, that certainly does not hurt. And yes, bbum is right, this is worthy of a bug report – no library should throw analyzer results unless there’s no other way.

Is Objective C fast enough for DSP/audio programming

I've been making some progress with audio programming for iPhone. Now I'm doing some performance tuning, trying to see if I can squeeze more out of this little machine. Running Shark, I see that a significant part of my cpu power (16%) is getting eaten up by objc_msgSend. I understand I can speed this up somewhat by storing pointers to functions (IMP) rather than calling them using [object message] notation. But if I'm going to go through all this trouble, I wonder if I might just be better off using C++.
Any thoughts on this?

Objective C is absolutely fast enough for DSP/audio programming, because Objective C is a superset of C. You don't need to (and shouldn't) make everything a message. Where performance is critical, use plain C function calls (or use inline assembly, if there are hardware features you can leverage that way). Where performance isn't critical, and your application can benefit from the features of message indirection, use the square brackets.
The Accelerate framework on OS X, for example, is a great high-performance Objective C library. It only uses standard C99 function calls, and you can call them from Objective C code without any wrapping or indirection.

The problem with Objective-C and functions like DSP is not speed per se but rather the uncertainty of when the inevitable bottlenecks will occur.
All languages have bottlenecks but in static linked languages like C++ you can better predict when and where in the code they will occur. In the case of Objective-C's runtime coupling, the time it takes to find the appropriate object, the time it takes to send a message is not necessary slow but it is variable and unpredictable. Objective-C's flexibility in UI, data management and reuse work against it in the case of tightly timed task.
Most audio processing in the Apple API is done in C or C++ because of the need to nail down the time it takes code to execute. However, its easy to mix Objective-C, C and C++ in the same app. This allows you to pick the best language for the immediate task at hand.

Is Objective C fast enough for DSP/audio programming
Real Time Rendering
Definitely Not. The Objective-C runtime and its libraries are simply not designed for the demands of real time audio rendering. The fact is, it's virtually impossible to guarantee that using ObjC runtime or libraries such as Foundation (or even CoreFoundation) will not result your renderer missing its deadline.
The common case is a lock -- even a simple heap allocation (malloc, new/new[], [[NSObject alloc] init]) will likely require a lock.
To use ObjC is to utilize libraries and a runtime which assume locks are acceptable at any point within their execution. The lock can suspend execution of your render thread (e.g. during your render callback) while waiting to acquire the lock. Then you can miss your render deadline because your render thread is held up, ultimately resulting in dropouts/glitches.
Ask a pro audio plugin developer: they will tell you that blocking within the realtime render domain is forbidden. You cannot e.g. run to the filesystem or create heap allocations because you have no practical upper bound regarding the time it will take to finish.
Here's a nice introduction: http://www.rossbencina.com/code/real-time-audio-programming-101-time-waits-for-nothing
Offline Rendering
Yes, it would be acceptably fast in most scenarios for high level messaging. At the lower levels, I recommend against using ObjC because it would be wasteful -- it could take many, many times longer to render if ObjC messaging used at that level (compared to a C or C++ implementation).
See also: Will my iPhone app take a performance hit if I use Objective-C for low level code?

objc_msgSend is just a utility.
The cost of sending a message is not just the cost of sending the message.
It is the cost of doing everything that the message initiates.
(Just like the true cost of a function call is its inclusive cost, including I/O if there is any.)
What you need to know is where are the time-dominant messages coming from and going to and why.
Stack samples will tell you which routines / methods are being called so often that you should figure out how to call them more efficiently.
You may find that you're calling them more than you have to.
Especially if you find that many of the calls are for creating and deleting data structure, you can probably find better ways to do that.

How can I find memory leaks in long-running Perl program?

Perl uses reference counting for GC, and it's quite easy to make a circular reference by accident. I see that my program seems to be using more and more memory, and it will probably overflow after a few days.
Is there any way to debug memory leaks in Perl? Attaching to a program and getting numbers of objects of various types would be a good start. If I knew which objects are much more numerous than expected I could check all references to them and hopefully fix the leak.

It may be relevant that Perl never gives memory back to the system by itself: It's all up to malloc() and all the rules associated with that.
Knowing how malloc() allocates memory is important to answering the greater question, and it varies from system to system, but in general most malloc() implementations are optimized for programs allocating and deallocating in stack-like orders. Perl uses reference-counting for tracking memory which means that deallocations which means (unlike a GC-based language which uses malloc() underneath) it is actually not all that difficult to tell where deallocation is going to occur, and in what order.
It may be that you can reorganize your program to take advantage of this fact- by calling undef($old_object) explicitly - and in the right order, in a manner similar to the way C-programmers say free(old_object);
For long-running programs (days, months, etc), where I have loads of load/copy/dump cycles, I garbage-collect using exit() and exec(), and where it's otherwide unfeasible, I simply pack up my data structures (using Storable) and file descriptors (using $^F) and exec($0) - usually with an environment variable set like $ENV{EXEC_GC_MODE}, and you may need something similar even if you don't have any leaks of your own simply because Perl is leaking small chunks that your system's malloc() can't figure out how to give back.
Of course, if you do have leaks in your code, then the rest of my advice is somewhat more relevant. It was originally posted to another question on this subject, but it didn't explicitly cover long-running programs.
All perl program memory leaks will either be an XS holding onto a reference, or a circular data structure. Devel::Cycle is a great tool for finding circular references, if you know what structures are likely to contain the loops. Devel::Peek can be used to find objects with a higher-than-expected reference count.
If you don't know where else to look, Devel::LeakTrace::Fast could be a good first place, but you'll need a perl built for debugging.
If you suspect the leak is inside XS-space, it's much harder, and Valgrind will probably be your best bet. Test::Valgrind may help you lower the amount of code you need to search, but this won't work on Windows, so you'd have to port (at least the leaky portion) to Linux in order to do this.

Devel::Gladiator is another useful tool in this space.

Seems like the cpan module Devel::Cycle is what you are looking for. It requires making some changes to your code, but it should help you find your references without too many problems.

valgrind is a great linux application, which locates memory leaks in running code. If your Perl code runs on linux, you should check it out.

In addition to the other comments, you may find my Perl Memory Use talk at LPW2013 useful. I'd recommend watching the screencast as it explains the slides and has some cute visuals and some Q&A at the end.
I'd also suggest looking at Paul Evans Devel::MAT module which I mention in the talk.