How can I judge where should I put memory barrier in the code? - linux-device-driver

When I am reading ldd3, I get the conception about memory barrier, it is said that code execution will be reordered, for the reason like caching and compilation optimizations. I think codes that have no dependencies can be reordered to get better peformance, and IO ports registers can not be optimized, because it need contain consistent data. But I can not understand the code below, and is there any rules to follow where should I insert functions like smb(),mb(),barrier()?
For example, in the examples code short from ldd3.
/*
* Atomicly increment an index into short_buffer
*/
static inline void short_incr_bp(volatile unsigned long *index, int delta)
{
unsigned long new = *index + delta;
barrier(); /* Don't optimize these two together */
*index = (new >= (short_buffer + PAGE_SIZE)) ? short_buffer : new;
}
How line before barrier and the line after barrier reordered? I think the latter depends on the former to be executed first to get new value.
This really makes me confused.

Memory barriers are used instead of locking for gain better performanse. There are several standard patters, when memory barriers provide needed syncronizaton. You can read, e.g., Documentation/memory-barriers.txt from the kernel sources.
In the given example from ldd3 barriers usage is tricker than usual. In the terms of current kernel(as opposite to 2.20+, which is described in ldd3) same intention can be expressed using ACCESS_ONCE() macro.
unsigned long new = *index + delta;
ACCESS_ONCE(*index) = (new >= (short_buffer + PAGE_SIZE)) ? short_buffer : new;
Without bariers, compiler may assign *index twice:
*index = *index + delta;
if(*index > (short_buffer + PAGE_BUFFER)
*index = short_buffer;
Because *index is used in multiple threads as unprotected invariant(it shows which buffer region is available), writting intermediate value *index + delta into it make invariant, seen by other thread, incorrect. This is prevented by ACCESS_ONCE() macro, which force compiler to generate access(write) to variable only when is is explicetely requested.
Actually, ACCESS_ONCE (and barrier in you code) is redudant for variable with volatile modifier.

Related

Translating snippet to functional from imperative [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
I have the following Scala snippet. In order to solve my given problem, I "cheat" a little and use a var -- essentially a non-final, mutable data type. Its value is updated at each iteration through the loop. I've spent quite a bit of time trying to figure out how to do this using only recursion, and immutable data types and lists.
Original snippet:
def countChange_sort(money: Int, coins: List[Int]): Int =
if (coins.isEmpty || money < 0)
0
else if (coins.tail.isEmpty && money % coins.head != 0) {
0
} else if (coins.tail.isEmpty && money % coins.head == 0 || money == 0) {
1
} else {
-- redacted --
}
}
Essentially, are there any basic techniques I can use to eliminate the i and especially the accumulating cnt variables?
Thanks!!
There are lots of different ways to solve problems in functional style. Often you start by analysing the problem in a different way than you would when designing an imperative algorithm, so writing an imperative algorithm and then "converting" it to a functional one doesn't produce very natural functional algorithms (and you often miss out on lots of the potential benefits of functional style). But when you're an experienced imperative programmer just starting out with functional programming, that's all you've got, and it is a good way to begin getting your head around the new concepts. So here's how you can approach "converting" such a function as the one you wrote to functional style in a fairly uncreative way (i.e. not coming up with a different algorithm).
Lets just consider the else expression since the rest is fine.
Functional style has no loops, so if you need run a block of code a number of times (the body of the imperative loop), that block of code must be a function. Often the function is a simple non-recursive one, and you call a higher-order function such as map or fold to do the actual recursion, but I'm going to presume you need the practice thinking recursively and want to see it explicitly. The loop condition is calculated from the quantities you have at hand in the loop body, so we just have the loop-replacement function recursively invoke itself depending on exactly the same condition:
} else {
var cnt = 0
var i = 0
def loop(????) : ??? = {
if (money - (i * coins.head) > 0) {
cnt += countChange_sort(money - (i * coins.head), coins.tail)
i = i + 1
loop(????)
}
}
loop(????)
cnt
}
Information is only communicated to a function through its input arguments or through its definition, and communicated from a function through its return value.
The information that enters a function through its definition is constant when the function is created (either at compile time, or at runtime when the closure is created). Doesn't sound very useful for the information contained in cnt and i, which needs to be different on each call. So they obviously need to be passed in as arguments:
} else {
var cnt = 0
var i = 0
def loop(cnt : Int, i : Int) : ??? = {
if (money - (i * coins.head) > 0) {
cnt += countChange_sort(money - (i * coins.head), coins.tail)
i = i + 1
loop(cnt, i)
}
}
loop(cnt, i)
cnt
}
But we want to use the final value of cnt after the function call. If information is only communicated from loop through its return value, then we can only get the last value of cnt by having loop return it. That's pretty easy:
} else {
var cnt = 0
var i = 0
def loop(cnt : Int, i : Int) : Int = {
if (money - (i * coins.head) > 0) {
cnt += countChange_sort(money - (i * coins.head), coins.tail)
i = i + 1
loop(cnt, i)
} else {
cnt
}
}
cnt = loop(cnt, i)
cnt
}
coins, money, and countChange_sort are examples of information "entering a function through its definition". coins and money are even "variable", but they're constant at the point when loop is defined. If you wanted to move loop out of the body of countChange_sort to become a stand-alone function, you would have to make coins and money additional arguments; they would be passed in from the top-level call in countChange_sort, and then passed down unmodified in each recursive call inside loop. That would still make loop dependent on countChange_sort itself though (as well as the arithmetic operators * and -!), so you never really get away from having the function know about external things that don't come into it through its arguments.
Looking pretty good. But we're still using assignment statements inside loop, which isn't right. However all we do is assign new values to cnt and i and then pass them to a recursive invocation of loop, so those assignments can be easily removed:
} else {
var cnt = 0
var i = 0
def loop(cnt : Int, i : Int) : Int = {
if (money - (i * coins.head) > 0) {
loop(cnt + countChange_sort(money - (i * coins.head), coins.tail), i + 1)
} else {
cnt
}
}
cnt = loop(cnt, i)
cnt
}
Now there are some obvious simplifications, because we're not really doing anything at all with the mutable cnt and i other than initialising them, and then passing their initial value, assigning to cnt once and then immediately returning it. So we can (finally) get rid of the mutable cnt and i entirely:
} else {
def loop(cnt : Int, i : Int) : Int = {
if (money - (i * coins.head) > 0) {
loop(cnt + countChange_sort(money - (i * coins.head), coins.tail), i + 1)
} else {
cnt
}
}
loop(0, 0)
}
And we're done! No side effects in sight!
Note that I haven't thought much at all about what your algorithm actually does (I have made no attempt to even figure out whether it's actually correct, though I presume it is). All I've done is straightforwardly applied the general principle that information only enters a function through its arguments and leaves through its return values; all mutable state accessible to an expression is really extra hidden inputs and hidden outputs of the expression. Making them immutable explicit inputs and outputs, and then allows you to prune away unneeded ones. For example, i doesn't need to be included in the return value from loop because it's not actually needed by anything, so the conversion to functional style has made it clear that it's purely internal to loop, whereas you had to actually read the code of the imperative version to deduce this.
cnt and i are what is known as accumulators. Accumulators aren't anything special, they're just ordinary arguments; the term only refers to how they are used. Basically, if your algorithm needs to keep track of some data as it goes, you can introduce an accumulator parameter so that each recursive call can "pass forward" the data from what has been done so far. They often fill the role that local temporary mutable variables fill in imperative algorithms.
It's quite a common pattern for the return value of a recursive function to be the value of an accumulator parameter once it is determined that there's no more work left to do, as happens with cnt in this case.
Note that these sort of techniques don't necessarily produce good functional code, but it's very easy to convert functions implemented using "local" mutable state to functional style using this technique. Pervasive non-local use of mutability, such as is typical of most traditional OO programs, is harder to convert like this; you can do it, but you tend to have to modify the entire program at once, and the resulting functions have large numbers of extra arguments (explicitly exposing all the hidden data-flow that was present in original program).
I don't have any basic techniques to change the code you have specifically. However, here is a general tip for solving recursion algorithms:
Can you break the problem into sub-problems? In the money example, for example, if you are trying to get to $10 with a $5, that's similar to the question of getting to $5 with a $5 (having already chosen the $5 once). Try to draw it out and make rules. You'll be surprised at how much more obviously correct your solution is.
Since nobody answers your question I will try to give you some hints:
What is a loop?
Traversing each element of a collection. stop meeting a condition
What can you do with recursion:
Traversing each element of a collection. stop meeting a condition.
Start simple write a method without vars which prints each element of a collection.
Then the rest becomes simple look at your loop and what you are doing.
Instead of manipulating the variables directly(like i=i + 1), simply pass i + 1 to the recursive call of your method.
HTH

Optimizing NSNumber numberWithInt:

I am profiling an iPhone app and I noticed a strange pattern. In a certain block of code that's called quite frequently...
[item setQuadrant:[NSNumber numberWithInt:a]];
[item setIndex:[NSNumber numberWithInt:b]];
[item setTimestamp:[NSNumber numberWithInt:c]];
[item setState:[NSNumber numberWithInt:d]];
[item setCompletionPercentage:[NSNumber numberWithInt:e]];
[item setId_:[NSNumber numberWithInt:f]];
...the first call to [NSNumber numberWithInt:] takes an inordinate amount of time, in the order of 10-15x that of the remaining calls. I've verified that the results are consistent if I shuffle the lines (the first line is always the slow one, by the same ratio). Is there something going on that I'm not aware of?
Perhaps this happens because this block is inside a try/catch?
If I had to guess, NSNumber performs some code in it's +load implementation, which slows down the initial call to the class. Also note that NSNumber caches it's return value, so future calls to +numberWithInt: with the same value are faster than before, that could possibly be part of the issue.
Maybe the first value is much larger? apart from CoreFoundation's CFNumber caching, the "new" runtime uses tagged pointers, allowing integers within the range of 24 bit to be encoded directly into the pointer - the runtime then figure out it's a tagged pointer by looking at the last bit (and that its a CFNumber by looking at the 3 bits before the last bit and the target number size - 8, 16, 32, 64 bit - using the next 4 bits before).
On a 32-bit system (current iPhones), that means that for ("small") negative 32 bit numbers or large positive numbers, CoreFoundation will allocate an object. For everything else, it uses the following expression that is way faster:
case kCFNumberSInt32Type: {
int32_t value = *(int32_t *)valuePtr; // this loads the actual numerical value passed to CFNumberCreate()
#if !__LP64__
// We don't bother allowing the min 24-bit integer -2^23 to also be fast-pathed;
// tell anybody that complains about that to go ... hang.
int32_t limit = (1L << 23);
if (value <= -limit || limit <= value) break;
#endif
uintptr_t ptr_val = ((uintptr_t)((intptr_t)value << 8) | (2 << 6) | kCFTaggedObjectID_Integer);
return (CFNumberRef)ptr_val;
}
(note that !__LP64__ is true for 32-bit systems)
Taken from: http://www.opensource.apple.com/source/CF/CF-744.12/CFNumber.c
Also, there is a caching mechanism that prevents a range of numbers from being re-created multiple times, just search for "__CFNumberCache" in the same source file.

Help with live-updating sound on the iPhone

My question is a little tricky, and I'm not exactly experienced (I might get some terms wrong), so here goes.
I'm declaring an instance of an object called "Singer". The instance is called "singer1". "singer1" produces an audio signal. Now, the following is the code where the specifics of the audio signal are determined:
OSStatus playbackCallback(void *inRefCon,
AudioUnitRenderActionFlags *ioActionFlags,
const AudioTimeStamp *inTimeStamp,
UInt32 inBusNumber,
UInt32 inNumberFrames,
AudioBufferList *ioData) {
//Singer *me = (Singer *)inRefCon;
static int phase = 0;
for(UInt32 i = 0; i < ioData->mNumberBuffers; i++) {
int samples = ioData->mBuffers[i].mDataByteSize / sizeof(SInt16);
SInt16 values[samples];
float waves;
float volume=.5;
for(int j = 0; j < samples; j++) {
waves = 0;
waves += sin(kWaveform * 600 * phase)*volume;
waves += sin(kWaveform * 400 * phase)*volume;
waves += sin(kWaveform * 200 * phase)*volume;
waves += sin(kWaveform * 100 * phase)*volume;
waves *= 32500 / 4; // <--------- make sure to divide by how many waves you're stacking
values[j] = (SInt16)waves;
values[j] += values[j]<<16;
phase++;
}
memcpy(ioData->mBuffers[i].mData, values, samples * sizeof(SInt16));
}
return noErr;
}
99% of this is borrowed code, so I only have a basic understanding of how it works (I don't know about the OSStatus class or method or whatever this is. However, you see those 4 lines with 600, 400, 200 and 100 in them? Those determine the frequency. Now, what I want to do (for now) is insert my own variable in there in place of a constant, which I can change on a whim. This variable is called "fr1". "fr1" is declared in the header file, but if I try to compile I get an error about "fr1" being undeclared. Currently, my technique to fix this is the following: right beneath where I #import stuff, I add the line
fr1=0.0;//any number will work properly
This sort of works, as the code will compile and singer1.fr1 will actually change values if I tell it to. The problems are now this:A)even though this compiles and the tone specified will play (0.0 is no tone), I get the warnings "Data definition has no type or storage class" and "Type defaults to 'int' in declaration of 'fr1'". I bet this is because for some reason it's not seeing my previous declaration in the header file (as a float). However, again, if I leave this line out the code won't compile because "fr1 is undeclared". B)Just because I change the value of fr1 doesn't mean that singer1 will update the value stored inside the "playbackcallback" variable or whatever is in charge of updating the output buffers. Perhaps this can be fixed by coding differently? C)even if this did work, there is still a noticeable "gap" when pausing/playing the audio, which I need to eliminate. This might mean a complete overhaul of the code so that I can "dynamically" insert new values without disrupting anything. However, the reason I'm going through all this effort to post is because this method does exactly what I want (I can compute a value mathematically and it goes straight to the DAC, which means I can use it in the future to make triangle, square, etc waves easily). I have uploaded Singer.h and .m to pastebin for your veiwing pleasure, perhaps they will help. Sorry, I can't post 2 HTML tags so here are the full links.
(http://pastebin.com/ewhKW2Tk)
(http://pastebin.com/CNAT4gFv)
So, TL;DR, all I really want to do is be able to define the current equation/value of the 4 waves and re-define them very often without a gap in the sound.
Thanks. (And sorry if the post was confusing or got off track, which I'm pretty sure it did.)
My understanding is that your callback function is called every time the buffer needs to be re-filled. So changing fr1..fr4 will alter the waveform, but only when the buffer updates. You shouldn't need to stop and re-start the sound to get a change, but you will notice an abrupt shift in the timbre if you change your fr values. In order to get a smooth transition in timbre, you'd have to implement something that smoothly changes the fr values over time. Tweaking the buffer size will give you some control over how responsive the sound is to your changing fr values.
Your issue with fr being undefined is due to your callback being a straight c function. Your fr variables are declared as objective-c instance variables as part of your Singer object. They are not accessible by default.
take a look at this project, and see how he implements access to his instance variables from within his callback. Basically he passes a reference to his instance to the callback function, and then accesses instance variables through that.
https://github.com/youpy/dowoscillator
notice:
Sinewave *sineObject = inRefCon;
float freq = sineObject.frequency * 2 * M_PI / samplingRate;
and:
AURenderCallbackStruct input;
input.inputProc = RenderCallback;
input.inputProcRefCon = self;
Also, you'll want to move your callback function outside of your #implementation block, because it's not actually part of your Singer object.
You can see this all in action here: https://github.com/coryalder/SineWaver

Which costs more while looping; assignment or an if-statement?

Consider the following 2 scenarios:
boolean b = false;
int i = 0;
while(i++ < 5) {
b = true;
}
OR
boolean b = false;
int i = 0;
while(i++ < 5) {
if(!b) {
b = true;
}
}
Which is more "costly" to do? If the answer depends on used language/compiler, please provide. My main programming language is Java.
Please do not ask questions like why would I want to do either.. They're just barebone examples that point out the relevant: should a variable be set the same value in a loop over and over again or should it be tested on every loop that it holds a value needed to change?
Please do not forget the rules of Optimization Club.
The first rule of Optimization Club is, you do not Optimize.
The second rule of Optimization Club is, you do not Optimize without measuring.
If your app is running faster than the underlying transport protocol, the optimization is over.
One factor at a time.
No marketroids, no marketroid schedules.
Testing will go on as long as it has to.
If this is your first night at Optimization Club, you have to write a test case.
It seems that you have broken rule 2. You have no measurement. If you really want to know, you'll answer the question yourself by setting up a test that runs scenario A against scenario B and finds the answer. There are so many differences between different environments, we can't answer.
Have you tested this? Working on a Linux system, I put your first example in a file called LoopTestNoIf.java and your second in a file called LoopTestWithIf.java, wrapped a main function and class around each of them, compiled, and then ran with this bash script:
#!/bin/bash
function run_test {
iter=0
while [ $iter -lt 100 ]
do
java $1
let iter=iter+1
done
}
time run_test LoopTestNoIf
time run_test LoopTestWithIf
The results were:
real 0m10.358s
user 0m4.349s
sys 0m1.159s
real 0m10.339s
user 0m4.299s
sys 0m1.178s
Showing that having the if makes it slight faster on my system.
Are you trying to find out if doing the assignment each loop is faster in total run time than doing a check each loop and only assigning once on satisfaction of the test condition?
In the above example I would guess that the first is faster. You perform 5 assignments. In the latter you perform 5 test and then an assignment.
But you'll need to up the iteration count and throw in some stopwatch timers to know for sure.
Actually, this is the question I was interested in… (I hoped that I’ll find the answer somewhere to avoid own testing. Well, I didn’t…)
To be sure that your (mine) test is valid, you (I) have to do enough iterations to get enough data. Each iteration must be “long” enough (I mean the time scale) to show the true difference. I’ve found out that even one billion iterations are not enough to fit to time interval that would be long enough… So I wrote this test:
for (int k = 0; k < 1000; ++k)
{
{
long stopwatch = System.nanoTime();
boolean b = false;
int i = 0, j = 0;
while (i++ < 1000000)
while (j++ < 1000000)
{
int a = i * j; // to slow down a bit
b = true;
a /= 2; // to slow down a bit more
}
long time = System.nanoTime() - stopwatch;
System.out.println("\\tasgn\t" + time);
}
{
long stopwatch = System.nanoTime();
boolean b = false;
int i = 0, j = 0;
while (i++ < 1000000)
while (j++ < 1000000)
{
int a = i * j; // the same thing as above
if (!b)
{
b = true;
}
a /= 2;
}
long time = System.nanoTime() - stopwatch;
System.out.println("\\tif\t" + time);
}
}
I ran the test three times storing the data in Excel, then I swapped the first (‘asgn’) and second (‘if’) case and ran it three times again… And the result? Four times “won” the ‘if’ case and two times the ‘asgn’ appeared to be the better case. This shows how sensitive the execution might be. But in general, I hope that this has also proven that the ‘if’ case is better choice.
Thanks, anyway…
Any compiler (except, perhaps, in debug) will optimize both these statements to
bool b = true;
But generally, relative speed of assignment and branch depend on processor architecture, and not on compiler. A modern, super-scalar processor perform horribly on branches. A simple micro-controller uses roughly the same number of cycles per any instruction.
Relative to your barebones example (and perhaps your real application):
boolean b = false;
// .. other stuff, might change b
int i = 0;
// .. other stuff, might change i
b |= i < 5;
while(i++ < 5) {
// .. stuff with i, possibly stuff with b, but no assignment to b
}
problem solved?
But really - it's going to be a question of the cost of your test (generally more than just if (boolean)) and the cost of your assignment (generally more than just primitive = x). If the test/assignment is expensive or your loop is long enough or you have high enough performance demands, you might want to break it into two parts - but all of those criteria require that you test how things perform. Of course, if your requirements are more demanding (say, b can flip back and forth), you might require a more complex solution.

In what circumstances can a compiler change the execution order of programme statements?

If this is not a real question then feel free to close ;)
Not only the compiler can reorder execution (mostly for optimization), most modern processors do so, too. Read more about execution reordering and memory barriers.
The compiler can change the execution order of statements when it sees fit for optimization purposes, and when such changes wouldn't alter the observable behavior of the code.
A very simple example -
int func (int value)
{
int result = value*2;
if (value > 10)
{
return result;
}
else
{
return 0;
}
}
A naive compiler can generate code for this in exactly the sequence shown. First calculate "result" and return it only if the original value is larger than 10 (if it isn't, "result" would be ignored - calculated needlessly).
A sane compiler, though, would see that the calculation of "result" is only needed when "value" is larger than 10, so may easily move the calculation "value*2" inside the first braces and only do it if "value" is actually larger than 10 (needless to mention, the compiler doesn't really look at the C code when optimizing - it works in lower levels).
This is only a simple example. Much more complicated examples can be created. It is very possible that a C function would end up looking almost nothing like its C representation in compiled form, with aggressive enough optimizations.
Many compilers use something called "common subexpression elimination". For example, if you had the following code:
for(int i=0; i<100; i++) {
x += y * i * 15;
}
the compiler would notice that y * 15 is invariant (its value doesn't change). So it would compute y * 15, stick the result in a register and change the loop statement to "x += r0 * i". This is kind of a contrived example, but you often see expressions like this when working with array indexes or any other base + offset type of situation.