Volatile vars and multi-core thread synchronization! - multicore

I have several threads executing concurrently and checking a value of a field in their own object. The field is set by the launch thread like this:
for (i = 0; i < ThreadCount; i++)
{
ThreadContext[i].MyField = 1;
}
Within each thread then I check the value of this value:
if (MyField == 1)
{
...//do something
}
However, I noticed that on a 4 core CPU, some of the (4) running threads need several miliseconds or even longer in order to see the changed MyField. MyField is a single char field. What appears to be happening is that when the memory bus is maxed out by the first thread which detects the change, all other threads may stall almost for the entire duration of the run of the first. (assuming there is enough memory pressure). Only when the first thread eases on memory (and does more with registers), is when other threads also get to see the new value.
I checked the asm and there is no compiler optimization in the way here. Calls go directly to memory. How can this be fixed?
Thanks!
Jam

I got feedback from Intel: Yes, that's how it works (no easy fix).

Related

MTLSharedEventListener block called before command buffer scheduling and not in-flight

I am using a MTLSharedEvent to occasionally relay new information from the CPU to the GPU by writing into a MTLBuffer with storage mode .storageModeManaged within a block registered by the shared event (using the notify(_:atValue:block:) method of MTLSharedEvent, with a MTLSharedEventListener configured to be notified on a background dispatch queue). The process looks something like this:
let device = MTLCreateSystemDefaultDevice()!
let synchronizationQueue = DispatchQueue(label: "com.myproject.synchronization")
let sharedEvent = device.makeSharedEvent()!
let sharedEventListener = MTLSharedEventListener(dispatchQueue: synchronizationQueue)
// Updated only occasionally on the CPU (on user interaction). Mostly written to
// on the GPU
let managedBuffer = device.makeBuffer(length: 10, options: .storageModeManaged)!
var doExtra = true
func computeSomething(commandBuffer: MTLCommandBuffer) {
// Do work on the GPU every frame
// After writing to the buffer on the GPU, synchronize the buffer (required)
let blitToSynchronize = commandBuffer.makeBlitCommandEncoder()!
blitToSynchronize.synchronize(resource: managedBuffer)
blitToSynchronize.endEncoding()
// Occassionally, add extra information on the GPU
if doExtraWork {
// Register a block to write into the buffer
sharedEvent.notify(sharedEventListener, atValue: 1) { event, value in
// Safely write into the buffer. Make sure we call `didModifyRange(_:)` after
// Update the counter
event.signaledValue = 2
}
commandBuffer.encodeSignalEvent(sharedEvent, value: 1)
commandBuffer.encodeWaitForEvent(sharedEvent, value: 2)
}
// Commit the work
commandBuffer.commit()
}
The expected behavior is as follows:
The GPU does some work with the managed buffer
Occasionally, the information needs to be updated with new information on the CPU. In this frame, we register a block of work to be executed. We do so in a dedicated block because we cannot guarantee that by the time execution on the main thread reaches this point the GPU is not simultaneously reading from or writing to the managed buffer. Hence, it is unsafe to simply write to it currently and must make sure the GPU is not doing anything with this data
When the GPU schedules this command buffer to be executed, commands executed before the encodeSignalEvent(_:value:) call are executed and then execution on the GPU stops until the block increments the signaledValue property of the event passed into the block
When execution reaches the block, we can safely write into the managed buffer because we know the CPU has exclusive access to the resource. Once we've done so, we resume execution of the GPU
The issue is that it seems Metal is not calling the block when the GPU is executing the command, but rather before the command buffer is even scheduled. Worse, the system seems to "work" with the initial command buffer (the very first command buffer, before any other are scheduled).
I first noticed this issue when I looked at a GPU frame capture after my scene would vanish after a CPU update, which is where I saw that the GPU had NaNs all over the place. I then ran into this strange situation when I purposely waited on the background dispatch queue with a sleep(:_) call. Quite correctly, my shared resource semaphore (not shown, signaled in a completion block of the command buffer and waited on in the main thread) reached a value of -1 after committing three command buffers to the command queue (three being the number of recycled shared MTLBuffers holding scene uniform data etc.). This suggests that the first command buffer has not finished executing by then time the CPU is more than three frames ahead, which is consistent with the sleep(_:) behavior. Again, what isn't consistent is the ordering: Metal seems to call the block before even scheduling the buffer. Further, in subsequent frames, it doesn't seem that Metal cares that the sharedEventListener block is taking so long and schedules the command buffer for execution even while the block is running, which finishes dozens of frames later.
This behavior is completely inconsistent with what I expect. What is going on here?
P.S.
There is probably a better way to periodically update a managed buffer whose contents are mostly
modified on the GPU, but I have not yet found a way to do so. Any advice on this subject is appreciated as well. Of course, a triple buffer system could work, but it would waste a lot of memory as the managed buffer is quite large (whereas the shared buffers managed by the semaphore are quite small)
I think I have the answer for you, but I'm not sure.
From MTLSharedEvent doc entry
Commands waiting on the event are allowed to run if the new value is equal to or greater than the value for which they are waiting. Similarly, setting the event's value triggers notifications if the value is equal to or greater than the value for which they are waiting.
Which means, that if you are passing values 1 and 2 like you show in your snippet, if will only work a single time and then the event won't be waited and listeners won't be notified.
You have to make sure the value you are waiting on and then signaling is monotonically rising every time, so you have to bump it up by 1 or more.

What happens if a thread is in the critical section or entering the critical section?

I am trying to better understand a chapter and have been confused about what happens if a thread is in the critical section or is entering the critical section. May someone explain or give me an idea on the process of what the thread undergoes in such circumstances? Thank you.
For an example, let's assume that you have an array, and multiple threads that read and write to the array; and if different threads are reading and writing to the array at the same time they'd see inconsistent data and it'd cause problems. To prevent those problems you protect the array with some kind of lock - before doing anything with the array a thread acquires the array's lock, and when it's finished using the array the thread releases the array's lock.
For example:
acquire_array_lock();
/** Critical section (code that does something with the array) **/
release_array_lock();
There's nothing special about the code in the critical section. It does whatever it was designed to do (maybe sorting the array, maybe adding up all the numbers in the array, maybe displaying the array, etc) using code that's no different to code that you might use to do the same thing in a single-threaded system without locks.
The only special parts are the code to acquire and release the lock.
There are many types of locks (spinlocks, mutexes, semaphores), but they all have the same fundamental principle - when acquiring it you have something (e.g. a variable) to determine if a thread can/can't continue, then either (if the thread can't continue) some kind of waiting or (if the thread can continue) some kind of change to let others know they need to wait; and when releasing you have something to let others know they can stop waiting.
The main difference between different kinds of locks is the implementation details - what kind of data is used to determine if a thread can/can't continue, and how a thread waits.
For the simplest kind of lock (a spinlock) you might just have a single "yes/no" flag, a little bit like this (but not literally like this):
acquire_lock(void) {
while(myLock == 0) {
// do nothing then retry
}
myLock = 1;
}
release_lock(void) {
myLock = 0;
}
However this won't work because two or more threads can see that myLock == 0 at the same time and think they can both continue (and then do the myLock = 1 after it's too late). To fix this you need assembly language or special language support for atomic operations (e.g. a special function for "test and set" or "compare and exchange").
The reason this is called a "spinlock" is that (if a thread needs to wait) it wastes CPU time continually checking ("spinning") to see if it can continue. Instead of doing this (to avoid wasting CPU time), a thread could tell a scheduler not to give it any CPU time until the lock is released; and this is how a mutex works.

Modification to "Implementing an N process barrier using semaphores"

Recently I see this problem which is pretty similar to First reader/writer problem.
Implementing an N process barrier using semaphores
I am trying to modify it to made sure that it can be reuse and work correctly.
n = the number of threads
count = 0
mutex = Semaphore(1)
barrier = Semaphore(0)
mutex.wait()
count = count + 1
if (count == n){ barrier.signal()}
mutex.signal()
barrier.wait()
mutex.wait()
count=count-1
barrier.signal()
if(count==0){ barrier.wait()}
mutex.signal()
Is this correct?
I'm wondering if there exist some mistakes I didn't detect.
Your pseudocode correctly returns barrier back to initial state. Insignificant suggestion: replace
barrier.signal()
if(count==0){ barrier.wait()}
with IMHO more readable
if(count!=0){ barrier.signal()} //if anyone left pending barrier, release it
However, there may be pitfalls in the way, barrier is reused. Described barrier has two states:
Stop each thread until all of them are pending.
Resume each thread until all of them are running
There is no protection against mixing them: some threads are being resumed, while other have already hit the first stage again. For example, you have bunch of threads, which do some stuff and then sync up on barrier. Each thread body would be:
while (!exit_condition) {
do_some_stuff();
pend_barrier(); // implementation from current question
}
Programmer expects, that number of calls for do_some_stuff() will be the same for all threads. What may (or may not) happen depending on timing: the first thread released from barrier finishes do_some_stuff() calculations before all threads have left the barrier, so it reentered pending for the second time. As a result he will be released along with other threads in the current barrier release iteration and will have (at least) one more call to do_some_stuff().

iOS Threads Wait for Action

I have a processing thread that I use to fill a data buffer. Elsewhere a piece of hardware triggers a callback which reads from this data buffer. The processing thread then kicks in and refills the buffer.
When the buffer fills up I am currently telling the thread to wait by:
while( [self FreeWriteSpace] < mProcessBufferSize && InActive) {
[NSThread sleepForTimeInterval:.0001];
}
However when I profile I am getting a lot of CPU time spent in sleep. Is there a better way to wait? Do I even care if the profiles says time is spent in sleep?
Time spent in sleep is effectively free. In Instruments, look at "running samples" rather than "all samples." But this still isn't an ideal solution.
First, your sleep interval is crazy. Do you really need .1µs granularity? The system almost certainly isn't giving you because the processor isn't that fast. I have to believe you could up this to .1 or .01. But that's still busy-waiting which is not ideal if you can help it.
The better solution is to use an NSCondition. In this thread, wait on the condition, and in your processing thread, trigger the condition when there's room to write.
Do be careful with your naming. Do not name methods with leading caps (that indicates that it's a class name). And avoid accessing ivars directly (InActive) like this. "InActive" is also a very confusing name. Does it mean the system is active (In Active) or not active (inactive). Naming in Objective-C is extremely important. The compiler will not protect you the way it does in C# and C++. Good naming is how you keep your programs working, and many parts of ObjC rely on it.
You may also want to investigate Grand Central Dispatch, which is particularly designed for these kinds of problems. Look at dispatch_async() to run things when new data comes in.
However when I profile I am getting a
lot of CPU time spent in sleep. Is
there a better way to wait? Do I even
care if the profiles says time is
spent in sleep?
Yes -- never, never poll. Polling eats CPU, makes your app less responsive, eats battery, and is an all around waste.
Notify instead.
The easiest way is to use one of the variants of "perform selector on main thread" (see NSThread's documentation). Or dispatch to a queue (including something like dispatch_async(dispatch_get_main_queue(), ^{ ... yo, data be ready ...});).

Force the iPhone to simulate doing CPU-intensive tasks?

For a normal app, you'd never want to do this.
But ... I'm making an educational app to show people exactly what happens with the different threading models on different iPhone hardware and OS level. OS 4 has radically changed the different models (IME: lots of existing code DOES NOT WORK when run on OS 4).
I'm writing an interactive test app that lets you fire off threads for different models (selector main thread, selector background, nsoperationqueue, etc), and see what happens to the GUI + main app while it happens.
But one of the common use-cases I want to reproduce is: "Thread that does a backgorund download then does a CPU-intensive parse of the results". We see this a lot in real-world apps.
It's not entirely trivial; the manner of "being busy" matters.
So ... how can I simulate this? I'm looking for something that is guaranteed not to be thrown-away by an optimizing compiler (either now, or with a better compiler), and is enough to force a thread to run at max CPU for about 5 seconds.
NB: in my real-world apps, I've noticed there are some strange things that happen when an iPhone thread gets busy - e.g. background threads will starve the main thread EVEN WHEN set at lower priority. Although this is clearly a bug in Apple's thread scheduler, I'd like to make a busy that demonstrates this - and/or an alternate busy that shows what happens when you DON'T trigger that behavioru in the scheduler.
UPDATE:
For instance, the following can have different effects:
for( int i=0; i<1000; i++ )
for( int k=0; k<1000; k++ )
CC_MD5( cStr, strlen(cStr), result );
for( int i=0; i<1000000; i++ )
CC_MD5( cStr, strlen(cStr), result );
...sometimes, at least, the compiler seems to optimize the latter (and I have no idea what the compiler voodoo is for that - some builds it showed no difference, some it did :()
UPDATE 2:
25 threads, on a first gen iPhone, doing a million MD5's each ... and there's almost no perceptible effect on the GUI.
Whereas 5 threads parsing XML using the bundled SAX-based parser will usually grind the GUI to a halt.
It seems that MD5 hashing doesn't trigger the problems in the iPhone's buggy thread-scheduler :(. I'm going to investigate mem allocations instead, see if that has a different effect.
You can avoid the compiler optimising things away by making sure the compiler can't easily infer what you're trying to do at compile time.
For example, this:
for( int i=0; i<1000000; i++ )
CC_MD5( cStr, strlen(cStr), result );
has an invariant input, so the compiler could realise that it's going to get the same result everytime. Or that the result isn't getting used so it doesn't need to calculate it.
You could avoid both these problems like so:
for( int i=0; i<1000000; i++ )
{
CC_MD5( cStr, strlen(cStr), result );
sprintf(cStr, "%02x%02x", result[0], result[1]);
}
If you're seeing the problem with SAX, then I'd start with getting the threads in your simulation app doing SAX and check you do see the same problems outside of your main app.
If the problem is not related to pure processor power or memory allocations, other areas you could look at would be disk I/O (depending where your xml input is coming from), mutexes or calling selectors/delegates.
Good luck, and do report back how you get on!
Apple actually provides sample code that does something similiar to what you are looking for at developer.apple.com, with the intent to highlight the performance differences between using LibXML (SAX) and CocoaXML. The focus is not on CPU performance, but assuming you can actually measure processor utilization, you could likely just scale up (repeat within your xml) the dataset that the sample downloads.