MTLSharedEventListener block called before command buffer scheduling and not in-flight - swift

I am using a MTLSharedEvent to occasionally relay new information from the CPU to the GPU by writing into a MTLBuffer with storage mode .storageModeManaged within a block registered by the shared event (using the notify(_:atValue:block:) method of MTLSharedEvent, with a MTLSharedEventListener configured to be notified on a background dispatch queue). The process looks something like this:
let device = MTLCreateSystemDefaultDevice()!
let synchronizationQueue = DispatchQueue(label: "com.myproject.synchronization")
let sharedEvent = device.makeSharedEvent()!
let sharedEventListener = MTLSharedEventListener(dispatchQueue: synchronizationQueue)
// Updated only occasionally on the CPU (on user interaction). Mostly written to
// on the GPU
let managedBuffer = device.makeBuffer(length: 10, options: .storageModeManaged)!
var doExtra = true
func computeSomething(commandBuffer: MTLCommandBuffer) {
// Do work on the GPU every frame
// After writing to the buffer on the GPU, synchronize the buffer (required)
let blitToSynchronize = commandBuffer.makeBlitCommandEncoder()!
blitToSynchronize.synchronize(resource: managedBuffer)
blitToSynchronize.endEncoding()
// Occassionally, add extra information on the GPU
if doExtraWork {
// Register a block to write into the buffer
sharedEvent.notify(sharedEventListener, atValue: 1) { event, value in
// Safely write into the buffer. Make sure we call `didModifyRange(_:)` after
// Update the counter
event.signaledValue = 2
}
commandBuffer.encodeSignalEvent(sharedEvent, value: 1)
commandBuffer.encodeWaitForEvent(sharedEvent, value: 2)
}
// Commit the work
commandBuffer.commit()
}
The expected behavior is as follows:
The GPU does some work with the managed buffer
Occasionally, the information needs to be updated with new information on the CPU. In this frame, we register a block of work to be executed. We do so in a dedicated block because we cannot guarantee that by the time execution on the main thread reaches this point the GPU is not simultaneously reading from or writing to the managed buffer. Hence, it is unsafe to simply write to it currently and must make sure the GPU is not doing anything with this data
When the GPU schedules this command buffer to be executed, commands executed before the encodeSignalEvent(_:value:) call are executed and then execution on the GPU stops until the block increments the signaledValue property of the event passed into the block
When execution reaches the block, we can safely write into the managed buffer because we know the CPU has exclusive access to the resource. Once we've done so, we resume execution of the GPU
The issue is that it seems Metal is not calling the block when the GPU is executing the command, but rather before the command buffer is even scheduled. Worse, the system seems to "work" with the initial command buffer (the very first command buffer, before any other are scheduled).
I first noticed this issue when I looked at a GPU frame capture after my scene would vanish after a CPU update, which is where I saw that the GPU had NaNs all over the place. I then ran into this strange situation when I purposely waited on the background dispatch queue with a sleep(:_) call. Quite correctly, my shared resource semaphore (not shown, signaled in a completion block of the command buffer and waited on in the main thread) reached a value of -1 after committing three command buffers to the command queue (three being the number of recycled shared MTLBuffers holding scene uniform data etc.). This suggests that the first command buffer has not finished executing by then time the CPU is more than three frames ahead, which is consistent with the sleep(_:) behavior. Again, what isn't consistent is the ordering: Metal seems to call the block before even scheduling the buffer. Further, in subsequent frames, it doesn't seem that Metal cares that the sharedEventListener block is taking so long and schedules the command buffer for execution even while the block is running, which finishes dozens of frames later.
This behavior is completely inconsistent with what I expect. What is going on here?
P.S.
There is probably a better way to periodically update a managed buffer whose contents are mostly
modified on the GPU, but I have not yet found a way to do so. Any advice on this subject is appreciated as well. Of course, a triple buffer system could work, but it would waste a lot of memory as the managed buffer is quite large (whereas the shared buffers managed by the semaphore are quite small)

I think I have the answer for you, but I'm not sure.
From MTLSharedEvent doc entry
Commands waiting on the event are allowed to run if the new value is equal to or greater than the value for which they are waiting. Similarly, setting the event's value triggers notifications if the value is equal to or greater than the value for which they are waiting.
Which means, that if you are passing values 1 and 2 like you show in your snippet, if will only work a single time and then the event won't be waited and listeners won't be notified.
You have to make sure the value you are waiting on and then signaling is monotonically rising every time, so you have to bump it up by 1 or more.

Related

Monitors : " At each point in time, at most one thread may be executing any of its methods. "

I've been recently taught, the concept of monitors. My prof, said "only one thread per time can be in the monitor". I am not sure I get this that's why I did my research. Wikipaideia states what i wrote as a title. My book states though, that inside monitor there are queues, for threads that are waiting , until some defined condition is met. What really confused me is a pseudocode we were given as solution to the bounded buffer problem with monitors.
My question is : If a process is not stopped by a wait() inside the monitor, does monitor structure guaruntee us that it will be permitted to execute the whole method without being interrupted by a context switch or just that, while it is executing the method nobody else produce or consumer is executing their according method?? .
Because in this, slide:
It seems like we only wake up a consumer if the buffer was empty, and we just produced an item.
Everytime a producer that reaches that part of code, has produced an item. Why don't we signal everytime? I supposed that , we (may) consider that: if the buffer wasn't empty, then they may be "active" consumers waiting, to be resumed because they were interrupted by a context switch, but then I thought to myself is this possible? Is it possible to be interrupted inside a method (not because you are "waited") but by a context switch?

Modification to "Implementing an N process barrier using semaphores"

Recently I see this problem which is pretty similar to First reader/writer problem.
Implementing an N process barrier using semaphores
I am trying to modify it to made sure that it can be reuse and work correctly.
n = the number of threads
count = 0
mutex = Semaphore(1)
barrier = Semaphore(0)
mutex.wait()
count = count + 1
if (count == n){ barrier.signal()}
mutex.signal()
barrier.wait()
mutex.wait()
count=count-1
barrier.signal()
if(count==0){ barrier.wait()}
mutex.signal()
Is this correct?
I'm wondering if there exist some mistakes I didn't detect.
Your pseudocode correctly returns barrier back to initial state. Insignificant suggestion: replace
barrier.signal()
if(count==0){ barrier.wait()}
with IMHO more readable
if(count!=0){ barrier.signal()} //if anyone left pending barrier, release it
However, there may be pitfalls in the way, barrier is reused. Described barrier has two states:
Stop each thread until all of them are pending.
Resume each thread until all of them are running
There is no protection against mixing them: some threads are being resumed, while other have already hit the first stage again. For example, you have bunch of threads, which do some stuff and then sync up on barrier. Each thread body would be:
while (!exit_condition) {
do_some_stuff();
pend_barrier(); // implementation from current question
}
Programmer expects, that number of calls for do_some_stuff() will be the same for all threads. What may (or may not) happen depending on timing: the first thread released from barrier finishes do_some_stuff() calculations before all threads have left the barrier, so it reentered pending for the second time. As a result he will be released along with other threads in the current barrier release iteration and will have (at least) one more call to do_some_stuff().

Context switch by random system call

I know that an interrupt causes the OS to change a CPU from its current task and to run a kernel routine. I this case, the system has to save the current context of the process running on the CPU.
However, I would like to know whether or not a context switch occurs when any random process makes a system call.
I would like to know whether or not a context switch occurs when any random process makes a system call.
Not precisely. Recall that a process can only make a system call if it's currently running -- there's no need to make a context switch to a process that's already running.
If a process makes a blocking system call (e.g, sleep()), there will be a context switch to the next runnable process, since the current process is now sleeping. But that's another matter.
There are generally 2 ways to cause a content switch. (1) a timer interrupt invokes the scheduler that forcibly makes a context switch or (2) the process yields. Most operating systems have a number of system services that will cause the process to yield the CPU.
well I got your point. so, first I clear a very basic idea about system call.
when a process/program makes a syscall and interrupt the kernel to invoke syscall handler. TSS loads up Kernel stack and jump to syscall function table.
See It's actually same as running a different part of that program itself, the only major change is Kernel play a role here and that piece of code will be executed in ring 0.
now your question "what will happen if a context switch happen when a random process is making a syscall?"
well, nothing will happen. Things will work in same way as they were working earlier. Just instead of having normal address in TSS you will have address pointing to Kernel stack and SysCall function table address in that random process's TSS.

Interrupt masking: why?

I was reading up on interrupts. It is possible to suspend non-critical interrupts via a special interrupt mask. This is called interrupt masking. What i dont know is when/why you might want to or need to temporarily suspend interrupts? Possibly Semaphores, or programming in a multi-processor environment?
The OS does that when it prepares to run its own "let's orchestrate the world" code.
For example, at some point the OS thread scheduler has control. It prepares the processor registers and everything else that needs to be done before it lets a thread run so that the environment for that process and thread is set up. Then, before letting that thread run, it sets a timer interrupt to be raised after the time it intends to let the thread have on the CPU elapses.
After that time period (quantum) has elapsed, the interrupt is raised and the OS scheduler takes control again. It has to figure out what needs to be done next. To do that, it needs to save the state of the CPU registers so that it knows how to undo the side effects of the code it executes. If another interrupt is raised for any reason (e.g. some async I/O completes) while state is being saved, this would leave the OS in a situation where its world is not in a valid state (in effect, saving the state needs to be an atomic operation).
To avoid being caught in that situation, the OS kernel therefore disables interrupts while any such operations that need to be atomic are performed. After it has done whatever needs doing and the system is in a known state again, it reenables interrupts.
I used to program on an ARM board that had about 10 interrupts that could occur. Each particular program that I wrote was never interested in more than 4 of them. For instance there were 2 timers on the board, but my programs only used 1. I would mask the 2nd timer's interrupt. If I didn't mask that timer, it might have been enabled and continued making interrupts which would slow down my code.
Another example was that I would use the UART receive REGISTER full interrupt and so would never need the UART receive BUFFER full interrupt to occur.
I hope this gives you some insight as to why you might want to disable interrupts.
In addition to answers already given, there's an element of priority to it. There are some interrupts you need or want to be able to respond to as quickly as possible and others you'd like to know about but only when you're not so busy. The most obvious example might be refilling the write buffer on a DVD writer (where, if you don't do so in time, some hardware will simply write the DVD incorrectly) versus processing a new packet from the network. You'd disable the interrupt for the latter upon receiving the interrupt for the former, and keep it disabled for the duration of filling the buffer.
In practise, quite a lot of CPUs have interrupt priority built directly into the hardware. When an interrupt occurs, the disabled flags are set for lesser interrupts and, often, that interrupt at the same time as reading the interrupt vector and jumping to the relevant address. Dictating that receipt of an interrupt also implicitly masks that interrupt until the end of the interrupt handler has the nice side effect of loosening restrictions on interrupting hardware. E.g. you can simply say that signal high triggers the interrupt and leave the external hardware to decide how long it wants to hold the line high for without worrying about inadvertently triggering multiple interrupts.
In many antiquated systems (including the z80 and 6502) there tends to be only two levels of interrupt — maskable and non-maskable, which I think is where the language of enabling or disabling interrupts comes from. But even as far back as the original 68000 you've got eight levels of interrupt and a current priority level in the CPU that dictates which levels of incoming interrupt will actually be allowed to take effect.
Imagine your CPU is in "int3" handler now and at that time "int2" happens and the newly happened "int2" has a lower priority compared with "int3". How would we handle with this situation?
A way is when handling "int3", we are masking out other lower priority interrupters. That is we see the "int2" is signaling to CPU but the CPU would not be interrupted by it. After we finishing handling the "int3", we make a return from "int3" and unmasking the lower priority interrupters.
The place we returned to can be:
Another process(in a preemptive system)
The process that was interrupted by "int3"(in a non-preemptive system or preemptive system)
An int handler that is interrupted by "int3", say int1's handler.
In cases 1 and 2, because we unmasked the lower priority interrupters and "int2" is still signaling the CPU: "hi, there is a something for you to handle immediately", then the CPU would be interrupted again, when it is executing instructions from a process, to handle "int2"
In case 3, if the priority of “int2” is higher than "int1", then the CPU would be interrupted again, when it is executing instructions from "int1"'s handler, to handle "int2".
Otherwise, "int1"'s handler is executed without interrupting (because we are also masking out the interrupters with priority lower then "int1" ) and the CPU would return to a process after handling the “int1” and unmask. At that time "int2" would be handled.

An IOCP documentation interpretation question - buffer ownership ambiguity

Since I'm not a native English speaker I might be missing something so maybe someone here knows better than me.
Taken from WSASend's doumentation at MSDN:
lpBuffers [in]
A pointer to an array of WSABUF
structures. Each WSABUF structure
contains a pointer to a buffer and the
length, in bytes, of the buffer. For a
Winsock application, once the WSASend
function is called, the system owns
these buffers and the application may
not access them. This array must
remain valid for the duration of the
send operation.
Ok, can you see the bold text? That's the unclear spot!
I can think of two translations for this line (might be something else, you name it):
Translation 1 - "buffers" refers to the OVERLAPPED structure that I pass this function when calling it. I may reuse the object again only when getting a completion notification about it.
Translation 2 - "buffers" refer to the actual buffers, those with the data I'm sending. If the WSABUF object points to one buffer, then I cannot touch this buffer until the operation is complete.
Can anyone tell what's the right interpretation to that line?
And..... If the answer is the second one - how would you resolve it?
Because to me it implies that for each and every data/buffer I'm sending I must retain a copy of it at the sender side - thus having MANY "pending" buffers (in different sizes) on an high traffic application, which really going to hurt "scalability".
Statement 1:
In addition to the above paragraph (the "And...."), I thought that IOCP copies the data to-be-sent to it's own buffer and sends from there, unless you set SO_SNDBUF to zero.
Statement 2:
I use stack-allocated buffers (you know, something like char cBuff[1024]; at the function body - if the translation to the main question is the second option (i.e buffers must stay as they are until the send is complete), then... that really screws things up big-time! Can you think of a way to resolve it? (I know, I asked it in other words above).
The answer is that the overlapped structure and the data buffer itself cannot be reused or released until the completion for the operation occurs.
This is because the operation is completed asynchronously so even if the data is eventually copied into operating system owned buffers in the TCP/IP stack that may not occur until some time in the future and you're notified of when by the write completion occurring. Note that with write completions these may be delayed for a surprising amount of time if you're sending without explicit flow control and relying on the the TCP stack to do flow control for you (see here: some OVERLAPS using WSASend not returning in a timely manner using GetQueuedCompletionStatus?) ...
You can't use stack allocated buffers unless you place an event in the overlapped structure and block on it until the async operation completes; there's not a lot of point in doing that as you add complexity over a normal blocking call and you don't gain a great deal by issuing the call async and then waiting on it.
In my IOCP server framework (which you can get for free from here) I use dynamically allocated buffers which include the OVERLAPPED structure and which are reference counted. This means that the cleanup (in my case they're returned to a pool for reuse) happens when the completion occurs and the reference is released. It also means that you can choose to continue to use the buffer after the operation and the cleanup is still simple.
See also here: I/O Completion Port, How to free Per Socket Context and Per I/O Context?