MUTEX LOCK and MUTEX UNLOCK - mutex

what happen in the extremely unlucky case that two processes arrive at exactly & precisely the same moment. This is not likely, but probabilistically can happen from time to time. Even more coincidentally, let’s assume the lock is OPEN so that both processes would find the lock available. What happens?

Mutexes implemented using atomic operations. Different processor architectures implement this differently, but regardless of what processor does, on the lower level there is always a bus arbiter hardware that will have to pick correct order for all simultaneous memory accesses.
So even if two processors are accessing the same mutex at exactly the same moment in time, bus arbiter will choose who will be the first one and who will be the second one.
In the end, nothing happens at exactly the same moment in time - everything is ordered.
You may read more about how memory access works at Fixing Gap in knowledge about C/C++ and register access
In short, processor does not access memory directly, instead it asks memory controller to do this. When two processor ask memory device to do something at the same time, it has to pick one of them first.

Locking a mutex is an atomic process, so even if two threads manage to request the mutex at exactly the same time, one of them will succeed and the other will fail--that is to say, one will lock the mutex, and the other won't.
Any other result means the mutex is utterly and irrevocably broken--i.e., it's not really a mutex at all.

Related

What happens when a core write in its L1 cache while another core is having the same line in its L1 too?

What happens when a core write in its L1 cache while another core is having the same line in its L1 too ?
Let say for an intel Skylake CPU.
How does the cache system preserve consistency ? Does it update in real time, does it stop one of the cores ?
What's the performance cost of continuously writing in same cache line with two cores ?
In general modern CPUs use some variant1 of the MESI protocol to preserve cache coherency.
In your scenario of an L1 write, the details depend on the existing state of the cache lines: is the cache line already in the cache of the writing core? In the other core, in what state is the cache line, e.g., has it been modified?
Let's take the simple case where the line isn't already in the writing core (C1), and it is in the "exclusive" state in the other core (C2). At the point where the address for the write is known, C1 will issue an RFO (request for ownership) transaction onto the "bus" with the address of the line and the other cores will snoop the bus and notice the transaction. The other core that has the line will then transition its line from the exclusive to the invalid state and the value of the value of the line will be provided to the requesting core, which will have it in the modified state, at which point the write can proceed.
Note that at this point, further writes to that line from the writing core proceed quickly, since it is in the M state which means no bus transaction needs to take place. That will be the case until the line is evicted or some other core requests access.
Now, there are a lot of additional details in actual implementations which aren't covered above or even in the wikipedia description of the protocol.
For example, the basic model involves a single private cache per CPU, and shared main memory. In this model, core C2 would usually provide the value of the shared line onto the bus, even though it has not modified it, since that would be much faster than waiting to read the value from main memory. In all recent x86 implementations, however, there is a shared last-level L3 cache which sits between all the private L2 and L1 caches and main memory. This cache has typically been inclusive so it can provide the value directly to C1 without needing to do a cache-to-cache transfer from C2. Furthermore, having this shared cache means that each CPU may not actually need to snoop the "bus" since the L3 cache can be consulted first to determine which, if any, cores actually have the line. Only the cores that have the line will then be asked to make a state transition. Kind of a push model rather than pull.
Despite all these implementation details, the basics are the same: each cache line has some "per core" state (even though this state may be stored or duplicated in some central place like the LLC), and this state atomically undergoes logical transitions that ensure that the cache line remains consistent at all times.
Given that background, here are some specific answers to you final two sub-questions:
Does it update in real time, does it stop one of the cores?
Any modern core is going to do this in real time, and also in parallel for different cache lines. That doesn't mean it is free! For example, in the description above, the write by C1 is stalled until the cache coherence protocol is complete, which is likely dozens of cycles. Contrast that with a normal write which takes only a couple cycles. There are also possible bandwidth issues: the requests and responses used to implement the protocol use shared resources that may have a maximum throughput; if the rate of coherence transactions passes some limit, all requests may slow down even if they are independent.
In the past, when there was truly a shared bus, there may have been some partial "stop the world" behavior in some cases. For example, the lock prefix for x86 atomic instructions is apparently named based on the lock signal that a CPU would assert on the bus while it was doing an atomic transaction. During that entire period other CPUs are not able to fully use the bus (but presumably they could still proceed with CPU-local instructions).
What's the performance cost of continuously writing in same cache line with two cores?
The cost is very high because the line will continuously ping-pong between the two cores as described above (at the end of the process described, just reverse the roles of C1 and C2 and restart). The exact details vary a lot by CPU and even by platform (e.g, a 2-socket configuration will change this behavior a lot), but basically are probably looking at a penalty of 10s of cycles per write versus a not-shared output of 1 write per cycle.
You can find some specific numbers in the answers to this question which covers both the "two threads on the same physical core" case and the "separate cores" case.
If you want more details about specific performance scenarios, you should probably ask a separate specific question that lays out the particular behavior you are interested in.
1 The variations on MESI often introduce new states, such as the "owned" state in MOESI or the "Forwarded" state in MESIF. The idea is usually to make certain transitions or usage patterns more efficient than the plain MESI protocol, but the basic idea is largely the same.

What makes RTOS behaviour predictable ?

How can you ensure that interrupt latency will not exceed a certain value when there may be other variables and factors involved, like the hardware ?
Hardware latency is predictable. It doesn't have to be constant, but it definitely is bounded - for example interrupt entry is usually 12 cycles, but sometimes it may take 15 cycles.
RTOS latency is predictable. It also is not constant, but for example you can be certain, that the RTOS does not block the interrupts for longer than 1000 cycles at any time. Usually it will block them for much shorter periods of time, but never longer than stated.
If only your application doesn't do something strange (like a while (1); in the thread with highest possible priority), then the latency of the whole system will be a sum of hardware latency and RTOS latency.
The important fact here is that using real-time operating system to write your application is not the only requirement for the application to also be real-time. In your application you have to ensure that the real-time constraints are not violated. The main job of RTOS is to NOT get in your way of doing that, so it may not introduce random/unpredictable delays.
Generally the most important of the "predictable" things in RTOS is that the highest priority thread that is not blocked is executing. Period. In a GPOS (like the one on your desktop computer, in tablets or in smartphones), this is not true, because the scheduler actively prevents low priority threads from starvation, by allowing them to run for some time, even if there are more important things to do right now. This makes the behaviour of the application unpredictable, because one day it may react within 10us, while on the other day it may react within 10s, because the scheduler decided it's a great moment to save the logs to hard drive or maybe do some garbage collection.
Alternatively you can think that for RTOS the latency is in the range of microseconds, maybe single milliseconds. For a GPOS the max latency would probably be something like dozens of seconds.

how resource integrity is maintained using Semaphores

I am new to computer science and it may sound stupid to some of you. Although i have searched for related question, but this scenario stuck in my mind.
I understand that Mutexes provide lock facility for a section or resource. Consider the example of a buffer(an array of size 10 say) where a thread puts some value in it. We lock the mutex before using it releases after. This whole process is done by same thread.
Now if i have to do this same process with semaphores. I can limit the number of threads that can enter the critical section in this case. But how can the integrity of the buffer is maintained.
(read and write operations handled by different threads on buffer)
Semaphores are an higher abstraction. Semaphores controls the ability to create processes and make sure the instance created is distinct in which case it is kicked out. In a nutshell.
The usual solution with semaphores is to allow multiple simultaneous readers or one writer. See the Wikipedia article Readers-writers problem for examples.

Context Switch questions: What part of the OS is involved in managing the Context Switch?

I was asked to anwer these questions about the OS context switch, the question is pretty tricky and I cannot find any answer in my textbook:
How many PCBs exist in a system at a particular time?
What are two situations that could cause a Context Switch to occur? (I think they are interrupt and termination of a process,but I am not sure )
Hardware support can make a difference in the amount of time it takes to do the switch. What are two different approaches?
What part of the OS is involved in managing the Context Switch?
There can be any number of PCBs in the system at a given moment in time. Each PCB is linked to a process.
Timer interrupts in preemptive kernels or process renouncing control of processor in cooperative kernels. And, of course, process termination and blocking at I/O operations.
I don't know the answer here, but see Marko's answer
One of the schedulers from the kernel.
3: A whole number of possible hardware optimisations
Small register sets (therefore less to save and restore on context switch)
'Dirty' flags for floating point/vector processor register set - allows the kernel to avoid saving the context if nothing has happened to it since it was switched in. FP/VP contexts are usually very large and a great many threads never use them. Some RTOSs provide an API to tell the kernel that a thread never uses FP/VP at all eliminating even more context restores and some saves - particularly when a thread handling an ISR pre-empts another, and then quickly completes, with the kernel immediately rescheduling the original thread.
Shadow register banks: Seen on small embedded CPUs with on-board singe-cycle SRAM. CPU registers are memory backed. As a result, switching bank is merely a case of switching base-address of the registers. This is usually achieved in a few instructions and is very cheap. Usually the number of context is severely limited in these systems.
Shadow interrupt registers: Shadow register banks for use in ISRs. An example is all ARM CPUs that have a shadow bank of about 6 or 7 registers for its fast interrupt handler and a slightly fewer shadowed for the regular one.
Whilst not strictly a performance increase for context switching, this can help ith the cost of context switching on the back of an ISR.
Physically rather than virtually mapped caches. A virtually mapped cache has to be flushed on context switch if the MMU is changed - which it will be in any multi-process environment with memory protection. However, a physically mapped cache means that virtual-physical address translation is a critical-path activity on load and store operations, and a lot of gates are expended on caching to improve performance. Virtually mapped caches were therefore a design choice on some CPUs designed for embedded systems.
The scheduler is the part of the operating systems that manage context switching, it perform context switching in one of the following conditions:
1.Multitasking
2.Interrupt handling
3.User and kernel mode switching
and each process have its own PCB

Number of threads with NSOperationQueueDefaultMaxConcurrentOperationCount

I'm looking for any concrete info related to the number of background threads an NSOperationQueue with create given the NSOperationQueueDefaultMaxConcurrentOperationCount maximum concurrency setting.
I had assumed that some sort of load monitoring is employed to determine the most appropriate number of threads to spawn, plus this setting is recommended in the docs. What I'm finding is that the queue spawns roughly 100 background threads and my app (running on iPad 3 with iOS 5.1.1) crashes with SIGABRT. I've reduced this to a more acceptable number like 3 and everything is working fine.
Any comments or insight would be appreciated.
My experience matches yours (though not to 100 threads; do put in some instrumenting to make sure that you really have that many running simultaneously. I've never seen it go quite that high). Unless you manually manage the number of concurrent operations, NSOperationQueue will tend to generate too many concurrent operations. (I have yet to see anyone refute this with testable code rather than inferences from the documentation.) For anything that may generate a large number of potentially concurrent operations, I recommend setMaxConcurrentOperations. While not ideal, I often wind up using a function like this one to assist (this of course doesn't help you balance between queues, so is very sub-optimal):
unsigned int countOfCores() {
unsigned int ncpu;
size_t len = sizeof(ncpu);
sysctlbyname("hw.ncpu", &ncpu, &len, NULL, 0);
return ncpu;
}
I eagerly await anyone posting real code demonstrating NSOperationQueue automatically performing correct load balancing for CPU-bound operations. I've posted a sample gist demonstrating what I'm talking about. Without calling setMaxConcurrentOperations:, it will spawn about 6 parallel processes on a 2-core iPad 3. In this very simplistic case with no contention or shared resources, this adds about a 10%-15% overhead. In more complicated code with contention (and particularly if operations might be cancelled), it can slow things down by an order of magnitude.
assuming your threads are busy working, 100 active threads in one process on a dual-core iPad is unreasonable. each thread consumes a good amount of time and memory. having that many busy threads is going to slow things down on a dual-core.
regardless of whether you're doing something silly (like sleeping them all or adding run loops or just giving them nothing to do), this would be a bug.
From the documentation:
The default maximum number of operations is determined dynamically by the NSOperationQueue object based on current system conditions.
The iPad 3 has a powerful processor and 1Gb of ram. Since NSOperationQueue calculates the amount of thread based on system conditions, it's very likely that it determined to be able to run a large number of NSOperation based on the power available on that device. The reason why it crashed might not have to do with the amount of threads running simultaneously, but on the code being executed inside those threads. Check the backtrace and see if there is some condition or resource being shared among these treads.