Mutex Unlock using add - mutex

While checking Mutex unlock codes around, I found some that add 1 into the mutex variable instead of setting it to "1" directly. Is there any pros/cons of this?
Thanks

I wonder if it is possible that you are refereeing to the difference between mutex and semaphore resource access control.
Edit
That's all about CPU cycles needed for those two OPs. To my knowledge add uses less CPU cycles than mov. But then again it is very arch-dependent and questionable. Also, bear in mind that assemblers choice of how to encode a higher-level language instruction is very dependent on the surrounding instructions

It's just important that whatever operation is used, that it is atomic. It makes most sense to me to doing a set, rather than an add, particularly if there's a test-and-set instruction or implementation.
I found this implementation of a TestAndSet function for the x86 architecture. Here it uses a set (mov) instruction but it could have also used add or inc to do this but it would require eax to be 0 with the xchg instruction being used for atomicity. I suppose that requiring eax to be zero could be a con.

Related

How to decide the registers to be preserved for OS task switching?

When task switch happens in an OS, how to decide which registers should be preserved?
Is this purely decided by hardware architecture? Or also involve the OS implementation?
I once did some naïve implementation on ARM architecture that preserve all the R1 ~ R15 registers (if I remember it correctly). But that seems too much.
I also tried the x86 hardware task switching support, the TSS segment covers a lot of registers which doesn't have good performance as well.
I guess the design philosophy of an OS, especially the implementation of a task state should decide this. But I am not sure if there's any best practice or conventions. Or other factors.
When task switch happens in an OS, how to decide which registers should be preserved?
Normally most of a scheduler would be written in a higher level language (e.g. C), and the low level task switch code will be written as a small assembly language function (and NOT inline assembly) because there's no sane way to predict what a compiler might do with the stack and local variables.
Because of this; which registers the low level assembly function needs to save/restore depends on the ABI ("calling convention") the compiler felt like using. For example, the System V AMD64 ABI says the callee must preserve RBX, RSP, RBP, and R12 to R15 (and can trash RAX, RCX, RDX, and R8 to R11 if they aren't used as return parameters).
This does depend on the nature of the OS though. E.g. it's possible to design an OS where the kernel runs like a separate task and anything that causes a switch from user-space to kernel-space acts like a task switch and has to save everything before any higher level kernel code is executed.
There is a lot of theoretical wiggle room for what registers an OS chooses to preserve. For a "safe" implementation an OS would save all registers that would be accessible by to a user and/or kernel thread. We typically think of the R0,R1,Rx,... (ARM, MIPS, .ect) or RAX,RBX,... (x86) registers needing to be preserved. However, hardware floating point and vector instructions (x86 AVX) may also need preserved.
This is often were the implementation of the OS has wiggle room. One could simply play it safe and preserve all floating point and vector instruction registers. However, if these registers are not being used by a thread, saving off unused registers slows down context switching. Not to mention families of processors may have the same core instructions and registers, but optional floating point or vector extensions. Thus some operating systems support flagging in a thread if floating point or vectors instructions are used by the thread, so the OS knows which additional registers to preserve.

CPUs with instructions with more than two branch destinations

Processors usually come with jmp-instructions to continue from a different fixed location and may depend on some condition. So the out-degree is two at most.
Are there any processors out there that have a single instruction that branches to one of three or more fixed locations?
There are a lot of reasons to assume / guess no, but I'm not familiar with enough ISAs to give a definite no. Especially if we include historical early computers from the 50s and 60s; they often have very odd stuff compared to modern systems.
Normally you just use an indirect branch (target address in a register or from memory, or looked up from a compressed table with ARM tbb) if you need anything other than taken vs. fall-through, so there's very little benefit to spending an opcode on a funky direct branch instruction with 2 non-fallthrough destinations.
Also, you'd need space in the instruction encoding for either 2 separate targets, or else some special rule like fall-through, PC + offset, PC + offset*2 (i.e. jump twice as far forwards or backwards). Using it would require laying out code with targets at specific offsets. You do sometimes make a table of fixed-size blocks of instructions and compute an offset into it (instead of looking up an address from a table of addresses), but having an instruction that forced you to do that sounds unlikely.
The condition itself could be a register being - / 0 / + as a 3-way condition, or FLAGS being less-than, equal, or greater-than. Or something else.
So it sounds very unlikely, and a complication to branch-prediction (unless you just treat it as indirect, in which case why bother).
But I wouldn't be shocked if there's some combination of conditions that make it make sense on some ISA. Maybe if there's a special-case handler address in some special register, and the normal case involves taken or fall-through?
But if we allow one of the target addresses to come from a register or other internal state, any branch that can fault would count. Consider a hypothetical ISA with a compare-and-branch on memory, like Intel with macro-fused cmp [rdi], eax / jne rel32 which decodes to a single internal uop.
Then the possible targets are:
fall-through to RIP
taken to RIP+rel32
#PF fault to the page-fault handler address (loaded from memory on x86-64).

From where is the code for dealing with critical section originated?

While learning the subject of operating systems, Critical Section is a topic which I've come across. To solve this problem, certain methods are provided like semaphores, certain software solutions, etc...etc..etc. But I've a question that from where is the code for implementing these solutions originated? As programmers never are found writing such codes for their program. Suppose I write a simple program executing printf in 'C', I never write any code for critical section problem. And the code is converted into low level instructions and is executed by OS, which behaves as our obedient servant. So, where does code dealing with critical section originate and fit in? Let resources like frame buffer be the critical section.
The OS kernel supplies such inter-thread comms synchronization mechanisms, mutex, semaphore, event, critical section, conditional variables etc. It has to because the kernel needs to block threads that cannot proceed. Many languages provide convenient wrappers around such calls.
Your app accesses them, directly or indirectly, via system calls, ie intrrupts that enter kernel state and ask for such services.
In some cases, a short-term user-space spinlock may get plastered on top, but such code should defer to a system call if the spinner is not quickly satisfied.
In the case of C printf, the relevant library, (stdio usually), will make the calls to lock/unlock the I/O stream, (assuming you have linked in a multithreaded version of the library).

Looking for the best equivalents of prefetch instructions for ia32, ia64, amd64, and powerpc

I'm looking at some slightly confused code that's attempted a platform abstraction of prefetch instructions, using various compiler builtins. It appears to be based on powerpc semantics initially, with Read and Write prefetch variations using dcbt and dcbtst respectively (both of these passing TH=0 in the new optional stream opcode).
On ia64 platforms we've got for read:
__lfetch(__lfhint_nt1, pTouch)
wherease for write:
__lfetch_excl(__lfhint_nt1, pTouch)
This (read vs. write prefetching) appears to match the powerpc semantics fairly well (with the exception that ia64 allows for a temporal hint).
Somewhat curiously the ia32/amd64 code in question is using
prefetchnta
Not
prefetchnt1
as it would if that code were to be consistent with the ia64 implementations (#ifdef variations of that in our code for our (still live) hpipf port and our now dead windows and linux ia64 ports).
Since we are building with the intel compiler I should be able to many of our ia32/amd64 platforms consistent by switching to the xmmintrin.h builtins:
_mm_prefetch( (char *)pTouch, _MM_HINT_NTA )
_mm_prefetch( (char *)pTouch, _MM_HINT_T1 )
... provided I can figure out what temporal hint should be used.
Questions:
Are there read vs. write ia32/amd64 prefetch instructions? I don't see any in the instruction set reference.
Would one of the nt1, nt2, nta temporal variations be preferred for read vs. write prefetching?
Any idea if there would have been a good reason to use the NTA temporal hint on ia32/amd64, yet T1 on ia64?
Are there read vs. write ia32/amd64 prefetch instructions? I don't see any in the instruction set reference.
Some systems support the prefetchw instructions for writes
Would one of the nt1, nt2, nta temporal variations be preferred for read vs. write prefetching?
If the line is exclusively used by the calling thread, it shouldn't matter how you bring the line, both reads and writes would be able to use it. The benefit for prefetchw mentioned above is that it will bring the line and give you ownership on it, which may take a while if the line was also used by another core. The hint level on the other hand is orthogonal with the MESI states, and only affects how long would the prefetched line survive. This matters if you prefetch long ahead of the actual access and don't want to prefetch to get lost in that duration, or alternatively - prefetch right before the access, and don't want the prefetches to thrash your cache too much.
Any idea if there would have been a good reason to use the NTA temporal hint on ia32/amd64, yet T1 on ia64?
Just speculating - perhaps the larger caches and aggressive memory BW are more vulnerable to bad prefetching and you'd want to reduce the impact through the non-temporal hint. Consider that your prefetcher is suddenly set loose to fetch anything it can, you'd end up swamped in junk prefetches that would through away lots of useful cachelines. The NTA hint makes them overrun each other, leaving the rest undamaged.
Of course this may also be just a bug, I can't tell for sure, only whoever developed the compiler, but it might make sense for the reason above.
The best resource I could find on x86 prefetching hint types was the good ol' article What Every Programmer Should Know About Memory.
For the most part on x86 there aren't different instructions for read and write prefetches. The exceptions seem to be those that are non-temporal aligned, where a write can bypass the cache but as far as I can tell, a read will always get cached.
It's going to be hard to backtrack through why the earlier code owners used one hint and not the other on a certain architecture. They could be making assumptions about how much cache is available on processors in that family, typical working set sizes for binaries there, long term control flow patterns, etc... and there's no telling how much any of those assumptions were backed up with good reasoning or data. From the limited background here I think you'd be justified in taking the approach that makes the most sense for the platform you're developing on now, regardless what was done on other platforms. This is especially true when you consider articles like this one, which is not the only context where I've heard that it's really, really hard to get any performance gain at all with software prefetches.
Are there any more details known up front, like typical cache miss ratios when using this code, or how much prefetches are expected to help?

Is it possible to avoid a wakeup-waiting race using only POSIX semaphores? Is it benign?

I'd like to use POSIX semaphores to manage atomic get and put from a file representing a queue. I want the flexibility of having something named in the filesystem, so that completely unrelated processes can share a queue. I think this plan rules out pthreads. The named posix semaphores are great for putting something in the filesystem that any process can see, but I can't find the standard CondWait primitive:
... decide we have to wait ....
CondWait(sem, cond);
When CondWait is called by a process it atomically posts to sem and waits on cond. When some other process posts to cond, the waiting process wakes up only if it can atomically decrement sem as well. The alternative of
... decide we have to wait ....
sem_post(sem);
sem_wait(cond);
sem_wait(sem);
is subject to a race condition in which some other process signals cond just before this process waits on it.
I hardly ever do any concurrent programming, so I thought I would ask SO: if I use a standard POSIX counting semaphore for the condition variable, is it possible that this race is benign?
Just in case anybody wants the larger context, I am building get and put operations for an atomic queue that can be called from shell scripts.
Since there are no other answers I will follow up with what I've learned:
Pthreads will not work with my application because I have processes without a common ancestor which need to share an atomic queue.
Posix semaphores are subject to the wakeup-waiting race, but because unlike classic condition variables they are counting semaphores, the race is benign. I don't have a proof of this claim but I have had a system running for two days now and working well. (Completely meaningless I know, but at least it meant I got the job done.)
Named Posix semaphores are difficult to garbage-collect from the filesystem.
To summarize, named Posix semaphores turned out to be a good basis for implementing an atomic queue abstraction to be shared among unrelated processes.
I would like to have a proof or a validated SPIN model, but as my need for the application is limited, it seems unlikely that I will write one. I hope this helps someone else who may want to use Posix semaphores.
According to the POSIX standard, the set of semaphore routines is:
sem_close()
sem_destroy()
sem_getvalue()
sem_init()
sem_open()
sem_post()
sem_timedwait()
sem_trywait()
sem_unlink()
sem_wait()
The sem_trywait() and sem_timedwait() functions might be what you are looking for.
I know this question is old, but the obvious solution would be to just use process-shared mutexes and condition variables located in a file you can mmap.
You are looking for: pthread_cond_wait, pthread_cond_signal, I think.
That's if you are using posix threads, then the pthread methods would supply the functionality of CondWait and Signal.
Look here for source code on multiprocess pthreads via shared memory.
http://linux.die.net/man/3/pthread_mutexattr_init
That's for Linux, but the documents are posix. They're similar to Solaris, but you'll want to peruse the man pages on your OS.