Are atomic operations enough for a mutex? - mutex

Are just atomic operations enough to implement a mutex in x86. I am asking this in relation to out of order execution. Except atomic access to the integer that specifies whether the mutex is locked or not, are there any additional actions that must happen and why?

The answer depends upon what you mean by "just atomic operations". Properly aligned reads/writes on x86 are atomic, and both Dekker's and Peterson's algorithms for a mutex use only reads and writes. But neither algorithm works work correctly without also using (possibly implicit) memory fences. The problem is that both algorithms assume a stronger memory consistency model than x86 has. Specifically, x86 allows a load programmed after a store to happen earlier if the two accesses are not to the same address. See here for a detailed example.
If by "just atomic operations", you include LOCK-prefixed instructions and XCHG (which has an implicit LOCK prefix), then the answer is yes, since said instructions have an implicit memory fence. For example, an XCHG instruction can be used to perform the sequentially consistent store required by Dekker's and Peterson's algorithms, or used to implement the usual test-and-set approach.

Related

What purpose does a queue serve in System Verilog?

They are not used for RTL but rather verification, correct? They would not be synthesizable.
Do they have better memory management features in turn optimizing program time? If I recall
correctly, System Verilog has an automatic garbage collector, so there is no need to deallocate memory.
The official IEEE documentation does a great job of explaining how they work. I am just wondering in what scenarios I would use one vs an array. One guess would be that they have associated methods that allow for easier data manipulation?
Thank you in advance for your knowledge and expertise.
A queue can be synthesisable if it has a bounded maximum size. Only a few synthesis tools support it, probably none of the FPGA synthesis tools.
The key advantage with a queue is in efficiency adding/removing one element from the array, especially when accessed at the head or tail of the queue. A dynamic array may require reallocation and copying the entire array when modifying its size. The penalty for a queue is the extra time it takes to access elements in the middle of the queue, and extra space compared with the same number of element of a dynamic array.
I hope that 2 answers this question.

Atomicity of small PCIE TLP writes

Are there any guarantees about how card to host writes from a PCIe device targeting regular memory are implemented from a software process' perspective, where a single TLP write is fully contained within a single CPU cache-line?
I'm wondering about a case where my device may write some number of words of data followed by a byte to indicate that the structure is now valid (for example an event completion), for example:
struct PCIE_COMPLETION_T {
uint64_t data_a;
uint64_t data_b;
uint64_t data_c;
uint64_t data_d;
uint8_t valid;
} alignas(SYSTEM_CACHE_LINE_SIZE);
Can I use a single TLP to write this structure, such that when software sees the valid member change to 1 (having been previously cleared to zero by software), then will the other data members will also reflect the values that I had written and not a previous value?
Currently I'm performing 2 writes, first writing the data and secondly marking it as valid, which doesn't have any apparent race conditions but does of course add unwanted overhead.
The most relevant question I can see on this site seems to be Are writes on the PCIe bus atomic? although this appears to relate to the relative ordering of TLPs.
Perusing the PCIe 3.0 specification, I didn't find anything that seemed to explicitly cover my concerns, I don't think that I need AtomicOps particularly. Given that I'm only concerned about interactions with x86-64 systems, I also dug through the Intel architecture guide but also came up no clearer.
Instinctively it seems that it should be possible for such a write to be perceived atomically -- especially as it is said to be a transaction -- but equally I can't find much in the way of documentation explicitly confirming that view (nor am I quite sure what I'd need to look at, probably the CPU vendor?). I also wonder if such a scheme can be extended over multiple cachelines -- ie if the valid sits on a second cacheline written from the same TLP transaction can I be assured that the first will be perceived no later than the second?
The write may be broken into smaller units, as small as dwords, but if it is, they must be observed in increasing address order.
PCIe revision 4, section 2.4.3:
If a single write transaction containing multiple DWs and the Relaxed
Ordering bit Clear is accepted by a Completer, the observed ordering
of the updates to locations within the Completer's data buffer must be
in increasing address order. This semantic is required in case a PCI
or PCI-X Bridge along the path combines multiple write transactions
into the single one. However, the observed granularity of the updates
to the Completer's data buffer is outside the scope of this
specification.
While not required by this specification, it is
strongly recommended that host platforms guarantee that when a PCI
Express write updates host memory, the update granularity observed by
a host CPU will not be smaller than a DW.
As an example of update
ordering and granularity, if a Requester writes a QW to host memory,
in some cases a host CPU reading that QW from host memory could
observe the first DW updated and the second DW containing the old
value.
I don't have a copy of revision 3, but I suspect this language is in that revision as well. To help you find it, Section 2.4 is "Transaction Ordering" and section 2.4.3 is "Update Ordering and Granularity Provided by a Write Transaction".

atomic operations and atomic transactions

Can someone explain to me, whats the difference between atomic operations and atomic transactions? Its seems to me that these two are the same thing.Is that correct?
The concept of Atomicity is common between atomic transactions and atomic operations, but they are usually related to different domains.
Atomic Transactions are associated with Database operations where a set of actions must ALL complete or else NONE of them complete. For example, if someone is booking a flight, you want to both get payment AND reserve the seat OR do neither. If either one were allowed to succeed without the other also succeeding, the database would be inconsistent.
Atomic Operations on the other hand are usually associated with low-level programming with regards to multi-processing or multi-threading applications and are similar to Critical Sections.
For example, if two threads both access and modify the same variable, each thread goes through the following steps:
Read the variable from storage into local memory.
Modify the value in local memory.
Write the modified value back to the original storage location.
But in a multi-threaded system an interrupt or other context switch might happen after the first process has read the value but has not written it back. The second process (or interrupt) will then read and modify the OLD value and write its modified value back to storage. When the first process is re-enabled, it doesn't know that something might have changed so it writes back its change to the original value. Hence the operation that the second process did to the variable will be lost.
If an operation is atomic, it is guaranteed to complete without being interrupted once it begins. This is usually accomplished using hardware-level primitives like Test-and-Set or Compare-and-Swap.
To get a wider picture, you can take a look at:
MySQL Transactions and Atomic Operations
Atomicity (database systems)
Atomicity (Programming)
Some quotes from the above-cited resources:
About databases:
In an atomic transaction, a series of database operations either all
occur, or nothing occurs. A guarantee of atomicity prevents updates to
the database occurring only partially, which can cause greater
problems than rejecting the whole series outright. In other words,
atomicity means indivisibility and irreducibility.
About programming:
In concurrent programming, an operation (or set of operations) is
atomic, linearizable, indivisible or uninterruptible if it appears to
the rest of the system to occur instantaneously. Atomicity is a
guarantee of isolation from concurrent processes. Additionally, atomic
operations commonly have a succeed-or-fail definition — they either
successfully change the state of the system, or have no apparent
effect.
I have seen the word transaction used more often for databases and operation in programming, especially in kernel-level programming.
In a statement:
an atomic transaction is the smallest set of operations to perform the required steps.
Either all of those required operations happen(successfully) or the atomic transaction fails.
An atomic operation usually has nothing in common with transactions. To my knowledge this comes from hardware programming, where an set of operations (or one) happen to get solved instantly.

why does memcached not support "multi set"

Can anyone explain why memcached folks decided to support multi get but not multi set.
By multi I mean operation involving more than one key (see protocol at http://code.google.com/p/memcached/wiki/NewCommands).
So you can get multiple keys in one shot (basic advantage is the standard saving you get by doing less round trips) but why can not you get bulk sets?
My theory is that it was meant to do less number of sets and that too individually (e.g. on a cache read and miss). But I still do not see how multi-set really conflicts with the general philosophy of memcached.
I looked at the client features at http://code.google.com/p/memcached/wiki/NewCommonFeatures and it seems that some clients potentially do support "Multi-Set" (why only in binary protocol?). I am using Java spy memcached, btw.
It's not supported in the text protocol because it'd be very, very complicated to express, no clients would support it, and it would provide very little that you can't already do from the text protocol.
It's supported in the binary protocol because it's a trivial use case of binary operations.
spymemcached supports it implicitly -- just do a bunch of sets and magic happens:
http://dustin.github.com/2009/09/23/spymemcached-optimizations.html
I don't know a lot about memcache internals, but I assume writes have to be blocking, atomic operations. I assume that allowing multiple set operations to be batched, you could block all reads for a long time (or risk a get occurring while only half of a write had been applied). Forcing writes to be done individually allows them to be interleaved fairly with gets.
I would imagine that the restriction against using multi sets is to avoid collisions when writing cached values to the memcache.
As an object cache, I can't foresee an example of when you would need transactional type writes. This use case seems less suited for a caching layer, but better suited for the underlying database.
If sets come in interleaved from different clients, it is most likely the case that for one key, the last one wins, or is at least close enough, until the cache is invalidated and a newer value is written.
As Gian mentions, there don't seem to be any good reasons to block reads from the cache while several or many writes to the cache happen.

Garbage-collectors for multi-core llvm?

I've been looking at LLVM for quite some time as a new back-end for the language I'm currently implementing. It seems to have good performance, rather high-level generation APIs, enough low-level support to optimize exotic optimizations. In addition, and although I haven't checked it myself, Apple seems to have successfully demonstrated the use of LLVM for garbage-collected multi-core programs.
So far, so good. As I'm interested in both garbage-collection and multi-core, the next step would be to choose a LLVM multi-core-able garbage-collector. Which brings me to the question: what is available? I'm aware of Jon Harrop's HLVM work, but that's about it.
Note that I need cross-platform, so Apple's GC is probably not what I'm looking for (unless there's a cross-platform version). Also note that I have nothing against stop-the-world garbage-collectors.
Thanks in advance,
Yoric
LLVM docs say that it does not support multi-threaded collectors yet.
As the matrix indicates, LLVM's
garbage collection infrastructure is
already suitable for a wide variety of
collectors, but does not currently
extend to multithreaded programs. This
will be added in the future as there
is interest.
The docs do say that to do multi-threaded garbage collection you need to stop the world and that this is a non-portable thing:
Threaded
Denotes a multithreaded mutator; the collector must still stop the
mutator ("stop the world") before
beginning reachability analysis.
Stopping a multithreaded mutator is a
complicated problem. It generally
requires highly platform specific code
in the runtime, and the production of
carefully designed machine code at
safe points.
However, shared state between threads is a nasty scaling issue. If your language communicates solely through message passing between 'tasks', and therefore there was no shared state between worker threads, then you could use a per-thread collector for the per-thread heap?
The quotes that Will gave are about LLVM's intrinsic support for GC, where you augment LLVM with C++ code telling it how to walk the stack, interpret stack frames, inject read and write barriers and so on. The primary goal of my HLVM project is to become useful with minimal effort and risk so I chose to use the shadow stack for an "uncooperative environment" in order to avoid hacking on immature internals of LLVM. Consequently, those statements about LLVM's intrinsic support for GC do not apply to HLVM's garbage collector because it does not use that infrastructure at all. My results are extremely compelling: you can achieve excellent performance with minimal effort (serial performance and parallel performance).
I believe HLVM already runs out-of-the-box across Unixs including Mac OS X because it requires only POSIX threads. I strongly disagree with the claim that writing a stop-the-world GC is difficult: it took me 5 days to write a 100-line multicore garbage collector and I barely know anything about computers. I cannot believe it would be difficult to port to Windows either.