The use of Locks and mutexes is illegal in hard real-time callbacks. Lock free variables can be read and written in different threads. In C, the language definition may or may not be broken, but most compilers spit out usable assembly code given that a variable is declared volatile (the reader thread treats the variable as as hardware register and thus actually issues load instructions before using the variable, which works well enough on most cache-coherent multiprocessor systems.)
Can this type of variable access be stated in Swift? Or does in-line assembly language or data cache flush/invalidate hints need to be added to the Swift language instead?
Added: Will the use of calls to OSMemoryBarrier() (from OSAtomic.h) before and after and each use or update of any potentially inter-thread variables (such as "lock-free" fifo/buffer status counters, etc.) in Swift enforce sufficiently ordered memory load and store instructions (even on ARM processors)?
As you already mentioned, volatile only guarantees that the variable will not get cached into the registries (will get treated itself as a register). That alone does not make it lock free for reads and writes. It doesn't even guarantees it's atomicity, at least not in a consistent, cross-platform way.
Why? Instruction pipelining and oversizing (e.g using Float64 on a platform that has 32bit, or less, floating-point registers) first comes to mind.
That being said, did you considered using OSAtomic?
Related
In COBOL, you may enclose a subroutine as nested program or as a stand-alone module. I want to know what are the differences between the two approaches in terms of speed of execution, memory usage, and whether both methods are allowed in CICS or not. Any references would be great. The run environment is Z/OS.
Thanks.
Both methods are allowed in CICS.
The difference in memory usage, if any, is going to be negligible. The compiler will generate reentrant code and thus your Working-Storage will be dynamically allocated on first execution per CICS transaction and your Local-Storage dynamically allocated per execution. The Language Environment memory allocation algorithm is designed to be speedy. Only one copy of your executable code will be present in the CICS region.
Packaging your subroutine as a nested program or statically linking your modules together at bind time avoids the overhead of the LOAD when the subroutine is called.
Packaging your subroutine as a nested program prevents it from being called by other programs unless you package the nested program as a copybook and use the COPY compiler directive to bring it into your program. This technique can lead to interesting issues, such as changes to the nested program copybook should probably require recompilation of all programs using the copybook in order to pick up the new version; but this depends on your source code management system. Static linking of the subroutine has similar issues.
If you package your subroutine as a separate module you have the option of executing it via EXEC CICS LINK or COBOL dynamic CALL. The former causes the creation of a new Language Environment enclave and thus the latter is more efficient, particularly on the second and subsequent CALL and if you specify the Language Environment runtime option CBLPSHPOP(OFF).
Much of the above was gleaned from SHARE presentations over the years.
Some tuning information is available in a SHARE presentation from 2002 S8213TR.PDF currently available here (the information is still valid). Note that there are many tuning opportunities relative to Language Environment runtime options related to storage allocation. There exist a number of different mechanisms to set Language Environment options. Your CICS Systems Programmer likely has an opinion on the matter. There may be shop standards regarding Language Environment runtime options.
Generally speaking, mainframe CICS COBOL application tuning has more to do with using efficient algorithms, variable definitions, compile options, and Language Environment runtime options than it does with application packaging.
In addition to the things mentioned by cschneid...
A contained program can reference items declared with the GLOBAL attribute in the Data Division of the containing program. The contained program does not need to declare the GLOBAL items in order to reference them.
Contained programs cannot be declared with the RECURSIVE attribute.
C programs can use global variables to share memory between functions executed in a parent and a child thread, but a Java program with several classes of objects doesn’t have such global variables. How can two threads share memory declared as variables in an object?
The practicable answer to this depends up on the language in which you are working.
In the theoretical, a process is an address space having one or more threads. A thread is a stream of execution having a process address space.
Because all the threads in the process share the same address space they can access each other's variables with no restrictions (for good or bad). Absolutely everything is shared.
Some programming languages, such as Ada have wonderful support for threads (tasks in Ada). Java has minimal support for threads. Classic C and C++ have no language support at all.
In a language like Ada, with real thread support, there are protected mechanisms for exchanging data among tasks. (But your Ada task could call an assembly language routine that can circumvent all that protection.) In C/C++ you create a task train wreck unless you explicitly plan to avoid one.
In Java, you can use static members (including static member functions) simulate a global variable that you can unsafely access.
From the «Learning concurrent programming in Scala» book:
In current versions of Scala (2.11.1), however, certain collections that are
deemed immutable, such as List and Vector, cannot be shared without
synchronization. Although their external API does not allow you to
modify them, they contain non-final fields.
Could anyone demonstrate this with a small example? And does this still apply to 2.11.7?
The behavior of changes made in one thread when viewed from another is governed by the Java Memory Model. In particular, these rules are extremely weak when it comes to something like building a collection and then passing the built-and-now-immutable collection to another thread. The JMM does not guarantee that the other thread won't see an earlier view where the collection was not fully built!
Since synchronized blocks enforce an ordering, they can be used to get a consistent view if they're used on every single operation.
In practice, though, this is rarely actually necessary. On the CPU side, there is typically a memory barrier operation that can be used to enforce memory consistency (i.e. if you write the tail of your list and then pass a memory barrier, no other thread can see the tail un-set). And in practice, JVMs usually have to implement synchronized by using memory barriers. So one could hope that you could just pass the created list within a synchronzied block, trusting that a memory barrier would be issued, and everything thereafter would be fine.
Unfortunately, the JMM doesn't require that it be implemented in this way (and you can't assume that the memory-barrier-like behavior of object creation will actually be a full memory barrier that applies to everything in that thread as opposed to simply the final fields of that object), which is both why the recommendation is what it is, and why it's not fixed (yet, anyway) in the library.
For what it's worth, on x86 architectures, I've never observed a problem if you hand off the immutable object within a synchronized block. I have observed problems if you try to do it with CAS (e.g. by using the java.util.concurrent.atomic classes).
As an addition to the excellent answer from Rex Kerr:
it should be noted that most common use cases of immutable collections in a multithreading context are not affected by this problem. The only situation where this might affect you is when you do something that you probably should not do in the first place.
E.g. you have a variable var x: Vector[Int], which you write from one thread A and read from another thread B.
If you mark x with #volatile, there will be no problem, since the volatile write introduces a memory barrier. So you will never be able to observe the Vector in an inconsistent state. The same is true when using a synchronized { } block when writing and reading, or when using java.util.concurrent.atomic.AtomicReference.
If you don't mark x with #volatile, you might observe the vector in an inconsistent state (not just wrong elements, but internally inconsistent!). But in that case your code is arguably broken to begin with. It is completely undefined when you will see the changes from A in B.
You might see them
immediately
after there is a memory barrier somewhere else in your program
not at all
depending on the architecture you`re running on, the phase of the moon, whatever. So as Viktor Klang put it: "Unsafe publication is unsafe..."
Note that if you use a higher level concurrency framework such as akka actors, it is also guaranteed that receivers of messages can not see immutable collections in an inconsistent state.
I need to update some code I used for Aho-Corasick algorithm in order to implement the algorithm using the GPU. However, the code heavily relies on object-oriented programming model. My question is, is it possible to pass objects to parallel for each? If not; is there any way around can be workable and exempt me from re-writing the entire code once again. My apologies if this seems naive question. C++-AMP is the first language I use in GPU programming. Hence, my experience in this field is quiet limited.
The answer to your question is yes, in that you can pass classes or structs to a lambda marked restrict(amp). Note that the parallel_foreach` is not AMP restricted, its lambda is.
However you are limited to using the types that are supported by the GPU. This is more of a limitation of current GPU hardware, rather than C++ AMP.
A C++ AMP-compatible function or lambda can only use C++
AMP-compatible types, which include the following:
int
unsigned int
float
double
C-style arrays of int, unsigned int, float, or double
concurrency::array_view or references to concurrency::array
structs containing only C++ AMP-compatible types
This means that some data types are forbidden:
bool (can be used for local variables in the lambda)
char
short
long long
unsigned versions of the above
References and pointers (to a compatible type) may be used locally but
cannot be captured by a lambda. Function pointers, pointer-to-pointer,
and the like are not allowed; neither are static or global variables.
Classes must meet more rules if you wish to use instances of them.
They must have no virtual func- tions or virtual inheritance.
Constructors, destructors, and other nonvirtual functions are allowed.
The member variables must all be of compatible types, which could of
course include instances of other classes as long as those classes
meet the same rules.
... From the C++ AMP book, Ch, 3.
So while you can do this it may not be the best solution for performance reasons. CPU and GPU caches are somewhat different. This makes arrays of structs a better choice of CPU implementations, whereas GPUs often perform better if structs of arrays are used.
GPU hardware is designed to provide the best performance when all
threads within a warp are access- ing consecutive memory and
performing the same operations on that data. Consequently, it should
come as no surprise that GPU memory is designed to be most efficient
when accessed in this way. In fact, load and store operations to the
same transfer line by different threads in a warp are coalesced into
as little as a single transaction. The size of a transfer line is
hardware-dependent, but in general, your code does not have to account
for this if you focus on making memory accesses as contiguous as
possible.
... Ch. 7.
If you take a look at the CPU and GPU implementations of the my n-body example you'll see implementations of both approaches for CPU and GPU.
The above does not mean that your algorithm will not run faster when you move the implementation to C++ AMP. It just means that you may be leaving some additional performance on the table. I would recommend doing the simplest port possible and then consider if you want to invest more time optimizing the code, possibly rewriting it to take better advantage of the GPU's architecture.
I want this statement (within the body of the if statement) to be atomic:
if(I2C1STATbits.P || cmd_buffer_ptr >= CMD_BUFFER_SIZE - 1)
cmd_buff_full = 1; // should be atomic
My processor (dsPIC33F) supports atomic bit set and clear. It also supports atomic writes for 16-bit registers and memory locations; these are single cycle. How can I be sure the operation will be implemented in an atomic fashion - is there a way to force the compiler to do this? In my case I'm fairly sure it will compile to be atomic, but I don't want it to change in future if I, for example, change some other code and it reorganises things, or if I update the compiler. For example, is there an atomic keyword?
I'm working with GCC v3.23 - more specifically, MPLAB C30, a modified closed source version of GCC. I am working on a microcontroller, which has only interrupts; there is no concept of threads. The only possible problem with atomicity is that an interrupt may be triggered in the middle of a write over two cycles, if that is even possible.
Depending on what other competing operations you want the assignment to be atomic, you could use sig_atomic_t. Strictly speaking, this protects it only from the presence of signals. In practice, it also provides atomicity wrt. multi-threeading.
Edit: if the object is to guarantee that the store operation is not coded into two assembler instructions, it will be necessary to use inline assembly - C will make no guarantees in that respect. If the objective is to prevent an interrupt from interfering with the store operation, an alternative is to disable interrupts before the store, and enable them afterwards.
Not in C, but perhaps there's a proprietary library call in libraries that come with your processor. For example, on Windows, there's an InterlockedIncrement() and InterlockedDecrement() (to inc/dec longs) that's guaranteed to be atomic without a lock.