Object oriented programming with C++-AMP - c++-amp

I need to update some code I used for Aho-Corasick algorithm in order to implement the algorithm using the GPU. However, the code heavily relies on object-oriented programming model. My question is, is it possible to pass objects to parallel for each? If not; is there any way around can be workable and exempt me from re-writing the entire code once again. My apologies if this seems naive question. C++-AMP is the first language I use in GPU programming. Hence, my experience in this field is quiet limited.

The answer to your question is yes, in that you can pass classes or structs to a lambda marked restrict(amp). Note that the parallel_foreach` is not AMP restricted, its lambda is.
However you are limited to using the types that are supported by the GPU. This is more of a limitation of current GPU hardware, rather than C++ AMP.
A C++ AMP-compatible function or lambda can only use C++
AMP-compatible types, which include the following:
int
unsigned int
float
double
C-style arrays of int, unsigned int, float, or double
concurrency::array_view or references to concurrency::array
structs containing only C++ AMP-compatible types
This means that some data types are forbidden:
bool (can be used for local variables in the lambda)
char
short
long long
unsigned versions of the above
References and pointers (to a compatible type) may be used locally but
cannot be captured by a lambda. Function pointers, pointer-to-pointer,
and the like are not allowed; neither are static or global variables.
Classes must meet more rules if you wish to use instances of them.
They must have no virtual func- tions or virtual inheritance.
Constructors, destructors, and other nonvirtual functions are allowed.
The member variables must all be of compatible types, which could of
course include instances of other classes as long as those classes
meet the same rules.
... From the C++ AMP book, Ch, 3.
So while you can do this it may not be the best solution for performance reasons. CPU and GPU caches are somewhat different. This makes arrays of structs a better choice of CPU implementations, whereas GPUs often perform better if structs of arrays are used.
GPU hardware is designed to provide the best performance when all
threads within a warp are access- ing consecutive memory and
performing the same operations on that data. Consequently, it should
come as no surprise that GPU memory is designed to be most efficient
when accessed in this way. In fact, load and store operations to the
same transfer line by different threads in a warp are coalesced into
as little as a single transaction. The size of a transfer line is
hardware-dependent, but in general, your code does not have to account
for this if you focus on making memory accesses as contiguous as
possible.
... Ch. 7.
If you take a look at the CPU and GPU implementations of the my n-body example you'll see implementations of both approaches for CPU and GPU.
The above does not mean that your algorithm will not run faster when you move the implementation to C++ AMP. It just means that you may be leaving some additional performance on the table. I would recommend doing the simplest port possible and then consider if you want to invest more time optimizing the code, possibly rewriting it to take better advantage of the GPU's architecture.

Related

In multi-stage compilation, should we use a standard serialisation method to ship objects through stages?

This question is formulated in Scala 3/Dotty but should be generalised to any language NOT in MetaML family.
The Scala 3 macro tutorial:
https://docs.scala-lang.org/scala3/reference/metaprogramming/macros.html
Starts with the The Phase Consistency Principle, which explicitly stated that free variables defined in a compilation stage CANNOT be used by the next stage, because its binding object cannot be persisted to a different compiler process:
... Hence, the result of the program will need to persist the program state itself as one of its parts. We don’t want to do this, hence this situation should be made illegal
This should be considered a solved problem given that many distributed computing frameworks demands the similar capability to persist objects across multiple computers, the most common kind of solution (as observed in Apache Spark) uses standard serialisation/pickling to create snapshots of the binded objects (Java standard serialization, twitter Kryo/Chill) which can be saved on disk/off-heap memory or send over the network.
The tutorial itself also suggested the possibility twice:
One difference is that MetaML does not have an equivalent of the PCP - quoted code in MetaML can access variables in its immediately enclosing environment, with some restrictions and caveats since such accesses involve serialization. However, this does not constitute a fundamental gain in expressiveness.
In the end, ToExpr resembles very much a serialization framework
Instead, Both Scala 2 & Scala 3 (and their respective ecosystem) largely ignores these out-of-the-box solutions, and only provide default methods for primitive types (Liftable in scala2, ToExpr in scala3). In addition, existing libraries that use macro relies heavily on manual definition of quasiquotes/quotes for this trivial task, making source much longer and harder to maintain, while not making anything faster (as JVM object serialisation is an highly-optimised language component)
What's the cause of this status quo? How do we improve it?

HIS-Metric "calling"

I do not understand the reason for this metric/rule:
A function should not be called from more than 5 different functions.
All calls within the same function are counted as 1. The rule is
limited to translation unit scope.
It appears to me completely intuitive, because this contradicts code reuse and the approach of split code into often used functions instead of duplicated code.
Can someone explain the rationale?
The first thing to say is that Metric-based quality approaches are by their nature a little subjective and approximate. There are no absolutes in following a metric approach to delivering good quality code.
There are two factors to consider in software complexity. One is the internal complexity, expressed by decision complexity within each function (best exemplified by the Cyclomatic Complexity measure) and dependency complexity between functions within the container (Translation Unit or Class). The other is interface complexity, measuring the level of dependency, including cyclic ones, between collaborating and hierarchical components or classes. In the C/C++ world, this is across multiple TUs. In Structure101 terms, the internal form of complexity is called “Fat” and the external form called “Tangles”.
Back to your question, this Hersteller Initiative Software ‘CALLING’ metric is targeting internal complexity (Fat). Their argument appears to be that if you have more than 5 points of reference to a single function, there may be too much implementation logic in that C++ class or C implementation file, and therefore perhaps time to break into separate modules or components. It seems like a peculiarly stinted view of software design and structure, and the list of exceptions may be as long as the areas where such a judgement might apply.

Is there a possibility to create a memory-efficient sequence of bits in the JVM?

I've got a piece of code that takes into account a given amount of features, where each feature is Boolean. I'm looking for the most efficient way to store a set of such features. My initial thought was to try and store these as a BitSet. But then, I realized that this implementation is meant to be used to store numbers in bit format rather than manipulate each bit, which is something I'd like to do (see the effect of switching any feature on and off). I then thought of using a Boolean array, but apparently the JVM uses much more memory for each Boolean element than the one bit it actually needs.
I'm therefore left with the question: What is the most efficient way to store a set of bits that I'd like to treat as independent bits rather than the building blocks of some number?
Please refer to this question: boolean[] vs. BitSet: Which is more efficient?
According to the answer of Peter Lawrey, boolean[] (not Boolean[]) is your way to go since its values can be manipulated and it takes only one byte of memory per bit to store. Consider that there is no way for a JVM application to store one bit in only one bit of memory and let it be directly (array-like) manipulated because it needs a pointer to find the address of the bit and the smallest addressable unit is a byte.
The site you referenced already states that the mutable BitSet is the same as the java.util.BitSet. There is nothing you can do in Java that you can't do in Scala. But since you are using Scala, you probably want a safe implementation which is probably meant to be even multithreaded. Mutable datatypes are not suitable for that. Therefore, I would simply use an immutable BitSet and accept the memory cost.
However, BitSets have their limits (deriving from the maximum number of int). If you need larger data sizes, you may use LongBitSets, which are basically Map<Long, BitSet>. If you need even more space, you may nest them in another map Map<Long, LongBitSet>, but in that case you need to use two or more identifiers (longs).

Lock-free shared variable in Swift? (functioning volatile)

The use of Locks and mutexes is illegal in hard real-time callbacks. Lock free variables can be read and written in different threads. In C, the language definition may or may not be broken, but most compilers spit out usable assembly code given that a variable is declared volatile (the reader thread treats the variable as as hardware register and thus actually issues load instructions before using the variable, which works well enough on most cache-coherent multiprocessor systems.)
Can this type of variable access be stated in Swift? Or does in-line assembly language or data cache flush/invalidate hints need to be added to the Swift language instead?
Added: Will the use of calls to OSMemoryBarrier() (from OSAtomic.h) before and after and each use or update of any potentially inter-thread variables (such as "lock-free" fifo/buffer status counters, etc.) in Swift enforce sufficiently ordered memory load and store instructions (even on ARM processors)?
As you already mentioned, volatile only guarantees that the variable will not get cached into the registries (will get treated itself as a register). That alone does not make it lock free for reads and writes. It doesn't even guarantees it's atomicity, at least not in a consistent, cross-platform way.
Why? Instruction pipelining and oversizing (e.g using Float64 on a platform that has 32bit, or less, floating-point registers) first comes to mind.
That being said, did you considered using OSAtomic?

Disadvantages of Immutable objects

I know that Immutable objects offer several advantages over mutable objects like they are easier to reason about than mutable ones, they do not have complex state spaces that change over time, we can pass them around freely, they make safe hash table keys etc etc.So my question is what are the disadvantages of immutable objects??
Quoting from Effective Java:
The only real disadvantage of immutable classes is that they require a
separate object for each distinct value. Creating these objects can be
costly, especially if they are large. For example, suppose that you
have a million-bit BigInteger and you want to change its low-order
bit:
BigInteger moby = ...;
moby = moby.flipBit(0);
The flipBit method
creates a new BigInteger instance, also a million bits long, that
differs from the original in only one bit. The operation requires time
and space proportional to the size of the BigInteger. Contrast this to
java.util.BitSet. Like BigInteger, BitSet represents an arbitrarily
long sequence of bits, but unlike BigInteger, BitSet is mutable. The
BitSet class provides a method that allows you to change the state of
a single bit of a millionbit instance in constant time.
Read the full item on Item 15: Minimize mutability
Apart from possible performance drawbacks (possible! because with the complexity of GC and HotSpot optimisations, immutable structures are not necessarily slower) - one drawback can be that state must now be threaded through your whole application. For simple applications or tiny scripts the effort to maintain state this way might be too high to buy you concurrency safety.
For example think of a GUI framework like Swing. It would be definitely possible to write a GUI framework entirely using immutable structures and one main "unsafe" outer loop, and I guess this has been done in Haskell. Some of the problems of maintaining nested immutable state can be addressed for example with lenses. But managing all the interactions (registering listeners etc.) may get quite involved, so you might instead want to introduce new abstractions such as functional-reactive or hybrid-reactive GUIs.
Basically you lose some of OO's encapsulation by going all immutable, and when this becomes a problem there are alternative approaches such as actors or STM.
I work with Scala on a daily basis. Immutability has certain key advantages as we all know. However sometimes it's just plain easier to allow mutable content in some situations. Here's a contrived example:
var counter = 0
something.map {e =>
...
counter += 1
}
Of course I could just have the map return a tuple with the payload and count, or use a collection.size if available. But in this case the mutable counter is arguably more clear. In general I prefer immutability but also allow myself to make exceptions.
To answer this question I would quote Programming in Scala, second Edition, chapter "Next Steps in Scala", item 11, by Lex Spoon, Bill Venners and Martin Odersky :
The Scala perspective, however, is that val and var are just two different tools in your toolbox, both useful, neither inherently evil. Scala encourages you to lean towards vals, but ultimately reach for the best tool given the job at hand.
So I would say that just as for programming languages, val and var solves different problems : there is no "disavantage / avantage" without context, there is just a problem to solve, and both of val / var address differently the problem.
Hope it helps, even if it does not provide a concrete list of pros / cons !