Efficient cast of integer array to integer domain in Chapel - distributed-computing

I noticed that this works in Chapel. I can turn an integer array into a set by casting it to domain(int)
var x: [1..4] int = (4,5,6,6);
var d = x : domain(int);
writeln(x);
> {4,6,5}
This is extremely helpful, but I'm wondering if there are instances when it will fail, like in a distributed context.
Another feature I'm using is the de-duping of a set when cast to a domain.
var y = {11,13,15,15};
writeln(y);
> {15,11, 13}
Is there a more efficient way to do this or is this a preferred method? I was not able to time it since I don't have access to a large enough cluster A.T.M.

I'm wondering if there are instances when it will fail, like in a distributed context.
It shouldn't. Well-behaved Chapel programs like this should be functionally equivalent in shared- and distributed-memory execution contexts. Of course, performance is likely to differ (for better or worse, depending on how your data and computation are distributed).
Is there a more efficient way to do this or is this a preferred method?
I doubt that Chapel defines a preferred method, but this should be O(n) (linear in the number of elements), so I think it should be reasonable, asymptotically speaking. I don't have any direct experience trying to optimize this idiom in Chapel or other languages, so am not aware of a preferred method.

Related

Monitoring runtime use of concrete collections

Background:
Our Scala software consists of various components, developed by different teams, that pass Scala collections back and forth. The APIs usually use abstract collections such as Seq[T] and Set[T], and developers are currently essentially free to choose any implementation they like: e.g. when creating new instances, some go with List() or Vector(), others with Seq.empty.
Problem:
Different implementations have different performance characteristics, e.g. List might have been a good choice locally (for one component) because the collection is only sequentially iterated over or modified at the head, but it could have been a poor choice globally, because another component performs loads of random accesses.
Question:
Are their any tools — ideally Scala-specific, but JVM-general might also be OK — that can monitor runtime use of collections and record the information necessary to detect and report undesirable access/usage patterns of collections?
My feeling is that runtime monitoring would be more fruitful than static analyses (including simple linting) because (i) statically detecting usage patterns in hot code is virtually impossible, and (ii) would most likely miss collections that are internally created, e.g. when performing complex filter/map/fold/etc. operations on immutable collections.
Edits/Clarifications:
Changing the interfaces to enforce specific types such as List isn't an option; it would also not prevent purely internal use of "wrong" collections/usage patterns.
The goal is identifying a globally optimal (over many runs of the software) collection type rather than locally optimising for each applied algorithm
You don't need linting for this, let alone runtime monitoring. This is exactly what having a strictly-typed language does for you out of the box. If you want to ensure a particular collection type is passed to the API, just declare that that API accepts that collection type (e.g., def foo(x: Stream[Bar]), not def foo(x: Seq[Bar]), etc.).
Alternatively, when practical, just convert to the desired type as part of implementation: def foo(x: List[Bar]) = { val y = x.toArray ; lotsOfRandomAccess(y); }
Collections that are "internally created" are typically the same type as the parent object: List(1,2,3).map(_ + 1) returns a List etc.
Again, if you want to ensure you are using a particular type, just say so:
val mapped: List[Int] = List(1,2,3).map(_ + 1)
You can actually, change the type this way if there is a need for that:
val mappedStream: Stream[Int] = List(1,2,3).map(_ + 1)(breakOut)
As discussed in the comments, this is a problem that needs to be solved at a local level rather than via global optimisation.
Each algorithm in the system will work best with a particular data type, so using a single global structure will never be optimal. Instead, each algorithm should ensure that the incoming data is in a format that can be processed efficiently. If it is not in the right format, the data should be converted to a better format as the first part of the process. Since the algorithm works better on the right format, this conversion is always a performance improvement.
The output data format is more of a problem if the system does not know which algorithm will be used next. The solution is to use the most efficient output format for the algorithm in question, and rely on other algorithms to re-format the data if required.
If you do want to monitor the whole system, it would be better to track the algorithms rather than the collections. If you monitor which algorithms are called and in which order you can create multiple traces through the code. You can then play back those traces with different algorithms and data structures to see which is the most efficient configuration.

Which one is more performant : using Swift Array.contains function OR checking with if (.. || .. || .. )?

In our Swift code we're often writing this when we want to check if foo is a particular value :
if [kConstantOne, kConstant2, kConstant3].contains(foo) {...}
I was wondering how this compared, performance wise, to a normal if statement where you compare the values with ||.
if kConstantOne == foo || kConstant2 == foo || kConstant3 == foo {...}
Does the compiler optimises this statement or is an array effectively allocated, instantiated and looped to check for equality?
It's an easier, a bit more fancy way to write these simple if statements, but if there would be a significant performance impact, which I seriously doubt it will, we should try to avoid it.
EDIT : A single isolated use of this would not have any impact, but I'm more interested to know what happens when it would be part of a larger algorithm. What happens when your code is hitting this statement or other similar ones a few thousand times, allocating and initialising an array and using the contains function for equality checking.
If you are concerned about the performance impact, the only way to explore this is to benchmark it in code that you believe would cause the performance impact. While we can do some reasoning about current compiler implementation details, this is no way to translate those into answering questions of "significant performance impact" and in many cases your intuition will be wrong (map is very slightly faster than a simple for loop in most cases, which is counter-intuitive until you read the implementation of map; but it's a very tiny difference).
Write the code clearly to say what you mean. The first does that.
It is possible that the if statement is very slightly faster than the contains and allows some compiler optimizations (that may or may not actually occur) that contains does not. It definitely does not create a temporary array or anything like that. However, this is going to be nearly unmeasurable over such a tiny array either way. If this is part of an inner loop that is called a few tens of millions of times, I have some approaches I would explore to optimize it (which wouldn't look like either of these; I'd focus first on getting rid of the == if these aren't integers). If this is called fewer than a million times, then you're more likely to accidentally hurt performance than help it by micro-optimizing like this. You're definitely likely to hurt maintainability.

Is there a possibility to create a memory-efficient sequence of bits in the JVM?

I've got a piece of code that takes into account a given amount of features, where each feature is Boolean. I'm looking for the most efficient way to store a set of such features. My initial thought was to try and store these as a BitSet. But then, I realized that this implementation is meant to be used to store numbers in bit format rather than manipulate each bit, which is something I'd like to do (see the effect of switching any feature on and off). I then thought of using a Boolean array, but apparently the JVM uses much more memory for each Boolean element than the one bit it actually needs.
I'm therefore left with the question: What is the most efficient way to store a set of bits that I'd like to treat as independent bits rather than the building blocks of some number?
Please refer to this question: boolean[] vs. BitSet: Which is more efficient?
According to the answer of Peter Lawrey, boolean[] (not Boolean[]) is your way to go since its values can be manipulated and it takes only one byte of memory per bit to store. Consider that there is no way for a JVM application to store one bit in only one bit of memory and let it be directly (array-like) manipulated because it needs a pointer to find the address of the bit and the smallest addressable unit is a byte.
The site you referenced already states that the mutable BitSet is the same as the java.util.BitSet. There is nothing you can do in Java that you can't do in Scala. But since you are using Scala, you probably want a safe implementation which is probably meant to be even multithreaded. Mutable datatypes are not suitable for that. Therefore, I would simply use an immutable BitSet and accept the memory cost.
However, BitSets have their limits (deriving from the maximum number of int). If you need larger data sizes, you may use LongBitSets, which are basically Map<Long, BitSet>. If you need even more space, you may nest them in another map Map<Long, LongBitSet>, but in that case you need to use two or more identifiers (longs).

Disadvantages of Immutable objects

I know that Immutable objects offer several advantages over mutable objects like they are easier to reason about than mutable ones, they do not have complex state spaces that change over time, we can pass them around freely, they make safe hash table keys etc etc.So my question is what are the disadvantages of immutable objects??
Quoting from Effective Java:
The only real disadvantage of immutable classes is that they require a
separate object for each distinct value. Creating these objects can be
costly, especially if they are large. For example, suppose that you
have a million-bit BigInteger and you want to change its low-order
bit:
BigInteger moby = ...;
moby = moby.flipBit(0);
The flipBit method
creates a new BigInteger instance, also a million bits long, that
differs from the original in only one bit. The operation requires time
and space proportional to the size of the BigInteger. Contrast this to
java.util.BitSet. Like BigInteger, BitSet represents an arbitrarily
long sequence of bits, but unlike BigInteger, BitSet is mutable. The
BitSet class provides a method that allows you to change the state of
a single bit of a millionbit instance in constant time.
Read the full item on Item 15: Minimize mutability
Apart from possible performance drawbacks (possible! because with the complexity of GC and HotSpot optimisations, immutable structures are not necessarily slower) - one drawback can be that state must now be threaded through your whole application. For simple applications or tiny scripts the effort to maintain state this way might be too high to buy you concurrency safety.
For example think of a GUI framework like Swing. It would be definitely possible to write a GUI framework entirely using immutable structures and one main "unsafe" outer loop, and I guess this has been done in Haskell. Some of the problems of maintaining nested immutable state can be addressed for example with lenses. But managing all the interactions (registering listeners etc.) may get quite involved, so you might instead want to introduce new abstractions such as functional-reactive or hybrid-reactive GUIs.
Basically you lose some of OO's encapsulation by going all immutable, and when this becomes a problem there are alternative approaches such as actors or STM.
I work with Scala on a daily basis. Immutability has certain key advantages as we all know. However sometimes it's just plain easier to allow mutable content in some situations. Here's a contrived example:
var counter = 0
something.map {e =>
...
counter += 1
}
Of course I could just have the map return a tuple with the payload and count, or use a collection.size if available. But in this case the mutable counter is arguably more clear. In general I prefer immutability but also allow myself to make exceptions.
To answer this question I would quote Programming in Scala, second Edition, chapter "Next Steps in Scala", item 11, by Lex Spoon, Bill Venners and Martin Odersky :
The Scala perspective, however, is that val and var are just two different tools in your toolbox, both useful, neither inherently evil. Scala encourages you to lean towards vals, but ultimately reach for the best tool given the job at hand.
So I would say that just as for programming languages, val and var solves different problems : there is no "disavantage / avantage" without context, there is just a problem to solve, and both of val / var address differently the problem.
Hope it helps, even if it does not provide a concrete list of pros / cons !

Object oriented programming with C++-AMP

I need to update some code I used for Aho-Corasick algorithm in order to implement the algorithm using the GPU. However, the code heavily relies on object-oriented programming model. My question is, is it possible to pass objects to parallel for each? If not; is there any way around can be workable and exempt me from re-writing the entire code once again. My apologies if this seems naive question. C++-AMP is the first language I use in GPU programming. Hence, my experience in this field is quiet limited.
The answer to your question is yes, in that you can pass classes or structs to a lambda marked restrict(amp). Note that the parallel_foreach` is not AMP restricted, its lambda is.
However you are limited to using the types that are supported by the GPU. This is more of a limitation of current GPU hardware, rather than C++ AMP.
A C++ AMP-compatible function or lambda can only use C++
AMP-compatible types, which include the following:
int
unsigned int
float
double
C-style arrays of int, unsigned int, float, or double
concurrency::array_view or references to concurrency::array
structs containing only C++ AMP-compatible types
This means that some data types are forbidden:
bool (can be used for local variables in the lambda)
char
short
long long
unsigned versions of the above
References and pointers (to a compatible type) may be used locally but
cannot be captured by a lambda. Function pointers, pointer-to-pointer,
and the like are not allowed; neither are static or global variables.
Classes must meet more rules if you wish to use instances of them.
They must have no virtual func- tions or virtual inheritance.
Constructors, destructors, and other nonvirtual functions are allowed.
The member variables must all be of compatible types, which could of
course include instances of other classes as long as those classes
meet the same rules.
... From the C++ AMP book, Ch, 3.
So while you can do this it may not be the best solution for performance reasons. CPU and GPU caches are somewhat different. This makes arrays of structs a better choice of CPU implementations, whereas GPUs often perform better if structs of arrays are used.
GPU hardware is designed to provide the best performance when all
threads within a warp are access- ing consecutive memory and
performing the same operations on that data. Consequently, it should
come as no surprise that GPU memory is designed to be most efficient
when accessed in this way. In fact, load and store operations to the
same transfer line by different threads in a warp are coalesced into
as little as a single transaction. The size of a transfer line is
hardware-dependent, but in general, your code does not have to account
for this if you focus on making memory accesses as contiguous as
possible.
... Ch. 7.
If you take a look at the CPU and GPU implementations of the my n-body example you'll see implementations of both approaches for CPU and GPU.
The above does not mean that your algorithm will not run faster when you move the implementation to C++ AMP. It just means that you may be leaving some additional performance on the table. I would recommend doing the simplest port possible and then consider if you want to invest more time optimizing the code, possibly rewriting it to take better advantage of the GPU's architecture.