Garbage-collectors for multi-core llvm? - multicore

I've been looking at LLVM for quite some time as a new back-end for the language I'm currently implementing. It seems to have good performance, rather high-level generation APIs, enough low-level support to optimize exotic optimizations. In addition, and although I haven't checked it myself, Apple seems to have successfully demonstrated the use of LLVM for garbage-collected multi-core programs.
So far, so good. As I'm interested in both garbage-collection and multi-core, the next step would be to choose a LLVM multi-core-able garbage-collector. Which brings me to the question: what is available? I'm aware of Jon Harrop's HLVM work, but that's about it.
Note that I need cross-platform, so Apple's GC is probably not what I'm looking for (unless there's a cross-platform version). Also note that I have nothing against stop-the-world garbage-collectors.
Thanks in advance,
Yoric

LLVM docs say that it does not support multi-threaded collectors yet.
As the matrix indicates, LLVM's
garbage collection infrastructure is
already suitable for a wide variety of
collectors, but does not currently
extend to multithreaded programs. This
will be added in the future as there
is interest.
The docs do say that to do multi-threaded garbage collection you need to stop the world and that this is a non-portable thing:
Threaded
Denotes a multithreaded mutator; the collector must still stop the
mutator ("stop the world") before
beginning reachability analysis.
Stopping a multithreaded mutator is a
complicated problem. It generally
requires highly platform specific code
in the runtime, and the production of
carefully designed machine code at
safe points.
However, shared state between threads is a nasty scaling issue. If your language communicates solely through message passing between 'tasks', and therefore there was no shared state between worker threads, then you could use a per-thread collector for the per-thread heap?

The quotes that Will gave are about LLVM's intrinsic support for GC, where you augment LLVM with C++ code telling it how to walk the stack, interpret stack frames, inject read and write barriers and so on. The primary goal of my HLVM project is to become useful with minimal effort and risk so I chose to use the shadow stack for an "uncooperative environment" in order to avoid hacking on immature internals of LLVM. Consequently, those statements about LLVM's intrinsic support for GC do not apply to HLVM's garbage collector because it does not use that infrastructure at all. My results are extremely compelling: you can achieve excellent performance with minimal effort (serial performance and parallel performance).
I believe HLVM already runs out-of-the-box across Unixs including Mac OS X because it requires only POSIX threads. I strongly disagree with the claim that writing a stop-the-world GC is difficult: it took me 5 days to write a 100-line multicore garbage collector and I barely know anything about computers. I cannot believe it would be difficult to port to Windows either.

Related

Do cats and scalaz create performance overhead on application?

I know it is totally a nonsense question but due to my illiteracy on programming skill this question came to my mind.
Cats and scalaz are used so that we can code in Scala similar to Haskell/in pure functional programming way. But for achieving this we need to add those libraries additionally with our projects. Eventually for using these we need to wrap our codes with their objects and functions. It is something adding extra codes and dependencies.
I don't know whether these create larger objects in memory.
These is making me think about. So my question: will I face any performance issue like more memory consumption if I use cats/scalaz ?
Or should I avoid these if my application needs performance?
Do cats and scalaz create performance overhead on application?
Absolutely.
The same way any line of code adds performance overhead.
So, if that is your concern, then don't write any code (well, actually the world may be simpler if we would have never tried all this).
Now, dick answer outside. The proper question you should be asking is: "Does the overhead of X library is harmful to my software?"; remember this applies to any library, actually to any code you write, to any algorithm you pick, etc.
And, in order to answer that question, we need some things before.
Define the SLAs the software you are writing must hold. Without those, any performance question / observation you made is pointless. It doesn't matter if something is faster / slower if you don't know if that is meaningful for you and your clients.
Once you have SLAs you need to perform stress tests to verify if your current version of the software satisfies those. Because, if your current code is performant enough, then you should worry about other things like maintainability, testing, adding more features, etc.
PS: Remember that those SLAs should not be raw numbers but be expressed in terms of percentiles, the same goes for the results of the tests.
When you found that you are falling your SLAs then you need to do proper benchmarking and debugging to identify the bottlenecks of your project. As you saw, caring about performance must be done on each line of code, but that is a lot of work that usually doesn't produce any relevant output. Thus, instead of evaluating the performance of everything, we find the bottlenecks first, those small pieces of code that have the biggest contributions to the overall performance of your software (remember the Pareto principle).
Remember that in this step, we have to be integral, network matters too. (and you will see this last one is usually the biggest slowdown; thus, usually you would rather search for architectural solutions like using Fibers instead of Threads rather than trying to optimize small functions. Also, sometimes the easier and cheaper solution is better infrastructure).
When you find the bottleneck, then you need to formulate some alternatives, implement those and not only benchmark them but do Statistical hypothesis testing to validate if the proposed changes are worth it or not. And, of course, validate if they were enough to satisfy the SLAs.
Thus, as you can see, performance is an art and a lot of work. So, unless you are committed to doing all this then stop worrying about something you will not measure and optimize properly.
Rather, focus on increasing the maintainability of your code. This actually also helps performance, because when you find that you need to change something you would be grateful that the code is as clean as possible and that the whole architecture of the code allows for an easy change.
And, believe me when I say that, using tools like cats, cats-effect, fs2, etc will help with that regard. Also, they actually pretty optimized on their core so you should be good for a lot of use cases.
Now, the big exception is that if you know that the work you are doing will be very CPU and memory bound then yeah, you pretty much can be sure all those abstractions will be harmful. In those cases, you may even want to stay away from the JVM and rather write pretty low-level code in a language like Rust which will provide you with proper tools for that kind of problem and still be way safer than plain old C.

Threading vs Forking (with explanation of what I want to do)

So, I've reviewed a ton of articles and forums before posting this, but I keep reading conflicting answers. Firstly, OS is not an issue, I can use either Windows or Unix, whatever would be best for my problem. I have a ton of data that I need to use for read-only purposes (not sure why this would matter, but, in case it does, the data structure that I'm going to have to go through is an array of arrays of arrays of hashes whose values are also arrays). I'm essentially comparing a "query" to a ton of different "sentences" and computing their relative similarities. From these quantities (several million), I want to take the top x% and do something with them. I need to parallelize this process. There's just no good way for me to decrease the space--I need to compare over everything to get good results and it will just take too long with some sort of threading/forking. Again, I've seen many conflicting answers and don't know which one to do.
Any help would be appreciated. Thanks in advance.
EDIT: I don't think the amount of memory usage will be an issue, but I don't know (8 GB RAM)
Without more detail on your problem, there's not much help that can be given. You want to parallelize a process. Threads and forks in Perl have advantages and disadvantages.
One of the key things that makes Perl threads different from other threads is that data is not shared by default. This makes threads much easier and safer to work with, you don't have to worry about thread safety of libraries or most of your code, just the threaded bit. However it can be a performance drag and memory hungry as Perl must put a copy of the interpreter and all loaded modules into each thread.
When it comes to forking I will only be talking about Unix. Perl emulates fork on Windows using threads, it works but it can be slow and buggy.
Forking Advantages
Very fast to create a fork
Very robust
Forking Disadvantages
Communicating between the processes can be slow and awkward
Thread Advantages
Thread coordination and data interchange is fairly easy
Threads are fairly easy to use
Thread Disadvantages
Each thread takes a lot of memory
Threads can be slow to start
Threads can be buggy (better the more recent your perl)
Database connections are not shared across threads
That last one is a bit of a doozy if the documentation is up to date. If you're going to be doing a lot of SQL, don't use threads.
In general, to get good performance out of Perl threads it's best to start a pool of threads and reuse them. Forks can more easily be created, used and discarded.
Really what it comes down to is what fits your way of thinking and your particular problem.
For either case, you're likely going to want something to manage your pool of workers. For forking you're going to want to use Parallel::ForkManager or Child. Child is particularly nice as it has built in inter-process communication.
For threads you're going to want to use threads::shared, Thread::Queue and read perlthrtut.
When reading articles about Perl threads, keep in mind they were a bit crap when they were introduced in 5.8.0 in 2002, and only serviceable by 5.10.1. After that they've firmed up considerably. Information and opinions about their efficiency and robustness tends to fall rapidly out of date.
Threading can be more difficult to get correct, but won't utilize as much memory.
Forking can be simpler to implement but use a significant amount of memory.
If you don't have experience with either, I would start by implemented a forking version & go from there.

Is Moose really this slow?

I recently downloaded Moose. Experimentally, I rewrote an existing module in Moose. It seems to be convenient way to avoid writing lots of repetitive code. I ran the tests of the module, and I noticed it was a bit delayed. I profiled the code with -d:DProf and it seems that just including the line
no Moose;
in the code increases the running time by about 0.25 seconds (on my computer). Is this typical? Am I doing something wrong, did I misinstall it, or should we really expect this much delay?
Yes, there's a bit of a penalty to using Moose. However, it's only a startup penalty, not at runtime; if you wrote everything properly, then things will be quite fast at runtime.
Did you also include this line:
__PACKAGE__->meta->make_immutable;
in all your classes when you no Moose;? Calling this method will make it (runtime) faster (at the expense of startup time). In particular, object construction and destruction are effectively "inlined" in your class, and no longer invoke the meta API. It is strongly recommended that you make your classes immutable. It makes your code much faster, with a small compile-time cost. This will be especially noticeable when creating many objects.1
2
However, sometimes this cost is still too much.
If you're using Moose inside a script, or in some other way where the compilation time is a significant fraction of your overall usage time, try doing s/Moose/Moo/g -- if you don't use MooseX modules, you can likely switch to Moo, whose goal is to be faster (at startup) while retaining 90% of the flexibility of Moose.
Since you are working with a web application, have you considered using Plack/PSGI?
1From the docs of make_immutable, in Moose::Cookbook::Basics::Recipe7
2See also Stevan Little's article: Why make_immutable is recommended for Moose classes
See Moose::Cookbook::FAQ:
I heard Moose is slow, is this true?
Again, this one is tricky, so Yes and No.
Firstly, nothing in life is free, and some Moose features do cost more than others. It is also the policy of Moose to only charge you for the features you use, and to do our absolute best to not place any extra burdens on the execution of your code for features you are not using. Of course using Moose itself does involve some overhead, but it is mostly compile time. At this point we do have some options available for getting the speed you need.
Currently we provide the option of making your classes immutable as a means of boosting speed. This will mean a slightly larger compile time cost, but the runtime speed increase (especially in object construction) is pretty significant. This can be done with the following code:
MyClass->meta->make_immutable();
We are regularly converting the hotspots of Class::MOP to XS. Florian Ragwitz and Yuval Kogman are currently working on a way to compile your accessors and instances directly into C, so that everyone can enjoy blazing fast OO.
On the other hand, I am working on a web application which using Dancer and Moose. Because the application is running as an HTTPD daemon, none of this is really relevant once the server is initialized. Performance seems more than adequate for my requirements on limited hardware or virtual servers.
Using Moose and Dancer for this project has had the added benefit that my small demo application shrank from about 5,000 lines to less than 1,000 lines.
How much stuff you want your app to depend on is one of those trade-offs you have to consider. CGI apps are made more responsive by limiting dependencies.
Your question is a little deceptive. Yes, Moose has a measurable startup cost, but isn't slow after that. If the startup cost is prohibitive, you can always daemonize your application.

Multi-Core Programming. Boost's MPI, OpenMP, TBB, or something else?

I am totally a novice in Multi-Core Programming, but I do know how to program C++.
Now, I am looking around for Multi-Core Programming library. I just want to give it a try, just for fun, and right now, I found 3 APIs, but I am not sure which one should I stick with. Right now, I see Boost's MPI, OpenMP and TBB.
For anyone who have experienced with any of these 3 API (or any other API), could you please tell me the difference between these? Are there any factor to consider, like AMD or Intel architecture?
As a starting point I'd suggest OpenMP. With this you can very simply do three basic types of parallelism: loops, sections, and tasks.
Parallel loops
These allow you to split loop iterations over multiple threads. For instance:
#pragma omp parallel for
for (int i=0; i<N; i++) {...}
If you were using two threads, then the first thread would perform the first half of the iteration. The second thread would perform the second half.
Sections
These allow you to statically partition the work over multiple threads. This is useful when there is obvious work that can be performed in parallel. However, it's not a very flexible approach.
#pragma omp parallel sections
{
#pragma omp section
{...}
#pragma omp section
{...}
}
Tasks
Tasks are the most flexible approach. These are created dynamically and their execution is performed asynchronously, either by the thread that created them, or by another thread.
#pragma omp task
{...}
Advantages
OpenMP has several things going for it.
Directive-based: the compiler does the work of creating and synchronizing the threads.
Incremental parallelism: you can focus on just the region of code that you need to parallelise.
One source base for serial and parallel code: The OpenMP directives are only recognized by the compiler when you run it with a flag (-fopenmp for gcc). So you can use the same source base to generate both serial and parallel code. This means you can turn off the flag to see if you get the same result from the serial version of the code or not. That way you can isolate parallelism errors from errors in the algorithm.
You can find the entire OpenMP spec at http://www.openmp.org/
Under the hood OpenMP is multi-threaded programming but at a higher level of abstraction than TBB and its ilk. The choice between the two, for parallel programming on a multi-core computer, is approximately the same as the choice between any higher and lower level software within the same domain: there is a trade off between expressivity and controllability.
Intel vs AMD is irrelevant I think.
And your choice ought to depend on what you are trying to achieve; for example, if you want to learn TBB then TBB is definitely the way to go. But if you want to parallelise an existing C++ program in easy steps, then OpenMP is probably a better first choice; TBB will still be around later for you to tackle. I'd probably steer clear of MPI at first unless I was certain that I would be transferring from shared-memory programming (which is mostly what you do on a multi-core) to distributed-memory programming (on clusters or networks). As ever , the technology you choose ought to depend on your requirements.
I'd suggest you to play with MapReduce for sometime. You can install several virtual machines instances on the same machine, each of which runs a Hadoop instance (Hadoop is a Yahoo! open source implementation of MapReduce). There are a lot of tutorials online for setting up Hadoop.
btw, MPI and OpenMP are not the same thing. OpenMP is for shared memory programming, which generally means, multi-core programming, not parallel programming on several machines.
Depends on your focus. If you are mainly interested in multi threaded programming go with TBB. If you are more interested in process level concurrency then MPI is the way to go.
Another interesting library is OpenCL. It basically allows you to use all your hardware (CPU, GPU, DSP, ...) in the best way.
It has some interesting features, like the possibility to create hundreds of threads without performance penalties.

Any Real-World Experience Using Software Transactional Memory? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
It seems that there has been a recent rising interest in STM (software transactional memory) frameworks and language extensions. Clojure in particular has an excellent implementation which uses MVCC (multi-version concurrency control) rather than a rolling commit log. GHC Haskell also has an extremely elegant STM monad which also allows transaction composition. Finally, so as to toot my own horn just a bit, I've recently implemented an STM framework for Scala which statically enforces reference restrictions.
All of these are interesting experiments, but they seem to be confined to that sphere alone (experimentation). So my question is: have any of you seen or used STM in the real world? If so, why? What sort of benefits did it bring? What about performance? (there seems to be a great deal of conflicting information on this point) Would you use STM again or would you prefer to use some other concurrency abstraction like actors?
I participated in the hobbyist development of the BitTorrent client in Haskell (named conjure). It uses STM quite heavily to coordinate different threads (1 per peer + 1 for storage management + 1 for overall management).
Benefits: less locks, readable code.
Speed was not an issue, at least not due to STM usage.
Hope this helps
We use it pretty routinely for high concurrency apps at Galois (in Haskell). It works, its used widely in the Haskell world, and it doesn't deadlock (though of course you can have too much contention). Sometimes we rewrite things to use MVars, if we've got the design right -- as they're faster.
Just use it. It's no big deal. As far as I'm concerned, STM in Haskell is "solved". There's no further work to do. So we use it.
The article "Software Transactional Memory: Why is it Only a Research Toy?" (Călin Caşcaval et al., Communications of the ACM, Nov. 2008), fails to look at the Haskell implementation, which is a really big omission. The problem for STM, as the article points out, is that implementations must chose between either making all variable accesses transactional unless the compiler can prove them safe (which kills performance) or letting the programmer indicate which ones are to be transactional (which kills simplicity and reliability). However the Haskell implementation uses the purity of Haskell to avoid the need to make most variable uses transactional, while the type system provides a simple model together with effective enforcement for the transactional mutation operations. Thus a Haskell program can use STM for those variables that are truly shared between threads whilst guaranteeing that non-transactional memory use is kept safe.
We, factis research GmbH, are using Haskell STM with GHC in production. Our server receives a stream of messages about new and modified "objects" from a clincal "data server", it transforms this event stream on the fly (by generating new objects, modifying objects, aggregating things, etc) and calculates which of these new objects should be synchronized to connected iPads. It also receives form inputs from iPads which are processed, merged with the "main stream" and also synchronized to the other iPads. We're using STM for all channels and mutable data structures that need to be shared between threads. Threads are very lightweight in Haskell so we can have lots of them without impacting performance (at the moment 5 per iPad connection). Building a large application is always a challenge and there were many lessons to be learned but we never had any problems with STM. It always worked as you'd naively expect. We had to do some serious performance tuning but STM was never a problem. (80% of the time we were trying to reduce short-lived allocations and overall memory usage.)
STM is one area where Haskell and the GHC runtime really shines. It's not just an experiment and not for toy programs only.
We're building a different component of our clincal system in Scala and have been using Actors so far, but we're really missing STM. If anybody has experience of what it's like to use one of the Scala STM implementations in production I'd love to hear from you. :-)
We have implemented our entire system (in-memory database and runtime) on top of our own STM implementation in C. Prior to this, we had some log and lock based mechanism to deal with concurrency, but this was a pain to maintain. We are very happy with STM since we can treat every operation the same way. Almost all locks could be removed. We use STM now for almost anything at any size, we even have a memory manager implement on top.
The performance is fine but to speed things up we now developed a custom operating system in collaboration with ETH Zurich. The system natively supports transactional memory.
But there are some challenges caused by STM as well. Especially with larger transactions and hotspots that cause unnecessary transaction conflicts. If for example two transactions put an item into a linked list, an unnecessary conflict will occur that could have been avoided using a lock free data structure.
I'm currently using Akka in some PGAS systems research. Akka is a Scala library for developing scalable concurrent systems using Actors, STM, and built-in fault tolerance capabilities modeled after Erlang's "Let It Fail/Crash/Crater/ROFL" philosophy. Akka's STM implementation is supposedly built around a Scala port of Clojure's STM implementation. An overview of Akka's STM module can be found here.