I wrote a simulation of the Ring network topology in Scala (source here) (Scala 2.8 RC7) and Clojure (source here) (Clojure 1.1) for a comparison of Actors and Agents.
While the Scala version shows almost constant message exchange rate as I increase the number of nodes in network from 100 to 1000000, the Clojure version shows message rates which decrease with the increase in the number of nodes. Also during a single run, the message rate in Clojure version decreases as the time passes.
So I am curious about how the Scala's Actors compare to Clojure's Agents? Are Agents inherently less concurrent than Actors or is the code inefficiently written (autoboxing?)?
PS: I noted that the memory usage in the Scala version increases a lot with the increase in the number of nodes (> 500 MB for 1 million nodes) while the Clojure one uses much less memory (~ 100 MB for 1 million nodes).
Edit:
Both the versions are running on same JVM with all the JVM args and Actor and Agent configuration parameters set as default. On my machine, the Scala version gives a message rate of around 5000 message/sec consistently for 100 to 1 million nodes, whereas the Clojure version starts with 60000 message/sec for 100 nodes which decreases to 200 messages/sec for 1 million nodes.
Edit 2
Turns out that my Clojure version was inefficiently written. I changed the type of nodes collection from list to vector and now it shows consistent behaviour: 100000 message/sec for 100 nodes and 80000 message/sec for 100000 nodes. So Clojure Agents seem to be faster than Scala Actors. I have updated the linked sources too.
[Disclaimer: I'm on the Akka team]
A Clojure Agent is a different beast from a Scala actor, most notably if you think about who controls the behavior. In Agents the behavior is defined outside and is pushed to the Agent, and in Actors the behavior is defined inside the Actor.
Without knowing anything about your code I really cannot say much, are you using the same JVM parameters, warming things up the same, sensible settings for Actors vs. sensible settings for Agents, or are they tuned separately?
As a side note:
Akka has an implementation of the ring bench located here: http://github.com/jboner/akka-bench/tree/master/ring/
Would be interesting to see what the result is compared to your Clojure test on your machine.
Related
I realize the exact number depends on a whole lot of things, so I’m really looking for an order of magnitude on say, a MacBook Pro.
Is it 100s of 1000s? Millions? More?
For example I’ve calculated I can run about 1M goroutines on this machine and I’m trying to get a sense of ZIO fibers would be about the same or more…
The primary resource consumption from a fiber is going to be the heap memory it consumes, plus (arguably) the memory consumed by the closure capturing its state. Because JVMs (and even different GC algorithms within a JVM) differ in how many bytes in memory a given object will take up, and this can even depend on runtime settings (e.g. if the heap is 32GiB or smaller, object references can be encoded in 32 bits, while a heap larger than that will require more space for each object reference).
On "typical" JVMs, the memory overhead of fibers is in the low hundreds of bytes. This is also approximately the overhead of an Akka actor (which can, like a goroutine, a ZIO fiber, a Cats Effect fiber, or a Scala future, be considered a means of modeling a process in a more efficient way than a thread (this ignores the substantial philosophical differences in the particulars of the respective models)), and it's well-established that substantially greater-than a million actors can be created per GiB of heap, so it's reasonable to expect that multiple millions of fibers can be created per GiB of heap.
It should be noted that it's impossible for more fibers to be consuming CPU at any point in time than you have cores/threads, so it's absolutely possible, if you have far more fibers/goroutines/actors ready to consume CPU, you may see a substantial latency effect from fibers waiting to be scheduled (so-called "thread starvation").
When building our current project the GWT compiling needs quite a large amount of the overall time (currently ~25min overall, 2/3 gwt compile). We reserched how to optimize that (e.g. here)
however in the end we decided to buy a new build server. GWT compiling is a quite CPU intensive task so we did some tests to analyze the improvement per core:
1 cores = 197s
2 cores = 165s
3 cores = 149s
4 cores = 157s (can be that the last core was busy with other tasks)
Judging from those numbers its seems that adding more cores doesn't necessarily improve performance since those numbers seem to flatten.
1.)
So now i would be interessted if someone of you can confirm / disprove that? So 8 or 12 cores doesn't necessarily make a difference - but the individual cpu speed (mhz) does?
2.)
After seeing some benchmarks our sales tend to buy *ntel xeon - any experience with AMD? (I am more of an AMD guy however currently it seems hard to disregard the benchmarks)
3.) Any other suggestions regarding memory, IO etc are welcome
Update: When we get the new server I'll post the updated numbers...
We are using an AMD FX-8350 (#4.00 Ghz) with a Samsung 830 Pro SSD. and we've set localWorkers=4 as well as -Xmx2048m. Previously we used a Intel XEON E5-2609 (#2.40 Ghz). That reduced compilation time from ~440s down to ~310s.
So we also experienced that raw CPU speed matters most in case of a single compilation process (with localWorkers=4). In case of multiple compilation processes running at the same time on this machine a SSD improves the IO wait time which increases with the count of concurrent compilation processes.
Our current hardware supports up to 4 maven builds at the same time (each one with localWorkers=4) and uses then up to 20GB of RAM. With the increasing count of concurrent builds the build time increases. But it is not a linear increase, so we try to reduce the idle time in periods where not all resources are used by a single maven process (Java class compiling, tests, ...).
As we compared the hardware prices, we decided to buy a consumer PC used as a slave in our Jenkins buildfarm. The overall price is much cheaper than server hardware and can easily replaced with a new one in case of a hardware failure.
I am taking a course on distributed systems and we have to make our project using Scala. Our instructor told us that Scala is good in the sense that it uses multiple cores to do the computation and uses parallelism to solve problems while being integrated with the actor model.
This is a theoretical question. I have learned some basics about the actor model using Akka and my question is that, while programming, does the user have to provide the details to the compiler so that various actors work on multiple cores, or does Scala take care of that and use multiple cores for various actors?
In a nutshell my question is: when we declare multiple actors using the Akka libraries in Scala, does Scala compiler automatically use the multi-core CPU power to distribute various actors among cores, or does the programmer have to provide some input to do this?
TL;DR: With the default configuration in Akka you need do nothing to get pretty good parallelism for most use cases.
Longer Answer: Actors in Akka run on a Dispatcher and that Dispatcher has an ExecutionService which is typically a pool of Threads. The number of Threads is configured by the developer, but by default is 3 times the number of CPU cores on the machine (see default-dispatcher.parallelism-factor here in the reference configuration).
At any point in time each CPU core can be running an Actor using one of these threads, so provided you have a number of threads in your Dispatcher's ExecutionService that is equal to the number of cores on your CPU, you will be able to take advantage of all your cores. The reason that this is set to three times the number of cores in the default configuration is to compensate for blocking IO.
IO is slow, and blocking calls hog threads at times you are doing IO rather than using the CPU. So the key to getting the best level of parallelism is configuring this thread pool:
If you are doing only non-blocking IO, you can set it to the number of CPU cores you have and feel confident you are taking full advantage of your CPU.
The more blocking IO you do, the more threads you will need to keep getting good parallelism, but be warned - the more Threads you use, the more memory you will use and Threads are not the most lightweight things in the world.
theon's answer is pretty good, but I would just like to point out that actors are not the only way to achieve parallelism in Scala. If you do not need to manage state, Futures are generally a simpler way to perform computation in parallel. You just wrap each snippet of code that can run independently of others in a call to the Future factory function, and you can then compose/transform the results of each snippet (also in parallel) using calls to map, flatMap, fold, etc., or with for comprehensions. All you need to configure is an ExecutionContext as an implicit val, and if you are already using Akka, you can use the same one that your actors use, or you can use the preconfigured global default.
My question concerns the extent to which a JVM application can exploit the NUMA layout of a host.
I have an Akka application in which actors concurrently process requests by combining incoming data with 'common' data already loaded into an immutable (Scala) object. The application scales well in the cloud, using many dual core VMs, but performs poorly on a single 64 core machine. I presume this is because the common data object resides in one NUMA cell and many threads concurrently accessing from other cells is too much for the interconnects.
If I run 64 separate JVM applications each containing 1 actor then performance is is good again. A more moderate approach might be to run as many JVM applications as there are NUMA cells (8 in my case), giving the host OS a chance to keep the threads and memory together?
But is there a smarter way to achieve the same effect within a single JVM? E.g. if I replaced my common data object with several instances of a case class, would the JVM have the capability to place them on the optimal NUMA cell?
Update:
I'm using Oracle JDK 1.7.0_05, and Akka 2.1.4
I've now tried with the UseNUMA and UseParallelGC JVM options. Neither seemed to have any significant impact on slow performance when using one or few JVMs. I've also tried using a PinnedDispatcher and the thre-pool-executor with no effect. I'm not sure if the configuration is having an effect though, since there seems nothing different in the startup logs.
The biggest improvement remains when I use a single JVM per worker (~50). However, the problem with this appears to be that there is a long delay (up to a couple of min) before the FailureDector registers the successful exchange of 'first heartbeat' between Akka cluster JVMs. I suspect there is some other issue here that I've not yet uncovered. I already had to increase the ulimit -u since I was hitting the default maximum number of processes (1024).
Just to clarify, I'm not trying to achieve large numbers of messages, just trying to have lots of separate actors concurrently access an immutable object.
I think if you sure that problems not in message processing algorithms then you should take in account not only NUMA option but whole env. configuration, starting from JVM version (latest is better, Oracle JDK also mostly performs better than OpenJDK) then JVM options (including GC, memory, concurrency options etc.) then Scala and Akka versions (latest release candidates and milestones can be much better) and also Akka configuration.
From here you can borrow all things that matter to got 50M messages per second of total throughput for Akka actors on contemporary laptops.
Never had chance to run these benchmarks on 64-core server - so any feedback will be greatly appreciated.
From my findings, which can help, current implementations of ForkJoinPool increases message send latency when number of threads in pool increases. It is greatly noticeable for cases when rate of response-request call between actors is high, e. g. on my laptop when increasing pool size from 4 to 64 message send latency of Akka actors for such cases grows up to 2-3x times for most executor services (Scala's ForkJoinPool, JDK's ForkJoinPool, ThreadPoolExecutor).
You can check if there are any differences by running mvnAll.sh with the benchmark.parallelism system variable set to different values.
I recently successfully experimented with Scala futures. I'm pleased as punch with the gains I'm seeing from the parallelism, but I'm only seeing 4 worker threads.
I've been looking all over for how I can crank up the number of threads to 11, but no luck. How can I do this?
Check out the source at http://www.scala-lang.org/api/current/index.html#scala.actors.scheduler.ResizableThreadPoolScheduler
There are some system properties: actors.corePoolSize and actors.maxPoolSize. Looks you can adjust those, and the scheduler for daemon-like things (including futures) will get a larger number. By default it picks up on the number of processors, I guess.