NUMA awareness of JVM - scala

My question concerns the extent to which a JVM application can exploit the NUMA layout of a host.
I have an Akka application in which actors concurrently process requests by combining incoming data with 'common' data already loaded into an immutable (Scala) object. The application scales well in the cloud, using many dual core VMs, but performs poorly on a single 64 core machine. I presume this is because the common data object resides in one NUMA cell and many threads concurrently accessing from other cells is too much for the interconnects.
If I run 64 separate JVM applications each containing 1 actor then performance is is good again. A more moderate approach might be to run as many JVM applications as there are NUMA cells (8 in my case), giving the host OS a chance to keep the threads and memory together?
But is there a smarter way to achieve the same effect within a single JVM? E.g. if I replaced my common data object with several instances of a case class, would the JVM have the capability to place them on the optimal NUMA cell?
Update:
I'm using Oracle JDK 1.7.0_05, and Akka 2.1.4
I've now tried with the UseNUMA and UseParallelGC JVM options. Neither seemed to have any significant impact on slow performance when using one or few JVMs. I've also tried using a PinnedDispatcher and the thre-pool-executor with no effect. I'm not sure if the configuration is having an effect though, since there seems nothing different in the startup logs.
The biggest improvement remains when I use a single JVM per worker (~50). However, the problem with this appears to be that there is a long delay (up to a couple of min) before the FailureDector registers the successful exchange of 'first heartbeat' between Akka cluster JVMs. I suspect there is some other issue here that I've not yet uncovered. I already had to increase the ulimit -u since I was hitting the default maximum number of processes (1024).
Just to clarify, I'm not trying to achieve large numbers of messages, just trying to have lots of separate actors concurrently access an immutable object.

I think if you sure that problems not in message processing algorithms then you should take in account not only NUMA option but whole env. configuration, starting from JVM version (latest is better, Oracle JDK also mostly performs better than OpenJDK) then JVM options (including GC, memory, concurrency options etc.) then Scala and Akka versions (latest release candidates and milestones can be much better) and also Akka configuration.
From here you can borrow all things that matter to got 50M messages per second of total throughput for Akka actors on contemporary laptops.
Never had chance to run these benchmarks on 64-core server - so any feedback will be greatly appreciated.
From my findings, which can help, current implementations of ForkJoinPool increases message send latency when number of threads in pool increases. It is greatly noticeable for cases when rate of response-request call between actors is high, e. g. on my laptop when increasing pool size from 4 to 64 message send latency of Akka actors for such cases grows up to 2-3x times for most executor services (Scala's ForkJoinPool, JDK's ForkJoinPool, ThreadPoolExecutor).
You can check if there are any differences by running mvnAll.sh with the benchmark.parallelism system variable set to different values.

Related

Why do we need more executors than number of machines in Spark?

What's the logic behind requesting more executors than machines available in your cluster?
In the ideal situation, we would like to have 1 executor (=1 jvm) at each of our machines, and not few in each machine.
If not, then why?
Thanks in advance
In the ideal situation, we would like to have 1 executor (=1 jvm) at each of our machines, and not few in each machine.
Not necessarily. Depending on the amount of available memory and JVM implementation separate virtual machines can be much a better option, in particular to:
Improve memory management with large machines - see for example Why 35GB Heap is Less Than 32GB – Java JVM Memory Oddities.
To improve fault tolerance with unstable workloads - if one JVM fails you'll lose work for all corresponding threads, so keeping things smaller can keep things under control.
To minimize effort required for GC tuning - very large instances can be extremely painful to tune.

What does it mean practically "An ActorSystem is a heavyweight structure that will allocate 1...N Threads, so create one per logical application"?

What does it mean practically "create one per logical application"? I have an enterprise application in Scala with 5 modules that will be deployed independently. I have used ActorSystem.create("...") to create some 4 or 5 system Actors in each modules like Messaging, Financial, Sales, Workflow, Security.
Do I have to do ActorSystem.create("...") only once? for my enterprise application with 5 modules as above.
Or am I doing it correctly?
It practically means that if you can reuse same thread-pools, akka-system configuration, dead-letters, namespace for actors, event buses - it's better to use one actor system.
So, in your case, module - is the logical application. Some frameworks like OSGi may allow several logical modules to live inside one JVM (physical application), that's probably why "logical application" term was used. However, in most cases (like, I suppose, yours) they are equal - I would recommend you to use one ActorSystem per module.
More generally, tha case of several logical applications inside one physical is some meta-container (like servlet-container), that runs inside one JVM but manages several independent applications (like several deployed .wars) living in the same JVM.
Btw, if you want to manage JVM resources correctly - you can just assign different dispatchers (and maybe thread pools) into different logical groups of actors, and still use one actor-system. So the rule is - if you can use one ActorSystem - just use one. Entities must not be multiplied beyond necessity
P.S. You should also be aware of lookup problem when using multiple actor-systems in one physical application. So if solution proposed there seems like workaround for your architecture - it's also a sign to merge systems together.
There is no right or wrong size here, or a magic formula to do it right.
It depends on the things you want you ActorSystem(s) to achieve and how the application parts relate to each other.
You should separate ActorSystems when they behave largely differenting performance and reliability needs and when the systems behave differently (blocking/ non blocking for example).
A good example would be a typical WebApplication with a Database: The application handling requests could be non blocking (like for example play), the database driver could be blocking (like slick in the old times).
So here it would be a good idea to use separated ActorSystems, to still be able to handle requests to inform the user that the dataabse communication is down.
As everything each ActorSystem comes with a cost, so you should only do it if you need it.
As #dk14 and #Andreas have already said an ActorSystem allows you to share resources ( thread-pools, akka-system configuration, dead-letters, namespace for actors, event buses).
From a sharing perspective it makes sense to have one ActorSystem per JVM and have different dispatchers per logical module. To get the most out of the your Akka actors it is critical that you tune your dispatcher settings to match 1) your application workload 2) your hardware settings (# of cores). For example, if you have some actors doing network IO they should have their own dedicated dispatchers.
You should also consider carefully how many JVMs you want to run on a physical node. For example, if you have a host with 256/512 GB of RAM running a single JVM may not be the best configuration. On the other hand, a physical/VM having 64 GB of RAM will do fine with just one JVM instance

Can memcached make full use of multi-core?

Is memcached capable of making full use of multi-core? Or is there any way tuning this?
memcached has "-t" option:
-t <threads>
Number of threads to use to process incoming requests. This option is only meaningful
if memcached was compiled with thread support enabled. It is typically not useful to
set this higher than the number of CPU cores on the memcached server. The default is
4.
so, I believe it can use all your CPU cores, of course if it was compiled with corresponding option.
memcached is multi-threaded by default and has no problem saturating many cores. It's a bit harder to saturate all cores on more massively parallel boxes (e.g. a 256-core CMT box) just because it gets harder to get the data in and out of the network.
If you find areas where some sort of contention is preventing you from saturating cores, file a bug or start a discussion.
Based on a this research by Intel, Memcached v.1.6 beta cannot scale well on a multicore system. Their experiments show that as core counts increase from 1 to 8, maximum throughput (with a median RTT < 1ms SLA) only doubles.
CAREFUL. This terminology is quite confusing. Memcached man page says -t option is only good up to the number of cores. However, this is odd because threads and processes are VERY different. Threads have NOTHING to do with the number of cores. Processes can definitely run on more than one cor, while threads cannot (unless they call to an OS routine, then they can thread switch and pack in more than 100% cpu usage). Threads share memory and just depend on an instruction pointer to differentiate who is who. Processes share nothing unless it is explicitly declared as shared ahead of time, and sharing occurs via the OS.
Overall, I want MORE CLARITY from the Memcached people about whether their app is multiprocessing or multithreaded and thus if it can use more than 100% of cpu.

Can a shared ready queue limit the scalability of a multiprocessor system?

Can a shared ready queue limit the scalability of a multiprocessor system?
Simply put, most definetly. Read on for some discussion.
Tuning a service is an art-form or requires benchmarking (and the space for the amount of concepts you need to benchmark is huge). I believe that it depends on factors such as the following (this is not exhaustive).
how much time an item which is picked up from the ready qeueue takes to process, and
how many worker threads are their?
how many producers are their, and how often do they produce ?
what type of wait concepts are you using ? spin-locks or kernel-waits (the latter being slower) ?
So, if items are produced often, and if the amount of threads is large, and the processing time is low: the data structure could be locked for large windows, thus causing thrashing.
Other factors may include the data structure used and how long the data structure is locked for -e.g., if you use a linked list to manage such a queue the add and remove oprations take constant time. A prio-queue (heaps) takes a few more operations on average when items are added.
If your system is for business processing you could take this question out of the picture by just using:
A process based architecure and just spawning multiple producer consumer processes and using the file system for communication,
Using a non-preemtive collaborative threading programming language such as stackless python, Lua or Erlang.
also note: synchronization primitives cause inter-processor cache-cohesion floods which are not good and therefore should be used sparingly.
The discussion could go on to fill a Ph.D dissertation :D
A per-cpu ready queue is a natural selection for the data structure. This is because, most operating systems will try to keep a process on the same CPU, for many reasons, you can google for.What does that imply? If a thread is ready and another CPU is idling, OS will not quickly migrate the thread to another CPU. load-balance kicks in long run only.
Had the situation been different, that is it was not a design goal to keep thread-cpu affinities, rather thread migration was frequent, then keeping separate per-cpu run queues would be costly.

Benefits of multiple memcached instances

Is there any difference between having 4 .5GB memcache servers running or one 2GB instance?
Does running multiple instances offer any benifits?
If one instance fails, you're still get advantages of using the cache. This is especially true if you are using the Consistenthashing that will bring the same data to the same instance, rather than spreading new reads/writes among the machines that are still up.
You may also elect to run servers on 32 bit operating systems, that cannot address more than around 3GB of memory.
Check the FAQ: http://www.socialtext.net/memcached/ and http://www.danga.com/memcached/
High availability is nice, and memcached will automatically distribute your cache across the 4 servers. If one of those servers dies for some reason, you can handle that error by either just continuing as if the cache was blank, redirecting to a different server, or any sort of custom error handling you want. If your 1x 2gb server dies, then your options are pretty limited.
The important thing to remember is that you do not have 4 copies of your cache, it is 1 cache, split amongst the 4 servers.
The only downside is that it's easier to run out of 4x .5 than it is to run out of 1x 2gb memory.
I would also add that theoretically, in case of several machines, it might save you some performance, as if you have a lot of frontends doing a lot of heavy reads, it's much better to split them into different machines: you know, network capabilities and processing power of one machine can become an upper bound for you.
This advantage is highly dependent on memcache utilization, however (sometimes it might be ways faster to fetch everything from one machine).