How does Scala attain parallelism? - scala

I am taking a course on distributed systems and we have to make our project using Scala. Our instructor told us that Scala is good in the sense that it uses multiple cores to do the computation and uses parallelism to solve problems while being integrated with the actor model.
This is a theoretical question. I have learned some basics about the actor model using Akka and my question is that, while programming, does the user have to provide the details to the compiler so that various actors work on multiple cores, or does Scala take care of that and use multiple cores for various actors?
In a nutshell my question is: when we declare multiple actors using the Akka libraries in Scala, does Scala compiler automatically use the multi-core CPU power to distribute various actors among cores, or does the programmer have to provide some input to do this?

TL;DR: With the default configuration in Akka you need do nothing to get pretty good parallelism for most use cases.
Longer Answer: Actors in Akka run on a Dispatcher and that Dispatcher has an ExecutionService which is typically a pool of Threads. The number of Threads is configured by the developer, but by default is 3 times the number of CPU cores on the machine (see default-dispatcher.parallelism-factor here in the reference configuration).
At any point in time each CPU core can be running an Actor using one of these threads, so provided you have a number of threads in your Dispatcher's ExecutionService that is equal to the number of cores on your CPU, you will be able to take advantage of all your cores. The reason that this is set to three times the number of cores in the default configuration is to compensate for blocking IO.
IO is slow, and blocking calls hog threads at times you are doing IO rather than using the CPU. So the key to getting the best level of parallelism is configuring this thread pool:
If you are doing only non-blocking IO, you can set it to the number of CPU cores you have and feel confident you are taking full advantage of your CPU.
The more blocking IO you do, the more threads you will need to keep getting good parallelism, but be warned - the more Threads you use, the more memory you will use and Threads are not the most lightweight things in the world.

theon's answer is pretty good, but I would just like to point out that actors are not the only way to achieve parallelism in Scala. If you do not need to manage state, Futures are generally a simpler way to perform computation in parallel. You just wrap each snippet of code that can run independently of others in a call to the Future factory function, and you can then compose/transform the results of each snippet (also in parallel) using calls to map, flatMap, fold, etc., or with for comprehensions. All you need to configure is an ExecutionContext as an implicit val, and if you are already using Akka, you can use the same one that your actors use, or you can use the preconfigured global default.

Related

Multiple Isolates vs one Isolate

How isolates are distributed across CPU cores
In Dart, you can run multiple isolates at the same time, and I haven't been able to find a guideline or best practice for using isolates.
My question is how will overall CPU usage and performance be affected by the numbers of isolates running at the same time, and is it better to use a small number of isolates (or even just one) or not.
One isolate per one thread
One isolate takes one platform thread - you can observe threads created per each isolate in the Call Stack pane of VSCode when debugging the Dart/Flutter app with multiple isolates. If the workload of interest allows parallelism you can get great performance gains via isolates.
Note that Dart explicitly abstracts away the implementation detail and docs avoid the specifics of scheduling of isolates and their intrinsics.
Number of isolates = ±number of CPU core
In determining the number of isolates/threads as the rule of thumb you can take the number of cores as the initial value. You can import 'dart:io'; and use the Platform.numberOfProcessors property to determine the number of cores. Though to fine tune experimentation would be required to see which number makes more sense. There're many factors that can influence the optimal number of threads:
Presence of Simultaneous MultiThreading (SMT) in CPU, such as Intel HyperThreading
Instruction level parallelism (ILP) and specific machine code produced for your code
CPU architecture
Mobile/smartphone scenarios vs desktop - e.g. Intel CPUs have the same cores, less tendency to throttling. Smartphones have efficiency and high-performance cores, they are prone to trotling, creating a myriad of threads can lead to worse results due to OS slowing down your code.
E.g. for one of my Flutter apps which uses multiple isolates to parallelize file processing I empirically came to the following piece of code determining the number of isolates to be created:
var numberOfIsolates = max(Platform.numberOfProcessors - 2, 2)
Isolate is not a thread
The model offered by isolate is way more restricting than what the standard threaded model suggests.
Isolates do not share memory vs Threads can read each other's vars. There're technical exceptions, e.g. since around Flutter 2.5.0 isolates use one heap, there're exceptions for immutable types sharing across isolates, such as strings - though they are an implementation detail and don't change the concept.
Isolates communicate only via messages vs numerous synchronizations prymitives in threads (critical sections, locks, semaphores, mutexes etc.).
The clear tradeoff is that Isolates are not prone to multi-threaded programming horrors (tricky bugs, debugging, development complexity) yet provide fewer capabilities for implementing parallelism.
In Dart/Flutter there're only 2 ways to work with Isolates:
Low level, Dart style - using the Isolate class to spawn individual isolates, set-up send/receive ports for messaging, code entry points.
Higher level Compute helper function in Flutter - it get's input params, creates a new isolate with defined entry point, processes the inputs and prives a single result - not back and forth communication, streams of events etc., request-response pattern.
Note that in Dart/Flutter SDK there is no parallelism APIs such as Task Parallel Library (TPL) in .NET which provides multi-core CPU optimized APIs to process data on multiple threads, e.g. sorting a collection in parallel. A huge number of algorithms can benefit from parallelism using threads though are not feasable with Isolates model where there's no shared memory. Also there's no Isolate pool, a set of isolates up and running and waiting for incoming tasks (I had to create one by myself https://pub.dev/packages/isolate_pool_2).
P.S.: the influence of SMT, ILP and other stuff on the performance of multiple treads can be observed via the following CPU benchmark (https://play.google.com/store/apps/details?id=xcom.saplin.xOPS) - e.g. one can see that there's typically a sweet spot in terms of a number of multiple threads doing computations. It is greater than the number of cores. E.g. on my Intel i7 8th gen MacBook with 6 cores and 12 threads per CPU the best performance was observed with the number of threads at about 4 times the number of cores.
The distribution of isolates across CPU cores is done by the OS. But each isolate correspond to a thread. The number of isolates to use will depend on the number of CPU cores physically available.
This is illustrated by a short article available here:
https://martin-robert-fink.medium.com/dart-is-indeed-multi-threaded-94e75f66aa1e

What is the relation between threads and concurrency?

Concurrency means the ability to allow more than one tasking process at a time
But where does threading fit in it?
What's the relation between threading and concurrency?
What is the important link between these two which will fully clear all the confusion?
Threads are one way to achieve concurrency. Concurrency can be achieved at many levels and in many ways. Here are some of them from low to high level to give you a rough idea:
CPU pipelines: at a hardware level, multiple instructions are executed in parallel (each instruction is at a different stage in the pipeline)
Duplication of ALU and FPU CPU units. There are more arithmetic-logic units and floating point units in a processor that can execute instructions in parallel.
vectorized instructions. Instructions which execute for multiple data.
hyperthreading/SMT. Duplication of the process context.
threads. Streams of instructions which can be executed in parallel.
processes. You run both a browser and a word processor on your system.
tasks. Higher abstraction over threads and async work.
multiple computers. Run your program on multiple computers
I'm new here but I don't really understand the down votes? Could someone explain it to me? Is it just because this question has (likely) been answered or because it's considered obvious?
Now that that's out of the way...
Nothing being executed on the CPU is from a "process" or anything else. They're all threads, scheduled and entirely managed by the kernel using a variety of algorithms to reach expected performance for any given application. The CPU only allows n threads, where n equals (cores * hyperthreads). In most cases hyperthreads will be 2 so you have double the core count to get logical CPU count. What this really means is that instead of 4 (for example) threads being run at once, it can support up to 8. Now the OS may have hundreds of threads at any given time, how is that possible? Well the kernel uses a variety of checks such as how frequently and long the thread sleeps to assign it a priority. Whenever the CPU triggers a timer interrupt the OS will swap out threads appropriately if they've reached their alotted time slice based on the OS determination of its priority.

How often should we really use futures and actors? Any alternatives? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I've set out to create a physics engine with scala using concurrency.
I know for a lot of number crunching can be parallelized (which is important for the optimizations of things).
However I know abstractions of parallelism using actors or futures come with a lot of overhead. How often should I really use futures and actors? I would imagine making every numeric statement (like factorial(4) or gcd(5,10)) a future might make things more inefficient as now you're paying for that overhead on multiple levels and quite often.
Maybe there are better ways of parallelizing execution on a lower level in scala? What are your opinions on the frequency of futures and actors being used?
I think it is worth distinguishing between true parallelism and broader concurrency.
Parallelism is a way to make you code run faster by performing multiple computations at the same time. This can range from SIMD instructions on a CPU though to distributed applications on multiple servers. The choice of parallelism will depend on the nature of the problem.
Concurrency is a way to implement separation of concerns by separating code into different sections which appear to execute at the same time. The primary goal is to allow the application to juggle multiple tasks even if it is executing on a single thread.
Actors and Futures are primarily used to implement concurrency. An actor takes a specific role in the overall system and appears to operate independently and asynchronously, making it easier to reason about the behaviour of that part of the system. Futures are a way to get away from strict linear execution by saying that an operation will happen at some indeterminate time in the future.
Scala does not really support SIMD instructions (perhaps because of the JVM foundations) but there are libraries for GPU acceleration which you should definitely look at if you are doing heavyweight calculations.
Simple task-level parallelism can be done with the parallel collection classes in Scala which will potentially use multiple threads in an efficient way.
Futures can be used to spawn tasks for parallel execution, but they lack control over scheduling (they start immediately) so it is better to use one of the Scala task libraries.
You can use actors for thread-level parallelism where the same computation is going to be performed multiple times during the execution of the application. The computation is wrapped in an actor and then that actor is replicated on multiple threads, cores or processors. Each computation is triggered by a message to the actor and the results are returned in to the main application using a second message. This is useful for long-running computations with small amounts of input and output data, but if there is too much data then the cost of moving it between processes may become a significant overhead.
And, of course, there are many ways of distributing code across multiple servers, and actors are a strong candidate for this approach. Moving the data around becomes the key concern at this point.

What does it mean practically "An ActorSystem is a heavyweight structure that will allocate 1...N Threads, so create one per logical application"?

What does it mean practically "create one per logical application"? I have an enterprise application in Scala with 5 modules that will be deployed independently. I have used ActorSystem.create("...") to create some 4 or 5 system Actors in each modules like Messaging, Financial, Sales, Workflow, Security.
Do I have to do ActorSystem.create("...") only once? for my enterprise application with 5 modules as above.
Or am I doing it correctly?
It practically means that if you can reuse same thread-pools, akka-system configuration, dead-letters, namespace for actors, event buses - it's better to use one actor system.
So, in your case, module - is the logical application. Some frameworks like OSGi may allow several logical modules to live inside one JVM (physical application), that's probably why "logical application" term was used. However, in most cases (like, I suppose, yours) they are equal - I would recommend you to use one ActorSystem per module.
More generally, tha case of several logical applications inside one physical is some meta-container (like servlet-container), that runs inside one JVM but manages several independent applications (like several deployed .wars) living in the same JVM.
Btw, if you want to manage JVM resources correctly - you can just assign different dispatchers (and maybe thread pools) into different logical groups of actors, and still use one actor-system. So the rule is - if you can use one ActorSystem - just use one. Entities must not be multiplied beyond necessity
P.S. You should also be aware of lookup problem when using multiple actor-systems in one physical application. So if solution proposed there seems like workaround for your architecture - it's also a sign to merge systems together.
There is no right or wrong size here, or a magic formula to do it right.
It depends on the things you want you ActorSystem(s) to achieve and how the application parts relate to each other.
You should separate ActorSystems when they behave largely differenting performance and reliability needs and when the systems behave differently (blocking/ non blocking for example).
A good example would be a typical WebApplication with a Database: The application handling requests could be non blocking (like for example play), the database driver could be blocking (like slick in the old times).
So here it would be a good idea to use separated ActorSystems, to still be able to handle requests to inform the user that the dataabse communication is down.
As everything each ActorSystem comes with a cost, so you should only do it if you need it.
As #dk14 and #Andreas have already said an ActorSystem allows you to share resources ( thread-pools, akka-system configuration, dead-letters, namespace for actors, event buses).
From a sharing perspective it makes sense to have one ActorSystem per JVM and have different dispatchers per logical module. To get the most out of the your Akka actors it is critical that you tune your dispatcher settings to match 1) your application workload 2) your hardware settings (# of cores). For example, if you have some actors doing network IO they should have their own dedicated dispatchers.
You should also consider carefully how many JVMs you want to run on a physical node. For example, if you have a host with 256/512 GB of RAM running a single JVM may not be the best configuration. On the other hand, a physical/VM having 64 GB of RAM will do fine with just one JVM instance

NUMA awareness of JVM

My question concerns the extent to which a JVM application can exploit the NUMA layout of a host.
I have an Akka application in which actors concurrently process requests by combining incoming data with 'common' data already loaded into an immutable (Scala) object. The application scales well in the cloud, using many dual core VMs, but performs poorly on a single 64 core machine. I presume this is because the common data object resides in one NUMA cell and many threads concurrently accessing from other cells is too much for the interconnects.
If I run 64 separate JVM applications each containing 1 actor then performance is is good again. A more moderate approach might be to run as many JVM applications as there are NUMA cells (8 in my case), giving the host OS a chance to keep the threads and memory together?
But is there a smarter way to achieve the same effect within a single JVM? E.g. if I replaced my common data object with several instances of a case class, would the JVM have the capability to place them on the optimal NUMA cell?
Update:
I'm using Oracle JDK 1.7.0_05, and Akka 2.1.4
I've now tried with the UseNUMA and UseParallelGC JVM options. Neither seemed to have any significant impact on slow performance when using one or few JVMs. I've also tried using a PinnedDispatcher and the thre-pool-executor with no effect. I'm not sure if the configuration is having an effect though, since there seems nothing different in the startup logs.
The biggest improvement remains when I use a single JVM per worker (~50). However, the problem with this appears to be that there is a long delay (up to a couple of min) before the FailureDector registers the successful exchange of 'first heartbeat' between Akka cluster JVMs. I suspect there is some other issue here that I've not yet uncovered. I already had to increase the ulimit -u since I was hitting the default maximum number of processes (1024).
Just to clarify, I'm not trying to achieve large numbers of messages, just trying to have lots of separate actors concurrently access an immutable object.
I think if you sure that problems not in message processing algorithms then you should take in account not only NUMA option but whole env. configuration, starting from JVM version (latest is better, Oracle JDK also mostly performs better than OpenJDK) then JVM options (including GC, memory, concurrency options etc.) then Scala and Akka versions (latest release candidates and milestones can be much better) and also Akka configuration.
From here you can borrow all things that matter to got 50M messages per second of total throughput for Akka actors on contemporary laptops.
Never had chance to run these benchmarks on 64-core server - so any feedback will be greatly appreciated.
From my findings, which can help, current implementations of ForkJoinPool increases message send latency when number of threads in pool increases. It is greatly noticeable for cases when rate of response-request call between actors is high, e. g. on my laptop when increasing pool size from 4 to 64 message send latency of Akka actors for such cases grows up to 2-3x times for most executor services (Scala's ForkJoinPool, JDK's ForkJoinPool, ThreadPoolExecutor).
You can check if there are any differences by running mvnAll.sh with the benchmark.parallelism system variable set to different values.