Why is Akka good for scaling "up" and "out"? - scala

If you Google "what does Akka do", the typical sales pitches you get is that it helps your program scale "up" and/or scale "out". But just like the buzzword "cloud" does nothing to explain the virtualization technologies that comprise a cloud service, I see "scale up/out" as equally-vague buzzwords that probably don't do Akka any real justice.
So let's say I've got a batch processing system full of 100 different types of tasks. Task 1 - 100 are kicking off all day long, doing their thing, whatever it is that they do. How exactly might Akka help me batch system scale "up"? How might it help my system scale "out"?

It scales "out" because it allows you to design and organize cluster of servers. Being message-passing-based, it is pretty much a one-to-one representation of the actual world (machines connected via the network and sending messages to each other). No magic here, it's just that the paradigm of the framework makes it easier to reason about your infrastructure.
It scales "up" because if you buy better hardware it will transparently take advantage of the newly added cores/cpus without you having to change anything.
(When it comes to the Typesafe stack, get used to the buzzword! :) )
Edit after first comment:
You could organize your cluster the way you want :)
Dividing by type/responsibility seems like a good option yes. You could have VM1 with Task1Actor instances, VM2 with Task2Actor instances and if you notice that task 1 is the bottleneck start VM1-bis to add more instances for example.
Since Akka abstracts the whole process of sending/receiving message you can have several JVMs on the same machine, several VMs on the same physical machine, several actual machines, several actual machines with several VMs with several JVMs. You get the idea.
For the Typesafe stack: http://typesafe.com/platform

Related

What is meant by Distributed System?

I am reading about distributed systems and getting confused with what is really means?
I understand on high level, it means that set of different machines that work together to achieve a single goal.
But this definition seems too broad and loose. I would like to give some points to explain the reasons for my confusion:
I see lot of people referring the micro-services as distributed system where the functionalities like Order, Payment etc are distributed in different services, where as some other refer to multiple instances of Order service which possibly trying to serve customers and possibly use some consensus algorithm to come to consensus on shared state (eg. current Inventory level).
When talking about distributed database, I see lot of people talk about different nodes which possibly use to store/serve a part of user request like records with primary key from 'A-C' in first node 'D-F' in second node etc. On high level it looks like sharding.
When talking about distributed rate limiting. Some refer to multiple application nodes (so called distributed application nodes) using a single rate limiter, some other mention that the rate limiter itself has multiple nodes with a shared cache (like redis).
It feels that people use distributed systems to mention about microservices architecture, horizontal scaling, partitioning (sharding) and anything in between.
I am reading about distributed systems and getting confused with what is really means?
As commented by #ReinhardMänner, the good general term definition of distributed system (DS) is at https://en.wikipedia.org/wiki/Distributed_computing
A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another from any system. The components interact with one another in order to achieve a common goal.
Anything that fits above definition can be referred as DS. All mentioned examples such as micro-services, distributed databases, etc. are specific applications of the concept or implementation details.
The statement "X being a distributed system" does not inherently imply any of such details and for each DS must be explicitly specified, eg. distributed database does not necessarily meaning usage of sharding.
I'll also draw from Wikipedia, but I think that the second part of the quote is more important:
A distributed system is a system whose components are located on
different networked computers, which communicate and coordinate their
actions by passing messages to one another from any system. The
components interact with one another in order to achieve a common
goal. Three significant challenges of distributed systems are:
maintaining concurrency of components, overcoming the lack of a global clock, and managing the independent failure of components. When
a component of one system fails, the entire system does not fail.
A system that constantly has to overcome these problems, even if all services are on the same node, or if they communicate via pipes/streams/files, is effectively a distributed system.
Now, trying to clear up your confusion:
Horizontal scaling was there with monoliths before microservices. Horizontal scaling is basically achieved by division of compute resources.
Division of compute requires dealing with synchronization, node failure, multiple clocks. But that is still cheaper than scaling vertically. That's where you might turn to consensus by implementing consensus in the application, or using a dedicated service e.g. Zookeeper, or abusing a DB table for that purpose.
Monoliths present 2 problems that microservices solve: address-space dependency (i.e. someone's component may crash the whole process and thus your component) and long startup times.
While microservices solve these problems, these problems aren't what makes them into a "distributed system". It doesn't matter if the different processes/nodes run the same software (monolith) or not (microservices), it matters that they are different processes that can't easily communicate directly (e.g. via function calls that promise not to fail).
In databases, scaling horizontally is also cheaper than scaling vertically, The two components of horizontal DB scaling are division of compute - effectively, a distributed system - and division of storage - sharding - as you mentioned, e.g. A-C, D-F etc..
Sharding of storage does not define distributed systems - a single compute node can handle multiple storage nodes. It's just that it's much more useful for a database that divides compute to also shard its storage, so you often see them together.
Distributed rate limiting falls under "maintaining concurrency of components". If every node does its own rate limiting, and they don't communicate, then the system-wide rate cannot be enforced. If they wait for each other to coordinate enforcement, they aren't concurrent.
Usually the solution is "approximate" rate limiting where components synchronize "occasionally".
If your components can't easily (= no latency) agree on a global rate limit, that's usually because they can't easily agree on a global anything. In that case, you're effectively dealing with a distributed system, even if all components just threads in the same process.
(that could happen e.g. if you plan to scale out but haven't done so yet, so you don't allow your threads to communicate directly.)

What does it mean practically "An ActorSystem is a heavyweight structure that will allocate 1...N Threads, so create one per logical application"?

What does it mean practically "create one per logical application"? I have an enterprise application in Scala with 5 modules that will be deployed independently. I have used ActorSystem.create("...") to create some 4 or 5 system Actors in each modules like Messaging, Financial, Sales, Workflow, Security.
Do I have to do ActorSystem.create("...") only once? for my enterprise application with 5 modules as above.
Or am I doing it correctly?
It practically means that if you can reuse same thread-pools, akka-system configuration, dead-letters, namespace for actors, event buses - it's better to use one actor system.
So, in your case, module - is the logical application. Some frameworks like OSGi may allow several logical modules to live inside one JVM (physical application), that's probably why "logical application" term was used. However, in most cases (like, I suppose, yours) they are equal - I would recommend you to use one ActorSystem per module.
More generally, tha case of several logical applications inside one physical is some meta-container (like servlet-container), that runs inside one JVM but manages several independent applications (like several deployed .wars) living in the same JVM.
Btw, if you want to manage JVM resources correctly - you can just assign different dispatchers (and maybe thread pools) into different logical groups of actors, and still use one actor-system. So the rule is - if you can use one ActorSystem - just use one. Entities must not be multiplied beyond necessity
P.S. You should also be aware of lookup problem when using multiple actor-systems in one physical application. So if solution proposed there seems like workaround for your architecture - it's also a sign to merge systems together.
There is no right or wrong size here, or a magic formula to do it right.
It depends on the things you want you ActorSystem(s) to achieve and how the application parts relate to each other.
You should separate ActorSystems when they behave largely differenting performance and reliability needs and when the systems behave differently (blocking/ non blocking for example).
A good example would be a typical WebApplication with a Database: The application handling requests could be non blocking (like for example play), the database driver could be blocking (like slick in the old times).
So here it would be a good idea to use separated ActorSystems, to still be able to handle requests to inform the user that the dataabse communication is down.
As everything each ActorSystem comes with a cost, so you should only do it if you need it.
As #dk14 and #Andreas have already said an ActorSystem allows you to share resources ( thread-pools, akka-system configuration, dead-letters, namespace for actors, event buses).
From a sharing perspective it makes sense to have one ActorSystem per JVM and have different dispatchers per logical module. To get the most out of the your Akka actors it is critical that you tune your dispatcher settings to match 1) your application workload 2) your hardware settings (# of cores). For example, if you have some actors doing network IO they should have their own dedicated dispatchers.
You should also consider carefully how many JVMs you want to run on a physical node. For example, if you have a host with 256/512 GB of RAM running a single JVM may not be the best configuration. On the other hand, a physical/VM having 64 GB of RAM will do fine with just one JVM instance

NUMA awareness of JVM

My question concerns the extent to which a JVM application can exploit the NUMA layout of a host.
I have an Akka application in which actors concurrently process requests by combining incoming data with 'common' data already loaded into an immutable (Scala) object. The application scales well in the cloud, using many dual core VMs, but performs poorly on a single 64 core machine. I presume this is because the common data object resides in one NUMA cell and many threads concurrently accessing from other cells is too much for the interconnects.
If I run 64 separate JVM applications each containing 1 actor then performance is is good again. A more moderate approach might be to run as many JVM applications as there are NUMA cells (8 in my case), giving the host OS a chance to keep the threads and memory together?
But is there a smarter way to achieve the same effect within a single JVM? E.g. if I replaced my common data object with several instances of a case class, would the JVM have the capability to place them on the optimal NUMA cell?
Update:
I'm using Oracle JDK 1.7.0_05, and Akka 2.1.4
I've now tried with the UseNUMA and UseParallelGC JVM options. Neither seemed to have any significant impact on slow performance when using one or few JVMs. I've also tried using a PinnedDispatcher and the thre-pool-executor with no effect. I'm not sure if the configuration is having an effect though, since there seems nothing different in the startup logs.
The biggest improvement remains when I use a single JVM per worker (~50). However, the problem with this appears to be that there is a long delay (up to a couple of min) before the FailureDector registers the successful exchange of 'first heartbeat' between Akka cluster JVMs. I suspect there is some other issue here that I've not yet uncovered. I already had to increase the ulimit -u since I was hitting the default maximum number of processes (1024).
Just to clarify, I'm not trying to achieve large numbers of messages, just trying to have lots of separate actors concurrently access an immutable object.
I think if you sure that problems not in message processing algorithms then you should take in account not only NUMA option but whole env. configuration, starting from JVM version (latest is better, Oracle JDK also mostly performs better than OpenJDK) then JVM options (including GC, memory, concurrency options etc.) then Scala and Akka versions (latest release candidates and milestones can be much better) and also Akka configuration.
From here you can borrow all things that matter to got 50M messages per second of total throughput for Akka actors on contemporary laptops.
Never had chance to run these benchmarks on 64-core server - so any feedback will be greatly appreciated.
From my findings, which can help, current implementations of ForkJoinPool increases message send latency when number of threads in pool increases. It is greatly noticeable for cases when rate of response-request call between actors is high, e. g. on my laptop when increasing pool size from 4 to 64 message send latency of Akka actors for such cases grows up to 2-3x times for most executor services (Scala's ForkJoinPool, JDK's ForkJoinPool, ThreadPoolExecutor).
You can check if there are any differences by running mvnAll.sh with the benchmark.parallelism system variable set to different values.

Perl Distributed parallel computing

I would like to know if there are any perl modules available to enable distributed parallel computation similar to apache hadoop.
Example,
A perl script to be executed in many machines parallely when submitted to a client node.
I'm the author of the Many-core Engine for Perl.
During the next several weekends, I will take MCE for a spin with Gearman::XS. MCE is good at maximizing available cores on a given node. Gearman is good at job distribution and includes a lot of features such as load balancing. Combining the two together was my thought for scaling MCE horizontally across many nodes. :) I did not share this news with anybody until just now.
Why are the two modules a good fit (my humble opinion):
For distribution, one needs some sort of chunking engine. MCE is a chunking engine -- so breaking up input is natural to MCE. Essentially MCE can be used at both sides, the job submission host as well as on the worker node for maximizing available cores.
For worker nodes, MCE follows a bank-queuing model when processing input data. This helps guarantee that all CPUs remain busy from the start of the job till the very end. As workers being to idle down, the remaining "working" are processing their very last chunk.
One's imagination is the limit here -- there are so many possibilities with these 2 modules working together. When writing MCE, I first focused on the node side. Job distribution is next obviously and I did a search and came across Gearman::XS. The 2 modules can chunk happily together :) Job distribution side (bigger chunks), once on node (smaller chunks). All the networking stuff is handled by Gearman.
Basically, there's no need for me to write the job distribution aspect when Gearman::XS is already quite nice. This has been my plan. I will write about Gearman::XS + MCE soon.
BTW: Folks can do similar things with GRID-Machine + MCE I imagine. MCE's beauty is on maximizing all available cores on any given node.
Another magical thing about MCE is that one may not want 200 nodes * 16 workers all reading/writing from/to the NFS server for example. That will impact the NFS server greatly. BTW: RHEL 6.4 will include pNFS (parallel NFS). With MCE, workers can call the "do" method to serialize writes/reads from NFS. So instead of 200 * 16 = 3200 attacking NFS, it becomes just 200 maximum requests against the NFS server at any given time (1 per physical node).
When writing MCE, grace can be applied for many scenarios. I need to add more wikis to MCE's home page MCE at code.google.com. In addition, MCE eats really big log files for breakfast :) Check out egrep.pl and wc.pl under the examples dir. It even beats the wide finder project with sequential IO (powerful slurp IO among many workers).
Check out the images included with the MCE distribution. Oh, do not forget to check out the main Gearman site as well.
What's left after this? Humm, the web piece. One idea which comes to mind is to use Mojo. There are many options. This is just one:
Gearman::XS + MCE + Mojolicious
Again, one can use GRID-Machine instead of Gearman::XS if wanting to communicate through SSH.
Anyway, that was my plan to use an already available job distribution module. For MCE, my focus was on maximizing performance on a single node -- to include chunking, serializing, bank-queuing model, user tasks (allows for many roles), number sequencing among workers, and sequential slurp IO.
-- mario
You might look into something as simple as a message queue like ZeroMQ. I'm sure a CPAN search could help with some other suggestions.
Recently there has been some talk of the Many Core Engine MCE module, which you might want to investigate, I don't know for sure that it lets you parallelize off the host computer, but it seems like it wouldn't be a big step given its stated purpose.
Argon may provide what you are looking for (disclaimer - I'm the author). It allows you to set up an arbitrary network of workers, each of which runs a process pool (using Coro::ProcessPool).
Creating a task is pretty simple:
use Argon::Client;
my $client = Argon::Client->new(host => "somehost", port => 8000);
my $result = $client->queue(sub {
use My::Work::Module qw(do_work);
my $task_id = shift;
do_work($task_id);
});
The GRID module on CPAN is designed for working with distributed computing.
https://metacpan.org/pod/distribution/GRID-Machine/lib/GRID/Machine/perlparintro.pod

Basics difference between distributed computing and interprocess communication?

I know the theritical definition for distributed computing and interproces communication.
But in real time I was not able to come to conclusion that when we go for distributed or interprocess.
Tell me some scenario where we can go for distributed computing or interprocess communication by example.
interprocess communication basically would mean comuniction b/t processes.
mostly this concept is used when studying parallel programming and studying the working of operating system.
this topic is to huge to explain, its a full subject, try googling interprocess communication and read the basic definations.
2)
my initial understanding is:-
imagine a office, why does it have several employees in one department? because many brains and men power is needed to bring one task to completion. one man can do the job but it might take days and what if he gets sick! so distributed...
now how to communicate between the porcesses/people doing there independent task of the job on different computers/different CPU's of the same computer/within different cabins of the same office building?
"shout!! hey i have done my work take the result and send more?? who is in charge here!! answer ****"
no right!
so here come the INTER PROCESS COMMUNICATION subject.
note:- please note i am also a learning person :-) so do not take the above as right without doing your own googling, i am not responsible for any .........
Interprocess communication is typically defined as communication between multiple processes on a single machine. Distributed computing is multiple processes being distributed across a network and executed on desired host boxes. To me it makes sense to implement a desired interprocess communication in the same fashion as the distributed processes transmit their results back to the distributor/ host. That way a weaker machine continues to be able to process data while a more powerful box runs a greater load.