How to limit the number of threads available to a scala/akka app - scala

I wrote an application that uses scala parallel collections and akka actors, and I would now like to study its "strong scaling" properties, i.e. how the running time for a given problem instance changes, as a function of the number of cores/threads available.
What would be a proper way of going about this? How can I tell the application to use only up to n cores/threads?

Related

How to use DataFrames within SparkListener?

I've written a CustomListener (deriving from SparkListener, etc...) and it works fine, I can intercept the metrics.
The question is about using the DataFrames within the listener itself, as that assumes the usage of the same Spark Context, however as of 2.1.x only 1 context per JVM.
Suppose I want to write to disk some metrics in json. Doing it at ApplicationEnd is not possible, only at the last jobEnd (if you have several jobs, the last one).
Is that possible/feasible???
I'm trying to measure the perfomance of jobs/stages/tasks, record that and then analyze programmatically. May be that is not the best way?! Web UI is good - but I need to make things presentable
I can force the creation of dataframes upon endJob event, however there are a few errors thrown (basically they refer to not able to propogate events to the listener) and in general I would like to avoid unnecessary manipulations. I want to have a clean set of measurements that I can record and write to disk
SparkListeners should be as fast as possible as a slow SparkListener would block others to receive events. You could use separate threads to release the main event dispatcher thread, but you're still bound to the limitation of having a single SparkContext per JVM.
That limitation is however easily to overcome since you could ask for the current SparkContext using SparkContext.getOrCreate.
I'd however not recommend the architecture. That puts too much pressure on the driver's JVM that should rather "focus" on the application processing (not collecting events that probably it already does for web UI and/or Spark History Server).
I'd rather use Kafka or Cassandra or some other persistence storage to store events to and have some other processing application to consume them (just like Spark History Server works).

Concurrency, how to create an efficient actor setup?

Alright so I have never done intense concurrent operations like this before, theres three main parts to this algorithm.
This all starts with a Vector of around 1 Million items.
Each item gets processed in 3 main stages.
Task 1: Make an HTTP Request, Convert received data into a map of around 50 entries.
Task 2: Receive the map and do some computations to generate a class instance based off the info found in the map.
Task 3: Receive the class and generate/add to multiple output files.
I initially started out by concurrently running task 1 with 64K entries across 64 threads (1024 entries per thread.). Generating threads in a for loop.
This worked well and was relatively fast, but I keep hearing about actors and how they are heaps better than basic Java threads/Thread pools. I've created a few actors etc. But don't know where to go from here.
Basically:
1. Are actors the right way to achieve fast concurrency for this specific set of tasks. Or is there another way I should go about it.
2. How do you know how many threads/actors are too many, specifically in task one, how do you know what the limit is on number of simultaneous connections is (Im on mac). Is there a golden rue to follow? How many threads vs how large per thread pool? And the actor equivalents?
3. Is there any code I can look at that implements actors for a similar fashion? All the code Im seeing is either getting an actor to print hello world, or super complex stuff.
1) Actors are a good choice to design complex interactions between components since they resemble "real life" a lot. You can see them as different people sending each other requests, it is very natural to model interactions. However, they are most powerful when you want to manage changing state in your application, which does not seem to be the case for you. You can achieve fast concurrency without actors. Up to you.
2) If none of your operations is blocking the best rule is amount of threads = amount of CPUs. If you use a non blocking HTTP client, and NIO when writing your output files then you should be fully non-blocking on IOs and can just safely set the thread count for your app to the CPU count on your machine.
3) The documentation on http://akka.io is very very good and comprehensive. If you have no clue how to use the actor model I would recommend getting a book - not necessarily about Akka.
1) It sounds like most of your steps aren't stateful, in which case actors add complication for no real benefit. If you need to coordinate multiple tasks in a mutable way (e.g. for generating the output files) then actors are a good fit for that piece. But the HTTP fetches should probably just be calls to some nonblocking HTTP library (e.g. spray-client - which will in fact use actors "under the hood", but in a way that doesn't expose the statefulness to you).
2) With blocking threads you pretty much have to experiment and see how many you can run without consuming too many resources. Worry about how many simultaneous connections the remote system can handle rather than hitting any "connection limits" on your own machine (it's possible you'll hit the file descriptor limit but if so best practice is just to increase it). Once you figure that out, there's no value in having more threads than the number of simultaneous connections you want to make.
As others have said, with nonblocking everything you should probably just have a number of threads similar to the number of CPU cores (I've also heard "2x number of CPUs + 1", on the grounds that that ensures there will always be a thread available whenever a CPU is idle).
With actors I wouldn't worry about having too many. They're very lightweight.
If you have really no expierience in Akka try to start with something simple like doing a one-to-one actor-thread rewriting of your code. This will be easier to grasp how things work in akka.
Spin two actors at the begining one for receiving requests and one for writting to the output file. Then when request is received create an actor in request-receiver actor that will do the computation and send the result to the writting actor.

Is Communicating Sequential Processes [CSP] an alternative to the actor model in Scala?

In a 1978 Paper by Hoare we have an idea called Communicating Sequential Processes. This is used by Go, Occam, and in Clojure in core.async.
Is it possible to use CSP as an alternative to the Actor Model in Scala? (I'm seeing JCSP but I'm wondering if this is the only option, if it is mature, and if anyone uses it).
EDIT - I'm also seeing Communicating Scala Objects as an alternative to JCSP in Scala. But those of these seem to be tied to real threads - which seems to miss one of the benefits of CSP, being to get away from the memory resource cost of keeping large numbers of threads always active.
You should consult this document, but in general there are a few differences:
Channels are anonymous while actors have identities
In CSP, you use channels to transmit messages, but actors can directly contact each other.
In CSP communication is done in the form of rendezvous (i.e., it is synchronous). Actors support asynchronous message passing.
And yes, it is possible to use CSP as an alternative to the Actor model if these differences are acceptable in your position. I don't have any experience with JCSP but I wouldn't recommend using that specific library (the reason is as I see there aren't any activity in the project since 2011).

Using Akka Actors/Remoting to distribute graph algorithms across a cluster

So I'm currently working on distributing my Scala code across multiple machines for a large graph ("part 1" of the question) and am currently working with the Akka framework in the hopes of using Actors and Remoting.
I read the document here and it seems the way the example was done can be extended to do what I want to do, but I have a few concerns regarding this method...
1) How do we decide how many instances of Actors we should create? Do we have to do a trial/error thing to see which is the best, or is there some more intuitive way to go about it?
2) I am thinking of doing my task similarly to how the example was done - with a Master that spawns several Workers and communicate using case classes as messages. What I want to do is to find some metric between pairs of vertices (random walk), for all-pairs. I have a graph class that implements a method to calculate the metric given two vertices.
I will give each Worker two vertices 'u' and 'v' to calculate the metric for, and have workers return the value.
When the Master sends messages to Workers to calculate the metric, the Worker needs the graph structure - do I just do this by including the graph structure (i.e. adjacency list that is a HashMap) in the message? Will this cause any overhead by copying the graph structure each time, or do all workers just share that graph, or is there a better way to go about this?
3) Does the algorithm for calculating the metric between pairs of vertices need to be re-implemented to the extended Actor class, or is there a way for individual Actors to access the same graph structure to call the method (I guess this is similar to the question above about passing the entire graph structure as part of the message)?
Thanks!
Regards,
-kstruct
1) How do we decide how many instances of Actors we should create?
Although actors abstract the underlying threading management, creating less actors than available CPU cores is wasting the computational power. If you have 10 servers, 8 cores each, create at least 80 actors, 8 per machine.
If the algorithm is CPU intensive, creating more won't give you a performance boost - extra workers will simply wait for available core.
[...] Worker needs the graph structure - do I just do this by including the graph structure (i.e. adjacency list that is a HashMap) in the message? [...]
There is no overhead if all your actors live in the same JVM - you are simply passing a reference to the graph structure in a message. However in distributed environment this will cause the graph to be serialized and sent over wire - probably a lot of data.
Consider sharing this data structure by all actors.
I don't understand question 3.

Process work in parallel with non-threadsafe function in scala

I have a lot of work (thousands of jobs) for a Scala application to process. Each piece of work is the file name of a 100 MB file. To process each file, I need to use an extractor object that is not thread safe (I can have multiple copies, but copies are expensive, and I should not make one per job). What is the best way to complete this work in parallel in Scala?
You can wrap your extractor in an Actor and send each file name to the actor as a message. Since an instance of an actor will process only one message at a time, thread safety won't be an issue. If you want to use multiple extractors, just start multiple instances of the actor and balance between them (you could write another actor to act as a load balancer).
The extractor actor(s) can then send extracted files to other actors to do the rest of the processing in parallel.
Don't make 1000 jobs, but make 4x250 jobs (targeting 4 threads) and give one extractor to each batch. Inside each batch, work sequentially. This might not be optimal parallel-wise, since one batch might finish earlier but it is very easy to implement.
Probably the correct (but more complicated) solution would be to make a pool of extractors, where jobs take extractors from and put them back after finishing.
I would make a thread pool, where each thread has an instance of the extractor class, and instantiate just as many of these threads as it takes to saturate the system (based on CPU usage, IO bandwidth, memory bandwidth, network bandwidth, contention for other shared resources, etc.). Then use a thread-safe work queue that these threads can pull tasks from, process them, and iterate until the container is empty.
Mind you, there should be one or several libraries in just about any modern language that implements exactly this. In C++, it would be Intel's Threading Building Blocks. In Objective-C, it would be Grand Central Dispatch.
It depends: what's the relative amount of CPU consumed by the extractor for each job ?
If it is very small, you have a classic single-producer/multiple-consumer problem for which you can find lots of solution in different languages. For Scala, if you are reluctant to start using actors, you can still use the Java API (Runnable, Executors and BlockingQueue, are quite good).
If it is a substantial amount (more than 10%), you app will never scale with a multithread model (see Amdhal law). You may prefer to run several process (several JVM) to obtain thread safety, and thus eliminate the non-sequential part.
First question: how quick does the work need to be completed?
Second question: would this work be isolated to a single physical box or what are your upper bounds on computational resource.
Third question: does the work that needs doing to each individual "job" require blocking and is it serialised or could be partitioned into parallel packets of work?
Maybe think about a distributed model whereby you scale through designing with a mind to pushing out across multiple nodes from the first instance, actors, remoteref all that crap first...try and keep your logic simple and easy - so serialised. Don't just think in terms of a single box.
Most answers here seem to dwell on the intricacies of spawning thread pools and executors and all that stuff - which is fine, but be sure you have a handle on the real problem first, before you start complicating your life with lots of thinking around how you manage the synchronisation logic.
If a problem can be decomposed, then decompose it. Don't overcomplicate it for the sake of doing so - it leads to better engineered code and less sleepless nights.