How to control the number of parallel processes with .par in Scala? - scala

I am using the par.map expression to execute processes in parallel in Scala (SBT).
Consider list.par.map(function(_)) (I am preparing an MWE). This means that function(_) should be applied to all the elements of the list in a parallel fashion. In my example, list has 3 elements. But Scala executes only function(list(1)) and function(list(2)) in parallel, and only afterwards function(list(3)).
Is there a reason for this behaviour? Is there a relation with the fact that the programme is executed on a two-core processor? Or how could you impose to execute all three things in parallel?

This question has been asked before:
scala parallel collections degree of parallelism
How to set the number of threads to use for par
and is well documented:
http://docs.scala-lang.org/overviews/parallel-collections/configuration.html
what you want is something like:
var parallelList = list.par
parallelList.tasksupport = new ForkJoinTaskSupport(
new scala.concurrent.forkjoin.ForkJoinPool(parlevel))
parallelList.map(function(_))
That said if your running on a 2 core processor you only have two threads (unless the cores are hyper threaded of course) meaning you can't have more than 2 parallel operations at once.

Related

How to run two spark jobs in parallel in standalone mode [duplicate]

This question already has answers here:
How to run multiple Spark jobs in parallel?
(3 answers)
Closed 4 years ago.
I have spark job in which I process a file and then do following steps.
1. Load the file into DataFrame
2. Push the DataFrame to elasticsearch
3. Run some aggregations on dataframe and save to cassandra
I have written a spark job for this in which I have following function calls
writeToES(df)
writeToCassandra(df)
Now these two operations run one by one. However these two can run in parallel.
How can I do this in a single spark job.
I can make two spark jobs each for writing to ES and Cassandra. But they will use multiple ports, which I want to avoid.
You cannot run these two actions through the same spark job. What you're surely looking for is running these two jobs in parallel in the same application.
As the documentation says, you can run multiple jobs in parallel in the same application if those jobs are submitted from different threads:
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
In other words, this should run both actions in parallel (using completable future API here, but you can use any async execution or multithreading mechanism):
CompletableFuture.runAsync(() -> writeToES(df));
CompletableFuture.runAsync(() -> writeToCassandra(df));
You can then join on one or both of these two to wait for completion. As noted in the documentation, you need to pay attention to the configured scheduler mode. Using the FAIR scheduler allows you to run the above in parallel:
conf.set("spark.scheduler.mode", "FAIR")

Are two transformations on the same RDD executed in parallel in Apache Spark?

Lets say we have the following Scala program:
val inputRDD = sc.textFile("log.txt")
inputRDD.persist()
val errorsRDD = inputRDD.filter(lambda x: "error" in x)
val warningsRDD = inputRDD.filter(lambda x: "warning" in x)
println("Errors: " + errorsRDD.count() + ", Warnings: " + warningsRDD.count())
We create a simple RDD, persist it, perform two transformations on the RDD and finally have an action which uses the RDDs.
When the print is called, the transformations are executed, each transformation is of course parallel depending on the cluster management.
My main question is: Are the two actions and transformations executed in parallel or sequence? Or does errorsRDD.count() first execute and then warningsRDD.count(), in sequence?
I'm also wondering if there is any point in using persist in this example.
All standard RDD methods are blocking (with exception to AsyncRDDActions) so actions will be evaluated sequentially. It is possible to execute multiple actions concurrently using non-blocking submission (threads, Futures) with correct configuration of in-application scheduler or explicitly limited resources for each action.
Regarding cache it is impossible to answer without knowing the context. Depending on the cluster configuration, storage, and data locality it might be cheaper to load data from disk again, especially when resources are limited, and subsequent actions might trigger cache cleaner.
This will execute errorsRDD.count() first then warningsRDD.count().
The point of using persist here is when the first count is executed, inputRDD will be in memory.
The second count, spark won't need to re-read "whole" content of file from storage again, so execution time of this count would be much faster than the first.

What happens if I use scala parallel collections within a spark job?

What happens if I use scala parallel collections within a spark job? (which typically spawns jobs to process partitions of the collections on multiple threads). Or for that matter an job that potentially starts sub threads?
Does spark's JVM limit execution to a single core or can it sensibly distribute the work across many cores (presumably on the same node?)
We use scala parallel collections extensively in Spark rdd.mapPartitions(...) function. It works perfectly for us, we are able so scale IO intensive jobs very well (calling Redis/HBase/etc...)
BIG WARN: Scala parallel collections are not lazy! when you construct par-iterator it actually brings all rows from Iterator[Row] into memory. We use it mostly in Spark-Streaming context, so it's not an issue for us. But it's a problem when we want for example to process huge HBase table with Spark
private def doStuff(rows: Iterator[Row]): Iterator[Row] = {
val pit = rows.toIterable.par
pit.tasksupport = new ExecutionContextTaskSupport(ExecutionContext.fromExecutor(....)
pit.map(row => transform(row)).toIterator
}
rdd.mapPartitions(doStuff)
We use ExecutionContextTaskSupport to put all computations into dedicated ThreadPool instead of using default JVM-level ForkJoin pool.

Scala 2.11.x concurrency: pool of workers doing something similar to map-reduce?

What is the idiomatic way to implement a pool of workers in Scala, such that work units coming from some source can be allocated to the next free worker and processed asynchronously? Each worker would produce a result and eventually, all the results would need to get combined to produce the overall result.
We do not know the number of work units on which we need to run a worker in advance and we do not know in advance the optimal number of workers, because that will depend on the system we run on.
So roughly what should happen is this:
for each work unit, eventually start a worker to process it
for each finished worker, combine its result into the global result
return the global result after all the worker results have been combined
Should this be done exclusively by futures, no matter how many work units and how many workers there will be? What if the results can only be combined when they are ALL available?
Most examples of futures I have seen have a fixed number of futures and then use for comprehension to combine them, but what if the number of futures is not known and I have e.g. just a collection of an arbitrary number of futures? What if there will be billions of easier work units to get processed that way versus just a few dozen long-running ones?
Are there other, better ways to do this, e.g. with Actors instead?
How would the design ideally change when the results of each worker does not need to get combined and each worker is completely independent of the others?
Too many questions in your question to address them all.
Basically, Futures will do what you want, you can create the ExecutionContext that better fits your needs. To combine the results: Future.sequence.

What is the benefit of using Futures over parallel collections in scala?

Is there a good reason for the added complexity of Futures (vs parallel collections) when processing a list of items in parallel?
List(...).par.foreach(x=>longRunningAction(x))
vs
Future.traverse(List(...)) (x=>Future(longRunningAction(x)))
I think the main advantage would be that you can access the results of each future as soon as it is computed, while you would have to wait for the whole computation to be done with a parallel collection. A disadvantage might be that you end up creating lots of futures. If you later end up calling Future.sequence, there really is no advantage.
Parallel collections will kill of some threads as we get closer to processing all items. So last few items might be processed by single thread.
Please see my question for more details on this behavior Using ThreadPoolTaskSupport as tasksupport for parallel collections in scala
Future does no such thing, and all your threads are in use until all objects are processed. Hence unless your tasks are so small that you dont care about loss of parallelism for last few tasks and you are using huge number of threads, which have to killed of as soon as possible, Futures are better.
Futures become useful as soon as you want to compose your deferred / concurrent computations. Futures (the good kind, anyway, such as Akka's) are monadic and hence allow you to build arbitrarily complex computational structures with all the concurrency and synchronization handled properly by the Futures library.