What happens if I use scala parallel collections within a spark job? - scala

What happens if I use scala parallel collections within a spark job? (which typically spawns jobs to process partitions of the collections on multiple threads). Or for that matter an job that potentially starts sub threads?
Does spark's JVM limit execution to a single core or can it sensibly distribute the work across many cores (presumably on the same node?)

We use scala parallel collections extensively in Spark rdd.mapPartitions(...) function. It works perfectly for us, we are able so scale IO intensive jobs very well (calling Redis/HBase/etc...)
BIG WARN: Scala parallel collections are not lazy! when you construct par-iterator it actually brings all rows from Iterator[Row] into memory. We use it mostly in Spark-Streaming context, so it's not an issue for us. But it's a problem when we want for example to process huge HBase table with Spark
private def doStuff(rows: Iterator[Row]): Iterator[Row] = {
val pit = rows.toIterable.par
pit.tasksupport = new ExecutionContextTaskSupport(ExecutionContext.fromExecutor(....)
pit.map(row => transform(row)).toIterator
}
rdd.mapPartitions(doStuff)
We use ExecutionContextTaskSupport to put all computations into dedicated ThreadPool instead of using default JVM-level ForkJoin pool.

Related

In Spark, how objects and variables are kept in memory and across different executors?

In Spark, how objects and variables are kept in memory and across different executors?
I am using:
Spark 3.0.0
Scala 2.12
I am working on writing a Spark Structured Streaming job with a custom Stream Source. Before the execution of the spark query, I create a bunch of metadata which is used by my Spark Streaming Job
I am trying to understand how this metadata is kept in memory across different executors?
Example Code:
case class JobConfig(fieldName: String, displayName: String, castTo: String)
val jobConfigs:List[JobConfig] = build(); //build the job configs
val query = spark
.readStream
.format("custom-streaming")
.load
query
.writeStream
.trigger(Trigger.ProcessingTime(2, TimeUnit.MINUTES))
.foreachBatch { (batchDF: DataFrame, batchId: Long) => {
CustomJobExecutor.start(jobConfigs) //CustomJobExecutor does data frame transformations and save the data in PostgreSQL.
}
}.outputMode(OutputMode.Append()).start().awaitTermination()
Need help in understanding following:
In the sample code, how Spark will keep "jobConfigs" in memory across different executors?
Is there any added advantage of broadcasting?
What is the efficient way of keeping the variables which can't be deserialized?
Local variables are copied for each task meanwhile broadcasted variables are copied only per executor. From docs
Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
It means that if your jobConfigs is large enough and the number of tasks and stages where the variable is used significantly larger than the number of executors, or deserialization is time-consuming, in that case, broadcast variables can make a difference. In other cases, they don't.

Are two transformations on the same RDD executed in parallel in Apache Spark?

Lets say we have the following Scala program:
val inputRDD = sc.textFile("log.txt")
inputRDD.persist()
val errorsRDD = inputRDD.filter(lambda x: "error" in x)
val warningsRDD = inputRDD.filter(lambda x: "warning" in x)
println("Errors: " + errorsRDD.count() + ", Warnings: " + warningsRDD.count())
We create a simple RDD, persist it, perform two transformations on the RDD and finally have an action which uses the RDDs.
When the print is called, the transformations are executed, each transformation is of course parallel depending on the cluster management.
My main question is: Are the two actions and transformations executed in parallel or sequence? Or does errorsRDD.count() first execute and then warningsRDD.count(), in sequence?
I'm also wondering if there is any point in using persist in this example.
All standard RDD methods are blocking (with exception to AsyncRDDActions) so actions will be evaluated sequentially. It is possible to execute multiple actions concurrently using non-blocking submission (threads, Futures) with correct configuration of in-application scheduler or explicitly limited resources for each action.
Regarding cache it is impossible to answer without knowing the context. Depending on the cluster configuration, storage, and data locality it might be cheaper to load data from disk again, especially when resources are limited, and subsequent actions might trigger cache cleaner.
This will execute errorsRDD.count() first then warningsRDD.count().
The point of using persist here is when the first count is executed, inputRDD will be in memory.
The second count, spark won't need to re-read "whole" content of file from storage again, so execution time of this count would be much faster than the first.

Can we make the different transformation functions for the Spark Streaming running on different servers?

val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
For the above example, we know there are two transformation functions. Both of them must running at the same process\server, however I want to make the second transformation running on a different server from the first one to achieve scalability, is it possible?
To clear things up: a Spark transformation is not an actual execution. Transformations in Spark are lazy which means nothing gets executed until you call an action (e.g. save, collect). An action is a job in Spark.
So based on the above, you can control jobs but you cannot control transformations. A Spark's job will be distributed on multiple executors by splitting the processed data (RDD) among them. Each executor will apply the job (multiple transformations) on its split and then the results will be collected again. This will significantly reduce network usage.
If you can perform what your asking about, then the intermediate results (which you actually don't care about) should be transformed over the network which in turns will add a great network overhead.

How to control the number of parallel processes with .par in Scala?

I am using the par.map expression to execute processes in parallel in Scala (SBT).
Consider list.par.map(function(_)) (I am preparing an MWE). This means that function(_) should be applied to all the elements of the list in a parallel fashion. In my example, list has 3 elements. But Scala executes only function(list(1)) and function(list(2)) in parallel, and only afterwards function(list(3)).
Is there a reason for this behaviour? Is there a relation with the fact that the programme is executed on a two-core processor? Or how could you impose to execute all three things in parallel?
This question has been asked before:
scala parallel collections degree of parallelism
How to set the number of threads to use for par
and is well documented:
http://docs.scala-lang.org/overviews/parallel-collections/configuration.html
what you want is something like:
var parallelList = list.par
parallelList.tasksupport = new ForkJoinTaskSupport(
new scala.concurrent.forkjoin.ForkJoinPool(parlevel))
parallelList.map(function(_))
That said if your running on a 2 core processor you only have two threads (unless the cores are hyper threaded of course) meaning you can't have more than 2 parallel operations at once.

How to read and write multiple tables in parallel in Spark?

In my Spark application, I am trying to read multiple tables from RDBMS, doing some data processing, then write multiple tables to another RDBMS as follows (in Scala):
val reading1 = sqlContext.load("jdbc", Map("url" -> myurl1, "dbtable" -> mytable1))
val reading2 = sqlContext.load("jdbc", Map("url" -> myurl1, "dbtable" -> mytable2))
val reading3 = sqlContext.load("jdbc", Map("url" -> myurl1, "dbtable" -> mytable3))
// data processing
// ..............
myDF1.write.mode("append").jdbc(myurl2, outtable1, new java.util.Properties)
myDF2.write.mode("append").jdbc(myurl2, outtable2, new java.util.Properties)
myDF3.write.mode("append").jdbc(myurl2, outtable3, new java.util.Properties)
I understand that reading from one table can be paralleled using partitions. However, the read operations of reading1, reading2, reading3 seem sequential, so do the write operations of myDF1, myDF2, myDF3.
How can I read from the multiple tables (mytable1, mytable2, mytable3) in parallel? and also write to multiple tables in parallel (I think same logic)?
You can schedule mode to be FAIR, it should run the tasks in parallel.
https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
Scheduling Within an Application
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
val conf = new SparkConf().setMaster(...).setAppName(...)
conf.set("spark.scheduler.mode", "FAIR")
val sc = new SparkContext(conf)