How to read and write multiple tables in parallel in Spark? - scala

In my Spark application, I am trying to read multiple tables from RDBMS, doing some data processing, then write multiple tables to another RDBMS as follows (in Scala):
val reading1 = sqlContext.load("jdbc", Map("url" -> myurl1, "dbtable" -> mytable1))
val reading2 = sqlContext.load("jdbc", Map("url" -> myurl1, "dbtable" -> mytable2))
val reading3 = sqlContext.load("jdbc", Map("url" -> myurl1, "dbtable" -> mytable3))
// data processing
// ..............
myDF1.write.mode("append").jdbc(myurl2, outtable1, new java.util.Properties)
myDF2.write.mode("append").jdbc(myurl2, outtable2, new java.util.Properties)
myDF3.write.mode("append").jdbc(myurl2, outtable3, new java.util.Properties)
I understand that reading from one table can be paralleled using partitions. However, the read operations of reading1, reading2, reading3 seem sequential, so do the write operations of myDF1, myDF2, myDF3.
How can I read from the multiple tables (mytable1, mytable2, mytable3) in parallel? and also write to multiple tables in parallel (I think same logic)?

You can schedule mode to be FAIR, it should run the tasks in parallel.
https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
Scheduling Within an Application
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
val conf = new SparkConf().setMaster(...).setAppName(...)
conf.set("spark.scheduler.mode", "FAIR")
val sc = new SparkContext(conf)

Related

How to run two spark jobs in parallel in standalone mode [duplicate]

This question already has answers here:
How to run multiple Spark jobs in parallel?
(3 answers)
Closed 4 years ago.
I have spark job in which I process a file and then do following steps.
1. Load the file into DataFrame
2. Push the DataFrame to elasticsearch
3. Run some aggregations on dataframe and save to cassandra
I have written a spark job for this in which I have following function calls
writeToES(df)
writeToCassandra(df)
Now these two operations run one by one. However these two can run in parallel.
How can I do this in a single spark job.
I can make two spark jobs each for writing to ES and Cassandra. But they will use multiple ports, which I want to avoid.
You cannot run these two actions through the same spark job. What you're surely looking for is running these two jobs in parallel in the same application.
As the documentation says, you can run multiple jobs in parallel in the same application if those jobs are submitted from different threads:
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
In other words, this should run both actions in parallel (using completable future API here, but you can use any async execution or multithreading mechanism):
CompletableFuture.runAsync(() -> writeToES(df));
CompletableFuture.runAsync(() -> writeToCassandra(df));
You can then join on one or both of these two to wait for completion. As noted in the documentation, you need to pay attention to the configured scheduler mode. Using the FAIR scheduler allows you to run the above in parallel:
conf.set("spark.scheduler.mode", "FAIR")

Are two transformations on the same RDD executed in parallel in Apache Spark?

Lets say we have the following Scala program:
val inputRDD = sc.textFile("log.txt")
inputRDD.persist()
val errorsRDD = inputRDD.filter(lambda x: "error" in x)
val warningsRDD = inputRDD.filter(lambda x: "warning" in x)
println("Errors: " + errorsRDD.count() + ", Warnings: " + warningsRDD.count())
We create a simple RDD, persist it, perform two transformations on the RDD and finally have an action which uses the RDDs.
When the print is called, the transformations are executed, each transformation is of course parallel depending on the cluster management.
My main question is: Are the two actions and transformations executed in parallel or sequence? Or does errorsRDD.count() first execute and then warningsRDD.count(), in sequence?
I'm also wondering if there is any point in using persist in this example.
All standard RDD methods are blocking (with exception to AsyncRDDActions) so actions will be evaluated sequentially. It is possible to execute multiple actions concurrently using non-blocking submission (threads, Futures) with correct configuration of in-application scheduler or explicitly limited resources for each action.
Regarding cache it is impossible to answer without knowing the context. Depending on the cluster configuration, storage, and data locality it might be cheaper to load data from disk again, especially when resources are limited, and subsequent actions might trigger cache cleaner.
This will execute errorsRDD.count() first then warningsRDD.count().
The point of using persist here is when the first count is executed, inputRDD will be in memory.
The second count, spark won't need to re-read "whole" content of file from storage again, so execution time of this count would be much faster than the first.

Can we make the different transformation functions for the Spark Streaming running on different servers?

val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
For the above example, we know there are two transformation functions. Both of them must running at the same process\server, however I want to make the second transformation running on a different server from the first one to achieve scalability, is it possible?
To clear things up: a Spark transformation is not an actual execution. Transformations in Spark are lazy which means nothing gets executed until you call an action (e.g. save, collect). An action is a job in Spark.
So based on the above, you can control jobs but you cannot control transformations. A Spark's job will be distributed on multiple executors by splitting the processed data (RDD) among them. Each executor will apply the job (multiple transformations) on its split and then the results will be collected again. This will significantly reduce network usage.
If you can perform what your asking about, then the intermediate results (which you actually don't care about) should be transformed over the network which in turns will add a great network overhead.

What happens if I use scala parallel collections within a spark job?

What happens if I use scala parallel collections within a spark job? (which typically spawns jobs to process partitions of the collections on multiple threads). Or for that matter an job that potentially starts sub threads?
Does spark's JVM limit execution to a single core or can it sensibly distribute the work across many cores (presumably on the same node?)
We use scala parallel collections extensively in Spark rdd.mapPartitions(...) function. It works perfectly for us, we are able so scale IO intensive jobs very well (calling Redis/HBase/etc...)
BIG WARN: Scala parallel collections are not lazy! when you construct par-iterator it actually brings all rows from Iterator[Row] into memory. We use it mostly in Spark-Streaming context, so it's not an issue for us. But it's a problem when we want for example to process huge HBase table with Spark
private def doStuff(rows: Iterator[Row]): Iterator[Row] = {
val pit = rows.toIterable.par
pit.tasksupport = new ExecutionContextTaskSupport(ExecutionContext.fromExecutor(....)
pit.map(row => transform(row)).toIterator
}
rdd.mapPartitions(doStuff)
We use ExecutionContextTaskSupport to put all computations into dedicated ThreadPool instead of using default JVM-level ForkJoin pool.

Apache-Spark Internal Job Scheduling

I came across the feature in Spark where it allows you to schedule different tasks within a spark context.
I want to implement this feature in a program where I map my input RDD(from a text source) into a key value RDD [K,V] subsequently make a composite key valueRDD [(K1,K2),V] and a filtered RDD containing some specific values.
Further pipeline involves calling some statistical methods from MLlib on both the RDDs and a join operation followed by externalizing the result to disk.
I am trying to understand how will spark's internal fair scheduler handle these operations. I tried reading the job scheduling documentation but got more confused with the concept of pools, users and tasks.
What exactly are the pools, are they certain 'tasks' which can be grouped together or are they linux users pooled into a group
What are users in this context. Do they refer to threads? or is it something like SQL context queries ?
I guess it relates to how are tasks scheduled within a spark context. But reading the documentation makes it seem like we are dealing with multiple applications with different clients and user groups.
Can someone please clarify this?
All the pipelined procedure you described in Paragraph 2:
map -> map -> map -> filter
will be handled in a single stage, just like a map() in MapReduce if it is familiar to you. It's because there isn't a need for repartition or shuffle your data for your make no requirements on the correlation between records, spark would just chain as much transformation as possible into a same stage before create a new one, because it would be much lightweight. More informations on stage separation could be find in its paper: Resilient Distributed Datasets Section 5.1 Job Scheduling.
When the stage get executed, it would be one task set (same tasks running in different thread), and get scheduled simultaneously in spark's perspective.
And Fair scheduler is about to schedule unrelated task sets and not suitable here.