How to run two spark jobs in parallel in standalone mode [duplicate] - scala

This question already has answers here:
How to run multiple Spark jobs in parallel?
(3 answers)
Closed 4 years ago.
I have spark job in which I process a file and then do following steps.
1. Load the file into DataFrame
2. Push the DataFrame to elasticsearch
3. Run some aggregations on dataframe and save to cassandra
I have written a spark job for this in which I have following function calls
writeToES(df)
writeToCassandra(df)
Now these two operations run one by one. However these two can run in parallel.
How can I do this in a single spark job.
I can make two spark jobs each for writing to ES and Cassandra. But they will use multiple ports, which I want to avoid.

You cannot run these two actions through the same spark job. What you're surely looking for is running these two jobs in parallel in the same application.
As the documentation says, you can run multiple jobs in parallel in the same application if those jobs are submitted from different threads:
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
In other words, this should run both actions in parallel (using completable future API here, but you can use any async execution or multithreading mechanism):
CompletableFuture.runAsync(() -> writeToES(df));
CompletableFuture.runAsync(() -> writeToCassandra(df));
You can then join on one or both of these two to wait for completion. As noted in the documentation, you need to pay attention to the configured scheduler mode. Using the FAIR scheduler allows you to run the above in parallel:
conf.set("spark.scheduler.mode", "FAIR")

Related

Force Apache Flink to execute at a given point

It is my understanding that Apache Flink does not actually run the operations that you ask it to until the result of those operations is needed for something. This makes it difficult to time exactly how long each operation takes, which is exactly what i am trying to do in order to compare its efficiency to Apache Spark. Is there a way to force it to run the operations when I want it to?
When running a Flink program one defines the topology and operators to be executed on a cluster. One triggers the job execution by calling env.execute where env is either an ExecutionEnvironment or a StreamExecutionEnvironment. There is one exception for batch jobs which are the API calls collect and print which trigger an eager execution.
You could use the web ui to extract the runtime of different operators. For each operator you see when it's deployed and when it finished execution.

How to control the number of parallel processes with .par in Scala?

I am using the par.map expression to execute processes in parallel in Scala (SBT).
Consider list.par.map(function(_)) (I am preparing an MWE). This means that function(_) should be applied to all the elements of the list in a parallel fashion. In my example, list has 3 elements. But Scala executes only function(list(1)) and function(list(2)) in parallel, and only afterwards function(list(3)).
Is there a reason for this behaviour? Is there a relation with the fact that the programme is executed on a two-core processor? Or how could you impose to execute all three things in parallel?
This question has been asked before:
scala parallel collections degree of parallelism
How to set the number of threads to use for par
and is well documented:
http://docs.scala-lang.org/overviews/parallel-collections/configuration.html
what you want is something like:
var parallelList = list.par
parallelList.tasksupport = new ForkJoinTaskSupport(
new scala.concurrent.forkjoin.ForkJoinPool(parlevel))
parallelList.map(function(_))
That said if your running on a 2 core processor you only have two threads (unless the cores are hyper threaded of course) meaning you can't have more than 2 parallel operations at once.

Total number of jobs in a Spark App

I already saw this question How to implement custom job listener/tracker in Spark? and checked the source code to find out how to get the number of stages per job but is there any way to track programatically the % of jobs that got completed in a Spark app?
I can probably get the number of finished jobs with the listeners but I'm missing the total number of jobs that will be run.
I want to track progress of the whole app and it creates quite a few jobs but I can't find to find it anywhere.
#Edit: I know there's a REST endpoint for getting all the jobs in an app but:
I would prefer not to use REST but to get it in the app itself (spark running on AWS EMR/Yarn - getting the address probably is doable but I'd prefer to not do it)
that REST endpoint seems to be returning only jobs that are running/finished/failed so not total number of jobs.
After going through the source code a bit I guess there's no way to see upfront how many jobs will there be since I couldn't find any place where Spark would be doing such analysis upfront (as jobs are submitted in each action independently Spark doesn't have a big picture of all the jobs from the start).
This kind of makes sense because of how Spark divides work into:
jobs - which are started whenever the code which is run on the driver node encounters an action (i.e. collect(), take() etc.) and are supposed to compute a value and return it to the driver
stages - which are composed of sequences of tasks between which no data shuffling is required
tasks - computations of the same type which can run in parallel on worker nodes
So we do need to know stages and tasks upfront for a single job to create the DAG but we don't necessarily need to create a DAG of jobs, we can just create them "as we go".

How to read and write multiple tables in parallel in Spark?

In my Spark application, I am trying to read multiple tables from RDBMS, doing some data processing, then write multiple tables to another RDBMS as follows (in Scala):
val reading1 = sqlContext.load("jdbc", Map("url" -> myurl1, "dbtable" -> mytable1))
val reading2 = sqlContext.load("jdbc", Map("url" -> myurl1, "dbtable" -> mytable2))
val reading3 = sqlContext.load("jdbc", Map("url" -> myurl1, "dbtable" -> mytable3))
// data processing
// ..............
myDF1.write.mode("append").jdbc(myurl2, outtable1, new java.util.Properties)
myDF2.write.mode("append").jdbc(myurl2, outtable2, new java.util.Properties)
myDF3.write.mode("append").jdbc(myurl2, outtable3, new java.util.Properties)
I understand that reading from one table can be paralleled using partitions. However, the read operations of reading1, reading2, reading3 seem sequential, so do the write operations of myDF1, myDF2, myDF3.
How can I read from the multiple tables (mytable1, mytable2, mytable3) in parallel? and also write to multiple tables in parallel (I think same logic)?
You can schedule mode to be FAIR, it should run the tasks in parallel.
https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
Scheduling Within an Application
Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. By “job”, in this section, we mean a Spark action (e.g. save, collect) and any tasks that need to run to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that serve multiple requests (e.g. queries for multiple users).
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
Starting in Spark 0.8, it is also possible to configure fair sharing between jobs. Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
val conf = new SparkConf().setMaster(...).setAppName(...)
conf.set("spark.scheduler.mode", "FAIR")
val sc = new SparkContext(conf)

Apache-Spark Internal Job Scheduling

I came across the feature in Spark where it allows you to schedule different tasks within a spark context.
I want to implement this feature in a program where I map my input RDD(from a text source) into a key value RDD [K,V] subsequently make a composite key valueRDD [(K1,K2),V] and a filtered RDD containing some specific values.
Further pipeline involves calling some statistical methods from MLlib on both the RDDs and a join operation followed by externalizing the result to disk.
I am trying to understand how will spark's internal fair scheduler handle these operations. I tried reading the job scheduling documentation but got more confused with the concept of pools, users and tasks.
What exactly are the pools, are they certain 'tasks' which can be grouped together or are they linux users pooled into a group
What are users in this context. Do they refer to threads? or is it something like SQL context queries ?
I guess it relates to how are tasks scheduled within a spark context. But reading the documentation makes it seem like we are dealing with multiple applications with different clients and user groups.
Can someone please clarify this?
All the pipelined procedure you described in Paragraph 2:
map -> map -> map -> filter
will be handled in a single stage, just like a map() in MapReduce if it is familiar to you. It's because there isn't a need for repartition or shuffle your data for your make no requirements on the correlation between records, spark would just chain as much transformation as possible into a same stage before create a new one, because it would be much lightweight. More informations on stage separation could be find in its paper: Resilient Distributed Datasets Section 5.1 Job Scheduling.
When the stage get executed, it would be one task set (same tasks running in different thread), and get scheduled simultaneously in spark's perspective.
And Fair scheduler is about to schedule unrelated task sets and not suitable here.