I want to run multiple spark SQL parallel in a spark cluster, so that I can utilize the complete resource cluster wide. I'm using sqlContext.sql(query).
I saw some sample code here like follows,
val parallelism = 10
val executor = Executors.newFixedThreadPool(parallelism)
val ec: ExecutionContext = ExecutionContext.fromExecutor(executor)
val tasks: Seq[String] = ???
val results: Seq[Future[Int]] = tasks.map(query => {
Future{
//spark stuff here
0
}(ec)
})
val allDone: Future[Seq[Int]] = Future.sequence(results)
//wait for results
Await.result(allDone, scala.concurrent.duration.Duration.Inf)
executor.shutdown //otherwise jvm will probably not exit
As I understood, the ExecutionContext compute the available cores in the machine(using ForkJoinPool) and do the parallelism accordingly. But what happens if we consider the spark cluster other-than the single machine and How can it guarantee the complete cluster resource utilization.?
eg: If I have a 10 node cluster with each 4 cores, then how can the above code guarantees that the 40 cores will be utilized.
EDITS:-
Lets say there are 2 sql to be executed, we have 2 way to do this,
submit the queries sequentially, so that second query will be completed only after the execution of the first. (because sqlContext.sql(query) is a synchronous call)
Submit both the queries parallel using Futures, so that both the queries will executed independently and parallel in the cluster
assuming there are enough resources (in both cases).
I think the second one is better because it uses the maximum resources available in the cluster and if the first query fully utilized the resources the scheduler will wait for the completion of the job(depending upon the policy) which is fair in this case.
But as user9613318 mentioned 'increasing pool size will saturate the driver'
Then how can I efficiently control the threads for better resource utilization.
Parallelism will have a minimal impact here, and additional cluster resources don't really affect the approach. Futures (or Threads) are use not to parallelize execution, but to avoid blocking execution. Increasing pool size can only saturate the driver.
What you really should be looking at is Spark in-application scheduling pools and tuning of the number of partitions for narrow (How to change partition size in Spark SQL, Whats meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?) and wide (What should be the optimal value for spark.sql.shuffle.partitions or how do we increase partitions when using Spark SQL?) transformations.
If jobs are completely independent (the code structure suggests that) it could be preferred to submit each one separately, with its own set of allocated resources, and configure cluster scheduling pools accordingly.
Related
I'm naively testing for concurrency in local mode, with the following spark context
SparkSession
.builder
.appName("local-mode-spark")
.master("local[*]")
.config("spark.executor.instances", 4)
.config("spark.executor.cores", 2)
.config("spark.network.timeout", "10000001") // to avoid shutdown during debug, avoid otherwise
.config("spark.executor.heartbeatInterval", "10000000") // to avoid shutdown during debug, avoid otherwise
.getOrCreate()
and a mapPartitions API call like follows:
import spark.implicits._
val inputDF : DataFrame = spark.read.parquet(inputFile)
val resultDF : DataFrame =
inputDF.as[T].mapPartitions(sparkIterator => new MyIterator)).toDF
On the surface of it, this did surface one concurrency bug in my code contained in MyIterator (not a bug in Spark's code). However, I'd like to see that my application will crunch all available machine resources both in production, and also during this testing so that the chances of spotting additional concurrency bugs will improve.
That is clearly not the case for me so far: my machine is only at very low CPU utilization throughout the heavy processing of the inputDF, while there's plenty of free RAM and the JVM Xmx poses no real limitation.
How would you recommend testing for concurrency using your local machine? the objective being to test that in production, Spark will not bump into thread-safety or other concurrency issues in my code applied by spark from within MyIterator?
Or can it even in spark local mode, process separate partitions of my input dataframe in parallel? Can I get spark to work concurrently on the same dataframe on a single machine, preferably in local mode?
Max parallelism
You are already running spark in local mode using .master("local[*]").
local[*] uses as many threads as the number of processors available to the Java virtual machine (it uses Runtime.getRuntime.availableProcessors() to know the number).
Max memory available to all executors/threads
I see that you are not setting the driver memory explicitly. By default the driver memory is 512M. If your local machine can spare more than this, set this explicitly. You can do that by either:
setting it in the properties file (default is spark-defaults.conf),
spark.driver.memory 5g
or by supplying configuration setting at runtime
$ ./bin/spark-shell --driver-memory 5g
Note that this cannot be achieved by setting it in the application, because it is already too late by then, the process has already started with some amount of memory.
Nature of Job
Check number of partitions in your dataframe. That will essentially determine how much max parallelism you can use.
inputDF.rdd.partitions.size
If the output of this is 1, that means your dataframe has only 1 partition and so you won't get concurrency when you do operations on this dataframe. In that case, you might have to tweak some config to create more number of partitions so that you can concurrently run tasks.
Running local mode cannot simulate a production environment for the following reasons.
There are lots of code which gets bypassed when code is run in local mode, which would normally run with any other cluster manager. Amongst various issues, few things that i could think
a. Inability to detect bugs from the way shuffle get handled.(Shuffle data is handled in a completely different way in local mode.)
b. We will not be able to detect serialization related issues, since all code is available to the driver and task runs in the driver itself, and hence we would not result in any serialization issues.
c. No speculative tasks(especially for write operations)
d. Networking related issues, all tasks are executed in same JVM. One would not be able detect issues like communication between driver/executor, codegen related issues.
Concurrency in local mode
a. Max concurrency than can be attained will be equal to the number of cores in your local machine.(Link to code)
b. The Job, Stage, Task metrics shown in Spark UI are not accurate since it will incur the overhead of running in the JVM where the driver is also running.
c: As for CPU/Memoryutilization, it depends on operation being performed. Is the operation CPU/memory intensive?
When to use local mode
a. Testing of code that will run only on driver
b. Basic sanity testing of the code that will get executed on the executors
c. Unit testing
tl; dr The concurrency bugs that occur in local mode might not even be present in other cluster resource managers, since there are lot of special handling in Spark code for local mode(There are lots of code which checks isLocal in code and control goes to a different code flow altogether)
Yes!
Achieving parallelism in local mode is quite possible.
Check the amount of memory and cpu available in your local machine and supply values to the driver-memory and driver-cores conf while submitting your spark job.
Increasing executor-memory and executor-cores will not make a difference in this mode.
Once the application is running, open up the SPARK UI for the job. You can now go to the EXECUTORS tab to actually check the amount of resources your spark job is utilizing.
You can monitor various tasks that get generated and the number of tasks that your job runs concurrently using the JOBS and STAGES tab.
In order to process data which is way larger than the resources available, ensure that you break your data into smaller partitions using repartition. This should allow your job to complete successfully.
Increase the default shuffle partitions in case your job has aggregations or joins. Also, ensure sufficient space on the local file system since spark creates intermediate shuffle files and writes them to disk.
Hope this helps!
Assume I have 36 cores per executor, one executor per node, and 3 nodes each with 48 cores available. The basic gist of what I've noticed is, when I set each task to use 1 core (the default), my CPU utilization over the workers is about 70% and 36 tasks will execute simultaneously per executor (as I would have expected). However, when I change my configuration to have 6 cores per task (--conf spark.task.cpus=6), I get the drop to 6 tasks at a time per executor (as expected), but my CPU utilization also drops below 10% utilization (unexpected). I would have assumed that Spark would know how to parallelize the workload over the 6 cores.
The implementation details that are important are that I am running a UDF function on a column of a DataFrame and appending the results as a new column on that dataframe. This UDF function uses a #transient object that provides a machine learning algorithm that I'm using. This UDF function is not part of an aggregation or coalesce operation, it is just a map operation over the column implemented like so:
def myUdf = udf { ... }
val resultSet = myUdf(dataFrame.col("originalCol"))
val dataFrameWithResults = dataFrame.withColumn("originalColMetric", resultSet)
I would have expected that Spark would execute 6 myUdf to process 6 records at a time, one for each core, but this doesn't appear to be the case. Is there a way to fix this (without submitting a PR to the Spark project), or at least, can someone explain why this might be happening?
Anticipating the question, I'm experimenting with increasing the number of cores per task in order to reduce the amount of RAM required per executor. Executing too many tasks at once exponentially increases the RAM usage, in this instance.
spark.task.cpus is a number of cores to allocate for each task. It is used to allocate multiple cores to a single task, in case when user code is multi-threaded. If your udf doesn't use multiple (doesn't spawn multiple threads in a single function call) threads then the cores are just wasted.
to process 6 records at a time
allocate 6 cores, with spark.task.cpus set to 1. If you want to limit number of tasks on node, then reduce number of cores offered by each node.
Essentially Spark can determine on its own how to split out mapping a UDF over multiple records concurrently by splitting the records up among each of the Tasks (according to the partitioning) and determining how many simultaneous Tasks each Executor can handle. However, Spark can NOT automatically split the work per Core per Task. To utilize multiple cores per task, the code in the UDF, which would get executed over one record at a time (sequentially) per Task, would need to be written to parallelize the computation in that UDF over a single record.
In the famous word count example for spark streaming, the spark configuration object is initialized as follows:
/* Create a local StreamingContext with two working thread and batch interval of 1 second.
The master requires 2 cores to prevent from a starvation scenario. */
val sparkConf = new SparkConf().
setMaster("local[2]").setAppName("WordCount")
Here if I change the master from local[2] to local or does not set the Master, I do not get the expected output and in fact word counting doesn't happen at all.
The comment says:
"The master requires 2 cores to prevent from a starvation scenario" that's why they have done setMaster("local[2]").
Can somebody explain me why it requires 2 cores and what is starvation scenario ?
From the documentation:
[...] note that a Spark worker/executor is a long-running task, hence it occupies one of the cores allocated to the Spark Streaming application. Therefore, it is important to remember that a Spark Streaming application needs to be allocated enough cores (or threads, if running locally) to process the received data, as well as to run the receiver(s).
In other words, one thread will be used to run the receiver and at least one more is necessary for processing the received data. For a cluster, the number of allocated cores must be more than the number of receivers, otherwise the system can not process the data.
Hence, when running locally, you need at least 2 threads and when using a cluster at least 2 cores need to be allocated to your system.
Starvation scenario refers to this type of problem, where some threads are not able to execute at all while others make progress.
There are two classical problems where starvation is well known:
Dining philosophers
Readers-writer problem, here it's possible to synchronize the threads so the readers or writers starve. It's also possible to make sure that no starvation occurs.
I recently discovered that adding parallel computing (e.g. using parallel-collections) inside UDFs increases performance considerable even when running spark in local[1] mode or using Yarn with 1 executor and 1 core.
E.g. in local[1] mode, the Spark-Jobs consumes as much CPU as possible (i.e. 800% if I have 8 cores, measured using top).
This seems strange because I thought Spark (or yarn) limits the CPU usage per Spark application?
So I wonder why that is and whether it's recommended to use parallel-processing/mutli-threading in spark or should I stick to sparks parallelizing pattern?
Here an example to play with (times measured in yarn client-mode with 1 instance and 1 core)
case class MyRow(id:Int,data:Seq[Double])
// create dataFrame
val rows = 10
val points = 10000
import scala.util.Random.nextDouble
val data = {1 to rows}.map{i => MyRow(i, Stream.continually(nextDouble()).take(points))}
val df = sc.parallelize(data).toDF().repartition($"id").cache()
df.show() // trigger computation and caching
// some expensive dummy-computation for each array-element
val expensive = (d:Double) => (1 to 10000).foldLeft(0.0){case(a,b) => a*b}*d
val serialUDF = udf((in:Seq[Double]) => in.map{expensive}.sum)
val parallelUDF = udf((in:Seq[Double]) => in.par.map{expensive}.sum)
df.withColumn("sum",serialUDF($"data")).show() // takes ~ 10 seconds
df.withColumn("sum",parallelUDF($"data")).show() // takes ~ 2.5 seconds
Spark does not limit CPU directly, instead it defines the number of concurrent threads spark creates. So for local[1] it would basically run one task at a time in parallel. When you are doing in.par.map{expensive} you are creating threads which spark does not manage and therefore are not handled by this limit. i.e. you told spark to limit itself to a single thread and then created other threads without spark knowing it.
In general, it is not a good idea to do parallel threads inside of a spark operation. Instead, it would be better to tell spark how many threads it can work with and make sure you have enough partitions for parallelism.
Spark are configuration of CPU usage
examople
val conf = new SparkConf()
.setMaster("local[2]")
.setAppName("CountingSheep")
val sc = new SparkContext(conf)
change the local[*] it will utilization all of your CPU cores.
I am trying to write a (wordcount) program to simulate a use case where the network traffic would be very high because of spark's shuffle process. I have a 3 node apache spark cluster (2 cores each, 8GB RAM each) configured with 1 master and 2 workers. I processed a 5GB file for wordcount and was able to see a network traffic between 2 worker nodes raise up to 1GB in 10-15 mins. I am looking for a way where i could increase the traffic between nodes raise up to atleast 1GB within 30s-60s. The inefficiency of the program or best practices doesn't matter in my current use case as I am just trying to simulate traffic.
This is the program i have written
val sc = new SparkContext(new SparkConf().setAppName("LetterCount-GroupBy-Map"))
val x = sc.textFile(args(0)).flatMap(t => t.split(" "))
val y = x.map(w => (w.charAt(0),w))
val z = y.groupByKey().mapValues(n => n.size)
z.collect().foreach(println)
More shuffled data can be generated by doing operations which do not combine data very well on each node. For eg: In the code you have written, the groupby will combine the common keys ( or do groupby locally). Instead choose a high cardinality of keys (in above example its 26 only). In addition, the size of values after the map operation can be increased. In your case, its the text line. You might want to put a very long string of values for each key.
Apart from this, if you take 2 different files/tables and apply join on some parameter, it will also cause shuffling.
Note: Am assuming the contents does not matter. You are only interested in generating highly shuffled data.