How multiple executors are managed on the worker nodes with a Spark standalone cluster? - scala

Until now, I have only used Spark on a Hadoop cluster with YARN as the resource manager. In that type of cluster, I know exactly how many executors to run and how the resource management works. However, know that I am trying to use a Standalone Spark Cluster, I have got a little bit confused. Correct me where I am wrong.
From this article, by default, a worker node uses all the memory of the node minus 1 GB. But I understand that by using SPARK_WORKER_MEMORY, we can use lesser memory. For example, if the total memory of the node is 32 GB, but I specify 16 GB, Spark worker is not going to use anymore than 16 GB on that node?
But what about executors? Let us say if I want to run 2 executors per node, can I do that by specifying executor memory during spark-submit to be half of SPARK_WORKER_MEMORY, and if I want to run 4 executors per node, by specifying executor memory to be the quarter of SPARK_WORKER_MEMORY?
If so, besides executor memory, I would also have to specify executor cores correctly, I think. For example, if I want to run 4 executors on a worker, I would have to specify executor cores to be the quarter of SPARK_WORKER_CORES? What happens, if I specify a bigger number than that? I mean if I specify executor memory to be the quarter of SPARK_WORKER_MEMORY, but executor cores to be only half of SPARK_WORKER_CORES? Would I get 2 or 4 executors running on that node in that case?

This is the best way to control number of executors, cores and memory in my experience.
Cores: You can set total number of cores across all executors and number of cores per each executor
Memory: Executor memory individually
--total-executor-cores 12 --executor-cores 2 --executor-memory 6G
This would give you 6 executors and 2 cores/6G per each executor, so in total you are looking at 12 Cores and 36G
You can set driver memory using
--driver-memory 2G

So, I experimented with the Spark Standalone cluster myself a bit, and this is what I noticed.
My intuition that muliple executors can be run inside a worker, by tuning executor cores was indeed correct. Let us say, your worker has 16 cores. Now if you specify 8 cores for executors, Spark would run 2 executors per worker.
How many executors run inside a worker also depend upon the executor memory you specify. For example, if worker memory is 24 GB, and you want to run 2 executors per worker, you cannot specify executor memory to be more than 12 GB.
A worker's memory can be limited when starting a slave by specifing the value for optional parameter--memory or by changing the value of SPARK_WORKER_MEMORY. Same with the number of cores (--cores/SPARK_WORKER_CORES).
If you want to be able to run multiple jobs on the Standalone Spark cluster, you could use the spark.cores.max configuration property while doing spark-submit. For example, like this.
spark-submit <other parameters> --conf="spark.cores.max=16" <other parameters>
So, if your Standalone Spark Cluster allows 64 cores in total, and you give only 16 cores to your program, other Spark jobs could use the remaining 48 cores.

Related

Kafka Streams - Relation between "Stream threads" vs "Tasks" running on 1 C4.XLarge Machine

I have a Kafka Streams Topology which has 5 Processors & 1 Source. Source topic for this topology has 200 partitions. My understanding is 200 tasks get created to match # of partitions for input topic.
This Kafka Streams app is running on C4.XLarge & these 200 tasks run on single stream thread which means this streams thread should be using up all the CPU Cores (8) & memory.
I know Kafka streams parallelism/scalability is controlled by number of stream threads. I can increase the num.stream.threads to 10, but how would it improve the performance if all of them run on single EC2 instance ?. How would it differ from running all tasks on single stream thread which is on single EC2 instance ?.
If you have a 8 core machine, you might want to run 8 StreamsThreads.
This Kafka Streams app is running on C4.XLarge & these 200 tasks run on single stream thread which means this streams thread should be using up all the CPU Cores (8) & memory.
This does not sound correct. A single thread cannot utilize multiple cores. While configuring a single StreamThread implies that some more other background threads are started (consumer heartbeat thread; producer sender thread), it would assume that you cannot fully utilize all 8 cores with this setting.
If 8 StreamsThreads do not fully utilize your 8 cores you might consider to configure 16 thread. However note, that all thread will share the same network and thus, if the network is the actually limiting factor, running more threads won't give you higher throughput (or higher CPU utilization). For this case, you need to scale out using multiple EC2 instances.
Given that you have 200 tasks, you could conceptually run up to 200 StreamThreads but you probably own't need 200 threads.

Kafka Streams: Threads vs CPU cores

If the machine has 16 cores and if we define 6 threads in the config, would Kafka Streams utilize 6 cores OR would all the threads run on just a single core OR there is no control over the cores?
It is wrong to think this approach and multiple factors are involved here
If we define tasks as 6 , it means we have 6 partition for that topic which will be consumed parallelly by kafka consumer or connector.
If you have 16 cores and no other process running then chances are that , it will be executed as you expected.
This is not normal production scenario where we have multiple topics (having more than 1 partition ) which invalidated your theory.
You should have task based on consumer and machine should have only worker.
Once above condition is satisfied , we can perform performance test on that data
How much time takes to process 50k record ?
What is out expected time ?
We can upgrade our system based on above basic parameters.

Partitioning of Apache Spark

I have a cluster consisting of 1 master 10 worker nodes. When I set number of partition as 3, I wonder that does the master node use only 3 worker nodes or use all of them? Because it shows that all of them are used.
The question is not so clear about what are you asking, However following things might help
When you start the job with 10 executors, spark application master gets all the resource from yarn. So all the executors are already associated with the spark job.
However if your data partition is less than the number of executors available, the rest of the executors will be sitting idle. Hence it is not a good idea keeping the number of partition less than the executor count.

Spark: master local[*] is a lot slower than master local

I have an EC2 set up with r3.8xlarge (32 cores, 244G RAM).
In my Spark application, I am reading two csv files from S3 using Spark-CSV from DataBrick, each csv has about 5 millions rows. I am unionAll the two DataFrames and running a dropDuplicates on the combined DataFrame.
But when I have,
val conf = new SparkConf()
.setMaster("local[32]")
.setAppName("Raw Ingestion On Apache Spark")
.set("spark.sql.shuffle.partitions", "32")
Spark is slower than .setMaster("local")
Wouldn't it be faster with 32 cores?
Well spark is not a Windows operating system, that it would work at maximum possible capacity from the start, you need to tune it for your usage.
Right now you just bluntly said to spark start and process my stuff on one node with 32 cores. That is not what Spark is good for. It is a distributed system suppose to be run on multi-node cluster, that is where it works best.
Reason is simple, even if you are using 32 core, what about IO issue?
Because now you are using let's if it has run 30 executors, than that is 32 process reading from same disk.
You specified 32 core, what about executor memory?
Did both machine had same ram, where you were testing.
You have specified now specifically that you want 32 partitions, if data is very small that is alot of overhead. Ideally you shouldn't specify partition until you know specifically what you are doing, or you are doing repetitive task, and you know data is going to be exactly similar all time.
If you tune it correctly spark with 32 core will indeed work faster than "local" which is basically running on one core.

spark.default.parallelism for Parallelize RDD defaults to 2 for spark submit

Spark standalone cluster with a master and 2 worker nodes 4 cpu core on each worker. Total 8 cores for all workers.
When running the following via spark-submit (spark.default.parallelism is not set)
val myRDD = sc.parallelize(1 to 100000)
println("Partititon size - " + myRDD.partitions.size)
val totl = myRDD.reduce((x, y) => x + y)
println("Sum - " + totl)
It returns value 2 for partition size.
When using spark-shell by connecting to spark standalone cluster the same code returns correct partition size 8.
What can be the reason ?
Thanks.
spark.default.parallelism defaults to the number of all cores on all machines. The parallelize api has no parent RDD to determine the number of partitions, so it uses the spark.default.parallelism.
When running spark-submit, you're probably running it locally. Try submitting your spark-submit with the same start up configs as you do the spark-shell.
Pulled this from the documentation:
spark.default.parallelism
For distributed shuffle operations like reduceByKey and join, the largest number of partitions in a parent RDD. For operations like parallelize with no parent RDDs, it depends on the cluster manager:
Local mode: number of cores on the local machine
Mesos fine grained mode: 8
Others: total number of cores on all executor nodes or 2, whichever is larger
Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.