I am using a cluster of 12 virtual machines, each of which has 16 GB memory and 6 cores(except master node with only 2 cores). To each worker node, 12GB memory and 4 cores were assigned.
When I submit a spark application to yarn, I set the number of executors to 10(1 as master manager, 1 as application master), and to maximize the parallelism of my application, most of my RDDs have 40 partitions as same as the number of cores of all executors.
The following is the problem I encountered: in some random stages, some tasks need to be processed extremely longer than others, which results in poor parallelism. As we can see in the first picture, executor 9 executed its tasks over 30s while other tasks could be finished with 1s. Furthermore, the reason for much time consumed is also randomized, sometimes just because of computation, but sometimes scheduler delay, deserialization or shuffle read. As we can see, the reason for second picture is different from first picture.
I am guessing the reason for this occurs is once some task got assigned to a specific slot, there is not enough resources on the corresponding machine, so jvm was waiting for cpus. Is my guess correct? And how to set the configuration of my cluster to avoid this situation?
computing
scheduler delay & deserialization
To get a specific answer you need to share more about what you're doing but most likely the partitions you get in one or more of your stages are unbalanced - i.e. some are much bigger than others. The result is slowdown since these partitions are handled by a specific task. One way to solve it is to increase the number of partitions or change the partitioning logic
When a big task finishes shipping the data to other tasks would take longer as well so that's why other tasks may take long
Related
We have a Druid Cluster with the following specs
3X Coordinators & Overlords - m5.2xlarge
6X Middle Managers(Ingest nodes with 5 slots) - m5d.4xlarge
8X Historical - i3.4xlarge
2X Router & Broker - m5.2xlarge
Cluster often goes into Restricted mode
All the calls to the Cluster gets rejected with a 502 error.
Even with 30 available slots for the index-parallel tasks, cluster only runs 10 at time and the other tasks are going into waiting state.
Loader Task submission time has been increasing monotonically from 1s,2s,..,6s,..10s(We submit a job to load the data in S3), after
recycling the cluster submission time decreases and increases again
over a period of time
We submit around 100 jobs per minute but we need to scale it to 300 to catchup with our current incoming load
Cloud someone help us with our queries
Tune the specs of the cluster
What parameters to be optimized to run maximum number of tasks in parallel without increasing the load on the master nodes
Why is the loader task submission time increasing, what are the parameters to be monitored here
At 100 jobs per minute, this is probably why the overlord is being overloaded.
The overlord initiates a job by communicating with the middle managers across the cluster. It defines the tasks that each middle manager will need to complete and it monitors the task progress until completion. The startup of each job has some overhead, so that many jobs would likely keep the overlord busy and prevent it from processing the other jobs you are requesting. This might explain why the time for job submissions increases over time. You could increase the resources on the overlord, but this sounds like there may be a better way to ingest the data.
The recommendation would be to use a lot less jobs and have each job do more work.
If the flow of data is so continuous as you describe, perhaps a kafka queue would be the best target followed up with a Druid kafka ingestion job which is fully scalable.
If you need to do batch, perhaps a single index_parallel job that reads many files would be much more efficient than many jobs.
Also consider that each task in an ingestion job creates a set of segments. By running a lot of really small jobs, you create a lot of very small segments which is not ideal for query performance. Here some info around how to think about segment size optimization which I think might help.
I think I'm familiar with Spark Cluster Mode components which described here and one of its major disadvantage - shuffling. hence, IMHO a best practice would be to have one executor for each worker node, having more than that will unnecessarily increase the shuffling (between the executors).
What am I missing here, in which cases I would prefer to have more then one executor in each worker?
In which case would we like to have more than one executor in each worker?
Whenever possible: if jobs require less resources for one executor than what a worker node has, then spark should try to start other executors on the same worker to use all its available resources.
But that's the role of spark, not our call. When deploying spark apps, it is up to spark to decide how many executors (jvm process) are started on each worker node (machine). And it depends on the executor resources (core and memory) required by the spark jobs (the spark.executor.* configs). We often don't know what resources are available per worker. A cluster is usually shared by multiple apps/people. So we configure executor number and required resources and let spark decide to run them on the same worker or not.
Now, your question is maybe "should we have less executors with lots of cores and memory, or distribute it in several little executors?"
Having less but bigger executors reduce shuffling, clearly. But there are several reasons to also prefer distribution:
It is easier to start little executors
Having big executor means the cluster need all required resources on one worker
it is specially useful when using dynamic allocation, that will start and kill executor in function of runtime usage
Several little executors improve resilience: if our code is unstable and might sometime kill the executor, everything is lost and restarted.
I met a case where the code used in the executor wasn't thread safe. That's a bad thing, but it wasn't done on purpose. So until, or instead :\ , this is fixed, we distributed the code on many 1-core executors.
I've faced with situation when running cluster on AWS EMR, that one stage remained 'running' when execution plan continue to progress. Look at screen from Spark UI (job 4 has running tasks, however job 7 in progress). My question is how to debug such situation, if there are any tips that I can find at DAG?
My thought that it could be some memory issue because data is tough, and there are a lot of spills to disk. However I am wondering why spark stays idle for hour. Is it related to driver memory issues?
UPD1:
Based on Ravi requests:
(1) check the time since they are running and the GC time also. If GC
time is >20% of the execution time it means that u r throttled by the
memory.
No, it is not an issue.
(2) check number of active tasks in the same page.
That's really weird, there are executors with more active tasks than cores capacity (3x time more for some of executors), however I do not see any executors failures.
(3) see if all executors are equally spending time in running the job
Not an issue
(4) what u showed above is the job what abt stages etc? are they also
paused for ever?
I have a java spring application that submits topologies to a storm (1.1.2) nimbus based on a DTO which creates the structure of the topology.
This is working great except for very large windows. I am testing it with several varying sliding and tumbling windows. None are giving me any issue besides a 24 hour sliding window which advances every 15 minutes. The topology will receive ~250 messages/s from Kafka and simply windows them using a simple timestamp extractor with a 3 second lag (much like all the other topologies I am testing).
I have played with the workers and memory allowances greatly to try and figure this out but my default configuration is 1 worker with a 2048mb heap size. I've also tried reducing the lag which had minimal effects.
I think that it's possible the window size is getting too large and the worker is running out of memory which delays the heartbeats or zookeeper connection check-in which in turn cause Nimbus to kill the worker.
What happens is every so often (~11 window advances) the Nimbus logs report that the Executor for that topology is "not alive" and the worker logs for that topology show either a KeeperException where the topology can't communicate with Zookeeper or a java.lang.ExceptionInInitializerError:null with a nest PrivelegedActionException.
When the topology is assigned a new worker, the aggregation I was doing is lost. I assume this is happening because the window is holding at least 250*60*15*11 (messagesPerSecond*secondsPerMinute*15mins*windowAdvancesBeforeCrash) messages which are around 84 bytes each. To complete the entire window it will end up being 250*60*15*97 messages (messagesPerSecond*secondsPerMinute*15mins*15minIncrementsIn24HoursPlusAnExpiredWindow). This is ~1.8gbs if my math is right so I feel like the worker memory should be covering the window or at least more than 11 window advances worth.
I could increase the memory slightly but not much. I could also decrease the amount of memory/worker and increase the number of workers/topology but I was wondering if there is something I'm missing? Could I just increase the amount of time the heartbeat for the worker is so that there is more time for the executor to check-in before being killed or would that be bad for some reason? If I changed the heartbeat if would be in the Config map for the topology. Thanks!
This was caused by the workers running out of memory. From looking at Storm code. it looks like Storm keeps around every message in a window as a Tuple (which is a fairly big object). With a high rate of messages and a 24 hour window, that's a lot of memory.
I fixed this by using a preliminary bucketing bolt that would bucket all the tuples in an initial 1 minute window which reduced the load on the main window significantly because it was now receiving one tuple per minute. The bucketing window doesn't run out of memory since it only has one minute of tuples at a time in its window.
In the famous word count example for spark streaming, the spark configuration object is initialized as follows:
/* Create a local StreamingContext with two working thread and batch interval of 1 second.
The master requires 2 cores to prevent from a starvation scenario. */
val sparkConf = new SparkConf().
setMaster("local[2]").setAppName("WordCount")
Here if I change the master from local[2] to local or does not set the Master, I do not get the expected output and in fact word counting doesn't happen at all.
The comment says:
"The master requires 2 cores to prevent from a starvation scenario" that's why they have done setMaster("local[2]").
Can somebody explain me why it requires 2 cores and what is starvation scenario ?
From the documentation:
[...] note that a Spark worker/executor is a long-running task, hence it occupies one of the cores allocated to the Spark Streaming application. Therefore, it is important to remember that a Spark Streaming application needs to be allocated enough cores (or threads, if running locally) to process the received data, as well as to run the receiver(s).
In other words, one thread will be used to run the receiver and at least one more is necessary for processing the received data. For a cluster, the number of allocated cores must be more than the number of receivers, otherwise the system can not process the data.
Hence, when running locally, you need at least 2 threads and when using a cluster at least 2 cores need to be allocated to your system.
Starvation scenario refers to this type of problem, where some threads are not able to execute at all while others make progress.
There are two classical problems where starvation is well known:
Dining philosophers
Readers-writer problem, here it's possible to synchronize the threads so the readers or writers starve. It's also possible to make sure that no starvation occurs.