I am running a series of steps in PySpark. It is taking about 56 minutes to complete. But when I go to the UI, I can see the breakdown only for 9-12 mins in one stage and all other stages are in milliseconds. Is there any way that I can reduce the wait time and also the run time to those 9-12 mins?
Highly appreciate for your time. Thanks.
Related
I've faced with situation when running cluster on AWS EMR, that one stage remained 'running' when execution plan continue to progress. Look at screen from Spark UI (job 4 has running tasks, however job 7 in progress). My question is how to debug such situation, if there are any tips that I can find at DAG?
My thought that it could be some memory issue because data is tough, and there are a lot of spills to disk. However I am wondering why spark stays idle for hour. Is it related to driver memory issues?
UPD1:
Based on Ravi requests:
(1) check the time since they are running and the GC time also. If GC
time is >20% of the execution time it means that u r throttled by the
memory.
No, it is not an issue.
(2) check number of active tasks in the same page.
That's really weird, there are executors with more active tasks than cores capacity (3x time more for some of executors), however I do not see any executors failures.
(3) see if all executors are equally spending time in running the job
Not an issue
(4) what u showed above is the job what abt stages etc? are they also
paused for ever?
I created a simple dataflow pipeline that reads byte arrays from pubsub, windows them, and writes to a text file in GCS. I found that with lower traffic topics this worked perfectly, however I ran it on a topic that does about 2.4GB per minute and some problems started to arise.
When kicking off the pipeline I hadn't set the number of workers (as I'd imagined that it would auto-scale as necessary). When ingesting this volume of data the number of workers stayed at 1, but the TextIO.write() was taking 15+ minutes to write a 2 minute window. This would continue to be backed up till it ran out of memory. Is there a good reason why Dataflow doesn't auto scale when this step gets so backed up?
When I increased the the number of workers to 6, the time to write the files started at around 4 mins for a 5 minute window, then moved down to as little as 20 seconds.
Also, when using 6 workers, it seems like there might be an issue for calculating wall time? Mine never seems to go down even when the dataflow has caught up and after running for 4 hours my summary for the write step looked like this:
Step summary
Step name: Write to output
System lag: 3 min 30 sec
Data watermark: Max watermark
Wall time: 1 day 6 hr 26 min 22 sec
Input collections: PT5M Windows/Window.Assign.out0
Elements added: 860,893
Estimated size: 582.11 GB
Job ID: 2019-03-13_19_22_25-14107024023503564121
So for each of your questions:
Is there a good reason why Dataflow doesn't auto scale when this step gets so backed up?
Streaming autoscaling is a beta feature, and must be explicitly enabled for it to work per the documentation here.
When using 6 workers, it seems like there might be an issue for calculating wall time?
My guess would be you ran your 6 worker pipeline for about 5 hours and 4 minutes, thus the "Wall time" presented is Workers*Hours.
Resizing operation seems very very slow
We have ds2.xlarge 3 nodes cluster, we decide to scale down that to 2 nodes, it has been running for last 28 hours, but the % of completion is just 48% (screenshot attached). So,
Do we need to wait for 30+ more hours to get it done, till that the cluster is going to be in read-only mode?
Because of this can we decide that usually resize will take more than 60+ hours?
What if I want to terminate the process?
Please advise.
60+ hours is anomalous, as per documentation it should take less than 48 hours:
(resizing) ... can take anywhere from a couple of hours to a couple of days.
You can't stop it from the console, but you can contact AWS support to stop it for you.
I have a sample (100 row) and three steps in my Recipe; When i run the job to load the data in a table in bigquery; it takes 6mn to create the table. The timelapse is too long for a simple process like the one that i am testing. I am trying to understand if there is a way to speed up the job. Change some settings, increase the size of the machine, run the job at a specific time, ect.
If you look in Google Cloud Platform -> Dataflow -> Your Dataprep Job, you will see a workflow diagram containing computation steps and computation times. For complex flows, you can identify there the operations that take longer to know what to improve.
For small jobs there is not much improvement to do, since setting the environment takes about 4min. You can see on the right side the "Elapsed time" (real time) and a time graph illustrating how much it takes starting and stopping workers.
In an ADF we can define concurrency limit up to maximum 10. So, assuming we set it to 10, and slices are waiting to run (not waiting for data set dependency etc), will there always be guarantee that at any given time 10 slices will be running in parallel. I have noticed that even after setting it to 10, sometimes couple of them are in progress, or not sure if UI doesn't show properly. Is it subject to resources available? But finally it's cloud, there are infinite resources virtually. Has anyone noticed anything like this?
If there are 10 slices to be run in parallel and for each one of them all their dependencies have been met then 10 slices would run in parallel. Do raise an Azure support ticket if you do not see this happening and we would look into it. There may be a small delay in kicking all 10 off but 10 should run in parallel.
Thanks, Harish