CPU profiling not covering all the vCPU time of Apache Beam pipeline on Dataflow - apache-beam

Our pipeline is developed based on the Apache Beam Go SDK. I'm trying to profile the CPU of all workers by setting the flag --cpu_profiling=gs://gs_location: https://github.com/apache/beam/blob/master/sdks/go/pkg/beam/runners/dataflow/dataflow.go
The job finished with spending 16.636 vCPU hr and a maximum number of 104 workers:
As a result in the specified GCS location, a bunch of files are recorded with name "profprocess_bundle-*":
Then I downloaded these files, unzipped them all, and visualize the results with pprof (https://github.com/google/pprof):
So here are my questions:
How is the total time in the profiling result collected? The sampled time (1.06 hrs) is way shorter than the vCPU time (16.626 hrs) reported by Dataflow.
What is the the number in the file name "profprocess_bundle-*"? I was thinking it may correspond to the index of a worker. But the maximum of the number is larger than the worker number, and the number is not continuous. The maximum number is 122, but there are only 66 files.

when you set --cpu_profiling=true, the profiling starts when the SDK worker starts processing a bundle (a batch of input elements will go through a subgraph of your pipeline DAG, sometimes also referred as work item) and ends when the processing finishes. A job can contain many bundles. That's why the total vCPU time will be larger than the sample period.
As mentioned the number in profprocess_bundle-* is representing the bundle id being profiled.

Related

Druid Cluster going into Restricted Mode

We have a Druid Cluster with the following specs
3X Coordinators & Overlords - m5.2xlarge
6X Middle Managers(Ingest nodes with 5 slots) - m5d.4xlarge
8X Historical - i3.4xlarge
2X Router & Broker - m5.2xlarge
Cluster often goes into Restricted mode
All the calls to the Cluster gets rejected with a 502 error.
Even with 30 available slots for the index-parallel tasks, cluster only runs 10 at time and the other tasks are going into waiting state.
Loader Task submission time has been increasing monotonically from 1s,2s,..,6s,..10s(We submit a job to load the data in S3), after
recycling the cluster submission time decreases and increases again
over a period of time
We submit around 100 jobs per minute but we need to scale it to 300 to catchup with our current incoming load
Cloud someone help us with our queries
Tune the specs of the cluster
What parameters to be optimized to run maximum number of tasks in parallel without increasing the load on the master nodes
Why is the loader task submission time increasing, what are the parameters to be monitored here
At 100 jobs per minute, this is probably why the overlord is being overloaded.
The overlord initiates a job by communicating with the middle managers across the cluster. It defines the tasks that each middle manager will need to complete and it monitors the task progress until completion. The startup of each job has some overhead, so that many jobs would likely keep the overlord busy and prevent it from processing the other jobs you are requesting. This might explain why the time for job submissions increases over time. You could increase the resources on the overlord, but this sounds like there may be a better way to ingest the data.
The recommendation would be to use a lot less jobs and have each job do more work.
If the flow of data is so continuous as you describe, perhaps a kafka queue would be the best target followed up with a Druid kafka ingestion job which is fully scalable.
If you need to do batch, perhaps a single index_parallel job that reads many files would be much more efficient than many jobs.
Also consider that each task in an ingestion job creates a set of segments. By running a lot of really small jobs, you create a lot of very small segments which is not ideal for query performance. Here some info around how to think about segment size optimization which I think might help.

Unusually long time in writing parquet files to Google Cloud

I am using pyspark dataframe on dataproc cluster to generate features and writing parquet files as output to the Google Cloud Storage. There are two problems I am facing-
I have provided 22 executers, 3 cores per exec and ~13G RAM per executer. However only 10 executers are fired when I submit the jobs. The dataproc cluster contains 10 worker nodes and 8 cores per node and 30 GB ram per node.
When I write the individual feature files and record the total time, it is significantly lower then the time taken to write all the features together in a single file. I have tried changing the partitions but doesn't help either.
This is how I write the parquet file:
df.select([feature_lst]).write.parquet(gcs_path+outfile,mode='overwrite')
data size - 20M+ records, 30+ numerical features
Spark UI image:
The current stage is when I write all features together- significantly higher than all of previous stages combined.
If someone can provide any insight into the above two issues I will be grateful.

Dataflow TextIO.write issues with scaling

I created a simple dataflow pipeline that reads byte arrays from pubsub, windows them, and writes to a text file in GCS. I found that with lower traffic topics this worked perfectly, however I ran it on a topic that does about 2.4GB per minute and some problems started to arise.
When kicking off the pipeline I hadn't set the number of workers (as I'd imagined that it would auto-scale as necessary). When ingesting this volume of data the number of workers stayed at 1, but the TextIO.write() was taking 15+ minutes to write a 2 minute window. This would continue to be backed up till it ran out of memory. Is there a good reason why Dataflow doesn't auto scale when this step gets so backed up?
When I increased the the number of workers to 6, the time to write the files started at around 4 mins for a 5 minute window, then moved down to as little as 20 seconds.
Also, when using 6 workers, it seems like there might be an issue for calculating wall time? Mine never seems to go down even when the dataflow has caught up and after running for 4 hours my summary for the write step looked like this:
Step summary
Step name: Write to output
System lag: 3 min 30 sec
Data watermark: Max watermark
Wall time: 1 day 6 hr 26 min 22 sec
Input collections: PT5M Windows/Window.Assign.out0
Elements added: 860,893
Estimated size: 582.11 GB
Job ID: 2019-03-13_19_22_25-14107024023503564121
So for each of your questions:
Is there a good reason why Dataflow doesn't auto scale when this step gets so backed up?
Streaming autoscaling is a beta feature, and must be explicitly enabled for it to work per the documentation here.
When using 6 workers, it seems like there might be an issue for calculating wall time?
My guess would be you ran your 6 worker pipeline for about 5 hours and 4 minutes, thus the "Wall time" presented is Workers*Hours.

Spark over Yarn some tasks are extremely slower

I am using a cluster of 12 virtual machines, each of which has 16 GB memory and 6 cores(except master node with only 2 cores). To each worker node, 12GB memory and 4 cores were assigned.
When I submit a spark application to yarn, I set the number of executors to 10(1 as master manager, 1 as application master), and to maximize the parallelism of my application, most of my RDDs have 40 partitions as same as the number of cores of all executors.
The following is the problem I encountered: in some random stages, some tasks need to be processed extremely longer than others, which results in poor parallelism. As we can see in the first picture, executor 9 executed its tasks over 30s while other tasks could be finished with 1s. Furthermore, the reason for much time consumed is also randomized, sometimes just because of computation, but sometimes scheduler delay, deserialization or shuffle read. As we can see, the reason for second picture is different from first picture.
I am guessing the reason for this occurs is once some task got assigned to a specific slot, there is not enough resources on the corresponding machine, so jvm was waiting for cpus. Is my guess correct? And how to set the configuration of my cluster to avoid this situation?
computing
scheduler delay & deserialization
To get a specific answer you need to share more about what you're doing but most likely the partitions you get in one or more of your stages are unbalanced - i.e. some are much bigger than others. The result is slowdown since these partitions are handled by a specific task. One way to solve it is to increase the number of partitions or change the partitioning logic
When a big task finishes shipping the data to other tasks would take longer as well so that's why other tasks may take long

Spark: sc.WholeTextFiles takes a long time to execute

I have a cluster and I execute wholeTextFiles which should pull about a million text files who sum up to approximately 10GB total
I have one NameNode and two DataNode with 30GB of RAM each, 4 cores each. The data is stored in HDFS.
I don't run any special parameters and the job takes 5 hours to just read the data. Is that expected? are there any parameters that should speed up the read (spark configuration or partition, number of executors?)
I'm just starting and I've never had the need to optimize a job before
EDIT: Additionally, can someone explain exactly how the wholeTextFiles function works? (not how to use it, but how it was programmed). I'm very interested in understand the partition parameter, etc.
EDIT 2: benchmark assessment
So I tried repartition after the wholeTextFile, the problem is the same because the first read is still using the pre-defined number of partitions, so there are no performance improvements. Once the data is loaded the cluster performs really well... I have the following warning message when dealing with the data (for 200k files), on the wholeTextFile:
15/01/19 03:52:48 WARN scheduler.TaskSetManager: Stage 0 contains a task of very large size (15795 KB). The maximum recommended task size is 100 KB.
Would that be a reason of the bad performance? How do I hedge that?
Additionally, when doing a saveAsTextFile, my speed according to Ambari console is 19MB/s. When doing a read with wholeTextFiles, I am at 300kb/s.....
It seems that by increase the number of partitions in wholeTextFile(path,partitions), I am getting better performance. But still only 8 tasks are running at the same time (my number of CPUs). I'm benchmarking to observe the limit...
To summarize my recommendations from the comments:
HDFS is not a good fit for storing many small files. First of all, NameNode stores metadata in memory so the amount of files and blocks you might have is limited (~100m blocks is a max for typical server). Next, each time you read file you first query NameNode for block locations, then connect to the DataNode storing the file. Overhead of this connections and responses is really huge.
Default settings should always be reviewed. By default Spark starts on YARN with 2 executors (--num-executors) with 1 thread each (--executor-cores) and 512m of RAM (--executor-memory), giving you only 2 threads with 512MB RAM each, which is really small for the real-world tasks
So my recommendation is:
Start Spark with --num-executors 4 --executor-memory 12g --executor-cores 4 which would give you more parallelism - 16 threads in this particular case, which means 16 tasks running in parallel
Use sc.wholeTextFiles to read the files and then dump them into compressed sequence file (for instance, with Snappy block level compression), here's an example of how this can be done: http://0x0fff.com/spark-hdfs-integration/. This will greatly reduce the time needed to read them with the next iteration