Dataflow TextIO.write issues with scaling - google-cloud-storage

I created a simple dataflow pipeline that reads byte arrays from pubsub, windows them, and writes to a text file in GCS. I found that with lower traffic topics this worked perfectly, however I ran it on a topic that does about 2.4GB per minute and some problems started to arise.
When kicking off the pipeline I hadn't set the number of workers (as I'd imagined that it would auto-scale as necessary). When ingesting this volume of data the number of workers stayed at 1, but the TextIO.write() was taking 15+ minutes to write a 2 minute window. This would continue to be backed up till it ran out of memory. Is there a good reason why Dataflow doesn't auto scale when this step gets so backed up?
When I increased the the number of workers to 6, the time to write the files started at around 4 mins for a 5 minute window, then moved down to as little as 20 seconds.
Also, when using 6 workers, it seems like there might be an issue for calculating wall time? Mine never seems to go down even when the dataflow has caught up and after running for 4 hours my summary for the write step looked like this:
Step summary
Step name: Write to output
System lag: 3 min 30 sec
Data watermark: Max watermark
Wall time: 1 day 6 hr 26 min 22 sec
Input collections: PT5M Windows/Window.Assign.out0
Elements added: 860,893
Estimated size: 582.11 GB
Job ID: 2019-03-13_19_22_25-14107024023503564121

So for each of your questions:
Is there a good reason why Dataflow doesn't auto scale when this step gets so backed up?
Streaming autoscaling is a beta feature, and must be explicitly enabled for it to work per the documentation here.
When using 6 workers, it seems like there might be an issue for calculating wall time?
My guess would be you ran your 6 worker pipeline for about 5 hours and 4 minutes, thus the "Wall time" presented is Workers*Hours.

Related

Spark job running fast after count

I have a scala job which had few CDC records ( 10K records) and as part of a merge job, it does member matching ( 4 million) from other Hive table which has 4 million members , provider matching from other Hive table which has 1 million providers and a bunch of other stuff ( recycle logic/rejection logic) which usually takes around 30 min to complete.
With the same volume I have added action ( count(*)) after every module to understand which one takes more time, however after using action it got completed in 6 min. Usually as per best practices we should not use action frequently, however I don't understand what's making the job run fast ? any link with explanation would be helpful.
Is it that since the resource may have got released after every module execution due to action, it's making overall job run fast.
My cluster is 6 node with 13 cores machine and 64GB memory, however there are other processes also running ..so it's usually overutilized.

Slow reads on MongoDB from Spark - weird task allocation

I have a MongoDB 4.2 cluster with 15 shards; the database stores a sharded collection of 6GB (i.e., about 400MB per machine).
I'm trying to read the whole collection from Apache Spark, which runs on the same machine. Spark's application runs with --num-executors 8 and --executor-cores 6; the connection is made through the spark-connector by configuring the MongoShardedPartitioner.
Besides the reading being very slow (about 1.5 minutes; but, as far as I understand, full scans are generally bad on MongoDB), I'm experiencing this weird behavior in Spark's task allocation:
The issues are the following:
For some reason, only one of the executors starts reading from the database, while all the others wait 25 seconds to begin their readings. The red bars correspond to "Task Deserialization Time", but my understanding is that they are simply idle (if there are concurrent stages, these executors work on something else and then come back to this stage only after the 25 seconds).
For some other reason, after some time the concurrent allocation of tasks is suspended and then it resumes altogether (at about 55 seconds from the start of the job); you can see it in the middle of picture, as a whole bunch of tasks is started at the same time.
Overall, the full scan could be completed in far less time if tasks were allocated properly.
What is the reason for these behaviors and who is responsible (is it Spark, the spark-connector, or MongoDB)? Is there some configuration parameter that could cause these problems?

Storm large window size causing executor to be killed by Nimbus

I have a java spring application that submits topologies to a storm (1.1.2) nimbus based on a DTO which creates the structure of the topology.
This is working great except for very large windows. I am testing it with several varying sliding and tumbling windows. None are giving me any issue besides a 24 hour sliding window which advances every 15 minutes. The topology will receive ~250 messages/s from Kafka and simply windows them using a simple timestamp extractor with a 3 second lag (much like all the other topologies I am testing).
I have played with the workers and memory allowances greatly to try and figure this out but my default configuration is 1 worker with a 2048mb heap size. I've also tried reducing the lag which had minimal effects.
I think that it's possible the window size is getting too large and the worker is running out of memory which delays the heartbeats or zookeeper connection check-in which in turn cause Nimbus to kill the worker.
What happens is every so often (~11 window advances) the Nimbus logs report that the Executor for that topology is "not alive" and the worker logs for that topology show either a KeeperException where the topology can't communicate with Zookeeper or a java.lang.ExceptionInInitializerError:null with a nest PrivelegedActionException.
When the topology is assigned a new worker, the aggregation I was doing is lost. I assume this is happening because the window is holding at least 250*60*15*11 (messagesPerSecond*secondsPerMinute*15mins*windowAdvancesBeforeCrash) messages which are around 84 bytes each. To complete the entire window it will end up being 250*60*15*97 messages (messagesPerSecond*secondsPerMinute*15mins*15minIncrementsIn24HoursPlusAnExpiredWindow). This is ~1.8gbs if my math is right so I feel like the worker memory should be covering the window or at least more than 11 window advances worth.
I could increase the memory slightly but not much. I could also decrease the amount of memory/worker and increase the number of workers/topology but I was wondering if there is something I'm missing? Could I just increase the amount of time the heartbeat for the worker is so that there is more time for the executor to check-in before being killed or would that be bad for some reason? If I changed the heartbeat if would be in the Config map for the topology. Thanks!
This was caused by the workers running out of memory. From looking at Storm code. it looks like Storm keeps around every message in a window as a Tuple (which is a fairly big object). With a high rate of messages and a 24 hour window, that's a lot of memory.
I fixed this by using a preliminary bucketing bolt that would bucket all the tuples in an initial 1 minute window which reduced the load on the main window significantly because it was now receiving one tuple per minute. The bucketing window doesn't run out of memory since it only has one minute of tuples at a time in its window.

how to stop the AWS Redshift resize activity?

Resizing operation seems very very slow
We have ds2.xlarge 3 nodes cluster, we decide to scale down that to 2 nodes, it has been running for last 28 hours, but the % of completion is just 48% (screenshot attached). So,
Do we need to wait for 30+ more hours to get it done, till that the cluster is going to be in read-only mode?
Because of this can we decide that usually resize will take more than 60+ hours?
What if I want to terminate the process?
Please advise.
60+ hours is anomalous, as per documentation it should take less than 48 hours:
(resizing) ... can take anywhere from a couple of hours to a couple of days.
You can't stop it from the console, but you can contact AWS support to stop it for you.

Does "concurrency" limit of 10 guarantee 10 parallel slice runs?

In an ADF we can define concurrency limit up to maximum 10. So, assuming we set it to 10, and slices are waiting to run (not waiting for data set dependency etc), will there always be guarantee that at any given time 10 slices will be running in parallel. I have noticed that even after setting it to 10, sometimes couple of them are in progress, or not sure if UI doesn't show properly. Is it subject to resources available? But finally it's cloud, there are infinite resources virtually. Has anyone noticed anything like this?
If there are 10 slices to be run in parallel and for each one of them all their dependencies have been met then 10 slices would run in parallel. Do raise an Azure support ticket if you do not see this happening and we would look into it. There may be a small delay in kicking all 10 off but 10 should run in parallel.
Thanks, Harish