Slow reads on MongoDB from Spark - weird task allocation - mongodb

I have a MongoDB 4.2 cluster with 15 shards; the database stores a sharded collection of 6GB (i.e., about 400MB per machine).
I'm trying to read the whole collection from Apache Spark, which runs on the same machine. Spark's application runs with --num-executors 8 and --executor-cores 6; the connection is made through the spark-connector by configuring the MongoShardedPartitioner.
Besides the reading being very slow (about 1.5 minutes; but, as far as I understand, full scans are generally bad on MongoDB), I'm experiencing this weird behavior in Spark's task allocation:
The issues are the following:
For some reason, only one of the executors starts reading from the database, while all the others wait 25 seconds to begin their readings. The red bars correspond to "Task Deserialization Time", but my understanding is that they are simply idle (if there are concurrent stages, these executors work on something else and then come back to this stage only after the 25 seconds).
For some other reason, after some time the concurrent allocation of tasks is suspended and then it resumes altogether (at about 55 seconds from the start of the job); you can see it in the middle of picture, as a whole bunch of tasks is started at the same time.
Overall, the full scan could be completed in far less time if tasks were allocated properly.
What is the reason for these behaviors and who is responsible (is it Spark, the spark-connector, or MongoDB)? Is there some configuration parameter that could cause these problems?

Related

Spark unfinished stages. Spark app is idle

I've faced with situation when running cluster on AWS EMR, that one stage remained 'running' when execution plan continue to progress. Look at screen from Spark UI (job 4 has running tasks, however job 7 in progress). My question is how to debug such situation, if there are any tips that I can find at DAG?
My thought that it could be some memory issue because data is tough, and there are a lot of spills to disk. However I am wondering why spark stays idle for hour. Is it related to driver memory issues?
UPD1:
Based on Ravi requests:
(1) check the time since they are running and the GC time also. If GC
time is >20% of the execution time it means that u r throttled by the
memory.
No, it is not an issue.
(2) check number of active tasks in the same page.
That's really weird, there are executors with more active tasks than cores capacity (3x time more for some of executors), however I do not see any executors failures.
(3) see if all executors are equally spending time in running the job
Not an issue
(4) what u showed above is the job what abt stages etc? are they also
paused for ever?

Storm large window size causing executor to be killed by Nimbus

I have a java spring application that submits topologies to a storm (1.1.2) nimbus based on a DTO which creates the structure of the topology.
This is working great except for very large windows. I am testing it with several varying sliding and tumbling windows. None are giving me any issue besides a 24 hour sliding window which advances every 15 minutes. The topology will receive ~250 messages/s from Kafka and simply windows them using a simple timestamp extractor with a 3 second lag (much like all the other topologies I am testing).
I have played with the workers and memory allowances greatly to try and figure this out but my default configuration is 1 worker with a 2048mb heap size. I've also tried reducing the lag which had minimal effects.
I think that it's possible the window size is getting too large and the worker is running out of memory which delays the heartbeats or zookeeper connection check-in which in turn cause Nimbus to kill the worker.
What happens is every so often (~11 window advances) the Nimbus logs report that the Executor for that topology is "not alive" and the worker logs for that topology show either a KeeperException where the topology can't communicate with Zookeeper or a java.lang.ExceptionInInitializerError:null with a nest PrivelegedActionException.
When the topology is assigned a new worker, the aggregation I was doing is lost. I assume this is happening because the window is holding at least 250*60*15*11 (messagesPerSecond*secondsPerMinute*15mins*windowAdvancesBeforeCrash) messages which are around 84 bytes each. To complete the entire window it will end up being 250*60*15*97 messages (messagesPerSecond*secondsPerMinute*15mins*15minIncrementsIn24HoursPlusAnExpiredWindow). This is ~1.8gbs if my math is right so I feel like the worker memory should be covering the window or at least more than 11 window advances worth.
I could increase the memory slightly but not much. I could also decrease the amount of memory/worker and increase the number of workers/topology but I was wondering if there is something I'm missing? Could I just increase the amount of time the heartbeat for the worker is so that there is more time for the executor to check-in before being killed or would that be bad for some reason? If I changed the heartbeat if would be in the Config map for the topology. Thanks!
This was caused by the workers running out of memory. From looking at Storm code. it looks like Storm keeps around every message in a window as a Tuple (which is a fairly big object). With a high rate of messages and a 24 hour window, that's a lot of memory.
I fixed this by using a preliminary bucketing bolt that would bucket all the tuples in an initial 1 minute window which reduced the load on the main window significantly because it was now receiving one tuple per minute. The bucketing window doesn't run out of memory since it only has one minute of tuples at a time in its window.

Spark over Yarn some tasks are extremely slower

I am using a cluster of 12 virtual machines, each of which has 16 GB memory and 6 cores(except master node with only 2 cores). To each worker node, 12GB memory and 4 cores were assigned.
When I submit a spark application to yarn, I set the number of executors to 10(1 as master manager, 1 as application master), and to maximize the parallelism of my application, most of my RDDs have 40 partitions as same as the number of cores of all executors.
The following is the problem I encountered: in some random stages, some tasks need to be processed extremely longer than others, which results in poor parallelism. As we can see in the first picture, executor 9 executed its tasks over 30s while other tasks could be finished with 1s. Furthermore, the reason for much time consumed is also randomized, sometimes just because of computation, but sometimes scheduler delay, deserialization or shuffle read. As we can see, the reason for second picture is different from first picture.
I am guessing the reason for this occurs is once some task got assigned to a specific slot, there is not enough resources on the corresponding machine, so jvm was waiting for cpus. Is my guess correct? And how to set the configuration of my cluster to avoid this situation?
computing
scheduler delay & deserialization
To get a specific answer you need to share more about what you're doing but most likely the partitions you get in one or more of your stages are unbalanced - i.e. some are much bigger than others. The result is slowdown since these partitions are handled by a specific task. One way to solve it is to increase the number of partitions or change the partitioning logic
When a big task finishes shipping the data to other tasks would take longer as well so that's why other tasks may take long

MongoDB is giving inconsistent write times

I am using Scala, Reactive Mongo 0.10.5 and Mongo 2.6.4 running on Ubuntu. I have tested on a few machine configurations but right now I am working with 15gb of memory, 2 cores and 60gb of SSD storage (AWS)
I have just set up a test mongo instance and have been using it to benchmark a few things, however I am seeing some inconsistency that I can't explain.
I am writing a consistent amount of data using 10 separate threads to a single collection. Each write consists of a document containing an array which contains 1000 elements. Each element is a complex document consisting of several fields and nested fields. I have tested with arrays of 1000, 10000 and 100 and have seen the same behavior with all. Each write is unique (i.e. I never write to the same document twice)
The write speed tends to be around 100-200ms per write with the current hardware I am using. I would like better but that isn't my main issue.
My main issue is that sometimes the write times will spike. When they do, it can take a single write several seconds to complete. They do eventually complete but it takes a while. I have timeouts built into the app doing the writing (10 seconds) and when the spikes happen it will frequently hit that timeout. I have increased the timeout and verified that the write does eventually complete but it can take a long time (30+ seconds).
I have worked with Mongo before using the Mongo Java Driver in Scala and have not noticed this problem. However it is unclear whether the issue is a result of the driver, or my Mongo setup.
I have looked at the logs and while they report when the query is taking longer, they don't actually provide any information about why it is taking longer. I have done the same with profiling and again they report a long query but don't say why it is long.
I have run mongostat while running and it seems that when the writes start taking a long time I notice a similar slow down in mongostat. I.E. mongostat will pause for several seconds before continuing.
The mongo machine itself is bored while this is happening. Load averages are minimal as are CPU and memory usage. It does not appear to be going into swap.
I suspect I just have something configured incorrectly in the Mongo but I haven't been able to find anything that indicates what.
Has anyone seen this behavior before? Is it something in my configuration or perhaps something with the Reactive Mongo driver?
UPDATE:
Using iostat I was able to determine that the normal writes/second is hitting around 1Mb/second. However during the slow periods it spikes to 6-7Mb/second.
I also found the following in the mongo logs.
[DataFileSync] flushing mmaps took 15621ms for 35 files
[DataFileSync] flushing mmaps took 14816ms for 22 files
In at least one case this log statement corresponds exactly with one of the slow downs.
This definitely seems to be a disk flush problem based on these observations.
Does this imply that I am pushing more data than the current Mongo configuration can handle? Or is there some other configuration that can be done to reduce the impact of those flushes?
It appears that in this case the problem may actually have been related to thread locking within the application itself. Once I resolved the issues with thread locking these other issues seemed to go away.
To be honest I don't know why thread locking would result in the observed behavior in Mongo, but if the problem is gone I am not going to complain.

MongoDB upsert operation blocks inconsistently (with syncdelay set to 0)

There is a database with 9 million rows, with 3 million distinct entities. Such a database is loaded everyday into MongoDB using perl driver. It runs smoothly on the first load. However from the second, third, etc. the process is really slowed down. It blocks for long times every now and then.
I initially realised that this was because of the automatic flushing to disk every 60 seconds, so I tried setting syncdelay to 0 and I tried the nojournalling option. I have indexed the fields that are used for upsert. Also I have observed that the blocking is inconsistent and not always at the same time for the same line.
I have 17 G ram and enough hard disk space. I am replicating on two servers with one arbiter. I do not have any significant processes running in the background. Is there a solution/explanation for such blocking?
UPDATE: The mongostat tool says in the 'res' column that around 3.6 G is used.