Spark job running fast after count

Spark job running fast after count - scala

I have a scala job which had few CDC records ( 10K records) and as part of a merge job, it does member matching ( 4 million) from other Hive table which has 4 million members , provider matching from other Hive table which has 1 million providers and a bunch of other stuff ( recycle logic/rejection logic) which usually takes around 30 min to complete.
With the same volume I have added action ( count(*)) after every module to understand which one takes more time, however after using action it got completed in 6 min. Usually as per best practices we should not use action frequently, however I don't understand what's making the job run fast ? any link with explanation would be helpful.
Is it that since the resource may have got released after every module execution due to action, it's making overall job run fast.
My cluster is 6 node with 13 cores machine and 64GB memory, however there are other processes also running ..so it's usually overutilized.

Related

Difference between DynamoDbEnhancedClient.BatchWriteItem(Dynamo DB Version2) and DynamoDBMapper.BatchSave(Version 1)

I have upgraded to DynamoDB Version 2 from Version 1.
In Version1, DynamoDBMapper.BatchSave() method has no batch size limitations I guess. Even if I pass 100+ records it will run successfully.
In Version2, I'm using DynamoDbEnhancedClient.BatchWriteItem(). It has batch size limitations. Up to only 25 records are processed by a Batch. So, for processing 100+ records I'm doing iterations.
Reference documentation on Batch Size limitations:
https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_BatchWriteItem.html
Does anyone have an Idea, why on Version1 it is handled dynamically, and on Version2 we have to do separate iterations?
Are there any other efficient ways to do Batch operations in Version2?

Slow reads on MongoDB from Spark - weird task allocation

I have a MongoDB 4.2 cluster with 15 shards; the database stores a sharded collection of 6GB (i.e., about 400MB per machine).
I'm trying to read the whole collection from Apache Spark, which runs on the same machine. Spark's application runs with --num-executors 8 and --executor-cores 6; the connection is made through the spark-connector by configuring the MongoShardedPartitioner.
Besides the reading being very slow (about 1.5 minutes; but, as far as I understand, full scans are generally bad on MongoDB), I'm experiencing this weird behavior in Spark's task allocation:
The issues are the following:
For some reason, only one of the executors starts reading from the database, while all the others wait 25 seconds to begin their readings. The red bars correspond to "Task Deserialization Time", but my understanding is that they are simply idle (if there are concurrent stages, these executors work on something else and then come back to this stage only after the 25 seconds).
For some other reason, after some time the concurrent allocation of tasks is suspended and then it resumes altogether (at about 55 seconds from the start of the job); you can see it in the middle of picture, as a whole bunch of tasks is started at the same time.
Overall, the full scan could be completed in far less time if tasks were allocated properly.
What is the reason for these behaviors and who is responsible (is it Spark, the spark-connector, or MongoDB)? Is there some configuration parameter that could cause these problems?

Redshift cluster: queries getting hang and filling up space

I have a Redshift cluster with 3 nodes. Every now and then, with users running queries against it, we end in this unpleasant situation where some queries run for way longer than expected (even simple ones, exceeding 15 minutes), and the cluster storage starts increasing to the point that if you don't terminate the long-standing queries it gets to 100% storage occupied.
I wonder why this may happen. My experience is varied, sometimes it's been a single query doing this and sometimes it's been different concurrent queries been run at the same time.

One specific scenario where we saw this happen related to LISTAGG. The type of LISTAGG is varchar(65535), and while Redshift optimizes away the implicit trailing blanks when stored to disk, the full width is required in memory during processing.
If you have a query that returns a million rows, you end up with 1,000,000 rows times 65,535 bytes per LISTAGG, which is 65 gigabytes. That can quickly get you into a situation like what you describe, with queries taking unexpectedly long or failing with “Disk Full” errors.
My team discussed this a bit more on our team blog the other day.

This typically happens when a poorly constructed query spills a too much data to disk. For instance the user accidentally specifies a Cartesian product (every row from tblA joined to every row of tblB).
If this happens regularly you can implement a QMR rule that limits the amount of disk spill before a query is aborted.
QMR Documentation: https://docs.aws.amazon.com/redshift/latest/dg/cm-c-wlm-query-monitoring-rules.html
QMR Rule Candidates query: https://github.com/awslabs/amazon-redshift-utils/blob/master/src/AdminScripts/wlm_qmr_rule_candidates.sql

How I can speed up mongodb?

I have a crawler which consits of 6+ processes. Half of processes are masters which crawl web and when they find jobs they put it inside jobs collection. In most times masters save 100 jobs at once (well I mean they get 100 jobs and save each one of them separatly as fast as possible.
The second half of processes are slaves which constantly check if new jobs of some type are available for him - if yes it marks them in_process (it is done using findOneAndUpdate), then processe the job and save the result in another collection.
Moreover from time to time master processes have to read a lot of data from jobs table to synchronize.
So to sum up there are a lot of read operations and write operations on db. When db was small it was working ok but now when I have ~700k job records (job document is small, it has 8 fields and has proper indexes / compound indexes) my db slacks. I can observe it when displaying "stats" page which basicaly executes ~16 count operations with some conditions (on indexed fields).
When masters/slaves processes are not runing stats page displays after 2 seconds. When masters/slaves are runing same page displays around 1 minute and sometime is does not display at all (timeout).
So how I can make my db to handle more requests per second? I have to replicate it?

MongoDB upsert operation blocks inconsistently (with syncdelay set to 0)

There is a database with 9 million rows, with 3 million distinct entities. Such a database is loaded everyday into MongoDB using perl driver. It runs smoothly on the first load. However from the second, third, etc. the process is really slowed down. It blocks for long times every now and then.
I initially realised that this was because of the automatic flushing to disk every 60 seconds, so I tried setting syncdelay to 0 and I tried the nojournalling option. I have indexed the fields that are used for upsert. Also I have observed that the blocking is inconsistent and not always at the same time for the same line.
I have 17 G ram and enough hard disk space. I am replicating on two servers with one arbiter. I do not have any significant processes running in the background. Is there a solution/explanation for such blocking?
UPDATE: The mongostat tool says in the 'res' column that around 3.6 G is used.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse