Unstable insert rate in MongoDB - mongodb

I have a process that can generate 20 000 records per second (record size ~30Kb). I am trying to insert them as fast as possible into single instance of MongoDB. But I am getting ~1500 inserts per second with unstable rate that ranges from 1000 inserts to 2000 inserts per second. The question is what is the reason and how to fix it? :) Here is data from mongostat for 2.5 hours:
Set up
I am running instance in the cloud with 8 cores, 16Gb RAM, 150Gb HDD, Ubuntu 18.04, MongoDB 4.0 official docker image. On the same instance run 2 workers that generate 10 000 records per second each and insert_many them into MongoDB 100 records per chunk. Each record is split between 2 collections cases and docs, docs uses zlib compression. cases record is ~1Kb in size on average. Random record as an example:
{'info': {'judge': 'Орлова Олеся Викторовна', 'decision': 'Отменено с возвращением на новое рассмотрение', 'entry_date': datetime.datetime(2017, 1, 1, 0, 0), 'number': '12-48/2017 (12-413/2016;)', 'decision_date': datetime.datetime(2017, 2, 9, 0, 0)}, 'acts': [{'doc': ObjectId('5c3c76543d495a000c97243b'), 'type': 'Решение'}], '_id': ObjectId('5c3c76543d495a000c97243a'), 'sides': [{'name': 'Кузнецов П. В.', 'articles': 'КоАП: ст. 5.27.1 ч.4'}], 'history': [{'timestamp': datetime.datetime(2017, 1, 1, 15, 6), 'type': 'Материалы переданы в производство судье'}, {'timestamp': datetime.datetime(2017, 2, 9, 16, 0), 'type': 'Судебное заседание', 'decision': 'Отменено с возвращением на новое рассмотрение'}, {'timestamp': datetime.datetime(2017, 2, 17, 15, 6), 'type': 'Дело сдано в отдел судебного делопроизводства'}, {'timestamp': datetime.datetime(2017, 2, 17, 15, 7), 'type': 'Вручение копии решения (определения) в соотв. с чч. 2, 2.1, 2.2 ст. 30.8 КоАП РФ'}, {'timestamp': datetime.datetime(2017, 3, 13, 16, 6), 'type': 'Вступило в законную силу'}, {'timestamp': datetime.datetime(2017, 3, 14, 16, 6), 'type': 'Дело оформлено'}, {'timestamp': datetime.datetime(2017, 3, 29, 14, 33), 'type': 'Дело передано в архив'}], 'source': {'date': datetime.datetime(2017, 1, 1, 0, 0), 'engine': 'v1', 'instance': 'appeal', 'host': 'bratsky.irk.sudrf.ru', 'process': 'adm_nar', 'crawled': datetime.datetime(2018, 12, 22, 8, 15, 7), 'url': 'https://bratsky--irk.sudrf.ru/modules.php?name=sud_delo&srv_num=1&name_op=case&case_id=53033119&case_uid=A84C1A34-846D-4912-8242-C7657985873B&delo_id=1502001'}, 'id': '53033119_A84C1A34-846D-4912-8242-C7657985873B_1_'}
docs record is ~30Kb on average:
{'_id': ObjectId('5c3c76543d495a000c97243b'), 'data': 'PEhUTUw+PEhFQUQ+DQo8TUVUQSBodHRwLWVxdWl2PUNvbnRlbnQtVHlwZSBjb250ZW50PSJ0ZXh0L2h0bWw7IGNoYXJzZXQ9V2luZG93cy0xMjUxIj4NCjxTVFlMRSB0eXBlPXRleHQvY3NzPjwvU1RZTEU+DQo8L0hFQUQ+DQo8Qk9EWT48U1BBTiBzdHlsZT0iVEVYVC1BTElHTjoganVzdGlmeSI+DQo8UCBzdHlsZT0iVEVYVC1JTkRFTlQ6IDAuNWluOyBURVhULUFMSUdOOiBjZW50ZXIiPtCgINCVINCoINCVINCdINCYINCVPC9QPg0KPFAgc3R5bGU9IlRFWFQtSU5ERU5UOiAwLjVpbjsgVEVYVC1BTElHTjoganVzdGlmeSI+0LMuINCR0YDQsNGC0YHQuiAwOSDRhNC10LLRgNCw0LvRjyAyMDE3INCz0L7QtNCwPC9QPg0KPFAgc3R5bGU9IlRFWFQtSU5ERU5UOiAwLjVpbjsgVEVYVC1BTElHTjoganVzdGlmeSI+0KHRg9C00YzRjyDQkdGA0LDRgtGB0LrQvtCz0L4g0LPQvtGA0L7QtNGB0LrQvtCz0L4g0YHRg9C00LAg0JjRgNC60YPRgtGB0LrQvtC5INC+0LHQu9Cw0YHRgtC4INCe0YDQu9C+0LLQsCDQni7Qki4sINGA0LDRgdGB0LzQvtGC0YDQtdCyINCw0LTQvNC40L3QuNGB0YLRgNCw0YLQuNCy0L3QvtC1INC00LXQu9C+IOKEliAxMi00OC8yMDE3INC/0L4g0LbQsNC70L7QsdC1INC40L3QtNC40LLQuNC00YPQsNC70YzQvdC+0LPQviDQv9GA0LXQtNC/0YDQuNC90LjQvNCw0YLQtdC70Y8g0JrRg9C30L3QtdGG0L7QstCwIDxTUE.....TlQ6IDAuNWluOyBURVhULUFMSUdOOiBqdXN0aWZ5Ij7QoNC10YjQtdC90LjQtSDQvNC+0LbQtdGCINCx0YvRgtGMINC+0LHQttCw0LvQvtCy0LDQvdC+INCyINCY0YDQutGD0YLRgdC60LjQuSDQvtCx0LvQsNGB0YLQvdC+0Lkg0YHRg9C0INCyINGC0LXRh9C10L3QuNC1IDEwINGB0YPRgtC+0Log0YEg0LzQvtC80LXQvdGC0LAg0L/QvtC70YPRh9C10L3QuNGPINC10LPQviDQutC+0L/QuNC4LjwvUD4NCjxQIHN0eWxlPSJURVhULUlOREVOVDogMC41aW47IFRFWFQtQUxJR046IGp1c3RpZnkiPtCh0YPQtNGM0Y8g0J4u0JIuINCe0YDQu9C+0LLQsDwvUD48L1NQQU4+PC9CT0RZPjwvSFRNTD4=', 'extension': '.html'}
Analysis
To figure out what is going on I use docker stats and mongostat. Key metrics are highlighted:
I collect metrics for 2.5 hours during data insertion and plot CPU %, insert, dirty from pictures above:
One can see that insert rate drops when dirty peaks at 20% and goes up to ~2000 when dirty is lower then 20%:
Dirty goes down when CPU is active. One can see that when cpu is ~300% dirty starts to go down (plots are a bit out of sick since docker stats and mongostat run separately), when cpu is 200% dirty grows back to 20% and inserts slow down:
Question
Is my analysis correct? It is my first time using MongoDB so I may be wrong
If analysis is correct why MongoDB does not always use 300%+ CPU (instance has 8 cores) to keep dirty low and insert rate high? Is it possible to force it to do so and is it the right way solve my issue?
Update
Maybe HDD IO is an issue?
I did not log IO utilisation, but
I remember looking into cloud.mongodb.com/freemonitoring during insertion process, there is a plot called "Disk Utilisation", it was 50% max
Currently my problem is insert rate instability. I am ok with current 2000 inserts per seconds max. It means that current HDD can handle that, right? I do not understand why periodically insert rate drops to 1000.
On sharding
Currently I am trying to reach max performance on single machine
Solution
Just change HDD to SSD.
Before:
After:
With the same ~1500 inserts per second, dirty is stable at ~5%. Inserts and CPU usage is now stable. This is the behaviour I expected to see. SSD solves the problem from the title of this question "Unstable insert rate in MongoDB"

Use a better disk will definitely improve the performance. There are other metrics you can monitor.
The percentage of dirty bytes indicates data is modified in wiredTiger cache but not persisted to the disk yet. You should monitor your disk IOPS if it has reach your provisioned limit. Use command iostat to monitor or get it from MongoDB FTDC data.
When your CPU spikes, monitor whether if CPU time spent on iowait. If the iowait % is high, you have I/O blocking, i.e. faster disk or more IOPS will help.
Monitor the qrw (queued read and write requests) and arw (active read and write requests) from mongostat output. If these numbers remain low like your sample output, especially the qrw, mongo is able to support your requests without queuing requests up.
Avoid resources competition by moving the injection works to other instances.
You can further optimize using different disk partitions for mongo data path and journal location.
Clients (ingestion workers) performance is usually ignored by observers. The CPU spike might be from your workers and thus had lower throughputs. Monitor clients performance using top command or equivalent.
Hope the above help.

Related

MongoDB degrading write performance over time

I am importing a lot of data (18GB, 3 million documents) over time, almost all the data are indexed, so there are lots of indexing going on. The system consist of a single client (single process on a separate machine) establishing a single connection (using pymongo) and doing insertMany in batch of 1000 docs.
MongoDB setup:
single instance,
journaling enabled,
WiredTiger with default cache,
RHEL 7,
version 4.2.1
192GB RAM, 16 CPUs
1.5 TB SSD,
cloud machine.
When I start the server (after full reboot) and insert the collection, it takes 1.5 hours. If the server run for a while inserting some other data (from a single client), it finishes to insert the data, I delete the collection and run the same data to insert - it takes 6 hours to insert it (there is still sufficient disk more than 60%, nothing else making connections to the db). It feels like the server performance degrades over time, may be OS specific. Any similar experience, ideas?
I had faced similar issue, the problem was RAM.
After full restart the server had all RAM free, but after insertions the RAM was full. Deletion of collection and insertion same data again might take time as some RAM was still utilised and less was free for mongo.
Try freeing up RAM and cache after you drop the collection, and check if same behaviour persists.
As you haven't provided any specific details, I would recommend you enable profiling; this will allow you to examine performance bottlenecks. At the mongo shell run:
db.setProfilingLevel(2)
Then run:
db.system.profile.find( { "millis": { "$gt": 10 } }, { "millis": 1, "command": 1 }) // find operations over 10 milliseconds
Once done set reset the profiling mode:
db.setProfilingLevel(0)

PostgreSQL autovacuum causing significant performance degradation

Our Postgres DB (hosted on Google Cloud SQL with 1 CPU, 3.7 GB of RAM, see below) consists mostly of one big ~90GB table with about ~60 million rows. The usage pattern consists almost exclusively of appends and a few indexed reads near the end of the table. From time to time a few users get deleted, deleting a small percentage of rows scattered across the table.
This all works fine, but every few months an autovacuum gets triggered on that table, which significantly impacts our service's performance for ~8 hours:
Storage usage increases by ~1GB for the duration of the autovacuum (several hours), then slowly returns to the previous value (might eventually drop below it, due to the autovacuum freeing pages)
Database CPU utilization jumps from <10% to ~20%
Disk Read/Write Ops increases from near zero to ~50/second
Database Memory increases slightly, but stays below 2GB
Transaction/sec and ingress/egress bytes are also fairly unaffected, as would be expected
This has the effect of increasing our service's 95th latency percentile from ~100ms to ~0.5-1s during the autovacuum, which in turn triggers our monitoring. The service serves around ten requests per second, with each request consisting of a few simple DB reads/writes that normally have a latency of 2-3ms each.
Here are some monitoring screenshots illustrating the issue:
The DB configuration is fairly vanilla:
The log entry documenting this autovacuum process reads as follows:
system usage: CPU 470.10s/358.74u sec elapsed 38004.58 sec
avg read rate: 2.491 MB/s, avg write rate: 2.247 MB/s
buffer usage: 8480213 hits, 12117505 misses, 10930449 dirtied
tuples: 5959839 removed, 57732135 remain, 4574 are dead but not yet removable
pages: 0 removed, 6482261 remain, 0 skipped due to pins, 0 skipped frozen
automatic vacuum of table "XXX": index scans: 1
Any suggestions what we could tune to reduce the impact of future autovacuums on our service? Or are we doing something wrong?
If you can increase autovacuum_vacuum_cost_delay, your autovacuum would run slower and be less invasive.
However, it is usually the best solution to make it faster by setting autovacuum_vacuum_cost_limit to 2000 or so. Then it finishes faster.
You could also try to schedule VACUUMs of the table yourself at times when it hurts least.
But frankly, if a single innocuous autovacuum is enough to disturb your operation, you need more I/O bandwidth.

uniformly partition a rdd in spark

I have a text file in HDFS, which has about 10 million records. I am trying to read the file do some transformations on that data. I am trying to uniformly partition the data before I do the processing on it. here is the sample code
var myRDD = sc.textFile("input file location")
myRDD = myRDD.repartition(10000)
and when I do my transformations on this re-partitioned data, I see that one partition has abnormally large number of records and others have very little data. (image of the distribution)
So the load is high on only one executor
I also tried and got the same result
myRDD.coalesce(10000, shuffle = true)
is there a way to uniformly distribute records among partitions.
Attached is the shuffle read size/ number of records on that particular executor
the circled one has a lot more records to process than the others
any help is appreciated thank you.
To deal with the skew, you can repartition your data using distribute by(or using repartition as you used). For the expression to partition by, choose something that you know will evenly distribute the data.
You can even use the primary key of the DataFrame(RDD).
Even this approach will not guarantee that data will be distributed evenly between partitions. It all depends on the hash of the expression by which we distribute.
Spark : how can evenly distribute my records in all partition
Salting can be used which involves adding a new "fake" key and using alongside the current key for better distribution of data.
(here is link for salting)
For small data I have found that I need to enforce uniform partitioning myself. In pyspark the difference is easily reproducible. In this simple example I'm just trying to parallelize a list of 100 elements into 10 even partitions. I would expect each partition to hold 10 elements. Instead, I get an uneven distribution with partitions sizes anywhere from 4 to 22:
my_list = list(range(100))
rdd = spark.sparkContext.parallelize(my_list).repartition(10)
rdd.glom().map(len).collect()
# Outputs: [10, 4, 14, 6, 22, 6, 8, 10, 4, 16]
Here is the workaround I use, which is to index the data myself and then mod the index to find which partition to place the row in:
my_list = list(range(100))
number_of_partitions = 10
rdd = (
spark.sparkContext
.parallelize(zip(range(len(my_list)), my_list))
.partitionBy(number_of_partitions, lambda idx: idx % number_of_partitions)
)
rdd.glom().map(len).collect()
# Outputs: [10, 10, 10, 10, 10, 10, 10, 10, 10, 10]

Sublinear behavior (MongoDB cluster)

I have the following setup:
Import CSV-file (20GB) with 90 million rows -> data takes 9GB in MongoDB -> index on „2d“ column -> additional integer-column index for sharding -> distribute data with 1, 2, 4, 6, 8, 16 shards.
Each shard machine in cluster has 20GB disk space and 2GB RAM.
I generated a random query and benchmarked the execution time for each cluster configuration (see attachment).
Now my question:
Using 1, 2, 4, 6, and 8 shards I see a more or less linear decrease of runtime, as expected. With 8 shards I would assume that on each shard my data fits into memory. Therefore I thought there would be no improvement from 8 shards to 16 shards.
But from my benchmarks I observe a very strong sublinear decrease of runtime.
Do you have an idea how this behavior might be explained? Any suggestions or references to the manual are much appreciated!
Thanks in advance,
Lydia

Why is MongoDB storage constantly increasing?

I have a single-host database which grew up to 95% of disk space while I was not watching. To remedy the situation I created a process that automatically removes the old records from the biggest collection, so the data usage fell to about 40% of disk space. I figured I was safe as long as the data size doesn't grow near the size of preallocated files, but after a week I was proven wrong:
Wed Jan 23 18:19:22 [FileAllocator] allocating new datafile /var/lib/mongodb/xxx.101, filling with zeroes...
Wed Jan 23 18:25:11 [FileAllocator] done allocating datafile /var/lib/mongodb/xxx.101, size: 2047MB, took 347.8 secs
Wed Jan 23 18:25:14 [conn4243] serverStatus was very slow: { after basic: 0, middle of mem: 590, after mem: 590, after connections: 590, after extra info: 970, after counters: 970, after repl: 970, after asserts: 970, after dur: 1800, at end: 1800 }
This is the output of db.stats(): (note that the numbers are in MB because of scale)
> db.stats(1024*1024)
{
"db" : "xxx",
"collections" : 47,
"objects" : 189307130,
"avgObjSize" : 509.94713418348266,
"dataSize" : 92064,
"storageSize" : 131763,
"numExtents" : 257,
"indexes" : 78,
"indexSize" : 29078,
"fileSize" : 200543,
"nsSizeMB" : 16,
"ok" : 1
}
Question: What can I do to stop MongoDB from allocating new datafiles?
Running repair is difficult because I would have to install new disk. Would running compact help? If yes, should I be running it regularly and how can I tell when I should run it?
UPDATE: I guess I am missing something fundamental here... Could someone please elaborate on connection between data files, extents, collections and database, and how space is allocated when needed?
Upgrade to 2.2.2 - 2.2.0 has an idempotency bug in replication and no longer recommended for production.
See here for general info http://docs.mongodb.org/manual/faq/storage/#faq-disk-size
The only way to recover space back from mongodb is to either sync a new node over the network - in which case the documents are copied over the the new file system and stored anew without fragmentation. Or to use the repair command - but for that you need double the disk space that you are using on disk. The data files are copied, defragged and compacted and copied back over the original. The compact command is badly named and only defrags - it doesn't recover disk space back from mongo.
Going forward, use usePowerOf2Sizes command (new in 2.2.x) http://docs.mongodb.org/manual/reference/command/collMod/
If you use that command and allocate say an 800 byte document, 1024 bytes will be allocated on disk. If you then delete that doc and insert a new one - say 900 bytes, that doc can fit in the 1024 byte space. Without this option enabled, the 800 byte doc might only have 850 bytes on disk - so when it's deleted and the 900 byte doc inserted, new space has to be allocated. And if that is then deleted you will end up with two free space - 850 bytes and 950 bytes which are never joined (unless compact or repair is used) - so then insert a 1000 byte doc and you need to allocate another chunk of disk. usePowerOf2Sizes helps this situation a lot by using standard bucket sizes.