How to improve the performance of feed system using mongodb - mongodb

I have a feed system using fan-out on write. I keep a list of feed ids in redis sorted set, and save the feed content in mongodb, so every time when i read 30 feeds, i have to do 30 query to mongodb, is there anyway to improve it ?

Its depend upon your setup of database. MongoDb has a vast documentation about how to increase simultaneous read and write MongoDb conncurrency
If you need so many writes in database with less latency starts using sharding Deployment Sharding.
If you need to increase number of reads in data base deploy each shards as replica set and rout your read query in secondary node Read Prefences
Also each query should covered by index Better indexing, you can check your query time by simply adding explain after a find it will show you the time and all facts
db.collection.find({a:"abcd"}).explain()
Make sure you have enough ram so that your data set fits with ram atleast your index should fit inside the ram coz each time a data fetched from disk is 10 times slower than RAM.
Check your sever status with running MongoStat it will measures your database performance , page fault , lock , query opertaion manny detail.
Also measure your hardware performance with program like iostat and make sure io wait is low and less than 1%.
Few good links to deployment of mongodb and performance tuning.
1. Production deployment of mongodb
2. Performance tuning of mongodb By 10gen
3. Using redis before mongodb to cache query and result object
4. Example of redis and mongo

Related

MongoDB poor write speed for collection with 500K documents with pymongo

System Information:
OS: Ubuntu 20.04 LTS
System: 80 GB RAM, 1 TB SSD, i7-12700k
The documents in this collection are on average 16KB, and there are 500K documents in this collection. I noticed that as the collection grows larger, the time taken to insert documents also grows larger.
In what ways could I improve the speed of writes?
It is taking 10 Hours to insert 150k documents. Which is around what the graph predicted when we integrate the line:
def f(num):
return 0.0004*num+0.9594
sum=0
for i in range(500,650):
sum+=f(i*1000)
>> sum/3600
>> 9.61497
Potential upgrades in my mind:
Use the C++ mongo engine for writes
Allocate more RAM to Mongod
Logs
iotop showing mongod using < 1% of the IO capacity with write speeds around 10-20 KB/s
htop showing the mongod is only using ~ 16GB of RAM \
Disks showing that some 300GB of SSD is free
EDIT:
Psudo code:
docs=[...]
for doc in docs:
doc["last_updated"]=str(datetime.now())
doc_from_db = collection.find_one({"key":doc["key"]})
new_dict = minify(doc)
if doc_from_db is None:
collection.insert_one(new_dict)
else:
collection.replace_one({"key":doc["key"]},new_dict,upsert=true)
When it comes to writes there are a few things to consider, the most impactful one which I'm assuming is the issue here is index size / index complexity / unique indexes.
It's hard to give exact advice without more information so I'll detail the most common bottlenecks when it comes to writes from my experience.
As mentioned indexes, if you have too many indexes. unique indexes. or indexes on very large arrays (and the document you insert have large arrays) these all heavily impact insert performance. This behavior also correlates with the graph you provided as inserting becomes worse and worse the larger the index gets. There is no "real" solution to this issue, you should reconsider which indexes and which indexes cause the bottleneck (focus on unique /array indexes). For example if you have an index that enforces uniqueness then drop it and enforce uniqueness at the application level instead.
write concern and replication lag, if you are using a replica set and you require a majority write concern this can definitely cause issues due to the sync lag that happens and grows, usually this is a side affect of a different issues, for example because of #1 (large indexes) the insert takes too long which causes sync lag which delays even further the write concern.
unoptimized hardware (Assuming you're hosted on cloud), you'd be surprised how much you can optimize write performance by just changing the disk type and increasing IOPS. this will give immediate performance. obviously at the cost of $$$.
no code was provided so I would also check that, if it's a for loop then obviously you can parallelize the logic.
I recommend you test the same insert logic on an indexless collection to pinpoint the problem, i'd be glad to help think of other issues/solutions once you can provide more information.
EDIT:
Here is an example of how to avoid the for loop issue by using bulkWrite instead in python using pymongo.
from pymongo import InsertOne, DeleteOne, ReplaceOne
from pymongo.errors import BulkWriteError
docs = [... input documents ]
requests = []
for doc in docs:
requests.append({
ReplaceOne({"docId": doc["docID"]}, doc, { upsert: True})
})
try:
db.docs.bulk_write(requests, ordered=False)
except BulkWriteError as bwe:
pprint(bwe.details)
You can enable profiling in Database, but according to previous comments and your code, just python code profiling may be enough, for example can you show the output of similar example?
https://github.com/Tornike-Skhulukhia/cprofiler_python_example/blob/main/demo.py
But before that, please check that you have index on field that you are doing searches against using find_one command in current code, otherwise database may need to do full collection scan to just find 1 document, meaning if you have more documents, this time will also increase a lot.

MongoDB concurrency - reduces the performance

I understand that mongo db does locking on read and write operations.
My Use case:
Only read operations. No write operations.
I have a collection about 10million documents. Storage engine is wiredTiger.
Mongo version is 3.4.
I made a request which should return 30k documents - took 650ms on an average.
When I made concurrent requests - same requests - 100 times - It takes in seconds - few seconds to 2 minutes all requests handled.
I have single node to serve the data.
How do I access the data:
Each document contains 25 to 40 fields. I indexed few fields. I query based on one index field.
API will return all the matching documents in json form.
Other informations: API is written using Spring boot.
Concurrency tested through JMeter shell script from command line on remote machine.
So,
My question:
Am I missing any optimizations? [storage engine level, version]
Can't I achieve all read requests to be served less than a second?
If so, what sla I can keep for this use case?
Any suggestions?
Edit:
I enabled database profiler in mongodb with level 2.
My single query internally converted to 4 queries:
Initial read
getMore
getMore
getMore
These are the queries found through profiler.
Totally, it is taking less than 100ms. Is it true really?
My concurrent queries:
Now, When I hit 100 requests, nearly 150 operations are more than 100ms, 100 operations are more than 200ms, 90 operations are more than 300ms.
As per my single query analysis, 100 requests will be converted to 400 queries internally. It is fixed pattern which I verified by checking the query tag in the profiler output.
I hope this is what affects my request performance.
My single query internally converted to 4 queries:
Initial read
getMore
getMore
getMore
It's the way mongo cursors work. The documents are transferred from the db to the app in batches. IIRC the first batch is around 100 documents + cursor Id, then consecutive getMore calls retrieve next batches by cursor Id.
You can define batch size (number of documents in the batch) from the application. The batch cannot exceed 16MB, e.g. if you set batch size 30,000 it will fit into single batch only if document size is less than 500B.
Your investigation clearly show performance degradation under load. There are too many factors and I believe locking is not one of them. WiredTiger does exclusive locks on document level for regular write operations and you are doing only reads during your tests, aren't you? In any doubts you can compare results of db.serverStatus().locks before and after tests to see how many write locks were acquired. You can also run db.serverStatus().globalLock during the tests to check the queue. More details about locking and concurrency are there: https://docs.mongodb.com/manual/faq/concurrency/#for-wiredtiger
The bottleneck is likely somewhere else. There are few generic things to check:
Query optimisation. Ensure you use indexes. The profiler should have no "COLLSCAN" stage in execStats field.
System load. If your database shares system resources with application it may affect performance of the database. E.g. BSON to JSON conversion in your API is quite CPU hungry and may affect performance of the queries. Check system's LA with top or htop on *nix systems.
Mongodb resources. Use mongostat and mongotop if the server has enough RAM, IO, file descriptors, connections etc.
If you cannot spot anything obvious I'd recommend you to seek professional help. I find the simplest way to get one is by exporting data to Atlas, running your tests against the cluster. Then you can talk to the support team if they could advice any improvements to the queries.

Correct way to run an ETL on a live production MongoDB database

We have the following environment:
3 servers + 3 replica set servers
Each server has 3 shards
Our main collection has around 40,000,000 documents that average at around 6kb
Our shard key is hashed(_id) - _id being pure BsonID
We peak daily at around 25,000 I/OPS and low point at around 10,000
We want to run an ETL that loads all of the documents on the main collection, do some in memory calculation (in our application tier) and then dumps it into an external DB.
We took the very poor and naive approach and simply loaded documents without a query using limit, skip and batchSize - which was a complete failure (severly hurt our service level - Even though we set the readPreference to secondary)
db.Collection.find().skip(i * 5000).limit(5000).readPref(secondary)
Where i is the current iteration we're going through, which runs on multiple threads to speed up the process.
I was wondering what would be the best approach to be able to load all of our documents without hurting the performance of our database.
The data can be a bit stale (a few seconds delay from the actual data on the primary is fine).
I've posted this question on DB admins but it doesn't seem to attract much answers so I'm posting it here as well. Sorry if it's against the forum rules.
Thanks!

Latency on aws (m1.large) with MongoDB 64b 2.x

I have deployed mongodb 64 bit 2.x version on aws m1.large instance.
I am trying to find best performance that mongo can give us on aws in-light of http://www.snailinaturtleneck.com/blog/tag/mongodb/ (and mongodb read/write performance and mongo hosting in the cloud)
I have created one db with one collection i.e. user and inserted 100,000 records/json object (each json object size is 4KB) using random number as suffix to “user-“. Also, created index on user id.
Further, I set db profiler to log slow query taking 20ms or more. I have executed java program with 10 threads. Each java class generates user id with random number and finds it in user collection in infinite loop. With such load I have observed latency in query/read up-to 60ms.
I also observed that when I run less number of threads say 3 or 4 (having query load on user collection 5K per second to find users) then I see no latency or less then 2ms latency.
I failed to understand why increasing load of finding user in collection is causing latency. I believe that mongo db can perform much more concurrent read then what I am trying and should not impact on performance as such.
One possibility I assume that would be - mongo is having performance issues if there are large queries executed on single collection like in our case, I expect to have 10K to 20K queries per second on single collection.
We would appreciate your thoughts / suggestion.
Some information is missing - what is your disk configuration? The EBS may contribute to the latency if everything is persisted to disk.
Amazon had released a white paper with best practices on how to install mongo on EC2: MongoDB on AWS. Here's its description
This whitepaper provides an overview of general best practices that apply to all major NoSQL systems and highlights one of popular NoSQL systems - MongoDB - and discusses how to best run it on the AWS cloud. It further examines different MongoDB configurations so you can optimize it for performance, durability, and security.

mongodb got slow when the document count went around 100, 000 . Any performance optimization?

I run a single mongodb instance which is getting inserted with logs from an app server. the current rate of insert in production is 10 inserts per second. And its a capped collection. i DONT USE ANY INDEXES . Queries were running faster when there were small number of records. only one collection has that amount of data. even querying from collection that has very few rows has become very slow. IS there any means to improve the performance.
-Avinash
This is a very difficult question to answer because we dont know much about your configuration or your document structure.
One thing that immediately pops into my head is that you are running out of memory. 10 inserts per second doesn't mean much because we do not know how big the inserted documents are.
If you are inserting larger documents at 10 per second, you could be eating up memory, causing the operating system to push some of your records to disk.
When you query without using an index, you are forced to scan every document. If your documents have been pushed to disk by the OS, you will begin having page faults. Mongo will need to fetch pages of data off the hard disk, and load them into memory so that they can be scanned. Before doing this, the operating system will need to make room for that data in memory by flushing other parts of memory out to disk.
It sounds like you are are I/O bound and the two biggest things you can do to fix this are
Add more memory to the machine running mongod
Start using indexes so that the database does not need to do full collection scans
Use proper indexes, though that will have some effect on the efficiency of insertion in a capped collection.
It would be better if you can share the collection structure and the query you are using.