Efficiently process records in mongodb in single machine - mongodb

I have big dataset in mongodb to process. Sometime we only have one machine.
For example,
Have 1 million records in mongodb collection A. Data size is about 60GB .
Only one machine 4 core , 24G ram, mongodb in this machine too.
Only HDD no SSD.
Every record has html need parse , and regex match .
Result need insert to another collection B
Because point 4 , mongodb built-in mapreduce can't be competent . And I also tend to use python.
I do some testing and find :
if skip is large, query would be very slow
cursor = collection.find({'result.type':'detail'}, skip=start, limit=start+PRODUCER_CHUNKS)
With collection B record increasing, insert performance slow down much.
Any setting can improve above?
Besides, would it be faster if I deploy a docker spark cluster in this machine to do same job?
Because there is only one machine, so I think the most efficiently way is use multiprocessing . I implement this , but the speed is 25 record per second, I think it still slow .
Update
In my thought,
HHD write speed is over 120MB/s, I think it is enough to do such things ?
per record size is 60*1024*1024/10^6 = 62kb , 25record/s * 62kb /1024 = 1.5MB/s , much much lower than 120MB/s
One producer, Four workers(non-blocking, use asyncio) , each run in different process. CPU is E1231V3 .
Queue size is 500, easily full if collection B became large (because workers take the responsibility write to collection B , and insertion keep slowing down while processing. )
I think maybe has some way to speed up like if the limitation is memory:
db split my data in parts, maybe 60GB /5GB = 12
db use 5GB ram cache to read
db use 5GB ram cache to write (like redis, some interval write write once . mongodb has memory only mode,however it doesn't persist.)

Related

MongoDB poor write speed for collection with 500K documents with pymongo

System Information:
OS: Ubuntu 20.04 LTS
System: 80 GB RAM, 1 TB SSD, i7-12700k
The documents in this collection are on average 16KB, and there are 500K documents in this collection. I noticed that as the collection grows larger, the time taken to insert documents also grows larger.
In what ways could I improve the speed of writes?
It is taking 10 Hours to insert 150k documents. Which is around what the graph predicted when we integrate the line:
def f(num):
return 0.0004*num+0.9594
sum=0
for i in range(500,650):
sum+=f(i*1000)
>> sum/3600
>> 9.61497
Potential upgrades in my mind:
Use the C++ mongo engine for writes
Allocate more RAM to Mongod
Logs
iotop showing mongod using < 1% of the IO capacity with write speeds around 10-20 KB/s
htop showing the mongod is only using ~ 16GB of RAM \
Disks showing that some 300GB of SSD is free
EDIT:
Psudo code:
docs=[...]
for doc in docs:
doc["last_updated"]=str(datetime.now())
doc_from_db = collection.find_one({"key":doc["key"]})
new_dict = minify(doc)
if doc_from_db is None:
collection.insert_one(new_dict)
else:
collection.replace_one({"key":doc["key"]},new_dict,upsert=true)
When it comes to writes there are a few things to consider, the most impactful one which I'm assuming is the issue here is index size / index complexity / unique indexes.
It's hard to give exact advice without more information so I'll detail the most common bottlenecks when it comes to writes from my experience.
As mentioned indexes, if you have too many indexes. unique indexes. or indexes on very large arrays (and the document you insert have large arrays) these all heavily impact insert performance. This behavior also correlates with the graph you provided as inserting becomes worse and worse the larger the index gets. There is no "real" solution to this issue, you should reconsider which indexes and which indexes cause the bottleneck (focus on unique /array indexes). For example if you have an index that enforces uniqueness then drop it and enforce uniqueness at the application level instead.
write concern and replication lag, if you are using a replica set and you require a majority write concern this can definitely cause issues due to the sync lag that happens and grows, usually this is a side affect of a different issues, for example because of #1 (large indexes) the insert takes too long which causes sync lag which delays even further the write concern.
unoptimized hardware (Assuming you're hosted on cloud), you'd be surprised how much you can optimize write performance by just changing the disk type and increasing IOPS. this will give immediate performance. obviously at the cost of $$$.
no code was provided so I would also check that, if it's a for loop then obviously you can parallelize the logic.
I recommend you test the same insert logic on an indexless collection to pinpoint the problem, i'd be glad to help think of other issues/solutions once you can provide more information.
EDIT:
Here is an example of how to avoid the for loop issue by using bulkWrite instead in python using pymongo.
from pymongo import InsertOne, DeleteOne, ReplaceOne
from pymongo.errors import BulkWriteError
docs = [... input documents ]
requests = []
for doc in docs:
requests.append({
ReplaceOne({"docId": doc["docID"]}, doc, { upsert: True})
})
try:
db.docs.bulk_write(requests, ordered=False)
except BulkWriteError as bwe:
pprint(bwe.details)
You can enable profiling in Database, but according to previous comments and your code, just python code profiling may be enough, for example can you show the output of similar example?
https://github.com/Tornike-Skhulukhia/cprofiler_python_example/blob/main/demo.py
But before that, please check that you have index on field that you are doing searches against using find_one command in current code, otherwise database may need to do full collection scan to just find 1 document, meaning if you have more documents, this time will also increase a lot.

MongoDB can not find() in 1 million documents

I just started to deal with MongoDB.
Created 10 thousand json documents. I do search:
db.mycollection.find({"somenode1.somenode2.somenode3.somenode4.Value", "9999"}).count()
It gives out the correct result. Operating time: 34 ms. Everything is OK.
Now create a database with 1 million of the same documents. The total size of the database is 34Gb.The MongoDB divided the database into files by 2Gb. I repeat the above described query to find the number of relevant documents. I waited for result about 2 hours. The memory was occupied (16GB). Finally I shut down the Mongo.
System: Windows 7 x64, 16Gb RAM.
Please tell me what I'm doing wrong. A production db will be much bigger.
In your particular case, it appears you simply do not have enough RAM. At minimum, and index on "somenode4" would improve the query performance. Keep in mind, the indexes are going to want to be in RAM as well so you may need more RAM anyhow. Are you on a virtual machine? If so; I recommend you increase the size of the machine to account for the size of the working set.
As one of the other commenters stated, that nesting is a bit ugly but I understand it is what you were dealt. So other than RAM, indexing appears to be your best bet.
As part of your indexing effort, you may also want to try experimenting with pre-heating the indexes to ensure they are in RAM prior to that find and count(). Try executing a query that seeks for something that does not exist. This should force the indexes and data into RAM prior to that query. Pending how often your data changes, you may want this to be done once a day or more. You are essentially front-loading the slow operations.

mongodb java driver readFully is slow

db.collection.find in my app that uses mongodb java driver (latest) are super slow. I investigated one of them as follows
// about 300 hundred ids at a time (i've tried lower and higher numbers - no impact
db.users.find({_id : {$in : [1,2,3,4,5,6....]}})
Once I get the cursor I do: cursor.toArray() and then iterate of the results
The toArray operation is extremely slow. On average they take about a minute. IMPORTANT: my database is under very heavy load at all times. This particular collection has over 50mm entries.
I've narrowed down the issue in mongo java driver to com.mongodb.Response - specifically to this line:
final byte [] b = new byte[36];
Bits.readFully(in, b);
Incredibly readFully of just 36 bytes takes over a minute some times!
When I bring own the load on the databases, the improvements are drastic. From about a minute to 5-6 seconds. I mean 5-6 seconds to get 300 documents is still super slow, but definitely better then 1 minute.
What can I do to troubleshoot this further? Are there settings on MondoDB that I need to look at?
What happens
You are loading all of the 300 user documents.
What happens is that the _id index is searched and the respective documents are sent completely to your app. So mongoDB will access it's data files, read the first document and send it to you, then it jumps to the next document and sends it to you and so forth. If you used the cursor, you could start iterating over the returned documents as soon as a number of documents equalling your defined cursor size have been returned, as others will be lazily loaded from the cursor on the server on demand. (Bit of a simplification, but sufficient for answering this question). What you do is to explicitly wait until the index is scanned, the documents are located, sent back to your app and have reached it down to the last byte of the last document. As #wdberkeley (who works for 10gen) correctly pointed out, this is a Very Bad Idea™.
What might cause or intensify the problem
Under heavy load, two things might happen. The more likely is that your _id index isn't in RAM any more, causing thousands, if not millions of reads from disk - which is slow. Much slower than if the indices are kept in RAM (by several orders of magnitude). So it is not the code snippet you mentioned, but the response time of MongoDB which causes the delay. Another option under heavy load is that your disk IO is simply too low or (more likely) the random file read latency is too high. I assume you are using spinning disks plus not enough RAM for a database that size.
What to do to find the cause
Try to find out your index size using the db.users.stats(). I am pretty sure that your index size(s combined) exceed your available RAM.
Measure the disk IO and latency. If you use a GNU/Linux OS, you might want to find out how high your IOwait percentage is. A high percentage shows that your disk latency is too high for the load put on the server. It might even be that your are reaching the disk's IO limits.
Do your queries on a mongo shell. In case they are fast, you can be pretty sure that your toArray call is the cause of the problem.
What to do to resolve the problem
If you have not enough RAM, either scale up or scale out.
If your disk latency or throughput is too high, either scale out or ( better and cheaper in most cases ) use SSDs for storing MongoDB's data.
Use a cursor object to iterate over the documents. This is a better solution in almost every use case I can think of.
Upgrading MongoDB driver to 3.6.4 will fetch the data in no-time.
We have around 2 million documents in our collection and with previous version it was taking around ~3 minutes but after after upgrading to 3.6.4 it took only 5-7 sec.So what I feel is that there is some issue with the old version of mongoDB driver.

How to improve the performance of feed system using mongodb

I have a feed system using fan-out on write. I keep a list of feed ids in redis sorted set, and save the feed content in mongodb, so every time when i read 30 feeds, i have to do 30 query to mongodb, is there anyway to improve it ?
Its depend upon your setup of database. MongoDb has a vast documentation about how to increase simultaneous read and write MongoDb conncurrency
If you need so many writes in database with less latency starts using sharding Deployment Sharding.
If you need to increase number of reads in data base deploy each shards as replica set and rout your read query in secondary node Read Prefences
Also each query should covered by index Better indexing, you can check your query time by simply adding explain after a find it will show you the time and all facts
db.collection.find({a:"abcd"}).explain()
Make sure you have enough ram so that your data set fits with ram atleast your index should fit inside the ram coz each time a data fetched from disk is 10 times slower than RAM.
Check your sever status with running MongoStat it will measures your database performance , page fault , lock , query opertaion manny detail.
Also measure your hardware performance with program like iostat and make sure io wait is low and less than 1%.
Few good links to deployment of mongodb and performance tuning.
1. Production deployment of mongodb
2. Performance tuning of mongodb By 10gen
3. Using redis before mongodb to cache query and result object
4. Example of redis and mongo

mongodb got slow when the document count went around 100, 000 . Any performance optimization?

I run a single mongodb instance which is getting inserted with logs from an app server. the current rate of insert in production is 10 inserts per second. And its a capped collection. i DONT USE ANY INDEXES . Queries were running faster when there were small number of records. only one collection has that amount of data. even querying from collection that has very few rows has become very slow. IS there any means to improve the performance.
-Avinash
This is a very difficult question to answer because we dont know much about your configuration or your document structure.
One thing that immediately pops into my head is that you are running out of memory. 10 inserts per second doesn't mean much because we do not know how big the inserted documents are.
If you are inserting larger documents at 10 per second, you could be eating up memory, causing the operating system to push some of your records to disk.
When you query without using an index, you are forced to scan every document. If your documents have been pushed to disk by the OS, you will begin having page faults. Mongo will need to fetch pages of data off the hard disk, and load them into memory so that they can be scanned. Before doing this, the operating system will need to make room for that data in memory by flushing other parts of memory out to disk.
It sounds like you are are I/O bound and the two biggest things you can do to fix this are
Add more memory to the machine running mongod
Start using indexes so that the database does not need to do full collection scans
Use proper indexes, though that will have some effect on the efficiency of insertion in a capped collection.
It would be better if you can share the collection structure and the query you are using.