MongoDB can not find() in 1 million documents - mongodb

I just started to deal with MongoDB.
Created 10 thousand json documents. I do search:
db.mycollection.find({"somenode1.somenode2.somenode3.somenode4.Value", "9999"}).count()
It gives out the correct result. Operating time: 34 ms. Everything is OK.
Now create a database with 1 million of the same documents. The total size of the database is 34Gb.The MongoDB divided the database into files by 2Gb. I repeat the above described query to find the number of relevant documents. I waited for result about 2 hours. The memory was occupied (16GB). Finally I shut down the Mongo.
System: Windows 7 x64, 16Gb RAM.
Please tell me what I'm doing wrong. A production db will be much bigger.

In your particular case, it appears you simply do not have enough RAM. At minimum, and index on "somenode4" would improve the query performance. Keep in mind, the indexes are going to want to be in RAM as well so you may need more RAM anyhow. Are you on a virtual machine? If so; I recommend you increase the size of the machine to account for the size of the working set.
As one of the other commenters stated, that nesting is a bit ugly but I understand it is what you were dealt. So other than RAM, indexing appears to be your best bet.
As part of your indexing effort, you may also want to try experimenting with pre-heating the indexes to ensure they are in RAM prior to that find and count(). Try executing a query that seeks for something that does not exist. This should force the indexes and data into RAM prior to that query. Pending how often your data changes, you may want this to be done once a day or more. You are essentially front-loading the slow operations.

Related

MongoDB poor write speed for collection with 500K documents with pymongo

System Information:
OS: Ubuntu 20.04 LTS
System: 80 GB RAM, 1 TB SSD, i7-12700k
The documents in this collection are on average 16KB, and there are 500K documents in this collection. I noticed that as the collection grows larger, the time taken to insert documents also grows larger.
In what ways could I improve the speed of writes?
It is taking 10 Hours to insert 150k documents. Which is around what the graph predicted when we integrate the line:
def f(num):
return 0.0004*num+0.9594
sum=0
for i in range(500,650):
sum+=f(i*1000)
>> sum/3600
>> 9.61497
Potential upgrades in my mind:
Use the C++ mongo engine for writes
Allocate more RAM to Mongod
Logs
iotop showing mongod using < 1% of the IO capacity with write speeds around 10-20 KB/s
htop showing the mongod is only using ~ 16GB of RAM \
Disks showing that some 300GB of SSD is free
EDIT:
Psudo code:
docs=[...]
for doc in docs:
doc["last_updated"]=str(datetime.now())
doc_from_db = collection.find_one({"key":doc["key"]})
new_dict = minify(doc)
if doc_from_db is None:
collection.insert_one(new_dict)
else:
collection.replace_one({"key":doc["key"]},new_dict,upsert=true)
When it comes to writes there are a few things to consider, the most impactful one which I'm assuming is the issue here is index size / index complexity / unique indexes.
It's hard to give exact advice without more information so I'll detail the most common bottlenecks when it comes to writes from my experience.
As mentioned indexes, if you have too many indexes. unique indexes. or indexes on very large arrays (and the document you insert have large arrays) these all heavily impact insert performance. This behavior also correlates with the graph you provided as inserting becomes worse and worse the larger the index gets. There is no "real" solution to this issue, you should reconsider which indexes and which indexes cause the bottleneck (focus on unique /array indexes). For example if you have an index that enforces uniqueness then drop it and enforce uniqueness at the application level instead.
write concern and replication lag, if you are using a replica set and you require a majority write concern this can definitely cause issues due to the sync lag that happens and grows, usually this is a side affect of a different issues, for example because of #1 (large indexes) the insert takes too long which causes sync lag which delays even further the write concern.
unoptimized hardware (Assuming you're hosted on cloud), you'd be surprised how much you can optimize write performance by just changing the disk type and increasing IOPS. this will give immediate performance. obviously at the cost of $$$.
no code was provided so I would also check that, if it's a for loop then obviously you can parallelize the logic.
I recommend you test the same insert logic on an indexless collection to pinpoint the problem, i'd be glad to help think of other issues/solutions once you can provide more information.
EDIT:
Here is an example of how to avoid the for loop issue by using bulkWrite instead in python using pymongo.
from pymongo import InsertOne, DeleteOne, ReplaceOne
from pymongo.errors import BulkWriteError
docs = [... input documents ]
requests = []
for doc in docs:
requests.append({
ReplaceOne({"docId": doc["docID"]}, doc, { upsert: True})
})
try:
db.docs.bulk_write(requests, ordered=False)
except BulkWriteError as bwe:
pprint(bwe.details)
You can enable profiling in Database, but according to previous comments and your code, just python code profiling may be enough, for example can you show the output of similar example?
https://github.com/Tornike-Skhulukhia/cprofiler_python_example/blob/main/demo.py
But before that, please check that you have index on field that you are doing searches against using find_one command in current code, otherwise database may need to do full collection scan to just find 1 document, meaning if you have more documents, this time will also increase a lot.

mongodb java driver readFully is slow

db.collection.find in my app that uses mongodb java driver (latest) are super slow. I investigated one of them as follows
// about 300 hundred ids at a time (i've tried lower and higher numbers - no impact
db.users.find({_id : {$in : [1,2,3,4,5,6....]}})
Once I get the cursor I do: cursor.toArray() and then iterate of the results
The toArray operation is extremely slow. On average they take about a minute. IMPORTANT: my database is under very heavy load at all times. This particular collection has over 50mm entries.
I've narrowed down the issue in mongo java driver to com.mongodb.Response - specifically to this line:
final byte [] b = new byte[36];
Bits.readFully(in, b);
Incredibly readFully of just 36 bytes takes over a minute some times!
When I bring own the load on the databases, the improvements are drastic. From about a minute to 5-6 seconds. I mean 5-6 seconds to get 300 documents is still super slow, but definitely better then 1 minute.
What can I do to troubleshoot this further? Are there settings on MondoDB that I need to look at?
What happens
You are loading all of the 300 user documents.
What happens is that the _id index is searched and the respective documents are sent completely to your app. So mongoDB will access it's data files, read the first document and send it to you, then it jumps to the next document and sends it to you and so forth. If you used the cursor, you could start iterating over the returned documents as soon as a number of documents equalling your defined cursor size have been returned, as others will be lazily loaded from the cursor on the server on demand. (Bit of a simplification, but sufficient for answering this question). What you do is to explicitly wait until the index is scanned, the documents are located, sent back to your app and have reached it down to the last byte of the last document. As #wdberkeley (who works for 10gen) correctly pointed out, this is a Very Bad Idea™.
What might cause or intensify the problem
Under heavy load, two things might happen. The more likely is that your _id index isn't in RAM any more, causing thousands, if not millions of reads from disk - which is slow. Much slower than if the indices are kept in RAM (by several orders of magnitude). So it is not the code snippet you mentioned, but the response time of MongoDB which causes the delay. Another option under heavy load is that your disk IO is simply too low or (more likely) the random file read latency is too high. I assume you are using spinning disks plus not enough RAM for a database that size.
What to do to find the cause
Try to find out your index size using the db.users.stats(). I am pretty sure that your index size(s combined) exceed your available RAM.
Measure the disk IO and latency. If you use a GNU/Linux OS, you might want to find out how high your IOwait percentage is. A high percentage shows that your disk latency is too high for the load put on the server. It might even be that your are reaching the disk's IO limits.
Do your queries on a mongo shell. In case they are fast, you can be pretty sure that your toArray call is the cause of the problem.
What to do to resolve the problem
If you have not enough RAM, either scale up or scale out.
If your disk latency or throughput is too high, either scale out or ( better and cheaper in most cases ) use SSDs for storing MongoDB's data.
Use a cursor object to iterate over the documents. This is a better solution in almost every use case I can think of.
Upgrading MongoDB driver to 3.6.4 will fetch the data in no-time.
We have around 2 million documents in our collection and with previous version it was taking around ~3 minutes but after after upgrading to 3.6.4 it took only 5-7 sec.So what I feel is that there is some issue with the old version of mongoDB driver.

MongoDB sharding scalability - performance of queries hitting a single chunk?

In doing some preliminary tests of MongoDB sharding, I hoped and expected that the time to execute queries that hit only a single chunk of data on one shard/machine would remain relatively constant as more data was loaded. But I found a significant slowdown.
Some details:
For my simple test, I used two machines to shard and tried queries on similar collections with 2 million rows and 7 million rows. These are obviously very small collections that don’t even require sharding, yet I was surprised to already see a significant consistent slowdown for queries hitting only a single chunk. Queries included the sharding key, were for result sets ranging from 10s to 100000s of rows, and I measured the total time required to scroll through the entire result sets. One other thing: since my application will actually require much more data than can fit into RAM, all queries were timed based on a cold cache.
Any idea why this would be? Has anyone else observed the same or contradictory results?
Further details (prompted by Theo):
For this test, the rows were small (5 columns including _id), and the key was not based on _id, but rather on a many-valued text column that almost always appears in queries.
The command db.printShardingStatus() shows how many chunks there are as well as the exact key values used to split ranges for chunks. The average chunk contains well over 100,000 rows for this dataset and inspection of key value splits verifies that the test queries are hitting a single chunk.
For the purpose of this test, I was measuring only reads. There were no inserts or updates.
Update:
Upon some additional research, I believe I determined the reason for the slowdown: MongoDB chunks are purely logical, and the data within them is NOT physically located together (source: "Scaling MongoDB" by Kristina Chodorow). This is in contrast to partitioning in traditional databases like Oracle and MySQL. This seems like a significant limitation, as sharding will scale horizontally with the addition of shards/machines, but less well in the vertical dimension as data is added to a collection with a fixed number of shards.
If I understand this correctly, if I have 1 collection with a billion rows sharded across 10 shards/machines, even a query that hits only one shard/machine is still querying from a large collection of 100 million rows. If values for the sharding key happen to be located contiguously on disk, then that might be OK. But if not and I'm fetching more than a few rows (e.g. 1000s), then this seems likely to lead to lots of I/O problems.
So my new question is: why not organize chunks in MongoDB physically to enable vertical as well as horizontal scalability?
What makes you say the queries only touched a single chunk? If the result ranged up to 100 000 rows it sounds unlikely. A chunk is max 64 Mb, and unless your objects are tiny that many won't fit. Mongo has most likely split your chunks and distributed them.
I think you need to tell us more about what you're doing and the shape of your data. Were you querying and loading at the same time? Do you mean shard when you say chunk? Is your shard key something else than _id? Do you do any updates while you query your data?
There are two major factors when it comes to performance in Mongo: the global write lock and it's use of memory mapped files. Memory mapped files mean you really have to think about your usage patterns, and the global write lock makes page faults hurt really badly.
If you query for things that are all over the place the OS will struggle to page things in and out, this can be especially hurting if your objects are tiny because whole pages have to be loaded just to access a small pieces, lots of RAM will be wasted. If you're doing lots of writes that will lock reads (but usually not that badly since writes happen fairly sequentially) -- but if you're doing updates you can forget about any kind of performance, the updates block the whole database server for significant amounts of time.
Run mongostat while you're running your tests, it can tell you a lot (run mongostat --discover | grep -v SEC to see the metrics for all you shard masters, don't forget to include --port if your mongos is not running on 27017).
Addressing the questions in your update: it would be really nice if Mongo did keep chunks physically together, but it is not the case. One of the reasons is that sharding is a layer on top of mongod, and mongod is not fully aware of it being a shard. It's the config servers and mongos processes that know of shard keys and which chunks that exist. Therefore, in the current architecture, mongod doesn't even have the information that would be required to keep chunks together on disk. The problem is even deeper: Mongo's disk format isn't very advanced. It still (as of v2.0) does not have online compaction (although compaction got better in v2.0), it can't compact a fragmented database and still serve queries. Mongo has a long way to go before it's capable of what you're suggesting, sadly.
The best you can do at this point is to make sure you write the data in order so that chunks will be written sequentially. It probably helps if you create all chunks beforehand too, so that data will not be moved around by the balancer. Of course this is only possible if you have all your data in advance, and that seems unlikely.
Disclaimer: I work at Tokutek
So my new question is: why not organize chunks in MongoDB physically to enable vertical as well as horizontal scalability?
This is exactly what is done in TokuMX, a replacement server for MongoDB. TokuMX uses Fractal Tree indexes which have high write throughput and compression, so instead of storing data in a heap, data is clustered with the index. By default, the shard key is clustered, so it does exactly what you suggest, it organizes the chunks physically, by ensuring all documents are ordered by the shard key on disk. This makes range queries on the shard key fast, just like on any clustered index.

mongodb got slow when the document count went around 100, 000 . Any performance optimization?

I run a single mongodb instance which is getting inserted with logs from an app server. the current rate of insert in production is 10 inserts per second. And its a capped collection. i DONT USE ANY INDEXES . Queries were running faster when there were small number of records. only one collection has that amount of data. even querying from collection that has very few rows has become very slow. IS there any means to improve the performance.
-Avinash
This is a very difficult question to answer because we dont know much about your configuration or your document structure.
One thing that immediately pops into my head is that you are running out of memory. 10 inserts per second doesn't mean much because we do not know how big the inserted documents are.
If you are inserting larger documents at 10 per second, you could be eating up memory, causing the operating system to push some of your records to disk.
When you query without using an index, you are forced to scan every document. If your documents have been pushed to disk by the OS, you will begin having page faults. Mongo will need to fetch pages of data off the hard disk, and load them into memory so that they can be scanned. Before doing this, the operating system will need to make room for that data in memory by flushing other parts of memory out to disk.
It sounds like you are are I/O bound and the two biggest things you can do to fix this are
Add more memory to the machine running mongod
Start using indexes so that the database does not need to do full collection scans
Use proper indexes, though that will have some effect on the efficiency of insertion in a capped collection.
It would be better if you can share the collection structure and the query you are using.

MongoDB: Sharding on single machine. Does it make sense?

created a collection in MongoDB consisting of 11446615 documents.
Each document has the following form:
{
"_id" : ObjectId("4e03dec7c3c365f574820835"),
"httpReferer" : "http://www.somewebsite.pl/art.php?id=13321&b=1",
"words" : ["SEX", "DRUGS", "ROCKNROLL", "WHATEVER"],
"howMany" : 3
}
httpReferer: just an url
words: words parsed from the url above. Size of the list is between 15 and 90.
I am planning to use this database to obtain list of webpages which have similar content.
I 'll by querying this collection using words field so I created (or rather started creating) index on this field:
db.my_coll.ensureIndex({words: 1})
Creating this collection takes very long time. I tried two approaches (tests below were done on my laptop):
Inserting and indexing Inserting took 5.5 hours mainly due to cpu intensive preprocessing of data. Indexing took 30 hours.
Indexing before inserting It would take a few days to insert all data to collection.
My main focus it to decrease time of generating the collection. I don't need replication (at least for now). Querying also doesn't have to be light-fast.
Now, time for a question:
I have only one machine with one disk were I can run my app. Does it make sense to run more than one instance of the database and split my data between them?
Yes, it does make sense to shard on a single server.
At this time, MongoDB still uses a global lock per mongodb server.
Creating multiple servers will release a server from one another's locks.
If you run a multiple core machine with seperate NUMAs, this can
also increase performance.
If your load increases too much for your server, initial sharding makes for easier horizontal scaling in the future. You might as well do it now.
Machines vary. I suggest writing your own bulk insertion benchmark program and spin up a various number of MongoDB server shards. I have a 16 core RAIDed machine and I've found that 3-4 shards seem to be ideal for my heavy write database. I'm finding that my two NUMAs are my bottleneck.
In modern day(2015) with mongodb v3.0.x there is collection-level locking with mmap, which increases write throughput slightly(assuming your writing to multiple collections), but if you use the wiredtiger engine there is document level locking, which has a much higher write throughput. This removes the need for sharding across a single machine. Though you can technically still increase the performance of mapReduce by sharding across a single machine, but in this case you'd be better off just using the aggregation framework which can exploit multiple cores. If you heavily rely on map reduce algorithms it might make most sense to just use something like Hadoop.
The only reason for sharding mongodb is to horizontally scale. So in the event that a single machine cannot house enough disk space, memory, or CPU power(rare), then sharding becomes beneficial. I think its really really seldom that someone has enough data that they need to shard, even a large business, especially since wiredtiger added compression support that can reduce disk usage to over 80% less. Its also infrequent that someone uses mongodb to perform really CPU heavy queries at a large scale, because there are much better technologies for this. In most cases IO is the most important factor in performance, not many queries are CPU intensive, unless you're running a lot of complex aggregations, even geo-spatial is indexed upon insertion.
Most likely reason you'd need to shard is if you have a lot of indexes that consume a large amount of RAM, wiredtiger reduces this, but its still the most common reason to shard. Where as sharding across a single machine is likely just going to cause undesired overhead, with very little or possible no benefits.
This doesn't have to be a mongo question, it's a general operating system question. There are three possible bottlenecks for your database use.
network (i.e. you're on a gigabit line, you're using most of it at peak times, but your database isn't really loaded down)
CPU (your CPU is near 100% but disk and network are barely ticking over)
disk
In the case of network, rewrite your network protocol if possible, otherwise shard to other machines. In the case of CPU, if you're 100% on a few cores but others are free, sharding on the same machine will improve performance. If disk is fully utilized add more disks and shard across them -- way cheaper than adding more machines.
No, it does not make sense to shard a on a single server.
There are a few exceptional cases but they mostly come down to concurrency issues related to things like running map/reduce or javascript.
This is answered in the first paragraph of the Replica set tutorial
http://www.mongodb.org/display/DOCS/Replica+Set+Tutorial