I am using mongodb via the mongo shell to query a large collection. For some reason after 90 seconds the mongo shell seems to be stopping my query and nothing is returned.
I have tried the following two commands but neither will return anything. After 90 seconds it just gives me a new line that I can type in another command.
db.cards.find("Field":"Something").maxTimeMS(9999999)
db.cards.find("Field":"Something").addOption(DBQuery.Option.tailable)
db.cards.find() return results, but anything with parameters is timing out at exactly 90 seconds and nothing is being returned.
Any help would be greatly appreciated.
Given the level of detail in your question, I am going to focus on 'query a large collection' and guess that your are using the MMAPv1 storage engine, with no index coverage on your query.
Are you disk bound?
Given the above assumptions, you could be cycling data between RAM and disk. Mongo has a default 100MB RAM limit, so if your query has to examine a lot of documents (no index coverage), paging data from disk to RAM could be the culprit. I have heard of mongo shell acting as you describe or locking/terminating when memory constraints are exceeded.
32bit systems can also impose severe memory limits for large collections.
You could look at your OS specific disk activity monitor to get a clue into whether this is your problem.
Just how large is your collection?
Next, how big is your collection? You can show collections and see the physical size of the collection and also db.cards.count() to see your record count. This helps quantify "large collection".
NOTE: you might need the mongo-hacker extensions to see collection disk use in show collections.
Mongo shell investigation
Within the mongo shell, you have a couple more places to look.
By default, mongo will log slow queries (> 100ms). After your 90 sec timeout:
db.adminCommand({getLog: "global" })
and look for slow query log entries.
Next look at your winning query plan.
var e = db.cards.explain()
e.find("Field":"Something")
I am guessing you will see
"stage": "COLLSCAN",
Which means you are doing a full collection scan and you need index coverage for your query (good idea for queries and sorts).
Suggestions
You should have at least partial index coverage on any production query. A proper index should solve your problem (assuming you don't have documents > 16MB).
Another approach (that I don't recommend - indexing is better) is to use a cursor instead
var cursor = db.cards.find("Field":"Something")
while (cursor.hasNext()) {
print(tojson(cursor.next()));
}
Depending on the root cause, this may work for you.
Related
System Information:
OS: Ubuntu 20.04 LTS
System: 80 GB RAM, 1 TB SSD, i7-12700k
The documents in this collection are on average 16KB, and there are 500K documents in this collection. I noticed that as the collection grows larger, the time taken to insert documents also grows larger.
In what ways could I improve the speed of writes?
It is taking 10 Hours to insert 150k documents. Which is around what the graph predicted when we integrate the line:
def f(num):
return 0.0004*num+0.9594
sum=0
for i in range(500,650):
sum+=f(i*1000)
>> sum/3600
>> 9.61497
Potential upgrades in my mind:
Use the C++ mongo engine for writes
Allocate more RAM to Mongod
Logs
iotop showing mongod using < 1% of the IO capacity with write speeds around 10-20 KB/s
htop showing the mongod is only using ~ 16GB of RAM \
Disks showing that some 300GB of SSD is free
EDIT:
Psudo code:
docs=[...]
for doc in docs:
doc["last_updated"]=str(datetime.now())
doc_from_db = collection.find_one({"key":doc["key"]})
new_dict = minify(doc)
if doc_from_db is None:
collection.insert_one(new_dict)
else:
collection.replace_one({"key":doc["key"]},new_dict,upsert=true)
When it comes to writes there are a few things to consider, the most impactful one which I'm assuming is the issue here is index size / index complexity / unique indexes.
It's hard to give exact advice without more information so I'll detail the most common bottlenecks when it comes to writes from my experience.
As mentioned indexes, if you have too many indexes. unique indexes. or indexes on very large arrays (and the document you insert have large arrays) these all heavily impact insert performance. This behavior also correlates with the graph you provided as inserting becomes worse and worse the larger the index gets. There is no "real" solution to this issue, you should reconsider which indexes and which indexes cause the bottleneck (focus on unique /array indexes). For example if you have an index that enforces uniqueness then drop it and enforce uniqueness at the application level instead.
write concern and replication lag, if you are using a replica set and you require a majority write concern this can definitely cause issues due to the sync lag that happens and grows, usually this is a side affect of a different issues, for example because of #1 (large indexes) the insert takes too long which causes sync lag which delays even further the write concern.
unoptimized hardware (Assuming you're hosted on cloud), you'd be surprised how much you can optimize write performance by just changing the disk type and increasing IOPS. this will give immediate performance. obviously at the cost of $$$.
no code was provided so I would also check that, if it's a for loop then obviously you can parallelize the logic.
I recommend you test the same insert logic on an indexless collection to pinpoint the problem, i'd be glad to help think of other issues/solutions once you can provide more information.
EDIT:
Here is an example of how to avoid the for loop issue by using bulkWrite instead in python using pymongo.
from pymongo import InsertOne, DeleteOne, ReplaceOne
from pymongo.errors import BulkWriteError
docs = [... input documents ]
requests = []
for doc in docs:
requests.append({
ReplaceOne({"docId": doc["docID"]}, doc, { upsert: True})
})
try:
db.docs.bulk_write(requests, ordered=False)
except BulkWriteError as bwe:
pprint(bwe.details)
You can enable profiling in Database, but according to previous comments and your code, just python code profiling may be enough, for example can you show the output of similar example?
https://github.com/Tornike-Skhulukhia/cprofiler_python_example/blob/main/demo.py
But before that, please check that you have index on field that you are doing searches against using find_one command in current code, otherwise database may need to do full collection scan to just find 1 document, meaning if you have more documents, this time will also increase a lot.
I run MongoDB 4.0 on WiredTiger under Ubuntu Server 16.04 to store complex documents. There is an issue with one of the collections: the documents have many images written as strings in base64. I understand this is a bad practice, but I need some time to fix it.
Because of this some find operations fail, but only those which have a non-empty filter or skip. For example db.collection('collection').find({}) runs OK while db.collection('collection').find({category: 1}) just closes connection after a timeout. It doesn't matter how many documents should be returned: if there's a filter, the error will pop every time (even if it should return 0 docs), while an empty query always executes well until skip is too big.
UPD: some skip values make queries to fail. db.collection('collection').find({}).skip(5000).limit(1) runs well, db.collection('collection').find({}).skip(9000).limit(1) takes way much time but executes too, while db.collection('collection').find({}).skip(10000).limit(1) fails every time. Looks like there's some kind of buffer where the DB stores query related data and on the 10000 docs it runs out of the resources. The collection itself has ~10500 docs. Also, searching by _id runs OK. Unfortunately, I have no opportunity to make new indexes because the operation fails just like read.
What temporary solution I may use before removing base64 images from the collection?
This happens because such a problematic data scheme causes huge RAM usage. The more entities there are in the collection, the more RAM is needed not only to perform well but even to run find.
Increasing MongoDB default RAM usage with storage.wiredTiger.engineConfig.cacheSizeGB config option allowed all the operations to run fine.
db.collection.find in my app that uses mongodb java driver (latest) are super slow. I investigated one of them as follows
// about 300 hundred ids at a time (i've tried lower and higher numbers - no impact
db.users.find({_id : {$in : [1,2,3,4,5,6....]}})
Once I get the cursor I do: cursor.toArray() and then iterate of the results
The toArray operation is extremely slow. On average they take about a minute. IMPORTANT: my database is under very heavy load at all times. This particular collection has over 50mm entries.
I've narrowed down the issue in mongo java driver to com.mongodb.Response - specifically to this line:
final byte [] b = new byte[36];
Bits.readFully(in, b);
Incredibly readFully of just 36 bytes takes over a minute some times!
When I bring own the load on the databases, the improvements are drastic. From about a minute to 5-6 seconds. I mean 5-6 seconds to get 300 documents is still super slow, but definitely better then 1 minute.
What can I do to troubleshoot this further? Are there settings on MondoDB that I need to look at?
What happens
You are loading all of the 300 user documents.
What happens is that the _id index is searched and the respective documents are sent completely to your app. So mongoDB will access it's data files, read the first document and send it to you, then it jumps to the next document and sends it to you and so forth. If you used the cursor, you could start iterating over the returned documents as soon as a number of documents equalling your defined cursor size have been returned, as others will be lazily loaded from the cursor on the server on demand. (Bit of a simplification, but sufficient for answering this question). What you do is to explicitly wait until the index is scanned, the documents are located, sent back to your app and have reached it down to the last byte of the last document. As #wdberkeley (who works for 10gen) correctly pointed out, this is a Very Bad Idea™.
What might cause or intensify the problem
Under heavy load, two things might happen. The more likely is that your _id index isn't in RAM any more, causing thousands, if not millions of reads from disk - which is slow. Much slower than if the indices are kept in RAM (by several orders of magnitude). So it is not the code snippet you mentioned, but the response time of MongoDB which causes the delay. Another option under heavy load is that your disk IO is simply too low or (more likely) the random file read latency is too high. I assume you are using spinning disks plus not enough RAM for a database that size.
What to do to find the cause
Try to find out your index size using the db.users.stats(). I am pretty sure that your index size(s combined) exceed your available RAM.
Measure the disk IO and latency. If you use a GNU/Linux OS, you might want to find out how high your IOwait percentage is. A high percentage shows that your disk latency is too high for the load put on the server. It might even be that your are reaching the disk's IO limits.
Do your queries on a mongo shell. In case they are fast, you can be pretty sure that your toArray call is the cause of the problem.
What to do to resolve the problem
If you have not enough RAM, either scale up or scale out.
If your disk latency or throughput is too high, either scale out or ( better and cheaper in most cases ) use SSDs for storing MongoDB's data.
Use a cursor object to iterate over the documents. This is a better solution in almost every use case I can think of.
Upgrading MongoDB driver to 3.6.4 will fetch the data in no-time.
We have around 2 million documents in our collection and with previous version it was taking around ~3 minutes but after after upgrading to 3.6.4 it took only 5-7 sec.So what I feel is that there is some issue with the old version of mongoDB driver.
I run a single mongodb instance which is getting inserted with logs from an app server. the current rate of insert in production is 10 inserts per second. And its a capped collection. i DONT USE ANY INDEXES . Queries were running faster when there were small number of records. only one collection has that amount of data. even querying from collection that has very few rows has become very slow. IS there any means to improve the performance.
-Avinash
This is a very difficult question to answer because we dont know much about your configuration or your document structure.
One thing that immediately pops into my head is that you are running out of memory. 10 inserts per second doesn't mean much because we do not know how big the inserted documents are.
If you are inserting larger documents at 10 per second, you could be eating up memory, causing the operating system to push some of your records to disk.
When you query without using an index, you are forced to scan every document. If your documents have been pushed to disk by the OS, you will begin having page faults. Mongo will need to fetch pages of data off the hard disk, and load them into memory so that they can be scanned. Before doing this, the operating system will need to make room for that data in memory by flushing other parts of memory out to disk.
It sounds like you are are I/O bound and the two biggest things you can do to fix this are
Add more memory to the machine running mongod
Start using indexes so that the database does not need to do full collection scans
Use proper indexes, though that will have some effect on the efficiency of insertion in a capped collection.
It would be better if you can share the collection structure and the query you are using.
I'm thinking about trying MongoDB to use for storing our stats but have some general questions about whether I'm understanding it correctly before I actually start learning it.
I understand the concept of using documents, what I'm not too clear about is how much data can be stored inside each document. The following diagram explains the layout I'm thinking of:
Website (document)
- some keys/values about the particular document
- statistics (tree)
- millions of rows where each record is inserted from a pageview (key/value array containing data such as timestamp, ip, browser, etc)
What got me excited about mongodb was the grouping functions such as:
http://www.mongodb.org/display/DOCS/Aggregation
db.test.group(
{ cond: {"invoked_at.d": {$gte: "2009-11", $lt: "2009-12"}}
, key: {http_action: true}
, initial: {count: 0, total_time:0}
, reduce: function(doc, out){ out.count++; out.total_time+=doc.response_time }
, finalize: function(out){ out.avg_time = out.total_time / out.count }
} );
But my main concern is how hard would that command for example be on the server if there is say 10's of millions of records across dozens of documents on a 512-1gb ram server on rackspace for example? Would it still run low load?
Is there any limit to the number of documents MongoDB can have (seperate databases)? Also, is there any limit to the number of records in a tree I explained above? Also, does that query I showed above run instantly or is it some sort of map/reduce query? Not very sure if I can execute that upon page load in our control panel to get those stats instantly.
Thanks!
Every document has a size limit of 4MB (which in text is A LOT).
It's recommended to run MongoDB in replication mode or to use sharding as you otherwise will have problems with single-server durability. Single-server durability is not given because MongoDB only fsync's to the disk every 60 seconds, so if your server goes down between two fsync's the data that got inserted/updated in that time will be lost.
There is no limit of documents other than your disk space in mongodb.
You should try to import a dataset that matches your data (or generate some test data) to MongoDB and analyse how fast your query executes. Remember to set indexes on those fields that you use heavily in your queries. Your above query should work pretty well even with a lot of data.
In order to analyze the speed of your query use the database profiler MongoDB comes with. On the mongo shell do:
db.setProfilingLevel(2); // to set the profiling level
[your query]
db.system.profile.find(); // to see the results
Remember to turn off profiling once you're finished (log will get pretty huge otherwise).
Regarding your database layout I suggest to change the "schema" (yeah yeah, schema less..) to:
website (collection):
- some keys/values about the particular document
statistics (collection)
- millions of rows where each record is inserted from a pageview (key/value array containing data such as timestamp, ip, browser, etc)
+ DBRef to website
See Database References
Documents in MongoDB are limited to a size of 4MB. Let's say a single page view results in 32 bytes being stored. Then you'll be able to store about 130,000 page views in a single document.
Basically the amount of page views a page can generate is infinite, and you indicated that you expect millions of them, so I suggest you store the log entries as separate documents. Each log entry should contain the _id of the parent document.
The number of documents in a database is limited to 2GB of total space on 32-bit systems. 64-bit systems don't have this limitation.
The group() function is a map-reduce query under the hood. The documentation recommends you use a map-reduce query instead of group(), because it has some limitations with large datasets and sharded environments.