Is db.inventory.find().limit(10) faster than db.inventory.find()?
I have millions of records in mongodb, I want to get top 10 records in some orders.
Using limit() you inform the server that you will not retrieve more than k documents. Allowing some optimizations to reduce bandwidth consumption and to speed-up sorts. Finally, using a limit clause the server will be able to better use the 32MB max available when sorting in RAM (i.e.: when sort order cannot be obtained from an index).
Now, the long story: find() returns a cursor. By default, the cursor will transfer the results to the client in batches. From the documentation,:
For most queries, the first batch returns 101 documents or just enough documents to exceed 1 megabyte. Subsequent batch size is 4 megabytes.
Using limit() the cursor will not need to retrieve more documents than necessary. Thus reducing bandwidth consumption and latency.
Please notice that, given your use case, you will probably use a sort() operation as well. From the same documentation as above:
For queries that include a sort operation without an index, the server must load all the documents in memory to perform the sort before returning any results.
And the sort() documentation page explains further:
If MongoDB cannot obtain the sort order via an index scan, then MongoDB uses a top-k sort algorithm. This algorithm buffers the first k results (or last, depending on the sort order) seen so far by the underlying index or collection access. If at any point the memory footprint of these k results exceeds 32 megabytes, the query will fail1.
1That 32 MB limitation is not specific to sort using a limit() clause. Any sort whose order cannot be obtained from an index will suffer from the same limitation. However, with a plain sort the server need to hold all documents in its memory to sort them. With a limited sort, it only have to store k documents in memory at the same time.
if you need it in order then of course the DB would first sort it based on the criteria and then return the top 10 records. by using the limit you are just saving the network bandwidth. e.g. here I am sorting by name and then giving the top 10 records, it has to scan the whole data and then pick the top 10. (as you can notice its doing COLLSCAN which is understood for collection scan as I don't have the index for this example, the idea to show here is that its doing the full scan of all the records, sort it and then pick the top ones.)
> db.t1.find().sort({name:1}).limit(10).explain()
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "test.t1",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [ ]
},
"winningPlan" : {
"stage" : "SORT",
"sortPattern" : {
"name" : 1
},
"limitAmount" : 10,
"inputStage" : {
"stage" : "COLLSCAN",
"filter" : {
"$and" : [ ]
},
"direction" : "forward"
}
},
"rejectedPlans" : [ ]
},
"serverInfo" : {
"host" : "Sachin-Mac.local",
"port" : 27017,
"version" : "3.0.2",
"gitVersion" : "6201872043ecbbc0a4cc169b5482dcf385fc464f"
},
"ok" : 1
}
Related
I'm running a standalone mongodb 3.6 docker container and I have a collection which contains very small documents and I have a super simple index on the "Date" field set by descending:
> db.collection.getIndexes()
[
{
"v" : 2,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "myApp.collection"
},
{
"v" : 2,
"key" : {
"Date" : -1
},
"name" : "Date_-1",
"ns" : "myApp.collection",
"sparse" : true
}
]
I'm using the MongoCSharpDriver to perform a query where I get the cursor and I'm getting the following error:
Command find failed: Executor error during find command :: caused by :: errmsg: "Sort operation used more than the maximum 33554432 bytes of RAM. Add an index, or specify a smaller limit."
I'm specifying a BatchSize of 100 documents however I'm not setting the limit of records to be returned since I think that will be handled by the cursor itsef (so both Skip and Limit are set to zero).
My question is, could it be that the actual index is already greater than 32MB? If so, is it that I have to extend the RAM allocated for this? Otherwise how do you solve this kind of issue? Note that I have 46132 documents right now each of them with a size of approx. 2.52 KB
You don't need to extend the RAM. Set allowDiskUse instead of abort the operation it will continue using disk file storage instead of RAM, if the allowDiskUse has been set to true.
db.getCollection('movies').aggregate( [
{ $sort : { year : 1} }
],
{ allowDiskUse: true }
)
And also, I can't see your query, but you said you have some index, always before that ordering, you need to create index.
I have a question with regards to collection sizes and query performance –
There are 2 dbs– DB1 & DB2. DB1 has 1 collection, and here’s the output from stats() on this collection –
{
…
"count" : 2085217,
"size" : 17048734192,
"avgObjSize" : 8176,
"capped" : false,
"nindexes" : 3,
"indexDetails" : {},
"totalIndexSize" : 606299456,
"indexSizes" : {
"_id_" : 67664576,
"id_1" : 284165056,
"id_2" : 254469824
},
…
}
A query on this collection, using index id_1 comes back in 0.012 secs. Here’s the output from explain() -
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 1,
"executionTimeMillis" : 0,
"totalKeysExamined" : 1,
"totalDocsExamined" : 1,
….
"indexName" : "id_1",
}
In DB2, I have 4 collections, and here’s the output from stats on DB2 –
{
…
"collections" : 4,
"objects" : 152655935,
"avgObjSize" : 8175.998307514215,
"dataSize" : 1248114666192,
"storageSize" : 1257144933456,
"indexes" : 12,
"indexSize" : 19757688272,
"fileSize" : 1283502112768,
…
}
A query on any collection in DB2, using the index, which I confirmed via explain(), takes at least double the time that it does for the previous query against DB1.
Since mongo should scale well, why is there this diff? I read that mongodb loads all the indexes in memory, and since DB2 has a higher volume than DB1, is that why it’s taking much longer?
Any insights would be greatly helpful. Thanks.
Edit 1:
Adding more info re. collection definition, indexes definitions and queries executed...
All collections (in both DBs) contain the same fields; only the values and the size of documents differ between them.
And, here's the relevant index -
"1" : {
"v" : 1,
"unique" : true,
"key" : {
"id" : 1
},
"name" : "id_1",
"ns" : "ns.coll1"
}
And, this is how the id field looks like:
"_id" : ObjectId("55f9b6548aefbce6b2fa2fac"),
"id" : {
"pid" : {
"f1" : "val1",
"f2" : "val2"
}
},
And, here's a sample query -
db.coll1.find({id:{pid:{f1:"val1",f2:"val2"}}})
Edit 2:
Here's some more info on the hard disk & RAM -
$ free -m
total used free shared buff/cache available
Mem: 386758 3750 1947 25283 381060 355675
Swap: 131071 3194 127877
The hard disk is around 3.5T, out of which 2.9T is already used.
Scaling
MongoDB scales very well. The thing is, it is designed to scale horizontally, not vertically. This means that if your DBs are holding a lot of data, you should shard the collections in order to achieve better parallelization.
Benchmark results
Regarding the difference in query time, I don't think your profiling is conclusive. The DBs are possibly on different machines (with different specs). Supposing the hardware is the same, DB2 apparently holds more documents on its collections and the size of documents are not the same on both DBs. The same query can return data sets with different sizes. That will inevitably have impact on data serialization and other low level aspects. Unless you profile the queries in a more controlled setup, I think your results are pretty much expected.
Suggestions
Take care if you are using DRef on your documents. Its possible Mongo will automatically dereference them; that means more that data to serialize and overhead.
Try running the same queries with a limit specification. You have defined the index to be unique, but I don't know if that automatically makes Mongo stop index traversal once it has found a value. Check if db.coll1.find({id:{pid:{f1:"val1",f2:"val2"}}}) and db.coll1.find({id:{pid:{f1:"val1",f2:"val2"}}}).limit(1) run on the same time.
Take a look at Indexes on embedded fields and Indexes on embedded documents. Embedded documents seem to impair even extra overhead.
Finally, if your document has no embedded documents, only embedded fields (which seems to be the case), then define your index more specifically. Create this index
db.coll1.createIndex({"id.pid.f1": 1, "id.pid.f2": 1}, {unique: true})
and run the query again. If this index doesn't improve performance, then I believe you have done everything properly and it may be time to start sharding.
My problem is related to the query optimizer of MongoDB and how it picks the perfect index to use. I realized that under some conditions the optimizer doesn't pick the perfect existing index and rather continues using the one that is close enough.
Consider having a simple dataset like:
{ "_id" : 1, "item" : "f1", "type" : "food", "quantity" : 500 }
{ "_id" : 2, "item" : "f2", "type" : "food", "quantity" : 100 }
{ "_id" : 3, "item" : "p1", "type" : "paper", "quantity" : 200 }
{ "_id" : 4, "item" : "p2", "type" : "paper", "quantity" : 150 }
{ "_id" : 5, "item" : "f3", "type" : "food", "quantity" : 300 }
{ "_id" : 6, "item" : "t1", "type" : "toys", "quantity" : 500 }
{ "_id" : 7, "item" : "a1", "type" : "apparel", "quantity" : 250 }
{ "_id" : 8, "item" : "a2", "type" : "apparel", "quantity" : 400 }
{ "_id" : 9, "item" : "t2", "type" : "toys", "quantity" : 50 }
{ "_id" : 10, "item" : "f4", "type" : "food", "quantity" : 75 }
and then want to issue a query as following:
db.inventory.find({"type": "food","quantity": {$gt: 50}})
I go ahead and create the following index:
db.inventory.ensureIndex({"quantity" : 1, "type" : 1})
The statistics of cursor.explain() confirms that this index has the following performance: ( "n" : 4, "nscannedObjects" : 4, "nscanned" : 9). It scanned more indexes than the perfect matching number. Considering the fact that "type" is a higher selective attribute with an identified match, it is surely better to create the following index instead:
db.inventory.ensureIndex({ "type" : 1, "quantity" : 1})
The statistics also confirms that this index performs better: ("n" : 4, "nscannedObjects" : 4, "nscanned" : 4). Meaning the second index needs exactly scanning the same number of indexes as the matched documents.
However, I observed if I don't delete the first index, the query optimizer continues using the first index, although the better index is got created.
According to the documentation, every time a new index is created the query optimizer consider it to make the query plan, but I don't see this happening here.
Can anyone explain how the query optimizer really works?
Considering the fact that "type" is a higher selective attribute
Index selectivity is a very important aspect, but in this case, note that you're using an equality query on type and a range query on quantity which is the more compelling reason to swap the order of indices, even if selectivity was lower.
However, I observed if I don't delete the first index, the query optimizer continues using the first index, although the better index is got created. [...]
The MongoDB query optimizer is largely statistical. Unlike most SQL engines, MongoDB doesn't attempt to reason what could be a more or less efficient index. Instead, it simply runs different queries in parallel from time to time and remembers which one was faster. The faster strategy will then be used. From time to time, MongoDB will perform parallel queries again and re-evaluate the strategy.
One problem of this approach (and maybe the cause of the confusion) is that there's probably not a big difference with such a tiny dataset - it's often better to simply scan elements than to use any kind of index or search strategy if the data isn't large compared to the prefetch / page size / cache size and pipeline length. As a rule of thumb, simple lists of up to maybe 100 or even 1,000 elements often don't benefit from indexing at all.
Like for doing anything greater, designing indexes requires some forward thinking. The goal is:
Efficiency - fast read / write operations
Selectivity - minimize records scanning
Other requirements - e.g. how are sorts handled?
Selectivity is the primary factor that determines how efficiently an index can be used. Ideally, the index enables us to select only those records required to complete the result set, without the need to scan a substantially larger number of index keys (or documents) in order to complete the query. Selectivity determines how many records any subsequent operations must work with. Fewer records means less execution time.
Think about what queries will be used most frequently by the application. Use explain command and specifically see the executionStats:
nReturned
totalKeysExamined - if the number of keys examined very large than the returned documents? We need some index to reduce it.
Look at queryPlanner, rejectedPlans. Look at winningPlan which shows the keyPattern which shows which keys needed to indexed. Whenever we see stage:SORT, it means that the key to sort is not part of the index or the database was not able to sort documents based on the sort order specified in the database. And needed to perform in-memory sort. If we add the key based on which the sort happens, we will see that the winningPlan's' stage changes from SORT to FETCH. The keys in the index needs to be specified based on the range of the data for them. e.g.: the class will have lesser volume than student. Doing this needs us to have a trade-off. Although the executionTimeMillis will be very less but the docsExamined and keysExamined will be relatively a little large. But this trade-off is worth making.
There is also a way to force queries to use a particular index but this is not recommended to be a part of deployment. The command in concern is the .hint() which can be chained after find or sort for sorting etc. It requires the actual index name or the shape of the index.
In general, when building compound indexes for:
- equality field: field on which queries will perform an equality test
- sort field: field on which queries will specify a sort
- range field: field on which queries perform a range test
The following rules of thumb should we keep in mind:
Equality fields before range fields
Sort fields before range fields
Equality fields before sort fields
I am doing many upserts to a collection that also receives a lot of find queries. My upserts have write concern unacknowledged. Many of these upserts appear in the mongo log with runtimes above 800ms and yields above 20. The number of inprog operations on the server seems stable around 20 with peaks around 40.
The collection contains ~15 million documents.
Does these long query times indicate that the mongo server cannot keep up with the incoming data, or is it just postponing the unacknowledged queries in a controlled manner?
The documents in the collection look like this:
{
"_id" : ObjectId("53c65f9f995bce51e4d84ecb"),
"items" : [
"53216cf7e4b04d3fa854a4d0",
"53218be4e4b0a79ba7fee19a"
],
"score" : 1,
"other" : [
"b09b2c99-e4f3-48a2-990d-4b2090cc9666",
"b09b2c99-e4f3-48a2-990d-4b2090cc9666"
]
}
I have the following indexes
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "dbname.stuff"
},
{
"v" : 1,
"key" : {
"items" : 1,
"score" : -1
},
"name" : "items_1_score_-1",
"ns" : "dbname.stuff",
"background" : true
}
]
The slow upserts look like this in the log
update dbname.stuff query: { items: [ "52ea4da1e4b035b15423f8f5", "53c7cf43e4b007135ca60114" ] } update: { $inc: { score: 6 }, $setOnInsert: { others: [ "64a7e6b1-2a0a-4374-ac9c-fbf2de7cbb48", "b9e07cda-14c8-45e4-95cc-f0f4c5bc410c" ] } } nscanned:0 nscannedObjects:0 nMatched:1 nModified:0 fastmodinsert:1 upsert:1 keyUpdates:0 numYields:16 locks(micros) w:46899 1752ms
Acknowledgement on write or "Write Concern" does not affect overall query performance, just really the time that could possibly be taken while the client waits for the acknowledgement. So if you are more or less in "fire and forget" mode, then your client is not held up but the write operations can still possible take as while.
In this case, it seems your working set is actually quite large. It is also worth considering that this is "server wide" and not just constrained to one collection or even database. The general case here is you cannot have enough RAM for what you are trying to load and you are running into paging. See the "yields" counter.
Upserts require to "find" the matching document in the index, so even if there is no match you still need to "scan" and find out if the item exists. This means loading the index into memory. As such you have a few choices:
Remodel to make these write "insert only", and aggregate "counter" type values in background processes, but basically not in real time.
Add more RAM within your means.
Time to shard so you can have several "shards" in your cluster that have a capable amount of RAM to deal with the working set sizes.
Nothing is easy here, and depending on what your application actually requires then all offer varying levels of solution. Indeed though if you are prepared to live without "write acknowledgements" in general, then you might need to work around the rest of your application bearing with "eventual consistency" of those writes actually being available to be read.
We are trying to build a notification application for our users with mongo. we created 1 mongodb on 10GB RAM, 150GB SAS HDD 15K RPM, 4 Core 2.9GHZ xeon intel XEN VM.
DB schema :-
{
"_id" : ObjectId("5178c458e4b0e2f3cee77d47"),
"userId" : NumberLong(1574631),
"type" : 2,
"text" : "a user connected to B",
"status" : 0,
"createdDate" : ISODate("2013-04-25T05:51:19.995Z"),
"modifiedDate" : ISODate("2013-04-25T05:51:19.995Z"),
"metadata" : "{\"INVITEE_NAME\":\"2344\",\"INVITEE\":1232143,\"INVITE_SENDER\":1574476,\"INVITE_SENDER_NAME\":\"123213\"}",
"opType" : 1,
"actorId" : NumberLong(1574630),
"actorName" : "2344"
}
DB stats :-
db.stats()
{
"db" : "UserNotificationDev2",
"collections" : 3,
"objects" : 78597973,
"avgObjSize" : 489.00035699393925,
"dataSize" : 38434436856,
"storageSize" : 41501835008,
"numExtents" : 42,
"indexes" : 2,
"indexSize" : 4272393328,
"fileSize" : 49301946368,
"nsSizeMB" : 16,
"dataFileVersion" : {
"major" : 4,
"minor" : 5
},
"ok" : 1
}
index :- userid and _id
we are trying to select latest 21 notifications for one user.
db.userNotification.find({ "userId" : 53 }).limit(21).sort({ "_id" : -1 });
but this query is taking too much time.
Fri Apr 26 05:39:55.563 [conn156] query UserNotificationDev2.userNotification query: { query: { userId: 53 }, orderby: { _id: -1 } } cursorid:225321382318166794 ntoreturn:21 ntoskip:0 nscanned:266025 keyUpdates:0 numYields: 2 locks(micros) r:4224498 nreturned:21 reslen:10295 2581ms
even count is taking hell lot of time.
Fri Apr 26 05:47:46.005 [conn159] command UserNotificationDev2.$cmd command: { count: "userNotification", query: { userId: 53 } } ntoreturn:1 keyUpdates:0 numYields: 11 locks(micros) r:9753890 reslen:48 5022ms
Are we doing some wrong in the query?
Please help!!!
Also suggest if our schema is not correct to storing user notifications. we have tried a embedded notifications like user and then notification for that user under that document but document limit is limiting us to store only ~50k notifications. so we changed to this.
You are querying by userId but not indexing it anywhere. My suggestion is to create an index on { "userId" : 1, "_id" : -1 }. This will create an index tree that starts with userId, then _id, which is almost exactly what your query is doing. This is the simplest/most flexible way to speeding up your query.
Another, more memory efficient, approach is to store your userId and timestamp as a string in _id, like _id : "USER_ID:DATETIME. Ex :
{_id : "12345:20120501123000"}
{_id : "15897:20120501124000"}
{_id : "15897:20120501125000"}
Notice _id is a string, not MongoId. Then your query above becomes a regex :
db.userNotification.find({ "_id" : /^53:/ }).limit(21).sort({ "_id" : -1 });
As expected, this will return all notifications for userId 53 in descending order. The memory efficient part is two fold:
You only need one index field. (Indexes compete with data for memory and are often several gigs in size)
If your queries are often about fetching newer data Right Balanced indexes keep your most often working in memory when the indexes are too large to fit whole.
Re: count. Count does take time because it scans through the entire collection.
Re: your schema. I'm guessing for your data set this is the best way to utilize your memory. When objects get large and your queries scan across multiple objects they will need to be loaded into memory in their entirety (I've had the OOM killer kill my mongod instance when i sorted with 2000 2MB objects on a 2GB RAM machine). With large objects your RAM usage will fluctuate greatly (not to mention they are limited upto a point). With your current schema mongo will have a much easier time loading only the data you're querying, resulting in less swapping and more consistent memory usage patterns.
One option is to try sharding then you can distribute notifications evenly between shards so when you need to select you will scan smaller subset of data. Need to decide however what your sharding will be using. To me it looks like operationType or userName but I do not know your data well enough. Another thing is why do you sort by _id?
I have just tried to replicate your problem. Created 140.000.000 inserts in userNotifications.
Without index on userId I got responses of 3-4seconds. After I created index on userId time dropped to almost instant responses.
db.userNotifications.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"ns" : "test.userNotifications",
"name" : "id"
},
{
"v" : 1,
"key" : {
"userId" : 1
},
"ns" : "test.userNotifications",
"name" : "userId_1"
}
]
Another thing is: When your select happens is system writing constantly to mongo userNotification collection? Mongo locks whole collection if that happens. If it is the case
I would split read and writes between master and slave (see replication) and also do some sharding. Btw. What language you use for your app?
The most important thing is that you currently don't seem to have an index to support the query for user's latest notifications.
You need a compound index on userId, _id. This will support queries which only query by userId, but they also are used by queries by userId which sort/limit by _id.
When you add {userId:1, _id:-1} index don't forget to drop index on just userId as it will become redundant.
As far as count() make sure you are using 2.4.3 (the latest version) there were significant improvements in how count() uses indexes which resulted in much better performance.