Query performing faster without the index - mongodb

Below is a simplified version of a document in my database:
{
_id : 1,
main_data : 100,
sub_docs: [
{
_id : a,
data : 22
},
{
_id: b,
data : 859
},
{
_id: c,
data: 151
},
... snip ...
{
_id: m,
data: 721
},
{
_id: n,
data: 111
}
]
}
So imagine I have a million of these documents with varied data values (say 0 - 1000). Currently my query is something like:
db.myDb.find(
{ sub_docs: { $elemMatch: { data: { $gte: 110, $lt: 160 } } } }
)
Also say the query above will only match around 0.001% of the data (so around 10 documents are returned in total).
And I have an index set using:
db.myDb.ensureIndex( sub_docs.data )
Performing a timed test on this data seems to show it's quicker without any index set on sub_docs.data.
I'm using Mongo 3.2.8.
Edit - Additional information:
My timed test is a Perl script which queries the server and then pulls back the relevant data. I ran this test first when I had the index enabled, however the slow query times forced me to do a bit of digging. I wanted to see how bad the query times would get if I dropped the index, however it improved the response time of the query!
I went a bit further, I plotted the query response time vs the total number of documents in the DB, both graphs show a linear increase in query time but the query with the index increases at a much faster rate.
All the while through testing I've been keeping my eye on the server memory usage (which is low) as my first thought would have been the index doesn't fit in memory.
So overall my question is: why for this particular query does this query perform better without and index?
And is there anyway to improve the speed of this query with a better index?
Update
Ok so it's been a while and I've narrowed it down to the index not constraining both sides of the query search parameters.
The query above will show an index bound of:
[-inf, 160]
Rather than 110 to 160.
I can resolve this problem by using the index min and max functions as follows:
db.myDb.find(
{ sub_docs: { $elemMatch: { data: { $gte: 110, $lt: 160 } } } }
).min({'subdocs.data': 110}).max({'subdocs.data': 160})
However (if possible) I would prefer a different way of doing this as I would like to make use of the aggregate function (which doesn't seem to support min/max index functions)

Ok so I managed to sort this in the end. For whatever reason the index doesn't limit the query as I expected.
Running this:
db.myDb.find({ sub_docs: { $elemMatch: { data: { $gte: 110, $lt: 160 } } } }).explain()
Snippet of what the index is doing is below:
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"sub_docs.data" : 1
},
"indexName" : "sub_docs.data_1",
"isMultiKey" : true,
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 1,
"direction" : "forward",
"indexBounds" : {
"sub_docs.data" : [
"[-inf.0, 160.0)"
]
}
}
Instead of limiting the index between 110 and 160 it's scanning all documents that match the index key of anything less than or equal to 160.
I've not included it but the other rejected plan was an index scan of 110 to inf+.
You can sort this issue with the min/max limits I mention above in my comment however this means you can't use the aggregation framework, which sucks.
So the solution I found was to pull out all the data I wanted to index on into an array:
{
_id : 1,
main_data : 100,
index_values : [
22,
859,
151,
...snip...
721,
111
],
sub_docs: [
{
_id : a,
data : 22
},
{
_id: b,
data : 859
},
{
_id: c,
data: 151
},
... snip ...
{
_id: m,
data: 721
},
{
_id: n,
data: 111
}
]
}
And then I create the index:
db.myDb.ensureIndex({index_values : 1})
And then query on that instead:
db.myDb.find({ index_values : { $elemMatch: { $gte: 110, $lt: 160 } } }).explain()
Which produces:
"indexBounds" : {
"index_values" : [
"[110.0, 160.0]"
]
}
So a lot less documents to check now!

Related

How to improve aggregate pipeline

I have pipeline
[
{'$match':{templateId:ObjectId('blabla')}},
{
"$sort" : {
"_id" : 1
}
},
{
"$facet" : {
"paginatedResult" : [
{
"$skip" : 0
},
{
"$limit" : 100
}
],
"totalCount" : [
{
"$count" : "count"
}
]
}
}
])
Index:
"key" : {
"templateId" : 1,
"_id" : 1
}
Collection has 10.6M documents 500k of it is with needed templateId.
Aggregate use index
"planSummary" : "IXSCAN { templateId: 1, _id: 1 }",
But the request takes 16 seconds. What i did wrong? How to speed up it?
For start, you should get rid of the $sort operator. The documents are already sorted by _id since the documents are already guaranteed to sorted by the { templateId: 1, _id: 1 } index. The outcome is sorting 500k which are already sorted anyway.
Next, you shouldn't use the $skip approach. For high page numbers you will skip large numbers of documents up to almost 500k (rather index entries, but still).
I suggest an alternative approach:
For the first page, calculate an id you know for sure falls out of the left side of the index. Say, if you know that you don't have entries back dated to 2019 and before, you can use a match operator similar to this:
var pageStart = ObjectId.fromDate(new Date("2020/01/01"))
Then, your match operator should look like this:
{'$match' : {templateId:ObjectId('blabla'), _id: {$gt: pageStart}}}
For the next pages, keep track of the last document of the previous page: if the rightmost document _id is x in a certain page, then pageStart should be x for the next page.
So your pipeline may look like this:
[
{'$match' : {templateId:ObjectId('blabla'), _id: {$gt: pageStart}}},
{
"$facet" : {
"paginatedResult" : [
{
"$limit" : 100
}
]
}
}
]
Note, that now the $skip is missing from the $facet operator as well.

Is searching by _id in mongoDB more efficient?

In my use case, I want to search a document by a given unique string in MongoDB. However, I want my queries to be fast and searching by _id will add some overhead. I want to know if there are any benefits in MongoDB to search a document by _id over any other unique value?
To my knowledge object ID are similar to any other unique value in a document [Point made for the case of searching only].
As for the overhead, you can assume I am caching the string to objectID and the cache is very small and in memory [Almost negligible], though the DB is large.
Analyzing your query performance
I advise you to use .explain() provided by mongoDB to analyze your query performance.
Let's say we are trying to execute this query
db.inventory.find( { quantity: { $gte: 100, $lte: 200 } } )
This would be the result of the query execution
{ "_id" : 2, "item" : "f2", "type" : "food", "quantity" : 100 }
{ "_id" : 3, "item" : "p1", "type" : "paper", "quantity" : 200 }
{ "_id" : 4, "item" : "p2", "type" : "paper", "quantity" : 150 }
If we call .execution() this way
db.inventory.find(
{ quantity: { $gte: 100, $lte: 200 } }
).explain("executionStats")
It will return the following result:
{
"queryPlanner" : {
"plannerVersion" : 1,
...
"winningPlan" : {
"stage" : "COLLSCAN",
...
}
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 3,
"executionTimeMillis" : 0,
"totalKeysExamined" : 0,
"totalDocsExamined" : 10,
"executionStages" : {
"stage" : "COLLSCAN",
...
},
...
},
...
}
More details about this can be found here
How efficient is search by _id and indexes
To answer your question, using indexes is always more efficient. Indexes are special data structures that store a small portion of the collection's data set in an easy to traverse form. With _id being the default index provided by MongoDB, that makes it more efficient.
Without indexes, MongoDB must perform a collection scan, i.e. scan every document in a collection, to select those documents that match the query statement.
So, YES, using indexes like _id is better!
You can also create your own indexes by using createIndex()
db.collection.createIndex( <key and index type specification>, <options> )
Optimize your MongoDB query
In case you want to optimize your query, there are multiple ways to do that.
Creating custom indexes to support your queries
Limit the Number of Query Results to Reduce Network Demand
db.posts.find().sort( { timestamp : -1 } ).limit(10)
Use Projections to Return Only Necessary Data
db.posts.find( {}, { timestamp : 1 , title : 1 , author : 1 , abstract : 1} ).sort( { timestamp : -1 } )
Use $hint to Select a Particular Index
db.users.find().hint( { age: 1 } )
Short answer, yes _id is the primary key and it's indexed. Of course it's fast.
But you can use an index on the other fields too and get more efficient queries.

Mongo multi-field filter query and sort - optimization

I have a records collection which has primary_id (unique), secondary_id, status fields among others. The ids are alphanumeric fields (ex. 'ABCD0000') and the status is a numeric (1 - 5).
One of the queries that would be frequently used is to filter by id (equality or range) and status.
examples:
records where primary_id between 'ABCD0000' - 'ABCN0000' and status is 2 or 3, sort by primary_id.
records where secondary_id between 'ABCD0000' - 'ABCD0000' and status is 2 or 3, sort by primary_id (or secondary_id if that would help).
The status in the filter will mostly be (status in (2,3)).
Initially we had an single index on each of the fields. But the query times out when the range is large. I have tried adding multiple indexes (single & compound) and with different ways to write the filter but couldn't get a decent performance. Now I have those indexes:
[
{primary_id: 1},
{secondary_id: 1},
{status: 1},
{primary_id: 1, status: 1},
{status: 1, primary_id: 1},
{status: 1, secondary_id: 1}
]
This query (with or without sort on primary_id)
{ $and: [
{ primary_id: { $gte: 'ABCD0000' } },
{ primary_id: { $lte: 'ABCN0000' } },
{status: { $in: [2,3] } }
] }
use the following plan:
...
"winningPlan" : {
"stage" : "FETCH",
"filter" : {
"status" : {
"$in" : [
2,
3
]
}
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"primary_id" : 1
},
"indexName" : "primary_idx",
"isMultiKey" : false,
"multiKeyPaths" : {
"primary_id" : [ ]
},
"isUnique" : true,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"primary_id" : [
"[\"ABCD0000\", \"ABCN0000\"]"
]
}
}
},
So, It seems that the FETCH step takes long time if the number of returned rows is large. Surprisingly, while running initial tests the status, primary_id compound index was sometimes picked as the winning plan and that was super-fast (few seconds). But for some reason its not been picked by Mongo anymore. I guess when the query needs to sort by primary_id this compound index wont be picked, as i understood from the Mongo docs
If the query does not specify an equality condition on an index prefix that precedes or overlaps with the sort specification, the operation will not efficiently use the index.
I tried to change the query as below but that is still not optimized
{$or: [
{ $and: [ { primary_id: { $gte: 'ABCD0000' } }, { primary_id: { $lte: 'ABCN0000' } }, { status: 2 } ]},
{ $and: [ { primary_id: { $gte: 'ABCD0000' } }, { primary_id: { $lte: 'ABCN0000' } }, { status: 3 } ]}
]}
Any suggestions on what would be a better indexing or query strategy?
I would try with 2 indexes
primary_id, status and secondary_id, status.
If timeout is still occurring, can you increase the query time out value? - considering the large data-set that you are trying to read from.
If those indexes don't help and good response time is expected , then you should look at hardware constraints - is your hardware good enough (read mongodb's working set size). Either scale up the server/hardware or look at sharding if performance is really a concern and your data size is going to grow.
OR - store status 2 and 3 in separate collections to reduce the "working set size" while querying for those.

Mongodb - Multi-key index on array of sub documents produces strange explain plan

If I insert the following document:
db.test.insertOne({ main_data : 100, sub_docs: [{ data : 22 },{ data : 859 },{ data: 151 }]}
And create an index on it using:
db.test.createIndex({"sub_docs.data" : 1})
When I perform a query to try and match the data using:
db.test.find({ sub_docs: { $elemMatch: { data: { $gte: 110, $lt: 160 }}}})
Why does the explain plan show the index starting from either inf.0 or -inf.0 up to one of the bounds of the $elemMatch? For example:
"indexBounds" : {
"sub_docs.data" : [
"[110.0, inf.0]"
]
}
Why aren't the bounds "[110.0, 160.0]"?
This is fixed in MongoDB 3.4:
https://jira.mongodb.org/browse/SERVER-15086

Solving "BSONObj size: 17582686 (0x10C4A5E) is invalid" when doing aggregation in MongoDB?

I'm trying to remove duplicate documents in MongoDB in a large collection according to the approach described here:
db.events.aggregate([
{ "$group": {
"_id": { "firstId": "$firstId", "secondId": "$secondId" },
"dups": { "$push": "$_id" },
"count": { "$sum": 1 }
}},
{ "$match": { "count": { "$gt": 1 } }}
], {allowDiskUse:true, cursor:{ batchSize:100 } }).forEach(function(doc) {
doc.dups.shift();
db.events.remove({ "_id": {"$in": doc.dups }});
});
I.e. I want to remove events that has the same "firstId - secondId" combination. However after a while MongoDB responds with this error:
2016-11-30T14:13:57.403+0000 E QUERY [thread1] Error: getMore command failed: {
"ok" : 0,
"errmsg" : "BSONObj size: 17582686 (0x10C4A5E) is invalid. Size must be between 0 and 16793600(16MB)",
"code" : 10334
}
Is there anyway to get around this? I'm using MongoDB 3.2.6.
The error message indicates that some part of the process is attempting to create a document that is larger than the 16 MB document size limit in MongoDB.
Without knowing your data set, I would guess that the size of the collection is sufficiently large that the number of unique firstId / secondId combinations is growing the result set past the document size limit.
If the size of the collection prevents finding all duplicates values in one operation, you may want to try breaking it up and iterating through the collection and querying to find duplicate values:
db.events.find({}, { "_id" : 0, "firstId" : 1, "secondId" : 1 }).forEach(function(doc) {
cnt = db.events.find(
{ "firstId" : doc.firstId, "secondId" : doc.secondId },
{ "_id" : 0, "firstId" : 1, "secondId" : 1 } // explictly only selecting key fields to allow index to cover the query
).count()
if( cnt > 1 )
print('Dupe Keys: firstId: ' + doc.firstId + ', secondId: ' + doc.secondId)
})
It's probably not the most efficient implementation, but you get the idea.
Note that this approach heavily relies upon the existence of the index { 'firstId' : 1, 'secondId' : 1 }