I found that a count query with two parameters was taking longer than expected on our production database. I added an index (which took a few hours, this collection has over 100 million documents) that had both fields, which improved the results of my .explain() call from IXSCAN to COUNT_SCAN.
Looking at the logs now, I see that there are still tons of IXSCAN planSummaries for this count query:
2019-07-17T13:02:34.561+0000 I COMMAND [conn25293] command
DatabaseName.CollectionName command: count { count: "CollectionName",
query: { userId: "5a4f82d4e4b09d5e0cdbae15", status: "FINISHED" } }
planSummary: IXSCAN { userId: 1 } keysExamined:299 docsExamined:299
numYields:7 reslen:44 locks:{ Global: { acquireCount: { r: 16 } },
Database: { acquireCount: { r: 8 } }, Collection: { acquireCount: { r:
8 } } } protocol:op_query 124ms
There is an index on the userId field, but I don't understand why this count query isn't hitting my new index. Here's the explain results:
db.CollectionName.explain().count({ userId: "59adb07de4b00782f7620c11", status: "FINISHED" })
/* 1 */
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "DatabaseName.CollectionName",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [
{
"status" : {
"$eq" : "FINISHED"
}
},
{
"userId" : {
"$eq" : "59adb07de4b00782f7620c11"
}
}
]
},
"winningPlan" : {
"stage" : "COUNT",
"inputStage" : {
"stage" : "COUNT_SCAN",
"keyPattern" : {
"userId" : 1.0,
"status" : 1.0
},
"indexName" : "idx_userId_status",
"isMultiKey" : false,
"multiKeyPaths" : {
"userId" : [],
"status" : []
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 1,
"indexBounds" : {
"startKey" : {
"userId" : "59adb07de4b00782f7620c11",
"status" : "FINISHED"
},
"startKeyInclusive" : true,
"endKey" : {
"userId" : "59adb07de4b00782f7620c11",
"status" : "FINISHED"
},
"endKeyInclusive" : true
}
}
},
"rejectedPlans" : []
},
"serverInfo" : {
"host" : "ip-10-114-1-8",
"port" : 27017,
"version" : "3.4.16",
"gitVersion" : "0d6a9242c11b99ddadcfb6e86a850b6ba487530a"
},
"ok" : 1.0
}
Checking the index stats, I see that it has been used quite a bit
{
"name" : "idx_userId_status",
"key" : {
"userId" : 1.0,
"status" : 1.0
},
"host" : "ip-address:27017",
"accesses" : {
"ops" : NumberLong(541337),
"since" : ISODate("2019-07-16T14:34:25.281Z")
}
}
I'm assuming that this means it is used sometimes, but for some reason not used at other times. Why would that be the case?
In my understanding of query planning in MongoDB, the DB keeps some kind of cache of query planning to be able to chose the right one without much thinking.
My guess is, in the case of the IXSCAN, the DB thought that using this one or the other wouldn't make that much of a difference.
That being said, you can still use the explain(true) (or more exactly explain("allPlansExecution") that will try to execute all possible plans. And then if you analyze the executionTimeMillis, you might see some difference that would explain the choice of the query planning.
Hope this helps :)
Related
I have an aggregation query in MongoDB:
[{
$group: {
_id: '$status',
status: {
$sum: 1
}
}
}]
It is running on a collection that has ~80 million documents. The status field is indexed, yet the query is very slow and runs for around 60 seconds or more.
I did an explain() on the query, but still got almost nowhere:
{
"explainVersion" : "1",
"stages" : [
{
"$cursor" : {
"queryPlanner" : {
"namespace" : "loa.document",
"indexFilterSet" : false,
"parsedQuery" : {
},
"queryHash" : "B9878693",
"planCacheKey" : "8EAA28C6",
"maxIndexedOrSolutionsReached" : false,
"maxIndexedAndSolutionsReached" : false,
"maxScansToExplodeReached" : false,
"winningPlan" : {
"stage" : "PROJECTION_SIMPLE",
"transformBy" : {
"status" : 1,
"_id" : 0
},
"inputStage" : {
"stage" : "COLLSCAN",
"direction" : "forward"
}
},
"rejectedPlans" : [ ]
}
}
},
{
"$group" : {
"_id" : "$status",
"status" : {
"$sum" : {
"$const" : 1
}
}
}
}
],
"serverInfo" : {
"host" : "rack-compute-2",
"port" : 27017,
"version" : "5.0.6",
"gitVersion" : "212a8dbb47f07427dae194a9c75baec1d81d9259"
},
"serverParameters" : {
"internalQueryFacetBufferSizeBytes" : 104857600,
"internalQueryFacetMaxOutputDocSizeBytes" : 104857600,
"internalLookupStageIntermediateDocumentMaxSizeBytes" : 104857600,
"internalDocumentSourceGroupMaxMemoryBytes" : 104857600,
"internalQueryMaxBlockingSortMemoryUsageBytes" : 104857600,
"internalQueryProhibitBlockingMergeOnMongoS" : 0,
"internalQueryMaxAddToSetBytes" : 104857600,
"internalDocumentSourceSetWindowFieldsMaxMemoryBytes" : 104857600
},
"command" : {
"aggregate" : "document",
"pipeline" : [
{
"$group" : {
"_id" : "$status",
"status" : {
"$sum" : 1
}
}
}
],
"explain" : true,
"cursor" : {
},
"lsid" : {
"id" : UUID("a07e17fe-65ff-4d38-966f-7517b7a5d3f2")
},
"$db" : "loa"
},
"ok" : 1
}
I see that it does a full COLLSCAN, I just can't understand why.
I plan on supporting a couple hundred million (or even a billion) documents in that collection, but this problem hijacks my plans for seemingly no reason.
You can advice the query planner to use the index as follow:
db.test.explain("executionStats").aggregate(
[
{$group:{ _id:"$status" ,status:{$sum:1} }}
],
{hint:"status_1"}
)
Make sure the index name in the hint is same as created ...
(db.test.getIndexes() will show you the exact index name )
I have a MongoDB 3.4 replicaset with a collection "page" where all documents have a "site" field (which is an ObjectId). "site" field has only 100 possible values. I have created an index on this field via db.page.createIndex({site:1}). There are about 3.6 millions documents in the "page" collection.
Now, I see logs like this in the mongod.log file
command db.page command: count { count: "page", query: { site: { $in:
[ ObjectId('A'), ObjectId('B'), ObjectId('C'), ObjectId('D'),
ObjectId('E'), ObjectId('F'), ObjectId('G'), ObjectId('H'),
ObjectId('I'), ObjectId('J'),, ObjectId('K'),, ObjectId('L') ] } } }
planSummary: IXSCAN { site: 1 } keysExamined:221888
docsExamined:221881 numYields:1786 reslen:44...
I don't understand the "keysExamined:221888" -> there are only 100 possible values, so my understanding would be that I would see keysExamined:100 at most, and here I would actually expect to see "keysExamined:12". What am I missing? For info, here is an explain on the request:
PRIMARY> db.page.explain().count({ site: { $in: [ ObjectId('A'), ObjectId('F'), ObjectId('H'), ObjectId('G'), ObjectId('I'), ObjectId('B'), ObjectId('C'), ObjectId('J'), ObjectId('K'), ObjectId('D'), ObjectId('E'), ObjectId('L') ] } } )
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "db.page",
"indexFilterSet" : false,
"parsedQuery" : {
"site" : {
"$in" : [
ObjectId("B"),
ObjectId("C"),
ObjectId("D"),
ObjectId("E"),
ObjectId("F"),
ObjectId("A"),
ObjectId("G"),
ObjectId("H"),
ObjectId("I"),
ObjectId("J"),
ObjectId("K"),
ObjectId("L")
]
}
},
"winningPlan" : {
"stage" : "COUNT",
"inputStage" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"site" : 1
},
"indexName" : "site_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"site" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"site" : [
"[ObjectId('B'), ObjectId('B')]",
"[ObjectId('C'), ObjectId('C')]",
"[ObjectId('D'), ObjectId('D')]",
"[ObjectId('E'), ObjectId('E')]",
"[ObjectId('F'), ObjectId('F')]",
"[ObjectId('A'), ObjectId('A')]",
"[ObjectId('G'), ObjectId('G')]",
"[ObjectId('H'), ObjectId('H')]",
"[ObjectId('I'), ObjectId('I')]",
"[ObjectId('J'), ObjectId('J')]",
"[ObjectId('K'), ObjectId('K')]",
"[ObjectId('L'), ObjectId('L')]"
]
}
}
}
},
"rejectedPlans" : [ ]
},
"serverInfo" : {
"host" : "9a18351b5211",
"port" : 27017,
"version" : "3.4.18",
"gitVersion" : "4410706bef6463369ea2f42399e9843903b31923"
},
"ok" : 1
}
PRIMARY>
I know we are on a fairly old MongoDB version and we are planning to upgrade soon to 5.0.X (via incremental upgrade to 3.6 / 4.0 / 4.2 / 4.4). Is there a fix in the next versions to your knowledge?
So after checking I realized I was expecting mongodb to use counted b-trees for its index but that is not the case, hence mongo has indeed to go through all the subkeys of the index. Details in jira.mongodb.org/plugins/servlet/mobile#issue/server-7745
Hence at the moment count request will run in O(N) for N docs if indexes are used
I have a collection with ~2.5m documents, the collection size is 14,1GB, storage size 4.2GB and average object size 5,8KB. I created two separate indexes on two of the fields dataSourceName and version (text fields) and tried to make an aggregate query to list their 'grouped by' values.
(Trying to achieve this: select dsn, v from collection group by dsn, v).
db.getCollection("the-collection").aggregate(
[
{
"$group" : {
"_id" : {
"dataSourceName" : "$dataSourceName",
"version" : "$version"
}
}
}
],
{
"allowDiskUse" : false
}
);
Even though MongoDB eats ~10GB RAM on the server, the fields are indexed and nothing else is running at all, the aggregation takes ~40 seconds.
I tried to make a new index, which contains both fields in order, but still, the query does not seem to use the index:
{
"stages" : [
{
"$cursor" : {
"query" : {
},
"fields" : {
"dataSourceName" : NumberInt(1),
"version" : NumberInt(1),
"_id" : NumberInt(0)
},
"queryPlanner" : {
"plannerVersion" : NumberInt(1),
"namespace" : "db.the-collection",
"indexFilterSet" : false,
"parsedQuery" : {
},
"winningPlan" : {
"stage" : "COLLSCAN",
"direction" : "forward"
},
"rejectedPlans" : [
]
}
}
},
{
"$group" : {
"_id" : {
"dataSourceName" : "$dataSourceName",
"version" : "$version"
}
}
}
],
"ok" : 1.0
}
I am using MongoDB 3.6.5 64bit on Windows, so it should use the indexes: https://docs.mongodb.com/master/core/aggregation-pipeline/#pipeline-operators-and-indexes
As #Alex-Blex suggested, I tried it with sorting, but I an get OOM error:
The following error occurred while attempting to execute the aggregate query
Mongo Server error (MongoCommandException): Command failed with error 16819: 'Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in.' on server server-address:port.
The full response is:
{
"ok" : 0.0,
"errmsg" : "Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in.",
"code" : NumberInt(16819),
"codeName" : "Location16819"
}
My bad, I tried it on the wrong collection... Adding the same sort as the index works, now it is using the index. Still not fast thought, took ~10 seconds to give me the results.
The new exaplain:
{
"stages" : [
{
"$cursor" : {
"query" : {
},
"sort" : {
"dataSourceName" : NumberInt(1),
"version" : NumberInt(1)
},
"fields" : {
"dataSourceName" : NumberInt(1),
"version" : NumberInt(1),
"_id" : NumberInt(0)
},
"queryPlanner" : {
"plannerVersion" : NumberInt(1),
"namespace" : "....",
"indexFilterSet" : false,
"parsedQuery" : {
},
"winningPlan" : {
"stage" : "PROJECTION",
"transformBy" : {
"dataSourceName" : NumberInt(1),
"version" : NumberInt(1),
"_id" : NumberInt(0)
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"dataSourceName" : NumberInt(1),
"version" : NumberInt(1)
},
"indexName" : "dataSourceName_1_version_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"dataSourceName" : [
],
"version" : [
]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : NumberInt(2),
"direction" : "forward",
"indexBounds" : {
"dataSourceName" : [
"[MinKey, MaxKey]"
],
"version" : [
"[MinKey, MaxKey]"
]
}
}
},
"rejectedPlans" : [
]
}
}
},
{
"$group" : {
"_id" : {
"dataSourceName" : "$dataSourceName",
"version" : "$version"
}
}
}
],
"ok" : 1.0
}
The page you are referring to says exactly opposite:
The $match and $sort pipeline operators can take advantage of an index
Your first stage is $group, which is neither $match nor $sort.
Try to sort it on the first stage to trigger use of the index:
db.getCollection("the-collection").aggregate(
[
{ $sort: { dataSourceName:1, version:1 } },
{
"$group" : {
"_id" : {
"dataSourceName" : "$dataSourceName",
"version" : "$version"
}
}
}
],
{
"allowDiskUse" : false
}
);
Please note, it should be a single compound index with the same fields and sorting:
db.getCollection("the-collection").createIndex({ dataSourceName:1, version:1 })
This is What I tried so far on aggregated query:
db.getCollection('storage').aggregate([
{
"$match": {
"user_id": 2
}
},
{
"$project": {
"formattedDate": {
"$dateToString": { "format": "%Y-%m", "date": "$created_on" }
},
"size": "$size"
}
},
{ "$group": {
"_id" : "$formattedDate",
"size" : { "$sum": "$size" }
} }
])
This is the result:
/* 1 */
{
"_id" : "2018-02",
"size" : NumberLong(10860595386)
}
/* 2 */
{
"_id" : "2017-12",
"size" : NumberLong(524288)
}
/* 3 */
{
"_id" : "2018-01",
"size" : NumberLong(21587971)
}
And this is the document structure:
{
"_id" : ObjectId("5a59efedd006b9036159e708"),
"user_id" : NumberLong(2),
"is_transferred" : false,
"is_active" : false,
"process_id" : NumberLong(0),
"ratio" : 0.000125759169459343,
"type_id" : 201,
"size" : NumberLong(1687911),
"is_processed" : false,
"created_on" : ISODate("2018-01-13T11:39:25.000Z"),
"processed_on" : ISODate("1970-01-01T00:00:00.000Z")
}
And last, the explain result:
/* 1 */
{
"stages" : [
{
"$cursor" : {
"query" : {
"user_id" : 2.0
},
"fields" : {
"created_on" : 1,
"size" : 1,
"_id" : 1
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "data.storage",
"indexFilterSet" : false,
"parsedQuery" : {
"user_id" : {
"$eq" : 2.0
}
},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"user_id" : 1
},
"indexName" : "user_id",
"isMultiKey" : false,
"multiKeyPaths" : {
"user_id" : []
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"user_id" : [
"[2.0, 2.0]"
]
}
}
},
"rejectedPlans" : []
}
}
},
{
"$project" : {
"_id" : true,
"formattedDate" : {
"$dateToString" : {
"format" : "%Y-%m",
"date" : "$created_on"
}
},
"size" : "$size"
}
},
{
"$group" : {
"_id" : "$formattedDate",
"size" : {
"$sum" : "$size"
}
}
}
],
"ok" : 1.0
}
The problem:
I can navigate and get all results in almost instantly like in 0,002sec. However, when I specify user_id and sum them by grouping on each month, My result came in between 0,300s to 0,560s. I do similar tasks in one request and it becaomes more than a second to finish.
What I tried so far:
I've added an index for user_id
I've added an index for created_on
I used more $match conditions. However, This makes even worse.
This collection have almost 200,000 documents in it currently and approximately 150,000 of them are belongs to user_id = 2
How can I minimize the response time for this query?
Note: MongoDB 3.4.10 used.
Pratha,
try to add sort on "created_on" and "size" fields as the first stage in aggregation pipeline.
db.getCollection('storage').aggregate([
{
"$sort": {
"created_on": 1, "size": 1
}
}, ....
Before that, add compound key index:
db.getCollection('storage').createIndex({created_on:1,size:1})
If you sort data before the $group stage, it will improve the efficiency of accumulation of the totals.
Note about sort aggregation stage:
The $sort stage has a limit of 100 megabytes of RAM. By default, if the stage exceeds this limit, $sort will produce an error. To allow for the handling of large datasets, set the allowDiskUse option to true to enable $sort operations to write to temporary files.
P.S
get rid of match stage by userID to test performance, or add userID to compound key also.
I am new to Mongo and was trying to get distinct count of users. The field Id and Status are not individually Indexed columns but there exists a composite index on both the field. My current query is something like this where the match conditions changes depending on the requirements.
DBQuery.shellBatchSize = 1000000;
db.getCollection('username').aggregate([
{$match:
{ Status: "A"
} },
{"$group" : {_id:"$Id", count:{$sum:1}}}
]);
Is there anyway we can optimize this query more or add parallel runs on collection so that we can achieve results faster ?
Regards
You can tune your aggregation pipelines by passing in an option of explain=true in the aggregate method.
db.getCollection('username').aggregate([
{$match: { Status: "A" } },
{"$group" : {_id:"$Id", count:{$sum:1}}}],
{ explain: true });
This will then output the following to work with
{
"stages" : [
{
"$cursor" : {
"query" : {
"Status" : "A"
},
"fields" : {
"Id" : 1,
"_id" : 0
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "test.usernames",
"indexFilterSet" : false,
"parsedQuery" : {
"Status" : {
"$eq" : "A"
}
},
"winningPlan" : {
"stage" : "EOF"
},
"rejectedPlans" : [ ]
}
}
},
{
"$group" : {
"_id" : "$Id",
"count" : {
"$sum" : {
"$const" : 1
}
}
}
}
],
"ok" : 1
}
So to speed up our query we need a index to help the match part of the pipeline, so let's create a index on Status
> db.usernames.createIndex({Status:1})
{
"createdCollectionAutomatically" : true,
"numIndexesBefore" : 1,
"numIndexesAfter" : 2,
"ok" : 1
}
If we now run the explain again we'll get the following results
{
"stages" : [
{
"$cursor" : {
"query" : {
"Status" : "A"
},
"fields" : {
"Id" : 1,
"_id" : 0
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "test.usernames",
"indexFilterSet" : false,
"parsedQuery" : {
"Status" : {
"$eq" : "A"
}
},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"Status" : 1
},
"indexName" : "Status_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"Status" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"Status" : [
"[\"A\", \"A\"]"
]
}
}
},
"rejectedPlans" : [ ]
}
}
},
{
"$group" : {
"_id" : "$Id",
"count" : {
"$sum" : {
"$const" : 1
}
}
}
}
],
"ok" : 1
}
We can now see straight away this is using a index.
https://docs.mongodb.com/manual/reference/explain-results/