Execution plan of aggregate query in MongoTempate with Spring data mongodb - mongodb

I am using spring data mongodb, in that want aggregation query to implement that I am using MongoTemplate with aggregation method. When I trace the log it shows the query as follows:
find: track.$cmd { "aggregate" : "stayRecord" , "pipeline" : [ { "$match" : { "vehicleId" : { "$all" : [ 10]}}} , { "$match" : { "stayTime" : { "$gte" : { "$date" : "2016-06-20T18:30:00.000Z"}}}} , { "$match" : { "stayTime" : { "$lt" : { "$date" : "2016-06-21T18:30:00.000Z"}}}} , { "$group" : { "_id" : "$stayTime" , "count" : { "$sum" : 1}}}}
I want to know execution plan for this query.
How can I find out if my indexes are used during that query?

Please note that in order to follow the below steps, you need the working aggregate query which mongo shell can understand.
Follow the below steps:-
1) Go to mongo shell
2) Execute the use command to switch to your database
use <database name>
3) Execute the below query. I hope the aggregate query mentioned in the thread is syntactically correct. Also, please change the collection name accordingly in the below syntax.
db.yourCollectionName.explain().aggregate({ "stayRecord" , "pipeline" : [ { "$match" : { "vehicleId" : { "$all" : [ 10]}}} , { "$match" : { "stayTime" : { "$gte" : { "$date" : "2016-06-20T18:30:00.000Z"}}}} , { "$match" : { "stayTime" : { "$lt" : { "$date" : "2016-06-21T18:30:00.000Z"}}}} , { "$group" : { "_id" : "$stayTime" , "count" : { "$sum" : 1}}}});
2) In the output, please find the "winningPlan" element. In the input stage ("inputStage") attribute, if the query used index it will show you the value as "IXSCAN" and the index name if the query used index. Otherwise, it would show "COLLSCAN" which means the query used the collection scan (i.e. index is not used).
"winningPlan" : {
"stage" : "LIMIT",
"limitAmount" : 0,
"inputStage" : {
"stage" : "SKIP",
"skipAmount" : 0,
"inputStage" : {
"stage" : "FETCH",
"filter" : {
"user.followers_count" : {
"$gt" : 1000
}
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"created_at" : -1
},
"indexName" : "created_at_-1",
"isMultiKey" : false,
"direction" : "backward",
"indexBounds" : {
"created_at" : [
"[MinKey, MaxKey]"
]
}
}
}
}
}

Related

MongoDB Aggregation is slow on Indexed fields

I have a collection with ~2.5m documents, the collection size is 14,1GB, storage size 4.2GB and average object size 5,8KB. I created two separate indexes on two of the fields dataSourceName and version (text fields) and tried to make an aggregate query to list their 'grouped by' values.
(Trying to achieve this: select dsn, v from collection group by dsn, v).
db.getCollection("the-collection").aggregate(
[
{
"$group" : {
"_id" : {
"dataSourceName" : "$dataSourceName",
"version" : "$version"
}
}
}
],
{
"allowDiskUse" : false
}
);
Even though MongoDB eats ~10GB RAM on the server, the fields are indexed and nothing else is running at all, the aggregation takes ~40 seconds.
I tried to make a new index, which contains both fields in order, but still, the query does not seem to use the index:
{
"stages" : [
{
"$cursor" : {
"query" : {
},
"fields" : {
"dataSourceName" : NumberInt(1),
"version" : NumberInt(1),
"_id" : NumberInt(0)
},
"queryPlanner" : {
"plannerVersion" : NumberInt(1),
"namespace" : "db.the-collection",
"indexFilterSet" : false,
"parsedQuery" : {
},
"winningPlan" : {
"stage" : "COLLSCAN",
"direction" : "forward"
},
"rejectedPlans" : [
]
}
}
},
{
"$group" : {
"_id" : {
"dataSourceName" : "$dataSourceName",
"version" : "$version"
}
}
}
],
"ok" : 1.0
}
I am using MongoDB 3.6.5 64bit on Windows, so it should use the indexes: https://docs.mongodb.com/master/core/aggregation-pipeline/#pipeline-operators-and-indexes
As #Alex-Blex suggested, I tried it with sorting, but I an get OOM error:
The following error occurred while attempting to execute the aggregate query
Mongo Server error (MongoCommandException): Command failed with error 16819: 'Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in.' on server server-address:port.
The full response is:
{
"ok" : 0.0,
"errmsg" : "Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting. Aborting operation. Pass allowDiskUse:true to opt in.",
"code" : NumberInt(16819),
"codeName" : "Location16819"
}
My bad, I tried it on the wrong collection... Adding the same sort as the index works, now it is using the index. Still not fast thought, took ~10 seconds to give me the results.
The new exaplain:
{
"stages" : [
{
"$cursor" : {
"query" : {
},
"sort" : {
"dataSourceName" : NumberInt(1),
"version" : NumberInt(1)
},
"fields" : {
"dataSourceName" : NumberInt(1),
"version" : NumberInt(1),
"_id" : NumberInt(0)
},
"queryPlanner" : {
"plannerVersion" : NumberInt(1),
"namespace" : "....",
"indexFilterSet" : false,
"parsedQuery" : {
},
"winningPlan" : {
"stage" : "PROJECTION",
"transformBy" : {
"dataSourceName" : NumberInt(1),
"version" : NumberInt(1),
"_id" : NumberInt(0)
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"dataSourceName" : NumberInt(1),
"version" : NumberInt(1)
},
"indexName" : "dataSourceName_1_version_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"dataSourceName" : [
],
"version" : [
]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : NumberInt(2),
"direction" : "forward",
"indexBounds" : {
"dataSourceName" : [
"[MinKey, MaxKey]"
],
"version" : [
"[MinKey, MaxKey]"
]
}
}
},
"rejectedPlans" : [
]
}
}
},
{
"$group" : {
"_id" : {
"dataSourceName" : "$dataSourceName",
"version" : "$version"
}
}
}
],
"ok" : 1.0
}
The page you are referring to says exactly opposite:
The $match and $sort pipeline operators can take advantage of an index
Your first stage is $group, which is neither $match nor $sort.
Try to sort it on the first stage to trigger use of the index:
db.getCollection("the-collection").aggregate(
[
{ $sort: { dataSourceName:1, version:1 } },
{
"$group" : {
"_id" : {
"dataSourceName" : "$dataSourceName",
"version" : "$version"
}
}
}
],
{
"allowDiskUse" : false
}
);
Please note, it should be a single compound index with the same fields and sorting:
db.getCollection("the-collection").createIndex({ dataSourceName:1, version:1 })

Optimise MongoDB aggregate query

I have a collection with millions of documents, each document represent an event: {_id, product, timestamp}
In my query, I need to group by product and take the top 10 for example.
"aggregate" : "product_events",
"pipeline" : [
{
"$match" : {
"timeEvent" : {
"$gt" : ISODate("2017-07-17T00:00:00Z")
}
}
},
{
"$group" : {
"_id" : "$product",
"count" : {
"$sum" : 1
}
}
},
{
"$sort" : {
"count" : -1
}
},
{
"$limit" : 10
}
]
My query is very slow now (10 seconds), I am wondering if there is a way to store data differently to optimise this query?
db.product_events.explain("executionStats").aggregate([ {"$match" :
{"timeEvent" : {"$gt" : ISODate("2017-07-17T00:00:00Z")}}},{"$group" :
{"_id" : "$product","count" : {"$sum" : 1}}}, {"$project": {"_id": 1,
"count": 1}} , {"$sort" : {"count" : -1}},{"$limit" : 500}],
{"allowDiskUse": true})
{
"stages" : [
{
"$cursor" : {
"query" : {
"timeEvent" : {
"$gt" : ISODate("2017-07-17T00:00:00Z")
}
},
"fields" : {
"product" : 1,
"_id" : 0
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "mydb.product_events",
"indexFilterSet" : false,
"parsedQuery" : {
"timeEvent" : {
"$gt" : ISODate("2017-07-17T00:00:00Z")
}
},
"winningPlan" : {
"stage" : "COLLSCAN",
"filter" : {
"timeEvent" : {
"$gt" : ISODate("2017-07-17T00:00:00Z")
}
},
"direction" : "forward"
},
"rejectedPlans" : [ ]
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 2127315,
"executionTimeMillis" : 940,
"totalKeysExamined" : 0,
"totalDocsExamined" : 2127315,
"executionStages" : {
"stage" : "COLLSCAN",
"filter" : {
"timeEvent" : {
"$gt" : ISODate("2017-07-17T00:00:00Z")
}
},
"nReturned" : 2127315,
"executionTimeMillisEstimate" : 810,
"works" : 2127317,
"advanced" : 2127315,
"needTime" : 1,
"needYield" : 0,
"saveState" : 16620,
"restoreState" : 16620,
"isEOF" : 1,
"invalidates" : 0,
"direction" : "forward",
"docsExamined" : 2127315
}
}
}
},
{
"$group" : {
"_id" : "$product",
"count" : {
"$sum" : {
"$const" : 1
}
}
}
},
{
"$project" : {
"_id" : true,
"count" : true
}
},
{
"$sort" : {
"sortKey" : {
"count" : -1
},
"limit" : NumberLong(500)
}
}
],
"ok" : 1
}
Below my indexes
db.product_events.getIndexes()
[
{
"v" : 2,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "mydb.product_events"
},
{
"v" : 2,
"key" : {
"product" : 1,
"timeEvent" : -1
},
"name" : "product_1_timeEvent_-1",
"ns" : "mydb.product_events"
}
]
Creating indexes on fields of a collection aids into optimising process of data retrieval from database collections.
Indexes are generally created on fields into which data are filtered according to specific criteria.
Data contained into indexed fields are sorted in specific order and while fetching data once match is found ,scanning of other document stops which makes process of fetching data faster.
According to description as mentioned into above question to optimise performance of aggregate query please try creating an index on timeEvent field as timeEvent field is used as a filter expression into $match stage of aggregation pipeline.
The documentation on compound indexes states the following.
db.products.createIndex( { "item": 1, "stock": 1 } )
The order of the fields listed in a compound index is important. The
index will contain references to documents sorted first by the values
of the item field and, within each value of the item field, sorted by
values of the stock field.
In addition to supporting queries that match on all the index fields,
compound indexes can support queries that match on the prefix of the
index fields. That is, the index supports queries on the item field as
well as both item and stock fields.
Your product_1_timeEvent_-1 index looks like this:
{
"product" : 1,
"timeEvent" : -1
}
which is why it cannot be used to support a query that only filters on timeEvent.
Options you have to get that sorted:
Flip the order of the fields in your index
Remove the product field from your index
Create an additional index with only the timeEvent field in it.
(Include some additional filter on the product field so the existing index gets used)
And keep in mind that any creation/deletion/modification of an index may impact other queries, too. So make sure you test your changes properly.

MongoDB performs slow query on sum() based on monthly groups

This is What I tried so far on aggregated query:
db.getCollection('storage').aggregate([
{
"$match": {
"user_id": 2
}
},
{
"$project": {
"formattedDate": {
"$dateToString": { "format": "%Y-%m", "date": "$created_on" }
},
"size": "$size"
}
},
{ "$group": {
"_id" : "$formattedDate",
"size" : { "$sum": "$size" }
} }
])
This is the result:
/* 1 */
{
"_id" : "2018-02",
"size" : NumberLong(10860595386)
}
/* 2 */
{
"_id" : "2017-12",
"size" : NumberLong(524288)
}
/* 3 */
{
"_id" : "2018-01",
"size" : NumberLong(21587971)
}
And this is the document structure:
{
"_id" : ObjectId("5a59efedd006b9036159e708"),
"user_id" : NumberLong(2),
"is_transferred" : false,
"is_active" : false,
"process_id" : NumberLong(0),
"ratio" : 0.000125759169459343,
"type_id" : 201,
"size" : NumberLong(1687911),
"is_processed" : false,
"created_on" : ISODate("2018-01-13T11:39:25.000Z"),
"processed_on" : ISODate("1970-01-01T00:00:00.000Z")
}
And last, the explain result:
/* 1 */
{
"stages" : [
{
"$cursor" : {
"query" : {
"user_id" : 2.0
},
"fields" : {
"created_on" : 1,
"size" : 1,
"_id" : 1
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "data.storage",
"indexFilterSet" : false,
"parsedQuery" : {
"user_id" : {
"$eq" : 2.0
}
},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"user_id" : 1
},
"indexName" : "user_id",
"isMultiKey" : false,
"multiKeyPaths" : {
"user_id" : []
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"user_id" : [
"[2.0, 2.0]"
]
}
}
},
"rejectedPlans" : []
}
}
},
{
"$project" : {
"_id" : true,
"formattedDate" : {
"$dateToString" : {
"format" : "%Y-%m",
"date" : "$created_on"
}
},
"size" : "$size"
}
},
{
"$group" : {
"_id" : "$formattedDate",
"size" : {
"$sum" : "$size"
}
}
}
],
"ok" : 1.0
}
The problem:
I can navigate and get all results in almost instantly like in 0,002sec. However, when I specify user_id and sum them by grouping on each month, My result came in between 0,300s to 0,560s. I do similar tasks in one request and it becaomes more than a second to finish.
What I tried so far:
I've added an index for user_id
I've added an index for created_on
I used more $match conditions. However, This makes even worse.
This collection have almost 200,000 documents in it currently and approximately 150,000 of them are belongs to user_id = 2
How can I minimize the response time for this query?
Note: MongoDB 3.4.10 used.
Pratha,
try to add sort on "created_on" and "size" fields as the first stage in aggregation pipeline.
db.getCollection('storage').aggregate([
{
"$sort": {
"created_on": 1, "size": 1
}
}, ....
Before that, add compound key index:
db.getCollection('storage').createIndex({created_on:1,size:1})
If you sort data before the $group stage, it will improve the efficiency of accumulation of the totals.
Note about sort aggregation stage:
The $sort stage has a limit of 100 megabytes of RAM. By default, if the stage exceeds this limit, $sort will produce an error. To allow for the handling of large datasets, set the allowDiskUse option to true to enable $sort operations to write to temporary files.
P.S
get rid of match stage by userID to test performance, or add userID to compound key also.

Should mongodb insert data ordered according with their indexes?

I would like to know if mongodb should re-order data after insert data according with the indexes configured previously, for instance:
After insert data according with sequence bellow:
db.test.insert(_id:1)
db.test.insert(_id:5)
db.test.insert(_id:2)
And executing the following search:
db.test.find();
We can see the result:
{ "_id" : 1 }
{ "_id" : 5 }
{ "_id" : 3 }
As we know, the field _id by default has a index, the question here is why after executing search the results are not return in sequence as presented bellow?
{ "_id" : 1 }
{ "_id" : 3 }
{ "_id" : 5 }
MongoDB indexes are separate data structures that contain the portion of the collection being indexed. The index does not dictate the order of documents in storage during insert.
Indexes are only used for sorting when a sort order is specified in the search:
db.test.find().sort({ _id: 1 })
{ "_id" : 1 }
{ "_id" : 2 }
{ "_id" : 5 }
You can verify the index usage using explain:
db.test.explain().find().sort({_id:1})
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "so.test",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [ ]
},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"_id" : 1
},
"indexName" : "_id_",
"isMultiKey" : false,
"direction" : "forward",
"indexBounds" : {
"_id" : [
"[MinKey, MaxKey]"
]
}
}
},
"rejectedPlans" : [ ]
},
"serverInfo" : {
"host" : "KSA303096TX2",
"port" : 27017,
"version" : "3.0.11",
"gitVersion" : "48f8b49dc30cc2485c6c1f3db31b723258fcbf39 modules: enterprise"
},
"ok" : 1
}

execStats is always empty in MongoDB "aggregate" commands profiling results

I am trying to profile the performance of an aggregation pipeline, specifically checking whether indices are used, how many objects are scanned, etc.
I'm setting the DB to full profiling:
db.setProfilingLevel(2)
But then in the db's 'system.profile' collection, in the result record for the aggregation command, the execStats is always empty.
Here is the full result for the command:
{
"op" : "command",
"ns" : "mydb.$cmd",
"command" : {
"aggregate" : "mycolection",
"pipeline" : [{
"$match" : {
"date" : {
"$gte" : "2013-11-26"
}
}
}, {
"$sort" : {
"user_id" : 1
}
}, {
"$project" : {
"user_id" : 1,
"_id" : 0
}
}, {
"$group" : {
"_id" : "$user_id",
"agg_val" : {
"$sum" : 1
}
}
}],
"allowDiskUse" : true
},
"keyUpdates" : 0,
"numYield" : 16,
"lockStats" : {
"timeLockedMicros" : {
"r" : NumberLong(3143653),
"w" : NumberLong(0)
},
"timeAcquiringMicros" : {
"r" : NumberLong(140),
"w" : NumberLong(3)
}
},
"responseLength" : 4990,
"millis" : 3237,
"execStats" : { },
"ts" : ISODate("2014-11-26T16:20:59.576Z"),
"client" : "127.0.0.1",
"allUsers" : [],
"user" : ""
}
Support execStats for aggregation command was added in mongo 3.4.