documentdb aggregate query not using index - mongodb

I am trying to find the max of a value in a range of dates. The aggregate query I use has a match on indexed column _id. But the query takes too long and the explain plan tells me its going of a COLLSCAN and not an index scan. Can you please suggest why it wont make use of the index on _id?
Would it help if I created another index on colId?
{$match:{_id:{ $regex: 'regex'}}},
{$match:{$and:[{"colId":'DATA'}]}},
{$unwind:"$data"},
{$match:{$and:[{"data.time":{$gte:ISODate("xyz"),$lte:ISODate("zyx")}}]}},
{$match:{$and: [{ "data.col": { $exists: true}}] }},
{$group:{_id:"$data.time",maxCol:{$max:"$data.col"}}} ,
{$sort:{"maxCol":-1,_id:-1}},
{$limit:1}
])
Explain plan snippet:
"winningPlan" : {
"stage" : "LIMIT_SKIP",
"inputStage" : {
"stage" : "SORT",
"sortPattern" : {
"_id" : -1,
"maxCol" : -1
},
"inputStage" : {
"stage" : "SUBSCAN",
"inputStage" : {
"stage" : "HASH_AGGREGATE",
"inputStage" : {
"stage" : "SUBSCAN",
"inputStage" : {
"stage" : "PROJECTION",
"inputStage" : {
"stage" : "COLLSCAN"
}
}
}
}
}
}
This is on DocumentDB (mongo4)

I think regular expression cannot use indexes. $match works also on array, try this one:
db.collection.aggregate([
{
$match: {
"colId": 'DATA',
"data.time": { $gte: ISODate("xyz"), $lte: ISODate("zyx") },
"data.col": { $exists: true }
}
},
{ $match: { _id: { $regex: 'regex' } } },
{ $unwind: "$data" },
{ $group: { _id: "$data.time", maxCol: { $max: "$data.col" } } },
{ $sort: { "maxCol": -1, _id: -1 } },
{ $limit: 1 }
])
As consequence put an index on {colId: 1, "data.time": 1} or {colId: 1, "data.time": 1, "data.col": 1}

Related

Getting the N documents in MongoDB before a Document ID from a Sorted Result

I have a collection in MongoDB, like the one below.
-> Mongo Playground link
I have sorted the collection with Overview and ID.
$sort{{ overview: 1,_id:1 }}
which results in a collection like this.
When I filter the collection to show only the documents after "subject 13.", it works as expected.
$match{{
_id:{$gt:ObjectId('605db89d208db95eb4878556')}
}}
however, when I try to the documents before "subject 13", that is "Subject 6" , with the following query, it doesn't work as I expect.
$match{{
_id:{$lt:ObjectId('605db89d208db95eb4878556')}
}}
Instead of getting just "Subject 6" in the result, I get the following.
I suspect this is happening because, mongodb always filters the document before sorting, regardless of the order in aggregate pipeline.
Please suggest me a way to get the documents before a particular "_id" in mongodb.
I have 600 documents in the collection, this is a sample dataset. My Full aggregate query below.
[
{
'$sort': {
'overview': 1,
'_id': 1
}
}, {
'$match': {
'_id': {
'$lt': new ObjectId('605db89d208db95eb4878556')
}
}
}
]
MongoDB optimizes the query performance by moving sort to the end in your case as you've $sort followed by $match
https://docs.mongodb.com/manual/core/aggregation-pipeline-optimization/#sort-match-sequence-optimization
When you have a sequence with $sort followed by a $match, the $match moves before the $sort to minimize the number of objects to sort. For example, if the pipeline consists of the following stages:
[
{ '$sort': { 'overview': 1, '_id': 1 } },
{ '$match': { '_id': { '$lt': new ObjectId('605db89d208db95eb4878556') } }
]
During the optimization phase, the optimizer transforms the sequence to the following:
[
{ '$match': { '_id': { '$lt': new ObjectId('605db89d208db95eb4878556') } },
{ '$sort': { 'overview': 1, '_id': 1 } }
]
Query planner Result -
We can see the 1st stage is the match query, after that sort is performed.
{
"stages" : [
{
"$cursor" : {
"query" : {
"_id" : {
"$lt" : ObjectId("605db89d208db95eb4878556")
}
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "video.data3",
"indexFilterSet" : false,
"parsedQuery" : {
"_id" : {
"$lt" : ObjectId("605db89d208db95eb4878556")
}
},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"_id" : 1
},
"indexName" : "_id_",
"isMultiKey" : false,
"multiKeyPaths" : {
"_id" : []
},
"isUnique" : true,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"_id" : [
"[ObjectId('000000000000000000000000'), ObjectId('605db89d208db95eb4878556'))"
]
}
}
},
"rejectedPlans" : []
}
}
},
{
"$sort" : {
"sortKey" : {
"overview" : 1,
"_id" : 1
}
}
}
],
"ok" : 1.0
}

MongoDB Aggregation Query Optimization : match -> unwind -> match vs unwind->match

Input Data
{
"_id" : ObjectId("5dc7ac6e720a2772c7b76671"),
"idList" : [
{
"queueUpdateTimeStamp" : "2019-12-12T07:16:47.577Z",
"displayId" : "H14",
"currentQueue" : "10",
"isRejected" : true,
"isDispacthed" : true
},
{
"queueUpdateTimeStamp" : "2019-12-12T07:16:47.577Z",
"displayId" : "H14",
"currentQueue" : "10",
"isRejected" : true,
"isDispacthed" : false
}
],
"poDetailsId" : ObjectId("5dc7ac15720a2772c7b7666f"),
"processtype" : 1
}
Output Data
{
"_id" : ObjectId("5dc7ac6e720a2772c7b76671"),
"idList":
{
"queueUpdateTimeStamp" : "2019-12-12T07:16:47.577Z",
"displayId" : "H14",
"currentQueue" : "10",
"isRejected" : true,
"isDispacthed" : true
},
"poDetailsId" : ObjectId("5dc7ac15720a2772c7b7666f"),
"processtype" : 1
}
Query 1 ( unwind then match)
aggregate([
{
$unwind: { path: "$idList" }
},
{
$match: { 'idList.isDispacthed': isDispatched }
}
])
Query 2 ( match then unwind then match)
aggregate([
{
$match: { 'idList.isDispacthed': isDispatched }
},
{
$unwind: { path: "$idList" }
},
{
$match: { 'idList.isDispacthed': isDispatched }
}
])
My Question / My Concern
(suppose i have large number of documents(50k +) in this collection and assuming i have other lookups and projections after this query in same pipeline)
match -> unwind -> match VS unwind ->match
is there any performance difference between these two queries ?
is there any other ( better ) way to write this query?
It all depends on the MongoDB query planner optimizer:
Aggregation pipeline operations have an optimization phase which attempts to reshape the pipeline for improved performance.
To see how the optimizer transforms a particular aggregation pipeline, include the explain option in the db.collection.aggregate() method.
https://docs.mongodb.com/manual/core/aggregation-pipeline-optimization/
Create index for poDetailsId and run this query:
db.getCollection('collection').explain().aggregate([
{
$unwind: "$idList"
},
{
$match: {
'idList.isDispacthed': true,
"poDetailsId" : ObjectId("5dc7ac15720a2772c7b7666f")
}
}
])
{
"stages" : [
{
"$cursor" : {
"query" : {
"poDetailsId" : {
"$eq" : ObjectId("5dc7ac15720a2772c7b7666f")
}
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "test.collection",
"indexFilterSet" : false,
"parsedQuery" : {
"poDetailsId" : {
"$eq" : ObjectId("5dc7ac15720a2772c7b7666f")
}
},
"queryHash" : "2CF7E390",
"planCacheKey" : "A8739F51",
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"poDetailsId" : 1.0
},
"indexName" : "poDetailsId_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"poDetailsId" : []
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"poDetailsId" : [
"[ObjectId('5dc7ac15720a2772c7b7666f'), ObjectId('5dc7ac15720a2772c7b7666f')]"
]
}
}
},
"rejectedPlans" : []
}
}
},
{
"$unwind" : {
"path" : "$idList"
}
},
{
"$match" : {
"idList.isDispacthed" : {
"$eq" : true
}
}
}
],
"ok" : 1.0
}
As you see MongoDB will change this aggregation to:
db.getCollection('collection').aggregate([
{
$match: { "poDetailsId" : ObjectId("5dc7ac15720a2772c7b7666f") }
}
{
$unwind: "$idList"
},
{
$match: { 'idList.isDispacthed': true }
}
])
Logically, $match -> $unwind -> $match is better since you filter (by index) a subset of records instead of full scan (working with 100 matched documents ≠ all documents).
If your aggregation operation requires only a subset of the data in a collection, use the $match, $limit, and $skip stages to restrict the documents that enter at the beginning of the pipeline. When placed at the beginning of a pipeline, $match operations use suitable indexes to scan only the matching documents in a collection.
https://docs.mongodb.com/manual/core/aggregation-pipeline/#early-filtering
Once you manipulate your documents, MongoDB cannot apply indexes.

MongoDB aggregate using scan for join collection?

I have two collection one collection is 'image' and other thing is 'map' for mapping info with artists
image collection has idx(single index) and map collection has multiple index(m_idx, i_idx)
i know mongodb is not RDBMS but i'm using this two table with aggregate pipline but i worried when aggregate image table, mongodb get image row by querying with index or select after scan all image collection.
i heard aggregate join the row after get all collection. but i'm not sure ..
add my query and explain result
Query
db.map.aggregate(
[
{
$match: {m_idx: 1111}
},
{
$lookup: {
from: "image",
localField: "i_idx",
foreignField: "idx",
as: "image_aggre"
}
},
{
$match: {
'image': {
$ne: []
}
}
},
{
$sort: {
seq: -1
}
},
{
$skip: (pageNum) * pageSize
},
{
$limit: pageSize
}
], {
explain: true
}
)
Explain Result
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "db.map",
"winningPlan" : {
"stage" : "LIMIT_SKIP",
"inputStage" : {
"stage" : "SORT",
"sortPattern" : {
"seq" : -1
},
"inputStage" : {
"stage" : "SUBSCAN",
"inputStage" : {
"stage" : "HASH_AGGREGATE",
"inputStage" : {
"stage" : "HASH_LOOKUP",
"inputStages" : [
{
"stage" : "COLLSCAN"
},
{
"stage" : "HASH",
"inputStage" : {
"stage" : "IXSCAN",
"indexName" : "m_idx_-1_i_idx_-1",
"direction" : "forward"
}
}
]
}
}
}
}
}
},
i'm using aws-documentDB so result can be a little bit different

MongoDB aggregation explain provides data only about first stages

I'm running the following aggregation query on a test database
db.restaurants.explain().aggregate([
{$match: {"address.zipcode": {$in: ["10314", "11208", "11219"]}}},
{$match: {"grades": {$elemMatch: {score: {$gte: 1}}}}},
{$group: {_id: "$borough", count: {$sum: 1} }},
{$sort: {count: -1} }
]);
And as per MongoDB documentation it should return cursor that I can iterate and see data about all pipeline stages:
The operation returns a cursor with the document that contains detailed information regarding the processing of the aggregation pipeline.
However the aggregation command returns explain info only about first two match stages:
{
"stages" : [
{
"$cursor" : {
"query" : {
"$and" : [
{
"address.zipcode" : {
"$in" : [
"10314",
"11208",
"11219"
]
}
},
{
"grades" : {
"$elemMatch" : {
"score" : {
"$gte" : 1.0
}
}
}
}
]
},
"fields" : {
"borough" : 1,
"_id" : 0
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "test.restaurants",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [
{
"grades" : {
"$elemMatch" : {
"score" : {
"$gte" : 1.0
}
}
}
},
{
"address.zipcode" : {
"$in" : [
"10314",
"11208",
"11219"
]
}
}
]
},
"winningPlan" : {
"stage" : "COLLSCAN",
"filter" : {
"$and" : [
{
"grades" : {
"$elemMatch" : {
"score" : {
"$gte" : 1.0
}
}
}
},
{
"address.zipcode" : {
"$in" : [
"10314",
"11208",
"11219"
]
}
}
]
},
"direction" : "forward"
},
"rejectedPlans" : []
}
}
},
{
"$group" : {
"_id" : "$borough",
"count" : {
"$sum" : {
"$const" : 1.0
}
}
}
},
{
"$sort" : {
"sortKey" : {
"count" : -1
}
}
}
],
"ok" : 1.0
}
And the object returned does not seem like cursor at all.
If I save the aggregation result to a variable and then try to iterate through it using cursor methods (hasNext(), next(), etc) I get the following:
TypeError: result.next is not a function : #(shell):1:1
How can I see info on all pipeline steps?
Thanks
1. Explain info
Explain() returns the winning plan of a query, ie how the database fetch the document before processing them in the pipeline.
Here, because adress.zipcode and grades aren't indexed, the db performs a COLLSCAN, ie iterate over all documents in db and see if they match
After that, you group the document and sort the results. Thoses operations are done "in memory", on the previously fetched documents. The fields aren't indexed, so no special plan can be used here
more info here : explain results
2. Explain() on aggregation query does not return a cursor
For some reason, explain() on aggregation query does not return a cursor, but a BSON object directly (unlike explain() on find() query )
It might be a bug, but there's nothing about this in the doc.
Anyway, you can do :
var explain = db.restaurants.explain().aggregate([
{$match: {"address.zipcode": {$in: ["10314", "11208", "11219"]}}},
{$match: {"grades": {$elemMatch: {score: {$gte: 1}}}}},
{$group: {_id: "$borough", count: {$sum: 1} }},
{$sort: {count: -1} }
]);
printjson(explain)

filter in aggregation and groupby to get values from given time interval

in the below $aggregate query, want to add a filter with $gt and $lt
The mongo doc:
{
"_id" : ObjectId("599eb4fae0f86361c36b1c91"),
"device_id" : ObjectId("5993df1b9a5fea3183064e49"),
"updatedAt" : ISODate("2017-08-24T11:38:12.135Z"),
"power_data" : [
{
"timestamp" : ISODate("2017-08-24T11:14:04.256Z"),
"_id" : ObjectId("599eb4fdeea8c69622751de3"),
"pfactor" : 1,
"rpower" : 0,
"voltage" : 0,
"current" : 0,
"energy" : 0,
"power" : 0
},
{
"timestamp" : ISODate("2017-08-24T11:14:04.256Z"),
"_id" : ObjectId("599eb507eea8c69622751de4"),
"pfactor" : 1,
"rpower" : 0,
"voltage" : 0,
"current" : 0,
"energy" : 0,
"power" : 0
},
{
"timestamp" : ISODate("2017-08-24T11:14:04.256Z"),
"_id" : ObjectId("599eb511eea8c69622751de5"),
"pfactor" : 1,
"rpower" : 0,
"voltage" : 0,
"current" : 0,
"energy" : 0,
"power" : 0
},
],
"__v" : 0,
"createdAt" : ISODate("2017-08-24T11:14:05.946Z")
}
and here is the query :
aggregate([{
$match: { "device_id": { $in: [ObjectId("5993df1b9a5fea3183064e49")] } }
}, {
$project: {
"power_data": 1
}
}, {
$unwind: "$power_data"
},
{
$project: {
"power": "$power_data.power", "hours": { $dayOfMonth: "$power_data.timestamp" }
}
}, {
$group: {
"_id": "$hours", "avg_power": { $avg: "$power" }
}
}]
all i want is, by passing some date timestamp range, i will get the data for this time interval only, currently its calculating overall.
Thanks For you valuable time!
I believe you just need to add an additional $match stage after the $unwind.
Here's the full pipeline:
{
$match: { "device_id": { $in: [ObjectId("5993df1b9a5fea3183064e49")] } }
},
{
$project: { "power_data": 1 }
},
{
$unwind: "$power_data"
},
{
$match: {
"power_data.timestamp" : { $gt : ISODate("2017-08-23T12:00:00.000Z"), $lt: ISODate("2017-08-25T12:00:00.000Z")}
}
},
{
$project: { "power": "$power_data.power", "hours": { $dayOfMonth: "$power_data.timestamp" }
}
},
{
$group: { "_id": "$hours", "avg_power": { $avg: "$power" } }
}