MongoDB aggregate using scan for join collection? - mongodb

I have two collection one collection is 'image' and other thing is 'map' for mapping info with artists
image collection has idx(single index) and map collection has multiple index(m_idx, i_idx)
i know mongodb is not RDBMS but i'm using this two table with aggregate pipline but i worried when aggregate image table, mongodb get image row by querying with index or select after scan all image collection.
i heard aggregate join the row after get all collection. but i'm not sure ..
add my query and explain result
Query
db.map.aggregate(
[
{
$match: {m_idx: 1111}
},
{
$lookup: {
from: "image",
localField: "i_idx",
foreignField: "idx",
as: "image_aggre"
}
},
{
$match: {
'image': {
$ne: []
}
}
},
{
$sort: {
seq: -1
}
},
{
$skip: (pageNum) * pageSize
},
{
$limit: pageSize
}
], {
explain: true
}
)
Explain Result
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "db.map",
"winningPlan" : {
"stage" : "LIMIT_SKIP",
"inputStage" : {
"stage" : "SORT",
"sortPattern" : {
"seq" : -1
},
"inputStage" : {
"stage" : "SUBSCAN",
"inputStage" : {
"stage" : "HASH_AGGREGATE",
"inputStage" : {
"stage" : "HASH_LOOKUP",
"inputStages" : [
{
"stage" : "COLLSCAN"
},
{
"stage" : "HASH",
"inputStage" : {
"stage" : "IXSCAN",
"indexName" : "m_idx_-1_i_idx_-1",
"direction" : "forward"
}
}
]
}
}
}
}
}
},
i'm using aws-documentDB so result can be a little bit different

Related

documentdb aggregate query not using index

I am trying to find the max of a value in a range of dates. The aggregate query I use has a match on indexed column _id. But the query takes too long and the explain plan tells me its going of a COLLSCAN and not an index scan. Can you please suggest why it wont make use of the index on _id?
Would it help if I created another index on colId?
{$match:{_id:{ $regex: 'regex'}}},
{$match:{$and:[{"colId":'DATA'}]}},
{$unwind:"$data"},
{$match:{$and:[{"data.time":{$gte:ISODate("xyz"),$lte:ISODate("zyx")}}]}},
{$match:{$and: [{ "data.col": { $exists: true}}] }},
{$group:{_id:"$data.time",maxCol:{$max:"$data.col"}}} ,
{$sort:{"maxCol":-1,_id:-1}},
{$limit:1}
])
Explain plan snippet:
"winningPlan" : {
"stage" : "LIMIT_SKIP",
"inputStage" : {
"stage" : "SORT",
"sortPattern" : {
"_id" : -1,
"maxCol" : -1
},
"inputStage" : {
"stage" : "SUBSCAN",
"inputStage" : {
"stage" : "HASH_AGGREGATE",
"inputStage" : {
"stage" : "SUBSCAN",
"inputStage" : {
"stage" : "PROJECTION",
"inputStage" : {
"stage" : "COLLSCAN"
}
}
}
}
}
}
This is on DocumentDB (mongo4)
I think regular expression cannot use indexes. $match works also on array, try this one:
db.collection.aggregate([
{
$match: {
"colId": 'DATA',
"data.time": { $gte: ISODate("xyz"), $lte: ISODate("zyx") },
"data.col": { $exists: true }
}
},
{ $match: { _id: { $regex: 'regex' } } },
{ $unwind: "$data" },
{ $group: { _id: "$data.time", maxCol: { $max: "$data.col" } } },
{ $sort: { "maxCol": -1, _id: -1 } },
{ $limit: 1 }
])
As consequence put an index on {colId: 1, "data.time": 1} or {colId: 1, "data.time": 1, "data.col": 1}

Getting the N documents in MongoDB before a Document ID from a Sorted Result

I have a collection in MongoDB, like the one below.
-> Mongo Playground link
I have sorted the collection with Overview and ID.
$sort{{ overview: 1,_id:1 }}
which results in a collection like this.
When I filter the collection to show only the documents after "subject 13.", it works as expected.
$match{{
_id:{$gt:ObjectId('605db89d208db95eb4878556')}
}}
however, when I try to the documents before "subject 13", that is "Subject 6" , with the following query, it doesn't work as I expect.
$match{{
_id:{$lt:ObjectId('605db89d208db95eb4878556')}
}}
Instead of getting just "Subject 6" in the result, I get the following.
I suspect this is happening because, mongodb always filters the document before sorting, regardless of the order in aggregate pipeline.
Please suggest me a way to get the documents before a particular "_id" in mongodb.
I have 600 documents in the collection, this is a sample dataset. My Full aggregate query below.
[
{
'$sort': {
'overview': 1,
'_id': 1
}
}, {
'$match': {
'_id': {
'$lt': new ObjectId('605db89d208db95eb4878556')
}
}
}
]
MongoDB optimizes the query performance by moving sort to the end in your case as you've $sort followed by $match
https://docs.mongodb.com/manual/core/aggregation-pipeline-optimization/#sort-match-sequence-optimization
When you have a sequence with $sort followed by a $match, the $match moves before the $sort to minimize the number of objects to sort. For example, if the pipeline consists of the following stages:
[
{ '$sort': { 'overview': 1, '_id': 1 } },
{ '$match': { '_id': { '$lt': new ObjectId('605db89d208db95eb4878556') } }
]
During the optimization phase, the optimizer transforms the sequence to the following:
[
{ '$match': { '_id': { '$lt': new ObjectId('605db89d208db95eb4878556') } },
{ '$sort': { 'overview': 1, '_id': 1 } }
]
Query planner Result -
We can see the 1st stage is the match query, after that sort is performed.
{
"stages" : [
{
"$cursor" : {
"query" : {
"_id" : {
"$lt" : ObjectId("605db89d208db95eb4878556")
}
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "video.data3",
"indexFilterSet" : false,
"parsedQuery" : {
"_id" : {
"$lt" : ObjectId("605db89d208db95eb4878556")
}
},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"_id" : 1
},
"indexName" : "_id_",
"isMultiKey" : false,
"multiKeyPaths" : {
"_id" : []
},
"isUnique" : true,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"_id" : [
"[ObjectId('000000000000000000000000'), ObjectId('605db89d208db95eb4878556'))"
]
}
}
},
"rejectedPlans" : []
}
}
},
{
"$sort" : {
"sortKey" : {
"overview" : 1,
"_id" : 1
}
}
}
],
"ok" : 1.0
}

MongoDB inconsistent aggregate call between queries

I have two tables. videos and youtubes. I want to do a $lookup on videos.youtube and match that to youtubes._id and then $match that data based on a youtubes field. Which is working fine, but there are some huge inconsistencies between queries that should be identical in nature, or at the very least close to.
Query 1: returns 8261 documents. Takes [40, 50]ms to execute
db.getCollection('videos').aggregate([
{ '$sort': { date: -1 } },
{
'$lookup': {
from: 'youtubes',
localField: 'youtube',
foreignField: '_id',
as: 'youtube'
}
},
{ '$match': { 'youtube.talent': true } },
])
Query 2: returns 760 documents. Takes [470, 500]ms to execute
db.getCollection('videos').aggregate([
{ '$sort': { date: -1 } },
{
'$lookup': {
from: 'youtubes',
localField: 'youtube',
foreignField: '_id',
as: 'youtube'
}
},
{ '$match': { 'youtube.id': 7 } },
])
Query 3: returns 760 documents. Takes [90, 100]ms to execute
db.getCollection('videos').aggregate([
// { '$sort': { date: -1 } },
{
'$lookup': {
from: 'youtubes',
localField: 'youtube',
foreignField: '_id',
as: 'youtube'
}
},
{ '$match': { 'youtube.id': 7 } },
])
All fields used in the queries are indexed. What stands out is that the $sort statement in Query 2, apparently uses roughly 400ms to execute, yet in Query 1 that uses the same $sort statement in the same location in the pipeline and it only uses [40, 50]ms.
I've used the { explain: true } option to look for differences between Query 1 and Query 2 that could explain the speed differences, but they are identical except for the $match portion.
Any solution/suggestions for bringing Query 2 up to speed with Query 1? Or at the very least an explanation for the huge differences in speed?
Another weird thing discovered while making this post
Query 4: returns 9378 documents. Takes [25, 35]ms to execute
db.getCollection('videos').aggregate([
{ '$sort': { date: -1 } },
{
'$lookup': {
from: 'youtubes',
localField: 'youtube',
foreignField: '_id',
as: 'youtube'
}
},
{ '$match': { 'youtube.clipper': true } }
])
Query 5: returns 9378 documents. Takes [600, 680]ms to execute
db.getCollection('videos').aggregate([
//{ '$sort': { date: -1 } },
{
'$lookup': {
from: 'youtubes',
localField: 'youtube',
foreignField: '_id',
as: 'youtube'
}
},
{ '$match': { 'youtube.clipper': true } }
])
At this point I'm stumped as to what is happening. Originally I thought it had to do with Number vs Boolean, but as Query 4 and Query 5 shows it clearly has 0 impact. And it seems random.
Indexes just in case (for youtubes)
[
{
"v" : 2,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "holo-watcher.youtubes"
},
{
"v" : 2,
"unique" : true,
"key" : {
"id" : 1
},
"name" : "id_1",
"ns" : "holo-watcher.youtubes",
"background" : true
},
{
"v" : 2,
"key" : {
"name" : 1
},
"name" : "name_1",
"ns" : "holo-watcher.youtubes",
"background" : true
},
{
"v" : 2,
"unique" : true,
"key" : {
"channelId" : 1
},
"name" : "channelId_1",
"ns" : "holo-watcher.youtubes",
"background" : true
},
{
"v" : 2,
"key" : {
"clipper" : 1
},
"name" : "clipper_1",
"ns" : "holo-watcher.youtubes",
"background" : true
},
{
"v" : 2,
"key" : {
"talent" : 1
},
"name" : "talent_1",
"ns" : "holo-watcher.youtubes",
"background" : true
},
{
"v" : 2,
"key" : {
"debut" : 1
},
"name" : "debut_1",
"ns" : "holo-watcher.youtubes",
"background" : true
}
]
indexes (for videos)
[
{
"v" : 2,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "holo-watcher.videos"
},
{
"v" : 2,
"unique" : true,
"key" : {
"videoId" : 1
},
"name" : "videoId_1",
"ns" : "holo-watcher.videos",
"background" : true
},
{
"v" : 2,
"key" : {
"title" : 1
},
"name" : "title_1",
"ns" : "holo-watcher.videos",
"background" : true
},
{
"v" : 2,
"key" : {
"date" : 1
},
"name" : "date_1",
"ns" : "holo-watcher.videos",
"background" : true
}
]
{ explain: true } output for Query 5 (nearly identical to Query 1 and Query 2):
{
"stages" : [
{
"$cursor" : {
"query" : {},
"sort" : {
"date" : -1
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "holo-watcher.videos",
"indexFilterSet" : false,
"parsedQuery" : {},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"date" : 1
},
"indexName" : "date_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"date" : []
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "backward",
"indexBounds" : {
"date" : [
"[MaxKey, MinKey]"
]
}
}
},
"rejectedPlans" : []
}
}
},
{
"$lookup" : {
"from" : "youtubes",
"as" : "youtube",
"localField" : "youtube",
"foreignField" : "_id"
}
},
{
"$match" : {
"youtube.clipper" : {
"$eq" : true
}
}
}
],
"ok" : 1.0
}

MongoDB Aggregation Query Optimization : match -> unwind -> match vs unwind->match

Input Data
{
"_id" : ObjectId("5dc7ac6e720a2772c7b76671"),
"idList" : [
{
"queueUpdateTimeStamp" : "2019-12-12T07:16:47.577Z",
"displayId" : "H14",
"currentQueue" : "10",
"isRejected" : true,
"isDispacthed" : true
},
{
"queueUpdateTimeStamp" : "2019-12-12T07:16:47.577Z",
"displayId" : "H14",
"currentQueue" : "10",
"isRejected" : true,
"isDispacthed" : false
}
],
"poDetailsId" : ObjectId("5dc7ac15720a2772c7b7666f"),
"processtype" : 1
}
Output Data
{
"_id" : ObjectId("5dc7ac6e720a2772c7b76671"),
"idList":
{
"queueUpdateTimeStamp" : "2019-12-12T07:16:47.577Z",
"displayId" : "H14",
"currentQueue" : "10",
"isRejected" : true,
"isDispacthed" : true
},
"poDetailsId" : ObjectId("5dc7ac15720a2772c7b7666f"),
"processtype" : 1
}
Query 1 ( unwind then match)
aggregate([
{
$unwind: { path: "$idList" }
},
{
$match: { 'idList.isDispacthed': isDispatched }
}
])
Query 2 ( match then unwind then match)
aggregate([
{
$match: { 'idList.isDispacthed': isDispatched }
},
{
$unwind: { path: "$idList" }
},
{
$match: { 'idList.isDispacthed': isDispatched }
}
])
My Question / My Concern
(suppose i have large number of documents(50k +) in this collection and assuming i have other lookups and projections after this query in same pipeline)
match -> unwind -> match VS unwind ->match
is there any performance difference between these two queries ?
is there any other ( better ) way to write this query?
It all depends on the MongoDB query planner optimizer:
Aggregation pipeline operations have an optimization phase which attempts to reshape the pipeline for improved performance.
To see how the optimizer transforms a particular aggregation pipeline, include the explain option in the db.collection.aggregate() method.
https://docs.mongodb.com/manual/core/aggregation-pipeline-optimization/
Create index for poDetailsId and run this query:
db.getCollection('collection').explain().aggregate([
{
$unwind: "$idList"
},
{
$match: {
'idList.isDispacthed': true,
"poDetailsId" : ObjectId("5dc7ac15720a2772c7b7666f")
}
}
])
{
"stages" : [
{
"$cursor" : {
"query" : {
"poDetailsId" : {
"$eq" : ObjectId("5dc7ac15720a2772c7b7666f")
}
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "test.collection",
"indexFilterSet" : false,
"parsedQuery" : {
"poDetailsId" : {
"$eq" : ObjectId("5dc7ac15720a2772c7b7666f")
}
},
"queryHash" : "2CF7E390",
"planCacheKey" : "A8739F51",
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"poDetailsId" : 1.0
},
"indexName" : "poDetailsId_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"poDetailsId" : []
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"poDetailsId" : [
"[ObjectId('5dc7ac15720a2772c7b7666f'), ObjectId('5dc7ac15720a2772c7b7666f')]"
]
}
}
},
"rejectedPlans" : []
}
}
},
{
"$unwind" : {
"path" : "$idList"
}
},
{
"$match" : {
"idList.isDispacthed" : {
"$eq" : true
}
}
}
],
"ok" : 1.0
}
As you see MongoDB will change this aggregation to:
db.getCollection('collection').aggregate([
{
$match: { "poDetailsId" : ObjectId("5dc7ac15720a2772c7b7666f") }
}
{
$unwind: "$idList"
},
{
$match: { 'idList.isDispacthed': true }
}
])
Logically, $match -> $unwind -> $match is better since you filter (by index) a subset of records instead of full scan (working with 100 matched documents ≠ all documents).
If your aggregation operation requires only a subset of the data in a collection, use the $match, $limit, and $skip stages to restrict the documents that enter at the beginning of the pipeline. When placed at the beginning of a pipeline, $match operations use suitable indexes to scan only the matching documents in a collection.
https://docs.mongodb.com/manual/core/aggregation-pipeline/#early-filtering
Once you manipulate your documents, MongoDB cannot apply indexes.

MongoDB aggregate count is too much slow

I have around 60 thousand document in users collection, and have the following query:
db.getCollection('users').aggregate([
{"$match":{"userType":"employer"}},
{"$lookup":{"from":"companies","localField":"_id","foreignField":"owner.id","as":"company"}},
{"$unwind":"$company"},
{"$lookup":{"from":"companytypes","localField":"company.type.id","foreignField":"_id","as":"companyType"}},
{"$unwind":"$companyType"},
{ $group: { _id: null, count: { $sum: 1 } } }
])
It takes around 12 seconds to count, even I call count function before list function, but my list function with limit: 10 response faster than count.
And following is explain result:
{
"stages" : [
{
"$cursor" : {
"query" : {
"userType" : "employer"
},
"fields" : {
"company" : 1,
"_id" : 1
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "jobs.users",
"indexFilterSet" : false,
"parsedQuery" : {
"userType" : {
"$eq" : "employer"
}
},
"winningPlan" : {
"stage" : "COLLSCAN",
"filter" : {
"userType" : {
"$eq" : "employer"
}
},
"direction" : "forward"
},
"rejectedPlans" : []
}
}
},
{
"$lookup" : {
"from" : "companies",
"as" : "company",
"localField" : "_id",
"foreignField" : "owner.id",
"unwinding" : {
"preserveNullAndEmptyArrays" : false
}
}
},
{
"$match" : {
"$nor" : [
{
"company" : {
"$eq" : []
}
}
]
}
},
{
"$group" : {
"_id" : {
"$const" : null
},
"total" : {
"$sum" : {
"$const" : 1
}
}
}
},
{
"$project" : {
"_id" : false,
"total" : true
}
}
],
"ok" : 1.0
}
$lookup operations are slow since they mimic the left join behavior, from the DOCS:
$lookup performs an equality match on the localField to the
foreignField from the documents of the from collection
Hence if there are no indexes in the fields used for joining the collections Mongodb is force to do a collection scan.
Adding an index for the foreignField attributes should prevent a collection scan and increase the performance even of a magnitude