Best way of doing lists that can be indexed efficiently in Mongo - mongodb

I have a collection with documents that can either look like this (option A):
{
"my_list": [
{ "id": "A", "other_data": 123 },
{ "id": "B", "other_data": 456 },
{ "id": "C", "other_data": 789 },
]
}
or like this (option B):
{
"my_list": {
"A": 123,
"B": 456,
"C": 789,
}
}
Question is: which one is more efficient for doing queries such as: fetch me all documents that have id 'B' in 'my_list'?
Also, for the better option, how do you tell Mongo to create the relevant index?

Definitely the first one.
{
"my_list": [
{ "id": "A", "other_data": 123 },
{ "id": "B", "other_data": 456 },
{ "id": "C", "other_data": 789 },
]
}
MongoDB uses multikey indexes to index the content stored in arrays. If you index a field that holds an array value, MongoDB creates separate index entries for every element of the array. These multikey indexes allow queries to select documents that contain arrays by matching on element or elements of the arrays. MongoDB automatically determines whether to create a multikey index if the indexed field contains an array value; you do not need to explicitly specify the multikey type.
https://docs.mongodb.com/manual/indexes/#multikey-index
The second option it's object type. You need to create Single Field or Compound Index to use indexes.
Transform arrays into key:value store
MongoDB allows you to transform Multikey Index array as key:value store for during aggregation i.e.:
db.collection.aggregate([
{
$match: { "my_list.id" : "A" }
},
{
$project: {
my_list: {
$arrayToObject: {
$map: {
input: "$my_list",
in: {
k: "$$this.id",
v: "$$this.other_data"
}
}
}
}
}
}
])
If we take a look explain command, MongoDB uses indexes for efficient execution of queries.
{
"stages" : [
{
"$cursor" : {
"query" : {
"my_list.id" : "A"
},
"fields" : {
"my_list" : 1,
"_id" : 1
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "test.collection",
"indexFilterSet" : false,
"parsedQuery" : {
"my_list.id" : {
"$eq" : "A"
}
},
"queryHash" : "599B2BF4",
"planCacheKey" : "48B2FCB0",
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"my_list.id" : 1.0
},
"indexName" : "my_list.id_1",
"isMultiKey" : true,
"multiKeyPaths" : {
"my_list.id" : [
"my_list"
]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"my_list.id" : [
"[\"A\", \"A\"]"
]
}
}
},
"rejectedPlans" : []
}
}
},
{
"$project" : {
"_id" : true,
"my_list" : {
"$arrayToObject" : [
{
"$map" : {
"input" : "$my_list",
"as" : "this",
"in" : {
"k" : "$$this.id",
"v" : "$$this.other_data"
}
}
}
]
}
}
}
],
"ok" : 1.0
}

Related

Getting the N documents in MongoDB before a Document ID from a Sorted Result

I have a collection in MongoDB, like the one below.
-> Mongo Playground link
I have sorted the collection with Overview and ID.
$sort{{ overview: 1,_id:1 }}
which results in a collection like this.
When I filter the collection to show only the documents after "subject 13.", it works as expected.
$match{{
_id:{$gt:ObjectId('605db89d208db95eb4878556')}
}}
however, when I try to the documents before "subject 13", that is "Subject 6" , with the following query, it doesn't work as I expect.
$match{{
_id:{$lt:ObjectId('605db89d208db95eb4878556')}
}}
Instead of getting just "Subject 6" in the result, I get the following.
I suspect this is happening because, mongodb always filters the document before sorting, regardless of the order in aggregate pipeline.
Please suggest me a way to get the documents before a particular "_id" in mongodb.
I have 600 documents in the collection, this is a sample dataset. My Full aggregate query below.
[
{
'$sort': {
'overview': 1,
'_id': 1
}
}, {
'$match': {
'_id': {
'$lt': new ObjectId('605db89d208db95eb4878556')
}
}
}
]
MongoDB optimizes the query performance by moving sort to the end in your case as you've $sort followed by $match
https://docs.mongodb.com/manual/core/aggregation-pipeline-optimization/#sort-match-sequence-optimization
When you have a sequence with $sort followed by a $match, the $match moves before the $sort to minimize the number of objects to sort. For example, if the pipeline consists of the following stages:
[
{ '$sort': { 'overview': 1, '_id': 1 } },
{ '$match': { '_id': { '$lt': new ObjectId('605db89d208db95eb4878556') } }
]
During the optimization phase, the optimizer transforms the sequence to the following:
[
{ '$match': { '_id': { '$lt': new ObjectId('605db89d208db95eb4878556') } },
{ '$sort': { 'overview': 1, '_id': 1 } }
]
Query planner Result -
We can see the 1st stage is the match query, after that sort is performed.
{
"stages" : [
{
"$cursor" : {
"query" : {
"_id" : {
"$lt" : ObjectId("605db89d208db95eb4878556")
}
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "video.data3",
"indexFilterSet" : false,
"parsedQuery" : {
"_id" : {
"$lt" : ObjectId("605db89d208db95eb4878556")
}
},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"_id" : 1
},
"indexName" : "_id_",
"isMultiKey" : false,
"multiKeyPaths" : {
"_id" : []
},
"isUnique" : true,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"_id" : [
"[ObjectId('000000000000000000000000'), ObjectId('605db89d208db95eb4878556'))"
]
}
}
},
"rejectedPlans" : []
}
}
},
{
"$sort" : {
"sortKey" : {
"overview" : 1,
"_id" : 1
}
}
}
],
"ok" : 1.0
}

MongoDB Aggregation Query Optimization : match -> unwind -> match vs unwind->match

Input Data
{
"_id" : ObjectId("5dc7ac6e720a2772c7b76671"),
"idList" : [
{
"queueUpdateTimeStamp" : "2019-12-12T07:16:47.577Z",
"displayId" : "H14",
"currentQueue" : "10",
"isRejected" : true,
"isDispacthed" : true
},
{
"queueUpdateTimeStamp" : "2019-12-12T07:16:47.577Z",
"displayId" : "H14",
"currentQueue" : "10",
"isRejected" : true,
"isDispacthed" : false
}
],
"poDetailsId" : ObjectId("5dc7ac15720a2772c7b7666f"),
"processtype" : 1
}
Output Data
{
"_id" : ObjectId("5dc7ac6e720a2772c7b76671"),
"idList":
{
"queueUpdateTimeStamp" : "2019-12-12T07:16:47.577Z",
"displayId" : "H14",
"currentQueue" : "10",
"isRejected" : true,
"isDispacthed" : true
},
"poDetailsId" : ObjectId("5dc7ac15720a2772c7b7666f"),
"processtype" : 1
}
Query 1 ( unwind then match)
aggregate([
{
$unwind: { path: "$idList" }
},
{
$match: { 'idList.isDispacthed': isDispatched }
}
])
Query 2 ( match then unwind then match)
aggregate([
{
$match: { 'idList.isDispacthed': isDispatched }
},
{
$unwind: { path: "$idList" }
},
{
$match: { 'idList.isDispacthed': isDispatched }
}
])
My Question / My Concern
(suppose i have large number of documents(50k +) in this collection and assuming i have other lookups and projections after this query in same pipeline)
match -> unwind -> match VS unwind ->match
is there any performance difference between these two queries ?
is there any other ( better ) way to write this query?
It all depends on the MongoDB query planner optimizer:
Aggregation pipeline operations have an optimization phase which attempts to reshape the pipeline for improved performance.
To see how the optimizer transforms a particular aggregation pipeline, include the explain option in the db.collection.aggregate() method.
https://docs.mongodb.com/manual/core/aggregation-pipeline-optimization/
Create index for poDetailsId and run this query:
db.getCollection('collection').explain().aggregate([
{
$unwind: "$idList"
},
{
$match: {
'idList.isDispacthed': true,
"poDetailsId" : ObjectId("5dc7ac15720a2772c7b7666f")
}
}
])
{
"stages" : [
{
"$cursor" : {
"query" : {
"poDetailsId" : {
"$eq" : ObjectId("5dc7ac15720a2772c7b7666f")
}
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "test.collection",
"indexFilterSet" : false,
"parsedQuery" : {
"poDetailsId" : {
"$eq" : ObjectId("5dc7ac15720a2772c7b7666f")
}
},
"queryHash" : "2CF7E390",
"planCacheKey" : "A8739F51",
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"poDetailsId" : 1.0
},
"indexName" : "poDetailsId_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"poDetailsId" : []
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"poDetailsId" : [
"[ObjectId('5dc7ac15720a2772c7b7666f'), ObjectId('5dc7ac15720a2772c7b7666f')]"
]
}
}
},
"rejectedPlans" : []
}
}
},
{
"$unwind" : {
"path" : "$idList"
}
},
{
"$match" : {
"idList.isDispacthed" : {
"$eq" : true
}
}
}
],
"ok" : 1.0
}
As you see MongoDB will change this aggregation to:
db.getCollection('collection').aggregate([
{
$match: { "poDetailsId" : ObjectId("5dc7ac15720a2772c7b7666f") }
}
{
$unwind: "$idList"
},
{
$match: { 'idList.isDispacthed': true }
}
])
Logically, $match -> $unwind -> $match is better since you filter (by index) a subset of records instead of full scan (working with 100 matched documents ≠ all documents).
If your aggregation operation requires only a subset of the data in a collection, use the $match, $limit, and $skip stages to restrict the documents that enter at the beginning of the pipeline. When placed at the beginning of a pipeline, $match operations use suitable indexes to scan only the matching documents in a collection.
https://docs.mongodb.com/manual/core/aggregation-pipeline/#early-filtering
Once you manipulate your documents, MongoDB cannot apply indexes.

Slow mongodb query, how can it happen?

I have the following query in MongoDB:
db.getCollection('message').aggregate([
{
"$match": {
"who" : { "$in" : ["manager", "woker"] },
"sendTo": { "$in": ["userId:243369", "userId:160921"] },
"exceptSendTo": { "$nin": ["userId:37355"] },
"msgTime": { "$lt": 1559716155 },
"isInvalid": { "$exists": false }
}
},
{
"$sort": { "msgTime": 1, "who": 1, "sendTo": 1 }
},
{
"$group": { "_id": "$who", "doc": { "$first": "$type" } }
}
], { allowDiskUse: true})
forget about the field meaning. and I have this index:
/* 1 */
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "db.message"
},
{
"v" : 1,
"key" : {
"who" : 1.0,
"sendTo" : 1.0
},
"name" : "who_sendTo",
"ns" : "db.message"
},
{
"v" : 1,
"key" : {
"msgTime" : 1.0
},
"name" : "msgTime_1",
"ns" : "db.message"
},
{
"v" : 1,
"key" : {
"msgTime" : 1.0,
"who" : 1.0,
"sendTo" : 1.0
},
"name" : "msgTime_1.0_who_1.0_sendTo_1.0",
"ns" : "db.message",
"background" : true
}
]
Perform the query above, It cost 1.52s, use explain to see it indeed has used msgTime_1.0_who_1.0_sendTo_1.0 index.
Why is query is still low while index has been used? and is there any way to solve the low problem like change index or something?
I dont think you are using the sort at all the way you intend to use it.
The $firs argument requires a sort on the actual first arguement
https://docs.mongodb.com/manual/reference/operator/aggregation/first/
You need to sort the key you want the first element of.
OR you could use $$ROOT, witch returns the first document.
I think you should modify it to something like:
{"$sort": {"who": 1, "msgTime": 1, "sendTo": 1}},
{"$group": {"_id": "$who", "doc": {"$first": "$$root"}}},
In this case the $group operator can find the result for each group "instantly" since they are all next to each other.
If you are only interested in the type, add an projection:
{'$project': {'doc.type': 1}

When using $elemMatch, why is MongoDB performing a FETCH stage if index is covering the filter?

I have the following table :
> db.foo.find()
{ "_id" : 1, "k" : [ { "a" : 50, "b" : 10 } ] }
{ "_id" : 2, "k" : [ { "a" : 90, "b" : 80 } ] }
With an compound index on k field :
"key" : {
"k.a" : 1,
"k.b" : 1
},
"name" : "k.a_1_k.b_1"
If I run the following query :
db.foo.aggregate([
{ $match: { "k.a" : 50 } },
{ $project: { _id : 0, "dummy": {$literal:""} }}
])
The index if used (make sense) and there is no need of FETCH stage :
"winningPlan" : {
"stage" : "COUNT_SCAN",
"keyPattern" : {
"k.a" : 1,
"k.b" : 1
},
"indexName" : "k.a_1_k.b_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"k.a" : [ ],
"k.b" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"indexBounds" : {
"startKey" : {
"k.a" : 50,
"k.b" : { "$minKey" : 1 }
},
"startKeyInclusive" : true,
"endKey" : {
"k.a" : 50,
"k.b" : { "$maxKey" : 1 }
},
"endKeyInclusive" : true
}
}
However, If I run the following query that use $elemMatch :
db.foo.aggregate([
{ $match: { k: {$elemMatch: {a : 50, b : { $in : [5, 6, 10]}}}} },
{ $project: { _id : 0, "dummy" : {$literal : ""}} }
])
There is a FETCH stage (which AFAIK is not necessary) :
"winningPlan" : {
"stage" : "FETCH",
"filter" : {
"k" : {
"$elemMatch" : {
"$and" : [
{ "a" : { "$eq" : 50 } },
{ "b" : { "$in" : [ 5, 6, 10 ] } }
]
}
}
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"k.a" : 1,
"k.b" : 1
},
"indexName" : "k.a_1_k.b_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"k.a" : [ ],
"k.b" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"k.a" : [
"[50.0, 50.0]"
],
"k.b" : [
"[5.0, 5.0]",
"[6.0, 6.0]",
"[10.0, 10.0]"
]
}
}
},
I am using MongoDB 3.4.
I ask this because I have a DB with lot of documents and there is a query that use aggregate() and $elemMatch (it perform more useful things than projecting nothing as in this question OFC, but theoretically things does not require a FETCH stage). I found out main reason of query being slow if is the FETCH stage, which AFAIK is not needed.
Is there some logic that force MongoDB to use FETCH when $elemMatch is used, or is it just a missing optimization ?
Looks like even a single "$elemMatch" forces mongo to do FETCH -> COUNT instead of COUNT_SCAN. Opened a ticket in their Jira with simple steps to reproduce - https://jira.mongodb.org/browse/SERVER-35223
TLDR: this is the expected behaviour of a multikey index combined with an $elemMatch.
From the covered query section of the multikey index documents:
Multikey indexes cannot cover queries over array field(s).
Meaning all information about a sub-document is not in the multikey index.
Let's imagine the following scenario:
//doc1
{
"k" : [ { "a" : 50, "b" : 10 } ]
}
//doc2
{
"k" : { "a" : 50, "b" : 10 }
}
Because Mongo "flattens" the array it indexes, once the index is built Mongo cannot differentiate between these 2 documents, $elemMatch specifically requires an array object to match (i.e doc2 will never match an $elemMatch query).
Meaning Mongo is forced to FETCH the documents to determine which docs will match, this is the premise causing the behaviour you see.

Skipping certain fields in subdocument when doing a multikey index in MongoDB

I tried doing a multikey index in MongoDB (Arrays containing subdocuments), but ended up going over the byte limit on many of the keys.
All of the subdocuments contain the same fields - is there any way to do a multikey index, skipping over the larger fields?
Something like:
db.foo.createIndex({"bar":1}, except for baz, bundy)
As long as all the elements are in a single array ( see Compound indexes with MultiKey ) then it is perfectly legal to set an array on all the sub-properties you want:
db.foo.insert({ "bar": [{ "foo": 1, "baz": 1, "buz": 1, "bat": 1 }] })
db.foo.ensureIndex({ "bar.foo": 1, "bar.baz": 1, "bar.buz": 1 })
And queries will use the index normally:
db.foo.find({ "bar": { "$elemMatch": { "foo": 1, "bar": 1, "buz": 1 } } }).explain()
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "test.foo",
"indexFilterSet" : false,
"parsedQuery" : {
"bar" : {
"$elemMatch" : {
"$and" : [
{
"baz" : {
"$eq" : 1
}
},
{
"buz" : {
"$eq" : 1
}
},
{
"foo" : {
"$eq" : 1
}
}
]
}
}
},
"winningPlan" : {
"stage" : "FETCH",
"filter" : {
"bar" : {
"$elemMatch" : {
"$and" : [
{
"foo" : {
"$eq" : 1
}
},
{
"baz" : {
"$eq" : 1
}
},
{
"buz" : {
"$eq" : 1
}
}
]
}
}
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"bar.foo" : 1,
"bar.baz" : 1,
"bar.buz" : 1
},
"indexName" : "bar.foo_1_bar.baz_1_bar.buz_1",
"isMultiKey" : false,
"direction" : "forward",
"indexBounds" : {
"bar.foo" : [
"[1.0, 1.0]"
],
"bar.baz" : [
"[1.0, 1.0]"
],
"bar.buz" : [
"[1.0, 1.0]"
]
}
}
},
So the "indexName" there shows the index selected.
But of course no form of index creation allows you to "exclude" fields. It's "all or nothing" and of course the same byte limit applies to the total keys you can apply.