Why is a query with indexes not a covered query? - mongodb

You would like to perform a covered query on the example collection. You have the following indexes:
{ name : 1, dob : 1 }
{ _id : 1 }
{ hair : 1, name : 1 }
Why is the below query not a covered query?
db.example.find( { name : { $in : [ "Bart", "Homer" ] } }, {_id : 0, hair : 1, name : 1} )
While this one is:
db.example.find( { name : { $in : [ "Bart", "Homer" ] } }, {_id : 0, dob : 1, name : 1} )

According to documentation on index prefixes, the query
db.example.find( { name : { $in : [ "Bart", "Homer" ] } } );
will be covered by
db.example.createIndex({ "name": 1, "dob": 1 });
but not by
db.example.createIndex({ "hair": 1, "name": 1 });
since { "name": 1 } is not the preifx of { "hair": 1, "name": 1 }.
Sample
{ "_id": 1, "name": "Bart", "hair": "triangles", "dob": "1985-01-01" }
{ "_id": 2, "name": "Homer", "hair": "two", "dob": "1960-01-01" }
Query 1
> db.example.find(
> { name: { $in: [ "Bart", "Homer" ] } },
> { _id: 0, hair: 1, name: 1 }
> ).explain("executionStats");
...
"executionStats": {
"totalKeysExamined": 2,
"totalDocsExamined": 2,
"executionStages": {
"stage": "PROJECTION",
"inputStage": {
"stage": "FETCH",
"inputStage": {
"stage": "IXSCAN",
"indexName": "name_1_dob_1",
...
As you see, name_1_dob_1 index was used (since { "name": 1 } is the prefix of { "name": 1, "dob": 1 }, 2 documents were examined in the index ("totalKeysExamined": 2), and then 2 documents were examined in the collection ("totalDocsExamined": 2), since name_1_dob_1 index does not have information about hair that is required to return.
Query 2
> db.example.find(
> { name: { $in: [ "Bart", "Homer" ] } },
> { _id: 0, dob: 1, name: 1 }
> ).explain("executionStats");
...
"executionStats": {
"totalKeysExamined": 2,
"totalDocsExamined": 0,
"executionStages": {
"stage": "PROJECTION",
"inputStage": {
"stage": "IXSCAN",
"indexName": "name_1_dob_1",
...
As for the Query 1, index name_1_dob_1 was used and 2 documents were examined in the index ("totalKeysExamined": 2), but there were no call to the collection ("totalDocsExamined": 0), since index name_1_dob_1 contains both dob and name in it and there are no need to fetch something more from the collection.

Related

Summarize documents by day of the week

For each student in a collection, I have an array of absences. I want to summarize the data by displaying the number of absences for each day of the week.
Given the following input:
{
"_id" : 9373,
"absences" : [
{
"code" : "U",
"date" : ISODate("2021-01-17T00:00:00.000+0000"),
"full_day" : false,
"remote" : false,
"dayNumber" : 1,
"dayName" : "Sunday"
}
]
}
{
"_id" : 9406,
"absences" : [
{
"code" : "E",
"date" : ISODate("2020-12-09T00:00:00.000+0000"),
"full_day" : false,
"remote" : false,
"dayNumber" : 4,
"dayName" : "Wednesday"
},
{
"code" : "U",
"date" : ISODate("2021-05-27T00:00:00.000+0000"),
"full_day" : false,
"remote" : false,
"dayNumber" : 5,
"dayName" : "Thursday"
}
]
}
How can I achieve the following output:
[
{
"_id": 9373,
"days": [
{
"dayNumber": 1,
"dayName": "Sunday",
"count": 1
}
]
},
{
"_id": 9406,
"days": [
{
"dayNumber": 4,
"dayName": "Wednesday",
"count": 1
},
{
"dayNumber": 5,
"dayName": "Thursday",
"count": 1
}
]
}
]
I've pushed all the required fields to this stage of the pipeline. I'm just not clear how to roll up the data in the nested absences array.
$unwind deconstruct absences array
$group by _id and dayNumber, and get count of grouped documents
$group by _id and reconstruct days array
db.collection.aggregate([
{ $unwind: "$absences" },
{
$group: {
_id: {
_id: "$_id",
dayNumber: "$absences.dayNumber"
},
dayName: { $first: "$absences.dayName" },
count: { $sum: 1 }
}
},
{
$group: {
_id: "$_id._id",
days: {
$push: {
dayName: "$dayName",
dayNumber: "$_id.dayNumber",
count: "$count"
}
}
}
}
])
Playground

Sort exceeds limit on text search?

I have defined an index like this:
db.imageProperties.createIndex(
{
"imageProperties.cameraMaker": "text",
"imageProperties.cameraModel": "text",
"imageProperties.dateTimeOriginal": -1,
},
{ name: "TextIndex" }
)
But, when I try to run a query with a sort like this:
db.imageProperties.find( { $text: { $search: "nikon" } }, {"imagePath" : 1, _id: 0 } ).sort( { "imageProperties.dateTimeOriginal": -1 } )
I get this error:
Error: error: {
"ok" : 0,
"errmsg" : "Executor error during find command :: caused by :: Sort operation used more than the maximum 33554432 bytes of RAM. Add an index, or specify a smaller limit.",
"code" : 96,
"codeName" : "OperationFailed"
It is my understanding from reading the documentation that it would be possible to combine text search with sorting by creating a combined index as I have done.
This is the output from .explain() on the above query:
> db.imageProperties.find( { $text: { $search: "nikon" } }, {"imagePath" : 1, _id: 0 } ).sort( { "imageProperties.dateTimeOriginal": -1 } ).explain()
{
"queryPlanner": {
"plannerVersion": 1,
"namespace": "olavt-images.imageProperties",
"indexFilterSet": false,
"parsedQuery": {
"$text": {
"$search": "nikon",
"$language": "english",
"$caseSensitive": false,
"$diacriticSensitive": false
}
},
"queryHash": "1DCFCE0B",
"planCacheKey": "650B3A8E",
"winningPlan": {
"stage": "PROJECTION_SIMPLE",
"transformBy": {
"imagePath": 1,
"_id": 0
},
"inputStage": {
"stage": "SORT",
"sortPattern": {
"imageProperties.dateTimeOriginal": -1
},
"inputStage": {
"stage": "SORT_KEY_GENERATOR",
"inputStage": {
"stage": "TEXT",
"indexPrefix": {
},
"indexName": "TextIndex",
"parsedTextQuery": {
"terms": [
"nikon"
],
"negatedTerms": [],
"phrases": [],
"negatedPhrases": []
},
"textIndexVersion": 3,
"inputStage": {
"stage": "TEXT_MATCH",
"inputStage": {
"stage": "FETCH",
"inputStage": {
"stage": "OR",
"inputStage": {
"stage": "IXSCAN",
"keyPattern": {
"_fts": "text",
"_ftsx": 1,
"imageProperties.dateTimeOriginal": -1
},
"indexName": "TextIndex",
"isMultiKey": true,
"isUnique": false,
"isSparse": false,
"isPartial": false,
"indexVersion": 2,
"direction": "backward",
"indexBounds": {
}
}
}
}
}
}
}
}
},
"rejectedPlans": []
},
"serverInfo": {
"host": "4794df1ed9c4",
"port": 27017,
"version": "4.2.5",
"gitVersion": "2261279b51ea13df08ae708ff278f0679c59dc32"
},
"ok": 1
}
How can I get the desired behavior?
The error suggests that sorting the result requires more memory than what is configured.
The field imagePath that you want to project is not covered by the TextIndex try adding a new index:
db.imageProperties.createIndex(
{
"imageProperties.cameraMaker": "text",
"imageProperties.cameraModel": "text",
"imageProperties.dateTimeOriginal": -1,
"imagePath": 1
}
)
Then try the following steps:
Check that the indexes are created successfully by running:
db.imageProperties.getIndexes()
Check whether the correct index is being used:
db.imageProperties.find( { $text: { $search: "nikon" } }, {"imagePath" : 1, _id: 0 } )
.sort( { "imageProperties.dateTimeOriginal": -1 } ).explain()
If you only want a limited rows of results, also add the limit
db.imageProperties.find( { $text: { $search: "nikon" } }, {"imagePath" : 1, _id: 0 } )
.sort( { "imageProperties.dateTimeOriginal": -1 } ).limit(100)
You can also allow disk usage by using aggregation framework with allowDiskUse
db.imageProperties.aggregate([{
$match: { $text: { $search: "nikon" } }
}, {
$sort: { "imageProperties.dateTimeOriginal": -1 }
} , {
$project: { imagePath: 1 }
}], {
allowDiskUse: true
})

How can I put $group condition to $match? [duplicate]

For what would be this query in SQL (to find duplicates):
SELECT userId, name FROM col GROUP BY userId, name HAVING COUNT(*)>1
I performed this simple query in MongoDB:
res = db.col.group({key:{userId:true,name:true},
reduce: function(obj,prev) {prev.count++;},
initial: {count:0}})
I've added a simple Javascript loop to go over the result set, and performed a filter to find all the fields with a count > 1 there, like so:
for (i in res) {if (res[i].count>1) printjson(res[i])};
Is there a better way to do this other than using javascript code in the client?
If this is the best/simplest way, say that it is, and this question will help someone :)
New answer using Mongo aggregation framework
After this question was asked and answered, 10gen released Mongodb version 2.2 with an aggregation framework. The new best way to do this query is:
db.col.aggregate( [
{ $group: { _id: { userId: "$userId", name: "$name" },
count: { $sum: 1 } } },
{ $match: { count: { $gt: 1 } } },
{ $project: { _id: 0,
userId: "$_id.userId",
name: "$_id.name",
count: 1}}
] )
10gen has a handy SQL to Mongo Aggregation conversion chart worth bookmarking.
The answer already given is apt to be honest, and use of projection makes it even better due to implicit optimisation working under the hood. I have made a small change and I am explaining the positive behind it.
The original command
db.getCollection('so').explain(1).aggregate( [
{ $group: { _id: { userId: "$userId", name: "$name" },
count: { $sum: 1 } } },
{ $match: { count: { $gt: 1 } } },
{ $project: { _id: 0,
userId: "$_id.userId",
name: "$_id.name",
count: 1}}
] )
Parts from the explain plan
{
"stages" : [
{
"$cursor" : {
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "5fa42c8b8778717d277f67c4_test.so",
"indexFilterSet" : false,
"parsedQuery" : {},
"queryHash" : "F301762B",
"planCacheKey" : "F301762B",
"winningPlan" : {
"stage" : "PROJECTION_SIMPLE",
"transformBy" : {
"name" : 1,
"userId" : 1,
"_id" : 0
},
"inputStage" : {
"stage" : "COLLSCAN",
"direction" : "forward"
}
},
"rejectedPlans" : []
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 6000,
"executionTimeMillis" : 8,
"totalKeysExamined" : 0,
"totalDocsExamined" : 6000,
The sampleset is pretty small, just 6000 documents
This query will work on data in WiredTiger Internal Cache, thus if the size of the collection is huge then all that will be kept in the Internal Cache to make sure the execution takes place. The WT Cache is pretty important and if this command takes up such huge space in cache then the cache size will have to be bigger to accommodate other operations
Now a small, hack and addition of an index.
db.getCollection('so').createIndex({userId : 1, name : 1})
New Command
db.getCollection('so').explain(1).aggregate( [
{$match : {name :{ "$ne" : null }, userId : { "$ne" : null } }},
{ $group: { _id: { userId: "$userId", name: "$name" },
count: { $sum: 1 } } },
{ $match: { count: { $gt: 1 } } },
{ $project: { _id: 0,
userId: "$_id.userId",
name: "$_id.name",
count: 1}}
] )
Explain Plan
{
"stages": [{
"$cursor": {
"queryPlanner": {
"plannerVersion": 1,
"namespace": "5fa42c8b8778717d277f67c4_test.so",
"indexFilterSet": false,
"parsedQuery": {
"$and": [{
"name": {
"$not": {
"$eq": null
}
}
},
{
"userId": {
"$not": {
"$eq": null
}
}
}
]
},
"queryHash": "4EF9C4D5",
"planCacheKey": "3898FC0A",
"winningPlan": {
"stage": "PROJECTION_COVERED",
"transformBy": {
"name": 1,
"userId": 1,
"_id": 0
},
"inputStage": {
"stage": "IXSCAN",
"keyPattern": {
"userId": 1.0,
"name": 1.0
},
"indexName": "userId_1_name_1",
"isMultiKey": false,
"multiKeyPaths": {
"userId": [],
"name": []
},
"isUnique": false,
"isSparse": false,
"isPartial": false,
"indexVersion": 2,
"direction": "forward",
"indexBounds": {
"userId": [
"[MinKey, undefined)",
"(null, MaxKey]"
],
"name": [
"[MinKey, undefined)",
"(null, MaxKey]"
]
}
}
},
"rejectedPlans": [{
"stage": "PROJECTION_SIMPLE",
"transformBy": {
"name": 1,
"userId": 1,
"_id": 0
},
"inputStage": {
"stage": "FETCH",
"filter": {
"userId": {
"$not": {
"$eq": null
}
}
},
"inputStage": {
"stage": "IXSCAN",
"keyPattern": {
"name": 1.0
},
"indexName": "name_1",
"isMultiKey": false,
"multiKeyPaths": {
"name": []
},
"isUnique": false,
"isSparse": false,
"isPartial": false,
"indexVersion": 2,
"direction": "forward",
"indexBounds": {
"name": [
"[MinKey, undefined)",
"(null, MaxKey]"
]
}
}
}
}]
},
"executionStats": {
"executionSuccess": true,
"nReturned": 6000,
"executionTimeMillis": 9,
"totalKeysExamined": 6000,
"totalDocsExamined": 0,
"executionStages": {
"stage": "PROJECTION_COVERED",
"nReturned": 6000,
Check the Projection_Covered part, this command is a covered query which basically is just relying on data in indexes
This command won't need to keep the data in the WT Internal Cache because it is not going there at all, check the docs examined, it is 0, given that data is in indexes it is using that for execution, this is a big positive for a system where WT Cache is already under pressure from other operations
If by any chance the requirement to search for specific names and not the whole collection then this becomes useful :D
Disadvantage here is an addition of index, if this index is utilised for other operations as well then no disadvantage to be honest but if this is an extra addition then it will take more space for the index in cache + the writes are impacted with addition of an index marginally
*On performance front for 6000 records the time shown is 1 ms more but for larger dataset this may vary. It must be noted that the sample document that I inserted has just 3 fields, apart from the two used here, the default _id, if this collection has bigger document size then the execution for original command will increase and the volume it will occupy in the cache will also increase.

Aggregate array of subdocuments into single document

My document looks like the following (ignore timepoints for this question):
{
"_id": "xyz-800",
"site": "xyz",
"user": 800,
"timepoints": [
{"timepoint": 0, "a": 1500, "b": 700},
{"timepoint": 2, "a": 1000, "b": 200},
{"timepoint": 4, "a": 3500, "b": 1500}
],
"groupings": [
{"type": "MNO", "group": "<10%", "raw": "1"},
{"type": "IJK", "group": "Moderate", "raw": "23"}
]
}
Can I flatten (maybe not the right term) so the groupings are in a single document. I would like the result to look like:
{
"id": "xyz-800",
"site": "xyz",
"user": 800,
"mnoGroup": "<10%",
"mnoRaw": "1",
"ijkGroup": "Moderate",
"ijkRaw": "23"
}
In reality I would like the mnoGroup and mnoRaw attributes to be created no matter if the attribute groupings.type = "MNO" exists or not. Same with the ijk attributes.
You can use $arrayElemAt to read the groupings array by index in the first project stage and $ifNull to project optional values in the final project stage. Litte verbose, but'll see what I can do.
db.groupmore.aggregate({
"$project": {
_id: 1,
site: 1,
user: 1,
mnoGroup: {
$arrayElemAt: ["$groupings", 0]
},
ijkGroup: {
$arrayElemAt: ["$groupings", -1]
}
}
}, {
"$project": {
_id: 1,
site: 1,
user: 1,
mnoGroup: {
$ifNull: ["$mnoGroup.group", "Unspecified"]
},
mnoRaw: {
$ifNull: ["$mnoGroup.raw", "Unspecified"]
},
ijkGroup: {
$ifNull: ["$ijkGroup.group", "Unspecified"]
},
ijkRaw: {
$ifNull: ["$ijkGroup.raw", "Unspecified"]
}
}
})
Sample Output
{ "_id" : "xyz-800", "site" : "xyz", "user" : 800, "mnoGroup" : "<10%", "mnoRaw" : "1", "ijkGroup" : "Moderate", "ijkRaw" : "23" }
{ "_id" : "ert-600", "site" : "ert", "user" : 8600, "mnoGroup" : "Unspecified", "mnoRaw" : "Unspecified", "ijkGroup" : "Unspecified", "ijkRaw" : "Unspecified" }

What is the correct way to do a HAVING in a MongoDB GROUP BY?

For what would be this query in SQL (to find duplicates):
SELECT userId, name FROM col GROUP BY userId, name HAVING COUNT(*)>1
I performed this simple query in MongoDB:
res = db.col.group({key:{userId:true,name:true},
reduce: function(obj,prev) {prev.count++;},
initial: {count:0}})
I've added a simple Javascript loop to go over the result set, and performed a filter to find all the fields with a count > 1 there, like so:
for (i in res) {if (res[i].count>1) printjson(res[i])};
Is there a better way to do this other than using javascript code in the client?
If this is the best/simplest way, say that it is, and this question will help someone :)
New answer using Mongo aggregation framework
After this question was asked and answered, 10gen released Mongodb version 2.2 with an aggregation framework. The new best way to do this query is:
db.col.aggregate( [
{ $group: { _id: { userId: "$userId", name: "$name" },
count: { $sum: 1 } } },
{ $match: { count: { $gt: 1 } } },
{ $project: { _id: 0,
userId: "$_id.userId",
name: "$_id.name",
count: 1}}
] )
10gen has a handy SQL to Mongo Aggregation conversion chart worth bookmarking.
The answer already given is apt to be honest, and use of projection makes it even better due to implicit optimisation working under the hood. I have made a small change and I am explaining the positive behind it.
The original command
db.getCollection('so').explain(1).aggregate( [
{ $group: { _id: { userId: "$userId", name: "$name" },
count: { $sum: 1 } } },
{ $match: { count: { $gt: 1 } } },
{ $project: { _id: 0,
userId: "$_id.userId",
name: "$_id.name",
count: 1}}
] )
Parts from the explain plan
{
"stages" : [
{
"$cursor" : {
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "5fa42c8b8778717d277f67c4_test.so",
"indexFilterSet" : false,
"parsedQuery" : {},
"queryHash" : "F301762B",
"planCacheKey" : "F301762B",
"winningPlan" : {
"stage" : "PROJECTION_SIMPLE",
"transformBy" : {
"name" : 1,
"userId" : 1,
"_id" : 0
},
"inputStage" : {
"stage" : "COLLSCAN",
"direction" : "forward"
}
},
"rejectedPlans" : []
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 6000,
"executionTimeMillis" : 8,
"totalKeysExamined" : 0,
"totalDocsExamined" : 6000,
The sampleset is pretty small, just 6000 documents
This query will work on data in WiredTiger Internal Cache, thus if the size of the collection is huge then all that will be kept in the Internal Cache to make sure the execution takes place. The WT Cache is pretty important and if this command takes up such huge space in cache then the cache size will have to be bigger to accommodate other operations
Now a small, hack and addition of an index.
db.getCollection('so').createIndex({userId : 1, name : 1})
New Command
db.getCollection('so').explain(1).aggregate( [
{$match : {name :{ "$ne" : null }, userId : { "$ne" : null } }},
{ $group: { _id: { userId: "$userId", name: "$name" },
count: { $sum: 1 } } },
{ $match: { count: { $gt: 1 } } },
{ $project: { _id: 0,
userId: "$_id.userId",
name: "$_id.name",
count: 1}}
] )
Explain Plan
{
"stages": [{
"$cursor": {
"queryPlanner": {
"plannerVersion": 1,
"namespace": "5fa42c8b8778717d277f67c4_test.so",
"indexFilterSet": false,
"parsedQuery": {
"$and": [{
"name": {
"$not": {
"$eq": null
}
}
},
{
"userId": {
"$not": {
"$eq": null
}
}
}
]
},
"queryHash": "4EF9C4D5",
"planCacheKey": "3898FC0A",
"winningPlan": {
"stage": "PROJECTION_COVERED",
"transformBy": {
"name": 1,
"userId": 1,
"_id": 0
},
"inputStage": {
"stage": "IXSCAN",
"keyPattern": {
"userId": 1.0,
"name": 1.0
},
"indexName": "userId_1_name_1",
"isMultiKey": false,
"multiKeyPaths": {
"userId": [],
"name": []
},
"isUnique": false,
"isSparse": false,
"isPartial": false,
"indexVersion": 2,
"direction": "forward",
"indexBounds": {
"userId": [
"[MinKey, undefined)",
"(null, MaxKey]"
],
"name": [
"[MinKey, undefined)",
"(null, MaxKey]"
]
}
}
},
"rejectedPlans": [{
"stage": "PROJECTION_SIMPLE",
"transformBy": {
"name": 1,
"userId": 1,
"_id": 0
},
"inputStage": {
"stage": "FETCH",
"filter": {
"userId": {
"$not": {
"$eq": null
}
}
},
"inputStage": {
"stage": "IXSCAN",
"keyPattern": {
"name": 1.0
},
"indexName": "name_1",
"isMultiKey": false,
"multiKeyPaths": {
"name": []
},
"isUnique": false,
"isSparse": false,
"isPartial": false,
"indexVersion": 2,
"direction": "forward",
"indexBounds": {
"name": [
"[MinKey, undefined)",
"(null, MaxKey]"
]
}
}
}
}]
},
"executionStats": {
"executionSuccess": true,
"nReturned": 6000,
"executionTimeMillis": 9,
"totalKeysExamined": 6000,
"totalDocsExamined": 0,
"executionStages": {
"stage": "PROJECTION_COVERED",
"nReturned": 6000,
Check the Projection_Covered part, this command is a covered query which basically is just relying on data in indexes
This command won't need to keep the data in the WT Internal Cache because it is not going there at all, check the docs examined, it is 0, given that data is in indexes it is using that for execution, this is a big positive for a system where WT Cache is already under pressure from other operations
If by any chance the requirement to search for specific names and not the whole collection then this becomes useful :D
Disadvantage here is an addition of index, if this index is utilised for other operations as well then no disadvantage to be honest but if this is an extra addition then it will take more space for the index in cache + the writes are impacted with addition of an index marginally
*On performance front for 6000 records the time shown is 1 ms more but for larger dataset this may vary. It must be noted that the sample document that I inserted has just 3 fields, apart from the two used here, the default _id, if this collection has bigger document size then the execution for original command will increase and the volume it will occupy in the cache will also increase.