Sort exceeds limit on text search? - mongodb

I have defined an index like this:
db.imageProperties.createIndex(
{
"imageProperties.cameraMaker": "text",
"imageProperties.cameraModel": "text",
"imageProperties.dateTimeOriginal": -1,
},
{ name: "TextIndex" }
)
But, when I try to run a query with a sort like this:
db.imageProperties.find( { $text: { $search: "nikon" } }, {"imagePath" : 1, _id: 0 } ).sort( { "imageProperties.dateTimeOriginal": -1 } )
I get this error:
Error: error: {
"ok" : 0,
"errmsg" : "Executor error during find command :: caused by :: Sort operation used more than the maximum 33554432 bytes of RAM. Add an index, or specify a smaller limit.",
"code" : 96,
"codeName" : "OperationFailed"
It is my understanding from reading the documentation that it would be possible to combine text search with sorting by creating a combined index as I have done.
This is the output from .explain() on the above query:
> db.imageProperties.find( { $text: { $search: "nikon" } }, {"imagePath" : 1, _id: 0 } ).sort( { "imageProperties.dateTimeOriginal": -1 } ).explain()
{
"queryPlanner": {
"plannerVersion": 1,
"namespace": "olavt-images.imageProperties",
"indexFilterSet": false,
"parsedQuery": {
"$text": {
"$search": "nikon",
"$language": "english",
"$caseSensitive": false,
"$diacriticSensitive": false
}
},
"queryHash": "1DCFCE0B",
"planCacheKey": "650B3A8E",
"winningPlan": {
"stage": "PROJECTION_SIMPLE",
"transformBy": {
"imagePath": 1,
"_id": 0
},
"inputStage": {
"stage": "SORT",
"sortPattern": {
"imageProperties.dateTimeOriginal": -1
},
"inputStage": {
"stage": "SORT_KEY_GENERATOR",
"inputStage": {
"stage": "TEXT",
"indexPrefix": {
},
"indexName": "TextIndex",
"parsedTextQuery": {
"terms": [
"nikon"
],
"negatedTerms": [],
"phrases": [],
"negatedPhrases": []
},
"textIndexVersion": 3,
"inputStage": {
"stage": "TEXT_MATCH",
"inputStage": {
"stage": "FETCH",
"inputStage": {
"stage": "OR",
"inputStage": {
"stage": "IXSCAN",
"keyPattern": {
"_fts": "text",
"_ftsx": 1,
"imageProperties.dateTimeOriginal": -1
},
"indexName": "TextIndex",
"isMultiKey": true,
"isUnique": false,
"isSparse": false,
"isPartial": false,
"indexVersion": 2,
"direction": "backward",
"indexBounds": {
}
}
}
}
}
}
}
}
},
"rejectedPlans": []
},
"serverInfo": {
"host": "4794df1ed9c4",
"port": 27017,
"version": "4.2.5",
"gitVersion": "2261279b51ea13df08ae708ff278f0679c59dc32"
},
"ok": 1
}
How can I get the desired behavior?

The error suggests that sorting the result requires more memory than what is configured.
The field imagePath that you want to project is not covered by the TextIndex try adding a new index:
db.imageProperties.createIndex(
{
"imageProperties.cameraMaker": "text",
"imageProperties.cameraModel": "text",
"imageProperties.dateTimeOriginal": -1,
"imagePath": 1
}
)
Then try the following steps:
Check that the indexes are created successfully by running:
db.imageProperties.getIndexes()
Check whether the correct index is being used:
db.imageProperties.find( { $text: { $search: "nikon" } }, {"imagePath" : 1, _id: 0 } )
.sort( { "imageProperties.dateTimeOriginal": -1 } ).explain()
If you only want a limited rows of results, also add the limit
db.imageProperties.find( { $text: { $search: "nikon" } }, {"imagePath" : 1, _id: 0 } )
.sort( { "imageProperties.dateTimeOriginal": -1 } ).limit(100)
You can also allow disk usage by using aggregation framework with allowDiskUse
db.imageProperties.aggregate([{
$match: { $text: { $search: "nikon" } }
}, {
$sort: { "imageProperties.dateTimeOriginal": -1 }
} , {
$project: { imagePath: 1 }
}], {
allowDiskUse: true
})

Related

Can I write an index for an $or query in mongodb?

I have a few different but similar mongodb queries that I need to write an index for. Is this the correct way to do it?
First Query:
{
$or: [
{_id: "..."},
{linkedListingID: "..."}
]
}
Second Query:
{
$or: [
{_id: "..."},
{linkedListingID: {$in: ["...", "..."]}
]
}
Third Query:
$or: [
{
_id: {$in: ["...", "..."]},
linkedListingID: {
$exists: false,
},
},
{
linkedListingID: {$in: ["...", "..."]},
},
];
Index:
Listing.index(
{
_id: 1,
linkedListingID: 1
},
{name: "index_name"}
);
As per $or documentation,
When using indexes with $or queries, each clause of an $or can use its own index.
That means it will not use compound index!
To support your query, rather than a compound index, you would create one index on _id and another index on linkedListingID:
Listing.index({ _id: 1 });
Listing.index({ linkedListingID: 1 });
For more details try explain() with your query, and check executionStats > executionStages > inputStage
Explain state with compound index:
Listing.index({ _id: 1, linkedListingID: 1 }, {name: "index_name"});
Result: This will scan full collection "stage": "COLLSCAN"!
"winningPlan": {
"inputStage": {
"direction": "forward",
"filter": {
"$or": [
{
"_id": {
"$eq": "abc"
}
},
{
"linkedListingID": {
"$eq": "bca"
}
}
]
},
"stage": "COLLSCAN"
},
"stage": "SUBPLAN"
}
Playground
Explain state with single index: Here _id not needed index because by default _id having unique index,
Listing.index({ linkedListingID: 1 });
Result: this will at least use index individually "stage": "IXSCAN"!
"winningPlan": {
"inputStage": {
"inputStage": {
"inputStages": [
{
"direction": "forward",
"indexBounds": {
"_id": [
"[\"abc\", \"abc\"]"
]
},
"indexName": "_id_",
"indexVersion": 2,
"isMultiKey": false,
"isPartial": false,
"isSparse": false,
"isUnique": true,
"keyPattern": {
"_id": 1
},
"multiKeyPaths": {
"_id": []
},
"stage": "IXSCAN"
},
{
"direction": "forward",
"indexBounds": {
"linkedListingID": [
"[\"bca\", \"bca\"]"
]
},
"indexName": "_linkedListingID",
"indexVersion": 2,
"isMultiKey": false,
"isPartial": false,
"isSparse": false,
"isUnique": false,
"keyPattern": {
"linkedListingID": 1
},
"multiKeyPaths": {
"linkedListingID": []
},
"stage": "IXSCAN"
}
],
"stage": "OR"
},
"stage": "FETCH"
},
"stage": "SUBPLAN"
}
Playground
By default, MongoDB creates a unique index on the _id field during the creation of a collection.
You query could be as simple as:
{_id: "..."}
MongoDB ensures that the indexed fields do not store duplicate values; for that reason you query on the _id will alway retrieve an unique document.

How can I put $group condition to $match? [duplicate]

For what would be this query in SQL (to find duplicates):
SELECT userId, name FROM col GROUP BY userId, name HAVING COUNT(*)>1
I performed this simple query in MongoDB:
res = db.col.group({key:{userId:true,name:true},
reduce: function(obj,prev) {prev.count++;},
initial: {count:0}})
I've added a simple Javascript loop to go over the result set, and performed a filter to find all the fields with a count > 1 there, like so:
for (i in res) {if (res[i].count>1) printjson(res[i])};
Is there a better way to do this other than using javascript code in the client?
If this is the best/simplest way, say that it is, and this question will help someone :)
New answer using Mongo aggregation framework
After this question was asked and answered, 10gen released Mongodb version 2.2 with an aggregation framework. The new best way to do this query is:
db.col.aggregate( [
{ $group: { _id: { userId: "$userId", name: "$name" },
count: { $sum: 1 } } },
{ $match: { count: { $gt: 1 } } },
{ $project: { _id: 0,
userId: "$_id.userId",
name: "$_id.name",
count: 1}}
] )
10gen has a handy SQL to Mongo Aggregation conversion chart worth bookmarking.
The answer already given is apt to be honest, and use of projection makes it even better due to implicit optimisation working under the hood. I have made a small change and I am explaining the positive behind it.
The original command
db.getCollection('so').explain(1).aggregate( [
{ $group: { _id: { userId: "$userId", name: "$name" },
count: { $sum: 1 } } },
{ $match: { count: { $gt: 1 } } },
{ $project: { _id: 0,
userId: "$_id.userId",
name: "$_id.name",
count: 1}}
] )
Parts from the explain plan
{
"stages" : [
{
"$cursor" : {
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "5fa42c8b8778717d277f67c4_test.so",
"indexFilterSet" : false,
"parsedQuery" : {},
"queryHash" : "F301762B",
"planCacheKey" : "F301762B",
"winningPlan" : {
"stage" : "PROJECTION_SIMPLE",
"transformBy" : {
"name" : 1,
"userId" : 1,
"_id" : 0
},
"inputStage" : {
"stage" : "COLLSCAN",
"direction" : "forward"
}
},
"rejectedPlans" : []
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 6000,
"executionTimeMillis" : 8,
"totalKeysExamined" : 0,
"totalDocsExamined" : 6000,
The sampleset is pretty small, just 6000 documents
This query will work on data in WiredTiger Internal Cache, thus if the size of the collection is huge then all that will be kept in the Internal Cache to make sure the execution takes place. The WT Cache is pretty important and if this command takes up such huge space in cache then the cache size will have to be bigger to accommodate other operations
Now a small, hack and addition of an index.
db.getCollection('so').createIndex({userId : 1, name : 1})
New Command
db.getCollection('so').explain(1).aggregate( [
{$match : {name :{ "$ne" : null }, userId : { "$ne" : null } }},
{ $group: { _id: { userId: "$userId", name: "$name" },
count: { $sum: 1 } } },
{ $match: { count: { $gt: 1 } } },
{ $project: { _id: 0,
userId: "$_id.userId",
name: "$_id.name",
count: 1}}
] )
Explain Plan
{
"stages": [{
"$cursor": {
"queryPlanner": {
"plannerVersion": 1,
"namespace": "5fa42c8b8778717d277f67c4_test.so",
"indexFilterSet": false,
"parsedQuery": {
"$and": [{
"name": {
"$not": {
"$eq": null
}
}
},
{
"userId": {
"$not": {
"$eq": null
}
}
}
]
},
"queryHash": "4EF9C4D5",
"planCacheKey": "3898FC0A",
"winningPlan": {
"stage": "PROJECTION_COVERED",
"transformBy": {
"name": 1,
"userId": 1,
"_id": 0
},
"inputStage": {
"stage": "IXSCAN",
"keyPattern": {
"userId": 1.0,
"name": 1.0
},
"indexName": "userId_1_name_1",
"isMultiKey": false,
"multiKeyPaths": {
"userId": [],
"name": []
},
"isUnique": false,
"isSparse": false,
"isPartial": false,
"indexVersion": 2,
"direction": "forward",
"indexBounds": {
"userId": [
"[MinKey, undefined)",
"(null, MaxKey]"
],
"name": [
"[MinKey, undefined)",
"(null, MaxKey]"
]
}
}
},
"rejectedPlans": [{
"stage": "PROJECTION_SIMPLE",
"transformBy": {
"name": 1,
"userId": 1,
"_id": 0
},
"inputStage": {
"stage": "FETCH",
"filter": {
"userId": {
"$not": {
"$eq": null
}
}
},
"inputStage": {
"stage": "IXSCAN",
"keyPattern": {
"name": 1.0
},
"indexName": "name_1",
"isMultiKey": false,
"multiKeyPaths": {
"name": []
},
"isUnique": false,
"isSparse": false,
"isPartial": false,
"indexVersion": 2,
"direction": "forward",
"indexBounds": {
"name": [
"[MinKey, undefined)",
"(null, MaxKey]"
]
}
}
}
}]
},
"executionStats": {
"executionSuccess": true,
"nReturned": 6000,
"executionTimeMillis": 9,
"totalKeysExamined": 6000,
"totalDocsExamined": 0,
"executionStages": {
"stage": "PROJECTION_COVERED",
"nReturned": 6000,
Check the Projection_Covered part, this command is a covered query which basically is just relying on data in indexes
This command won't need to keep the data in the WT Internal Cache because it is not going there at all, check the docs examined, it is 0, given that data is in indexes it is using that for execution, this is a big positive for a system where WT Cache is already under pressure from other operations
If by any chance the requirement to search for specific names and not the whole collection then this becomes useful :D
Disadvantage here is an addition of index, if this index is utilised for other operations as well then no disadvantage to be honest but if this is an extra addition then it will take more space for the index in cache + the writes are impacted with addition of an index marginally
*On performance front for 6000 records the time shown is 1 ms more but for larger dataset this may vary. It must be noted that the sample document that I inserted has just 3 fields, apart from the two used here, the default _id, if this collection has bigger document size then the execution for original command will increase and the volume it will occupy in the cache will also increase.

Why is a query with indexes not a covered query?

You would like to perform a covered query on the example collection. You have the following indexes:
{ name : 1, dob : 1 }
{ _id : 1 }
{ hair : 1, name : 1 }
Why is the below query not a covered query?
db.example.find( { name : { $in : [ "Bart", "Homer" ] } }, {_id : 0, hair : 1, name : 1} )
While this one is:
db.example.find( { name : { $in : [ "Bart", "Homer" ] } }, {_id : 0, dob : 1, name : 1} )
According to documentation on index prefixes, the query
db.example.find( { name : { $in : [ "Bart", "Homer" ] } } );
will be covered by
db.example.createIndex({ "name": 1, "dob": 1 });
but not by
db.example.createIndex({ "hair": 1, "name": 1 });
since { "name": 1 } is not the preifx of { "hair": 1, "name": 1 }.
Sample
{ "_id": 1, "name": "Bart", "hair": "triangles", "dob": "1985-01-01" }
{ "_id": 2, "name": "Homer", "hair": "two", "dob": "1960-01-01" }
Query 1
> db.example.find(
> { name: { $in: [ "Bart", "Homer" ] } },
> { _id: 0, hair: 1, name: 1 }
> ).explain("executionStats");
...
"executionStats": {
"totalKeysExamined": 2,
"totalDocsExamined": 2,
"executionStages": {
"stage": "PROJECTION",
"inputStage": {
"stage": "FETCH",
"inputStage": {
"stage": "IXSCAN",
"indexName": "name_1_dob_1",
...
As you see, name_1_dob_1 index was used (since { "name": 1 } is the prefix of { "name": 1, "dob": 1 }, 2 documents were examined in the index ("totalKeysExamined": 2), and then 2 documents were examined in the collection ("totalDocsExamined": 2), since name_1_dob_1 index does not have information about hair that is required to return.
Query 2
> db.example.find(
> { name: { $in: [ "Bart", "Homer" ] } },
> { _id: 0, dob: 1, name: 1 }
> ).explain("executionStats");
...
"executionStats": {
"totalKeysExamined": 2,
"totalDocsExamined": 0,
"executionStages": {
"stage": "PROJECTION",
"inputStage": {
"stage": "IXSCAN",
"indexName": "name_1_dob_1",
...
As for the Query 1, index name_1_dob_1 was used and 2 documents were examined in the index ("totalKeysExamined": 2), but there were no call to the collection ("totalDocsExamined": 0), since index name_1_dob_1 contains both dob and name in it and there are no need to fetch something more from the collection.

How to merge/join mongodb aggregate?

Given this dataset:
db.calls.insert([{
"agent": 2,
"isFromOutside": true,
"duration": 304
}, {
"agent": 1,
"isFromOutside": false,
"duration": 811
}, {
"agent": 0,
"isFromOutside": true,
"duration": 753
}, {
"agent": 1,
"isFromOutside": false,
"duration": 593
}, {
"agent": 3,
"isFromOutside": true,
"duration": 263
}, {
"agent": 0,
"isFromOutside": true,
"duration": 995
}, {
"agent": 0,
"isFromOutside": false,
"duration": 210
}, {
"agent": 1,
"isFromOutside": false,
"duration": 737
}, {
"agent": 2,
"isFromOutside": false,
"duration": 170
}, {
"agent": 0,
"isFromOutside": false,
"duration": 487
}])
I have two aggregate queries that give the total duration for each agent and the count of outgoing calls for each client:
get outGoingCalls table:
db.calls.aggregate([
{ $match: { duration :{ $gt: 0 }, isFromOutside: false } },
{ $group: { _id: "$agent", outGoingCalls: { $sum: 1 } } },
{ $sort: { outGoingCalls: -1 } }
])
get totalDuration table:
db.calls.aggregate([
{ $group: { _id: "$agent", totalDuration: { $sum: "$duration" } } },
{ $sort: {totalDuration: -1 } }
])
How to merge/join these tables (or do only one aggregation) to have something like this:
[
{_id: 0, totalDuration: ..., outGoingCalls: ...},
{_id: 1, totalDuration: ..., outGoingCalls: ...},
{_id: 2, totalDuration: ..., outGoingCalls: ...},
...
]
Try the following aggregation framework:
db.calls.aggregate([
{
"$group": {
"_id": "$agent",
"outGoingCalls": {
"$sum": {
"$cond": [
{
"$and": [
{"$gt": ["$duration", 0 ]},
{"$eq": ["$isFromOutside", false ]}
]
},
1,
0
]
}
},
"totalDuration": { "$sum": "$duration" }
}
},
{
"$sort": {
"totalDuration": -1,
"outGoingCalls": -1
}
}
])
Output:
/* 0 */
{
"result" : [
{
"_id" : 0,
"outGoingCalls" : 2,
"totalDuration" : 2445
},
{
"_id" : 1,
"outGoingCalls" : 3,
"totalDuration" : 2141
},
{
"_id" : 2,
"outGoingCalls" : 1,
"totalDuration" : 474
},
{
"_id" : 3,
"outGoingCalls" : 0,
"totalDuration" : 263
}
],
"ok" : 1
}

What is the correct way to do a HAVING in a MongoDB GROUP BY?

For what would be this query in SQL (to find duplicates):
SELECT userId, name FROM col GROUP BY userId, name HAVING COUNT(*)>1
I performed this simple query in MongoDB:
res = db.col.group({key:{userId:true,name:true},
reduce: function(obj,prev) {prev.count++;},
initial: {count:0}})
I've added a simple Javascript loop to go over the result set, and performed a filter to find all the fields with a count > 1 there, like so:
for (i in res) {if (res[i].count>1) printjson(res[i])};
Is there a better way to do this other than using javascript code in the client?
If this is the best/simplest way, say that it is, and this question will help someone :)
New answer using Mongo aggregation framework
After this question was asked and answered, 10gen released Mongodb version 2.2 with an aggregation framework. The new best way to do this query is:
db.col.aggregate( [
{ $group: { _id: { userId: "$userId", name: "$name" },
count: { $sum: 1 } } },
{ $match: { count: { $gt: 1 } } },
{ $project: { _id: 0,
userId: "$_id.userId",
name: "$_id.name",
count: 1}}
] )
10gen has a handy SQL to Mongo Aggregation conversion chart worth bookmarking.
The answer already given is apt to be honest, and use of projection makes it even better due to implicit optimisation working under the hood. I have made a small change and I am explaining the positive behind it.
The original command
db.getCollection('so').explain(1).aggregate( [
{ $group: { _id: { userId: "$userId", name: "$name" },
count: { $sum: 1 } } },
{ $match: { count: { $gt: 1 } } },
{ $project: { _id: 0,
userId: "$_id.userId",
name: "$_id.name",
count: 1}}
] )
Parts from the explain plan
{
"stages" : [
{
"$cursor" : {
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "5fa42c8b8778717d277f67c4_test.so",
"indexFilterSet" : false,
"parsedQuery" : {},
"queryHash" : "F301762B",
"planCacheKey" : "F301762B",
"winningPlan" : {
"stage" : "PROJECTION_SIMPLE",
"transformBy" : {
"name" : 1,
"userId" : 1,
"_id" : 0
},
"inputStage" : {
"stage" : "COLLSCAN",
"direction" : "forward"
}
},
"rejectedPlans" : []
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 6000,
"executionTimeMillis" : 8,
"totalKeysExamined" : 0,
"totalDocsExamined" : 6000,
The sampleset is pretty small, just 6000 documents
This query will work on data in WiredTiger Internal Cache, thus if the size of the collection is huge then all that will be kept in the Internal Cache to make sure the execution takes place. The WT Cache is pretty important and if this command takes up such huge space in cache then the cache size will have to be bigger to accommodate other operations
Now a small, hack and addition of an index.
db.getCollection('so').createIndex({userId : 1, name : 1})
New Command
db.getCollection('so').explain(1).aggregate( [
{$match : {name :{ "$ne" : null }, userId : { "$ne" : null } }},
{ $group: { _id: { userId: "$userId", name: "$name" },
count: { $sum: 1 } } },
{ $match: { count: { $gt: 1 } } },
{ $project: { _id: 0,
userId: "$_id.userId",
name: "$_id.name",
count: 1}}
] )
Explain Plan
{
"stages": [{
"$cursor": {
"queryPlanner": {
"plannerVersion": 1,
"namespace": "5fa42c8b8778717d277f67c4_test.so",
"indexFilterSet": false,
"parsedQuery": {
"$and": [{
"name": {
"$not": {
"$eq": null
}
}
},
{
"userId": {
"$not": {
"$eq": null
}
}
}
]
},
"queryHash": "4EF9C4D5",
"planCacheKey": "3898FC0A",
"winningPlan": {
"stage": "PROJECTION_COVERED",
"transformBy": {
"name": 1,
"userId": 1,
"_id": 0
},
"inputStage": {
"stage": "IXSCAN",
"keyPattern": {
"userId": 1.0,
"name": 1.0
},
"indexName": "userId_1_name_1",
"isMultiKey": false,
"multiKeyPaths": {
"userId": [],
"name": []
},
"isUnique": false,
"isSparse": false,
"isPartial": false,
"indexVersion": 2,
"direction": "forward",
"indexBounds": {
"userId": [
"[MinKey, undefined)",
"(null, MaxKey]"
],
"name": [
"[MinKey, undefined)",
"(null, MaxKey]"
]
}
}
},
"rejectedPlans": [{
"stage": "PROJECTION_SIMPLE",
"transformBy": {
"name": 1,
"userId": 1,
"_id": 0
},
"inputStage": {
"stage": "FETCH",
"filter": {
"userId": {
"$not": {
"$eq": null
}
}
},
"inputStage": {
"stage": "IXSCAN",
"keyPattern": {
"name": 1.0
},
"indexName": "name_1",
"isMultiKey": false,
"multiKeyPaths": {
"name": []
},
"isUnique": false,
"isSparse": false,
"isPartial": false,
"indexVersion": 2,
"direction": "forward",
"indexBounds": {
"name": [
"[MinKey, undefined)",
"(null, MaxKey]"
]
}
}
}
}]
},
"executionStats": {
"executionSuccess": true,
"nReturned": 6000,
"executionTimeMillis": 9,
"totalKeysExamined": 6000,
"totalDocsExamined": 0,
"executionStages": {
"stage": "PROJECTION_COVERED",
"nReturned": 6000,
Check the Projection_Covered part, this command is a covered query which basically is just relying on data in indexes
This command won't need to keep the data in the WT Internal Cache because it is not going there at all, check the docs examined, it is 0, given that data is in indexes it is using that for execution, this is a big positive for a system where WT Cache is already under pressure from other operations
If by any chance the requirement to search for specific names and not the whole collection then this becomes useful :D
Disadvantage here is an addition of index, if this index is utilised for other operations as well then no disadvantage to be honest but if this is an extra addition then it will take more space for the index in cache + the writes are impacted with addition of an index marginally
*On performance front for 6000 records the time shown is 1 ms more but for larger dataset this may vary. It must be noted that the sample document that I inserted has just 3 fields, apart from the two used here, the default _id, if this collection has bigger document size then the execution for original command will increase and the volume it will occupy in the cache will also increase.