Why are any objects being scanned here? - mongodb

I have an index:
{indices.textLc:1, group:1, lc:1, wordCount:1, pattern:1, clExists:1}
and Morphia generates queries like:
{
$and: [{
lc: "eng"
},
{
$or: [{
group: "cn"
},
{
group: "all"
}]
},
{
"indices.textLc": {
$in: ["media strengthening", "strengthening", "media"]
}
},
{
wordCount: {
$gte: 1
}
},
{
wordCount: {
$lte: 2
}
}]
}
and explain gives:
{
"cursor" : "BtreeCursor indices.textLc_1_group_1_lc_1_wordCount_1_pattern_1_clExists_1 multi",
"nscanned" : 20287,
"nscannedObjects" : 20272,
"n" : 22,
"millis" : 677,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : true,
"indexOnly" : false,
"indexBounds" : {
"indices.textLc" : [
[
"media",
"media"
],
[
"media strengthening",
"media strengthening"
],
[
"strengthening",
"strengthening"
]
],
"group" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"lc" : [
[
"eng",
"eng"
]
],
"wordCount" : [
[
1,
1.7976931348623157e+308
]
],
"pattern" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"clExists" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
}
Firstly, I don't understand why any scanning is required since everything is available in the index. More specifically, why does the wordCount part of the indexBounds not look like:
"wordCount" : [
[
1,
2
]
],
Update 2012-03-20: If it helps explain anything, I'm running MongoDB 2.0.3

Every field in your query being available in your compound index says very little about whether or not it can use your one index for every clause in your query. There are a few things to consider :
With the exception of top-level $or clauses which can use an index per clause every MongoDB query can use at most one index.
Compound indexes only work if each subsequent field in the compound can be used in order, meaning your query allows for filtering on the first index field first, the second next and so on. SO if you have an index {a:1, b:1} a query {b:"Hi!"} would not use the index even though the field is in the compound index.
Now, the reason your query requires a scan is because your index can only optimize the query execution plan for the "indices.textLc" field (your first index field) and in this particular case "lc" because it's a seperate clause in your $and.
The "wordCount" part of the explain should actually read :
"wordCount" : [
[
1,
2
]
]
I just tested it and it does on my machine so I think something's going wrong with your Morphia/mapping solution there.
Compound indexes and complicated queries such as yours are a tricky subject. I don't have time now to look at your query and index and see if it can be optimized. I'll revisit tonight and help you out if I can.

Related

Mongo doesn't optimize $or query by combining two IXSCANs

I have an orders collection with the following index, among others:
{location: 1, completedDate: 1, estimatedProductionDate: 1, estimatedCompletionDate: 1}
I'm performing the following query:
db.orders.find({
status: {$in: [1, 2, 3]},
location: "PA",
$or: [
{completedDate: {$lt: ISODate("2017-08-22T04:59:59.999Z")}},
{
completedDate: null,
estimatedProductionDate: {$lt: ISODate("2017-08-22T04:59:59.999Z")}
}
]
}).explain()
I was hoping this would perform an efficient IXSCAN for each branch of the $or, and then combine the results:
{completedDate: {$lt: ISODate("2017-08-22T04:59:59.999Z")}}
"indexBounds" : {
"location" : [
"[\"TX\", \"TX\"]"
],
"completedDate" : [
"[MinKey, ISODate("2017-08-22T04:59:59.999Z")]"
],
"estimatedProductionDate" : [
"[MinKey, MaxKey]"
],
"estimatedCompletionDate" : [
"[MinKey, MaxKey]"
]
}
{
completedDate: null,
estimatedProductionDate: {$lt: ISODate("2017-08-22T04:59:59.999Z")}
}
"indexBounds" : {
"location" : [
"[\"TX\", \"TX\"]"
],
"completedDate" : [
"[null, null]"
],
"estimatedProductionDate" : [
"[MinKey, ISODate("2017-08-22T04:59:59.999Z")]"
],
"estimatedCompletionDate" : [
"[MinKey, MaxKey]"
]
}
Instead, it only bounds the location in the IXSCAN, and does the rest of the filtering during FETCH. Is there any way to optimize this query without splitting it into two separate queries?
"winningPlan" : {
"stage" : "FETCH",
"filter" : {
"$and" : [
{
"$or" : [
{
"$and" : [
{
"completedDate" : {
"$eq" : null
}
},
{
"estimatedProductionDate" : {
"$lt" : "2017-08-22T04:59:59.999Z"
}
}
]
},
{
"completedDate" : {
"$lt" : "2017-08-22T04:59:59.999Z"
}
}
]
},
{
"status" : {
"$in" : [
1,
2,
3
]
}
}
]
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"location" : 1,
"completedDate" : 1,
"estimatedProductionDate" : 1,
"estimatedCompletionDate" : 1
},
"indexName" : "location_1_completedDate_1_estimatedProductionDate_1_estimatedCompletionDate_1",
"isMultiKey" : false,
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 1,
"direction" : "forward",
"indexBounds" : {
"location" : [
"[\"TX\", \"TX\"]"
],
"completedDate" : [
"[MinKey, MaxKey]"
],
"estimatedProductionDate" : [
"[MinKey, MaxKey]"
],
"estimatedCompletionDate" : [
"[MinKey, MaxKey]"
]
}
}
},
There are three issues that are immediately apparent:
Your index
I'm not sure about the other indexes you have, but your query is of the shape:
{
status:1,
location:1,
$or: [
{completedDate:1},
{completedDate:1, estimatedProductionDate:1}
]
}
However your index does not contain the term status. You would need the status field in your index to maximize index use.
Your $or query
To paraphrase the page $or Clauses and Indexes:
... for MongoDB to use indexes to evaluate an $or expression, all the clauses in the $or expression must be supported by indexes. Otherwise, MongoDB will perform a collection scan.
To put it simply, efficient $or queries in MongoDB would require the $or term to be the top-level term, with each part of the term supported by an index.
For example, you may find the performance of the following index and query to be a bit better:
db.orders.createIndex({
status:1,
location:1,
completedDate:1,
estimatedProductionDate:1
})
db.orders.explain().find({
$or: [
{
status: {$in: [1, 2, 3]},
location: "PA",
completedDate: {$lt: ISODate("2017-08-22T04:59:59.999Z")}},
{
status: {$in: [1, 2, 3]},
location: "PA",
completedDate: null,
estimatedProductionDate: {$lt: ISODate("2017-08-22T04:59:59.999Z")}
}
]
})
The reason is because MongoDB treats each of the term in an $or query to be a separate query. Thus, each term can use its own index.
Note that the order of fields in the index I proposed above follows the order of the fields in the query.
However, this is still not optimal, because MongoDB has to perform a fetch with filter: {completedDate: {$eq: null}} after the index scan for a query with completedDate: null. The reason for this is subtle and best explained here:
The document {} generates the index key {"": null} for the index with key pattern {"a.b": 1}.
The document {a: []} also generates the index key {"": null} for the index with key pattern {"a.b": 1}.
The document {} matches the query {"a.b": null}.
The document {a: []} does not match the query {"a.b": null}.
Therefore, a query {"a.b": null} that is answered by an index with key
pattern {"a.b": 1} must fetch the document and re-check the predicate,
in order to ensure that the document {} is included in the result set
and that the document {a: []} is not included in the result set.
To maximize index use, you may be better off just to assign something into the completedDate field instead of setting it to null.

mongo db how to write a function in an query maybe aggregation?

The question is Calculate the average age of the users who have more than 3 strengths listed.
One of the data is like this :
{
"_id" : 1.0,
"user_id" : "jshaw0",
"first_name" : "Judy",
"last_name" : "Shaw",
"email" : "jshaw0#merriam-webster.com",
"age" : 39.0,
"status" : "disabled",
"join_date" : "2016-09-05",
"last_login_date" : "2016-09-30 23:59:36 -0400",
"address" : {
"city" : "Deskle",
"province" : "PEI"
},
"strengths" : [
"star schema",
"dw planning",
"sql",
"mongo queries"
],
"courses" : [
{
"code" : "CSIS2300",
"total_questions" : 118.0,
"correct_answers" : 107.0,
"incorect_answers" : 11.0
},
{
"code" : "CSIS3300",
"total_questions" : 101.0,
"correct_answers" : 34.0,
"incorect_answers" : 67.0
}
]
}
I know I need to count how many strengths this data has, and then set it to $gt, and then calculate the average age.
However, I don't know how to write 2 function which are count and average in one query. Do I need to use aggregation, if so, how?
Thanks so much
Use $redact to match your array size & $group to calculate the average :
db.collection.aggregate([{
"$redact": {
"$cond": [
{ "$gt": [{ "$size": "$strengths" }, 3] },
"$$KEEP",
"$$PRUNE"
]
}
}, {
$group: {
_id: 1,
average: { $avg: "$age" }
}
}])
The $redact part match the size of strenghs array greater than 3, it will $$KEEP record that match this condition otherwise $$PRUNE the record that don't match. Check $redact documentation
The $group just perform an average with $avg

Redact in mongodb seems obscure to me

I'm fighting with redact right now and I'm not sure to understand it.
I just read the docs and tried to use redact on a collection grades (it comes from mongodb online training)
A document in the collection "grades" looks like this :
{
"_id" : ObjectId("50b59cd75bed76f46522c34e"),
"student_id" : 0,
"class_id" : 2,
"scores" : [
{
"type" : "exam",
"score" : 57.92947112575566
},
{
"type" : "quiz",
"score" : 21.24542588206755
},
{
"type" : "homework",
"score" : 68.19567810587429
},
{
"type" : "homework",
"score" : 67.95019716560351
},
{
"type" : "homework",
"score" : 18.81037253352722
}
]
}
I use the following query :
db.grades.aggregate([
{ $match: { student_id: 0 } },
{
$redact: {
$cond: {
if: { $eq: [ "$type" , "exam" ] },
then: "$$PRUNE",
else: "$$DESCEND"
}
}
}
]
);
With this query, each type an exam is found, this sub document should be excluded. And it works, the result is:
{
"_id" : ObjectId("50b59cd75bed76f46522c34e"),
"student_id" : 0,
"class_id" : 2,
"scores" : [
{
"type" : "quiz",
"score" : 21.24542588206755
},
{
"type" : "homework",
"score" : 68.19567810587429
},
{
"type" : "homework",
"score" : 67.95019716560351
},
{
"type" : "homework",
"score" : 18.81037253352722
}
]
}
but if I invert the condition, I expect that only exams are kept in the result :
if: { $eq: [ "$type" , "exam" ] },
then: "$$DESCEND",
else: "$$PRUNE"
however the result is empty.
I don't understand why subdocument of type "exam" are not included.
The $redact stage starts at the root document and its fields, and only when that document fulfills the condition to $$DESCEND, it examines the sub-documents included in that document. That means the first thing $redact does with your document is examine this:
{
"_id" : ObjectId("50b59cd75bed76f46522c34e"),
"student_id" : 0,
"class_id" : 2,
"scores" : [] // Some array. I will look at this later.
}
It doesn't even find a type field here, so $eq: [ "$type" , "exam" ] is false. What did you tell $redact to do when the condition is false? else: "$$PRUNE", so the whole document is pruned before the sub-documents are examined.
As a workaround, test if $type is either "exam" or doesn't exist. You didn't explicitly ask for a working solution, so I will leave it as an exercise to you to figure out how to do this.

Aggregate and select only top records with $last

I have following collection in MongoDB:
{
"_id" : ObjectId("..."),
"assetId" : "...",
"date" : ISODate("..."),
...
}
I need to do quite simple thing - find latest record for each device/asset. I have following query:
db.collection.aggregate([
{ "$match" : { "assetId" : { "$in" : [ up_to_80_ids ]} } },
{ "$group" :{ "_id" : "$assetId" , "date" : { "$last" : "$date"}}}
])
Whole table is around 20Gb. When I am trying to do this query it takes around 8 seconds which does not make any sense, as far as I specified that only $last record should be selected. Both assetId and date are indexed. If I add { $sort : { date : 1 } } before group it does not change anything.
Basically, result of my query should NOT depend on data size. The only thing I need is a top record for each device/asset. If I do instead 80 separate queries it takes me few milliseconds.
Is there any way to make MongoDB to do NOT go through whole table? It looks like database does not reduce but processes everything?! Well, I understand that there should be some good reason for this behaviour but I cannot find anything in documentation or on the forums.
UPDATE:
Eventually found right syntax of explain query for 2.4.6:
db.runCommand( { aggregate: "collection", pipeline : [...] , explain : true })
Result:
{
"serverPipeline" : [
{
"query" : {
"assetId" : {
"$in" : [
"52744d5722f8cb9b4f94d321",
"52791fe322f8014b320dae41",
"52740f5222f8cb9b4f94d306",
... must remove some because of SO limitations
"52744d1722f8cb9b4f94d31d",
"52744b1d22f8cb9b4f94d308",
"52744ccd22f8cb9b4f94d319"
]
}
},
"projection" : {
"assetId" : 1,
"date" : 1,
"_id" : 0
},
"cursor" : {
"cursor" : "BtreeCursor assetId_1 multi",
"isMultiKey" : false,
"n" : 960881,
"nscannedObjects" : 960881,
"nscanned" : 960894,
"nscannedObjectsAllPlans" : 960881,
"nscannedAllPlans" : 960894,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 9,
"nChunkSkips" : 0,
"millis" : 6264,
"indexBounds" : {
"assetId" : [
[
"52740baa22f8cb9b4f94d2e8",
"52740baa22f8cb9b4f94d2e8"
],
[
"52740bed22f8cb9b4f94d2e9",
"52740bed22f8cb9b4f94d2e9"
],
[
"52740c3222f8cb9b4f94d2ea",
"52740c3222f8cb9b4f94d2ea"
],
....
[
"5297770a22f82f9bdafce322",
"5297770a22f82f9bdafce322"
],
[
"529df5f622f82f9bdafce429",
"529df5f622f82f9bdafce429"
],
[
"529f6a6722f89deaabbf9881",
"529f6a6722f89deaabbf9881"
],
[
"52a6e35122f89ce6e2cf4267",
"52a6e35122f89ce6e2cf4267"
]
]
},
"allPlans" : [
{
"cursor" : "BtreeCursor assetId_1 multi",
"n" : 960881,
"nscannedObjects" : 960881,
"nscanned" : 960894,
"indexBounds" : {
"assetId" : [
[
"52740baa22f8cb9b4f94d2e8",
"52740baa22f8cb9b4f94d2e8"
],
[
"52740bed22f8cb9b4f94d2e9",
"52740bed22f8cb9b4f94d2e9"
],
[
"52740c3222f8cb9b4f94d2ea",
"52740c3222f8cb9b4f94d2ea"
],
.......
[
"529df5f622f82f9bdafce429",
"529df5f622f82f9bdafce429"
],
[
"529f6a6722f89deaabbf9881",
"529f6a6722f89deaabbf9881"
],
[
"52a6e35122f89ce6e2cf4267",
"52a6e35122f89ce6e2cf4267"
]
]
}
}
],
"oldPlan" : {
"cursor" : "BtreeCursor assetId_1 multi",
"indexBounds" : {
"assetId" : [
[
"52740baa22f8cb9b4f94d2e8",
"52740baa22f8cb9b4f94d2e8"
],
[
"52740bed22f8cb9b4f94d2e9",
"52740bed22f8cb9b4f94d2e9"
],
[
"52740c3222f8cb9b4f94d2ea",
"52740c3222f8cb9b4f94d2ea"
],
........
[
"529df5f622f82f9bdafce429",
"529df5f622f82f9bdafce429"
],
[
"529f6a6722f89deaabbf9881",
"529f6a6722f89deaabbf9881"
],
[
"52a6e35122f89ce6e2cf4267",
"52a6e35122f89ce6e2cf4267"
]
]
}
},
"server" : "351bcc56-1a25-61b7-a435-c14e06887015.local:27017"
}
},
{
"$group" : {
"_id" : "$assetId",
"date" : {
"$last" : "$date"
}
}
}
],
"ok" : 1
}
Your explain output indicates there are 960,881 items matching the assetIds in your $match stage. MongoDB finds all of them using the index on assetId, and streams them all through the $group stage. This is expensive. At the moment MongoDB does not make very many whole-pipeline optimizations to the aggregation pipeline, so what you write is what you get, pretty much.
MongoDB could optimize this pipeline by sorting by assetId ascending and date descending, then applying the optimization suggested in SERVER-9507 but this is not yet implemented.
For the moment, your best course of action is to do this for each assetId:
db.collection.find({assetId: THE_ID}).sort({date: -1}).limit(1)
i am not sure but if you read this link on monngodb site.
it has NOTE that Only use $last when the $group follows an $sort operation. Otherwise, the result of this operation is unpredictable.
I have the same problem in my program. I have tried mongoDB MapReduce, aggregation framework and other but finally I've stopped on scanning collection using indexes and forming result on client. But now collections is too big to do that so I think I will use many small queries as you mentioned above in your question. It is not so beautifull but it will be the fastest solution IMHO.
Only the first query in your pipeline use index. The second query in pipeline accept output of the first query and it is big and not indexed. But as mentioned in Pipeline Operators and Indexes your query can use compound index so it is not so clear.
I have an idea: you can try to use many $or operators instead one $in operator like this
{ "$match": { "$or": [{"assetId": <id1>}, {"assetId": <id2>...}] } }. As I know $or operator can be executed in parallel and each query can use index. So it would be interesting to test this solution.
p.s. I really will be happy if solution for this problem will be found.

MongoDB indexes for $elemMatch

The indexes help page at http://www.mongodb.org/display/DOCS/Indexes doesn't mention $elemMatch and since it says to add an index on my 2M+ object collection I thought I'd ask this:
I am doing a query like:
{ lc: "eng", group: "xyz", indices: { $elemMatch: { text: "as", pos: { $gt: 1 } } } }
If I add an index
{lc:1, group:1, indices.text:1, indices.pos:1}
will this query with the $elemMatch component be able to be fully run through the index?
Based on your query, I imagine that your documents look something like this:
{
"_id" : 1,
"lc" : "eng",
"group" : "xyz",
"indices" : [
{
"text" : "as",
"pos" : 2
},
{
"text" : "text",
"pos" : 4
}
]
}
I created a test collection with documents of this format, created the index, and ran the query that you posted with the .explain() option.
The index is being used as expected:
> db.test.ensureIndex({"lc":1, "group":1, "indices.text":1, "indices.pos":1})
> db.test.find({ lc: "eng", group: "xyz", indices: { $elemMatch: { text: "as", pos: { $gt: 1 } } } }).explain()
{
"cursor" : "BtreeCursor lc_1_group_1_indices.text_1_indices.pos_1",
"isMultiKey" : true,
"n" : NumberLong(1),
"nscannedObjects" : NumberLong(1),
"nscanned" : NumberLong(1),
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : NumberLong(0),
"millis" : 0,
"indexBounds" : {
"lc" : [
[
"eng",
"eng"
]
],
"group" : [
[
"xyz",
"xyz"
]
],
"indices.text" : [
[
"as",
"as"
]
],
"indices.pos" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "Marcs-MacBook-Pro.local:27017"
}
The documentation on the .explain() feature may be found here: http://www.mongodb.org/display/DOCS/Explain
.explain() may be used to display information about a query, including which (if any) index is used.