Explain Aggregate framework - mongodb

I Just read this link Mongodb Explain for Aggregation framework but not explain my problem
I want retrieve information about aggregation like db.coll.find({bla:foo}).explain()
I tried
db.coll.aggregate([
my-op
],
{
explain: true
})
the result is not Explain, but the query on Database.
I have tried also
db.runCommand({
aggregate: "mycoll",
pipeline: [
my- op
],
explain: true
})
I retrieved information with this command, but i haven't millis, nscannedObjects etc...
I use mongoDb 2.6.2

Aggregations don't run like traditional queries and you can't run the explain on them. They are actually classified as commands and though they make use of indexes you can't readily find out how they are being executed in real-time.
Your best bet is to take the $match portion of your aggregation and run it as a query with explain to figure out how the indexes are performing and get an idea on nscanned.

I am not sure how you managed to fail getting explain information. In 2.6.x this information is available and you can explain your aggregation results:
db.orders.aggregate([
# put your whole aggregation query
], {
explain: true
})
which gives me something like:
{
"stages" : [
{
"$cursor" : {
"query" : {
"a" : 1
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "test.a",
"indexFilterSet" : false,
"parsedQuery" : {
"a" : {
"$eq" : 1
}
},
"winningPlan" : {
"stage" : "COLLSCAN",
"filter" : {
"a" : {
"$eq" : 1
}
},
"direction" : "forward"
},
"rejectedPlans" : [ ]
}
}
}
],
"ok" : 1
}

Related

Problem in using indexes in aggregation pipeline

I have a query like this
db.UserPosts.aggregate([
{ "$match" : { "Posts.DateTime" : { "$gte" : ISODate("2018-09-04T11:50:58Z"), "$lte" : ISODate("2018-09-05T11:50:58Z") } } },
{ "$match" : { "UserId" : { "$in" : [NUUID("aaaaaaaa-cccc-dddd-dddd-5369b183cccc"), NUUID("vvvvvvvv-bbbb-ffff-cccc-e0af0c8acccc")] } } },
{ "$project" : { "_id" : 0, "UserId" : 1, "Posts" : 1 } },
{ "$unwind" : "$Posts" },
{ "$unwind" : "$Posts.Comments" },
{ "$sort" : {"Posts.DateTime" : -1} },
{ "$skip" : 0 }, { "$limit" : 20 },
{ "$project" : { "_id" : 0, "UserId" : 1, "DateTime" : "$Posts.DateTime", "Title" : "$Posts.Title", "Type" : "$Posts.Comments.Type", "Comment" : "$Posts.Comments.Description" } },
],{allowDiskUse:true})
I have a compound index
{
"Posts.DateTime" : -1,
"UserId" : 1
}
Posts and Comments are array of objects.
I've tried different types of indexes but the problem is it does not use my index in $sort stage. I changed the place of my $sort stage but wasn't successful. It seems it is working in $match but not set to $sort. I even tried 2 simple indexes on those fields and combination of 2 simple indexes and one compound index but none of them works.
I also read related documents in MongoDB website for
Compound Indexes
Use Indexes to Sort Query Results
Index Intersection
Aggregation Pipeline Optimization
Could somebody please help me to find the solution?
I solved this problem by changing my data model and moving DateTime to higher level of data.

Best match Mongodb in complex objects

I have a database with about 50k records about candidates like the example bellow:
[
{
"_id":{
"$oid":"5744eff20ca7832b5c7452321"
},
"name":"Candidate 1",
"characteristics":[
{
"name":"personal skills",
"info":[
"Great speaker",
"Very friendly",
"Born to be a leader"
]
},
{
"name":"education background",
"info":[
"Studied Mechanical Engineering",
"Best of his class 2001"
]
}
]
},
... thousands more objects with same structure
]
And given some personal skills I would like to search the best matches for that input:
Example of input:
["speaker", "leader"]
Expected output:
list of candidates (whole object) descenting from the best match.
I basically need to search only the field "personal skills".
What could be a good approach for this problem using MongoDB? Or is there another database that fits better this problem?
The below query using regex brings us the matching records of speaker and leader.
db.collection_name.find(
{ $and :
[
{"characteristics.info": /.*speaker.*/},
{"characteristics.info": /.*leader.*/}
]
}
)
To have a better performance we can have a Text Index as shown below, but please note that there is only one Text Index allowed per collection
db.collection_name.createIndex({"characteristics":"text"});
After our Text Index has been created we can see that it is used in our search
Using explain to view the use of Text Index
db.collection_name.find({ $and: [{"characteristics.info": /.*speaker.*/}, {"characteristics.info": /.*leader.*/}]}).explain()
Mongo shell output with query plan explained
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "test.a",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [
{
"characteristics.info" : {
"$regex" : ".*speaker.*"
}
},
{
"characteristics.info" : {
"$regex" : ".*leader.*"
}
}
]
},
"winningPlan" : {
"stage" : "COLLSCAN",
"filter" : {
"$and" : [
{
"characteristics.info" : {
"$regex" : ".*speaker.*"
}
},
{
"characteristics.info" : {
"$regex" : ".*leader.*"
}
}
]
},
"direction" : "forward"
},
"rejectedPlans" : [ ]
},
"serverInfo" : {
"host" : "PC369236",
"port" : 27017,
"version" : "3.6.1",
"gitVersion" : "025d4f4fe61efd1fb6f0005be20cb45a004093d1"
},
"ok" : 1
}

MongoDB WinningPlan IDHACK

For the below query:
db.restaurants.find({"_id" : ObjectId("5aabce4f827d70999ae5f5f7")}).explain()
I'm getting the below query plan:
/* 1 */
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "test.restaurants",
"indexFilterSet" : false,
"parsedQuery" : {
"_id" : {
"$eq" : ObjectId("5aabce4f827d70999ae5f5f7")
}
},
"winningPlan" : {
"stage" : "IDHACK"
},
"rejectedPlans" : []
},
"serverInfo" : {
"host" : "CHNMCT136701L",
"port" : 27017,
"version" : "3.6.3",
"gitVersion" : "9586e557d54ef70f9ca4b43c26892cd55257e1a5"
},
"ok" : 1.0
}
Now I have some questions on my mind.
What is meant by stage: IDHACK ? How its different from COLLSCAN? Whether this has anything to do with performance optimization? If yes, what are the scenarios in which MongoDB goes for this winningplan ? If I create an index on _id, whether IDHACK will be replaced by the respective indexing plan?
Can anybody clarify this?
https://jira.mongodb.org/browse/SERVER-16891
The query idhack path (a performance optimization to reduce planning/execution overhead for operations with query predicates of the form {_id: })
COLLSCAN doesn't use an index. IDHACK uses the _id index.
yes it does have to do with performance optimization.

Mongodb: Indexing for Aggregate sort limit query?

I am in the process of moving from mysql to mongodb. Started learning mongodb yesterday.
I have a big mysql table (over 4 million rows, with over 300 fields each) which I am moving to mongodb.
Let's assume, the products table have the following fields -
_id, category, and 300+ other fields.
To find the top 5 categories in the products along with their count, I have the following mysql query
Select category, count(_id) as N from products group by category order by N DESC limit 5;
I have an index on category field and this query takes around 4.4 sec in mysql.
Now, I have successfully moved this table to mongodb and this is my corresponding query for finding top 5 categories with their counts.
db.products.aggregate([{$group : {_id:"$category", N:{$sum:1}}},{$sort:{N: -1}},{$limit:5}]);
I again have an index on category but the query doesn't seem to be using it (explain : true says so) and it is also taking around 13.5 sec for this query.
Having read more about mongodb aggregation pipeline optimization, I found out that we need to use sort prior to aggregation for index to work but I am sorting on the derived field from aggregation so can't bring it before the aggregate function.
How do I optimize queries like these in mongodb?
=========================================================================
Output of explain
db.products.aggregate([{$group : {_id:"$category",N:{$sum:1}}},{$sort:{N: -1}},{$limit:5}], { explain: true });
{
"waitedMS" : NumberLong(0),
"stages" : [
{
"$cursor" : {
"query" : {
},
"fields" : {
"category" : 1,
"_id" : 0
},
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "mydb.products",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [ ]
},
"winningPlan" : {
"stage" : "COLLSCAN",
"filter" : {
"$and" : [ ]
},
"direction" : "forward"
},
"rejectedPlans" : [ ]
}
}
},
{
"$group" : {
"_id" : "$category",
"N" : {
"$sum" : {
"$const" : 1
}
}
}
},
{
"$sort" : {
"sortKey" : {
"N" : -1
},
"limit" : NumberLong(5)
}
}
],
"ok" : 1
}
There are currently some limitations in what aggregation framework can do to improve the performance in our use case, however, you should be able to speed up the query by sorting on category first. This will force the query to use the index you have added and should speed up the group query in the second part of your pipeline:
db.products.aggregate([
{ "$sort" : { "category" : 1 },
{$group : {_id:"$category",N:{$sum:1}}},
{$sort:{N: -1}},{$limit:5}]);

Aggregate and select only top records with $last

I have following collection in MongoDB:
{
"_id" : ObjectId("..."),
"assetId" : "...",
"date" : ISODate("..."),
...
}
I need to do quite simple thing - find latest record for each device/asset. I have following query:
db.collection.aggregate([
{ "$match" : { "assetId" : { "$in" : [ up_to_80_ids ]} } },
{ "$group" :{ "_id" : "$assetId" , "date" : { "$last" : "$date"}}}
])
Whole table is around 20Gb. When I am trying to do this query it takes around 8 seconds which does not make any sense, as far as I specified that only $last record should be selected. Both assetId and date are indexed. If I add { $sort : { date : 1 } } before group it does not change anything.
Basically, result of my query should NOT depend on data size. The only thing I need is a top record for each device/asset. If I do instead 80 separate queries it takes me few milliseconds.
Is there any way to make MongoDB to do NOT go through whole table? It looks like database does not reduce but processes everything?! Well, I understand that there should be some good reason for this behaviour but I cannot find anything in documentation or on the forums.
UPDATE:
Eventually found right syntax of explain query for 2.4.6:
db.runCommand( { aggregate: "collection", pipeline : [...] , explain : true })
Result:
{
"serverPipeline" : [
{
"query" : {
"assetId" : {
"$in" : [
"52744d5722f8cb9b4f94d321",
"52791fe322f8014b320dae41",
"52740f5222f8cb9b4f94d306",
... must remove some because of SO limitations
"52744d1722f8cb9b4f94d31d",
"52744b1d22f8cb9b4f94d308",
"52744ccd22f8cb9b4f94d319"
]
}
},
"projection" : {
"assetId" : 1,
"date" : 1,
"_id" : 0
},
"cursor" : {
"cursor" : "BtreeCursor assetId_1 multi",
"isMultiKey" : false,
"n" : 960881,
"nscannedObjects" : 960881,
"nscanned" : 960894,
"nscannedObjectsAllPlans" : 960881,
"nscannedAllPlans" : 960894,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 9,
"nChunkSkips" : 0,
"millis" : 6264,
"indexBounds" : {
"assetId" : [
[
"52740baa22f8cb9b4f94d2e8",
"52740baa22f8cb9b4f94d2e8"
],
[
"52740bed22f8cb9b4f94d2e9",
"52740bed22f8cb9b4f94d2e9"
],
[
"52740c3222f8cb9b4f94d2ea",
"52740c3222f8cb9b4f94d2ea"
],
....
[
"5297770a22f82f9bdafce322",
"5297770a22f82f9bdafce322"
],
[
"529df5f622f82f9bdafce429",
"529df5f622f82f9bdafce429"
],
[
"529f6a6722f89deaabbf9881",
"529f6a6722f89deaabbf9881"
],
[
"52a6e35122f89ce6e2cf4267",
"52a6e35122f89ce6e2cf4267"
]
]
},
"allPlans" : [
{
"cursor" : "BtreeCursor assetId_1 multi",
"n" : 960881,
"nscannedObjects" : 960881,
"nscanned" : 960894,
"indexBounds" : {
"assetId" : [
[
"52740baa22f8cb9b4f94d2e8",
"52740baa22f8cb9b4f94d2e8"
],
[
"52740bed22f8cb9b4f94d2e9",
"52740bed22f8cb9b4f94d2e9"
],
[
"52740c3222f8cb9b4f94d2ea",
"52740c3222f8cb9b4f94d2ea"
],
.......
[
"529df5f622f82f9bdafce429",
"529df5f622f82f9bdafce429"
],
[
"529f6a6722f89deaabbf9881",
"529f6a6722f89deaabbf9881"
],
[
"52a6e35122f89ce6e2cf4267",
"52a6e35122f89ce6e2cf4267"
]
]
}
}
],
"oldPlan" : {
"cursor" : "BtreeCursor assetId_1 multi",
"indexBounds" : {
"assetId" : [
[
"52740baa22f8cb9b4f94d2e8",
"52740baa22f8cb9b4f94d2e8"
],
[
"52740bed22f8cb9b4f94d2e9",
"52740bed22f8cb9b4f94d2e9"
],
[
"52740c3222f8cb9b4f94d2ea",
"52740c3222f8cb9b4f94d2ea"
],
........
[
"529df5f622f82f9bdafce429",
"529df5f622f82f9bdafce429"
],
[
"529f6a6722f89deaabbf9881",
"529f6a6722f89deaabbf9881"
],
[
"52a6e35122f89ce6e2cf4267",
"52a6e35122f89ce6e2cf4267"
]
]
}
},
"server" : "351bcc56-1a25-61b7-a435-c14e06887015.local:27017"
}
},
{
"$group" : {
"_id" : "$assetId",
"date" : {
"$last" : "$date"
}
}
}
],
"ok" : 1
}
Your explain output indicates there are 960,881 items matching the assetIds in your $match stage. MongoDB finds all of them using the index on assetId, and streams them all through the $group stage. This is expensive. At the moment MongoDB does not make very many whole-pipeline optimizations to the aggregation pipeline, so what you write is what you get, pretty much.
MongoDB could optimize this pipeline by sorting by assetId ascending and date descending, then applying the optimization suggested in SERVER-9507 but this is not yet implemented.
For the moment, your best course of action is to do this for each assetId:
db.collection.find({assetId: THE_ID}).sort({date: -1}).limit(1)
i am not sure but if you read this link on monngodb site.
it has NOTE that Only use $last when the $group follows an $sort operation. Otherwise, the result of this operation is unpredictable.
I have the same problem in my program. I have tried mongoDB MapReduce, aggregation framework and other but finally I've stopped on scanning collection using indexes and forming result on client. But now collections is too big to do that so I think I will use many small queries as you mentioned above in your question. It is not so beautifull but it will be the fastest solution IMHO.
Only the first query in your pipeline use index. The second query in pipeline accept output of the first query and it is big and not indexed. But as mentioned in Pipeline Operators and Indexes your query can use compound index so it is not so clear.
I have an idea: you can try to use many $or operators instead one $in operator like this
{ "$match": { "$or": [{"assetId": <id1>}, {"assetId": <id2>...}] } }. As I know $or operator can be executed in parallel and each query can use index. So it would be interesting to test this solution.
p.s. I really will be happy if solution for this problem will be found.