MongoDB keysExamined:return Ratio - mongodb

In MongoDB, we use aggregate and $group if we want to get count of some action. Suppose, we have to get total number of deposits users have done in last 2 months then we use $group and then $sum to get count. Now in MongoDB Atlas profiler, it shows that as a very time taking and intensive operation. Because it scans keys of 2 months' data and return only 1 document(count). So is this a good way to get count or not?

If your query is "get total number of deposits for a set of users between two dates where that set can be all users" and deposits is a collection with a field user or similar then you do not need $group. Simply count() the filtered set:
db.deposits.find({$and:[
{"user":{$in: [ list ]}},
{"depositDate":{$gte: startDate}},
{"depositDate":{$lt: endDate}}
]}).count();
or with the aggregation pipeline:
db.deposits.aggregate([
{$match: {$and:[
{"user":{$in: [ list ]}},
{"depositDate":{$gte: startDate}},
{"depositDate":{$lt: endDate}}
]}},
{$count: "n"}
]});
Note you should be using a real datetime type for 'depositDate' -- not a string, not even ISO 8601 -- to facilitate more sophisticated queries involving dates.

Related

Mongodb optimal query

My use case is different. I am trying to map it to user and orders for easy understanding.
I have to get the following for a user
For each department
For each order type
delivered count
unique orders
Unique order count means user might have ordered the same product, but that count has to be 1 for the same. I have the background logic and identified via duplicate order ids.
db.getCollection('user_orders').aggregate([{"user_id":123},
{$group: {"_id": {"department":"$department", "order_type":"$order_type"},
"del_count":{$sum:"$del_count"},
"unique_order":{$addToSet:{"unique_order":"$unique_order"}}}},
{$project: {"_id":0,
"department":"$_id.department",
"order_type_name":"$_id.order_type",
"unique_order_count": {$size:"$unique_order"},
"del_count":"$del_count"
}},
{$group: {"_id":"$department",
order_types: {$addToSet:
{"order_type_name":"$order_type_name",
"unique_order_count": "$unique_order_count",
"del_count":"$del_count"
}}}}
])
Sorry for my query formatting.
This query is working absolutely fine. I added the second grouping to bring the responses together for all order types of the same department.
Can I do the same in less number of pipelines - efficient ways?
The $project stage appears to be redundant but it's more refactoring rather than performance improvement. Your simplified pipeline can look like below:
db.getCollection('user_orders').aggregate([{$group: {"_id": {"department":"$department", "order_type":"$order_type"},
"del_count":{$sum:"$del_count"},
"unique_order":{$addToSet:{"unique_order":"$unique_order"}}}},
{$group: {"_id":"$_id.department",
order_types: {$addToSet:
{"order_type_name":"$_id.order_type",
"unique_order_count": {$size:"$unique_order"},
"del_count":"$del_count"
}}}}
])

Can an index on a subfield cover queries on projections of that field?

Imagine you have a schema like:
[{
name: "Bob",
naps: [{
time: 2019-05-01T15:35:00,
location: "sofa"
}, ...]
}, ...
]
So lots of people, each with a few dozen naps. You want to find out 'what days do people take the most naps?', so you index naps.time, and then query with:
aggregate([
{$unwind: naps},
{$group: { _id: {$day: "$naps.time"}, napsOnDay: {"$sum": 1 } }
])
But when doing explain(), mongo tells me no index was used in this query, when clearly the index on the time Date field could have been. Why is this? How can I get mongo to use the index for the more optimal query?
Indexes stores pointers to actual documents, and can only be used when working with a material document (i.e. the document that is actually stored on disk).
$match or $sort does not mutate the actual documents, and thus indexes can be used in these stages.
In contrast, $unwind, $group, or any other stages that changes the actual document representations basically loses the connection between the index and the material documents.
Additionally, when those stages are processed without $match, you're basically saying that you want to process the whole collection. There is no point in using the index if you want to process the whole collection.

MongoDB Aggregate of 120M documents

I've a system that records entries by action. There're more than 120M of them and I want to group them with aggregate by id_entry. The structure is as the following :
entry
{
id_entry: ObjectId(...),
created_at: Date(...),
action: {object},
}
When I try to do an aggregate by id_entry and grouping its actions it takes more than 3h to finish :
db.entry.aggregate([
{ '$match': {'created_at': { $gte:ISODate("2016-02-02"), $lt:ISODate("2016-02-03")}}},
{ '$group': {
'_id' :{'id_entry': '$id_entry'},
actions: {
$push: '$action'
}
}}])
But in that range of days there's only around ~4M documents. (id_entry and created_at has indexes)
What im I doing wrong in the aggregate? How can I group 3-4M documents to join them by id_entry in less than 3h?
Thanks
To speed up your particular query, you need an index on the created_at field.
However, the overall performance of the aggregation will also depend on your hardware specification (among other things).
If you find the query's performance to be less than what you require, you can either:
Create a pre-aggregated report (essentially a document that contains the aggregated data you require, updated every time a new data is inserted), or
Utilize sharding to spread your data to more servers.
If you need to run this aggregation query all the time, a pre-aggregated report allows you to have an extremely up-to-date aggregated report of your data that is accessible using a simple find() query.
The tradeoff is that for every insertion, you will also need to update the pre-aggregated document to reflect the current state of your data. However, this is a relatively small tradeoff compared to having to run long/complex aggregation query that could interfere with your day-to-day operation.
One caveat with the aggregation framework is: once the aggregation pipeline encounters a $group or a $project stage, no index can be used. This is because MongoDB index are tied to how the documents are stored physically. Grouping and projecting transform the documents to a state where the document does not have a physical representation in disk anymore.

Keep the result of subset from $match aggregation in cache in mongoDB

I am doing a website to explore mongoDB data. In my database I store GPS measurements captured from smartphones. I am using various queries to explore those measurements. I have one query that groups by day and count the measurements. Another query counts the number of measurements for each kind of smartphone (iOS, Android, ). Etc..
All these queries share the same $match parameters in their aggregation pipeline . In this pipeline I filter the measurement in order to focus in an interval of time and in a geographical area.
Is there a way to keep the subset obtained in the $match in the cache in a manner that the database do not need to apply this filter every time ?
I want to optimize the response time of my queries.
Sample of one the query :
cursor = db.myCollection.aggregate(
[
{
"$match":
{
"$and": [{"t": {"$gt": tMin, "$lt": tMax}, "location":{"$geoWithin":{"$geometry":square}}}]
}
},
{
"$group":
{
"_id": {"hourGroup": "$tHour"},
"count": {"$sum": 1}
}
}
]
)
I want to keep the result of this in the cache :
"$match":
{
"$and": [{"t": {"$gt": tMin, "$lt": tMax}, "location":{"$geoWithin":{"$geometry":square}}}]
}
The way you could do it is to create a new collection using $out pipeline stage.
Then as you will go with the query batch the first query will created a matched output and next ones could use it results.
There is a new pipeline stage in development called $facet where we will be able to execute match and then use this result in multiple aggregation path (plan is to have it ready in mongo 3.4)
Any comments welcome!

Aggregate framework can't use indexes

I run this command:
db.ads_view.aggregate({$group: {_id : "$campaign", "action" : {$sum: 1} }});
ads_view : 500 000 documents.
this queries take 1.8s . this is its profile : https://gist.github.com/afecec63a994f8f7fd8a
indexed : db.ads_view.ensureIndex({campaign: 1});
But mongodb don't use index. Anyone know if can aggregate framework use indexes, how to index this query.
This is a late answer, but since $group in Mongo as of version 4.0 still won't make use of indexes, it may be helpful for others.
To speed up your aggregation significantly, performe a $sort before $group.
So your query would become:
db.ads_view.aggregate({$sort:{"campaign":1}},{$group: {_id : "$campaign", "action" : {$sum: 1} }});
This assumes an index on campaign, which should have been created according to your question. In Mongo 4.0, create the index with db.ads_view.createIndex({campaign:1}).
I tested this on a collection containing 5.5+ Mio. documents. Without $sort, the aggregation would not have finished even after several hours; with $sort preceeding $group, aggregation is taking a couple of seconds.
The $group operator is not one of the ones that will use an index currently. The list of operators that do (as of 2.2) are:
$match
$sort
$limit
$skip
From here:
http://docs.mongodb.org/manual/applications/aggregation/#pipeline-operators-and-indexes
Based on the number of yields going on in the gist, I would assume you either have a very active instance or that a lot of this data is not in memory when you are doing the group (it will yield on page fault usually too), hence the 1.8s
Note that even if $group could use an index, and your index covered everything being grouped, it would still involve a full scan of the index to do the group, and would likely not be terrible fast anyway.
$group doesn't use an index because it doesn't have to. When you $group your items you're essentially indexing all documents passing through the $group stage of the pipeline using your $group's _id. If you used an index that matched the $group's _id, you'd still have to pass through all the docs in the index so it's the same amount of work.