Working with the MongoDB aggregation framework it is clear that the $group function is the bottleneck. By using explain() on some find queries, I'm able to tailor my indexes to reduce table scans significantly, however it seems that $group does not take into account any $sort that happens before, even if I end up sorting by the fields it will end up doing the $group by.
Besides simply reducing the result set, are there any practical ways to improve the performance of the $group function? I'm almost tempted to take advantage of the sort, and just do the $group in my own application, but there must be an elegant and performant solution using the framework.
I'm noticing that as the result set from the $match increases, the $group time also increases.
My document is basically like this
a: (String)
b: (String)
with a pipeline that looks something like
$match :{ a : 'frank'}
$sort : { b : 1 }
$group : { _id : { $b : b }}
It is surprising to me, because I assume by the time it gets to the group, the data is loaded into memory, and since the fields are indexed, a few thousand records shouldn't take that much time to load into memory. Is this not the case?
Just seems that the $sort has no effect on the overall performance. Is there a way to use indexes, as well as the previous functions of the pipeline to improve the performance of the $group? Also, does $group stay within the result set from the previous functions, or does it go back to an entire table scan (I'm pretty sure, or hopefully that's not it)
I have a data schema consisting of many updates (hundreds of thousands+ per entity) that are assigned to entities. I'm representing this with a single top-level document for each of the entities and an array of updates under each of them. The schema for those top-level documents looks like this:
"entity_id": "uuid",
"updates": [
{ "timestamp": Date(...), "value": 10 },
{ "timestamp": Date(...), "value": 11 }
I'm trying to create a query that returns the number of entities that have received an update within the past n hours. All updates in the updates array are guaranteed to be sorted by virtue of the manner in which they're updated by my application. I've created the following aggregation to do this:
{"$project": {last_update: {"$arrayElemAt": ["$updates", -1]}}},
{"$replaceRoot": {newRoot: "$last_update"}},
{"$match": {timestamp: {"$gte": new Date(...)}}},
{"$count": "count"}
For some reason that I don't understand, the query I just pasted takes an absurd amount of time to complete. It exhausts the 15-second timeout on the client I use, as a matter of fact.
From a time complexity point of view, this query looks incredibly cheap (which is part of the way I designed this schema that way I did). It looks to be linear with respect to the total number of top-level documents in the collection which are then filtered down, of which there are less than 10,000.
The confusing part is that it doesn't seem to be the $project step which is expensive. If I run that one alone, the query completes in under 2 seconds. However, just adding the $match step makes it time out and shows large amounts of CPU and IO usage on the server the database is running on. My best guess is that it's doing some operations on the full update array for some reason, which makes no sense since the first step explicitly limits it to only the last element.
Is there any way I can improve the performance of this aggregation? Does having all of the updates in a single array like this somehow cause Mongo to not be able to create optimal queries even if the array access patterns are efficient themselves?
Would it be better to do what I was doing previously and store each update as a top-level document tagged with the id of its parent entity? This is what I was doing previously, but performance was quite bad and I figured I'd try this schema instead in an effort to improve it. So far, the experience has been the opposite of what I was expecting/hoping for.
Use indexing, it will enhance the performance of your query.
For that use the mongo compass to check which index is used most then one by one index them to improve the performance of it.
After that fetch on the fields which you require in the end, with projection in aggregation.
I hope this might solve your issue. But i would suggest that go for indexing first. Its a huge PLUS in case of large data fetching.
You need to support your query with an index and simplify it as much as possible.
You're querying against the timestamp field of the first element of the updates field, so add an index for that:
db.updates.createIndex({'updates.0.timestamp': 1})
You're just looking for a count, so get that directly:
db.updates.count({'updates.0.timestamp': {$gte: new Date(...)}})
I have a mongoDB collection with millions of rows and I'm trying to optimize my queries. I'm currently using the aggregation framework to retrieve data and group them as I want. My typical aggregation query is something like : $match > $group > $ group > $project
However, I noticed that the last parts only take a few ms, the beginning is the slowest.
I tried to perform a query with only the $match filter, and then to perform the same query with collection.find. The aggregation query takes ~80ms while the find query takes 0 or 1ms.
I have indexes on pretty much each field so I guess this isn't the problem. Any idea on what could go wrong ? Or is it just a "normal" drawback of the aggregation framework ?
I could use find queries instead of aggregation queries, however I would have to perform a lot of processing after the request and this process can be done quickly with $group etc. so I would rather keep the aggregation framework.
Here is my criteria :
"action" : "click",
"timestamp" : {
"$gt" : ISODate("2015-01-01T00:00:00Z"),
"$lt" : ISODate("2015-02-011T00:00:00Z")
"itemId" : "5"
The main purpose of the aggregation framework is to ease the query of a big number of entries and generate a low number of results that hold value to you.
As you have said, you can also use multiple find queries, but remember that you can not create new fields with find queries. On the other hand, the $group stage allows you to define your new fields.
If you would like to achieve the functionality of the aggregation framework, you would most likely have to run an initial find (or chain several ones), pull that information and further manipulate it with a programming language.
The aggregation pipeline might seem to take longer, but at least you know you only have to take into account the performance of one system - MongoDB engine.
Whereas, when it comes to manipulating the data returned from a find query, you would most likely have to further manipulate the data with a programming language, thus increasing the complexity depending on the intricacies of the programming language of choice.
Have you tried using explain() to your find queries? It'll give you good idea about how much time find() query will exactly take. You can do the same for $match with $explain & see whether there is any difference in index accessing & other parameters.
Also the $group part of aggregation framework doesn't utilize the indexing so it has to process all the records returned by $match stage of aggregation framework. So to better understand the the working of your query see the result set it returns & whether it fits into memory to be processed by MongoDB.
if you are concern with performance, then no doubt aggregation is time taking task rather then find clause.
when you are fetching record on multiple conditions, having lookup, grouping, and some limited record ( paginated) then it is best approch to use aggregate , meanwhile in find query is fast when you have to fetch very big data set. you have some population, projection and no pagination i suggest to use find query that is fast
I have started exploring mongodb couple of weeks back. I have a scenario here. I have a collection which has 3 million records.
I would want to perform aggregation on the aggreation based on two keys (also need to use match condition). I used aggregation framework for the same. I came to know that aggregation would fail if the processing document size (array) exceeds 16 MB.
I faced the same issue when i tried. I am trying to use map reduce now. I would need the guidance on implementing the same. How can I overcome the 16 MB size limit by using map reduce?
Also I came to know that I can do it by splitting the collection into multiple collections and do the aggregation on the same. Would be great if anyone can point me in right direction?
Even without code there are basic answers to your questions.
The limitation on the BSON document 16MB output size is for "inline" responses. That means a response from your operations that does not write the individual "documents" from your response to a collection.
So with mapReduce a statement much like this:
{ "out": { "inline": 1 } }
Has the problem that the "array" in the response needs to be under 16MB. But if you change this to output to a collection:
{ "out": { "replace": "newcollection" } }
Then you no longer have this limitation.
The same applies to the aggregate method from versions 2.6 and upwards using the $out pipeline stage:
// lots of pipeline
{ "$out": "newcollection }
This overcomes the limtation by the same means by outputing to a collection.
Actually with the aggregate statement, again from version 2.6 and upwards this returns a cursor, just like the .find() method, and is also not subject to this limitation.
In Mongo, suppose I have a collection mycollection that has fields a, b, and huge. I very frequently want to perform queries, mapreduce, updates, etc. on a, and b and very occassionally want to return huge in query results as well.
I know that db.mycollection.find() will scan the entire collection and result in Mongo attempting to add the whole collection to the working set, which may exceed the amount of RAM I have available.
If I instead call db.mycollection.find({}, { a : 1, b : 1 }), will this still result in the whole collection being added to the working set or only the terms of my projection?
MongoDB can use something called covered queries: these allow you to load all the values from the index rather than the disk, or memory, if those documents are in memory at the time.
Be warned that you cannot use covered queries on a full table scan, the condition, projection and sort must all be within the index; i.e.:
db.col.find({a:1}, {_id:0, a:1, b:1})(.sort({b:1}));
Would work (the sort is in brackets because it is not totally needed). You can add _id to your index if you intend to return that too.
Map Reduce does not support covered queries, there is no way to project only a certain amount of fields into the MR, as far as I know; maybe there is some hack I do not know of. Map Reduce only supports a $match like operator in terms of input query with a separate parameter for the sort of the incoming query ( ).
Note that for updates I believe only atomic operations: (excluding findAndModify) do not load the document into your working set, however, believe is the keyword there.
Considering you need to do both MR and normal find and update on these records I would strongly recommend you look into checking why you are paging in so much data and whether you really do need to do it that often. It seems like you are trying to do too much processing in a short and frequent amount of time.
On the other hand, if this is a script which runs every night or something then I would not worry too much about its excessive working set (i.e. score board recalc script).
Is it slow/poor form to use the $in operator in MongoDB with a large array of possibilities?
author : {
$in : ['friend1','friend2','friend3'....'friend40']
App Engine, for example, won't let you use more than 30 because they translate directly to one query per item in the IN array, and so instead force you into using their method for handling fan out. While that's probably the most efficient method in Mongo too, the code for it is significantly more complex so I'd prefer to just use this generic method.
Will Mongo execute these $in queries efficiently for reasonable-sized datasets?
It can be fairly efficient with small lists (hard to say what small is, but at least into the tens/hundreds) for $in. It does not work like app-engine since mongodb has actual btree indexes and isn't a column store like bigtable.
With $in it will skip around in the index to find the matching documents, or walk through the whole collection if there isn't an index to use.
Assuming you have created index on the author field, from algorithmic point of view, the time complexity of $in operation is: $(N*log(M)), where N is the length of input array and M is the size of the collection.
The time complexity of $in operation will not change unless you change a database (Though I don't think any db can break O(N*log(M))).
However, from engineering point of view, if N goes to a big number, it is better to let your business logic server to simulate the $in operation, either by batch or one-by-one.
This is simply because: memory in database servers is way more valuable than the memory in business logic servers.
If you build an index (ensureIndex) on the list element, it should be pretty quick.
Have you tried using explain()? Its a good, built-in way to profile your queries: