Efficient pagination of MongoDB aggregation? - mongodb

For efficiency, the Mongo documentation recommends that limit statements immediately follow sort statements, thus ending up with the somewhat nonsensical:
collection.find(f).sort(s).limit(l).skip(p)
I say this is somewhat nonsensical because it seems to say take the first l items, and then drop the first p of those l. Since p is usually larger than l, you'd think you'd end up with no results, but in practice you end up with l results.
Aggregation works more as you'd expect:
collection.aggregate({$unwind: u}, {$group: g},{$match: f}, {$sort: s}, {$limit: l}, {$skip: p})
returns 0 results if p>=l.
collection.aggregate({$unwind: u}, {$group: g}, {$match: f}, {$sort: s}, {$skip: p}, {$limit: l})
works, but the documentation seems to imply that this will fail if the match returns a result set that's larger than working memory. Is this true? If so, is there a better way to perform pagination on a result set returned through aggregation?
Source: the "Changed in version 2.4" comment at the end of this page: http://docs.mongodb.org/manual/reference/operator/aggregation/sort/

In MongoDB cursor methods (i.e. when using find()) like limit, sort, skip can be applied in any order => order does not matter. A find() returns a cursor on which modifications applied. Sort is always done before limit, skip is done before limit as well. So in other words the order is: sort -> skip -> limit.
Aggregation framework does not return a DB cursor. Instead it returns a document with results of aggregation. It works by producing intermediate results at each step of the pipeline and thus the order of operations really matters.
I guess MongoDB does not support order for cursor modifier methods because of the way it's implemented internally.
You can't paginate on a result of aggregation framework because there is a single document with results only. You can still paginate on a regular query by using skip and limit, but a better practice would be to use a range query due to it's efficiency of using an index.
UPDATE:
Since v2.6 Mongo aggregation framework returns a cursor instead of a single document. Compare: v2.4 and v2.6.

The documentation seems to imply that this (aggregation) will fail if the match returns a result set that's larger than working memory. Is this true?
No. You can, for example, aggregate on a collection that is larger than physical memory without even using the $match operator. It might be slow, but it should work. There is no problem if $match returns something that is larger than RAM.
Here are the actual pipeline limits.
http://docs.mongodb.org/manual/core/aggregation-pipeline-limits/
The $match operator solely does not cause memory problems. As stated in the documentation, $group and $sort are the usual villains. They are cumulative, and might require access to the entire input set before they can produce any output. If they load too much data into physical memory, they will fail.
If so, is there a better way to perform pagination on a result set returned through aggregation?
I has been correctly said that you cannot "paginate" (apply $skip and $limit) on the result of the aggregation, because it is simply a MongoDB document. But you can "paginate" on the intermediate results of the aggregation pipeline.
Using $limit on the pipeline will help on keeping the result set within the 16 MB bounds, the maximum BSON document size. Even if the collection grows, you should be safe.
Problems could arise with $group and, specially, $sort. You can create "sort friendly" indexes to deal with them if they do actually happen. Have a look at the documentation on indexing strategies.
http://docs.mongodb.org/manual/tutorial/sort-results-with-indexes/
Finally, be aware that $skip does not help with performance. On the contrary, they tend to slow down the application since it forces MongoDB to scan every skipped document to reach the desired point in the collection.
http://docs.mongodb.org/manual/reference/method/cursor.skip/

MongoDB recommendation of $sort preceding $limit is absolutely true as when it happens it optimizes the memory required to do the operation for top n results.
It just that the solution you proposes doesn't fit your use case, which is pagination.
You can modify your query to to get the benefit from this optimization.
collection.aggregate([
{
$unwind: u
},
{
$group: g
},
{
$match: f
},
{
$sort: s
},
{
$limit: l+p
},
{
$skip: p
}
]);
or for find query
collection.find(f).sort(s).limit(l+p).skip(p)
Though, as you can see the with big pagination the memory will grow more and more even with this optimization.

Related

How to make distinct operation more quickly in mongodb

There are 30,000,000 records in one collection.
when I use distinct command on this collection by java, it takes about 4 minutes, the result's count is about 40,000.
Is mongodb's distinct operation so inefficiency?
and how can I make it more efficient?
Is mongodb's distinct operation so inefficiency?
At 30m records? I would say 4 minutes is actually quite good, I think that's just as fast, maybe a little faster than SQL does it.
I would probably test this in other databases before saying it is inefficient.
However, one way of looking at performance is to see if the field is indexed first and if that index is in RAM or can be loaded without page thrashing. Distinct() can use an index so long as the field has an index.
and how can I make it more efficient?
You could use a couple of methods:
Incremental map reduce to distinct the main collection once every, say, 5 mins to a unique collection
And Pre-aggregate the unique collection on save by saving to two collections, one detail and one unique
Those are the two most viable methods of getting around this performantly.
Edit
Distinct() is not outdated and if it fits your needs is actually more performant than $group since it can use an index.
The .distinct() operation is an old one, as is .group(). In general these have been superseded by .aggregate() which should be generally used in preference to these actions:
db.collection.aggregate([
{ "$group": {
"_id": "$field",
"count": { "$sum": 1 }
}
)
Substituting "$field" with whatever field you wish to get a distinct count from. The $ prefixes the field name to assign the value.
Look at the documentation and especially $group for more information.

Iterating over distinct items in one field in MongoDB

I have a very large collection (~7M items) in MongoDB, primarily consisting of documents with three fields.
I'd like to be able to iterate over all the unique values for one of the fields, in an expedient manner.
Currently, I'm querying for just that field, and then processing the returned results by iterating on the cursor for uniqueness. This works, but it's rather slow, and I suspect there must be a better way.
I know mongo has the db.collection.distinct() function, but this is limited by the maximum BSON size (16 MB), which my dataset exceeds.
Is there any way to iterate over something similar to the db.collection.distinct(), but using a cursor or some other method, so the record-size limit isn't as much of an issue?
I think maybe something like the map/reduce functionality would possibly be suited for this kind of thing, but I don't really understand the map-reduce paradigm in the first place, so I have no idea what I'm doing. The project I'm working on is partially to learn about working with different database tools, so I'm rather inexperienced.
I'm using PyMongo if it's relevant (I don't think it is). This should be mostly dependent on MongoDB alone.
Example:
For this dataset:
{"basePath" : "foo", "internalPath" : "Neque", "itemhash": "49f4c6804be2523e2a5e74b1ffbf7e05"}
{"basePath" : "foo", "internalPath" : "porro", "itemhash": "ffc8fd5ef8a4515a0b743d5f52b444bf"}
{"basePath" : "bar", "internalPath" : "quisquam", "itemhash": "cf34a8047defea9a51b4a75e9c28f9e7"}
{"basePath" : "baz", "internalPath" : "est", "itemhash": "c07bc6f51234205efcdeedb7153fdb04"}
{"basePath" : "foo", "internalPath" : "qui", "itemhash": "5aa8cfe2f0fe08ee8b796e70662bfb42"}
What I'd like to do is iterate over just the basePath field. For the above dataset, this means I'd iterate over foo, bar, and baz just once each.
I'm not sure if it's relevant, but the DB I have is structured so that while each field is not unique, the aggregate of all three is unique (this is enforced with an index).
The query and filter operation I'm currently using (note: I'm restricting the query to a subset of the items to reduce processing time):
self.log.info("Running path query")
itemCursor = self.dbInt.coll.find({"basePath": pathRE}, fields={'_id': False, 'internalPath': False, 'itemhash': False}, exhaust=True)
self.log.info("Query complete. Processing")
self.log.info("Query returned %d items", itemCursor.count())
self.log.info("Filtering returned items to require uniqueness.")
items = set()
for item in itemCursor:
# print item
items.add(item["basePath"])
self.log.info("total unique items = %s", len(items))
Running the same query with self.dbInt.coll.distinct("basePath") results in OperationFailure: command SON([('distinct', u'deduper_collection'), ('key', 'basePath')]) failed: exception: distinct too big, 16mb cap
Ok, here is the solution I wound up using. I'd add it as an answer, but I don't want to detract from the actual answers that got me here.
reStr = "^%s" % fqPathBase
pathRE = re.compile(reStr)
self.log.info("Running path query")
pipeline = [
{ "$match" :
{
"basePath" : pathRE
}
},
# Group the keys
{"$group":
{
"_id": "$basePath"
}
},
# Output to a collection "tmp_unique_coll"
{"$out": "tmp_unique_coll"}
]
itemCursor = self.dbInt.coll.aggregate(pipeline, allowDiskUse=True)
itemCursor = self.dbInt.db.tmp_unique_coll.find(exhaust=True)
self.log.info("Query complete. Processing")
self.log.info("Query returned %d items", itemCursor.count())
self.log.info("Filtering returned items to require uniqueness.")
items = set()
retItems = 0
for item in itemCursor:
retItems += 1
items.add(item["_id"])
self.log.info("Recieved items = %d", retItems)
self.log.info("total unique items = %s", len(items))
General performance compared to my previous solution is about 2X in terms of wall-clock time. On a query that returns 834273 items, with 11467 uniques:
Original method(retreive, stuff into a python set to enforce uniqueness):
real 0m22.538s
user 0m17.136s
sys 0m0.324s
Aggregate pipeline method :
real 0m9.881s
user 0m0.548s
sys 0m0.096s
So while the overall execution time is only ~2X better, the aggregation pipeline is massively more performant in terms of actual CPU time.
Update:
I revisited this project recently, and rewrote the DB layer to use a SQL database, and everything was much easier. A complex processing pipeline is now a simple SELECT DISTINCT(colName) WHERE xxx operation.
Realistically, MongoDB and NoSQL databases in general are vary much the wrong database type for what I'm trying to do here.
From the discussion points so far I'm going to take a stab at this. And I'm also noting that as of writing, the 2.6 release for MongoDB should be just around the corner, good weather permitting, so I am going to make some references there.
Oh and the FYI that didn't come up in chat, .distinct() is an entirely different animal that pre-dates the methods used in the responses here, and as such is subject to many limitations.
And this soltion is finally a solution for 2.6 up, or any current dev release over 2.5.3
The alternative for now is use mapReduce because the only restriction is the output size
Without going into the inner workings of distinct, I'm going to go on the presumption that aggregate is doing this more efficiently [and even more so in upcoming release].
db.collection.aggregate([
// Group the key and increment the count per match
{$group: { _id: "$basePath", count: {$sum: 1} }},
// Hey you can even sort it without breaking things
{$sort: { count: 1 }},
// Output to a collection "output"
{$out: "output"}
])
So we are using the $out pipeline stage to get the final result that is over 16MB into a collection of it's own. There you can do what you want with it.
As 2.6 is "just around the corner" there is one more tweak that can be added.
Use allowDiskUse from the runCommand form, where each stage can use disk and not be subject to memory restrictions.
The main point here, is that this is nearly live for production. And the performance will be better than the same operation in mapReduce. So go ahead and play. Install 2.5.5 for you own use now.
A MapReduce, in the current version of Mongo would avoid the problems of the results exceeding 16MB.
map = function() {
if(this['basePath']) {
emit(this['basePath'], 1);
}
// if basePath always exists you can just call the emit:
// emit(this.basePath);
};
reduce = function(key, values) {
return Array.sum(values);
};
For each document the basePath is emitted with a single value representing the count of that value. The reduce simply creates the sum of all the values. The resulting collection would have all unique values for basePath along with the total number of occurrences.
And, as you'll need to store the results to prevent an error using the out option which specifies a destination collection.
db.yourCollectionName.mapReduce(
map,
reduce,
{ out: "distinctMR" }
)
#Neil Lunn 's answer could be simplified:
field = 'basePath' # Field I want
db.collection.aggregate( [{'$project': {field: 1, '_id': 0}}])
$project filters fields for you. In particular, '_id': 0 filters out the _id field.
Result still too large? Batch it with $limit and $skip:
field = 'basePath' # Field I want
db.collection.aggregate( [{'$project': {field: 1, '_id': 0}}, {'$limit': X}, {'$skip': Y}])
I think the most scalable solution is to perform a query for each unique value. The queries must be executed one after the other, and each query will give you the "next" unique value based on the previous query result. The idea is that the query will return you one single document, that will contain the unique value that you are looking for. If you use the proper projection, mongo will just use the index loaded into memory without having to read from disk.
You can define this strategy using $gt operator in mongo, but you must take into account values like null or empty strings, and potentially discard them using the $ne or $nin operator. You can also extend this strategy using multiple keys, using operators like $gte for one key and $gt for the other.
This strategy should give you the distinct values of a string field in alphabetical order, or distinct numerical values sorted ascendingly.

difference between aggregate ($match) and find, in MongoDB?

What is the difference between the $match operator used inside the aggregate function and the regular find in Mongodb?
Why doesn't the find function allow renaming the field names like the aggregate function?
e.g. In aggregate we can pass the following string:
{ "$project" : { "OrderNumber" : "$PurchaseOrder.OrderNumber" , "ShipDate" : "$PurchaseOrder.ShipDate"}}
Whereas, find does not allow this.
Why does not the aggregate output return as a DBCursor or a List? and also why can't we get a count of the documents that are returned?
Thank you.
Why does not the aggregate output return as a DBCursor or a List?
The aggregation framework was created to solve easy problems that otherwise would require map-reduce.
This framework is commonly used to compute data that requires the full db as input and few document as output.
What is the difference between the $match operator used inside the aggregate function and the regular find in Mongodb?
One of differences, like you stated, is the return type. Find operations output return as a DBCursor.
Other differences:
Aggregation result must be under 16MB. If you are using shards, the full data must be collected in a single point after the first $group or $sort.
$match only purpose is to improve aggregation's power, but it has some other uses, like improve the aggregation performance.
and also why can't we get a count of the documents that are returned?
You can. Just count the number of elements in the resulting array or add the following command to the end of the pipe:
{$group: {_id: null, count: {$sum: 1}}}
Why doesn't the find function allow renaming the field names like the aggregate function?
MongoDB is young and features are still coming. Maybe in a future version we'll be able to do that. Renaming fields is more critical in aggregation than in find.
EDIT (2014/02/26):
MongoDB 2.6 aggregation operations will return a cursor.
EDIT (2014/04/09):
MongoDB 2.6 was released with the predicted aggregation changes.
I investigated a few things about the aggregation and find call:
I did this with a descending sort in a table of 160k documents and limited my output to a few documents.
The Aggregation command is slower than the find command.
If you access to the data like ToList() the aggregation command is faster than the find.
if you watch at the total times (point 1 + 2) the commands seem to be equal
Maybe the aggregation automatically calls the ToList() and does not have to call it again. If you dont call ToList() afterwards the find() call will be much faster.
7 [ms] vs 50 [ms] (5 documents)

Time Complexity of $addToset vs $push when element does not exist in the Array

Given: Connection is Safe=True so Update's return will contain update information.
Say I have a documents that look like:
[{'a': [1]}, {'a': [2]}, {'a': [1,2]}]
And I issue:
coll.update({}, {'$addToSet': {'a':1}}, multi=True)
The result would be:
{u'connectionId': 28,
u'err': None,
u'n': 3,
u'ok': 1.0,
u'updatedExisting': True
}
Even when come documents already have that value. To avoid this I could issue a command.
coll.update({'a': {'$ne': 1}}, {'$push': {'a':1}}, multi=True)
What's the Time Complexity Comparison for $addToSet vs. $push with a $ne check ?
Looks like $addToSet is doing the same thing as your command: $push with a $ne check. Both would be O(N)
https://github.com/mongodb/mongo/blob/master/src/mongo/db/ops/update_internal.cpp
if speed is really important then why not use a hash:
instead of:
{'$addToSet': {'a':1}}
{'$addToSet': {'a':10}}
use:
{$set: {'a.1': 1}
{$set: {'a.10': 1}
Edit
Ok since I read your question wrong all along it turns out that actually you are looking at two different queries and judging the time complexity between them.
The first query being:
coll.update({}, {'$addToSet': {'a':1}}, multi=True)
And the second being:
coll.update({'a': {'$ne': 1}}, {'$push': {'a':1}}, multi=True)
First problem springs to mind here, no indexes. $addToSet, being an update modifier, I do not believe it uses an index as such you are doing a full table scan to accomplish what you need.
In reality you are looking for all documents that do not have 1 in a already and looking to $push the value 1 to that a array.
So 2 points to the second query even before we get into time complexity here because the first query:
Does not use indexes
Would be a full table scan
Would then do a full array scan (with no index) to $addToSet
So I have pretty much made my mind up here that the second query is what your looking for before any of the Big O notation stuff.
There is a problem to using big O notation to explain the time complexity of each query here:
I am unsure of what perspective you want, whether it is per document or for the whole collection.
I am unsure of indexes as such. Using indexes will actually create a Log algorithm on a however not using indexes does not.
However the first query would look something like: O(n) per document since:
The $addToSet would need to iterate over each element
The $addToSet would then need to do an O(1) op to insert the set if it does not exist. I should note I am unsure whether the O(1) is cancelled out or not (light reading suggests my version), I have cancelled it out here.
Per collection, without the index it would be: O(2n2) since the complexity of iterating a will expodentially increase with every new document.
The second query, without indexes, would look something like: O(2n2) (O(n) per document) I believe since $ne would have the same problems as $addToSet without indexes. However with indexes I believe this would actually be O(log n log n) (O(log n) per document) since it would first find all documents with a in then all documents without 1 in their set based upon the b-tree.
So based upon time complexity and the notes at the beginning I would say query 2 is better.
If I am honest I am not used to explaining in "Big O" Notation so this is experimental.
Hope it helps,
Adding my observation in difference between addToSet and push from bulk update of 100k documents.
when you are doing bulk update. addToSet will be executed separately.
for example,
bulkInsert.find({x:y}).upsert().update({"$set":{..},"$push":{ "a":"b" } , "$setOnInsert": {} })
will first insert and set the document. And then it executes addToSet query.
I saw clear difference of 10k between
db.collection_name.count() #gives around 40k
db.collection_name.count({"a":{$in:["b"]}}) # it gives only around 30k
But when replaced $addToSet with $push. both count query returned same value.
note: when you're not concerned about duplicate entry in array. you can go with $push.

Apply function and sort in MongoDB without MapReduce

I have an interesting problem. I have a working M/R version of this but it's not really a viable solution in a small-scale environment since it's too slow and the query needs to be executed real-time.
I would like to iterate over each element in a collection and score it, sort by descending, limit to top 10 and return the results to the applications.
Here is the function I'd like applied to each document in pseudo code.
var score = 0;
foreach(tag in document.Tags) {
score += someMap[tag];
}
return score;
Since your someMap is changing each time, I don't see any alternative other than to score all the documents and return the highest-scoring ones. Whatever method you adopt for this type of operation, you'll have to consider all the documents in the collection, which is going to be slow, and will become more and more costly as the collection you're scanning grows.
One issue with map reduce is that each mongod instance can only run one concurrent map reduce. This is a limitation of the javascript engine, which is single-threaded. Multiple map reduces will be interleaved, but they cannot run concurrently with one another. This means that if you're relying on map reduce for "real-time" uses, that is, if your web page has to run a map reduce to render, you'll eventually hit a limit where page load times become unacceptably slow.
You can work around this by querying all the documents into your application, and doing the scoring, sorting, and limiting in your application code. Queries in MongoDB can run concurrently, unlike map reduce, though of course this means that your application servers will have to do a lot of work.
Finally, if you are willing to wait for MongoDB 2.2 to be released (which should be within a few months), you can use the new aggregation framework in place of map reduce. You'll have to massage the someMap to generate the correct pipeline steps. Here's an example of what this might look like if someMap were {"a": 5, "b": 2}:
db.runCommand({aggregate: "foo",
pipeline: [
{$unwind: "$tags"},
{$project: {
tag1score: {$cond: [{$eq: ["$tags", "a"]}, 5, 0]},
tag2score: {$cond: [{$eq: ["$tags", "b"]}, 3, 0]}}
},
{$project: {score: {$add: ["$tag1score", "$tag2score"]}}},
{$group: {_id: "$_id", score: {$sum: "$score"}}},
{$sort: {score: -1}},
{$limit: 10}
]})
This is a little complicated, and bears explaining:
First, we "unwind" the tags array, so that the following steps in the pipeline process documents where "tags" is a scalar -- the value of the tag from the array -- and all the other document fields (notably _id) are duplicated for each unwound element.
We use a projection operator to convert from tags to named score fields. The $cond/$eq expression for each roughly means (for the tag1score example) "if the value in the document in the 'tags' field id equal to 'a', then return 5 and assign that value to a new field tag1score, else return 0 and assign that". This expression would be repeated for each tag/score combination in your someMap. At this point in the pipeline, each document will nave N tagNscore fields, but at most one of them will have a non-zero value.
Next we use another projection operator to create a score field whose value is the sum of the tagNscore fields in the document.
Next we group the documents by their _id, and sum up the value of the score field from the previous step across all documents in each group.
We sort by score, descending (i.e. greatest scores first)
We limit to only the top 10 scores.
I'll leave it as an exercise to the reader how to convert someMap into the correct set of projections in step 2, and the correct set of fields to add in step 3.
This is essentially the same set of steps that your application code or map reduce would go through, but has the following distinct advantages: instead of map reduce, the aggregation framework is fully implemented in C++ and is faster and more concurrent than map reduce; and unlike querying all the documents to your application, the aggregation framework works with the data on the server side, saving network load. But like the other two approaches, this will still have to consider each document, and can only limit the result set once the score has been calculated for all of them.