I am familiar with the best practice of range based pagination on large MongoDB collections, however I am struggling with figuring out how to paginate a collection where the sort value is on a non-unique field.
For example, I have a large collection of users, and there is a field for the number of times they have done something. This field is defintely non-unique, and could have large groups of documents that have the same value.
I would like to return results sorted by that 'numTimesDoneSomething' field.
Here is a sample data set:
{_id: ObjectId("50c480d81ff137e805000003"), numTimesDoneSomething: 12}
{_id: ObjectId("50c480d81ff137e805000005"), numTimesDoneSomething: 9}
{_id: ObjectId("50c480d81ff137e805000006"), numTimesDoneSomething: 7}
{_id: ObjectId("50c480d81ff137e805000007"), numTimesDoneSomething: 1}
{_id: ObjectId("50c480d81ff137e805000002"), numTimesDoneSomething: 15}
{_id: ObjectId("50c480d81ff137e805000008"), numTimesDoneSomething: 1}
{_id: ObjectId("50c480d81ff137e805000009"), numTimesDoneSomething: 1}
{_id: ObjectId("50c480d81ff137e805000004"), numTimesDoneSomething: 12}
{_id: ObjectId("50c480d81ff137e805000010"), numTimesDoneSomething: 1}
{_id: ObjectId("50c480d81ff137e805000011"), numTimesDoneSomething: 1}
How would I return this data set sorted by 'numTimesDoneSomething' with 2 records per page?
#cubbuk shows a good example using offset (skip) but you can also mould the query he shows for ranged pagination as well:
db.collection.find().sort({numTimesDoneSomething:-1, _id:1})
Since the _id here will be unique and you are seconding on it you can actually then range by _id and the results, even between two records having numTimesDoneSomething of 12, should be consistent as to whether they should be on one page or the next.
So doing something as simple as
var q = db.collection.find({_id: {$gt: last_id}}).sort({numTimesDoneSomething:-1, _id:1}).limit(2)
Should work quite good for ranged pagination.
You can sort on multiple fields in this case sort on numTimesDoneSomething and id field. Since id_ field is ascending in itself already according to the insertion timestamp, you will able to paginate through the collection without iterating over duplicate data unless new data is inserted during the iteration.
db.collection.find().sort({numTimesDoneSomething:-1, _id:1}).offset(index).limit(2)
Related
{
field_1: [
{A: 1, ...}
{A: 10, ...}
{A: 2, ...}
]
}
What if field_1 has 10s of 1000s of subdocuments in its array?
What kind of query performance would be negatively affected? And why?
What if the subdocuments are multi key indexed?
If field_1 has a million subdocuments then it overload the stack in runtime and compiler will not be able to process that amount of data, it will result in crashing your code.
Say I have a Foo document like the following:
{
_id: 1,
bar: [{_id: 1, ...bar props}, {_id: 2, ...bar props}, {_id: 3, ...bar props)}],
... other foo props
}
How do I query the database for a single Bar, such that my result looks like:
{_id: 2, ...bar props}
Something like:
db().collection('foo').findOne({ _id: 1, {foo: _id: 2}}, {foo: 1, _id: 0})
Matching and projection are separate operations in MongoDB and you should also keep them separate when you are thinking (and asking) about queries.
You cannot "query for a single Bar". Queries always match documents. What you can do is find a document which contains a Bar which matches conditions, or you can find a document which contains exactly one Bar which also matches conditions, etc. In all of these cases you still get the top-level document(s) as a result.
To retrieve (only) one, several or all of the Bars in whichever documents matched your query conditions, instead of those documents, use projection (either second argument to find or $project aggregation pipeline stage).
When you are using the aggregation pipeline, you can mix $match and $project stages so that, for example, you $match to filter down documents, then $project to reduce the documents to some of their fields, then $match to further filter down the resulting documents, and so on. Still matching and projection are separate operations.
I want to execute a mongodb query that would fetch documents until the sum of a field of those documents exceeds a value. For example, if I have the following documents
{id: 1, qty: 40}
{id: 2, qty: 50}
{id: 3, qty: 30}
and I have a set quantity of 80, I would want to retrieve id1 and id2 because 40+50 is 90 and is now over 80. If I wanted a quantity of 90, I would also retrieve id1 and id2. Does anyone have any insight into how to query in this manner? (I'm using Go btw - but any general mongo query advice would help tremendously)
Since you're keeping a running sum of a certain field, the easiest way of doing this is running a Find operation, get a cursor, and iterate the cursor while keeping the sum yourself until the required total is reached. Then, close the cursor and return:
cursor, err:=coll.Find(context.Background(),query)
sum:=0
defer cursor.Close(context.Background())
for cursor.Next(context.Background()) {
cursor.Decode(&data)
sum+=data.Qty
if sum>=80 {
break
}
}
Ranged pagination is cut and dry when you're paginating based on single unique fields, but how does it work, if at all, in situations with non-unique fields, perhaps several of them at a time?
TL;DR: Is it reasonable or possible to paginate and sort an "advanced search" type query using range-based pagination? This means querying on, and sorting on, user-selected, perhaps non-unique fields.
For example say I wanted to paginate a search for played word docs in a word game. Let's say each doc has a score and a word and I'd like to let users filter and sort on those fields. Neither field is unique. Assume a sorted index on the fields in question.
Starting simple, say the user wants to see all words with a score of 10:
// page 1
db.words.find({score: 10}).limit(pp)
// page 2, all words with the score, ranged on a unique _id, easy enough!
db.words.find({score: 10, _id: {$gt: last_id}}).limit(pp)
But what if the user wanted to get all words with a score less than 10?
// page 1
db.words.find({score: {$lt: 10}}).limit(pp)
// page 2, getting ugly...
db.words.find({
// OR because we need everything lt the last score, but also docs with
// the *same* score as the last score we haven't seen yet
$or: [
{score: last_score, _id: {$gt: last_id}},
{score: {$lt: last_score}
]
}).limit(pp)
Now what if the user wanted words with a score less than 10, and an alphabetic value greater than "FOO"? The query quickly escalates in complexity, and this is for just one variation of the search form with the default sort.
// page 1
db.words.find({score: {$lt: 10}, word: {$gt: "FOO"}}).limit(pp)
// page 2, officially ugly.
db.words.find({
$or: [
// triple OR because now we need docs that have the *same* score but a
// higher word OR those have the *same* word but a lower score, plus
// the rest
{score: last_score, word: {$gt: last_word}, _id: {$gt: last_id}},
{word: last_word, score: {$lt: last_score}, _id: {$gt: last_id}},
{score: {$lt: last_score}, word: {$gt: last_word}}
]
}).limit(pp)
I suppose writing a query builder for this sort of pattern would be doable, but it seems terribly messy and error prone. I'm leaning toward falling back to skip pagination with a capped result size, but I'd like to use ranged pagination if possible. Am I completely wrong in my thinking of how this would have to work? Is there a better way?
Edit: For the record...
With no viable alternatives thus far I'm actually just using skip based pagination with a limited result set, keeping the skip manageable. For my purposes this is actually sufficient, as there's no real need to search then paginate into the thousands.
You can get ranged pagination by sorting on a unique field and saving the value of that field for the last result. For example:
// first page
var page = db.words.find({
score:{$lt:10},
word:{$gt:"FOO"}
}).sort({"_id":1}).limit(pp);
// Get the _id from the last result
var page_results = page.toArray();
var last_id = page_results[page_results.length-1]._id;
// Use last_id to get your next page
var next_page = db.words.find({
score:{$lt:10},
word:{$gt:"FOO"},
_id:{$gt:last_id}
}).sort({"_id":1}).limit(pp);
I do much find requests on collection like this:
{'$and': [{'time': {'$lt': 1375214400}},
{'time': {'$gte': 1375128000}},
{'$or': [{'uuid': 'test'},{'uuid': 'test2'}]}
]}
Which index i must create: compound or two single or both?
uuid - name of data collector.
time - timestamp
I want to retrieve data, collected by one or few collectors in specified time interval.
Your query would be better written without the $and and using $in instead of $or:
{
'time': {'$lt': 1375214400, '$gte': 1375128000},
'uuid': {'$in': ['test', 'test2']}
}
Then it's pretty clear you need a compound index that covers both time and uuid for best query performance. But it's important to always confirm your index is being used as you expect by using explain().