how to efficient paging in mongodb [duplicate] - mongodb

This question already has answers here:
Implementing pagination in mongodb
(2 answers)
Closed 5 years ago.
I want to sort all docs by field x (multiple docs can have same value on this field). Then every time I press "next", it loads the 10 more docs.
If multiple docs have the same value, they can be displayed at whatever order among them, it doesn't matter.
Since skip() is inefficient on large dataset, how do this efficiently? No pagination number needed, only infinite scroll.

If you don't require pagination number, then you can just utilise a monotonically increasing unique id values; such as _id with ObjectId().
Using your example:
/* Value of first scroll record, and to be updated every iteration */
var current_id;
var scroll_size = 10;
db.collection.find({_id: {$lt: current_id}}).
limit(scroll_size).
sort({
_id: -1,
x: 1 // Depending on your use case
});
The example above will give you most recent records. Depending on your use case you would have to decide what to do with newly inserted document.
If you are using a different field than _id to scroll through, make sure you add appropriate indexes on the field.

Related

How to update a value and increment another in single-query using mongodb [duplicate]

This question already has answers here:
How to Model a "likes" voting system with MongoDB
(1 answer)
Can you specify a key for $addToSet in Mongo?
(2 answers)
Closed 4 years ago.
Is there a way to achieve the following:-
MyDocument {
String docId;
Integer bal;
Set<String> actions;
}
I need to update the field actions with some value of a user action, say from
[a1, a2]
to
[a1, a2, a3]
At the same time, this added action can increment the bal by some value so I need to ensure that $inc operation is also implemented for the bal field.
One way of achieving this I could think of was incrementing the bal using one update query :
Bson searchFilter = Filters.in("docId", <someId>);
// couldn't find a cleaner way of increment in mongo-java
BasicDBObject modifiedObject = new BasicDBObject();
modifiedObject.put("$inc", new BasicDBObject().append("bal",2)); // any value to increment
mongoCollection.updateOne(searchFilter, modifiedObject);
I can thereafter execute another update document query for set of actions update.
Another way could possibly have been updating the documents with both the fields. For which :
I would have to find the document first, read the current bal, perform the compute at application.
Then update the document with both the computed fields.
In both the above approaches, I couldn't see a way of avoiding to query the DB twice. Is there any other workaround possible to achieve this currently?
Also, I am concerned about the atomicity with the above combination of [UPDATE + UPDATE] or [GET + UPDATE], given multiple such concurrent queries can occur, is the concern justified even?
Do let me know any further detail required.

Possible to retrieve multiple random, non-sequential documents from MongoDB?

I'd like to retrieve a random set of documents from a MongoDB database. So far after lots of Googling, I've only seen ways to retrieve one random document OR a set of documents starting at a random skip position but where the documents are still sequential.
I've tried mongoose-simple-random, and unfortunately it doesn't retrieve a "true" random set. What it does is skip to a random position and then retrieve n documents from that position.
Instead, I'd like to retrieve a random set like MySQL does using one query (or a minimal amount of queries), and I need this list to be random every time. I need this to be efficient -- relatively on par with such a query with MySQL. I want to reproduce the following but in MongoDB:
SELECT * FROM products ORDER BY rand() LIMIT 50;
Is this possible? I'm using Mongoose, but an example with any adapter -- or even a straight MongoDB query -- is cool.
I've seen one method of adding a field to each document, generating a random value for each field, and using {rand: {$gte:rand()}} each query we want randomized. But, my concern is that two queries could theoretically return the same set.
You may do two requests, but in an efficient way :
Your first request just gets the list of all "_id" of document of your collections. Be sure to use a mongo projection db.products.find({}, { '_id' : 1 }).
You have a list of "_id", just pick N randomly from the list.
Do a second query using the $in operator.
What is especially important is that your first query is fully supported by an index (because it's "_id"). This index is likely fully in memory (else you'd probably have performance problems). So, only the index is read while running the first query, and it's incredibly fast.
Although the second query means reading actual documents, the index will help a lot.
If you can do things this way, you should try.
I don't think MySQL ORDER BY rand() is particularly efficient - as I understand it, it essentially assigns a random number to each row, then sorts the table on this random number column and returns the top N results.
If you're willing to accept some overhead on your inserts to the collection, you can reduce the problem to generating N random integers in a range. Add a counter field to each document: each document will be assigned a unique positive integer, sequentially. It doesn't matter what document gets what number, as long as the assignment is unique and the numbers are sequential, and you either don't delete documents or you complicate the counter document scheme to handle holes. You can do this by making your inserts two-step. In a separate counter collection, keep a document with the first number that hasn't been used for the counter. When an insert occurs, first findAndModify the counter document to retrieve the next counter value and increment the counter value atomically. Then insert the new document with the counter value. To find N random values, find the max counter value, then generate N distinct random numbers in the range defined by the max counter, then use $in to retrieve the documents. Most languages should have random libraries that will handle generating the N random integers in a range.

Efficient way to transverse mongoDB's indexes with an iterator/pointer?

We are developing a C++ application development tool that uses mongoDB as the underlying database. Say the user has developed a patient collection with fields _id (OID) & patient# with a unique ascending index on patient#. Say there are 10 patients with patient#s 1, 5, 7, 13, 14, 20, 21, 22, 23, 25. Patient# 20 is displayed fetched with limit(1). The user presses PageDown and patient# 21 is displayed -- easy with $gt with patient# = 20.
When the user presses PageUp, patient# 14 should be displayed. My (hopefully wrong) solution is to create a parallel descending index on patient# and $lt. That implies every collection a user creates requires both indexes on the primary key fields to get bidirectional movement. That would apply also to secondary indexes such as name.
Additionally, the user presses F5, is prompted "# of records to move", they enter 3, patient# 23 should be displayed. Or they enter -3, patient# 7 should be displayed. My idea: First use a covered query and return the following 3 patient#s from the index and then fetch the 3rd document. This isn't at all ideal when a less simplified application has hundreds of thousands of documents and the user wants to transverse by 10s of thousands of records. And, again, to achieve the backward movement, I believe I would need that second descending index.
Finally, I need a way to have the user navigate to the last index entry (patient #25). Going to the beginning is trivial. Again, second index?
My question: Is there a way to transverse the ascending index using (say) an iterator or pointer to the current index element and then use iterator/pointer arithmetic to achieve what I want? I.e., +1 will get me the "next" index element from which I could fetch the "next" document; -1 the "previous", +3 the third following, -3 the third previous; index size to the last. Or is there another solution without so much overhead (multiple indexes, large covered queries).
The way to achieve what you want is to have an index on the relevant fields and then do simple queries to get you the records you need when you need them.
It's important to not over-think the problem. Rather than trying to break down how the query optimizer would traverse the index and trying to "reduce" somehow the work it does, just use the queries you need to get the job done.
That means in your case querying for records you need and when the user wants to jump to a particular record querying for that record. If someone is looking at record 27 and they want to go to the next one you can query for smallest record greater than 27 via descending sort and limit(1) qualifier on your find.
I would encourage you to revisit your choice to have basically two primary keys - instead of separate patientID field which has a unique index, you can store patientId in the _id field and get to use the already existing unique index on _id that MongoDB requires in every collection.
A more concrete description of what I am trying to do can be found here:
How to get the previous mongoDB document from a compound index
The answer is: It can't be done in the current version. The 2nd jira ticket is a possible future fix.
See: SERVER-9540
and
SERVER-9547

Maintaining order of mongodb collection

I have a collection that will have many documents (maybe millions). When a user inserts a new document, I would like to have a field that maintains the "order" of the data that I can index. For example, if one field is time, in this format "1352392957.46516", if I have three documents, the first with time: 1352392957.46516 and the second with time: 1352392957.48516 (20ms later) and the third with 1352392957.49516 (10ms later) I would like to have an another field where the first document would have 0, and the second would be 1, the third 2 and so on.
The reason I want this is so that I can index that field, then when I do a find I can do an efficient $mod operation to down sample the data. So for example, if I have a million docs, and I only want 1000 of them evenly spaced, I could do a $mod [1000, 0] on the integer field.
The reason I could not do that on the Time field is because they may not be perfectly spaced, or might be all even or odd so the mod would not work. So the separate integer field would keep the order in a linearly increasing fashion.
Also, you should be able to insert documents anywhere in the collection, so all subsequent fields would need to be updated.
Is there a way to do this automatically? Or would I have to implement this? Or is there a more efficient way of doing what I am describing?
It is well beyond "slower inserts" if you are updating several million documents for a single insert - this approach makes your entire collection the active working set. Similarly, in order to do the $mod comparison with a key value, you will have to compare every key value in the index.
Given your requirement for a sorted sampling order, I'm not sure there is a more efficient preaggregation approach you can take.
I would use skip() and limit() to fetch a random document. The skip() command will be scanning from the beginning of the index to skip over unwanted documents each time, but if you have enough RAM to keep the index in memory the performance should be acceptable:
// Add an index on time field
db.data.ensureIndex({'time':1})
// Count number of documents
var dc = db.data.count()
// Iterate and sample every 1000 docs
var i = 0; var sampleSize = 1000; var results = [];
while (i < dc) {
results.push(db.data.find().sort({time:1}).skip(i).limit(1)[0]);
i += sampleSize;
}
// Result array of sampled docs
printjson(results);

Slow pagination over tons of records in mongodb

I have over 300k records in one collection in Mongo.
When I run this very simple query:
db.myCollection.find().limit(5);
It takes only few miliseconds.
But when I use skip in the query:
db.myCollection.find().skip(200000).limit(5)
It won't return anything... it runs for minutes and returns nothing.
How to make it better?
One approach to this problem, if you have large quantities of documents and you are displaying them in sorted order (I'm not sure how useful skip is if you're not) would be to use the key you're sorting on to select the next page of results.
So if you start with
db.myCollection.find().limit(100).sort({created_date:true});
and then extract the created date of the last document returned by the cursor into a variable max_created_date_from_last_result, you can get the next page with the far more efficient (presuming you have an index on created_date) query
db.myCollection.find({created_date : { $gt : max_created_date_from_last_result } }).limit(100).sort({created_date:true});
From MongoDB documentation:
Paging Costs
Unfortunately skip can be (very) costly and requires the server to walk from the beginning of the collection, or index, to get to the offset/skip position before it can start returning the page of data (limit). As the page number increases skip will become slower and more cpu intensive, and possibly IO bound, with larger collections.
Range based paging provides better use of indexes but does not allow you to easily jump to a specific page.
You have to ask yourself a question: how often do you need 40000th page? Also see this article;
I found it performant to combine the two concepts together (both a skip+limit and a find+limit). The problem with skip+limit is poor performance when you have a lot of docs (especially larger docs). The problem with find+limit is you can't jump to an arbitrary page. I want to be able to paginate without doing it sequentially.
The steps I take are:
Create an index based on how you want to sort your docs, or just use the default _id index (which is what I used)
Know the starting value, page size and the page you want to jump to
Project + skip + limit the value you should start from
Find + limit the page's results
It looks roughly like this if I want to get page 5432 of 16 records (in javascript):
let page = 5432;
let page_size = 16;
let skip_size = page * page_size;
let retval = await db.collection(...).find().sort({ "_id": 1 }).project({ "_id": 1 }).skip(skip_size).limit(1).toArray();
let start_id = retval[0].id;
retval = await db.collection(...).find({ "_id": { "$gte": new mongo.ObjectID(start_id) } }).sort({ "_id": 1 }).project(...).limit(page_size).toArray();
This works because a skip on a projected index is very fast even if you are skipping millions of records (which is what I'm doing). if you run explain("executionStats"), it still has a large number for totalDocsExamined but because of the projection on an index, it's extremely fast (essentially, the data blobs are never examined). Then with the value for the start of the page in hand, you can fetch the next page very quickly.
i connected two answer.
the problem is when you using skip and limit, without sort, it just pagination by order of table in the same sequence as you write data to table so engine needs make first temporary index. is better using ready _id index :) You need use sort by _id. Than is very quickly with large tables like.
db.myCollection.find().skip(4000000).limit(1).sort({ "_id": 1 });
In PHP it will be
$manager = new \MongoDB\Driver\Manager("mongodb://localhost:27017", []);
$options = [
'sort' => array('_id' => 1),
'limit' => $limit,
'skip' => $skip,
];
$where = [];
$query = new \MongoDB\Driver\Query($where, $options );
$get = $manager->executeQuery("namedb.namecollection", $query);
I'm going to suggest a more radical approach. Combine skip/limit (as an edge case really) with sort range based buckets and base the pages not on a fixed number of documents, but a range of time (or whatever your sort is). So you have top-level pages that are each range of time and you have sub-pages within that range of time if you need to skip/limit, but I suspect the buckets can be made small enough to not need skip/limit at all. By using the sort index this avoids the cursor traversing the entire inventory to reach the final page.
My collection has around 1.3M documents (not that big), properly indexed, but still takes a big performance hit by the issue.
After reading other answers, the solution forward is clear; the paginated collection must be sorted by a counting integer similar to the auto-incremental value of SQL instead of the time-based value.
The problem is with skip; there is no other way around it; if you use skip, you are bound to hit with the issue when your collection grows.
Using a counting integer with an index allows you to jump using the index instead of skip. This won't work with time-based value because you can't calculate where to jump based on time, so skipping is the only option in the latter case.
On the other hand,
by assigning a counting number for each document, the write performance would take a hit; because all documents must be inserted sequentially. This is fine with my use case, but I know the solution is not for everyone.
The most upvoted answer doesn't seem applicable to my situation, but this one does. (I need to be able to seek forward by arbitrary page number, not just one at a time.)
Plus, it is also hard if you are dealing with delete, but still possible because MongoDB support $inc with a minus value for batch updating. Luckily I don't have to deal with the deletion in the app I am maintaining.
Just write this down as a note to my future self. It is probably too much hassle to fix this issue with the current application I am dealing with, but next time, I'll build a better one if I were to encounter a similar situation.
If you have mongos default id that is ObjectId, use it instead. This is probably the most viable option for most projects anyway.
As stated from the official mongo docs:
The skip() method requires the server to scan from the beginning of
the input results set before beginning to return results. As the
offset increases, skip() will become slower.
Range queries can use indexes to avoid scanning unwanted documents,
typically yielding better performance as the offset grows compared to
using skip() for pagination.
Descending order (example):
function printStudents(startValue, nPerPage) {
let endValue = null;
db.students.find( { _id: { $lt: startValue } } )
.sort( { _id: -1 } )
.limit( nPerPage )
.forEach( student => {
print( student.name );
endValue = student._id;
} );
return endValue;
}
Ascending order example here.
If you know the ID of the element from which you want to limit.
db.myCollection.find({_id: {$gt: id}}).limit(5)
This is a lil genious solution which works like charm
For faster pagination don't use the skip() function. Use limit() and find() where you query over the last id of the precedent page.
Here is an example where I'm querying over tons of documents using spring boot:
Long totalElements = mongockTemplate.count(new Query(),"product");
int page =0;
Long pageSize = 20L;
String lastId = "5f71a7fe1b961449094a30aa"; //this is the last id of the precedent page
for(int i=0; i<(totalElements/pageSize); i++) {
page +=1;
Aggregation aggregation = Aggregation.newAggregation(
Aggregation.match(Criteria.where("_id").gt(new ObjectId(lastId))),
Aggregation.sort(Sort.Direction.ASC,"_id"),
new CustomAggregationOperation(queryOffersByProduct),
Aggregation.limit((long)pageSize)
);
List<ProductGroupedOfferDTO> productGroupedOfferDTOS = mongockTemplate.aggregate(aggregation,"product",ProductGroupedOfferDTO.class).getMappedResults();
lastId = productGroupedOfferDTOS.get(productGroupedOfferDTOS.size()-1).getId();
}