on the mongodb docs it says:
(source)
Unfortunately skip can be (very) costly and requires the server to
walk from the beginning of the collection, or index, to get to the
offset/skip position before it can start returning the page of data
(limit). As the page number increases skip will become slower and more
cpu intensive, and possibly IO bound, with larger collections. Range
based paging provides better use of indexes but does not allow you to
easily jump to a specific page.
What is range based paging and where is the documentation for it?
The basic idea is to write the paging into the query predicate pattern.
For example if you list forum posts by date and you want to show the next page then use the date of the last post on the current page as a predicate. MongoDB can use the index built on the date field.
//older posts
db.forum_posts.find({date: {$lt: ..last_post_date..} }).sort({date: -1}).limit(20);
Of course this gets a little more complicated if the field you are using for sorting is not unique.
Related
A non-distributed database has many posts, posts have zero or more user-defined tags, most posts have the most_posts_have_this tag, few posts have the few_posts_have_this tag.
When querying {'tags': {'$all': ['most_posts_have_this', 'few_posts_have_this']}} the query is slow, it seems to be iterating through posts with the most_posts_have_this tag.
Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?
Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?
Short answer is no, this is due to how Mongo builds an index on an array:
To index a field that holds an array value, MongoDB creates an index key for each element in the array
So when you when you query the tags field imagine mongo queries each tag separately then it does an intersection.
If you run "explain" you will be able to see that after the index scan phase Mongo executes a fetch document phase, this phase in theory should be redundant for an pure index scan which shows this is not the case. So basically Mongo fetches ALL documents that have either of the tags, only then it performs the "$all" logic in the filtering phase.
So what can you do?
if you have prior knowledge on which tag is sparser you could first query that and only then filter based on the larger tag, I'm assuming this is not really the case but worth considering if possible. If your tags are somewhat static maybe you can precalculate this even.
Otherwise you will have to reconsider a restructuring that will allow better index usage for this usecase, I will say for most access patterns your structure is better.
The new structure can be an object like so:
tags2: {
tagname1: 1,
tagname2: 2,
...
}
Now if you built an index on tags2 each key of the object will be indexed separately, this will make mongo skip the "fetch" phase as the index contains all the information needed to execute the following query:
{"tags2.most_posts_have_this" :{$exists: true}, "tags2.few_posts_have_this": {$exists: true}}
I understand both solutions are underwhelming to say the least, but sadly Mongo does not excel in this specific use case.. I can think of more "hacky" approaches but I would say these 2 are the more reasonable ones to actually consider implementing depending on performance requirments.
Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?
Not really. When Mongo runs an $all it is going to get all records with both tags first. You could try using two $in queries in an aggregation instead, selecting the less frequent tag first. I'm not sure if this would actually be faster (depends on how Mongo optimizes things) but could be worth a try.
The best you can do:
Make sure you have an an index on the tags field. I see in the comments you have done this.
Mongo may be using the wrong index for this query. You can see which it is using with cursor.explain(). You can force it to use your tags index with hint(). First use db.collection.getIndexes() to make sure your tags index shows up as expected in the list of indexes.
Using projections to return only the fields you need might speed things up. For example, depending on your use case, you might return just post IDs and then query full text for a smaller subset of the returned posts. This could speed things up because Mongo doesn't have to manage as much intermediate data.
You could also consider periodically sorting the tags array field by frequency. If the least frequent tags are first, Mongo may be able to skip further scanning for that document. It will still fetch all the matching documents, but if your tag lists are very large it could save time by skipping the later tags. See The ESR (Equality, Sort, Range) Rule for more details on optimizing your indexed fields.
If all that's still not fast enough and the performance of these queries is critical, you'll need to do something more drastic:
Upgrade your machine (ensure it has enough RAM to store your whole dataset, or at least your indexes, in memory)
Try sharding
Revisit your data model. The fastest possible result will be if you can turn this query into a covered query. This may or may not be possible on an array field.
See Mongo's optimizing query performance for more detail, but again, it is unlikely to help with this use case.
I have a collection of ~500M documents.
Every time when I execute a query, I receive one or more documents from this collection. Let's say I have a counter for each document, and I increase this counter by 1 whenever this document is returned from the query. After a few months of running the system in production, I discover that the counter of only 5% of the documents is greater than 0 (zero). Meaning, 95% of the documents are not used.
My question is: Is there an efficient way to arrange these documents to speedup the query execution time, based on the fact that 95% of the documents are not used?
What is the best practice in this case?
If - for example - I will add another boolean field for each document named "consumed" and index this field. Can I improve the query execution time somehow?
~500M documents That is quite a solid figure, good job if that's true. So here is how I see the solution of the problem:
If you want to re-write/re-factor and rebuild the DB of an app. You could use versioning pattern.
How does it looks like?
Imagine you have a two collections (or even two databases, if you are using micro service architecture)
Relevant docs / Irrelevant docs.
Basically you could use find only on relevant docs collection (which store 5% of your useful docs) and if there is nothing, then use Irrelevant.find(). This pattern will allows you to store old/historical data. And manage it via TTL index or capped collection.
You could also add some Redis magic to it. (Which uses precisely the same logic), take a look:
This article can also be helpful (as many others, like this SO question)
But don't try to replace Mongo with Redis, team them up instead.
Using Indexes and .explain()
If - for example - I will add another boolean field for each document named "consumed" and index this field. Can I improve the query execution time somehow?
Yes, it will deal with your problem. To take a look, download MongoDB Compass, create this boolean field in your schema, (don't forget to add default value), index the field and then use Explain module with some query. But don't forget about compound indexes! If you create field on one index, measure the performance by queering only this one field.
The result should been looks like this:
If your index have usage (and actually speed-up) Compass will shows you it.
To measure the performance of the queries (with and without indexing), use Explain tab.
Actually, all this part can be done without Compass itself, via .explain and .index queries. But Compass got better visuals of this process, so it's better to use it. Especially since he becomes absolutely free for all.
While going through
cursor.skip() MongoDB I read that this is expensive approach and I totally understand it why it is expensive as cursor has to go through from start to execute this skip. And in the below paragraph they wrote
Consider using range-based pagination for these kinds of tasks. That is, query for a range of objects, using logic within the application to determine the pagination rather than the database itself. This approach features better index utilization, if you do not need to easily jump to a specific page.
I don't understand this part, how this overcome the "expensive(ness)" of skip() operation.
Thanks
When using cursor.skip(N) the server finds all the matching data and then skips over the first N matching documents.
When using range based pagination (ie. with a date range) the server will only find and return the matching documents. If the property you base your pagination on is indexed the index will also be used.
The difference is the amount of data the server has to read in the two situations.
I'm building an app that calls data from MongoDB. For purposes of this question, pretend that the user searches my app for a certain query, and MongoDB has 4,000 results to spit out that match the query.
After reading around a bit, I see that it's possible to paginate using the .skip() method, but MongoDB themselves suggest against using this as it requires the curser to iterate through all the records up until the one you're skipping to, which gets more and more expensive the higher in the list you go.
I've seen a few tutorials that rely on the _id property of the results to be sequential, but this doesn't apply here - my database has tens of thousands of records, and each has a unique id, and the 4000 results that apply to the user's query are definitely not going to be sequential.
Can anyone think of a way to do this, or is skip() the only option here?
Other considerations:
The pagination will work based on the position on the page. For instance, the first query should spit out 20 records to my app. When the user scrolls to the bottom of the page, I could potentially get the _id of the 20th element on the page and pass that to my query, find it in the list of 4,000 results, find the subsequent result and start the next set of 20 from there. Is that sort of thing possible, and would it be less CPU intensive than skip()?
Your trick in "other considerations" works only if you add a sort on _id, otherwise you can't guarantee order for follow up queries. If you want to sort on a different field, you need to index that field. I would also suggest you query for 21 elements so that you don't have to go back and find the next one after the 20th element (of course, you can still show only the first 20 elements).
MongoDB ranged pagination has a good example as well.
I known there is already some patterns on pagination with mongo (skip() on few documents, ranged queries on many), but in my situation i need live sorting.
update:
For clarity i'll change point of question. Can i make query like this:
db.collection.find().sort({key: 1}).limit(n).sort({key: -1}).limit(1)
The main point, is to sort query in "usual" order, limit the returned set of data and then reverse this with sorting to get the last index of paginated data. I tried this approach, but it seems that mongo somehow optimise query and ignores first sort() operator.
I am having a huge problem attempting to grasp your question.
From what I can tell when a user refreshes the page, say 6 hours later, it should show not only the results that were there but also the results that are there now.
As #JohnnyHK says MongoDB does "live" sorting naturally whereby this would be the case and MongoDB, for your queries would give you back the right results.
Now I think one problem you might trying to get at here (the question needs clarification, massively) is that due to the data change the last _id you saw might no longer truely represent the page numbers etc or even the diversity of the information, i.e. the last _id you saw is now in fact half way through page 13.
These sorts of things you would probably spend more time and performance trying to solve than just letting the user understand that they have been AFAK for a long time.
Edit
Aha, I think I see what your trying to do now, your trying to be sneaky by getting both the page and the last item in the list at the same time. Unfortunately just like SQL this is not possible. Even if sort worked like that the sort would not function like it should since you can only sort one way on a single field.
However for future reference the sort() function is exactly that on a cursor and until you actually open the cursor by starting to iterate it calling sort() multiple times will just overwrite the cursor property.
I am afraid that this has to be done with two queries, so you get your page first and then client side (I think your looking for the max of that page) scroll through the records to find the last _id or just do a second query to get the last _id. It should be super dupa fast.