I known there is already some patterns on pagination with mongo (skip() on few documents, ranged queries on many), but in my situation i need live sorting.
update:
For clarity i'll change point of question. Can i make query like this:
db.collection.find().sort({key: 1}).limit(n).sort({key: -1}).limit(1)
The main point, is to sort query in "usual" order, limit the returned set of data and then reverse this with sorting to get the last index of paginated data. I tried this approach, but it seems that mongo somehow optimise query and ignores first sort() operator.
I am having a huge problem attempting to grasp your question.
From what I can tell when a user refreshes the page, say 6 hours later, it should show not only the results that were there but also the results that are there now.
As #JohnnyHK says MongoDB does "live" sorting naturally whereby this would be the case and MongoDB, for your queries would give you back the right results.
Now I think one problem you might trying to get at here (the question needs clarification, massively) is that due to the data change the last _id you saw might no longer truely represent the page numbers etc or even the diversity of the information, i.e. the last _id you saw is now in fact half way through page 13.
These sorts of things you would probably spend more time and performance trying to solve than just letting the user understand that they have been AFAK for a long time.
Edit
Aha, I think I see what your trying to do now, your trying to be sneaky by getting both the page and the last item in the list at the same time. Unfortunately just like SQL this is not possible. Even if sort worked like that the sort would not function like it should since you can only sort one way on a single field.
However for future reference the sort() function is exactly that on a cursor and until you actually open the cursor by starting to iterate it calling sort() multiple times will just overwrite the cursor property.
I am afraid that this has to be done with two queries, so you get your page first and then client side (I think your looking for the max of that page) scroll through the records to find the last _id or just do a second query to get the last _id. It should be super dupa fast.
Related
How do I get the last page of any Firestore query without needing to send the cursor value for the last document in the query?
Obviously the first/last value is not always known unless I manually keep track of those values. I have queries that order by many different fields so having to store first/last cursor values for each of those seems like a lot of unnecessary work.
Getting the entire query result without limit is obviously expensive and impractical.
There currently isn't a way to get the last list. The common way to do this is to:
Reverse the order of the query.
Request the first page of results.
Reverse the results client-side again to get them in the right order.
Note though that there is talk (and even work being done) to add a limitToLast() operation to Firestore's query mechanism, which would allow you to get precisely the result you're looking for. It just isn't available yet. The biggest differences would be that you can skip steps 1 and 3 from the workaround.
I'm building an app that calls data from MongoDB. For purposes of this question, pretend that the user searches my app for a certain query, and MongoDB has 4,000 results to spit out that match the query.
After reading around a bit, I see that it's possible to paginate using the .skip() method, but MongoDB themselves suggest against using this as it requires the curser to iterate through all the records up until the one you're skipping to, which gets more and more expensive the higher in the list you go.
I've seen a few tutorials that rely on the _id property of the results to be sequential, but this doesn't apply here - my database has tens of thousands of records, and each has a unique id, and the 4000 results that apply to the user's query are definitely not going to be sequential.
Can anyone think of a way to do this, or is skip() the only option here?
Other considerations:
The pagination will work based on the position on the page. For instance, the first query should spit out 20 records to my app. When the user scrolls to the bottom of the page, I could potentially get the _id of the 20th element on the page and pass that to my query, find it in the list of 4,000 results, find the subsequent result and start the next set of 20 from there. Is that sort of thing possible, and would it be less CPU intensive than skip()?
Your trick in "other considerations" works only if you add a sort on _id, otherwise you can't guarantee order for follow up queries. If you want to sort on a different field, you need to index that field. I would also suggest you query for 21 elements so that you don't have to go back and find the next one after the 20th element (of course, you can still show only the first 20 elements).
MongoDB ranged pagination has a good example as well.
Is there a way to use the cursor returned by the previous query as a value for $in in the next query? For example, something like this:
var users = db.user.find({state:1})
var offers = db.offer.find({user:{$in:users}})
I think this can reduce the traffic between mongodb and client in case the client doesn't need user information at all, just offers. Am i wrong?
Basically you want to do a join between two collections which Mongo doesn't support. You can reduce the amount of data being transferred from the server by limiting the fields returned from the first query to only the unique user information (i.e. the _id) that you need to get data from the offers collection.
If you really just want to make one query then you should store more information in the offers collection. For example, if you're trying to find offers for active users then you would store the active state of the user in the offers collection.
To work from your comment:
Yes, that's why I used tag 'join' in a question. The idea is that I
can make a first query more сomplex using a bunch of fields and
regexes without storing user data in other collections except
references. In these cases I always have to perform two consecutive
queries, but transfering of the results of the first query is not
necessary neither for me nor for the mongodb itself. I just want to
understand could it be done now, will it be possible to do so in the
future or it cannot be implemented for some technical reasons
As far as I understand it there is no immediate hurry to make this possible. Also the way it is coded atm will make this quite a big change to the way cursors work and are defined. A change big enough to possibly cause implementation breaks for other people. It is really a case of whether to set safe for inserts and updates for all future drivers. It is recognised that safe should be default but this will break implementation for other people who expect it the other way around.
It is rather inefficient if you don't require the results of the first query at all however since most networks are prepped with high traffic in mind and the traffic is cheap there hasn't been a demand to make it able to do chained queries server side in the cursor.
However subselects (which this basically is, it is selecting a set of rows based upon a sub selection of previous rows) have been on mongodb-user a couple of times and there might even be a JIRA for it somewhere, if not might be useful to make one.
As for doing it right now: there is no way.
I’m playing around with MongoDB for the moment to see what nice features it has. I’ve created a small test suite representing a simple blog system with posts, authors and comments, very basic.
I’ve experimented with a search function which uses the MongoRegEx class (PHP Driver), where I’m just searching through all post content and post titles after the sentence ‘lorem ipsum’ with case sensitive on “/I”.
My code looks like this:
$regex = new MongoRegEx('/lorem ipsum/i');
$query = array('post' => $regex, 'post_title' => $regex);
But I’m confused and stunned about what happens. I check every query for running time (set microtime before and after the query and get the time with 15 decimals).
For my first test I’ve added 110.000 blog documents and 5000 authors, everything randomly generated. When I do my search, it finds 6824 posts with the sentence “lorem ipsum” and it takes 0.000057935714722 seconds to do the search. And this is after I’ve reset the MongoDB service (using Windows) and this is without any index other than the default on _id.
MongoDB uses a B-tree index, which most definitely isn’t very efficient for a full text search. If I create an index on my post content attribute, the same query as above runs in 0.000150918960571, which funny enough is slower than without any index (slower with a factor of 0.000092983245849). Now this can happen for several reasons because it uses a B-tree cursor.
But I’ve tried to search for an explanation as to how it can query it so fast. I guess that it probably keeps everything in my RAM (I’ve got 4GB and the database is about 500MB). This is why I try to restart the mongodb service to get a full result.
Can anyone with experience with MongoDB help me understand what is going on with this kind of full text search with or without index and definitely without an inverted index?
Sincerely
- Mestika
I think you simply didn't iterate over the results? With just a find(), the driver will not send a query to the server. You need to fetch at least one result for that. I don't believe MongoDB is this fast, and I believe your error to be in your benchmark.
As a second thing, for regular expression search that is not anchored at the beginning of the field's value with an ^, no index is used at all. You should play with explain() to see what is actually happening.
I'm not sure that there is a good way to do with with the facilities CouchDB provides, but I'd like to somehow extract the relative complement of the sets of two different document types over a particular key.
For example, let's say that I have documents representing users and posts, both of which have a (unique) username field. There's a validation in place ensuring that a user document exists for the username in every post, but there may be any number post documents with a given username, include none. It's trivial to create a view which counts the number of posts per username. The view can even include zero-counts by emitting zero post-counts for the user documents in the view map function. What I want to do though is retrieve just the list of users who have zero associated posts.
It's possible to build the view I described above and filter client-side for zero-value results, but in my actual situation the number of results could be very, very large, and the interesting results a relatively small proportion of the total. Is there a way to do this sever-side and retrieve back just the interesting results?
I would write a map function to iterate through the documents and emit the users (or just usersnames) with 0 posts.
Then I would write a list function to iterate through the map function results and format them however you want (JSON, csv, etc).
(I would NOT use a reduce function to format the results, even if a reduce function appears to work OK in development. That is just my own experience from lessons learned the hard way.)
Personally I would filter on the client-side until I had performance issues. Next I would probably use Teddy's _filter technique—all pretty standard CouchDB stuff.
However, I stumbled across (IMO) an elegant way to find set complements. I described it when exploring how to find documents missing a field.
The basic idea
Finding non-members of your view obviously can't be done with a simple query (and a straightforward index scan.) However, it can be done in constant memory, and linear time, by simultaneously iterating through two query results at the same time.
One query is for all possible document ids. The other query is for matching documents (those you don't want). Importantly, CouchDB sorts query results, therefore you can calculate the complement efficiently.
See my details in the previous question. The basic idea is you iterate through both (sorted) lists simultaneously and when you say "hey, this document id is listed in the full set but it's missing in the sub-set, that is a hit.
(You don't have to query _all_docs, you just need two queries to CouchDB: one returning all possible values, and the other returning values not to be counted.)