How to paginate query results in NoSQL databases when there are no unique fields included in the projection? - mongodb

I've heard using MongoDB's skip() to batch query results is a bad idea because it can lead to the server becoming IO bound as it has to 'walk through' all the results. I want to only return a maximum of 200 documents at a time, and then the user will be able to fetch the next 200 if they want (assuming they haven't limited it to less).
Initially I read up on paginating results and most things I read said the easiest way in MongoDB at least is to modify the query criteria to simulate skipping through.
For example if a field called account number on the last document is 28022004, then the next query should have "accNumber > 28022004" in the criteria. But what if there are no unique fields included in the projection? What if the user wants to sort the records by a non-unique field?

Related

Mongoose skip and limit (MongoDB Pagination)

I am using skip and limit in mongodb mongoose it returns random documents.
Sometimes it returns same documents.
I am trying to get limit number of documents and after skip i want to get next limit documents.
I guess you are trying to use the pagination concept here. In both SQL and NoSQL, the data must be sorted in a specific order to achieve pagination without jumbled data on each db call.
for example:
await Event.find({}).sort({"createdDate": 0}).skip(10).limit(10);
I'm trying to fetch the 10-20 data in the above case, which is sorted using createdDate. So, the data won't shuffled while fetching, unless you insert or delete the data.
I hope this answers your question

Projection in MongoDb

I am learning MongoDb and a question came to my mind regarding projection.
When we do a projection for some fields, what does MongoDB do?
Would it read the whole document and then drop some fields and returns the results or it won't read excluded fields and return the fields mentioned in the query.
For e.g. If I have a document with 4 fields and 3 arrays(each of size ~10) and I just want the 4 fields and not the arrays.
Would MongoDB read the whole document and drop the array or would just read the 4 fields?
If it's the first case how the execution time or latency would differ if the array becomes big in the document?
The document is compressed on storage , so mongo need to read the document first , uncompress it and get the fields specified in the filter only.
The trick here is that when you search by some of the fields you need to index them so the search to happen faster in memory and to avoid mongo to read all documents one by one and check for the searched field.
And if you need faster access for only those fields it is best all those fields to be in compound index and you search them via so called "covered query" , then you will search only in memory and fetch only from memory without accessing storage which will be much more faster.
Also in many cases it happen that same documents are searched multiple times so the mongoDB predictive algorithm is caching those documents in memory to be accessed faster.

Mongo DB update query performance

I would like to understand which of the below queries would be faster, while doing updates, in mongo db? I want to update few thousands of records at one stretch.
Accumulating the object ids of those records and firing them using $in or using bulk update?
Using one or two fields in the collection which are common for those few thousand records - akin to "where" in sql and firing an update using those fields. These fields might or might not be indexed.
I know that query will be much smaller in the 2nd case as every single "_id" (oid) is not accumulated. Does accumulating _ids and using those to update documents offer any practical performance advantages?
Does accumulating _ids and using those to update documents offer any practical performance advantages?
Yes because MongoDB will certainly use the _id index (idhack).
In the second method - as you observed - you can't tell whether or not an index will be used for a certain field.
So the answer will be: it depends.
If your collection has million of documents or more, and / or the number of search fields is quite large you should prefer the first search method. Especially if the id list size is not small and / or the id values are adjacent.
If your collection is pretty small and you can tolerate a full scan you may prefer the second approach.
In any case, you should testify both methods using explain().

Paginating results in MongoDB without relying on .skip()

I'm building an app that calls data from MongoDB. For purposes of this question, pretend that the user searches my app for a certain query, and MongoDB has 4,000 results to spit out that match the query.
After reading around a bit, I see that it's possible to paginate using the .skip() method, but MongoDB themselves suggest against using this as it requires the curser to iterate through all the records up until the one you're skipping to, which gets more and more expensive the higher in the list you go.
I've seen a few tutorials that rely on the _id property of the results to be sequential, but this doesn't apply here - my database has tens of thousands of records, and each has a unique id, and the 4000 results that apply to the user's query are definitely not going to be sequential.
Can anyone think of a way to do this, or is skip() the only option here?
Other considerations:
The pagination will work based on the position on the page. For instance, the first query should spit out 20 records to my app. When the user scrolls to the bottom of the page, I could potentially get the _id of the 20th element on the page and pass that to my query, find it in the list of 4,000 results, find the subsequent result and start the next set of 20 from there. Is that sort of thing possible, and would it be less CPU intensive than skip()?
Your trick in "other considerations" works only if you add a sort on _id, otherwise you can't guarantee order for follow up queries. If you want to sort on a different field, you need to index that field. I would also suggest you query for 21 elements so that you don't have to go back and find the next one after the 20th element (of course, you can still show only the first 20 elements).
MongoDB ranged pagination has a good example as well.

Querying directly on results from MongoDB mapreduce versus updating original collection

I have a mapreduce job that runs on a collection of posts and calculates a popularity for each post. The mapreduce outputs a collection with the post_id and popularity for each post. The application needs to be able to get posts sorted by popularity. There are millions of posts, and these popularities are updated every 10 minutes. Two methods I can think of:
Method 1
Keep an index on the posts table popularity field
Run mapreduce on the posts table (this will replace any previous mapreduce results)
Loop through each row in the mapreduce results collection and individually update the popularity of its corresponding post in the posts table
Query directly on the posts table to get posts sorted by popularity
Method 2
Run mapreduce on the posts table (this will replace the previous mapreduce results)
Add an index to the popularity field in the resulting mapreduce collection
When the application needs posts, first query the mapreduce results collection to get the sorted post_ids, then query the posts collection to get the actual post data
Questions
Method 1 would need to maintain an index on the popularity in the posts table. It'll also need to update millions (the post table has millions of rows) of popularities individually every 10 or so minutes. It'll only update those posts that have changed popularity, but it's still a lot of updates on a collection with a couple of indexes. There will be a significant # of reads on this collection as well. Is this scalable?
For method 2, is it possible to mapreduce the posts collection to create a new popularities collection, immediately create an index on it, and query it?
Are there any concurrency issues for question #2, assuming the application will be querying that popularities collection as it's being updated by the map reduce and re-indexed.
If the mapreduce replaces the popularities collection do I need to manually create a new index every time or will mongo know to keep an index on the popularity field. Basically, how do indexes work with mapreduce result collections.
Is there some tweak or other method I could use for this??
Thanks for any help!
The generic advice concerning Map Reduce is to have your application perform a little extra computation on each insert, and avoid doing a processor-intensive map reduce job whenever possible.
Is it possible to add a "popularity" field to each "post" document and have your application increment it each time each post is viewed, clicked on, voted for, or however you measure popularity? You could then index the popularity field, and searches for posts by popularity would be lightning-fast.
If simply incrementing a "popularity" field is not an option, and a MapReduce operation must be performed, try to prevent it from paging through all of the documents in the collection. You will find that this becomes prohibitively slow as your collection grows. It sounds as though your collection is already pretty large.
It is possible to perform an incremental map reduce, where the results of the latest map reduce are integrated with the results of the previous one, instead of merely being overwritten. You can also provide a query to the mapReduce function, so not all documents will be read. Perhaps add a query that matches only posts that have been viewed, voted for, or added since the last map reduce.
The documentation on incremental mapReduce operations is here:
http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-IncrementalMapreduce
Integrating the new results with the old ones is explained in the "Output options" section.
I realize that my advice has been pretty general so far, so I will attempt to address your questions now:
1) As discussed above, if your MapReduce operation has to read every single document, this will not scale well.
2) The MapReduce operation only outputs a collection. Creating an index and querying that collection will have to be done programmatically.
3) If there is one process that is querying a collection at the same time that another is updating it, then it is possible for the query to return a document before it has been updated. The short answer is, "yes"
4) If the collection is dropped then indexes will have to be rebuilt. If the documents in the collection are deleted, but the collection itself is not dropped then the index(es) will persist. In the case of a MapReduce run with the {out:{replace:"output"}} option, the index(ex) will persist, and won't have to be recreated.
5) As stated above, if possible it would be preferable to add another field to your "posts" collection, and update that, instead of performing so many MapReduce operations.
Hopefully I have been able to provide you with some additional factors to consider when building your application. Ultimately, it is important to remember that each application is unique, and so for the ultimate proof of which way is "best", you will have to experiment with all of the different options and decide for yourself which way is most efficient. Good Luck!