Get nth item from a collection - mongodb

I'm in the learning phase of mongodb.
I have a test website project where each step of a story is a domain.com/step
for instance, step 14 is accessed through domain.com/14
In other words, for the above case, I will need to access 14th document in my collection to serve it.
I've been using find().skip(n).limit(1) method for this so far to return nth document however it becomes extremely slow when there are too many documents to skip. So I need a more efficient way to get the nth document in my collection.
Any ideas are appreciated.

Add a field to your documents which tells you which step it is, add an index to that field and query by it.
Document:
{
step:14
text:"text",
date:date,
imageurl:"imageurl"
}
Index:
db.collection.createIndex({step:1});
Query:
db.collection.find({step:14});
Relying on natural order in the collection is not just slow (as you found out), it is also unreliable. When you start a new collection and insert a bunch of documents, you will usually find them in the order you inserted them. But when you change documents after they were inserted, it can happen that the order gets messed up in unpredictable ways. So never rely on insertion order being consistent.
Exception: Capped Collections guarantee that insertion order stays consistent. But there are very few use-cases where these are useful, and I don't think you have such a case here.

Related

MongoDB $all optimization of tag-based query

A non-distributed database has many posts, posts have zero or more user-defined tags, most posts have the most_posts_have_this tag, few posts have the few_posts_have_this tag.
When querying {'tags': {'$all': ['most_posts_have_this', 'few_posts_have_this']}} the query is slow, it seems to be iterating through posts with the most_posts_have_this tag.
Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?
Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?
Short answer is no, this is due to how Mongo builds an index on an array:
To index a field that holds an array value, MongoDB creates an index key for each element in the array
So when you when you query the tags field imagine mongo queries each tag separately then it does an intersection.
If you run "explain" you will be able to see that after the index scan phase Mongo executes a fetch document phase, this phase in theory should be redundant for an pure index scan which shows this is not the case. So basically Mongo fetches ALL documents that have either of the tags, only then it performs the "$all" logic in the filtering phase.
So what can you do?
if you have prior knowledge on which tag is sparser you could first query that and only then filter based on the larger tag, I'm assuming this is not really the case but worth considering if possible. If your tags are somewhat static maybe you can precalculate this even.
Otherwise you will have to reconsider a restructuring that will allow better index usage for this usecase, I will say for most access patterns your structure is better.
The new structure can be an object like so:
tags2: {
tagname1: 1,
tagname2: 2,
...
}
Now if you built an index on tags2 each key of the object will be indexed separately, this will make mongo skip the "fetch" phase as the index contains all the information needed to execute the following query:
{"tags2.most_posts_have_this" :{$exists: true}, "tags2.few_posts_have_this": {$exists: true}}
I understand both solutions are underwhelming to say the least, but sadly Mongo does not excel in this specific use case.. I can think of more "hacky" approaches but I would say these 2 are the more reasonable ones to actually consider implementing depending on performance requirments.
Is there some way to hint to MongoDB that it should be iterating through posts with the few_posts_have_this tag instead?
Not really. When Mongo runs an $all it is going to get all records with both tags first. You could try using two $in queries in an aggregation instead, selecting the less frequent tag first. I'm not sure if this would actually be faster (depends on how Mongo optimizes things) but could be worth a try.
The best you can do:
Make sure you have an an index on the tags field. I see in the comments you have done this.
Mongo may be using the wrong index for this query. You can see which it is using with cursor.explain(). You can force it to use your tags index with hint(). First use db.collection.getIndexes() to make sure your tags index shows up as expected in the list of indexes.
Using projections to return only the fields you need might speed things up. For example, depending on your use case, you might return just post IDs and then query full text for a smaller subset of the returned posts. This could speed things up because Mongo doesn't have to manage as much intermediate data.
You could also consider periodically sorting the tags array field by frequency. If the least frequent tags are first, Mongo may be able to skip further scanning for that document. It will still fetch all the matching documents, but if your tag lists are very large it could save time by skipping the later tags. See The ESR (Equality, Sort, Range) Rule for more details on optimizing your indexed fields.
If all that's still not fast enough and the performance of these queries is critical, you'll need to do something more drastic:
Upgrade your machine (ensure it has enough RAM to store your whole dataset, or at least your indexes, in memory)
Try sharding
Revisit your data model. The fastest possible result will be if you can turn this query into a covered query. This may or may not be possible on an array field.
See Mongo's optimizing query performance for more detail, but again, it is unlikely to help with this use case.

How to do global search in MongoDB?

What I mean as global search is searching for documents in specified collections, for example, searching for a name in both User and Organization collections and will return both user and organization documents that match the criteria.
Is it possible to simply copy the documents in User and Organization into another collection and do a search in it?
No, it is not possible to do a multi-collection search automatically. There's no reason however that you couldn't perform the same query on multiple collections and combine the results.
While you could duplicate the data into another collection for query purposes, if you need to be guaranteed that the source collection's values matches identically with the "index" collection, you'll need to implement your own multi-phase transaction (example) as MongoDb doesn't have a multi-collection atomic commit. Or, you can accept the fact that the "index" table may be out of sync. Of course, it could be periodically updated through custom code. Further, it means your working set has increased as you're double storing data. Also, if you then need to grab data from individual collections (to grab more of the source document), you've likely not gained anything and made things worse when compared to doing multiple queries in the first place.
You could store related documents in the same collection and take advantage of the built-in indexing offered. Of course, this comes with the caveat that if your documents are now typed, you may find it more challenging to build MongoDb indexes that are efficient. Every changing/new document must go through the indexing pipeline, which may introduce significant overhead.
If it's only a few collections, I'd just do multiple searches without understanding more deeply your requirements. If not, the second best would be to combine documents into a single collection. Last choice would be to copy the data.

Paginating results in MongoDB without relying on .skip()

I'm building an app that calls data from MongoDB. For purposes of this question, pretend that the user searches my app for a certain query, and MongoDB has 4,000 results to spit out that match the query.
After reading around a bit, I see that it's possible to paginate using the .skip() method, but MongoDB themselves suggest against using this as it requires the curser to iterate through all the records up until the one you're skipping to, which gets more and more expensive the higher in the list you go.
I've seen a few tutorials that rely on the _id property of the results to be sequential, but this doesn't apply here - my database has tens of thousands of records, and each has a unique id, and the 4000 results that apply to the user's query are definitely not going to be sequential.
Can anyone think of a way to do this, or is skip() the only option here?
Other considerations:
The pagination will work based on the position on the page. For instance, the first query should spit out 20 records to my app. When the user scrolls to the bottom of the page, I could potentially get the _id of the 20th element on the page and pass that to my query, find it in the list of 4,000 results, find the subsequent result and start the next set of 20 from there. Is that sort of thing possible, and would it be less CPU intensive than skip()?
Your trick in "other considerations" works only if you add a sort on _id, otherwise you can't guarantee order for follow up queries. If you want to sort on a different field, you need to index that field. I would also suggest you query for 21 elements so that you don't have to go back and find the next one after the 20th element (of course, you can still show only the first 20 elements).
MongoDB ranged pagination has a good example as well.

How to ensure that cursor does not return duplicate document in MongoDB?

In MongoDB, read operation on the collection returns cursor.
If the read operation is accessing most of the documents in the collection and it may be possible that it may interleave with other update operation.
In that case, may it be possible that cursor will have duplicate documents ?
How to make sure that cursor will avoid duplicates ?
The distinct method will not be of much help here. This is not a problem that the function can solve, not only that but it is a fraction of the speed of a normal cursor.
If the read operation is accessing most of the documents in the collection and it may be possible that it may interleave with other update operation.
It is possible if the documents move in such a manner that with the sort of the cursor they get read again.
Whether this is a problem or not depends, if you are sorting by something that won't be updated, for example _id, then you don't really need to worry, however, if you are sorting by something that will be updated and could shift then yes; you will have a problem.
One method of solving this is to look at the last _id in that iteration of the cursor, filling the cursor into batchs of 1000 in an array or something. After you have the last _id in that batch you range, taking everything greater than that _id.
Another method could be to do snapshot queries: http://docs.mongodb.org/manual/reference/operator/snapshot/ however this function has quite a few limitations, for example it cannot be used with sharded collections.

Querying directly on results from MongoDB mapreduce versus updating original collection

I have a mapreduce job that runs on a collection of posts and calculates a popularity for each post. The mapreduce outputs a collection with the post_id and popularity for each post. The application needs to be able to get posts sorted by popularity. There are millions of posts, and these popularities are updated every 10 minutes. Two methods I can think of:
Method 1
Keep an index on the posts table popularity field
Run mapreduce on the posts table (this will replace any previous mapreduce results)
Loop through each row in the mapreduce results collection and individually update the popularity of its corresponding post in the posts table
Query directly on the posts table to get posts sorted by popularity
Method 2
Run mapreduce on the posts table (this will replace the previous mapreduce results)
Add an index to the popularity field in the resulting mapreduce collection
When the application needs posts, first query the mapreduce results collection to get the sorted post_ids, then query the posts collection to get the actual post data
Questions
Method 1 would need to maintain an index on the popularity in the posts table. It'll also need to update millions (the post table has millions of rows) of popularities individually every 10 or so minutes. It'll only update those posts that have changed popularity, but it's still a lot of updates on a collection with a couple of indexes. There will be a significant # of reads on this collection as well. Is this scalable?
For method 2, is it possible to mapreduce the posts collection to create a new popularities collection, immediately create an index on it, and query it?
Are there any concurrency issues for question #2, assuming the application will be querying that popularities collection as it's being updated by the map reduce and re-indexed.
If the mapreduce replaces the popularities collection do I need to manually create a new index every time or will mongo know to keep an index on the popularity field. Basically, how do indexes work with mapreduce result collections.
Is there some tweak or other method I could use for this??
Thanks for any help!
The generic advice concerning Map Reduce is to have your application perform a little extra computation on each insert, and avoid doing a processor-intensive map reduce job whenever possible.
Is it possible to add a "popularity" field to each "post" document and have your application increment it each time each post is viewed, clicked on, voted for, or however you measure popularity? You could then index the popularity field, and searches for posts by popularity would be lightning-fast.
If simply incrementing a "popularity" field is not an option, and a MapReduce operation must be performed, try to prevent it from paging through all of the documents in the collection. You will find that this becomes prohibitively slow as your collection grows. It sounds as though your collection is already pretty large.
It is possible to perform an incremental map reduce, where the results of the latest map reduce are integrated with the results of the previous one, instead of merely being overwritten. You can also provide a query to the mapReduce function, so not all documents will be read. Perhaps add a query that matches only posts that have been viewed, voted for, or added since the last map reduce.
The documentation on incremental mapReduce operations is here:
http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-IncrementalMapreduce
Integrating the new results with the old ones is explained in the "Output options" section.
I realize that my advice has been pretty general so far, so I will attempt to address your questions now:
1) As discussed above, if your MapReduce operation has to read every single document, this will not scale well.
2) The MapReduce operation only outputs a collection. Creating an index and querying that collection will have to be done programmatically.
3) If there is one process that is querying a collection at the same time that another is updating it, then it is possible for the query to return a document before it has been updated. The short answer is, "yes"
4) If the collection is dropped then indexes will have to be rebuilt. If the documents in the collection are deleted, but the collection itself is not dropped then the index(es) will persist. In the case of a MapReduce run with the {out:{replace:"output"}} option, the index(ex) will persist, and won't have to be recreated.
5) As stated above, if possible it would be preferable to add another field to your "posts" collection, and update that, instead of performing so many MapReduce operations.
Hopefully I have been able to provide you with some additional factors to consider when building your application. Ultimately, it is important to remember that each application is unique, and so for the ultimate proof of which way is "best", you will have to experiment with all of the different options and decide for yourself which way is most efficient. Good Luck!