Can MongoDB run the same operation on many documents without querying each one? - mongodb

I am looking for a way to update every document in a collection called "posts".
Posts get updated periodically with a popularity (sitewide popularity) and a strength (the estimated relevance to that particular user), each from different sources. What I need to do is multiply popularity and strength on each post to get a third field, relevance. Relevance is used for sorting the posts.
class Post
include Mongoid::Document
field :popularity
field :strength
field :relevance
...
The current implementation is as follows:
1) I map/reduce down to a separate collection, which stores the post id and calculated relevance.
2) I update every post individually from the map reduce results.
This is a huge amount of individual update queries, and it seems silly to map each post to its own result (1-to-1), only to update the post again. Is it possible to multiply in place, or do some sort of in-place map?

Is it possible to multiply in place, or do some sort of in-place map?
Nope.
The ideal here would be to have the Map/Reduce update the Post directly when it is complete. Unfortunately, M/R does not have that capability. In theory, you could issue updates from the "finalize" stage, but this will collapse in a sharded environment.
However, if all you are doing is a simple multiplication, then you don't really need M/R at all. You can just run a big for loop, or you can hook up the save event to update :relevance when :popularity or :strength are updated.
MongoDB doesn't have triggers, so it can't do this automatically. But you're using a business layer which is the exact place to put this kind of logic.

Related

How to efficiently query MongoDB for documents when I know that 95% are not used

I have a collection of ~500M documents.
Every time when I execute a query, I receive one or more documents from this collection. Let's say I have a counter for each document, and I increase this counter by 1 whenever this document is returned from the query. After a few months of running the system in production, I discover that the counter of only 5% of the documents is greater than 0 (zero). Meaning, 95% of the documents are not used.
My question is: Is there an efficient way to arrange these documents to speedup the query execution time, based on the fact that 95% of the documents are not used?
What is the best practice in this case?
If - for example - I will add another boolean field for each document named "consumed" and index this field. Can I improve the query execution time somehow?
~500M documents That is quite a solid figure, good job if that's true. So here is how I see the solution of the problem:
If you want to re-write/re-factor and rebuild the DB of an app. You could use versioning pattern.
How does it looks like?
Imagine you have a two collections (or even two databases, if you are using micro service architecture)
Relevant docs / Irrelevant docs.
Basically you could use find only on relevant docs collection (which store 5% of your useful docs) and if there is nothing, then use Irrelevant.find(). This pattern will allows you to store old/historical data. And manage it via TTL index or capped collection.
You could also add some Redis magic to it. (Which uses precisely the same logic), take a look:
This article can also be helpful (as many others, like this SO question)
But don't try to replace Mongo with Redis, team them up instead.
Using Indexes and .explain()
If - for example - I will add another boolean field for each document named "consumed" and index this field. Can I improve the query execution time somehow?
Yes, it will deal with your problem. To take a look, download MongoDB Compass, create this boolean field in your schema, (don't forget to add default value), index the field and then use Explain module with some query. But don't forget about compound indexes! If you create field on one index, measure the performance by queering only this one field.
The result should been looks like this:
If your index have usage (and actually speed-up) Compass will shows you it.
To measure the performance of the queries (with and without indexing), use Explain tab.
Actually, all this part can be done without Compass itself, via .explain and .index queries. But Compass got better visuals of this process, so it's better to use it. Especially since he becomes absolutely free for all.

Paginating results in MongoDB without relying on .skip()

I'm building an app that calls data from MongoDB. For purposes of this question, pretend that the user searches my app for a certain query, and MongoDB has 4,000 results to spit out that match the query.
After reading around a bit, I see that it's possible to paginate using the .skip() method, but MongoDB themselves suggest against using this as it requires the curser to iterate through all the records up until the one you're skipping to, which gets more and more expensive the higher in the list you go.
I've seen a few tutorials that rely on the _id property of the results to be sequential, but this doesn't apply here - my database has tens of thousands of records, and each has a unique id, and the 4000 results that apply to the user's query are definitely not going to be sequential.
Can anyone think of a way to do this, or is skip() the only option here?
Other considerations:
The pagination will work based on the position on the page. For instance, the first query should spit out 20 records to my app. When the user scrolls to the bottom of the page, I could potentially get the _id of the 20th element on the page and pass that to my query, find it in the list of 4,000 results, find the subsequent result and start the next set of 20 from there. Is that sort of thing possible, and would it be less CPU intensive than skip()?
Your trick in "other considerations" works only if you add a sort on _id, otherwise you can't guarantee order for follow up queries. If you want to sort on a different field, you need to index that field. I would also suggest you query for 21 elements so that you don't have to go back and find the next one after the 20th element (of course, you can still show only the first 20 elements).
MongoDB ranged pagination has a good example as well.

Create aggregated user stats with MongoDB

I am building a MongoDB database that will work with an Android app. I have a user collection and a records collection. The records documents consist of GPS tracks such as start and end coordinates, total time and top speed and distance. The user document is has user id, first name, last name and so forth.
I want to have aggregate stats for each user that summarizes total distance, total time, total average speed and top speed to date.
I am confused if I should do a map reduce and create an aggregate collection for users, or if I should add these stats to the user document with some kind of cron job type soliuton. I have read many guides about map reduce and aggregation for MongoDB but can't figure this out.
Thanks!
It sounds like your aggregate indicator values are per-user, in which case I would simply calculate them and push them directly into the user object as the same time as you update current co-oordinates, speed etc. They would be nice and easy (and fast) to query, and you could aggregate them further if you wished.
When I say pre-calculate, I don't mean MapReduce, which you would use as a batch process, I simply mean calculate on update of the user object.
If your aggregate stats are compiled across users, then you could still pre-calculate them on update, but if you also need to be able to query those aggregate stats against some other condition or filter, such as, "tell me what the total distance travelled for all users within x region", then depending on the number of combinations you may not be able to cover all those with pre-calculation.
So, if your aggregate stats ARE across users, AND need some sort of filter applying, then they'll need to be calculated from some snapshot of data. The two approaches here are;
the aggregation framework in 2.2
MapReduce
You would need to use MapReduce say, if you've a LOT of historical data that you want to crunch and you can pre-calculate the results for fast reading later. By my definition, that data isn't changing frequently, but even if it did, you can also use incremental MR to add new results to an existing calculation.
The aggregation framework in 2.2 will allow you to do a lot of this on demand, but it won't be as quick of course as pre-calculated values but way quicker than MR when executed on-demand. It can't cope with the high volume result-sets that you can do with MR, but it's better suited to queries where you don't know the parameter values in advance.
By way of example, if you wanted to calculate the aggregate sums of users stats within a particular lat/long, you couldn't use MR because there are just too many combinations of that filter, so you'd need to do that on the fly.
If however, you wanted it by city, well you could conceivably use MR there because you could stick to a finite set of cities and just pre-calculate them all.
But to wrap up, if your aggregate indicator values are per-user alone, then I'd start by calculating and storing the values inside the user object when I update the user object as I said in the first paragraph. Yes, you're storing the value as well as the inputs, but that's the model that saves you having to calculate on the fly.

Querying directly on results from MongoDB mapreduce versus updating original collection

I have a mapreduce job that runs on a collection of posts and calculates a popularity for each post. The mapreduce outputs a collection with the post_id and popularity for each post. The application needs to be able to get posts sorted by popularity. There are millions of posts, and these popularities are updated every 10 minutes. Two methods I can think of:
Method 1
Keep an index on the posts table popularity field
Run mapreduce on the posts table (this will replace any previous mapreduce results)
Loop through each row in the mapreduce results collection and individually update the popularity of its corresponding post in the posts table
Query directly on the posts table to get posts sorted by popularity
Method 2
Run mapreduce on the posts table (this will replace the previous mapreduce results)
Add an index to the popularity field in the resulting mapreduce collection
When the application needs posts, first query the mapreduce results collection to get the sorted post_ids, then query the posts collection to get the actual post data
Questions
Method 1 would need to maintain an index on the popularity in the posts table. It'll also need to update millions (the post table has millions of rows) of popularities individually every 10 or so minutes. It'll only update those posts that have changed popularity, but it's still a lot of updates on a collection with a couple of indexes. There will be a significant # of reads on this collection as well. Is this scalable?
For method 2, is it possible to mapreduce the posts collection to create a new popularities collection, immediately create an index on it, and query it?
Are there any concurrency issues for question #2, assuming the application will be querying that popularities collection as it's being updated by the map reduce and re-indexed.
If the mapreduce replaces the popularities collection do I need to manually create a new index every time or will mongo know to keep an index on the popularity field. Basically, how do indexes work with mapreduce result collections.
Is there some tweak or other method I could use for this??
Thanks for any help!
The generic advice concerning Map Reduce is to have your application perform a little extra computation on each insert, and avoid doing a processor-intensive map reduce job whenever possible.
Is it possible to add a "popularity" field to each "post" document and have your application increment it each time each post is viewed, clicked on, voted for, or however you measure popularity? You could then index the popularity field, and searches for posts by popularity would be lightning-fast.
If simply incrementing a "popularity" field is not an option, and a MapReduce operation must be performed, try to prevent it from paging through all of the documents in the collection. You will find that this becomes prohibitively slow as your collection grows. It sounds as though your collection is already pretty large.
It is possible to perform an incremental map reduce, where the results of the latest map reduce are integrated with the results of the previous one, instead of merely being overwritten. You can also provide a query to the mapReduce function, so not all documents will be read. Perhaps add a query that matches only posts that have been viewed, voted for, or added since the last map reduce.
The documentation on incremental mapReduce operations is here:
http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-IncrementalMapreduce
Integrating the new results with the old ones is explained in the "Output options" section.
I realize that my advice has been pretty general so far, so I will attempt to address your questions now:
1) As discussed above, if your MapReduce operation has to read every single document, this will not scale well.
2) The MapReduce operation only outputs a collection. Creating an index and querying that collection will have to be done programmatically.
3) If there is one process that is querying a collection at the same time that another is updating it, then it is possible for the query to return a document before it has been updated. The short answer is, "yes"
4) If the collection is dropped then indexes will have to be rebuilt. If the documents in the collection are deleted, but the collection itself is not dropped then the index(es) will persist. In the case of a MapReduce run with the {out:{replace:"output"}} option, the index(ex) will persist, and won't have to be recreated.
5) As stated above, if possible it would be preferable to add another field to your "posts" collection, and update that, instead of performing so many MapReduce operations.
Hopefully I have been able to provide you with some additional factors to consider when building your application. Ultimately, it is important to remember that each application is unique, and so for the ultimate proof of which way is "best", you will have to experiment with all of the different options and decide for yourself which way is most efficient. Good Luck!

Query for set complement in CouchDB

I'm not sure that there is a good way to do with with the facilities CouchDB provides, but I'd like to somehow extract the relative complement of the sets of two different document types over a particular key.
For example, let's say that I have documents representing users and posts, both of which have a (unique) username field. There's a validation in place ensuring that a user document exists for the username in every post, but there may be any number post documents with a given username, include none. It's trivial to create a view which counts the number of posts per username. The view can even include zero-counts by emitting zero post-counts for the user documents in the view map function. What I want to do though is retrieve just the list of users who have zero associated posts.
It's possible to build the view I described above and filter client-side for zero-value results, but in my actual situation the number of results could be very, very large, and the interesting results a relatively small proportion of the total. Is there a way to do this sever-side and retrieve back just the interesting results?
I would write a map function to iterate through the documents and emit the users (or just usersnames) with 0 posts.
Then I would write a list function to iterate through the map function results and format them however you want (JSON, csv, etc).
(I would NOT use a reduce function to format the results, even if a reduce function appears to work OK in development. That is just my own experience from lessons learned the hard way.)
Personally I would filter on the client-side until I had performance issues. Next I would probably use Teddy's _filter technique—all pretty standard CouchDB stuff.
However, I stumbled across (IMO) an elegant way to find set complements. I described it when exploring how to find documents missing a field.
The basic idea
Finding non-members of your view obviously can't be done with a simple query (and a straightforward index scan.) However, it can be done in constant memory, and linear time, by simultaneously iterating through two query results at the same time.
One query is for all possible document ids. The other query is for matching documents (those you don't want). Importantly, CouchDB sorts query results, therefore you can calculate the complement efficiently.
See my details in the previous question. The basic idea is you iterate through both (sorted) lists simultaneously and when you say "hey, this document id is listed in the full set but it's missing in the sub-set, that is a hit.
(You don't have to query _all_docs, you just need two queries to CouchDB: one returning all possible values, and the other returning values not to be counted.)