Strategies for keeping Map Reduce results around for subsequent queries

Strategies for keeping Map Reduce results around for subsequent queries - mongodb

I'm using Map Reduce with MongoDB. Simplified scenario: There are users, items and things. Items include any number of things. Each user can rate things. Map reduce is used to calculate the aggregate rating for each user on each item. It's a complex formula using the ratings for each thing in the item and the time of day - it's not something you could ever index on and thus map-reduce is an ideal approach to calculating it.
The question is: having calculated the results using Map Reduce what strategies do people use to maintain these per-user results collections in their NOSQL databases?
1) On demand with automatic deletion: Keep them around for some set period of time and then delete them; regenerate them as necessary when the user makes a new request?
2) On demand never delete: Keep them around indefinitely. When the user makes a request and the collection is past it's use-by date, regenerate it.
3) Scheduled: Regular process running to update all results collections for all users?
4) Other?

The best strategy depends on the nature of your map-reduce job.
If you're using a separate map-reduce call for each individual user, I would go with the first or second strategy. The advantage of the second strategy over the first strategy is that you always have a result ready. So when the user makes a request and the result is outdated, you can still present the old result to the user, while running a new map-reduce in the background to generate a fresh result for the next requests. This has the following advantages:
The user doesn't have to wait for the map-reduce to complete, which is important if the map-reduce may take a while to complete. The exception is of course the very first map-reduce call; at this point there is no old result available.
You're automatically running map-reduce only for the active users, reducing the load on the database.
If you're using a single, application-wide map-reduce call for all users, the third strategy is the best approach. You can easily achieve this by specifying an output collection. The advantages of this approach:
You can easily control the freshness of the result. If you need more up-to-date results, or need to reduce the load on the database, you only have to adjust the schedule.
Your application code isn't responsible for managing the map-reduce calls, which simplifies your application.
If a user can only see his or her own ratings, I'd go with strategy one or two, or include a lastActivity timestamp in user profiles and run an application-wide scheduled map-reduce job on the active subset of the users (strategy 3). If a user can see any other user's ratings, I'd go with strategy 3 as well, as this greatly reduces the complexity of the application.

Related

Is dynamically creating and dropping collections in MongoDB going to create scalability issues?

I have an application (built in Meteor) that provides some ad hoc reporting capabilities to the end user. I have built up that functionality by using the aggregation pipeline to produce the results for a given query. This makes it extremely fast and I was using $out to push the results right into a results table.
The results table included a queryID, which the client used to figure out which were the correct results.
Unfortunately, as you may know (and I discovered), that doesn't work so well once you have more than one user running reports at a time because $out deletes the whole results table before pushing the new query in.
I see three possible workarounds:
Run the aggregation, but manually push the results into the results collection
$out the results into a temporary collection (dynamically named to avoid conflicts) and then manually copy the results from there into results collection, immediately dropping the temporary one. This made some sense when I thought I could use copyTo(), but that doesn't appear possible within Meteor, so I think this option doesn't make much sense relative to #1 in this case.
$out the results into a temporary collection (dynamically named to avoid conflicts) and have the client pull its results directly from there. I would then periodically drop the extra collections after say 24 hours (like I do with specific query results in the main collection today).
#3 would be the fastest by far - the time it takes to manually copy rows dwarfs the time it takes the queries to run. But I'm concerned about the impact of creating and dropping so many collections.
We're not talking millions of users here, but if an average of 500 users a day were each running 10-20 reports, there could be an additional 5-10k collections in the database at any one time. That seems like a lot. Perhaps I could be smarter about cleaning them up somehow, though I can't just immediately remove them because a user might want to have multiple tabs open with different reports. Even still, we're potentially talking about hundreds to thousands of collections.
Is that going to be a problem?
Are there other approaches I should consider instead?
Other recommendations?
Thanks!

Dropping a collection in mongoDB is very efficient operation, anyway much more efficient than deleting some documents in a larger collection.
Maximum number of collections is quite high, only limited by namespace namespace in MMAPv1 while no hard limit exists in wiretiger engine.
So I would favor your solution #3.
Some improvements/alternatives you can think:
Consider creating the collections in a separated database (say per day) then you can drop the entire database in a single operation without having to drop individual collections.
Use an endpoint for the result set, cash the results then drop the $out collection. Let cache handle user requirements and only rerun the aggregation if cache has expired or something.

This kind of activity is done very easily in relational databases such as mysql or pgsql. You might consider synchronising your data to a separate relational database for the purposes of reporting.
There is a package https://github.com/perak/mysql-shadow which claims to provide synchronisation. I played with it and it didn't work perfectly, although doing just one way sync is more likely to succeed.
The other option is to use Graphql over a mongo/mysql hybrid database which can be done with the Apollo stack http://www.apollodata.com/

MongoDB Schema Suggestion

I am trying to pick MongoDB as my preferred database. I need help on the design of my table.
App background - analytics app where contacts push their own events and related custom data. Contact can have many events. Eg: contact did this, did that etc.
event_type, custom_data (json), epoch_time
eg:
event 1: event_type: page_visited, custom-data: {url: pricing, referrer: google}, current_time
event 2: event_type: video_watched, custom-data: {url: video_link}, current_time
event 3: event_type: paid, custom_data: {plan:lite, price:35}
These events are custom and are defined by the user. Scalability is a concern.
These are the common use cases:
give me a list of users who have come to pricing page in the last 7 days
give me a list of users who watched the video and paid more than 50
give me a list of users who have visited pricing, watched video but NOT paid at least 20
What's the best way to design my table?
Is it a good idea to use embedded events in this case?

In Mongo they are called collections and not tables, since the data is not rows/columns :)
(1) I'd make an Event collection and a Users collections
(2) I'd do 1 document per Event which has a userId in it.
(3) If you need realtime data you will want an index on what you want to query by (i.e. never do a scan over the whole collection).
(4) if there are things which are needed for reporting only, I'd recommend making a reporting node (i.e. a different mongo instance) and using replication to copy data to that mongo instance. You can put additional indexes for reporting on that node. That way the additional indexes and any expensive queries will not affect production performance.
Notes on sharding
If your events collection is going to become large - you may need to consider sharding. Perhaps sharding by user Id. However, I'd recommend that may be a longer term solution and not to dive into that until you need it.
One thing to note, is that mongo has currently (2.6) a database level write locking implementation. Which means you can only perform 1 write at a time. It allows many reads. Which means that if you want a high write system AND have a lot of users, you will need to look into sharding at some point. However, in my experience so far, administratively 1 primary node with a secondary (and reporting node) is easier to setup. We currently can handle around 10,000 operations per second with that setup.
However, we have had issues with spikes in users coming to the system. You'll want to make sure you have enough memory for your indexes. And SSD's would be recommended to. as a surge in users can result in cache misses (i.e. index not in memory) which causes it to be read off the hard disk.
One final note - there are a lot of NoSQL DB's and they all have their pros and cons. I personally found that high write, low read, and realtime anaysis of lots of data is not really mongo's strength. So it does depend on what you are doing. It sounds like you are still learning the fundamentals. It might be worth a read of all the available types to pick the right tool for the right job.

Are there any advantages to using a custom _id for documents in MongoDB?

Let's say I have a collection called Articles. If I were to insert a new document into that collection without providing a value for the _id field, MongoDB will generate one for me that is specific to the machine and the time of the operation (e.g. sdf4sd89fds78hj).
However, I do have the ability to pass a value for MongoDB to use as the value of the _id key (e.g. 1).
My question is, are there any advantages to using my own custom _ids, or is it best to just let Mongo do its thing? In what scenarios would I need to assign a custom _id?
Update
For anyone else that may find this. The general idea (as I understand it) is that there's nothing wrong with assigning your own _ids, but it forces you to maintain unique values within your application layer, which is a PITA, and requires an extra query before every insert to make sure you don't accidentally duplicate a value.
Sammaye provides an excellent answer here:
Is it bad to change _id type in MongoDB to integer?

Advantages with generating your own _ids:
You can make them more human-friendly, by assigning incrementing numbers: 1, 2, 3, ...
Or you can make them more human-friendly, using random strings: t3oSKd9q
(That doesn't take up too much space on screen, could be picked out from a list, and could potentially be copied manually if needed. However you do need to make it long enough to prevent collisions.)
If you use randomly generated strings they will have an approximately even sharding distribution, unlike the standard mongo ObjectIds, which tends to group records created around the same time onto the same shard. (Whether that is helpful or not really depends on your sharding strategy.)
Or you may like to generate your own custom _ids that will group related objects onto one shard, e.g. by owner, or geographical region, or a combination. (Again, whether that is desirable or not depends on how you intend to query the data, and/or how rapidly you are producing and storing it. You can also do this by specifying a shard key, rather than the _id itself. See the discussion below.)
Advantages to using ObjectIds:
ObjectIds are very good at avoiding collisions. If you generate your own _ids randomly or concurrently, then you need to manage the collision risk yourself.
ObjectIds contain their creation time within them. That can be a cheap and easy way to retain the creation date of a document, and to sort documents chronologically. (On the other hand, if you don't want to expose/leak the creation date of a document, then you must not expose its ObjectId!)
The nanoid module can help you to generate short random ids. They also provide a calculator which can help you choose a good id length, depending on how many documents/ids you are generating each hour.
Alternatively, I wrote mongoose-generate-unique-key for generating very short random ids (provided you are using the mongoose library).
Sharding strategies
Note: Sharding is only needed if you have a huge number of documents (or very heavy documents) that cannot be managed by one server. It takes quite a bit of effort to set up, so I would not recommend worrying about it until you are sure you actually need it.
I won't claim to be an expert on how best to shard data, but here are some situations we might consider:
An astronomical observatory or particle accelerator handles gigabytes of data per second. When an interesting event is detected, they may want to store a huge amount of data in only a few seconds. In this case, they probably want an even distribution of documents across the shards, so that each shard will be working equally hard to store the data, and no one shard will be overwhelmed.
You have a huge amount of data and you sometimes need to process all of it at once. In this case (but depending on the algorithm) an even distribution might again be desirable, so that all shards can work equally hard on processing their chunk of the data, before combining the results at the end. (Although in this scenario, we may be able to rely on MongoDB's balancer, rather than our shard key, for the even distribution. The balancer runs in the background after data has been stored. After collecting a lot of data, you may need to leave it to redistribute the chunks overnight.)
You have a social media app with a large amount of data, but this time many different users are making many light queries related mainly to their own data, or their specific friends or topics. In this case, it doesn't make sense to involve every shard whenever a user makes a little query. It might make sense to shard by userId (or by topic or by geographical region) so that all documents belonging to one user will be stored on one shard, and when that user makes a query, only one shard needs to do work. This should leave the other shards free to process queries for other users, so many users can be served at once.
Sharding documents by creation time (which the default ObjectIds will give you) might be desirable if you have lots of light queries looking at data for similar time periods. For example many different users querying different historical charts.
But it might not be so desirable if most of your users are querying only the most recent documents (a common situation on social media platforms) because that would mean one or two shards would be getting most of the work. Distributing by topic or perhaps by region might provide a flatter overall distribution, whilst also allowing related documents to clump together on a single shard.
You may like to read the official docs on this subject:
https://docs.mongodb.com/manual/sharding/#shard-key-strategy
https://docs.mongodb.com/manual/core/sharding-choose-a-shard-key/

I can think of one good reason to generate your own ID up front. That is for idempotency. For example so that it is possible to tell if something worked or not after a crash. This method works well when using re-try logic.
Let me explain. The reason people might consider re-try logic:
Inter-app communication can sometimes fail for different reasons, (especially in a microservice architecture). The app would be more resilient and self-healing by codifying the app to re-try and not give up right away. This rides over odd blips that might occur without the consumer ever being affected.
For example when dealing with mongo, a request is sent to the DB to store some object, the DB saves it, but just as it is trying to respond to the client to say everything worked fine, there is a network blip for whatever reason and the “OK” is never received. The app assumes it didn't work and so the app may end up re-trying the same data and storing it twice, or worse it just blows up.
Creating the ID up front is an easy, low overhead way to help deal with re-try logic. Of course one could think of other schemes too.
Although this sort of resiliency may be overkill in some types of projects, it really just depends.

I have used custom ids a couple of times and it was quite useful.
In particular I had a collection where I would store stats by date, so the _id was actually a date in a specific format. I did that mostly because I would always query by date. Keep in mind that using this approach can simplify your indexes as no extra index is needed, the basic cursor is sufficient.

Sometimes the ID is something more meaningful than a randomly generated one. For example, a user collection may use the email address as the _id instead. In my project I generate IDs that are much shorter than the ones Mongodb uses so that the ID shown in the URL is much shorter.

I'll use an example , i created a property management tool and it had multiple collections. For simplicity some fields would be duplicated for example the payment. And when i needed to update these record it had to happen simultaneously across all collections it appeared in so i would assign them a custom payment id so when the delete/query action is performed it changes all instances of it database wide

Can MongoDB run the same operation on many documents without querying each one?

I am looking for a way to update every document in a collection called "posts".
Posts get updated periodically with a popularity (sitewide popularity) and a strength (the estimated relevance to that particular user), each from different sources. What I need to do is multiply popularity and strength on each post to get a third field, relevance. Relevance is used for sorting the posts.
class Post
include Mongoid::Document
field :popularity
field :strength
field :relevance
...
The current implementation is as follows:
1) I map/reduce down to a separate collection, which stores the post id and calculated relevance.
2) I update every post individually from the map reduce results.
This is a huge amount of individual update queries, and it seems silly to map each post to its own result (1-to-1), only to update the post again. Is it possible to multiply in place, or do some sort of in-place map?

Is it possible to multiply in place, or do some sort of in-place map?
Nope.
The ideal here would be to have the Map/Reduce update the Post directly when it is complete. Unfortunately, M/R does not have that capability. In theory, you could issue updates from the "finalize" stage, but this will collapse in a sharded environment.
However, if all you are doing is a simple multiplication, then you don't really need M/R at all. You can just run a big for loop, or you can hook up the save event to update :relevance when :popularity or :strength are updated.
MongoDB doesn't have triggers, so it can't do this automatically. But you're using a business layer which is the exact place to put this kind of logic.

Is moving documents between collections a good way to represent state changes in MongoDB?

I have two collections, one (A) containing items to be processed (relatively small) and one (B) with those already processed (fairly large, with extra result fields).
Items are read from A, get processed and save()'d to B, then remove()'d from A.
The rationale is that indices can be different across these, and that the "incoming" collection can be kept very small and fast this way.
I've run into two issues with this:
if either remove() or save() time out or otherwise fail under load, I lose the item completely, or process it twice
if both fail, the side effects happen but there is no record of that
I can sidestep the double-failure case with findAndModify locks (not needed otherwise, we have a process-level lock) but then we have stale lock issues and partial failures can still happen. There's no way to atomically remove+save to different collections, as far as I can tell (maybe by design?)
Is there a Best Practice for this situation?

There's no way to atomically remove+save to different collections, as far as I can tell (maybe by design?)
Yes this is by design. MongoDB explicitly does not provides joins or transactions. Remove + Save is a form of transaction.
Is there a Best Practice for this situation?
You really have two low-complexity options here, both involve findAndModify.
Option #1: a single collection
Based on your description, you are basically building a queue with some extra features. If you leverage a single collection then you use findAndModify to update the status of each item as it is processing.
Unfortunately, that means you will lose this: ...that the "incoming" collection can be kept very small and fast this way.
Option #2: two collections
The other option is basically a two phase commit, leveraging findAndModify.
Take a look at the docs for this here.
Once an item is processed in A you set a field to flag it for deletion. You then copy that item over to B. Once copied to B you can then remove the item from A.

I've not tried this myself yet but the new book 50 Tips and Tricks for MongoDB Developers mentions a few times about using cron jobs (or services/scheduler) to clean up data like this. You could leave the documents in Collection A flagged for deletion and run daily job to clear them out, reducing the overall scope of the original transaction.
From what I've learned so far, I'd never leave the database in a state where I rely on the next database action succeeding unless it is the last action (journalling will resend the last db action upon recovery). For example, I have a three phase account registration process where I create a user in CollectionA and then add another related document to CollectionB. When I create the user I embed the details of the CollectionB document in CollectionA in case the second write fails. Later I will write a process that removes the embedded data from CollectionA if the document in CollectionB exists
Not having transactions does cause pain points like this, but I think in some cases there are new ways of thinking about it. In my case, time will tell as I progress with my app

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse