How to mapreduce on key from another collection - mongodb

Say I have a collection of users like this:-
{
"_id" : "1234",
"Name" : "John",
"OS" : "5.1",
"Groups" : [{
"_id" : "A",
"Name" : "Group A"
}, {
"_id" : "C",
"Name" : "Group C"
}]
}
And I have a collection of events like this:-
{
"_id" : "15342",
"Event" : "VIEW",
"UserId" : "1234"
}
I'm able to use mapreduce to work out the count of events per user as I can just emit the "UserId" and count off of that, however what I want to do now is count events by group.
If I had a "Groups" array in my event document then this would be easy, however I don't and this is only an example, the actual application of this is much more complicated and I don't want to replicate all that data into the event document.
I've see an example at http://tebros.com/2011/07/using-mongodb-mapreduce-to-join-2-collections/ but I can't see how that applies in this situation as it is aggregating values from two places... all I really want to do is perform a lookup.
In SQL I would simply JOIN my flattened UserGroup table to the event table and just GROUP BY UserGroup.GroupName
I'd be happy with multiple passes of mapreduce... first pass to count by UserId into something like { "_id" : "1234", "count" : 9 } but I get stuck on next pass... how to include the group id
Some potential approaches I've considered:-
Include group info in the event document (not feasible)
Work out how to "join" the user collection or look-up the users groups from within the map function so I can emit the group id's as well (don't know how to do this)
Work out how to "join" the event and user collections into a third collection I can run mapreduce over
What is possible and what are the benefits/issues with each approach?

Your third approach is the way to go:
Work out how to "join" the event and user collections into a third collection I can run mapreduce over
To do this you'll need to create a new collection J with the "joined" data you need for map-reduce. There are several strategies you can use for this:
Update your application to insert/update J in the normal course of business. This is best in the case where you need to run MR very frequently and with up-to-date data. It can add substantially to code complexity. From an implementation standpoint, you can do this either directly (by writing to J) or indirectly (by writing changes to a log collection L and then applying the "new" changes to J). If you choose the log collection approach you'll need a strategy for determining what's changed. There are two common ones: high-watermark (based on _id or a timestamp) and using the log collection as a queue with the findAndModify command.
Create/update J in batch mode. This is the way to go in the case of high-performance systems where the multiple updates from the above strategy would affect performance. This is also the way to go if you do not need to run the MR very frequently and/or you do not have to guarantee up-to-the-second data accuracy.
If you go with (2) you will have to iterate over documents in the collections you need to join--as you've figured out, Mongo map-reduce won't help you here. There are many possible ways to do this:
If you don't have many documents and if they are small, you can iterate outside of the DB with a direct connection to the DB.
If you cannot do (1) you can iterate inside the DB using db.eval(). If the number of documents in not small, make sure to use nolock: true as db.eval is blocking by default. This is typically the strategy I choose as I tend to deal with very large document sets and I cannot afford to move them over the network.
If you cannot do (1) and do not want to do (2) you can clone the collections to another node with a temporary DB. Mongo has a convenient cloneCollection command for this. Note that this does not work if the DB requires authentication (don't ask why; it's a strange 10gen design choice). In that case you can use mongodump and mongorestore. Once you have the data local to a new DB you can party on it as you see fit. Once you complete the MR you can update the result collection in your production DB. I use this strategy for one-off map-reduce operations with heavy pre-processing so as to not load the production replica sets.
Good luck!

Related

mongodb - combining multiple collections, but each individual query might not need all columns

Here is the scenario :
We have 2 tables (issues, anomalies) in BigQuery, which we plan to combine into a single document in MongoDB, since the 2 collections (issues, anomalies) is data about particular site.
[
{
"site": "abc",
"issues": {
--- issues data --
},
"anomalies": {
-- anomalies data --
}
}
]
There are some queries which require the 'issues' data, while others require 'anomalies' data.
In the future, we might need to show 'issues' & 'anomalies' data together, which is the reason why i'm planning to combine the two in a single document.
Questions on the approach above, wrt performance/volume of data read:
When we read the combined document, is there a way to read only specific columns (so the data volume read is not huge) ?
Or does this mean that when we read the document, the entire document is loaded in memory ?
Pls let me know.
tia!
UPDATE :
going over the mongoDB docs, we can use projections to pull only the required data from mongoDB documents.
Also, in this case - the data that is transferred over the network is only the data the specific fields that is read.
However the mongoDB server will still have to select the specific fields from the documents.

Mongodb - expire subset of data in collection

I am wondering what is the best way to expire only a subset of a collection.
In one collection I store conversion data and click data.
The click data I would like to store for lets a week
And the conversion data for a year.
In my collection "customers" I store something like:
{ "_id" : ObjectId("53f5c0cfeXXXXXd"), "appid" : 2, "action" : "conversion", "uid" : "2_b2f5XXXXXX3ea3", "iid" : "2_2905040001", "t" : ISODate("2014-07-18T15:01:00.001Z") }
And
{ "_id" : ObjectId("53f5c0cfe4b0d9cd24847b7d"), "appid" : 2, "action" : "view", "uid" : "2_b2f58679e6f73ea3", "iid" : "2_2905040001", "t" : ISODate("2014-07-18T15:01:00.001Z") }
for the click data
So should I exucute a ensureIndex or something like a cronjob?
Thank you in advance
There are a couple of built in techniques you can use. The most obvious is a TTL collection which will automatically remove documents based on a date/time field. The caveat here is that for that convenience, you lose some control. You will be automatically doing deletes all the time that you have no control over, and deletes are not free - they require a write lock, they need to be flushed to disk etc. Basically you will want to test to see if your system can handle the level of deletes you will be doing and how it impacts your performance.
Another option is a capped collection - capped collections are pre-allocated on disk and don't grow (except for indexes), they don't have the same overheads as TTL deletes do (though again, not free). If you have a consistent insert rate and document size, then you can work out how much space corresponds to the time frame you wish to keep data. Perhaps 20GiB is 5 days, so to be safe you allocate 30GiB and make sure to monitor from time to time to make sure your data size has not changed.
After that you are into more manual options. For example, you could simply have a field that marks a document as expired or not, perhaps a boolean - that would mean that expiring a document would be an in-place update and about as efficient as you can get in terms of a MongoDB operation. You could then do a batch delete of your expired documents at a quiet time for your system when the deletes and their effect on performance are less of a concern.
Another alternative: you could start writing to a new database every X days in a predictable pattern so that your application knows what the name of the current database is and can determine the names of the previous 2. When you create your new database, you delete the one older than the previous two and essentially always just have 3 (sub in numbers as appropriate). This sounds like a lot of work, but the benefit is that the removal of the old data is just a drop database command, which just unlinks/deletes the data files at the OS level and is far more efficient from an IO perspective than randomly removing documents from within a series of large files. This model also allows for a very clean backup model - mongodump the old database, compress and archive, then drop etc.
As you can see, there are a lot of trade offs here - you can go for convenience, IO efficiency, database efficiency, or something in between - it all depends on what your requirements are and what fits best for your particular use case and system.

Using Large number of collections in MongoDB

I am considering MongoDB to hold data of our campaign logs,
{
"domain" : ""
"log_time" : ""
"email" : ""
"event_type" : "",
"data" : {
"campaign_id" : "",
"campaign_name" : "",
"message" : "",
"subscriber_id" : ""
}
}
The above one is our event structure, each event is associated with one domain,
one domain can contain any number of events and there is no relation between one domain to another domain
most of our queries are specific to one domain at a time
for quick query responses I'm planning to create one collection per one domain so that I can query on particular domain collection data instead of query on whole data which contains all domains data
we will have at least 100k+ domains in the future, so I need to create 100k+ collections.
We are expecting 1 million + documents per collection.
our main intention is index on only required collections, we don't want to index on whole data, that is why we are planning to have one collection per one domain
which approach is better for my case
1.Storing all domains events in one collection
(or)
2.Each domain events in separate collection
I have seen some questions on max number of collections that mongodb can support but I didn't get clarity on this topic , as far I know we can extend default limit size 24k, but if I create 100k+ collections what about performance will it get affect
Is this solution (using max number of collections) right approach for my case
Please suggest about my approach, thanks in advance
Without some hard numbers, this question would be probably just opinion based.
However, if you do some calculations with the numbers you provided, you will get to a solution.
So your total document count is:
100 K collections x 1M documents = 100 G (100.000.000.000) documents.
From your document structure, I'm going to do a rough estimate and say that the average size for each document will be 240 bytes (it may be even higher).
Multiplying those two numbers you get ~21.82 TB of data. You can't store this amount of data just one one server, so you will have to split your data across multiple servers.
With this amount of data, your problem isn't anymore one collection vs multiple collections, but rather, how do I store all of this data in MongoDB on multiple servers, so I can efficiently do my queries.
If you have 100K collections, you can probably do some manual work and store e.g. 10 K collections per MongoDB server. But there's a better way.
You can use sharding and let the MongoDB do the hard work of splitting your data across servers. With sharding, you will have one collection for all domains and then shard that collection across multiple servers.
I would strongly recommend you to read all documentation regarding sharding, before trying to deploy a system of this size.

Design pattern for directed acyclic graphs in MongoDB

The problem
As usual the problem is to display a directed acyclic graph in a database. The choices for a Database I had were a relational database like mysql or mongodb. I chose mongoDb because DAGs in relational databases are a mess but if there is a trick I just didn't find please tell me.
The goal is to map a DAG in one or multiple MongoDB Documents. Because we have multiple children and parents SubDocuments where no possibility. I came across multiple design patterns but am not sure which one is the best to go with.
Tree-structure with Ancestors Array
The Ancestors Array is suggested by the mongoDB docs. And is quite easy to understand. As I understand it my document would look like this:
{
"_id" : "root",
"ancestors" : [ null ],
"left": 1
}
{
"_id" : "child1",
"ancestors" : [ "root" ],
"left": 2
}
{
"_id" : "child2",
"ancestors" : [ "root", "child1" ],
"left": 1
}
This allows me to find all children of an element like this:
db.Tree.find({ancestors: 'root'}).sort({left: -1})
and all parents like this:
db.Tree.findOne({_id: 'child1'}).ancestors
DBRefs instead of Strings
My second approach would be to replace the string-keys with DBRefs. But except for longer database records I don't see many advantages over the ancestors array.
String-based array with children and parents
The last idea is to store not only the children of each document but it's parents as well. This would give me all the features I want. The downside is the massive overhead of information I would create by storing all relations two times. Further on I am worried by the amount of administration there is. E.g. if a document gets deleted I have to check all others for a reference in multiple fields.
My Questions
Is MongoDb the right choice over a relational database for this purpose?
Are there any up-/downsides of any of my pattern I missed?
Which pattern would you suggest and why? Do you maybe have experience with one of them?
Why don't you use a graph database? Check ArangoDB, you can use documents like MongoDB and also graphs. MongoDB is a great database, but not for storing graph oriented documents. ArangoDB does.
https://www.arangodb.com/

heterogeneous bulk update in mongodb

I know that we can bulk update documents in mongodb with
db.collection.update( criteria, objNew, upsert, multi )
in one db call, but it's homogeneous, i.e. all those documents impacted are following one kind of criteria. But what I'd like to do is something like
db.collection.update([{criteria1, objNew1}, {criteria2, objNew2}, ...]
, to send multiple update request which would update maybe absolutely different documents or class of documents in single db call.
What I want to do in my app is to insert/update a bunch of objects with compound primary key, if the key is already existing, update it; insert it otherwise.
Can I do all these in one combine in mongodb?
That's two seperate questions. To the first one; there is no MongoDB native mechanism to bulk send criteria/update pairs although technically doing that in a loop yourself is bound to be about as efficient as any native bulk support.
Checking for the existence of a document based on an embedded document (what you refer to as compound key, but in the interest of correct terminology to avoid confusion it's better to use the mongo name in this case) and insert/update depending on that existence check can be done with upsert :
document A :
{
_id: ObjectId(...),
key: {
name: "Will",
age: 20
}
}
db.users.update({name:"Will", age:20}, {$set:{age: 21}}), true, false)
This upsert (update with insert if no document matches the criteria) will do one of two things depending on the existence of document A :
Exists : Performs update "$set:{age:21}" on the existing document
Doesn't exist : Create a new document with fields "name" and field
"age" with values "Will" and "20" respectively (basically the
criteria are copied into the new doc) and then the update is applied
($set:{age:21}). End result is a document with "name"="Will" and
"age"=21.
Hope that helps
we are seeing some benefits of $in clause.
our use case was to update the 'status' in a document for a large number number records.
In our first cut, we were doing a for loop and doing updates one by 1. But then we switched to using $in clause and that made a huge improvement.
There is no real benefit from doing updates the way you suggest.
The reason that there is a bulk insert API and that it is faster is that Mongo can write all the new documents sequentially to memory, and update indexes and other bookkeeping in one operation.
A similar thing happens with updates that affect more than one document: the update will traverse the index only once and update objects as they are found.
Sending multiple criteria with multiple criteria cannot benefit from any of these optimizations. Each criteria means a separate query, just as if you issued each update separately. The only possible benefit would be sending slightly fewer bytes over the connection. The database would still have to do each query separately and update each document separately.
All that would happen would be that Mongo would queue the updates internally and execute them sequentially (because only one update can happen at any one time), this is exactly the same as if all the updates were sent separately.
It's unlikely that the overhead in sending the queries separately would be significant, Mongo's global write lock will be the limiting factor anyway.