MongoDb - Embedding or reference? - mongodb

I come from SQL relation world, and have a question that involves mongodb schema design.
The reality that i need to represent contains:
Users, and monthly reports (with multiple daily reports).
I want to figure out if in mongodb, is better to embedded reports objects into the Users collection, or to have 2 separate collection referenced by id.
Embedded solution:
User:{
name:
surname:
monthlyReports: [{
month: "January 2014"
dailyReport: [{
day: 1
singleReport: [
{ report1}, {report2}, ...
]
}, {
day: 2
singleReport: [
{ report1}, {report2}, ...
]
}
]
},
{
/*
february 2014
Day 1
Day 2 ...
*/
} ...
]
}
Referenced solution:
Users:{
name:
surname:
monthlyReports: [
id_reportMonth1, id_reportMonth2, ...
]
}
MonthlyReport: {
id:
month:
dailyReport: [{
day: 1
singleReport: [
{ report1 }, { report2 } ...
]
},
{
} ....
]
}
For single user, i need to retrive single daily report, monthly report, and total report.
I think that in embedded solution, querying is more simple, but create large object in a long period.
Another possibility:
Create 3 referenced collection:
User, monthlyReport, dailyReport.
What is the better way to do it?
Someone has suggested?

Mongo wrote an excellent 3 parts blog post about it: http://blog.mongodb.org/post/87200945828/6-rules-of-thumb-for-mongodb-schema-design-part-1
Bottom line: it depends on the use.
You need to consider two factors (ref: mongodb blog post) :
Will the entities on the “N” side of the One-to-N ever need to stand alone?
What is the cardinality of the relationship: is it one-to-few; one-to-many; or one-to-squillions?
Based on these factors, you can pick one of the three basic One-to-N schema designs:
Embed the N side if the cardinality is one-to-few and there is no
need to access the embedded object outside the context of the parent
object
Use an array of references to the N-side objects if the cardinality
is one-to-many or if the N-side objects should stand alone for any
reasons
Use a reference to the One-side in the N-side objects if the
cardinality is one-to-squillions
Hope this helps.

Related

MongoDB , how to choose the right design?

I have to create a collection of document and I have a doubt about the right design.
Each document is an "identity"; each Identity has a list of "partner Data"; each partner data are defined by an ID and a set of Data.
One approach can be (1):
{
_id: ...
partners: [
{
id: partner1,
data: {
}
},
{
id: partner2,
data: {
}
},
]
}
Another approach can be (2)
{
_id: ...
partners: {
partner1: {
data: {
}
},
partner2: {
data: {
}
},
]
}
I prefer the first one, but considering that I could have million of these identities, which could be the most performed schema?
A typical query can be: "how many identities have partner with ID N".
With the second example, a query can be:
db.identities.find({partner.partnerName: {$exists:true}})
With first approach, how can I get this count?
The second solution is more easy to handle Server Side; each document will have a list where each KEY is the partner ID, so instead of scan all document, I can simply get partner data by key...
What do you think about these solutions? I prefer the first one but the second I think that is more "usable"...
Thanks
I prefer the first one, but considering that I could have million of these identities, wich could be the most performed schema?
If you going to have millions of identities, then both approaches
are not really scalable .
Each document in mongo has a size limit (16MB) (read about it here)
In case you are going to have really lot's of identities ,
the scalable approach would be to create a different collection,
only for the relations and partnership data.
Now , I also want you to consider how you treat "partnership",
if I'm a user and I got you on my partners list , will you see me as a partner on your list ?
In case we both see each other as partners , then mongo-db may not be the best solution. graph db's are more appropriate for dealing with relations of this type.
All solutions within mongo for two-ways relations will be built on double updates (Your id on my partner's list , My id on your partner's list).
(In SQL you could add an extra condition for joining but not in mongo) ,
so you don't need to save twice partnerships . (me and you , you and me)
just you and me.
Do you see where is going this way ?
If you need to go only in one way,
Then just create a second collection , "partnerships" ,
{
_id: should be uniqe,
user_id: 'your_id',
partner_id: 'his_id'
data: {} or just flatten the fields into the root object.
}
Please notice that you create a row for each partnership !
Then you could use $lookup in order to query for a user with all of his
partners .
something like:
db.getCollection('partners').aggregate([
{
$lookup: {
from: 'parterships',
localField: '_id',
foreignField: 'user_id',
as: 'partners'
}
},
{
$project: {
name: 1,
partners: 1,
num_partners: { $size: "$partners" }
}
}
])
Read more about the aggregation stages here.
In case you are not going to have lot's of partnership's Then please continue with
your first approach which is good .
The second approach will make most queries to this collection pretty weird and you will always have to write code in order to query this table .
It won't be "straight forward" mongo queries .

aggregating and sorting based on a Mongodb Relationship

I'm trying to figure out if what I want to do is even possible in Mongodb. I'm open to all sorts of suggestions regarding more appropriate ways to achieve what I need.
Currently, I have 2 collections:
vehicles (Contains vehicle data such as make and model. This data can be highly unstructured, which is why I turned to Mongodb for this)
views (Simply contains an IP, a date/time that the vehicle was viewed and the vehicle_id. There could be thousands of views)
I need to return a list of vehicles that have views between 2 dates. The list should include the number of views. I need to be able to sort by the number of views in addition to any of the usual vehicle fields. So, to be clear, if a vehicle has had 1000 views, but only 500 of those between the given dates, the count should return 500.
I'm pretty sure I could perform this query without any issues in MySQL - however, trying to store the vehicle data in MySQL has been a real headache in the past and it has been great moving to Mongo where I can add new data fields with ease and not worry about the structure of my database.
What do you all think?? TIA!
As it turns out, it's totally possible. It took me a long while to get my head around this, so I'm posting it up for future google searches...
db.statistics.aggregate({
$match: {
branch_id: { $in: [14] }
}
}, {
$lookup: {
from: 'vehicles', localField: 'vehicle_id', foreignField: '_id', as: 'vehicle'
}
}, {
$group: {
_id: "$vehicle_id",
count: { $sum: 1 },
vehicleObject: { $first: "$vehicle" }
}
}, { $unwind: "$vehicleObject" }, {
$project: {
daysInStock: { $subtract: [ new Date(), "$vehicleObject.date_assigned" ] },
vehicleObject: 1,
count: 1
}
}, { $sort: { count: -1 } }, { $limit: 10 });
To explain the above:
The Mongodb aggregate framework is the way forward for complex queries like this. Firstly, I run a $match to filter the records. Then, we use $lookup to grab the vehicle record. Worth mentioning here that this is a Many to One relationship here (lots of stats, each having a single vehicle). I can then group on the vehicle_id field, which will enable me to return one record per vehicle with a count of the number of stats in the group. As it is a group, we technically have lots of copies of that same vehicle document now in each group, so I then add just the first one into the vehicleObject variable. This would be fine, but $first tends to return an array with a single entry (pointless in my opinion), so I added the $unwind stage to pull the actual vehicle out. I then added a $project stage to calculate an additional field, sorted by the count descending and limited the results to 10.
And take a breath :)
I hope that helps someone. If you know of a better way to do what I did, then I'm open to suggestions to improve.

Mongo error 16996 during aggregation - too large document produced

I am parsing Wikipedia dumps in order to play with the link-oriented metadata. One of the collections is named articles and it is in the following form:
{
_id : "Tree",
id: "18955875",
linksFrom: " [
{
name: "Forest",
count: 6
},
[...]
],
categories: [
"Trees",
"Forest_ecology"
[...]
]
}
The linksFrom field stores all articles this article points to, and how many times that happens. Next, I want to create another field linksTo with all the articles that point to this article. In the beginning, I went through the whole collection and updated every article, but since there's lots of them it takes too much time. I switched to aggregation for performance purposes and tried it on a smaller set - works like a charm and is super fast in comparison with the older method. The aggregation pipeline is as follows:
db.runCommand(
{
aggregate: "articles",
pipeline : [
{
$unwind: "$linksFrom"
},
{
$sort: { "linksFrom.count": -1 }
},
{
$project:
{
name: "$_id",
linksFrom: "$linksFrom"
}
},
{
$group:
{
_id: "$linksFrom.name",
linksTo: { $push: { name: "$name", count: { $sum : "$linksFrom.count" } } },
}
},
{
$out: "TEMPORARY"
}
] ,
allowDiskUse: true
}
)
However, on a large dataset being the english Wikipedia I get the following error after a few minutes:
{
"ok" : 0,
"errmsg" : "insert for $out failed: { connectionId: 24, err: \"BSONObj size: 24535193 (0x1766099) is invalid. Size must be between 0 and 16793600(16MB) First element: _id: \"United_States\"\", code: 10334, n: 0, ok: 1.0 }",
"code" : 16996
}
I understand that there are too many articles, which link to United_States article and the corresponding document's size grows above 16MB, currently almost 24MB. Unfortunately, I cannot even check if that's the case (error messages sometimes tend to lie)... Because of that, I'm trying to change the model so that the relationship between articles is stored with IDs rather than long names but I'm afraid that might not be enough - especially because my plan is to merge the two collections for every article later...
The question is: does anyone have a better idea? I don't want to try to increase the limit, I'm rather thinking about a different approach of storing this data in the database.
UPDATE after comment by Markus
Markus is correct, I am using a SAX parser and, as a matter of fact, I'm already storing all the links in a similar way. Apart from articles I have three more collections - one with links and two others, labels and stemmed-labels. The first one stores all links that occur in the dump in the following way:
{
_id : "tree",
stemmedName: "tree",
targetArticle: "Christmas_tree"
}
_id stores the text that is used to represent a given link, stemmedName represents stemmed _id and targetArticle marks what article this text pointed to. I'm in the middle of adding sourceArticle to this one, because it's obviously a good idea.
The second collection labels contains documents as follows:
{
_id : "tree",
targetArticles: [
{
name: "Christmas_tree",
count: 1
},
{
name: "Tree",
count: 166
}
[...]
]
}
The third stemmed-labels is analogous to the labels with its _id being a stemmed version of the root label.
So far, the first collection links serves as a baseline for the two other collections. I group the labels together by their name so that I only do one lookup for every phrase and then I can immiedately get all target articles with one query. Then I use the articles and labels collections in order to:
Look for label with a given name.
Get all articles it might
point to.
Compare the incoming and outcoming links for these
articles.
This is where the main question comes. I thought that it's better if I store all possible articles for a given phrase in one document rather than leave them scattered in the links collection. Only now did it occur to me, that - as long as the lookups are indexed - the overall performance might be the same for one big document or many smaller ones! Is this a correct assumption?
I think your data model is wrong. It may well be (albeit a bit theoretical) that individual articles (let's stick with the wikipedia example) are linked more often than you could store in a document. Embedding only works with One-To(-Very)-Few™ relationships.
So basically, I think you should change your model. I will show you how I would do it.
I will use the mongoshell and JavaScript in this example, since it is the lingua franca. You might need to translate accordingly.
The questions
Lets begin with the questions you want to have answered:
For a given article, which other articles link to that article?
For a given article, to which other articles does that article link to?
For a given article, how many articles link to it?
Optional: For a given article, to how many articles does it link to?
The crawling
What I would do basically is to implement a SAX parser on the articles, creating a new document for each article link you encounter. The document itself should be rather simple:
{
"_id": new ObjectId(),
// optional, for recrawling or pointing out a given state
"date": new ISODate(),
"article": wikiUrl,
"linksTo": otherWikiUrl
}
Note that you should not do an insert, but an upsert. The reason for this is that we do not want to document the number of links, but the articles linked to. If we did an insert, the same combination of article and linksTocould occur multiple times.
So our statement when encountering a link would look like this for example:
db.links.update(
{ "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ "date": new ISODate(), "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ upsert:true }
)
Answering the questions
As you might already guess, answering the questions becomes pretty straightforward now. I have use the following statements for creating a few documents:
db.links.update(
{ "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ "date": new ISODate(), "article":"HMS_Warrior_(1860)", "linksTo":"Royal_Navy" },
{ upsert:true }
)
db.links.update(
{ "article":"Royal_Navy", "linksTo":"Mutiny_on_the_Bounty" },
{ "date":new ISODate(), "article":"Royal_Navy", "linksTo":"Mutiny_on_the_Bounty" },
{ upsert:true }
)
db.links.update(
{ "article":"Mutiny_on_the_Bounty", "linksTo":"Royal_Navy"},
{ "date":new ISODate(), "article":"Mutiny_on_the_Bounty", "linksTo":"Royal_Navy" },
{ upsert:true }
)
For a given article, which other articles link to that article?
We found out that we should not use an aggregation, since that might exceed the size limit. But we don't have to. We simply use a cursor and gather the results:
var toLinks =[]
var cursor = db.links.find({"linksTo":"Royal_Navy"},{"_id":0,"article":1})
cursor.forEach(
function(doc){
toLinks.push(doc.article);
}
)
printjson(toLinks)
// Output: [ "HMS_Warrior_(1860)", "Mutiny_on_the_Bounty" ]
For a given article, to which other articles does that article link to?
This works pretty much like the first question – we basically only change the query:
var fromLinks = []
var cursor = db.links.find({"article":"Royal_Navy"},{"_id":0,"linksTo":1})
cursor.forEach(
function(doc){
fromLinks.push(doc.linksTo)
}
)
printjson(fromLinks)
// Output: [ "Mutiny_on_the_Bounty" ]
For a given article, how many articles link to it?
It should be obvious that in case you already have answered question 1, you could simply check toLinks.length. But let's assume you haven't. There are two other ways of doing this
Using .count()
You can use this method on replica sets. On sharded clusters, this doesn't work well. But it is easy:
db.links.find({ "linksTo":"Royal_Navy" }).count()
// Output: 2
Using an aggregation
This works on any environment and isn't much more complicated:
db.links.aggregate([
{ "$match":{ "linksTo":"Royal_Navy" }},
{ "$group":{ "_id":"$linksTo", "isLinkedFrom":{ "$sum":1 }}}
])
// Output: { "_id" : "Royal_Navy", "isLinkedFrom" : 2 }
Optional: For a given article, to how many articles does it link to?
Again, you can answer this question by reading the length of the array from question 2 of use the .count()method. The aggregation again is simple
db.links.aggregate([
{ "$match":{ "article":"Royal_Navy" }},
{ "$group":{ "_id":"$article", "linksTo":{ "$sum":1 }}}
])
// Output: { "_id" : "Royal_Navy", "linksTo" : 1 }
Indices
As for the indices, I haven't really checked them, but individual indices on the fields is probably what you want:
db.links.createIndex({"article":1})
db.links.createIndex({"linksTo":1})
A compound index will not help much, since order matters and we do no always ask for the first field. So this is probably as optimized as it can get.
Conclusion
We are using an extremely simple, scalable model and rather simple queries and aggregations to get the questions answered you have to the data.

how do I do 'not-in' operation in mongodb?

I have two collections - shoppers (everyone in shop on a given day) and beach-goers (everyone on beach on a given day). There are entries for each day, and person can be on a beach, or shopping or doing both, or doing neither on any day. I want to now do query - all shoppers in last 7 days who did not go to beach.
I am new to Mongo, so it might be that my schema design is not appropriate for nosql DBs. I saw similar questions around join and in most cases it was suggested to denormalize. So one solution, I could think of is to create collection - activity, index on date, embed actions of user. So something like
{
user_id
date
actions {
[action_type, ..]
}
}
Insertion now becomes costly, as now I will have to query before insert.
A few of suggestions.
Figure out all the queries you'll be running, and all the types of data you will need to store. For example, do you expect to add activities in the future or will beach and shop be all?
Consider how many writes vs. reads you will have and which has to be faster.
Determine how your documents will grow over time to make sure your schema is scalable in the long term.
Here is one possible approach, if you will only have these two activities ever. One record per user per day.
{ user: "user1",
date: "2012-12-01",
shopped: 0,
beached: 1
}
Now your query becomes even simpler, whether you have two or ten activities.
When new activity comes in you always have to update the correct record based on it.
If you were thinking you could just append a record to your collection indicating user, date, activity then your inserts are much faster but your queries now have to do a LOT of work querying for both users, dates and activities.
With proposed schema, here is the insert/update statement:
db.coll.update({"user":"username", "date": "somedate"}, {"shopped":{$inc:1}}, true)
What that's saying is: "for username on somedate increment their shopped attribute by 1 and create it if it doesn't exist aka "upsert" (that's the last 'true' argument).
Here is the query for all users on a particular day who did activity1 more than once but didn't do any of activity2.
db.coll.find({"date":"somedate","shopped":0,"danced":{$gt:1}})
Be wary of picking a schema where a single document can have continuous and unbounded growth.
For example, storing everything in a users collection where the array of dates and activities keeps growing will run into this problem. See the highlighted section here for explanation of this - and keep in mind that large documents will keep getting into your working data set and if they are huge and have a lot of useless (old) data in them, that will hurt the performance of your application, as will fragmentation of data on disk.
Remember, you don't have to put all the data into a single collection. It may be best to have a users collection with a fixed set of attributes of that user where you track how many friends they have or other semi-stable information about them and also have a user_activity collection where you add records for each day per user what activities they did. The amount or normalizing or denormalizing of your data is very tightly coupled to the types of queries you will be running on it, which is why figure out what those are is the first suggestion I made.
Insertion now becomes costly, as now I will have to query before insert.
Keep in mind that even with RDBMS, insertion can be (relatively) costly when there are indices in place on the table (ie, usually). I don't think using embedded documents in Mongo is much different in this respect.
For the query, as Asya Kamsky suggest you can use the $nin operator to find everyone who didn't go to the beach. Eg:
db.people.find({
actions: { $nin: ["beach"] }
});
Using embedded documents probably isn't the best approach in this case though. I think the best would be to have a "flat" activities collection with documents like this:
{
user_id
date
action
}
Then you could run a query like this:
var start = new Date(2012, 6, 3);
var end = new Date(2012, 5, 27);
db.activities.find({
date: {$gte: start, $lt: end },
action: { $in: ["beach", "shopping" ] }
});
The last step would be on your client driver, to find user ids where records exist for "shopping", but not for "beach" activities.
One possible structure is to use an embedded array of documents (a users collection):
{
user_id: 1234,
actions: [
{ action_type: "beach", date: "6/1/2012" },
{ action_type: "shopping", date: "6/2/2012" }
]
},
{ another user }
Then you can do a query like this, using $elemMatch to find users matching certain criteria (in this case, people who went shopping in the last three days:
var start = new Date(2012, 6, 1);
db.people.find( {
actions : {
$elemMatch : {
action_type : { $in: ["shopping"] },
date : { $gt : start }
}
}
});
Expanding on this, you can use the $and operator to find all people went shopping, but did not go to the beach in the past three days:
var start = new Date(2012, 6, 1);
db.people.find( {
$and: [
actions : {
$elemMatch : {
action_type : { $in: ["shopping"] },
date : { $gt : start }
}
},
actions : {
$not: {
$elemMatch : {
action_type : { $in: ["beach"] },
date : { $gt : start }
}
}
}
]
});

MongoDB many-to-many search

this is my collections: (many-to-many)
actors:
{
_id: 1,
name: "Name 1"
}
movies:
{
_id: 1,
name: "The Terminator",
production_year: 1984,
actors: [
{
actors_id: 1,
role_id : 1
},
{
actors_id: 2,
role_id : 1
}
]
}
I can't get a list of actors for some movie
it is not a problem when I have this:
{
_id: 1,
name: "The Terminator",
production_year: 1984,
actors: [1,2,3,4,5] (actors id's)
}
var a = db.movies.findOne(name:"The Terminator").actors
db.actors.find({"_id":{$in:a}})
but, how can I make it with this above structure:
if, I do this var a = db.movies.findOne(name:"The Terminator").actors
it returns me this:
[
{
actors_id: 1,
role_id : 1
},
{
actors_id: 2,
role_id : 1
}
]
How do I get only this in array [1,2] (actors_id) to get the names of actors (with $in)
Thanks,
Zoran
You don't. Within MongoDB you always query for documents so you have to make sure your schema is such that you can get all the information you need by querying for specific documents. There is no join/view like functionality in MongoDB.
Denormalization is usually the most appropriate choice in such cases. Your schema looks like it's designed for a traditional relational database and you will have to try and let go of some of the schema design principles that come with relational data.
Specifically for your example you could add the actor name to the embedded array so you have that information after querying for the movie.
Finally, consider if you're using the right tool for what you need to do. Too often people think of MongoDB is a "fast MySQL" which is entirely wrong. Document databases are very different to RDBMS and even k/v stores. If you have a lot of related data use an RDBMS.
variable a in db.movies.findOne(name:"The Terminator").actors is an array of documents, so you'd have to make it an array of integers (ids)