I playing with the best way to model mongodb documents
I am modelling a school.
A Student has many subjects.
Student{
subjects:[ {name:'',
level:'',
short name:''
},
{...},
{...}]
}
Decided to denormalise and embed subjects into students for performance.
There are rare cases where a subject needs to be queried and updated.
subjects.all
subject1.short_name = 'something new'
I know I will have to iterate through every student to update every subject reocrd.
However whast the best way to return all unique subjects?
Can you do a unique search of student.subjects names for example?
Or is it better to have another collection which is
Subjects{
name:'',
level:'',
short name:''
}
I still keep the denormalised Student.subject. But this is simply there for quering all the subjects on offer.
An updated would update this + every embeded Student.subject?
Any suggestions/recommendations?
However whast the best way to return all unique subjects?
This is a short fall of your schema here. You traded the ability to do this kind of thing easily in return for other speed benefits that you would do more often.
Currently the only real way is to either use the distinct() command ( http://docs.mongodb.org/manual/reference/method/db.collection.distinct/ ):
db.students.distinct('subjects.name');
or the aggregation framework:
db.students.aggregate([
{$unwind:'$subjects'},
{$group:{_id:'$subjects.name'}}
])
Like so.
As for schema recommendation, if you intend to make this kind of query often then I would factor out subjects into a separate collection.
Related
Say I have two documents
/*1*/
{
"id":1,
"name":"natty",
"subject_enrolled":"english"
}
/*2*/
{
"id":2,
"name":"natty",
"subject_enrolled":"science"
}
Ideally, it should have been same document, with subject_enrolled being an array having both subjects. But for some reason, I maintained my data flat like this.
Now, I want to write a query which will retrieve all students who have enrolled for both "english" and "science".
I tried the below query:
db.students.find({"subject_enrolled":{"$in":["science", "english"]}})
But that is wrong, coz if any student who registered for only science will also be in the result. I cannot use "$all", as both science and english are in two different documents.
Is there a way to achieve this easily and effectively?
All I could think of now is to use an aggregate.
db.students.aggregate([{"$group":{"_id":"$name", "subject_enrolled":{"$addToSet":"$subject_enrolled"}}}, {"$match":{"subject_enrolled":{"$all":["english", "science"]}}}])
This satisfies my condition perfectly.
But I am worried about performance only. If I have say 10,000 documents, worst case I have all documents are for individual students and the querying Param is also a single value, then I will be aggregating for no good!
Mongo experts, please share your views on my situation.
I've read a lot of documentation and examples here in Stackoverflow but I'm not really sure about my conclusions so this is why I'm askingfor help.
Imagine we have a collection Films and a collection Users and we want to know, which users have seen a film, and which films has seen an user.
One way to design this in MongoDb is:
User:
{
"name":"User1",
"films":[filmId1, filmId2, filmId3, filmId4] //ObjectIds from Films
}
Film:
{
"name": "The incredible MongoDb Developer",
"watched_by": [userId1, userId2, userId3] //ObjectsIds from User
}
Ok, this may work if the amount of users/films is low, but for example if we expect that one film will have a 800k users the size of the array will be near to: 800k * 12 bytes ~ 9.5MB which is nearly to the 16MB max for a BSON file.
In this case, there are other approach than the typical relational-world way that is create an intermediate collection for the relations?
Also I don't know if read and parse a JSON about 10MB will have a better performance in comparison with the classic relational way.
Thank you
For films, if you include the viewers, you might eventually hit the 16MB size limit of BSON documents, as you correctly stated.
Putting the films a user has seen into an array is a viable way, depending on your use cases. Especially if you want to have relations with attributes (say date and place of viewing), doing updates and statistical analysis becomes less performant (you would need to $unwind your docs first, subsequent $matches become more costly and whatnot).
If your relations have or may have attributes, I'd go with what you describe as the classical relational way, since it answers your most likely use cases as good as embedding and allow for higher performance from my experience:
Given a collection with a structure like
{
_id: someObjectId,
date: ISODate("2016-05-05T03:42:00Z"),
movie: "nameOfMovie",
user: "username"
}
You have everything at hand to answer the following sample questions easily:
For a given user, which movies has he seen in the last 3 month, in descending order of date?
db.views.aggregate([
{$match:{user:userName, date:{$gte:threeMonthAgo}}},
{$sort:{date:-1}},
{$group:{_id:"$user",viewed:{$push:{movie:"$movie",date:"$date"}}}}
])
or, if you are ok with an iterator, even easier with:
db.views.find({user:username, date:{$get:threeMonthAgo}}).sort({date:-1})
For a given movie, how many users have seen it on May 30th this year?
db.views.aggregate([
{$match:{
movie:movieName,
date{
$gte:ISODate("2016-05-30T00:00:00"),
$lt:ISODate("2016-05-31T00:00:00")}
}},
{$group:{
_id: "$movie",
views: {$sum:1}
}}
])
The reason why I use an aggregation here instead of a .count() on the result is SERVER-3645
For a given movie, show all users which have seen it.
db.views.find({movie:movieName},{_id:0,user:1})
There is a thing to note: Since we used the usernames and movie names, respectively, we do not need a JOIN (or something similar), which should give us good performance. Plus we do not have to do rather costly update operations when adding entries. Instead of an update, we simply insert the data.
I wonder if there is a way to count a collection that exists in the model I want to query. I tried this:
Event.find({ limit: { '>': attenders.length }}).limit(5).populateAll().exec(function(err, events) {
});
Because I just want to get events where the number of attending persons is less than the limit. This does not work, but is there a similar way to solve the problem?
There isn't a way to do this directly with Waterline, since association information is not stored in the same collection as the model. You'll have to count the attenders in a separate query, if possible.
A better way around this would be to maintain an attender count in the Event model itself and update it when someone joins or leaves an Event.
Here's the deal. Let's suppose we have the following data schema in MongoDB:
items: a collection with large documents that hold some data (it's absolutely irrelevant what it actually is).
item_groups: a collection with documents that contain a list of items._id called item_groups.items plus some extra data.
So, these two are tied together with a Many-to-Many relationship. But there's one tricky thing: for a certain reason I cannot store items within item groups, so -- just as the title says -- embedding is not the answer.
The query I'm really worried about is intended to find some particular groups that contain some particular items (i.e. I've got a set of criteria for each collection). In fact it also has to say how much items within each found group fitted the criteria (no items means group is not found).
The only viable solution I came up with this far is to use a Map/Reduce approach with a dummy reduce function:
function map () {
// imagine that item_criteria came from the scope.
// it's a mongodb query object.
item_criteria._id = {$in: this.items};
var group_size = db.items.count(item_criteria);
// this group holds no relevant items, skip it
if (group_size == 0) return;
var key = this._id.str;
var value = {size: group_size, ...};
emit(key, value);
}
function reduce (key, values) {
// since the map function emits each group just once,
// values will always be a list with length=1
return values[0];
}
db.runCommand({
mapreduce: item_groups,
map: map,
reduce: reduce,
query: item_groups_criteria,
scope: {item_criteria: item_criteria},
});
The problem line is:
item_criteria._id = {$in: this.items};
What if this.items.length == 5000 or even more? My RDBMS background cries out loud:
SELECT ... FROM ... WHERE whatever_id IN (over 9000 comma-separated IDs)
is definitely not a good way to go.
Thank you sooo much for your time, guys!
I hope the best answer will be something like "you're stupid, stop thinking in RDBMS style, use $its_a_kind_of_magicSphere from the latest release of MongoDB" :)
I think you are struggling with the separation of domain/object modeling from database schema modeling. I too struggled with this when trying out MongoDb.
For the sake of semantics and clarity, I'm going to substitute Groups with the word Categories
Essentially your theoretical model is a "many to many" relationship in that each Item can belong Categories, and each Category can then possess many Items.
This is best handled in your domain object modeling, not in DB schema, especially when implementing a document database (NoSQL). In your MongoDb schema you "fake" a "many to many" relationship, by using a combination of top-level document models, and embedding.
Embedding is hard to swallow for folks coming from SQL persistence back-ends, but it is an essential part of the answer. The trick is deciding whether or not it is shallow or deep, one-way or two-way, etc.
Top Level Document Models
Because your Category documents contain some data of their own and are heavily referenced by a vast number of Items, I agree with you that fully embedding them inside each Item is unwise.
Instead, treat both Item and Category objects as top-level documents. Ensure that your MongoDb schema allots a table for each one so that each document has its own ObjectId.
The next step is to decide where and how much to embed... there is no right answer as it all depends on how you use it and what your scaling ambitions are...
Embedding Decisions
1. Items
At minimum, your Item objects should have a collection property for its categories. At the very least this collection should contain the ObjectId for each Category.
My suggestion would be to add to this collection, the data you use when interacting with the Item most often...
For example, if I want to list a bunch of items on my web page in a grid, and show the names of the categories they are part of. It is obvious that I don't need to know everything about the Category, but if I only have the ObjectId embedded, a second query would be necessary to get any detail about it at all.
Instead what would make most sense is to embed the Category's Name property in the collection along with the ObjectId, so that pulling back an Item can now display its category names without another query.
The biggest thing to remember is that the key/value objects embedded in your Item that "represent" a Category do not have to match the real Category document model... It is not OOP or relational database modeling.
2. Categories
In reverse you might choose to leave embedding one-way, and not have any Item info in your Category documents... or you might choose to add a collection for Item data much like above (ObjectId, or ObjectId + Name)...
In this direction, I would personally lean toward having nothing embedded... more than likely if I want Item information for my category, i want lots of it, more than just a name... and deep-embedding a top-level document (Item) makes no sense. I would simply resign myself to querying the database for an Items collection where each one possesed the ObjectId of my Category in its collection of Categories.
Phew... confusing for sure. The point is, you will have some data duplication and you will have to tweak your models to your usage for best performance. The good news is that that is what MongoDb and other document databases are good at...
Why don't use the opposite design ?
You are storing items and item_groups. If your first idea to store items in item_group entries then maybe the opposite is not a bad idea :-)
Let me explain:
in each item you store the groups it belongs to. (You are in NOSql, data duplication is ok!)
for example, let's say you store in item entries a list called groups and your items look like :
{ _id : ....
, name : ....
, groups : [ ObjectId(...), ObjectId(...),ObjectId(...)]
}
Then the idea of map reduce takes a lot of power :
map = function() {
this.groups.forEach( function(groupKey) {
emit(groupKey, new Array(this))
}
}
reduce = function(key,values) {
return Array.concat(values);
}
db.runCommand({
mapreduce : items,
map : map,
reduce : reduce,
query : {_id : {$in : [...,....,.....] }}//put here you item ids
})
You can add some parameters (finalize for instance to modify the output of the map reduce) but this might help you.
Of course you need to have another collection where you store the details of item_groups if you need to have it but in some case (if this informations about item_groups doe not exist, or don't change, or you don't care that you don't have the most updated version of it) you don't need them at all !
Does that give you a hint about a solution to your problem ?
I've a collection named Events. Each Eventdocument have a collection of Participants as embbeded documents.
Now is my question.. is there a way to query an Event and get all Participants thats ex. Age > 18?
When you query a collection in MongoDB, by default it returns the entire document which matches the query. You could slice it and retrieve a single subdocument if you want.
If all you want is the Participants who are older than 18, it would probably be best to do one of two things:
Store them in a subdocument inside of the event document called "Over18" or something. Insert them into that document (and possibly the other if you want) and then when you query the collection, you can instruct the database to only return the "Over18" subdocument. The downside to this is that you store your participants in two different subdocuments and you will have to figure out their age before inserting. This may or may not be feasible depending on your application. If you need to be able to check on arbitrary ages (i.e. sometimes its 18 but sometimes its 21 or 25, etc) then this will not work.
Query the collection and retreive the Participants subdocument and then filter it in your application code. Despite what some people may believe, this isnt terrible because you dont want your database to be doing too much work all the time. Offloading the computations to your application could actually benefit your database because it now can spend more time querying and less time filtering. It leads to better scalability in the long run.
Short answer: no. I tried to do the same a couple of months back, but mongoDB does not support it (at least in version <= 1.8). The same question has been asked in their Google Group for sure. You can either store the participants as a separate collection or get the whole documents and then filter them on the client. Far from ideal, I know. I'm still trying to figure out the best way around this limitation.
For future reference: This will be possible in MongoDB 2.2 using the new aggregation framework, by aggregating like this:
db.events.aggregate(
{ $unwind: '$participants' },
{ $match: {'age': {$gte: 18}}},
{ $project: {participants: 1}
)
This will return a list of n documents where n is the number of participants > 18 where each entry looks like this (note that the "participants" array field now holds a single entry instead):
{
_id: objectIdOfTheEvent,
participants: { firstName: 'only one', lastName: 'participant'}
}
It could probably even be flattened on the server to return a list of participants. See the officcial documentation for more information.