Let's take an example of a chat room.
Should I create 2 collections: Room and Messages, and store the room details (title, description) separately from Messages (body/date/author)? The Messages collection would have a field called "Room" that links to the ObjectId of Room.
OR
Should I create 1 collection, called Room. And then inside Room, there is an array of messages?
What is the best practice? What would you do?
I would lean toward your first choice. Aside from the fact that 16MB may be too small (I've seen some pretty busy chat rooms in my days), storing the messages separately allows for some greater flexibility on your part. The room doesn't even need to know what messages are associated with it - just create it once and query for messages by your Room id as needed.
Schema design questions have been ask numerous times here. Please research since the solutions are always the same - you just have to think for a moment and apply them to your own usecase:
MongoDB Schema Design - Real-time Chat
Mongodb schema design
MongoDB Schema Design - Many small documents or fewer large documents?
It really should not be that hard replacing X in the answer with Y of your own problem.
In addition: the standard documentation applies as well (and is pretty much explicit on your problem):
http://www.mongodb.org/display/DOCS/Schema+Design
Related
I'd like to use MongoDB to store chat messages as part of a chat application. The database will be used to display chat history to users joining a channel.
I am trying to determine the best way to model this data in the database. The application is a simple chat app which contains numerous channels that users can chat in. Here are a few options I've considered:
A Messages Collection containing a document for every message. This is easy to implement, however with any significant usage many documents would be created.
A Channels Collection containing a document for every channel. This would result in fewer documents. Messages would be stored as an array on a channel document.
Which of these options is preferred, and why? Is there a better option not listed here?
There are many, many ways to go about modeling something like this. There is no generic "best way," as it really depends on how you plan on using the data, how the app is going to function, etc. However, there are a few things to consider with your approach.
First, having a lot of documents is not an issue. That's what Mongo does - it's great at storing lots of documents. I am a strong advocate of modularity, as it makes things more flexible. I reflect that mindset in my database by separating data as much as possible, and then using references to populate as needed.
This means you have to do more population, but in the end it prevents you from pidgeon holing yourself into having to do things a certain way.
So for your example in particular, a good way would be to combine what you've mentioned above: Have a Messages collection which creates a document for every message. Then have a Channels collection which stores an array of Message IDs (not the message itself).
Why is this useful? I'm assuming you will want to load a Channel, but not all 2,000 messages that are in it. You probably want to load the first 50, and then load more via infinite scroll or something.
This allows you to fetch a Channel document and then populate the first 50 messages. Then you can incrementally fetch 50 more messages at a time if needed.
If you store all of the messages in that array, your Channel document is going to get very, very large.
This also allows a user to edit their Message without editing the Channel document in any way - this is very important!
Having a separate Message schema also allows you to do things like fetch all the messages from a single user. You'll probably want to have a reference in the Message to a User ID.
There is a lot to consider when modeling data like this, but the important things to think about are "How am I going to need to fetch this data?" and "How will I need to modify this data?" Then figure out if your current format makes one of those things difficult.
I started reading up on MongoDB (which got me very excited) as I understand one of their flaws is the self explanatory lack of relation. Especially when it comes to large or ever growing on both sides, many to many relationships.
And, as I read around the best way to avoid ever growing arrays inside some document is either try avoiding it by creating buckets of documents and then referencing the buckets (that does not guarantee total prevention of overgrowth). Or to create the both document referencing a third many to many document.
Since I could not found a final answer to this dilemma or at least one the wouldn't be a few years old, could someone explain if this is the dead end (in case the project uses a few big(ever growing) many to many relationships) and I should switch to RDBMS?
It depends on your usecase.
The main question is do you actually know why you want to use MongoDB in the first place? Hopefully, the reason is not because of the trend. RDBMS's are still relevant and have their own usecases. For some applications RDBMS is the way to go for some it isn't.
Now back to your original question about many-to-many relations. As you have already researched there are ways to model those relationships in MongoDB. So that doesn't disqualify MongoDB as a database on its own. For example, to you need transactionality or referential integrity checks when you insert or delete records for those many to many relationships? If the answer to that is yes, then MongoDB may not be the perfect fit for your case.
When i first started working on MongoDB this exact question crossed my mind and during searching for the answer i read something very interesting (hope i had the link to that for you, but unfortunately i dont).
think of a real world problem where you have a many to many relation that just keeps on growing ? there may be very exceptional cases of such kind.
lets say many students are registered for many courses. Now a course may be registered by 100 students but for sure a student wont register for 100 courses, so you can simply in the student collection keep a array field for registered course ID's..
let's deep dive and say there are a bunch of super brilliant students who actually registered for 100 courses in such scenario a array field may not be a viable solution. Then ? how about a collection that just have student_id and course_id. This even exists in the RDBMS world too.
so the workarounds available should be enough to find and design an optimized solution for probably the most complex of the scenarios.
After learning about performance and schema design in MongoDB, I still can´t figure out how would I make the schema design in an application when performance is a must.
Let´s imagine if we have to make YouTube to work with MongoDB as its database. How would you make the schema?
OPTION 1: two collections (videos collection and comments collection)
Pros: adding, deleting and editing comments affects only the comments collection, therefore these operations would be more efficient.
Cons: Retrieving videos and comments would be 2 different queries to the database, one for videos and one for comments.
OPTION 2: single collection (videos collection with the comments embedded)
Pros: You retrieve videos and its comments with a single query.
Cons: Adding, deleting and editing comments affect the video Document, therefore these operations would be less efficient.
So what do you think? Are my guesses true?
As a caller in the desert, I have to say that embedding should only be used under very special circumstances:
The relation is a "One-To(-Very)-Few" and it is absolutely sure that no document will ever exceed this limit. A good example would be the relation between "users" and "email addresses" – a user is unlikely to have millions of them and there isn't even a problem with artificial limits: setting the maximum number of addresses as user can have to, say 50 hardly would cause a problem. It may be unlikely that a video gets millions of comments, but you don't want to impose an artificial limit on it, right?
Updates do not happen very often. If documents increase in size beyond a certain threshold, they might be moved, since documents are guaranteed to be never fragmented. However, document migrations are expensive and you want to prevent them.
Basically, all operations on comments become more complicated and hence more expensive - a bad choice. KISS!
I have written an article about the above, which describes the respective problems in greater detail.
And furthermore, I do not see any advantage in having the comments with the videos. The questions to answer would be
For a given user, what are the videos?
What are the newest videos (with certain tags)?
For a given video, what are the comments?
Note that the only connection between videos and comments here is about a given video, so you already have the _id or something else to positively identify the video. Furthermore, you don't want to load all comments at once, especially if you have a lot of them, since this would decrease UX because of long load times.
Let's say it is the _id. So, with it, you'd be able to have paged comments easily:
db.comments.find({"video_id": idToFind})
.skip( (page-1) * pageSize )
.limit( pageSize )
hth
As usual the answer is, it depends. As as a rule of thumb you should favour embedding, unless you need to regularly query the embedded objects on its own or if the embedded array is likely to get too large(>~100 records). Using this guideline, there are a few questions you need to ask regarding your application.
How is your application going to access the data ? Are you only ever going to show the comments on the same page as the associated video ? Or do you want to provide the options to show all comments for a given user across all movies ? The first scenario favours embedding (one collection), whereas you probably would be better of with two collections in the second scenario.
Secondly, how many comments do you expect for each video ? Taking the analogy of IMDB, you could easily expect more than 100 comments for a popular video, so that means you are better off creating two separate collections as the embedded array of comments would grow large quite quickly. I wouldn't be too concerned about the overhead of an application join, they are generally comparable in speed compared to a server-side join in a relational database provided your collections are properly indexed.
Finally, how often are users going to update their comments after their initial post ? If you lock the comments after 5 minutes like on StackOverflow users may not update their comments very often. In that case the overhead of updating or deleting comments in the video collection will be negligible and may even be outweigh the cost of performing a second query in a separate comments collection.
You should use embedded for better performance. Your I/O's will be lesser. In worst case? it might take a bit long to persist the document in the DB but it wont take much time to retrieve it.
You should either compromise persistence over reads or vise versa depending on your application needs.
Hence it is important to choose your db wisely.
I've been getting in to mongo, but coming from RDBMS background facing the probably obvious questions with regards to denormalisation and general data modelling.
If I have a document type with an array of sub docs, each sub doc has a status code.
In The relational world I would add a foreign key to the record, StatusId, simple.
In mongodb, would you denormalise the key pieces of data from the "status" e.g. Code and desc and hold objectid referencing another collection of proper status. I guess the next question is one of design, if the status doc is modified I'd then need to modified the denormalised data?
Another question on the same theme is how would you model a transaction table, say I have events and people, the events could be quite granular, say time sheets which over time may lead to many records. Based on what I've seen, this would seem like a good candidate for a child / sub array of docs, of course that could be indexed for speed.
Therefore is it possible to query / find just the sub array or part of it? And given the 16mb limit for doc size, and I just limited the transaction history of the person? Or should the transaction history be a separate collection with a onjid referencing the person?
Thanks for any input
Sam
Or should the transaction history be a separate collection with a onjid referencing the person?
Probably, I think this S/O question may help you understand why.
if the status doc is modified I'd then need to modified the denormalised data?
Yes this is standard trade-off in MongoDB. You will encounter this question a lot. You may need to leverage a Queue structure to ensure that data remains consistent across multiple collections.
Therefore is it possible to query / find just the sub array or part of it?
This is a tough one specific to MongoDB. With the basic query syntax, you have only limited support for dealing with arrays of objects. The new "Aggregration Framework" is actually much better here, but it's not available in a stable build.
All your "how to model this or that" can't really be answered, because good schema design depends on so many factors (access patters, hardware characteristics, is cluster used, etc).
if the status doc is modified I'd then need to modified the denormalised data?
Usually yes, that's the drawback of denormalisation. But sometimes you don't have to (some social network site stores user name with a photo tag and doesn't update it when user changes his name).
to query / find just the sub array or part of it?
It is not currently possible to fetch only a part of array (unless using map/reduce, of course).
And given the 4mb limit
Where did you get this from? It's 16mb at the moment.
While it's true that schema design does take into account many factors, the need to denormalize data usually comes up somewhere. I tend to take advantage of denormalization in my apps that use MongoDB because I feel it lends itself well storing denormalized data:
no additional column maintenance
support for hashes and arrays as field types (perfect for storing denormalized fields)
speedy, non-blocking writes make syncing data less expensive
document size growth only marginally affects performance up to limits (for the most part)
There are a few gems that help you manage denormalized data, including setting it up and keeping it in sync. If you're using Mongoid, you try mongoid_alize. DISCLAIMER: I am the author and maintainer of mongoid_alize.
I have a chatroom system, and I want to use MongoDB as the backend database. The following are the entities:
Room - a chatroom (room_id)
User - a chatting user in chatroom (room_id, user_name)
Msg - a message in chatroom (room_id, user_name, message)
For designing the schema, I have some ideas: First, 3 collections - room, user and msgs - and there is a parent reference in user and msg documents.
Another idea is to create collections for each room. Such as
db.chatroom.victor
db.chatroom.victor.users
db.chatroom.victor.msgs
db.chatroom.john
db.chatroom.john.users
db.chatroom.john.msgs
db.chatroom.tom
db.chatroom.tom.users
db.chatroom.tom.msgs
...
I think if I can divide the documents into different collections, it would be much more efficient to query. Also, I can use capped collections to limit the count of messages in each room. However, I am not familiar with MongoDB. I'm not sure if there is any side effect to doing that, or is there any performance problem to create lots of collections? Is there any guideline for designing a MongoDB schema?
Thanks.
you should always design your schema by answering 2 questions:
what data should i store? (temporary/permanently)
how will i access that data? (lots of reads on this, lots of writes on that, random rw here)
you don't want to embed high access rate data into document(like chat messages that are accessed by every user every second or so), it's better to have it as separate collection.
on the other hand - collection of users in chat room changes rather rarely - you can definitely embed that.
just design using common sense and you'll be fine
You definitely do not want to embed messages inside of other documents. They need to be stored as individual documents.
I say this because MongoDB allocates a certain amount of space for every document it writes. When it writes a document, it takes its current size and adds some empty space (padding) to the document so that if it is actually 1k large, it may become 1.5k large to leave space for the document to grow in size.
Chat messages will almost definitely each be larger than the allocated free space. Multiple messages will absolutely be larger than the free space.
The problem is that when a document doesnt fit in its current location on disk\memory when you try to embed another document inside of it (via an update) the database must read that document from the disk\memory and rewrite the entire thing at the tail end of the data file.
This causes alot of disk activity that should otherwise not exist - that added I/O will destroy the performance of the database.
Think about all of the use cases. What kind of queries do you want to execute?
"Show me the rooms where a user is chatting". This won't work with the current schema and you have to add a list of rooms to the user or a in a separate collection to make it work.
"Show me all the messages a user sent in all of the rooms". This, again, won't work.
"Delete all the idle users from all the rooms". With the proposed schema you have to run this query on every room.users collection.
Also, do some approximation for the size of your collections. If you have 100 rooms with max 1000 users, that's 100000 entries for a collection where you store all these mappings. With an index on room and user that shouldn't be an issue, you don't need separate collections.
With 100 users you can even embed this to the room object as an array. Just make sure there is an index.