I'd like to use MongoDB to store chat messages as part of a chat application. The database will be used to display chat history to users joining a channel.
I am trying to determine the best way to model this data in the database. The application is a simple chat app which contains numerous channels that users can chat in. Here are a few options I've considered:
A Messages Collection containing a document for every message. This is easy to implement, however with any significant usage many documents would be created.
A Channels Collection containing a document for every channel. This would result in fewer documents. Messages would be stored as an array on a channel document.
Which of these options is preferred, and why? Is there a better option not listed here?
There are many, many ways to go about modeling something like this. There is no generic "best way," as it really depends on how you plan on using the data, how the app is going to function, etc. However, there are a few things to consider with your approach.
First, having a lot of documents is not an issue. That's what Mongo does - it's great at storing lots of documents. I am a strong advocate of modularity, as it makes things more flexible. I reflect that mindset in my database by separating data as much as possible, and then using references to populate as needed.
This means you have to do more population, but in the end it prevents you from pidgeon holing yourself into having to do things a certain way.
So for your example in particular, a good way would be to combine what you've mentioned above: Have a Messages collection which creates a document for every message. Then have a Channels collection which stores an array of Message IDs (not the message itself).
Why is this useful? I'm assuming you will want to load a Channel, but not all 2,000 messages that are in it. You probably want to load the first 50, and then load more via infinite scroll or something.
This allows you to fetch a Channel document and then populate the first 50 messages. Then you can incrementally fetch 50 more messages at a time if needed.
If you store all of the messages in that array, your Channel document is going to get very, very large.
This also allows a user to edit their Message without editing the Channel document in any way - this is very important!
Having a separate Message schema also allows you to do things like fetch all the messages from a single user. You'll probably want to have a reference in the Message to a User ID.
There is a lot to consider when modeling data like this, but the important things to think about are "How am I going to need to fetch this data?" and "How will I need to modify this data?" Then figure out if your current format makes one of those things difficult.
Related
Currently, our system is not entirely normalized, and we use meteor-publish-composite to obtain the normalized data in mongodb. Some models have very few dependencies, but others have arrays of objects (i.e. sub-documents) with few foreign keys that we are subscribing to when fetching each model.
An example would be a Post containing a list of Comment sub-documents, where each comment has a userId field.
My question is, while I know it would be faster to use collection hooks and update the collection with data denormalization, how does Meteor handle multiple subscriptions on the same collection?
Is a hundred subscriptions on the same collection affect the application speed (significantly)? What about a thousand? etc.
This may not fully answer your question, however after spending countless hours tuning the performance of a large meteor app, I thought I would share some of the things that I have learned.
In Meteor, when you define a publication, you are setting up a reactive query that continues to push data to subscribed clients when changes to the underlying mongo data causes the result of the query to change. In other words, it sets up a query that will continually push data to clients as the data is inserted, updated, or removed. The mechanism by which it does this is by creating an observer on the query.
When an observer is initialized (e.g. when publication is subscribed to), it will query mongodb for the initial dataset to send down and then use the oplog to detect changes going forward. Fortunately, meteor is able to re-use an existing observer for a new subscription if the query is for the same collection, same selectors, and same options.
This means that you could create hundreds of subscriptions against many different publications, but if they are hitting against the same collection and using the same query selectors then you effectively only have 1 observe in play. For more details, I highly recommend reading this article from kadira.io (from which I acquired the information I used in this answer).
In addition to this, Meteor is also able to deal with multiple publications publishing the same document, and when this occurs, the documents will be merged into one. See this for more detail.
Lastly, because of Meteor's MergeBox component, it will minimize the data being sent over the wire across all your subscriptions by keeping track of what data changed vs. what is already on the client.
Therefore, in your specific example, it sounds like you will be running several different subscriptions on effectively the same query (since you are just trying to de-normalize your data) and dataset. Because of all the optimizations that I described above, I would guess that you won't be plagued by performance issues by taking this approach.
I have done similar things in one of my apps and have never had an issue.
I'm developing a chat app with node.js, redis, socket.io and mongodb. MongoDB comes the last and for persisting the messages.
My question is what would be the best approach for this last step?
I'm afraid a collection with all the messages like
{
id,
from,
to,
datetime,
message
}
can get too big too soon, and is going to get very slow for reading purposes, what do you think?
Is there a better approach you already worked with?
In MongoDB, you store your data in the format you will want to read them later.
If what you read from the database is a list of messages filtered on the 'to' field and with a dynamic datetime filter, then this schema is the perfect fit.
Don't forget to add an index on the fields you will be querying on, then it will be reasonable fast to query them, even over millions of records.
If you would, for example, always show a full history of a full day, you would store all messages for a single day in one document. If both types of queries occur a lot, you would even store your messages in both formats.
If storage is an issue, you could also use capped collection, which will automatically delete messages of e.g. over 1 year old.
I think the db structure is fine, the way you mentioned in your question.
You may assign some unique id for chat between each pair and keep it in each record of chat. Retrieve based on that when you want to show it.
Say 12 is the unique id for chat between A and B, retrieve should be based on 12 when you want to show chat for A and B.
So your db structure can be like:-
{
id,
from,
to,
datetime,
message,
uid
}
Remember, you can optimize your retrieve, if you will give some limit(say 100 at a time) for retrieve. If user is scrolling beyond 100 retrieve more 100 chats. Which will solve lots of retrieve.
When using limit, retrieve based on date created and use sort with find query as well.
Just a thought here, are the messages plain text or are you allowed to share images and videos as well ?
If it's the latter then storing all the chats for a single day in one collection might not work out.
Actually if you have images and videos shares allowed then you need to take into account the. 16mb document restriction as well.
I have a chatroom system, and I want to use MongoDB as the backend database. The following are the entities:
Room - a chatroom (room_id)
User - a chatting user in chatroom (room_id, user_name)
Msg - a message in chatroom (room_id, user_name, message)
For designing the schema, I have some ideas: First, 3 collections - room, user and msgs - and there is a parent reference in user and msg documents.
Another idea is to create collections for each room. Such as
db.chatroom.victor
db.chatroom.victor.users
db.chatroom.victor.msgs
db.chatroom.john
db.chatroom.john.users
db.chatroom.john.msgs
db.chatroom.tom
db.chatroom.tom.users
db.chatroom.tom.msgs
...
I think if I can divide the documents into different collections, it would be much more efficient to query. Also, I can use capped collections to limit the count of messages in each room. However, I am not familiar with MongoDB. I'm not sure if there is any side effect to doing that, or is there any performance problem to create lots of collections? Is there any guideline for designing a MongoDB schema?
Thanks.
you should always design your schema by answering 2 questions:
what data should i store? (temporary/permanently)
how will i access that data? (lots of reads on this, lots of writes on that, random rw here)
you don't want to embed high access rate data into document(like chat messages that are accessed by every user every second or so), it's better to have it as separate collection.
on the other hand - collection of users in chat room changes rather rarely - you can definitely embed that.
just design using common sense and you'll be fine
You definitely do not want to embed messages inside of other documents. They need to be stored as individual documents.
I say this because MongoDB allocates a certain amount of space for every document it writes. When it writes a document, it takes its current size and adds some empty space (padding) to the document so that if it is actually 1k large, it may become 1.5k large to leave space for the document to grow in size.
Chat messages will almost definitely each be larger than the allocated free space. Multiple messages will absolutely be larger than the free space.
The problem is that when a document doesnt fit in its current location on disk\memory when you try to embed another document inside of it (via an update) the database must read that document from the disk\memory and rewrite the entire thing at the tail end of the data file.
This causes alot of disk activity that should otherwise not exist - that added I/O will destroy the performance of the database.
Think about all of the use cases. What kind of queries do you want to execute?
"Show me the rooms where a user is chatting". This won't work with the current schema and you have to add a list of rooms to the user or a in a separate collection to make it work.
"Show me all the messages a user sent in all of the rooms". This, again, won't work.
"Delete all the idle users from all the rooms". With the proposed schema you have to run this query on every room.users collection.
Also, do some approximation for the size of your collections. If you have 100 rooms with max 1000 users, that's 100000 entries for a collection where you store all these mappings. With an index on room and user that shouldn't be an issue, you don't need separate collections.
With 100 users you can even embed this to the room object as an array. Just make sure there is an index.
I have this schema for support of in-site messaging:
When I send a message to another member, the message is saved to Message table; a record is added to MessageSent table and a record per recipient is added to MessageInbox table. MessageCount is being used to keep track of number of messages in the inbox/send folders and is filled using insert/delete triggers on MessageInbox/MessageSent - this way I can always know how many messages a member has without making an expensive "select count(*)" query.
Also, when I query member's messages, I join to Member table to get member's FirstName/LastName.
Now, I will be moving the application to MongoDB, and I'm not quite sure what should be the collection schema. Because there are no joins available in MongoDB, I have to completely denormalize it, so I woudl have MessageInbox, MessageDraft and MessageSent collections with full message information, right?
Then I'm not sure about following:
What if a user changes his First/LastName? It will be stored denormalized as sender in some messages, as a part of Recipients in other messages - how do I update it in optimal ways?
How do I get message counts? There will be tons of requests at the same time, so it has to be performing well.
Any ideas, comments and suggestions are highly appreciated!
I can offer you some insight as to what I have done to simulate JOINs in MongoDB.
In cases like this, I store the ID of a corresponding user (or multiple users) in a given object, such as your message object in the messages collection.
(Im not suggesting this be your schema, just using it as an example of my approach)
{
_id: "msg1234",
from: "user1234",
to: "user5678",
subject: "This is the subject",
body: "This is the body"
}
I would query the database to get all the messages I need then in my application I would iterate the results and build an array of user IDs. I would filter this array to be unique and then query the database a second time using the $in operator to find any user in the given array.
Then in my application, I would join the results back to the object.
It requires two queries to the database (or potentially more if you want to join other collections) but this illustrates something that many people have been advocating for a long time: Do your JOINs in your application layer. Let the database spend its time querying data, not processing it. You can probably scale your application servers quicker and cheaper than your database anyway.
I am using this pattern to create real time activity feeds in my application and it works flawlessly and fast. I prefer this to denormalizing things that could change like user information because when writing to the database, MongoDB may need to re-write the entire object if the new data doesnt fit in the old data's place. If I needed to rewrite hundreds (or thousands) of activity items in my database, then it would be a disaster.
Additionally, writes on MongoDB are blocking so if a scenario like I've just described were to happen, all reads and writes would be blocked until the write operation is complete. I believe this is scheduled to be addressed in some capacity for the 2.x series but its still not going to be perfect.
Indexed queries, on the other hand, are super fast, even if you need to do two of them to get the data.
Let's take an example of a chat room.
Should I create 2 collections: Room and Messages, and store the room details (title, description) separately from Messages (body/date/author)? The Messages collection would have a field called "Room" that links to the ObjectId of Room.
OR
Should I create 1 collection, called Room. And then inside Room, there is an array of messages?
What is the best practice? What would you do?
I would lean toward your first choice. Aside from the fact that 16MB may be too small (I've seen some pretty busy chat rooms in my days), storing the messages separately allows for some greater flexibility on your part. The room doesn't even need to know what messages are associated with it - just create it once and query for messages by your Room id as needed.
Schema design questions have been ask numerous times here. Please research since the solutions are always the same - you just have to think for a moment and apply them to your own usecase:
MongoDB Schema Design - Real-time Chat
Mongodb schema design
MongoDB Schema Design - Many small documents or fewer large documents?
It really should not be that hard replacing X in the answer with Y of your own problem.
In addition: the standard documentation applies as well (and is pretty much explicit on your problem):
http://www.mongodb.org/display/DOCS/Schema+Design