How should I structure the data schema of MongoDB? - mongodb

I have a chatroom system, and I want to use MongoDB as the backend database. The following are the entities:
Room - a chatroom (room_id)
User - a chatting user in chatroom (room_id, user_name)
Msg - a message in chatroom (room_id, user_name, message)
For designing the schema, I have some ideas: First, 3 collections - room, user and msgs - and there is a parent reference in user and msg documents.
Another idea is to create collections for each room. Such as
db.chatroom.victor
db.chatroom.victor.users
db.chatroom.victor.msgs
db.chatroom.john
db.chatroom.john.users
db.chatroom.john.msgs
db.chatroom.tom
db.chatroom.tom.users
db.chatroom.tom.msgs
...
I think if I can divide the documents into different collections, it would be much more efficient to query. Also, I can use capped collections to limit the count of messages in each room. However, I am not familiar with MongoDB. I'm not sure if there is any side effect to doing that, or is there any performance problem to create lots of collections? Is there any guideline for designing a MongoDB schema?
Thanks.

you should always design your schema by answering 2 questions:
what data should i store? (temporary/permanently)
how will i access that data? (lots of reads on this, lots of writes on that, random rw here)
you don't want to embed high access rate data into document(like chat messages that are accessed by every user every second or so), it's better to have it as separate collection.
on the other hand - collection of users in chat room changes rather rarely - you can definitely embed that.
just design using common sense and you'll be fine

You definitely do not want to embed messages inside of other documents. They need to be stored as individual documents.
I say this because MongoDB allocates a certain amount of space for every document it writes. When it writes a document, it takes its current size and adds some empty space (padding) to the document so that if it is actually 1k large, it may become 1.5k large to leave space for the document to grow in size.
Chat messages will almost definitely each be larger than the allocated free space. Multiple messages will absolutely be larger than the free space.
The problem is that when a document doesnt fit in its current location on disk\memory when you try to embed another document inside of it (via an update) the database must read that document from the disk\memory and rewrite the entire thing at the tail end of the data file.
This causes alot of disk activity that should otherwise not exist - that added I/O will destroy the performance of the database.

Think about all of the use cases. What kind of queries do you want to execute?
"Show me the rooms where a user is chatting". This won't work with the current schema and you have to add a list of rooms to the user or a in a separate collection to make it work.
"Show me all the messages a user sent in all of the rooms". This, again, won't work.
"Delete all the idle users from all the rooms". With the proposed schema you have to run this query on every room.users collection.
Also, do some approximation for the size of your collections. If you have 100 rooms with max 1000 users, that's 100000 entries for a collection where you store all these mappings. With an index on room and user that shouldn't be an issue, you don't need separate collections.
With 100 users you can even embed this to the room object as an array. Just make sure there is an index.

Related

How to design my Mongo database

I got a collection Users, that is name, password, email etc.
Also i got a collection Groups, every group has it's members - array of users.
How should i design my database? I clearly see 2 ways of doing so:
Way 1 (MySQL-like): every user has an _id, so i just put it into the members array and so be it.
Way 2: copy a whole user document inside plus add some fields.
On the MongoDB site they are telling that duplicate data is nothing to worry bcs of the low price of storages. Also they say that we should avoid JOINs on data read.
duplicate data is nothing to worry about
This is something to worry about when it comes to updating. Suppose you have user details nested and duplicated in every document. What happens when a user changes their name? You'll have to update every instance of that user in every document.
Be careful to differentiate between data and entities. A user is an entity, think carefully before duplicating entities as fixing it later could be hard work.
Personally, I'd split them unless you find yourself in a situation where performance is too slow to do the joining in real time. Then, and only then, consider merging.
Actually answer to this question depends on what kind of screens you are designing and what kind of queries you are going to make to fetch data. Lets go through pros and cons of each option which will help you in weighing each option.
Way 1 :- Putting array of user_ids in group collection
Pros
1) If you have a screen which shows group details of a particular group and list of all members (users_ids) belonging to that group, then one query can fetch all the details needed for this screen and it would be faster too.
Cons
1) If in group detail screen, you have to show details of users along with group details, then since mongodb does not provide any joins, you would be fetching user details in a separate query and would be joining both on the client side. This can lead to a impact on performance.
2) If you have a screen which shows user details and all the groups he/she belongs, then you will be searching user_id in user array in group collection. If you are expecting number of members in a group to be very high(millions), then searching inside the array can lead to a huge performance impact.
Way 2 :- Copy user document inside inside group collection
Duplicating data is not a problem in Mongodb, but you should have a really good reason for that. Thumb rule should be duplicate data when relationship is 1:few and not 1:many.
Pros
1) This approach will save you from joining group and user collection at client side as one query can fetch all the details of group along with its users.
Cons
1) Suppose you have a million groups and user_id_1 belongs to 100,000 groups, then whenever you have an update on user_id_1, you will have to update 100,000 documents. This can again lead to huge performance impact.
2) Also if a large number of users subscribe to 1 group, then document size of this group keeps on increasing. In Mongodb The maximum BSON document size is 16 megabytes that means you cannot have a document greater than 16MB, so you cannot add users to a group infinitely. This will limit your functionality.
Way 3 :- Embed group details in user collection
Pros
1) One query can fetch user details along with all the details of all the groups this user belongs to.
2) If you have are expecting few users in a group, then you will have few group arrays in a user document. This will not exceed 16MB limit.
Cons
1) If you are expecting that a user can subscribe to a lot many groups(millions), then user document may exceed 16MB limit.
2) Also if you have very frequent updates in group details then you will have to update the same in many user documents.
You can also go through the following link to get more details about data model design :-
https://docs.mongodb.org/manual/core/data-model-design/
It depends on how you will use data in your application.
If you have more than 2 groups and you will have to search a user in all of the groups, embed the user document within the group (way 2) is not a good idea. So in this case I sugest to use the way 1.
If you have only 2 groups or the user group will be known before your application when doing the query, then use the way 2.
I guess that separating the data is the way to go, since it will be better to direct update, get and delete user data directly.

Querying a large mongodb collection in real-time

We have a service that allow people to open a room and play YouTube songs while others are listening in real-time.
Among other collections in our MongoDB we have one to store songs user adding to the room's playlists, it calls: userSong.
This collection holds records for all songs added for the combination of: user-room-song.
The code makes frequent queries to the collection in those major operations:
Loading current playlist (regular find with a trivial condition)
Loading random song for a room (using Mongo aggregation FW)
Loading room top songs (using Mongo aggregation FW)
Now, this table become big (+1m records) and things start become slow, AWS start sending us CPU utilization notifications more often and follow by mongotop the userSong collection makes the CPU high consumption mostly in READ operations.
We made some modifications in the collection indexes and it helps a bit but it's still not a solution, we need to find some other way to arrange the data cause it exponentially growing.
We tought about to split the userSong data into a low level segmentation, instead of by user-room-song to do it by collection of user-song for each room in the system, this will short the time to fetching data from the DB, now we need to decide how to do that:
Make a new collection for each room (roomUserSong) that will hold all user-song records for a particula room. this might be good for quick fetching but will create an unlimited new collectons in the database (roomusersong-1,roomusersong-2, ..., roomusersong-n) and we dont know if it's a good in practice or there are some others Mongo limitations in that kind of solution.
Create just 1 more collection in the DB with the following fields:
{room: <roomId>, userSongs: [{userSong1, userSong2, ..., userSongN}], so each room will have it's own document and inside it a sub document (an Array) that holds all user-song records for this room. this will solve the previous issue (to create unlimited collections) but it'll be very hard to work with Mongoose (our ODM) alter cause (as far as i know) we cannot define a schema in advanced for this such data structure. also this is may tak us to the sub-document size limitation that is 16MB as far as understood.
It'll be nice to hear some advices from people who have Mongo experience with those kind situations:
Is +1m is really consider big and supposed to make this CPU utilization issues? (using AWS m3.medium, one core)
What is the better solution approach form what introduced?
Any other ideas to make smart cache without change too much the code?
Thanks for helpers!

Multi room chat database with mongoDB (mongoose)

I need to set a scheme for a multi room chat which uses mongodb for storage. I'm currently using mongoose v2 and I've thought of the following methods:
Method 1
Each chat log (each room) has its own mongo collection
Each chat log collection is populated by documents (schema) message with from, to, message and time.
There is the user collection
The user collection is populated by documents (schema) user (with info regarding the user)
Doubts:
1 How exactly can I retrieve the documents from a specific collection (the chat room)?
Method 2
There is a collection (chat_logs)
The collection chat_logs is pop. by documents (schema) message with from, to (chatroom), user, etc...
There is the user collection, as above.
Doubts:
1 Is there a max size for collections?
Any advice is welcome... thank you for your help.
There is no good reason to have a separate collection per chatroom. The size of collections is unlimited and MongoDB offers no way to query data from more than one collection. So when you distribute your data over multiple collections, you won't be able to analyze data across more than one chatroom.
There isn't a limit on normal collections. However, do you really want to save every word ever written in any chat forever? Most likely not. You would like to save say the last 1000 or 10000 messages written. Or even 100.000. Lets make it 1.000.000. Given the average size of a chat message, this shouldn't be more than 20MB. So let's make it really safe and multiply that by 10.
What I would do is to use a capped collection per chat room and use tailable cursors. You don't need to be afraid as for too many connections. The average mongo server can take a couple of hundred of them. The queries can be made tailable quite easily as shown in the Mongoose docs.
This approach has some advantages:
Capped collections are fast - they are basically fifo buffers for BSON data.
The data is returned exactly in insertion order for free - no sorting, no complex queries, no extra indices
There is no need to maintain the individual rooms. Simply set a cap on creation and mongodb will take care of the rest.
As for how to do it: Simply create a connection per chat room, save them on an application level in an associative array with the chat room as the key name. Use the connections for creating a new tailable cursors per request. Use XHR to request the chat data. Respond as stream. Process accordingly.

mongo db design of following and feeds, where should I embed?

I have a basic question about where I should embed a collection of followers/following in a mongo db. It makes sense to have an embedded collection of following in a user object, but does it also make sense to also embed the converse followers collection as well? That would mean I would have to update and embed in the profile record of both the:
following embedded list in the follower
And the followers embedded list of the followee
I can't ensure atomicity on that unless I also somehow keep a transaction or update status somewhere. Is it worth it embedding in both entities or should I just update #1, embed following in the follower's profile and, put an index on it so that I can query for the converse- followers across all profiles? Is the performance hit on that too much?
Is this a candidate for a collection that should not be embedded? Should I just have a collection of edges where I store following in its own collection with followerid and followedbyId ?
Now if I also have to update a feed to both users when they are followed or following, how should I organize that?
As for the use case, the user will see the people they are following when viewing their feeds, which happens quite often, and also see the followers of a profile when they view the profile detail of anyone, which also happens often but not quite as much as the 1st case. In both cases, the total numbers of following and followers shows up on every profile page.
In general, it's a bad idea to embed following/followed-by relationships into user documents, for several reasons:
(1) there is a maximum document size limit of 16MB, and it's plausible that a popular user of a well-subscribed site might end up with hundreds of thousands of followers, which will approach the maximum document size,
(2) followership relationships change frequently, and so the case where a user gains a lot of followers translates into repeated document growth if you're embedding followers. Frequent document growth will significantly hinder MongoDB performance, and so should be avoided (occasional document growth, especially is documents tend to reach a stable final size, is less of a performance penalty).
So, yes, it is best to split out following/followed-by relationship into a separate collection of records each having two fields, e.g., { _id : , oid : }, with indexes on _id (for the "who am I following?" query) and oid (for the "who's following me?" query). Any individual state change is modeled by a single document addition or removal, though if you're also displaying things like follower counts, you should probably keep separate counters that you update after any edge insertion/deletion.
(Of course, this supposes your business requirements allow you some flexibility on the consistency details: in general, if your display code tells a user he's got 304 followers and then proceeds to enumerate them, only the most fussy user will check that the followers enumerated tally up to 304. If business requirements necessitate absolute consistency, you'll either need a database that isolates transactions for you, or else you'll have to do the counting yourself as part of displaying all user identities.)
You can embed them all but create a new document when you reach a certain limit. For example you can limit a document to an array of 500 elements then create a new one. Also, if it is about feed, when viewed, you dont have to keep the viewed publications you can replace by new ones so you don't have to create new document for additional publication storage.
To maintain your performance, I'd advice you to make a collection that can use graphlookup aggregation, where you store your following. Being followed can reach millions of followers, so you have to store what pwople follow instead of who follows them.
I hope it helps you.

Moving messaging schema to MongoDB

I have this schema for support of in-site messaging:
When I send a message to another member, the message is saved to Message table; a record is added to MessageSent table and a record per recipient is added to MessageInbox table. MessageCount is being used to keep track of number of messages in the inbox/send folders and is filled using insert/delete triggers on MessageInbox/MessageSent - this way I can always know how many messages a member has without making an expensive "select count(*)" query.
Also, when I query member's messages, I join to Member table to get member's FirstName/LastName.
Now, I will be moving the application to MongoDB, and I'm not quite sure what should be the collection schema. Because there are no joins available in MongoDB, I have to completely denormalize it, so I woudl have MessageInbox, MessageDraft and MessageSent collections with full message information, right?
Then I'm not sure about following:
What if a user changes his First/LastName? It will be stored denormalized as sender in some messages, as a part of Recipients in other messages - how do I update it in optimal ways?
How do I get message counts? There will be tons of requests at the same time, so it has to be performing well.
Any ideas, comments and suggestions are highly appreciated!
I can offer you some insight as to what I have done to simulate JOINs in MongoDB.
In cases like this, I store the ID of a corresponding user (or multiple users) in a given object, such as your message object in the messages collection.
(Im not suggesting this be your schema, just using it as an example of my approach)
{
_id: "msg1234",
from: "user1234",
to: "user5678",
subject: "This is the subject",
body: "This is the body"
}
I would query the database to get all the messages I need then in my application I would iterate the results and build an array of user IDs. I would filter this array to be unique and then query the database a second time using the $in operator to find any user in the given array.
Then in my application, I would join the results back to the object.
It requires two queries to the database (or potentially more if you want to join other collections) but this illustrates something that many people have been advocating for a long time: Do your JOINs in your application layer. Let the database spend its time querying data, not processing it. You can probably scale your application servers quicker and cheaper than your database anyway.
I am using this pattern to create real time activity feeds in my application and it works flawlessly and fast. I prefer this to denormalizing things that could change like user information because when writing to the database, MongoDB may need to re-write the entire object if the new data doesnt fit in the old data's place. If I needed to rewrite hundreds (or thousands) of activity items in my database, then it would be a disaster.
Additionally, writes on MongoDB are blocking so if a scenario like I've just described were to happen, all reads and writes would be blocked until the write operation is complete. I believe this is scheduled to be addressed in some capacity for the 2.x series but its still not going to be perfect.
Indexed queries, on the other hand, are super fast, even if you need to do two of them to get the data.