Is a good idea to store chat messages in a mongodb collection? - mongodb

I'm developing a chat app with node.js, redis, socket.io and mongodb. MongoDB comes the last and for persisting the messages.
My question is what would be the best approach for this last step?
I'm afraid a collection with all the messages like
{
id,
from,
to,
datetime,
message
}
can get too big too soon, and is going to get very slow for reading purposes, what do you think?
Is there a better approach you already worked with?

In MongoDB, you store your data in the format you will want to read them later.
If what you read from the database is a list of messages filtered on the 'to' field and with a dynamic datetime filter, then this schema is the perfect fit.
Don't forget to add an index on the fields you will be querying on, then it will be reasonable fast to query them, even over millions of records.
If you would, for example, always show a full history of a full day, you would store all messages for a single day in one document. If both types of queries occur a lot, you would even store your messages in both formats.
If storage is an issue, you could also use capped collection, which will automatically delete messages of e.g. over 1 year old.

I think the db structure is fine, the way you mentioned in your question.
You may assign some unique id for chat between each pair and keep it in each record of chat. Retrieve based on that when you want to show it.
Say 12 is the unique id for chat between A and B, retrieve should be based on 12 when you want to show chat for A and B.
So your db structure can be like:-
{
id,
from,
to,
datetime,
message,
uid
}
Remember, you can optimize your retrieve, if you will give some limit(say 100 at a time) for retrieve. If user is scrolling beyond 100 retrieve more 100 chats. Which will solve lots of retrieve.
When using limit, retrieve based on date created and use sort with find query as well.

Just a thought here, are the messages plain text or are you allowed to share images and videos as well ?
If it's the latter then storing all the chats for a single day in one collection might not work out.
Actually if you have images and videos shares allowed then you need to take into account the. 16mb document restriction as well.

Related

Firestore array-not-contains alternative solution

TL-DR
I have created a Flutter Firestore posts application. I want to present the user only new posts, which they didn't read yet.
How do I achieve that using Firestore query?
The problem
Each time a user sees a post, their id is added to the post views field.
Next time the user opens the app, I want to present only posts they didn't read yet.
The problem is that query array-not-contains is not supported. How do I achive that functionality?
You're going to have a real hard time with this because Firestore can only give you documents where you know something about the contents of that document. That's how indexes work - by recording data present in the document. Indexes don't track data not present in a document because that's basically an infinite amount of data.
If you are trying to track documents seen by the user, you would think to mark the document as "seen" using a boolean per user, or tracking the document ID somewhere. But as you can see, you can only query for documents that the user has seen, because that's the data present in the system.
What you can do is query for all documents, then query for all the documents the user has seen, then subtract the seen documents from all documents in order to get the unseen documents. But this probably doesn't scale in a way you'd like. (It's essentially the same problem with Firestore indexes not being able to surface documents without some known data present. Firestore won't do the equivalent of a SQL table scan, since that would be a lot of reads you'd have to pay for.)
You can kind of fake it by making sure there is a creation timestamp in each document, and record for each user the timestamp of the most recent seen document. If you require that the user must view the documents in chronological order, then you can simply query for documents with a creation timestamp greater than the timestamp of the latest document seen by the user. This is really as good as it's going to get with Firestore, since you can't query for the absence of data.

Modeling data in MongoDB for a chat application

I'd like to use MongoDB to store chat messages as part of a chat application. The database will be used to display chat history to users joining a channel.
I am trying to determine the best way to model this data in the database. The application is a simple chat app which contains numerous channels that users can chat in. Here are a few options I've considered:
A Messages Collection containing a document for every message. This is easy to implement, however with any significant usage many documents would be created.
A Channels Collection containing a document for every channel. This would result in fewer documents. Messages would be stored as an array on a channel document.
Which of these options is preferred, and why? Is there a better option not listed here?
There are many, many ways to go about modeling something like this. There is no generic "best way," as it really depends on how you plan on using the data, how the app is going to function, etc. However, there are a few things to consider with your approach.
First, having a lot of documents is not an issue. That's what Mongo does - it's great at storing lots of documents. I am a strong advocate of modularity, as it makes things more flexible. I reflect that mindset in my database by separating data as much as possible, and then using references to populate as needed.
This means you have to do more population, but in the end it prevents you from pidgeon holing yourself into having to do things a certain way.
So for your example in particular, a good way would be to combine what you've mentioned above: Have a Messages collection which creates a document for every message. Then have a Channels collection which stores an array of Message IDs (not the message itself).
Why is this useful? I'm assuming you will want to load a Channel, but not all 2,000 messages that are in it. You probably want to load the first 50, and then load more via infinite scroll or something.
This allows you to fetch a Channel document and then populate the first 50 messages. Then you can incrementally fetch 50 more messages at a time if needed.
If you store all of the messages in that array, your Channel document is going to get very, very large.
This also allows a user to edit their Message without editing the Channel document in any way - this is very important!
Having a separate Message schema also allows you to do things like fetch all the messages from a single user. You'll probably want to have a reference in the Message to a User ID.
There is a lot to consider when modeling data like this, but the important things to think about are "How am I going to need to fetch this data?" and "How will I need to modify this data?" Then figure out if your current format makes one of those things difficult.

Twitter exercise with MongoDB and lack of transactions?

I was trying to figure out if MongoDB needs transactions and why you wouldn't have everything in a single document. I also know twitter uses HBase which does have transactions so I thought about a tweet and watchers.
If i post a tweet it will be inserted with no problem. But how would I or anyone else find my tweet? I heard mongodb has indexes so maybe I can index author and find my tweet however I can't imagine that being efficient if everyone does that. Also time has to be indexed.
So from what I understand (I think i saw some slides twitter released) twitter has a 'timeline' so everytime a person tweets twitter inserts the tweetid in everyone timeline which is indexed by date and when a given user browse it grabs available tweets sorted by time.
How would that be done in mongodb? The only solution I can think of is having a column in the tweet document saying {SendOut:DateStamp} which is removed when completed. If it didnt complete on the first attempt (checking timestamp to guess if it should be completed by now or not) then I would need to check all the watchers to see who hasn't received it and insert if they didn't. But also since theres no transactions i guess i need to index the SendOut column? Would this solution work? How would I efficiently insert a tweet and give it to everyone watching the user? (if this solution would not work)
It sounds like you're describing a model similar to pub/sub. Couldn't you instead just track that last post (by date) with each user object that the user last read? Users would requests tweets the same way, using various indexes including time.
I'm not sure what you need transactions for, but Mongo does support atomic operations.
[Updated]
So in other words, each user's object stores the dateTime of the last tweet read/delivered. Obviously you would also need the list of subscribed author IDs. To fetch new tweets you would ask for tweets indexed by both author_id,time properties and then sort by time.
By using the last read date from the user object and using it as the secondary index into your tweets collection, I don't believe you need either pub/sub or transactions to do it.
I might be missing something though.

How should I structure the data schema of MongoDB?

I have a chatroom system, and I want to use MongoDB as the backend database. The following are the entities:
Room - a chatroom (room_id)
User - a chatting user in chatroom (room_id, user_name)
Msg - a message in chatroom (room_id, user_name, message)
For designing the schema, I have some ideas: First, 3 collections - room, user and msgs - and there is a parent reference in user and msg documents.
Another idea is to create collections for each room. Such as
db.chatroom.victor
db.chatroom.victor.users
db.chatroom.victor.msgs
db.chatroom.john
db.chatroom.john.users
db.chatroom.john.msgs
db.chatroom.tom
db.chatroom.tom.users
db.chatroom.tom.msgs
...
I think if I can divide the documents into different collections, it would be much more efficient to query. Also, I can use capped collections to limit the count of messages in each room. However, I am not familiar with MongoDB. I'm not sure if there is any side effect to doing that, or is there any performance problem to create lots of collections? Is there any guideline for designing a MongoDB schema?
Thanks.
you should always design your schema by answering 2 questions:
what data should i store? (temporary/permanently)
how will i access that data? (lots of reads on this, lots of writes on that, random rw here)
you don't want to embed high access rate data into document(like chat messages that are accessed by every user every second or so), it's better to have it as separate collection.
on the other hand - collection of users in chat room changes rather rarely - you can definitely embed that.
just design using common sense and you'll be fine
You definitely do not want to embed messages inside of other documents. They need to be stored as individual documents.
I say this because MongoDB allocates a certain amount of space for every document it writes. When it writes a document, it takes its current size and adds some empty space (padding) to the document so that if it is actually 1k large, it may become 1.5k large to leave space for the document to grow in size.
Chat messages will almost definitely each be larger than the allocated free space. Multiple messages will absolutely be larger than the free space.
The problem is that when a document doesnt fit in its current location on disk\memory when you try to embed another document inside of it (via an update) the database must read that document from the disk\memory and rewrite the entire thing at the tail end of the data file.
This causes alot of disk activity that should otherwise not exist - that added I/O will destroy the performance of the database.
Think about all of the use cases. What kind of queries do you want to execute?
"Show me the rooms where a user is chatting". This won't work with the current schema and you have to add a list of rooms to the user or a in a separate collection to make it work.
"Show me all the messages a user sent in all of the rooms". This, again, won't work.
"Delete all the idle users from all the rooms". With the proposed schema you have to run this query on every room.users collection.
Also, do some approximation for the size of your collections. If you have 100 rooms with max 1000 users, that's 100000 entries for a collection where you store all these mappings. With an index on room and user that shouldn't be an issue, you don't need separate collections.
With 100 users you can even embed this to the room object as an array. Just make sure there is an index.

Moving messaging schema to MongoDB

I have this schema for support of in-site messaging:
When I send a message to another member, the message is saved to Message table; a record is added to MessageSent table and a record per recipient is added to MessageInbox table. MessageCount is being used to keep track of number of messages in the inbox/send folders and is filled using insert/delete triggers on MessageInbox/MessageSent - this way I can always know how many messages a member has without making an expensive "select count(*)" query.
Also, when I query member's messages, I join to Member table to get member's FirstName/LastName.
Now, I will be moving the application to MongoDB, and I'm not quite sure what should be the collection schema. Because there are no joins available in MongoDB, I have to completely denormalize it, so I woudl have MessageInbox, MessageDraft and MessageSent collections with full message information, right?
Then I'm not sure about following:
What if a user changes his First/LastName? It will be stored denormalized as sender in some messages, as a part of Recipients in other messages - how do I update it in optimal ways?
How do I get message counts? There will be tons of requests at the same time, so it has to be performing well.
Any ideas, comments and suggestions are highly appreciated!
I can offer you some insight as to what I have done to simulate JOINs in MongoDB.
In cases like this, I store the ID of a corresponding user (or multiple users) in a given object, such as your message object in the messages collection.
(Im not suggesting this be your schema, just using it as an example of my approach)
{
_id: "msg1234",
from: "user1234",
to: "user5678",
subject: "This is the subject",
body: "This is the body"
}
I would query the database to get all the messages I need then in my application I would iterate the results and build an array of user IDs. I would filter this array to be unique and then query the database a second time using the $in operator to find any user in the given array.
Then in my application, I would join the results back to the object.
It requires two queries to the database (or potentially more if you want to join other collections) but this illustrates something that many people have been advocating for a long time: Do your JOINs in your application layer. Let the database spend its time querying data, not processing it. You can probably scale your application servers quicker and cheaper than your database anyway.
I am using this pattern to create real time activity feeds in my application and it works flawlessly and fast. I prefer this to denormalizing things that could change like user information because when writing to the database, MongoDB may need to re-write the entire object if the new data doesnt fit in the old data's place. If I needed to rewrite hundreds (or thousands) of activity items in my database, then it would be a disaster.
Additionally, writes on MongoDB are blocking so if a scenario like I've just described were to happen, all reads and writes would be blocked until the write operation is complete. I believe this is scheduled to be addressed in some capacity for the 2.x series but its still not going to be perfect.
Indexed queries, on the other hand, are super fast, even if you need to do two of them to get the data.