Moving messaging schema to MongoDB - mongodb

I have this schema for support of in-site messaging:
When I send a message to another member, the message is saved to Message table; a record is added to MessageSent table and a record per recipient is added to MessageInbox table. MessageCount is being used to keep track of number of messages in the inbox/send folders and is filled using insert/delete triggers on MessageInbox/MessageSent - this way I can always know how many messages a member has without making an expensive "select count(*)" query.
Also, when I query member's messages, I join to Member table to get member's FirstName/LastName.
Now, I will be moving the application to MongoDB, and I'm not quite sure what should be the collection schema. Because there are no joins available in MongoDB, I have to completely denormalize it, so I woudl have MessageInbox, MessageDraft and MessageSent collections with full message information, right?
Then I'm not sure about following:
What if a user changes his First/LastName? It will be stored denormalized as sender in some messages, as a part of Recipients in other messages - how do I update it in optimal ways?
How do I get message counts? There will be tons of requests at the same time, so it has to be performing well.
Any ideas, comments and suggestions are highly appreciated!

I can offer you some insight as to what I have done to simulate JOINs in MongoDB.
In cases like this, I store the ID of a corresponding user (or multiple users) in a given object, such as your message object in the messages collection.
(Im not suggesting this be your schema, just using it as an example of my approach)
{
_id: "msg1234",
from: "user1234",
to: "user5678",
subject: "This is the subject",
body: "This is the body"
}
I would query the database to get all the messages I need then in my application I would iterate the results and build an array of user IDs. I would filter this array to be unique and then query the database a second time using the $in operator to find any user in the given array.
Then in my application, I would join the results back to the object.
It requires two queries to the database (or potentially more if you want to join other collections) but this illustrates something that many people have been advocating for a long time: Do your JOINs in your application layer. Let the database spend its time querying data, not processing it. You can probably scale your application servers quicker and cheaper than your database anyway.
I am using this pattern to create real time activity feeds in my application and it works flawlessly and fast. I prefer this to denormalizing things that could change like user information because when writing to the database, MongoDB may need to re-write the entire object if the new data doesnt fit in the old data's place. If I needed to rewrite hundreds (or thousands) of activity items in my database, then it would be a disaster.
Additionally, writes on MongoDB are blocking so if a scenario like I've just described were to happen, all reads and writes would be blocked until the write operation is complete. I believe this is scheduled to be addressed in some capacity for the 2.x series but its still not going to be perfect.
Indexed queries, on the other hand, are super fast, even if you need to do two of them to get the data.

Related

Modeling data in MongoDB for a chat application

I'd like to use MongoDB to store chat messages as part of a chat application. The database will be used to display chat history to users joining a channel.
I am trying to determine the best way to model this data in the database. The application is a simple chat app which contains numerous channels that users can chat in. Here are a few options I've considered:
A Messages Collection containing a document for every message. This is easy to implement, however with any significant usage many documents would be created.
A Channels Collection containing a document for every channel. This would result in fewer documents. Messages would be stored as an array on a channel document.
Which of these options is preferred, and why? Is there a better option not listed here?
There are many, many ways to go about modeling something like this. There is no generic "best way," as it really depends on how you plan on using the data, how the app is going to function, etc. However, there are a few things to consider with your approach.
First, having a lot of documents is not an issue. That's what Mongo does - it's great at storing lots of documents. I am a strong advocate of modularity, as it makes things more flexible. I reflect that mindset in my database by separating data as much as possible, and then using references to populate as needed.
This means you have to do more population, but in the end it prevents you from pidgeon holing yourself into having to do things a certain way.
So for your example in particular, a good way would be to combine what you've mentioned above: Have a Messages collection which creates a document for every message. Then have a Channels collection which stores an array of Message IDs (not the message itself).
Why is this useful? I'm assuming you will want to load a Channel, but not all 2,000 messages that are in it. You probably want to load the first 50, and then load more via infinite scroll or something.
This allows you to fetch a Channel document and then populate the first 50 messages. Then you can incrementally fetch 50 more messages at a time if needed.
If you store all of the messages in that array, your Channel document is going to get very, very large.
This also allows a user to edit their Message without editing the Channel document in any way - this is very important!
Having a separate Message schema also allows you to do things like fetch all the messages from a single user. You'll probably want to have a reference in the Message to a User ID.
There is a lot to consider when modeling data like this, but the important things to think about are "How am I going to need to fetch this data?" and "How will I need to modify this data?" Then figure out if your current format makes one of those things difficult.

Is it ok to use Meteor publish composite for dozens of subscriptions?

Currently, our system is not entirely normalized, and we use meteor-publish-composite to obtain the normalized data in mongodb. Some models have very few dependencies, but others have arrays of objects (i.e. sub-documents) with few foreign keys that we are subscribing to when fetching each model.
An example would be a Post containing a list of Comment sub-documents, where each comment has a userId field.
My question is, while I know it would be faster to use collection hooks and update the collection with data denormalization, how does Meteor handle multiple subscriptions on the same collection?
Is a hundred subscriptions on the same collection affect the application speed (significantly)? What about a thousand? etc.
This may not fully answer your question, however after spending countless hours tuning the performance of a large meteor app, I thought I would share some of the things that I have learned.
In Meteor, when you define a publication, you are setting up a reactive query that continues to push data to subscribed clients when changes to the underlying mongo data causes the result of the query to change. In other words, it sets up a query that will continually push data to clients as the data is inserted, updated, or removed. The mechanism by which it does this is by creating an observer on the query.
When an observer is initialized (e.g. when publication is subscribed to), it will query mongodb for the initial dataset to send down and then use the oplog to detect changes going forward. Fortunately, meteor is able to re-use an existing observer for a new subscription if the query is for the same collection, same selectors, and same options.
This means that you could create hundreds of subscriptions against many different publications, but if they are hitting against the same collection and using the same query selectors then you effectively only have 1 observe in play. For more details, I highly recommend reading this article from kadira.io (from which I acquired the information I used in this answer).
In addition to this, Meteor is also able to deal with multiple publications publishing the same document, and when this occurs, the documents will be merged into one. See this for more detail.
Lastly, because of Meteor's MergeBox component, it will minimize the data being sent over the wire across all your subscriptions by keeping track of what data changed vs. what is already on the client.
Therefore, in your specific example, it sounds like you will be running several different subscriptions on effectively the same query (since you are just trying to de-normalize your data) and dataset. Because of all the optimizations that I described above, I would guess that you won't be plagued by performance issues by taking this approach.
I have done similar things in one of my apps and have never had an issue.

Is dynamically creating and dropping collections in MongoDB going to create scalability issues?

I have an application (built in Meteor) that provides some ad hoc reporting capabilities to the end user. I have built up that functionality by using the aggregation pipeline to produce the results for a given query. This makes it extremely fast and I was using $out to push the results right into a results table.
The results table included a queryID, which the client used to figure out which were the correct results.
Unfortunately, as you may know (and I discovered), that doesn't work so well once you have more than one user running reports at a time because $out deletes the whole results table before pushing the new query in.
I see three possible workarounds:
Run the aggregation, but manually push the results into the results collection
$out the results into a temporary collection (dynamically named to avoid conflicts) and then manually copy the results from there into results collection, immediately dropping the temporary one. This made some sense when I thought I could use copyTo(), but that doesn't appear possible within Meteor, so I think this option doesn't make much sense relative to #1 in this case.
$out the results into a temporary collection (dynamically named to avoid conflicts) and have the client pull its results directly from there. I would then periodically drop the extra collections after say 24 hours (like I do with specific query results in the main collection today).
#3 would be the fastest by far - the time it takes to manually copy rows dwarfs the time it takes the queries to run. But I'm concerned about the impact of creating and dropping so many collections.
We're not talking millions of users here, but if an average of 500 users a day were each running 10-20 reports, there could be an additional 5-10k collections in the database at any one time. That seems like a lot. Perhaps I could be smarter about cleaning them up somehow, though I can't just immediately remove them because a user might want to have multiple tabs open with different reports. Even still, we're potentially talking about hundreds to thousands of collections.
Is that going to be a problem?
Are there other approaches I should consider instead?
Other recommendations?
Thanks!
Dropping a collection in mongoDB is very efficient operation, anyway much more efficient than deleting some documents in a larger collection.
Maximum number of collections is quite high, only limited by namespace namespace in MMAPv1 while no hard limit exists in wiretiger engine.
So I would favor your solution #3.
Some improvements/alternatives you can think:
Consider creating the collections in a separated database (say per day) then you can drop the entire database in a single operation without having to drop individual collections.
Use an endpoint for the result set, cash the results then drop the $out collection. Let cache handle user requirements and only rerun the aggregation if cache has expired or something.
This kind of activity is done very easily in relational databases such as mysql or pgsql. You might consider synchronising your data to a separate relational database for the purposes of reporting.
There is a package https://github.com/perak/mysql-shadow which claims to provide synchronisation. I played with it and it didn't work perfectly, although doing just one way sync is more likely to succeed.
The other option is to use Graphql over a mongo/mysql hybrid database which can be done with the Apollo stack http://www.apollodata.com/

Is a good idea to store chat messages in a mongodb collection?

I'm developing a chat app with node.js, redis, socket.io and mongodb. MongoDB comes the last and for persisting the messages.
My question is what would be the best approach for this last step?
I'm afraid a collection with all the messages like
{
id,
from,
to,
datetime,
message
}
can get too big too soon, and is going to get very slow for reading purposes, what do you think?
Is there a better approach you already worked with?
In MongoDB, you store your data in the format you will want to read them later.
If what you read from the database is a list of messages filtered on the 'to' field and with a dynamic datetime filter, then this schema is the perfect fit.
Don't forget to add an index on the fields you will be querying on, then it will be reasonable fast to query them, even over millions of records.
If you would, for example, always show a full history of a full day, you would store all messages for a single day in one document. If both types of queries occur a lot, you would even store your messages in both formats.
If storage is an issue, you could also use capped collection, which will automatically delete messages of e.g. over 1 year old.
I think the db structure is fine, the way you mentioned in your question.
You may assign some unique id for chat between each pair and keep it in each record of chat. Retrieve based on that when you want to show it.
Say 12 is the unique id for chat between A and B, retrieve should be based on 12 when you want to show chat for A and B.
So your db structure can be like:-
{
id,
from,
to,
datetime,
message,
uid
}
Remember, you can optimize your retrieve, if you will give some limit(say 100 at a time) for retrieve. If user is scrolling beyond 100 retrieve more 100 chats. Which will solve lots of retrieve.
When using limit, retrieve based on date created and use sort with find query as well.
Just a thought here, are the messages plain text or are you allowed to share images and videos as well ?
If it's the latter then storing all the chats for a single day in one collection might not work out.
Actually if you have images and videos shares allowed then you need to take into account the. 16mb document restriction as well.

How should I structure the data schema of MongoDB?

I have a chatroom system, and I want to use MongoDB as the backend database. The following are the entities:
Room - a chatroom (room_id)
User - a chatting user in chatroom (room_id, user_name)
Msg - a message in chatroom (room_id, user_name, message)
For designing the schema, I have some ideas: First, 3 collections - room, user and msgs - and there is a parent reference in user and msg documents.
Another idea is to create collections for each room. Such as
db.chatroom.victor
db.chatroom.victor.users
db.chatroom.victor.msgs
db.chatroom.john
db.chatroom.john.users
db.chatroom.john.msgs
db.chatroom.tom
db.chatroom.tom.users
db.chatroom.tom.msgs
...
I think if I can divide the documents into different collections, it would be much more efficient to query. Also, I can use capped collections to limit the count of messages in each room. However, I am not familiar with MongoDB. I'm not sure if there is any side effect to doing that, or is there any performance problem to create lots of collections? Is there any guideline for designing a MongoDB schema?
Thanks.
you should always design your schema by answering 2 questions:
what data should i store? (temporary/permanently)
how will i access that data? (lots of reads on this, lots of writes on that, random rw here)
you don't want to embed high access rate data into document(like chat messages that are accessed by every user every second or so), it's better to have it as separate collection.
on the other hand - collection of users in chat room changes rather rarely - you can definitely embed that.
just design using common sense and you'll be fine
You definitely do not want to embed messages inside of other documents. They need to be stored as individual documents.
I say this because MongoDB allocates a certain amount of space for every document it writes. When it writes a document, it takes its current size and adds some empty space (padding) to the document so that if it is actually 1k large, it may become 1.5k large to leave space for the document to grow in size.
Chat messages will almost definitely each be larger than the allocated free space. Multiple messages will absolutely be larger than the free space.
The problem is that when a document doesnt fit in its current location on disk\memory when you try to embed another document inside of it (via an update) the database must read that document from the disk\memory and rewrite the entire thing at the tail end of the data file.
This causes alot of disk activity that should otherwise not exist - that added I/O will destroy the performance of the database.
Think about all of the use cases. What kind of queries do you want to execute?
"Show me the rooms where a user is chatting". This won't work with the current schema and you have to add a list of rooms to the user or a in a separate collection to make it work.
"Show me all the messages a user sent in all of the rooms". This, again, won't work.
"Delete all the idle users from all the rooms". With the proposed schema you have to run this query on every room.users collection.
Also, do some approximation for the size of your collections. If you have 100 rooms with max 1000 users, that's 100000 entries for a collection where you store all these mappings. With an index on room and user that shouldn't be an issue, you don't need separate collections.
With 100 users you can even embed this to the room object as an array. Just make sure there is an index.