Modeling a facebook-like chat/discussion between two users using MongoDB - mongodb

I am trying to model a FB-like chat/discussion between two users using MongoDB.
I came up with the following BSon "structure" for a message:
{
_id: ObjectId(...),
discussionId: ObjectId(...),
from: {
id: ObjectId(...),
nickname: ‘Joe’,
thumbnail: ‘xoopp7788ee….jpg’
},
to: {
id: ObjectId(...),
nickname: ‘Jane’,
thumbnail: ‘rtolkj96547cc….jpg’
},
text: ‘Hello Jane, How are you today?’,
posted: ISODateTime(...),
viewed: ISODateTime(...),
next: ObjectId(...),
previous: ObjectId(...)
}
I did read the following mongodb documentation but I still wanted to submit my question because my model/problem is slightly different.
First I am not sure whether the next and previous fields are necessary. Can I use just the posted field in order to thread the messages?
Second, can having just one document per message pose a performance issue? Would I be better off with several messages per document?
I am looking forward to reading your comments and suggestions.
edit1 : the discussionId would somehow be a unique ID for a combination of two users...

IMHO:
thumbnail if it references an avatar should be in user profile. If that's a smiley then it should be at the same level as text.
if from.id is a user ID, then to save space you don't need to repeat nickname. It can be "injected" in UI or at the time it's sent out somewhere if needed. Sometimes duplicating data can be useful, but in this case it's dubious.
discussion collection might already have from and to per each discussion. In this case you don't have to repeat user IDs in each message. You could keep only from.
If there could be more than 2 people discussing something - group chat, then from and to does not work. You might have to make 2 arrays of from and to. It depends on the type of chat you would have.
next and previous: usually I would not have them. You can find a discussion thread by discussionId, from.id, to.id and sort it by posted. Unless you have some very strange way chat works you can delete these 2 fields.
posted has a date and it's ok, but if you want to save more space date is stored in _id as well, so you can avoid creating an index and storing an extra field.
eventually, if you decide to remove nickname and thumbnail then you can bring from.id to a level up as fromId.
Overall it looks good. I'm just nitpicking for performance purposes and small improvements.

Related

Nosql database design - MongoDB

I am trying to build an app where I just have these 3 models:
topic (has just a title (max 100 chars.))
comment (has text (may be very long), author_id, topic_id, createdDate)
author (has just a username)
Actually a very simple db structure. A Topic may have many comments, which are created by authors. And an author may have many comments.
I am still trying to figure out the best way of designing the database structure (documents). First I though to put everything to its own schema like above. 3 Documents. But since this is a nosql db, I should actually try to eliminate the needs for a join. And now I am really thinking of putting everything to a single document, which also sounds crazy.
These are my actually queries from ui:
Homepage query: Listing all the topics, which have received the most comments today (will run very often)
Auto suggestion list for search field: Listing all the topics, whose title contains string "X"
Main page of a topic query: Listing all the comments of a topic, with their authors' username.
Since most of my queries need data from at least 2 documents, should I really just use them all together in a single document like this:
Comment (text, username, topic_title, createdDate)
This way I will not need any join, but also save i.e. the title of topics multiple times.. in every comment..
I just could not decide.
I appreciate any help.
You can do the second design you suggested but it all comes down to how you want to use the data. I assume you’re going to be using it for a website.
If you want the comments to be clickable, in such that clicking on the topic name will redirect to the topic’s page or clicking the username will redirect to the user’s page where you can see all his comments, i suggest you keep them as IDs. Since you can later use .populate(“field1 field2”) and you can select the fields you would like to get from that ID.
Alternatively you can store both the topic_name and username and their IDs in the same document to reduce queries, but you would end up storing more redundant data.
Revised design:
The three queries (in the question post) are likely to be like this (pseudo-code):
select all topics from comments, where date is today, group by topic and count comments, order by count (desc)
select topics from comments, where topic matches search, group by topic.
select all from comments, where topic matches topic_param, order by comment_date (desc).
So, as you had intended (in your question post) it is likely there will be one main collection, comments.
comments:
date
author
text
topic
The user and topic collections with one field each, are optional, to maintain uniqueness.
Note the group-by queries will be aggregation queries, for example, the main query will be like this:
db.comments.aggregate( [
{ $match: { date: ISODate("2019-11-15") } },
{ $group: { _id: "$topic", count: { $sum: 1 } } },
{ $sort: { count: -1 } }
] )
This will give you all the topics names, today and with highest counted topics first.
You could also take a bit different approach. Storing information redundant is not a bad thing in all cases.
1. Homepage query: Listing all the topics, which have received the most comments today (will run very often)
You could implement this as two extra fields in your Topic entity. One describing the last date a comment was added and the second to count the amount of comments added that day. By doing so you do not need to join but can write a query that only looks at the Topic collection.
You could also store these statistics independently of the other data and update it when required. Think of this as having a document that describes your database its current state (at least those parts relevant to you).
This might give you a time penalty on storing information but it improves reading times.
2. Auto suggestion list for search field: Listing all the topics, whose title contains string "X"
Far as I understand this one you only need the topic title. Meaning you can query the database once and retrieve all titles. If the collection grows so big this becomes slow you could trigger a refresh of the retrieval query that only returns a subset (a user is not likely to go through 100 possible topics).
3. Main page of a topic query: Listing all the comments of a topic, with their authors' username.
This is actually the tricky one. If this is really what it is you want to do then you are most likely best off storing all data in one document. However I would ask you: what is the problem making more than one query? I doubt you will be showing all comments at once when there are thousands (as you say). Instead of storing each in a separate document or throwing all in one document, you could also bucket them and retrieve only the 20 most recent ones (if you would create buckets of size 20). Read more about the bucket pattern here and update the ones shown when required.
You said:
"Since most of my queries need data from at least 2 documents, should I really just use them all together in a single document like this..."
I"ll make an argument from a 'domain driven design' point of view.
Given that all your data exists within the same bounded context (business domain). Then it is acceptable to encapsulate it all within the same document!

How to avoid inconsistent embedded documents

Having a bit of trouble understanding when and why to use embedded documents in a mongo database.
Imagine we have three collections: users, rooms and bookings.
I have a few questions about a situation like this:
1) How would you update the embedded document? Would it be the responsibility of the application developer to find all instances of kevin as a embedded document and update it?
2) If the solution is to use document references, is that as heavy as a relational db join? Is this just a case of the example not being a good fit for Mongo?
As always let me know if I'm being a complete idiot.
Imho, you overdid it. Given the question from you use cases are
For a given reservation, what room is booked by which user?
For a given user, what are his or her details?
How many beds does a given room provide?
I would go with the following model for rooms
{
_id: 1001,
beds: 2
}
for users
{
_id: new ObjectId(),
username: "Kevin",
mobile:"12345678"
}
and for reservations
{
_id: new ObjectId(),
date: new ISODate(),
user: "Kevin",
room: 1001
}
Now in a reservation overview, you can have all relevant information ("who", "when" and "which") by simply querying reservations, without any overhead to answer the first question from you use cases. In a reservation details view, admittedly you would have to do two queries, but they are lightning fast with proper indexing and depending on your technology can be done asynchronously, too. Note that I saved an index by using the room number as id. How to answer the remaining questions should be obvious.
So as per your original question: embedding is not necessary here, imho.

Mongodb schema design for swipe card style application

What would be the good approach to design following swipe card style app with skip functionality?
Core functionality of the app I'm working on is as follows.
On the main page, a user first make a query for the list of the posts.
list should be sorted by date in reverse chronological order or some kind of internal score that determines the active post (with large number of votes or comments etc)
A each post is shown to user one by one in the form of a card like tinder or jelly style feed.
For each card, user can either skip or vote for it.
when user consumes all cards fetched and make query again for next items, skipped or already voted card by the current user should not appear again.
Here, the point is that a user could have huge number of skipped or voted post since user only can skip or vote for a post on the main page.(user can browse these already processed items on his/her profile)
The approaches I simply thought about are
1.to store the list of skipped or voted post ids for each user in somewhere and use them in the query with $nin operator.
db.posts.find({ _id: {$nin: [postid1,...,postid999]} }).sort({ date: -1 })
2.to embed all the userId of users that voted or skipped the post to an array and query using $ne operator
{
_id: 'postid',
skipOrVoteUser: ['user1', 'user2' ...... 'user999'],
date: 1429286816366
}
db.posts.find({ skipOrVoteUser: {$ne: 'user1'} }).sort({ date: -1 })
3.Maintaining feedCache for each user and fanout on write.
FeedCache
{
userId: 'user1',
posts: [{id:1, data: {..}}, {id:2, data: {...}},.... {id:3, data: {...}}]
}
Operations:
-When a user create a post, write copy of the post to all user's feed cache in the system.
-Fetch posts from the user's feed cache.
-When the user vote or skip a post, delete the post from his/her feed cache.
But since the list of the posts that user skipped or voted is ever growing and could be really large as time goes. I'm concern that this query would be too slow with large number of list for $nin for approach 1.
Also with approach 2, since all user on the system(or many depending on the filtering) could either vote or skip for a post, embedded user array of each post could be really large( max with number of all user) and performance of the query with $ne will be poor.
With approach 3, for every post created, there will be too much write operation and It won't be efficient.
What would be the good approach to design schema to support this kind of functionality? I've tried come up with good solution and could not think of better solutions. Please help me to solve this problem. Thanks!
On a relational database I would use approach 1. It's and obvious choice as you have good SQL operators for the task and you can easily optimize the query.
With document databases I would choose approach 2. In this case there is a good chance the vote/skip list remaining relatively small as the system grows.

MongoDB schema design to support editing subdocuments within an array in multi-user environment?

Let's suppose I have a basic blog web app using the following document schema for a blog post.
{
_id: ObjectId(...),
title: "Blog Post #1",
text: "<p>This is my blog post!</p>",
comments: [
{
user: "username1",
time: Date(...),
text: "This is a great blog post!"
},
{
user: "username2",
time: Date(...),
text: "This is even better than sliced bread!"
}
]
}
That's all well and good, but now let's suppose that a user can edit or delete his comment. On top of that, it's a web app, so there could be multiple people editing or deleting their comments at the same time. Now suppose I am logged in as "username2" and try to edit my comment, which is the 2nd item in the comments array - index position 1. Just before I click "save", user1 logs in and deletes his comment which is the 1st item in the array. If my code tries to delete user 2's comment by index position, it will fail because there are no longer 2 items in the array.
Two ideas came to mind, but I'm not crazy about either one.
create some sort of id on each comment
create a "lastModified" timestamp on the parent document, and only save the edit if nothing has changed on the document.
What is the best way to handle this type of situation? If I really need an id on each comment, will I have to generate it myself? What data type should it be? Or would it be best to use both of my ideas together? Or is there another option I'm not even thinking about?
Having different writers is a key downside of embedding documents in my opinion. You might want to take a look at this discussion that presents different solutions. I'd try to avoid different writers to one document and use a separate Comments collection instead, where each comment is owned by its author. You can fetch all comments on a post by an indexed field postId reasonably fast. Then the comments simply have a regular _id field. It makes sense to use an ObjectId because that automatically stores the time the comment was created, and it's a monotonic index by default.
create a "lastModified" timestamp on the parent document, and only save the edit if nothing has changed on the document.
This is called 'optimistic locking' and it's generally not good if there is a high probability of concurrent operations. In the case of blog posts, it's likely that newer posts receive a lot more comments than older ones, so I'd say the collision proability is kinda high.
There's yet another nasty side-effect: let's say the blog post author wants to modify the text but someone adds or removes a comment in the mean time. Now even the blog author wouldn't be able to change the text unless you use the atomic $set operation on the text and bypass the version check.

Opinion on my case MongoDB schema design

This is a pretty common question on MongoDB: When to embed and when to reference.
However in my case, this appears to be somehow a dilemma. I have a document that have a reference where I could just embedded it, yet it will cost me the size of the disk. But if I make a reference, it will give me quite a performance cost.
Here's an example, say I have this Member with the 'detail' as my problem:
Member: {
_id: "abc",
detail: {
name: "Stack Overflow",
website: "www.stackoverflow.com"
}
}
I want this Member's detail to be in every Blog this member "asdf" made cause every blog displayed would display the member details. So there are 2 options I can do for my Blog document:
First, make a reference by only putting the member's _id:
Blog: {
_id: 123,
memberId: "asdf" ---> will be used as reference to query specific member
}
or Second, embed the member into Blog instead:
Blog: {
_id: 123,
member: {
_id: "asdf",
detail: {
name: "Stack Overflow",
website: "www.stackoverflow.com"
}
}
}
So the first option requires another query for member which is a performance issue. The second option however is faster cause I only need to query once, yet my disk would get larger for redundant data of embedded document 'member' as the number of Blog grows.
PS: As you can see for this example, Member and Blog relationship is one-to-many, so a member can have many blogs, but member's detail variables stay the same; 'name' and 'website'.
Any opinion which is better in this case? It'll be great if you also have the 3rd solution. Thanks before.
I think it is fine to keep the member details separate, like a forum signature. That way when a member updates their details all the posts will show their current information without your application having to update duplicate data in every previous post.
From your description it sounds like you may only be displaying this on blog posts the users create, rather than on every comment they make on a page.
If you are worried about the performance cost of an extra query per user, you could always cache that user data (or the generated page output) instead of relying on fetching all the blog info in a single DB query. I would see how the application performs in actual usage before trying to optimize for a use case that may not be a problem.
Another approach would be to only show the extra user details as an Ajax hover (similar to how SO shows more information for an established user.