Questions about better solution to keep comments and votes in one document - mongodb

I'm looking for a better way to handle it. My documents I designed to store all comments and votes (like a confirmation thats stores picture and text information as well) using arrays inside a document. The issue I'm concerning is about the size limit of a document (16 Mb so far), If a document keeps a lot of comments and specially votes in internal arrays, very probably it will be broken reaching the size limit, on the other hand, keep this strategy I can ensure faster queries as well.
What do I have to do? Do like a relational DB and keep these kind of information and different collections and docs? It will decrease search speed, otherwise I'll keep it safe and unbroken.

It all depends on how you plan on using the data. If you get a huge number of comments on a single blog post would you really want to query for the post and get all the comments back?
No real life webpage that shows you a blog post actually does that. They show you the first few comments and then fetch more either as you scroll through those or when you click "show more". That's probably the best (hybrid) model. Store what you need when you first display a blog post in the blog post document but keep everything else that you can query for later in a separate collection where every comment references the post that it belongs to. Then you can get the additional comments with a single indexed read (probably index it on post_id and date posted?). You can also use the "bucketing" technique and store comments grouped by post and chunk of time so that you can fetch entire "next page of comments" document.
If you architect this correctly rather than reducing your search speed it will likely increase your search and reading speed for base documents and save you a lot of network bandwidth too.

Related

Mongodb: about performance and schema design

After learning about performance and schema design in MongoDB, I still can´t figure out how would I make the schema design in an application when performance is a must.
Let´s imagine if we have to make YouTube to work with MongoDB as its database. How would you make the schema?
OPTION 1: two collections (videos collection and comments collection)
Pros: adding, deleting and editing comments affects only the comments collection, therefore these operations would be more efficient.
Cons: Retrieving videos and comments would be 2 different queries to the database, one for videos and one for comments.
OPTION 2: single collection (videos collection with the comments embedded)
Pros: You retrieve videos and its comments with a single query.
Cons: Adding, deleting and editing comments affect the video Document, therefore these operations would be less efficient.
So what do you think? Are my guesses true?
As a caller in the desert, I have to say that embedding should only be used under very special circumstances:
The relation is a "One-To(-Very)-Few" and it is absolutely sure that no document will ever exceed this limit. A good example would be the relation between "users" and "email addresses" – a user is unlikely to have millions of them and there isn't even a problem with artificial limits: setting the maximum number of addresses as user can have to, say 50 hardly would cause a problem. It may be unlikely that a video gets millions of comments, but you don't want to impose an artificial limit on it, right?
Updates do not happen very often. If documents increase in size beyond a certain threshold, they might be moved, since documents are guaranteed to be never fragmented. However, document migrations are expensive and you want to prevent them.
Basically, all operations on comments become more complicated and hence more expensive - a bad choice. KISS!
I have written an article about the above, which describes the respective problems in greater detail.
And furthermore, I do not see any advantage in having the comments with the videos. The questions to answer would be
For a given user, what are the videos?
What are the newest videos (with certain tags)?
For a given video, what are the comments?
Note that the only connection between videos and comments here is about a given video, so you already have the _id or something else to positively identify the video. Furthermore, you don't want to load all comments at once, especially if you have a lot of them, since this would decrease UX because of long load times.
Let's say it is the _id. So, with it, you'd be able to have paged comments easily:
db.comments.find({"video_id": idToFind})
.skip( (page-1) * pageSize )
.limit( pageSize )
hth
As usual the answer is, it depends. As as a rule of thumb you should favour embedding, unless you need to regularly query the embedded objects on its own or if the embedded array is likely to get too large(>~100 records). Using this guideline, there are a few questions you need to ask regarding your application.
How is your application going to access the data ? Are you only ever going to show the comments on the same page as the associated video ? Or do you want to provide the options to show all comments for a given user across all movies ? The first scenario favours embedding (one collection), whereas you probably would be better of with two collections in the second scenario.
Secondly, how many comments do you expect for each video ? Taking the analogy of IMDB, you could easily expect more than 100 comments for a popular video, so that means you are better off creating two separate collections as the embedded array of comments would grow large quite quickly. I wouldn't be too concerned about the overhead of an application join, they are generally comparable in speed compared to a server-side join in a relational database provided your collections are properly indexed.
Finally, how often are users going to update their comments after their initial post ? If you lock the comments after 5 minutes like on StackOverflow users may not update their comments very often. In that case the overhead of updating or deleting comments in the video collection will be negligible and may even be outweigh the cost of performing a second query in a separate comments collection.
You should use embedded for better performance. Your I/O's will be lesser. In worst case? it might take a bit long to persist the document in the DB but it wont take much time to retrieve it.
You should either compromise persistence over reads or vise versa depending on your application needs.
Hence it is important to choose your db wisely.

Database design for queries that has tons of sql-like join

I have a collection named posts consisting of multiple article posts and a collection called users which has a lot of user info. Each post has a field called author that references the post author in the users collection.
On my home page I will query the posts collection and return a list of posts to the client. Since I also want to display the author of the post I need to do sql-like join commands so that all the posts will have author names, ids,...etc.
If I return a list of 40 posts I'd have to do 40 sqllike-joins. Which means each time I will do 41 queries to get a list of posts with author info. Which just seems really expensive.
I am thinking to store the author info at the time I am storing the post info. This way I only need to do 1 query to retrieve all posts and author info. However when the user info changes (such as name changes) the list will be outdated and it seems not quite easy to manage lists like this.
So is there's a better or standard approach to this?
p.s: I am using mongodb
Mongo is NoSQL DB. By definition, NoSQL solutions are meant to be denormalized(all required data should be located at a same location)
In your example, relationship between authors and posts is one to many but ratio of authors as compared to posts is very small. In simple words, no. of authors as compared to no. of posts will be very small.
Based on this, you can safely store author info in posts collection.
If you need to query posts collection i.e. if you know your most queries will be executed on posts collection then it makes sense to store author in posts. It wont take huge space to store one attribute but it will make huge difference in query performance and easiness to code/retrieve the data.

Mongodb - most efficient way to structure my db?

I'm a bit new to mongodb and I'm trying to setup a simple server where I will have users, posts, comments, like and dislikes, among some things. What I'm wondering is which way this should be setup most efficiently?
Should I have one table for likes where I add userId and postId (more or less same for the dislike and comments table)
Or would it be better if likes, dislikes and comments are parts of the post? Like:
//Post structure
{
"_id":"kljflskds",
"field1":"content",
"field2":"content",
"likes":[userId,userId,userId],
"dislikes":[userId,userId,userId],
"comments":[{comment object},{comment object},{comment object}]
}
Because for each post when I retreive them I would like to know how many likes it has, how many dislikes and how many comments. With the first version I would either need to multiple queries on the server(unnecessary processor power?) or on the phone(unnecessary bandwidth). But the second would only need one query. I believe the second option with having comments as a part of the posts seems more efficient, but I'm not a pro so I'd like to hear what other people think of this?
As has been pointed out, there are no tables in a document-oriented database. What you'll also find is that unlike a relational database where there is often a 'right way' to structure the database, the same is not true with MongoDB. Your schema should be structured based on how you're going to access the information most regularly. Documents are extremely flexible, unlike rows in tables.
You could create a comments collection or have them directly in the post documents. Two considerations would be: 1. Will you need to access the comments without accessing the post? and 2. Are your documents going to get too big and unwieldy?
In both of these cases with your blog, it most likely would be better to nest the comments as most of your traffic will be searching for posts, and you'll be pulling all of the comments related to the post. Also, a comment will not be owned by multiple tables; besides, MongoDB isn't meant to be denormalized like a relational database, so having duplicate information in multiple documents (i.e. tag names, city names, etc.) is normal.
Also, having a collection for likes is a very 'relational' way of thinking. In MongoDB, I can't think of a use case where you'd want a likes collection. When you're coming from the relational world, you really have to step back and rethink how you're creating your database because you'll be constantly fighting it otherwise.
With only two collections, posts and users, getting the information that you're looking for would be trivial, as you can just get the count of the likes and comments and they're all right there.

In MongoDB is it practical to keep all comments for a post in one document?

I've read in description of Document based dbs you can for example embed all comments under a post in the same document as the post if you choose to like so:
{
_id = sdfdsfdfdsf,
title = "post title"
body = "post body"
comments = [
"comment 1 ......................................... end of comment"
.
.
n
]
}
I'm having situation similar where each comment could be as large as 8KB and there could be as many as 30 of them per post.
Even though it's convenient to embed comments in the same document I wonder if having large documents impact performance especially when MongoDb server and http server run on separate machines and must communicate though a LAN?
Posting this answer after some the others so I will repeat some of the things mentioned.
That said there are a few things to take into account. Consider these three questions :
Will you always require all comments every time you query for a post?
Will you want to query on comments directly (e.g. query comments for a specific user)?
Will your system have relatively low usage?
If all questions can be answered with yes then you can embed the comments array. In all other scenarios you will probably need a seperate collection to store your comments.
First of all, you can actually update and remove comments atomically in a concurrency safe way (see updates with positional operators) but there are some things you cannot do such as index based inserts.
The main concern with using embedded arrays for any sort of large collection is the move-on-update issue. MongoDB reserves a certain amount of padding (see db.col.stats().paddingFactor) per document to allow it to grow as needed. If it runs out of this padding (and it will often in your usecase) it will have to move that ever growing document around on the disk. This makes updates an order of magnitude slower and is therefore a serious concern on high bandwidth servers. A related but slightly less vital issue is bandwidth. If you have no choice but to query the entire post with all its comments even though you're only displaying the first 10 you're going to waste quite a bit of bandwidth which can be an issue on cloud environments especially (you can use $slice to avoid some of this).
If you do want to go embedded here are your basic ops :
Add comment :
db.posts.update({_id:[POST ID]}, {$push:{comments:{commentId:"remon-923982", author:"Remon", text:"Hi!"}}})
Update comment :
db.posts.update({_id:[POST ID], 'comments.commentId':"remon-923982"}, {$set:{'comments.$.text':"Hello!"}})
Remove comment
db.posts.update({_id:[POST ID], 'comments.commentId':"remon-923982"}, {$pull:{comments:{commentId:"remon-923982"}}})
All these methods are concurrency safe because the update criteria are part of the (process wide) write lock.
With all that said you probably want a dedicated collection for your comments but that comes with a second choice. You can either store each comment in a dedicated document or use comment buckets of, say, 20-30 comments each (described in detail here http://www.10gen.com/presentations/mongosf2011/schemascale). This has advantages and disadvantages so it's up to you to see which approach fits best for what you want to do. I would go for buckets if your comments per post can exceed a couple of hundred due to the o(N) performance of the skip(N) cursor method you'll need for paging them. In all other cases just go with a comment per document approach. That's most flexible with querying on comments for other use cases as well.
It greatly depends on the operations you want to allow, but a separate collection is usually better.
For instance, if you want to allow users to edit or delete comments, it is a very good idea to store comments in a separate collection, because these operations are hard or impossible to express w/ atomic modifiers alone, and state management becomes painful. The documentation also covers this.
A key issue w/ embedding comments is that you will have different writers. Normally, a blog post can be modified only by blog authors. With embedded comments, a reader also gets write access to the object, so to speak.
Code like this will be dangerous:
post = db.findArticle( { "_id" : 2332 } );
post.Text = "foo";
// in this moment, someone does a $push on the article's comments
db.update(post);
// now, we've deleted that comment
For performance reasons it is best to avoid documents that can grow in size over time:
Padding Factors:
"When you update a document in MongoDB, the update occurs in-place if
the document has not grown in size. If the document did grow in size,
however, then it might need to be relocated on disk to find a new disk
location with enough contiguous space to fit the new larger document.
This can lead to problems for write performance if the collection has
many indexes since a move will require updating all the indexes for
the document."
http://www.mongodb.org/display/DOCS/Padding+Factor
If you always retrieve a post with all its comments, why not?
If you don't, or you retrieve comments in a query other than by post (ie. view all of a user's comments on the user's page), then probably not since queries would become much more complicated.
Short answer: Yes and no.
Let's say you are writing a blog based on mongoDB. You would embed your comments into your post.
Why: It's easy to query, you just have to do a single request and get all the data you need to display.
Now, you know you'll get large documents with subdocuments. As you need to serve them through your LAN, i would highly recommend you to store them in a different collection.
Why: Sending large documents through your network takes time. And i guess, there are situations where you don't need every single subdocument.
TL;DR: Both variants work. I recommend you to store your comments in an separat table.
I'm working on a similar project which requires posts and comments, let me list down the points for both:
Keep in a separate document if you:
- need to delete a specific comment on a post
- want to show the latest comments on any post (like usually it is in the sidebar on blogs)
Keep in the same document if you:
- don't need any of the above
- need to fetch all the comments of a post in the same query (the separate document approach will require fetching the comments from different documents)

mongo db design of following and feeds, where should I embed?

I have a basic question about where I should embed a collection of followers/following in a mongo db. It makes sense to have an embedded collection of following in a user object, but does it also make sense to also embed the converse followers collection as well? That would mean I would have to update and embed in the profile record of both the:
following embedded list in the follower
And the followers embedded list of the followee
I can't ensure atomicity on that unless I also somehow keep a transaction or update status somewhere. Is it worth it embedding in both entities or should I just update #1, embed following in the follower's profile and, put an index on it so that I can query for the converse- followers across all profiles? Is the performance hit on that too much?
Is this a candidate for a collection that should not be embedded? Should I just have a collection of edges where I store following in its own collection with followerid and followedbyId ?
Now if I also have to update a feed to both users when they are followed or following, how should I organize that?
As for the use case, the user will see the people they are following when viewing their feeds, which happens quite often, and also see the followers of a profile when they view the profile detail of anyone, which also happens often but not quite as much as the 1st case. In both cases, the total numbers of following and followers shows up on every profile page.
In general, it's a bad idea to embed following/followed-by relationships into user documents, for several reasons:
(1) there is a maximum document size limit of 16MB, and it's plausible that a popular user of a well-subscribed site might end up with hundreds of thousands of followers, which will approach the maximum document size,
(2) followership relationships change frequently, and so the case where a user gains a lot of followers translates into repeated document growth if you're embedding followers. Frequent document growth will significantly hinder MongoDB performance, and so should be avoided (occasional document growth, especially is documents tend to reach a stable final size, is less of a performance penalty).
So, yes, it is best to split out following/followed-by relationship into a separate collection of records each having two fields, e.g., { _id : , oid : }, with indexes on _id (for the "who am I following?" query) and oid (for the "who's following me?" query). Any individual state change is modeled by a single document addition or removal, though if you're also displaying things like follower counts, you should probably keep separate counters that you update after any edge insertion/deletion.
(Of course, this supposes your business requirements allow you some flexibility on the consistency details: in general, if your display code tells a user he's got 304 followers and then proceeds to enumerate them, only the most fussy user will check that the followers enumerated tally up to 304. If business requirements necessitate absolute consistency, you'll either need a database that isolates transactions for you, or else you'll have to do the counting yourself as part of displaying all user identities.)
You can embed them all but create a new document when you reach a certain limit. For example you can limit a document to an array of 500 elements then create a new one. Also, if it is about feed, when viewed, you dont have to keep the viewed publications you can replace by new ones so you don't have to create new document for additional publication storage.
To maintain your performance, I'd advice you to make a collection that can use graphlookup aggregation, where you store your following. Being followed can reach millions of followers, so you have to store what pwople follow instead of who follows them.
I hope it helps you.