Mongodb: about performance and schema design - mongodb

After learning about performance and schema design in MongoDB, I still can´t figure out how would I make the schema design in an application when performance is a must.
Let´s imagine if we have to make YouTube to work with MongoDB as its database. How would you make the schema?
OPTION 1: two collections (videos collection and comments collection)
Pros: adding, deleting and editing comments affects only the comments collection, therefore these operations would be more efficient.
Cons: Retrieving videos and comments would be 2 different queries to the database, one for videos and one for comments.
OPTION 2: single collection (videos collection with the comments embedded)
Pros: You retrieve videos and its comments with a single query.
Cons: Adding, deleting and editing comments affect the video Document, therefore these operations would be less efficient.
So what do you think? Are my guesses true?

As a caller in the desert, I have to say that embedding should only be used under very special circumstances:
The relation is a "One-To(-Very)-Few" and it is absolutely sure that no document will ever exceed this limit. A good example would be the relation between "users" and "email addresses" – a user is unlikely to have millions of them and there isn't even a problem with artificial limits: setting the maximum number of addresses as user can have to, say 50 hardly would cause a problem. It may be unlikely that a video gets millions of comments, but you don't want to impose an artificial limit on it, right?
Updates do not happen very often. If documents increase in size beyond a certain threshold, they might be moved, since documents are guaranteed to be never fragmented. However, document migrations are expensive and you want to prevent them.
Basically, all operations on comments become more complicated and hence more expensive - a bad choice. KISS!
I have written an article about the above, which describes the respective problems in greater detail.
And furthermore, I do not see any advantage in having the comments with the videos. The questions to answer would be
For a given user, what are the videos?
What are the newest videos (with certain tags)?
For a given video, what are the comments?
Note that the only connection between videos and comments here is about a given video, so you already have the _id or something else to positively identify the video. Furthermore, you don't want to load all comments at once, especially if you have a lot of them, since this would decrease UX because of long load times.
Let's say it is the _id. So, with it, you'd be able to have paged comments easily:
db.comments.find({"video_id": idToFind})
.skip( (page-1) * pageSize )
.limit( pageSize )
hth

As usual the answer is, it depends. As as a rule of thumb you should favour embedding, unless you need to regularly query the embedded objects on its own or if the embedded array is likely to get too large(>~100 records). Using this guideline, there are a few questions you need to ask regarding your application.
How is your application going to access the data ? Are you only ever going to show the comments on the same page as the associated video ? Or do you want to provide the options to show all comments for a given user across all movies ? The first scenario favours embedding (one collection), whereas you probably would be better of with two collections in the second scenario.
Secondly, how many comments do you expect for each video ? Taking the analogy of IMDB, you could easily expect more than 100 comments for a popular video, so that means you are better off creating two separate collections as the embedded array of comments would grow large quite quickly. I wouldn't be too concerned about the overhead of an application join, they are generally comparable in speed compared to a server-side join in a relational database provided your collections are properly indexed.
Finally, how often are users going to update their comments after their initial post ? If you lock the comments after 5 minutes like on StackOverflow users may not update their comments very often. In that case the overhead of updating or deleting comments in the video collection will be negligible and may even be outweigh the cost of performing a second query in a separate comments collection.

You should use embedded for better performance. Your I/O's will be lesser. In worst case? it might take a bit long to persist the document in the DB but it wont take much time to retrieve it.
You should either compromise persistence over reads or vise versa depending on your application needs.
Hence it is important to choose your db wisely.

Related

Performance Implications of Accessing Single MongoDB Document vs Different MongoDB Documents in The Same Collection

Say I have a MongoDB Document that contains within itself a list.
This list gets altered a lot and there's no real reason why it couldn't have its own collection and each of the items became a document.
Would there be any performance implications of the former? I've got an inkling that document read/writes are going to be blocked while any given connection tries to read it, but the same wouldn't be true for accessing different documents in the same collection.
I find that these questions are effectively impossible to 'answer' here on Stack Overflow. Not only is there not really a 'right' answer, but it is impossible to get enough context from the question to frame a response that appropriately factors in the items that are most important for you to consider in your specific situation. Nonetheless, here are some thoughts that come to mind that may help point you in the right direction.
Performance is obviously an important consideration here, so it's good to have it in mind as you think through the design. Even within the single realm of performance there are various aspects. For example, would it be acceptable for the source document and the associated secondary documents in another collection to be out of sync? If not, and you had to pursue a route such as using transactions to keep them aligned, then that may be a much bigger performance hit overall and not worth pursuing.
As broad as performance is, it is also just a single consideration here. What about usability? Are you able to succinctly express the type of modifications that you would be doing to the array using MongoDB's query language? What about retrieving the data, would you always pull the information back as a single logical document? If so, then that would imply needing to use $lookup very frequently. Even doing so via a view may be cumbersome and could be both a usability as well as performance consideration. Indeed, an overreliance on $lookup can be considered an antipattern.
What does it mean when you say that the list gets "altered" a lot? Are you inserting new information, or updating existing entries? There has been a 16MB size limit for individual documents for a long time in MongoDB, so they generally recommend avoiding unbounded arrays. Indeed processing them can be costly in various ways depending on some specific factors.
Also, where does your inkling about concurrency behavior come from? There is a FAQ on concurrency here which helps outline some of the expected behavior for various operations and their locking. Often (with any system) it can be most appropriate to build out an environment that appropriately represents your end state and stress test it directly. That often gives a good general sense for how the approach would work in your situation without having to become an expert in the particulars of how the database (or tool in general) works.
You can see that even in this short response, the "recommendation" fluctuates back and forth. Ultimately this question is about a trade-off which we are not in a good position answer for you. Hopefully this response helps give you some things to think about while doing so.

Mass Update NoSQL Documents: Bad Practice?

I'm storing two collections in a MongoDB database:
==Websites==
id
nickname
url
==Checks==
id
website_id
status
I want to display a list of check statuses with the appropriate website nickname.
For example:
[Google, 200] << (basically a join in SQL-world)
I have thousands of checks and only a few websites.
Which is more efficient?
Store the nickname of the website within the "check" directly. This means if the nickname is ever changed, I'll have to perform a mass update of thousands of documents.
Return a multidimensional array where the site ID is the key and the nickname is the value. This is to be used when iterating through the list of checks.
I've read that #1 isn't too bad (in the NoSQL) world and may, in fact, be preferred? True?
If it's only a few websites I'd go with option 1 - not as clean and normalized as in the relational/SQL world but it works and much less painful than trying to emulate joins with MongoDB. The thing to remember with MongoDB or any other NoSQL database is that you are generally making some kind of trade off - nothing is for free. I personally really value the schema-less document oriented data design and for the applications I use it for I readily make the trade-offs (like no joins and transactions).
That said, this is a trade-off - so one thing to always be asking yourself in this situation is why am I using MongoDB or some other NoSQL database? Yes, it's trendy and "hot", but I'd make certain that what you are doing makes sense for a NoSQL approach. If you are spending a lot of time working around the lack of joins and foreign keys, no transactions and other things you're used to in the SQL world I'd think seriously about whether this is the best fit for your problem.
You might consider a 3rd option: Get rid of the Checks collection and embed the checks for each website as an array in each Websites document.
This way you avoid any JOINs and you avoid inconsistencies, because it is impossible for a Check to exist without the Website it belongs to.
This, however, is only recommended when the checks array for each document stays relatively constant over time and doesn't grow constantly. Rapidly growing documents should be avoided in MongoDB, because everytime a document doubles its size, it is moved to a different location in the physical file it is stored in, which slows down write-operations. Also, MongoDB has a 16MB limit per document. This limit exists mostly to discourage growing documents.
You haven't said what a Check actually is in your application. When it is a list of tasks you perform periodically and only make occasional changes to, there would be nothing wrong with embedding. But when you collect the historical results of all checks you ever did, I would rather recommend to put each result(set?) in an own document to avoid document growth.

Questions about better solution to keep comments and votes in one document

I'm looking for a better way to handle it. My documents I designed to store all comments and votes (like a confirmation thats stores picture and text information as well) using arrays inside a document. The issue I'm concerning is about the size limit of a document (16 Mb so far), If a document keeps a lot of comments and specially votes in internal arrays, very probably it will be broken reaching the size limit, on the other hand, keep this strategy I can ensure faster queries as well.
What do I have to do? Do like a relational DB and keep these kind of information and different collections and docs? It will decrease search speed, otherwise I'll keep it safe and unbroken.
It all depends on how you plan on using the data. If you get a huge number of comments on a single blog post would you really want to query for the post and get all the comments back?
No real life webpage that shows you a blog post actually does that. They show you the first few comments and then fetch more either as you scroll through those or when you click "show more". That's probably the best (hybrid) model. Store what you need when you first display a blog post in the blog post document but keep everything else that you can query for later in a separate collection where every comment references the post that it belongs to. Then you can get the additional comments with a single indexed read (probably index it on post_id and date posted?). You can also use the "bucketing" technique and store comments grouped by post and chunk of time so that you can fetch entire "next page of comments" document.
If you architect this correctly rather than reducing your search speed it will likely increase your search and reading speed for base documents and save you a lot of network bandwidth too.

In MongoDB is it practical to keep all comments for a post in one document?

I've read in description of Document based dbs you can for example embed all comments under a post in the same document as the post if you choose to like so:
{
_id = sdfdsfdfdsf,
title = "post title"
body = "post body"
comments = [
"comment 1 ......................................... end of comment"
.
.
n
]
}
I'm having situation similar where each comment could be as large as 8KB and there could be as many as 30 of them per post.
Even though it's convenient to embed comments in the same document I wonder if having large documents impact performance especially when MongoDb server and http server run on separate machines and must communicate though a LAN?
Posting this answer after some the others so I will repeat some of the things mentioned.
That said there are a few things to take into account. Consider these three questions :
Will you always require all comments every time you query for a post?
Will you want to query on comments directly (e.g. query comments for a specific user)?
Will your system have relatively low usage?
If all questions can be answered with yes then you can embed the comments array. In all other scenarios you will probably need a seperate collection to store your comments.
First of all, you can actually update and remove comments atomically in a concurrency safe way (see updates with positional operators) but there are some things you cannot do such as index based inserts.
The main concern with using embedded arrays for any sort of large collection is the move-on-update issue. MongoDB reserves a certain amount of padding (see db.col.stats().paddingFactor) per document to allow it to grow as needed. If it runs out of this padding (and it will often in your usecase) it will have to move that ever growing document around on the disk. This makes updates an order of magnitude slower and is therefore a serious concern on high bandwidth servers. A related but slightly less vital issue is bandwidth. If you have no choice but to query the entire post with all its comments even though you're only displaying the first 10 you're going to waste quite a bit of bandwidth which can be an issue on cloud environments especially (you can use $slice to avoid some of this).
If you do want to go embedded here are your basic ops :
Add comment :
db.posts.update({_id:[POST ID]}, {$push:{comments:{commentId:"remon-923982", author:"Remon", text:"Hi!"}}})
Update comment :
db.posts.update({_id:[POST ID], 'comments.commentId':"remon-923982"}, {$set:{'comments.$.text':"Hello!"}})
Remove comment
db.posts.update({_id:[POST ID], 'comments.commentId':"remon-923982"}, {$pull:{comments:{commentId:"remon-923982"}}})
All these methods are concurrency safe because the update criteria are part of the (process wide) write lock.
With all that said you probably want a dedicated collection for your comments but that comes with a second choice. You can either store each comment in a dedicated document or use comment buckets of, say, 20-30 comments each (described in detail here http://www.10gen.com/presentations/mongosf2011/schemascale). This has advantages and disadvantages so it's up to you to see which approach fits best for what you want to do. I would go for buckets if your comments per post can exceed a couple of hundred due to the o(N) performance of the skip(N) cursor method you'll need for paging them. In all other cases just go with a comment per document approach. That's most flexible with querying on comments for other use cases as well.
It greatly depends on the operations you want to allow, but a separate collection is usually better.
For instance, if you want to allow users to edit or delete comments, it is a very good idea to store comments in a separate collection, because these operations are hard or impossible to express w/ atomic modifiers alone, and state management becomes painful. The documentation also covers this.
A key issue w/ embedding comments is that you will have different writers. Normally, a blog post can be modified only by blog authors. With embedded comments, a reader also gets write access to the object, so to speak.
Code like this will be dangerous:
post = db.findArticle( { "_id" : 2332 } );
post.Text = "foo";
// in this moment, someone does a $push on the article's comments
db.update(post);
// now, we've deleted that comment
For performance reasons it is best to avoid documents that can grow in size over time:
Padding Factors:
"When you update a document in MongoDB, the update occurs in-place if
the document has not grown in size. If the document did grow in size,
however, then it might need to be relocated on disk to find a new disk
location with enough contiguous space to fit the new larger document.
This can lead to problems for write performance if the collection has
many indexes since a move will require updating all the indexes for
the document."
http://www.mongodb.org/display/DOCS/Padding+Factor
If you always retrieve a post with all its comments, why not?
If you don't, or you retrieve comments in a query other than by post (ie. view all of a user's comments on the user's page), then probably not since queries would become much more complicated.
Short answer: Yes and no.
Let's say you are writing a blog based on mongoDB. You would embed your comments into your post.
Why: It's easy to query, you just have to do a single request and get all the data you need to display.
Now, you know you'll get large documents with subdocuments. As you need to serve them through your LAN, i would highly recommend you to store them in a different collection.
Why: Sending large documents through your network takes time. And i guess, there are situations where you don't need every single subdocument.
TL;DR: Both variants work. I recommend you to store your comments in an separat table.
I'm working on a similar project which requires posts and comments, let me list down the points for both:
Keep in a separate document if you:
- need to delete a specific comment on a post
- want to show the latest comments on any post (like usually it is in the sidebar on blogs)
Keep in the same document if you:
- don't need any of the above
- need to fetch all the comments of a post in the same query (the separate document approach will require fetching the comments from different documents)

mongo db design of following and feeds, where should I embed?

I have a basic question about where I should embed a collection of followers/following in a mongo db. It makes sense to have an embedded collection of following in a user object, but does it also make sense to also embed the converse followers collection as well? That would mean I would have to update and embed in the profile record of both the:
following embedded list in the follower
And the followers embedded list of the followee
I can't ensure atomicity on that unless I also somehow keep a transaction or update status somewhere. Is it worth it embedding in both entities or should I just update #1, embed following in the follower's profile and, put an index on it so that I can query for the converse- followers across all profiles? Is the performance hit on that too much?
Is this a candidate for a collection that should not be embedded? Should I just have a collection of edges where I store following in its own collection with followerid and followedbyId ?
Now if I also have to update a feed to both users when they are followed or following, how should I organize that?
As for the use case, the user will see the people they are following when viewing their feeds, which happens quite often, and also see the followers of a profile when they view the profile detail of anyone, which also happens often but not quite as much as the 1st case. In both cases, the total numbers of following and followers shows up on every profile page.
In general, it's a bad idea to embed following/followed-by relationships into user documents, for several reasons:
(1) there is a maximum document size limit of 16MB, and it's plausible that a popular user of a well-subscribed site might end up with hundreds of thousands of followers, which will approach the maximum document size,
(2) followership relationships change frequently, and so the case where a user gains a lot of followers translates into repeated document growth if you're embedding followers. Frequent document growth will significantly hinder MongoDB performance, and so should be avoided (occasional document growth, especially is documents tend to reach a stable final size, is less of a performance penalty).
So, yes, it is best to split out following/followed-by relationship into a separate collection of records each having two fields, e.g., { _id : , oid : }, with indexes on _id (for the "who am I following?" query) and oid (for the "who's following me?" query). Any individual state change is modeled by a single document addition or removal, though if you're also displaying things like follower counts, you should probably keep separate counters that you update after any edge insertion/deletion.
(Of course, this supposes your business requirements allow you some flexibility on the consistency details: in general, if your display code tells a user he's got 304 followers and then proceeds to enumerate them, only the most fussy user will check that the followers enumerated tally up to 304. If business requirements necessitate absolute consistency, you'll either need a database that isolates transactions for you, or else you'll have to do the counting yourself as part of displaying all user identities.)
You can embed them all but create a new document when you reach a certain limit. For example you can limit a document to an array of 500 elements then create a new one. Also, if it is about feed, when viewed, you dont have to keep the viewed publications you can replace by new ones so you don't have to create new document for additional publication storage.
To maintain your performance, I'd advice you to make a collection that can use graphlookup aggregation, where you store your following. Being followed can reach millions of followers, so you have to store what pwople follow instead of who follows them.
I hope it helps you.