Mongodb - most efficient way to structure my db? - mongodb

I'm a bit new to mongodb and I'm trying to setup a simple server where I will have users, posts, comments, like and dislikes, among some things. What I'm wondering is which way this should be setup most efficiently?
Should I have one table for likes where I add userId and postId (more or less same for the dislike and comments table)
Or would it be better if likes, dislikes and comments are parts of the post? Like:
//Post structure
{
"_id":"kljflskds",
"field1":"content",
"field2":"content",
"likes":[userId,userId,userId],
"dislikes":[userId,userId,userId],
"comments":[{comment object},{comment object},{comment object}]
}
Because for each post when I retreive them I would like to know how many likes it has, how many dislikes and how many comments. With the first version I would either need to multiple queries on the server(unnecessary processor power?) or on the phone(unnecessary bandwidth). But the second would only need one query. I believe the second option with having comments as a part of the posts seems more efficient, but I'm not a pro so I'd like to hear what other people think of this?

As has been pointed out, there are no tables in a document-oriented database. What you'll also find is that unlike a relational database where there is often a 'right way' to structure the database, the same is not true with MongoDB. Your schema should be structured based on how you're going to access the information most regularly. Documents are extremely flexible, unlike rows in tables.
You could create a comments collection or have them directly in the post documents. Two considerations would be: 1. Will you need to access the comments without accessing the post? and 2. Are your documents going to get too big and unwieldy?
In both of these cases with your blog, it most likely would be better to nest the comments as most of your traffic will be searching for posts, and you'll be pulling all of the comments related to the post. Also, a comment will not be owned by multiple tables; besides, MongoDB isn't meant to be denormalized like a relational database, so having duplicate information in multiple documents (i.e. tag names, city names, etc.) is normal.
Also, having a collection for likes is a very 'relational' way of thinking. In MongoDB, I can't think of a use case where you'd want a likes collection. When you're coming from the relational world, you really have to step back and rethink how you're creating your database because you'll be constantly fighting it otherwise.
With only two collections, posts and users, getting the information that you're looking for would be trivial, as you can just get the count of the likes and comments and they're all right there.

Related

Schema design advice for Social Network

I'm trying to come up with a schema for a social network app.
Users could post Posts, and inside them have Photos.
Both Posts and Photos can have Likes and Comments.
Posts can have several collaborators/owners, which is why I added the Posts Participants table.
Users can search for Posts either by searching for keywords inside the posts texts, or by the hashtags of the posts.
That's why I used tsvector type for both of them, indexed with the GiN index type.
So far I have come up with the following schema :
My main issues with this design are:
Hashtags in a post - is it fine what I did, i.e - storing the hashtags of a post in one column tsvector inside the Posts table?
two additional ideas I had in mind :
a. have a separate table for hashtags, like this : id|post_id|tag_name, and each record will represent each individual hashtag. Sounds a bit inefficient though, will results in too many records..
b. same as a, but the "tag_name" would be a tsvector representing all of the hashtags of the posts. This would result in far less records in the table than option 'a'.
Saved posts - what if I have a 10k posts, and each of them will be liked by 1k people. This would result in 10 millions of records! that doesn't sound efficient.
Normalization - it seems to me there are too many tables which will require a lot of JOINS to retrieve a whole Post object to the clients (along will the comments, likes, photos and their comments/likes, etc.), as well as be very complex to write to. Will the queries to retrieve/write different Posts be too slow / cumbersome?
Comments - should I separate comments for posts and comments for photos like I did in the design above? or combine them into one table?
I want to have 1-level replies inside the comments. Should I just add a column of "parent_comment_id" inside the Comments table?
Storing the hashtags of a post in one column tsvector inside the Posts table is all good, as you are not using it in any other table. But in case if u need hashtags for any scenario where you need to join tables, then it would be better to keep them in another table for flexibility.
It is good to keep records as such, you are keeping the id's right. So it won't result in any bottleneck. I believe you are going to use CSV's.
You can always optimize your queries, use your joins wisely, make sure you are using the right join at the right place.
It will be better if you separate comments for posts and comments for photos as you will find it way more easier to retrieve and store data. You can anyways combine them when needed. This will avoid any data congestion.
Yes sounds ok.

Database design for queries that has tons of sql-like join

I have a collection named posts consisting of multiple article posts and a collection called users which has a lot of user info. Each post has a field called author that references the post author in the users collection.
On my home page I will query the posts collection and return a list of posts to the client. Since I also want to display the author of the post I need to do sql-like join commands so that all the posts will have author names, ids,...etc.
If I return a list of 40 posts I'd have to do 40 sqllike-joins. Which means each time I will do 41 queries to get a list of posts with author info. Which just seems really expensive.
I am thinking to store the author info at the time I am storing the post info. This way I only need to do 1 query to retrieve all posts and author info. However when the user info changes (such as name changes) the list will be outdated and it seems not quite easy to manage lists like this.
So is there's a better or standard approach to this?
p.s: I am using mongodb
Mongo is NoSQL DB. By definition, NoSQL solutions are meant to be denormalized(all required data should be located at a same location)
In your example, relationship between authors and posts is one to many but ratio of authors as compared to posts is very small. In simple words, no. of authors as compared to no. of posts will be very small.
Based on this, you can safely store author info in posts collection.
If you need to query posts collection i.e. if you know your most queries will be executed on posts collection then it makes sense to store author in posts. It wont take huge space to store one attribute but it will make huge difference in query performance and easiness to code/retrieve the data.

What is the correct noSQL collection structure for the following case?

As someone who got used to thinking in relational terms, I am trying to get a grasp of thinking in the "noSQL way".
Assume the following scenario:
We have a blog (eg. 9gag.com) with many posts and registered users. Every post can be liked by each user. We would like to build a recommendation engine, so we need to track:
all posts viewed by a user
all posts liked by a user
Posts have: title, body, category. Users have: username, password, email, other data.
In a relational DB we would have something like: posts, users, posts_users_views (post_id, users_id, view_date), posts_users_likes (post_id, user_id, like_date).
Question
What is the "right" structure would be in a document/column oriented noSQL database?
Clarification: Should we save an array of all viewed/liked post ids in users (or user ids in posts)? If so, won't we have a problem with a row size getting huge?
In CouchDB you could have separate documents for the user, post, view and like. Showing the views/likes by user can be arranged by the "view" (materialized map/reduce query) with map function emitting an array key [user_id, post_id]. As the result you will get the sorted dictionary (ordered lexicographically by the key), so taking all the views per user='ID' is the query with keys starting from [ID] to [ID,{}]. You can optimize it, but the basic solution is very simple.
In CouchDB wiki there is a comment on using relationally modeled design and view collation mechanism (which can substitute some simple joins). To get some intuition I rather advice to study the problems of post and comments, which is also very simple, but not that much trivial as view and likes :)
There may be no NoSQL way, but I think most of the map/reduce systems share similar type of thinking. The CouchDB is a good tool to start, because it is very limited :) It is difficult to do any queries inefficient in distributed environment and its map and reduce query functions cannot have side effects (they are generating the materialized view, incrementally when the document set is changed, and the result should not depend on the order of document updates).

Relations in Document-oriented database?

I'm interested in document-oriented databases, and I'd like to play with MongoDB. So I started a fairly simple project (an issue tracker), but am having hard times thinking in a non-relational way.
My problems:
I have two objects that relate to each other (e.g. issue = {code:"asdf-11", title:"asdf", reporter:{username:"qwer", role:"manager"}} - here I have a user related to the issue). Should I create another document 'user' and reference it in 'issue' document by its id (like in relational databases), or should I leave all the user's data in the subdocument?
If I have objects (subdocuments) in a document, can I update them all in a single query?
I'm totally new to document-oriented databases, and right now I'm trying to develop sort of a CMS using node.js and mongodb so I'm facing the same problems as you.
By trial and error I found this rule of thumb: I make a collection for every entity that may be a "subject" for my queries, while embedding the rest inside other objects.
For example, comments in a blog entry can be embedded, because usually they're bound to the entry itself and I can't think about a useful query made globally on all comments. On the other side, tags attached to a post might deserve their own collection, because even if they're bound to the post, you might want to reason globally about all the tags (for example making a list of trending topics).
In my mind this is actually pretty simple. Embedded documents can only be accessed via their master document. If you can envision a need to query an object outside the context of the master document, then don't embed it. Use a ref.
For your example
issue = {code:"asdf-11", title:"asdf", reporter:{username:"qwer", role:"manager"}}
I would make issue and reporter each their own document, and reference the reporter in the issue. You could also reference a list of issues in reporter. This way you won't duplicate reporters in issues, you can query them each separately, you can query reporter by issue, and you can query issues by reporter. If you embed reporter in issue, you can only query the one way, reporter by issue.
If you embed documents, you can update them all in a single query, but you have to repeat the update in each master document. This is another good reason to use reference documents.
The beauty of mongodb and other "NoSQL" product is that there isn't any schema to design. I use MongoDB and I love it, not having to write SQL queries and awful JOIN queries! So to answer your two questions.
1 - If you create multiple documents, you'll need make two calls to the DB. Not saying it's a bad thing but if you can throw everything into one document, why not? I recall when I used to use MySQL, I would create a "blog" table and a "comments" table. Now, I append the comments to the record in the same collection (aka table) and keep building on it.
2 - Yes ...
The schema design in Document-oriented DBs can seems difficult at first, but building my startup with Symfony2 and MongoDB I've found that the 80% of the time is just like with a relational DB.
At first, think it like a normal db:
To start, just create your schema as you would with a relational Db:
Each Entity should have his own Collection, especially if you'll need to paginate the documents in it.
(in Mongo you can somewhat paginate nested document arrays, but the capabilities are limited)
Then just remove overly complicated normalization:
do I need a separate category table? (simply write the category in a column/property as a string or embedded doc)
Can I store comments count directly as an Int in the Author collection? (then update the count with an event, for example in Doctrine ODM)
Embedded documents:
Use embedded documents only for:
clearness (nested documents like: addressInfo, billingInfo in the User collection)
to store tags/categories ( eg: [ name: "Sport", parent: "Hobby", page: "/sport"
] )
to store simple multiple values (for eg. in User collection: list of specialties, list of personal websites)
Don't use them when:
the parent Document will grow too large
when you need to paginate them
when you feel the entity is important enough to deserve his own collection
Duplicate values across collection and precompute counts:
Duplicate some columns/attributes values from a Collection to another if you need to do a query with each values in the where conditions. (remember there aren't joins)
eg: In the Ticket collection put also the author name (not only the ID)
Also if you need a counter (number of tickets opened by user, by category, ecc), precompute them.
Embed references:
When you have a One-to-Many or Many-to-Many reference, use an embedded array with the list of the referenced document ids (see MongoDB DB Ref).
You'll need to use an Event again to remove an id if the referenced document get deleted.
(There is an extension for Doctrine ODM if you use it: Reference Integrity)
This kind of references are directly managed by Doctrine ODM: Reference Many
Its easy to fix errors:
If you find late that you have made a mistake in the schema design, its quite simply to fix it with few lines of Javascript to run directly in the Mongo console.
(stored procedures made easy: no need of complex migration scripts)
Waring: don't use Doctrine ODM Migrations, you'll regret that later.
Redid this answer since the original answer took the relation the wrong way round due to reading incorrectly.
issue = {code:"asdf-11", title:"asdf", reporter:{username:"qwer", role:"manager"}}
As to whether embedding some important information about the user (creator) of the ticket is a wise decision or not depends upon the system specifics.
Are you giving these users the ability to login and report issues they find? If so then it is likely you might want to factor that relation off to a user collection.
On the other hand, if that is not the case then you could easily get away with this schema. The one problem I see here is if you wish to contact the reporter and their job role has changed, that's somewhat awkward; however, that is a real world dilemma, not one for the database.
Since the subdocument represents a single one-to-one relation to a reporter you also should not suffer fragmentation problems mentioned in my original answer.
There is one glaring problem with this schema and that is duplication of changing repeating data (Normalised Form stuff).
Let's take an example. Imagine you hit the real world dilemma I spoke about earlier and a user called Nigel wants his role to reflect his new job position from now on. This means you have to update all rows where Nigel is the reporter and change his role to that new position. This can be a lengthy and resource consuming query for MongoDB.
To contradict myself again, if you were to only have maybe 100 tickets (aka something manageable) per user then the update operation would likely not be too bad and would, in fact, by manageable for the database quite easily; plus due to the lack of movement (hopefully) of the documents this would be a completely in place update.
So whether this should be embedded or not depends heavily upn your querying and documents etc, however, I would say this schema isn't a good idea; specifically due to the duplication of changing data across many root documents. Technically, yes, you could get away with it but I would not try.
I would instead split the two out.
If I have objects (subdocuments) in a document, can I update them all in a single query?
Just like the relation style in my original answer, yes and easily.
For example, let's update the role of Nigel to MD (as hinted earlier) and change the ticket status to completed:
db.tickets.update({'reporter.username':'Nigel'},{$set:{'reporter.role':'MD', status: 'completed'}})
So a single document schema does make CRUD easier in this case.
One thing to note, stemming from your English, you cannot use the positional operator to update all subdocuments under a root document. Instead it will update only the first found.
Again hopefully that makes sense and I haven't left anything out. HTH
Original Answer
here I have a user related to the issue). Should I create another document 'user' and reference it in 'issue' document by its id (like in relational databases), or should I leave all the user's data in the subdocument?
This is a considerable question and requires some background knowledge before continuing.
First thing to consider is the size of a issue:
issue = {code:"asdf-11", title:"asdf", reporter:{username:"qwer", role:"manager"}}
Is not very big, and since you no longer need the reporter information (that would be on the root document) it could be smaller, however, issues are never that simple. If you take a look at the MongoDB JIRA for example: https://jira.mongodb.org/browse/SERVER-9548 (as a random page that proves my point) the contents of a "ticket" can actually be quite considerable.
The only way you would gain a true benefit from embedding the tickets would be if you could store ALL user information in a single 16 MB block of contigious sotrage which is the maximum size of a BSON document (as imposed by the mongod currently).
I don't think you would be able to store all tickets under a single user.
Even if you was to shrink the ticket to, maybe, a code, title and a description you could still suffer from the "swiss cheese" problem caused by regular updates and changes to documents in MongoDB, as ever this: http://www.10gen.com/presentations/storage-engine-internals is a good reference for what I mean.
You would typically witness this problem as users add multiple tickets to their root user document. The tickets themselves will change as well but maybe not in a drastic or frequent manner.
You can, of course, remedy this problem a bit by using power of 2 sizes allocation: http://docs.mongodb.org/manual/reference/command/collMod/#usePowerOf2Sizes which will do exactly what it says on the tin.
Ok, hypothetically, if you were to only have code and title then yes, you could store the tickets as subdocuments in the root user without too many problems, however, this is something that comes down to specifics that the bounty assignee has not mentioned.
If I have objects (subdocuments) in a document, can I update them all in a single query?
Yes, quite easily. This is one thing that becomes easier with embedding. You could use a query like:
db.users.update({user_id:uid,'tickets.code':'asdf-1'}, {$set:{'tickets.$.title':'Oh NOES'}})
However, to note, you can only update ONE subdocument at a time using the positional operator. As such this means you cannot, in a single atomic operation, update all ticket dates on a single user to 5 days in the future.
As for adding a new ticket, that is quite simple:
db.users.update({user_id:uid},{$push:{tickets:{code:asdf-1,title:"Whoop"}}})
So yes, you can quite simply, depending on your queries, update the entire users data in a single call.
That was quite a long answer so hopefully I haven't missed anything out, hope it helps.
I like MongoDB, but I have to say that I will use it a lot more soberly in my next project.
Specifically, I have not had as much luck with the Embedded Document facility as people promise.
Embedded Document seems to be useful for Composition (see UML Composition), but not for aggregation. Leaf nodes are great, anything in the middle of your object graph should not be an embedded document. It will make searching and validating your data more of a struggle than you'd want.
One thing that is absolutely better in MongoDB is your many-to-X relationships. You can do a many-to-many with only two tables, and it's possible to represent a many-to-one relationship on either table. That is, you can either put 1 key in N rows, or N keys in 1 row, or both. Notably, queries to accomplish set operations (intersection, union, disjoint set, etc) are actually comprehensible by your coworkers. I have never been satisfied with these queries in SQL. I often have to settle for "two other people will understand this".
If you've ever had your data get really big, you know that inserts and updates can be constrained by how much the indexes cost. You need fewer indexes in MongoDB; an index on A-B-C can be used to query for A, A & B, or A & B & C (but not B, C, B & C or A & C). Plus the ability to invert a relationship lets you move some indexes to secondary tables. My data hasn't gotten big enough to try, but I'm hoping that will help.

I need an advice about NoSQL/MongoDb and data/models structure

Recently I'm exploring NoSQL Databases. I need an advice about how to store data in the most optimal and efficient way for a given problem. I'm targeting MongoDB, now. However it should be the same with CouchDB.
Let's say we have these 3 Models:
Story:
id
title
User:
id
name
Vote:
id
story_id
user_id
I want to be able to ask the database these questions:
Who has voted for this Story?
What this User has Voted for?
I'm doing simple joins while working with a relational DB. The question is, how should I store the data for those objects in order to be most efficient.
For example, if I store the Vote objects as a subcollection of Stories it wont be easy to get the info - "What a user has voted for".
I would suggest storing votes as a list of story _ids in each user. That way you can find out what stories a user has voted for just by looking at the list. To get the users who have voted for a story you can do something like:
db.users.find({stories: story_id})
where story_id is the _id of the story in question. If you create an index on the stories field both of those queries will be fast.
don't worry if your queries are efficient until it starts to matter
according to below quote, you're doing it wrong
The way I have been going about the
mind switch is to forget about the
database alltogether. In the
relational db world you always have to
worry about data normalization and
your table structure. Ditch it all.
Just layout your web page. Lay them
all out. Now look at them. Your
already 2/3 there. If you forget the
notion that database size matters and
data shouldn't be duplicated than your
3/4 there and you didnt even have to
write any code! Let your views dictate
your Models. You don't have to take
your objects and make them 2
dimensional anymore as in the
relational world. You can store
objects with shape now.
how-to-think-in-data-stores-instead-of-databases
Ok, you haven given a normalized data model as you would do in an SQL setup.
In my understanding you don't do this in MongoDB. You could store references, but you do not for performance reasons in the general case.
I'm not an expert in the NoSQL area in no way, but why don't you simply follow your needs and store the user (ids) that have voted for a story in the stories collection and the story (ids) a user has voted for in the users collection?
In CouchDB this is very simple. One view emits:
function(doc) {
if(doc.type == "vote") {
emit(doc.story_id, doc.user_id);
}
}
Another view emits:
function(doc) {
if(doc.type == "vote") {
emit(doc.user_id, doc.story_id);
}
}
Both are queries extremely fast since there is no join. If you do need user data or story data, CouchDB supports multi-document fetch. Also quite fast and is one way to do a "join".
I've been looking into MongoDB and CouchDB a lot lately, but my insight is limited. Still, when thinking about storing the votes inside the story document, you might have to worry about hitting the 4MB document size limit. Even if you don't, you might be constantly increasing the size of the document enough to cause it to get moved and thus slowing down your writes (see how documents are sized in MongoDB).
As for CouchDB, these kinds of things are quite simple, elegant, and quite fast once the view indexes are calculated. Personally, however, I have hesitated to do a similar project in CouchDB because of benchmarks showing it progressively slowing down to a considerable degree as the database grows (and the view indexes grow). I'd love to see some more recent benchmarks showing CouchDB performance as database size increases. I WANT to try MongoDB or CouchDB, but SQL still seems so efficient and logical, so I'll stay with it until the project fits the temptation just right.