Database design for queries that has tons of sql-like join - mongodb

I have a collection named posts consisting of multiple article posts and a collection called users which has a lot of user info. Each post has a field called author that references the post author in the users collection.
On my home page I will query the posts collection and return a list of posts to the client. Since I also want to display the author of the post I need to do sql-like join commands so that all the posts will have author names, ids,...etc.
If I return a list of 40 posts I'd have to do 40 sqllike-joins. Which means each time I will do 41 queries to get a list of posts with author info. Which just seems really expensive.
I am thinking to store the author info at the time I am storing the post info. This way I only need to do 1 query to retrieve all posts and author info. However when the user info changes (such as name changes) the list will be outdated and it seems not quite easy to manage lists like this.
So is there's a better or standard approach to this?
p.s: I am using mongodb

Mongo is NoSQL DB. By definition, NoSQL solutions are meant to be denormalized(all required data should be located at a same location)
In your example, relationship between authors and posts is one to many but ratio of authors as compared to posts is very small. In simple words, no. of authors as compared to no. of posts will be very small.
Based on this, you can safely store author info in posts collection.
If you need to query posts collection i.e. if you know your most queries will be executed on posts collection then it makes sense to store author in posts. It wont take huge space to store one attribute but it will make huge difference in query performance and easiness to code/retrieve the data.

Related

Finding all documents that don't have a relations with others in MongoDB

I have a collection of Users and one of Posts. I want to find all the posts that one user has not viewed yet. I expect the number of posts one user views to grow over time, possibly reaching tens or hundreds of thousands for some users, although the majority of users will only have a few hundreds.
How should I organize my data in a MongoDB database?
Should I keep the array of viewed posts in the User collection, in the Post collection, in a collection on its own (a document per view) or what else?
How should I then query the database?
The query should be constructable using the aggregation pipeline. First $lookup to join posts to post views, then $match with $exists: false to remove existing posts.
This won't be a cheap query with a large volume of data. One strategy for making it faster is by limiting, for example, the time allowed for posts prior to the join, or scoping the posts to a forum/tag, etc.

Schema design advice for Social Network

I'm trying to come up with a schema for a social network app.
Users could post Posts, and inside them have Photos.
Both Posts and Photos can have Likes and Comments.
Posts can have several collaborators/owners, which is why I added the Posts Participants table.
Users can search for Posts either by searching for keywords inside the posts texts, or by the hashtags of the posts.
That's why I used tsvector type for both of them, indexed with the GiN index type.
So far I have come up with the following schema :
My main issues with this design are:
Hashtags in a post - is it fine what I did, i.e - storing the hashtags of a post in one column tsvector inside the Posts table?
two additional ideas I had in mind :
a. have a separate table for hashtags, like this : id|post_id|tag_name, and each record will represent each individual hashtag. Sounds a bit inefficient though, will results in too many records..
b. same as a, but the "tag_name" would be a tsvector representing all of the hashtags of the posts. This would result in far less records in the table than option 'a'.
Saved posts - what if I have a 10k posts, and each of them will be liked by 1k people. This would result in 10 millions of records! that doesn't sound efficient.
Normalization - it seems to me there are too many tables which will require a lot of JOINS to retrieve a whole Post object to the clients (along will the comments, likes, photos and their comments/likes, etc.), as well as be very complex to write to. Will the queries to retrieve/write different Posts be too slow / cumbersome?
Comments - should I separate comments for posts and comments for photos like I did in the design above? or combine them into one table?
I want to have 1-level replies inside the comments. Should I just add a column of "parent_comment_id" inside the Comments table?
Storing the hashtags of a post in one column tsvector inside the Posts table is all good, as you are not using it in any other table. But in case if u need hashtags for any scenario where you need to join tables, then it would be better to keep them in another table for flexibility.
It is good to keep records as such, you are keeping the id's right. So it won't result in any bottleneck. I believe you are going to use CSV's.
You can always optimize your queries, use your joins wisely, make sure you are using the right join at the right place.
It will be better if you separate comments for posts and comments for photos as you will find it way more easier to retrieve and store data. You can anyways combine them when needed. This will avoid any data congestion.
Yes sounds ok.

Mongodb - most efficient way to structure my db?

I'm a bit new to mongodb and I'm trying to setup a simple server where I will have users, posts, comments, like and dislikes, among some things. What I'm wondering is which way this should be setup most efficiently?
Should I have one table for likes where I add userId and postId (more or less same for the dislike and comments table)
Or would it be better if likes, dislikes and comments are parts of the post? Like:
//Post structure
{
"_id":"kljflskds",
"field1":"content",
"field2":"content",
"likes":[userId,userId,userId],
"dislikes":[userId,userId,userId],
"comments":[{comment object},{comment object},{comment object}]
}
Because for each post when I retreive them I would like to know how many likes it has, how many dislikes and how many comments. With the first version I would either need to multiple queries on the server(unnecessary processor power?) or on the phone(unnecessary bandwidth). But the second would only need one query. I believe the second option with having comments as a part of the posts seems more efficient, but I'm not a pro so I'd like to hear what other people think of this?
As has been pointed out, there are no tables in a document-oriented database. What you'll also find is that unlike a relational database where there is often a 'right way' to structure the database, the same is not true with MongoDB. Your schema should be structured based on how you're going to access the information most regularly. Documents are extremely flexible, unlike rows in tables.
You could create a comments collection or have them directly in the post documents. Two considerations would be: 1. Will you need to access the comments without accessing the post? and 2. Are your documents going to get too big and unwieldy?
In both of these cases with your blog, it most likely would be better to nest the comments as most of your traffic will be searching for posts, and you'll be pulling all of the comments related to the post. Also, a comment will not be owned by multiple tables; besides, MongoDB isn't meant to be denormalized like a relational database, so having duplicate information in multiple documents (i.e. tag names, city names, etc.) is normal.
Also, having a collection for likes is a very 'relational' way of thinking. In MongoDB, I can't think of a use case where you'd want a likes collection. When you're coming from the relational world, you really have to step back and rethink how you're creating your database because you'll be constantly fighting it otherwise.
With only two collections, posts and users, getting the information that you're looking for would be trivial, as you can just get the count of the likes and comments and they're all right there.

What is the correct noSQL collection structure for the following case?

As someone who got used to thinking in relational terms, I am trying to get a grasp of thinking in the "noSQL way".
Assume the following scenario:
We have a blog (eg. 9gag.com) with many posts and registered users. Every post can be liked by each user. We would like to build a recommendation engine, so we need to track:
all posts viewed by a user
all posts liked by a user
Posts have: title, body, category. Users have: username, password, email, other data.
In a relational DB we would have something like: posts, users, posts_users_views (post_id, users_id, view_date), posts_users_likes (post_id, user_id, like_date).
Question
What is the "right" structure would be in a document/column oriented noSQL database?
Clarification: Should we save an array of all viewed/liked post ids in users (or user ids in posts)? If so, won't we have a problem with a row size getting huge?
In CouchDB you could have separate documents for the user, post, view and like. Showing the views/likes by user can be arranged by the "view" (materialized map/reduce query) with map function emitting an array key [user_id, post_id]. As the result you will get the sorted dictionary (ordered lexicographically by the key), so taking all the views per user='ID' is the query with keys starting from [ID] to [ID,{}]. You can optimize it, but the basic solution is very simple.
In CouchDB wiki there is a comment on using relationally modeled design and view collation mechanism (which can substitute some simple joins). To get some intuition I rather advice to study the problems of post and comments, which is also very simple, but not that much trivial as view and likes :)
There may be no NoSQL way, but I think most of the map/reduce systems share similar type of thinking. The CouchDB is a good tool to start, because it is very limited :) It is difficult to do any queries inefficient in distributed environment and its map and reduce query functions cannot have side effects (they are generating the materialized view, incrementally when the document set is changed, and the result should not depend on the order of document updates).

many-to-many relationships for social app: Mongodb or graph databases like Neo4j

I have tried to understand embedding in Mongodb but could not find good enough documentation. Linking is not advised as writes are not atomic across documents and also there are two lookups. Does someone know how to solve this or would you suggest me to go to graph dbs like neo4j.
I am trying to build an application which would need many-to-many relationships. To explain, I will take the example of a library. It can suggest books to user based on books his friends are reading and neighbors (like minded) users are reading.
There are Users and Books. Users borrow books and have friends who are other users
Given a user, I need all books he is reading and number of mutual
friends for the book
Given a book, I need all the people who are reading it. May be given
a user A, this would return the intersection of people reading book
and friends of user A. This is mutual friendship
Users = [
{ name: 'xyz', 'id':'000000', friend_ids:['949583','958694']}
{ name: 'abc', 'id':'000001', friend_ids:['949582','111111']}
]
Books = [
{'book':'da vinci code', 'author': 'dan brown', 'readers'=['949583', '000000']}
{'book':'iCon', 'author': 'Young', 'readers'=['000000', '000001']}
]
As seen above, generally I need two documents if I take mongo DB as I might two way lookup. Duplicating (embedding) on document into another could lead to lot of duplicity (these schemas could store much more information than shown).
Am I modeling my data correctly? Can this be effectively done in mongodb or should I look at graph dbs.
A disclaimer: I work for Neo4j
From your outline, requirements and type of data it seems that your app is rather in a sweetspot for graph databases.
I'd suggest you just do a quick spike with a graph database and see how it is going.
There will be no duplication
you have transactions for atomic operations
following links is the natural operation
local queries (e.g. from a user or a book) are cheap and fast
you can use graph algorithms like shortest path to find interesting information about your data
recommendations and similar operations are natural to graph databases
Some Questions:
Why did you choose MongoDB in the first place?
What implementation language do you use?
Your basic schema proposal above would work fine for MongoDB, with a few suggestions:
Use integers for identifiers, rather than strings. Integers will often be stored more compactly by MongoDB (they will always be 8 bytes, whereas strings' stored size will depend on the length of the string). You can use findAndModify to emulate unique sequence generators (like auto_increment in some relational databases) -- see Mongoengine's SequenceField for an example of how this is done. You could also use ObjectIds which are always 12 bytes, but are virtually guaranteed to be unique without having to store any coordination information in the database.
You should use the _id field instead of id, as this field is always present in MongoDB and has a default unique index created on it. This means your _ids are always unique, and lookups by _id is very fast.
You are right that using this sort of schema will require multiple find()s, and will incur network round-trip overhead each time. However, for each of the queries you have suggested above, you need no more than 2 lookups, combined with some straightforward application code:
"Given a user, I need all books he is reading and number of mutual friends for the book"
a. Look up the user in question, thenb. query the books collection using db.books.find({_id: {$in: [list, of, books, for, the, user]}}), thenc. For each book, compute a set union for that book's readers plus the user's friends
"Given a book, I need all the people who are reading it."a. Look up the book in question, thenb. Look up all the users who are reading that book, again using $in like db.users.find({_id: {$in: [list, of, users, reading, book]}})
"May be given a user A, this would return the intersection of people reading book and friends of user A."a. Look up the user in question, thenb. Look up the book in question, thenc. Compute the set union of the user's friends and the book's readers
I should note that $in can be slow if you have very long lists, as it is effectively equivalent to doing N number of lookups for a list of N items. The server does this for you, however, so it only requires one network round-trip rather than N.
As an alternative to using $in for some of these queries, you can create an index on the array fields, and query the collection for documents with a specific value in the array. For instance, for query #1 above, you could do:
// create an index on the array field "readers"
db.books.ensureIndex({readers: 1})
// now find all books for user whose id is 1234
db.books.find({readers: 1234})
This is called a multi-key index and can perform better than $in in some cases. Your exact experience will vary depending on the number of documents and the size of the lists.