Schema design advice for Social Network - postgresql

I'm trying to come up with a schema for a social network app.
Users could post Posts, and inside them have Photos.
Both Posts and Photos can have Likes and Comments.
Posts can have several collaborators/owners, which is why I added the Posts Participants table.
Users can search for Posts either by searching for keywords inside the posts texts, or by the hashtags of the posts.
That's why I used tsvector type for both of them, indexed with the GiN index type.
So far I have come up with the following schema :
My main issues with this design are:
Hashtags in a post - is it fine what I did, i.e - storing the hashtags of a post in one column tsvector inside the Posts table?
two additional ideas I had in mind :
a. have a separate table for hashtags, like this : id|post_id|tag_name, and each record will represent each individual hashtag. Sounds a bit inefficient though, will results in too many records..
b. same as a, but the "tag_name" would be a tsvector representing all of the hashtags of the posts. This would result in far less records in the table than option 'a'.
Saved posts - what if I have a 10k posts, and each of them will be liked by 1k people. This would result in 10 millions of records! that doesn't sound efficient.
Normalization - it seems to me there are too many tables which will require a lot of JOINS to retrieve a whole Post object to the clients (along will the comments, likes, photos and their comments/likes, etc.), as well as be very complex to write to. Will the queries to retrieve/write different Posts be too slow / cumbersome?
Comments - should I separate comments for posts and comments for photos like I did in the design above? or combine them into one table?
I want to have 1-level replies inside the comments. Should I just add a column of "parent_comment_id" inside the Comments table?

Storing the hashtags of a post in one column tsvector inside the Posts table is all good, as you are not using it in any other table. But in case if u need hashtags for any scenario where you need to join tables, then it would be better to keep them in another table for flexibility.
It is good to keep records as such, you are keeping the id's right. So it won't result in any bottleneck. I believe you are going to use CSV's.
You can always optimize your queries, use your joins wisely, make sure you are using the right join at the right place.
It will be better if you separate comments for posts and comments for photos as you will find it way more easier to retrieve and store data. You can anyways combine them when needed. This will avoid any data congestion.
Yes sounds ok.

Related

Database design for queries that has tons of sql-like join

I have a collection named posts consisting of multiple article posts and a collection called users which has a lot of user info. Each post has a field called author that references the post author in the users collection.
On my home page I will query the posts collection and return a list of posts to the client. Since I also want to display the author of the post I need to do sql-like join commands so that all the posts will have author names, ids,...etc.
If I return a list of 40 posts I'd have to do 40 sqllike-joins. Which means each time I will do 41 queries to get a list of posts with author info. Which just seems really expensive.
I am thinking to store the author info at the time I am storing the post info. This way I only need to do 1 query to retrieve all posts and author info. However when the user info changes (such as name changes) the list will be outdated and it seems not quite easy to manage lists like this.
So is there's a better or standard approach to this?
p.s: I am using mongodb
Mongo is NoSQL DB. By definition, NoSQL solutions are meant to be denormalized(all required data should be located at a same location)
In your example, relationship between authors and posts is one to many but ratio of authors as compared to posts is very small. In simple words, no. of authors as compared to no. of posts will be very small.
Based on this, you can safely store author info in posts collection.
If you need to query posts collection i.e. if you know your most queries will be executed on posts collection then it makes sense to store author in posts. It wont take huge space to store one attribute but it will make huge difference in query performance and easiness to code/retrieve the data.

Mongodb - most efficient way to structure my db?

I'm a bit new to mongodb and I'm trying to setup a simple server where I will have users, posts, comments, like and dislikes, among some things. What I'm wondering is which way this should be setup most efficiently?
Should I have one table for likes where I add userId and postId (more or less same for the dislike and comments table)
Or would it be better if likes, dislikes and comments are parts of the post? Like:
//Post structure
{
"_id":"kljflskds",
"field1":"content",
"field2":"content",
"likes":[userId,userId,userId],
"dislikes":[userId,userId,userId],
"comments":[{comment object},{comment object},{comment object}]
}
Because for each post when I retreive them I would like to know how many likes it has, how many dislikes and how many comments. With the first version I would either need to multiple queries on the server(unnecessary processor power?) or on the phone(unnecessary bandwidth). But the second would only need one query. I believe the second option with having comments as a part of the posts seems more efficient, but I'm not a pro so I'd like to hear what other people think of this?
As has been pointed out, there are no tables in a document-oriented database. What you'll also find is that unlike a relational database where there is often a 'right way' to structure the database, the same is not true with MongoDB. Your schema should be structured based on how you're going to access the information most regularly. Documents are extremely flexible, unlike rows in tables.
You could create a comments collection or have them directly in the post documents. Two considerations would be: 1. Will you need to access the comments without accessing the post? and 2. Are your documents going to get too big and unwieldy?
In both of these cases with your blog, it most likely would be better to nest the comments as most of your traffic will be searching for posts, and you'll be pulling all of the comments related to the post. Also, a comment will not be owned by multiple tables; besides, MongoDB isn't meant to be denormalized like a relational database, so having duplicate information in multiple documents (i.e. tag names, city names, etc.) is normal.
Also, having a collection for likes is a very 'relational' way of thinking. In MongoDB, I can't think of a use case where you'd want a likes collection. When you're coming from the relational world, you really have to step back and rethink how you're creating your database because you'll be constantly fighting it otherwise.
With only two collections, posts and users, getting the information that you're looking for would be trivial, as you can just get the count of the likes and comments and they're all right there.

MongoDB/Mongoose: many to many relationship

I have two Mongoose schemas Post and Tag and I want to design a many to many relationship between them. I'm wondering which one is the best solution for performances:
In both Tag and Post models keep an array with a reference to the models of the other schema (each Tag has many posts referenced in an array of ids and viceversa)
Keep the array of Tag ids only in the Post schema
The second solution seems easier to implement because when I edit the list of tags related to one post only one array has to be modified but at the same time might be less performant when getting all the posts that belongs to one tag
Keep the array of Tag ids only in the Post schema
I would definitely use this second solution which is more straightforward.
Unless you have exotic requirements, you shouldn't need to have each Tag explicitly tracking the Post references. A array of Post references in Tag documents would effectively be unbounded in size. This a usage pattern that tends to create storage fragmentation and/or performance issues for popular documents which frequently outgrow the record padding for their allocated record space.
On the other hand, the number of Tags used in a single Post is unlikely to change much over time and you can make this query performant by adding an index on the Tag array in your Post collection.

Questions about better solution to keep comments and votes in one document

I'm looking for a better way to handle it. My documents I designed to store all comments and votes (like a confirmation thats stores picture and text information as well) using arrays inside a document. The issue I'm concerning is about the size limit of a document (16 Mb so far), If a document keeps a lot of comments and specially votes in internal arrays, very probably it will be broken reaching the size limit, on the other hand, keep this strategy I can ensure faster queries as well.
What do I have to do? Do like a relational DB and keep these kind of information and different collections and docs? It will decrease search speed, otherwise I'll keep it safe and unbroken.
It all depends on how you plan on using the data. If you get a huge number of comments on a single blog post would you really want to query for the post and get all the comments back?
No real life webpage that shows you a blog post actually does that. They show you the first few comments and then fetch more either as you scroll through those or when you click "show more". That's probably the best (hybrid) model. Store what you need when you first display a blog post in the blog post document but keep everything else that you can query for later in a separate collection where every comment references the post that it belongs to. Then you can get the additional comments with a single indexed read (probably index it on post_id and date posted?). You can also use the "bucketing" technique and store comments grouped by post and chunk of time so that you can fetch entire "next page of comments" document.
If you architect this correctly rather than reducing your search speed it will likely increase your search and reading speed for base documents and save you a lot of network bandwidth too.

What is the correct noSQL collection structure for the following case?

As someone who got used to thinking in relational terms, I am trying to get a grasp of thinking in the "noSQL way".
Assume the following scenario:
We have a blog (eg. 9gag.com) with many posts and registered users. Every post can be liked by each user. We would like to build a recommendation engine, so we need to track:
all posts viewed by a user
all posts liked by a user
Posts have: title, body, category. Users have: username, password, email, other data.
In a relational DB we would have something like: posts, users, posts_users_views (post_id, users_id, view_date), posts_users_likes (post_id, user_id, like_date).
Question
What is the "right" structure would be in a document/column oriented noSQL database?
Clarification: Should we save an array of all viewed/liked post ids in users (or user ids in posts)? If so, won't we have a problem with a row size getting huge?
In CouchDB you could have separate documents for the user, post, view and like. Showing the views/likes by user can be arranged by the "view" (materialized map/reduce query) with map function emitting an array key [user_id, post_id]. As the result you will get the sorted dictionary (ordered lexicographically by the key), so taking all the views per user='ID' is the query with keys starting from [ID] to [ID,{}]. You can optimize it, but the basic solution is very simple.
In CouchDB wiki there is a comment on using relationally modeled design and view collation mechanism (which can substitute some simple joins). To get some intuition I rather advice to study the problems of post and comments, which is also very simple, but not that much trivial as view and likes :)
There may be no NoSQL way, but I think most of the map/reduce systems share similar type of thinking. The CouchDB is a good tool to start, because it is very limited :) It is difficult to do any queries inefficient in distributed environment and its map and reduce query functions cannot have side effects (they are generating the materialized view, incrementally when the document set is changed, and the result should not depend on the order of document updates).