I have the following schema for posts. Each post has an embedded author and attachments (array of links / videos / photos etc).
{
"content": "Pixable tempts Everpix users with quick-import tool for photos ahead of December 15 closure http:\/\/t.co\/tbsSrVYneK by #psawers",
"author": {
"username": "TheNextWeb",
"id": "10876852",
"name": "The Next Web",
"photo": "https:\/\/pbs.twimg.com\/profile_images\/378800000147133877\/895fa7d3daeed8d32b7c089d9b3e976e_bigger.png",
"url": "https:\/\/twitter.com\/account\/redirect_by_id?id=10876852",
"description": "",
"serviceName": "twitter"
},
"attachments": [
{
"title": "Pixable tempts Everpix users with quick-import tool for photos ahead of December 15 closure",
"description": "Pixable, the SingTel-owned company that organizes your social photos in smart ways, has announced a quick-import tool for Everpix users following the company's decision to close ...",
"url": "http:\/\/t.co\/tbsSrVYneK",
"type": "link",
"photo": "http:\/\/cdn1.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2013\/09\/camera1-.jpg"
}
]
}
Posts are read often (we have a view with 4 tabs, each tab requires 24 posts to be shown). Currently we are indexing these lists in Redis, so querying 4x24posts is as simple as fetching the lists from Redis (returns a list of mongo ids) and querying posts with the ids.
Updates on the embedded author happen rarely (for example when the author changes his picture). The updates do not have to be instantaneous or even fast.
We're wondering if we should split up the author and the post into two different collections. So a post would have a reference to its author, instead of an embedded / duplicated author. Is a normalized data state preferred here (author is duplicated for every post, resulting in a lot of duplicated data / extra bytes)? Or should we continue with the de-normalized state?
As it seems that you have a few magnitudes more reads than writes, it probably makes little sense to split this data out into two collections. Especially with few updates, and you needing almost all author information while showing posts one query is going to be faster than two. You also get data locality so potentially you would need less data in memory as well, which should provide another benefit.
However, you can only really find out by benchmarking this with the amount of data that you'd be using in production.
Related
I have an application where an article can be linked to multiple platforms.
Article contains a list of platforms and platforms also contains a list of articles.
For more detailed information please look at this stackoverflow question that I asked a few months ago.
https://stackoverflow.com/a/40377383/5770147
The question was on how to create an article and implement the N-N relationship between article and platform.
I have Creating article and Deleting the article setup so that the lists update in the platforms aswell.
How do I implement editing an article so I can update which platforms are linked to an article?
For creating and editing the linked platforms I use a dropdown menu in which multiple options can be selected. The necessary code can be found in the question previously linked.
Based on the information that you provided, I would recommend two possible approaches, starting from the same foundation:
Use two collections (articles and platforms) and store only a reference to platform documents in an array defined on article
documents
I would recommend this approach if:
You have a high cardinality of both article documents, as well as
platforms
You want to be able to manage both entities independently, while
also syncing references between them
// articles collection schema
{
"_id": ...,
"title": "I am an article",
...
"platforms": [ "platform_1", "platform_2", "platform_3" ],
...
}
// platforms collection schema
{
"_id": "platform_1",
"name": "Platform 1",
"url": "http://right/here",
...
},
{
"_id": "platform_2",
"name": "Platform 2",
"url": "http://right/here",
...
},
{
"_id": "platform_3",
"name": "Platform 3",
"url": "http://right/here",
...
}
Even if this approach is quite flexible, it comes at a cost - if you require both article and platform data, you will have to fire more queries to your MongoDB instance, as the data is split in two different collections.
For example, when loading an article page, considering that you also want to display a list of platforms, you would have to fire a query to the articles collection, and then also trigger a search on the platforms collection to retrieve all the platform entities to which that article is published via the members of the platforms array on the article document.
However, if you only have a small subset of frequently accessed platform attributes that you need to have available when loading an article document, you might enhance the platforms array on the articles collection to store those attributes in addition to the _id reference to the platform documents:
// enhanced articles collection schema
{
"_id": ...,
"title": "I am an article",
...
"platforms": [
{platform_id: "platform_1", name: "Platform 1"},
{platform_id: "platform_2", name: "Platform 2"},
{platform_id: "platform_3", name: "Platform 3"}
],
...
}
This hybrid approach would be suitable if the platform data attributes that you frequently retrieve to display together with article specific data are not changing that often.
Otherwise, you will have to synchronize all the updates that are made to the platform document attributes in the platforms collection with the subset of attributes that you track as part of the platforms array for article documents.
Regarding the management of article lists for individual platforms, I wouldn't recommend storing N-to-N references in both collections, as the aforementioned mechanism already allows you to extract article lists by querying the articles collection using a find query with the _id value of the platform document:
Approach #1
db.articles.find({"platforms": "platform_1"});
Approach #2:
db.articles.find({"platforms.platform_id": "platform_1"});
Having presented two different approaches, what I would recommend now is for you to analyze the query patterns and performance thresholds of your application and make a calculated decision based on the scenarios that you encounter.
I am more used to MySQL but I decided to go MongoDB for this project.
Basically it's a social network.
I have a posts collection where documents currently look like this:
{
"text": "Some post...",
"user": "3j219dj21h18skd2" // User's "_id"
}
I am looking to implement a replies system. Will it be better to simply add an array of liking users, like so:
{
"text": "Some post...",
"user": "3j219dj21h18skd2", // User's "_id"
"replies": [
{
"user": "3j219dj200928smd81",
"text": "Nice one!"
},
{
"user": "3j219dj2321md81zb3",
"text": "Wow, this is amazing!"
}
]
}
Or will it be better to have a whole separate "replies" collection with a unique ID for each reply, and then "link" to it by ID in the posts collection?
I am not sure, but feels like the 1st way is more "NoSQL-like", and the 2nd way is the way I would go for MySQL.
Any inputs are welcome.
This is a typical data modeling question in MongoDB. Since you are planning to store just the _id of the user the answer is definitely to embed it because those replies are part of the post object.
If those replies can number in the hundreds or thousands and you are not going to show them by default (for example, you are going to have the users click to load those comments) then it would make more sense to store the replies in a separate collection.
Finally, if you need to store more than the user _id (such as the name) you have to think about maintaining the name in two places (here and in the user maintenance page) as you are duplicating data. This can be manageable or too much work. You have to decide.
I'm approaching the noSQL world.
I studied a little bit around the web (not the best way to study!) and I read the Mongodb documentation.
Around the web I wasn't able to find a real case example (only fancy flights on big architectures not well explained or too basic to be real world examples).
So I have still some huge holes in my understanding of a noSQL and Mongodb.
I try to summarise one of them, the worst one actually, here below:
Let's imagine the data structure for a post of a simple blog structure:
{
"_id": ObjectId(),
"title": "Title here",
"body": "text of the post here",
"date": ISODate("2010-09-24"),
"author": "author_of_the_post_name",
"comments": [
{
"author": "comment_author_name",
"text": "comment text",
"date": ISODate("date")
},
{
"author": "comment_author_name2",
"text": "comment text",
"date": ISODate("date")
},
...
]
}
So far so good.
All works fine if the author_of_the_post does not change his name (not considering profile picture and description).
The same for all comment_authors.
So if I want to consider this situation I have to use relationships:
"authorID": <author_of_the_post_id>,
for post's author and
"authorID": <comment_author_id>,
for comments authors.
But MongoDB does not allow joins when querying. So there will be a different query for each authorID.
So what happens if I have 100 comments on my blog post?
1 query for the post
1 query to retrieve authors informations
100 queries to retrieve comments' authors informations
**total of 102 queries!!!**
Am I right?
Where is the advantage of using a noSQL here?
In my understanding 102 queries VS 1 bigger query using joins.
Or am I missing something and there is a different way to model this situation?
Thanks for your contribution!
Have you seen this?
http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/
It sounds like what you are doing is NOT a good use case for NoSQL. Use relational database for basic data storage to back applications, use NoSQL for caching and the like.
NoSQL databases are used for storage of non-sensitive data for instance posts, comments..
You are able to retrieve all data with one query. Example: Don't care about outdated fields as author_name, profile_picture_url or whatever because it's just a post and in the future this post will not be visible as newer ones. But if you want to have updated fields you have two options:
First option is to use some kind of worker service. If some user change his username or profile picture you will give some kind of signal to your service to traverse all posts and comments and update all fields his new username.
Second option use authorId instead of author name, and instead of 2 query you will make N+2 queries to query for comment_author_profile. But use pagination, instead of querying for 100 comments take 10 and show "load more" button/link, so you will make 12 queries.
Hope this helps.
I'm quite new to nosql world.
If I have a very simple webapp with users authenticating & publishing posts, what's the mongodb(nosql) way to store users & posts on the nosql db?
Do I have (like in relationnal databases) to store users & posts each one in his own collection? Or store them in the same collection, on different documents? Or, finally with a redondant user infos (credentials) on each post he has published?
A way you could do it is to use two collection, a posts collection and a authors collection. They could look like the following:
Posts
{
title: "Post title",
body: "Content of the post",
author: "author_id",
date: "...",
comments: [
{
name: "name of the commenter",
email: "...",
comment: "..."
}],
tags: [
"tag1", "tag2, "tag3
]
}
Authors
{
"_id": "author_id",
"password": "..."
}
Of course, you can put it in a single collection, but #jcrade mentioned a reason why you would/should use two collections. Remember, that's NoSQL. You should design your database from an application point of you, that means ask yourself what data is consumed and how.
This post says it all:
https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-1
It really depends on your application, and how many posts you expect your users to have: if it's a one-to-few relationship, then probably using embedded documents (inside your users model) is the way to go. If it's one to many (up to a couple of thousands) then just embed an array of IDs in your users model. If it's more than that, then use the answer provided by Horizon_Net.
Read the post, and you get a pretty good idea of what you will have to do. Good luck!
When you are modeling nosql database you should think in 3 basic ideas
Desnormalization
Copy same data on multiple documents. in order to simplify/optimize query processing or to fit the user’s data into a particular data model
Aggregation
Embed data into documents for example (blog post and coments) in order to impact updates both in performance and consistency because mongo has one document consistency at time
Application level Joins
Create applicaciton level joins when its not good idea to agregate information (for example each post as idependent document will be really bad because we need to accces to the same resource)
to answer your question
Create two document one is blogPost with all the comments, and tags on it and user ui. Second User with all user information.
I have a Share collection which stores a document for every time a user has shared a Link in my application. The schema looks like this:
{
"userId": String
"linkId": String,
"dateCreated": Date
}
In my application I am making requests for these documents, but my application requires that the information referenced by the userId and linkId properties is fully resolved/populated/joined (not sure on the terminology) in order to display the information as needed. Thus, every request for a Share document results in a lookup for the subsequent User and Link documents. Furthermore, each Link has a parent Feed document which must also be looked up. This means I have some spagehetti-like code to perform each find operation in a series (3 in total). Yet, the application only needs some of the data found in these calls (one or two properties). That said, the application does need the entire Link document.
This is very slow, and I am wondering whether I should just be replicating the data in the Share document itself. In my head, this is fine because most of the data will not change, but some of it might (i.e. a User's username). This is suggesting of a Share schema design like so:
{
"userId": String,
"user": {
"username": String,
"name": String,
},
"linkId": String,
"link": {}, // all of the `Link` data
"feed": {
"title": String
}
"dateCreated": Date
}
What is the consensus on optimising data for the application with regards to this? Do you recommend that I replicate the data and write some glue code to ensure the replicated username gets updated if it changes (for example), or can you recommend a better solution (with details on why)? My other worry about replicating data in this manner is, what if I needed more data in the Share document further down the line?