Is this structure possible to implement with MongoDB? - mongodb

So I am coding an app in Meteor which is a mini social network for my college. The issue I am facing right now is that my data is relational.
This web app allows people to write posts and post images and links. People who follow the user see his posts on his feed. People can share this. So the data is inter connected.
So basically
Users have followers
Followers gets the posts from the people they follow
They can comment and share
The shared post appears on the people who follow the sharer
Every post should be tagged from a predefined of tags
People who follow the tags should get the posts with the tags whether they are not following the person who wrote the post or not

You made the first – and correct – step of defining your use cases before you start to model your data.
However, you have the misconception that interrelated data needs an RDBMS.
There are several ways of model relationships between documents.
Note: The following examples are heavily simplified for brevity and comprehensibility
Embedding
An 1:1 relationship can be modeled simply by embedding:
{
_id: "joe",
name: "Joe Bookreader",
address: {
street: "123 Fake Street",
city: "Faketon",
state: "MA",
zip: "12345"
}
}
A 1:many relationship can be modeled by embedding, too:
{
_id: "joe",
name: "Joe Bookreader",
phone:[
{ type:"mobile", number:"+1 2345 67890"},
{ type:"home", number:"+1 2345 987654"}
]
}
References
The major difference in comparison to RDBMS is that you resolve the references in your application code, as shown below.
Implicit references
Let's say we have publisher and books. A publisher doc may look like this
{
_id: "acme",
name: "Acme Publishing, Inc."
}
and a book doc may look like this
{
_id:"9788636505700",
name: "The Book of Foo",
publisher: "acme"
}
Now here comes the important part: We have basically two use cases we can cover with this data. The first one being
For "The Book of Foo", get the details for the publisher
Easy enough, since we already have "The Book of Foo" and it's values
db.publishers.find({_id:"acme"})
The other use case would be
Which books have been published by Acme Publishing, Inc. ?
Since we have Acme Publishing, Inc's data, again, that is easy enough:
db.books.find({publisher:"acme"})
Explicit References
MongoDB has a notion of references, commonly referred to as DBRefs
However, these references are resolved by the driver and not by the MongoDB server. Personally, I have not ever used or needed it, since implicit references most of the times work perfectly.
Example modeling for "Users make posts"
Let's say we have a user document like
{
_id: "joe",
name: "John Bookreader",
joined: ISODate("2015-05-05T06:31:00Z"),
…
}
When doing it naively, we would simply embed the posts into the user document.
However, that would limit the number of posts one can make, since there is a hardcoded size limit for BSON documents of 16MB.
So, it is pretty obvious that every post should have its own document with a reference to the author:
{
_id: someObjectId,
title: "My first post",
text: "Some text",
author: "joe"
}
However, this comes with a problem: we want to show the authors name, not his id.
For a single post, we could simply do a lookup in the users collection to get the name. But what when we want to display a list of posts? That would require a lot of queries. So instead, we use redundancy to save those queries and optimize our application:
{
_id: someObjectId,
title: "My first post",
text: "Some text",
author: { id:"joe", name: "Joe Bookreader"}
}
So for a list of posts, we can display them with the poster's name without additional queries. Only when a user wants to get details about the poster, you would look up the poster by his id. With that, we have saved a lot of queries for a common use case. You may say "Stop! What if a user changes his name?" Well, for starters, it is a relatively rare use case. And even when, it is not much of a problem. First of course, we'd have to update the user document:
db.users.update({"_id":"joe"},{$set:{name:"Joe A. Bookreader"}})
And then, we have to take an additional step:
db.posts.update(
{ "author.id": "joe" },
{ $set:{ "author.name": "Joe A. Bookreader" }},
{ multi: true}
)
Of course, this is kind of costly. But what have we done here? We optimized a common use case at the expense of a rather rare use case. A good bargain in my book.
I hope this simple example helped you to understand better on how you can approach your application's use cases with MongoDB.

Related

Good DB-design to reference different collections in MongoDB

I'm regularly facing the similar problem on how to reference several different collections in the same property in MongoDB (or any other NoSQL database). Usually I use Meteor.js for my projects.
Let's take an example for a notes collection that includes some tagIds:
{
_id: "XXXXXXXXXXXXXXXXXXXXXXXX",
message: "This is an important message",
dateTime: "2018-03-01T00:00:00.000Z",
tagIds: [
"123456789012345678901234",
"abcdefabcdefabcdefabcdef"
]
}
So a certain id referenced in tagIds might either be a person, a product or even another note.
Of course the most obvious solutions for this imo is to save the type as well:
...
tagIds: [
{
type: "note",
id: "123456789012345678901234",
},
{
type: "person",
id: "abcdefabcdefabcdefabcdef",
}
]
...
Another solution I'm also thinking about is to use several fields for each collection, but I'm not sure if this has any other benefits (apart from the clear separation):
...
tagIdsNotes: ["123456789012345678901234"],
tagIdsPersons: ["abcdefabcdefabcdefabcdef"],
...
But somehow both solutions feel strange to me as they need a lot of extra information (it would be nice to have this information implicit) and so I wanted to ask, if this is the way to go, or if you know any other solution for this?
If you use Meteor Methods to pull this data, you have a chance to run some code, get from DB, run some mappings, pull again from DB etc and return a result. However, if you use pub/sub, things are different, you need to keep it really simple and light.
So, first question: method or pub/sub?
Your question is really more like: should I embed and how much to embed, or should I not embed and build relations (only keep an id of a tag in the message object) and later use aggregations or should I denormalize (duplicate data): http://highscalability.com/building-scalable-databases-denormalization-nosql-movement-and-digg
All these are ok in Mongo depending on your case: https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-3
The way I do this is to keep a tags Collection indexed by messageId and eventually date (for sorting). When you have a message, you get all tags by querying the Tags Collection rather than mapping over your tags in your message object and send 3 different queries to 3 different Collections (person, product, note).
If you embed your tags data in the message object, let's say in your UX you want to show there are 3 tags and on click you get those 3 tags. You can basically pull those tags when you pulled the message (and might not need that data) or pull the tags on an action such as click. So, you might want to consider what data you need in your view and only pull that. You could keep an Integer as number of tags on the message object and save the tags in either a tags Collection or embed in your message object.
Following the principles of NoSQL it is ok and advisable to save some data multiple times in different collections to make your queries super fast.
So in a Tags Collection you could save as well things related to your original objects. Let's say
// Tags
{
...
messageId: 'xxx',
createdAt: Date,
person: {
firstName: 'John',
lastName: 'Smith',
userId: 'yyyy',
...etc
},
{
...
messageId: 'xxy',
createdAt: Date,
product: {
name: 'product_name',
productId: 'yyzz',
...etc
},
}

Mongodb: store all related data in one collection or abstract pieces of data from each other?

Schema:
articles: [
{
_id: uid,
owner: userId,
title: string,
text: text,
}
],
comments_1: [
{
// single comment
articleId: uid,
text: text,
user: {
name: string,
id: uid
}
}
],
comments_2: [
{
// all comments at once
articleId: uid,
comments: [
{
_id: commentId,
text: text,
user: {
name: string,
id: uid
}
}
],
}
],
I'm a bit confused with mongodb recommendations:
Say, i need to retrieve information for an article page. I'll need to do 2 requests, first to find article by id, and the second to find comments. If i'd include comments (comments_2) as property into each article, i'd need to perform only one query to get all the data i need, and if i'd need to list say, titles of 20 articles, i'd perform a query with specified properties to be retrieved, right?
Should i store comments and articles in different collections?
If comments will be in different store, should i store comments the comments_1 way or comments_2 way?
I'll avoid deep explanations, because the schema explains my point clearly, i guess. Briefly, i don't get if it's better to store everything in one place and then specify properties i want to retrieve while querying, or abstract pieces of data to different collections?
In a relational database, this would be achieved by JOIN. Apparently, there is a NoSQL equivalent in MongoDB, starting from version 3.2 called $lookup
This allows you to keep comments and articles in separate schemas, but still retrieve list of comments for an article with a single query.
Stack Overflow Source
It's a typical trade-off you have to make. Both approaches have their own pros and cons and you have to choose what fits best for your use case. Couple of inputs:
Single table:
fast load single article, since you load all data in one query
no issues with loading titles of 20 articles (you can query only subset of fields using projection
Multiple table:
much easier to do perpendicular queries (e.g comments made by specific user, etc)
I would go with version 1, since it's simpler and version 2 won't give you any advantage
Well, MongoDB models are usually meant to hold data and relationship together since it doesn't provides JOINS ($lookup is the nearest to join and costly, best to avoid).
That's why in DB modeling there is huge emphasis on denormalization, since there are two benefits of storing together
You wouldn't have to join the collections and you can get the data in a single query.
Since mongo provides atomic update, you can update comments and article in one go, not worrying about transaction and rollback.
So almost certainly you would like to put comments inside article collection. So it would be something like
articles: [
{
_id: uid,
owner: userId,
title: string,
text: text,
comments: [
{
_id: commentId,
text: text,
user: {
name: string,
id: uid
}
}
]
}
]
Before we agree to it, let us see the drawback of above approach.
There is a limit of 16MB per document which is huge, but think if the text of your article is large and the comments on that article is also in large number, maybe it can cross 16 MB.
All the places where you get article for other purposes you might have to exclude the comments field, otherwise it would be heavy and slow.
If you have to do aggregation again we might get into memory limit issue if we need to aggregate based on comments also one way or other.
These are serious problem, and we cannot ignore that, now we might want to keep it in different collection and see what we are losing.
First of all comment and articles though linked but are different entity, so you might never need to update them together for any field.
Secondly, you would have to load comments separately, which makes sense in normal use-case, in most application that's how we proceed, so that too is not an issue.
So in my opinion clear winner is having two separate collection
articles: [
{
_id: uid,
owner: userId,
title: string,
text: text,
}
],
comments: [
{
// single comment
articleId: uid,
text: text,
user: {
name: string,
id: uid
}
}
]
You wouldn't want to go comment_2 way if you are choosing for two collection approach, again for same reason as what if there are huge comments for a single article.

MongoDB One to Many Relationship

I’m starting to learn MongoDB and I at one moment I was asking myself how to solve the “one to many” relationship design in MongoDB. While searching, I found many comments in other posts/articles like ” you are thinking relational “.
Ok, I agree. There will be some cases like duplication of information won’t be a problem, like in for example, CLIENTS-ORDERS example.
But, suppose you have the tables: ORDERS, that has an embedded DETAIL structure with the PRODUCTS that a client bought.
So for one thing or another, you need to change a product name (or another kind of information) that is already embedded in several orders.
At the end, you are force to do a one-to-many relashionship in MongoDB (that means, putting the ObjectID field as link to another collection) so you can solve this simple problem, don’t you ?
But every time I found some article/comment about this, it says that will be a performance fault in Mongo. It’s kind of disappointing
Is there another way to solve/design this without performance fault in MongoDB ?
One to Many Relations
In this relationship, there is many, many entities or many entities that map to the one entity. e.g.:
- a city have many persons who live in that city. Say NYC have 8 million people.
Let's assume the below data model:
//city
{
_id: 1,
name: 'NYC',
area: 30,
people: [{
_id: 1,
name: 'name',
gender: 'gender'
.....
},
....
8 million people data inside this array
....
]
}
This won't work because that's going to be REALLY HUGE. Let's try to flip the head.
//people
{
_id: 1,
name: 'John Doe',
gender: gender,
city: {
_id: 1,
name: 'NYC',
area: '30'
.....
}
}
Now the problem with this design is that if there are obviously multiple people living in NYC, so we've done a lot of duplication for city data.
Probably, the best way to model this data is to use true linking.
//people
{
_id: 1,
name: 'John Doe',
gender: gender,
city: 'NYC'
}
//city
{
_id: 'NYC',
...
}
In this case, people collection can be linked to the city collection. Knowing we don't have foreign key constraints, we've to be consistent about it. So, this is a one to many relation. It requires 2 collections. For small one to few (which is also one to many), relations like blog post to comments. Comments can be embedded inside post documents as an array.
So, if it's truly one to many, 2 collections works best with linking. But for one to few, one single collection is generally enough.
The problem is that you over normalize your data. An order is defined by a customer, who lives at a certain place at the given point in time, pays a certain price valid at the time of the order (which might heavily change over the application lifetime and which you have to document anyway and several other parameters which are all valid only in a certain point of time. So to document an order (pun intended), you need to persist all data for that certain point in time. Let me give you an example:
{ _id: "order123456789",
date: ISODate("2014-08-01T16:25:00.141Z"),
customer: ObjectId("53fb38f0040980c9960ee270"),
items:[ ObjectId("53fb3940040980c9960ee271"),
ObjectId("53fb3940040980c9960ee272"),
ObjectId("53fb3940040980c9960ee273")
],
Total:400
}
Now, as long as neither the customer nor the details of the items change, you are able to reproduce where this order was sent to, what the prices on the order were and alike. But now what happens if the customer changes it's address? Or if the price of an item changes? You would need to keep track of those changes in their respective documents. It would be much easier and sufficiently efficient to store the order like:
{
_id: "order987654321",
date: ISODate("2014-08-01T16:25:00.141Z"),
customer: {
userID: ObjectId("53fb3940040980c9960ee283"),
recipientName: "Foo Bar"
address: {
street: "742 Evergreen Terrace",
city: "Springfield",
state: null
}
},
items: [
{count:1, productId:ObjectId("53fb3940040980c9960ee300"), price: 42.00 },
{count:3, productId:ObjectId("53fb3940040980c9960ee301"), price: 0.99},
{count:5, productId:ObjectId("53fb3940040980c9960ee302"), price: 199.00}
]
}
With this data model and the usage of aggregation pipelines, you have several advantages:
You don't need to independently keep track of prices and addresses or name changes or gift buys of a customer - it is already documented.
Using aggregation pipelines, you can create a price trends without the need of storing pricing data independently. You simply store the current price of an item in an order document.
Even complex aggregations such as price elasticity, turnover by state / city and alike can be done using pretty simple aggregations.
In general, it is safe to say that in a document oriented database, every property or field which is subject to change in the future and this change would create a different semantic meaning should be stored inside the document. Everything which is subject to change in the future but doesn't touch the semantic meaning (the users password in the example) may be linked via a GUID.

Multilingual data modeling on MongoDB

I am trying to model my objects on MonogoDB and not sure how to proceed. I am building a Product catalog that will be:
No frequent changes to product catalog. A bulk operation may be done weekly / fortnight.
Product information is in multiple languages ( English, Spanish , French ) new language may be added anytime.
Here is what I am trying to do: I need to model my product catalog to capture the multilingual functionality. Assume I have:
product : {
_id:xxx,
sku:"23456",
name:"Name",
description: "Product details",
tags:["x1","x2"]}...
}
Surely, name,description, tags and possible images will change according to language. So, how do I model it?
I can have a seperate collection for each language eg: enProducts,esProducts etc
Have JSON representation in the product itself with the individual languages like:
product :{
id: xxx,
en: {
name: "Name",
description: "product details.."
},
es: {
name: "Name",
description: "product details.."
},
...
}
Or is there any other solution? Need help of MongoDB modeling experts here :)
Another option would be to just keep the values different per language. Would probably make maintaining the schema much easier as well:
product : {
_id:xxx,
sku: {
und: "23456"
},
name: {
en: "Fork",
de: "Gabel"
},
description: {
en: "A metal thingy with four spikes",
de: "Eine Dinge aus metal der vier spitze hat"
}
}
und would be short for "undefined", i.e. the same for all languages, and could be used as a fallback - or you always use "en" as fallback if you'd prefer that.
The above example is roughly how Drupal CMS manages languages (albeit translated from SQL to Mongo).
What about this approach:
product: {
id: 1,
name: 'Original Name',
description: 'Original Description',
price: 33,
date: '2019-03-13',
translations: {
es: {
name: 'Nombre Original',
description: 'Descripción Original',
}
}
}
If the user selects some language different to the default and the key translations exists in the object, you only need to merge it, and if any key has no translation, the original remains.
Another advantage is if you need to remove the translation feature or add/remove some language, you only need to change or remove the translation key and not having to refactor the entire schema.
Both solutions are normally standard for this, the first being standard in RDBMS techs as well (or file based translations being another method that is not possible here).
As for which is best right here, I am leaning towards the second considering your use.
Some of the reasons would be:
One single document load for all translations and product data, no JOINs
Making for a single contiguous read of your disk
Allowing for atomic updating and adding of new languages and changes etc to a single product
But creating some downsides:
Updating could (probably will) create fragmentation which can be remedied to some extent (not completely) by powerof2sizes
All your ops will now go to one single part of your hard disk which may actually create a bottle neck however, your scenario is such that you do not update often if at all so this shouldn't be a problem.
As a side note: I am judging that fragmentation might not bee too much of a problem for you. The reason being is that you only really bulk import products, probably from a CSV as such your documents will not probably grow greater than by the power of 2 from their insertion regularly. As such this point might be obsolete.
So overall, if planned right the second option is a good one however, there are some considerations to take into account:
Could the multiple descriptions/fields push the document past the 16meg limit?
How to manually pad to the document to efficiently use space and prevent fragmentation?
Those are your biggest concerns if you go with the second option.
Considering that you can fit all of the works of Shakespear into 4MB with room to spare I am actually not sure if you will reach the 16MB limit, if you do it would have to be some considerable text, and maybe storing the images in binary into the document.
Coming back to the first option, your largest concern will be duplication of certain data, i.e. price (France and Spain both have the Euro) unless you use two documents, one to house common data and the other a translation (this will make 4 documents actually but two queries).
Considering that this catalogue will never be updated unless in bulk duplicated data will not matter too much (however, for future reference in the case of expansion I will be cautious) so:
You can make it have one document per translation and not worry about updating prices atomically across all regions
You have one disk read without the fragmentation
No need to manually pad your documents
So both options are readily available but I am leaning towards the second case.
I use following pattern for key and values that should be indexed in key:
{
"id":"ObjectId",
"key":"error1"
"values":[{
"lang":"en",
"value":"Error Message 1"
},
{
"lang":"fa",
"value":"متن خطای شماره 1"
}]
}
and Use This Code in C#
object = coleccion.find({"key": "error1"});
view this link Model One-to-Many Relationships with Embedded Documents!
For a static list of languages I would go with #Zagorulkin Dmitry solution, as it is easy to query.
For a dynamic list of languages, I would rather not change the schema and allow easy management of the data.
The down side is that querying is less trivial.
{
"product": {
"id": "xxx",
"languageDependentData": [
{
"language": "en",
"name": "Name",
"description": "product details.."
},
{
"language": "es",
"name": "Name",
"description": "product details.."
}
]
}
}
this way will be the best:
product :{
id: xxx,
en: {
name: "Name",
description: "product details.."
},
es: {
name: "Name",
description: "product details.."
},
...
}
just because you have to search for only one product and after you could choose any language.
Yet another option is to store your primary data in one language only and to have a separate text-resource translation collection where you map any text resource from your primary language to other target languages (no matter if your text resource comes from the primary data store or is just a translation of a system message on your system).
I.e. make no language specific adjustments to the schema and model at all.
The drawback that I can see is in maintaining the removal of information from the translation collection when the product is removed from the primary store, well, as soon as you guarantee that the same resource is not used elsewhere it is trivial but needs to be programmed :)

Referencial integrity in MongoDB. Which is a better practice?

Let's say I have a Document and an author collections. I could design it in two ways:
1st way:
documents
{_id:1, title:"document 1", author:"John", age: 34}
{_id:2, title: "document 2", author: "Maria", age:42 }
{_id:3, title: "document 3", author: "John", age: 34}
authors
{_id:1, name:"John", age:34}
{_id:2, name:"Maria", age:42}
2nd way:
documents
{_id:1, title:"document 1", id_author:1}
{_id:2, title: "document 2", id_author: 2}
{_id:3, title: "document 3", id_author: 1}
authors
{_id:1, name:"John", age:34}
{_id:2, name:"Maria", age:42}
1st way is good because I don't have to simulate a Join when I retrieve a document, I have all the data in the documents collection. But, on the other hand, if I have to change Maria's age, I have to do it in both collections.
2nd way is the opposite, if I need a document and the age of it's author I need to query documents first and then authors. But the good thing is that when I have to change Maria's age I only have to do it in the authors collection.
So, which solution is better? I guess that the more fields you need in authors collection the more likely you'll be using the second way. But, if I am using the 1st way, is there a single query I can use to update the age of Maria in both collections?
Which is the most used solution?
Update in more than one collection would be a transaction. MongoDB does not support transactions.
Both ways have their own disadvantages.
The first way which is author-data inclusive may be more appropriate in logging situations where its contents won't be subject to change.
The second way is better when you expect the author's details to change or grow over time (most cases).
Like already mentioned, embedding the documents in their respective author's document would be a way to combine the 2 suggestions' benefits but may lead to problems in the long run.
The problem with the first method is updates:
{_id:1, title:"document 1", author:"John", age: 34}
I can imagine that actually you will want an author id in there as well as some of the details you need for querying (schema redundancy).
This could pose a problem, as you notice:
But, on the other hand, if I have to change Maria's age, I have to do it in both collections.
Age changes once every year at least, and if you have the age wrong, more often. Name can change as well, especially if later on you find that this "John" has a last name or his name is actually "Johnny".
So the problem with creating redundancy here is that the author document could change dramatically causing you to have to run extremely unperformant updates which could massively increase your working set at times. As to how often it would cause this I cannot say with the information provided, that will be upto you decide.
Normally a good way to create redundancy is when you need extremely rarely updated attributes in another document in your current document. This does not seem to be the case here.
The second way is normally the default way of doing this kind of randomly read and updated relationship however there is a possible third method - embedding.
You could embed the documents into the author. This depends on how many documents you are looking to store though since MongoDB has a max document size of 16Meg.
That being said a possibility is:
{
_id: {},
name: 'John',
age: 43,
documents: [
{ id: 1, title: "New Document" }
]
}
The one down side of this is the use of in-memory operations such as $pull or $push and not only that but if your document is consistently and vastly growing you could see fragmentation.
But again these are just notes for you to take in, the realiy depends upon information not provided.
I would suggest a mix of both approaches, the "static" information will be saved along with the documents collection, and the variable data will be centralized in the authors collection, only when the variable data requires to be retrieved I will use the author id to retrieve his age. Something like this:
documents
{_id:"1", title:"document 1", author:"John", authorId: "1"}
{_id:"2", title: "document 2", author: "Maria", authorId: "2"}
{_id:"3", title: "document 3", author: "John", authorId: "1"}
authors
{_id:"1", name:"John", age:34}
{_id:"2", name:"Maria", age:42}
Age is something you wouldn't required too often, but could be updated frequently therefore this will handle better both situations.
As someone else mentioned, Mongo is not transactional and you could have problems if you create the author and the document in one shot.