MongoDB collections - which way will be more efficient?

MongoDB collections - which way will be more efficient? - mongodb

I am more used to MySQL but I decided to go MongoDB for this project.
Basically it's a social network.
I have a posts collection where documents currently look like this:
{
"text": "Some post...",
"user": "3j219dj21h18skd2" // User's "_id"
}
I am looking to implement a replies system. Will it be better to simply add an array of liking users, like so:
{
"text": "Some post...",
"user": "3j219dj21h18skd2", // User's "_id"
"replies": [
{
"user": "3j219dj200928smd81",
"text": "Nice one!"
},
{
"user": "3j219dj2321md81zb3",
"text": "Wow, this is amazing!"
}
]
}
Or will it be better to have a whole separate "replies" collection with a unique ID for each reply, and then "link" to it by ID in the posts collection?
I am not sure, but feels like the 1st way is more "NoSQL-like", and the 2nd way is the way I would go for MySQL.
Any inputs are welcome.

This is a typical data modeling question in MongoDB. Since you are planning to store just the _id of the user the answer is definitely to embed it because those replies are part of the post object.
If those replies can number in the hundreds or thousands and you are not going to show them by default (for example, you are going to have the users click to load those comments) then it would make more sense to store the replies in a separate collection.
Finally, if you need to store more than the user _id (such as the name) you have to think about maintaining the name in two places (here and in the user maintenance page) as you are duplicating data. This can be manageable or too much work. You have to decide.

Related

Is keeping a log of all document relationships an anti-pattern in CouchDB?

When we return each document in our database to be consumed by the client we also must to add a property "isInUse" to that document's response payload to indicate if a given documented is referenced by other documents .
This is needed because referenced documents cannot be deleted and so a trash bin button should not be displayed next to it's listing entry in the client-side app.
So basically we have relationships where a document can reference another link this:
{
"_id": "factor:1I9JTM97D",
"someProp": 1,
"otherProp": 2,
"defaultBank": <id of some bank document>
}
Previously we have used views and selectors to query for each documents references in other documents, however this proved to be non-trivial.
So here's how someone in our team has implemented this now: We register all relationships in dedicated "relationship" documents like the one below and update them every time a document created/updated/deleted by the server, to reflect anything new references or de-references:
{
"_id": "docInUse:bank",
"_rev": "7-f30ffb403549a00f63c6425376c99427",
"items": [
{
"id": "bank:1S36U3FDD",
"usedBy": [
"factor:1I9JTM97D"
]
},
{
"id": "bank:M6FXX6UA5",
"usedBy": [
"salesCharge:VDHV2M9I1",
"salesCharge:7GA3BH32K"
]
}
]
}
The question is whether this solution is an anti-pattern and what are the potential drawbacks.

I would say using a single document to record the relationships between all other documents could be problematic because
the document "docInUse:bank" could end up being updated frequently. Cloudant allows you to update documents but when you get to many thousands of revisions, then the document size becomes none trivial, because all the previous revision tokens are retained
updating a central document invites the problem of document conflicts if two processes attempt to update the document at the same time. You are allowed to have have conflicts, but it is your app's responsibility to manage them see here
if you have lots of relationships, this document could get very large (I don't know enough about your app to judge)
Another solution is to keep your bank:*, factor:* & salesCharge:* documents the same and create a document per relationship e.g.
{
"_id": "1251251921251251",
"type": "relationship",
"doc": "bank:1S36U3FDD",
"usedby": "factor:1I9JTM97D"
}
You can then find out documents on either side of the "join" by querying documents by the value of doc or usedby with a suitable index.
I've also seen implementations, where the document's _id field contains all of the information:
{
"_id": "bank:1S36U3FDD:factor:1I9JTM97D"
"added": "2018-02-28 10:24:22"
}
and the primary key helpfully sorts the document ids for you allowing you to use judicious use of GET /db/_all_docs?startkey=x&endkey=y to fetch the relationships for the given bank id.
If you need to undo a relationship, just delete the document!

By building a cache of relationships on every document create/update/delete as you currently implemented it, you are basically recreating an index manually in the database. This is the reason why I would lean towards calling it an antipattern.
One great way to improve your design is to store each relation as a separate document as Glynn suggested.
If your concern is consistency (which I think might be the case, judging by looking at the document types you mentioned), try to put all information about a transaction into a single document. You can define the relationships in a consistent place in your documents, so updating the views would not be necessary:
{
"_id":"salesCharge:VDHV2M9I1",
"relations": [
{ "type": "bank", "id": "bank:M6FXX6UA5" },
{ "type": "whatever", "id": "whatever:xy" }
]
}
Then you can keep your views consistent, and you can rely on CouchDB to keep the "relation cache" up to date.

Mongodb real basic use case

I'm approaching the noSQL world.
I studied a little bit around the web (not the best way to study!) and I read the Mongodb documentation.
Around the web I wasn't able to find a real case example (only fancy flights on big architectures not well explained or too basic to be real world examples).
So I have still some huge holes in my understanding of a noSQL and Mongodb.
I try to summarise one of them, the worst one actually, here below:
Let's imagine the data structure for a post of a simple blog structure:
{
"_id": ObjectId(),
"title": "Title here",
"body": "text of the post here",
"date": ISODate("2010-09-24"),
"author": "author_of_the_post_name",
"comments": [
{
"author": "comment_author_name",
"text": "comment text",
"date": ISODate("date")
},
{
"author": "comment_author_name2",
"text": "comment text",
"date": ISODate("date")
},
...
]
}
So far so good.
All works fine if the author_of_the_post does not change his name (not considering profile picture and description).
The same for all comment_authors.
So if I want to consider this situation I have to use relationships:
"authorID": <author_of_the_post_id>,
for post's author and
"authorID": <comment_author_id>,
for comments authors.
But MongoDB does not allow joins when querying. So there will be a different query for each authorID.
So what happens if I have 100 comments on my blog post?
1 query for the post
1 query to retrieve authors informations
100 queries to retrieve comments' authors informations
**total of 102 queries!!!**
Am I right?
Where is the advantage of using a noSQL here?
In my understanding 102 queries VS 1 bigger query using joins.
Or am I missing something and there is a different way to model this situation?
Thanks for your contribution!

Have you seen this?
http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/
It sounds like what you are doing is NOT a good use case for NoSQL. Use relational database for basic data storage to back applications, use NoSQL for caching and the like.

NoSQL databases are used for storage of non-sensitive data for instance posts, comments..
You are able to retrieve all data with one query. Example: Don't care about outdated fields as author_name, profile_picture_url or whatever because it's just a post and in the future this post will not be visible as newer ones. But if you want to have updated fields you have two options:
First option is to use some kind of worker service. If some user change his username or profile picture you will give some kind of signal to your service to traverse all posts and comments and update all fields his new username.
Second option use authorId instead of author name, and instead of 2 query you will make N+2 queries to query for comment_author_profile. But use pagination, instead of querying for 100 comments take 10 and show "load more" button/link, so you will make 12 queries.
Hope this helps.

MongoDb - Modeling storage of users & post in a webapp

I'm quite new to nosql world.
If I have a very simple webapp with users authenticating & publishing posts, what's the mongodb(nosql) way to store users & posts on the nosql db?
Do I have (like in relationnal databases) to store users & posts each one in his own collection? Or store them in the same collection, on different documents? Or, finally with a redondant user infos (credentials) on each post he has published?

A way you could do it is to use two collection, a posts collection and a authors collection. They could look like the following:
Posts
{
title: "Post title",
body: "Content of the post",
author: "author_id",
date: "...",
comments: [
{
name: "name of the commenter",
email: "...",
comment: "..."
}],
tags: [
"tag1", "tag2, "tag3
]
}
Authors
{
"_id": "author_id",
"password": "..."
}
Of course, you can put it in a single collection, but #jcrade mentioned a reason why you would/should use two collections. Remember, that's NoSQL. You should design your database from an application point of you, that means ask yourself what data is consumed and how.

This post says it all:
https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-1
It really depends on your application, and how many posts you expect your users to have: if it's a one-to-few relationship, then probably using embedded documents (inside your users model) is the way to go. If it's one to many (up to a couple of thousands) then just embed an array of IDs in your users model. If it's more than that, then use the answer provided by Horizon_Net.
Read the post, and you get a pretty good idea of what you will have to do. Good luck!

When you are modeling nosql database you should think in 3 basic ideas
Desnormalization
Copy same data on multiple documents. in order to simplify/optimize query processing or to fit the user’s data into a particular data model
Aggregation
Embed data into documents for example (blog post and coments) in order to impact updates both in performance and consistency because mongo has one document consistency at time
Application level Joins
Create applicaciton level joins when its not good idea to agregate information (for example each post as idependent document will be really bad because we need to accces to the same resource)
to answer your question
Create two document one is blogPost with all the comments, and tags on it and user ui. Second User with all user information.

Schema design in MongoDB — to replicate data or not

I have a Share collection which stores a document for every time a user has shared a Link in my application. The schema looks like this:
{
"userId": String
"linkId": String,
"dateCreated": Date
}
In my application I am making requests for these documents, but my application requires that the information referenced by the userId and linkId properties is fully resolved/populated/joined (not sure on the terminology) in order to display the information as needed. Thus, every request for a Share document results in a lookup for the subsequent User and Link documents. Furthermore, each Link has a parent Feed document which must also be looked up. This means I have some spagehetti-like code to perform each find operation in a series (3 in total). Yet, the application only needs some of the data found in these calls (one or two properties). That said, the application does need the entire Link document.
This is very slow, and I am wondering whether I should just be replicating the data in the Share document itself. In my head, this is fine because most of the data will not change, but some of it might (i.e. a User's username). This is suggesting of a Share schema design like so:
{
"userId": String,
"user": {
"username": String,
"name": String,
},
"linkId": String,
"link": {}, // all of the `Link` data
"feed": {
"title": String
}
"dateCreated": Date
}
What is the consensus on optimising data for the application with regards to this? Do you recommend that I replicate the data and write some glue code to ensure the replicated username gets updated if it changes (for example), or can you recommend a better solution (with details on why)? My other worry about replicating data in this manner is, what if I needed more data in the Share document further down the line?

Document design

I am trying out some different options to design and store a document structure in an efficient way in RavenDB.
The structure I am handling user is Session and activity tracking information.
A Session is started when a User logs into the system and the activities start getting created. There could be hundreds activities per session.
The session ends when the user closes / logs out.
A factor that complicates the scenario slightly is that the sessions are displayed in a web portal in real time. In other words: I need to keep track of the session and activities and correlate them to be able to find out if they are ongoing (and how long they have been running) or if they are done.
You can also dig around in the history of course.
I did some research and found two relevant questions here on stack overflow but none of them really helped me:
Document structure for RavenDB
Activity stream design with RavenDb
The two options I have spiked successfully are: (simplified structures)
1:
{
"User": "User1",
"Machine": "machinename",
"StartTime": "2012-02-13T13:11:52.0000000",
"EndTime": "2012-02-13T13:13:54.0000000",
"Activities": [
{
"Text": "Loaded Function X",
"StartTime": "2012-02-13T13:12:10.0000000",
"EndTime": "2012-02-13T13:12:10.0000000"
},
{
"Text": "Executed action Z",
"StartTime": "2012-02-13T13:12:10.0000000",
"EndTime": "2012-02-13T13:12:10.0000000"
}
}
2:
{
"Session" : "SomeSessionId-1",
"User": "User1",
"Machine": "machinename",
"Text": "Loaded Function X",
"StartTime": "2012-02-13T13:12:10.0000000",
"EndTime": "2012-02-13T13:12:10.0000000"
}
{
"Session" : "SomeSessionId-1",
"User": "User1",
"Machine": "machinename",
"Text": "Executed action Z",
"StartTime": "2012-02-13T13:12:10.0000000",
"EndTime": "2012-02-13T13:12:10.0000000"
}
Alternative 1 feels more natural, comming from a relational background and it was really simple to load up a Session, add events and store away. The overhead of loading a Session object and the appending events every time feels really bad for insert performance.
Alternative 2 feels much more efficient, I can simply append events (almost like event-sourcing). But the selections when digging around in events and showing them per session gets a bit more complex.
Is there perhaps a third better alternative?
Could the solution be to separate the events and create another read model?
Am I overcomplicating the issue?

I definitely think you should go with some variant of option 2. Won't the documents grow very large in option 1? That would probably make the inserts very slow.
I can't really see why showing events per session would be any more complicated in option 2 than in option 1, you can just select events by session with
session.Query<Event>().Where(x => x.Session == sessionId)
and RavenDB will automatically create an index for it. And if you want to make more complicated queries you could always create more specialized indexes for that.

Looks like you just need a User document and a session document. Create two models for "User" and "Session".. session doc would have userid as one property. Session will have nested "activity" properties also. It will be easy to show real time users - sessions - activities in this case. Without knowing more details, I'm over simplifying ofcourse.
EDIT:
//Sample User Document
{
UserId:"ABC01",
HomeMachine:"xxxx",
DateCreated:"12/12/2011"
}
//Sample Session Document
{
UserId:"ABC01",
Activities
{
Activity 1 properties
}
{
Activity 2 properties
}
...
...
etc..
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse