I wanted to know if anyone knew if you can over use embedding on MongoDB. Not saying something like 100 levels deep, in my application my average document size can get pretty large, simple tests have shown documents of 177kb.
The application is for logging, so for example I take the Apache access log and get lots of things from it like a list of all the pages that were called, a lit of all the IP address and so on. And these are done by minute.
It is unlikely that that I would ever have a document that was at the MongoDB document size limit, but wanted to know if I keep each of the sub lists as there own document, would that make for better performance regarding, returning subset information (querying for all the IP addresses that took place over 5 minutes).
When I run the query I filter to only show the IP addresses, am I wasting the databases performance if I group each minute into one document, or am I wasting it if I split each list into its own document?
You want to structure your collections and documents in a way that reflects how you intend to use the data. If you're going to do a lot of complex queries, especially with subdocuments, you might find it easier to split your documents up into separate collections. An example of this would be splitting comments from blog posts.
Your comments could be stored as an array of subdocuments:
# Example post document with comment subdocuments
{
title: 'How to Mongo!'
content: 'So I want to talk about MongoDB.',
comments: [
{
author: 'Renold',
content: 'This post, it's amazing.'
},
...
]
}
This might cause problems, though, if you want to do complex queries on just comments (e.g. picking the most recent comments from all posts or getting all comments by one author.) If you plan on making these complex queries, you'd be better off creating two collections: one for comments and the other for posts.
# Example post document with "ForeignKeys" to comment documents
{
_id: ObjectId("50c21579c5f2c80000000000"),
title: 'How to Mongo!',
content: 'So I want to talk about MongoDB.',
comments: [
ObjectId("50c21579c5f2c80000000001"),
ObjectId("50c21579c5f2c80000000002"),
...
]
}
# Example comment document with a "ForeignKey" to a post document
{
_id: ObjectId("50c21579c5f2c80000000001"),
post_id: ObjectId("50c21579c5f2c80000000000"),
title: 'Renold',
content: 'This post, it's amazing.'
}
This is similar to how you'd store "ForeignKeys" in a relational database. Normalizing your documents like this makes for querying both comments and posts easy. Also, since you're breaking up your documents, each document will take up less memory. The trade-off, though, is you have to maintain the ObjectId references whenever there's a change to either document (e.g. when you insert/update/delete a comment or post.) And since there are no event hooks in Mongo, you have to do all this maintenance in your application.
On the other-hand, if you don't plan on doing any complex queries on a document's subdocuments, you might benefit from storing monolithic objects. For instance, a user's preferences isn't something you're likely to make queries for:
# Example user document with address subdocument
{
ObjectId("50c21579c5f2c800000000421"),
name: 'Howard',
password: 'naughtysecret',
address: {
state: 'FL',
city: 'Gainesville',
zip: 32608
}
}
Related
Having a bit of trouble understanding when and why to use embedded documents in a mongo database.
Imagine we have three collections: users, rooms and bookings.
I have a few questions about a situation like this:
1) How would you update the embedded document? Would it be the responsibility of the application developer to find all instances of kevin as a embedded document and update it?
2) If the solution is to use document references, is that as heavy as a relational db join? Is this just a case of the example not being a good fit for Mongo?
As always let me know if I'm being a complete idiot.
Imho, you overdid it. Given the question from you use cases are
For a given reservation, what room is booked by which user?
For a given user, what are his or her details?
How many beds does a given room provide?
I would go with the following model for rooms
{
_id: 1001,
beds: 2
}
for users
{
_id: new ObjectId(),
username: "Kevin",
mobile:"12345678"
}
and for reservations
{
_id: new ObjectId(),
date: new ISODate(),
user: "Kevin",
room: 1001
}
Now in a reservation overview, you can have all relevant information ("who", "when" and "which") by simply querying reservations, without any overhead to answer the first question from you use cases. In a reservation details view, admittedly you would have to do two queries, but they are lightning fast with proper indexing and depending on your technology can be done asynchronously, too. Note that I saved an index by using the room number as id. How to answer the remaining questions should be obvious.
So as per your original question: embedding is not necessary here, imho.
The database is MongoDB, not SQL, and that isn't going to change.
Suppose you have an app that allows users to post questions and then tag them with a single subject: math, science, english, history.
Also, on the nav, each subject has its own tab that will display questions that are tagged with the corresponding subject.
Since the display of questions by tag is essential to the app, what is the fastest way to retrieve this data?
Possibilities:
(1) Leave the tags as a single field in the questions collection. Problem: Would have to search through every question, then search through the tags field, to find the relevant tags.
(2) A tag collection, with a field for each subject. Problem: if the number of questions grow too large (approx 20,000 posts), Mongo won't work. From the Mongo docs (http://blog.mongodb.org/post/35133050249/mongodb-for-the-php-mind-part-3):
With document design you need to consider two things: scale and
search. ScaleMongoDB Documents have a limit of 16MB, which although
sounds quite small, can accommodate thousands of documents. However if
you are expecting 20,000 comments per post on a high traffic website,
or your comments are unlimited in size, then embedding might not work
well for you.
This seems to leave...
(3) A separate collection for each tag. Is (3) best in this case?
From the (rather rudimentary) spec you give, I'd go for a single collection for the documents, where each document contains an array of tags:
{
_id: "whatever",
content: "the question",
tags: [ "this", "that", "and another tag" ]
...
}
Then, for efficient querying by tag, set a multi-key index on the tags.
See https://docs.mongodb.com/manual/indexes/#multikey-index
Let's say I have a service with thousands of users, and I want to post news alerts that they can view. Once they view one, it's marked as seen (for just that user, obviously).
I think I know the answer to this, but is it better to store on the news item a list of users who have seen it? Or is it better to store on the user document a list of all news items they've seen?
I'm assuming the latter is better, mostly because if I have 20,000 users, that means if all of them have seen a particular news alert, then I've got an array of 20,000 IDs stored in that news alert document, which probably isn't good. But this structure seems better:
{
email: 'person#person.net',
name: 'Person',
seenNews: [
'TTJGGiPsTqqLio4sf',
'vhePmuShra3MSzYsu',
'JKFqqCKDmtuuoQBXu',
'gCFyzu8BAihj8NnXB'
]
}
I probably won't have more than a few hundred news items, plus I can always go back and delete old ones anyway.
Or is there an even better way to handle this?
Given you have news
{
_id: "Fubar2.0",
title: "Fubar 2.0 released"
}
and users
{
_id: "12345",
name: "CoolName"
}
storing what has been seen in either of the above models would sooner or later exceed the BSON document size limit of 16MB. Furthermore, increasing documents in size isn't efficiently handled with the mmapv1 storage engine, which is still the default.
Conclusion: you need to store the news read in separate documents in a seen collection:
{
_id: {
newsitem: "Fubar2.0",
user:"12345"
}
}
Since we have a compound _id for seen, which is automatically indexed (and held in RAM as long as possible), queries are quite efficient.
The problem is obvious: you need two queries to get news unseen by a user
var seen = new Array()
db.seen.find({"_id.user":"12345"},{_id:1}).forEach(
function(doc){
seen.push(doc._id.news);
}
)
var unseen = db.news.find({_id: {$nin: seen}})
While this works and imho is the proper solution for the situation described, the "unseen" query isn't very efficient.
Depending on the use case, you could rather go with something like this for users
{
_id:"12345",
name: "CoolName",
lastSeen: ISODate("2015-05-05T03:26:36Z")
}
and news like this
{
_id:{
title:"FuBar 2.0 released",
date: ISODate("2015-05-05T03:46:00Z")
}
}
So when a user logs in, you already loaded the user document, right? With this, you can get all the news he or she presumably hasn't seen with
db.news.find({"_id.date":{$gte: user.lastSeen} })
Admittedly you can not really check which user has seen which news item, but if the goal is to make sure the user is presented with all the news since his or her last visit, the latter solution is efficient and easy to implement (and scale).
MongoDB has the restriction for a document to be max. 16MB in size. However, it's also encouraged to store related collections inside the document. For example, a blog post and its comments:
{
_id: 1,
title: "First Post",
Content: "...",
Comments: [
{ content: "..." },
{ content: "..." },
...
]
}
Assuming that this post has gone viral and I'm getting millions of comments. How should I store the comments in MongoDB? Should I put it inside another collection with the following structure:
{
_id: 23,
blogPostId: 1,
content: "..."
}
If that's the case, how should I make the queries like "get me the blog posts which have more than 10 comments" perform efficiently?
This is quite a common use case for MongoDB and is covered in the online manual. You typically have 3 choices:
Store each comment in separate document
Embed all comments in the parent document (be sensitive to the 16MB limit)
A hybrid design, stores comments separately from the parent but aggregates comments into a small number of documents (buckets), where each contains many comments
You could probably also consider another type of hybrid, where you have a max number of comments stored in an array of the parent document and then maybe use bucketed comments in a 'comments overflow' collection, which would only be used by those posts that have gone viral. In reality, only a small ratio of visitors would ever interact with the web app to result in a query against the overflow docs. It's sort of a trade-off of runtime efficiency Vs developer complexity.
For most of these options, you would maintain "pre-aggregated" summary data (eg. total number of comments) in the parent document and this would be what you would be able to query on easily. Pre-aggregation is also discussed in the online manual.
Well,I would go for a simple and efficient solution.Let's view the problem from a query point of view. Most queries are interested in the most recent comments, right?. We do agree that embedding documents is quite fast and efficient in MongoDB but we are limited to a certain document size. Therefore a compromise is needed here.
We can embed the most recent comments and reference the rest.
{
_id : ObjectId(...),
title : "Post title",
most_recent_comments:[
{
commentBody :"...",
author : "...",
},
{
commentBody :"...",
author : "...",
},
],
all_comments_ids :[ObjectId(commentId),ObjectId(commentId),ObjectId(commentId)]
}
So, you can create a designated collection for comments and keep the most recent as embedded . From your programme recent comments could be a queue with a limited size you specify.
I'm quite new to nosql world.
If I have a very simple webapp with users authenticating & publishing posts, what's the mongodb(nosql) way to store users & posts on the nosql db?
Do I have (like in relationnal databases) to store users & posts each one in his own collection? Or store them in the same collection, on different documents? Or, finally with a redondant user infos (credentials) on each post he has published?
A way you could do it is to use two collection, a posts collection and a authors collection. They could look like the following:
Posts
{
title: "Post title",
body: "Content of the post",
author: "author_id",
date: "...",
comments: [
{
name: "name of the commenter",
email: "...",
comment: "..."
}],
tags: [
"tag1", "tag2, "tag3
]
}
Authors
{
"_id": "author_id",
"password": "..."
}
Of course, you can put it in a single collection, but #jcrade mentioned a reason why you would/should use two collections. Remember, that's NoSQL. You should design your database from an application point of you, that means ask yourself what data is consumed and how.
This post says it all:
https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design-part-1
It really depends on your application, and how many posts you expect your users to have: if it's a one-to-few relationship, then probably using embedded documents (inside your users model) is the way to go. If it's one to many (up to a couple of thousands) then just embed an array of IDs in your users model. If it's more than that, then use the answer provided by Horizon_Net.
Read the post, and you get a pretty good idea of what you will have to do. Good luck!
When you are modeling nosql database you should think in 3 basic ideas
Desnormalization
Copy same data on multiple documents. in order to simplify/optimize query processing or to fit the user’s data into a particular data model
Aggregation
Embed data into documents for example (blog post and coments) in order to impact updates both in performance and consistency because mongo has one document consistency at time
Application level Joins
Create applicaciton level joins when its not good idea to agregate information (for example each post as idependent document will be really bad because we need to accces to the same resource)
to answer your question
Create two document one is blogPost with all the comments, and tags on it and user ui. Second User with all user information.