MongoDB Array field maximum lenght - mongodb

I am using mongoDB with mongoose, and I need to create a schema "advertisement" which has a field called "views" in which will be and array of
{userId: String, date: Date}.
I want to know if this is a good practice, since although I know now how much it will grow (until 1500 and then reseted) in the future I will not. I want to know if for example it would seriously affect the performance of the application if that array could be 50000 or 100000 or whatever. (It is is an unbounded array) In this cases what would be the best practice. I thought just storing an increasing number, but business decision is to know by who and when the ad was seen.
I know that there is a limit only for the document (16mb), but not for the fields themselves. But my questions is more related to performance rather than document limit.
Thank you!
Edit => In the end definitely it is not a good idea to let an array grow unbounded. I checked at the answer they provided first..and it is a good approach. However since I will be querying the whole document with the array property quite often I didn't want to split it. So, since I don't want to store data longer than 3 days in the array..I will pull all elements that have 3 days or more, and I hope this keeps the array clean.

I know that there is a limit only for the document (16mb), but not for
the fields themselves.
Fields and their values are parts of the document, so they make direct impact on the document size.
Beside that, having such a big arrays usually is not the best approach. It decreases performance and complicates queries.
In your case, it is much better to have a separate views collection of the documents which are referencing to the advertisements by their _id.
Also, if you expect advertisement.views to be queried pretty often or, for example, you often need to show the last 10 or 20 views, then the Outlier pattern may also work for you.

Related

Mongodb: about performance and schema design

After learning about performance and schema design in MongoDB, I still can´t figure out how would I make the schema design in an application when performance is a must.
Let´s imagine if we have to make YouTube to work with MongoDB as its database. How would you make the schema?
OPTION 1: two collections (videos collection and comments collection)
Pros: adding, deleting and editing comments affects only the comments collection, therefore these operations would be more efficient.
Cons: Retrieving videos and comments would be 2 different queries to the database, one for videos and one for comments.
OPTION 2: single collection (videos collection with the comments embedded)
Pros: You retrieve videos and its comments with a single query.
Cons: Adding, deleting and editing comments affect the video Document, therefore these operations would be less efficient.
So what do you think? Are my guesses true?
As a caller in the desert, I have to say that embedding should only be used under very special circumstances:
The relation is a "One-To(-Very)-Few" and it is absolutely sure that no document will ever exceed this limit. A good example would be the relation between "users" and "email addresses" – a user is unlikely to have millions of them and there isn't even a problem with artificial limits: setting the maximum number of addresses as user can have to, say 50 hardly would cause a problem. It may be unlikely that a video gets millions of comments, but you don't want to impose an artificial limit on it, right?
Updates do not happen very often. If documents increase in size beyond a certain threshold, they might be moved, since documents are guaranteed to be never fragmented. However, document migrations are expensive and you want to prevent them.
Basically, all operations on comments become more complicated and hence more expensive - a bad choice. KISS!
I have written an article about the above, which describes the respective problems in greater detail.
And furthermore, I do not see any advantage in having the comments with the videos. The questions to answer would be
For a given user, what are the videos?
What are the newest videos (with certain tags)?
For a given video, what are the comments?
Note that the only connection between videos and comments here is about a given video, so you already have the _id or something else to positively identify the video. Furthermore, you don't want to load all comments at once, especially if you have a lot of them, since this would decrease UX because of long load times.
Let's say it is the _id. So, with it, you'd be able to have paged comments easily:
db.comments.find({"video_id": idToFind})
.skip( (page-1) * pageSize )
.limit( pageSize )
hth
As usual the answer is, it depends. As as a rule of thumb you should favour embedding, unless you need to regularly query the embedded objects on its own or if the embedded array is likely to get too large(>~100 records). Using this guideline, there are a few questions you need to ask regarding your application.
How is your application going to access the data ? Are you only ever going to show the comments on the same page as the associated video ? Or do you want to provide the options to show all comments for a given user across all movies ? The first scenario favours embedding (one collection), whereas you probably would be better of with two collections in the second scenario.
Secondly, how many comments do you expect for each video ? Taking the analogy of IMDB, you could easily expect more than 100 comments for a popular video, so that means you are better off creating two separate collections as the embedded array of comments would grow large quite quickly. I wouldn't be too concerned about the overhead of an application join, they are generally comparable in speed compared to a server-side join in a relational database provided your collections are properly indexed.
Finally, how often are users going to update their comments after their initial post ? If you lock the comments after 5 minutes like on StackOverflow users may not update their comments very often. In that case the overhead of updating or deleting comments in the video collection will be negligible and may even be outweigh the cost of performing a second query in a separate comments collection.
You should use embedded for better performance. Your I/O's will be lesser. In worst case? it might take a bit long to persist the document in the DB but it wont take much time to retrieve it.
You should either compromise persistence over reads or vise versa depending on your application needs.
Hence it is important to choose your db wisely.

Best way to make a medical history DB in mongodb

i'm making a sistem that stores all medical , and healthy data from a person in a database , i've chosen mongodb to do the work but i'm new in mongodb modeling and i dont have an idea of whats the best way to do this.
Do i use a document for each pacient and insert subdocuments like this:
$evolution=array(); //subdocument
$record=array(); //subdocument
$prescriptions=array(); //subdocument
$exams=array(); //subdocument
$surgeries=array(); //subdocument
or do i create a new document for each one of these data?.
i know the limitation of document size that is 16 megabytes, but i don't know if the informations will reach the limmit.
The exact layout of your documents is highly dependent on the types of queries you need to make. Unfortunately without a detailed understanding of your use case it would be impossible to provide good advice about what is the best layout.
Depending on your use case it may be valid to have a document/patient with sub documents as you indicate. In some cases though it may be better to have a separate collection for each of the fields indicated. It all depends on how big those documents will be, what types of queries you will need to perform etc.
Some general advice:
Try to avoid queries that use multiple collections.
If your queries are getting difficult, you may have the wrong layout. Re-evaluate your layout any time you are in this situation.
Documents that constantly grow can create problems because Mongo constantly has to move them around in order to make room for the growth. If they will be growing quickly then reevaluate to see if there is a better layout.
While you can technically store different document layouts in the same collection in Mongo it is not generally considered a good practice. All documents in your collection should ideally follow some sort of schema even if that schema is not rigidly defined.
Field names matter. They take up space in Mongo so short field names are better if you expect to have a lot of data.
The best advice I can offer would be to start with what you think might work and see how it goes. If it gets awkward or difficult to get the information you need then reevaluate.

Mass Update NoSQL Documents: Bad Practice?

I'm storing two collections in a MongoDB database:
==Websites==
id
nickname
url
==Checks==
id
website_id
status
I want to display a list of check statuses with the appropriate website nickname.
For example:
[Google, 200] << (basically a join in SQL-world)
I have thousands of checks and only a few websites.
Which is more efficient?
Store the nickname of the website within the "check" directly. This means if the nickname is ever changed, I'll have to perform a mass update of thousands of documents.
Return a multidimensional array where the site ID is the key and the nickname is the value. This is to be used when iterating through the list of checks.
I've read that #1 isn't too bad (in the NoSQL) world and may, in fact, be preferred? True?
If it's only a few websites I'd go with option 1 - not as clean and normalized as in the relational/SQL world but it works and much less painful than trying to emulate joins with MongoDB. The thing to remember with MongoDB or any other NoSQL database is that you are generally making some kind of trade off - nothing is for free. I personally really value the schema-less document oriented data design and for the applications I use it for I readily make the trade-offs (like no joins and transactions).
That said, this is a trade-off - so one thing to always be asking yourself in this situation is why am I using MongoDB or some other NoSQL database? Yes, it's trendy and "hot", but I'd make certain that what you are doing makes sense for a NoSQL approach. If you are spending a lot of time working around the lack of joins and foreign keys, no transactions and other things you're used to in the SQL world I'd think seriously about whether this is the best fit for your problem.
You might consider a 3rd option: Get rid of the Checks collection and embed the checks for each website as an array in each Websites document.
This way you avoid any JOINs and you avoid inconsistencies, because it is impossible for a Check to exist without the Website it belongs to.
This, however, is only recommended when the checks array for each document stays relatively constant over time and doesn't grow constantly. Rapidly growing documents should be avoided in MongoDB, because everytime a document doubles its size, it is moved to a different location in the physical file it is stored in, which slows down write-operations. Also, MongoDB has a 16MB limit per document. This limit exists mostly to discourage growing documents.
You haven't said what a Check actually is in your application. When it is a list of tasks you perform periodically and only make occasional changes to, there would be nothing wrong with embedding. But when you collect the historical results of all checks you ever did, I would rather recommend to put each result(set?) in an own document to avoid document growth.

MongoDB space usage inefficiencies when using $push

Let's say that I have two collections, A and B. Among other things, one of them (collection A) has an array whose cells contain subdocuments with a handful of keys.
I also have a script that will go through a queue (external to MongoDB), insert its items on collection B, and push any relevant info from these items into subdocuments in an array in collection A, using $push. As the script runs, the size of the documents in collection A grows significantly.
The problem seems to be that, whenever a document does not fit its allocated size, MongoDB will move it internally, but it won't release the space it occupied previously---new MongoDB documents won't use that space, not unless I run a compact or repairDatabase command.
In my case, the script seems to scorch through my disk space quickly. It inserts a couple of items into collection B, then tries to inserts into a document in collection A, and (I'm guessing) relocates said document without reusing its old spot. Perhaps this does not happen every time, with padding, but when these documents are about 10MB in size, that means that every time it does happen it burns through a significant chunk of the DB, even though the actual data size remains small. The process eats up my (fairly small, admittedly) DB in minutes.
Requiring a compact or repairDatabase command every time this happens is clumsy: there is space on disk, and I would like MongoDB to use it without requesting it explicitly. The alternative of having a separate collection for the subdocuments in the array would fix this issue, and is probably a better design anyway, but one that will require me to make joins that I wanted to avoid, this being one of the advantages of NoSQL.
So, first, does MongoDB actually use space the way I described above? Second, am I approaching this the wrong way? Perhaps there is a parameter I can set to get MongoDB to reuse this space automatically; if there is, is it advisable to use it? And third, are there other, more fitting, design approaches I'm missing?
Most of the questions you have asked you should have already known (Google searching would have brought up 100's of links including critical blog posts on the matter) having tried to use MongoDB in such a case however, this presentation should answer like 90% of your questions: http://www.mongodb.com/presentations/storage-engine-internals
As for solving the problem through settings etc, not really possible here, power of 2 sizes won't help for an array which grows like this. So to answer:
Perhaps there is a parameter I can set to get MongoDB to reuse this space automatically; if there is, is it advisable to use it?
I would say no.
And third, are there other, more fitting, design approaches I'm missing?
For something like this I would recommend using a separate collection to store each of the array elements as a new row independent of the parent document.
Sammaye's recommendation was correct, but I needed to do some more digging to understand the cause of this issue. Here's what I found.
So, first, does MongoDB actually use space the way I described above?
Yes, but that's not as intended. See bug SERVER-8078, and its (non-obvious) duplicate, SERVER-2958. Frequent $push operations cause MongoDB to shuffle documents around, and their old spots are not (yet!) reused without a compact or repairDatabase command.
Second, am I approaching this the wrong way? Perhaps there is a parameter I can set to get MongoDB to reuse this space automatically; if there is, is it advisable to use it?
For some usages of $push, the usePowerOf2Size option initially consumes more memory, but stabilizes better (see the discussion on SERVER-8078). It may not work well with arrays that consistently tend to grow, which are a bad idea anyway because document sizes are capped.
And third, are there other, more fitting, design approaches I'm missing?
If an array is going to have hundreds or thousands of items, or if its length is arbitrary but likely large, it's better to move its cells to a different collection, despite the need for additional database calls.

Best Mongodb Data Model for Response time statistic website

In my project, I have servers that will send ping request to websites, measuring their response time and store it every minute.
I'm going to use Mongodb and i'm searching for best data model.
which data model is better?
1- have a collection for each website and each request as a document.
(1000 collection)
or
2- have a collection for all websites and each website as a document and each request as sub-document.
Both solutions should face of one certain limitation of mongodb. With the first one, that you said each website a collection, the limitation is in the number of the collections while each one will have a namespace entry and the namespace size is 16MB so around 16.000 entries can fit in. (the size of the namespace can be increased) In my opinion this is a much better solution while you said 1000 collections are expected and it can be handled. (Should be considered that indexes has their own namespace entries and count in the 16.000). In this case you can store the entries as documents you can handle them after generally much easier than with the embedded array.
Embedded array limitation. This limitation in the second case is a hard one. Your documents cannot grow bigger than 16MB. This one is BSON size and it can store quite many things inside documents but if you use huge documents which varies in size , and change size in time your storage will get fragmented. The reason is that will be clear if you watch this webinar . Basically this is the worth what you can do in terms of storage usage.
If you likely to use aggregation framework for further analysis it will be also harder with the embedded array concept.
You could do either, but I think you will have to factor in periodic growth in database for either case. During the expansion of datafiles database will be slow/unresponsive. (There might be a setting so this happens in the background - I forget ).
A related question - MongoDB performance with growing data structure, specifically the "Padding Factor"
With first approach, there is an upper limit to number of websites you can store imposed by max number of collections. You can do the calculations based on http://docs.mongodb.org/manual/reference/limits/.
In second approach, while #of collection don't matter as much, but growth of database is something you will want to consider.
One approach is to initialize it with empty data, so it takes lasts longer before expanding.
For instance.
{
website: name,
responses: [{
time: Jan 1, 2013, 0:1, ...
},
{
time: Jan 1, 2013, 0:2, ...
}
... and so for each minute/interval you expect.
]
}
The downside is, it might take you longer to initialize but you will have to worry about this later.
Either ways, it is a cost you will have to pay. The only question is when? Now? or later?
Consider reading their usecases, particularly - http://docs.mongodb.org/manual/use-cases/hierarchical-aggregation/