Performance Implications of Accessing Single MongoDB Document vs Different MongoDB Documents in The Same Collection - mongodb

Say I have a MongoDB Document that contains within itself a list.
This list gets altered a lot and there's no real reason why it couldn't have its own collection and each of the items became a document.
Would there be any performance implications of the former? I've got an inkling that document read/writes are going to be blocked while any given connection tries to read it, but the same wouldn't be true for accessing different documents in the same collection.

I find that these questions are effectively impossible to 'answer' here on Stack Overflow. Not only is there not really a 'right' answer, but it is impossible to get enough context from the question to frame a response that appropriately factors in the items that are most important for you to consider in your specific situation. Nonetheless, here are some thoughts that come to mind that may help point you in the right direction.
Performance is obviously an important consideration here, so it's good to have it in mind as you think through the design. Even within the single realm of performance there are various aspects. For example, would it be acceptable for the source document and the associated secondary documents in another collection to be out of sync? If not, and you had to pursue a route such as using transactions to keep them aligned, then that may be a much bigger performance hit overall and not worth pursuing.
As broad as performance is, it is also just a single consideration here. What about usability? Are you able to succinctly express the type of modifications that you would be doing to the array using MongoDB's query language? What about retrieving the data, would you always pull the information back as a single logical document? If so, then that would imply needing to use $lookup very frequently. Even doing so via a view may be cumbersome and could be both a usability as well as performance consideration. Indeed, an overreliance on $lookup can be considered an antipattern.
What does it mean when you say that the list gets "altered" a lot? Are you inserting new information, or updating existing entries? There has been a 16MB size limit for individual documents for a long time in MongoDB, so they generally recommend avoiding unbounded arrays. Indeed processing them can be costly in various ways depending on some specific factors.
Also, where does your inkling about concurrency behavior come from? There is a FAQ on concurrency here which helps outline some of the expected behavior for various operations and their locking. Often (with any system) it can be most appropriate to build out an environment that appropriately represents your end state and stress test it directly. That often gives a good general sense for how the approach would work in your situation without having to become an expert in the particulars of how the database (or tool in general) works.
You can see that even in this short response, the "recommendation" fluctuates back and forth. Ultimately this question is about a trade-off which we are not in a good position answer for you. Hopefully this response helps give you some things to think about while doing so.

Related

read queries becoming slower the more indexes I add

It seems that the more compound index I add to my collection it gets better to some point and then beyond that the more indexes the slower it becomes.
Is this possible? If so why?
EDITED:
I am referring to read queries. not write queries. I am aware that writes will be slower.
This is the case for any sort of index, not just compound indexes.
In MongoDB (and most databases) a lot of operations are sped up by having an index, at the cost of maintaining each index.
Generally speaking this shouldn't slow down things like a find but it will very much affect insert and update as those change the underlying data and thus requires modifying or rebuilding of each index those changes are linked to.
However, even with inserts and updates an index can help speed up those operations as the query engine can find the documents to update quicker.
In the end it very much a balance as the cost to maintain the indexes, and the space they take up ... can if you were to be overzealous (i.e. creating many, many less used indexes) ... counteract their helpfulness.
For a deeper dive into that, I'd suggest these docs:
https://www.mongodb.com/docs/manual/core/data-model-operations/#std-label-data-model-indexes
https://www.mongodb.com/docs/manual/core/index-creation/
I agree with the information that #Justin Jenkins shared in their answer, as there is absolutely write overhead associated with maintaining indexes. I don't think that answer focuses query performance much though which is what I understand this question to be about. I will give some thoughts about that below, though without additional details about the situation it will necessarily be a little generic.
Although indexes absolutely feel magical at times, they are still just a utility that we make available for the database to use when running operations. Ideally it would never be the case that adding an index would slow down the execution of a query, but unfortunately it can in some circumstances. This is not particularly common which is why it is not often an upfront talking point or concern.
Here are some important considerations:
The database is responsible for figuring out the index(es) that would result in the most efficient execution plan for every arbitrary query that is executed
Indexes are data structures. They take up space in memory when loaded from disk and must be traversed to be read.
The server hosting the database only has finite resources. Every time it uses some of those resources to maintain indexes it reduces the amount of resources available to process queries. It also introduces more possibilities for locking, yielding, or other contention to maintain consistency.
If you are observing a sudden and drastic degradation in query performance, I would tend to suspect a problem associated with the first consideration above. Again while not particularly common, it is possible that the increased number of indexes is now preventing the database from finding the optimal plan. This would be most likely if the query contained an $or operator, but can happen in other situations as well. Be on the lookout for a different index being reported in the winningPlan of the explain output for the query. It would usually happen after a specific number of indexes were created and/or if that new index(es) had a particular definition relevant to the query of interest.
A slower and more linear degradation in performance would seem to be for a different reason, such as the second or third items mentioned above. While memory/cache contention can certainly still degrade performance reasonably quickly, you would not see a shift in the query plans with one of these problems. What can happen here instead is now you have two indexes which (for simplicity) take up twice the amount of space now competing for the same limited space in memory. If what is requested exceeds what is available then the database will have to begin reading useful portions of the indexes (and data) into and out of its cache. This overhead can quickly add up and will result in operations now spending more time waiting for their portion of the index to be made available in memory for reading. I would expect a broader portion of queries to be impacted, though more moderately, in this situation.
In any case, the most actionable broad advice we can give would be for you to review and consolidate your existing indexes. There is a little bit of guidance on the topic here in the documentation. The general idea is that the prefix of the index (the keys at the beginning) are the important ones when it comes to usage for queries. Except for a few special circumstances, a single field index on { A: 1 } is completely redundant if you have a separate compound index on { A: 1, B: 1 }. Since the latter index can support all of the operations that the former one can, the former one (single field index in this example) should be removed.
Ultimately you may have to make some tradeoffs about which indexes to maintain and there may not be a 'perfect' index present for every single query. That's okay. Sometimes it is better to let one query do a little extra scanning when one of its predicate fields is not indexed as opposed to maintaining an entirely separate index. There is a tradeoff here at some point and, as #Justin Jenkins put it, it's important to go too far and become overzealous when creating indexes.

Mongodb: about performance and schema design

After learning about performance and schema design in MongoDB, I still can´t figure out how would I make the schema design in an application when performance is a must.
Let´s imagine if we have to make YouTube to work with MongoDB as its database. How would you make the schema?
OPTION 1: two collections (videos collection and comments collection)
Pros: adding, deleting and editing comments affects only the comments collection, therefore these operations would be more efficient.
Cons: Retrieving videos and comments would be 2 different queries to the database, one for videos and one for comments.
OPTION 2: single collection (videos collection with the comments embedded)
Pros: You retrieve videos and its comments with a single query.
Cons: Adding, deleting and editing comments affect the video Document, therefore these operations would be less efficient.
So what do you think? Are my guesses true?
As a caller in the desert, I have to say that embedding should only be used under very special circumstances:
The relation is a "One-To(-Very)-Few" and it is absolutely sure that no document will ever exceed this limit. A good example would be the relation between "users" and "email addresses" – a user is unlikely to have millions of them and there isn't even a problem with artificial limits: setting the maximum number of addresses as user can have to, say 50 hardly would cause a problem. It may be unlikely that a video gets millions of comments, but you don't want to impose an artificial limit on it, right?
Updates do not happen very often. If documents increase in size beyond a certain threshold, they might be moved, since documents are guaranteed to be never fragmented. However, document migrations are expensive and you want to prevent them.
Basically, all operations on comments become more complicated and hence more expensive - a bad choice. KISS!
I have written an article about the above, which describes the respective problems in greater detail.
And furthermore, I do not see any advantage in having the comments with the videos. The questions to answer would be
For a given user, what are the videos?
What are the newest videos (with certain tags)?
For a given video, what are the comments?
Note that the only connection between videos and comments here is about a given video, so you already have the _id or something else to positively identify the video. Furthermore, you don't want to load all comments at once, especially if you have a lot of them, since this would decrease UX because of long load times.
Let's say it is the _id. So, with it, you'd be able to have paged comments easily:
db.comments.find({"video_id": idToFind})
.skip( (page-1) * pageSize )
.limit( pageSize )
hth
As usual the answer is, it depends. As as a rule of thumb you should favour embedding, unless you need to regularly query the embedded objects on its own or if the embedded array is likely to get too large(>~100 records). Using this guideline, there are a few questions you need to ask regarding your application.
How is your application going to access the data ? Are you only ever going to show the comments on the same page as the associated video ? Or do you want to provide the options to show all comments for a given user across all movies ? The first scenario favours embedding (one collection), whereas you probably would be better of with two collections in the second scenario.
Secondly, how many comments do you expect for each video ? Taking the analogy of IMDB, you could easily expect more than 100 comments for a popular video, so that means you are better off creating two separate collections as the embedded array of comments would grow large quite quickly. I wouldn't be too concerned about the overhead of an application join, they are generally comparable in speed compared to a server-side join in a relational database provided your collections are properly indexed.
Finally, how often are users going to update their comments after their initial post ? If you lock the comments after 5 minutes like on StackOverflow users may not update their comments very often. In that case the overhead of updating or deleting comments in the video collection will be negligible and may even be outweigh the cost of performing a second query in a separate comments collection.
You should use embedded for better performance. Your I/O's will be lesser. In worst case? it might take a bit long to persist the document in the DB but it wont take much time to retrieve it.
You should either compromise persistence over reads or vise versa depending on your application needs.
Hence it is important to choose your db wisely.

Mass Update NoSQL Documents: Bad Practice?

I'm storing two collections in a MongoDB database:
==Websites==
id
nickname
url
==Checks==
id
website_id
status
I want to display a list of check statuses with the appropriate website nickname.
For example:
[Google, 200] << (basically a join in SQL-world)
I have thousands of checks and only a few websites.
Which is more efficient?
Store the nickname of the website within the "check" directly. This means if the nickname is ever changed, I'll have to perform a mass update of thousands of documents.
Return a multidimensional array where the site ID is the key and the nickname is the value. This is to be used when iterating through the list of checks.
I've read that #1 isn't too bad (in the NoSQL) world and may, in fact, be preferred? True?
If it's only a few websites I'd go with option 1 - not as clean and normalized as in the relational/SQL world but it works and much less painful than trying to emulate joins with MongoDB. The thing to remember with MongoDB or any other NoSQL database is that you are generally making some kind of trade off - nothing is for free. I personally really value the schema-less document oriented data design and for the applications I use it for I readily make the trade-offs (like no joins and transactions).
That said, this is a trade-off - so one thing to always be asking yourself in this situation is why am I using MongoDB or some other NoSQL database? Yes, it's trendy and "hot", but I'd make certain that what you are doing makes sense for a NoSQL approach. If you are spending a lot of time working around the lack of joins and foreign keys, no transactions and other things you're used to in the SQL world I'd think seriously about whether this is the best fit for your problem.
You might consider a 3rd option: Get rid of the Checks collection and embed the checks for each website as an array in each Websites document.
This way you avoid any JOINs and you avoid inconsistencies, because it is impossible for a Check to exist without the Website it belongs to.
This, however, is only recommended when the checks array for each document stays relatively constant over time and doesn't grow constantly. Rapidly growing documents should be avoided in MongoDB, because everytime a document doubles its size, it is moved to a different location in the physical file it is stored in, which slows down write-operations. Also, MongoDB has a 16MB limit per document. This limit exists mostly to discourage growing documents.
You haven't said what a Check actually is in your application. When it is a list of tasks you perform periodically and only make occasional changes to, there would be nothing wrong with embedding. But when you collect the historical results of all checks you ever did, I would rather recommend to put each result(set?) in an own document to avoid document growth.

Are there any advantages to using a custom _id for documents in MongoDB?

Let's say I have a collection called Articles. If I were to insert a new document into that collection without providing a value for the _id field, MongoDB will generate one for me that is specific to the machine and the time of the operation (e.g. sdf4sd89fds78hj).
However, I do have the ability to pass a value for MongoDB to use as the value of the _id key (e.g. 1).
My question is, are there any advantages to using my own custom _ids, or is it best to just let Mongo do its thing? In what scenarios would I need to assign a custom _id?
Update
For anyone else that may find this. The general idea (as I understand it) is that there's nothing wrong with assigning your own _ids, but it forces you to maintain unique values within your application layer, which is a PITA, and requires an extra query before every insert to make sure you don't accidentally duplicate a value.
Sammaye provides an excellent answer here:
Is it bad to change _id type in MongoDB to integer?
Advantages with generating your own _ids:
You can make them more human-friendly, by assigning incrementing numbers: 1, 2, 3, ...
Or you can make them more human-friendly, using random strings: t3oSKd9q
(That doesn't take up too much space on screen, could be picked out from a list, and could potentially be copied manually if needed. However you do need to make it long enough to prevent collisions.)
If you use randomly generated strings they will have an approximately even sharding distribution, unlike the standard mongo ObjectIds, which tends to group records created around the same time onto the same shard. (Whether that is helpful or not really depends on your sharding strategy.)
Or you may like to generate your own custom _ids that will group related objects onto one shard, e.g. by owner, or geographical region, or a combination. (Again, whether that is desirable or not depends on how you intend to query the data, and/or how rapidly you are producing and storing it. You can also do this by specifying a shard key, rather than the _id itself. See the discussion below.)
Advantages to using ObjectIds:
ObjectIds are very good at avoiding collisions. If you generate your own _ids randomly or concurrently, then you need to manage the collision risk yourself.
ObjectIds contain their creation time within them. That can be a cheap and easy way to retain the creation date of a document, and to sort documents chronologically. (On the other hand, if you don't want to expose/leak the creation date of a document, then you must not expose its ObjectId!)
The nanoid module can help you to generate short random ids. They also provide a calculator which can help you choose a good id length, depending on how many documents/ids you are generating each hour.
Alternatively, I wrote mongoose-generate-unique-key for generating very short random ids (provided you are using the mongoose library).
Sharding strategies
Note: Sharding is only needed if you have a huge number of documents (or very heavy documents) that cannot be managed by one server. It takes quite a bit of effort to set up, so I would not recommend worrying about it until you are sure you actually need it.
I won't claim to be an expert on how best to shard data, but here are some situations we might consider:
An astronomical observatory or particle accelerator handles gigabytes of data per second. When an interesting event is detected, they may want to store a huge amount of data in only a few seconds. In this case, they probably want an even distribution of documents across the shards, so that each shard will be working equally hard to store the data, and no one shard will be overwhelmed.
You have a huge amount of data and you sometimes need to process all of it at once. In this case (but depending on the algorithm) an even distribution might again be desirable, so that all shards can work equally hard on processing their chunk of the data, before combining the results at the end. (Although in this scenario, we may be able to rely on MongoDB's balancer, rather than our shard key, for the even distribution. The balancer runs in the background after data has been stored. After collecting a lot of data, you may need to leave it to redistribute the chunks overnight.)
You have a social media app with a large amount of data, but this time many different users are making many light queries related mainly to their own data, or their specific friends or topics. In this case, it doesn't make sense to involve every shard whenever a user makes a little query. It might make sense to shard by userId (or by topic or by geographical region) so that all documents belonging to one user will be stored on one shard, and when that user makes a query, only one shard needs to do work. This should leave the other shards free to process queries for other users, so many users can be served at once.
Sharding documents by creation time (which the default ObjectIds will give you) might be desirable if you have lots of light queries looking at data for similar time periods. For example many different users querying different historical charts.
But it might not be so desirable if most of your users are querying only the most recent documents (a common situation on social media platforms) because that would mean one or two shards would be getting most of the work. Distributing by topic or perhaps by region might provide a flatter overall distribution, whilst also allowing related documents to clump together on a single shard.
You may like to read the official docs on this subject:
https://docs.mongodb.com/manual/sharding/#shard-key-strategy
https://docs.mongodb.com/manual/core/sharding-choose-a-shard-key/
I can think of one good reason to generate your own ID up front. That is for idempotency. For example so that it is possible to tell if something worked or not after a crash. This method works well when using re-try logic.
Let me explain. The reason people might consider re-try logic:
Inter-app communication can sometimes fail for different reasons, (especially in a microservice architecture). The app would be more resilient and self-healing by codifying the app to re-try and not give up right away. This rides over odd blips that might occur without the consumer ever being affected.
For example when dealing with mongo, a request is sent to the DB to store some object, the DB saves it, but just as it is trying to respond to the client to say everything worked fine, there is a network blip for whatever reason and the “OK” is never received. The app assumes it didn't work and so the app may end up re-trying the same data and storing it twice, or worse it just blows up.
Creating the ID up front is an easy, low overhead way to help deal with re-try logic. Of course one could think of other schemes too.
Although this sort of resiliency may be overkill in some types of projects, it really just depends.
I have used custom ids a couple of times and it was quite useful.
In particular I had a collection where I would store stats by date, so the _id was actually a date in a specific format. I did that mostly because I would always query by date. Keep in mind that using this approach can simplify your indexes as no extra index is needed, the basic cursor is sufficient.
Sometimes the ID is something more meaningful than a randomly generated one. For example, a user collection may use the email address as the _id instead. In my project I generate IDs that are much shorter than the ones Mongodb uses so that the ID shown in the URL is much shorter.
I'll use an example , i created a property management tool and it had multiple collections. For simplicity some fields would be duplicated for example the payment. And when i needed to update these record it had to happen simultaneously across all collections it appeared in so i would assign them a custom payment id so when the delete/query action is performed it changes all instances of it database wide

MongoDB space usage inefficiencies when using $push

Let's say that I have two collections, A and B. Among other things, one of them (collection A) has an array whose cells contain subdocuments with a handful of keys.
I also have a script that will go through a queue (external to MongoDB), insert its items on collection B, and push any relevant info from these items into subdocuments in an array in collection A, using $push. As the script runs, the size of the documents in collection A grows significantly.
The problem seems to be that, whenever a document does not fit its allocated size, MongoDB will move it internally, but it won't release the space it occupied previously---new MongoDB documents won't use that space, not unless I run a compact or repairDatabase command.
In my case, the script seems to scorch through my disk space quickly. It inserts a couple of items into collection B, then tries to inserts into a document in collection A, and (I'm guessing) relocates said document without reusing its old spot. Perhaps this does not happen every time, with padding, but when these documents are about 10MB in size, that means that every time it does happen it burns through a significant chunk of the DB, even though the actual data size remains small. The process eats up my (fairly small, admittedly) DB in minutes.
Requiring a compact or repairDatabase command every time this happens is clumsy: there is space on disk, and I would like MongoDB to use it without requesting it explicitly. The alternative of having a separate collection for the subdocuments in the array would fix this issue, and is probably a better design anyway, but one that will require me to make joins that I wanted to avoid, this being one of the advantages of NoSQL.
So, first, does MongoDB actually use space the way I described above? Second, am I approaching this the wrong way? Perhaps there is a parameter I can set to get MongoDB to reuse this space automatically; if there is, is it advisable to use it? And third, are there other, more fitting, design approaches I'm missing?
Most of the questions you have asked you should have already known (Google searching would have brought up 100's of links including critical blog posts on the matter) having tried to use MongoDB in such a case however, this presentation should answer like 90% of your questions: http://www.mongodb.com/presentations/storage-engine-internals
As for solving the problem through settings etc, not really possible here, power of 2 sizes won't help for an array which grows like this. So to answer:
Perhaps there is a parameter I can set to get MongoDB to reuse this space automatically; if there is, is it advisable to use it?
I would say no.
And third, are there other, more fitting, design approaches I'm missing?
For something like this I would recommend using a separate collection to store each of the array elements as a new row independent of the parent document.
Sammaye's recommendation was correct, but I needed to do some more digging to understand the cause of this issue. Here's what I found.
So, first, does MongoDB actually use space the way I described above?
Yes, but that's not as intended. See bug SERVER-8078, and its (non-obvious) duplicate, SERVER-2958. Frequent $push operations cause MongoDB to shuffle documents around, and their old spots are not (yet!) reused without a compact or repairDatabase command.
Second, am I approaching this the wrong way? Perhaps there is a parameter I can set to get MongoDB to reuse this space automatically; if there is, is it advisable to use it?
For some usages of $push, the usePowerOf2Size option initially consumes more memory, but stabilizes better (see the discussion on SERVER-8078). It may not work well with arrays that consistently tend to grow, which are a bad idea anyway because document sizes are capped.
And third, are there other, more fitting, design approaches I'm missing?
If an array is going to have hundreds or thousands of items, or if its length is arbitrary but likely large, it's better to move its cells to a different collection, despite the need for additional database calls.