MongoDB space usage inefficiencies when using $push - mongodb

Let's say that I have two collections, A and B. Among other things, one of them (collection A) has an array whose cells contain subdocuments with a handful of keys.
I also have a script that will go through a queue (external to MongoDB), insert its items on collection B, and push any relevant info from these items into subdocuments in an array in collection A, using $push. As the script runs, the size of the documents in collection A grows significantly.
The problem seems to be that, whenever a document does not fit its allocated size, MongoDB will move it internally, but it won't release the space it occupied previously---new MongoDB documents won't use that space, not unless I run a compact or repairDatabase command.
In my case, the script seems to scorch through my disk space quickly. It inserts a couple of items into collection B, then tries to inserts into a document in collection A, and (I'm guessing) relocates said document without reusing its old spot. Perhaps this does not happen every time, with padding, but when these documents are about 10MB in size, that means that every time it does happen it burns through a significant chunk of the DB, even though the actual data size remains small. The process eats up my (fairly small, admittedly) DB in minutes.
Requiring a compact or repairDatabase command every time this happens is clumsy: there is space on disk, and I would like MongoDB to use it without requesting it explicitly. The alternative of having a separate collection for the subdocuments in the array would fix this issue, and is probably a better design anyway, but one that will require me to make joins that I wanted to avoid, this being one of the advantages of NoSQL.
So, first, does MongoDB actually use space the way I described above? Second, am I approaching this the wrong way? Perhaps there is a parameter I can set to get MongoDB to reuse this space automatically; if there is, is it advisable to use it? And third, are there other, more fitting, design approaches I'm missing?

Most of the questions you have asked you should have already known (Google searching would have brought up 100's of links including critical blog posts on the matter) having tried to use MongoDB in such a case however, this presentation should answer like 90% of your questions: http://www.mongodb.com/presentations/storage-engine-internals
As for solving the problem through settings etc, not really possible here, power of 2 sizes won't help for an array which grows like this. So to answer:
Perhaps there is a parameter I can set to get MongoDB to reuse this space automatically; if there is, is it advisable to use it?
I would say no.
And third, are there other, more fitting, design approaches I'm missing?
For something like this I would recommend using a separate collection to store each of the array elements as a new row independent of the parent document.

Sammaye's recommendation was correct, but I needed to do some more digging to understand the cause of this issue. Here's what I found.
So, first, does MongoDB actually use space the way I described above?
Yes, but that's not as intended. See bug SERVER-8078, and its (non-obvious) duplicate, SERVER-2958. Frequent $push operations cause MongoDB to shuffle documents around, and their old spots are not (yet!) reused without a compact or repairDatabase command.
Second, am I approaching this the wrong way? Perhaps there is a parameter I can set to get MongoDB to reuse this space automatically; if there is, is it advisable to use it?
For some usages of $push, the usePowerOf2Size option initially consumes more memory, but stabilizes better (see the discussion on SERVER-8078). It may not work well with arrays that consistently tend to grow, which are a bad idea anyway because document sizes are capped.
And third, are there other, more fitting, design approaches I'm missing?
If an array is going to have hundreds or thousands of items, or if its length is arbitrary but likely large, it's better to move its cells to a different collection, despite the need for additional database calls.

Related

Performance Implications of Accessing Single MongoDB Document vs Different MongoDB Documents in The Same Collection

Say I have a MongoDB Document that contains within itself a list.
This list gets altered a lot and there's no real reason why it couldn't have its own collection and each of the items became a document.
Would there be any performance implications of the former? I've got an inkling that document read/writes are going to be blocked while any given connection tries to read it, but the same wouldn't be true for accessing different documents in the same collection.
I find that these questions are effectively impossible to 'answer' here on Stack Overflow. Not only is there not really a 'right' answer, but it is impossible to get enough context from the question to frame a response that appropriately factors in the items that are most important for you to consider in your specific situation. Nonetheless, here are some thoughts that come to mind that may help point you in the right direction.
Performance is obviously an important consideration here, so it's good to have it in mind as you think through the design. Even within the single realm of performance there are various aspects. For example, would it be acceptable for the source document and the associated secondary documents in another collection to be out of sync? If not, and you had to pursue a route such as using transactions to keep them aligned, then that may be a much bigger performance hit overall and not worth pursuing.
As broad as performance is, it is also just a single consideration here. What about usability? Are you able to succinctly express the type of modifications that you would be doing to the array using MongoDB's query language? What about retrieving the data, would you always pull the information back as a single logical document? If so, then that would imply needing to use $lookup very frequently. Even doing so via a view may be cumbersome and could be both a usability as well as performance consideration. Indeed, an overreliance on $lookup can be considered an antipattern.
What does it mean when you say that the list gets "altered" a lot? Are you inserting new information, or updating existing entries? There has been a 16MB size limit for individual documents for a long time in MongoDB, so they generally recommend avoiding unbounded arrays. Indeed processing them can be costly in various ways depending on some specific factors.
Also, where does your inkling about concurrency behavior come from? There is a FAQ on concurrency here which helps outline some of the expected behavior for various operations and their locking. Often (with any system) it can be most appropriate to build out an environment that appropriately represents your end state and stress test it directly. That often gives a good general sense for how the approach would work in your situation without having to become an expert in the particulars of how the database (or tool in general) works.
You can see that even in this short response, the "recommendation" fluctuates back and forth. Ultimately this question is about a trade-off which we are not in a good position answer for you. Hopefully this response helps give you some things to think about while doing so.

MongoDB Array field maximum lenght

I am using mongoDB with mongoose, and I need to create a schema "advertisement" which has a field called "views" in which will be and array of
{userId: String, date: Date}.
I want to know if this is a good practice, since although I know now how much it will grow (until 1500 and then reseted) in the future I will not. I want to know if for example it would seriously affect the performance of the application if that array could be 50000 or 100000 or whatever. (It is is an unbounded array) In this cases what would be the best practice. I thought just storing an increasing number, but business decision is to know by who and when the ad was seen.
I know that there is a limit only for the document (16mb), but not for the fields themselves. But my questions is more related to performance rather than document limit.
Thank you!
Edit => In the end definitely it is not a good idea to let an array grow unbounded. I checked at the answer they provided first..and it is a good approach. However since I will be querying the whole document with the array property quite often I didn't want to split it. So, since I don't want to store data longer than 3 days in the array..I will pull all elements that have 3 days or more, and I hope this keeps the array clean.
I know that there is a limit only for the document (16mb), but not for
the fields themselves.
Fields and their values are parts of the document, so they make direct impact on the document size.
Beside that, having such a big arrays usually is not the best approach. It decreases performance and complicates queries.
In your case, it is much better to have a separate views collection of the documents which are referencing to the advertisements by their _id.
Also, if you expect advertisement.views to be queried pretty often or, for example, you often need to show the last 10 or 20 views, then the Outlier pattern may also work for you.

Best way to query entire MongoDB collection for ETL

We want to query an entire live production MongoDB collection (v2.6, around 500GB of data on around 70M documents).
We're wondering what's the best approach for this:
A single query with no filtering to open a cursor and get documents in batches of 5/6k
Iterate with pagination, using a logic of find().limit(5000).skip(currentIteration * 5000)
We're unsure what's the best practice and will yield the best results with minimum impact on performance.
I would go with 1. & 2. mixed if possible: Iterate over your huge dataset in pages but access those pages by querying instead of skipping over them as this may be costly as also pointed out by the docs.
The cursor.skip() method is often expensive because it requires the
server to walk from the beginning of the collection or index to get
the offset or skip position before beginning to return results. As the
offset (e.g. pageNumber above) increases, cursor.skip() will become
slower and more CPU intensive. With larger collections, cursor.skip()
may become IO bound.
So if possible build your pages on an indexed field and process those batches of data with an according query range.
The brutal way
Generally speaking, most drivers load batches of documents anyway. So your languages equivalent of
var docs = db.yourcoll.find()
docs.forEach(
function(doc){
//whatever
}
)
will actually just create a cursor initially, and will then, when the current batch is close to exhaustion, load a new batch transparently. So doing this pagination manually while planning to access every document in the collection will have little to no advantage, but hold the overhead of multiple queries.
As for ETL, manually iterating over the documents to modify and then store them in a new instance does under most circumstances not seem reasonable to me, as you basically reinvent the wheel.
Alternate approach
Generally speaking, there is no one-size-fits all "best" way. The best way is the one that best fits your functional and non-functional requirements.
When doing ETL from MongoDB to MongoDB, I usually proceed as follows:
ET…
Unless you have very complicated transformations, MongoDB's aggregation framework is a surprisingly capable ETL tool. I use it regularly for that purpose and have yet to find a problem not solvable with the aggregation framework for in-MongoDB ETL. Given the fact that in general each document is processed one by one, the impact on your production environment should be minimal, if noticeable at all. After you did your transformation, simply use the $out stage to save the results in a new collection.
Even collection spanning transformations can be achieved, using the $lookup stage.
…L
After you did the extract and transform on the old instance, for loading the data to the new MongoDB instance, you have several possibilities:
Create a temporary replica set, consisting of the old instance, the new instance and an arbiter. Make sure your old instance becomes primary, do the ET part, have the primary step down so your new instance becomes primary and remove the old instance and the arbiter from the replica set. The advantage is that you facilitate MongoDB's replication mechanics to get the data from your old instance to your new instance, without the need to worry about partially executed transfers and such. And you can use it the other way around: Transfer the data first, make the new instance the primary, remove the other members from the replica set perform your transformations and remove the "old" data, then.
Use db.CloneCollection(). The advantage here is that you only transfer the collections you need, at the expense of more manual work.
Use db.cloneDatabase() to copy over the entire DB. Unless you have multiple databases on the original instance, this method has little to now advantage over the replica set method.
As written, without knowing your exact use cases, transformations and constraints, it is hard to tell which approach makes the most sense for you.
MongoDB 3.4 support Parallel Collection Scan. I never tried this myself yet. But looks interesting to me.
This will not work on sharded clusters. If we have parallel processing setup this will speed up the scanning for sure.
Please see the documentation here: https://docs.mongodb.com/manual/reference/command/parallelCollectionScan/

MongoDB: Declarative or Navigational?

I am still new on the whole area of MongoDB systems.
I was wondering whether anyone of you knows if MongoDB is declarative or navigational when it comes to accessing objects within a document?
What I mean is:
-> Declarative: a pattern is given and the system works out the result. In other words, it works in the same way as SPJ queries
-> Navigational: it always starts from the beginning of a document and continues from there
SPJ (Select-Project-Join) is more than just accession of documents/rows, it is the entire process of forming a result set for queries.
I am unsure how the two: "Declarative" and "Navigational" are compatible. You talk about "Declarative" being the formation of a result set but then "Navigational" being related to accessing a document, i.e. reading the beginning of it.
I will answer what I believe your question to be about, the access patterns of documents.
I believe MongoDB leaves reading up to the OS itself (might use some C++ library to do its work for it) as such it starts from the beginning of a pointer (i.e. "Navigational") however, as to how it actually reads document does not really matter.
Here is why, MongoDB does not split a document up into many pieces to be stored on the hard disk, instead it stores a document as a single "block" of space. So I am going to turn around and say you probably need to look at the code to be sure exactly how it reads, though I don't see the point and here is a presentation that will help you understand more about the internals of MongoDB: http://www.mongodb.com/presentations/storage-engine-internals

Is there any way to register a callback for deletions in a capped collection in Mongo?

I want to use a capped collection in Mongo, but I don't want my documents to die when the collection loops around. Instead, I want Mongo to notice that I'm running out of space and move the old documents into another, permanent collection for archival purposes.
Is there a way to have Mongo do this automatically, or can I register a callback that would perform this action?
You shouldn't be using a capped collection for this. I'm assuming you're doing so because you want to keep the amount of "hot" data relatively small and move stale data to a permanent collection. However, this is effectively what happens anyway when you use MongoDB. Data that's accessed often will be in memory and data that is used less often will not be. Same goes for your indexes if they remain right-balanced. I would think you're doing a bit of premature optimization or at least have a suboptimal schema or index strategy for your problem. If you post exactly what you're trying to achieve and where your performance takes a dive I can have a look.
To answer your actual question; MongoDB does not have callbacks or triggers. There are some open feature requests for them though.
EDIT (Small elaboration on technical implementation) : MongoDB is built on top of memory mapped files for it's storage engine. It basically means it's an LRU based cache of "hot" data where data in this case can be both actual data and index data. As a result data and associated index data you access often (in your case the data you'd typically have in your capped collection) will be in memory and thus very fast to query. In typical use cases the performance difference between having an "active" collection and an "archive" collection and just one big collection should be small. As you can imagine having more memory available to the mongod process means more data can stay in memory and as a result performance will improve. There are some nice presentations from 10gen available on mongodb.org that go into more detail and also provide detail on how to keep indexes right balanced etc.
At the moment, MongoDB does not support triggers at all. If you want to move documents away before they reach the end of the "cap" then you need to monitor the data usage yourself.
However, I don't see why you would want a capped collection and also still want to move your items away. If you clarify that in your question, I'll update the answer.