Total MongoDB storage size - mongodb

I have a sharded and replicated MongoDB with dozens millions of records. I know that Mongo writes data with some padding factor, to allow fast updates, and I also know that to replicate the database Mongo should store operation log which requires some (actually, a lot of) space. Even with that knowledge I have no idea how to estimate the actual size required by Mongo given a size of a typical database record. By now I have a descrepancy with a factor of 2 - 3 between weekly repairs.
So the question is: How to estimate a total storage size required by MongoDB given an average record size in bytes?

The short answer is: you can't, not based solely on avg. document size (at least not in any accurate way).
To explain more verbosely:
The space needed on disk is not simply a function of the average document size. There is also the space needed for any indexes you create. Then there is the space needed if you do trigger those moves (despite padding, this does happen) - that space is placed on a list to be re-used but depending on the data you subsequently insert, it may or may not be possible to re-use that space.
You can also add into the fact that pre-allocation will mean that occasionally a handful of documents will increase your on-disk space utilization by ~2GB as a new data file is allocated. Of course, with sufficient data, this will be essentially a rounding error but it is worth bearing in mind.
The only way to estimate this type of data to size ratio, assuming a consistent usage pattern, is to trend it over time for your particular use case and track the disk space usage versus the data inserted (number of documents might be better than data volume depending on variability of doc size).
Similarly, if you track the insertion rate, doc size and the space gained back from a resync/repair. FYI - you can resync a secondary from scratch to get a "fresh" copy of the data files rather than running a repair, which can be less disruptive, and use less space depending on your set up.

Related

Does the size of the value affect the size of the index in MongoDB?

I have a set of IDs which are numbers anywhere between 8 and 11 digits long, and there are only 300K of them (so not exactly sequential etc.). These are stored in collection A.
I have a collection B with millions of entries in which every entry has an array of these IDs, and every array could have thousands of these IDs. I need to index this field too (i.e. hundreds of millions, potentially up to a billion+ entries). When I indexed it, the index turned out to be very large, way over the RAM size of the cluster.
Would it be worth trying to compress the value of each ID down from 8-11 numbers into some small alphanumeric encoded string? Or simply re-number them from e.g. 1 - 300,000 sequentially (and maintain a mapping of this)? Would that have a significant impact on the index size, or is it not worth the effort?
The size of your indexed field affects the size of the index. You can run a collStats command to check the size of the index and compare the size of your indexed field with the total size that MongoDb needs to create the index.
Mongo already performs some compressions on indexes so trying to encode your field in an alphanumeric encoded string is probably not going to have a benefit or a marginal one.
Using a smaller numeric type is going to save a small amount of size in your index, but if you need to maintain a mapping, it´s probably not worth the effort and it is probably overcomplicating things.
The size of the index for a collection with 300K elements only indexing the 11 digits ID should be small, something around a few Mb. So it´s very likely that you don't have
storage or memory issues with that index size.
Regarding your second collection, if you reduce some bytes in each ID, you can get some reduction of the size of the index.
e.g. Reducing the size of each ID from 8bytes to 4bytes and having about 1Billion elements, you are reducing some GB the size of the index.
Reducing the size of the index and the collection B some GB could be an interesting save, so based on your needs, it could worth the effort to modify the IDs to use the smallest type possible. However, you could still have or you could have it in the near future if the collections keep growing, the memory issues due to the index not fitting in memory. So sharding the collection could be a good possibility.
Hashed Column index you can create which will give more or less same performance in case you interested more on saving the size of index.
You can check with some data how much % size of data you saving for index and performance impact and take decision

Mongodb Migration Threshold Controls?

I'm seeking a way to control sharded collection migration thresholds in mongodb. These thresholds are described at https://docs.mongodb.com/manual/core/sharding-balancer-administration/#sharding-migration-thresholds
What I see in those values is that they have tuned the migration thresholds for roughly 10% of the chunk counts for small numbers of chunks (0-20: 2, 20-80: 4, 80+: 8). Above that, it's locked at 8 chunks: just 8 chunk counts being different between shard members will trigger a migration activity.
For our collections having high activity rates and large bodies of data, this causes balancing thrash - there is almost always a difference of 8 chunks, all the time. With high transaction rates on a sharded collection, there are a range of perfectly-acceptable causes of temporary imbalance (which I won't go into here). When we shut off the balancer, small temporary imbalances are often then corrected organically as activity across the cluster shifts. With the balancer turned on, by the time it finishes one migration, another (or many in parallel) triggers right away.
With the thresholds locked down like this, our larger collections thrash all the time - consuming IOPS and network bandwidth that we would really like to use in other ways. These tiny migrations have no practical benefit, either: if we're talking about a large collection, then 8 chunks can be a vanishingly small quantity of data relative to any real workload. So we're spending a lot of energy moving lots of small snippets around for zero effective benefit.
I would love to find a config file setting that - at a minimum - allows me to redefine those values. Even better would be to force a fractional policy, like 10% of the number of chunks in the collection. I don't see any controls of this type in the mongo documentation, but could be missing it.
Failing that, I'll have to spin up on the code and retool it myself to build from source, so I'm hoping someone has already solved this and I just can't see where to control it. Thanks in advance!

What is the max size of collection in mongodb

I would like to know what is the max size of collection in mongodb.
In mongodb limitations documentation it is mentioned single MMAPv1 database has a maximum size of 32TB.
This means max size of collection is 32TB?
If I want to store more than 32TB in one collection what is the solution?
There are theoretical limits, as I will show below, but even the lower bound is pretty high. It is not easy to calculate the limits correctly, but the order of magnitude should be sufficient.
mmapv1
The actual limit depends on a few things like length of shard names and alike (that sums up if you have a couple of hundred thousands of them), but here is a rough calculation with real life data.
Each shard needs some space in the config db, which is limited as any other database to 32TB on a single machine or in a replica set. On the servers I administrate, the average size of an entry in config.shards is 112 bytes. Furthermore, each chunk needs about 250 bytes of metadata information. Let us assume optimal chunk sizes of close to 64MB.
We can have at maximum 500,000 chunks per server. 500,000 * 250byte equals 125MB for the chunk information per shard. So, per shard, we have 125.000112 MB per shard if we max everything out. Dividing 32TB by that value shows us that we can have a maximum of slightly under 256,000 shards in a cluster.
Each shard in turn can hold 32TB worth of data. 256,000 * 32TB is 8.19200 exabytes or 8,192,000 terabytes. That would be the limit for our example.
Let's say its 8 exabytes. As of now, this can easily translated to "Enough for all practical purposes". To give you an impression: All data held by the Library of Congress (arguably one of the biggest library in the world in terms of collection size) holds an estimated size of data of around 20TB in size including audio, video, and digital materials. You could fit that into our theoretical MongoDB cluster some 400,000 times. Note that this is the lower bound of the maximum size, using conservative values.
WiredTiger
Now for the good part: The WiredTiger storage engine does not have this limitation: The database size is not limited (since there is no limit on how many datafiles can be used), so we can have an unlimited number of shards. Even when we have those shards running on mmapv1 and only our config servers on WT, the size of a becomes nearly unlimited – the limitation to 16.8M TB of RAM on a 64 bit system might cause problems somewhere and cause the indices of the config.shard collection to be swapped to disk, stalling the system. I can only guess, since my calculator refuses to work with numbers in that area (and I am too lazy to do it by hand), but I estimate the limit here in the two digit yottabyte area (and the space needed to host that somewhere in the size of Texas).
Conclusion
Do not worry about the maximum data size in a sharded environment. No matter what, it is by far enough, even with the most conservative approach. Use sharding, and you are done. Btw: even 32TB is a hell lot of data: Most clusters I know hold less data and shard because the IOPS and RAM utilization exceeded a single nodes capacity.

Does reducing the size of mongodb documents reduce the working set?

I'm trying to determine if reducing the size of our documents will reduce our working set? Our database is reaching the limit of the RAM on our instances. We are storing redundant data in an array in each of our documents (can be thousands of elements), which I am now limiting to 40. That should reduce the size of the collection at fault to 10% of what it was (this is where our bulk is), but in my understanding of the documentation, storage size will not change. After reading up on mongo, I'm not sure if reducing document size to 10% of what they were originally will impact the working set? Can someone please help/explain?
Edit: Some background information
One of the things we've done to keep performance up as our database grows is to increase RAM to 'fit' the database in its entirety. The database is getting close to the 64GB of ram we have on our replica instances... That is what put this question forward...
Edit: The essential question
Essentially, the question comes down to this: What should I use when I'm calculating if our working set fits in memory? These numbers come from running db.stats():
dataSize + indexSize < RAM
OR
storageSize + indexSize < RAM
OR
fileSize + indexSize < RAM
Thanks!

How to calculate the future database size in Mongo?

I'm using MongoDB and we are really happy with this DB. But recently our client asked us for the database size in the future.
We know how to calculate this in a typical relational database, but we don't have a long experience in production with this No-SQL database.
Things that we know:
db.namecollections.stats() give us important information like, size(documents),avgObjSize(documents), storageSize, totalIndexSize
(more here)
With the size and totalIndexSize we can calculate the total size for the collection only, but the big question here is:
Why is there a difference between collection size and storageSize???
How can one calculate this, thinking in the future database size?
MongoDB pads documents a bit so that they can grow a bit without having to be moved to the end of the collection on disk (an expensive operation).
Also, mongo pre-allocates data files by creating a the next one and filling it with zeros before it is needed to boost speed.
You can throw a --noprealloc flag on mongod to prevent that from hapening.
If you want more info you can look here
In regards to your question about calculating disk space 5 years out, if you can figure out an equation for the growth of your data, make some assumptions about what your average document size will be, and how many / what kinds of indexes you will have, you might be able to come up with something.
Having worked for a bank also, my suggestion would be to come up with an an insane upper bound and then quadruple it. Money is cheap inside a bank, calculation mistakes are not.