MongoDB does not reclaim space despite running repairDatabase() - mongodb

We have been running mongoDB in single unsharded instance, just one database. The size of data files was 0.45 GB. When I looked into the storageSize of all the collections, the total size was ~85 MB. In a bid to reclaim unused space, we ran repairDatabase(), with understanding that file sizes grow from 64 to 128 to 256 and so on till 2 GB. Since the mongo object data we have (85 MB) can be accommodated in 64 + 128 MB files, we were expecting the 256 MB file to be reclaimed. However, to our surprise, no space was reclaimed.
Can someone let us know the logic based on which we can find how much space would be reclaimed? Essentially, given total disk space a database takes, and given total mongo object data size, can one estimate accurately how much space would be reclaimed?
The following is the db.stats() output as requested in a comment:
> db.stats()
{
"db" : "analytics_data_1",
"collections" : 12,
"objects" : 207223,
"avgObjSize" : 353.6659347659285,
"dataSize" : 73287716,
"storageSize" : 84250624,
"numExtents" : 43,
"indexes" : 26,
"indexSize" : 21560112,
"fileSize" : 469762048,
"nsSizeMB" : 16,
"dataFileVersion" : {
"major" : 4,
"minor" : 5
},
"ok" : 1
}
>

The storage FAQ explains that an extra file is always pre-allocated and as soon as you start writing to it, mongod will preallocate the next file.
Repair won't reclaim any space that would normally exist - it can only help if you've deleted a lot of data or dropped some collections.
Disabling preallocation can save you space but will cost you in performance as the file will be allocated when it's actually needed to write to - and that will slow down inserts.

Related

How can MongoDB dataSize be larger than storageSize?

As far as I understand, the storage size for MongoDB should always be larger than data size. However, after upgrading to Mongo 3.0 and using WiredTiger, I start seeing that the data size is larger than the storage size.
Here's from one of the databases:
{
"db" : "Results",
"collections" : NumberInt(1),
"objects" : NumberInt(251816),
"avgObjSize" : 804.4109548241573,
"dataSize" : NumberInt(202563549),
"storageSize" : NumberInt(53755904),
"numExtents" : NumberInt(0),
"indexes" : NumberInt(5),
"indexSize" : NumberInt(41013248),
"ok" : NumberInt(1)
}
Note that 202563549 > 53755904 by far margin. I am confused how this can be. Is the way to read db.stats() different now in Mongo 3.0?
The storageSize metric is equal to the size (in bytes) of all the data extents in the database. Without compression, this number is larger than dataSize because it includes yet-unused space (in data extents) and space vacated by deleted or moved documents within extents. However, as you are using the WiredTiger storage engine, data is compressed on the disk and is therefore smaller than the dataSize.
MongoDB 3.0 with WiredTiger engine uses 'snappy' compression by default.
If this affects your DB performance, you can consider to turn it off (blockCompressor: none) in the mongod.conf file:
storage:
engine: wiredTiger
wiredTiger:
collectionConfig:
blockCompressor: none

why my mongodb fileSize is much bigger than storageSize in db.stats()?

I have a db named log_test1, with only 1 capped collection logs. The max size of capped collection is 512M. After I inserted 200k data, I found the disk usage of the db is 1.6G. With db.stats(), I can see the storageSize is 512M, correct, but my actual fileSize is 1.6G, why did this happen? How can I control the disk size is just my capped collection size plus index size?
> use log_test1
switched to db log_test1
> db.stats()
{
"db" : "log_test1",
"collections" : 3,
"objects" : 200018,
"avgObjSize" : 615.8577328040476,
"dataSize" : 123182632,
"storageSize" : 512008192,
"numExtents" : 3,
"indexes" : 8,
"indexSize" : 71907920,
"fileSize" : 1610612736,
"nsSizeMB" : 16,
"dataFileVersion" : {
"major" : 4,
"minor" : 5
},
"ok" : 1
}
This is probably because MongoDB preallocates data and journal files.
MongoDB 2
In the data directory, MongoDB preallocates data files to a particular size, in part to prevent file system fragmentation. MongoDB names the first data file <databasename>.0, the next <databasename>.1, etc. The first file mongod allocates is 64 megabytes, the next 128 megabytes, and so on, up to 2 gigabytes, at which point all subsequent files are 2 gigabytes. The data files include files with allocated space but that hold no data. mongod may allocate a 1 gigabyte data file that may be 90% empty. For most larger databases, unused allocated space is small compared to the database.
On Unix-like systems, mongod preallocates an additional data file and initializes the disk space to 0. Preallocating data files in the background prevents significant delays when a new database file is next allocated.
You can disable preallocation with the noprealloc run time option. However noprealloc is not intended for use in production environments: only use noprealloc for testing and with small data sets where you frequently drop databases.
MongoDB 3
The data files in your data directory, which is the /data/db
directory in default configurations, might be larger than the data set
inserted into the database. Consider the following possible causes:
Preallocated data files
MongoDB preallocates its data files to avoid filesystem fragmentation,
and because of this, the size of these files do not necessarily
reflect the size of your data.
The storage.mmapv1.smallFiles option will reduce the size of these
files, which may be useful if you have many small databases on disk.
The oplog
If this mongod is a member of a replica set, the data
directory includes the oplog.rs file, which is a preallocated capped
collection in the local database.
The default allocation is approximately 5% of disk space on 64-bit
installations.
The journal
The data directory contains the journal files, which store
write operations on disk before MongoDB applies them to databases.
Empty records
MongoDB maintains lists of empty records in data files
as it deletes documents and collections. MongoDB can reuse this space,
but will not, by default, return this space to the operating system.
Taken from MongoDB Storage FAQ.

Understanding Server Status for "mem"

Taking a look at the serverStatus command, I see the following data.
>db.runCommand( { serverStatus: 1} )
...
"mem" : {
"bits" : 64,
"resident" : 2138, // Mongo uses 2 GB RAM
"virtual" : 33272,
"supported" : true,
"mapped" : 16489, // equals db.coll.totalSize()
"mappedWithJournal" : 32978
},
Mongo recommends that the working set size fit in RAM.
If I understand correctly, then 16.4 GB of Mongo documents/indexes are memory mapped. Since Mongo is only using 2 GB of RAM, whenever Mongo needs to access an address outside of that 2 GB, then Mongo will need to fetch the contents of the address on disk and then load them into memory?
Is this my explanation the main reason that working set must fit into RAM?

Is it 100 million documents too much?

Well, I am new to mongo and today morning I had a (bad) idea. I was playing around with indexes from the shell and decided to create a large collection with many documents (100 million). So I executed the following command:
for (i = 1; i <= 100; i++) {
for (j = 100; j > 0; j--) {
for (k = 1; k <= 100; k++) {
for (l = 100; l > 0; l--) {
db.testIndexes.insert({a:i, b:j, c:k, d:l})
}
}
}
}
However, the things didn't go as I expected:
It took 45 minutes complete the request.
It created 16 GB data on my hard disk.
It used 80% of my RAM (8GB total) and it won't release them till I restarted my PC.
As you can see in the photo below, as the number of documents inside the collection was growing, the time of the insertion of documents was growing as well. I suggest that by the last modification time of the data files:
Is this an expected behavior? I don't think that 100 million simple documents are too much.
P.S. I am now really afraid to run an ensureIndex command.
Edit:
I executed the following command:
> db.testIndexes.stats()
{
"ns" : "test.testIndexes",
"count" : 100000000,
"size" : 7200000056,
"avgObjSize" : 72.00000056,
"storageSize" : 10830266336,
"numExtents" : 28,
"nindexes" : 1,
"lastExtentSize" : 2146426864,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 3248014112,
"indexSizes" : {
"_id_" : 3248014112
},
"ok" : 1
}
So, the default index on _id has more than 3GB size.
It took 45 minutes complete the request.
Not surprised.
It created 16 GB data on my hard disk.
As #Abhishek states everything seems fine, MongoDB does use a fair amount of space without compression currently (that's coming later hopefully).
It seems that the data size is about 7.2GB while the average object size is 72 bytes, it seems this is working perfectly (since 72 bytes fits into 7.2GB) with the 3GB overhead of the _id index it seems that the storage size of 10GB is fitting quite well.
Though I am concerned that it has used 6GB more than the statistics say it needs to, that might need more looking into. I am guessing it is because of how MongoDB wrote to the data files, it might even be because you was not using a non fire and forget write concern (w>0), all in all; hmmm.
It used 80% of my RAM (8GB total) and it won't release them till I restarted my PC.
MongoDB will try and take as much RAM as the OS will let it. If the OS lets it take 80% then 80% it will take. This is actually a good sign, it shows that MongoDB has the right configuration values to store your working set efficiently.
When running ensureIndex mongod will never free up RAM. It simply has no hooks for that, instead the OS will shrink its allocated block to make room for more (or should rather).
This is an expected behavior, mongo db files starts with filesize 16MB ( test.0 ), and grow till 2GB and then 2GB is constant.
100 million ( 16 GB ) documents in nothing.
You can run ensureIndex, it shouldn't take much time.
You need not to restart your pc, the moment other process needed RAM, mongod will free RAM.
FYI : test.12 is completely empty.
I am guessing you are not worried about 16GB size just for 100 million documents ?

Does MongoDB store documents in 4MB chunks?

I read that MongoDB documents are limited to 4 MB in size. I also read that when you insert a document, MongoDB puts some padding in so that if you add something to the document, the entire document doesn't have to be moved and reindexed.
So I was wondering, does it store documents in 4MB chunks on disk?
Thanks
As of 1.8, individual documents are now limited to 16MB in size (was previously 4MB). This is an arbitary limitation imposed as when you read a document off disk, the whole document is read into RAM. So I think the intention is that this limitation is there to try and safeguard memory / make you think about your schema design.
Data is then stored across multiple data files on disk - I forget the initial file size, but every time the database grows, a new file is created to expand into, where each new file is created bigger than the previous file until a single file size of 2GB is reached. From this point on, if the database continues to grow, subsequent 2GB data files are created for documents to be inserted into.
"chunks" has a meaning in the sharding aspect of MongoDB. Whereby documents are stored in "chunks" of a configurable size and when balancing needs to be done, it's these chunks of data (n documents) that are moved around.
The simple answer is "no." The actual space a document takes up in Mongo's files is variable, but it isn't the maximum document size. The DB engine watches to see how much your documents tend to change after insertion and calculates the padding factor based on that. So it changes all the time.
If you're curious, you can see the actual padding factor and storage space of your data using the .stats() function on a collection in the mongo shell. Here's a real-world example (with some names changed to protect the innocent clients):
{14:42} ~/my_directory ➭ mongo
MongoDB shell version: 1.8.0
connecting to: test
> show collections
schedule_drilldown
schedule_report
system.indexes
> db.schedule_report.stats()
{
"ns" : "test.schedule_report",
"count" : 16749,
"size" : 60743292,
"avgObjSize" : 3626.681712341035,
"storageSize" : 86614016,
"numExtents" : 10,
"nindexes" : 3,
"lastExtentSize" : 23101696,
"paddingFactor" : 1.4599999999953628,
"flags" : 1,
"totalIndexSize" : 2899968,
"indexSizes" : {
"_id_" : 835584,
"WeekEnd_-1_Salon_1" : 925696,
"WeekEnd_-1_AreaCode_1" : 1138688
},
"ok" : 1
}
So my test collection has about 16,749 records in it, with an average size of about 3.6 KB ("avgObjSize") and a total data size of about 60 MB ("size"). However, it turns out they actually take up about 86 MB on disk ("storageSize") because of the padding factor. That padding factor has varied over time as the collection's documents have been updated, but if I inserted a new document right now, it'd allocate 1.46 times as much space as the document needs ("paddingFactor") to avoid having to move things around if I change it later. To me that's a fair size/speed tradeoff.