How can MongoDB dataSize be larger than storageSize?

How can MongoDB dataSize be larger than storageSize? - mongodb

As far as I understand, the storage size for MongoDB should always be larger than data size. However, after upgrading to Mongo 3.0 and using WiredTiger, I start seeing that the data size is larger than the storage size.
Here's from one of the databases:
{
"db" : "Results",
"collections" : NumberInt(1),
"objects" : NumberInt(251816),
"avgObjSize" : 804.4109548241573,
"dataSize" : NumberInt(202563549),
"storageSize" : NumberInt(53755904),
"numExtents" : NumberInt(0),
"indexes" : NumberInt(5),
"indexSize" : NumberInt(41013248),
"ok" : NumberInt(1)
}
Note that 202563549 > 53755904 by far margin. I am confused how this can be. Is the way to read db.stats() different now in Mongo 3.0?

The storageSize metric is equal to the size (in bytes) of all the data extents in the database. Without compression, this number is larger than dataSize because it includes yet-unused space (in data extents) and space vacated by deleted or moved documents within extents. However, as you are using the WiredTiger storage engine, data is compressed on the disk and is therefore smaller than the dataSize.

MongoDB 3.0 with WiredTiger engine uses 'snappy' compression by default.
If this affects your DB performance, you can consider to turn it off (blockCompressor: none) in the mongod.conf file:
storage:
engine: wiredTiger
wiredTiger:
collectionConfig:
blockCompressor: none

Related

How to know MongoDB collection size?

db.collection.stats()
Response :
"count" : 20696445,
"size" : NumberLong("1478263842661"),
"storageSize" : 334732324864,
"totalIndexSize" : 676327424,
"indexSizes" : {
"_id_" : 377094144,
"leadID_1" : 128049152,
"leadID_hashed" : 171184128
},
"avgObjSize" : 71425.97884134208
My actual disk size is matched with storageSize. So what is the size and other keys.

You haven't mentioned the version of MongoDB server you are using but given the size of your data is much larger than the storageSize on disk, I'm assuming you are using the WiredTiger storage engine which compresses data and indexes by default. The WiredTiger storage engine was first available as an option in the MongoDB 3.0 production series and became the default storage engine for new deployments in MongoDB 3.2+.
In your example output it looks like you have 1.4TB of uncompressed data which is currently occupying 334GB on disk (the storageSize value). Storage space used by indexes for this collection is reported separately under indexSizes and summed up as totalIndexSize.
The output of collection.stats() will vary depending on your MongoDB server version and configured storage engine, but is generally described in the MongoDB manual as part of the output of the collStats command which is called by the db.collection.stats() shell helper.
Note: MongoDB documentation is versioned so you should always make sure you are referencing documentation that matches your release series of MongoDB (i.e. 3.2, 3.4, ...). Default documentation links will point to the current production release.

Refer link
collStats.size
The total size in memory of all records in a collection. This value does not include the record header, which is 16 bytes per record, but does include the record’s padding. Additionally size does not include the size of any indexes associated with the collection, which the totalIndexSize field reports.
The scale argument affects this value.
collStats.storageSize
The total amount of storage allocated to this collection for document storage. The scale argument affects this value.
storageSize does not include index size. See totalIndexSize for index sizing.

why my mongodb fileSize is much bigger than storageSize in db.stats()?

I have a db named log_test1, with only 1 capped collection logs. The max size of capped collection is 512M. After I inserted 200k data, I found the disk usage of the db is 1.6G. With db.stats(), I can see the storageSize is 512M, correct, but my actual fileSize is 1.6G, why did this happen? How can I control the disk size is just my capped collection size plus index size?
> use log_test1
switched to db log_test1
> db.stats()
{
"db" : "log_test1",
"collections" : 3,
"objects" : 200018,
"avgObjSize" : 615.8577328040476,
"dataSize" : 123182632,
"storageSize" : 512008192,
"numExtents" : 3,
"indexes" : 8,
"indexSize" : 71907920,
"fileSize" : 1610612736,
"nsSizeMB" : 16,
"dataFileVersion" : {
"major" : 4,
"minor" : 5
},
"ok" : 1
}

This is probably because MongoDB preallocates data and journal files.
MongoDB 2
In the data directory, MongoDB preallocates data files to a particular size, in part to prevent file system fragmentation. MongoDB names the first data file <databasename>.0, the next <databasename>.1, etc. The first file mongod allocates is 64 megabytes, the next 128 megabytes, and so on, up to 2 gigabytes, at which point all subsequent files are 2 gigabytes. The data files include files with allocated space but that hold no data. mongod may allocate a 1 gigabyte data file that may be 90% empty. For most larger databases, unused allocated space is small compared to the database.
On Unix-like systems, mongod preallocates an additional data file and initializes the disk space to 0. Preallocating data files in the background prevents significant delays when a new database file is next allocated.
You can disable preallocation with the noprealloc run time option. However noprealloc is not intended for use in production environments: only use noprealloc for testing and with small data sets where you frequently drop databases.
MongoDB 3
The data files in your data directory, which is the /data/db
directory in default configurations, might be larger than the data set
inserted into the database. Consider the following possible causes:
Preallocated data files
MongoDB preallocates its data files to avoid filesystem fragmentation,
and because of this, the size of these files do not necessarily
reflect the size of your data.
The storage.mmapv1.smallFiles option will reduce the size of these
files, which may be useful if you have many small databases on disk.
The oplog
If this mongod is a member of a replica set, the data
directory includes the oplog.rs file, which is a preallocated capped
collection in the local database.
The default allocation is approximately 5% of disk space on 64-bit
installations.
The journal
The data directory contains the journal files, which store
write operations on disk before MongoDB applies them to databases.
Empty records
MongoDB maintains lists of empty records in data files
as it deletes documents and collections. MongoDB can reuse this space,
but will not, by default, return this space to the operating system.
Taken from MongoDB Storage FAQ.

MongoDB does not reclaim space despite running repairDatabase()

We have been running mongoDB in single unsharded instance, just one database. The size of data files was 0.45 GB. When I looked into the storageSize of all the collections, the total size was ~85 MB. In a bid to reclaim unused space, we ran repairDatabase(), with understanding that file sizes grow from 64 to 128 to 256 and so on till 2 GB. Since the mongo object data we have (85 MB) can be accommodated in 64 + 128 MB files, we were expecting the 256 MB file to be reclaimed. However, to our surprise, no space was reclaimed.
Can someone let us know the logic based on which we can find how much space would be reclaimed? Essentially, given total disk space a database takes, and given total mongo object data size, can one estimate accurately how much space would be reclaimed?
The following is the db.stats() output as requested in a comment:
> db.stats()
{
"db" : "analytics_data_1",
"collections" : 12,
"objects" : 207223,
"avgObjSize" : 353.6659347659285,
"dataSize" : 73287716,
"storageSize" : 84250624,
"numExtents" : 43,
"indexes" : 26,
"indexSize" : 21560112,
"fileSize" : 469762048,
"nsSizeMB" : 16,
"dataFileVersion" : {
"major" : 4,
"minor" : 5
},
"ok" : 1
}
>

The storage FAQ explains that an extra file is always pre-allocated and as soon as you start writing to it, mongod will preallocate the next file.
Repair won't reclaim any space that would normally exist - it can only help if you've deleted a lot of data or dropped some collections.
Disabling preallocation can save you space but will cost you in performance as the file will be allocated when it's actually needed to write to - and that will slow down inserts.

Understanding Server Status for "mem"

Taking a look at the serverStatus command, I see the following data.
>db.runCommand( { serverStatus: 1} )
...
"mem" : {
"bits" : 64,
"resident" : 2138, // Mongo uses 2 GB RAM
"virtual" : 33272,
"supported" : true,
"mapped" : 16489, // equals db.coll.totalSize()
"mappedWithJournal" : 32978
},
Mongo recommends that the working set size fit in RAM.
If I understand correctly, then 16.4 GB of Mongo documents/indexes are memory mapped. Since Mongo is only using 2 GB of RAM, whenever Mongo needs to access an address outside of that 2 GB, then Mongo will need to fetch the contents of the address on disk and then load them into memory?
Is this my explanation the main reason that working set must fit into RAM?

Does MongoDB store documents in 4MB chunks?

I read that MongoDB documents are limited to 4 MB in size. I also read that when you insert a document, MongoDB puts some padding in so that if you add something to the document, the entire document doesn't have to be moved and reindexed.
So I was wondering, does it store documents in 4MB chunks on disk?
Thanks

As of 1.8, individual documents are now limited to 16MB in size (was previously 4MB). This is an arbitary limitation imposed as when you read a document off disk, the whole document is read into RAM. So I think the intention is that this limitation is there to try and safeguard memory / make you think about your schema design.
Data is then stored across multiple data files on disk - I forget the initial file size, but every time the database grows, a new file is created to expand into, where each new file is created bigger than the previous file until a single file size of 2GB is reached. From this point on, if the database continues to grow, subsequent 2GB data files are created for documents to be inserted into.
"chunks" has a meaning in the sharding aspect of MongoDB. Whereby documents are stored in "chunks" of a configurable size and when balancing needs to be done, it's these chunks of data (n documents) that are moved around.

The simple answer is "no." The actual space a document takes up in Mongo's files is variable, but it isn't the maximum document size. The DB engine watches to see how much your documents tend to change after insertion and calculates the padding factor based on that. So it changes all the time.
If you're curious, you can see the actual padding factor and storage space of your data using the .stats() function on a collection in the mongo shell. Here's a real-world example (with some names changed to protect the innocent clients):
{14:42} ~/my_directory ➭ mongo
MongoDB shell version: 1.8.0
connecting to: test
> show collections
schedule_drilldown
schedule_report
system.indexes
> db.schedule_report.stats()
{
"ns" : "test.schedule_report",
"count" : 16749,
"size" : 60743292,
"avgObjSize" : 3626.681712341035,
"storageSize" : 86614016,
"numExtents" : 10,
"nindexes" : 3,
"lastExtentSize" : 23101696,
"paddingFactor" : 1.4599999999953628,
"flags" : 1,
"totalIndexSize" : 2899968,
"indexSizes" : {
"_id_" : 835584,
"WeekEnd_-1_Salon_1" : 925696,
"WeekEnd_-1_AreaCode_1" : 1138688
},
"ok" : 1
}
So my test collection has about 16,749 records in it, with an average size of about 3.6 KB ("avgObjSize") and a total data size of about 60 MB ("size"). However, it turns out they actually take up about 86 MB on disk ("storageSize") because of the padding factor. That padding factor has varied over time as the collection's documents have been updated, but if I inserted a new document right now, it'd allocate 1.46 times as much space as the document needs ("paddingFactor") to avoid having to move things around if I change it later. To me that's a fair size/speed tradeoff.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse