Does MongoDB store documents in 4MB chunks? - mongodb

I read that MongoDB documents are limited to 4 MB in size. I also read that when you insert a document, MongoDB puts some padding in so that if you add something to the document, the entire document doesn't have to be moved and reindexed.
So I was wondering, does it store documents in 4MB chunks on disk?
Thanks

As of 1.8, individual documents are now limited to 16MB in size (was previously 4MB). This is an arbitary limitation imposed as when you read a document off disk, the whole document is read into RAM. So I think the intention is that this limitation is there to try and safeguard memory / make you think about your schema design.
Data is then stored across multiple data files on disk - I forget the initial file size, but every time the database grows, a new file is created to expand into, where each new file is created bigger than the previous file until a single file size of 2GB is reached. From this point on, if the database continues to grow, subsequent 2GB data files are created for documents to be inserted into.
"chunks" has a meaning in the sharding aspect of MongoDB. Whereby documents are stored in "chunks" of a configurable size and when balancing needs to be done, it's these chunks of data (n documents) that are moved around.

The simple answer is "no." The actual space a document takes up in Mongo's files is variable, but it isn't the maximum document size. The DB engine watches to see how much your documents tend to change after insertion and calculates the padding factor based on that. So it changes all the time.
If you're curious, you can see the actual padding factor and storage space of your data using the .stats() function on a collection in the mongo shell. Here's a real-world example (with some names changed to protect the innocent clients):
{14:42} ~/my_directory ➭ mongo
MongoDB shell version: 1.8.0
connecting to: test
> show collections
schedule_drilldown
schedule_report
system.indexes
> db.schedule_report.stats()
{
"ns" : "test.schedule_report",
"count" : 16749,
"size" : 60743292,
"avgObjSize" : 3626.681712341035,
"storageSize" : 86614016,
"numExtents" : 10,
"nindexes" : 3,
"lastExtentSize" : 23101696,
"paddingFactor" : 1.4599999999953628,
"flags" : 1,
"totalIndexSize" : 2899968,
"indexSizes" : {
"_id_" : 835584,
"WeekEnd_-1_Salon_1" : 925696,
"WeekEnd_-1_AreaCode_1" : 1138688
},
"ok" : 1
}
So my test collection has about 16,749 records in it, with an average size of about 3.6 KB ("avgObjSize") and a total data size of about 60 MB ("size"). However, it turns out they actually take up about 86 MB on disk ("storageSize") because of the padding factor. That padding factor has varied over time as the collection's documents have been updated, but if I inserted a new document right now, it'd allocate 1.46 times as much space as the document needs ("paddingFactor") to avoid having to move things around if I change it later. To me that's a fair size/speed tradeoff.

Related

How to know MongoDB collection size?

db.collection.stats()
Response :
"count" : 20696445,
"size" : NumberLong("1478263842661"),
"storageSize" : 334732324864,
"totalIndexSize" : 676327424,
"indexSizes" : {
"_id_" : 377094144,
"leadID_1" : 128049152,
"leadID_hashed" : 171184128
},
"avgObjSize" : 71425.97884134208
My actual disk size is matched with storageSize. So what is the size and other keys.
You haven't mentioned the version of MongoDB server you are using but given the size of your data is much larger than the storageSize on disk, I'm assuming you are using the WiredTiger storage engine which compresses data and indexes by default. The WiredTiger storage engine was first available as an option in the MongoDB 3.0 production series and became the default storage engine for new deployments in MongoDB 3.2+.
In your example output it looks like you have 1.4TB of uncompressed data which is currently occupying 334GB on disk (the storageSize value). Storage space used by indexes for this collection is reported separately under indexSizes and summed up as totalIndexSize.
The output of collection.stats() will vary depending on your MongoDB server version and configured storage engine, but is generally described in the MongoDB manual as part of the output of the collStats command which is called by the db.collection.stats() shell helper.
Note: MongoDB documentation is versioned so you should always make sure you are referencing documentation that matches your release series of MongoDB (i.e. 3.2, 3.4, ...). Default documentation links will point to the current production release.
Refer link
collStats.size
The total size in memory of all records in a collection. This value does not include the record header, which is 16 bytes per record, but does include the record’s padding. Additionally size does not include the size of any indexes associated with the collection, which the totalIndexSize field reports.
The scale argument affects this value.
collStats.storageSize
The total amount of storage allocated to this collection for document storage. The scale argument affects this value.
storageSize does not include index size. See totalIndexSize for index sizing.

Resize storagesize of Mongodb

I would like to know if is make sense to resize the storageSize of MongoDB?
I recognize that my size is larger then the storage size. Maybe it decrease my performance if I retrieve data, etc..?
"count" : 9622,
"size" : 9329997,
"avgObjSize" : 969,
"storageSize" : 3198976,
"capped" : false
If is necessary how can I resize the storagesize?
No, Per this doc why-are-the-files-in-my-data-directory-larger-than-the-data-in-my-database, it is NOT necessary to resize the storagesize. Because MongoDB preallocates data and journal files
The data files in your data directory, which is the /data/db directory in default configurations, might be larger than the data set inserted into the database. Consider the following possible causes:
Preallocated data files
MongoDB preallocates its data files to avoid filesystem fragmentation, and because of this, the size of these files do not necessarily reflect the size of your data.
The storage.mmapv1.smallFiles option will reduce the size of these files, which may be useful if you have many small databases on disk.
The oplog
If this mongod is a member of a replica set, the data directory includes the oplog.rs file, which is a preallocated capped collection in the local database.
The default allocation is approximately 5% of disk space on 64-bit installations. In most cases, you should not need to resize the oplog.
The journal
The data directory contains the journal files, which store write operations on disk before MongoDB applies them to databases. See Journaling.
Empty records
MongoDB maintains lists of empty records in data files as it deletes documents and collections. MongoDB can reuse this space, but will not, by default, return this space to the operating system.
Also here is one good blog how-big-is-your-mongodb

why my mongodb fileSize is much bigger than storageSize in db.stats()?

I have a db named log_test1, with only 1 capped collection logs. The max size of capped collection is 512M. After I inserted 200k data, I found the disk usage of the db is 1.6G. With db.stats(), I can see the storageSize is 512M, correct, but my actual fileSize is 1.6G, why did this happen? How can I control the disk size is just my capped collection size plus index size?
> use log_test1
switched to db log_test1
> db.stats()
{
"db" : "log_test1",
"collections" : 3,
"objects" : 200018,
"avgObjSize" : 615.8577328040476,
"dataSize" : 123182632,
"storageSize" : 512008192,
"numExtents" : 3,
"indexes" : 8,
"indexSize" : 71907920,
"fileSize" : 1610612736,
"nsSizeMB" : 16,
"dataFileVersion" : {
"major" : 4,
"minor" : 5
},
"ok" : 1
}
This is probably because MongoDB preallocates data and journal files.
MongoDB 2
In the data directory, MongoDB preallocates data files to a particular size, in part to prevent file system fragmentation. MongoDB names the first data file <databasename>.0, the next <databasename>.1, etc. The first file mongod allocates is 64 megabytes, the next 128 megabytes, and so on, up to 2 gigabytes, at which point all subsequent files are 2 gigabytes. The data files include files with allocated space but that hold no data. mongod may allocate a 1 gigabyte data file that may be 90% empty. For most larger databases, unused allocated space is small compared to the database.
On Unix-like systems, mongod preallocates an additional data file and initializes the disk space to 0. Preallocating data files in the background prevents significant delays when a new database file is next allocated.
You can disable preallocation with the noprealloc run time option. However noprealloc is not intended for use in production environments: only use noprealloc for testing and with small data sets where you frequently drop databases.
MongoDB 3
The data files in your data directory, which is the /data/db
directory in default configurations, might be larger than the data set
inserted into the database. Consider the following possible causes:
Preallocated data files
MongoDB preallocates its data files to avoid filesystem fragmentation,
and because of this, the size of these files do not necessarily
reflect the size of your data.
The storage.mmapv1.smallFiles option will reduce the size of these
files, which may be useful if you have many small databases on disk.
The oplog
If this mongod is a member of a replica set, the data
directory includes the oplog.rs file, which is a preallocated capped
collection in the local database.
The default allocation is approximately 5% of disk space on 64-bit
installations.
The journal
The data directory contains the journal files, which store
write operations on disk before MongoDB applies them to databases.
Empty records
MongoDB maintains lists of empty records in data files
as it deletes documents and collections. MongoDB can reuse this space,
but will not, by default, return this space to the operating system.
Taken from MongoDB Storage FAQ.

Is it 100 million documents too much?

Well, I am new to mongo and today morning I had a (bad) idea. I was playing around with indexes from the shell and decided to create a large collection with many documents (100 million). So I executed the following command:
for (i = 1; i <= 100; i++) {
for (j = 100; j > 0; j--) {
for (k = 1; k <= 100; k++) {
for (l = 100; l > 0; l--) {
db.testIndexes.insert({a:i, b:j, c:k, d:l})
}
}
}
}
However, the things didn't go as I expected:
It took 45 minutes complete the request.
It created 16 GB data on my hard disk.
It used 80% of my RAM (8GB total) and it won't release them till I restarted my PC.
As you can see in the photo below, as the number of documents inside the collection was growing, the time of the insertion of documents was growing as well. I suggest that by the last modification time of the data files:
Is this an expected behavior? I don't think that 100 million simple documents are too much.
P.S. I am now really afraid to run an ensureIndex command.
Edit:
I executed the following command:
> db.testIndexes.stats()
{
"ns" : "test.testIndexes",
"count" : 100000000,
"size" : 7200000056,
"avgObjSize" : 72.00000056,
"storageSize" : 10830266336,
"numExtents" : 28,
"nindexes" : 1,
"lastExtentSize" : 2146426864,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 3248014112,
"indexSizes" : {
"_id_" : 3248014112
},
"ok" : 1
}
So, the default index on _id has more than 3GB size.
It took 45 minutes complete the request.
Not surprised.
It created 16 GB data on my hard disk.
As #Abhishek states everything seems fine, MongoDB does use a fair amount of space without compression currently (that's coming later hopefully).
It seems that the data size is about 7.2GB while the average object size is 72 bytes, it seems this is working perfectly (since 72 bytes fits into 7.2GB) with the 3GB overhead of the _id index it seems that the storage size of 10GB is fitting quite well.
Though I am concerned that it has used 6GB more than the statistics say it needs to, that might need more looking into. I am guessing it is because of how MongoDB wrote to the data files, it might even be because you was not using a non fire and forget write concern (w>0), all in all; hmmm.
It used 80% of my RAM (8GB total) and it won't release them till I restarted my PC.
MongoDB will try and take as much RAM as the OS will let it. If the OS lets it take 80% then 80% it will take. This is actually a good sign, it shows that MongoDB has the right configuration values to store your working set efficiently.
When running ensureIndex mongod will never free up RAM. It simply has no hooks for that, instead the OS will shrink its allocated block to make room for more (or should rather).
This is an expected behavior, mongo db files starts with filesize 16MB ( test.0 ), and grow till 2GB and then 2GB is constant.
100 million ( 16 GB ) documents in nothing.
You can run ensureIndex, it shouldn't take much time.
You need not to restart your pc, the moment other process needed RAM, mongod will free RAM.
FYI : test.12 is completely empty.
I am guessing you are not worried about 16GB size just for 100 million documents ?

Mongo Db records retrieval very slow using c# api

I am trying to retrieve 100000 docouments from MongoDb like below and its taking very long to return collection.
var query = Query.EQ("Status", "E");
var items = collection.Find(query).SetLimit(100000).ToList();
Or
var query = Query.GT("_id", idValue);
var items = collection.Find(query).SetLimit(100000).ToList();
Explain:
{
"cursor" : "BtreeCursor _id_",
"nscanned" : 1,
"nscannedObjects" :1,
"n" : 1,
"millis" : 0,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" :
{
"_id" :[[ObjectId("4f79a64eca98b5fc0e5ae35a"),
ObjectId("4f79a64eca98b5fc0e5ae35a")]]
}
}
Any suggestions to improve query performance. My table having 2 million documents.
-Venkat
This question was also asked on Google Groups:
https://groups.google.com/forum/?fromgroups#!topicsearchin/mongodb-user/100000/mongodb-user/a6FHFp5aOnA
As I responded on the Google Groups question I tried to reproduce this and was unable to observe any slowness. I was able to read 100,000 documents in 2-3 seconds, depending on whether the documents were near the beginning or near the end of the collection (because I didn't create an index).
My answer to the Google groups question has more details and a link to the test program I used to try and reproduce this.
Given the information you have provided my best guess is that your document size is too large and the delay is not necessarily on the mongo server but on the transmission of the result set back to your app machine. Take a look at your avg document size in the collection, do you have large embedded arrays for example?
Compare the response time when selecting only one field using the .SetFields method (see example here How to retrieve a subset of fields using the C# MongoDB driver?). If the response time is significantly faster then you know that this is the issue.
Have you defined indices?
http://www.mongodb.org/display/DOCS/Indexes
There are several things to check:
Is your query correctly indexed?
If your query is indexed, what are the odds that the data itself is in memory? If you have 20GB of data and 4GB of RAM, then most of your data is not in memory which means that your disks are doing a lot of work.
How much data does 100k documents represent? If your documents are really big they could be sucking up all of the available disk IO or possibly the network? Do you have enough space to store this in RAM on the client?
You can check for disk usage using iostat (a common linux tool) or perfmon (under Windows). If you run these while your query is running, you should get some idea about what's happening with your disks.
Otherwise, you will have to do some reasoning about how much data is moving around here. In general, queries that return 100k objects are not intended to be really fast (not in MongoDB or in SQL). That's more data than humans typically consume in one screen, so you may want to make smaller batches and read 10k objects 10 times instead of 100k objects once.
If you don't create indexes for your collection the MongoDB will do a full table scan - this is the slowest possible method.
You can run explain() for your query. Explain will tell you which indexes (if any) are used for the query, number of scanned documents and total query duration.
If your query hits all the indexes and it's execution is still slow then you probably have a problem with the size of the collection / RAM.
MongoDB is the fastest when collection data + indexes fits in the memory. If the your collection size is larger than available RAM the performance drop is very large.
You can check the size of your collection with totalSize(), totalIndexSize() or validate() (these are shell commands).