newbie to mongodb: about initial files size - mongodb

Just wanted to understand.
I've just installed mongodb to test it on Windows OS. For each DB it creates 2 files: dbname.0 and dbname.ns
these db files have constant initial size (dbname.0 - 67MB and dbname.ns 16MB)
Is it normal and if yes why?
Thanks!

Yes, this is normal - these are the pre-allocated datafile and the namespace file.
dbname.0 is the pre-allocated initial datafile , which starts with 64MB
dbname.ns is for book-keeping. ns stands for namespace. The default limit for the 16MB .ns file supports 24,000 namespaces (collections + indexes) (see: --nssize parameter)
whenever MongoDB grows beyond the size of the last dbname.x file, it allocates a new data file with twice the size, up to size 2GB. Once the file size reaches 2GB, each successive file is also 2GB.
See:
http://www.mongodb.org/display/DOCS/Using+a+Large+Number+of+Collections
http://www.mongodb.org/display/DOCS/Developer+FAQ
http://www.mongodb.org/display/DOCS/Excessive+Disk+Space
http://comments.gmane.org/gmane.comp.db.mongodb.user/49819
Also:
How many collections are possible in a MongoDB without losing performance?

Related

Mongodb data files become smaller after migration

On my first server I get:
root#prod ~ # du -hs /var/lib/mongodb/
909G /var/lib/mongodb/
After migration this database with mongodump/mongorestore
On my second server I get:
root#prod ~ # du -hs /var/lib/mongodb/
30G /var/lib/mongodb/
After I waited a few hours, mongo finished indexing I got:
root#prod ~ # du -hs /var/lib/mongodb/
54G /var/lib/mongodb/
I tested database and there's no corrupted or missed data.
Why there's so big difference in size before and after migration?
MongoDB does not recover disk space when actually data size drops due to data deletion along with other causes. There's a decent explanation in the online docs:
Why are the files in my data directory larger than the data in my database?
The data files in your data directory, which is the /data/db directory
in default configurations, might be larger than the data set inserted
into the database. Consider the following possible causes:
Preallocated data files.
In the data directory, MongoDB preallocates data files to a particular
size, in part to prevent file system fragmentation. MongoDB names the
first data file .0, the next .1, etc. The
first file mongod allocates is 64 megabytes, the next 128 megabytes,
and so on, up to 2 gigabytes, at which point all subsequent files are
2 gigabytes. The data files include files with allocated space but
that hold no data. mongod may allocate a 1 gigabyte data file that may
be 90% empty. For most larger databases, unused allocated space is
small compared to the database.
On Unix-like systems, mongod preallocates an additional data file and
initializes the disk space to 0. Preallocating data files in the
background prevents significant delays when a new database file is
next allocated.
You can disable preallocation by setting preallocDataFiles to false.
However do not disable preallocDataFiles for production environments:
only use preallocDataFiles for testing and with small data sets where
you frequently drop databases.
On Linux systems you can use hdparm to get an idea of how costly
allocation might be:
time hdparm --fallocate $((1024*1024)) testfile
The oplog.
If this mongod is a member of a replica set, the data directory
includes the oplog.rs file, which is a preallocated capped collection
in the local database. The default allocation is approximately 5% of
disk space on 64-bit installations, see Oplog Sizing for more
information. In most cases, you should not need to resize the oplog.
However, if you do, see Change the Size of the Oplog.
The journal.
The data directory contains the journal files, which store write
operations on disk prior to MongoDB applying them to databases. See
Journaling Mechanics.
Empty records.
MongoDB maintains lists of empty records in data files when deleting
documents and collections. MongoDB can reuse this space, but will
never return this space to the operating system.
To de-fragment allocated storage, use compact, which de-fragments
allocated space. By de-fragmenting storage, MongoDB can effectively
use the allocated space. compact requires up to 2 gigabytes of extra
disk space to run. Do not use compact if you are critically low on
disk space.
Important
compact only removes fragmentation from MongoDB data files and does
not return any disk space to the operating system.
To reclaim deleted space, use repairDatabase, which rebuilds the
database which de-fragments the storage and may release space to the
operating system. repairDatabase requires up to 2 gigabytes of extra
disk space to run. Do not use repairDatabase if you are critically low
on disk space.
http://docs.mongodb.org/manual/faq/storage/
What they don't tell you are the two other ways to restore/recover disk space - mongodump/mongorestore as you did or adding a new member to the replica set with an empty disk so that it writes it's databsae files from scratch.
If you are interested in monitoring this, the db.stats() command returns a wealth of data on data, index, storage and file sizes:
http://docs.mongodb.org/manual/reference/command/dbStats/
Over time the MongoDB files develop fragmentation. When you do a "migration", or whack the data directory and force a re-sync, the files pack down. If your application does a lot of deletes or updates which grow the documents fragmentation develops fairly quickly. In our deployment it is updates that grow the documents that causes this. Somehow MongoDB moves the document when it sees that the updated document can't fit in the space of the original document. There is some way to add padding factors to the collection to avoid this.

GridFS disk management

In my environments I can have DB of 5-10 GB or DB of 10 TB (video recordings).
Focusing on the 5-10 GB: if I keep default settings for prealloc an small-files I can actually loose 20-40% of the disk space because of allocations.
In my production environments, the disk size can be 512G, but user can limit DB allocation to only 10G.
To implement this, I have a scheduled task that deletes the old documents from the DB when DB dataSize reached a certain threshold.
I can't use capped-collection (GridFS, sharding limitation, cannot delete random documents..), I can't use --no-prealloc/small-files flags, cause i need the files insert to be efficient.
So what happens, is this: if dataSize gets to 10G, the fileSize would be at least 12G, so I need to take that in consideration and lower the threshold in 2GB (and lose a lot of disk space).
What I do want, is to tell mongo to pre-allocate all the 10 GB the user requested, and disable further pre-alloc.
For example, running mongod with --no-prealloc and --small-files, but pre-allocate in advance all the 10 GB.
Another protection I gain here, is protecting the user against sudden disk-full errors. If he regularly downloads Game of Thrones episodes to the same drive, he can't take space from the DB 10G, since it's already pre-allocated.
(using C# driver)
I think I found a solution: You might want to look at the --quota and --quotafiles command line opts. In your case, you also might want to add the --smalfiles option. So
mongod --smallfiles --quota --quotafiles 11
should give you a size of exactly 10224 MB for your data, which, adding the default namespace file size of 16MB equals your target size of 10GB, excluding indices.
The following applies to regular collections as per documentation. But since metadata can be attached to files, it might very well apply to GridFS as well.
MongoDB uses what is called a record to store data. A record consists of two parts: the actual data and something which is called "padding". The padding is basically unused data which is used if the document grows in size. The reason for that is that a document or file chunk in GridFS respectively never gets fragmented to enhance query performance. So what would happen when the document or a file chunk grows in size is that it had to be moved to a different location in the datafile(s) every time the file is modified, which can be a very costly operation in terms of IO and time. So with the default settings, if the document or file chunk grows in size is that the padding is used instead of moving the file, thus reducing the need of moving around data in the data file and thereby improving performance. Only if the growth of the data exceeds the preallocated padding the document or file chunk is moved within the datafile(s).
The default strategy for preallocating padding space is "usePowerOf2Sizes", which determines the padding size by taking the document size and uses the next power of two size as the size preallocated for the document. Say we have a 47 byte document, the usePowerOf2Sizes strategy would preallocate 64 bytes for that document, resulting in a padding of 17 bytes.
There is another preallocation strategy, however. It is called "exactFit". It determines the padding space by multiplying the document size with a dynamically computed "paddingFactor". As far as I understood, the padding factor is determined by the average document growth in the respective collection. Since we are talking of static files in your case, the padding factor should always be 0, and because of this, there should not be any "lost" space any more.
So I think a possible solution would be to change the allocation strategy for both the files and the chunks collection to exactFit. Could you try that and share your findings with us?

db.repairDatabase() did not reduce size of database

I have a db that is about 8G.
I did copy db to generate a copy.
Then I pruned the copy using js console.
Then I ran a reapir DB and the copy is still the exact size as the original.
In all likelihood, this means that you did not free up enough space to return an entire extent or file to the OS. Imagine that you have five 2GB files (MongoDB preallocates files in 2GB increments after the first few smaller files) and now imagine that you had 8GB of data in this DB. The last file will always be empty because MongoDB preallocates a file before it needs it. So 8GBs are occupying four 2GB files and one 2GB file is empty.
Now you do some pruning - maybe even 1.8GBs worth of deleting of stuff. You run repairDB which rewrites every single record, as compactly as possible in a new set of database files. Except it still needs the same five 2GB files because the fourth file has 100MB of data and the last file always has to be empty.
You can look at the output of db.stats() to see what the data size is compared to the storage size, but the fact is that these are relatively small numbers compared to the size of allocated files and that's likely why you are seeing what you are seeing.

Datafiles in /data/db are larger than the data set inserted in MongoDB

I was wondering that for a given set of data, the MongoDB datafiles in /data/db are larger than the data set inserted into the database .Why such happening in mongoDB ? can someone clear me on this.
MongoDB preallocates disk space, so the actual storage on disk will always be larger than the data set.
Each datafile is preallocated to a particular size. (This is done to prevent file system fragmentation, among other reasons.) The first filename for a database is .0, then .1, etc. .0 will be 64MB, .1 128MB, et cetera, up to 2GB. Once the files reach 2GB in size, each successive file is also 2GB.

Compact command not freeing up space in MongoDB 2.0

I just installed MongoDB 2.0 and tried to run the compact command instead of the repair command in earlier versions. My database is empty at the moment, meaning there is only one collection with 0 entries and the two system collections (indices, users). Currently the db takes about 4 GB of space on the harddisk. The db is used as a temp queue with all items being removes after they have been processed.
I tried to run the following in the mongo shell.
use mydb
db.theOnlyCollection.runCommand("compact")
It returns with
ok: 1
But still the same space is taken on the harddisk. I tried to compact the system collections as well, but this did not work.
When I run the normal repair command
db.repairDatabase()
the database is compacted and only takes 400 MB.
Anyone has an idea why the compact command is not working?
Thanks a lot for your help.
Best
Alex
Collection compaction is not supposed to decrease the size of data files. Main point is to defragment collection and index data - combine unused space gaps into continuous space allowing new data to be stored there. Moreover it may actually increase the size of data files:
Compaction may increase the total size of your data files by up to 2GB. Even in this case, total collection storage space will decrease.
http://www.mongodb.org/display/DOCS/compact+Command#compactCommand-Effectsofacompaction