Is it 100 million documents too much?

Is it 100 million documents too much? - mongodb

Well, I am new to mongo and today morning I had a (bad) idea. I was playing around with indexes from the shell and decided to create a large collection with many documents (100 million). So I executed the following command:
for (i = 1; i <= 100; i++) {
for (j = 100; j > 0; j--) {
for (k = 1; k <= 100; k++) {
for (l = 100; l > 0; l--) {
db.testIndexes.insert({a:i, b:j, c:k, d:l})
}
}
}
}
However, the things didn't go as I expected:
It took 45 minutes complete the request.
It created 16 GB data on my hard disk.
It used 80% of my RAM (8GB total) and it won't release them till I restarted my PC.
As you can see in the photo below, as the number of documents inside the collection was growing, the time of the insertion of documents was growing as well. I suggest that by the last modification time of the data files:
Is this an expected behavior? I don't think that 100 million simple documents are too much.
P.S. I am now really afraid to run an ensureIndex command.
Edit:
I executed the following command:
> db.testIndexes.stats()
{
"ns" : "test.testIndexes",
"count" : 100000000,
"size" : 7200000056,
"avgObjSize" : 72.00000056,
"storageSize" : 10830266336,
"numExtents" : 28,
"nindexes" : 1,
"lastExtentSize" : 2146426864,
"paddingFactor" : 1,
"systemFlags" : 1,
"userFlags" : 0,
"totalIndexSize" : 3248014112,
"indexSizes" : {
"_id_" : 3248014112
},
"ok" : 1
}
So, the default index on _id has more than 3GB size.

It took 45 minutes complete the request.
Not surprised.
It created 16 GB data on my hard disk.
As #Abhishek states everything seems fine, MongoDB does use a fair amount of space without compression currently (that's coming later hopefully).
It seems that the data size is about 7.2GB while the average object size is 72 bytes, it seems this is working perfectly (since 72 bytes fits into 7.2GB) with the 3GB overhead of the _id index it seems that the storage size of 10GB is fitting quite well.
Though I am concerned that it has used 6GB more than the statistics say it needs to, that might need more looking into. I am guessing it is because of how MongoDB wrote to the data files, it might even be because you was not using a non fire and forget write concern (w>0), all in all; hmmm.
It used 80% of my RAM (8GB total) and it won't release them till I restarted my PC.
MongoDB will try and take as much RAM as the OS will let it. If the OS lets it take 80% then 80% it will take. This is actually a good sign, it shows that MongoDB has the right configuration values to store your working set efficiently.
When running ensureIndex mongod will never free up RAM. It simply has no hooks for that, instead the OS will shrink its allocated block to make room for more (or should rather).

This is an expected behavior, mongo db files starts with filesize 16MB ( test.0 ), and grow till 2GB and then 2GB is constant.
100 million ( 16 GB ) documents in nothing.
You can run ensureIndex, it shouldn't take much time.
You need not to restart your pc, the moment other process needed RAM, mongod will free RAM.
FYI : test.12 is completely empty.
I am guessing you are not worried about 16GB size just for 100 million documents ?

Related

why my mongodb fileSize is much bigger than storageSize in db.stats()?

I have a db named log_test1, with only 1 capped collection logs. The max size of capped collection is 512M. After I inserted 200k data, I found the disk usage of the db is 1.6G. With db.stats(), I can see the storageSize is 512M, correct, but my actual fileSize is 1.6G, why did this happen? How can I control the disk size is just my capped collection size plus index size?
> use log_test1
switched to db log_test1
> db.stats()
{
"db" : "log_test1",
"collections" : 3,
"objects" : 200018,
"avgObjSize" : 615.8577328040476,
"dataSize" : 123182632,
"storageSize" : 512008192,
"numExtents" : 3,
"indexes" : 8,
"indexSize" : 71907920,
"fileSize" : 1610612736,
"nsSizeMB" : 16,
"dataFileVersion" : {
"major" : 4,
"minor" : 5
},
"ok" : 1
}

This is probably because MongoDB preallocates data and journal files.
MongoDB 2
In the data directory, MongoDB preallocates data files to a particular size, in part to prevent file system fragmentation. MongoDB names the first data file <databasename>.0, the next <databasename>.1, etc. The first file mongod allocates is 64 megabytes, the next 128 megabytes, and so on, up to 2 gigabytes, at which point all subsequent files are 2 gigabytes. The data files include files with allocated space but that hold no data. mongod may allocate a 1 gigabyte data file that may be 90% empty. For most larger databases, unused allocated space is small compared to the database.
On Unix-like systems, mongod preallocates an additional data file and initializes the disk space to 0. Preallocating data files in the background prevents significant delays when a new database file is next allocated.
You can disable preallocation with the noprealloc run time option. However noprealloc is not intended for use in production environments: only use noprealloc for testing and with small data sets where you frequently drop databases.
MongoDB 3
The data files in your data directory, which is the /data/db
directory in default configurations, might be larger than the data set
inserted into the database. Consider the following possible causes:
Preallocated data files
MongoDB preallocates its data files to avoid filesystem fragmentation,
and because of this, the size of these files do not necessarily
reflect the size of your data.
The storage.mmapv1.smallFiles option will reduce the size of these
files, which may be useful if you have many small databases on disk.
The oplog
If this mongod is a member of a replica set, the data
directory includes the oplog.rs file, which is a preallocated capped
collection in the local database.
The default allocation is approximately 5% of disk space on 64-bit
installations.
The journal
The data directory contains the journal files, which store
write operations on disk before MongoDB applies them to databases.
Empty records
MongoDB maintains lists of empty records in data files
as it deletes documents and collections. MongoDB can reuse this space,
but will not, by default, return this space to the operating system.
Taken from MongoDB Storage FAQ.

MongoDB does not reclaim space despite running repairDatabase()

We have been running mongoDB in single unsharded instance, just one database. The size of data files was 0.45 GB. When I looked into the storageSize of all the collections, the total size was ~85 MB. In a bid to reclaim unused space, we ran repairDatabase(), with understanding that file sizes grow from 64 to 128 to 256 and so on till 2 GB. Since the mongo object data we have (85 MB) can be accommodated in 64 + 128 MB files, we were expecting the 256 MB file to be reclaimed. However, to our surprise, no space was reclaimed.
Can someone let us know the logic based on which we can find how much space would be reclaimed? Essentially, given total disk space a database takes, and given total mongo object data size, can one estimate accurately how much space would be reclaimed?
The following is the db.stats() output as requested in a comment:
> db.stats()
{
"db" : "analytics_data_1",
"collections" : 12,
"objects" : 207223,
"avgObjSize" : 353.6659347659285,
"dataSize" : 73287716,
"storageSize" : 84250624,
"numExtents" : 43,
"indexes" : 26,
"indexSize" : 21560112,
"fileSize" : 469762048,
"nsSizeMB" : 16,
"dataFileVersion" : {
"major" : 4,
"minor" : 5
},
"ok" : 1
}
>

The storage FAQ explains that an extra file is always pre-allocated and as soon as you start writing to it, mongod will preallocate the next file.
Repair won't reclaim any space that would normally exist - it can only help if you've deleted a lot of data or dropped some collections.
Disabling preallocation can save you space but will cost you in performance as the file will be allocated when it's actually needed to write to - and that will slow down inserts.

Mongo `touch` Command Unexpected Results

With a 11 GB working set (db.records.totalSize()), I ran the touch command in order to get Mongo to use as much memory as possible on my 16-GB RAM box. Before running touch, the serverStatus command showed that Mongo's mem.resident equaled 5800 (roughly 6 GB RAM).
db.runCommand({ touch: "records", data: true, index: true })
{ "ok" : 1 }
But, after running touch, Mongo's using roughly the same amount of RAM.
"mem" : {
"bits" : 64,
"resident" : 5821, /* only a 21 MB increase */
"virtual" : 29010,
"supported" : true,
"mapped" : 14362,
"mappedWithJournal" : 28724
},
Why did the touch command hardly increase how much RAM Mongo uses (mem.resident)?

The way that MongoDB db.serverStatus() command reports resident memory is by counting how many pages in physical RAM were actually accessed by mongod process.
This means that while your collection and indexes were read into RAM they won't show up in "res" value until you start actually querying against it.
You can verify that the data was read into RAM (if it was definitely cold before) just by seeing how much RAM mongod process has (not virtual memory).

Mongo Db records retrieval very slow using c# api

I am trying to retrieve 100000 docouments from MongoDb like below and its taking very long to return collection.
var query = Query.EQ("Status", "E");
var items = collection.Find(query).SetLimit(100000).ToList();
Or
var query = Query.GT("_id", idValue);
var items = collection.Find(query).SetLimit(100000).ToList();
Explain:
{
"cursor" : "BtreeCursor _id_",
"nscanned" : 1,
"nscannedObjects" :1,
"n" : 1,
"millis" : 0,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" :
{
"_id" :[[ObjectId("4f79a64eca98b5fc0e5ae35a"),
ObjectId("4f79a64eca98b5fc0e5ae35a")]]
}
}
Any suggestions to improve query performance. My table having 2 million documents.
-Venkat

This question was also asked on Google Groups:
https://groups.google.com/forum/?fromgroups#!topicsearchin/mongodb-user/100000/mongodb-user/a6FHFp5aOnA
As I responded on the Google Groups question I tried to reproduce this and was unable to observe any slowness. I was able to read 100,000 documents in 2-3 seconds, depending on whether the documents were near the beginning or near the end of the collection (because I didn't create an index).
My answer to the Google groups question has more details and a link to the test program I used to try and reproduce this.

Given the information you have provided my best guess is that your document size is too large and the delay is not necessarily on the mongo server but on the transmission of the result set back to your app machine. Take a look at your avg document size in the collection, do you have large embedded arrays for example?
Compare the response time when selecting only one field using the .SetFields method (see example here How to retrieve a subset of fields using the C# MongoDB driver?). If the response time is significantly faster then you know that this is the issue.

Have you defined indices?
http://www.mongodb.org/display/DOCS/Indexes

There are several things to check:
Is your query correctly indexed?
If your query is indexed, what are the odds that the data itself is in memory? If you have 20GB of data and 4GB of RAM, then most of your data is not in memory which means that your disks are doing a lot of work.
How much data does 100k documents represent? If your documents are really big they could be sucking up all of the available disk IO or possibly the network? Do you have enough space to store this in RAM on the client?
You can check for disk usage using iostat (a common linux tool) or perfmon (under Windows). If you run these while your query is running, you should get some idea about what's happening with your disks.
Otherwise, you will have to do some reasoning about how much data is moving around here. In general, queries that return 100k objects are not intended to be really fast (not in MongoDB or in SQL). That's more data than humans typically consume in one screen, so you may want to make smaller batches and read 10k objects 10 times instead of 100k objects once.

If you don't create indexes for your collection the MongoDB will do a full table scan - this is the slowest possible method.
You can run explain() for your query. Explain will tell you which indexes (if any) are used for the query, number of scanned documents and total query duration.
If your query hits all the indexes and it's execution is still slow then you probably have a problem with the size of the collection / RAM.
MongoDB is the fastest when collection data + indexes fits in the memory. If the your collection size is larger than available RAM the performance drop is very large.
You can check the size of your collection with totalSize(), totalIndexSize() or validate() (these are shell commands).

Does MongoDB store documents in 4MB chunks?

I read that MongoDB documents are limited to 4 MB in size. I also read that when you insert a document, MongoDB puts some padding in so that if you add something to the document, the entire document doesn't have to be moved and reindexed.
So I was wondering, does it store documents in 4MB chunks on disk?
Thanks

As of 1.8, individual documents are now limited to 16MB in size (was previously 4MB). This is an arbitary limitation imposed as when you read a document off disk, the whole document is read into RAM. So I think the intention is that this limitation is there to try and safeguard memory / make you think about your schema design.
Data is then stored across multiple data files on disk - I forget the initial file size, but every time the database grows, a new file is created to expand into, where each new file is created bigger than the previous file until a single file size of 2GB is reached. From this point on, if the database continues to grow, subsequent 2GB data files are created for documents to be inserted into.
"chunks" has a meaning in the sharding aspect of MongoDB. Whereby documents are stored in "chunks" of a configurable size and when balancing needs to be done, it's these chunks of data (n documents) that are moved around.

The simple answer is "no." The actual space a document takes up in Mongo's files is variable, but it isn't the maximum document size. The DB engine watches to see how much your documents tend to change after insertion and calculates the padding factor based on that. So it changes all the time.
If you're curious, you can see the actual padding factor and storage space of your data using the .stats() function on a collection in the mongo shell. Here's a real-world example (with some names changed to protect the innocent clients):
{14:42} ~/my_directory ➭ mongo
MongoDB shell version: 1.8.0
connecting to: test
> show collections
schedule_drilldown
schedule_report
system.indexes
> db.schedule_report.stats()
{
"ns" : "test.schedule_report",
"count" : 16749,
"size" : 60743292,
"avgObjSize" : 3626.681712341035,
"storageSize" : 86614016,
"numExtents" : 10,
"nindexes" : 3,
"lastExtentSize" : 23101696,
"paddingFactor" : 1.4599999999953628,
"flags" : 1,
"totalIndexSize" : 2899968,
"indexSizes" : {
"_id_" : 835584,
"WeekEnd_-1_Salon_1" : 925696,
"WeekEnd_-1_AreaCode_1" : 1138688
},
"ok" : 1
}
So my test collection has about 16,749 records in it, with an average size of about 3.6 KB ("avgObjSize") and a total data size of about 60 MB ("size"). However, it turns out they actually take up about 86 MB on disk ("storageSize") because of the padding factor. That padding factor has varied over time as the collection's documents have been updated, but if I inserted a new document right now, it'd allocate 1.46 times as much space as the document needs ("paddingFactor") to avoid having to move things around if I change it later. To me that's a fair size/speed tradeoff.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse