MongoDB collection size before/after dump - mongodb

I have a question regarding MongoDB's collection size.
I did a small stress test in which my MongoDB server was constantly inserting, deleting and updating data for about 48 hours. The documents were only of small size, simply a numerical value and a timestamp as well as an ID.
Now, after those 48 hours, the collection used for inserting, deleting and updating data was 98.000 Bytes and the preallocated storage size was 696.320 Bytes. It has become that much higher than the actual collection size because of one input spike during an insertion phase. Due to following deletions of objects the actual collection size decreased again, the preallocated storage size didn't (AFAIK a common database management problem, since it's the same with e.g. MySQL).
After the stress test was completed I created a dump of my MongoDB database and dropped the database completely, so I could import the dump afterwards again and see how the stats would look then. And as I suspected, the collection size was still the same (98.000 Bytes) but the preallocated storage size went down to 40.960 Bytes (from 696.320 Bytes before).
Since we want to try out MongoDB for an application that produces hundreds of MB of data and therefore I/O traffic every day, we need to keep the database and its occupied space to a minimum. And preferably without having to create a dump, drop the whole database and import the dump again every now and then.
Now my question is: is there a way to call the MongoDB garbage collector functionally from code? The software behind it is a Java software and my idea was to call the garbage collector after a certain amount of time/operations or after the preallocated storage size has reached a certain threshold.
Or maybe there's an ever better (more elegant) way to minimize the occupied space?
Any help would be appreciated and I'll try to provide any further information if needed. Thanks in advance.

Related

Is there a way to release RAM occupied by mongodb indexes, after dropping collection?

The problem is that we have a huge dataset consists of 50 mln records and almost all fields are indexed, that causes huge consumption of RAM, and after collection is deleted resources are not released, I know that this can be solved by restarting the server, but this solution is not applicable under our situation. So, my question - is there a way to release RAM resources without restarting mongo server? Version of Mongo is 4.4. Thanks in advance.
Not directly... MongoDB never make memory free, it just replaces it or allocates more.
But if you start reading from the disk data what you're going to need, that data will replace that part of memory.
Base problem is that MongoDB will always use (eventually) all free memory what is available and try to keep in memory all active data. So, reading data from the disk, makes that data "active" and will change the content of disk cache in the memory.

GridFS disk management

In my environments I can have DB of 5-10 GB or DB of 10 TB (video recordings).
Focusing on the 5-10 GB: if I keep default settings for prealloc an small-files I can actually loose 20-40% of the disk space because of allocations.
In my production environments, the disk size can be 512G, but user can limit DB allocation to only 10G.
To implement this, I have a scheduled task that deletes the old documents from the DB when DB dataSize reached a certain threshold.
I can't use capped-collection (GridFS, sharding limitation, cannot delete random documents..), I can't use --no-prealloc/small-files flags, cause i need the files insert to be efficient.
So what happens, is this: if dataSize gets to 10G, the fileSize would be at least 12G, so I need to take that in consideration and lower the threshold in 2GB (and lose a lot of disk space).
What I do want, is to tell mongo to pre-allocate all the 10 GB the user requested, and disable further pre-alloc.
For example, running mongod with --no-prealloc and --small-files, but pre-allocate in advance all the 10 GB.
Another protection I gain here, is protecting the user against sudden disk-full errors. If he regularly downloads Game of Thrones episodes to the same drive, he can't take space from the DB 10G, since it's already pre-allocated.
(using C# driver)
I think I found a solution: You might want to look at the --quota and --quotafiles command line opts. In your case, you also might want to add the --smalfiles option. So
mongod --smallfiles --quota --quotafiles 11
should give you a size of exactly 10224 MB for your data, which, adding the default namespace file size of 16MB equals your target size of 10GB, excluding indices.
The following applies to regular collections as per documentation. But since metadata can be attached to files, it might very well apply to GridFS as well.
MongoDB uses what is called a record to store data. A record consists of two parts: the actual data and something which is called "padding". The padding is basically unused data which is used if the document grows in size. The reason for that is that a document or file chunk in GridFS respectively never gets fragmented to enhance query performance. So what would happen when the document or a file chunk grows in size is that it had to be moved to a different location in the datafile(s) every time the file is modified, which can be a very costly operation in terms of IO and time. So with the default settings, if the document or file chunk grows in size is that the padding is used instead of moving the file, thus reducing the need of moving around data in the data file and thereby improving performance. Only if the growth of the data exceeds the preallocated padding the document or file chunk is moved within the datafile(s).
The default strategy for preallocating padding space is "usePowerOf2Sizes", which determines the padding size by taking the document size and uses the next power of two size as the size preallocated for the document. Say we have a 47 byte document, the usePowerOf2Sizes strategy would preallocate 64 bytes for that document, resulting in a padding of 17 bytes.
There is another preallocation strategy, however. It is called "exactFit". It determines the padding space by multiplying the document size with a dynamically computed "paddingFactor". As far as I understood, the padding factor is determined by the average document growth in the respective collection. Since we are talking of static files in your case, the padding factor should always be 0, and because of this, there should not be any "lost" space any more.
So I think a possible solution would be to change the allocation strategy for both the files and the chunks collection to exactFit. Could you try that and share your findings with us?

Delete a collection from mongo db of huge size and reclaim the space

I am having a mongo shard set up in my production environment.
In my application i create db on daily basis as my single day db size reaches to 18 GB.
I have a collection in my DB which logs raw data for the hits of my site. I use this collection for single day only as whole raw data is converted to aggregated data by my db script.
I want to delete this collection at the end of day but my confusion is due to big size of this collection(almost 6 GB) and my db size exceeds 17 GB is it safe to use repair Database command.
Could you please suggest me a way to do this.
MongoDB (at at 2.4) currently allocates storage at a database level. You are correct that you would need to run the repairDatabase command in order to reclaim the preallocated storage.
If that space is going to be reused again soon (i.e. for the next day of raw data) you could just leave it allocated rather than running a repair. If you process different amounts of data every day, this may use some excessive storage as you'll basically remain at the "high watermark" where you've had the most storage allocated.
If you're concerned about the space usage, a better approach would be to add the raw data into a separate database that you can drop when you no longer need the raw data (i.e. your raw data goes into a separate DB/collection per day instead of just a separate collection).

Why does MongoDB takes up so much space?

I am trying to store records with a set of doubles and ints (around 15-20) in mongoDB. The records mostly (99.99%) have the same structure.
When I store the data in a root which is a very structured data storing format, the file is around 2.5GB for 22.5 Million records. For Mongo, however, the database size (from command show dbs) is around 21GB, whereas the data size (from db.collection.stats()) is around 13GB.
This is a huge overhead (Clarify: 13GB vs 2.5GB, I'm not even talking about the 21GB), and I guess it is because it stores both keys and values. So the question is, why and how Mongo doesn't do a better job in making it smaller?
But the main question is, what is the performance impact in this? I have 4 indexes and they come out to be 3GB, so running the server on a single 8GB machine can become a problem if I double the amount of data and try to keep a large working set in memory.
Any guesses into if I should be using SQL or some other DB? or maybe just keep working with ROOT files if anyone has tried them?
Basically, this is mongo preparing for the insertion of data. Mongo performs prealocation of storage for data to prevent (or minimize) fragmentation on the disk. This prealocation is observed in the form of a file that the mongod instance creates.
First it creates a 64MB file, next 128MB, next 512MB, and on and on until it reaches files of 2GB (the maximum size of prealocated data files).
There are some more things that mongo does that might be suspect to using more disk space, things like journaling...
For much, much more info on how mongoDB uses storage space, you can take a look at this page and in specific the section titled Why are the files in my data directory larger than the data in my database?
There are some things that you can do to minimize the space that is used, but these tequniques (such as using the --smallfiles option) are usually only recommended for development and testing use - never for production.
Question: Should you use SQL or MongoDB?
Answer: It depends.
Better way to ask the question: Should you use use a relational database or a document database?
Answer:
If your data is highly structured (every row has the same fields), or you rely heavily on foreign keys and you need strong transactional integrity on operations that use those related records... use a relational database.
If your records are heterogeneous (different fields per document) or have variable length fields (arrays) or have embedded documents (hierarchical)... use a document database.
My current software project uses both. Use the right tool for the job!

GridFS and standard collections, memory usage

As I know, MongoDB is optimized for situation when all data fits into memory. And as I understood GridFS uses standard collection and all standard storage methods. Is it?
Does it mean that storing a large set of data (images at my case), that bigger that current amount of memory, it will forse out my real data from memory?
Maybe MongoDB smart enough to give less priority for GridFS collection?
MongoDB uses memory-mapped files to manage its data files. If you use data, it will stay in memory. If you don't use it, it will eventually be flushed to disk (and be read back, when you request it next time). If you need to read all your data, you better fit it all in RAM or your system might enter the deadly swap spiral (depends on your load, of course).
If you just store data and don't do much with it, MongoDB will use only a fraction of memory. For example, in one of my projects total dataset size is over 300 GB and mongo takes only 800 MB of RAM (because I almost don't read data, only write it).