Why mongodb doesn't use the deleted data's space? [duplicate] - mongodb

This question already has answers here:
Auto compact the deleted space in mongodb?
(4 answers)
Closed 8 years ago.
I set one shard's maxsize to 20GB and keep remove data every day. but I see one db in this shard's storageSize is about 18G, but its data file on the disk is about 33GB and it keep growing. I know mongo is just sign deleted data, but should it reuse that space? Or I need configure something?

I would not run repair on your database as suggested whenever you feel like getting back your space; this would result in disastrous performance problems for one.
The problem you have is that documents are not "fitting" into the spaces you have in your deleted buckets list (for reference this is a good presentation: http://www.mongodb.com/presentations/storage-engine-internals it will explain everything I am talking about) and since MongoDB runs out of "steam" searching your list before it makes a whole new document you are most likely getting a new extent for every document you are inserting/updating.
You could run a compact but then that is as bad as repair in terms of performance but it will send those delete spaces to a free list which can be used by any size document.
You could also collmod with powerof2sizes: http://docs.mongodb.org/manual/reference/command/collMod/#usePowerOf2Sizes which may well fix your problem.
Ideally you need to redo your application, you underestimated how in-place updates/inserts into MongoDB works and now it is slaughtering you.

Related

Mongodb CloneCollection vs CopyTo

Im in need to store a copy of a collection in a given time, work on that copy (process which could take 20-30 minutes) and then remove the copy.
The copy itself should be an image of the collection in the given time such that any action that occurs after that time, needs to be ignored.
For that MondbDb have 2 methods which do the same but doing it differently and here comes the question.
db.CloneCollection which copies the collection to a new collection (I could find any lock related warnings)
and
db.collection.copyTo
which copies the document but causes collection-wide lock (which could be a problem for me).
Question is, under what assumptions I can work with cloneCollection and CopyTo.
The question related to not only to the lock making.
I would like to ask for suggestions, hints or any idea of out to get the data as fast as possible without hurting workers using the collection. (It also may be that the answer is not including cloneCollection nor copyTo).
Its not an open debate, there is pretty much only 1 desicive answer under the following assumptions:
The collection may contain up to 50m entries.
lock should be reduced to a minimum or none, if possible.
There could be a slight time difference from when I start the process till the full clone of the collection is ready. e.g: if I issue a clone in 10:00 and data from 10:09 is included, it'll be ok as long as its not drastically far.
Im using MongoDB WieredTiger 3.4

MongoDb "Working Set" exceeding RAM

I m collecting timeseries in mongoDb. Eventually my working set will be larger than my RAM. However I mostly need to access the recent data.
If I put everything in just one table, would it still be possible to do that? Because the index size will keep growing if I just put all the data in one table.
I was thinking of creating a new table every month and put the data there. This way, the very old data will not be loaded in RAM unless someone (rarely) needs that archive data.
So my question is : is it better to manually partition the data like that or just leave everything up to mongoDB?

Why does MongoDB takes up so much space?

I am trying to store records with a set of doubles and ints (around 15-20) in mongoDB. The records mostly (99.99%) have the same structure.
When I store the data in a root which is a very structured data storing format, the file is around 2.5GB for 22.5 Million records. For Mongo, however, the database size (from command show dbs) is around 21GB, whereas the data size (from db.collection.stats()) is around 13GB.
This is a huge overhead (Clarify: 13GB vs 2.5GB, I'm not even talking about the 21GB), and I guess it is because it stores both keys and values. So the question is, why and how Mongo doesn't do a better job in making it smaller?
But the main question is, what is the performance impact in this? I have 4 indexes and they come out to be 3GB, so running the server on a single 8GB machine can become a problem if I double the amount of data and try to keep a large working set in memory.
Any guesses into if I should be using SQL or some other DB? or maybe just keep working with ROOT files if anyone has tried them?
Basically, this is mongo preparing for the insertion of data. Mongo performs prealocation of storage for data to prevent (or minimize) fragmentation on the disk. This prealocation is observed in the form of a file that the mongod instance creates.
First it creates a 64MB file, next 128MB, next 512MB, and on and on until it reaches files of 2GB (the maximum size of prealocated data files).
There are some more things that mongo does that might be suspect to using more disk space, things like journaling...
For much, much more info on how mongoDB uses storage space, you can take a look at this page and in specific the section titled Why are the files in my data directory larger than the data in my database?
There are some things that you can do to minimize the space that is used, but these tequniques (such as using the --smallfiles option) are usually only recommended for development and testing use - never for production.
Question: Should you use SQL or MongoDB?
Answer: It depends.
Better way to ask the question: Should you use use a relational database or a document database?
Answer:
If your data is highly structured (every row has the same fields), or you rely heavily on foreign keys and you need strong transactional integrity on operations that use those related records... use a relational database.
If your records are heterogeneous (different fields per document) or have variable length fields (arrays) or have embedded documents (hierarchical)... use a document database.
My current software project uses both. Use the right tool for the job!

Does MongoDB reuse deleted space?

First off, I know about this question:
Auto compact the deleted space in mongodb?
My question is not about shrinking DB file sizes though, but more about the reuse of deleted space. Say I have 100K documents in a collection, I then delete 50K of those. Will Mongo reuse the space within its data file that the deleted documents have freed? Or are they simply "marked" as deleted?
I don't care so much about the actual size of the file on disk, its more about "does it just grow and grow".
Update (Mar 2015): As of the 3.0 release, there are multiple storage engines available in MongoDB. This answer applies to the MMAP storage engine (still the default in MongoDB 3.0), the answer for other engines (WiredTiger for example) is quite different and may well be tunable and adjustable. Hence if you are using another engine, please read the relevant docs for that storage engine to determine what your space re-use defaults and options are.
With the MMAP storage engine, when documents are deleted the space left behind is put into a free list. However, to use the space there will need to be similarly sized documents inserted later, and MongoDB will need to find an appropriate space for that document within a certain time frame (once it times out looking at the list, it will just append) otherwise the space re-use is not going to happen very often. This deletion is done within the data files, so there is no disk space reclamation happening here - all of this is done internally within the existing data files.
If you subsequently do a repair, or resync a secondary from scratch, the data files are rewritten and the space on disk will be reclaimed (any padding on docs is also removed). This is where you will see actual space reclamation on-disk. For any other actions (compact included) the on disk usage will not change and may even increase.
With 2.2+ you can now use the collMod command and the usePowersOf2Sizes option to make the re-use of deleted space more likely (note that this is the default in 2.6+). This means that the initial space allocation for a document is a bit less efficient (512 bytes for a 400 byte doc for example) but means that when a new doc is inserted it is more likely to be able to re-use that space. If you are deleting (or growing and hence moving) documents a lot, then this will be more efficient in the long term.
For anyone that is interested, one of the people that wrote a lot of the storage code (Mathias Stearn) has a great presentation about the storage internals, which can be found here

MongoDB data remove - reclaim diskspace [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Auto compact the deleted space in mongodb?
My understanding is that on delete operations MongoDB won't free up the disk space but would reuse it as needed.
Is that correct?
If not, would I have run a repair command?
Could the repair be run on a live mongo instance?
Yes it is correct.
No, better to give mongodb as much disk space as possible( if mongodb can allocate more space than less disk fragmentation you will have, in additional allocating space is expensive operation). But if you wish you can run db.repairDatabase() from mongodb shell to shrink database size.
Yes you can run repairDatabase on live mongodb instance ( better to run it in none peak hours)
This is somewhat of a duplicate of this MongoDB question ...
Auto compact the deleted space in mongodb?
See that answer for details on how to ...
Reclame some space
Use serverside JS
to run a recurring job to get back
space (including a script you can run ...)
How you might want to look
into Capped Collections for some use
cases!
Also you can see this related blog posting: http://learnmongo.com/posts/compacting-mongodb-data-files/
I have another solution that might work better than doing db.repairDatabase() if you can't afford for the system to be locked, or don't have double the storage.
You must be using a replica set.
My thought is once you've removed all of the excess data that's gobbling your disk, stop a secondary replica, wipe its data directory, start it up and let it resynchronize with the master. Repeat with the other secondaries, one at a time.
On the master, do an rs.stepDown() to hand over MASTER to one of the synched secondaries, now stop this one, wipe it, and let it resync.
The process is time consuming, but it should only cost a few seconds of down time, when you do the rs.stepDown().