I am using python, scrapy, MongoDB for my web scraping project. I used to scrape 40Gb data daily. Is there a way or setting in mongodb.conf file so that MongoDB will exit normally before applying a write lock on db due to disk full error ?
Because every time i face this problem of disk full error in MongoDB. Then I have to manually re-install MongoDB to remove the write lock from db. I cant run repair and compact command on the database because for running this command also I need free space.
MongoDB doesn't handle disk-full errors very well in certain cases, but you do not have to uninstall and then re-install MongoDB to remove the lock file. Instead, you can just mongod.lock file from this. As long as you have journalling enabled, your data should be good. Of course, at that moment, you can't add more data to the MongoDB databases.
You probably wouldn't need repair and compact only helps if you actually have deleted data from MongoDB. compact does not compress data, so this is only useful if you indeed have deleted data.
Constant adding, and then deleting later can cause fragmentation and lots of disk space to be unused. You can prevent that mostly by using the userPowerOf2Sizes option that you can set on collections. compact mitigates this by rewriting the database files as well, but as you said you need free disk space for this. I would advice you to also add some monitoring to warn you when your data size reaches 50% of your full disk space. In that case, there is still plenty of time to use compact to reclaim unused space.
Related
Context
For local development and testing within a CI pipeline, I want a postgres docker image that contains some data sampled from production (a few tens of MBs). I will periodically rebuild this image to ensure the sampled data stays fresh.
I don't care at all about data integrity, but I care quite a bit about image size and container disk/memory usage when run. Startup time should be at most a couple of mins.
What I've built
I have a docker file that builds on top of one of the official postgres (postgis) docker images, but it actually initializes the database and uses pg_restore to insert my sample data.
Attempted optimising
I use a mutlistage build, just copying the postgres directory into the final image (this helps as I used node during the build).
I notice that the pg_xlog directory is quite large, and logically seems redundant here since I would happily checkpoint and ditch any WAL before sealing the image. I can't figure out how to get rid of it. I tried starting postgres with the following flags:
-min_wal_size=2 --max_wal_size=3 --archive_mode=off --wal_keep_segments
and running Checkpoint and waiting for a few seconds, but it doesn't seem to change anything. I also tried deleting the contents of the directory, but that seemed to break the database on its next startup.
Rather than put the actual database within the image, I could just leave a pg_dump file in the image and have the image entrypoint build the database from that. I think this would improve the image size (though I'm not clear why the database should take up much more space than the dump, unless indexes are especially big - I actually thought the dump format was less compact than the database itself, so this might offset the index size). This would obviously impact on startup time (but not prohibitively so).
Summary/Questions
Am I going about this the right way? If so, what kind of disk/memory optimizations can I use? In particular can I remove/shrink pg_xlog?
I'm using Postgres 9.5 and Postgis 2.X.
Was the server ever run with a larger max_wal_size than 3? If so, it could have "recycled" ahead a lot of wal files by renaming old ones for future use. Once those are renamed, they will never be removed until after they are used, even if max_wal_size is later reduced.
I also tried deleting the contents of the directory, but that seemed to break the database on its next startup.
You can fix that by using pg_resetxlog. Just don't get in the habit of running that blindly, it is very dangerous to run outside of a test environment.
I am familiar both with the MongoDB repairDatabase and compact commands, but these both seem to lock the database and/or collection. Is there another way to reclaim deleted disk space without essentially shutting down the database? What are best practices in this area? Thanks!
Best practice would probably depend on your schema and what your application does. Here's my use case, perhaps you can learn something... My application is storing very large amounts of time stamped data samples. Deleting data from a very large store is a very expensive operation, this gets more complicated when you try doing this on live systems. MongoDB had several issues in the past with reclaiming the disk space back to OS and we had to dance around this, not sure how good it works now. But what we did solved everything for good - we partitioned the data in such way so that we could dispose of old stuff by simply dumping entire database. Dropping mongodb database is a very cheap and efficient operation, almost instantaneous even when you drop a TB. Note that dropping collection is not as effective as dropping database, this was actually a key to the solution. For doing this we had to redesign the schema.. Your case of course could be different, but the lesson learned is that deleting data from large storage is very expensive.
The best method currently is to run a Master Slave Setup.
Shutdown 1 mongod instance and let it resync.
More details here: Reducing MongoDB database file size
I want to shrink data files size by reclaiming deleted space, but I can't run db.repairDatabase(), because free disk space is not enough.
Update: With WiredTiger, compact does free space.
The original answer to this question is here:
Reducing MongoDB database file size
There really is nothing outside of repair that will reclaim space. The compact should allow you to go much longer on the existing space. Otherwise, you will have to migrate to a bigger drive.
One way to do this is to use an off-line secondary from your Replica Set. This should give you a whole maintenance window to migrate, repair, move back and bring back up.
If you are not running a Replica Set, it's time to look at doing just that.
You could run the compact command on a single collection, or one by one in all the collections you want to shrink.
http://www.mongodb.org/display/DOCS/Compact+Command
db.runCommand( { compact : 'mycollectionname' } )
As noted in comments, I was mistaken, compact does not actually reclaim disk space, it only defragments and rebuilds collection indexes.
Instead though, you could use "--repairpath" option if you have another drive available which has available freespace.
For example:
mongod --dbpath /data/db --repair --repairpath /data/db0
Shown here: http://docs.mongodb.org/manual/tutorial/recover-data-following-unexpected-shutdown/
You can as well do a manual mongodump and mongorestore. That's basically the same what repairDatabase does. That way you can dump and restore it to/from a different machine with sufficient disk space.
If you're running a replica-set, you will want to issue a resync on each of your secondaries, one at a time. Once this has been completed, step-down your primary and resync the newly assigned secondary.
To resync, stop your mongod instance, delete the locals and start the process back up. Watch the logs to ensure everything starts back up properly and the resync has initiated.
If you have a lot of data / indexes, ensure your oplog is large enough, otherwise it's likely to go stale.
There is one other option, if you are using a replica set, but with a lot of caveats. You can fail over to another set member, then delete the files on the now former primary and do a full resync. A full resync rewrites the files from scratch in a similar way to a repair, but you will also have to rebuild indexes. This is not to be done lightly.
If you go down this path, my recommendation would be to have a 3 member replica set before doing this for disk space reclamation, so that at any time when a member is syncing from scratch you have 2 set members fully functional.
If you do not have a replica set, I recommend creating one, with two secondaries. When you sync them initially you will be creating a nice unfragmented and unpadded versions of your data. More here:
http://www.mongodb.org/display/DOCS/Replica+Set+Configuration
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Auto compact the deleted space in mongodb?
My understanding is that on delete operations MongoDB won't free up the disk space but would reuse it as needed.
Is that correct?
If not, would I have run a repair command?
Could the repair be run on a live mongo instance?
Yes it is correct.
No, better to give mongodb as much disk space as possible( if mongodb can allocate more space than less disk fragmentation you will have, in additional allocating space is expensive operation). But if you wish you can run db.repairDatabase() from mongodb shell to shrink database size.
Yes you can run repairDatabase on live mongodb instance ( better to run it in none peak hours)
This is somewhat of a duplicate of this MongoDB question ...
Auto compact the deleted space in mongodb?
See that answer for details on how to ...
Reclame some space
Use serverside JS
to run a recurring job to get back
space (including a script you can run ...)
How you might want to look
into Capped Collections for some use
cases!
Also you can see this related blog posting: http://learnmongo.com/posts/compacting-mongodb-data-files/
I have another solution that might work better than doing db.repairDatabase() if you can't afford for the system to be locked, or don't have double the storage.
You must be using a replica set.
My thought is once you've removed all of the excess data that's gobbling your disk, stop a secondary replica, wipe its data directory, start it up and let it resynchronize with the master. Repeat with the other secondaries, one at a time.
On the master, do an rs.stepDown() to hand over MASTER to one of the synched secondaries, now stop this one, wipe it, and let it resync.
The process is time consuming, but it should only cost a few seconds of down time, when you do the rs.stepDown().
How can I compact Firebird 2.1 database, like we do in MS Access (discarding erased data, remaking index, etc)?
There's a way to do it?
Thanks!
Usually there is no need to compact a Firebird Database: see fb release notes about garbage collection and an automatic (per-database configurable) operation named "sweep".
In few words, fb reuses space in pages when records are deleted or oldest record version are freed asking for disk space chunks only when free space becomes too small (i.e. under a defined percent).
Sweep is performed as default after a predefined number of commited transactions, bur it's an expensive task.
Backup and restore must be intended as last resort to optimize and shrink, as this rebuilds and optimize indexes too, but usually this is not needed as there are commands and tools to rebuild indexes.
The only way to do it is to make a backup and a restore.
From the official faq
Many users wonder why they don't get their disk space back when they
delete a lot of records from database.
The reason is that it is an expensive operation, it would require a
lot of disk writes and memory - just like doing refragmentation of
hard disk partition. The parts of database (pages) that were used by
such data are marked as empty and Firebird will reuse them next time
it needs to write new data.
If disk space is critical for you, you can get the space back by
doing backup and then restore. Since you're doing the backup to
restore right away, it's wise to use the "inhibit garbage collection"
or "don't use garbage collection" switch (-G in gbak), which will make
backup go A LOT FASTER. Garbage collection is used to clean up your
database, and as it is a maintenance task, it's often done together
with backup (as backup has to go throught entire database anyway).
However, you're soon going to ditch that database file, and there's no
need to clean it up.