Postgres database dump size larger than physical size - postgresql

I just made an pg_dump backup from my database and its size is about 95GB but the size of the direcory /pgsql/data is about 38GB.
I run a vacuum FULL and the size of the dump does not change. The version of my postgres installation is 9.3.4, on a CentOS release 6.3 server.
It is very weird the size of the dump comparing with the physical size or I can consider this normal?
Thanks in advance!
Regards.
Neme.

The size of pg_dump output and the size of a Postgres cluster (aka 'instance') on disk have very, very little correlation. Consider:
pg_dump has 3 different output formats, 2 of which allow compression on-the-fly
pg_dump output contains only schema definition and raw data in a text (or possibly "binary" format). It contains no index data.
The text/"binary" representation of different data types can be larger or smaller than actual data stored in the database. For example, the number 1 stored in a bigint field will take 8 bytes in a cluster, but only 1 byte in pg_dump.
This is also why VACUUM FULL had no effect on the size of the backup.
Note that a Point In Time Recovery (PITR) based backup is entirely different from a pg_dump backup. PITR backups are essentially copies of the data on disk.

Postgres does compress its data in certain situations, using a technique called TOAST:
PostgreSQL uses a fixed page size (commonly 8 kB), and does not allow tuples to span multiple pages. Therefore, it is not possible to store very large field values directly. To overcome this limitation, large field values are compressed and/or broken up into multiple physical rows. This happens transparently to the user, with only small impact on most of the backend code. The technique is affectionately known as TOAST (or "the best thing since sliced bread").

Related

bson vs gzip dump of mongodb

I have a Database with
On disk size 19.032GB (using show dbs command)
Data size 56 GB (using db.collectionName.stats(1024*1024*1024).size command)
While taking mongodump using command mongodump we can set param --gzip. These are the observations I have with and without this flag.
command
timeTaken in dump
size of dump
restoration time
observation
with gzip
30 min
7.5 GB
20 min
in mongostat the insert rate was ranging from 30K to 80k par sec
without gzip
10 min
57 GB
50 min
in mongostat the insert rate was very erratic, and ranging from 8k to 20k par sec
Dump was taken from machine with 8 core, 40 GB ram(Machine B) to 12 core, 48GB ram machine (Machine A). And restored to 12 core, 48 gb machine(Machine C) from Machine A to make sure there is no resource contention between mongo, mongorestore and mongodump process. Mongo version 4.2.0
I have few questions like
What is the functional difference between 2 dumps?
Can the bson dump be zipped to make it zip?
how does number of indexes impact the mongodump and restore process. (If we drop some unique indexes and then recreate it, will it expedite total dump and restore time? considering while doing insert mongodb will not have to take care of uniqueness part)
Is there a way to make overall process faster. From these result I see that have we have to choose 1 between dump and restore speed.
Will having a bigger machine(RAM) which reads the dump and restores it expedite the overall process?
Will smaller dump help in overall time?
Update:
2. Can the bson dump be zipped to make it zip?
yes
% ./mongodump -d=test
2022-11-16T21:02:24.100+0530 writing test.test to dump/test/test.bson
2022-11-16T21:02:24.119+0530 done dumping test.test (10000 documents)
% gzip dump/test/test.bson
% ./mongorestore --db=test8 --gzip dump/test/test.bson.gz
2022-11-16T21:02:51.076+0530 The --db and --collection flags are deprecated for this use-case; please use --nsInclude instead, i.e. with --nsInclude=${DATABASE}.${COLLECTION}
2022-11-16T21:02:51.077+0530 checking for collection data in dump/test/test.bson.gz
2022-11-16T21:02:51.184+0530 restoring test8.test from dump/test/test.bson.gz
2022-11-16T21:02:51.337+0530 finished restoring test8.test (10000 documents, 0 failures)
2022-11-16T21:02:51.337+0530 10000 document(s) restored successfully. 0 document(s) failed to restore.
I am no MongoDB expert, but I have good experience working with MongoDB backup and restore activities and will answer to the best of my knowledge.
What is the functional difference between 2 dumps?
mongodump command without the use of the --gzip option will save each and every document to a file in bson format.
This will significantly reduce the time taken for backup and restore operations since it just reads the bson file and inserts the document, with the compromise being the .bson dump file size
However, when we pass the --gzip option, the bson data is compressed and it is being dumped to a file. This will significantly increase the time taken for mongodump and mongorestore, but the size of the backup file will be very less due to compression.
Can the bson dump be zipped to make it zip?
Yes, it can be further zipped. But, You will be spending additional time since you have to compress the already compressed file and extract it again before the restore operation, increasing the overall time taken. Do it if the compressed file size is very small compared to just gzip.
EDIT:
As #best wishes pointed, I completely misread this question.
gzip performed by mongodump is just a gzip performed on the mongodump side. It is literally the same as compressing the original BSON file manually from our end.
For instance, If you extract the .gzip.bson file with any compression application, you will get the actual BSON backup file.
Note that zip and gzip are not the same (in terms of compression) since they both use different compression algorithms, even though they both compress files. So you will get different results in file size on comparing mongodump gzip and manual zip of files.
How does number of indexes impact the mongodump and restore process. (If we drop some unique indexes and then recreate it, will it expedite total dump and restore time? considering while doing insert mongodb will not have to take care of uniqueness part)
Whenever you take a dump, mongodump tool creates a <Collection-Name.metadata.json> file. This basically contains all the indexes followed by collection name, uuid, colmod, dbUsersAndRoles and so on.
The number and type of index in the collection will not have an impact during the mongodump operation. However, after the restoration of data using mongorestore command, it will go through all the indexes in the metaadata file and try to recreate the indexes.
The time taken by this operation depends on the number of indexes and the number of documents in your collection. In short (No. of Indexes * No. of Documents). The type of the index (Even if it's unique) doesn't have a mojor impact on performance. If the indexes are applied in the original collection using the background: true option, it's going to take even more time to rebuild the indexes while restoring.
You can avoid the indexing operaion during the mongorestore operation by passing the --noIndexRestore option in commandline. You can index later on when required.
In the Production backup environment of my company, indexing of keys takes more time compared to the restoration of data.
Is there a way to make the overall process faster. From these result I see that have we have to choose 1 between dump and restore speed
The solution depends...
If Network bandwidth is not an issue (Example: Moving data between two instances running in the cloud), don't use and compression, since it will save you time.
If the data in the newly moved instance won't be accessed immediately, perform the restoration process with the --noIndexRestore flag.
If the backup is for cold storage or saving data for later use, apply gzip compression, or manual zip compression, or both (whatever works best for you)
Choose whichever scenario works best for you, but you have to find the right balance between time and space primarily while deciding and secondly, whether to apply indexes or not.
In my company, we usually take non-compressed backup and restore for P-1 and gzip compression for weeks old prod backups, and further manually compress it for backups that are months older.
You have one more option and I DON'T RECOMMEND THIS METHOD. you can directly move the data path pointed by your MongoDB instance and change the DB path in the MongoDB instance of the migrated machine. Again, I don't recommend this method as there are many things that could go wrong, although I had no issues with this methodology on my end. But I can't guarantee the same for you. Do this at your own risk if you decide to.
Will having a bigger machine(RAM) which reads the dump and restores it expedite the overall process?
I don't think so. I am not sure about this, but I have 16 GB RAM and I restored a backup of 40GB mongodump to my local and didn't face any bottleneck due to RAM, but I could be wrong as I am not sure. Please let me know if you come to know the answer yourself.
Will smaller dump help in overall time?
If by smaller dump, you mean limiting the data to be dumped using the --query flag, for sure it will since the data to be backed up and restored is very less. Remember the No. of Indexes * No. of Documents rule.
Hope this helped you answer your questions. Let me know if you have any:
Any further questions
If I made any mistakes
Found a better solution
What you have decided finally
Here are my two cents:
In my experience: using --gzip to save space on storage with time, so-called space for time, both mongodump and mongorestore will have overhead.
in additions, I also use parallel settings:
--numParallelCollections=n1
--numInsertionWorkersPerCollection=n2
which may increase performance by little around 10%, n1 and n2 depending on CPU numbers on the server.
restore process also has rebuild index, it depends on how much indexes in your databases. In general speaking, rebuild index is faster than data restore.
hope these help!

Will gzipping + storage as bytea save more disk space over storage as text?

If I have a table containing 30 million rows and one of the columns in the table is currently a text column. The column is populated with random strings of a size between 2 and 10 kb. I don't need to search the strings directly.
I'm considering gzipping the strings before saving them (typically reducing them 2x in size) and instead save them in a bytea column.
I have read that Postgresql does some compression of text columns by default, so I'm wondering: Will there be any actual disk space reduction as a product of the suggested change?
I'm running Postgresql 9.3
PostgreSQL stores text columns that exceed about 2000 bytes in a TOAST table and compresses the data.
The compression is fast, but not very good, so you can have some savings if you use a different compression method. Since the stored values are not very large, the savings will probably be small.
If you want to go that way, you should disable compression on that already compressed column:
ALTER TABLE tab
ALTER bin_col SET STORAGE EXTERNAL;
I'd recommend that you go with PostgreSQL's standard compression and keep things simple, but the best thing would be for you to run a test and see if you get a benefit from using custom compression.

Is database back up size is same as database size

am working with PostgreSQL i checked following command then it returns 12MB
SELECT pg_size_pretty(pg_database_size('itcs'));
but when i took back up using pgadmin back up size is 1MB why this difrence
If you are taking a logical backup (with pg_dump), the backup contains only the data, no empty pages, no old versions of rows, no padding, no indexes. It may also be compressed. All that can greatly reduce the size.
If you are taking a physical backup, the backup more or less consists of the actual database files as they are, plus recovery logs to get them to a consistent state. So that would be roughly the same size as the database itself (but you can also compress it).

What affects DB2 restored database size?

I have database TESTDB with following details:
Database size: 3.2GB
Database Capacity: 302 GB
One of its tablespaces has its HWM too high due to an SMP extent, so it is not letting me reduce the high water mark.
My backup size is around 3.2 GB (As backups contains only used pages)
If I restore this database backup image via a redirected restore, what will be the newly restored database's size?
Will it be around 3.2 GB or around 302 GB?
The short answer is that RESTORE DATABASE will produce a target database that occupies about as much disk space as the source database did when it was backed up.
On its own, the size of a DB2 backup image is not a reliable indicator of how big the target database will be. For one thing, DB2 provides the option to compress the data being backed up, which can make the backup image significantly smaller than the DB2 object data it contains.
As you correctly point out, the backup image only contains non-empty extents (blocks of contiguous pages), but the RESTORE DATABASE command will recreate each tablespace container to its original size (including empty pages) unless you specify different container locations and sizes via the REDIRECT parameter.
The 302GB of capacity you're seeing is from GET_DBSIZE_INFO and similar utilities, and is quite often larger than the total storage the database currently occupies. This is because DB2's capacity calculation includes not only unused pages in DMS tablespaces, but also any free space on volumes or drives that are used by an SMS tablespace (most DB2 LUW databases contain at least one SMS tablespace).

Should I compact a MongoDB database before mongodump/mongorestore?

I have a database that hasn't been compacted in a while so its disk size is much larger than actual data and index sizes. I'll be moving it to another database and would like to know:
would compacting speed up mongodump
does mongorestore rebuild the database in a compact way negating the
need to compact
Compact + dump should be longer than a single dump since compact in the first run will operate on the same non-compact data.
Yes, it rebuilds the database in a compact way and also releases physical disk space. Simple compact will decrease data size only, but disk space still will be allocated by Mongo (you will not be able to use it for other purposes).