Will gzipping + storage as bytea save more disk space over storage as text? - postgresql

If I have a table containing 30 million rows and one of the columns in the table is currently a text column. The column is populated with random strings of a size between 2 and 10 kb. I don't need to search the strings directly.
I'm considering gzipping the strings before saving them (typically reducing them 2x in size) and instead save them in a bytea column.
I have read that Postgresql does some compression of text columns by default, so I'm wondering: Will there be any actual disk space reduction as a product of the suggested change?
I'm running Postgresql 9.3

PostgreSQL stores text columns that exceed about 2000 bytes in a TOAST table and compresses the data.
The compression is fast, but not very good, so you can have some savings if you use a different compression method. Since the stored values are not very large, the savings will probably be small.
If you want to go that way, you should disable compression on that already compressed column:
ALTER TABLE tab
ALTER bin_col SET STORAGE EXTERNAL;
I'd recommend that you go with PostgreSQL's standard compression and keep things simple, but the best thing would be for you to run a test and see if you get a benefit from using custom compression.

Related

Postgresql - How do I find compressed fields in a table?

Postgresql uses different storage techniques for field values. If values become large (e.g. are long texts), Postgresql will eventually apply compression and/or TOAST the values.
How do I find out which fields in a table are compressed?
(Backgound: I have database that stores tiny BLOBs in a some columns and I want to find out how much of it is compressed - if compression is hardly used, I want to turn it off, so Postgres won't waste CPU cycles on trying)
Starting in v14, there is the function pg_column_compression.
select pg_column_compression(colname),count(*) from tablename group by 1;

Text length reduction while storing in database table

I am having a scenario in which the column of the table looks like as follows
1234124124124
2343243253253
2131324324324
4545645354356
0982349874598
1298349832595
5365240240324
0980979879832
0924320982438
....
So on many rows will be there.
I want an efficient way to compress and store all these rows of that column in a single row. I just want to reduce the size occupied by the data with the concept of text compression or some similar functionality. while storing in same row even separators can also be used to store in the same row for differentiating them.
How can this be achieved?
Thanks in advance.
You don't need to alter your schema to save disk space; MongoDB's WiredTiger storage engine includes compression, so you get the benefit of reduced disk space while keeping the schema you need to do your queries effectively.

PostgreSQL - inserting string of 30,000 characters doesn't change size?

Via the command
select
relname as "Table",
pg_size_pretty(pg_total_relation_size(relid)) as "Size",
pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as "External Size"
from pg_catalog.pg_statio_user_tables order by pg_total_relation_size(relid) desc;
I can retrieve the size of my tables. (according to this article) which works. But I just came to a weird conclusion. Inserting multiple values which contain approx 30,000 characters each, doesn't change the size.
When executing before inserting I get
tablename | size text |external size text
-------------------------------------------
participant | 264kb | 256kb
After inserting (btw they are base64 encoded images) and executing the select command, I get the exact same sizes returned.
I figured this couldn't be correct so I was wondering, is the command wrong? Or does PostgreSQL do something special with very large strings?
(In pgadminIII the strings do not show in the 'view data' view but do are shown when executing select base64image from participant).
And next to this was I wondering (not my main question but would be nice to have answered) if this is the best practice (since my app generates base64 images) or should I f.e. convert them to an image on the backend and store the images remotely on my server instead of in the database?
Storage management
When you insert (or update) data that requires more space on disk then it currently uses, Postgres (or actually any DBMS) will allocate that space to store the new data.
When you delete data either by setting a column to a smaller or by deleting rows, the space is not immediately released to the operating system. The assumption is that that space will most probably be re-used by subsequent updates or inserts and extending a file is a relatively expensive operation so the database tries to avoid that (again this is something that all DBMS do).
If the space allocated is much bigger then the space that is actually stored, this can influence the speed of the retrieval - especially for table scans ("Seq Scan" in the execution plan) as more blocks then necessary need to be read from the harddisk. This is also known as "table bloat".
It is possible to shrink the space used using the statement VACUUM FULL. But that should only be used if you do suspect a problem with "bloat". This blog post explains this in more details.
Storing binary data in the database
If you want to store images in the database, then by all means use bytea instead of a string value. An image encoded in Base64 takes twice as much spaces as the raw data would.
There are pros and cons regarding the question if binary data (images, documents) should be stored in the database or not. This is a somewhat subjective decision to make and depends on a lot of external factors.
See e.g. here: Which is the best method to store files on the server (in database or storing the location alone)?

Why is my PostgreSQL table larger (in GB) than the csv it came from?

A < 4 GB csv became a 7.7 GB table in my AWS Postgres instance. And a 14 GB csv wouldn't load into 22 GB of space, I'm guessing because it is also going to double in size! Is this factor of two normal? And if so, why, and is it reliable?
There are many possible reasons:
Indexes take up space. If you have lots of indexes, especially multi-column indexes or GiST / GIN indexes, they can be a big space hog.
Some data types are represented more compactly in text form than in a table. For example, 1 consumes 1 byte in csv (or 2 if you count the comma delimiter) but if you store it in a bigint column it requires 8 bytes.
If there's a FILLFACTOR set, PostgreSQL will intentionally waste space so make later UPDATEs and INSERTs faster. If you don't know what FILLFACTOR is, then there isn't one set.
PostgreSQL has a much larger per-row overhead than CSV. In CSV, the per-row overhead is 2 bytes for a newline and carriage return. Rows in a PostgreSQL table require 24 to 28 bytes, plus data values, mainly because of the metadata required for multiversion concurrency control. So a CSV with very many narrow rows will produce a significantly bigger table than one the same size in bytes that has fewer wider rows.
PostgreSQL can do out-of-line storage and compression of values using TOAST. This can make big text strings significantly smaller in the database than in CSV.
You can use octet_size and pg_column_size to get PostgreSQL to tell you how big rows are. Because of TOAST out-of-line compressed storage, the pg_column_size might be different for a tuple produced by a VALUES expression vs one SELECTed from a table.
You can also use pg_total_relation_size to find out how big the table for a given sample input is.

Why do Redshift COPY queries use (much) more disk space for tables with a sort key

I have a large set of data on S3 in the form of a few hundred CSV files that are ~1.7 TB in total (uncompressed). I am trying to copy it to an empty table on a Redshift cluster.
The cluster is empty (no other tables) and has 10 dw2.large nodes. If I set a sort key on the table, the copy commands uses up all available disk space about 25% of the way through, and aborts. If there's no sort key, the copy completes successfully and never uses more than 45% of the available disk space. This behavior is consistent whether or not I also set a distribution key.
I don't really know why this happens, or if it's expected. Has anyone seen this behavior? If so, do you have any suggestions for how to get around it? One idea would be to try importing each file individually, but I'd love to find a way to let Redshift deal with that part itself and do it all in one query.
Got an answer to this from the Redshift team. The cluster needs free space of at least 2.5x the incoming data size to use as temporary space for the sort. You can upsize your cluster, copy the data, and resize it back down.
Each dw2.large box has 0.16 TB disk space. When you said you you have cluster of 10 nodes, total space available is around 1.6 TB.
You have mentioned that you have around 1.7 TB raw data ( uncompressed) to be loaded in redshift.
When you load data to redshift using copy commands redshift automatically compresses you data and load it table.
once you load any db table you can see compression encoding by below query
Select "column", type, encoding
from pg_table_def where tablename = 'my_table_name'
Once you load your data when table has no sort key. See what are compression are being applied.
I suggested you drop and create table each time when you load data for your testing So that compressions encoding will be analysed each time.Once you load your table using copy commands see below link and fire script to determine table size
http://docs.aws.amazon.com/redshift/latest/dg/c_analyzing-table-design.html
Since when you apply sort key for your table and load data , sort key also occupies some disk space.
Since table with out sort key need less disk space than table with sort key.
You need to make sure that compression are being applied to table.
When we have sort key applied it need more space to store. When you apply sort key you need to check if you are loading data in sorted order as well,so that data will be stored in sorted fashion. This we need to avoid vacuum command to sort table after data being loaded.