How to zip/unzip data bytea on Postgresql - postgresql

As we have known, Oracle provided 2 functions UTL_COMPRESS.LZ_COMPRESS/LZ_UNCOMPRESS to compress the raw input. But when moving to aws Postgresql, there are no similar functions like that.
My question is how can we do compress and uncompress the raw data on Postgresql? Willing to listen to any solutions from you guys. Thank you in advance.

PostgreSQL does that automatically for you through a feature called TOAST. Values in a table row exceeding a size of 2000 bytes will get compressed and, if the row still exceeds 2000 bytes after that, sliced up and stored out of line. All that happens transparently.

Related

PostgreSQL bytea network traffic double expected value

I'm investigating a bandwidth problem and stumbled on an issue with retrieving a bytea value. I tested this with PostgreSQL 10 and 14, the respective psql clients and the psycopg2 client library.
The issue is that if the size of a bytea value is eg. 10 MB (I can confirm by doing select length(value) from table where id=1), and I do select value from table where id=1, then the amount of data transferred over the socket is about 20MB. Note that the value in the database is pre-compressed (so high entropy), and the table is set to not compress the bytea value to avoid double work.
I can't find any obvious encoding issue since it's all just bytes. I can understand that the psql CLI command may negotiate some encoding so it can print the result, but psycopg2 definitely doesn't do that, and I experience the same behaviour.
I tested the same scenario with a text field, and that nearly worked as expected. I started with copy paste of lorem ipsum and it transferred the correct amount of data, but when I changed the text to be random extended ASCII values (higher entropy again), it transferred more data than it should have. I have compression disabled for all my columns so I don't understand why that would happen.
Any ideas as to why this would happen?
That is normal. By default, values are transferred as strings, so a bytea would be rendered in hexadecimal numbers, which doubles its size.
As a workaround, you could transfer such data in binary mode. The frontend-backend protocol and the C library offer support for that, but it will depend on your client API whether you can make use of that or not.

Will gzipping + storage as bytea save more disk space over storage as text?

If I have a table containing 30 million rows and one of the columns in the table is currently a text column. The column is populated with random strings of a size between 2 and 10 kb. I don't need to search the strings directly.
I'm considering gzipping the strings before saving them (typically reducing them 2x in size) and instead save them in a bytea column.
I have read that Postgresql does some compression of text columns by default, so I'm wondering: Will there be any actual disk space reduction as a product of the suggested change?
I'm running Postgresql 9.3
PostgreSQL stores text columns that exceed about 2000 bytes in a TOAST table and compresses the data.
The compression is fast, but not very good, so you can have some savings if you use a different compression method. Since the stored values are not very large, the savings will probably be small.
If you want to go that way, you should disable compression on that already compressed column:
ALTER TABLE tab
ALTER bin_col SET STORAGE EXTERNAL;
I'd recommend that you go with PostgreSQL's standard compression and keep things simple, but the best thing would be for you to run a test and see if you get a benefit from using custom compression.

PostgreSQL - inserting string of 30,000 characters doesn't change size?

Via the command
select
relname as "Table",
pg_size_pretty(pg_total_relation_size(relid)) as "Size",
pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as "External Size"
from pg_catalog.pg_statio_user_tables order by pg_total_relation_size(relid) desc;
I can retrieve the size of my tables. (according to this article) which works. But I just came to a weird conclusion. Inserting multiple values which contain approx 30,000 characters each, doesn't change the size.
When executing before inserting I get
tablename | size text |external size text
-------------------------------------------
participant | 264kb | 256kb
After inserting (btw they are base64 encoded images) and executing the select command, I get the exact same sizes returned.
I figured this couldn't be correct so I was wondering, is the command wrong? Or does PostgreSQL do something special with very large strings?
(In pgadminIII the strings do not show in the 'view data' view but do are shown when executing select base64image from participant).
And next to this was I wondering (not my main question but would be nice to have answered) if this is the best practice (since my app generates base64 images) or should I f.e. convert them to an image on the backend and store the images remotely on my server instead of in the database?
Storage management
When you insert (or update) data that requires more space on disk then it currently uses, Postgres (or actually any DBMS) will allocate that space to store the new data.
When you delete data either by setting a column to a smaller or by deleting rows, the space is not immediately released to the operating system. The assumption is that that space will most probably be re-used by subsequent updates or inserts and extending a file is a relatively expensive operation so the database tries to avoid that (again this is something that all DBMS do).
If the space allocated is much bigger then the space that is actually stored, this can influence the speed of the retrieval - especially for table scans ("Seq Scan" in the execution plan) as more blocks then necessary need to be read from the harddisk. This is also known as "table bloat".
It is possible to shrink the space used using the statement VACUUM FULL. But that should only be used if you do suspect a problem with "bloat". This blog post explains this in more details.
Storing binary data in the database
If you want to store images in the database, then by all means use bytea instead of a string value. An image encoded in Base64 takes twice as much spaces as the raw data would.
There are pros and cons regarding the question if binary data (images, documents) should be stored in the database or not. This is a somewhat subjective decision to make and depends on a lot of external factors.
See e.g. here: Which is the best method to store files on the server (in database or storing the location alone)?

Postgres database dump size larger than physical size

I just made an pg_dump backup from my database and its size is about 95GB but the size of the direcory /pgsql/data is about 38GB.
I run a vacuum FULL and the size of the dump does not change. The version of my postgres installation is 9.3.4, on a CentOS release 6.3 server.
It is very weird the size of the dump comparing with the physical size or I can consider this normal?
Thanks in advance!
Regards.
Neme.
The size of pg_dump output and the size of a Postgres cluster (aka 'instance') on disk have very, very little correlation. Consider:
pg_dump has 3 different output formats, 2 of which allow compression on-the-fly
pg_dump output contains only schema definition and raw data in a text (or possibly "binary" format). It contains no index data.
The text/"binary" representation of different data types can be larger or smaller than actual data stored in the database. For example, the number 1 stored in a bigint field will take 8 bytes in a cluster, but only 1 byte in pg_dump.
This is also why VACUUM FULL had no effect on the size of the backup.
Note that a Point In Time Recovery (PITR) based backup is entirely different from a pg_dump backup. PITR backups are essentially copies of the data on disk.
Postgres does compress its data in certain situations, using a technique called TOAST:
PostgreSQL uses a fixed page size (commonly 8 kB), and does not allow tuples to span multiple pages. Therefore, it is not possible to store very large field values directly. To overcome this limitation, large field values are compressed and/or broken up into multiple physical rows. This happens transparently to the user, with only small impact on most of the backend code. The technique is affectionately known as TOAST (or "the best thing since sliced bread").

Do Redshift column encodings affect query execution speed?

When creating data tables in Amazon Redshift, you can specify various encodings such as MOSTLY32 or BYTEDICT or LZO. Those are the compressions used when storing the columnar values on disk.
I am wondering if my choice of encoding is supposed to make a difference in query execution times. For example, if I make a column BYTEDICT would that make a difference over LZO when it comes to SELECTs, GROUP BYs or FILTERs?
Yes. The compression encoding used translates to amount of disk storage. Generally, the lower the storage the better would be query performance.
But, which encoding would be be more beneficial to you depends on your data type and its distribution. There is no gurantee that LZO will always be better than Bytedict or vice-a-versa. In my experience, I usually load some sample data in the intended table. Than do a analyze compression. Now whatever Redshift suggests, I go with it. That has worked for me.
Amazon actually has released a python script that can apply this automatically to your database. You can find this script here https://github.com/awslabs/amazon-redshift-utils/blob/master/src/ColumnEncodingUtility/analyze-schema-compression.py
Bit late but likely useful to anyone taking a look here:
Amazon can now decide on the best compression to use (Loading Tables with Automatic Compression), if you are using a COPY command to load your table, and there is no existing compression defined in your table.
You just have to add COMPUPDATE ON to your COPY command.