Shall I SET STORAGE PLAIN on fixed-length bytea PK column? - postgresql

My Postgres table's primary key is a SHA1 checksum (always 20 bytes) stored in a bytea column (because Postgres doesn't have fixed-length binary types).
Shall I ALTER TABLE t ALTER COLUMN c SET STORAGE PLAIN not to let Postgres compress and/or outsource (TOAST) my PK/FK for the sake of lookup and join performance? And why (not)?

I would say that that is a micro-optimization that will probably not have a measurable effect.
First, PostgreSQL only considers compressing and slicing values if the row exceeds 2000 bytes, so there will only be an effect at all if your rows routinely exceed that size.
Then, even if the primary key column gets toasted, you will probably only be able to measure a difference if you select a large number of rows in a single table scan. Fetching only a few rows by index won't make a big difference.
I'd benchmark both approaches, but I'd assume that it will be hard to measure a difference. I/O and other costs will probably hide the small extra CPU time required for decompression (remember that the row has to be large for TOAST to kick in in the first place).

Related

oid and bytea are creating system in tables

oid -> creates a table pg_largeobjects and stores data in there
bytea -> if the compressed data would still exceed 2000 bytes, PostgreSQL splits variable length data types in chunks and stores them out of line in a special “TOAST table” according to https://www.cybertec-postgresql.com/en/binary-data-performance-in-postgresql/
I don't want any other table for large data I want to store them in a column in my defined table, is that possible?
It is best to avoid Large Objects.
With bytea you can prevent PostgreSQL from storing data out of line in a TOAST table by changing the column definition like
ALTER TABLE tab ALTER col SET STORAGE MAIN;
Then PostgreSQL will compress that column but keep it in the main table.
Since the block size in PostgreSQL is 8kB, and one row is always stored in a single block, that will limit the size of your table rows to somewhat under 8kB (there is a block header and other overhead).
I think that you are trying to solve a non-problem, and your request to not store large data out of line is unreasonable.

Encoding a Postgres UUID in Amazon Redshift

We have a couple of entities which are being persisted into Amazon Redshift for reporting purposes, and these entities have a relationship between them. The source tables in Postgres are related via a foreign key with a UUID datatype, which is not supported in Redshift.
One option is to encode the UUID as a 128 bit signed integer. The Redshift documentation refers to the ability to create NUMBER(38,0), and to the ability to create 128 bit numbers.
But 2^128 = 340,282,366,920,938,463,463,374,607,431,768,211,456 which is 39 digits. (thanks Wikipedia). So despite what the docs say, you cannot store the full 128 bits / 39 digits of precision in Redshift. How do you actually make a full 128 bit number column in Redshift?
In short, the real question behind this is - what is Redshift best practice for storing & joining tables which have UUID primary keys?
Redshift joins will perform well even with a VARCHAR key, so that's where I would start.
The main factor for join performance will be co-locating the rows onto the same compute node. To achieve this you should declare the UUID column as the distribution key on both tables.
Alternatively, if one of the tables is fairly small (<= ~1 million rows), then you can declare that table as DISTSTYLE ALL and choose some other dist key for the larger table.
If you have co-located the join and wish to optimize further then you could try splitting the UUID value into 2 BIGINT columns, one for the top 64 bits and another for the bottom 64. Even half of the UUID is likely to be unique and then you can use the second column as a "tie breaker".
c.f. "Amazon Redshift Engineering’s Advanced Table Design Playbook: Preamble, Prerequisites, and Prioritization"

Database size doubles on update to single new column

I have a fairly simple table used to drive tile processing for a web mapping application.
Column | Type | Modifiers
---------------+--------------------------+---------------------------------------------------------
id | integer | not null default
nextval('wmts_tiles_id_seq'::regclass)
tile_matrix | integer |
rowx | integer |
coly | integer |
geom | geometry(Geometry,27700) |
processed | smallint |
Indexes:
"wmts_tiles_pkey" PRIMARY KEY, btree (id)
"wmts_tiles_wmts_tile_matrix_x_y" UNIQUE CONSTRAINT, btree (tile_matrix, rowx, coly)
"ix_spatial_wmts_tiles" gist (geom)
"ix_tile_matrix_processed" btree (tile_matrix, processed)
with various indexes (one spatial) and constaints as shown. This table has 240 million rows and the pg_relation_size and pg_total_relation_size indicate that this table is 66 GB, of which, half is the indexes and half the data.
I added a single date column and then ran an update to populate it,
alter database wmts_tiles add column currency_date date;
update wmts_tiles set currency_date = '2014-05-01'
After this, the size went to 133 GB, ie, it doubled. Once I ran, VACUUM FULL on the table, the size shrunk back to 67 GB, ie, 1GB larger than before, which is what you would expect after adding 240 million rows of a 4 byte field (date).
I understand that there will often be a reasonably percentage of dead rows in a table where there are a lot of inserts and deletes happening, but why would a table size double under one single update and is there anything I can do to prevent this? Note, this update was the only transaction running and the table had just been dumped and recreated, so the data pages and index were in a compact state prior to the update.
EDIT: I have seen this question, Why does my postgres table get much bigger under update? and I understand that while the table is being updated that to support MVCC the table needs to be rewritten, what I don't understand is why it stays twice the size, until I explicitly run VACUUM FULL.
Most of this is covered by this prior question.
The reason it doesn't shrink is that PostgreSQL doesn't know you want it to. It's inefficient to allocate disk space (grow a table) and then release it (shrink the table) repeatedly. PostgreSQL prefers to grow a table then keep the disk space, marking it empty and ready for re-use.
Not only is the allocation and release cycle inefficient, but the OS also only permits release of space at the end of a file*. So PostgreSQL has to move all the rows from the end of the file, which was the only place it could write them when you did the update, to the now-free space at the start. It could't do this as it went because it couldn't overwrite any of the old data until the update transaction committed.
If you know you won't need the space again, you can use VACUUM FULL to compact the table and release the space.
There's no periodic vacuum full done by autovacuum, partly because it might be quite bad for performance if the table just has to be expanded again, partly because vacuum full is quite I/O intensive (so it'll slow everything else down) and partly because vacuum full requires an access exclusive lock that prevents any concurrent access to the table. PostgreSQL would need an incremental table repack command/feature, which it doesn't have yet, for this to be possible with autovacuum. Patches are welcome... though this part of the code is very complex and getting it right would not be a beginner job.
* Yes, I know you can mark large regions of zeroes within a file as sparse regions on some platforms. Feel free to submit a PostgreSQL patch for that. In practice you'll have to compact anyway, because you don't find large regions with free pages in the table normally. Plus you'd have to deal with page headers.

Does the order of columns in a Postgres table impact performance?

In Postgres does the order of columns in a CREATE TABLE statement impact performance? Consider the following two cases:
CREATE TABLE foo (
a TEXT,
B VARCHAR(512),
pkey INTEGER PRIMARY KEY,
bar_fk INTEGER REFERENCES bar(pkey),
C bytea
);
vs.
CREATE TABLE foo2 (
pkey INTEGER PRIMARY KEY,
bar_fk INTEGER REFERENCES bar(pkey),
B VARCHAR(512),
a TEXT,
C bytea
);
Will performance of foo2 be better than foo because of better byte alignment for the columns? When Postgres executes CREATE TABLE does it follow the column order specified or does it re-organize the columns in optimal order for byte alignment or performance?
Question 1
Will the performance of foo2 be better than foo because of better byte
alignment for the columns?
Yes, the order of columns can have a small impact on performance. Type alignment is the more important factor, because it affects the footprint on disk. You can minimize storage size (play "column tetris") and squeeze more rows on a data page - which is the most important factor for speed.
Normally, it's not worth bothering. With an extreme example like in this related answer you get a substantial difference:
Calculating and saving space in PostgreSQL
Type alignment details:
Making sense of Postgres row sizes
The other factor is that retrieving column values is slightly faster if you have fixed size columns first. I quote the manual here:
To read the data you need to examine each attribute in turn. First
check whether the field is NULL according to the null bitmap. If it
is, go to the next. Then make sure you have the right alignment. If
the field is a fixed width field, then all the bytes are simply
placed. If it's a variable length field (attlen = -1) then it's a bit
more complicated. All variable-length data types share the common
header structure struct varlena, which includes the total length of
the stored value and some flag bits.
There is an open TODO item to allow reordering of column positions in the Postgres Wiki, partly for these reasons.
Question 2
When Postgres executes a CREATE TABLE does it follow the column order
specified or does it re-organize the columns in optimal order for byte
alignment or performance?
Columns are stored in the defined order, the system does not try to optimize.
I fail to see any relevance of column order to TOAST tables like another answer seems to imply.
As far as I understand, PostgreSQL adheres to the order in which you enter the columns when saving records. Whether this affects performance is debatable. PostgreSQL stores all table data in pages each being 8kb in size. 8kb is the default, but it can be change at compile time.
Each row in the table will take up space within the page. Since your table definition contains variable columns, a page can consist of a variable amount of records. What you want to do is make sure you can fit as many records into one page as possible. That is why you will notice performance degradation when a table has a huge amount of columns or column sizes are huge.
This being said, declaring a varchar(8192) does not mean the page will be filled up with one record, but declaring a CHAR(8192) will use up one whole page irrespective of the amount of data in the column.
There is one more thing to consider when declaring TOASTable types such as TEXT columns. These are columns that could exceed the maximum page size. A table that has TOASTable columns will have an associated TOAST table to store the data and only a pointer to the data is stored with the table. This can impact performance, but can be improved with proper indexes on the TOASTable columns.
To conclude, I would have to say that the order of the columns do not play much of role in the performance of a table. Most queries utilise indexes which are store separately to retrieve records and therefore column order is negated. It comes down to how many pages needs to be read to retrieve the data.

How many records can I store in 5 MB of PostgreSQL on Heroku?

I'm going to store records in a single table with 2 fields:
id -> 4 characters
password_hash -> 64 characters
How many records like the one above will I be able to store in a 5mb PostgreSQL on Heroku?
P.S.: given a single table with x columns and a length of y - how can I calculate the space it will take in a database?
Disk space occupied
Calculating the space on disk is not trivial. You have to take into account:
The overhead per table. Small, basically the entries in the system catalog.
The overhead per row (HeapTupleHeader) and per data page (PageHeaderData). Details about page layout in the manual.
Space lost to column alignment, depending on data types.
Space for a NULL bitmap. Effectively free for tables of 8 columns or less, irrelevant for your case.
Dead rows after UPDATE / DELETE. (Until the space is eventually vacuumed and reused.)
Size of index(es). You'll have a primary key, right? Index size is similar to that of a table with just the indexed columns and less overhead per row.
The actual space requirement of the data, depending on respective data types. Details for character types (incl. fixed length types) in the manual:
The storage requirement for a short string (up to 126 bytes) is 1 byte
plus the actual string, which includes the space padding in the case
of character. Longer strings have 4 bytes of overhead instead of 1
More details for all types in the system catalog pg_type.
The database encoding in particular for character types. UTF-8 uses up to four bytes to store one character (But 7-Bit-ASCII characters always occupy just one byte, even in UTF-8.)
Other small things that may affect your case, like TOAST - which should not affect you with 64-character strings.
Calculate with test case
A simple method to find an estimate is to create a test table, fill it with dummy data and measure with database object size functions::
SELECT pg_size_pretty(pg_relation_size('tbl'));
Including indexes:
SELECT pg_size_pretty(pg_total_relation_size('tbl'));
See:
Measure the size of a PostgreSQL table row
A quick test shows the following results:
CREATE TABLE test(a text, b text);
INSERT INTO test -- quick fake of matching rows
SELECT chr((g/1000 +32)) || to_char(g%1000, 'FM000')
, repeat (chr(g%120 + 32), 64)
FROM generate_series(1,50000) g;
SELECT pg_size_pretty(pg_relation_size('test')); -- 5640 kB
SELECT pg_size_pretty(pg_total_relation_size('test')); -- 5648 kB
After adding a primary key:
ALTER TABLE test ADD CONSTRAINT test_pkey PRIMARY KEY(a);
SELECT pg_size_pretty(pg_total_relation_size('test')); -- 6760 kB
So, I'd expect a maximum of around 44k rows without and around 36k rows with primary key.