I have a table with almost six-hundred columns that contains raw data from an outside source, each row being a single transaction.
How can I order the columns to make its on disk and in memory layout space efficient?
Related
oid -> creates a table pg_largeobjects and stores data in there
bytea -> if the compressed data would still exceed 2000 bytes, PostgreSQL splits variable length data types in chunks and stores them out of line in a special “TOAST table” according to https://www.cybertec-postgresql.com/en/binary-data-performance-in-postgresql/
I don't want any other table for large data I want to store them in a column in my defined table, is that possible?
It is best to avoid Large Objects.
With bytea you can prevent PostgreSQL from storing data out of line in a TOAST table by changing the column definition like
ALTER TABLE tab ALTER col SET STORAGE MAIN;
Then PostgreSQL will compress that column but keep it in the main table.
Since the block size in PostgreSQL is 8kB, and one row is always stored in a single block, that will limit the size of your table rows to somewhat under 8kB (there is a block header and other overhead).
I think that you are trying to solve a non-problem, and your request to not store large data out of line is unreasonable.
My Postgres table's primary key is a SHA1 checksum (always 20 bytes) stored in a bytea column (because Postgres doesn't have fixed-length binary types).
Shall I ALTER TABLE t ALTER COLUMN c SET STORAGE PLAIN not to let Postgres compress and/or outsource (TOAST) my PK/FK for the sake of lookup and join performance? And why (not)?
I would say that that is a micro-optimization that will probably not have a measurable effect.
First, PostgreSQL only considers compressing and slicing values if the row exceeds 2000 bytes, so there will only be an effect at all if your rows routinely exceed that size.
Then, even if the primary key column gets toasted, you will probably only be able to measure a difference if you select a large number of rows in a single table scan. Fetching only a few rows by index won't make a big difference.
I'd benchmark both approaches, but I'd assume that it will be hard to measure a difference. I/O and other costs will probably hide the small extra CPU time required for decompression (remember that the row has to be large for TOAST to kick in in the first place).
I have a stream of data that I can replay any time to reload data into a Postgres table. Lets say I have millions of rows in my table and I add a new column. Now I can replay that stream of data to map a key in the data to the column name that I have just added.
The two options I have are:
1) Truncate and then Insert
2) Upsert
Which would be a better option in terms of performance?
The way PostgreSQL does multiversioning, every update creates a new row version. The old row version will have to be reclaimed later.
This means extra work and tables with a lot of empty space in them.
On the other hand, TRUNCATE just throws away the old table, which is very fast.
You can gain extra performance by using COPY instead of INSERT to load bigger amounts of data.
I have a table with files and various relations to this table, files are stored as bytea. I want to free up space occupied by old files (according to timestamp), however the rows should still be present in the table.
Is it enough to set null to bytea field? Will the data be actually deleted from the table this way?
In PostgreSQL, updating a row creates a new tuple (row version), and the old one is left to be deleted by autovacuum.
Also, larger bytea attributes will be stored out-of-line in the TOAST table that belongs to the table.
When you set the bytea attribute to NULL (which is the right thing to do), two things will happen:
The main table will become bigger because of all the new tuples created by the UPDATE. Autovacuum will free the space, but not shrink the table (the empty space can be re-used by future data modifications).
Entries in the TOAST table will be deleted. Again, autovacuum will free the space, but the table won't shrink.
So what you will actually observe is that after the UPDATE, your table uses more space than before.
You can get rid of all that empty space by running VACUUM (FULL) on the table, but that will block concurrent access to the table for the duration of the operation, so be ready to schedule some down time (you'll probably do that for the UPDATE anyway).
In Postgres does the order of columns in a CREATE TABLE statement impact performance? Consider the following two cases:
CREATE TABLE foo (
a TEXT,
B VARCHAR(512),
pkey INTEGER PRIMARY KEY,
bar_fk INTEGER REFERENCES bar(pkey),
C bytea
);
vs.
CREATE TABLE foo2 (
pkey INTEGER PRIMARY KEY,
bar_fk INTEGER REFERENCES bar(pkey),
B VARCHAR(512),
a TEXT,
C bytea
);
Will performance of foo2 be better than foo because of better byte alignment for the columns? When Postgres executes CREATE TABLE does it follow the column order specified or does it re-organize the columns in optimal order for byte alignment or performance?
Question 1
Will the performance of foo2 be better than foo because of better byte
alignment for the columns?
Yes, the order of columns can have a small impact on performance. Type alignment is the more important factor, because it affects the footprint on disk. You can minimize storage size (play "column tetris") and squeeze more rows on a data page - which is the most important factor for speed.
Normally, it's not worth bothering. With an extreme example like in this related answer you get a substantial difference:
Calculating and saving space in PostgreSQL
Type alignment details:
Making sense of Postgres row sizes
The other factor is that retrieving column values is slightly faster if you have fixed size columns first. I quote the manual here:
To read the data you need to examine each attribute in turn. First
check whether the field is NULL according to the null bitmap. If it
is, go to the next. Then make sure you have the right alignment. If
the field is a fixed width field, then all the bytes are simply
placed. If it's a variable length field (attlen = -1) then it's a bit
more complicated. All variable-length data types share the common
header structure struct varlena, which includes the total length of
the stored value and some flag bits.
There is an open TODO item to allow reordering of column positions in the Postgres Wiki, partly for these reasons.
Question 2
When Postgres executes a CREATE TABLE does it follow the column order
specified or does it re-organize the columns in optimal order for byte
alignment or performance?
Columns are stored in the defined order, the system does not try to optimize.
I fail to see any relevance of column order to TOAST tables like another answer seems to imply.
As far as I understand, PostgreSQL adheres to the order in which you enter the columns when saving records. Whether this affects performance is debatable. PostgreSQL stores all table data in pages each being 8kb in size. 8kb is the default, but it can be change at compile time.
Each row in the table will take up space within the page. Since your table definition contains variable columns, a page can consist of a variable amount of records. What you want to do is make sure you can fit as many records into one page as possible. That is why you will notice performance degradation when a table has a huge amount of columns or column sizes are huge.
This being said, declaring a varchar(8192) does not mean the page will be filled up with one record, but declaring a CHAR(8192) will use up one whole page irrespective of the amount of data in the column.
There is one more thing to consider when declaring TOASTable types such as TEXT columns. These are columns that could exceed the maximum page size. A table that has TOASTable columns will have an associated TOAST table to store the data and only a pointer to the data is stored with the table. This can impact performance, but can be improved with proper indexes on the TOASTable columns.
To conclude, I would have to say that the order of the columns do not play much of role in the performance of a table. Most queries utilise indexes which are store separately to retrieve records and therefore column order is negated. It comes down to how many pages needs to be read to retrieve the data.