What's the impact of TOAST on performance? (adding a hundred varchar columns) - postgresql

Consider a table with the following data:
id bigint Auto Increment
name character varying(255) NULL
category character varying(255) NULL
english character varying(255) NULL
french character varying(255) NULL
pivot character varying(255) NULL
credits character varying(255) NULL
hash character varying(20) NULL
The english column contains data of the following size (in bytes): max 116, min 5, average 42, median: 40.
The number of rows in the table is around 30,000 and will hardly change.
The new 107 columns will be translations of the English.
Will adding 107 columns hurt performance?
The Postgres site says the maximum number of columns on a Postgres table is
250-1600 depending on column types
and
The maximum number of columns for a table is further reduced as the tuple being stored must fit in a single 8192-byte heap page
Will the data fall under that limit?
Size of the largest row
What is the actual storage size of the table's rows? pg_column_size is the
Number of bytes used to store a particular value (possibly compressed)
SELECT id, pg_column_size(t.*) FROM table as t ORDER BY pg_column_size DESC
-- Some stats derived from the query:
-- Min 87 bytes
-- Max 514 bytes
-- Average 216 bytes
-- Median: 209 bytes
But no compression is actually happening here, because:
When a row that is to be stored is "too wide" (the threshold for that is 2KB by default), the TOAST mechanism first attempts to compress any wide field values. If that isn't enough to get the row under 2KB, it breaks up the wide field values into chunks that get stored in the associated TOAST table. Each original field value is replaced by a small pointer that shows where to find this "out of line" data in the TOAST table. TOAST will attempt to squeeze the user-table row down to 2KB in this way, but as long as it can get below 8KB, that's good enough and the row can be stored successfully.
Compression would start to kick in once the table gets bigger and those new columns are added.
It's unclear to me what the compression ratio would be for such data?
I wonder how effective it'll be on lots of short multilingual sentences. Also, tried to find the exact name of the compression algorithm used by Postgres: the docs say "the LZ family of compression techniques", but which one – LZ77? LZ78? A twist on one of them?
The best way to find out how much compression will achieve here is certainly to try… once I've got the translations. But I'd rather get an idea of it beforehand, as I won't get all the data at once.
TOAST'ed?
If the size of the table goes beyond the page size limit, then Posgres will rely on TOAST not just to compress but also to split the data for "out-of-line" rows.
I understand this will increase fetch times for those rows that don't fit… But what's the impact of TOAST on performance? Is it negligible for such a use case?
Bottom-line
At the end of the day…
Is adding those 107 columns a good idea, or should I use a different approach?
If fine, how important is it to be fetching only those columns the user needs? (No user will need all of them.)
Or am I approaching this the wrong way, i.e. is it a case of premature optimization where I'd have been better off just adding the columns and only investigate later if faced with problems?
Using Postgres 9.6. Upgrading is an option if needed.

The best way to find out how much compression will achieve here is certainly to try… once I've got the translations. But I'd rather get an idea of it beforehand, as I won't get all the data at once.
I'd just copy the English version into each of the 107 columns. That should be good enough to get some useful findings. You might worry that the repetition would cause the compression to be idiosyncratic; but each value is compressed in isolation so won't "know" it is identical to some other value.
It's unclear to me what the compression ratio would be for such data?
Not very much. For example, the paragraph of yours I quoted first doesn't get any benefit from compression (when I copied it into 107 other columns). Short segments of ordinary text do not have enough repetition in them to be very compressible. Translating them to other languages is unlikely to change this.
If fine, how important is it to be fetching only those columns the user needs? (No user will need all of them.)
This question has a very clear answer. You should absolutely select only what you need. Assembling a row from 100+ toasted columns, just to throw most of them away, will slow you down.

I don't know if this falls under "premature optimization" so much as falling under poor design. In one way or another you will need some method of know which of the 108 versions you need. But what happens when you need to add the 108th translation, or you delete say the 93rd. So use this information to form a key to a translation table. Something like Translation_Test (for_ref_in bigint, language text, translation text). Then access the necessary text (including perhaps the English version) from that table.

Related

Does it waste space defining a column in IBM Db2 on Cloud a longer VARCHAR() than required?

We often have columns that can contain values of varying sizes. For these, I like to set the data type to VARCHAR with a size way beyond the current maximum length. For example, if I have a column where the current minimum length for a value is 10 and the maximum length is 35, I might set the data type to VARCHAR(64). My rationale is that Db2 stores the 2 byte length followed by the exact value, therefore, there is no difference, from a storage perspective, defining the data type as VARCHAR(64) instead of VARCHAR(35). And I don't get an error if I a value with a length of 36 comes along.
Is there a nuance that I'm missing and should I not be so glib about my VARCHAR assignments?
The exact formula to calculate row length is described in the docs for CREATE TABLE. VARCHAR(64) or VARCHAR(35) should not make a difference.
Be aware that rows a stored in data pages in tablespaces. Database systems usually pre-allocate pages for performance reasons. Moreover, pages might not be fully filled or there is compression. And you might have defined indexes which require their own pages with structures. Plus there is metadata in the system catalog.

Does varchar's length have any effect on performance

We have discussions with our development staff over the use of VARCHAR columns as they define every varchar fileds as varchar(255),varchar(500),... and much bigger than the maximum length of the field,
does varchar's length have any effect on performance in db2? We have find that it is recommended to use char instead of varchar for column of 30 bytes or less and our concern is about varchar fileds that are greater than 30 bytes.
Allowing excessive column length is not a good idea. If you allow, let’s say, a FirstName column to have maximum length 500, you may find quite a long irrelevant story there eventually, because why not if it’s allowed :)
As for performance implications.
The only problem may be here, if Extended row size is turned on for the database (you simply can’t create too “wide” table otherwise), and the total length of the row exceeds the tablespace page size. Some varchar column value gets out from the data page, and more IO will be needed to access such a row in future. You should keep in mind such a behavior. And the probability of such events is higher in case of uncontrolled varchar columns length.
This can have an performance hit with ORGANIZE BY COLUMN tables. There is a limit in the total declared width that can be processed within the Columnar Data Engine, if this limit is breached in a query plan, the remainder of the query will be processed in the Row Data Engine.

How does postgres stores row in page, when row size exceeds available free size in page ?

I am exploring storage mechanism of postgres. I know that postgres is using page like structure(each of size 8K) to store rows. One page can contain more than one row. I also know that TOASTing is done by postgres, when the row can not be contained in given page.
But I am not certain about following scerio :-
There's only 1K space left in current page, and the size of newly created row exceeds one 1K. In that case, what will happen ? Will new page be allocated for that row and old page will have unused space ? OR the old page's remaining space will be occupied, when another row with size less than or equal to 1K is created ?
I am referring TOAST. Following para is bit unclear :-
When a row that is to be stored is "too wide" (the threshold for that is 2KB by default), the TOAST mechanism first attempts to compress any wide field values. If that isn't enough to get the row under 2KB, it breaks up the wide field values into chunks that get stored in the associated TOAST table. Each original field value is replaced by a small pointer that shows where to find this "out of line" data in the TOAST table. TOAST will attempt to squeeze the user-table row down to 2KB in this way, but as long as it can get below 8KB, that's good enough and the row can be stored successfully.
Why it's talking about two sizes 8K and 2K ? Why postgres checks for threshold 2K ?
Thanks in advance.
First, I should clarify that “enough room in the table page” has nothing to do with the question if an attribute is TOASTed or not.
The paragraph you quote describes how TOAST tries to reduce the size of a table row that exceeds 2KB by first compressing the values and then storing them “out of line” in a TOAST table.
The idea is to reduce the size such that a row does not use up more than a quarter of the space in a table block. But if that fails, and the row ends up bigger than 2KB after TOASTing, that is no problem either, as long if the resulting row fits into one 8KB block.
A table row is always stored in a single table block. If there is not enough space left in any existing block, a new table block is allocated and the existing blocks are left with some empty space. This empty space can still be used for other, smaller new rows.
The limits of 8KB for a table block and 2KB for the threshold for TOASTing are somewhat arbitrary and based on experience. You can change them if you are ready to recompile PostgreSQL (from PostgreSQL v11 on, you can specify the block size when you create the database cluster with initdb), but I have not heard any reports that this is a good idea.

postgres text storage inline or in "background table"?

In PostgreSQL, how can I tell whether a text column is stored inline or stored in a "background table"?
Documentation for text column types says that
Very long values are also stored in background tables so that they do not interfere with rapid access to shorter column values.
Is there a fixed length at which a value is determined to be "very long"? If not, are there other ways of telling how my columns are laid out on disk? I have a table with several columns that are text (or varchar(n)) and want to understand how they are stored under the hood. Is there more documentation on these "background tables" somewhere?
Any varlena data type (all types with variable length or types longer than 4 bytes (32 bits) or 8 bytes (64 bits)) can be TOASTed - TOAST is a process that tries to reduce long rows (records) to 8KB page size.
Row size is checked before physically storing to the relation. When the size exceeds 2KB, most larger fields are selected, compressed, sliced to 2KB chunks and moved to a secondary table file with the suffix _toast. A pointer to the toast file replaces the data in the main storage. This process is repeated while the row is bigger than 2KB.
Follow the links provided by a_horse_with_no_name and IMSoP for more detailed documentation.
If your table is called t1, then enter \d+ t1 at your psql prompt, it will show a column storage mode.

Does not using NULL in PostgreSQL still use a NULL bitmap in the header?

Apparently PostgreSQL stores a couple of values in the header of each database row.
If I don't use NULL values in that table - is the null bitmap still there?
Does defining the columns with NOT NULL make any difference?
It's actually more complex than that.
The null bitmap needs one bit per column in the row, rounded up to full bytes. It is only there if the actual row includes at least one NULL value and is fully allocated in that case. NOT NULL constraints do not directly affect that. (Of course, if all fields of your table are NOT NULL, there can never be a null bitmap.)
The "heap tuple header" (per row) is 23 bytes long. Actual data starts at a multiple of MAXALIGN (Maximum data alignment) after that, which is typically 8 bytes on 64-bit OS (4 bytes on 32-bit OS). Run the following command from your PostgreSQL binary dir as root to get a definitive answer:
./pg_controldata /path/to/my/dbcluster
On a typical Debian-based installation of Postgres 12 that would be:
sudo /usr/lib/postgresql/12/bin/pg_controldata /var/lib/postgresql/12/main
Either way, there is one free byte between the header and the aligned start of the data, which the null bitmap can utilize. As long as your table has 8 columns or less, NULL storage is effectively absolutely free (as far as disk space is concerned).
After that, another MAXALIGN (typically 8 bytes) is allocated for the null bitmap to cover another (typically) 64 fields. Etc.
This is valid for at least versions 8.4 - 12 and most likely won't change.
The null bitmap is only present if the HEAP_HASNULL bit is set in t_infomask. If it is present it begins just after the fixed header and occupies enough bytes to have one bit per data column (that is, t_natts bits altogether). In this list of bits, a 1 bit indicates not-null, a 0 bit is a null. When the bitmap is not present, all columns are assumed not-null.
http://www.postgresql.org/docs/9.0/static/storage-page-layout.html#HEAPTUPLEHEADERDATA-TABLE
so for every 8 columns you use one byte of extra storage. Then for every about million rows that would take up one megabyte of storage. Does not really seem that important. I would define the tables how they needed to be defined and not worry about null headers.