Does the order of columns in a Postgres table impact performance? - postgresql

In Postgres does the order of columns in a CREATE TABLE statement impact performance? Consider the following two cases:
CREATE TABLE foo (
a TEXT,
B VARCHAR(512),
pkey INTEGER PRIMARY KEY,
bar_fk INTEGER REFERENCES bar(pkey),
C bytea
);
vs.
CREATE TABLE foo2 (
pkey INTEGER PRIMARY KEY,
bar_fk INTEGER REFERENCES bar(pkey),
B VARCHAR(512),
a TEXT,
C bytea
);
Will performance of foo2 be better than foo because of better byte alignment for the columns? When Postgres executes CREATE TABLE does it follow the column order specified or does it re-organize the columns in optimal order for byte alignment or performance?

Question 1
Will the performance of foo2 be better than foo because of better byte
alignment for the columns?
Yes, the order of columns can have a small impact on performance. Type alignment is the more important factor, because it affects the footprint on disk. You can minimize storage size (play "column tetris") and squeeze more rows on a data page - which is the most important factor for speed.
Normally, it's not worth bothering. With an extreme example like in this related answer you get a substantial difference:
Calculating and saving space in PostgreSQL
Type alignment details:
Making sense of Postgres row sizes
The other factor is that retrieving column values is slightly faster if you have fixed size columns first. I quote the manual here:
To read the data you need to examine each attribute in turn. First
check whether the field is NULL according to the null bitmap. If it
is, go to the next. Then make sure you have the right alignment. If
the field is a fixed width field, then all the bytes are simply
placed. If it's a variable length field (attlen = -1) then it's a bit
more complicated. All variable-length data types share the common
header structure struct varlena, which includes the total length of
the stored value and some flag bits.
There is an open TODO item to allow reordering of column positions in the Postgres Wiki, partly for these reasons.
Question 2
When Postgres executes a CREATE TABLE does it follow the column order
specified or does it re-organize the columns in optimal order for byte
alignment or performance?
Columns are stored in the defined order, the system does not try to optimize.
I fail to see any relevance of column order to TOAST tables like another answer seems to imply.

As far as I understand, PostgreSQL adheres to the order in which you enter the columns when saving records. Whether this affects performance is debatable. PostgreSQL stores all table data in pages each being 8kb in size. 8kb is the default, but it can be change at compile time.
Each row in the table will take up space within the page. Since your table definition contains variable columns, a page can consist of a variable amount of records. What you want to do is make sure you can fit as many records into one page as possible. That is why you will notice performance degradation when a table has a huge amount of columns or column sizes are huge.
This being said, declaring a varchar(8192) does not mean the page will be filled up with one record, but declaring a CHAR(8192) will use up one whole page irrespective of the amount of data in the column.
There is one more thing to consider when declaring TOASTable types such as TEXT columns. These are columns that could exceed the maximum page size. A table that has TOASTable columns will have an associated TOAST table to store the data and only a pointer to the data is stored with the table. This can impact performance, but can be improved with proper indexes on the TOASTable columns.
To conclude, I would have to say that the order of the columns do not play much of role in the performance of a table. Most queries utilise indexes which are store separately to retrieve records and therefore column order is negated. It comes down to how many pages needs to be read to retrieve the data.

Related

Difference between BRIN index and table partitioning in PostgreSQL

What is the difference between a BRIN index and a table partition in PostgreSQL? When I should use one instead of another? It seems that they provide very similar benefits and also have similar use cases
Example
Suppose we have the following table structure
CREATE TABLE orders (
id SERIAL PRIMARY KEY,
store_id INT,
client_id INT,
created_at timestamp,
information jsonb
)
that has the following characteristics:
orders can only be inserted, deletions are not allowed and updates are very rare and they don't involve the created_at column
the created_at column contains the timestamp of the insertion of the row in the database thus the values in the column are strictly increasing
almost every query use the created_at column in a condition and some of them may use the store_id and client_id columns
the most accessed rows are the most recent ones in terms of the created_at column
some queries may return a few records (example: analyzing a single record or the records created in a small time interval) while others may scan a vast amount of records (example: aggregate functions for a dashboard functionality)
I have chosen this example because it's very common and also both approach could be used (in my opinion). In this case which choice should I use between a BRIN index on the whole table or a partitioned table with maybe a btree index (or just a simple btree index without partitioning)? Does the table dimension influence the choice?
I have used both features (although I'll caveat that my experience with partitioning is from back when you had to use inheritance + constraints, before the introduction of CREATE TABLE ... PARTITION BY). You are correct that they seem similar-ish on a surface level, but they function by completely different mechanisms.
Table partitioning basically works as follows: replace all references to table with (select * from table_partition1 union all select * from table_partition2 /* repeat for all partitions */). The partitions will have a constraint on the partition columns, so that if those columns appear in a WHERE, the constraints can be applied up-front to prune which partitions are actually scanned. IOW, if table_partition1 has CHECK(client_id=1), and your WHERE Has client_id=2, table_partition1 will be skipped since the table constraint automatically excludes all rows from this partition from passing that WHERE.
BRIN indexes, in contrast, choose a block size for the table, and then for each block, records a min/max bound of the indexed column. This allows WHERE conditions to skip entire blocks when we can see, say, that the maximum created_at in a particular block of rows is below a created_at>={some_value} clause in your WHERE.
I can't tell you a definitive answer for your case as to which would work better. Well, that's not true, actually: the definitive answer is, "benchmark it for your own data" ;)
This is kind of fuzzy, but my general feeling is that BRIN is lightweight, and table partitioning is not. BRIN is something that can be added on to an existing table without much trouble, the indexes themselves are very small, and the impact on writes is not major (at least, not without inordinately many indices). Table partitioning, on the other hand, is a different way of representing the data on-disk; you are actually determining into which data files particular rows will be written. This requires a much more involved migration process when introducing it to an existing dataset.
However, the set of query optimizations available for table partitioning is much greater. Not only is there the constraint exclusion I described above, but you can also have indices (even BRIN ones!) on each individual partition. Of course, you can also have BRIN + other indices on a single-big-table, but I'm not sure that is particularly helpful IRL.
A few other thoughts: BRIN is good for monotonic data (timestamps, incremnting IDs, etc); the more correlated the on-disk ordering is to the indexed value, the more effective a BRIN index can be at pruning blocks to be scanned. Things like customer IDs, however, are unlikely to work well with BRIN; any given block of rows is likely to have at least one relatively low and relatively high ID. However, fields that like work quite well for partitioning: a partition-per-client, or partitioning on the modulus of a customer ID (which would more commonly be called sharding), is a good way of scaling horizontally, almost without bound.
Any update, even if it does not change the indexed column, will make a BRIN index pretty useless (unless it is a HOT update). Even without that, there are differences, for example:
partitioning allows you to get rid of lots of data efficiently, a BRIN index won't
a partitioned table allows one autovacuum worker per partition, which improves autovacuum performance
But if your only concern is to efficiently select all rows for a certain value of the index or partitioning key, both may offer about the same benefit.

Shall I SET STORAGE PLAIN on fixed-length bytea PK column?

My Postgres table's primary key is a SHA1 checksum (always 20 bytes) stored in a bytea column (because Postgres doesn't have fixed-length binary types).
Shall I ALTER TABLE t ALTER COLUMN c SET STORAGE PLAIN not to let Postgres compress and/or outsource (TOAST) my PK/FK for the sake of lookup and join performance? And why (not)?
I would say that that is a micro-optimization that will probably not have a measurable effect.
First, PostgreSQL only considers compressing and slicing values if the row exceeds 2000 bytes, so there will only be an effect at all if your rows routinely exceed that size.
Then, even if the primary key column gets toasted, you will probably only be able to measure a difference if you select a large number of rows in a single table scan. Fetching only a few rows by index won't make a big difference.
I'd benchmark both approaches, but I'd assume that it will be hard to measure a difference. I/O and other costs will probably hide the small extra CPU time required for decompression (remember that the row has to be large for TOAST to kick in in the first place).

Understanding indexes and performance as they relate to indexed column and non-indexed column data in the same row

I have some tables that are around 100 columns wide. I haven't normalized them because to put it back together would require almost 3 dozen joins and am not sure it would perform any better... haven't tested it yet (I will) so can't say for sure.
Anyway, that really isn't the question. I have been indexing columns in these tables that I know will be pulled frequently, so something like 50 indexes per table.
I got to thinking though. These columns will never be pulled by themselves and are meaningless without the primary key (basically an item number). The PK will always be used for the join and even in simple SELECT queries, it will have to be a specified column so the data makes sense.
That got me thinking further about indexes and how they work. As I understand them the locations of a values are committed to memory for that column so it is quickly found in a query.
For example, if you have:
SELECT itemnumber, expdate
FROM items;
And both itemnumber and expdate are indexed, is that excessive and really adding any benefit? Is it sufficient to just index itemnumber and the index will know that expdate, or anything else that is queried for that item, is on the same row?
Secondly, if multiple columns constitute a primary key, should the index include them grouped together, or is individually sufficient?
For example,
CREATE INDEX test_index ON table (pk_col1, pk_col2, pk_col3);
vs.
CREATE INDEX test_index1 ON table (pk_col1);
CREATE INDEX test_index2 ON table (pk_col2);
CREATE INDEX test_index3 ON table (pk_col3);
Thanks for clearing that up in advance!
Uh oh, there is a mountain of basics that you still have to learn.
I'd recommend that you read the PostgreSQL documentation and the excellent book “SQL Performance Explained”.
I'll give you a few pointers to get you started:
Whenever you create a PRIMARY KEY or UNIQUE constraint, PostgreSQL automatically creates a unique index over all the columns of that constraint. So you don't have to create that index explicitly (but if it is a multicolumn index, it sometimes is useful to create another index on any but the first column).
Indexes are relevant to conditions in the WHERE clause and the GROUP BY clause and to some extent for table joins. They are irrelevant for entries in the SELECT list. An index provides an efficient way to get the part of a table that satisfies a certain condition; an (unsorted) access to all rows of a table will never benefit from an index.
Don't sprinkle your schema with indexes randomly, since indexes use space and make all data modification slow.
Use them where you know that they will do good: on columns on which a foreign key is defined, on columns that appear in WHERE clauses and contain many different values, on columns where your examination of the execution plan (with EXPLAIN) suggests that you can expect a performance benefit.

How many records can I store in 5 MB of PostgreSQL on Heroku?

I'm going to store records in a single table with 2 fields:
id -> 4 characters
password_hash -> 64 characters
How many records like the one above will I be able to store in a 5mb PostgreSQL on Heroku?
P.S.: given a single table with x columns and a length of y - how can I calculate the space it will take in a database?
Disk space occupied
Calculating the space on disk is not trivial. You have to take into account:
The overhead per table. Small, basically the entries in the system catalog.
The overhead per row (HeapTupleHeader) and per data page (PageHeaderData). Details about page layout in the manual.
Space lost to column alignment, depending on data types.
Space for a NULL bitmap. Effectively free for tables of 8 columns or less, irrelevant for your case.
Dead rows after UPDATE / DELETE. (Until the space is eventually vacuumed and reused.)
Size of index(es). You'll have a primary key, right? Index size is similar to that of a table with just the indexed columns and less overhead per row.
The actual space requirement of the data, depending on respective data types. Details for character types (incl. fixed length types) in the manual:
The storage requirement for a short string (up to 126 bytes) is 1 byte
plus the actual string, which includes the space padding in the case
of character. Longer strings have 4 bytes of overhead instead of 1
More details for all types in the system catalog pg_type.
The database encoding in particular for character types. UTF-8 uses up to four bytes to store one character (But 7-Bit-ASCII characters always occupy just one byte, even in UTF-8.)
Other small things that may affect your case, like TOAST - which should not affect you with 64-character strings.
Calculate with test case
A simple method to find an estimate is to create a test table, fill it with dummy data and measure with database object size functions::
SELECT pg_size_pretty(pg_relation_size('tbl'));
Including indexes:
SELECT pg_size_pretty(pg_total_relation_size('tbl'));
See:
Measure the size of a PostgreSQL table row
A quick test shows the following results:
CREATE TABLE test(a text, b text);
INSERT INTO test -- quick fake of matching rows
SELECT chr((g/1000 +32)) || to_char(g%1000, 'FM000')
, repeat (chr(g%120 + 32), 64)
FROM generate_series(1,50000) g;
SELECT pg_size_pretty(pg_relation_size('test')); -- 5640 kB
SELECT pg_size_pretty(pg_total_relation_size('test')); -- 5648 kB
After adding a primary key:
ALTER TABLE test ADD CONSTRAINT test_pkey PRIMARY KEY(a);
SELECT pg_size_pretty(pg_total_relation_size('test')); -- 6760 kB
So, I'd expect a maximum of around 44k rows without and around 36k rows with primary key.

SQL 2008 R2 Row size limit exceeded

I have sql 2008 R2 database. I created a table and when trying to execute a select statement (with order by clause) against it, I receive the error "Cannot create a row of size 8870 which is greater than the allowable maximum row size of 8060."
I am able to select the data without an order by clause, however the order by clause is important and I require it. I have tried a ROBUST PLAN option but I still received the same error.
My table has 300+ columns with data type TEXT. I have tried using varchar and nvarchar, but have had no success.
Can someone please provide some insight?
Update:
Thanks for comments. I agree. 300+ columns in one table is not very good design. What I'm trying to do is bring excel tabs into the database as data tables. Some tabs have 300+ columns.
I first use a CREATE statement to create a table based on the excel tab so the columns vary. Then I do various SELECT, UPDATE, INSERT, etc statements on the table after the table is created with data.
The structure of the table usually follow this patter:
fkVersionID, RowNumber(autonumber), Field1, Field2, Field3, etc...
is there any way to get around the 8060 row size limit?
You mentioned that you tried nvarchar and varchar ... remember that nvarchar doubles the bytes used, but it is the only one of the two to support foreign characters in some cases, such as accent marks.
varchar is a good choice if you can limit its maximum size appropriately.
8000 characters is still a real limit, but if on average each varchar column is no more than 26 characters, you'll be okay.
You could go riskier and go with varchar and 50char length, but on average only utilize 26characters per column.. meaning one column maybe 36 character length, and the next is 16character length... then you are okay again. (As long as you never exceed the average of 26characters per column for the 300 columns.)
Obviously with dynamic number of fields, and potential to way exceed the 8000 character limit, it is doomed by SQL's specs.
Your only other alternative is to create multiple tables and when you access the data, have a unique key to join appropriate records on. So in your select statement, use the join, and from multiple tables then you can handle rows with 8000 + 8000 + ...
So it is doable, but you have to work with SQL rules.
I believe you're running into this limitation:
There is no limit to the number of items in the ORDER BY clause. However, there is a limit of 8,060 bytes for the row size of intermediate worktables needed for sort operations. This limits the total size of columns specified in an ORDER BY clause.
I had a legacy app like this, it was a nightmare.
First, I broke it into multiple tables, all one-to-one. This is bad, but less bad than what you've got.
Then I changed the queries to request only the columns that were actually needed. (I can't tell if you have that option.)