Wrong (?) Size of PostgreSQL table - postgresql

I have a table with columns and constraints:
height smallint,
length smallint,
diameter smallint,
volume integer,
idsensorfragments integer,
CONSTRAINT sensorstats_idsensorfragments_fkey FOREIGN KEY (idsensorfragments)
REFERENCES sensorfragments (idsensorfragments) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE
(no primary key). There is currently 28 978 112 records in it, but the size of the table is way too much in my opinion.
Result of the query:
select pg_size_pretty(pg_total_relation_size('sensorstats')), pg_size_pretty(pg_relation_size('sensorstats'))
is:
"1849 MB";"1226 MB"
There is just one index on idsensorfragments column. Using simple math you can see that one record takes ~66,7 B (?!?!). Can anyone explain me where this number comes from?
5 columns = 2 + 2 + 2 + 4 + 4 = 14 Bytes. I have one index, no primary key. Where additional 50B per record comes from?
P.S. Table was vacuumed, analyzed and reindexed.

You should take a look on how Database Physical Storage is organized, especially on the Page Layout.
PostgreSQL keeps a bunch of extra fields for each tuple (row) and also for each Page. Tuples are kept in the Pages, as Page is the item that database operates on, typically 8192 bytes in size. So the extra space usage comes from:
PageHeader, 24 bytes;
Tupleheader, 27 bytes;
“invisible” Tuple versions;
reserved free space, according to the Storage Parameters of the table;
NULL indicator array;
(might have missed something more).
The layout of physical storage changes between major releases, that's the reason you have to perform a full dump/restore. In the recent versions pg_upgrade is of great help in this process.

Did you do a VACUUM FULL or CLUSTER? If not, the unused space is still dedicated to this table and index. These statements rewrite the table, a VACUUM without FULL doesn't do the rewrite.

Related

Encoding a Postgres UUID in Amazon Redshift

We have a couple of entities which are being persisted into Amazon Redshift for reporting purposes, and these entities have a relationship between them. The source tables in Postgres are related via a foreign key with a UUID datatype, which is not supported in Redshift.
One option is to encode the UUID as a 128 bit signed integer. The Redshift documentation refers to the ability to create NUMBER(38,0), and to the ability to create 128 bit numbers.
But 2^128 = 340,282,366,920,938,463,463,374,607,431,768,211,456 which is 39 digits. (thanks Wikipedia). So despite what the docs say, you cannot store the full 128 bits / 39 digits of precision in Redshift. How do you actually make a full 128 bit number column in Redshift?
In short, the real question behind this is - what is Redshift best practice for storing & joining tables which have UUID primary keys?
Redshift joins will perform well even with a VARCHAR key, so that's where I would start.
The main factor for join performance will be co-locating the rows onto the same compute node. To achieve this you should declare the UUID column as the distribution key on both tables.
Alternatively, if one of the tables is fairly small (<= ~1 million rows), then you can declare that table as DISTSTYLE ALL and choose some other dist key for the larger table.
If you have co-located the join and wish to optimize further then you could try splitting the UUID value into 2 BIGINT columns, one for the top 64 bits and another for the bottom 64. Even half of the UUID is likely to be unique and then you can use the second column as a "tie breaker".
c.f. "Amazon Redshift Engineering’s Advanced Table Design Playbook: Preamble, Prerequisites, and Prioritization"

Altering column type from int to bigint frees space?

I have table with an int type column, in a table of ~ 15 M rows.
OS windows 7 and C disk (where postgres is installed) shows that:
59 GB free of 238 GB
Then I changed this column type to bigint:
ALTER TABLE mytable ALTER column col TYPE bigint;
And now, C disk:
61 GB free of 238 GB
How are 2 GB freed? Looks like that bigint would take less space than int? Or what happened?
There is no other processes on this machine (this is local/home computer) at this moment.
bigint takes 8 bytes, int takes 4 bytes, but space on disk depends on the whole row.
Making sense of Postgres row sizes
Calculating and saving space in PostgreSQL
More importantly, the physical size of the file containing the table also depends on dead tuples (table bloat). VACUUM (typically only VACUUM FULL) can reduce the physical size of the table. Your ALTER TABLE caused a whole-table rewrite, which bloated the table. But it also made it simple for a later VACUUM (or VACUUM FULL) to trim data pages with only dead tuples - effectively compacting the table. VACUUM is run by autovacuum automatically with default settings (but not VACUUM FULL).
VACUUM returning disk space to operating system
Are regular VACUUM ANALYZE still recommended under 9.1?
Apart from that side effect, changing a column from int to bigint never reduces the row size (without bloat). Sometimes it stays the same because the previous row had 4 bytes of alignment padding that can be used by the bigint. Or the row size increases by another (typically) 8 bytes.
Just for clarity - only VACUUM FULL can reduce disk space used by table. And even this not always true - depend on amount of deleted tuples in datafile pages if some pages can be removed. And it creates new datafile.
Plain VACUUM only frees space in existing datafile pages by removing deleted tuples but does not lower number of allocated pages in datafile. And does not create new datafile.

Database size doubles on update to single new column

I have a fairly simple table used to drive tile processing for a web mapping application.
Column | Type | Modifiers
---------------+--------------------------+---------------------------------------------------------
id | integer | not null default
nextval('wmts_tiles_id_seq'::regclass)
tile_matrix | integer |
rowx | integer |
coly | integer |
geom | geometry(Geometry,27700) |
processed | smallint |
Indexes:
"wmts_tiles_pkey" PRIMARY KEY, btree (id)
"wmts_tiles_wmts_tile_matrix_x_y" UNIQUE CONSTRAINT, btree (tile_matrix, rowx, coly)
"ix_spatial_wmts_tiles" gist (geom)
"ix_tile_matrix_processed" btree (tile_matrix, processed)
with various indexes (one spatial) and constaints as shown. This table has 240 million rows and the pg_relation_size and pg_total_relation_size indicate that this table is 66 GB, of which, half is the indexes and half the data.
I added a single date column and then ran an update to populate it,
alter database wmts_tiles add column currency_date date;
update wmts_tiles set currency_date = '2014-05-01'
After this, the size went to 133 GB, ie, it doubled. Once I ran, VACUUM FULL on the table, the size shrunk back to 67 GB, ie, 1GB larger than before, which is what you would expect after adding 240 million rows of a 4 byte field (date).
I understand that there will often be a reasonably percentage of dead rows in a table where there are a lot of inserts and deletes happening, but why would a table size double under one single update and is there anything I can do to prevent this? Note, this update was the only transaction running and the table had just been dumped and recreated, so the data pages and index were in a compact state prior to the update.
EDIT: I have seen this question, Why does my postgres table get much bigger under update? and I understand that while the table is being updated that to support MVCC the table needs to be rewritten, what I don't understand is why it stays twice the size, until I explicitly run VACUUM FULL.
Most of this is covered by this prior question.
The reason it doesn't shrink is that PostgreSQL doesn't know you want it to. It's inefficient to allocate disk space (grow a table) and then release it (shrink the table) repeatedly. PostgreSQL prefers to grow a table then keep the disk space, marking it empty and ready for re-use.
Not only is the allocation and release cycle inefficient, but the OS also only permits release of space at the end of a file*. So PostgreSQL has to move all the rows from the end of the file, which was the only place it could write them when you did the update, to the now-free space at the start. It could't do this as it went because it couldn't overwrite any of the old data until the update transaction committed.
If you know you won't need the space again, you can use VACUUM FULL to compact the table and release the space.
There's no periodic vacuum full done by autovacuum, partly because it might be quite bad for performance if the table just has to be expanded again, partly because vacuum full is quite I/O intensive (so it'll slow everything else down) and partly because vacuum full requires an access exclusive lock that prevents any concurrent access to the table. PostgreSQL would need an incremental table repack command/feature, which it doesn't have yet, for this to be possible with autovacuum. Patches are welcome... though this part of the code is very complex and getting it right would not be a beginner job.
* Yes, I know you can mark large regions of zeroes within a file as sparse regions on some platforms. Feel free to submit a PostgreSQL patch for that. In practice you'll have to compact anyway, because you don't find large regions with free pages in the table normally. Plus you'd have to deal with page headers.

Does the order of columns in a Postgres table impact performance?

In Postgres does the order of columns in a CREATE TABLE statement impact performance? Consider the following two cases:
CREATE TABLE foo (
a TEXT,
B VARCHAR(512),
pkey INTEGER PRIMARY KEY,
bar_fk INTEGER REFERENCES bar(pkey),
C bytea
);
vs.
CREATE TABLE foo2 (
pkey INTEGER PRIMARY KEY,
bar_fk INTEGER REFERENCES bar(pkey),
B VARCHAR(512),
a TEXT,
C bytea
);
Will performance of foo2 be better than foo because of better byte alignment for the columns? When Postgres executes CREATE TABLE does it follow the column order specified or does it re-organize the columns in optimal order for byte alignment or performance?
Question 1
Will the performance of foo2 be better than foo because of better byte
alignment for the columns?
Yes, the order of columns can have a small impact on performance. Type alignment is the more important factor, because it affects the footprint on disk. You can minimize storage size (play "column tetris") and squeeze more rows on a data page - which is the most important factor for speed.
Normally, it's not worth bothering. With an extreme example like in this related answer you get a substantial difference:
Calculating and saving space in PostgreSQL
Type alignment details:
Making sense of Postgres row sizes
The other factor is that retrieving column values is slightly faster if you have fixed size columns first. I quote the manual here:
To read the data you need to examine each attribute in turn. First
check whether the field is NULL according to the null bitmap. If it
is, go to the next. Then make sure you have the right alignment. If
the field is a fixed width field, then all the bytes are simply
placed. If it's a variable length field (attlen = -1) then it's a bit
more complicated. All variable-length data types share the common
header structure struct varlena, which includes the total length of
the stored value and some flag bits.
There is an open TODO item to allow reordering of column positions in the Postgres Wiki, partly for these reasons.
Question 2
When Postgres executes a CREATE TABLE does it follow the column order
specified or does it re-organize the columns in optimal order for byte
alignment or performance?
Columns are stored in the defined order, the system does not try to optimize.
I fail to see any relevance of column order to TOAST tables like another answer seems to imply.
As far as I understand, PostgreSQL adheres to the order in which you enter the columns when saving records. Whether this affects performance is debatable. PostgreSQL stores all table data in pages each being 8kb in size. 8kb is the default, but it can be change at compile time.
Each row in the table will take up space within the page. Since your table definition contains variable columns, a page can consist of a variable amount of records. What you want to do is make sure you can fit as many records into one page as possible. That is why you will notice performance degradation when a table has a huge amount of columns or column sizes are huge.
This being said, declaring a varchar(8192) does not mean the page will be filled up with one record, but declaring a CHAR(8192) will use up one whole page irrespective of the amount of data in the column.
There is one more thing to consider when declaring TOASTable types such as TEXT columns. These are columns that could exceed the maximum page size. A table that has TOASTable columns will have an associated TOAST table to store the data and only a pointer to the data is stored with the table. This can impact performance, but can be improved with proper indexes on the TOASTable columns.
To conclude, I would have to say that the order of the columns do not play much of role in the performance of a table. Most queries utilise indexes which are store separately to retrieve records and therefore column order is negated. It comes down to how many pages needs to be read to retrieve the data.

How many records can I store in 5 MB of PostgreSQL on Heroku?

I'm going to store records in a single table with 2 fields:
id -> 4 characters
password_hash -> 64 characters
How many records like the one above will I be able to store in a 5mb PostgreSQL on Heroku?
P.S.: given a single table with x columns and a length of y - how can I calculate the space it will take in a database?
Disk space occupied
Calculating the space on disk is not trivial. You have to take into account:
The overhead per table. Small, basically the entries in the system catalog.
The overhead per row (HeapTupleHeader) and per data page (PageHeaderData). Details about page layout in the manual.
Space lost to column alignment, depending on data types.
Space for a NULL bitmap. Effectively free for tables of 8 columns or less, irrelevant for your case.
Dead rows after UPDATE / DELETE. (Until the space is eventually vacuumed and reused.)
Size of index(es). You'll have a primary key, right? Index size is similar to that of a table with just the indexed columns and less overhead per row.
The actual space requirement of the data, depending on respective data types. Details for character types (incl. fixed length types) in the manual:
The storage requirement for a short string (up to 126 bytes) is 1 byte
plus the actual string, which includes the space padding in the case
of character. Longer strings have 4 bytes of overhead instead of 1
More details for all types in the system catalog pg_type.
The database encoding in particular for character types. UTF-8 uses up to four bytes to store one character (But 7-Bit-ASCII characters always occupy just one byte, even in UTF-8.)
Other small things that may affect your case, like TOAST - which should not affect you with 64-character strings.
Calculate with test case
A simple method to find an estimate is to create a test table, fill it with dummy data and measure with database object size functions::
SELECT pg_size_pretty(pg_relation_size('tbl'));
Including indexes:
SELECT pg_size_pretty(pg_total_relation_size('tbl'));
See:
Measure the size of a PostgreSQL table row
A quick test shows the following results:
CREATE TABLE test(a text, b text);
INSERT INTO test -- quick fake of matching rows
SELECT chr((g/1000 +32)) || to_char(g%1000, 'FM000')
, repeat (chr(g%120 + 32), 64)
FROM generate_series(1,50000) g;
SELECT pg_size_pretty(pg_relation_size('test')); -- 5640 kB
SELECT pg_size_pretty(pg_total_relation_size('test')); -- 5648 kB
After adding a primary key:
ALTER TABLE test ADD CONSTRAINT test_pkey PRIMARY KEY(a);
SELECT pg_size_pretty(pg_total_relation_size('test')); -- 6760 kB
So, I'd expect a maximum of around 44k rows without and around 36k rows with primary key.