Encoding a Postgres UUID in Amazon Redshift - postgresql

We have a couple of entities which are being persisted into Amazon Redshift for reporting purposes, and these entities have a relationship between them. The source tables in Postgres are related via a foreign key with a UUID datatype, which is not supported in Redshift.
One option is to encode the UUID as a 128 bit signed integer. The Redshift documentation refers to the ability to create NUMBER(38,0), and to the ability to create 128 bit numbers.
But 2^128 = 340,282,366,920,938,463,463,374,607,431,768,211,456 which is 39 digits. (thanks Wikipedia). So despite what the docs say, you cannot store the full 128 bits / 39 digits of precision in Redshift. How do you actually make a full 128 bit number column in Redshift?
In short, the real question behind this is - what is Redshift best practice for storing & joining tables which have UUID primary keys?

Redshift joins will perform well even with a VARCHAR key, so that's where I would start.
The main factor for join performance will be co-locating the rows onto the same compute node. To achieve this you should declare the UUID column as the distribution key on both tables.
Alternatively, if one of the tables is fairly small (<= ~1 million rows), then you can declare that table as DISTSTYLE ALL and choose some other dist key for the larger table.
If you have co-located the join and wish to optimize further then you could try splitting the UUID value into 2 BIGINT columns, one for the top 64 bits and another for the bottom 64. Even half of the UUID is likely to be unique and then you can use the second column as a "tie breaker".
c.f. "Amazon Redshift Engineering’s Advanced Table Design Playbook: Preamble, Prerequisites, and Prioritization"

Related

What is postgres uuid_generate_v4() maximum?

I have logs table with many rows where pk is generated by uuid_generate_v4() function.
What i'm curious about - is there a limit for generated uuids? Like if i will have 10.000.000.000 rows it will not able to generate unique primary key.
Since a UUID is a 128 bit number, the maximum of different UUIDs would be 2^128 = 340.282.366.920.938.463.463.374.607.431.768.211.456 (if that big number calculator made no mistake but it sure is very, very large). So you're far, far, far away from that with just 10.000.000.000.

Shall I SET STORAGE PLAIN on fixed-length bytea PK column?

My Postgres table's primary key is a SHA1 checksum (always 20 bytes) stored in a bytea column (because Postgres doesn't have fixed-length binary types).
Shall I ALTER TABLE t ALTER COLUMN c SET STORAGE PLAIN not to let Postgres compress and/or outsource (TOAST) my PK/FK for the sake of lookup and join performance? And why (not)?
I would say that that is a micro-optimization that will probably not have a measurable effect.
First, PostgreSQL only considers compressing and slicing values if the row exceeds 2000 bytes, so there will only be an effect at all if your rows routinely exceed that size.
Then, even if the primary key column gets toasted, you will probably only be able to measure a difference if you select a large number of rows in a single table scan. Fetching only a few rows by index won't make a big difference.
I'd benchmark both approaches, but I'd assume that it will be hard to measure a difference. I/O and other costs will probably hide the small extra CPU time required for decompression (remember that the row has to be large for TOAST to kick in in the first place).

Is a PK of type bytea slower compared to a PK of a sequential type integer in Postgres?

I am working on a project where I have to store millions of rows with a column x of type bytea (with a maximum size of 128 bytes). I need to query the data by x (i.e. where x = ?). Now I was wondering if I can use x directly as a primary key without any negative performance impact?
I also have to join that table on the primary key from another table, therefore I would also have to store bytea as foreign key in another table.
As far as I know, most database systems make use of a B+-Tree which has a search complexity of θ(log(n)). When using bytea as primary key, I am not sure if Postgres can efficiently organize such a B+-Tree?
If you can guarantee that the value of the bytea never changes, you can use it as primary key.
But it is not necessarily wise to do so: if that key is stored in other tables as well, this will waste space, and an artificial primary key might be better.

Wrong (?) Size of PostgreSQL table

I have a table with columns and constraints:
height smallint,
length smallint,
diameter smallint,
volume integer,
idsensorfragments integer,
CONSTRAINT sensorstats_idsensorfragments_fkey FOREIGN KEY (idsensorfragments)
REFERENCES sensorfragments (idsensorfragments) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE CASCADE
(no primary key). There is currently 28 978 112 records in it, but the size of the table is way too much in my opinion.
Result of the query:
select pg_size_pretty(pg_total_relation_size('sensorstats')), pg_size_pretty(pg_relation_size('sensorstats'))
is:
"1849 MB";"1226 MB"
There is just one index on idsensorfragments column. Using simple math you can see that one record takes ~66,7 B (?!?!). Can anyone explain me where this number comes from?
5 columns = 2 + 2 + 2 + 4 + 4 = 14 Bytes. I have one index, no primary key. Where additional 50B per record comes from?
P.S. Table was vacuumed, analyzed and reindexed.
You should take a look on how Database Physical Storage is organized, especially on the Page Layout.
PostgreSQL keeps a bunch of extra fields for each tuple (row) and also for each Page. Tuples are kept in the Pages, as Page is the item that database operates on, typically 8192 bytes in size. So the extra space usage comes from:
PageHeader, 24 bytes;
Tupleheader, 27 bytes;
“invisible” Tuple versions;
reserved free space, according to the Storage Parameters of the table;
NULL indicator array;
(might have missed something more).
The layout of physical storage changes between major releases, that's the reason you have to perform a full dump/restore. In the recent versions pg_upgrade is of great help in this process.
Did you do a VACUUM FULL or CLUSTER? If not, the unused space is still dedicated to this table and index. These statements rewrite the table, a VACUUM without FULL doesn't do the rewrite.

Doubt in clustered and non Clustered index

I have a doubt that if my table do n't have any constraint like Primary Key,Foreign key,Unique key etc. then can i create the clustered index on table and clustered index can have the douplicate records ?
My 2nd question is where should we exectly use the non clustered index and when it is useful and benificial to create in table?
My 3rd question is How can we create the 249 non clustered index in a table .Is it the meaning, Creating the non clustered index on 249 columns ?
Can you anyone help me to remove my confusion in this.
First, the definition of a clustered index is that it is physical ordering of data on the disk. Every time you do an insert into that table, the new record will be placed on the physical disk in its order based on its value in the clustered index column. Because it is the physical location on the disk, it is (A) the most rapidly accessible column in the table but (B) only possible to define a single clustered index per table. Which column (or columns) you use as the clustered index depend on the data itself and its use. Primary keys are typically the clustered index, especially if the primary key is sequential (e.g. an integer that increments automatically with each insert). This will provide the fastest insert/update/delete functionality. If you are more interested in performing reads (select * from table), you may want to cluster on a Date column, as most queries have either a date in the where clause, the group by clause or both.
Second, clustered indexes (at least in the DB's I know) need not be unique (they CAN have duplicates). Constraining the column to be unique is separate matter. If the clustered index is a primary key its uniqueness is a function of being a primary key.
Third, I can't follow you questions concerning 249 columns. A non-clustered index is basically a tool for accelerating queries at the expense of extra disk space. It's hard to think of a case where creating an index on each column is necessary. If you want a quick rule of thumb...
Write a query using your table.
If a column is required to do a join, index it.
If a column is used in a where column, index it.
Remember all the indexes are doing for you is speeding up your queries. If queries run fast, don't worry about them.
This is just a thumbnail sketch of a large topic. There are tons of more informative/comprehensive resources on this matter, and some depend on the database system ... just google it.