Test HASH function for Postgres table partitioning - postgresql

I'm using Postgres 11 and would like to use a Hash Partitioning on a table where the primary key is a UUID. I understand I need to select a number of partitions up front, and that the modulus of a hash function on the primary key will be used to assign rows to each partition.
Something like this:
CREATE TABLE new_table ( id uuid ) PARTITION BY HASH (id);
CREATE TABLE new_table_0 PARTITION OF new_table FOR VALUES WITH (MODULUS 3, REMAINDER 0);
CREATE TABLE new_table_1 PARTITION OF new_table FOR VALUES WITH (MODULUS 3, REMAINDER 1);
CREATE TABLE new_table_2 PARTITION OF new_table FOR VALUES WITH (MODULUS 3, REMAINDER 2);
The documentation mentions "the hash value of the partition key" but doesn't specify how that hashing takes place. I'd like to test this hash function against my existing data to see the distribution patterns for different numbers of partitions. Something like this:
SELECT unknown_partition_hash_function(id) AS hash_value, COUNT(id) AS number_of_records
FROM existing_table
GROUP BY 1
Is there a way to use this hash function in a SELECT statement?

It should use hash_any. It doesn't seem to be exposed in any way that's directly accessible.
hash_any

Related

Is it possible to duplicate a table and transform the new table to a partitioned table

I'm trying to create a partitioned out of an existing table .
For that i have wrote this query but it seems that my syntax is not correct .
create table table_Duplicate as
(select * from table) PARTITION BY RANGE (datecalcul);
Notice that datecalcul is a timeStamp column type
My main question that (Is it possible to duplicate a postgreSQL table and transform my new table to a partitioned one ?)
why not create your table first and then insert data into it :
-- step 1 - declare table defintion
create table table_Duplicate (
< copy table structure from table >
) PARTITION BY RANGE (datecalcul);
-- step 2 - declare partitions
create table tablename_2021 PARTITION OF table_Duplicate
for values from ('2021-01-01') TO ('2021-12-31');
create table tablename_2020 PARTITION OF table_Duplicate
for values from ('2020-01-01') TO ('2020-12-31');
...
-- step 3 create indexes
create index on tablename_2021 (datecalcul);
create index on tablename_2020 (datecalcul);
...
-- step 4 insert data
insert into table_Duplicate
select * from table;

On what basis should I decide on the optimal number of hash partitions for a given table?

For example, I would like to create a hash partitioned table like:
CREATE TABLE partition_table (
some_id INT NOT NULL
) PARTITION BY HASH (some_id);
And I start by creating 4 partitions like:
CREATE TABLE partition_1 PARTITION OF partition_table FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE partition_2 PARTITION OF partition_table FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE partition_3 PARTITION OF partition_table FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE partition_ PARTITION OF partition_table FOR VALUES WITH (MODULUS 4, REMAINDER 3);
I have read here that:
it’s always recommended that the number of tables should be a power of
2
On what basis should I decide on the optimal number of partitions for a given table?
Edit:
My objective behind using hash partitioning is to distribute data uniformly. More precisely, I have a table with a hashed-id (that has no logical order), and I would like to distribute data evenly across partitions based on that key.
Hash partitioning is pretty useless - it is a form of striping using partitioning. The only other advantage I see is that it is easier for autovacuum to maintain several smaller tables than one big table.
In my opinion, hash partitioning only makes sense if each partition is on a different tablespace, and each tablespace uses different storage. Then the number of the partitions should be the number of different independent disks you have available.
I don't see a reason to have a power of two for the number of partitions. The article you reference doesn't give any explanation either, so I'd say forget it.

Amazon Redshift COMPOUND SORTKEY - does insertion order matter?

Let's say I've created an empty table in Redshift like this:
CREATE TABLE my_table (
val_1 INT ,
val_2 INT ,
val_3 FLOAT
)
COMPOUND SORTKEY(val_1, val_2)
;
When I first populate the table (let's say with the results of some query), should the records be inserted in the SORTKEY order, using the ORDER BY in the code below:
INSERT INTO my_table
SELECT val_1, val_2, val_3 FROM other_table
ORDER BY val_1, val_2
Or is there no need to do that; i.e. SORTKEY ordering of inserted records is handled physically by Redshift itself? Thx.
Assuming the same behaviour for INSERT INTO as for loading via the COPY command, there is no need to order the records first. According to the AWS docs all the following constraints be fulfilled in order to add the records to sorted region of the table - in your example you have a COMPOUND SORTKEY of 2 columns:
The table uses a compound sort key with only one sort column.
The sort column is NOT NULL.
The table is 100 percent sorted or empty.
All the new rows are higher in sort order than the existing rows, including rows marked for deletion. In this instance, Amazon Redshift uses the first eight bytes of the sort key to determine sort order.

PostgreSQL partitioning syntax FOR VALUES

I'm working on table partitioning in PostgreSQL.
I created a partition for my master table:
CREATE TABLE head_partition_table PARTITION OF master_table
FOR VALUES FROM (DATE_START) TO (DATE_END)
PARTITION BY RANGE (ENTITY_ID, GROUP_NAME);
After that, I want to divide head_partition_table into smaller partitions, so I wrote code:
CREATE TABLE subpartition_table OF head_partititon_table
FOR VALUES FROM ('0', 'A') TO ('0', 'Z');
I can't find how I can specify individual values rather than a range.
Something like
CREATE TABLE subpartition_table OF head_partititon_table
FOR VALUES ('0', 'A');
CREATE TABLE subpartition_table OF head_partititon_table
FOR VALUES ('0', 'Z');
I get a syntax error at or near "(".
Is this possible?
P.S. I tried PARTITION BY LIST, but in that case, I can use just one field.
You could partition these by list like you want by introducing another layer of partitions:
CREATE TABLE head_partition_table PARTITION OF master_table
FOR VALUES FROM ('2019-01-01') TO ('2020-01-01')
PARTITION BY LIST (entity_id);
CREATE TABLE subpartition1 PARTITION OF head_partition_table
FOR VALUES IN ('0')
PARTITION BY LIST (group_name);
CREATE TABLE subsubpartition1 PARTITION OF subpartition1
FOR VALUES IN ('A');
But this is more an academic exercise than something useful.
Anything exceeding a at most few hundred partitions will not perform well at all.

PostgreSQL: Partition table by external (non-column) value

I want to split up a table into partitions, where I have a partition for each distinct value of a column. So, if foo_id is the column, I want to put a row with foo_id=23 into the partition bar_23. Is it possible to then remove the column from the partitions and still use the value for selecting the partition via constraints or do I have to explicitly name the partition table in the query?
I.e., can I do this
INSERT INTO bar (foo_id, other) (23, 'value');
without actually having foo_id in the table? Or do I have to go the explicit way:
INSERT INTO bar_23 (other) ('value');