List Partitioning in Postgres 12 - postgresql

CREATE TABLE countrymeasurements
(
countrycode int NOT NULL,
countryname character varying(30) NOT NULL,
languagename character varying (30) NOT NULL,
daysofoperation character varying(30) NOT NULL,
salesparts bigint,
replaceparts bigint
)
PARTITION BY LIST(countrycode)
(
partition india values(1),
partition japan values(2),
partition china values(3),
partition malaysia values(4)
);
I am getting ERROR: syntax error at or near "(". What i am missing here. I am using postgres12

I don't know where you found that syntax, obviously not in the manual. As you can see there partitions are created using create table .. as partition of in Postgres:
Define the table:
CREATE TABLE countrymeasurements
(
countrycode int NOT NULL,
countryname character varying(30) NOT NULL,
languagename character varying (30) NOT NULL,
daysofoperation character varying(30) NOT NULL,
salesparts bigint,
replaceparts bigint
)
PARTITION BY LIST(countrycode);
Define the partitions:
create table india
partition of countrymeasurements
for values in (1);
create table japan
partition of countrymeasurements
for values in (2);
create table china
partition of countrymeasurements
for values in (3);
create table malaysia
partition of countrymeasurements
for values in (4);

Welcome to stackoverflow! Please note, that asking questions here without showing prior research may turn away people that otherwise might love to help.
In this case I checked and found no official example for list partitioning. But, if you just shorten your statement it will create a table using the values in countrycode column to partition:
CREATE TABLE countrymeasurements
(
countrycode int NOT NULL,
countryname character varying(30) NOT NULL,
languagename character varying (30) NOT NULL,
daysofoperation character varying(30) NOT NULL,
salesparts bigint,
replaceparts bigint
)
PARTITION BY LIST(countrycode)
;
The psql describe table command shows the partitioning is as requested:
psql=# \d countrymeasurements
Table "public.countrymeasurements"
Column | Type | Collation | Nullable | Default
-----------------+-----------------------+-----------+----------+---------
countrycode | integer | | not null |
countryname | character varying(30) | | not null |
languagename | character varying(30) | | not null |
daysofoperation | character varying(30) | | not null |
salesparts | bigint | | |
replaceparts | bigint | | |
Partition key: LIST (countrycode)
Then you can define the partitions like in the answer from #a_horse_with_no_name. But, some notes on using such a strategy may be in order.
Notes:
When you just allow 4 explicit partitions via list (as you tried) what happens when value 5 comes along?
The documentation at postgresql 12 on ddl partition ing suggests to consider hash partitioning instead of list and choose the number of partitions instead of relying on your column values which might expose a very unbalanced abundance.

Related

Data validation/constraint in Postgres DB

I have a (likely) simple question about data validation in a Postgres DB.
I have the following table:
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------------+-----------------------+-----------+----------+---------+----------+--------------+-------------
id_number | integer | | not null | | plain | |
last_name | character varying(50) | | not null | | extended | |
first_name | character varying(50) | | not null | | extended | |
school | character varying(50) | | not null | | extended | |
district | character varying(50) | | not null | | extended | |
Code to create the table
CREATE TABLE students (
id_number INTEGER PRIMARY KEY NOT NULL,
last_name VARCHAR(50) NOT NULL,
first_name VARCHAR(50) NOT NULL,
school VARCHAR(50) NOT NULL,
district VARCHAR(50) NOT NULL);
I want to create a list of valid input strings (text) for a column and reject any other input.
For example: for the "districts" column, I want the only input allowed to be "district a," district b," or "district c."
I've read over the constraints documentation but don't see anything about text constraints or using "or."
Is this possible? If so, how would I do it?
Thanks
Right at the top of the linked documentation it discusses CHECK constraints, that's what you want here:
CREATE TABLE students (
...
district VARCHAR(50) NOT NULL CHECK (district in ('district a', 'district b', 'district c')
);
Alternatively, you could add a separate table with the districts and then use a FOREIGN KEY constraint to restrict the districts to only those in the districts table.
For this you'd have something like:
create table districts (
id integer not null primary key,
name varchar not null
)
and then:
CREATE TABLE students (
id_number INTEGER PRIMARY KEY NOT NULL,
last_name VARCHAR(50) NOT NULL,
first_name VARCHAR(50) NOT NULL,
school VARCHAR(50) NOT NULL,
district_id integer not null references districts(id)
)
and you'd JOIN to the districts table to get the district names.
Using a separate table would make it easier to get a list a possible districts, add new ones, remove old ones, and change the district's names. This would also be a more normalized approach, might be a little more work at the beginning but it is a big win later on.

psycopg2.errors.UndefinedColumn: column excluded.number does not exist

I have two tables in two different schemas on one database:
CREATE TABLE IF NOT EXISTS target_redshift.dim_collect_projects (
project_id BIGINT NOT NULL UNIQUE,
project_number BIGINT,
project_name VARCHAR(300) NOT NULL,
connect_project_id BIGINT NOT NULL,
project_desc VARCHAR(5000) NOT NULL,
project_type VARCHAR(50) NOT NULL,
project_status VARCHAR(100),
project_path VARCHAR(32768),
language_code VARCHAR(10),
country_code VARCHAR(10),
timezone VARCHAR(10),
project_created_at TIMESTAMP WITHOUT TIME ZONE,
project_modified_at TIMESTAMP WITHOUT TIME ZONE,
date_created TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT NOW(),
date_updated TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT NOW()
);
CREATE TABLE IF NOT EXISTS source_redshift.dim_collect_projects (
id BIGINT NOT NULL UNIQUE,
number BIGINT,
name VARCHAR(300) NOT NULL,
connect_project_id BIGINT NOT NULL,
description VARCHAR(5000) NOT NULL,
type VARCHAR(50) NOT NULL,
status VARCHAR(100),
path VARCHAR(32768),
language VARCHAR(10),
country VARCHAR(10),
timezone VARCHAR(10),
created TIMESTAMP WITHOUT TIME ZONE NULL DEFAULT NOW(),
modified TIMESTAMP WITHOUT TIME ZONE NULL DEFAULT NOW()
);
I need to copy data from the second table to the first.
Do it so:
INSERT INTO target_redshift.dim_collect_projects AS t
SELECT id, number, name, connect_project_id, description,
type, status, path, language, country, timezone, created,
modified
FROM source_redshift.dim_collect_projects
ON CONFLICT (project_id)
DO UPDATE SET
(t.project_number, t.project_name, t.connect_project_id, t.project_desc,
t.project_type, t.project_status, t.project_path, t.language_code,
t.country_code, t.timezone, t.project_created_at, t.project_modified_at,
t.date_created, t.date_updated) = (EXCLUDED.number, EXCLUDED.name, EXCLUDED.connect_project_id,
EXCLUDED.description, EXCLUDED.type, EXCLUDED.status,
EXCLUDED.path, EXCLUDED.language, EXCLUDED.country,
EXCLUDED.timezone, EXCLUDED.created, EXCLUDED.modified, t.date_created, NOW())
And AirFlow send an error:
psycopg2.errors.UndefinedColumn: column excluded.number does not exist
LINE 12: t.date_created, t.date_updated) = (EXCLUDED.number, ...
You need to use the target_redshift.dim_collect_projects field names for the excluded.* fields e.g. excluded.project_number. The target table is the controlling one for the column names as that is where the data insert is being attempted.
UPDATE
Using an example table from my test database:
\d animals
Table "public.animals"
Column | Type | Collation | Nullable | Default
--------+------------------------+-----------+----------+---------
id | integer | | not null |
cond | character varying(200) | | not null |
animal | character varying(200) | | not null |
Indexes:
"animals_pkey" PRIMARY KEY, btree (id)
\d animals_target
Table "public.animals_target"
Column | Type | Collation | Nullable | Default
---------------+------------------------+-----------+----------+---------
target_id | integer | | not null |
target_cond | character varying(200) | | |
target_animal | character varying(200) | | |
Indexes:
"animals_target_pkey" PRIMARY KEY, btree (target_id)
insert into
animals_target
select
*
from
animals
ON CONFLICT
(target_id)
DO UPDATE SET
(target_id, target_cond, target_animal) =
(excluded.target_id, excluded.target_cond, excluded.target_animal);
NOTE: No use of table alias for the table being inserted into.
The target table is the one the data is being inserted into. The attempted INSERT is into its columns so the they are the ones that are being potentially excluded.
For anyone who might come here later, I had a table which was created from an Excel import, and unwittingly one of the column names started with a Unicode character (in other words, an invisible character).
ERROR: column excluded.columnname does not exist
LINE 5: ... (yada, yada) = (excluded.columnname, excluded.yada)
HINT: Perhaps you wanted to reference the column "excluded.columnname".
Since none of the answers have been marked as correct, I suggest that errors like the above may arise even though everything looks to be perfectly fine if a column name begins with one of these invisible characters. At least, that was the case for me and I had to scratch my head for quite a while before I figured it out.
One way to avoid such issues could be to not create tables automatically based on the contents of Excel files.
Did it:
INSERT INTO dim_collect_projects_1 AS t (project_id, project_number, project_name, connect_project_id, project_desc,
project_type, project_status, project_path, language_code,
country_code, timezone, project_created_at, project_modified_at)
SELECT s.id, s.number, s.name, s.connect_project_id, s.description,
s.type, s.status, s.path, s.language, s.country, s.timezone, s.created,
s.modified
FROM dim_collect_projects_2 AS s
ON CONFLICT (project_id)
DO UPDATE SET
(project_number, project_name, connect_project_id, project_desc,
project_type, project_status, project_path, language_code,
country_code, timezone, project_created_at, project_modified_at,
date_updated) = (EXCLUDED.project_number,
EXCLUDED.project_name, EXCLUDED.connect_project_id,
EXCLUDED.project_desc, EXCLUDED.project_type, EXCLUDED.project_status,
EXCLUDED.project_path, EXCLUDED.language_code, EXCLUDED.country_code,
EXCLUDED.timezone, EXCLUDED.project_created_at,
EXCLUDED.project_modified_at, NOW())
WHERE t.project_number != EXCLUDED.project_number
OR t.project_name != EXCLUDED.project_name
OR t.connect_project_id != EXCLUDED.connect_project_id
OR t.project_desc != EXCLUDED.project_desc
OR t.project_type != EXCLUDED.project_type
OR t.project_status != EXCLUDED.project_status
OR t.project_path != EXCLUDED.project_path
OR t.language_code != EXCLUDED.language_code
OR t.country_code != EXCLUDED.country_code
OR t.timezone != EXCLUDED.timezone
OR t.project_created_at != EXCLUDED.project_created_at
OR t.project_modified_at != EXCLUDED.project_modified_at;

how could we store other values in list partition in postgresql

how could we store other values in list partition in postgresql?
Sample: How can i add a partition for different values than (1,2,3,4) in table below.
CREATE TABLE countrymeasurements
(
countrycode int NOT NULL,
countryname character varying(30) NOT NULL,
languagename character varying (30) NOT NULL,
daysofoperation character varying(30) NOT NULL,
salesparts bigint,
replaceparts bigint
)
PARTITION BY LIST(countrycode);
Define the partitions:
create table india
partition of countrymeasurements
for values in (1);
create table japan
partition of countrymeasurements
for values in (2);
create table china
partition of countrymeasurements
for values in (3);
create table malaysia
partition of countrymeasurements
for values in (4);
Found out the solution now:
create table dwh_user.countrymeasurements_def
partition of countrymeasurements
default;

Postgres: group by and get most frequent value, plus count of matching foreign keys

I am working in Postgres 9.6. I have a table called person that looks like this:
id | integer (pk)
name | character varying(300)
name_slug | character varying(50)
And another table called person_to_manor that looks like this, where person_id is a foreign key to person.id:
id | integer (pk)
manor_id | integer
person_id | integer
I want to combine these two tables to populate a third table canonical_person in which the primary key is name_slug, and which has the following fields:
name_slug | character varying(50) (pk)
name | character varying(300)
num_manor | integer
where:
name_slug is the primary key
name is the most common value of person.name when grouped by name_slug
num_l66 is the count of rows in person_to_manor that match any of the values of id for that value of name_slug.
Is this possible in a single SQL query? This is as far as I've got...
INSERT INTO canonical_person
VALUES (
SELECT name_slug,
[most popular value of name from `array_agg(distinct name) from person`],
COUNT(number of rows in person_to_manor that match any of `array_agg(distinct id) from person`)
FROM person
GROUP BY name_slug);
Is something like this?
I created the three tables
CREATE TABLE test.person (
id int4 NOT NULL,
"name" varchar(300) NULL,
name_slug varchar(50) NULL,
CONSTRAINT person_pkey PRIMARY KEY (id)
);
CREATE TABLE test.person_to_manor (
id int4 NOT NULL,
manor_id int4 NULL,
person_id int4 NULL,
CONSTRAINT person_to_manor_pkey PRIMARY KEY (id),
CONSTRAINT person_to_manor_person_id_fkey FOREIGN KEY (person_id) REFERENCES
test.person(id)
);
CREATE TABLE test.canonical_person (
name_slug varchar(50) NOT NULL,
"name" varchar(300) NULL,
num_manor int4 NULL,
CONSTRAINT canonical_person_pkey PRIMARY KEY (name_slug)
);
With the following values
select * from test.person;
id|name|name_slug
--|----|---------
0|a |ab
1|b |aa
2|c |ab
3|a |bb
4|a |ab
select * from test.person_to_manor;
id|manor_id|person_id
--|--------|---------
1| 5| 0
2| 6| 0
3| 7| 2
I run this query
insert into test.canonical_person
select name_slug,
name as most_popular_name,
sub.n as count_rows
from (
select name,
name_slug,count(*) as n,
row_number () over(order by count(*) desc) as n_max
from test.person
group by name,name_slug
order by n_max asc
) as sub
where sub.n_max =1;
The result after query
select * from test.canonical_person;
name_slug|name|num_manor
---------|----|---------
ab |a | 2
Is this your goal?

How do I update 1.3 billion rows in this table more efficiently?

I have 1.3 billion rows in a PostgreSQL table sku_comparison that looks like this:
id1 (INTEGER) | id2 (INTEGER) | (10 SMALLINT columns) | length1 (SMALLINT)... |
... length2 (SMALLINT) | length_difference (SMALLINT)
The id1 and id2 columns are referenced in a table called sku, which contains about 300,000 rows, and have an associated varchar(25) value in each row from a column, code.
There is a btree index built on id1 and id2, and a compound index of id1 and id2 in sku_comparison. There is a btree index on the id column of sku, as well.
My goal is to update the length1 and length2 columns with the lengths of the corresponding code column from the sku table. However, I ran the following code for over 20 hours, and it did not complete the update:
UPDATE sku_comparison SET length1=length(sku.code) FROM sku
WHERE sku_comparison.id1=sku.id;
All of the data is stored on a single hard disk on a local computer, and the processor is fairly modern. Constructing this table, which required much more complicated string comparisons in Python, only took about 30 hours or so, so I am not sure why something like this would take as long.
edit: here are formatted table definitions:
Table "public.sku"
Column | Type | Modifiers
------------+-----------------------+--------------------------------------------------
id | integer | not null default nextval('sku_id_seq'::regclass)
sku | character varying(25) |
pattern | character varying(25) |
pattern_an | character varying(25) |
firsttwo | character(2) | default ' '::bpchar
reference | character varying(25) |
Indexes:
"sku_pkey" PRIMARY KEY, btree (id)
"sku_sku_idx" UNIQUE, btree (sku)
"sku_firstwo_idx" btree (firsttwo)
Referenced by:
TABLE "sku_comparison" CONSTRAINT "sku_comparison_id1_fkey" FOREIGN KEY (id1) REFERENCES sku(id)
TABLE "sku_comparison" CONSTRAINT "sku_comparison_id2_fkey" FOREIGN KEY (id2) REFERENCES sku(id)
Table "public.sku_comparison"
Column | Type | Modifiers
---------------------------+----------+-------------------------
id1 | integer | not null
id2 | integer | not null
consec_charmatch | smallint |
consec_groupmatch | smallint |
consec_fieldtypematch | smallint |
consec_groupmatch_an | smallint |
consec_fieldtypematch_an | smallint |
general_charmatch | smallint |
general_groupmatch | smallint |
general_fieldtypematch | smallint |
general_groupmatch_an | smallint |
general_fieldtypematch_an | smallint |
length1 | smallint | default 0
length2 | smallint | default 0
length_difference | smallint | default '-999'::integer
Indexes:
"sku_comparison_pkey" PRIMARY KEY, btree (id1, id2)
"ssd_id1_idx" btree (id1)
"ssd_id2_idx" btree (id2)
Foreign-key constraints:
"sku_comparison_id1_fkey" FOREIGN KEY (id1) REFERENCES sku(id)
"sku_comparison_id2_fkey" FOREIGN KEY (id2) REFERENCES sku(id)
Would you consider using an anonymous code block?
using pseudo code...
FOREACH 'SELECT ski.id,
sku.code,
length(sku.code)
FROM sku
INTO v_skuid, v_skucode, v_skulength'
DO
UPDATE sku_comparison
SET sku_comparison.length1 = v_skulength
WHERE sku_comparison.id1=v_skuid;
END DO
END FOREACH
This would break the whole thing into smaller transactions and you will not be evaluating the length of sku.code every time.