Related
I have a (likely) simple question about data validation in a Postgres DB.
I have the following table:
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------------+-----------------------+-----------+----------+---------+----------+--------------+-------------
id_number | integer | | not null | | plain | |
last_name | character varying(50) | | not null | | extended | |
first_name | character varying(50) | | not null | | extended | |
school | character varying(50) | | not null | | extended | |
district | character varying(50) | | not null | | extended | |
Code to create the table
CREATE TABLE students (
id_number INTEGER PRIMARY KEY NOT NULL,
last_name VARCHAR(50) NOT NULL,
first_name VARCHAR(50) NOT NULL,
school VARCHAR(50) NOT NULL,
district VARCHAR(50) NOT NULL);
I want to create a list of valid input strings (text) for a column and reject any other input.
For example: for the "districts" column, I want the only input allowed to be "district a," district b," or "district c."
I've read over the constraints documentation but don't see anything about text constraints or using "or."
Is this possible? If so, how would I do it?
Thanks
Right at the top of the linked documentation it discusses CHECK constraints, that's what you want here:
CREATE TABLE students (
...
district VARCHAR(50) NOT NULL CHECK (district in ('district a', 'district b', 'district c')
);
Alternatively, you could add a separate table with the districts and then use a FOREIGN KEY constraint to restrict the districts to only those in the districts table.
For this you'd have something like:
create table districts (
id integer not null primary key,
name varchar not null
)
and then:
CREATE TABLE students (
id_number INTEGER PRIMARY KEY NOT NULL,
last_name VARCHAR(50) NOT NULL,
first_name VARCHAR(50) NOT NULL,
school VARCHAR(50) NOT NULL,
district_id integer not null references districts(id)
)
and you'd JOIN to the districts table to get the district names.
Using a separate table would make it easier to get a list a possible districts, add new ones, remove old ones, and change the district's names. This would also be a more normalized approach, might be a little more work at the beginning but it is a big win later on.
I have two tables in two different schemas on one database:
CREATE TABLE IF NOT EXISTS target_redshift.dim_collect_projects (
project_id BIGINT NOT NULL UNIQUE,
project_number BIGINT,
project_name VARCHAR(300) NOT NULL,
connect_project_id BIGINT NOT NULL,
project_desc VARCHAR(5000) NOT NULL,
project_type VARCHAR(50) NOT NULL,
project_status VARCHAR(100),
project_path VARCHAR(32768),
language_code VARCHAR(10),
country_code VARCHAR(10),
timezone VARCHAR(10),
project_created_at TIMESTAMP WITHOUT TIME ZONE,
project_modified_at TIMESTAMP WITHOUT TIME ZONE,
date_created TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT NOW(),
date_updated TIMESTAMP WITHOUT TIME ZONE NOT NULL DEFAULT NOW()
);
CREATE TABLE IF NOT EXISTS source_redshift.dim_collect_projects (
id BIGINT NOT NULL UNIQUE,
number BIGINT,
name VARCHAR(300) NOT NULL,
connect_project_id BIGINT NOT NULL,
description VARCHAR(5000) NOT NULL,
type VARCHAR(50) NOT NULL,
status VARCHAR(100),
path VARCHAR(32768),
language VARCHAR(10),
country VARCHAR(10),
timezone VARCHAR(10),
created TIMESTAMP WITHOUT TIME ZONE NULL DEFAULT NOW(),
modified TIMESTAMP WITHOUT TIME ZONE NULL DEFAULT NOW()
);
I need to copy data from the second table to the first.
Do it so:
INSERT INTO target_redshift.dim_collect_projects AS t
SELECT id, number, name, connect_project_id, description,
type, status, path, language, country, timezone, created,
modified
FROM source_redshift.dim_collect_projects
ON CONFLICT (project_id)
DO UPDATE SET
(t.project_number, t.project_name, t.connect_project_id, t.project_desc,
t.project_type, t.project_status, t.project_path, t.language_code,
t.country_code, t.timezone, t.project_created_at, t.project_modified_at,
t.date_created, t.date_updated) = (EXCLUDED.number, EXCLUDED.name, EXCLUDED.connect_project_id,
EXCLUDED.description, EXCLUDED.type, EXCLUDED.status,
EXCLUDED.path, EXCLUDED.language, EXCLUDED.country,
EXCLUDED.timezone, EXCLUDED.created, EXCLUDED.modified, t.date_created, NOW())
And AirFlow send an error:
psycopg2.errors.UndefinedColumn: column excluded.number does not exist
LINE 12: t.date_created, t.date_updated) = (EXCLUDED.number, ...
You need to use the target_redshift.dim_collect_projects field names for the excluded.* fields e.g. excluded.project_number. The target table is the controlling one for the column names as that is where the data insert is being attempted.
UPDATE
Using an example table from my test database:
\d animals
Table "public.animals"
Column | Type | Collation | Nullable | Default
--------+------------------------+-----------+----------+---------
id | integer | | not null |
cond | character varying(200) | | not null |
animal | character varying(200) | | not null |
Indexes:
"animals_pkey" PRIMARY KEY, btree (id)
\d animals_target
Table "public.animals_target"
Column | Type | Collation | Nullable | Default
---------------+------------------------+-----------+----------+---------
target_id | integer | | not null |
target_cond | character varying(200) | | |
target_animal | character varying(200) | | |
Indexes:
"animals_target_pkey" PRIMARY KEY, btree (target_id)
insert into
animals_target
select
*
from
animals
ON CONFLICT
(target_id)
DO UPDATE SET
(target_id, target_cond, target_animal) =
(excluded.target_id, excluded.target_cond, excluded.target_animal);
NOTE: No use of table alias for the table being inserted into.
The target table is the one the data is being inserted into. The attempted INSERT is into its columns so the they are the ones that are being potentially excluded.
For anyone who might come here later, I had a table which was created from an Excel import, and unwittingly one of the column names started with a Unicode character (in other words, an invisible character).
ERROR: column excluded.columnname does not exist
LINE 5: ... (yada, yada) = (excluded.columnname, excluded.yada)
HINT: Perhaps you wanted to reference the column "excluded.columnname".
Since none of the answers have been marked as correct, I suggest that errors like the above may arise even though everything looks to be perfectly fine if a column name begins with one of these invisible characters. At least, that was the case for me and I had to scratch my head for quite a while before I figured it out.
One way to avoid such issues could be to not create tables automatically based on the contents of Excel files.
Did it:
INSERT INTO dim_collect_projects_1 AS t (project_id, project_number, project_name, connect_project_id, project_desc,
project_type, project_status, project_path, language_code,
country_code, timezone, project_created_at, project_modified_at)
SELECT s.id, s.number, s.name, s.connect_project_id, s.description,
s.type, s.status, s.path, s.language, s.country, s.timezone, s.created,
s.modified
FROM dim_collect_projects_2 AS s
ON CONFLICT (project_id)
DO UPDATE SET
(project_number, project_name, connect_project_id, project_desc,
project_type, project_status, project_path, language_code,
country_code, timezone, project_created_at, project_modified_at,
date_updated) = (EXCLUDED.project_number,
EXCLUDED.project_name, EXCLUDED.connect_project_id,
EXCLUDED.project_desc, EXCLUDED.project_type, EXCLUDED.project_status,
EXCLUDED.project_path, EXCLUDED.language_code, EXCLUDED.country_code,
EXCLUDED.timezone, EXCLUDED.project_created_at,
EXCLUDED.project_modified_at, NOW())
WHERE t.project_number != EXCLUDED.project_number
OR t.project_name != EXCLUDED.project_name
OR t.connect_project_id != EXCLUDED.connect_project_id
OR t.project_desc != EXCLUDED.project_desc
OR t.project_type != EXCLUDED.project_type
OR t.project_status != EXCLUDED.project_status
OR t.project_path != EXCLUDED.project_path
OR t.language_code != EXCLUDED.language_code
OR t.country_code != EXCLUDED.country_code
OR t.timezone != EXCLUDED.timezone
OR t.project_created_at != EXCLUDED.project_created_at
OR t.project_modified_at != EXCLUDED.project_modified_at;
I am working in Postgres 9.6. I have a table called person that looks like this:
id | integer (pk)
name | character varying(300)
name_slug | character varying(50)
And another table called person_to_manor that looks like this, where person_id is a foreign key to person.id:
id | integer (pk)
manor_id | integer
person_id | integer
I want to combine these two tables to populate a third table canonical_person in which the primary key is name_slug, and which has the following fields:
name_slug | character varying(50) (pk)
name | character varying(300)
num_manor | integer
where:
name_slug is the primary key
name is the most common value of person.name when grouped by name_slug
num_l66 is the count of rows in person_to_manor that match any of the values of id for that value of name_slug.
Is this possible in a single SQL query? This is as far as I've got...
INSERT INTO canonical_person
VALUES (
SELECT name_slug,
[most popular value of name from `array_agg(distinct name) from person`],
COUNT(number of rows in person_to_manor that match any of `array_agg(distinct id) from person`)
FROM person
GROUP BY name_slug);
Is something like this?
I created the three tables
CREATE TABLE test.person (
id int4 NOT NULL,
"name" varchar(300) NULL,
name_slug varchar(50) NULL,
CONSTRAINT person_pkey PRIMARY KEY (id)
);
CREATE TABLE test.person_to_manor (
id int4 NOT NULL,
manor_id int4 NULL,
person_id int4 NULL,
CONSTRAINT person_to_manor_pkey PRIMARY KEY (id),
CONSTRAINT person_to_manor_person_id_fkey FOREIGN KEY (person_id) REFERENCES
test.person(id)
);
CREATE TABLE test.canonical_person (
name_slug varchar(50) NOT NULL,
"name" varchar(300) NULL,
num_manor int4 NULL,
CONSTRAINT canonical_person_pkey PRIMARY KEY (name_slug)
);
With the following values
select * from test.person;
id|name|name_slug
--|----|---------
0|a |ab
1|b |aa
2|c |ab
3|a |bb
4|a |ab
select * from test.person_to_manor;
id|manor_id|person_id
--|--------|---------
1| 5| 0
2| 6| 0
3| 7| 2
I run this query
insert into test.canonical_person
select name_slug,
name as most_popular_name,
sub.n as count_rows
from (
select name,
name_slug,count(*) as n,
row_number () over(order by count(*) desc) as n_max
from test.person
group by name,name_slug
order by n_max asc
) as sub
where sub.n_max =1;
The result after query
select * from test.canonical_person;
name_slug|name|num_manor
---------|----|---------
ab |a | 2
Is this your goal?
I am using PostgreSQL and would like to prevent certain required CHARACTER VARYING (VARCHAR) fields from allowing empty string inputs.
These fields would also need to contain unique values, so I am already using a unique constraint; however, this does not prevent an original (unique) empty value.
Basic example, where username needs to be unique and not empty
| id | username | password |
+----+----------+----------+
| 1 | User1 | pw1 | #Allowed
| 2 | User2 | pw1 | #Allowed
| 3 | User2 | pw2 | #Already prevented by constraint
| 4 | '' | pw2 | #Currently allowed, but needs to be prevented
Use a check constraint:
CREATE TABLE foobar(
x TEXT NOT NULL UNIQUE,
CHECK (x <> '')
);
INSERT INTO foobar(x) VALUES('');
You can use the standard SQL 'CONSTRAINT...CHECK' clause when defining table fields:
CREATE TABLE test
(
nonempty VARCHAR NOT NULL UNIQUE CONSTRAINT non_empty CHECK(length(nonempty)>0)
)
As a special kind of constraint, you can put the datatype+constraint into a DOMAIN:
-- set search_path='tmp';
DROP DOMAIN birthdate CASCADE;
CREATE DOMAIN birthdate AS date DEFAULT NULL
CHECK (value >= '1900-01-01' AND value <= now())
;
DROP DOMAIN username CASCADE;
CREATE DOMAIN username AS VARCHAR NOT NULL
CHECK (length(value) > 0)
;
DROP TABLE employee CASCADE;
CREATE TABLE employee
( empno INTEGER NOT NULL PRIMARY KEY
, dob birthdate
, zname username
, UNIQUE (zname)
);
INSERT INTO employee(empno,dob,zname)
VALUES (1,'1980-02-02', 'John Doe' ), (2,'1980-02-02', 'Jon Doeh' );
INSERT INTO employee(empno,dob,zname)
VALUES (3,'1980-02-02', '' ), (4,'1980-01-01', 'Joan Doh' );
This will allow you to reuse the domain again and again, without having to copy the constraint every time.
-- UPDATE 2021-03-25 (Thanks to #AlexanderPavlov)
There appears to be a serious flaw in Postgres's implementation: it is possible to insert NULLs from the results of an empty scalar subquery.
The (nonsensical) COALESCE() below "fixes" this behaviour.
This allows us to put the database into a forbidden state.
\echo literal NULL
INSERT INTO employee(empno,dob,zname) VALUES (5,'2021-02-02', NULL );
\echo empty (scalar) set
INSERT INTO employee(empno,dob,zname) VALUES (6,'2021-02-02', (select zname from employee where 1=0) );
\echo empty COALESCE((scalar, NULL) ) set
INSERT INTO employee(empno,dob,zname) VALUES (7,'2021-02-02', (select COALESCE(zname,NULL) from employee where 1=0) );
\echo empty set#2
INSERT INTO employee(empno,dob,zname) (select 8,'2021-03-03', zname from employee where 1=0 );
\echo duplicate the complete table
INSERT INTO employee(empno,dob,zname) (select 100+empno,dob+'1mon':: interval, upper(zname) from employee );
select * from employee;
Extra Results:
literal NULL
ERROR: domain username does not allow null values
empty (scalar) set
INSERT 0 1
empty COALESCE((scalar, NULL) ) set
ERROR: domain username does not allow null values
empty set#2
INSERT 0 0
duplicate the complete table
ERROR: domain username does not allow null values
empno | dob | zname
-------+------------+----------
1 | 1980-02-02 | John Doe
2 | 1980-02-02 | Jon Doeh
6 | 2021-02-02 |
(3 rows)
An INSERT on a table triggers a stored proc where the following error occurs.
ERROR: column "targetedfamily" is of type boolean but expression is of type character varying
Hint: You will need to rewrite or cast the expression.
Where: PL/pgSQL function "fn_family_audit" line 19 at SQL statement
And here's the ERRING stored proc (notice that my attempt to fix the problem by doing CAST(NEW.targetedfamily AS BOOLEAN) does NOT seem to work)
CREATE OR REPLACE FUNCTION fn_family_audit() RETURNS TRIGGER AS $tr_family_audit$
BEGIN
--
-- Create a row in family_audit to reflect the operation performed on family,
-- make use of the special variable TG_OP to work out the operation.
--
IF (TG_OP = 'DELETE') THEN
INSERT INTO public.family_audit values (
DEFAULT, 'D', OLD.family_id, OLD.familyserialno, OLD.node_id, OLD.sourcetype, OLD.familyname,
OLD.familynamelocallang, OLD.hofname, OLD.hofnamelocallang, OLD.targetedfamily, OLD.homeless,
OLD.landless, OLD.dependentonlabour, OLD.womenprimaryearner, OLD.landlinenumber, OLD.username , now());
RETURN OLD;
ELSIF (TG_OP = 'UPDATE') THEN
INSERT INTO public.family_audit values(
DEFAULT, 'U',NEW.family_id, NEW.familyserialno, NEW.node_id, NEW.sourcetype, NEW.familyname,
NEW.familynamelocallang, NEW.hofname, NEW.hofnamelocallang, NEW.targetedfamily, NEW.homeless,
NEW.landless, NEW.dependentonlabour, NEW.womenprimaryearner, NEW.landlinenumber, NEW.username , now());
RETURN NEW;
ELSIF (TG_OP = 'INSERT') THEN
INSERT INTO public.family_audit values(
DEFAULT, 'I',NEW.family_id, NEW.familyserialno, NEW.node_id, NEW.sourcetype, NEW.familyname,
NEW.familynamelocallang, NEW.hofname, NEW.hofnamelocallang, CAST(NEW.targetedfamily AS BOOLEAN), NEW.homeless,
NEW.landless, NEW.dependentonlabour, NEW.womenprimaryearner, NEW.landlinenumber, NEW.username , now());
RETURN NEW;
END IF;
RETURN NULL; -- result is ignored since this is an AFTER trigger
END;
$tr_family_audit$ LANGUAGE plpgsql;
Here's the table definition
nucleus4=# \d family;
Table "public.family"
Column | Type | Modifiers
---------------------+-----------------------------+------------------------------------------------------------
family_id | integer | not null default nextval('family_family_id_seq'::regclass)
familyserialno | integer | not null
sourcetype | character varying(20) | not null
familyname | character varying(100) |
familynamelocallang | character varying(255) |
hofname | character varying(100) | not null
hofnamelocallang | character varying(255) | not null
targetedfamily | boolean |
homeless | boolean |
landless | boolean |
dependentonlabour | boolean |
womenprimaryearner | boolean |
landlinenumber | character varying(20) |
username | character varying(20) | not null
adddate | timestamp without time zone | not null default now()
updatedate | timestamp without time zone | not null default now()
node_id | integer | not null
Indexes:
"PK_family" PRIMARY KEY, btree (family_id)
"family_idx" UNIQUE, btree (familyserialno, node_id)
Foreign-key constraints:
"family_fk" FOREIGN KEY (node_id) REFERENCES hierarchynode_master(node_id)
Referenced by:
TABLE "agriland" CONSTRAINT "FK_agriland_family" FOREIGN KEY (family_id) REFERENCES family(family_id) ON UPDATE RESTRICT ON DELETE RESTRICT
TABLE "currentloans" CONSTRAINT "FK_currentloans_family" FOREIGN KEY (family_id) REFERENCES family(family_id) ON UPDATE RESTRICT ON DELETE RESTRICT
TABLE "family_address" CONSTRAINT "FK_family_address_family" FOREIGN KEY (family_id) REFERENCES family(family_id) ON UPDATE RESTRICT ON DELETE RESTRICT
TABLE "family_basic_info" CONSTRAINT "FK_family_basic_info_family" FOREIGN KEY (family_id) REFERENCES family(family_id) ON UPDATE RESTRICT ON DELETE RESTRICT
TABLE "family_entitlement" CONSTRAINT "FK_family_entitlement_family" FOREIGN KEY (family_id) REFERENCES family(family_id) ON UPDATE RESTRICT ON DELETE RESTRICT
TABLE "livestock" CONSTRAINT "FK_livestock_family" FOREIGN KEY (family_id) REFERENCES family(family_id) ON UPDATE RESTRICT ON DELETE RESTRICT
TABLE "member" CONSTRAINT "FK_member_family" FOREIGN KEY (family_id) REFERENCES family(family_id) ON UPDATE RESTRICT ON DELETE RESTRICT
TABLE "otherassets" CONSTRAINT "FK_otherassets_family" FOREIGN KEY (family_id) REFERENCES family(family_id) ON UPDATE RESTRICT ON DELETE RESTRICT
Triggers:
tr_family_audit AFTER INSERT OR DELETE OR UPDATE ON family FOR EACH ROW EXECUTE PROCEDURE fn_family_audit()
tr_family_updatedate BEFORE UPDATE ON family FOR EACH ROW EXECUTE PROCEDURE fn_modify_updatedate_column()
nucleus4=#
Here's family_audit
nucleus4=# \d family_audit;
Table "public.family_audit"
Column | Type | Mod
---------------------+-----------------------------+----------------------------------
familyaudit_id | integer | not null default nextval('family_
operation | character(1) | not null
family_id | integer | not null
familyserialno | integer | not null
sourcetype | character varying(20) | not null
familyname | character varying(100) |
familynamelocallang | character varying(255) |
hofname | character varying(100) | not null
hofnamelocallang | character varying(255) | not null
targetedfamily | boolean |
homeless | boolean |
landless | boolean |
dependentonlabour | boolean |
womenprimaryearner | boolean |
landlinenumber | character varying(20) |
username | character varying(20) | not null
adddate | timestamp without time zone | not null default now()
node_id | integer | not null
Indexes:
"PK_family_audit" PRIMARY KEY, btree (familyaudit_id)
nucleus4=#
Here's the trigger
CREATE TRIGGER tr_family_audit
AFTER INSERT OR UPDATE OR DELETE ON public.family
FOR EACH ROW EXECUTE PROCEDURE fn_family_audit();
I would appreciate any hints.
Thank you,
BR,
~A
Your problem is here:
NEW.hofnamelocallang
Your insert has one extra column (apparently NEW.node_id). Try changing your insert to:
INSERT INTO public.family_audit values(
DEFAULT, 'I',NEW.family_id, NEW.familyserialno,
NEW.sourcetype, NEW.familyname,
NEW.familynamelocallang, NEW.hofname, NEW.hofnamelocallang,
NEW.targetedfamily, NEW.homeless,
NEW.landless, NEW.dependentonlabour, NEW.womenprimaryearner,
NEW.landlinenumber, NEW.username , now()
);
The error you are getting is basically saying that you were trying to insert NEW.hofnamelocallang into targetedfamily column (which is boolean, not varchar) because of the extra column you were putting in the insert sentence.
I would advice that, when you are performing an insert, for sanity reasons, always enumerate the columns you are putting values into. Something like this:
insert into table foo
(col1, col2, col3) -- column enumeration here
values
(1, 2, 3);