Composite PRIMARY KEY enforces NOT NULL constraints on involved columns - postgresql

This is one strange, unwanted behavior I encountered in Postgres:
When I create a Postgres table with composite primary keys, it enforces NOT NULL constraint on each column of the composite combination.
For example,
CREATE TABLE distributors (m_id integer, x_id integer, PRIMARY KEY(m_id, x_id));
enforces NOT NULL constraint on columns m_id and x_id, which I don't want!
MySQL doesn't do this. I think Oracle doesn't do it as well.
I understand that PRIMARY KEY enforces UNIQUE and NOT NULL automatically but that makes sense for single-column primary key. In a multi-column primary key table, the uniqueness is determined by the combination.
Is there any simple way of avoiding this behavior of Postgres? When I execute this:
CREATE TABLE distributors (m_id integer, x_id integer);
I do not get any NOT NULL constraints of course. But I would not have a primary key either.

If you need to allow NULL values, use a UNIQUE constraint (or index) instead of a PRIMARY KEY (and add a surrogate PK column - I suggest a serial or IDENTITY column in Postgres 10 or later).
Auto increment table column
A UNIQUE constraint allows columns to be NULL:
CREATE TABLE distributor (
distributor_id GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, m_id integer
, x_id integer
, UNIQUE(m_id, x_id) -- !
-- , CONSTRAINT distributor_my_name_uni UNIQUE (m_id, x_id) -- verbose form
);
The manual:
For the purpose of a unique constraint, null values are not considered equal, unless NULLS NOT DISTINCT is specified.
In your case, you could enter something like (1, NULL) for (m_id, x_id) any number of times without violating the constraint. Postgres never considers two NULL values equal - as per definition in the SQL standard.
If you need to treat NULL values as equal (i.e. "not distinct") to disallow such "duplicates", I see two three (since Postgres 15) options:
0. NULLS NOT DISTINCT
This option was added with Postgres 15 and allows to treat NULL values as "not distinct", so two of them conflict in a unique constraint or index. This is the most convenient option, going forward. The manual:
That means even in the presence of a unique constraint it is possible
to store duplicate rows that contain a null value in at least one of
the constrained columns. This behavior can be changed by adding the
clause NULLS NOT DISTINCT ...
Detailed instructions:
Create unique constraint with null columns
1. Two partial indexes
In addition to the UNIQUE constraint above:
CREATE UNIQUE INDEX dist_m_uni_idx ON distributor (m_id) WHERE x_id IS NULL;
CREATE UNIQUE INDEX dist_x_uni_idx ON distributor (x_id) WHERE m_id IS NULL;
But this gets out of hands quickly with more than two columns that can be NULL. See:
Create unique constraint with null columns
2. A multi-column UNIQUE index on expressions
Instead of the UNIQUE constraint. We need a free default value that is never present in involved columns, like -1. Add CHECK constraints to disallow it:
CREATE TABLE distributor (
distributor serial PRIMARY KEY
, m_id integer
, x_id integer
, CHECK (m_id &lt> -1)
, CHECK (x_id &lt> -1)
);
CREATE UNIQUE INDEX distributor_uni_idx
ON distributor (COALESCE(m_id, -1), COALESCE(x_id, -1));

When you want a polymorphic relation
Your table uses column names that indicate that they are probably references to other tables:
CREATE TABLE distributors (m_id integer, x_id integer);
So I think you probably are trying to model a polymorphic relation to other tables – where a record in your table distributors can refer to one m record xor one x record.
Polymorphic relations are difficult in SQL. The best resource I have seen about this topic is "Modeling Polymorphic Associations in a Relational Database". There, four alternative options are presented, and the recommendation for most cases is called "Exclusive Belongs To", which in your case would lead to a table like this:
CREATE TABLE distributors (
id serial PRIMARY KEY,
m_id integer REFERENCES m,
x_id integer REFERENCES x,
CHECK (
((m_id IS NOT NULL)::integer + (x_id IS NOT NULL)::integer) = 1
)
);
CREATE UNIQUE INDEX ON distributors (m_id) WHERE m_id IS NOT NULL;
CREATE UNIQUE INDEX ON distributors (x_id) WHERE x_id IS NOT NULL;
Like other solutions, this uses a surrogate primary key column because primary keys are enforced to not contain NULL values in the SQL standard.
This solution adds a 4th option to the three in #Erwin Brandstetter's answer for how to avoid the case where "you could enter something like (1, NULL) for (m_id, x_id) any number of times without violating the constraint." Here, that case is excluded by a combination of two measures:
Partial unique indexes on each column individually: two records (1, NULL) and (1, NULL) would not violate the constraint on the second column as NULLs are considered distinct, but they would violate the constraint on the first column (two records with value 1).
Check constraint: The missing piece is preventing multiple (NULL, NULL) records, still allowed because NULLs are considered distinct, and anyway because our partial indexes do not cover them to save space and write events. This is achieved by the CHECK constraint, which prevents any (NULL, NULL) records by making sure that exactly one column is NULL.
There's one difference though: all alternatives in #Erwin Brandstetter's answer allow at least one record (NULL, NULL) and any number of records with no NULL value in any column (like (1, 2)). When modeling a polymorphic relation, you want to disallow such records. That is achieved by the check constraint in the solution above.

Related

Postgres: best way to implement single favourite flag

I have this table:
CREATE TABLE public.data_source__instrument (
instrument_id int4 NOT NULL,
data_source_id int4 NOT NULL,
CONSTRAINT data_source__instrument__pk PRIMARY KEY (data_source_id, instrument_id)
);
For clarity, here's an example of the data I might have in this table:
instrument_id
data_source_id
1
1
1
2
1
3
2
2
2
3
I would like to be able to set a favourite data source for each instrument. I'd also like each instrument to have 1 and only 1 favourite data source.
The solution I came up with is the below:
CREATE TABLE public.data_source__instrument (
instrument_id int4 NOT NULL,
data_source_id int4 NOT NULL,
fav_data_source boolean NULL, -- ### new column ###
CONSTRAINT data_source__instrument__pk PRIMARY KEY (data_source_id, instrument_id),
CONSTRAINT fav_data_source UNIQUE (instrument_id,fav_data_source) -- ### new constraint ###
);
I chose to mark favourites with the value true and set non-favourite tuples to null (because the unique constraint doesn't affect NULLs).
This solution will allow at most one true value per instrument in the fav_data_source column.
Example:
instrument_id
data_source_id
fav_data_source
1
1
true
1
2
null
1
3
null
2
2
null
2
3
true
However, I'm not completely satisfied with this solution. For starters, it allows instruments without a favourite data source. Moreover, the value of the fav_data_source could be set to false which is confusing and not very useful (since it doesn't play well with the unique constraint).
Is there a better way of doing this? Am I overlooking something?
EDIT: Ideally, I'd prefer a simple solution to this problem and avoid using features like database triggers.
What you need is to think about DB normalisation and Entities and their Relations. Specifically
having that boolean flag in your joint table is violating normal form (functional dependency on other rows)
"favourite data source" is an entity on its own, the flag you have is describing a relation when it should describe an entity
I assume you have separate tables for instruments and data sources, and omitted those just for the sake of simplifying your question.
Here's what you need:
create table instrument
(
id serial primary key
);
create table data_source
(
id serial primary key
);
create table data_source__instrument
(
data_source_id integer references data_source on delete cascade on update cascade,
instrument_id integer references instrument on delete cascade on update cascade,
constraint data_source__instrument_pk primary key (data_source_id, instrument_id)
);
create table favourite_data_source
(
data_source_id integer not null,
instrument_id integer not null unique,
constraint favourite_data_source_pk foreign key (data_source_id, instrument_id) references data_source__instrument (data_source_id, instrument_id) on update cascade on delete cascade
);
And now to reproduce your example:
insert into instrument
values (1),
(2);
insert into data_source
values (1),
(2),
(3);
insert into data_source__instrument (instrument_id, data_source_id)
values (1, 1),
(1, 2),
(1, 3),
(2, 2),
(2, 3);
insert into favourite_data_source (instrument_id, data_source_id)
values (1, 1),
(2, 3);
With this scheme
you can have 0 to 1 favourite data source per instrument
you have to insert favourite data source data separately
FKs will handle integrity thanks to cascade and checks
I chose to mark favourites with the value true and set non-favourite tuples to null (because the unique constraint doesn't affect NULLs), [however] the value of the fav_data_source could be set to false which is confusing and not very useful
There are two ways to improve this:
put a check constraint on the column to prevent false values:
fav_data_source boolean NULL check (fav_data_source IS TRUE or fav_data_source IS NULL)
use a partial index to check only rows with true values (see also here):
fav_data_source boolean NOT NULL
…
CREATE UNIQUE INDEX unique_fav_data_source ON data_source__instrument (instrument_id) WHERE (fav_data_source);
It allows instruments without a favourite data source.
I don't think it's possible to express such a "out of x, at least one y must exist" constraint in a relational database without triggers. (Assuming you still want to allow instruments without any data sources at all).
I'd also like each instrument to have 1 and only 1 favourite data source.
You might consider storing the favourite data source in the instruments table itself, as a non-nullable foreign reference column.
To avoid duplication, put only the non-favourite (additional/alternate) data source in the relation table, then provide data_source__instrument as a view:
CREATE VIEW data_source__instrument(instrument_id, data_source_id, is_favourite) AS
SELECT id, fav_data_source_id, true FROM instrument
UNION ALL
SELECT instrument_id, data_source_id, false FROM alternate_data_source__instrument;
The downside of this approach is that you cannot create a foreign key reference to the view (should you need to do so), and that updating the data sources gets quite complicated. Also this only works for M:N relations, not 1:N ones, where you would need a way to assert that the favourite source of an instrument actually belongs to that instrument (technically possible with an extra foreign key, but really ugly) and where you would introduce circular references which are just a mess.

What is the best way to work with an optional FK constraint in Postgres? [duplicate]

This question already has answers here:
How can you represent inheritance in a database?
(7 answers)
Closed 3 years ago.
I have a table in Postgres which contains Things. Each of these Things can be of 1 of 3 types:
is a self contained Thing
is an instance of a SuperThingA
is an instance of a SuperThingB.
Essentially, in object terms, there's an AbstractThing, extended in
different ways by SuperThingA or SuperThingB and all the records on
the Thing table are either unextended or extended in 1 of the 2
ways.
In order to represent this on the Thing table, I have a number of fields which are common to Things of all types, and I have 2 optional FK reference columns into the SuperThingA and SuperThingB tables.
Ideally, I would like a "FK IF NOT NULL" constraint on each the two, but this doesn't appear to be possible; as far as I can see, all I can do is make both fields nullable and rely on the using code to maintain the FK relationships, which is pretty sucky. This seems to be doable in other databases, for example SQL Server, as per SQL Server 2005: Nullable Foreign Key Constraint, but not any way that I've found so far in PG
How else can I handle this - an insert/update trigger which checks the values when either of those fields is not null and checks the value is present on whichever parent table? That's maybe doable on a small parent table with limited inserts on the Thing table (which, in fairness, is largely the case here - no more than a couple of hundred records in each of the parent tables, and small numbers of inserts on the Thing table), but in a more general case, would be a performance black hole on the inserts if one or both parent table were large
This is not currently enforced with a FK relationship. I've reviewed the PG docs, and it seem pretty definitive that I can't have an optional FK relatioship (which is understandable). It leaves me with a table definition something like this:
CREATE TABLE IF NOT EXISTS Thing(
Thing_id int4 NOT NULL,
Thing_description varchar(40),
Thing_SuperA_FK int4,
Thing_SuperB_FK char(10),
CONSTRAINT ThingPK PRIMARY KEY (Thing_id)
)
;
Every foreign key on a nullable column is only enforced when the value is non-null. This is default behavior.
create table a (
id int4 not null primary key
);
create table b (
id int4 not null primary key,
a_id int4 references a(id)
);
In the example table b has an optional reference to table a.
insert into a values(1); -- entry in table a
insert into b values (1, 1); -- entry in b with reference to a
insert into b values (2, null); -- entry in b with no reference to a
Depending on your use case it also might make sense to reverse your table structure. Instead of having the common table pointing to two more specialized tables you can have it the other way around. This way you avoid the non-null columns entirely.
create table common(
id int4 primary key
-- all the common fields
);
create table special1(
common_id int4 not null references common(id)
-- all the special fields of type special1
);
create table special2(
common_id int4 not null references common(id)
-- all the special fields of type special2
);
You need your SuperN_FK fields defined as nullable foreign keys, then you'll need check constraint(s) on the table to enforce the optional NULLability requirements.
CREATE TABLE Things
( ID int primary key
, col1 varchar(1)
, col2 varchar(1)
, SuperA_FK int constraint fk_SuperA references Things(ID)
, cola1 varchar(1)
, constraint is_SuperA check ((SuperA_FK is null and cola1 is null) or
(SuperA_FK is not null and cola1 is not null))
, SuperB_FK int constraint fk_SuperB references Things(ID)
, colb1 varchar(1)
, constraint is_SuberB check ((SuperB_FK is null and colb1 is null) or
(SuperB_FK is not null))
, constraint Super_Constraint check (
case when SuperA_FK is not null then 1 else 0 end +
case when SuperB_FK is not null then 1 else 0 end
<= 1 )
);
In the above example I've split the check constraints up for ease maintenance. The two is_SuperN constraints enforce the NULL requirements on the FK and it's related detail columns, either all NULLs or the FK is not null and some or all of the detail columns are not null. The final Super_Constraint ensures that at most one SuperN_FK is not null.

PostgreSQL exclusive or on column setting

In PostgreSQL 9.5 I'm wanting to create a table with three columns. I'd basically have something like
create table Foo (
account varchar not null,
team_id integer references team (ident) on delete cascade,
league_id integer references league (ident) on delete cascade
)
The fun part now is that I want them to specify EITHER team_id OR league_id, but not both. The combination of account plus one of the other two columns is then the UNIQUE constraint.
Is that possible to do?
To make sure only one of the columns is supplied, use a check constraint:
alter table foo add
constraint check_team check (not (team_id is not null and league_id is not null));
The above will however not prevent providing a NULL value for both columns. If you want to make sure that exactly one of them is provided you can use:
alter table foo add
constraint check_team check ( (team_id is not null or league_id is not null)
and not (team_id is not null and league_id is not null));
Edit: as Abelisto pointed out, the check constraint can be simplified to
alter table foo add
constraint check_team check ((team_id is null) <> (league_id is null));
I'm not sure about the unique constraint you want to establish. If e.g. the following two rows should be prevented ('x', 1, null), ('x', null, 1) then you can use a unique index like this:
create unique index on foo (account, coalesce(team_id, league_id));
That would only work properly if you enforce the rule that at least one of those columns must be not null.
If however you want to allow the same team in different columns, but want to prevent to have he same team_id or league_id twice for an account (allowing the above example) then I think you need to unique indexes:
create unique index on foo (account, team_id) where team_id is not null;
create unique index on foo (account, league_id) where league_id is not null;

Shared Primary key versus Foreign Key

I have a laboratory analysis database and I'm working on the bast data layout. I've seen some suggestions based on similar requirements for using a "Shared Primary Key", but I don't see the advantages over just foreign keys. I'm using PostgreSQL:tables listed below
Sample
___________
sample_id (PK)
sample_type (where in the process the sample came from)
sample_timestamp (when was the sample taken)
Analysis
___________
analysis_id (PK)
sample_id (FK references sample)
method (what analytical method was performed)
analysis_timestamp (when did the analysis take place)
analysis_notes
gc
____________
analysis_id (shared Primary key)
gc_concentration_meoh (methanol concentration)
gc_concentration_benzene (benzene concentration)
spectrophotometer
_____________
analysis_id
spectro_nm (wavelength used)
spectro_abs (absorbance measured)
I could use this design, or I could move the fields from the analysis table into both the gc and spectrophotometer tables, and just use foreign keys between sample, gc, and spectrophotometer tables. The only advantage I see of this design is in cases where I would just want information on how many or what types of analyses were performed, without having to join in the actual results. However, the additional rules to ensure referential integrity between the shared primary keys, and managing extra joins and triggers (on delete cascade, etc) appears to make it more of a headache than the minor advantages. I'm not a DBA, but a scientist, so please let me know what I'm missing.
UPDATE:
A shared primary key (as I understand it) is like a one-to-one foreign key with the additional constraint that each value in the parent tables(analysis) must appear in one of the child tables once, and no more than once.
I've seen some suggestions based on similar requirements for using a
"Shared Primary Key", but I don't see the advantages over just foreign
keys.
If I've understood your comments above, the advantage is that only the first implements the requirement that each row in the parent match a row in one child, and only in one child. Here's one way to do that.
create table analytical_methods (
method_id integer primary key,
method_name varchar(25) not null unique
);
insert into analytical_methods values
(1, 'gc'),(2, 'spec'), (3, 'Atomic Absorption'), (4, 'pH probe');
create table analysis (
analysis_id integer primary key,
sample_id integer not null, --references samples, not shown
method_id integer not null references analytical_methods (method_id),
analysis_timestamp timestamp not null,
analysis_notes varchar(255),
-- This unique constraint lets the pair of columns be the target of
-- foreign key constraints from other tables.
unique (analysis_id, method_id)
);
-- The combination of a) the default value and the check() constraint on
-- method_id, and b) the foreign key constraint on the paired columns
-- analysis_id and method_id guarantee that rows in this table match a
-- gc row in the analysis table.
--
-- If all the child tables have similar constraints, a row in analysis
-- can match a row in one and only one child table.
create table gc (
analysis_id integer primary key,
method_id integer not null
default 1
check (method_id = 1),
foreign key (analysis_id, method_id)
references analysis (analysis_id, method_id),
gc_concentration_meoh integer not null,
gc_concentration_benzene integer not null
);
It looks like in my case this supertype/subtype model in not the best choice. Instead, I should move the fields from the analysis table into all the child tables, and make a series of simple foreign key relationships. The advantage of the supertype/subtype model is when using the primary key of the supertype as a foreign key in another table. Since I am not doing this, the extra layer of complexity will not add anything.

Do I need a primary key for my table, which has a UNIQUE (composite 4-columns), one of which can be NULL?

I have the following table (PostgreSQL 8.3) which stores prices of some products. The prices are synchronised with another database, basically most of the fields below (apart from one) are not updated by our client - but instead dropped and refreshed every once-in-a-while to sync with another stock database:
CREATE TABLE product_pricebands (
template_sku varchar(20) NOT NULL,
colourid integer REFERENCES colour (colourid) ON DELETE CASCADE,
currencyid integer NOT NULL REFERENCES currency (currencyid) ON DELETE CASCADE,
siteid integer NOT NULL REFERENCES site (siteid) ON DELETE CASCADE,
master_price numeric(10,2),
my_custom_field boolean,
UNIQUE (template_sku, siteid, currencyid, colourid)
);
On the synchronisation, I basically DELETE most of the data above except for data WHERE my_custom_field is TRUE (if it's TRUE, it means the client updated this field via their CMS and therefore this record should not be dropped). I then INSERT 100s to 1000s of rows into the table, and UPDATE where the INSERT fails (i.e. where the combination of (template_sku, siteid, currencyid, colourid) already exists).
My question is - what best practice should be applied here to create a primary key? Is a primary key even needed? I wanted to make the primary key = (template_sku, siteid, currencyid, colourid) - but the colourid field can be NULL, and using it in a composite primary key is not possible.
From what I read on other forum posts, I think I have done the above correctly, and just need to clarify:
1) Should I use a "serial" primary key just in case I ever need one? At the moment I don't, and don't think I ever will, because the important data in the table is the price and my custom field, only identified by the (template_sku, siteid, currencyid, colourid) combination.
2) Since (template_sku, siteid, currencyid, colourid) is the combination that I will use to query a product's price, should I add any further indexing to my columns, such as the "template_sku" which is a varchar? Or is the UNIQUE constraint a good index already for my SELECTs?
Should I use a "serial" primary key just in case I ever need one?
You can easily add a serial column later if you need one:
ALTER TABLE product_pricebands ADD COLUMN id serial;
The column will be filled with unique values automatically. You can even make it the primary key in the same statement (if no primary key is defined, yet):
ALTER TABLE product_pricebands ADD COLUMN id serial PRIMARY KEY;
If you reference the table from other tables I would advise to use such a surrogate primary key, because it is rather unwieldy to link by four columns. It is also slower in SELECTs with JOINs.
Either way, you should define a primary key. The UNIQUE index including a nullable column is not a full replacement. It allows duplicates for combinations including a NULL value, because two NULL values are never considered the same. This can lead to trouble.
As
the colourid field can be NULL
you might want to create two unique indexes. The combination (template_sku, siteid, currencyid, colourid) cannot be a PRIMARY KEY, because of the nullable colourid, but you can create a UNIQUE constraint like you already have (implementing an index automatically):
ALTER TABLE product_pricebands ADD CONSTRAINT product_pricebands_uni_idx
UNIQUE (template_sku, siteid, currencyid, colourid)
This index perfectly covers the queries you mention in 2).
Create a partial unique index in addition if you want to avoid "duplicates" with (colourid IS NULL):
CREATE UNIQUE INDEX product_pricebands_uni_null_idx
ON product_pricebands (template_sku, siteid, currencyid)
WHERE colourid IS NULL;
To cover all bases. I wrote more about that technique in a related answer on dba.SE.
The simple alternative to the above is to make colourid NOT NULL and create a primary key instead of the above product_pricebands_uni_idx.
Also, as you
basically DELETE most of the data
for your refill operation, it will be faster to drop indexes, that are not needed during the refill operation, and recreate those afterwards. It is faster by an order of magnitude to build an index from scratch than to add all rows incrementally.
How do you know, which indexes are used (needed)?
Test your queries with EXPLAIN ANALYZE.
Or use the built-in statistics. pgAdmin displays statistics in a separate tab for the selected object.
It may also be faster to select the few rows with my_custom_field = TRUE into a temporary table, TRUNCATE the base table and re-INSERT the survivors. Depends on whether you have foreign keys defined. Would look like this:
CREATE TEMP TABLE pr_tmp AS
SELECT * FROM product_pricebands WHERE my_custom_field;
TRUNCATE product_pricebands;
INSERT INTO product_pricebands SELECT * FROM pr_tmp;
This avoids a lot of vacuuming.