Postgres unique constraint not enforcing uniqueness - postgresql

Here is my constraint:
CREATE UNIQUE INDEX index_subscriptions_on_user_id_and_class_type_id_and_deleted_at
ON subscriptions
USING btree
(user_id, class_type_id, deleted_at);
This query proves the constraint is not actually working:
SELECT id, user_id, class_type_id,deleted_at
FROM subscriptions;
Here is the output:
Why is uniqueness not being enforced?

Unique indexes in Postgres are based on values being equal, but NULL is never equal to anything, including other NULLs. Therefore any row with a NULL deleted_at value is distinct from any other possible row - so you can insert any number of them.
One way around this is to create partial indexes, applying different rules to rows with and without NULLs:
CREATE UNIQUE INDEX ... ON subscriptions
(user_id, class_type_id) WHERE deleted_at IS NULL;
CREATE UNIQUE INDEX ... ON subscriptions
(user_id, class_type_id, deleted_at) WHERE deleted_at IS NOT NULL;

This happens because of the NULL value in the created_at column. A unique index (or constraint) allows multiple rows with NULL in them.
The only way you can prevent that is to either declare the column as NOT NULL in order to force a value for it, or reduce the unique columns to (user_id, class_type_id)

Related

Use index to speed up query using values from different tables

I have a table products, a table orders and a table orderProducts.
Products have a name as a PK (apple, banana, mango) and a price .
orders have a created_at date and an id as a PK.
orderProducts connects orders and products, so they have a product_name and an order_id. Now I would like to show all orders for a given product that happened in the last 24 hours.
I use the following query:
SELECT
orders.id,
orders.created_at,
products.name,
products.price
FROM
orderProducts
JOIN products ON
products.name=orderProducts.product
JOIN orders ON
orders.id=orderProducts.order
WHERE
products.name='banana'
AND
orders.created_at BETWEEN NOW() - INTERVAL '24 HOURS' AND NOW()
ORDER BY
orders.created_at
This works, but I would like to optimize this query with an index. This index would need to first be ordered by
the product name, so it can be filtered
then the created_at of the order in descending order, so it can select only the ones from 24 hours ago
The problem is, that from what I have seen, indexes can only be created on a single table, without the possibility of joining another tables values to it. Since two individual index do not solve this problem either, I was wondering if there was an alternative way to optimize this particular query.
Here are the table scripts:
CREATE TABLE products
(
name text PRIMARY KEY,
price integer,
)
CREATE TABLE orders
(
id SERIAL PRIMARY KEY,
created_at TIMESTAMP DEFAULT NOW(),
)
CREATE TABLE orderProducts
(
product text REFERENCES products(name),
"order" integer REFERENCES orders(id),
)
First of all. Please do not put indices everywhere - that lead to slower changing operations...
As proposed by #Laurenz Albe - do not guess - check.
Other than that. Note that you know product name, price is repeated - so you can query that once. Question if in your case two queries are going to be faster then single one... Check that.
Please read docs. I would try this index:
create index orders_id_created_at on orders(created_at desc, id)
Normally id should go first, since that is unique, however here system should be able to filter out on both predicates - where/join. Just guessing here.
orderProducts I would like to see index on both columns, however for this query only one should be needed. In practice you are going from products to orders, or other way - both paths are possible, that is why I've wrote about indexing both columns. I would use two separate indexes:
create index orderproducts_product_id on orderproducts (product_id) include (order_id);
create index orderproducts_order_id on orderproducts (order_id) include (product_id);
Probably that is not changing much, but... idea is to use only index, but not the table itself.
These rules are important in terms of performance:
Integer index faster than string index, therefore, you should try to make the primary keys always be an integer. Because join the tables uses primary keys too.
If when in where clauses always use two fields then we must create an index for both fields.
Foreign-Keys are not indexed, you must create an index for foreign-key fields manually.
So, recommended table scripts will be are that:
CREATE TABLE products
(
id serial primary key,
name text,
price integer
);
CREATE UNIQUE INDEX products_name_idx ON products USING btree (name);
CREATE TABLE orders
(
id SERIAL PRIMARY KEY,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE INDEX orders_created_at_idx ON orders USING btree (created_at);
CREATE TABLE orderProducts
(
product_id integer REFERENCES products(id),
order_id integer REFERENCES orders(id)
);
CREATE INDEX orderproducts_product_id_idx ON orderproducts USING btree (product_id, order_id);
---- OR ----
CREATE INDEX orderproducts_product_id ON orderproducts (product_id);
CREATE INDEX orderproducts_order_id ON orderproducts (order_id);

PostgreSQL exclusive or on column setting

In PostgreSQL 9.5 I'm wanting to create a table with three columns. I'd basically have something like
create table Foo (
account varchar not null,
team_id integer references team (ident) on delete cascade,
league_id integer references league (ident) on delete cascade
)
The fun part now is that I want them to specify EITHER team_id OR league_id, but not both. The combination of account plus one of the other two columns is then the UNIQUE constraint.
Is that possible to do?
To make sure only one of the columns is supplied, use a check constraint:
alter table foo add
constraint check_team check (not (team_id is not null and league_id is not null));
The above will however not prevent providing a NULL value for both columns. If you want to make sure that exactly one of them is provided you can use:
alter table foo add
constraint check_team check ( (team_id is not null or league_id is not null)
and not (team_id is not null and league_id is not null));
Edit: as Abelisto pointed out, the check constraint can be simplified to
alter table foo add
constraint check_team check ((team_id is null) <> (league_id is null));
I'm not sure about the unique constraint you want to establish. If e.g. the following two rows should be prevented ('x', 1, null), ('x', null, 1) then you can use a unique index like this:
create unique index on foo (account, coalesce(team_id, league_id));
That would only work properly if you enforce the rule that at least one of those columns must be not null.
If however you want to allow the same team in different columns, but want to prevent to have he same team_id or league_id twice for an account (allowing the above example) then I think you need to unique indexes:
create unique index on foo (account, team_id) where team_id is not null;
create unique index on foo (account, league_id) where league_id is not null;

Composite PRIMARY KEY enforces NOT NULL constraints on involved columns

This is one strange, unwanted behavior I encountered in Postgres:
When I create a Postgres table with composite primary keys, it enforces NOT NULL constraint on each column of the composite combination.
For example,
CREATE TABLE distributors (m_id integer, x_id integer, PRIMARY KEY(m_id, x_id));
enforces NOT NULL constraint on columns m_id and x_id, which I don't want!
MySQL doesn't do this. I think Oracle doesn't do it as well.
I understand that PRIMARY KEY enforces UNIQUE and NOT NULL automatically but that makes sense for single-column primary key. In a multi-column primary key table, the uniqueness is determined by the combination.
Is there any simple way of avoiding this behavior of Postgres? When I execute this:
CREATE TABLE distributors (m_id integer, x_id integer);
I do not get any NOT NULL constraints of course. But I would not have a primary key either.
If you need to allow NULL values, use a UNIQUE constraint (or index) instead of a PRIMARY KEY (and add a surrogate PK column - I suggest a serial or IDENTITY column in Postgres 10 or later).
Auto increment table column
A UNIQUE constraint allows columns to be NULL:
CREATE TABLE distributor (
distributor_id GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, m_id integer
, x_id integer
, UNIQUE(m_id, x_id) -- !
-- , CONSTRAINT distributor_my_name_uni UNIQUE (m_id, x_id) -- verbose form
);
The manual:
For the purpose of a unique constraint, null values are not considered equal, unless NULLS NOT DISTINCT is specified.
In your case, you could enter something like (1, NULL) for (m_id, x_id) any number of times without violating the constraint. Postgres never considers two NULL values equal - as per definition in the SQL standard.
If you need to treat NULL values as equal (i.e. "not distinct") to disallow such "duplicates", I see two three (since Postgres 15) options:
0. NULLS NOT DISTINCT
This option was added with Postgres 15 and allows to treat NULL values as "not distinct", so two of them conflict in a unique constraint or index. This is the most convenient option, going forward. The manual:
That means even in the presence of a unique constraint it is possible
to store duplicate rows that contain a null value in at least one of
the constrained columns. This behavior can be changed by adding the
clause NULLS NOT DISTINCT ...
Detailed instructions:
Create unique constraint with null columns
1. Two partial indexes
In addition to the UNIQUE constraint above:
CREATE UNIQUE INDEX dist_m_uni_idx ON distributor (m_id) WHERE x_id IS NULL;
CREATE UNIQUE INDEX dist_x_uni_idx ON distributor (x_id) WHERE m_id IS NULL;
But this gets out of hands quickly with more than two columns that can be NULL. See:
Create unique constraint with null columns
2. A multi-column UNIQUE index on expressions
Instead of the UNIQUE constraint. We need a free default value that is never present in involved columns, like -1. Add CHECK constraints to disallow it:
CREATE TABLE distributor (
distributor serial PRIMARY KEY
, m_id integer
, x_id integer
, CHECK (m_id &lt> -1)
, CHECK (x_id &lt> -1)
);
CREATE UNIQUE INDEX distributor_uni_idx
ON distributor (COALESCE(m_id, -1), COALESCE(x_id, -1));
When you want a polymorphic relation
Your table uses column names that indicate that they are probably references to other tables:
CREATE TABLE distributors (m_id integer, x_id integer);
So I think you probably are trying to model a polymorphic relation to other tables – where a record in your table distributors can refer to one m record xor one x record.
Polymorphic relations are difficult in SQL. The best resource I have seen about this topic is "Modeling Polymorphic Associations in a Relational Database". There, four alternative options are presented, and the recommendation for most cases is called "Exclusive Belongs To", which in your case would lead to a table like this:
CREATE TABLE distributors (
id serial PRIMARY KEY,
m_id integer REFERENCES m,
x_id integer REFERENCES x,
CHECK (
((m_id IS NOT NULL)::integer + (x_id IS NOT NULL)::integer) = 1
)
);
CREATE UNIQUE INDEX ON distributors (m_id) WHERE m_id IS NOT NULL;
CREATE UNIQUE INDEX ON distributors (x_id) WHERE x_id IS NOT NULL;
Like other solutions, this uses a surrogate primary key column because primary keys are enforced to not contain NULL values in the SQL standard.
This solution adds a 4th option to the three in #Erwin Brandstetter's answer for how to avoid the case where "you could enter something like (1, NULL) for (m_id, x_id) any number of times without violating the constraint." Here, that case is excluded by a combination of two measures:
Partial unique indexes on each column individually: two records (1, NULL) and (1, NULL) would not violate the constraint on the second column as NULLs are considered distinct, but they would violate the constraint on the first column (two records with value 1).
Check constraint: The missing piece is preventing multiple (NULL, NULL) records, still allowed because NULLs are considered distinct, and anyway because our partial indexes do not cover them to save space and write events. This is achieved by the CHECK constraint, which prevents any (NULL, NULL) records by making sure that exactly one column is NULL.
There's one difference though: all alternatives in #Erwin Brandstetter's answer allow at least one record (NULL, NULL) and any number of records with no NULL value in any column (like (1, 2)). When modeling a polymorphic relation, you want to disallow such records. That is achieved by the check constraint in the solution above.

Unique constraint with soft deleted rows excluded

We have a table with a unique constraint on it, for feedback left from one user, for another, in relation to a sale.
ALTER TABLE feedback
ADD CONSTRAINT unique_user_subject_and_sale
UNIQUE (user_id, subject_id, sale_id)
This ensures we don't accidentally get duplicated rows of feedback.
Currently we sometimes hard-delete feedback left in error and left the user leave it again. We want to change to soft-delete:
ALTER TABLE feedback
ADD COLUMN deleted_at timestamptz
If deleted_at IS NOT NULL, consider the feedback deleted, though we still have the audit trail in our DB (and will probably show it ghosted out to site admins).
How can we keep our unique constraint when we're using soft-delete like this? Is it possible without using a more general CHECK() constraint the does an aggregate check (I've never tried using check constraint like this).
It's like I need to append a WHERE clause to the constraint.
Your unique index, later edited out.
CREATE UNIQUE INDEX feedback_unique_user_subject_and_sale_null
ON feedback(user_id, subject_id, sale_id)
WHERE deleted_at IS NULL
Your unique index has at least two side effects that might cause you some trouble.
In other tables, you can't set a foreign key constraint that references "feedback". A foreign key reference requires some combination of columns to be declared as either primary key or unique.
Your unique index allows multiple rows that differ only in the "deleted_at" timestamp. So it's possible to end up with rows that look like the example below. Whether this is a problem is application-dependent.
Example
user_id subject_id sale_id deleted_at
--
1 1 1 2012-01-01 08:00:01.33
1 1 1 2012-01-01 08:00:01.34
1 1 1 2012-01-01 08:00:01.35
PostgreSQL documents this kind of index as a partial index, should you need to Google it sometime. Other platforms use different terms for it--filtered index is one. You can limit the problems to a certain extent with a pair of partial indexes.
CREATE UNIQUE INDEX feedback_unique_user_subject_and_sale_null
ON feedback(user_id, subject_id, sale_id)
WHERE deleted_at IS NULL
CREATE UNIQUE INDEX feedback_unique_user_subject_and_sale_not_null
ON feedback(user_id, subject_id, sale_id)
WHERE deleted_at IS NOT NULL
But I see no reason to go to this much trouble, especially given the potential problems with foreign keys. If your table looks like this
create table feedback (
feedback_id integer primary key,
user_id ...
subject_id ...
sale_id ...
deleted_at ...
constraint unique_user_subj_sale
unique (user_id, subject_id, sale_id)
);
then all you need is that unique constraint on {user_id, subject_id, sale_id}. You might further consider making all deletes use the "deleted_at" column instead of doing a hard delete.
Despite the fact the PostgreSQL documentation advises against using a unique index instead of a constraint (if the point is to have a constraint), it appears you can do
CREATE UNIQUE INDEX feedback_unique_user_subject_and_sale
ON feedback(user_id, subject_id, sale_id)
WHERE deleted_at IS NULL
You could create a constraint with deleted_at signature.
ALTER TABLE feedback
ADD CONSTRAINT unique_user_subject_and_sale
UNIQUE (user_id, subject_id, sale_id, deleted_at)
This does not create problems like a unique index.
The only problem would be removing user feedback, creating a new one, and then deleting it again in the same time.
And considering the accuracy of the column, the user would have to fit in 1 microsecond. Which, let's face it, is impossible, even if he did it through, the time between these requests will definitely be greater than 1 microsecond.
The only problem is that null value isn't checking in unique. So it won't works.
But you could add a default value for not deleted rows as the lowest value you could store in that column.
If you cannot accept the default value for deleted_at, if the object is not deleted, you can consider adding a deleted_token column and generating a key for it if you delete a value, and for an undeleted object, keep 0 or whatever.

Do I need a primary key for my table, which has a UNIQUE (composite 4-columns), one of which can be NULL?

I have the following table (PostgreSQL 8.3) which stores prices of some products. The prices are synchronised with another database, basically most of the fields below (apart from one) are not updated by our client - but instead dropped and refreshed every once-in-a-while to sync with another stock database:
CREATE TABLE product_pricebands (
template_sku varchar(20) NOT NULL,
colourid integer REFERENCES colour (colourid) ON DELETE CASCADE,
currencyid integer NOT NULL REFERENCES currency (currencyid) ON DELETE CASCADE,
siteid integer NOT NULL REFERENCES site (siteid) ON DELETE CASCADE,
master_price numeric(10,2),
my_custom_field boolean,
UNIQUE (template_sku, siteid, currencyid, colourid)
);
On the synchronisation, I basically DELETE most of the data above except for data WHERE my_custom_field is TRUE (if it's TRUE, it means the client updated this field via their CMS and therefore this record should not be dropped). I then INSERT 100s to 1000s of rows into the table, and UPDATE where the INSERT fails (i.e. where the combination of (template_sku, siteid, currencyid, colourid) already exists).
My question is - what best practice should be applied here to create a primary key? Is a primary key even needed? I wanted to make the primary key = (template_sku, siteid, currencyid, colourid) - but the colourid field can be NULL, and using it in a composite primary key is not possible.
From what I read on other forum posts, I think I have done the above correctly, and just need to clarify:
1) Should I use a "serial" primary key just in case I ever need one? At the moment I don't, and don't think I ever will, because the important data in the table is the price and my custom field, only identified by the (template_sku, siteid, currencyid, colourid) combination.
2) Since (template_sku, siteid, currencyid, colourid) is the combination that I will use to query a product's price, should I add any further indexing to my columns, such as the "template_sku" which is a varchar? Or is the UNIQUE constraint a good index already for my SELECTs?
Should I use a "serial" primary key just in case I ever need one?
You can easily add a serial column later if you need one:
ALTER TABLE product_pricebands ADD COLUMN id serial;
The column will be filled with unique values automatically. You can even make it the primary key in the same statement (if no primary key is defined, yet):
ALTER TABLE product_pricebands ADD COLUMN id serial PRIMARY KEY;
If you reference the table from other tables I would advise to use such a surrogate primary key, because it is rather unwieldy to link by four columns. It is also slower in SELECTs with JOINs.
Either way, you should define a primary key. The UNIQUE index including a nullable column is not a full replacement. It allows duplicates for combinations including a NULL value, because two NULL values are never considered the same. This can lead to trouble.
As
the colourid field can be NULL
you might want to create two unique indexes. The combination (template_sku, siteid, currencyid, colourid) cannot be a PRIMARY KEY, because of the nullable colourid, but you can create a UNIQUE constraint like you already have (implementing an index automatically):
ALTER TABLE product_pricebands ADD CONSTRAINT product_pricebands_uni_idx
UNIQUE (template_sku, siteid, currencyid, colourid)
This index perfectly covers the queries you mention in 2).
Create a partial unique index in addition if you want to avoid "duplicates" with (colourid IS NULL):
CREATE UNIQUE INDEX product_pricebands_uni_null_idx
ON product_pricebands (template_sku, siteid, currencyid)
WHERE colourid IS NULL;
To cover all bases. I wrote more about that technique in a related answer on dba.SE.
The simple alternative to the above is to make colourid NOT NULL and create a primary key instead of the above product_pricebands_uni_idx.
Also, as you
basically DELETE most of the data
for your refill operation, it will be faster to drop indexes, that are not needed during the refill operation, and recreate those afterwards. It is faster by an order of magnitude to build an index from scratch than to add all rows incrementally.
How do you know, which indexes are used (needed)?
Test your queries with EXPLAIN ANALYZE.
Or use the built-in statistics. pgAdmin displays statistics in a separate tab for the selected object.
It may also be faster to select the few rows with my_custom_field = TRUE into a temporary table, TRUNCATE the base table and re-INSERT the survivors. Depends on whether you have foreign keys defined. Would look like this:
CREATE TEMP TABLE pr_tmp AS
SELECT * FROM product_pricebands WHERE my_custom_field;
TRUNCATE product_pricebands;
INSERT INTO product_pricebands SELECT * FROM pr_tmp;
This avoids a lot of vacuuming.