Multiple unique constraint performance - postgresql

Imagine we have those tables with no primary key:
CREATE TABLE test (
name1 INT,
name2 INT,
name3 INT,
UNIQUE (name1, name2, name3)
);
CREATE TABLE test2 (
name1 INT,
name2 INT,
name3 INT,
UNIQUE (name1, name2)
);
I feel like those two tables are exactly the same somehow, I am not sure if the combinations are the same. If you have a trick to know about combinations, I ll be more than happy to know about it.
In case of performance is it the same to add an unique constraint on 2 columns and an unique constraint on let say 5 or 6 columns? I imagine that we are just adding one pointer per constraint?

Each unique SQL constraint translates into a definition stored in a system table (the set of these system tables forming what is called the CATALOG) but also into an index which speeds up the verification of this uniqueness. The number of columns in the index as the length of the information has little influence on the duration of this check because the search algorithm is logarithmic...
In your case, the difference of performance will be negligible

Related

Exclusion constraint for unique constraint, is there a difference? (bis)

Assume the following table:
CREATE TABLE zoo (
cage INTEGER,
animal TEXT,
);
What is the real, effective difference between:
ALTER TABLE zoo ADD CONSTRAINT x EXCLUDE USING gist (cage WITH =, animal WITH <>)
and:
CREATE UNIQUE INDEX ON zoo(cage, animal)
?
I took this example from https://www.postgresql.org/docs/current/static/btree-gist.html and was confused why they are representing this with an exclude constraint instead of good-old unique constraint. So I am wondering if there is really a difference.
The two do different things.
The exclusion constraint is doing just what the documentation says -- it is guaranteeing that a cage has exactly one type of animal. No lions and sheep in the cage together.
The unique index/constraint says that there are no duplicate animals in the cage. So, a lion and sheep is fine (from that perspective). But two lions or two sheep is not. (Of course, the lion and sheep example is likely to quickly result in a satisfied unique constraint).
This type of "exclusion" constraint could be handled using foreign key constraints. Something like this:
CREATE TABLE cages (
CageId serial,
AnimalType varchar(255) -- or whatever
);
CREATE TABLE CageAnimals (
CageAnimalId serial,
CageId int references Cages(CageId)
AnimalName varchar(255)
);
(The model would be a bit more complicated in real life.)

Fewer rows versus fewer columns

I am currently modeling a table schema for PostgreSQL that has a lot of columns and is intended to hold a lot of rows. I don't know if it is faster to have more columns or to split the data into more rows.
The schema looks like this (shortened):
CREATE TABLE child_table (
PRIMARY KEY(id, position),
id bigint REFERENCES parent_table(id) ON DELETE CASCADE,
position integer,
account_id bigint REFERENCES accounts(account_id) ON DELETE CASCADE,
attribute_1 integer,
attribute_2 integer,
attribute_3 integer,
-- about 60 more columns
);
Exactly 10 rows of child_table are at maximum related to one row of parent_table. The order is given by the value in position which ranges from 1 to 10. parent_table is intended to hold 650 million rows. With this schema I would end up with 6.5 billion rows in child_table.
Is it smart to do this? Or is it better to model it this way so that I only have 650 million rows:
CREATE TABLE child_table (
PRIMARY KEY(id),
id bigint,
parent_id bigint REFERENCES other_table(id) ON DELETE CASCADE,
account_id_1 bigint REFERENCES accounts(account_id) ON DELETE CASCADE,
attribute_1_1 integer,
attribute_1_2 integer,
attribute_1_3 integer,
account_id_2 bigint REFERENCES accounts(account_id) ON DELETE CASCADE,
attribute_2_1 integer,
attribute_2_2 integer,
attribute_2_3 integer,
-- [...]
);
The number of columns and rows matters less than how well they are indexed. Indexes drastically reduce the number of rows which need to be searched. In a well-indexed table, the total number of rows is irrelevant. If you try to smash 10 rows into one row you'll make indexing much harder. It will also make writing efficient queries which use those indexes harder.
Postgres has many different types of indexes to cover many different types of data and searches. You can even write your own (though that shouldn't be necessary).
Exactly 10 rows of child_table are at maximum related to one row of parent_table.
Avoid encoding business logic in your schema. Business logic changes all the time, especially arbitrary numbers like 10.
One thing you might consider is reducing the number of attribute columns, 60 is a lot, especially if they are actually named attribute_1, attribute_2, etc. Instead, if your attributes are not well defined, store them as a single JSON column with keys and values. Postgres' JSON operations are very efficient (provided you use the jsonb type) and provide a nice middle ground between a key/value store and a relational database.
Similarly, if any sets of attributes are simple lists (like address1, address2, address3), you can also consider using Postgres arrays.
I can't give better advice than this without specifics.

Composite PRIMARY KEY enforces NOT NULL constraints on involved columns

This is one strange, unwanted behavior I encountered in Postgres:
When I create a Postgres table with composite primary keys, it enforces NOT NULL constraint on each column of the composite combination.
For example,
CREATE TABLE distributors (m_id integer, x_id integer, PRIMARY KEY(m_id, x_id));
enforces NOT NULL constraint on columns m_id and x_id, which I don't want!
MySQL doesn't do this. I think Oracle doesn't do it as well.
I understand that PRIMARY KEY enforces UNIQUE and NOT NULL automatically but that makes sense for single-column primary key. In a multi-column primary key table, the uniqueness is determined by the combination.
Is there any simple way of avoiding this behavior of Postgres? When I execute this:
CREATE TABLE distributors (m_id integer, x_id integer);
I do not get any NOT NULL constraints of course. But I would not have a primary key either.
If you need to allow NULL values, use a UNIQUE constraint (or index) instead of a PRIMARY KEY (and add a surrogate PK column - I suggest a serial or IDENTITY column in Postgres 10 or later).
Auto increment table column
A UNIQUE constraint allows columns to be NULL:
CREATE TABLE distributor (
distributor_id GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, m_id integer
, x_id integer
, UNIQUE(m_id, x_id) -- !
-- , CONSTRAINT distributor_my_name_uni UNIQUE (m_id, x_id) -- verbose form
);
The manual:
For the purpose of a unique constraint, null values are not considered equal, unless NULLS NOT DISTINCT is specified.
In your case, you could enter something like (1, NULL) for (m_id, x_id) any number of times without violating the constraint. Postgres never considers two NULL values equal - as per definition in the SQL standard.
If you need to treat NULL values as equal (i.e. "not distinct") to disallow such "duplicates", I see two three (since Postgres 15) options:
0. NULLS NOT DISTINCT
This option was added with Postgres 15 and allows to treat NULL values as "not distinct", so two of them conflict in a unique constraint or index. This is the most convenient option, going forward. The manual:
That means even in the presence of a unique constraint it is possible
to store duplicate rows that contain a null value in at least one of
the constrained columns. This behavior can be changed by adding the
clause NULLS NOT DISTINCT ...
Detailed instructions:
Create unique constraint with null columns
1. Two partial indexes
In addition to the UNIQUE constraint above:
CREATE UNIQUE INDEX dist_m_uni_idx ON distributor (m_id) WHERE x_id IS NULL;
CREATE UNIQUE INDEX dist_x_uni_idx ON distributor (x_id) WHERE m_id IS NULL;
But this gets out of hands quickly with more than two columns that can be NULL. See:
Create unique constraint with null columns
2. A multi-column UNIQUE index on expressions
Instead of the UNIQUE constraint. We need a free default value that is never present in involved columns, like -1. Add CHECK constraints to disallow it:
CREATE TABLE distributor (
distributor serial PRIMARY KEY
, m_id integer
, x_id integer
, CHECK (m_id &lt> -1)
, CHECK (x_id &lt> -1)
);
CREATE UNIQUE INDEX distributor_uni_idx
ON distributor (COALESCE(m_id, -1), COALESCE(x_id, -1));
When you want a polymorphic relation
Your table uses column names that indicate that they are probably references to other tables:
CREATE TABLE distributors (m_id integer, x_id integer);
So I think you probably are trying to model a polymorphic relation to other tables – where a record in your table distributors can refer to one m record xor one x record.
Polymorphic relations are difficult in SQL. The best resource I have seen about this topic is "Modeling Polymorphic Associations in a Relational Database". There, four alternative options are presented, and the recommendation for most cases is called "Exclusive Belongs To", which in your case would lead to a table like this:
CREATE TABLE distributors (
id serial PRIMARY KEY,
m_id integer REFERENCES m,
x_id integer REFERENCES x,
CHECK (
((m_id IS NOT NULL)::integer + (x_id IS NOT NULL)::integer) = 1
)
);
CREATE UNIQUE INDEX ON distributors (m_id) WHERE m_id IS NOT NULL;
CREATE UNIQUE INDEX ON distributors (x_id) WHERE x_id IS NOT NULL;
Like other solutions, this uses a surrogate primary key column because primary keys are enforced to not contain NULL values in the SQL standard.
This solution adds a 4th option to the three in #Erwin Brandstetter's answer for how to avoid the case where "you could enter something like (1, NULL) for (m_id, x_id) any number of times without violating the constraint." Here, that case is excluded by a combination of two measures:
Partial unique indexes on each column individually: two records (1, NULL) and (1, NULL) would not violate the constraint on the second column as NULLs are considered distinct, but they would violate the constraint on the first column (two records with value 1).
Check constraint: The missing piece is preventing multiple (NULL, NULL) records, still allowed because NULLs are considered distinct, and anyway because our partial indexes do not cover them to save space and write events. This is achieved by the CHECK constraint, which prevents any (NULL, NULL) records by making sure that exactly one column is NULL.
There's one difference though: all alternatives in #Erwin Brandstetter's answer allow at least one record (NULL, NULL) and any number of records with no NULL value in any column (like (1, 2)). When modeling a polymorphic relation, you want to disallow such records. That is achieved by the check constraint in the solution above.

postgreSQL table design

I need to create a table (postgresql 9.1) and I am stuck. Could you possibly help?
The incoming data can assume either of the two formats:
client id(int), shop id(int), asof(date), quantity
client id(int), , asof(date), quantity
The given incoming CSV template is: {client id, shop id, shop type, shop genre, asof, quantity}
In the first case, the key is -- client id, shop id, asof
In the second case, the key is -- client id, shop type, shop genre, asof
I tried something like:
create table(
client_id int references...,
shop_id int references...,
shop_type int references...,
shop_genre varchar(30),
asof date,
quantity real,
primary key( client_id, shop_id, shop_type, shop_genre, asof )
);
But then I ran into a problem. When data is of format 1, the inserts fail because of nulls in pk.
The queries within a client can be either by shop id, or by a combination of shop type and genre. There are no use cases of partial or regex matches on genre.
What would be a suitable design? Must I split this into 2 tables and then take a union of search results? Or, is it customary to put 0's and blanks for missing values and move along?
If it matters, the table is expected to be 100-500 million rows once all historic data is loaded.
Thanks.
You could try partial unique indexes aka filtered unique index aka conditional unique indexes.
http://www.postgresql.org/docs/9.2/static/indexes-partial.html
Basically what it comes down to is the uniqueness is filtered based on a where clause,
For example(Of course test for correctness and impact on performance):
CREATE TABLE client(
pk_id SERIAL,
client_id int,
shop_id int,
shop_type int,
shop_genre varchar(30),
asof date,
quantity real,
PRIMARY KEY (pk_id)
);
CREATE UNIQUE INDEX uidx1_client
ON client
USING btree
(client_id, shop_id, asof, quantity)
WHERE client_id = 200;
CREATE UNIQUE INDEX uidx2_client
ON client
USING btree
(client_id, asof, quantity)
WHERE client_id = 500;
A simple solution would be to create a field for the primary key which would use one of two algorithms to generate its data depending on what is passed in.
If you wanted a fully normalised solution, you would probably need to split the shop information into two separate tables and have it referenced from this table using outer joins.
You may also be able to use table inheritance available in postgres.

GUID. and automatic id as primary key in SQL databases

SELECT COUNT(*) FROM table_name;
My algorithm is:
check count
count+1 is the new primary key starting point
Then keep on incrementing before every insert operation
But what is this GUID? Does SQL Server provide something where it automatically generates and incremented primary key?
There are 3 options
CREATE TABLE A
(
ID INT IDENTITY(1,1) PRIMARY KEY,
... Other Columns
)
CREATE TABLE B
(
ID UNIQUEIDENTIFIER DEFAULT NEWID() PRIMARY KEY,
... Other Columns
)
CREATE TABLE C
(
ID UNIQUEIDENTIFIER DEFAULT NEWSEQUENTIALID() PRIMARY KEY,
... Other Columns
)
One reason why you might prefer C rather than B would be to reduce fragmentation if you were to use the ID as the clustered index.
I'm not sure if you're also asking about IDENTITY or not- but a GUID is a unique identifier that is (almost) guaranteed to be unique. It can be used on primary keys but isn't recommended unless you're doing offline work or planning on merging databases.
For example a "normal", IDENTITY primary key is
1 Jason
2 Jake
3 Mike
which when merging with another database which looks like
1 Lisa
2 John
3 Sam
will be tricky. You've got to re-key some columns, make sure that your FKs are in order, etc. Using GUIDs, the data looks like this, and is easy to merge:
1FB74D3F-2C84-43A6-9FB6-0EFC7092F4CE Jason
845D5184-6383-473F-A5D6-4DE98DBFBC39 Jake
8F515331-4457-49D0-A9F5-5814EE7F50BA Mike
CE789C89-E01F-4BCE-AC05-CBDF10419E78 Lisa
4D51B568-107C-4B63-9F7F-24592704118F John
7FA4ED64-7356-4013-A78A-C8CCAB329954 Sam
Note that a GUID takes a lot more space than an INT, and because of this it's recommended to use an INT as a primary key unless you absolutely need to.
create table your table
(id int indentity(1,1) primary key,
col1 varchar(10)
)
will automatically create the primary key for you.
Check GUID in the T-SQL, don't have it at hand right now.
The issue with using count , then count +1 as key, is that were you to delete a record from the middle, you would end up with a duplicate key generated.
EG:
Key Data
1 A
2 B
3 C
4 D
now delete B (count becomes 3), and insert E. This tries to make the new primary key as 4, which already exists.
Key Data
1 A
3 C
4 D <--After delete count = 3 here
4 E <--Attempted insert with key 4
You could use primary key and auto increment to make sure you don't have this issue
CREATE TABLE myTable
(
P_Id int NOT NULL AUTO_INCREMENT,
PRIMARY KEY (P_Id)
)
Or you could use GUID. How GUIDs work is by creating a 128 bit integer (represented as a 32 char hex string)
Key Data
24EC84E0-36AA-B489-0C7B-074837BCEA5D A
.
.
This results in 2^128 possible values (reaaally large), so the chances of similar values created by one computer is extremely small. Besides there are algorithms to help try and ensure that this doesn't happen. So GUID are a pretty good choice for a key as well.
As for whether you use an integer or a GUID, is usually dependent on application, policy etc.