PostgreSQL table inheritance and constraints - postgresql

I have a DDL for some tables similar to the following:
CREATE TABLE devices
(
device_uuid uuid NOT NULL,
manufacturer_uuid NOT NULL,
...
CONSTRAINT device_manufacturer_uuid_fkey FOREIGN_KEY (manufacturer_uuid)
REFERENCES manufacturer (manufacturer_uuid) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT devices_device_uuid_key UNIQUE (device_uuid),
CONSTRAINT devices_pkey PRIMARY KEY (device_uuid)
);
I might have different types of devices, e.g. "peripherals", "power", "graphics", etc.
There are two approaches I can take:
Add a simple type column to this table
Pros: I have a pretty simple database structure.
Cons: There could potentially be hundreds of thousands of entries, which could lead to performance problems. The list of "devices" would be searched fairly regularly, and searching through all those entries to find all those of type "peripherals" every time might not be great.
Inherit from the above table for every type.
The DDL for this would look similar to the following (I believe):
CREATE TABLE devicesperipherals
(
type device NOT NULL,
CONSTRAINT devicesperipherals_manufacturer_uuid_fkey FOREIGN_KEY (manufacturer_uuid)
REFERENCES manufacturer (manufacturer_uuid) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT devicesperipherals_device_uuid_key UNIQUE (device_uuid)
CONSTRAINT devices_manufacturer_uuid_fkey FOREIGN_KEY (manufacturer_uuid)
REFERENCES manufacturer (manufacturer_uuid) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
)
INHERITS (devices)
WITH (
OIDS=FALSE
);
Pros: From what I know about table inheritance, this approach should have better performance.
Cons: More complex database structure, more space required for indexing.
Which is the preferred approach in such a case? Why?

Related

Postgres ALTER TABLE... unexpected performance when applying multiple changes

I am attempting to speed up a bulk load. The bulk load is performed into a table with all primary keys, indexes, and foreign keys dropped. After data finishes loading we go and apply the necessary constraints back to the database. As a simple test I have the following setup:
CREATE TABLE users
(
id int primary key
);
CREATE TABLE events
(
id serial,
user_id1 int,
user_id2 int,
unique_int1 int,
unique_int2 int,
unique_int3 int
);
INSERT INTO
users (id)
SELECT generate_Series(1,100000000);
INSERT INTO
events (user_id1,user_id2,unique_int1,unique_int2,unique_int3)
SELECT
generate_Series(1,100000000),
generate_Series(1,100000000),
generate_Series(1,100000000),
generate_Series(1,100000000),
generate_Series(1,100000000);
I then wish to apply the following constraints via 5 individual alter table statements:
ALTER TABLE ONLY public.events
ADD CONSTRAINT events_pkey PRIMARY KEY (id);
ALTER TABLE ONLY public.events
ADD CONSTRAINT events_user_id1_fkey FOREIGN KEY (user_id1) REFERENCES public.users(id);
ALTER TABLE ONLY public.events
ADD CONSTRAINT events_user_id2_fkey FOREIGN KEY (user_id2) REFERENCES public.users(id);
ALTER TABLE ONLY public.events
ADD CONSTRAINT events_unique_int1_me_key UNIQUE (unique_int1);
ALTER TABLE ONLY public.events
ADD CONSTRAINT events_unique_int2_me_key UNIQUE (unique_int2);
ALTER TABLE ONLY public.events
ADD CONSTRAINT events_unique_int3_me_key UNIQUE (unique_int3);
Each of the above statements takes approximately 90 seconds to run for a total of 450 seconds. I expected that when I combined the 5 above statements into a single ALTER TABLE statement that the time would be reduced but in fact it has increased.
alter table only public.events
ADD CONSTRAINT events_pkey PRIMARY KEY (id),
ADD CONSTRAINT events_user_id1_fkey FOREIGN KEY (user_id1) REFERENCES public.users(id),
ADD CONSTRAINT events_user_id2_fkey FOREIGN KEY (user_id2) REFERENCES public.users(id),
ADD CONSTRAINT events_unique_int1_me_key UNIQUE (unique_int1),
ADD CONSTRAINT events_unique_int2_me_key UNIQUE (unique_int2),
ADD CONSTRAINT events_unique_int3_me_key UNIQUE (unique_int3);
This takes 520 seconds whereas I expected it to at least take less than 450 seconds. My reason for thinking this is from the Postgres documentation for the ALTER TABLE statement where in the Notes section it reads
The main reason for providing the option to specify multiple changes in a single ALTER TABLE is that multiple table scans or rewrites can thereby be combined into a single pass over the table.
Can anyone explain the above measurements or have any suggestions?
This case is not going to be helped much as each of the commands needs to do a pass to verify the conditions of the constraints FK or UNIQUE. Also the docs:
When multiple subcommands are given, the lock acquired will be the strictest one required by any subcommand.
So by combining the commands you are working on the strictest locking for the entire command.
The section you quoted is amplified further on:
For example, it is possible to add several columns and/or alter the type of several columns in a single command. This is particularly useful with large tables, since only one pass over the table need be made.
The conclusion is combining commands is not necessarily a time saver.
Unifying the table read is going to be particularly beneficial if there is no other IO to be done, like column type change or check constraint. But here you need to actually build the indexes or look up the foreign keys, and that is what will dominate your run time. So it is no surprise that you don't see a big gain.
For a small loss in performance, it could be that that is due to worse cache usage, less efficient parallel query execution, or interrupted stride length for readahead or for bulk writing to your IO devices.

Composite FK referencing atomic PK + non unique attribute

I am trying to create the following tables in Postgres 13.3:
CREATE TABLE IF NOT EXISTS accounts (
account_id Integer PRIMARY KEY NOT NULL
);
CREATE TABLE IF NOT EXISTS users (
user_id Integer PRIMARY KEY NOT NULL,
account_id Integer NOT NULL REFERENCES accounts(account_id) ON DELETE CASCADE
);
CREATE TABLE IF NOT EXISTS calendars (
calendar_id Integer PRIMARY KEY NOT NULL,
user_id Integer NOT NULL,
account_id Integer NOT NULL,
FOREIGN KEY (user_id, account_id) REFERENCES users(user_id, account_id) ON DELETE CASCADE
);
But I get the following error when creating the calendars table:
ERROR: there is no unique constraint matching given keys for referenced table "users"
Which does not make much sense to me since the foreign key contains the user_id which is the PK of the users table and therefore also has a uniqueness constraint. If I add an explicit uniqueness constraint on the combined user_id and account_id like so:
ALTER TABLE users ADD UNIQUE (user_id, account_id);
Then I am able to create the calendars table. This unique constraint seems unnecessary to me as user_id is already unique. Can someone please explain to me what I am missing here?
Postgres is so smart/dumb that it doesn't assume the designer to do stupid things.
The Postgres designers could have taken different strategies:
Detect the transitivity, and make the FK not only depend on users.id, but also on users.account_id -> accounts.id. This is doable but costly. It also involves adding multiple dependency-records in the catalogs for a single FK-constraint. When imposing the constraint(UPDATE or DELETE in any of the two referred tables), it could get very complex.
Detect the transitivity, and silently ignore the redundant column reference. This implies: lying to the programmer. It would also need to be represented in the catalogs.
cascading DDL operations would get more complex, too. (remember: DDL is already very hard w.r.t. concurrency/versioning)
From the execution/performance point of view: imposing the constraints currently involves "pseudo triggers" on the referred table's indexes. (except DEFERRED, which has to be handled specially)
So, IMHO the Postgres developers made the sane choice of refusing to do stupid complex things.

Is this a good idea to store relations to many different tables in one field? [duplicate]

I have a database which has three tables
Messages - PK = MessageId
Drafts - PK = DraftId
History - FK = RelatedItemId
The History table has a single foreign Key [RelatedItemId] which maps to one of the two Primary keys in Messages and Drafts.
Is there a name for this relationship?
Is it just bad design?
Is there a better way to design this relationship?
Here are the CREATE TABLE statements for this question:
CREATE TABLE [dbo].[History](
[HistoryId] [uniqueidentifier] NOT NULL,
[RelatedItemId] [uniqueidentifier] NULL,
CONSTRAINT [PK_History] PRIMARY KEY CLUSTERED ( [HistoryId] ASC )
)
CREATE TABLE [dbo].[Messages](
[MessageId] [uniqueidentifier] NOT NULL,
CONSTRAINT [PK_Messages] PRIMARY KEY CLUSTERED ( [MessageId] ASC )
)
CREATE TABLE [dbo].[Drafts](
[DraftId] [uniqueidentifier] NOT NULL,
CONSTRAINT [PK_Drafts] PRIMARY KEY CLUSTERED ( [DraftId] ASC )
)
In a short description the solution you have used is called:
Polymorphic Association
Objective: Reference Multiple Parents
Resulting anti-pattern: Use dual-purpose foreign key, violating first normal form (atomic issue), loosing referential integrity
Solution: Simplify the Relationship
More information about the problem.
BTW createing a common super-table will help you:
Is there a name for this relationship?
There is no standard name that I'm aware of, but I've heard people using the term "generic FKs" or even "inner-platform effect".
Is it just bad design?
Yes.
The reason: it prevents you from declaring a FOREIGN KEY, and therefore prevents the DBMS from enforcing referential integrity directly. Therefore you must enforce it trough imperative code, which is surprisingly difficult.
Is there a better way to design this relationship?
Yes.
Create separate FOREIGN KEY for each referenced table. Make them NULL-able, but make sure exactly one of them is non-NULL, through a CHECK constraint.
Alternatively, take a look at inheritance.
Best practice I have found is to create a Function that returns whether the passed in value exists in either of your Messages and Drafts PK columns. You can then add a constraint on the column on the History that calls this function and will only insert if it passes (i.e. it exists).
Adding non-parsed example Code:
CREATE FUNCTION is_related_there (
IN #value uniqueidentifier )
RETURNS TINYINT
BEGIN
IF (select count(DraftId) from Drafts where DraftId = #value + select count(MessageId) from Messages where MessageId = #value) > 0 THEN
RETURN 1;
ELSE
RETURN 0;
END IF;
END;
ALTER TABLE History ADD CONSTRAINT
CK_HistoryExists CHECK (is_related_there (RelatedItemId) = 1)
Hope that runs and helps lol

postgres - constraint on sum of a column (without triggers)

I want to constrain the sum of a certain attribute of child entities of a parent entity to a certain attribute of that parent entity. I want to do this using PostgreSQL and without using triggers. An example follows;
Assume we have a crate with a volume attribute. We want to fill it with smaller boxes, which have their own volume attributes. The sum of volumes of all boxes in the crate cannot be greater than the volume of the crate.
The idea i have in mind is something like:
CREATE TABLE crates (
crate_id int NOT NULL,
crate_volume int NOT NULL,
crate_volume_used int NOT NULL DEFAULT 0,
CONSTRAINT crates_pkey PRIMARY KEY (crate_id),
CONSTRAINT ukey_for_fkey_ref_from_boxes
UNIQUE (crate_id, crate_volume, crate_volume_used),
CONSTRAINT crate_volume_used_cannot_be_greater_than_crate_volume
CHECK (crate_volume_used <= crate_volume),
CONSTRAINT crate_volume_must_be_positive CHECK (crate_volume >= 0)
);
CREATE TABLE boxes (
box_id int NOT NULL,
box_volume int NOT NULL,
crate_id int NOT NULL,
crate_volume int NOT NULL,
crate_volume_used int NOT NULL,
id_of_previous_box int,
previous_sum_of_volumes_of_boxes int,
current_sum_of_volumes_of_boxes int NOT NULL,
id_of_next_box int,
CONSTRAINT boxes_pkey PRIMARY KEY (box_id),
CONSTRAINT box_volume_must_be_positive CHECK (box_volume >= 0),
CONSTRAINT crate_fkey FOREIGN KEY (crate_id, crate_volume, crate_volume_used)
REFERENCES crates (crate_id, crate_volume, crate_volume_used) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE NO ACTION DEFERRABLE INITIALLY DEFERRED,
CONSTRAINT previous_box_self_ref_fkey FOREIGN KEY (id_of_previous_box, previous_sum_of_volumes_of_boxes)
REFERENCES boxes (box_id, current_sum_of_volumes_of_boxes) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE NO ACTION DEFERRABLE INITIALLY DEFERRED,
CONSTRAINT ukey_for_previous_box_self_ref_fkey UNIQUE (box_id, current_sum_of_volumes_of_boxes),
CONSTRAINT previous_box_self_ref_fkey_validity UNIQUE (crate_id, id_of_previous_box),
CONSTRAINT next_box_self_ref_fkey FOREIGN KEY (id_of_next_box)
REFERENCES boxes (box_id) MATCH SIMPLE
ON UPDATE CASCADE ON DELETE NO ACTION DEFERRABLE INITIALLY DEFERRED,
CONSTRAINT next_box_self_ref_fkey_validity UNIQUE (crate_id, id_of_next_box),
CONSTRAINT self_ref_key_integrity CHECK (
(id_of_previous_box IS NULL AND previous_sum_of_volumes_of_boxes IS NULL) OR
(id_of_previous_box IS NOT NULL AND previous_sum_of_volumes_of_boxes IS NOT NULL)
),
CONSTRAINT sum_of_volumes_of_boxes_check1 CHECK (current_sum_of_volumes_of_boxes <= crate_volume),
CONSTRAINT sum_of_volumes_of_boxes_check2 CHECK (
(previous_sum_of_volumes_of_boxes IS NULL AND current_sum_of_volumes_of_boxes=box_volume) OR
(previous_sum_of_volumes_of_boxes IS NOT NULL AND current_sum_of_volumes_of_boxes=box_volume+previous_sum_of_volumes_of_boxes)
),
CONSTRAINT crate_volume_used_check CHECK (
(id_of_next_box IS NULL AND crate_volume_used=current_sum_of_volumes_of_boxes) OR
(id_of_next_box IS NOT NULL)
)
);
CREATE UNIQUE INDEX single_first_box ON boxes (crate_id) WHERE id_of_previous_box IS NULL;
CREATE UNIQUE INDEX single_last_box ON boxes (crate_id) WHERE id_of_next_box IS NULL;
My questions is if this is a way of doing this and, if there is a better (less confusing, more optimized etc.) way of doing this. Or should i just stick to triggers?
Thanks in advance.
My questions is if there is a better (less confusing, more optimized etc.) way of doing this.
Yes, there is: in a word, use a trigger…
No, never mind that you don't want to use one. Use a trigger here; no ifs, no buts.
Expanding on the comments I and others posted earlier:
What you're doing amounts to writing a constraint trigger that is verifying that sum(boxes.volume) <= crate.volume. It's just doing so in a very, very bastardized way (by having check constraints and unique keys and foreign keys masquerade as an aggregate function), and doing the relevant calculations within your app at that.
Your only achievement in avoiding to use a genuine trigger will be errors down the road when two concurrent updates will try to affect the same crate. All this, at the cost of maintaining unnecessary unique indexes and foreign keys.
Sure, you'll end up fixing some or all of these issues and refining your "implementation" further by making the foreign keys deferrable, adding locks, yada yada. But in the end, you're basically doing what amounts to writing a vastly inefficient aggregate function.
So use a trigger. Either maintain a current_volume column in crates using after triggers on boxes, and enforce a check using a simple check() constraint on crates. Or add constraint triggers on boxes, to enforce the check directly.
If you need more convincing, just consider at the overhead that you're creating. Really. Take a cold, hard look at it: instead of maintaining one volume column in crates using triggers (if even that), you're maintaining no less than six fields that serve absolutely no purpose beyond your constraint, and so many useless unique indexes and foreign keys constraints related to them that I genuinely lose count when I try to enumerate them. And check constraints on them, at that. This stuff all adds up in terms of storage and write performance.

Shared Primary key versus Foreign Key

I have a laboratory analysis database and I'm working on the bast data layout. I've seen some suggestions based on similar requirements for using a "Shared Primary Key", but I don't see the advantages over just foreign keys. I'm using PostgreSQL:tables listed below
Sample
___________
sample_id (PK)
sample_type (where in the process the sample came from)
sample_timestamp (when was the sample taken)
Analysis
___________
analysis_id (PK)
sample_id (FK references sample)
method (what analytical method was performed)
analysis_timestamp (when did the analysis take place)
analysis_notes
gc
____________
analysis_id (shared Primary key)
gc_concentration_meoh (methanol concentration)
gc_concentration_benzene (benzene concentration)
spectrophotometer
_____________
analysis_id
spectro_nm (wavelength used)
spectro_abs (absorbance measured)
I could use this design, or I could move the fields from the analysis table into both the gc and spectrophotometer tables, and just use foreign keys between sample, gc, and spectrophotometer tables. The only advantage I see of this design is in cases where I would just want information on how many or what types of analyses were performed, without having to join in the actual results. However, the additional rules to ensure referential integrity between the shared primary keys, and managing extra joins and triggers (on delete cascade, etc) appears to make it more of a headache than the minor advantages. I'm not a DBA, but a scientist, so please let me know what I'm missing.
UPDATE:
A shared primary key (as I understand it) is like a one-to-one foreign key with the additional constraint that each value in the parent tables(analysis) must appear in one of the child tables once, and no more than once.
I've seen some suggestions based on similar requirements for using a
"Shared Primary Key", but I don't see the advantages over just foreign
keys.
If I've understood your comments above, the advantage is that only the first implements the requirement that each row in the parent match a row in one child, and only in one child. Here's one way to do that.
create table analytical_methods (
method_id integer primary key,
method_name varchar(25) not null unique
);
insert into analytical_methods values
(1, 'gc'),(2, 'spec'), (3, 'Atomic Absorption'), (4, 'pH probe');
create table analysis (
analysis_id integer primary key,
sample_id integer not null, --references samples, not shown
method_id integer not null references analytical_methods (method_id),
analysis_timestamp timestamp not null,
analysis_notes varchar(255),
-- This unique constraint lets the pair of columns be the target of
-- foreign key constraints from other tables.
unique (analysis_id, method_id)
);
-- The combination of a) the default value and the check() constraint on
-- method_id, and b) the foreign key constraint on the paired columns
-- analysis_id and method_id guarantee that rows in this table match a
-- gc row in the analysis table.
--
-- If all the child tables have similar constraints, a row in analysis
-- can match a row in one and only one child table.
create table gc (
analysis_id integer primary key,
method_id integer not null
default 1
check (method_id = 1),
foreign key (analysis_id, method_id)
references analysis (analysis_id, method_id),
gc_concentration_meoh integer not null,
gc_concentration_benzene integer not null
);
It looks like in my case this supertype/subtype model in not the best choice. Instead, I should move the fields from the analysis table into all the child tables, and make a series of simple foreign key relationships. The advantage of the supertype/subtype model is when using the primary key of the supertype as a foreign key in another table. Since I am not doing this, the extra layer of complexity will not add anything.