My googling powers were not strong enough for this one. This is all theoretical question.
Let's say I have a huge database with hundreds of tables and each table has a user column which references user table.
Now if I would change the user column to have a foreign key constraint, would the increase in database size be noticeable?
If by "change the user column to have a foreign key constraint" you mean something like:
alter table some_table
add constraint fk_some_table_users
foreign key (user_id)
references users (id);
Then the answer is: no, this will in no way change the size of your database (except some additional rows in the system catalogs to store the definition of your constraint).
The constraints will improved the reliability of your data and in some cases even might help the optimizer to remove unnecessary joins or take other shortcuts based on the constraint information. There is however a small performance overhead when insert or deleting rows because the constraint needs to be verified. But this small overhead does not outweigh the advantages that you gain from having consistent data in your database.
I have never seen an application which claimed to be able to "have that under control" that didn't need data cleaning after having been in production for some time. So best leave this kind of check to the database.
Related
Context
PostgreSQL Version: 13.7
Hosting: Google Cloud SQL
Situation
I have 2 partitioned tables where table A has a foreign key to table B. By the nature of how table partitioning works, the foreign keys on table A partitions all point to the parent table B. In practice, I have one tenant id per partition, so it would be optimal to have a partition of table A have a direct foreign key to the partition for the same tenant on table B, but that is undestandably not facilitated by table partitioning.
When I attach an empty table as a partition to table B, the operation is really quick. When I attach a partition to table A, it takes forever. Upon investigating, I found that the operation is creating thousands of locks across all of the partitions for table B due to the foreign key referencing the parent table B. I believe this is occurring because when the table becomes a partition of table A it is inheriting a foreign key to table B and runs a check across all of its partitions (even though the new partition is empty). When I remove the foreign key, the operation is very fast.
Problem
I assume that disabling the foreign key constraints would make the attach operation quick. Unfortunately, since I am using Google Cloud SQL, I can't disable the foreign keys since they don't grant me superuser privileges. One option I can see is to remove the foreign key from the parent table and add it back after the operation. I don't love this option for many reasons (maintainability, data integrity, sanity, ...). There is also the option of simply not using foreign keys which I also don't love.
This isn't just an issue of speed. The mass locking creates contention in the db that prevents other ambient operations that a user might be performing at the time of the partition attaching.
Question
Are there any other options that I can explore to reduce the amount of locking that is occurring when attaching a partition to a table that has a foreign key to a partitioned table?
I'm performing schema changes on a large database, correcting ancient design mistakes (expanding primary keys and their corresponding foreign keys from INTEGER to BIGINT). The basic process is:
Shutdown our application.
Drop DB triggers and constraints.
Perform the changes (ALTER TABLE foo ALTER COLUMN bar TYPE BIGINT for each table and primary/foreign key).
Recreate the triggers and constraints (NOT VALID).
Restart the application.
Validate the constraints (ALTER TABLE foo VALIDATE CONSTRAINT bar for each constraint).
Note:
Our Postgres DB (version 11.7) and our application are hosted on Heroku.
Some of our tables are quite large (millions of rows, the largest being ~1.2B rows).
The problem is in the final validation step. When conditions are just "right", a single ALTER TABLE foo VALIDATE CONSTRAINT bar can create database writes at a pace that exceeds the WAL's write capacity. This leads to varying degrees of unhappiness up to crashing the DB server. (My understanding is that Heroku uses a bespoke WAL plug-in to implement their "continuous backups" and "db follower" features. I've tried contacting Heroku support on this issue -- their response was less than helpful, even though we're on an enterprise-level support contract).
My question: Is there any downside to leaving these constraints in the NOT VALID state?
Related: Does anyone know why validating a constraint generates so much write activity?
There are downsides to leaving a constraint as not valid. Firstly, you may have data that doesn't meet the constraint requirements, meaning you have data that shouldn't be in your table. But also the query planner won't be able to use the constraint predicate to rule out rows that meet or don't meet the constraint requirment.
As for all the WAL activity, I could only imagine that's because it has to set the flag for those rows to mark them as valid. This should produce a relatively small amount of WAL to be generated relative to actual row updates, but I guess if you have enough rows being validated, it will generate a lot of WAL. That shouldn't usually cause a crash unless storage becomes full.
The IBM DB2 documentation says:
To improve the performance of queries, you can add informational
constraints to your tables.
And there is this NOT ENFORCED option we can provide:
ALTER TABLE <name> <constraint attributes> NOT ENFORCED
The explanation is fairly simple:
NOT ENFORCED should only be specified if the table data is
independently known to conform to the constraint. Query results might
be unpredictable if the data does not actually conform to the
constraint.
From what I understood - if I have let's say a foreign key, in a table, declared as NOT ENFORCED that's absolutely the same as not having it at all.
But then what are the real use cases for it and when this option should be used?
(what is the difference between having NOT ENFORCED constraint vs not having it at all)
The so-called Information Constraints can be used to improve performance. This is done by adding insights to the database. Without the informational constraint Db2 would not know about the relationship between the two tables and related columns. Now, the SQL query compiler and optimizer can exploit the fact and optimize query execution.
As a consequence, the informational constraint should only be applied when indeed the data is constraint in the specified ways. Db2 does not enforce it, the user (you) is guaranteeing that data property. Hence, when it is not true, query results could be wrong because Db2 assume that the relationships are present.
I noticed in postgres when we create a table, it seems to automatically creates a btree index on the PRIMARY KEY CONSTRAINT. Looking at the properties of the CONSTRAINT, it would appear it is not clustered. How do I cluster it and should I cluster it?
You have to use the CLUSTER command:
CLUSTER stone.unitloaddetail USING pk10;
Remember that this rewrites the table and blocks it for others during that time.
Also, clustering is not maintained when table data are modified, so you have to schedule regular CLUSTER runs if you want to keep the table clustered.
Addressing the "should you" part, it depends on the likelihood of queries needing to access multiple rows having adjacent values of the clustering key.
For a table with a synthetic primary key, it probably makes more sense to cluster on a foreign key column.
Imagine that you have a table or products. Are you more likely to request multiple products having:
consecutive product_id?
the same location_id?
the same type_id?
the same manufacturer_id?
If it would solve a problem for you to improve the performance of the system for one of these particular cases, then that is the column by which you should consider clustering.
If doing so would not solve a problem, then do not do it.
Instead of having a composite primary key (this table maintains the relationship between the two tables which represents two entities [two tables]), the design is proposed to have identity column as primary key and the unique data constraint is enforced over two columns which represents the data from the primary key of entities.
For me having identity column for each relationship table is breaking the normalisation rules.
What is the industry standards?
What are the considerations to make before making the design decision on this?
Which approach is right?
There are lots of tables where you may want to have an identity column as a primary key. However, in the case of a M:M relationship table you describe, best practice is NOT to use a new identity column for the primary key.
RThomas's link in his comment provides the excellent reasons why the best practice is to NOT add an identity column. Here's that link.
The cons will outweigh the pros in pretty much every case, but since you asked for pros and cons I put a couple of unlikely pros in as well.
Cons
Adds complexity
Can lead to duplicate relationships unless you enforce uniqueness on the relationship (which a primary key would do by default).
Likely slower: db must maintain two indexes rather than one.
Pros
All the pros are pretty sketchy
If you had a situation where you needed to use the primary key of the relationship table as a join to a separate table (e.g. an audit table?) the join would likely be faster. (As noted though--adding and removing records will likely be slower. Further, if your relationship table is a relationship between tables that themselves use unique IDs, the speed increase from using one identity column in the join vs two will be minimal.)
The application, for simplicity, may assume that every table it works with has a unique ID as its primary key. (That's poor design in the app but you may not have control over it.) You could imagine a scenario where it is better to introduce some extra complexity in the DB than the extra complexity into such an app.
Cons:
Composite primary keys have to be imported in all referencing tables.
That means larger indexes, and more code to write (e.g. the joins,
the updates). If you are systematic about using composite primary
keys, it can become very cumbersome.
You can't update a part of the primary key. E.g. if you use
university_id, student_id as primary key in a table of university
students, and one student changes university, you have to delete
and recreate the record.
Pros:
Composite primary keys allow to enforce a common kind of constraint
in a powerful and seemless way. Suppose you have a table UNIVERSITY,
a table STUDENT, a table COURSE, and a table STUDENT_COURSE (which
student follows which course). If it is a constraint that you always
have to be a student of university A in order to follow a course of
university A, then that constraint will be automatically validated if
university_id is a part of the composite keys of both STUDENT and
COURSE.
You have to create all the columns in each tables wherever it is used as foreign key. This is the biggest disadvantage.