I'm taking a course about PostgreSQL coming from a MySQL background and I stumbled upon the USING table expression. I know it is a shorthand to, well, shorten the ON conditions for JOINs, but I have questions
https://www.postgresql.org/docs/13/queries-table-expressions.html
Are they actually used?
I think that having, say, a "customerid" PRIMARY key on some "customers" table just to be able to use USING is way less unconvenient than just having a normal "id" PRIMARY key as I've always done; is it bad practice?
USING clauses are used quite often. It is rather a design choice for the tables in a database. Sometimes customers.id is used in the primary table and sometimes customers.customer_id.
Usually you'll see customer_id as foreign keys in other tables.
If in your queries you plan to do a lot of simple joins on foreign vs primary keys structuring the tables to be able to use the USING clause might be worth it if it simplifies many queries.
I would say none of the two options could be considered bad practice.
Related
We have a large table in our Postgres production database which we want to start "sharding" using foreign tables and inheritance.
The desired architecture will be to have 1 (empty) table that defines the schema and several foreign tables inheriting from the empty "parent" table. (possible with Postgres 9.5)
I found this well written article https://www.depesz.com/2015/04/02/waiting-for-9-5-allow-foreign-tables-to-participate-in-inheritance/ that explains everything on how to do it from scratch.
My question is how to reduce the needed migration of data to a minimum.
We have this 100+ GB table now, that should become our first "shard". And in the future we will regulary add new "shards". At some point, the older shards will be moved to another tablespace (on cheaper hardware since they become less important).
My question now:
Is there a way to "ALTER" an existing table to be a foreign table instead?
No way to use alter table to do this.
You really have to basically do it manually. This is no different (really) than doing table partitioning. You create your partitions, you load the data. You direct reads and writes to the partitions.
Now in your case, in terms of doing sharding there are a number of tools I would look at to make this less painful. First, if you make sure your tables are split the way you like them first, you can use a logical replication solution like Bucardo to replicate the writes while you are moving everything over.
There are some other approaches (parallelized readers and writers) that may save you some time at the expense of db load, but those are niche tools.
There is no native solution for shard management of standard PostgreSQL (and I don't know enough about Postgres-XL in this regard to know how well it can manage changing shard criteria). However pretty much anything is possible with a little work and knowledge.
I have a task that to implement a 'rollback' (not the usual rollback) function for a batch of entries from different tables. For example:
def rollback(cursor, entries):
# entries is a dict of such form:
# {'table_name1': [id1, id2, ...], 'table_name2': [id1, id2, ...], ...}
I need to delete entries in each table_name. But because these entries may have relationship between so a bit complex. My idea is in several steps:
Find out all columns from all tables that are nullable.
Update all entries set all columns that are nullable to null. After this step there should be no circular depends (if not, i think they can't be insert into the table)
Find out their depends and make a topological sort.
Delete one by one.
My questions are:
Does the idea make sense?
Has anyone done something similar before? And how?
How to query the meta tables for step 3? coz i'm quite new to postgresql.
Any idea and suggestion would be appreciate.
(1) and (2) are not right. It's quite likely that there will be columns defined NOT NULL REFERENCES othertable(othercol) - there are in any normal schema.
What I think you need to do is to sort the foreign key dependency graph to find an ordering that allows you to DELETE, table-by-table, the data you need to remove. Be aware that circular dependencies are possible due to deferred foreign key constraints, so you need to demote/ignore DEFERRABLE INITIALLY DEFERRED constraints; you can temporarily violate those so long as it's all consistent again at COMMIT time.
Even then you might run into issues. What if a client used SET CONSTRAINTS to make a DEFERRABLE INITIALLY IMMEDIATE constraint DEFERRED during a transaction? You'd then fail to cope with the circular dependency. To handle this your code must [SET CONSTRAINTS ALL DEFERRED] before proceeding.
You will need to look at the information_schema or the PostgreSQL-specific system catalogs to work out the dependencies. It might be worth a look at the pg_dump source code too, since it tries to order dumped tables to avoid dependency conflicts. You'll be particularly interested in the pg_constraint
catalog, or its information_schema equivalents information_schema.referential_constraints, information_schema.constraint_table_usage and information_schema.constraint_column_usage.
You can use the either the information_schema or pg_catalog. Don't use both. information_schema is SQL-standard and more portable, but can be slow to query and doesn't have all the information pg_catalog contains. On the flip side, pg_catalog's schema isn't guaranteed to remain compatible across major versions (like 9.1 to 9.2) - though it generally does - and its use isn't portable.
Instead of having a composite primary key (this table maintains the relationship between the two tables which represents two entities [two tables]), the design is proposed to have identity column as primary key and the unique data constraint is enforced over two columns which represents the data from the primary key of entities.
For me having identity column for each relationship table is breaking the normalisation rules.
What is the industry standards?
What are the considerations to make before making the design decision on this?
Which approach is right?
There are lots of tables where you may want to have an identity column as a primary key. However, in the case of a M:M relationship table you describe, best practice is NOT to use a new identity column for the primary key.
RThomas's link in his comment provides the excellent reasons why the best practice is to NOT add an identity column. Here's that link.
The cons will outweigh the pros in pretty much every case, but since you asked for pros and cons I put a couple of unlikely pros in as well.
Cons
Adds complexity
Can lead to duplicate relationships unless you enforce uniqueness on the relationship (which a primary key would do by default).
Likely slower: db must maintain two indexes rather than one.
Pros
All the pros are pretty sketchy
If you had a situation where you needed to use the primary key of the relationship table as a join to a separate table (e.g. an audit table?) the join would likely be faster. (As noted though--adding and removing records will likely be slower. Further, if your relationship table is a relationship between tables that themselves use unique IDs, the speed increase from using one identity column in the join vs two will be minimal.)
The application, for simplicity, may assume that every table it works with has a unique ID as its primary key. (That's poor design in the app but you may not have control over it.) You could imagine a scenario where it is better to introduce some extra complexity in the DB than the extra complexity into such an app.
Cons:
Composite primary keys have to be imported in all referencing tables.
That means larger indexes, and more code to write (e.g. the joins,
the updates). If you are systematic about using composite primary
keys, it can become very cumbersome.
You can't update a part of the primary key. E.g. if you use
university_id, student_id as primary key in a table of university
students, and one student changes university, you have to delete
and recreate the record.
Pros:
Composite primary keys allow to enforce a common kind of constraint
in a powerful and seemless way. Suppose you have a table UNIVERSITY,
a table STUDENT, a table COURSE, and a table STUDENT_COURSE (which
student follows which course). If it is a constraint that you always
have to be a student of university A in order to follow a course of
university A, then that constraint will be automatically validated if
university_id is a part of the composite keys of both STUDENT and
COURSE.
You have to create all the columns in each tables wherever it is used as foreign key. This is the biggest disadvantage.
in T-SQL (SQL Server 2008), is it technically correct to INNER JOIN two tables through key-less columns (no relationships)? Thank you.
Yes. I often join child of a child table bypassing the "middle" table.
Foreign keys are constraint for data integrity: nothing to do with query formulation. You should have them anyway unless you don't care about data quality and integrity.
You can join on any two fields that have the same data (and they should be the same datatype as well). If the fields are indexed you should not have performance issues either unless you are joining on two varchar (4000) fields. You may even need to do this when you have a poor database design and one column is serving more than one purpose espcially if you have used an EAV model.
However what you won't get in this scenario is data integrity. Without a foreign key to enforce the rules, you may find that the related table has values that don't have a match to the parent table. This could result in records not seen in queries that should be seen or records which cannot be correctly interpreted. So it is a good practice to set up foreign keys where they are applicable.
It will work, it might not be very efficient though...You should definitely create the foreign keys if you can.
Technically it will work. No problems there.However, sometimes the query plan generator will use FK's to help make better use of indexes. But, from a design standpoint, it's not such a great idea. You should be using FK's as much as possible, especially if you want to go the ORM route.
Caveats:
Let me first clarify that this is not a question about whether to use surrogates primary keys or not.
Also, this is NOT a related to identities (SQL Server) / Sequences (Oracle) and their pros / cons. I did get a fair bit of idea about that thanks to this, this and this
Question:
I come from a SQL Server background and have always used identity columns as surrogate primary keys for most tables.
Based on my knowledge of Oracle, I find that the nearest equivalent in Oracle are SEQUENCES which can be used to simulate something similar to Identity in SQL server.
As I am new to Oracle and my database has 100+ tables, the main thing that i am concerned about :-
Considering i have to create a sequence for each table in Oracle (almost), would this be the standard accepted implementation for simulating Identity or is there a better / easier way to achieve this kind of implementation in Oracle?
Are there any specific GOTCHA's related to having so many sequences in Oracle?
The system supports both Oracle 10G and 11G
Considering i have to create a
sequence for each table in Oracle
(almost), would this be the standard
accepted implementation for simulating
Identity or is there a better / easier
way to achieve this kind of
implementation in Oracle?
Yes, it is very typical in Oracle to create a sequence for each table. It is possible to use the same sequence for several tables, but you run the risk of making it a bottleneck by using a single sequence for many/all tables: see this AskTom q/a
Are there any specific GOTCHA's
related to having so many sequences in
Oracle?
None that I can think of.
100+ tables is not very many. I routinely work on databases with several hundred sequences, one for each table. The more the merrier.
It's even conceivable to have more sequences than tables - unlike identity columns in other DBMSs, sequences can be used for more than just creating surrogate key values.
An alternative is to use GUIDs - in Oracle you can call SYS_GUID to generate unique values.
A good article, followed by comments with pros and cons for both approaches: http://rwijk.blogspot.com/2009/12/sysguid.html