I have a task that to implement a 'rollback' (not the usual rollback) function for a batch of entries from different tables. For example:
def rollback(cursor, entries):
# entries is a dict of such form:
# {'table_name1': [id1, id2, ...], 'table_name2': [id1, id2, ...], ...}
I need to delete entries in each table_name. But because these entries may have relationship between so a bit complex. My idea is in several steps:
Find out all columns from all tables that are nullable.
Update all entries set all columns that are nullable to null. After this step there should be no circular depends (if not, i think they can't be insert into the table)
Find out their depends and make a topological sort.
Delete one by one.
My questions are:
Does the idea make sense?
Has anyone done something similar before? And how?
How to query the meta tables for step 3? coz i'm quite new to postgresql.
Any idea and suggestion would be appreciate.
(1) and (2) are not right. It's quite likely that there will be columns defined NOT NULL REFERENCES othertable(othercol) - there are in any normal schema.
What I think you need to do is to sort the foreign key dependency graph to find an ordering that allows you to DELETE, table-by-table, the data you need to remove. Be aware that circular dependencies are possible due to deferred foreign key constraints, so you need to demote/ignore DEFERRABLE INITIALLY DEFERRED constraints; you can temporarily violate those so long as it's all consistent again at COMMIT time.
Even then you might run into issues. What if a client used SET CONSTRAINTS to make a DEFERRABLE INITIALLY IMMEDIATE constraint DEFERRED during a transaction? You'd then fail to cope with the circular dependency. To handle this your code must [SET CONSTRAINTS ALL DEFERRED] before proceeding.
You will need to look at the information_schema or the PostgreSQL-specific system catalogs to work out the dependencies. It might be worth a look at the pg_dump source code too, since it tries to order dumped tables to avoid dependency conflicts. You'll be particularly interested in the pg_constraint
catalog, or its information_schema equivalents information_schema.referential_constraints, information_schema.constraint_table_usage and information_schema.constraint_column_usage.
You can use the either the information_schema or pg_catalog. Don't use both. information_schema is SQL-standard and more portable, but can be slow to query and doesn't have all the information pg_catalog contains. On the flip side, pg_catalog's schema isn't guaranteed to remain compatible across major versions (like 9.1 to 9.2) - though it generally does - and its use isn't portable.
Related
I'm taking a course about PostgreSQL coming from a MySQL background and I stumbled upon the USING table expression. I know it is a shorthand to, well, shorten the ON conditions for JOINs, but I have questions
https://www.postgresql.org/docs/13/queries-table-expressions.html
Are they actually used?
I think that having, say, a "customerid" PRIMARY key on some "customers" table just to be able to use USING is way less unconvenient than just having a normal "id" PRIMARY key as I've always done; is it bad practice?
USING clauses are used quite often. It is rather a design choice for the tables in a database. Sometimes customers.id is used in the primary table and sometimes customers.customer_id.
Usually you'll see customer_id as foreign keys in other tables.
If in your queries you plan to do a lot of simple joins on foreign vs primary keys structuring the tables to be able to use the USING clause might be worth it if it simplifies many queries.
I would say none of the two options could be considered bad practice.
NOTE: I have never done this before:
What are some steps or documentation to help normalize tables/views in a database? Currently, there are several tables and views in a database that do not use primary/foreign key concept and sort of repeats same information in multiple tables.
I'd like to clean this up and also somewhat setup a process that would keep relationship updated. Example, if a person zipcode changes or record is removed then it automatically updates its relationship with other tables row/s.
NOTE:* My question is to normalize existing database tables. The tables are live so how do I approach normalization? Do I create a brand new database with table structure I want and then move data to that database? Once data moved, I plug in stored procedures and imports?
This question is somewhat broad, so I will only explain the concept.
Views are generally used for reporting/data presentation purposes and therefore I would not try to normalise them. Your case may be different.
You also need to be clear about primary / foreign key concept:
Lack of actual constraints (e.g. PRIMARY KEY, FOREIGN KEY) defined on the table does not mean that the tables do not have logical relationships on columns.
Data maintenance can be implemented in Triggers.
If you really have a situation where a lot of highly de-normalised data exists in tables for no apparent reason and you want to normalise it then this problem can be approached in two ways:
Full re-write - I would recommend for small / new Apps
"Gradual" re-factoring - large / mature applications, where underlying data relationships are complex and / or may not be fully understood.
Within "Gradual" re-factoring there are a few ways as well:
2.a. You take 1 old table and replace it with a new table and at the same time change all code that uses the old table to use the new table. For large systems this can be problematic as you simply may not be aware of all places that reference this table. On the other hand, it may be useful for situations where the table structure change is not significant and/or when the number of dependencies is small.
2.b. Another way is to create new table(s) (in the same database) in the shape / form you desire. The current tables should be replaced with Views that return identical data (to old tables) but sourced from "new" tables. This approach removes / minimises the need to modify all dependencies immediately. The drawback is that the View that replaces the old table can become rather complex, especially if View Instead Of Triggers are needed to be implemented.
We have a large table in our Postgres production database which we want to start "sharding" using foreign tables and inheritance.
The desired architecture will be to have 1 (empty) table that defines the schema and several foreign tables inheriting from the empty "parent" table. (possible with Postgres 9.5)
I found this well written article https://www.depesz.com/2015/04/02/waiting-for-9-5-allow-foreign-tables-to-participate-in-inheritance/ that explains everything on how to do it from scratch.
My question is how to reduce the needed migration of data to a minimum.
We have this 100+ GB table now, that should become our first "shard". And in the future we will regulary add new "shards". At some point, the older shards will be moved to another tablespace (on cheaper hardware since they become less important).
My question now:
Is there a way to "ALTER" an existing table to be a foreign table instead?
No way to use alter table to do this.
You really have to basically do it manually. This is no different (really) than doing table partitioning. You create your partitions, you load the data. You direct reads and writes to the partitions.
Now in your case, in terms of doing sharding there are a number of tools I would look at to make this less painful. First, if you make sure your tables are split the way you like them first, you can use a logical replication solution like Bucardo to replicate the writes while you are moving everything over.
There are some other approaches (parallelized readers and writers) that may save you some time at the expense of db load, but those are niche tools.
There is no native solution for shard management of standard PostgreSQL (and I don't know enough about Postgres-XL in this regard to know how well it can manage changing shard criteria). However pretty much anything is possible with a little work and knowledge.
I am receiving a record csv for outside, then when I create or update the entry into the postgresql, I need to create an mirror entry that only have sign differences. This is could be done at program level, I am curious to know would it possible using triggers.
For the examples I can find, they all end with code,
FOR EACH ROW EXECUTE PROCEDURE foo()
And usually deal with checks, add addtional info using NEW.additionalfield, or insert into another table. If I use trigger this way to insert another row in the same table, it seems the trigger will triggered again and the creation become recursive.
Any way to work this out?
When dealing with triggers, the rules of thumb are:
If it changes the current row, based on some business rules or other (e.g. adding extra info or processing calculated fields), it belongs in a BEFORE trigger.
If it has side effects on one or more rows in separate tables, it belongs in an AFTER trigger.
If it runs integrity checks on any table that no other built-in constraints (checks, unique keys, foreign keys, exclude, etc.) can take care of, it belongs in a CONSTRAINT [after] trigger.
If it has side effects on one or more other rows within the same table, you should probably revisit your schema, your code flow, or both.
Regarding that last point, there actually are workarounds in Postgres, such as trying to get a lock or checking xmin vs the transaction's xid, to avoid getting bogged down in recursive scenarios. A recent version additionally introduced pg_trigger_depth(). But I'd still advise against it.
Note that a constraint trigger can be created as deferrable initially deferred. This will delay the constraint trigger until the very end of the transaction, rather than immediately after the statement.
Your question and nickname hint that you're wondering how to automatically balance a set of lines in a double-entry book-keeping application. Assuming so, do NOT create the balancing entry automatically. Instead, begin a transaction, enter each line separately, and have a (for each row, deferrable initially deferred) constraint trigger pick things up from there and reject the entire batch if anything is unbalanced. Proceeding that way will spare you a mountain of headaches when you want to balance more than two or three lines with each other.
Another reading might be that you want to create an audit trail. If so, create additional audit tables and use after triggers to populate them. There are multiple ways to create and manage these audit tables. Look into slowly changing dimensions. (Fwiw, type 6 with a start_end column of type tsrange or tstzrange works well for the audit tables if you're interested in a table's full history including its history of relationships with other audit tables.) Use the "live" tables for your application to keep things fast, and use the audit-tables when you need historical reporting.
in T-SQL (SQL Server 2008), is it technically correct to INNER JOIN two tables through key-less columns (no relationships)? Thank you.
Yes. I often join child of a child table bypassing the "middle" table.
Foreign keys are constraint for data integrity: nothing to do with query formulation. You should have them anyway unless you don't care about data quality and integrity.
You can join on any two fields that have the same data (and they should be the same datatype as well). If the fields are indexed you should not have performance issues either unless you are joining on two varchar (4000) fields. You may even need to do this when you have a poor database design and one column is serving more than one purpose espcially if you have used an EAV model.
However what you won't get in this scenario is data integrity. Without a foreign key to enforce the rules, you may find that the related table has values that don't have a match to the parent table. This could result in records not seen in queries that should be seen or records which cannot be correctly interpreted. So it is a good practice to set up foreign keys where they are applicable.
It will work, it might not be very efficient though...You should definitely create the foreign keys if you can.
Technically it will work. No problems there.However, sometimes the query plan generator will use FK's to help make better use of indexes. But, from a design standpoint, it's not such a great idea. You should be using FK's as much as possible, especially if you want to go the ORM route.