How can a relational database with foreign key constraints ingest data that may be in the wrong order?

How can a relational database with foreign key constraints ingest data that may be in the wrong order? - postgresql

The database is ingesting data from a stream, and all the rows needed to satisfy a foreign key constraint may be late or never arrive.
This can likely be accomplished by using another datastore, one without foreign key constraints, and then when all the needed data is available, read into the database which has fk constraints. However, this adds complexity and I'd like to avoid it.
We're working on a solution that creates "placeholder" rows to point the foreign key to. When the real data comes in, the placeholder is replaced with real values. Again, this adds complexity, but it's the best solution we've found so far.
How do people typically solve this problem?
Edit: Some sample data which might help explain the problem:
Let's say we have these tables:
CREATE TABLE order (
id INTEGER NOT NULL,
order_number,
PRIMARY KEY (id),
UNIQUE (order_number)
);
CREATE TABLE line_item (
id INTEGER NOT NULL,
order_number INTEGER REFERENCES order(order_number),
PRIMARY KEY (id)
);
If I insert an order first, not a problem! But let's say I try:
INSERT INTO line_item (order_number) values (123) before order 123 was inserted. This will fail the fk constraint of course. But this might be the order I get the data, since it's reading from a stream that is collecting this data from multiple sources.
Also, to address #philpxy's question, I didn't really find much on this. One thing that was mentioned was deferred constraints. This is a mechanism that waits to do the fk constraints at the end of a transaction. I don't think it's possible to do that in my case however, since these insert statements will be run at random times whenever the data is received.

You have a business workflow problem, because line items of individual orders are coming in before the orders themselves have come in. One workaround, perhaps not ideal, would be to create a before insert trigger which checks, for every incoming insert to the line_item table, whether that order already exists in the order table. If not, then it will first insert the order record before trying the insert on line_item.
CREATE OR REPLACE FUNCTION "public"."fn_insert_order" () RETURNS trigger AS $$
BEGIN
INSERT INTO "order" (order_number)
SELECT NEW.order_number
WHERE NOT EXISTS (SELECT 1 FROM "order" WHERE order_number = NEW.order_number);
RETURN NEW;
END
$$
LANGUAGE 'plpgsql'
# trigger
CREATE TRIGGER "trigger_insert_order"
BEFORE INSERT ON line_item FOR EACH ROW
EXECUTE PROCEDURE fn_insert_order()
Note: I am assuming that the id column of the order table in fact is auto increment, in which case Postgres would automatically assign a value to it when inserting as above. Most likely, this is what you want, as having two id columns which both need to be manually assigned does not make much sense.

You could accomplish that with a BEFORE INSERT trigger on line_item.
In that trigger you query order if a matching item exists, and if not, you insert a dummy row.
That will allow the INSERT to succeed, at the cost of some performance.
To insert rows into order, use
INSERT INTO order ...
ON CONFLICT ON (order_number) DO UPDATE SET
id = EXCLUDED.id;
Updating a primary key is problematic and may lead to conflicts. One way you could get around that is if you use negative ids for artificially generated orders (assuming that the real ids are positive). If you have any references to that primary key, you'd have to define the constraint with ON UPDATE CASCADE.

Related

Is there a way to reserve a range in a postgres sequence?

I'm writing a program that generates large numbers of rows to be inserted into a PostgreSQL database. Due to the presence of multiple indices, this process is getting slower over time. That's why I want to move to using COPY which seems to be significantly faster. The problem is that one of the tables has a foreign key into another, and I do not have the IDs for the foreign key column.
I was thinking that maybe if I could reserve a range in the sequence used for the primary key of the first table, I could do the ID assignment manually but I don't think Postgres natively supports such an operation. Is there a way to achieve this another way?

First off from your source data identify the business key for the parent and child tables. Create those tables and a unique constraint business key. This will not be the surrogate - auto generated - PK. Now create a staging table with all the columns necessary (except the FK). Since you will most likely be using across sessions this is a permanent table, but the intent is single time usage. With this insert into the parent table generating the pk from the sequence. Then insert into the child selecting the FK by referencing the business key from the parent.
insert into parent( <columns> )
select column_list
from stage
on conflict (business key ) do nothing;
insert into child ( <columns>, )
select s.<columns>, a.id
from stage s
join parent a on s.business key = a.business key
on conflict (a.parent_id, child_bk) do nothing;
Since the above is rather obscure in the abstract see a concrete example here. There is no need attempting to "reserve a range", just let the pk/fk develop naturally.

Execute Function Once per Unique Value

I have two tables that contain data related to everyday business:
CREATE TABLE main_table (
main_id serial,
cola text,
colb text,
colc text,
CONSTRAINT main_table_pkey PRIMARY KEY (main_id)
);
CREATE TABLE second_table (
second_id serial,
main_id integer,
cold text,
CONSTRAINT second_table_pkey PRIMARY KEY (second_id),
CONSTRAINT second_table_fkey FOREIGN KEY (main_id)
REFERENCES main_table (main_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
);
We have a need to know when some data was updated in these tables so that exports can be generated and pushed to third parties. I've created a third table to hold the update information:
CREATE TYPE field AS ENUM ('cola', 'colb', 'colc', 'cold');
CREATE TABLE table_updates (
main_id int,
field field
updated_on date NOT NULL DEFAULT NOW(),
CONSTRAINT table_updates_fkey FOREIGN KEY (main_id)
REFERENCES main_table (main_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
);
main_table has a trigger to update table_updates before UPDATE queries, which satisfies the need to track three of the four column updates.
I can easily add the same type of trigger to second_table, however because main_id is not unique the function can be executed several times for a single main_id value, which is not desirable.
How can I create a function that, when updating several rows in second_table, executes only once per main_id?

How can I create a function that, when updating several rows in second_table, executes only once per main_id?
If your inserts are batched insert by main_id ie, INSERT INTO tbl (main_id...) VALUES (main_id ...),(main_id ...),(main_id ...) you can use the rule system to trigger once for the INSERT or UPDATE
For the things that can be implemented by both, which is best depends on the usage of the database. A trigger is fired once for each affected row. A rule modifies the query or generates an additional query. So if many rows are affected in one statement, a rule issuing one extra command is likely to be faster than a trigger that is called for every single row and must re-determine what to do many times. However, the trigger approach is conceptually far simpler than the rule approach, and is easier for novices to get right.
Shy of that, you may also want to look into the normal LISTEN, and NOTIFY. Which give you the ability to use Async actions. If that's your thing and you decide to keep the trigger method consider Trigger Change Notification module, via tcn.
My suggestion is to do this in the app (outside of the DB) if at all possible. Remember in PostgreSQL temp tables are local to the session. So you can have each loader-session do something like this,
BEGIN
CREATE TEMP TABLE UNLOGGED etl_inventory;
COPY foo FROM stdin;
-- Are they different, if so `NOTIFY`
-- UPSERT
COMMIT;
And then one have one daemon that does exportation add to exportation queue when it receives the NOTIFY event.

While Evan's answer is correct, I think this question could benefit from an example.
This is the rule definition I used with the example tables in the question:
CREATE OR REPLACE RULE update_update_table
AS ON UPDATE TO second_table
DO ALSO (
INSERT INTO table_updates (
main_id, field
)
SELECT DISTINCT OLD.main_id, 'cold'::field
WHERE NOT EXISTS (
SELECT TRUE
FROM table_updates
WHERE main_id = OLD.main_id
AND field = 'cold'
);
UPDATE table_updates
SET updated_on = NOW()
WHERE main_id = OLD.main_id
AND field = 'cold'
)

Is it possible to catch a foreign key violation in postgres

I'm trying to insert data into a table which has a foreign key constraint. If there is a constraint violation in a row that I'm inserting, I want to chuck that data away.
The issue is that postgres returns an error every time I violate the constraint. Is it possible for me to have some statement in my insert statement like 'ON FOREIGN KEY CONSTRAINT DO NOTHING'?
EDIT:
This is the query that I'm trying to do, where info is a dict:
cursor.execute("INSERT INTO event (case_number_id, date, \
session, location, event_type, worker, result) VALUES \
(%(id_number)s, %(date)s, %(session)s, \
%(location)s, %(event_type)s, %(worker)s, %(result)s) ON CONFLICT DO NOTHING", info)
It errors out when there is a foreign key violation

If you're only inserting a single row at a time, you can create a savepoint before the insert and rollback to it when the insert fails (or release it when the insert succeeds).
For Postgres 9.5 or later, you can use INSERT ... ON CONFLICT DO NOTHING which does what it says. You can also write ON CONFLICT DO UPDATE SET column = value..., which will automagically convert your insert into an update of the row you are conflicting with (this functionality is sometimes called "upsert").
This does not work because OP is dealing with a foreign key constraint rather than a unique constraint. In that case, you can most easily use the savepoint method I described earlier, but for multiple rows it may prove tedious. If you need to insert multiple rows at once, it should be reasonably performant to split them into multiple insert statements, provided you are not working in autocommit mode, all inserts occur in one transaction, and you are not inserting a very large number of rows.
Sometimes, you really do need multiple inserts in a single statement, because the round-trip overhead of talking to your database plus the cost of having savepoints on every insert is simply too high. In this case, there are a number of imperfect approaches. Probably the least bad is to build a nested query which selects your data and joins it against the other table, something like this:
INSERT INTO table_A (column_A, column_B, column_C)
SELECT A_rows.*
FROM VALUES (...) AS A_rows(column_A, column_B, column_C)
JOIN table_B ON A_rows.column_B = table_B.column_B;

How to INSERT OR UPDATE while MATCHING a non Primary Key without updating existing Primary Key?

I'm currently working with Firebird and attempting to utilize UPDATE OR INSERT functionality in order to solve a particular new case within our software. Basically, we are needing to pull data off of a source and put it into an existing table and then update that data at regular intervals and adding any new references. The source is not a database so it isn't a matter of using MERGE to link the two tables (unless we make a separate table and then merge it, but that seems unnecessary).
The problem rests on the fact we cannot use the primary key of the existing table for matching, because we need to match based off of the ID we get from the source. We can use the MATCHING clause no problem but the issue becomes that the primary key of the existing table will be updated to the next key every time because it has to be in the query because of the insertion chance. Here is the query (along with c# parameter additions) to demonstrate the problem.
UPDATE OR INSERT INTO existingtable (PrimaryKey, UniqueSourceID, Data) VALUES (?,?,?) MATCHING (UniqueSourceID);
this.AddInParameter("PrimaryKey", FbDbType.Integer, itemID);
this.AddInParameter("UniqueSourceID", FbDbType.Integer, source.id);
this.AddInParameter("Data", FbDbType.SmallInt, source.data);
Problem is shown that every time the UPDATE triggers, the primary key will also change to the next incremented key I need a way to leave the primary key alone when updating, but if it is inserting I need to insert it.

Do not generate primary key manually, let a trigger generate it when nessesary:
CREATE SEQUENCE seq_existingtable;
SET TERM ^ ;
CREATE TRIGGER Gen_PK FOR existingtable
ACTIVE BEFORE INSERT
AS
BEGIN
IF(NEW.PrimaryKey IS NULL)THEN NEW.PrimaryKey = NEXT VALUE FOR seq_existingtable;
END^
SET TERM ; ^
Now you can omit the PK field from your statement:
UPDATE OR INSERT INTO existingtable (UniqueSourceID, Data) VALUES (?,?) MATCHING (UniqueSourceID);
and when the insert is triggered by the statement then the trigger will take care of creating the PK. If you need to know the generated PK then use the RETURNING clause of the UPDATE OR INSERT statement.

PostgreSQL ON INSERT CASCADE

I've got two tables - one is Product and one is ProductSearchResult.
Whenever someone tries to Insert a SearchResult with a product that is not listed in the Product table the foreign key constrain is violattet, hence i get an error.
I would like to know how i could get my database to automatically create that missing Product in the Product Table (Just the ProductID, all other attributes can be left blank)
Is there such thing as CASCADE ON INSERT? If there is, i was not able not get it working.
Rules are getting executed after the Insert, so because we get an Error beforehand there are useless if you USE an "DO ALSO". If you use "DO INSTEAD" and add the INSERT Command at the End you end up with endless recursion.
I reckon a Trigger is the way to go - but all my attempts to write one failed.
Any recommendations?
The Table Structure:
CREATE TABLE Product (
ID char(10) PRIMARY KEY,
Title varchar(150),
Manufacturer varchar(80),
Category smallint,
FOREIGN KEY(Category) REFERENCES Category(ID) ON DELETE CASCADE);
CREATE TABLE ProductSearchResult (
SearchTermID smallint NOT NULL,
ProductID char(10) NOT NULL,
DateFirstListed date NOT NULL DEFAULT current_date,
DateLastListed date NOT NULL DEFAULT current_date,
PRIMARY KEY (SearchTermID,ProductID),
FOREIGN KEY (SearchTermID) REFERENCES SearchTerm(ID) ON DELETE CASCADE,
FOREIGN KEY (ProductID) REFERENCES Product ON DELETE CASCADE);

Yes, triggers are the way to go. But before you can start to use triggers in plpgsql, you
have to enable the language. As user postgres, run the command createlang with the proper parameters.
Once you've done that, you have to
Write function in plpgsql
create a trigger to invoke that function
See example 39-3 for a basic example.
Note that a function body in Postgres is a string, with a special quoting mechanism: 2 dollar signs with an optional word in between them, as the quotes. (The word allows you to quote other similar quotes.)
Also note that you can reuse a trigger procedure for multiple tables, as long as they have the columns your procedure uses.
So the function has to
check if the value of NEW.ProductID exists in the ProductSearchResult table, with a select statement (you ought to be able to use SELECT count(*) ... INTO someint, or SELECT EXISTS(...) INTO somebool)
if not, insert a new row in that table
If you still get stuck, come back here.

In any case (rules OR triggers) the insert needs to create a new key (and new values for the attributes) in the products table. In most cases, this implies that a (serial,sequence) surrogate primary key should be used in the products table, and that the "real world" product_id ("product number") should default to NULL, and be degraded to a candidate key.
BTW: a rule can be used, rules just are tricky to implement correctly for N:1 relations (they need the same kind of EXISTS-logic as in Bart's answer above).
Maybe cascading on INSERT is not such a good idea after all. What do you want to happen if someone inserts a ProductSearchResult record for a not-existing product? [IMO a FK is always a domain; you cannot just extend a domain just by referring to a not-existant value for it; that would make the FK constraint meaningless]