DELETE then INSERT is randomly raising a duplicate key violates unique constraint error - postgresql

I have a query (for a website) that replaces old data with new data.
I run the query in one call to the database via the PHP pg_query function and also use pgbouncer with transaction pool mode. I would be very surprised if two of the same queries are running at the same time, but is that the only explanation for this? I don't have any triggers or SERIAL columns on the table.
CREATE TABLE mydata (
id INT NOT NULL,
val TEXT NOT NULL
);
ALTER TABLE mydata ADD CONSTRAINT mydata_unique (id);
The statement that raises the conflict is
DELETE FROM mydata WHERE id IN (1,2,3);
INSERT INTO mydata (id,val) VALUES (1,'one');
INSERT INTO mydata (id,val) VALUES (2,'two');
INSERT INTO mydata (id,val) VALUES (3,'three');
Version PostgreSQL 12.2

I assume that you are not running these statements in parallel, but one after the other.
Still, this could easily cause conflicts if several database sessions are doing the same thing at the same time: a second session may insert rows after the first session deleted the old rows, but before it inserted the new rows.
To protect yourself from that with row locks, run all statements in a single transaction. This may occasionally lead to a deadlock, which is no big deal - just repeat the transaction that failed.

Related

Postgres Insert as select from table ignore any errors

I am trying to bulk load records from a temp table to table using insert as select stmt and on conflict strategy do update.
I want load as many records as possible, currently if there any any foreign key violations no records get inserted, everything gets rolled back. Is there a way to insert valid records and skip the faulty records.
In https://dba.stackexchange.com/a/46477 I saw a strategy of going with the foreign table in the query to ignore the faulty rows. I don't want to do that too as I may have many foreign keys on that table and it will make my query more complex and table specific. I would like it to be generic.
Sample use case, if have 100 rows in the temp table and suppose row number 5 and 7 are causing insertion failure, I want to insert the rest 98 records and identify which two rows failed.
I want to avoid inserting record by record and catch the error, as it is not efficient. I am doing this whole exercise to avoid loading a table row by row.
Oracle provides support to catch bulk errors at a shot.
Sample https://stackoverflow.com/a/36430893/8575780
https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:1422998100346727312
I have already explored options loading using copy, it catches not null constraints and other data type errors, but when foreign key violation happens nothing gets committed.
I am looking something closer to what pgloader is doing when it faces error.
https://pgloader.readthedocs.io/en/latest/pgloader.html#batches-and-retry-behaviour

How to set Ignore Duplicate Key in Postgresql while table creation itself

I am creating a table in Postgresql 9.5 where id is the primary key. While inserting rows in the table if anyone tries to insert duplicate id, i want it to get ignored instead of raising exception. Is there any way such that i can set this while table creation itself that duplicate entries get ignored.
There are many techniques to resolve duplicate insertion issue while writing insertion query i.e. using ON CONFLICT DO NOTHING, or using WHERE EXISTS clause etc. But i want to handle this at table creation end so that the person writing insertion query doesn't need to bother any.
Creating RULE is one of the possible solution. Are there other possible solutions? Maybe something like this:
`CREATE TABLE dbo.foo (bar int PRIMARY KEY WITH (FILLFACTOR=90, IGNORE_DUP_KEY = ON))`
Although exact this statement doesn't work on Postgresql 9.5 on my machine.
add a trigger before insert or rule on insert do instead - otherwise has to be handled by inserting query. both solutions will require more resources on each insert.
Alternative way to use function with arguments for insert, that will check for duplicates, so end users will use function instead of INSERT statement.
WHERE EXISTS sub-query is not atomic btw - so you can still have exception after check...
9.5 ON CONFLICT DO NOTHING is the best solution still

Is it possible to catch a foreign key violation in postgres

I'm trying to insert data into a table which has a foreign key constraint. If there is a constraint violation in a row that I'm inserting, I want to chuck that data away.
The issue is that postgres returns an error every time I violate the constraint. Is it possible for me to have some statement in my insert statement like 'ON FOREIGN KEY CONSTRAINT DO NOTHING'?
EDIT:
This is the query that I'm trying to do, where info is a dict:
cursor.execute("INSERT INTO event (case_number_id, date, \
session, location, event_type, worker, result) VALUES \
(%(id_number)s, %(date)s, %(session)s, \
%(location)s, %(event_type)s, %(worker)s, %(result)s) ON CONFLICT DO NOTHING", info)
It errors out when there is a foreign key violation
If you're only inserting a single row at a time, you can create a savepoint before the insert and rollback to it when the insert fails (or release it when the insert succeeds).
For Postgres 9.5 or later, you can use INSERT ... ON CONFLICT DO NOTHING which does what it says. You can also write ON CONFLICT DO UPDATE SET column = value..., which will automagically convert your insert into an update of the row you are conflicting with (this functionality is sometimes called "upsert").
This does not work because OP is dealing with a foreign key constraint rather than a unique constraint. In that case, you can most easily use the savepoint method I described earlier, but for multiple rows it may prove tedious. If you need to insert multiple rows at once, it should be reasonably performant to split them into multiple insert statements, provided you are not working in autocommit mode, all inserts occur in one transaction, and you are not inserting a very large number of rows.
Sometimes, you really do need multiple inserts in a single statement, because the round-trip overhead of talking to your database plus the cost of having savepoints on every insert is simply too high. In this case, there are a number of imperfect approaches. Probably the least bad is to build a nested query which selects your data and joins it against the other table, something like this:
INSERT INTO table_A (column_A, column_B, column_C)
SELECT A_rows.*
FROM VALUES (...) AS A_rows(column_A, column_B, column_C)
JOIN table_B ON A_rows.column_B = table_B.column_B;

Postgres 9.3: Sharelock issue with simple INSERT

Update: Potential solution below
I have a large corpus of configuration files consisting of key/value pairs that I'm trying to push into a database. A lot of the keys and values are repeated across configuration files so I'm storing the data using 3 tables. One for all unique key values, one for all unique pair values, and one listing all the key/value pairs for each file.
Problem:
I'm using multiple concurrent processes (and therefore connections) to add the raw data into the database. Unfortunately I get a lot of detected deadlocks when trying to add values to the key and value tables. I have a tried a few different methods of inserting the data (shown below), but always end up with a "deadlock detected" error
TransactionRollbackError: deadlock detected DETAIL: Process 26755
waits for ShareLock on transaction 689456; blocked by process 26754.
Process 26754 waits for ShareLock on transaction 689467; blocked by
process 26755.
I was wondering if someone could shed some light on exactly what could be causing these deadlocks, and possibly point me towards some way of fixing the issue. Looking at the SQL statements I'm using (listed below), I don't really see why there is any co-dependency at all. Thanks for reading!
Example config file:
example_key this_is_the_value
other_example other_value
third example yet_another_value
Table definitions:
CREATE TABLE keys (
id SERIAL PRIMARY KEY,
hash UUID UNIQUE NOT NULL,
key TEXT);
CREATE TABLE values (
id SERIAL PRIMARY KEY,
hash UUID UNIQUE NOT NULL,
key TEXT);
CREATE TABLE keyvalue_pairs (
id SERIAL PRIMARY KEY,
file_id INTEGER REFERENCES filenames,
key_id INTEGER REFERENCES keys,
value_id INTEGER REFERENCES values);
SQL Statements:
Initially I was trying to use this statement to avoid any exceptions:
WITH s AS (
SELECT id, hash, key FROM keys
WHERE hash = 'hash_value';
), i AS (
INSERT INTO keys (hash, key)
SELECT 'hash_value', 'key_value'
WHERE NOT EXISTS (SELECT 1 FROM s)
returning id, hash, key
)
SELECT id, hash, key FROM i
UNION ALL
SELECT id, hash, key FROM s;
But even something as simple as this causes the deadlocks:
INSERT INTO keys (hash, key)
VALUES ('hash_value', 'key_value')
RETURNING id;
In both cases, if I get an exception thrown because the inserted hash
value is not unique, I use savepoints to rollback the change and
another statement to just select the id I'm after.
I'm using hashes for the unique field, as some of the keys and values
are too long to be indexed
Full example of the python code (using psycopg2) with savepoints:
key_value = 'this_key'
hash_val = generate_uuid(value)
try:
cursor.execute(
'''
SAVEPOINT duplicate_hash_savepoint;
INSERT INTO keys (hash, key)
VALUES (%s, %s)
RETURNING id;
'''
(hash_val, key_value)
)
result = cursor.fetchone()[0]
cursor.execute('''RELEASE SAVEPOINT duplicate_hash_savepoint''')
return result
except psycopg2.IntegrityError as e:
cursor.execute(
'''
ROLLBACK TO SAVEPOINT duplicate_hash_savepoint;
'''
)
#TODO: Should ensure that values match and this isn't just
#a hash collision
cursor.execute(
'''
SELECT id FROM keys WHERE hash=%s LIMIT 1;
'''
(hash_val,)
)
return cursor.fetchone()[0]
Update:
So I believe I a hint on another stackexchange site:
Specifically:
UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE commands
behave the same as SELECT in terms of searching for target rows: they
will only find target rows that were committed as of the command start
time1. However, such a target row might have already been updated (or
deleted or locked) by another concurrent transaction by the time it is
found. In this case, the would-be updater will wait for the first
updating transaction to commit or roll back (if it is still in
progress). If the first updater rolls back, then its effects are
negated and the second updater can proceed with updating the
originally found row. If the first updater commits, the second updater
will ignore the row if the first updater deleted it2, otherwise it
will attempt to apply its operation to the updated version of the row.
While I'm still not exactly sure where the co-dependency is, it seems that processing a large number of key/value pairs without commiting would likely result in something like this. Sure enough, if I commit after each individual configuration file is added, the deadlocks don't occur.
It looks like you're in this situation:
The table to INSERT into has a primary key (or unique index(es) of any sort).
Several INSERTs into that table are performed within one transaction (as opposed to committing immediately after each one)
The rows to insert come in random order (with regard to the primary key)
The rows are inserted in concurrent transactions.
This situation creates the following opportunity for deadlock:
Assuming there are two sessions, that each started a transaction.
Session #1: insert row with PK 'A'
Session #2: insert row with PK 'B'
Session #1: try to insert row with PK 'B'
=> Session #1 is put to wait until Session #2 commits or rollbacks
Session #2: try to insert row with PK 'A'
=> Session #2 is put to wait for Session #1.
Shortly thereafter, the deadlock detector gets aware that both sessions are now waiting for each other, and terminates one of them with a fatal deadlock detected error.
If you're in this scenario, the simplest solution is to COMMIT after a new entry is inserted, before attempting to insert any new row into the table.
Postgres is known for that type of deadlocks, to be honest. I often encounter such problems when different workers update information about interleaving entities. Recently I had a task of importing a big list of scientific papers metadata from multiple json files. I was using parallel processes via joblib to read from several files at the same time. Deadlocks were hanging all the time on authors(id bigint primary key, name text) table all the time 'cause many files contained papers of the same authors, therefore producing inserts with oftentimes the same authors. I was using insert into authors (id,name) values %s on conflict(id) do nothing, but that was not helping. I tried sorting tuples before sending them to Postgres server, with little success. What really helped me was keeping a list of known authors in a Redis set (accessible to all processes):
if not rexecute("sismember", "known_authors", author_id):
# your logic...
rexecute("sadd", "known_authors", author_id)
Which I recommend to everyone. Use Memurai if you are limited to Windows. Sad but true, not a lot of other options for Postgres.

Duplicate Key error when using INSERT DEFAULT

I am getting a duplicate key error, DB2 SQL Error: SQLCODE=-803, SQLSTATE=23505, when I try to INSERT records. The primary key is one column, INTEGER 4, Generated, and it is the first column.
the insert looks like this: INSERT INTO SCHEMA.TABLE1 values (DEFAULT, ?, ?, ...)
It's my understanding that using the value DEFAULT will just let DB2 auto-generate the key at the time of insert, which is what I want. This works most of the time, but sometimes/randomly I get the duplicate key error. Thoughts?
More specifically, I'm running against DB2 9.7.0.3, using Scriptella to copy a bunch of records from one database to another. Sometimes I can process a bunch with no problems, other times I'll get the error right away, other times after 2 records, or 20 records, or 30 records, etc. Does not seem to be a pattern, nor is it the same record every time. If I change the data to copy 1 record instead of a bunch, sometimes I'll get the error one time, then it's fine the next time.
I thought maybe some other process was inserting records during my batch program, and creating keys at the same time. However, the tables I'm copying TO should not have any other users/processes trying to INSERT records during this same time frame, although there could be READS happening.
Edit: adding create info:
Create table SCHEMA.TABLE1 (
SYSTEM_USER_KEY INTEGER NOT NULL
generated by default as identity (start with 1 increment by 1 cache 20),
COL2...,
)
alter table SCHEMA.TABLE1
add constraint SYSTEM_USER_SYSTEM_USER_KEY_IDX
Primary Key (SYSTEM_USER_KEY);
You most likely have records in your table with IDs that are bigger then the next value in your identity sequence. To find out what the current value your sequence is about at, run the following query.
select s.nextcachefirstvalue-s.cache, s.nextcachefirstvalue-s.increment
from syscat.COLIDENTATTRIBUTES as a inner join syscat.sequences as s on a.seqid=s.seqid
where a.tabschema='SCHEMA'
and a.TABNAME='TABLE1'
and a.COLNAME='SYSTEM_USER_KEY'
So basically what happened is that somehow you got records in your table with ids that are bigger then the current last value of your identity sequence. So sooner or later these ids will collide with identity generated ids.
There are different reasons on how this could have happened. One possibility is that data was loaded which already contained values for the id column or that records were inserted with an actual value for the ID. Another option is that the identity sequence was reset to start at a lower value than the max id in the table.
Whatever the cause, you may also want the fix:
SELECT MAX(<primary_key_column>) FROM onsite.forms;
ALTER TABLE <table> ALTER COLUMN <primary_key_column> RESTART WITH <number from previous query + 1>;