Constraint makes insert fail but primary key still incremented - postgresql

I have a table that has a primary key based off of a sequence START WITH 1 INCREMENT BY 1.
Primary key works great, incrementing as needed.
However I have another field that is UNIQUE. When I attempt to insert a new row that will fail the UNIQUE check, the primary key is still updated. I expect this to not update the primary key because the INSERT failed.
To test this I inserted two rows, primary key was 1 and 2 respectively. I then attempted to insert the data from row two again. It failed due to the unique constraint. I then inserted another row with a unique value and the primary key jumped from 2 to 4, skipping the three that would have been used had the unique constraint not failed.
Is this expected behaviour for Postgres?

That is normal behavior.
The documentation explains that:
To avoid blocking concurrent transactions that obtain numbers from the same sequence, a nextval operation is never rolled back; that is, once a value has been fetched it is considered used and will not be returned again. This is true even if the surrounding transaction later aborts, or if the calling query ends up not using the value. For example an INSERT with an ON CONFLICT clause will compute the to-be-inserted tuple, including doing any required nextval calls, before detecting any conflict that would cause it to follow the ON CONFLICT rule instead. Such cases will leave unused “holes” in the sequence of assigned values. Thus, PostgreSQL sequence objects cannot be used to obtain “gapless” sequences.
For details and examples, you can read my blog.

Related

psql upsert results in noncontinuous id

I have a postgresql (>9.5) table with primary_key id and a unique key col. When I use
INSERT INTO table_a (col) VLUES('xxx') ON CONFLICT(col) DO NOTHING;
to perform a upsert, let's say a row with an id 1 is generated.
If I run the sql again, nothing will happen, but actually the id 2 will be generated and abandoned.
Then if I insert a new record, for example,
INSERT INTO table_a (col) VLUES('yyy') ON CONFLICT(col) DO NOTHING;
Another row with id 3 will be generated and id 2 is wasted!
Is there anyway to avoid this waste?
Presumably id is a serial. Under the hood this causes a nextval() call from a sequence. A number nextval() once returned will never be returned again. And the call of nextval() happens before checking for possible conflicts.
From "9.16. Sequence Manipulation Functions":
nextval
(...)
Important: To avoid blocking concurrent transactions that obtain numbers from the same sequence, a nextval operation is never rolled back; that is, once a value has been fetched it is considered used and will not be returned again. This is true even if the surrounding transaction later aborts, or if the calling query ends up not using the value. For example an INSERT with an ON CONFLICT clause will compute the to-be-inserted tuple, including doing any required nextval calls, before detecting any conflict that would cause it to follow the ON CONFLICT rule instead. Such cases will leave unused "holes" in the sequence of assigned values. Thus, PostgreSQL sequence objects cannot be used to obtain "gapless" sequences.
Concluded that means, that the answer to your question is no, there is no way to avoid this unless you generate the values yourself somehow.

serial in postgres is being increased even though I added on conflict do nothing

I'm using Postgres 9.5 and seeing some wired things here.
I've a cron job running ever 5 mins firing a sql statement that is adding a list of records if not existing.
INSERT INTO
sometable (customer, balance)
VALUES
(:customer, :balance)
ON CONFLICT (customer) DO NOTHING
sometable.customer is a primary key (text)
sometable structure is:
id: serial
customer: text
balance: bigint
Now it seems like everytime this job runs, the id field is silently incremented +1. So next time, I really add a field, it is thousands of numbers above my last value. I thought this query checks for conflicts and if so, do nothing but currently it seems like it tries to insert the record, increased the id and then stops.
Any suggestions?
The reason this feels weird to you is that you are thinking of the increment on the counter as part of the insert operation, and therefore the "DO NOTHING" ought to mean "don't increment anything". You're picturing this:
Check values to insert against constraint
If duplicate detected, abort
Increment sequence
Insert data
But in fact, the increment has to happen before the insert is attempted. A SERIAL column in Postgres is implemented as a DEFAULT which executes the nextval() function on a bound SEQUENCE. Before the DBMS can do anything with the data, it's got to have a complete set of columns, so the order of operations is like this:
Resolve default values, including incrementing the sequence
Check values to insert against constraint
If duplicate detected, abort
Insert data
This can be seen intuitively if the duplicate key is in the autoincrement field itself:
CREATE TABLE foo ( id SERIAL NOT NULL PRIMARY KEY, bar text );
-- Insert row 1
INSERT INTO foo ( bar ) VALUES ( 'test' );
-- Reset the sequence
SELECT setval(pg_get_serial_sequence('foo', 'id'), 0, true);
-- Attempt to insert row 1 again
INSERT INTO foo ( bar ) VALUES ( 'test 2' )
ON CONFLICT (id) DO NOTHING;
Clearly, this can't know if there's a conflict without incrementing the sequence, so the "do nothing" has to come after that increment.
As already said by #a_horse_with_no_name and #Serge Ballesta serials are always incremented even if INSERT fails.
You can try to "rollback" serial value to maximum id used by changing the corresponding sequence:
SELECT setval('sometable_id_seq', MAX(id), true) FROM sometable;
As said by #a_horse_with_no_name, that is by design. Serial type fields are implemented under the hood through sequences, and for evident reasons, once you have gotten a new value from a sequence, you cannot rollback the last value. Imagine the following scenario:
sequence is at n
A requires a new value : got n+1
in a concurrent transaction B requires a new value: got n+2
for any reason A rollbacks its transaction - would you feel safe to reset sequence?
That is the reason why sequences (and serial field) just document that in case of rollbacked transactions holes can occur in the returned values. Only unicity is guaranteed.
Well there is technique that allows you to do stuff like that. They call insert mutex. It is old old old, but it works.
https://www.percona.com/blog/2011/11/29/avoiding-auto-increment-holes-on-innodb-with-insert-ignore/
Generally idea is that you do INSERT SELECT and if your values are duplicating the SELECT does not return any results that of course prevents INSERT and the index is not incremented. Bit of mind boggling, but perfectly valid and performant.
This of course completely ignores ON DUPLICATE but one gets back control over the index.

Postgres 9.3: Sharelock issue with simple INSERT

Update: Potential solution below
I have a large corpus of configuration files consisting of key/value pairs that I'm trying to push into a database. A lot of the keys and values are repeated across configuration files so I'm storing the data using 3 tables. One for all unique key values, one for all unique pair values, and one listing all the key/value pairs for each file.
Problem:
I'm using multiple concurrent processes (and therefore connections) to add the raw data into the database. Unfortunately I get a lot of detected deadlocks when trying to add values to the key and value tables. I have a tried a few different methods of inserting the data (shown below), but always end up with a "deadlock detected" error
TransactionRollbackError: deadlock detected DETAIL: Process 26755
waits for ShareLock on transaction 689456; blocked by process 26754.
Process 26754 waits for ShareLock on transaction 689467; blocked by
process 26755.
I was wondering if someone could shed some light on exactly what could be causing these deadlocks, and possibly point me towards some way of fixing the issue. Looking at the SQL statements I'm using (listed below), I don't really see why there is any co-dependency at all. Thanks for reading!
Example config file:
example_key this_is_the_value
other_example other_value
third example yet_another_value
Table definitions:
CREATE TABLE keys (
id SERIAL PRIMARY KEY,
hash UUID UNIQUE NOT NULL,
key TEXT);
CREATE TABLE values (
id SERIAL PRIMARY KEY,
hash UUID UNIQUE NOT NULL,
key TEXT);
CREATE TABLE keyvalue_pairs (
id SERIAL PRIMARY KEY,
file_id INTEGER REFERENCES filenames,
key_id INTEGER REFERENCES keys,
value_id INTEGER REFERENCES values);
SQL Statements:
Initially I was trying to use this statement to avoid any exceptions:
WITH s AS (
SELECT id, hash, key FROM keys
WHERE hash = 'hash_value';
), i AS (
INSERT INTO keys (hash, key)
SELECT 'hash_value', 'key_value'
WHERE NOT EXISTS (SELECT 1 FROM s)
returning id, hash, key
)
SELECT id, hash, key FROM i
UNION ALL
SELECT id, hash, key FROM s;
But even something as simple as this causes the deadlocks:
INSERT INTO keys (hash, key)
VALUES ('hash_value', 'key_value')
RETURNING id;
In both cases, if I get an exception thrown because the inserted hash
value is not unique, I use savepoints to rollback the change and
another statement to just select the id I'm after.
I'm using hashes for the unique field, as some of the keys and values
are too long to be indexed
Full example of the python code (using psycopg2) with savepoints:
key_value = 'this_key'
hash_val = generate_uuid(value)
try:
cursor.execute(
'''
SAVEPOINT duplicate_hash_savepoint;
INSERT INTO keys (hash, key)
VALUES (%s, %s)
RETURNING id;
'''
(hash_val, key_value)
)
result = cursor.fetchone()[0]
cursor.execute('''RELEASE SAVEPOINT duplicate_hash_savepoint''')
return result
except psycopg2.IntegrityError as e:
cursor.execute(
'''
ROLLBACK TO SAVEPOINT duplicate_hash_savepoint;
'''
)
#TODO: Should ensure that values match and this isn't just
#a hash collision
cursor.execute(
'''
SELECT id FROM keys WHERE hash=%s LIMIT 1;
'''
(hash_val,)
)
return cursor.fetchone()[0]
Update:
So I believe I a hint on another stackexchange site:
Specifically:
UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE commands
behave the same as SELECT in terms of searching for target rows: they
will only find target rows that were committed as of the command start
time1. However, such a target row might have already been updated (or
deleted or locked) by another concurrent transaction by the time it is
found. In this case, the would-be updater will wait for the first
updating transaction to commit or roll back (if it is still in
progress). If the first updater rolls back, then its effects are
negated and the second updater can proceed with updating the
originally found row. If the first updater commits, the second updater
will ignore the row if the first updater deleted it2, otherwise it
will attempt to apply its operation to the updated version of the row.
While I'm still not exactly sure where the co-dependency is, it seems that processing a large number of key/value pairs without commiting would likely result in something like this. Sure enough, if I commit after each individual configuration file is added, the deadlocks don't occur.
It looks like you're in this situation:
The table to INSERT into has a primary key (or unique index(es) of any sort).
Several INSERTs into that table are performed within one transaction (as opposed to committing immediately after each one)
The rows to insert come in random order (with regard to the primary key)
The rows are inserted in concurrent transactions.
This situation creates the following opportunity for deadlock:
Assuming there are two sessions, that each started a transaction.
Session #1: insert row with PK 'A'
Session #2: insert row with PK 'B'
Session #1: try to insert row with PK 'B'
=> Session #1 is put to wait until Session #2 commits or rollbacks
Session #2: try to insert row with PK 'A'
=> Session #2 is put to wait for Session #1.
Shortly thereafter, the deadlock detector gets aware that both sessions are now waiting for each other, and terminates one of them with a fatal deadlock detected error.
If you're in this scenario, the simplest solution is to COMMIT after a new entry is inserted, before attempting to insert any new row into the table.
Postgres is known for that type of deadlocks, to be honest. I often encounter such problems when different workers update information about interleaving entities. Recently I had a task of importing a big list of scientific papers metadata from multiple json files. I was using parallel processes via joblib to read from several files at the same time. Deadlocks were hanging all the time on authors(id bigint primary key, name text) table all the time 'cause many files contained papers of the same authors, therefore producing inserts with oftentimes the same authors. I was using insert into authors (id,name) values %s on conflict(id) do nothing, but that was not helping. I tried sorting tuples before sending them to Postgres server, with little success. What really helped me was keeping a list of known authors in a Redis set (accessible to all processes):
if not rexecute("sismember", "known_authors", author_id):
# your logic...
rexecute("sadd", "known_authors", author_id)
Which I recommend to everyone. Use Memurai if you are limited to Windows. Sad but true, not a lot of other options for Postgres.

How to INSERT OR UPDATE while MATCHING a non Primary Key without updating existing Primary Key?

I'm currently working with Firebird and attempting to utilize UPDATE OR INSERT functionality in order to solve a particular new case within our software. Basically, we are needing to pull data off of a source and put it into an existing table and then update that data at regular intervals and adding any new references. The source is not a database so it isn't a matter of using MERGE to link the two tables (unless we make a separate table and then merge it, but that seems unnecessary).
The problem rests on the fact we cannot use the primary key of the existing table for matching, because we need to match based off of the ID we get from the source. We can use the MATCHING clause no problem but the issue becomes that the primary key of the existing table will be updated to the next key every time because it has to be in the query because of the insertion chance. Here is the query (along with c# parameter additions) to demonstrate the problem.
UPDATE OR INSERT INTO existingtable (PrimaryKey, UniqueSourceID, Data) VALUES (?,?,?) MATCHING (UniqueSourceID);
this.AddInParameter("PrimaryKey", FbDbType.Integer, itemID);
this.AddInParameter("UniqueSourceID", FbDbType.Integer, source.id);
this.AddInParameter("Data", FbDbType.SmallInt, source.data);
Problem is shown that every time the UPDATE triggers, the primary key will also change to the next incremented key I need a way to leave the primary key alone when updating, but if it is inserting I need to insert it.
Do not generate primary key manually, let a trigger generate it when nessesary:
CREATE SEQUENCE seq_existingtable;
SET TERM ^ ;
CREATE TRIGGER Gen_PK FOR existingtable
ACTIVE BEFORE INSERT
AS
BEGIN
IF(NEW.PrimaryKey IS NULL)THEN NEW.PrimaryKey = NEXT VALUE FOR seq_existingtable;
END^
SET TERM ; ^
Now you can omit the PK field from your statement:
UPDATE OR INSERT INTO existingtable (UniqueSourceID, Data) VALUES (?,?) MATCHING (UniqueSourceID);
and when the insert is triggered by the statement then the trigger will take care of creating the PK. If you need to know the generated PK then use the RETURNING clause of the UPDATE OR INSERT statement.

Duplicate Key error when using INSERT DEFAULT

I am getting a duplicate key error, DB2 SQL Error: SQLCODE=-803, SQLSTATE=23505, when I try to INSERT records. The primary key is one column, INTEGER 4, Generated, and it is the first column.
the insert looks like this: INSERT INTO SCHEMA.TABLE1 values (DEFAULT, ?, ?, ...)
It's my understanding that using the value DEFAULT will just let DB2 auto-generate the key at the time of insert, which is what I want. This works most of the time, but sometimes/randomly I get the duplicate key error. Thoughts?
More specifically, I'm running against DB2 9.7.0.3, using Scriptella to copy a bunch of records from one database to another. Sometimes I can process a bunch with no problems, other times I'll get the error right away, other times after 2 records, or 20 records, or 30 records, etc. Does not seem to be a pattern, nor is it the same record every time. If I change the data to copy 1 record instead of a bunch, sometimes I'll get the error one time, then it's fine the next time.
I thought maybe some other process was inserting records during my batch program, and creating keys at the same time. However, the tables I'm copying TO should not have any other users/processes trying to INSERT records during this same time frame, although there could be READS happening.
Edit: adding create info:
Create table SCHEMA.TABLE1 (
SYSTEM_USER_KEY INTEGER NOT NULL
generated by default as identity (start with 1 increment by 1 cache 20),
COL2...,
)
alter table SCHEMA.TABLE1
add constraint SYSTEM_USER_SYSTEM_USER_KEY_IDX
Primary Key (SYSTEM_USER_KEY);
You most likely have records in your table with IDs that are bigger then the next value in your identity sequence. To find out what the current value your sequence is about at, run the following query.
select s.nextcachefirstvalue-s.cache, s.nextcachefirstvalue-s.increment
from syscat.COLIDENTATTRIBUTES as a inner join syscat.sequences as s on a.seqid=s.seqid
where a.tabschema='SCHEMA'
and a.TABNAME='TABLE1'
and a.COLNAME='SYSTEM_USER_KEY'
So basically what happened is that somehow you got records in your table with ids that are bigger then the current last value of your identity sequence. So sooner or later these ids will collide with identity generated ids.
There are different reasons on how this could have happened. One possibility is that data was loaded which already contained values for the id column or that records were inserted with an actual value for the ID. Another option is that the identity sequence was reset to start at a lower value than the max id in the table.
Whatever the cause, you may also want the fix:
SELECT MAX(<primary_key_column>) FROM onsite.forms;
ALTER TABLE <table> ALTER COLUMN <primary_key_column> RESTART WITH <number from previous query + 1>;