psql upsert results in noncontinuous id - postgresql

I have a postgresql (>9.5) table with primary_key id and a unique key col. When I use
INSERT INTO table_a (col) VLUES('xxx') ON CONFLICT(col) DO NOTHING;
to perform a upsert, let's say a row with an id 1 is generated.
If I run the sql again, nothing will happen, but actually the id 2 will be generated and abandoned.
Then if I insert a new record, for example,
INSERT INTO table_a (col) VLUES('yyy') ON CONFLICT(col) DO NOTHING;
Another row with id 3 will be generated and id 2 is wasted!
Is there anyway to avoid this waste?

Presumably id is a serial. Under the hood this causes a nextval() call from a sequence. A number nextval() once returned will never be returned again. And the call of nextval() happens before checking for possible conflicts.
From "9.16. Sequence Manipulation Functions":
nextval
(...)
Important: To avoid blocking concurrent transactions that obtain numbers from the same sequence, a nextval operation is never rolled back; that is, once a value has been fetched it is considered used and will not be returned again. This is true even if the surrounding transaction later aborts, or if the calling query ends up not using the value. For example an INSERT with an ON CONFLICT clause will compute the to-be-inserted tuple, including doing any required nextval calls, before detecting any conflict that would cause it to follow the ON CONFLICT rule instead. Such cases will leave unused "holes" in the sequence of assigned values. Thus, PostgreSQL sequence objects cannot be used to obtain "gapless" sequences.
Concluded that means, that the answer to your question is no, there is no way to avoid this unless you generate the values yourself somehow.

Related

Constraint makes insert fail but primary key still incremented

I have a table that has a primary key based off of a sequence START WITH 1 INCREMENT BY 1.
Primary key works great, incrementing as needed.
However I have another field that is UNIQUE. When I attempt to insert a new row that will fail the UNIQUE check, the primary key is still updated. I expect this to not update the primary key because the INSERT failed.
To test this I inserted two rows, primary key was 1 and 2 respectively. I then attempted to insert the data from row two again. It failed due to the unique constraint. I then inserted another row with a unique value and the primary key jumped from 2 to 4, skipping the three that would have been used had the unique constraint not failed.
Is this expected behaviour for Postgres?
That is normal behavior.
The documentation explains that:
To avoid blocking concurrent transactions that obtain numbers from the same sequence, a nextval operation is never rolled back; that is, once a value has been fetched it is considered used and will not be returned again. This is true even if the surrounding transaction later aborts, or if the calling query ends up not using the value. For example an INSERT with an ON CONFLICT clause will compute the to-be-inserted tuple, including doing any required nextval calls, before detecting any conflict that would cause it to follow the ON CONFLICT rule instead. Such cases will leave unused “holes” in the sequence of assigned values. Thus, PostgreSQL sequence objects cannot be used to obtain “gapless” sequences.
For details and examples, you can read my blog.

Avoiding double SELECT/INSERT by INSERT'ing placeholder

Is it possible to perform a query that will SELECT for some values and if those values do not exist, perform an INSERT and return the very same values - in a single query?
Background:
I am writing an application with a large deal of concurrency. At one point a function will check a database to see if a certain key value exists using SELECT. If the key exists, the function can safely exit. If the value does not exist, the function will perform a REST API call to capture the necessary data, then INSERT the values into the database.
This works fine until it is run concurrently. Two threads (I am using Go, so goroutines) will each independently run the SELECT. Since both queries report that the key does not exist, both will independently perform the REST API call and both will attempt to INSERT the values.
Currently, I avoid double-insertion by using a duplicate constraint. However, I would like to avoid even the double API call by having the first query SELECT for the key value and if it does not exist, INSERT a placeholder - then return those values. This way, subsequent SELECT queries report that the key value already exists and will not perform the API calls or INSERT.
In Pseudo-code, something like this:
SELECT values FROM my_table WHERE key=KEY_HERE;
if found;
RETURN SELECTED VALUES;
if not found:
INSERT values, key VALUES(random_placeholder, KEY_HERE) INTO table;
SELECT values from my_table WHERE key=KEY_HERE;
The application code will insert a random value so that a routine/thread can determine if it was the one that generated the new INSERT and will subsequently go ahead and perform the Rest API call.
This is for a Go application using the pgx library.
Thanks!
You could write a stored procedure and it would be a single query for the client to execute. PostgreSQL, of course, would still execute multiple statements. PostgreSQL insert statement can return values with the returning keyword, so you may not need the 2nd select.
Lock the table in an appropriate lock mode.
For example in the strictest possible mode ACCESS EXCLUSIVE:
BEGIN TRANSACTION;
LOCK elbat
IN ACCESS EXCLUSIVE MODE;
SELECT *
FROM elbat
WHERE id = 1;
-- if there wasn't any row returned make the API call and
INSERT INTO elbat
(id,
<other columns>)
VALUES (1,
<API call return values>);
COMMIT;
-- return values the to the application
Once one transaction has acquired the ACCESS EXCLUSIVE lock, no other transaction isn't even reading from the table until the acquiring transaction ends. And ACCESS EXCLUSIVE won't be granted unless there are no other (even weaker) locks. That way the instance of your component that gets the lock first will do the check and the INSERT if necessary. The other one will be blocked in the meantime and the time it finally gets access, the INSERT has already been done in the first transaction, it need not make the API call anymore (unless the first one fails for some reason and rolled back).
If this is too strict for your use case, you may need to find out which lock level might be appropriate for you. Maybe, if you can make any component accessing the database (or at least the table) cooperative (and it sounds like you can do this), even advisory locks are enough.

Is postgresql sequence next val consist with insert order?

Given an table with id bigint default next_val('foo_sequence')
Can I assume the order of id consisting with the insert order ?
I mean the later inserted id is always greater then earlier inserted ids.
I am trying to calculate and save an increment continuous number of row,
Here is how I did
SELECT count(*) as seq_no from foo where id < some_id;
// get the seq no
UPDATE foo SET seq_no = seq_no_above + 1 WHERE id = some_id;
But it sometimes give duplicate seq_no value,
if the id consists with insert order, it should not have duplicate value.
In the simplest and purest sense, yes. It depends what you mean by "earlier" and "later", though, as you have to consider opening the transaction and closing the transaction. If a transaction has not been committed, then theoretically a record could show up later with an earlier ID.
The IDs are allocated when the insert happens, but the records will not show up until the records are committed. So if commit order is different, you may see some strange behavior depending on how strict your use case is.
I.e.
Open Transaction A
Insert records 1,2
Open Transaction B
Insert records 3,4
Close transaction B
Select * (get 3,4)
Close transaction A
Select * (get 1,2,3,4)
You also have to worry about caching on whether you consider them to be sequential. From the (very good) Postgres docs:
Furthermore, although multiple sessions are guaranteed to allocate
distinct sequence values, the values might be generated out of
sequence when all the sessions are considered. For example, with a
cache setting of 10, session A might reserve values 1..10 and return
nextval=1, then session B might reserve values 11..20 and return
nextval=11 before session A has generated nextval=2. Thus, with a
cache setting of one it is safe to assume that nextval values are
generated sequentially; with a cache setting greater than one you
should only assume that the nextval values are all distinct, not that
they are generated purely sequentially. Also, last_value will reflect
the latest value reserved by any session, whether or not it has yet
been returned by nextval.
One last caveat is someone with appropriate privileges can always reset the sequence to a different value, which obviously would throw a wrench into things.
EDIT:
To address your use case above, you definitely want to use sequences (and likely add NOT NULL / PRIMARY KEY constraints as well, to ensure uniqueness). In pgAdmin, at least, you can do all of this by setting data type serial. Though I have mentioned caveats, for 99% of practical purposes, you get uniqueness and sequential ordering (hence sequences) the way that you want.
In any case, we would need to see example data to confirm why you are seeing duplication (how to create a reproducible example). I presume the duplication you are seeing is in seq_no and not id, which illustrates that the problem is your query. If duplication is in id, then you have other problems, and that would explain duplication in seq_no.
Sequences are much better for transactional definition in the data (they take care of uniqueness for you, perform well in concurrency, and do not cause duplication... plus you get sequential ordering for the most part). For unique keys, they are best used with NOT NULL and PRIMARY KEY or UNIQUE constraints.
But if you need a perfect increment, it is better to do something like the below:
select *, row_number() over (order by value) as id
from foo
;
Postgres window functions are very powerful, but are definitely not the standard to use for inserting data with sequential keys. They are more useful for reporting, analysis, and complex queries after the fact.

Is it possible to catch a foreign key violation in postgres

I'm trying to insert data into a table which has a foreign key constraint. If there is a constraint violation in a row that I'm inserting, I want to chuck that data away.
The issue is that postgres returns an error every time I violate the constraint. Is it possible for me to have some statement in my insert statement like 'ON FOREIGN KEY CONSTRAINT DO NOTHING'?
EDIT:
This is the query that I'm trying to do, where info is a dict:
cursor.execute("INSERT INTO event (case_number_id, date, \
session, location, event_type, worker, result) VALUES \
(%(id_number)s, %(date)s, %(session)s, \
%(location)s, %(event_type)s, %(worker)s, %(result)s) ON CONFLICT DO NOTHING", info)
It errors out when there is a foreign key violation
If you're only inserting a single row at a time, you can create a savepoint before the insert and rollback to it when the insert fails (or release it when the insert succeeds).
For Postgres 9.5 or later, you can use INSERT ... ON CONFLICT DO NOTHING which does what it says. You can also write ON CONFLICT DO UPDATE SET column = value..., which will automagically convert your insert into an update of the row you are conflicting with (this functionality is sometimes called "upsert").
This does not work because OP is dealing with a foreign key constraint rather than a unique constraint. In that case, you can most easily use the savepoint method I described earlier, but for multiple rows it may prove tedious. If you need to insert multiple rows at once, it should be reasonably performant to split them into multiple insert statements, provided you are not working in autocommit mode, all inserts occur in one transaction, and you are not inserting a very large number of rows.
Sometimes, you really do need multiple inserts in a single statement, because the round-trip overhead of talking to your database plus the cost of having savepoints on every insert is simply too high. In this case, there are a number of imperfect approaches. Probably the least bad is to build a nested query which selects your data and joins it against the other table, something like this:
INSERT INTO table_A (column_A, column_B, column_C)
SELECT A_rows.*
FROM VALUES (...) AS A_rows(column_A, column_B, column_C)
JOIN table_B ON A_rows.column_B = table_B.column_B;

serial in postgres is being increased even though I added on conflict do nothing

I'm using Postgres 9.5 and seeing some wired things here.
I've a cron job running ever 5 mins firing a sql statement that is adding a list of records if not existing.
INSERT INTO
sometable (customer, balance)
VALUES
(:customer, :balance)
ON CONFLICT (customer) DO NOTHING
sometable.customer is a primary key (text)
sometable structure is:
id: serial
customer: text
balance: bigint
Now it seems like everytime this job runs, the id field is silently incremented +1. So next time, I really add a field, it is thousands of numbers above my last value. I thought this query checks for conflicts and if so, do nothing but currently it seems like it tries to insert the record, increased the id and then stops.
Any suggestions?
The reason this feels weird to you is that you are thinking of the increment on the counter as part of the insert operation, and therefore the "DO NOTHING" ought to mean "don't increment anything". You're picturing this:
Check values to insert against constraint
If duplicate detected, abort
Increment sequence
Insert data
But in fact, the increment has to happen before the insert is attempted. A SERIAL column in Postgres is implemented as a DEFAULT which executes the nextval() function on a bound SEQUENCE. Before the DBMS can do anything with the data, it's got to have a complete set of columns, so the order of operations is like this:
Resolve default values, including incrementing the sequence
Check values to insert against constraint
If duplicate detected, abort
Insert data
This can be seen intuitively if the duplicate key is in the autoincrement field itself:
CREATE TABLE foo ( id SERIAL NOT NULL PRIMARY KEY, bar text );
-- Insert row 1
INSERT INTO foo ( bar ) VALUES ( 'test' );
-- Reset the sequence
SELECT setval(pg_get_serial_sequence('foo', 'id'), 0, true);
-- Attempt to insert row 1 again
INSERT INTO foo ( bar ) VALUES ( 'test 2' )
ON CONFLICT (id) DO NOTHING;
Clearly, this can't know if there's a conflict without incrementing the sequence, so the "do nothing" has to come after that increment.
As already said by #a_horse_with_no_name and #Serge Ballesta serials are always incremented even if INSERT fails.
You can try to "rollback" serial value to maximum id used by changing the corresponding sequence:
SELECT setval('sometable_id_seq', MAX(id), true) FROM sometable;
As said by #a_horse_with_no_name, that is by design. Serial type fields are implemented under the hood through sequences, and for evident reasons, once you have gotten a new value from a sequence, you cannot rollback the last value. Imagine the following scenario:
sequence is at n
A requires a new value : got n+1
in a concurrent transaction B requires a new value: got n+2
for any reason A rollbacks its transaction - would you feel safe to reset sequence?
That is the reason why sequences (and serial field) just document that in case of rollbacked transactions holes can occur in the returned values. Only unicity is guaranteed.
Well there is technique that allows you to do stuff like that. They call insert mutex. It is old old old, but it works.
https://www.percona.com/blog/2011/11/29/avoiding-auto-increment-holes-on-innodb-with-insert-ignore/
Generally idea is that you do INSERT SELECT and if your values are duplicating the SELECT does not return any results that of course prevents INSERT and the index is not incremented. Bit of mind boggling, but perfectly valid and performant.
This of course completely ignores ON DUPLICATE but one gets back control over the index.