serial in postgres is being increased even though I added on conflict do nothing - postgresql

I'm using Postgres 9.5 and seeing some wired things here.
I've a cron job running ever 5 mins firing a sql statement that is adding a list of records if not existing.
INSERT INTO
sometable (customer, balance)
VALUES
(:customer, :balance)
ON CONFLICT (customer) DO NOTHING
sometable.customer is a primary key (text)
sometable structure is:
id: serial
customer: text
balance: bigint
Now it seems like everytime this job runs, the id field is silently incremented +1. So next time, I really add a field, it is thousands of numbers above my last value. I thought this query checks for conflicts and if so, do nothing but currently it seems like it tries to insert the record, increased the id and then stops.
Any suggestions?

The reason this feels weird to you is that you are thinking of the increment on the counter as part of the insert operation, and therefore the "DO NOTHING" ought to mean "don't increment anything". You're picturing this:
Check values to insert against constraint
If duplicate detected, abort
Increment sequence
Insert data
But in fact, the increment has to happen before the insert is attempted. A SERIAL column in Postgres is implemented as a DEFAULT which executes the nextval() function on a bound SEQUENCE. Before the DBMS can do anything with the data, it's got to have a complete set of columns, so the order of operations is like this:
Resolve default values, including incrementing the sequence
Check values to insert against constraint
If duplicate detected, abort
Insert data
This can be seen intuitively if the duplicate key is in the autoincrement field itself:
CREATE TABLE foo ( id SERIAL NOT NULL PRIMARY KEY, bar text );
-- Insert row 1
INSERT INTO foo ( bar ) VALUES ( 'test' );
-- Reset the sequence
SELECT setval(pg_get_serial_sequence('foo', 'id'), 0, true);
-- Attempt to insert row 1 again
INSERT INTO foo ( bar ) VALUES ( 'test 2' )
ON CONFLICT (id) DO NOTHING;
Clearly, this can't know if there's a conflict without incrementing the sequence, so the "do nothing" has to come after that increment.

As already said by #a_horse_with_no_name and #Serge Ballesta serials are always incremented even if INSERT fails.
You can try to "rollback" serial value to maximum id used by changing the corresponding sequence:
SELECT setval('sometable_id_seq', MAX(id), true) FROM sometable;

As said by #a_horse_with_no_name, that is by design. Serial type fields are implemented under the hood through sequences, and for evident reasons, once you have gotten a new value from a sequence, you cannot rollback the last value. Imagine the following scenario:
sequence is at n
A requires a new value : got n+1
in a concurrent transaction B requires a new value: got n+2
for any reason A rollbacks its transaction - would you feel safe to reset sequence?
That is the reason why sequences (and serial field) just document that in case of rollbacked transactions holes can occur in the returned values. Only unicity is guaranteed.

Well there is technique that allows you to do stuff like that. They call insert mutex. It is old old old, but it works.
https://www.percona.com/blog/2011/11/29/avoiding-auto-increment-holes-on-innodb-with-insert-ignore/
Generally idea is that you do INSERT SELECT and if your values are duplicating the SELECT does not return any results that of course prevents INSERT and the index is not incremented. Bit of mind boggling, but perfectly valid and performant.
This of course completely ignores ON DUPLICATE but one gets back control over the index.

Related

How can a relational database with foreign key constraints ingest data that may be in the wrong order?

The database is ingesting data from a stream, and all the rows needed to satisfy a foreign key constraint may be late or never arrive.
This can likely be accomplished by using another datastore, one without foreign key constraints, and then when all the needed data is available, read into the database which has fk constraints. However, this adds complexity and I'd like to avoid it.
We're working on a solution that creates "placeholder" rows to point the foreign key to. When the real data comes in, the placeholder is replaced with real values. Again, this adds complexity, but it's the best solution we've found so far.
How do people typically solve this problem?
Edit: Some sample data which might help explain the problem:
Let's say we have these tables:
CREATE TABLE order (
id INTEGER NOT NULL,
order_number,
PRIMARY KEY (id),
UNIQUE (order_number)
);
CREATE TABLE line_item (
id INTEGER NOT NULL,
order_number INTEGER REFERENCES order(order_number),
PRIMARY KEY (id)
);
If I insert an order first, not a problem! But let's say I try:
INSERT INTO line_item (order_number) values (123) before order 123 was inserted. This will fail the fk constraint of course. But this might be the order I get the data, since it's reading from a stream that is collecting this data from multiple sources.
Also, to address #philpxy's question, I didn't really find much on this. One thing that was mentioned was deferred constraints. This is a mechanism that waits to do the fk constraints at the end of a transaction. I don't think it's possible to do that in my case however, since these insert statements will be run at random times whenever the data is received.
You have a business workflow problem, because line items of individual orders are coming in before the orders themselves have come in. One workaround, perhaps not ideal, would be to create a before insert trigger which checks, for every incoming insert to the line_item table, whether that order already exists in the order table. If not, then it will first insert the order record before trying the insert on line_item.
CREATE OR REPLACE FUNCTION "public"."fn_insert_order" () RETURNS trigger AS $$
BEGIN
INSERT INTO "order" (order_number)
SELECT NEW.order_number
WHERE NOT EXISTS (SELECT 1 FROM "order" WHERE order_number = NEW.order_number);
RETURN NEW;
END
$$
LANGUAGE 'plpgsql'
# trigger
CREATE TRIGGER "trigger_insert_order"
BEFORE INSERT ON line_item FOR EACH ROW
EXECUTE PROCEDURE fn_insert_order()
Note: I am assuming that the id column of the order table in fact is auto increment, in which case Postgres would automatically assign a value to it when inserting as above. Most likely, this is what you want, as having two id columns which both need to be manually assigned does not make much sense.
You could accomplish that with a BEFORE INSERT trigger on line_item.
In that trigger you query order if a matching item exists, and if not, you insert a dummy row.
That will allow the INSERT to succeed, at the cost of some performance.
To insert rows into order, use
INSERT INTO order ...
ON CONFLICT ON (order_number) DO UPDATE SET
id = EXCLUDED.id;
Updating a primary key is problematic and may lead to conflicts. One way you could get around that is if you use negative ids for artificially generated orders (assuming that the real ids are positive). If you have any references to that primary key, you'd have to define the constraint with ON UPDATE CASCADE.

psql upsert results in noncontinuous id

I have a postgresql (>9.5) table with primary_key id and a unique key col. When I use
INSERT INTO table_a (col) VLUES('xxx') ON CONFLICT(col) DO NOTHING;
to perform a upsert, let's say a row with an id 1 is generated.
If I run the sql again, nothing will happen, but actually the id 2 will be generated and abandoned.
Then if I insert a new record, for example,
INSERT INTO table_a (col) VLUES('yyy') ON CONFLICT(col) DO NOTHING;
Another row with id 3 will be generated and id 2 is wasted!
Is there anyway to avoid this waste?
Presumably id is a serial. Under the hood this causes a nextval() call from a sequence. A number nextval() once returned will never be returned again. And the call of nextval() happens before checking for possible conflicts.
From "9.16. Sequence Manipulation Functions":
nextval
(...)
Important: To avoid blocking concurrent transactions that obtain numbers from the same sequence, a nextval operation is never rolled back; that is, once a value has been fetched it is considered used and will not be returned again. This is true even if the surrounding transaction later aborts, or if the calling query ends up not using the value. For example an INSERT with an ON CONFLICT clause will compute the to-be-inserted tuple, including doing any required nextval calls, before detecting any conflict that would cause it to follow the ON CONFLICT rule instead. Such cases will leave unused "holes" in the sequence of assigned values. Thus, PostgreSQL sequence objects cannot be used to obtain "gapless" sequences.
Concluded that means, that the answer to your question is no, there is no way to avoid this unless you generate the values yourself somehow.

Postgres 9.3: Sharelock issue with simple INSERT

Update: Potential solution below
I have a large corpus of configuration files consisting of key/value pairs that I'm trying to push into a database. A lot of the keys and values are repeated across configuration files so I'm storing the data using 3 tables. One for all unique key values, one for all unique pair values, and one listing all the key/value pairs for each file.
Problem:
I'm using multiple concurrent processes (and therefore connections) to add the raw data into the database. Unfortunately I get a lot of detected deadlocks when trying to add values to the key and value tables. I have a tried a few different methods of inserting the data (shown below), but always end up with a "deadlock detected" error
TransactionRollbackError: deadlock detected DETAIL: Process 26755
waits for ShareLock on transaction 689456; blocked by process 26754.
Process 26754 waits for ShareLock on transaction 689467; blocked by
process 26755.
I was wondering if someone could shed some light on exactly what could be causing these deadlocks, and possibly point me towards some way of fixing the issue. Looking at the SQL statements I'm using (listed below), I don't really see why there is any co-dependency at all. Thanks for reading!
Example config file:
example_key this_is_the_value
other_example other_value
third example yet_another_value
Table definitions:
CREATE TABLE keys (
id SERIAL PRIMARY KEY,
hash UUID UNIQUE NOT NULL,
key TEXT);
CREATE TABLE values (
id SERIAL PRIMARY KEY,
hash UUID UNIQUE NOT NULL,
key TEXT);
CREATE TABLE keyvalue_pairs (
id SERIAL PRIMARY KEY,
file_id INTEGER REFERENCES filenames,
key_id INTEGER REFERENCES keys,
value_id INTEGER REFERENCES values);
SQL Statements:
Initially I was trying to use this statement to avoid any exceptions:
WITH s AS (
SELECT id, hash, key FROM keys
WHERE hash = 'hash_value';
), i AS (
INSERT INTO keys (hash, key)
SELECT 'hash_value', 'key_value'
WHERE NOT EXISTS (SELECT 1 FROM s)
returning id, hash, key
)
SELECT id, hash, key FROM i
UNION ALL
SELECT id, hash, key FROM s;
But even something as simple as this causes the deadlocks:
INSERT INTO keys (hash, key)
VALUES ('hash_value', 'key_value')
RETURNING id;
In both cases, if I get an exception thrown because the inserted hash
value is not unique, I use savepoints to rollback the change and
another statement to just select the id I'm after.
I'm using hashes for the unique field, as some of the keys and values
are too long to be indexed
Full example of the python code (using psycopg2) with savepoints:
key_value = 'this_key'
hash_val = generate_uuid(value)
try:
cursor.execute(
'''
SAVEPOINT duplicate_hash_savepoint;
INSERT INTO keys (hash, key)
VALUES (%s, %s)
RETURNING id;
'''
(hash_val, key_value)
)
result = cursor.fetchone()[0]
cursor.execute('''RELEASE SAVEPOINT duplicate_hash_savepoint''')
return result
except psycopg2.IntegrityError as e:
cursor.execute(
'''
ROLLBACK TO SAVEPOINT duplicate_hash_savepoint;
'''
)
#TODO: Should ensure that values match and this isn't just
#a hash collision
cursor.execute(
'''
SELECT id FROM keys WHERE hash=%s LIMIT 1;
'''
(hash_val,)
)
return cursor.fetchone()[0]
Update:
So I believe I a hint on another stackexchange site:
Specifically:
UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE commands
behave the same as SELECT in terms of searching for target rows: they
will only find target rows that were committed as of the command start
time1. However, such a target row might have already been updated (or
deleted or locked) by another concurrent transaction by the time it is
found. In this case, the would-be updater will wait for the first
updating transaction to commit or roll back (if it is still in
progress). If the first updater rolls back, then its effects are
negated and the second updater can proceed with updating the
originally found row. If the first updater commits, the second updater
will ignore the row if the first updater deleted it2, otherwise it
will attempt to apply its operation to the updated version of the row.
While I'm still not exactly sure where the co-dependency is, it seems that processing a large number of key/value pairs without commiting would likely result in something like this. Sure enough, if I commit after each individual configuration file is added, the deadlocks don't occur.
It looks like you're in this situation:
The table to INSERT into has a primary key (or unique index(es) of any sort).
Several INSERTs into that table are performed within one transaction (as opposed to committing immediately after each one)
The rows to insert come in random order (with regard to the primary key)
The rows are inserted in concurrent transactions.
This situation creates the following opportunity for deadlock:
Assuming there are two sessions, that each started a transaction.
Session #1: insert row with PK 'A'
Session #2: insert row with PK 'B'
Session #1: try to insert row with PK 'B'
=> Session #1 is put to wait until Session #2 commits or rollbacks
Session #2: try to insert row with PK 'A'
=> Session #2 is put to wait for Session #1.
Shortly thereafter, the deadlock detector gets aware that both sessions are now waiting for each other, and terminates one of them with a fatal deadlock detected error.
If you're in this scenario, the simplest solution is to COMMIT after a new entry is inserted, before attempting to insert any new row into the table.
Postgres is known for that type of deadlocks, to be honest. I often encounter such problems when different workers update information about interleaving entities. Recently I had a task of importing a big list of scientific papers metadata from multiple json files. I was using parallel processes via joblib to read from several files at the same time. Deadlocks were hanging all the time on authors(id bigint primary key, name text) table all the time 'cause many files contained papers of the same authors, therefore producing inserts with oftentimes the same authors. I was using insert into authors (id,name) values %s on conflict(id) do nothing, but that was not helping. I tried sorting tuples before sending them to Postgres server, with little success. What really helped me was keeping a list of known authors in a Redis set (accessible to all processes):
if not rexecute("sismember", "known_authors", author_id):
# your logic...
rexecute("sadd", "known_authors", author_id)
Which I recommend to everyone. Use Memurai if you are limited to Windows. Sad but true, not a lot of other options for Postgres.

Using the serial datatype as a foreign key

Lets say that I have two tables.
The first is: table lists, with list_id SERIAL, list_name TEXT
The second table is, trivially, a table which says if the list is public: list_id INT, is_public INT
Obviously a bit of a contrived case, but I am planning out some tables and this seems to be an issue. If I insert a new list_name into table lists, then it'll give me a new serial number...but now I will need to use that serial number in the second table. Obviously in this case, you could simply add is_public to the first table, but in the case of a linking list where you have a compound key, you'll need to know the serial value that was returned.
How do people usually handle this? Do they get the return type from the insert using whatever system they're interacting with the database with?
One approach to this sort of thing is:
INSERT...
SELECT lastval()
INSERT...
INSERT into the first table, use lastval() to get the "value most recently obtained with nextval for any sequence" (in the current session), and then use that value to build your next INSERT.
There's also INSERT ... RETURNING:
The optional RETURNING clause causes INSERT to compute and return value(s) based on each row actually inserted. This is primarily useful for obtaining values that were supplied by defaults, such as a serial sequence number.
Using INSERT ... RETURNING id basically combines the first two steps above into one so you'd do:
INSERT ... RETURNING id
INSERT ...
where the second INSERT would use the id returned from the first INSERT.

Values missing in postgres serial field

I run a small site and use PostgreSQL 8.2.17 (only version available at my host) to store data.
In the last few months there were 3 crashes of the database system on my server and every time it happened 31 ID's from a serial field (primary key) in one of the tables were missing. There are now 93 ID's missing.
Table:
CREATE TABLE "REGISTRY"
(
"ID" serial NOT NULL,
"strUID" character varying(11),
"strXml" text,
"intStatus" integer,
"strUIDOrg" character varying(11),
)
It is very important for me that all the ID values are there. What can I do to to solve this problem?
You can not expect serial column to not have holes.
You can implement gapless key by sacrificing concurrency like this:
create table registry_last_id (value int not null);
insert into registry_last_id values (-1);
create function next_registry_id() returns int language sql volatile
as $$
update registry_last_id set value=value+1 returning value
$$;
create table registry ( id int primary key default next_registry_id(), ... )
But any transaction, which tries to insert something to registry table will block until other insert transaction finishes and writes its data to disk. This will limit you to no more than 125 inserting transactions per second on 7500rpm disk drive.
Also any delete from registry table will create a gap.
This solution is based on article Gapless Sequences for Primary Keys by A. Elein Mustain, which is somewhat outdated.
Are you missing 93 records or do you have 3 "holes" of 31 missing numbers?
A sequence is not transaction safe, it will never rollback. Therefor it is not a system to create a sequence of numbers without holes.
From the manual:
Important: To avoid blocking
concurrent transactions that obtain
numbers from the same sequence, a
nextval operation is never rolled
back; that is, once a value has been
fetched it is considered used, even if
the transaction that did the nextval
later aborts. This means that aborted
transactions might leave unused
"holes" in the sequence of assigned
values. setval operations are never
rolled back, either.
Thanks to the answers from Matthew Wood and Frank Heikens i think i have a solution.
Instead of using serial field I have to create my own sequence and define CACHE parameter to 1. This way postgres will not cache values and each one will be taken directly from the sequence :)
Thanks for all your help :)