Postgresql Share Row Exclusive Lock VS pg_advisory_lock - postgresql

I'm running Postgres 9.3 and have a table tags, accessed through Python's psycopg2 module. I have a table called 'tags' that gets updated/inserted by two different methods, called 'update' and 'insert.' I also have several workers running concurrently, each of which call either 'update' or 'insert.' Due to a uniqueness constraint, I'd like to lock the 'tags' table directly before I perform the inserts or updates, and I commit the transaction directly after.
So my code roughly looks like (in psycopg2 parlance)
UPDATE:
cur.execute(LOCK TABLE tags IN SHARE ROW EXCLUSIVE MODE)
cur.execute(UPDATE tags SET ...)
cur.execute(DELETE FROM tags ....)
cur.execute(INSERT INTO tags ...)
connection.commit()
INSERT:
cur.execute(LOCK TABLE tags IN SHARE ROW EXCLUSIVE MODE)
cur.execute(DELETE FROM tags ....)
cur.execute(INSERT INTO tags ...)
connection.commit()
And my schema looks like
user_id varchar NOT NULL,
tag varchar NOT NULL,
time timestamptz,
CONSTRAINT unique_tag_key PRIMARY KEY (user_id, tag)
CONSTRAINT seen_before_user FOREIGN_KEY (user_id)
REFERENCES user_id_table (user_id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
The problem I'm getting is that when I run concurrent workers I get deadlocks upon executing the share lock.
Weirdly though, if I replace the LOCK TABLE calls with a call like
cur.execute("SELECT pg_advisory_lock(tag_hash)")
where tag_hash is a hash on the tags table name 'tags', I get no such errors.
Why is it that I get errors with SHARE ROW EXCLUSIVE, but not pg_advisory lock? Are there any downsides to using pg_advisory locks here if I can guarantee that the tags table never gets modified outside of these two methods?

I think you are going about this the wrong way.
You should never be lock an entire table in Postgres unless you are running DDL on it.
You already have a unique constraint on the fields in question
Your best bet is to run the transaction in a try/except. If there is an exception For The constraint violation - Handle it properly, or a deadlock detected exception, simply retry it.
In general postgres is very good about handling locks without needing manual lock control unless you are doing something very crazy.

Related

Out of shared memory when deleting rows with lots of incoming foreign keys

I develop a multi-tenancy application where we have a single master schema to keep track of tenants, along with 99 app databases to distribute load. Each of 33 tables within each app database also has a tenant column pointing to the master schema. This means there are 3,267 foreign keys pointing to the master schema's tenant id, and roughly 6000 triggers associated with the tenant table.
Recently, I added a table and started getting this error in the teardown portion of our test suite where we delete the test tenant:
psycopg2.errors.OutOfMemory: out of shared memory
HINT: You might need to increase max_locks_per_transaction.
CONTEXT: SQL statement "SELECT 1 FROM ONLY "test2"."item" x WHERE $1 OPERATOR(pg_catalog.=) "tenant" FOR KEY SHARE OF x"
For query
SET CONSTRAINTS ALL IMMEDIATE
Raising max_locks_per_transaction as suggested solves the problem, as does deleting some of the app schemas. The obvious solution here would be to reduce the number of redundant schemas or delete the foreign key constraints so we don't have to hold so many locks, but I'm curious if there's something else going on here.
I had imagined that only the rows to be deleted (associated with the test schema) would be locked, and so only the test schema would be locked. And anyway, by this point there is no data left pointing to the tenant table, so the locking is pretty much redundant in practice.
Update:
For more context, I'm not doing anything really fancy here. Below is a simplified example of what my schema and query look like:
CREATE SCHEMA master;
CREATE table master.tenant (id uuid NOT NULL PRIMARY KEY);
CREATE SCHEMA app_00;
CREATE table app_00.account (id uuid NOT NULL PRIMARY KEY, tenant uuid NOT NULL);
ALTER TABLE app_00.account ADD CONSTRAINT fk_tenant FOREIGN KEY (store) REFERENCES master.store(id) DEFERRABLE;
CREATE table app_00.item (id uuid NOT NULL PRIMARY KEY, tenant uuid NOT NULL);
ALTER TABLE app_00.item ADD CONSTRAINT fk_tenant FOREIGN KEY (store) REFERENCES master.store(id) DEFERRABLE;
In reality I'm creating 33 tables for each schema of app_00..99. Now assume my database is populated with data, the query that is failing with the above error is:
DELETE FROM master.tenant WHERE id = 'some uuid';
You don't tell us much about the setup, but probably partitioning or inheritance are involved. These features often require that a statement recurse to table partitions or inheritance children, either during query planning or execution. At any rate, your SQL statements have to touch many tables.
Now whenever PostgreSQL touches a table, it places a lock on it to avoid conflicting concurrent executions. If lots of tables are involved, it can be that the lock table, that originally has max_connections * max_locks_per_transaction entries, is exhausted.
The solution simply is to increase max_locks_per_transaction. Don't worry, there is no negative consequence in raising that parameter, only a little bit more shared memory is allocated during server startup.

How to set Ignore Duplicate Key in Postgresql while table creation itself

I am creating a table in Postgresql 9.5 where id is the primary key. While inserting rows in the table if anyone tries to insert duplicate id, i want it to get ignored instead of raising exception. Is there any way such that i can set this while table creation itself that duplicate entries get ignored.
There are many techniques to resolve duplicate insertion issue while writing insertion query i.e. using ON CONFLICT DO NOTHING, or using WHERE EXISTS clause etc. But i want to handle this at table creation end so that the person writing insertion query doesn't need to bother any.
Creating RULE is one of the possible solution. Are there other possible solutions? Maybe something like this:
`CREATE TABLE dbo.foo (bar int PRIMARY KEY WITH (FILLFACTOR=90, IGNORE_DUP_KEY = ON))`
Although exact this statement doesn't work on Postgresql 9.5 on my machine.
add a trigger before insert or rule on insert do instead - otherwise has to be handled by inserting query. both solutions will require more resources on each insert.
Alternative way to use function with arguments for insert, that will check for duplicates, so end users will use function instead of INSERT statement.
WHERE EXISTS sub-query is not atomic btw - so you can still have exception after check...
9.5 ON CONFLICT DO NOTHING is the best solution still

Is it possible to catch a foreign key violation in postgres

I'm trying to insert data into a table which has a foreign key constraint. If there is a constraint violation in a row that I'm inserting, I want to chuck that data away.
The issue is that postgres returns an error every time I violate the constraint. Is it possible for me to have some statement in my insert statement like 'ON FOREIGN KEY CONSTRAINT DO NOTHING'?
EDIT:
This is the query that I'm trying to do, where info is a dict:
cursor.execute("INSERT INTO event (case_number_id, date, \
session, location, event_type, worker, result) VALUES \
(%(id_number)s, %(date)s, %(session)s, \
%(location)s, %(event_type)s, %(worker)s, %(result)s) ON CONFLICT DO NOTHING", info)
It errors out when there is a foreign key violation
If you're only inserting a single row at a time, you can create a savepoint before the insert and rollback to it when the insert fails (or release it when the insert succeeds).
For Postgres 9.5 or later, you can use INSERT ... ON CONFLICT DO NOTHING which does what it says. You can also write ON CONFLICT DO UPDATE SET column = value..., which will automagically convert your insert into an update of the row you are conflicting with (this functionality is sometimes called "upsert").
This does not work because OP is dealing with a foreign key constraint rather than a unique constraint. In that case, you can most easily use the savepoint method I described earlier, but for multiple rows it may prove tedious. If you need to insert multiple rows at once, it should be reasonably performant to split them into multiple insert statements, provided you are not working in autocommit mode, all inserts occur in one transaction, and you are not inserting a very large number of rows.
Sometimes, you really do need multiple inserts in a single statement, because the round-trip overhead of talking to your database plus the cost of having savepoints on every insert is simply too high. In this case, there are a number of imperfect approaches. Probably the least bad is to build a nested query which selects your data and joins it against the other table, something like this:
INSERT INTO table_A (column_A, column_B, column_C)
SELECT A_rows.*
FROM VALUES (...) AS A_rows(column_A, column_B, column_C)
JOIN table_B ON A_rows.column_B = table_B.column_B;

Postgres 9.3: Sharelock issue with simple INSERT

Update: Potential solution below
I have a large corpus of configuration files consisting of key/value pairs that I'm trying to push into a database. A lot of the keys and values are repeated across configuration files so I'm storing the data using 3 tables. One for all unique key values, one for all unique pair values, and one listing all the key/value pairs for each file.
Problem:
I'm using multiple concurrent processes (and therefore connections) to add the raw data into the database. Unfortunately I get a lot of detected deadlocks when trying to add values to the key and value tables. I have a tried a few different methods of inserting the data (shown below), but always end up with a "deadlock detected" error
TransactionRollbackError: deadlock detected DETAIL: Process 26755
waits for ShareLock on transaction 689456; blocked by process 26754.
Process 26754 waits for ShareLock on transaction 689467; blocked by
process 26755.
I was wondering if someone could shed some light on exactly what could be causing these deadlocks, and possibly point me towards some way of fixing the issue. Looking at the SQL statements I'm using (listed below), I don't really see why there is any co-dependency at all. Thanks for reading!
Example config file:
example_key this_is_the_value
other_example other_value
third example yet_another_value
Table definitions:
CREATE TABLE keys (
id SERIAL PRIMARY KEY,
hash UUID UNIQUE NOT NULL,
key TEXT);
CREATE TABLE values (
id SERIAL PRIMARY KEY,
hash UUID UNIQUE NOT NULL,
key TEXT);
CREATE TABLE keyvalue_pairs (
id SERIAL PRIMARY KEY,
file_id INTEGER REFERENCES filenames,
key_id INTEGER REFERENCES keys,
value_id INTEGER REFERENCES values);
SQL Statements:
Initially I was trying to use this statement to avoid any exceptions:
WITH s AS (
SELECT id, hash, key FROM keys
WHERE hash = 'hash_value';
), i AS (
INSERT INTO keys (hash, key)
SELECT 'hash_value', 'key_value'
WHERE NOT EXISTS (SELECT 1 FROM s)
returning id, hash, key
)
SELECT id, hash, key FROM i
UNION ALL
SELECT id, hash, key FROM s;
But even something as simple as this causes the deadlocks:
INSERT INTO keys (hash, key)
VALUES ('hash_value', 'key_value')
RETURNING id;
In both cases, if I get an exception thrown because the inserted hash
value is not unique, I use savepoints to rollback the change and
another statement to just select the id I'm after.
I'm using hashes for the unique field, as some of the keys and values
are too long to be indexed
Full example of the python code (using psycopg2) with savepoints:
key_value = 'this_key'
hash_val = generate_uuid(value)
try:
cursor.execute(
'''
SAVEPOINT duplicate_hash_savepoint;
INSERT INTO keys (hash, key)
VALUES (%s, %s)
RETURNING id;
'''
(hash_val, key_value)
)
result = cursor.fetchone()[0]
cursor.execute('''RELEASE SAVEPOINT duplicate_hash_savepoint''')
return result
except psycopg2.IntegrityError as e:
cursor.execute(
'''
ROLLBACK TO SAVEPOINT duplicate_hash_savepoint;
'''
)
#TODO: Should ensure that values match and this isn't just
#a hash collision
cursor.execute(
'''
SELECT id FROM keys WHERE hash=%s LIMIT 1;
'''
(hash_val,)
)
return cursor.fetchone()[0]
Update:
So I believe I a hint on another stackexchange site:
Specifically:
UPDATE, DELETE, SELECT FOR UPDATE, and SELECT FOR SHARE commands
behave the same as SELECT in terms of searching for target rows: they
will only find target rows that were committed as of the command start
time1. However, such a target row might have already been updated (or
deleted or locked) by another concurrent transaction by the time it is
found. In this case, the would-be updater will wait for the first
updating transaction to commit or roll back (if it is still in
progress). If the first updater rolls back, then its effects are
negated and the second updater can proceed with updating the
originally found row. If the first updater commits, the second updater
will ignore the row if the first updater deleted it2, otherwise it
will attempt to apply its operation to the updated version of the row.
While I'm still not exactly sure where the co-dependency is, it seems that processing a large number of key/value pairs without commiting would likely result in something like this. Sure enough, if I commit after each individual configuration file is added, the deadlocks don't occur.
It looks like you're in this situation:
The table to INSERT into has a primary key (or unique index(es) of any sort).
Several INSERTs into that table are performed within one transaction (as opposed to committing immediately after each one)
The rows to insert come in random order (with regard to the primary key)
The rows are inserted in concurrent transactions.
This situation creates the following opportunity for deadlock:
Assuming there are two sessions, that each started a transaction.
Session #1: insert row with PK 'A'
Session #2: insert row with PK 'B'
Session #1: try to insert row with PK 'B'
=> Session #1 is put to wait until Session #2 commits or rollbacks
Session #2: try to insert row with PK 'A'
=> Session #2 is put to wait for Session #1.
Shortly thereafter, the deadlock detector gets aware that both sessions are now waiting for each other, and terminates one of them with a fatal deadlock detected error.
If you're in this scenario, the simplest solution is to COMMIT after a new entry is inserted, before attempting to insert any new row into the table.
Postgres is known for that type of deadlocks, to be honest. I often encounter such problems when different workers update information about interleaving entities. Recently I had a task of importing a big list of scientific papers metadata from multiple json files. I was using parallel processes via joblib to read from several files at the same time. Deadlocks were hanging all the time on authors(id bigint primary key, name text) table all the time 'cause many files contained papers of the same authors, therefore producing inserts with oftentimes the same authors. I was using insert into authors (id,name) values %s on conflict(id) do nothing, but that was not helping. I tried sorting tuples before sending them to Postgres server, with little success. What really helped me was keeping a list of known authors in a Redis set (accessible to all processes):
if not rexecute("sismember", "known_authors", author_id):
# your logic...
rexecute("sadd", "known_authors", author_id)
Which I recommend to everyone. Use Memurai if you are limited to Windows. Sad but true, not a lot of other options for Postgres.

Way to migrate a create table with sequence from postgres to DB2

I need to migrate a DDL from Postgres to DB2, but I need that it works the same as in Postgres. There is a table that generates values from a sequence, but the values can also be explicitly given.
Postgres
create sequence hist_id_seq;
create table benchmarksql.history (
hist_id integer not null default nextval('hist_id_seq') primary key,
h_c_id integer,
h_c_d_id integer,
h_c_w_id integer,
h_d_id integer,
h_w_id integer,
h_date timestamp,
h_amount decimal(6,2),
h_data varchar(24)
);
(Look at the sequence call in the hist_id column to define the value of the primary key)
The business logic inserts into the table by explicitly providing an ID, and in other cases, it leaves the database to choose the number.
If I change this in DB2 to a GENERATED ALWAYS it will throw errors because there are some provided values. On the other side, if I create the table with GENERATED BY DEFAULT, DB2 will throw an error when trying to insert with the same value (SQL0803N), because the "internal sequence" does not take into account the already inserted values, and it does not retry with a next value.
And, I do not want to restart the sequence each time a provided ID was inserted.
This is the problem in BenchmarkSQL when trying to port it to DB2: https://sourceforge.net/projects/benchmarksql/ (File sqlTableCreates)
How can I implement the same database logic in DB2 as it does in Postgres (and apparently in Oracle)?
You're operating under a misconception: that sources external to the db get to dictate its internal keys. Ideally/conceptually, autogenerated ids will never need to be seen outside of the db, as conceptually there should be unique natural keys for export or reporting. Still, there are times when applications will need to manage some ids, often when setting up related entities (eg, JPA seems to want to work this way).
However, if you add an id value that you generated from a different source, the db won't be able to manage it. How could it? It's not efficient - for one thing, attempting to do so would do one of the following
Be unsafe in the face of multiple clients (attempt to add duplicate keys)
Serialize access to the table (for a potentially slow query, too)
(This usually shows up when people attempt something like: SELECT MAX(id) + 1, which would require locking the entire table for thread safety, likely including statements that don't even touch that column. If you try to find any "first-unused" id - trying to fill gaps - this gets more complicated and problematic)
Neither is ideal, so it's best to not have the problem in the first place. This is usually done by having id columns be autogenerated, but (as pointed out earlier) there are situations where we may need to know what the id will be before we insert the row into the table. Fortunately, there's a standard SQL object for this, SEQUENCE. This provides a db-managed, thread-safe, fast way to get ids. It appears that in PostgreSQL you can use sequences in the DEFAULT clause for a column, but DB2 doesn't allow it. If you don't want to specify an id every time (it should be autogenerated some of the time), you'll need another way; this is the perfect time to use a BEFORE INSERT trigger;
CREATE TRIGGER Add_Generated_Id NO CASCADE BEFORE INSERT ON benchmarksql.history
NEW AS Incoming_Entity
FOR EACH ROW
WHEN Incoming_Entity.id IS NULL
SET id = NEXTVAL FOR hist_id_seq
(something like this - not tested. You didn't specify where in the project this would belong)
So, if you then add a row with something like:
INSERT INTO benchmarksql.history (hist_id, h_data) VALUES(null, 'a')
or
INSERT INTO benchmarksql.history (h_data) VALUES('a')
an id will be generated and attached automatically. Note that ALL ids added to the table must come from the given sequence (as #mustaccio pointed out, this appears to be true even in PostgreSQL), or any UNIQUE CONSTRAINT on the column will start throwing duplicate-key errors. So any time your application needs an id before inserting a row in the table, you'll need some form of
SELECT NEXT VALUE FOR hist_id_seq
FROM sysibm.sysdummy1
... and that's it, pretty much. This is completely thread and concurrency safe, will not maintain/require long-term locks, nor require serialized access to the table.