How can I deal with a race condition in PostgreSQL? - postgresql

I've been getting a handful of errors on Postgresql, that seem to be tied to this race condition.
I have a process/daemon written in Twisted Python. The easiest way to describe it is as a Web Crawler - it pulls a page, parses the links, and logs what it's seen. Because of HTTP blocking, Twisted runs multiple "concurrent" processes deferred to threads.
Here's the race condition...
When I encounter a url shortener , this logic happens:
result= """SELECT * FROM shortened_link WHERE ( url_shortened = %(url)s ) LIMIT 1;"""
if result:
pass
else:
result= """INSERT INTO shortened_link ( url_shortened ..."
A surprising number or psycopg2.IntegrityError's are raised, because the unique index on url_shortened gets violated.
The select/insert does actually run that close together. From what I can tell, it looks like 2 shortened links get queued next to one another.
Process A: Select, returns Null
Process B: Select, returns Null
Process A: Insert , success
Process B: Insert , integrity error
Can anyone suggest any tips/tricks to handle this ? I'd like to avoid explicit locking, because I know that'll open up a whole other set of problems.

Do it all in a single command:
result= """
INSERT INTO shortened_link ( url_shortened ...
SELECT %(url)s
where not exists (
select 1
from shortened_link
WHERE url_shortened = %(url)s
);"""
It will only insert if that link does not exist.

There's really not a solution that avoids the need to be able to handle the possibility of a unique constraint violation error. If your framework can't do it then I'd wrap the SQL in a PL/pgSQL function or procedure that can.
Given that you can handle the error you might as well not test for the existence of the unique value and just attempt the insert, letting any error be handled by the EXCEPTION clause.

You either need a mutex lock of some kind, or you will have to live with the redundancies that will occur due to the race condition.
If you choose to go with the mutex lock - you don't necessarily need to use a database-level lock. You can simply lock down the Twisted process to block other threads handling a similar shortened url.
If you choose to avoid the lock, remove the unique constraint on the url_shortened field. Periodically, you can move these records to a 'clean' table that contains a single unique copy of each shortened url.

Related

Should I lock a PostgreSQL table when invoking setval for sequence with the max ID function?

I have the following SQL script which sets the sequence value corresponding to max value of the ID column:
SELECT SETVAL('mytable_id_seq', COALESCE(MAX(id), 1)) FROM mytable;
Should I lock 'mytable' in this case in order to prevent changing ID in a parallel request, such in the example below?
request #1 request #2
MAX(id)=5
inserted id 6
SETVAL=5
Or setval(max(id)) is an atomic operation?
Your suspicion is right, this approach is subject to race conditions.
But locking the table won't help, because it won't keep a concurrent transaction from fetching new sequence values. This transaction will block while the table is locked, but will happily continue inserting once the lock is gone, using a sequence value it got while the table was locked.
If it were possible to lock sequences, that might be a solution, but it is not possible to lock sequences.
I can think of two solutions:
Remove all privileges on the sequence while you modify it, so that concurrent requests to the sequence will fail. That causes errors, of course.
The pragmatic way: use
SELECT SETVAL('mytable_id_seq', COALESCE(MAX(id), 1) + 100000) FROM mytable;
Here 100000 is a value that is safely bigger than the number rows that might get inserted while your operatoin is running.
You can use two requests in the same transaction:
ALTER SEQUENCE mytable_id_seq RESTART;
SELECT SETVAL('mytable_id_seq', COALESCE(MAX(id), 1)) FROM mytable;
Note: the first command will lock the sequence for other transactions

how to lock a table for writing

I would like to lock a table for writing during a period of time, while leaving it available for reading.
Is that possible ?
Ideally I would like to lock the table with a predicate (for example prevent writing rows "where country = france").
If you really want to lock against such inserts, i.e. the query should hang and only continue when you allow it, you would have to place a SHARE lock on the table and keep the transaction open.
This is usually not a good idea.
If you want to prevent any such inserts, i.e. throw an error when such an insert is attempted, create a BEFORE INSERT trigger that throws an exception if the NEW row satisfies the condition.
You can use FOR SHARE lock, which blocks other transactions from performing like UPDATE and DELETE, while allowing SELECT FOR SHARE. (Read the docs for details: https://www.postgresql.org/docs/9.4/explicit-locking.html [13.3.2])
For example, there are 2 processes accessing table user_table, in the following sequence:
Process A: BEGIN;
Process A: SELECT username FROM user_table WHERE country = france FOR SHARE;
Process B: SELECT * FROM user_table FOR SHARE; (In here, process B can still read all the rows of the table)
Process B: UPDATE user_table SET username = 'test' WHERE country = france; (In here, process B is blocked and is waiting for process A to finish its transaction)

How can pgsql sequence be undefined when I just called nextval?

I've got an app built on top of PostgresQL, which makes use of a custom sequence. I think I understand sequences pretty well by now: they are non-transactional, currval is defined only within the current session, etc. But I don't understand this:
2015-10-13 10:37:16 SQLSelect: SELECT nextval('commit_id_seq')
2015-10-13 10:37:16 commit_id_seq: 57
2015-10-13 10:37:16 SQLExecute: UPDATE bid SET is_archived=false,company_id=1436,contact_id=15529,...(etc)...,sharing_policy='' WHERE id = 56229
2015-10-13 10:37:16 ERROR: ERROR: currval of sequence "commit_id_seq" is not yet defined in this session
CONTEXT: SQL statement "INSERT INTO history (table_name, record_id, sec_user_id, created, action, notes, status, before, after, commit_id)
SELECT TG_TABLE_NAME, rec.id, (SELECT id FROM sec_user WHERE name = CURRENT_USER), now(), SUBSTR(TG_OP,1,1), note, stat, oldH, newH, currval('commit_id_seq')"
PL/pgSQL function log_to_history() line 28 at SQL statement
[3]
We log every call to the database, and in the case of the SELECT nextval, I also log the result. The above are the exact calls, except that I trimmed the UPDATE statement (because the original is really long).
So, you can see that we just called nextval on the sequence, got a reasonable number back, and then we do an UPDATE that invokes a trigger function that attempts to use currval on that sequence... and it fails, claiming currval is not defined.
Note that this doesn't usually happen, but once it does start happening, it does so consistently (perhaps until the user disconnects from the DB).
How can this be? And what can I do about it?
Your UPDATE statement obviously calls a trigger. The most plausible cause of this error is that the trigger function is in a different schema from where the sequence is defined and the schema of the sequence is not in the search_path. That gives you two options to resolve this:
Make the schema of the sequence visible to the trigger function using SET search_path TO .... Note that this will make all objects in the schema of the sequence visible, which may be something of a security risk, depending on your database design.
Schema-qualify the sequence name in the trigger function: currval('my_schema.commit_id_seq').
Another plausible cause is connection pooling at your application end. Log the "session ID" (really just the starting time and pid of the current session) by adding %c to your log_line_prefix() parameter in postgresql.conf. In PostgreSQL every command runs in its own transaction unless a transaction is explicitly established. Connection pooling software also works at the transaction level (i.e. you start a transaction and then your connection will stay open until you close it, outside of a transaction there are no guarantees about session persistence). If that is the case you can wrap your entire set of commands in a BEGIN ... COMMIT block (you should probably use a specific call from your pooling software), or better yet, change your code to not depend on a previous nextval() call.

DB2 deadlock timeout Sqlstate: 40001, reason code 68 due to update statements called from servlet using SQL

I am calling update statements one after the other from a servlet to DB2. I am getting error sqlstate 40001, reason code 68 which i found it is due to deadlock timeout.
How can I resolve this issue?
Can it be resolved by setting query timeout?
If yes then how to use it with update statements in servlet or where to use it?
The reason code 68 already tells you this is due to a lock timeout (deadlock is reason code 2) It could be due to other users running queries at the same time that use the same data you are accessing, or your own multiple updates.
Begin by running db2pd -db locktest -locks show detail from a db2 command line to see where the locks are. You'll then need to run something like:
select tabschema, tabname, tableid, tbspaceid
from syscat.tables where tbspaceid = # and tableid = #
filling in the # symbols with the ID number you get from the db2pd command output.
Once you see where the locks are, here are some tips:
◦Deadlock frequency can sometimes be reduced by ensuring that all applications access their common data in the same order – meaning, for example, that they access (and therefore lock) rows in Table A, followed by Table B, followed by Table C, and so on.
taken from: http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.admin.trb.doc/doc/t0055074.html
recommended reading: http://www.ibm.com/developerworks/data/library/techarticle/dm-0511bond/index.html
Addendum: if your servlet or another guilty application is using select statements found to be involved in the deadlock, you can try appending with ur to the select statements if accuracy of the newly updated (or inserted) data isn't important.
For me, the solution was adding FOR READ ONLY WITH UR at the end of all my SELECT statements. (Apparently my select statements were returning so much data, it locked the tables long enough to interfere with other SQL statements)
See https://www.ibm.com/support/knowledgecenter/SSEPEK_10.0.0/sqlref/src/tpc/db2z_sql_isolationclause.html

How do I do large non-blocking updates in PostgreSQL?

I want to do a large update on a table in PostgreSQL, but I don't need the transactional integrity to be maintained across the entire operation, because I know that the column I'm changing is not going to be written to or read during the update. I want to know if there is an easy way in the psql console to make these types of operations faster.
For example, let's say I have a table called "orders" with 35 million rows, and I want to do this:
UPDATE orders SET status = null;
To avoid being diverted to an offtopic discussion, let's assume that all the values of status for the 35 million columns are currently set to the same (non-null) value, thus rendering an index useless.
The problem with this statement is that it takes a very long time to go into effect (solely because of the locking), and all changed rows are locked until the entire update is complete. This update might take 5 hours, whereas something like
UPDATE orders SET status = null WHERE (order_id > 0 and order_id < 1000000);
might take 1 minute. Over 35 million rows, doing the above and breaking it into chunks of 35 would only take 35 minutes and save me 4 hours and 25 minutes.
I could break it down even further with a script (using pseudocode here):
for (i = 0 to 3500) {
db_operation ("UPDATE orders SET status = null
WHERE (order_id >" + (i*1000)"
+ " AND order_id <" + ((i+1)*1000) " + ")");
}
This operation might complete in only a few minutes, rather than 35.
So that comes down to what I'm really asking. I don't want to write a freaking script to break down operations every single time I want to do a big one-time update like this. Is there a way to accomplish what I want entirely within SQL?
Column / Row
... I don't need the transactional integrity to be maintained across
the entire operation, because I know that the column I'm changing is
not going to be written to or read during the update.
Any UPDATE in PostgreSQL's MVCC model writes a new version of the whole row. If concurrent transactions change any column of the same row, time-consuming concurrency issues arise. Details in the manual. Knowing the same column won't be touched by concurrent transactions avoids some possible complications, but not others.
Index
To avoid being diverted to an offtopic discussion, let's assume that
all the values of status for the 35 million columns are currently set
to the same (non-null) value, thus rendering an index useless.
When updating the whole table (or major parts of it) Postgres never uses an index. A sequential scan is faster when all or most rows have to be read. On the contrary: Index maintenance means additional cost for the UPDATE.
Performance
For example, let's say I have a table called "orders" with 35 million
rows, and I want to do this:
UPDATE orders SET status = null;
I understand you are aiming for a more general solution (see below). But to address the actual question asked: This can be dealt with in a matter milliseconds, regardless of table size:
ALTER TABLE orders DROP column status
, ADD column status text;
The manual (up to Postgres 10):
When a column is added with ADD COLUMN, all existing rows in the table
are initialized with the column's default value (NULL if no DEFAULT
clause is specified). If there is no DEFAULT clause, this is merely a metadata change [...]
The manual (since Postgres 11):
When a column is added with ADD COLUMN and a non-volatile DEFAULT
is specified, the default is evaluated at the time of the statement
and the result stored in the table's metadata. That value will be used
for the column for all existing rows. If no DEFAULT is specified,
NULL is used. In neither case is a rewrite of the table required.
Adding a column with a volatile DEFAULT or changing the type of an
existing column will require the entire table and its indexes to be
rewritten. [...]
And:
The DROP COLUMN form does not physically remove the column, but
simply makes it invisible to SQL operations. Subsequent insert and
update operations in the table will store a null value for the column.
Thus, dropping a column is quick but it will not immediately reduce
the on-disk size of your table, as the space occupied by the dropped
column is not reclaimed. The space will be reclaimed over time as
existing rows are updated.
Make sure you don't have objects depending on the column (foreign key constraints, indices, views, ...). You would need to drop / recreate those. Barring that, tiny operations on the system catalog table pg_attribute do the job. Requires an exclusive lock on the table which may be a problem for heavy concurrent load. (Like Buurman emphasizes in his comment.) Baring that, the operation is a matter of milliseconds.
If you have a column default you want to keep, add it back in a separate command. Doing it in the same command applies it to all rows immediately. See:
Add new column without table lock?
To actually apply the default, consider doing it in batches:
Does PostgreSQL optimize adding columns with non-NULL DEFAULTs?
General solution
dblink has been mentioned in another answer. It allows access to "remote" Postgres databases in implicit separate connections. The "remote" database can be the current one, thereby achieving "autonomous transactions": what the function writes in the "remote" db is committed and can't be rolled back.
This allows to run a single function that updates a big table in smaller parts and each part is committed separately. Avoids building up transaction overhead for very big numbers of rows and, more importantly, releases locks after each part. This allows concurrent operations to proceed without much delay and makes deadlocks less likely.
If you don't have concurrent access, this is hardly useful - except to avoid ROLLBACK after an exception. Also consider SAVEPOINT for that case.
Disclaimer
First of all, lots of small transactions are actually more expensive. This only makes sense for big tables. The sweet spot depends on many factors.
If you are not sure what you are doing: a single transaction is the safe method. For this to work properly, concurrent operations on the table have to play along. For instance: concurrent writes can move a row to a partition that's supposedly already processed. Or concurrent reads can see inconsistent intermediary states. You have been warned.
Step-by-step instructions
The additional module dblink needs to be installed first:
How to use (install) dblink in PostgreSQL?
Setting up the connection with dblink very much depends on the setup of your DB cluster and security policies in place. It can be tricky. Related later answer with more how to connect with dblink:
Persistent inserts in a UDF even if the function aborts
Create a FOREIGN SERVER and a USER MAPPING as instructed there to simplify and streamline the connection (unless you have one already).
Assuming a serial PRIMARY KEY with or without some gaps.
CREATE OR REPLACE FUNCTION f_update_in_steps()
RETURNS void AS
$func$
DECLARE
_step int; -- size of step
_cur int; -- current ID (starting with minimum)
_max int; -- maximum ID
BEGIN
SELECT INTO _cur, _max min(order_id), max(order_id) FROM orders;
-- 100 slices (steps) hard coded
_step := ((_max - _cur) / 100) + 1; -- rounded, possibly a bit too small
-- +1 to avoid endless loop for 0
PERFORM dblink_connect('myserver'); -- your foreign server as instructed above
FOR i IN 0..200 LOOP -- 200 >> 100 to make sure we exceed _max
PERFORM dblink_exec(
$$UPDATE public.orders
SET status = 'foo'
WHERE order_id >= $$ || _cur || $$
AND order_id < $$ || _cur + _step || $$
AND status IS DISTINCT FROM 'foo'$$); -- avoid empty update
_cur := _cur + _step;
EXIT WHEN _cur > _max; -- stop when done (never loop till 200)
END LOOP;
PERFORM dblink_disconnect();
END
$func$ LANGUAGE plpgsql;
Call:
SELECT f_update_in_steps();
You can parameterize any part according to your needs: the table name, column name, value, ... just be sure to sanitize identifiers to avoid SQL injection:
Table name as a PostgreSQL function parameter
Avoid empty UPDATEs:
How do I (or can I) SELECT DISTINCT on multiple columns?
Postgres uses MVCC (multi-version concurrency control), thus avoiding any locking if you are the only writer; any number of concurrent readers can work on the table, and there won't be any locking.
So if it really takes 5h, it must be for a different reason (e.g. that you do have concurrent writes, contrary to your claim that you don't).
You should delegate this column to another table like this:
create table order_status (
order_id int not null references orders(order_id) primary key,
status int not null
);
Then your operation of setting status=NULL will be instant:
truncate order_status;
First of all - are you sure that you need to update all rows?
Perhaps some of the rows already have status NULL?
If so, then:
UPDATE orders SET status = null WHERE status is not null;
As for partitioning the change - that's not possible in pure sql. All updates are in single transaction.
One possible way to do it in "pure sql" would be to install dblink, connect to the same database using dblink, and then issue a lot of updates over dblink, but it seems like overkill for such a simple task.
Usually just adding proper where solves the problem. If it doesn't - just partition it manually. Writing a script is too much - you can usually make it in a simple one-liner:
perl -e '
for (my $i = 0; $i <= 3500000; $i += 1000) {
printf "UPDATE orders SET status = null WHERE status is not null
and order_id between %u and %u;\n",
$i, $i+999
}
'
I wrapped lines here for readability, generally it's a single line. Output of above command can be fed to psql directly:
perl -e '...' | psql -U ... -d ...
Or first to file and then to psql (in case you'd need the file later on):
perl -e '...' > updates.partitioned.sql
psql -U ... -d ... -f updates.partitioned.sql
I am by no means a DBA, but a database design where you'd frequently have to update 35 million rows might have… issues.
A simple WHERE status IS NOT NULL might speed up things quite a bit (provided you have an index on status) – not knowing the actual use case, I'm assuming if this is run frequently, a great part of the 35 million rows might already have a null status.
However, you can make loops within the query via the LOOP statement. I'll just cook up a small example:
CREATE OR REPLACE FUNCTION nullstatus(count INTEGER) RETURNS integer AS $$
DECLARE
i INTEGER := 0;
BEGIN
FOR i IN 0..(count/1000 + 1) LOOP
UPDATE orders SET status = null WHERE (order_id > (i*1000) and order_id <((i+1)*1000));
RAISE NOTICE 'Count: % and i: %', count,i;
END LOOP;
RETURN 1;
END;
$$ LANGUAGE plpgsql;
It can then be run by doing something akin to:
SELECT nullstatus(35000000);
You might want to select the row count, but beware that the exact row count can take a lot of time. The PostgreSQL wiki has an article about slow counting and how to avoid it.
Also, the RAISE NOTICE part is just there to keep track on how far along the script is. If you're not monitoring the notices, or do not care, it would be better to leave it out.
Are you sure this is because of locking? I don't think so and there's many other possible reasons. To find out you can always try to do just the locking. Try this:
BEGIN;
SELECT NOW();
SELECT * FROM order FOR UPDATE;
SELECT NOW();
ROLLBACK;
To understand what's really happening you should run an EXPLAIN first (EXPLAIN UPDATE orders SET status...) and/or EXPLAIN ANALYZE. Maybe you'll find out that you don't have enough memory to do the UPDATE efficiently. If so, SET work_mem TO 'xxxMB'; might be a simple solution.
Also, tail the PostgreSQL log to see if some performance related problems occurs.
I would use CTAS:
begin;
create table T as select col1, col2, ..., <new value>, colN from orders;
drop table orders;
alter table T rename to orders;
commit;
Some options that haven't been mentioned:
Use the new table trick. Probably what you'd have to do in your case is write some triggers to handle it so that changes to the original table also go propagated to your table copy, something like that... (percona is an example of something that does it the trigger way). Another option might be the "create a new column then replace the old one with it" trick, to avoid locks (unclear if helps with speed).
Possibly calculate the max ID, then generate "all the queries you need" and pass them in as a single query like update X set Y = NULL where ID < 10000 and ID >= 0; update X set Y = NULL where ID < 20000 and ID > 10000; ... then it might not do as much locking, and still be all SQL, though you do have extra logic up front to do it :(
PostgreSQL version 11 handles this for you automatically with the Fast ALTER TABLE ADD COLUMN with a non-NULL default feature. Please do upgrade to version 11 if possible.
An explanation is provided in this blog post.