Table locking in a plpgsql function - postgresql

Let's say I've written plpgsql function that does the following:
CREATE OR REPLACE FUNCTION foobar (_foo_data_id bigint)
RETURNS bigint AS $$
BEGIN
DROP TABLE IF EXISTS tmp_foobar;
CREATE TEMP TABLE tmp_foobar AS
SELECT *
FROM foo_table ft
WHERE ft.foo_data_id = _foo_data_id;
-- more SELECT queries on unrelated tables
-- a final SELECT query that invokes tmp_foobar
END;
First question:
If I simultaneously invoked this function twice, is it possible for the second invocation of foobar() to drop the tmp_foobar table while the first invocation of foobar() is still running?
I understand that SELECT statements create an ACCESS SHARE lock, but will that lock persist until the SELECT statement completes or until the implied COMMIT at the end of the function?
Second question:
If the latter is true, will the second invocation of foobar() indefinitely re-try DROP TABLE IF EXISTS tmp_foobar; until the lock is dropped or will it fail at some point?

If you simultaneously invoke a function twice, it means you're using two separate sessions to do so. Temporary tables are not shared between sessions, so the second session would not "see" tmp_foobar from the first session, and there would be no interaction. See http://www.postgresql.org/docs/9.2/static/sql-createtable.html#AEN70605 ("Temporary tables").
Locks persist until the end of the transaction (regardless of how you acquire them; exception are advisory locks, but that's not what you're doing.)
The second question does not need an answer, because the premise is false.
One more thing. It might be useful to create indexes on that temporary table of yours, and ANALYZE it; that might cause the final query to be faster.

Related

How do I chain a VACUUM off of a purge routine running with pg_cron?

Postgres 13.4
I've got some pg_cron jobs set up to periodically delete older records out of log-like files. What I'd like to do is to run VACUUM ANALYZE after performing a purge. Unfortunately, I can't work out how to do this in a stored function. Am I missing a trick? Is a stored procedure more appropriate?
As an example, here's one of my purge routines
CREATE OR REPLACE FUNCTION dba.purge_event_log (
retain_days_in integer_positive default 14)
RETURNS int4
AS $BODY$
WITH -- Use a CTE so that we've got a way of returning the count easily.
deleted AS (
-- Normal-looking code for this requires a literal:
-- where your_dts < now() - INTERVAL '14 days'
-- Don't want to use a literal, SQL injection, etc.
-- Instead, using a interval constructor to achieve the same result:
DELETE
FROM dba.event_log
WHERE dts < now() - make_interval (days => $1)
RETURNING *
),
----------------------------------------
-- Save details to a custom log table
----------------------------------------
logit AS (
insert into dba.event_log (name, details)
values ('purge_event_log(' || retain_days_in::text || ')',
'count = ' || (select count(*)::text from deleted)
)
)
----------------------------------------
-- Return result count
----------------------------------------
select count(*) from deleted;
$BODY$
LANGUAGE sql;
COMMENT ON FUNCTION dba.purge_event_log (integer_positive) IS
'Delete dba.event_log records older than the day count passed in, with a default retention period of 14 days.';
The truth is, I don't really care about the count(*) result from this routine, in this case. But I might want a result and an additional action in some other, similar context. As you can see, the routine deletes records, uses a CTE to insert a report into another table, and then returns a result. No matter what, I figure this example is a good way to get me head around the alternatives and options in stored functions. The main thing I want to achieve here is to delete records, and then run maintenance. if this is an awkward fit for a stored function or procedure, I could write out an entry to a vacuum_list table with the table name, and have another job to run though that list.
If there's a smarter way to approach vacuum without the extra, I'm of course interested in that. But I'm also interested in understanding the limits on what operationa you can combine in PL/PgSQL routines.
Pavel Stehule' answer is correct and complete. I decided to follow-up a bit here as I like to dig in on bugs in my code, behaviors in Postgres, etc. to get a better sense of what I'm dealing with. I'm including some notes below for anyone who finds them of use.
COMMAND cannot be executed...
The reference to "VACUUM cannot be executed inside a transaction block" gave me a better way to search the docs for similarly restricted commands. The information below probably doesn't cover everything, but it's a start.
Command Limitation
CREATE DATABASE
ALTER DATABASE If creating a new table space.
DROP DATABASE
CLUSTER Without any parameters.
CREATE TABLESPACE
DROP TABLESPACE
REINDEX All in system catalogs, database, or schema.
CREATE SUBSCRIPTION When creating a replication slot (the default behavior.)
ALTER SUBSCRIPTION With refresh option as true.
DROP SUBSCRIPTION If the subscription is associated with a replication slot.
COMMIT PREPARED
ROLLBACK PREPARED
DISCARD ALL
VACUUM
The accepted answer indicates that the limitation has nothing to do with the specific server-side language used. I've just come across an older thread that has some excellent explanations and links for stored functions and transactions:
Do stored procedures run in database transaction in Postgres?
Sample Code
I also wondered about stored procedures, as they're allowed to control transactions. I tried them out in PG 13 and, no, the code is treated like a stored function, down to the error messages.
For anyone that goes in for this sort of thing, here are the "hello world" samples of sQL and PL/PgSQL stored functions and procedures to test out how VACCUM behaves in these cases. Spoiler: It doesn't work, as advertised.
SQL Function
/*
select * from dba.vacuum_sql_function();
Fails:
ERROR: VACUUM cannot be executed from a function
CONTEXT: SQL function "vacuum_sql_function" statement 1. 0.000 seconds. (Line 13).
*/
DROP FUNCTION IF EXISTS dba.vacuum_sql_function();
CREATE FUNCTION dba.vacuum_sql_function()
RETURNS VOID
LANGUAGE sql
AS $sql_code$
VACUUM ANALYZE activity;
$sql_code$;
select * from dba.vacuum_sql_function(); -- Fails.
PL/PgSQL Function
/*
select * from dba.vacuum_plpgsql_function();
Fails:
ERROR: VACUUM cannot be executed from a function
CONTEXT: SQL statement "VACUUM ANALYZE activity"
PL/pgSQL function vacuum_plpgsql_function() line 4 at SQL statement. 0.000 seconds. (Line 22).
*/
DROP FUNCTION IF EXISTS dba.vacuum_plpgsql_function();
CREATE FUNCTION dba.vacuum_plpgsql_function()
RETURNS VOID
LANGUAGE plpgsql
AS $plpgsql_code$
BEGIN
VACUUM ANALYZE activity;
END
$plpgsql_code$;
select * from dba.vacuum_plpgsql_function();
SQL Procedure
/*
call dba.vacuum_sql_procedure();
ERROR: VACUUM cannot be executed from a function
CONTEXT: SQL function "vacuum_sql_procedure" statement 1. 0.000 seconds. (Line 20).
*/
DROP PROCEDURE IF EXISTS dba.vacuum_sql_procedure();
CREATE PROCEDURE dba.vacuum_sql_procedure()
LANGUAGE SQL
AS $sql_code$
VACUUM ANALYZE activity;
$sql_code$;
call dba.vacuum_sql_procedure();
PL/PgSQL Procedure
/*
call dba.vacuum_plpgsql_procedure();
ERROR: VACUUM cannot be executed from a function
CONTEXT: SQL statement "VACUUM ANALYZE activity"
PL/pgSQL function vacuum_plpgsql_procedure() line 4 at SQL statement. 0.000 seconds. (Line 23).
*/
DROP PROCEDURE IF EXISTS dba.vacuum_plpgsql_procedure();
CREATE PROCEDURE dba.vacuum_plpgsql_procedure()
LANGUAGE plpgsql
AS $plpgsql_code$
BEGIN
VACUUM ANALYZE activity;
END
$plpgsql_code$;
call dba.vacuum_plpgsql_procedure();
Other Options
Plenty. As I understand it, VACUUM, and a handful of other commands, are not supported in server-side code running within Postgres. Therefore, you code needs to start from somewhere else. That can be:
Whatever cron you've got in your server's OS.
Any exteral client you like.
pg_cron.
As we're deployed on RDS, those last two options are where I'll look. And there's one more:
Let AUTOVACCUM and an occasional VACCUM do their thing.
That's pretty easy to do, and seems to work fine for the bulk of our needs.
Another Idea
If you do want a bit more control and some custom logging, I'm imagining a table like this:
CREATE TABLE IF NOT EXISTS dba.vacuum_list (
database_name text,
schema_name text,
table_name text,
run boolean,
run_analyze boolean,
run_full boolean,
last_run_dts timestamp)
ALTER TABLE dba.vacuum_list ADD CONSTRAINT
vacuum_list_pk
PRIMARY KEY (database_name, schema_name, table_name);
That's just a sketch. The idea is like this:
You INSERT into vacuum_list when a table needs some vacuuming, at least as far as you're concerned.
In my case, that would be an UPSERT as I don't need a full log-like table, just a single row per table of interest with the last outcome and/or pending state.
Periodically, a remote client, etc. connects, reads the table, and executes each specified VACUUM, according to the options specified in the record.
The external client updates the row with the last run timestamp, and whatever else you're including in the row.
Optionally, you could include fields for duration and change in relation size pre:post vacuuming.
That last option is what I'm interested in. None of our VACUUM calls were working for quite some time as there was a months-old dead connection from something sever-side. VACUUM appears to run fine, in such a case, it just can't delete a whole lot of rows. (Because of the super old "open" transaction ID, visibility maps, etc.) The only way to see this sort of thing seems to be to VACUUM VERBOSE and study the output. Or to record vacuum time and, more important, relation size change to flag cases where nothing seems to happen, when it seems like it should.
VACUUM is "top level" command. It cannot be executed from PL/pgSQL ever or from any other PL.

Which explicit lock to use for a trigger?

I am trying to understand which type of a lock to use for a trigger function.
Simplified function:
CREATE OR REPLACE FUNCTION max_count() RETURNS TRIGGER AS
$$
DECLARE
max_row INTEGER := 6;
association_count INTEGER := 0;
BEGIN
LOCK TABLE my_table IN ROW EXCLUSIVE MODE;
SELECT INTO association_count COUNT(*) FROM my_table WHERE user_id = NEW.user_id;
IF association_count > max_row THEN
RAISE EXCEPTION 'Too many rows';
END IF;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE CONSTRAINT TRIGGER my_max_count
AFTER INSERT OR UPDATE ON my_table
DEFERRABLE INITIALLY DEFERRED
FOR EACH ROW
EXECUTE PROCEDURE max_count();
I initially was planning to use EXCLUSIVE but it feels too heavy. What I really want is to ensure that during this function execution no new rows are added to the table with concerned user_id.
If you want to prevent concurrent transactions from modifying the table, a SHARE lock would be correct. But that could lead to a deadlock if two such transactions run at the same time — each has modified some rows and is blocked by the other one when it tries to escalate the table lock.
Moreover, all table locks that conflict with SHARE UPDATE EXCLUSIVE will lead to autovacuum cancelation, which will cause table bloat when it happens too often.
So stay away from table locks, they are usually the wrong thing.
The better way to go about this is to use no explicit locking at all, but to use the SERIALIZABLE isolation level for all transactions that access this table.
Then you can simply use your trigger (without lock), and no anomalies can occur. If you get a serialization error, repeat the transaction.
This comes with a certain performance penalty, but allows more concurrency than a table lock. It also avoids the problems described in the beginning.

Is it safe to use temporary tables when an application may try to create them for independent, but simultaneous processes?

I am hoping that I can articulate this effectively, so here it goes:
I am creating a model which will be run on a platform by users, possibly simultaneously, but each model run is marked by a unique integer identifier. This model will execute a series of PostgreSQL queries and eventually write a result elswehere.
Now because of the required parallelization of model runs, I have to make sure that the processes will not collide, despite running in the same database. I am at a point now where I have to store a list of records, sorted by a score variable and then operate on them. This is the beginning of the query:
DO
$$
DECLARE row RECORD;
BEGIN
DROP TABLE IF EXISTS ranked_clusters;
CREATE TEMP TABLE ranked_clusters AS (
SELECT
pl.cluster_id AS c_id,
SUM(pl.total_area) AS cluster_score
FROM
emob.parking_lots AS pl
WHERE
pl.cluster_id IS NOT NULL
AND
run_id = 2005149
GROUP BY
pl.cluster_id
ORDER BY
cluster_score DESC
);
FOR row IN SELECT c_id FROM ranked_clusters LOOP
RAISE NOTICE 'Cluster %', row.c_id;
END LOOP;
END;
$$ LANGUAGE plpgsql;
So I create a temporary table called ranked_clusters and then iterate through it, at the moment just logging the identifiers of each record.
I have been careful to only build this list from records which have a run_id value equal to a certain number, so data from the same source, but with a different number will be ignored.
What I am worried about however is that a simultaneous process will also create its own ranked_clusters temporary table, which will collide with the first one, invalidating the results.
So my question is essentially this: Are temporary tables only visible to the session which creates them (or to the cursor object from say, Python)? And is it therefore safe to use a temporary table in this way?
The main reason I ask is because I see that these so-called "temporary" tables seem to persist after I execute the query in PgAdmin III, and the query fails on the next execution because the table already exists. This troubles me because it seems as though the tables are actually globally accessible during their lifetime and would therefore introduce the possibility of a collision when a simultaneous run occurs.
Thanks #a_horse_with_no_name for the explanation but I am not yet convinced that it is safe, because I have been able to execute the following code:
import psycopg2 as pg2
conn = pg2.connect(dbname=CONFIG["GEODB_NAME"],
user=CONFIG["GEODB_USER"],
password=CONFIG["GEODB_PASS"],
host=CONFIG["GEODB_HOST"],
port=CONFIG["GEODB_PORT"])
conn.autocommit = True
cur = conn.cursor()
conn2 = pg2.connect(dbname=CONFIG["GEODB_NAME"],
user=CONFIG["GEODB_USER"],
password=CONFIG["GEODB_PASS"],
host=CONFIG["GEODB_HOST"],
port=CONFIG["GEODB_PORT"])
conn2.autocommit = True
cur2 = conn.cursor()
cur.execute("CREATE TEMPORARY TABLE temptable (tempcol INTEGER); INSERT INTO temptable VALUES (0);")
cur2.execute("SELECT tempcol FROM temptable;")
print(cur2.fetchall())
And I receive the value in temptable despite it being created as a temporary table in a completely different connection as the one which queries it afterwards. Am I missing something here? Because it seems like the temporary table is indeed accessible between connections.
The above had a typo, Both cursors were actually being spawned from conn, rather than one from conn and another from conn2. Individual connections in psycopg2 are not able to access each other's temporary tables, but cursors spawned from the same connection are.
Temporary tables are only visible to the session (=connection) that created them. Even if two sessions create the same table, they won't interfere with each other.
Temporary tables are removed automatically when the session is disconnected.
If you want to automatically remove them when your transaction ends, use the ON COMMIT DROP option when creating the table.
So the answer is: yes, this is safe.
Unrelated, but: you can't store rows "in a sorted way". Rows in a table have no implicit sort order. The only way you can get a guaranteed sort order is to use an ORDER BY when selecting the rows. The order by that is part of your CREATE TABLE AS statement is pretty much useless.
If you have to rely on the sort order of the rows, the only safe way to do that is in the SELECT statement:
FOR row IN SELECT c_id FROM ranked_clusters ORDER BY cluster_score
LOOP
RAISE NOTICE 'Cluster %', row.c_id;
END LOOP;

PostgreSQL BEFORE INSERT trigger locking behavior in a concurrent environment

I have a general function that can manipulate the sequence of any table (why is irrelevant to my question). It reads the current value, works out the new value, sets it, and returns its calculation, which is what's inserted. This is obviously a multi-step process.
I call it from a BEFORE INSERT trigger on tables where I need it.
All I need to know is am I guaranteed that the function will be called by only one caller at a time in a multi-user environment?
Specifically, does the BEFORE INSERT trigger have to complete before it is called again by another caller?
Logically, I would assume yes, but one never knows what may be going on under the hood.
If the answer is no, what minimal locking would I need on the function to guarantee I can read and write the sequence in a "thread-safe" manner?
I'm using PG 10.
EDIT
Here is the function updated with a lock:
CREATE OR REPLACE FUNCTION public.uts_set()
RETURNS TRIGGER AS
$$
DECLARE
sv int8;
seq text := format('%I.%I_uts_seq', tg_table_schema, tg_table_name);
BEGIN
EXECUTE format('LOCK TABLE %I IN ROW EXCLUSIVE MODE;', tg_table_name);
EXECUTE 'SELECT last_value+1 FROM ' || seq INTO sv; -- currval(seq) isn't useable
PERFORM setval(seq, GREATEST(sv, (EXTRACT(epoch FROM localtimestamp) * 1000000)::int8), false);
RETURN NULL;
END;
$$ LANGUAGE plpgsql;
However, a SELECT already acquires ROW EXCLUSIVE, so this statement may be redundant and a stronger lock may be needed. Or, conversely, it may mean no lock is needed.
UPDATE
If I am reading this SO question correctly, my original version without the LOCK should work since the trigger acquires the same lock my updated function is redundantly taking.
All I need to know is am I guaranteed that the function will be called by only one caller at a time in a multi-user environment?
No. Not related to calling functions itself, but you can achieve this behaviour with SERIALIZABLE transaction isolation level:
This level emulates serial transaction execution for all committed
transactions; as if transactions had been executed one after another,
serially, rather than concurrently
But this approach would introduce several tradeoffs, such preparing your application to retry transactions with serialization failure.
Maybe a missed something, but I really believe that you just need NEXTVAL, something like below:
CREATE OR REPLACE FUNCTION public.uts_set()
RETURNS TRIGGER AS
$$
DECLARE
sv int8;
-- First, use %I wildcard for identifiers instead of %s
seq text := format('%I.%I', tg_table_schema, tg_table_name || '_uts_seq');
BEGIN
-- Second, you couldn't call CURRVAL on a session
-- that you didn't issued NEXTVAL before
sv := NEXTVAL(seq);
-- Do your logic here...
-- Result is ignored since this is an STATEMENT trigger
RETURN NULL;
END;
$$ LANGUAGE plpgsql;
Remember that CURRVAL acts on session local scope and NEXTVAL on global scope, so you have a reliable thread-safe mechanism in hands.
The sequence itself handles thread safety with concurrent sessions. So it real comes down to the code that is interacting with the sequence. The following code is thread safe:
SELECT nextval('myseq');
If the sequence is doing much fancier things like setval and currval, I would be more worried about that being done in a high transaction/multi-user environment. Even so, the sequence itself should be locked from other queries while the sequence is being manipulated.

How to lock(exclusive) a record during a function call in pgsql?

In pgsql, how may I lock a record during a function run? considering the following function.
create or replace function foo.bar_func(int)returns int as $$
with s as (select * from foo.bar where id=$1);-- <--lock the fetched row [BEGIN]
/*
Some query, update, insert, ...
*/
select coalesce(s.id,-1) from s;--return something.
-- <-- release the locked row [END]
$$ language sql;
I like to lock the row(if found) at the begin of function, till it finishes its work.
How does pg_advisory_lock(bigint) work? does it help here?what is the difference with select for update?
SELECT … FOR UPDATE does what you expect, namely locks the returned rows exclusively until the end of the current transaction (see here).
Advisory locks, on the other hand, are application defined. They are held either until the end of the current transaction or until the end of the current session (see here). Thus, they need to be checked manually and you may need to release them manually.
If you want to use variables (like s in your sample code), you have to use PL/pgSQL. However, there doesn't seem to be a way to make your function transactional. Instead, it will be always executed in the context of the surrounding transaction. Adding an EXCEPTION clause to your function causes the function to be wrapped in a subtransaction (see here), but locks acquired by your function will be held until the end of the surrounding transaction. I tested with PG 9.3.