I have three tables, film, show and room. I want to insert new shows for a film in a given room and I thought I can do it with a trigger that ckecks if there are no time collisions with the exiting ones.
I wrote this:
CREATE OR REPLACE TRIGGER insert_new_show
BEFORE INSERT ON show
FOR EACH ROW
DECLARE
-- pragma autonomous_transaction;
collisioni varchar(40);
runtime INT;
error1 EXCEPTION;
error2 EXCEPTION;
BEGIN
IF :new.time < CURRENT_TIMESTAMP THEN
RAISE error1;
END IF;
SELECT runtime INTO runtime
FROM film
WHERE title = deref(:new.di_film).title;
SELECT film.title INTO collisioni
FROM show JOIN film ON deref(show.di_film).title=film.title
WHERE DEREF(room).nome = DEREF(:NEW.room).nome
AND ( (:new.time < show.time AND :new.time + runtime + 10 > show.time)
OR (:new.time > show.time
AND :new.time < show.time + runtime + 10));
IF NOT SQL%NOTFOUND THEN
INSERT INTO show
SELECT to_timestamp(:NEW.orario,'yyyy-mm-dd hh24:mi:ss')
, :NEW.max_n_spot,:NEW.costo, REF(s), NULL, REF(f)
FROM film f, room s
WHERE f.title=deref(:NEW.di_film).title
AND s.nome=deref(:NEW.room).nome;
ELSE
RAISE error2;
END IF;
EXCEPTION
WHEN error1 THEN
RAISE_APPLICATION_ERROR(-20491,'Error');
WHEN error2 THEN
RAISE_APPLICATION_ERROR(-20491,'Error');
END;
If I don't use the "pragma autonomous_transaction", I get a table s%s is mutating error. But, of course with the transaction, the trigger can't see the :new values.
I also thought to do the check with a CHECK CONSTRAINT, but I don't know if it works.
Can you help me to find a solution, please?
This is the classic "use a package / procedure" situation.
You get ORA-04091: table name is mutating, trigger/function may not see it when you query a table in a trigger that you are currently performing a DML operation on. Oracle does not allow this in order to maintain a read-consistent view of the data There are various ways to avoid this, most of which involve trying to do something slightly funky; including as you've already noted an autonomous transaction.
However, they don't influence the root cause of the error; that is, you're doing something wrong. This error is normally indicative of one or both of two things:
You're putting far too much logic into a trigger
Your data-model is flawed.
The solution is to move the logic to a package or procedure, which has the added benefits of removing a layer of obfuscation and aids debugging.
I can't see how a check constraint is possible here as you would still need to know the run-time from FILM, which requires querying a second table.
Further reading:
Compound Triggers - official documentation
Avoid mutating table exceptions - AskTom
"Fix" Oracle mutating table trigger errors - dba-oracle.com (double quotes my own).
Related
I have a trigger defined on several tables to fire after all INSERT, UPDATE, or DELETE, all using the same trigger function. The trigger function performs an expensive check, but I can speed it up significantly by filtering some of the intermediate steps of that check using either a WHERE machine_serial = NEW.machine_serial or WHERE machine_serial = OLD.machine_serial clause, depending on what type of statement fired the trigger. However, not all the tables actually have a machine_serial column, so I can't perform this filtering when the trigger is fired on one of those tables. I am currently trying to find a good solution to making the decision of whether to filter or not from within the trigger function, and I believe that simply checking whether NEW or OLD has the machine_serial field would be easiest, clearest, and fastest. I can't find any way to do that in the documentation though, but checking whether a RECORD contains a certain field seems like such a basic, commonplace operation for anyone that has to work with RECORDs that I assume that I've just got to be missing it somewhere - I can't imagine that it's just not possible.
For completeness, I'll go over the alternatives I've considered to the hypothetical does-RECORD-have-field check:
I could create two trigger functions, do_expensive_check_with_machine_serial() and do_expensive_check_without_machine_serial(), and use one or the other depending on whether the table has the machine_serial column. But if I or anyone after me needs to alter the logic in either one of these functions, they'll need to remember to alter the logic in the other one, too.
I could stick with the one trigger function I currently have, and figure out whether the firing table has machine_serial by just trying to access NEW.machine_serial or OLD.machine_serial. If that raises an exception, I can catch it and then I'll know the field isn't present. But the manual explicitly suggests avoiding using exception blocks unless absolutely necessary, due to performance impacts.
I could stick with the one trigger function I currently have, and just add a check like this: IF (TG_TABLE_SCHEMA = x AND TG_TABLE_NAME = y) OR (TG_TABLE_SCHEMA = w AND TG_TABLE_NAME = z) OR ...
, and just maintain that list of every table that has a machine_serial column. But then I and anyone that comes after me would need to alter that check in the trigger function any time the trigger is added to a new table, which is less than ideal.
Of course, the above three alternatives would all function, but they all feel like bad design choices to me. Maybe it's because I'm used to the dynamicness offered by Python, but if I used any of these alternatives, I would feel like I'm doing something wrong. And PostgreSQL is pretty good about offering lots of operators on all sorts of data types, so I just can't imagine that something as basic as checking whether a RECORD or ROW-type variable contains a certain field is impossible.
Before I show the solution, I have to say, so this requirement can be signal of some unhappy design. Maybe you try to implement some functionality that should not be implemented in triggers. Triggers are good, but too smart too generic too rich can be very slow and very hard to maintain and fix errors (but as every in life, there are exceptions from rules).
So first - you can look to system catalog:
CREATE FUNCTION public.foo_trg() RETURNS trigger
LANGUAGE plpgsql
AS $$
begin
raise notice 'a exists %', exists(select * from pg_attribute where attrelid = new.tableoid and attname = 'a');
raise notice 'd exists %', exists(select * from pg_attribute where attrelid = new.tableoid and attname = 'd');
return new;
end;
$$;
CREATE TABLE public.foo (
a integer,
b integer
);
CREATE TRIGGER foo_trg_insert
AFTER INSERT ON public.foo
FOR EACH ROW EXECUTE FUNCTION public.foo_trg();
(2022-09-02 06:18:41) postgres=# insert into foo values(1,2);
NOTICE: a exists t
NOTICE: d exists f
INSERT 0 1
Second solution is based on record to jsonb transformations:
CREATE OR REPLACE FUNCTION public.foo_trg()
RETURNS trigger
LANGUAGE plpgsql
AS $$
declare j jsonb;
begin
j := to_jsonb(new);
raise notice 'a exists %', j ? 'a';
raise notice 'd exists %', j ? 'd';
return new;
end;
$$
(2022-09-02 06:24:54) postgres=# insert into foo values(1,2);
NOTICE: a exists t
NOTICE: d exists f
INSERT 0 1
Second solution can be faster, because doesn't requires queries to system catalog. It hits just system catalog cache, but it doesn't work on some legacy PostgreSQL releases.
I would love to be able to validate objects representing table rows using the database's existing constraints (triggers that raise exceptions and checks) without actually inserting them into the database.
Is there currently a way one could do this in postgres? At least with BEFORE INSERT triggers and CHECK, I assume it makes no sense with AFTER INSERT triggers.
The easiest way I can think or right now would be to:
Lock the table
Insert a new row
If exception raise to the API / else DELETE the row and call it valid
Unlock
But I can see several issues with this.
A simpler way is to insert within a transaction and not commit:
BEGIN;
INSERT INTO tbl(...) VALUES (...);
-- see effects ...
ROLLBACK;
No need for additional locking. The row is never visible to any other transaction with default transacton isolation level READ COMMITTED. (You might be stalling concurrent writes that confict with the tested row.)
Notable side-effect: Sequences of serial or IDENTITY columns are advanced even if the INSERT is never committed. But gaps in sequential numbers are to be expected anyway and nothing to worry about.
Be wary of triggers with side-effects. All "transactional" SQL effects are rolled back, even most DDL commands. But some special operations (like advancing sequences) are never rolled back.
Also, DEFERRED constraints do not kick in. The manual:
DEFERRED constraints are not checked until transaction commit.
If you need this a lot, work with a copy of your table, or even your database.
Strictly speaking, while any trigger / constraint / concurrent event is allowed, there is no other way to "validate objects" than to insert them into the actual target table in the actual target database at the actual point in time. Triggers, constraints, even default values, can interact with the current state of the whole DB. The more possibilities are ruled out and requirements are reduced, the more options we might have to emulate the test.
CREATE FUNCTION validate_function ( )
RETURNS trigger LANGUAGE plpgsql
AS $function$
DECLARE
valid_flag boolean := 't';
BEGIN
--Validation code
if valid_flag = 'f' then
RAISE EXCEPTION 'This record is not valid id %', id
USING HINT = 'Please enter valid record';
RETURN NULL;
else
RETURN NEW;
end if;
END;
$function$
CREATE TRIGGER validate_rec BEFORE INSERT OR UPDATE ON some_tbl
FOR EACH ROW EXECUTE FUNCTION validate_function();
With this function and trigger you validate inside the trigger. If the new record fails validation you set the valid_flag to false and then use that to raise exception. The RETURN NULL; is probably redundant and I am not sure it will be reached, but if it is it will also abort the insert or update. If the record is valid then you RETURN NEW and the insert/update completes.
About 8 months ago I used a suggestion to set up a holding table, then push to the formal table and prevent duplicate entries, per this post: Best way to prevent duplicate data on copy csv postgresql
It's been working very nicely, but today I noticed some errors and gaps in the data.
Here's my insert statement:
And here's how the index is set up:
And here's an example of the error I'm getting, although on the next chron insert, it went through.
Here's it going through fine:
I haven't noted any large changes in the data incoming. Here's what the data looks like that's coming in now:
In summary, I've noticed recent oddities with the insert statements, and success is erratic, resulting in large data gaps in the database. Thanks for any help, and I'm happy to provide more details, but I wanted to see if my information sounds like something someone else has already dealt with.
Thanks very much for any help,
S
As Gordon pointed out in his answer to your previous question, this approach only works if you have exclusive access to the table. There is a delay between the existence check and the insert itself, and if another process modifies the table during this window, you may end up with duplicates.
If you're on Postgres 9.5+, the best approach is to skip the existence check and simply use an INSERT ... ON CONFLICT DO NOTHING statement.
On earlier versions, the simplest solution (if you can afford to do so) would be to lock the table for the duration of the import. Otherwise, you can emulate ON CONFLICT DO NOTHING (albeit less efficiently) using a loop and an exception handler:
DO $$
DECLARE r RECORD;
BEGIN
FOR r IN (SELECT * FROM holding) LOOP
BEGIN
INSERT INTO ltg_data (pulsecount, intensity, time, lon, lat, ltg_geom)
VALUES (r.pulsecount, r.intensity, r.time, r.lon, r.lat, r.ltg_geom);
EXCEPTION WHEN unique_violation THEN
END;
END LOOP;
END
$$
As an aside, DELETE FROM holding is time-consuming and probably unnecessary. Create your staging table as TEMP, and it will be cleaned up automatically at the end of your session. You can easily build a table which matches the structure of your import target:
CREATE TEMP TABLE holding (LIKE ltg_data);
I have created a "merge" function which is supposed to execute either an UPDATE or an INSERT query, depending on existing data. Instead of writing an upsert-wrapper for each table (as in most of the available examples), this function takes entire SQL strings. Both of the SQL strings are automatically generated by our application.
The plan is to call the function like this:
-- hypothetical "settings" table, with a primary key of (user_id, setting):
SELECT merge(
$$UPDATE settings SET value = 'x' WHERE user_id = 42 AND setting = 'foo'$$,
$$INSERT INTO settings (user_id, setting, value) VALUES (42, 'foo', 'x')$$
);
Here's the full code of the merge() function:
CREATE OR REPLACE FUNCTION merge (update_sql TEXT, insert_sql TEXT) RETURNS TEXT AS
$func$
DECLARE
max_iterations INTEGER := 10;
i INTEGER := 0;
num_updated INTEGER;
BEGIN
-- usually returns before re-entering the loop
LOOP
-- first try the update
EXECUTE update_sql;
GET DIAGNOSTICS num_updated = ROW_COUNT;
IF num_updated > 0 THEN
RETURN 'UPDATE';
END IF;
-- nothing was updated: try the insert, watching out for concurrent inserts
BEGIN
EXECUTE insert_sql;
RETURN 'INSERT';
EXCEPTION WHEN unique_violation THEN
-- nop; just loop and try again from the top
END;
-- emergency brake
i := i + 1;
IF i >= max_iterations THEN
RAISE EXCEPTION 'merge(): tried looping % times, giving up now.', i;
EXIT;
END IF;
END LOOP;
END;
$func$
LANGUAGE plpgsql;
It appears to work well enough in my tests, but I'm not certain if I haven't missed anything crucial, especially regarding concurrent UPDATE/INSERT/DELETE queries, which may be issued without using this function. Did I overlook anything important?
Among the resources I consulted for this function are:
UPDATE/INSERT example 40.2 in the PostgreSQL manual
Why is UPSERT so complicated?
SO: Insert, on duplicate update (postgresql)
(Edit: one of the goals was to avoid locking the target table.)
The answer to your question depends your the context of how your application(s) will access the database. There are many ways to solve this as nicely discussed in depesz's post you cited by yourself. In addition you might want to also consider using writeable CTEs see here. Also the [question]Insert, on duplicate update in PostgreSQL? has some interesting discussions for your decision making process.
I want to do a large update on a table in PostgreSQL, but I don't need the transactional integrity to be maintained across the entire operation, because I know that the column I'm changing is not going to be written to or read during the update. I want to know if there is an easy way in the psql console to make these types of operations faster.
For example, let's say I have a table called "orders" with 35 million rows, and I want to do this:
UPDATE orders SET status = null;
To avoid being diverted to an offtopic discussion, let's assume that all the values of status for the 35 million columns are currently set to the same (non-null) value, thus rendering an index useless.
The problem with this statement is that it takes a very long time to go into effect (solely because of the locking), and all changed rows are locked until the entire update is complete. This update might take 5 hours, whereas something like
UPDATE orders SET status = null WHERE (order_id > 0 and order_id < 1000000);
might take 1 minute. Over 35 million rows, doing the above and breaking it into chunks of 35 would only take 35 minutes and save me 4 hours and 25 minutes.
I could break it down even further with a script (using pseudocode here):
for (i = 0 to 3500) {
db_operation ("UPDATE orders SET status = null
WHERE (order_id >" + (i*1000)"
+ " AND order_id <" + ((i+1)*1000) " + ")");
}
This operation might complete in only a few minutes, rather than 35.
So that comes down to what I'm really asking. I don't want to write a freaking script to break down operations every single time I want to do a big one-time update like this. Is there a way to accomplish what I want entirely within SQL?
Column / Row
... I don't need the transactional integrity to be maintained across
the entire operation, because I know that the column I'm changing is
not going to be written to or read during the update.
Any UPDATE in PostgreSQL's MVCC model writes a new version of the whole row. If concurrent transactions change any column of the same row, time-consuming concurrency issues arise. Details in the manual. Knowing the same column won't be touched by concurrent transactions avoids some possible complications, but not others.
Index
To avoid being diverted to an offtopic discussion, let's assume that
all the values of status for the 35 million columns are currently set
to the same (non-null) value, thus rendering an index useless.
When updating the whole table (or major parts of it) Postgres never uses an index. A sequential scan is faster when all or most rows have to be read. On the contrary: Index maintenance means additional cost for the UPDATE.
Performance
For example, let's say I have a table called "orders" with 35 million
rows, and I want to do this:
UPDATE orders SET status = null;
I understand you are aiming for a more general solution (see below). But to address the actual question asked: This can be dealt with in a matter milliseconds, regardless of table size:
ALTER TABLE orders DROP column status
, ADD column status text;
The manual (up to Postgres 10):
When a column is added with ADD COLUMN, all existing rows in the table
are initialized with the column's default value (NULL if no DEFAULT
clause is specified). If there is no DEFAULT clause, this is merely a metadata change [...]
The manual (since Postgres 11):
When a column is added with ADD COLUMN and a non-volatile DEFAULT
is specified, the default is evaluated at the time of the statement
and the result stored in the table's metadata. That value will be used
for the column for all existing rows. If no DEFAULT is specified,
NULL is used. In neither case is a rewrite of the table required.
Adding a column with a volatile DEFAULT or changing the type of an
existing column will require the entire table and its indexes to be
rewritten. [...]
And:
The DROP COLUMN form does not physically remove the column, but
simply makes it invisible to SQL operations. Subsequent insert and
update operations in the table will store a null value for the column.
Thus, dropping a column is quick but it will not immediately reduce
the on-disk size of your table, as the space occupied by the dropped
column is not reclaimed. The space will be reclaimed over time as
existing rows are updated.
Make sure you don't have objects depending on the column (foreign key constraints, indices, views, ...). You would need to drop / recreate those. Barring that, tiny operations on the system catalog table pg_attribute do the job. Requires an exclusive lock on the table which may be a problem for heavy concurrent load. (Like Buurman emphasizes in his comment.) Baring that, the operation is a matter of milliseconds.
If you have a column default you want to keep, add it back in a separate command. Doing it in the same command applies it to all rows immediately. See:
Add new column without table lock?
To actually apply the default, consider doing it in batches:
Does PostgreSQL optimize adding columns with non-NULL DEFAULTs?
General solution
dblink has been mentioned in another answer. It allows access to "remote" Postgres databases in implicit separate connections. The "remote" database can be the current one, thereby achieving "autonomous transactions": what the function writes in the "remote" db is committed and can't be rolled back.
This allows to run a single function that updates a big table in smaller parts and each part is committed separately. Avoids building up transaction overhead for very big numbers of rows and, more importantly, releases locks after each part. This allows concurrent operations to proceed without much delay and makes deadlocks less likely.
If you don't have concurrent access, this is hardly useful - except to avoid ROLLBACK after an exception. Also consider SAVEPOINT for that case.
Disclaimer
First of all, lots of small transactions are actually more expensive. This only makes sense for big tables. The sweet spot depends on many factors.
If you are not sure what you are doing: a single transaction is the safe method. For this to work properly, concurrent operations on the table have to play along. For instance: concurrent writes can move a row to a partition that's supposedly already processed. Or concurrent reads can see inconsistent intermediary states. You have been warned.
Step-by-step instructions
The additional module dblink needs to be installed first:
How to use (install) dblink in PostgreSQL?
Setting up the connection with dblink very much depends on the setup of your DB cluster and security policies in place. It can be tricky. Related later answer with more how to connect with dblink:
Persistent inserts in a UDF even if the function aborts
Create a FOREIGN SERVER and a USER MAPPING as instructed there to simplify and streamline the connection (unless you have one already).
Assuming a serial PRIMARY KEY with or without some gaps.
CREATE OR REPLACE FUNCTION f_update_in_steps()
RETURNS void AS
$func$
DECLARE
_step int; -- size of step
_cur int; -- current ID (starting with minimum)
_max int; -- maximum ID
BEGIN
SELECT INTO _cur, _max min(order_id), max(order_id) FROM orders;
-- 100 slices (steps) hard coded
_step := ((_max - _cur) / 100) + 1; -- rounded, possibly a bit too small
-- +1 to avoid endless loop for 0
PERFORM dblink_connect('myserver'); -- your foreign server as instructed above
FOR i IN 0..200 LOOP -- 200 >> 100 to make sure we exceed _max
PERFORM dblink_exec(
$$UPDATE public.orders
SET status = 'foo'
WHERE order_id >= $$ || _cur || $$
AND order_id < $$ || _cur + _step || $$
AND status IS DISTINCT FROM 'foo'$$); -- avoid empty update
_cur := _cur + _step;
EXIT WHEN _cur > _max; -- stop when done (never loop till 200)
END LOOP;
PERFORM dblink_disconnect();
END
$func$ LANGUAGE plpgsql;
Call:
SELECT f_update_in_steps();
You can parameterize any part according to your needs: the table name, column name, value, ... just be sure to sanitize identifiers to avoid SQL injection:
Table name as a PostgreSQL function parameter
Avoid empty UPDATEs:
How do I (or can I) SELECT DISTINCT on multiple columns?
Postgres uses MVCC (multi-version concurrency control), thus avoiding any locking if you are the only writer; any number of concurrent readers can work on the table, and there won't be any locking.
So if it really takes 5h, it must be for a different reason (e.g. that you do have concurrent writes, contrary to your claim that you don't).
You should delegate this column to another table like this:
create table order_status (
order_id int not null references orders(order_id) primary key,
status int not null
);
Then your operation of setting status=NULL will be instant:
truncate order_status;
First of all - are you sure that you need to update all rows?
Perhaps some of the rows already have status NULL?
If so, then:
UPDATE orders SET status = null WHERE status is not null;
As for partitioning the change - that's not possible in pure sql. All updates are in single transaction.
One possible way to do it in "pure sql" would be to install dblink, connect to the same database using dblink, and then issue a lot of updates over dblink, but it seems like overkill for such a simple task.
Usually just adding proper where solves the problem. If it doesn't - just partition it manually. Writing a script is too much - you can usually make it in a simple one-liner:
perl -e '
for (my $i = 0; $i <= 3500000; $i += 1000) {
printf "UPDATE orders SET status = null WHERE status is not null
and order_id between %u and %u;\n",
$i, $i+999
}
'
I wrapped lines here for readability, generally it's a single line. Output of above command can be fed to psql directly:
perl -e '...' | psql -U ... -d ...
Or first to file and then to psql (in case you'd need the file later on):
perl -e '...' > updates.partitioned.sql
psql -U ... -d ... -f updates.partitioned.sql
I am by no means a DBA, but a database design where you'd frequently have to update 35 million rows might have… issues.
A simple WHERE status IS NOT NULL might speed up things quite a bit (provided you have an index on status) – not knowing the actual use case, I'm assuming if this is run frequently, a great part of the 35 million rows might already have a null status.
However, you can make loops within the query via the LOOP statement. I'll just cook up a small example:
CREATE OR REPLACE FUNCTION nullstatus(count INTEGER) RETURNS integer AS $$
DECLARE
i INTEGER := 0;
BEGIN
FOR i IN 0..(count/1000 + 1) LOOP
UPDATE orders SET status = null WHERE (order_id > (i*1000) and order_id <((i+1)*1000));
RAISE NOTICE 'Count: % and i: %', count,i;
END LOOP;
RETURN 1;
END;
$$ LANGUAGE plpgsql;
It can then be run by doing something akin to:
SELECT nullstatus(35000000);
You might want to select the row count, but beware that the exact row count can take a lot of time. The PostgreSQL wiki has an article about slow counting and how to avoid it.
Also, the RAISE NOTICE part is just there to keep track on how far along the script is. If you're not monitoring the notices, or do not care, it would be better to leave it out.
Are you sure this is because of locking? I don't think so and there's many other possible reasons. To find out you can always try to do just the locking. Try this:
BEGIN;
SELECT NOW();
SELECT * FROM order FOR UPDATE;
SELECT NOW();
ROLLBACK;
To understand what's really happening you should run an EXPLAIN first (EXPLAIN UPDATE orders SET status...) and/or EXPLAIN ANALYZE. Maybe you'll find out that you don't have enough memory to do the UPDATE efficiently. If so, SET work_mem TO 'xxxMB'; might be a simple solution.
Also, tail the PostgreSQL log to see if some performance related problems occurs.
I would use CTAS:
begin;
create table T as select col1, col2, ..., <new value>, colN from orders;
drop table orders;
alter table T rename to orders;
commit;
Some options that haven't been mentioned:
Use the new table trick. Probably what you'd have to do in your case is write some triggers to handle it so that changes to the original table also go propagated to your table copy, something like that... (percona is an example of something that does it the trigger way). Another option might be the "create a new column then replace the old one with it" trick, to avoid locks (unclear if helps with speed).
Possibly calculate the max ID, then generate "all the queries you need" and pass them in as a single query like update X set Y = NULL where ID < 10000 and ID >= 0; update X set Y = NULL where ID < 20000 and ID > 10000; ... then it might not do as much locking, and still be all SQL, though you do have extra logic up front to do it :(
PostgreSQL version 11 handles this for you automatically with the Fast ALTER TABLE ADD COLUMN with a non-NULL default feature. Please do upgrade to version 11 if possible.
An explanation is provided in this blog post.