Why doesn't this function actually update anything? - postgresql

This is a SQL learning exercise. I have a 'tbl' with a single integer field. There are no indices on this table. I have this function:
CREATE OR REPLACE FUNCTION rcs()
RETURNS VOID AS $$
DECLARE
c CURSOR FOR SELECT * FROM tbl ORDER BY sequence;
s INTEGER;
r RECORD;
BEGIN
s := 0;
OPEN c;
LOOP
FETCH c INTO r;
EXIT WHEN NOT FOUND;
RAISE NOTICE 'got it';
r.sequence = s;
s := s + 1;
RAISE NOTICE '%', r.sequence;
END LOOP;
CLOSE c;
END;
$$ language 'plpgsql'
This loads and runs cleanly and the RAISE statements suggest that the 'sequence' field gets updated to 0, 1 etc. in accordance with the ORDER BY.
However, when I SELECT the table afterwards, the pre-existing values (which happen to all be '6') did not change.
Is this something to do with transactions? I tried fiddling around with COMMIT, etc. to no avail.
This is a freshly installed Postgresql 9.4.4 running on a Linode with no hackery of config files or anything like that, from the 'psql' command line.
EDIT: maybe it's because 'r' isn't actually the DB table, it's some kind of temporary copy of it? If so, please clarify, hopefully what I'm trying to achieve here is obvious (and I know it may be a bit nonsensical, but surely it's possible without resorting to reading the set into Java, etc.)

The actual problem: your function does not contain an UPDATE statement so nothing gets written to disk. r.sequence = s; simply assigns a new value to a variable that is held in memory.
To fix this, you need something like:
UPDATE tbl
set sequence = s -- change the actual column in the table
WHERE current of c; -- for the current row of your cursor
If where current of doesn't work, you need to switch that to a "regular" where clause:
UPDATE tbl
set sequence = s
WHERE tbl.pk_column = r.pk_column; -- the current primary key value
But a much more efficient solution is to do this in a single statement:
update tbl
set sequence = tu.new_sequence
from (
select t.pk_column,
row_number() over (order by t.sequence) as new_sequence
from tbl t
) tu
where tbl.pk_column = tu.pk_column;
You need to replace the column name pk_column with the real primary key (or unique) column of your table.

Related

Is there any way to check whether a PostgreSQL record- or row-type variable contains a specific field from inside a function?

I have a trigger defined on several tables to fire after all INSERT, UPDATE, or DELETE, all using the same trigger function. The trigger function performs an expensive check, but I can speed it up significantly by filtering some of the intermediate steps of that check using either a WHERE machine_serial = NEW.machine_serial or WHERE machine_serial = OLD.machine_serial clause, depending on what type of statement fired the trigger. However, not all the tables actually have a machine_serial column, so I can't perform this filtering when the trigger is fired on one of those tables. I am currently trying to find a good solution to making the decision of whether to filter or not from within the trigger function, and I believe that simply checking whether NEW or OLD has the machine_serial field would be easiest, clearest, and fastest. I can't find any way to do that in the documentation though, but checking whether a RECORD contains a certain field seems like such a basic, commonplace operation for anyone that has to work with RECORDs that I assume that I've just got to be missing it somewhere - I can't imagine that it's just not possible.
For completeness, I'll go over the alternatives I've considered to the hypothetical does-RECORD-have-field check:
I could create two trigger functions, do_expensive_check_with_machine_serial() and do_expensive_check_without_machine_serial(), and use one or the other depending on whether the table has the machine_serial column. But if I or anyone after me needs to alter the logic in either one of these functions, they'll need to remember to alter the logic in the other one, too.
I could stick with the one trigger function I currently have, and figure out whether the firing table has machine_serial by just trying to access NEW.machine_serial or OLD.machine_serial. If that raises an exception, I can catch it and then I'll know the field isn't present. But the manual explicitly suggests avoiding using exception blocks unless absolutely necessary, due to performance impacts.
I could stick with the one trigger function I currently have, and just add a check like this: IF (TG_TABLE_SCHEMA = x AND TG_TABLE_NAME = y) OR (TG_TABLE_SCHEMA = w AND TG_TABLE_NAME = z) OR ...
, and just maintain that list of every table that has a machine_serial column. But then I and anyone that comes after me would need to alter that check in the trigger function any time the trigger is added to a new table, which is less than ideal.
Of course, the above three alternatives would all function, but they all feel like bad design choices to me. Maybe it's because I'm used to the dynamicness offered by Python, but if I used any of these alternatives, I would feel like I'm doing something wrong. And PostgreSQL is pretty good about offering lots of operators on all sorts of data types, so I just can't imagine that something as basic as checking whether a RECORD or ROW-type variable contains a certain field is impossible.
Before I show the solution, I have to say, so this requirement can be signal of some unhappy design. Maybe you try to implement some functionality that should not be implemented in triggers. Triggers are good, but too smart too generic too rich can be very slow and very hard to maintain and fix errors (but as every in life, there are exceptions from rules).
So first - you can look to system catalog:
CREATE FUNCTION public.foo_trg() RETURNS trigger
LANGUAGE plpgsql
AS $$
begin
raise notice 'a exists %', exists(select * from pg_attribute where attrelid = new.tableoid and attname = 'a');
raise notice 'd exists %', exists(select * from pg_attribute where attrelid = new.tableoid and attname = 'd');
return new;
end;
$$;
CREATE TABLE public.foo (
a integer,
b integer
);
CREATE TRIGGER foo_trg_insert
AFTER INSERT ON public.foo
FOR EACH ROW EXECUTE FUNCTION public.foo_trg();
(2022-09-02 06:18:41) postgres=# insert into foo values(1,2);
NOTICE: a exists t
NOTICE: d exists f
INSERT 0 1
Second solution is based on record to jsonb transformations:
CREATE OR REPLACE FUNCTION public.foo_trg()
RETURNS trigger
LANGUAGE plpgsql
AS $$
declare j jsonb;
begin
j := to_jsonb(new);
raise notice 'a exists %', j ? 'a';
raise notice 'd exists %', j ? 'd';
return new;
end;
$$
(2022-09-02 06:24:54) postgres=# insert into foo values(1,2);
NOTICE: a exists t
NOTICE: d exists f
INSERT 0 1
Second solution can be faster, because doesn't requires queries to system catalog. It hits just system catalog cache, but it doesn't work on some legacy PostgreSQL releases.

Declare and return value for DELETE and INSERT

I am trying to remove duplicated data from some of our databases based upon unique id's. All deleted data should be stored in a separate table for auditing purposes. Since it concerns quite some databases and different schemas and tables I wanted to start using variables to reduce chance of errors and the amount of work it will take me.
This is the best example query I could think off, but it doesn't work:
do $$
declare #source_schema varchar := 'my_source_schema';
declare #source_table varchar := 'my_source_table';
declare #target_table varchar := 'my_target_schema' || source_table || '_duplicates'; --target schema and appendix are always the same, source_table is a variable input.
declare #unique_keys varchar := ('1', '2', '3')
begin
select into #target_table
from #source_schema.#source_table
where id in (#unique_keys);
delete from #source_schema.#source_table where export_id in (#unique_keys);
end ;
$$;
The query syntax works with hard-coded values.
Most of the times my variables are perceived as columns or not recognized at all. :(
You need to create and then call a plpgsql procedure with input parameters :
CREATE OR REPLACE PROCEDURE duplicates_suppress
(my_target_schema text, my_source_schema text, my_source_table text, unique_keys text[])
LANGUAGE plpgsql AS
$$
BEGIN
EXECUTE FORMAT(
'WITH list AS (INSERT INTO %1$I.%3$I_duplicates SELECT * FROM %2$I.%3$I WHERE array[id] <# %4$L :: integer[] RETURNING id)
DELETE FROM %2$I.%3$I AS t USING list AS l WHERE t.id = l.id', my_target_schema, my_source_schema, my_source_table, unique_keys :: text) ;
END ;
$$ ;
The procedure duplicates_suppress inserts into my_target_schema.my_source_table || '_duplicates' the rows from my_source_schema.my_source_table whose id is in the array unique_keys and then deletes these rows from the table my_source_schema.my_source_table .
See the test result in dbfiddle.
As has been commented, you need some kind of dynamic SQL. In a FUNCTION, PROCEDURE or a DO statement to do it on the server.
You should be comfortable with PL/pgSQL. Dynamic SQL is no beginners' toy.
Example with a PROCEDURE, like Edouard already suggested. You'll need a FUNCTION instead to wrap it in an outer transaction (like you very well might). See:
When to use stored procedure / user-defined function?
CREATE OR REPLACE PROCEDURE pg_temp.f_archive_dupes(_source_schema text, _source_table text, _unique_keys int[], OUT _row_count int)
LANGUAGE plpgsql AS
$proc$
-- target schema and appendix are always the same, source_table is a variable input
DECLARE
_target_schema CONSTANT text := 's2'; -- hardcoded
_target_table text := _source_table || '_duplicates';
_sql text := format(
'WITH del AS (
DELETE FROM %I.%I
WHERE id = ANY($1)
RETURNING *
)
INSERT INTO %I.%I TABLE del', _source_schema, _source_table
, _target_schema, _target_table);
BEGIN
RAISE NOTICE '%', _sql; -- debug
EXECUTE _sql USING _unique_keys; -- execute
GET DIAGNOSTICS _row_count = ROW_COUNT;
END
$proc$;
Call:
CALL pg_temp.f_archive_dupes('s1', 't1', '{1, 3}', 0);
db<>fiddle here
I made the procedure temporary, since I assume you don't need to keep it permanently. Create it once per database. See:
How to create a temporary function in PostgreSQL?
Passed schema and table names are case-sensitive strings! (Unlike unquoted identifiers in plain SQL.) Either way, be wary of SQL-injection when concatenating SQL dynamically. See:
Are PostgreSQL column names case-sensitive?
Table name as a PostgreSQL function parameter
Made _unique_keys type int[] (array of integer) since your sample values look like integers. Use a the actual data type of your id columns!
The variable _sql holds the query string, so it can easily be debugged before actually executing. Using RAISE NOTICE '%', _sql; for that purpose.
I suggest to comment the EXECUTE line until you are sure.
I made the PROCEDURE return the number of processed rows. You didn't ask for that, but it's typically convenient. At hardly any cost. See:
Dynamic SQL (EXECUTE) as condition for IF statement
Best way to get result count before LIMIT was applied
Last, but not least, use DELETE ... RETURNING * in a data-modifying CTE. Since that has to find rows only once it comes at about half the cost of separate SELECT and DELETE. And it's perfectly safe. If anything goes wrong, the whole transaction is rolled back anyway.
Two separate commands can also run into concurrency issues or race conditions which are ruled out this way, as DELETE implicitly locks the rows to delete. Example:
Replicating data between Postgres DBs
Or you can build the statements in a client program. Like psql, and use \gexec. Example:
Filter column names from existing table for SQL DDL statement
Based on Erwin's answer, minor optimization...
create or replace procedure pg_temp.p_archive_dump
(_source_schema text, _source_table text,
_unique_key int[],_target_schema text)
language plpgsql as
$$
declare
_row_count bigint;
_target_table text := '';
BEGIN
select quote_ident(_source_table) ||'_'|| array_to_string(_unique_key,'_') into _target_table from quote_ident(_source_table);
raise notice 'the deleted table records will store in %.%',_target_schema, _target_table;
execute format('create table %I.%I as select * from %I.%I limit 0',_target_schema, _target_table,_source_schema,_source_table );
execute format('with mm as ( delete from %I.%I where id = any (%L) returning * ) insert into %I.%I table mm'
,_source_schema,_source_table,_unique_key, _target_schema, _target_table);
GET DIAGNOSTICS _row_count = ROW_COUNT;
RAISE notice 'rows influenced, %',_row_count;
end
$$;
--
if your _unique_key is not that much, this solution also create a table for you. Obviously you need to create the target schema yourself.
If your unique_key is too much, you can customize to properly rename the dumped table.
Let's call it.
call pg_temp.p_archive_dump('s1','t1', '{1,2}','s2');
s1 is the source schema, t1 is source table, {1,2} is the unique key you want to extract to the new table. s2 is the target schema

How to migrate oracle's pipelined function into PostgreSQL

As I am newbie to plpgSQL,
I stuck while migrating a Oracle query into PostgreSQL.
Oracle query:
create or replace FUNCTION employee_all_case(
p_ugr_id IN integer,
p_case_type_id IN integer
)
RETURN number_tab_t PIPELINED
-- LANGUAGE 'plpgsql'
-- COST 100
-- VOLATILE
-- AS $$
-- DECLARE
is
l_user_id NUMBER;
l_account_id NUMBER;
BEGIN
l_user_id := p_ugr_id;
l_account_id := p_case_type_id;
FOR cases IN
(SELECT ccase.case_id, ccase.employee_id
FROM ct_case ccase
INNER JOIN ct_case_type ctype
ON (ccase.case_type_id=ctype.case_type_id)
WHERE ccase.employee_id = l_user_id)
LOOP
IF cases.employee_id IS NOT NULL THEN
PIPE ROW (cases.case_id);
END IF;
END LOOP;
RETURN;
END;
--$$
When I execute this function then I get the following result
select * from table(select employee_all_case(14533,1190) from dual)
My question here is: I really do not understand the pipelined function and how can I obtain the same result in PostgreSQL as Oracle query ?
Please help.
Thank you guys, your solution was very helpful.
I found the desire result:
-- select * from employee_all_case(14533,1190);
-- drop function employee_all_case
create or replace FUNCTION employee_all_case(p_ugr_id IN integer ,p_case_type_id IN integer)
returns table (case_id double precision)
-- PIPELINED
LANGUAGE 'plpgsql'
COST 100
VOLATILE
AS $$
DECLARE
-- is
l_user_id integer;
l_account_id integer;
BEGIN
l_user_id := cp_lookup$get_user_id_from_ugr_id(p_ugr_id);
l_account_id := cp_lookup$acctid_from_ugr(p_ugr_id);
RETURN QUERY SELECT ccase.case_id
FROM ct_case ccase
INNER JOIN ct_case_type ctype ON ccase.case_type_id = ctype.case_type_id
WHERE ccase.employee_id = p_ugr_id
and ccase.employee_id IS NOT NULL;
--return NEXT;
END;
$$
You would rewrite that to a set returning function:
Change the return type to
RETURNS SETOF integer
and do away with the PIPELINED.
Change the PIPE ROW statement to
RETURN NEXT cases.case_id;
Of course, you will have to do the obvious syntactic changes, like using integer instead of NUMBER and putting the IN before the parameter name.
But actually, it is quite unnecessary to write a function for that. Doing it in a single SELECT statement would be both simpler and faster.
Pipelined functions are best translated to a simple SQL function returning a table.
Something like this:
create or replace function employee_all_case(p_ugr_id integer, p_case_type_IN integer)
returns table (case_id integer)
as
$$
SELECT ccase.case_id
FROM ct_case ccase
INNER JOIN ct_case_type ctype ON ccase.case_type_id = ctype.case_type_id
WHERE ccase.employee_id = p_ugr_id
and cases.employee_id IS NOT NULL;
$$
language sql;
Note that your sample code did not use the second parameter p_case_type_id.
Usage is also more straightforward:
select *
from employee_all_case(14533,1190);
Before diving into the solution, I will provide some information which will help you to understand better.
So basically PIPELINED came into picture for improving memory allocation at run time.
As you all know collections will occupy space when ever they got created. So the more you use, the more memory will get allocated.
Pipelining negates the need to build huge collections by piping rows out of the function.
saving memory and allowing subsequent processing to start before all the rows are generated.
Pipelined table functions include the PIPELINED clause and use the PIPE ROW call to push rows out of the function as soon as they are created, rather than building up a table collection.
By using Pipelined how memory usage will be optimized?
Well, it's very simple. instead of storing data into an array, just process the data by using pipe row(desired type). This actually returns the row and process the next row.
coming to solution in plpgsql
simple but not preferred while storing large data.
Remove PIPELINED from return declaration and return an array of desired type. something like RETURNS typrec2[].
Where ever you are using pipe row(), add that entry to array and finally return that array.
create a temp table like
CREATE TEMPORARY TABLE temp_table (required fields) ON COMMIT DROP;
and insert data into it. Replace pipe row with insert statement and finally return statement like
return query select * from temp_table
**The best link for understanding PIPELINED in oracle [https://oracle-base.com/articles/misc/pipelined-table-functions]
pretty ordinary for postgres reference [http://manojadinesh.blogspot.com/2011/11/pipelined-in-oracle-as-well-in.html]
Hope this helps some one conceptually.

CURSOR vs select statement in loop

I just saw a simple example in another StackOverflow question that used a cursor to loop through a table. I would have just looped through the results of a select query instead of wrapping the select query in a cursor. What is the advantage of using a cursor?
(I couldn't include the example here because StackOverflow thought my question was mostly code, and demanded more details. I've run into that annoying restriction before. If I can ask my question clearly in just a few words, I should be able to. I'll see if I can find a link to that question, and if I can, I'll add the link here.)
Here is the original question where I saw CURSOR used.
What is the advantage of using a cursor?
The only advantage is that you have to write more code (if they pay you for each line of code).
do $$
declare
rec record;
cur cursor for select i from generate_series(1, 3) i;
begin
open cur;
loop
fetch cur into rec;
exit when rec is null;
raise notice '%', rec.i;
end loop;
close cur;
end
$$;
A loop through query results just opens a (virtual) cursor, fetches rows, checks range, exits when needed and closes the cursor for you.
do $$
declare
rec record;
begin
for rec in select i from generate_series(1, 3) i
loop
raise notice '%', rec.i;
end loop;
end
$$;
There are several ways:
Use an explicit cursor in PL/pgSQL and loop through it and process each result row.
Example:
OPEN c FOR SELECT id FROM a WHERE ok;
LOOP
UPDATE b SET a_ok = TRUE WHERE a_id = c.id;
END LOOP;
Use a FOR r IN SELECT ... LOOP in PL/pgSQL. This is effectively the same as 1. with a clearer syntax.
Example:
FOR c IN SELECT id FROM a WHERE ok LOOP
UPDATE b SET a_ok = TRUE WHERE a_id = c.id;
END LOOP;
Run a SELECT query without a cursor and process each result row on the client side, probably issuing a database query for each result.
Example (in pseudocode):
resultset := db_exec('SELECT id FROM a WHERE ok');
while (resultset.next()) {
db_exec('UPDATE b SET a_ok = TRUE WHERE a_id = ' || resultset.get('id'));
}
Use a JOIN.
Example:
UPDATE b SET a_ok = TRUE
FROM a
WHERE a.id = b.a_id AND a.ok;
Method 3. is the most terrible way conceivable to solve the problem, because it causes a lot of client-server round trips and has the database parse a gazillion statements.
Alas, it is often the way how SQL novices attack the problem. I call it home-grown nested loop join.
On top of all that, the client software will often snarf the complete result set from the first query into memory, which causes yet another problem.
Methods 1. and 2. are equivalent, except that 2. is more elegant. It saves the round trips and uses prepared statements under the hood, so the UPDATE doesn't have to be parsed all the time. Still, the executor has to run many times, and PL/pgSQL is known not to be particularly fast. It is also a kind of home-grown nested loop join.
Method 4 is the way to go. Not only is everything run in a single query, but PostgreSQL can also use a more effective join strategy if that is better.

Is this generic MERGE/UPSERT function for PostgreSQL safe?

I have created a "merge" function which is supposed to execute either an UPDATE or an INSERT query, depending on existing data. Instead of writing an upsert-wrapper for each table (as in most of the available examples), this function takes entire SQL strings. Both of the SQL strings are automatically generated by our application.
The plan is to call the function like this:
-- hypothetical "settings" table, with a primary key of (user_id, setting):
SELECT merge(
$$UPDATE settings SET value = 'x' WHERE user_id = 42 AND setting = 'foo'$$,
$$INSERT INTO settings (user_id, setting, value) VALUES (42, 'foo', 'x')$$
);
Here's the full code of the merge() function:
CREATE OR REPLACE FUNCTION merge (update_sql TEXT, insert_sql TEXT) RETURNS TEXT AS
$func$
DECLARE
max_iterations INTEGER := 10;
i INTEGER := 0;
num_updated INTEGER;
BEGIN
-- usually returns before re-entering the loop
LOOP
-- first try the update
EXECUTE update_sql;
GET DIAGNOSTICS num_updated = ROW_COUNT;
IF num_updated > 0 THEN
RETURN 'UPDATE';
END IF;
-- nothing was updated: try the insert, watching out for concurrent inserts
BEGIN
EXECUTE insert_sql;
RETURN 'INSERT';
EXCEPTION WHEN unique_violation THEN
-- nop; just loop and try again from the top
END;
-- emergency brake
i := i + 1;
IF i >= max_iterations THEN
RAISE EXCEPTION 'merge(): tried looping % times, giving up now.', i;
EXIT;
END IF;
END LOOP;
END;
$func$
LANGUAGE plpgsql;
It appears to work well enough in my tests, but I'm not certain if I haven't missed anything crucial, especially regarding concurrent UPDATE/INSERT/DELETE queries, which may be issued without using this function. Did I overlook anything important?
Among the resources I consulted for this function are:
UPDATE/INSERT example 40.2 in the PostgreSQL manual
Why is UPSERT so complicated?
SO: Insert, on duplicate update (postgresql)
(Edit: one of the goals was to avoid locking the target table.)
The answer to your question depends your the context of how your application(s) will access the database. There are many ways to solve this as nicely discussed in depesz's post you cited by yourself. In addition you might want to also consider using writeable CTEs see here. Also the [question]Insert, on duplicate update in PostgreSQL? has some interesting discussions for your decision making process.