Issue with nested loops in postgresql - postgresql

I am wondering if there is something related to variables and nested loops in postgresql that works differently than in other languages.
Example:
CREATE OR REPLACE FUNCTION public.generate_syllables()
RETURNS integer AS
$BODY$
DECLARE
w RECORD;
s RECORD;
current_syllable integer := 1;
vowel_trigger integer := 0;
syllable_count integer := 1;
BEGIN
FOR w IN SELECT id FROM words LOOP
FOR s IN SELECT sound, id FROM sounds WHERE id = w.id ORDER BY ordering LOOP
IF (SELECT sr.vowel FROM sound_reference sr WHERE sr.sound = s.sound) = 1 AND vowel_trigger = 1 THEN
syllable_count := syllable_count + 1;
UPDATE sounds SET syllable = syllable_count WHERE id = s.id;
vowel_trigger := 0;
ELSIF (SELECT sr.vowel FROM sound_reference sr WHERE sr.sound = s.sound) = 1 THEN
vowel_trigger := 1;
UPDATE sounds SET syllable = syllable_count WHERE id = s.id;
ELSE
UPDATE sounds SET syllable = syllable_count WHERE id = s.id;
END IF;
END LOOP;
UPDATE words SET syllables = syllable_count WHERE id = w.id;
syllable_count := 1;
vowel_trigger := 0;
END LOOP;
RETURN 1;
END;
$BODY$
LANGUAGE plpgsql VOLATILE
When I run this function as is, the function never enters the first condition in the if statement. I tested this by adding a return statement within that first condition. At first I thought this must be a logic error, but I have gone through it by hand with examples that are generated from my dataset, and it should work as desired. What is even stranger, is when I comment out the line in the outer loop, for vowel_trigger := 0, then it DOES enter the first if statement. Of course then, the logic does not work correctly either, and from that I have gathered that the syllable_count is being set back to 0 BEFORE the nested loop finishes looping, which would also explain why the first condition is never entered, because vowel_trigger is set back to 0 before the loop makes it back up to the first condition.
In other words, it seems to me that my nested loop is not acting like a nested loop, but rather the nested loop extends into the outer loop, before the nested loop restarts. I imagine I must just not understand how to properly create a nested loop, or perhaps they just can't work this way in POSTGRESQL... any advice would be greatly appreciated.

You haven't provided table structures and - even more important - data. While your function's behavior really depends on data in tables words, sounds, and sound_reference. For example, if sound_reference is empty, vowel_trigger will never be 1, so the first IF becomes non-achievable.
This will help to debug your function:
RAISE NOTICE 'printlining helps to debug! vowel_trigger=%, syllable_count=%',
vowel_trigger, syllable_count;
As a side note, I've noticed that UPDATE sounds SET syllable = syllable_count WHERE id = s.id; is repeated in all if/else cases, so it may be worth to move it outside them and place right before the inner END LOOP;.
Addition:
...when I comment out the line in the outer loop, for vowel_trigger := 0, then it DOES enter the first if statement.
It tells us that one of the inner loop's executions ends with vowel_trigger being 1, and it would allow the first IF to trigger, but right outside the inner loop you turn it 0, so the first IF doesn't work then.

Related

How can I query a custom datatype object inside an array of said custom datatype in PL/pgSQL?

Suppose I have:
CREATE TYPE compfoo AS (f1 int, f2 text);
And I create a table foo containing two columns: fooid and fooname, corresponding to the fields of compfoo, later I insert some records 1, aa, 2, bb, 3, cc
Then, I define a PL/pgSQL function (more or less as follows:)
create or replace function foo_query()
returns text
language plpgsql
as $$
declare
r compfoo;
arr compfoo [];
footemp compfoo;
result text;
begin
for r in
select * from foo where fooid = 1 OR fooid = 2
loop
arr := array_append(arr, r);
end loop;
foreach footemp in array arr
loop
select footemp.f1 into result where footemp.f1 = 1;
end loop;
return result;
end;
$$
Where I query first foo by column names and save the results into arr, an array of compfoo. Later, I iterate over arr and try to query the elements by their fieldnames as defined in compfoo.
I don't get an error per se in Postgres but the result of my function is null.
What am I doing wrong?
The RAISE NOTICE should be your best friend. You can print the result of some variables at some points of your code. The basic issue are not well initialized values. The arr variable is initialized by NULL value, and any operation over NULL is NULL again.
Another problem is in select footemp.f1 into result where footemp.f1 = 1; statement. SELECT INTO in Postgres overwrite the target variable by NULL value when an result is empty. In second iteration, the result of this query is empty set, and the result variable is set on NULL.
The most big problem of your example is style of programming. You use ISAM style, and your code can be terrible slow.
Don't use array_append in cycle, when you can use array_agg function in query, and you don't need cycle,
Don't use SELECT INTO when you don't read data from tables,
Don't try to repeat Oracle' pattern BULK COLLECT and FOREACH read over collection. PostgreSQL is not Oracle, uses very different architecture, and this pattern doesn't increase performance (like on Oracle), but probably you will lost some performance.
Your fixed code can looks like:
CREATE OR REPLACE FUNCTION public.foo_query()
RETURNS text
LANGUAGE plpgsql
AS $function$
declare
r compfoo;
arr compfoo [] default '{}'; --<<<
footemp compfoo;
result text;
begin
for r in
select * from foo where fooid = 1 or fooid = 2
loop
arr := array_append(arr, r);
end loop;
foreach footemp in array arr
loop
if footemp.f1 = 1 then --<<<
result := footemp.f1;
end if;
end loop;
return result;
end;
$function$
postgres-# ;
It returns expected result. But it is perfect example how don't write stored procedures. Don't try to replace SQL in stored procedures. All code of this procedure can be replaced just by one query. In the end this code can be very slow on bigger data.

How do I update all rows of a postgres table sequentially, and run the query in parallel while avoiding deadlocks?

First question here after years of reading answers, so apologies if I've got some things wrong!
I'm creating a very large image mosaic (e.g. a big image made of lots of small images) and I've got it working, but I'm struggling to work out how to assign the candidate images to the target locations in parallel.
I have two tables in a postgres database - tiles and candidates. Each contains 50,000 rows.
For each candidate, I want to assign it to the best target location (tile) that has not already been assigned a candidate.
My tables look like this:
create table candidates
(
id integer not null primary key,
lab_l double precision,
lab_a double precision,
lab_b double precision,
processing_by integer
);
create table tiles
(
id integer not null primary key,
pos_x integer,
pos_y integer,
lab_l double precision,
lab_a double precision,
lab_b double precision,
candidate_id integer
);
To find the best tile for a candidate I'm using this function:
create or replace function update_candidates() RETURNS integer
LANGUAGE SQL
PARALLEL SAFE
AS
$$
WITH updated_candidate AS (
SELECT id, pg_advisory_xact_lock(id)
FROM candidates
WHERE id = (
SELECT id
FROM candidates
FOR UPDATE SKIP LOCKED
LIMIT1
)
)
UPDATE tiles
SET candidate_id = (SELECT id FROM updated_candidate)
WHERE tiles.id = (SELECT tile_id
FROM (select t.id AS tile_id,
c.id AS candidate_id,
delta_e_cie_2000(t.lab_l, t.lab_a, t.lab_b,
c.lab_l, c.lab_a, c.lab_b) AS diff
FROM tiles t,
candidates c
WHERE c.id = (SELECT id FROM updated_candidate)
AND t.candidate_id IS NULL
ORDER BY diff ASC
LIMIT 1 FOR UPDATE
) AS ticid)
RETURNING id;
$$
;
Which uses the delta_e_cie_2000 function to calculate the color difference between candidate and each of the un-used tiles.
To assign all the candidates to tiles I'm using this function:
create or replace function update_all_tiles() RETURNS int
LANGUAGE plpgsql
PARALLEL SAFE
AS
$$
DECLARE
counter integer := 0;
last_id integer := 1;
BEGIN
WHILE last_id IS NOT NULL
LOOP
RAISE NOTICE 'Counter %', counter;
counter := counter + 1;
last_id := update_candidates();
RAISE NOTICE 'Updated %', last_id;
END LOOP;
RETURN counter;
END
$$
;
Running select update_all_tiles(); is working fine on a single thread. It's producing the correct output but it's pretty slow - it takes 20-30 minutes to update 50,000 tiles.
I'm running it on a machine with lots of cores and was hoping to use them all, or at least more than one.
However, if I run select update_all_tiles() in a second session the it runs for a while and then one of the processes outputs something along the lines of
ERROR: deadlock detected
DETAIL: Process 2541 waits for ShareLock on transaction 1628083; blocked by process 1664.
Process 1664 waits for ShareLock on transaction 1628084; blocked by process 2541.
Ideally, I'd like the update_all_tiles() function itself to run in parallel so I could just call it once and it would use all the cores.
If that's not possible I'd like to be able to run the function in multiple sessions.
Is there a way I can do this while avoiding the deadlock detected error?
Any help would be greatly appreciated.
Ok well I solved this specific issue like this
create or replace function assign_next_candidate() returns int
LANGUAGE plpgsql
PARALLEL SAFE
as
$$
declare
_candidate_id integer := 0;
_tile_id integer := 0;
begin
--first we get a candidate, and lock it (using `FOR UPDATE`) so that no other process will pick it up
-- and `SKIP LOCKED` so that we don't pick up one another process is using
UPDATE candidates
set processing_by = pg_backend_pid()
WHERE id = (
SELECT id
from candidates
where processing_by is null
FOR UPDATE SKIP LOCKED
LIMIT 1
)
RETURNING id into _candidate_id;
raise notice 'Candidate %', _candidate_id;
-- now we get the least different tile that is not locked and update it, using the candidate_id from the previous query
UPDATE tiles
set candidate_id = _candidate_id
WHERE id = (
select t.id as tile_id
from tiles t,
candidates c
where c.id = _candidate_id
and t.candidate_id is null
order by delta_e_cie_2000(t.lab_l, t.lab_a, t.lab_b,
c.lab_l, c.lab_a, c.lab_b) desc
Limit 1 for update skip locked
)
RETURNING id into _tile_id;
raise notice 'Tile %', _tile_id;
return _tile_id;
end
$$;
select assign_next_candidate();
--does not commit changes until it's finished all assignments
create or replace function assign_all_candidates() RETURNS int
LANGUAGE plpgsql
PARALLEL SAFE
AS
$$
declare
counter integer := 0;
last_id integer := 1;
begin
while last_id is not null
loop
raise notice 'Counter %', counter;
counter := counter + 1;
last_id := assign_next_candidate();
raise notice 'Updated %', last_id;
end loop;
return counter;
end
$$
;
I found this article really helpful: https://www.2ndquadrant.com/en/blog/what-is-select-skip-locked-for-in-postgresql-9-5/
I've now got another issue which is that the queries run in parallel fine but they slow down a lot as they get further through the available rows. I think this is because the changes are not committed inside the loop in assign_all_candidates(), which means that the SKIP LOCKED queries are having to skip over a lot of candidates before finding one they can actually assign. I think I can get round that by breaking out of the loop every X records, committing those updates, and then re-calling assign_all_candidates().
It's still much much faster than it was and seems robust.
Update:
As far as I can tell there is no way to commit within a plpgsql function (at least up to version 13). You can commit within anonymous code blocks, and within procedures, but you can't have parallel procedures.
I've ended up creating a function that takes an array/series of IDs and updates each of them. I then call this function from multiple processes (via Python) so that the results are committed after each "block" is processed.
queries = [
"select assign_candidates(array[1,2,3])",
"select assign_candidates(array[4,5,6])"
]
with Pool(2) as p:
p.map(do_query, queries)
Where the assign_candidates function looks something like
CRETE FUNCTION assign_candidates(_candidates integer[]) RETURNS integer
PARALLEL SAFE
LANGUAGE plpgsql
AS
$$
DECLARE
_candidate_id int := 0;
BEGIN
foreach _candidate_id in ARRAY _candidates
LOOP
perform assign_candidate(_candidate_id);
END LOOP;
RETURN 1;
END;
$$;
This "split queries" approach scales fairly linearly as I process more rows unlike the previous approach that slowed down a lot as the number of rows grew (because it had to skip a lot of locked but not committed rows):
It also scaled fairly well on more processors although doubling the number of processes does not halve the time (I guess because they are also trying to lock the same rows, and skipping over each other's locked rows):
For my use case the above approach reduced the processing time from nearly an hour to a few minutes, which was good enough.

Change ws_source from lead based on last source when inserting to table Postgresql

So, I have a function that sets ws_source based on ws_tsource. Right now I can set the right ws_source based on different ws_tsource's, example:
If ws_tsource = foobar then ws_source = foo,
if ws_tsource = foobar2 then ws_source = foo2;
But now I have to split the leads from same ws_tsource to 'foo' and 'foo2', so as they arrive one gets ws_source 'foo' and the next one gets ws_source 'foo2'.
I created a variable to check the last lead with that ws_tsource so that the new ws_source will be different, but the v_return is not affecting the If block correctly, therefore not setting the correct ws_source. So independently of the ws_source from the last lead the new ws_source is always set to 'foo2'.
What I am doing wrong here, and is it even feasible to do?
CREATE OR REPLACE FUNCTION public.leads_ins()
RETURNS trigger
LANGUAGE plpgsql
AS $function$
DECLARE
v_return text;
BEGIN
if NEW.ws_source IN ('foo') then
v_priority := 13821;
if left(new.ws_tsource,7) = 'foobar' then
select ws_source into v_return from leads where ws_tsource = 'foobar' order by ws_id desc limit 1;
if v_return = 'foo' then
v_priority := 16460;
update leads
set ws_source = 'foo2',
where ws_id = NEW.ws_id;
end if;
if v_return = 'foo2' then
v_priority := 13821;
update leads
set ws_source = 'foo',
where ws_id = NEW.ws_id;
end if;
end if;
end if;
END IF;
return NEW;
end;
$function$
;
So,
All I had to do was
select ws_source into v_return from leads where ws_tsource = 'foobar' order by ws_id desc limit 1 offset 1;
Because the function wasn't INSERTING a lead but UPDATING the table with the new lead already there. Without the offset the last lead was always the new lead that had by default 'foo' as 'ws_source' hence updating always to 'foo2' and never the other way around.

How to code an atomic transaction in PL/pgSQL

I have multiple select, insert and update statement to complete a transaction, but I don't seem to be able to ensure all statements to be successful before committing the changes to table.
The transaction doesn't seem to be atomic.
I do have begin and end in my function but transaction doesn't seem to be atomic.
CREATE FUNCTION public.testarray(salesid integer, items json) RETURNS integer
LANGUAGE plpgsql
AS $$
declare
resu text;
resu2 text := 'TH';
ssrow RECORD;
oserlen int := 0;
nserlen int := 0;
counter int := 0;
begin
select json_array_length(items::json->'oserial') into oserlen;
while counter < nserlen loop
select items::json#>>array['oserial',counter::text] into resu;
select * into strict ssrow from salesserial where fk_salesid=salesid and serialnum=resu::int;
insert into stockloct(serialnum,fk_barcode,source,exflag) values(ssrow.serialnum,ssrow.fk_barcode,ssrow.fk_salesid,true);
counter := counter + 1;
end loop;
counter := 0;
select json_array_length(items::json->'nserial') into nserlen;
while counter < nserlen loop
select items::json#>>array['nserial',counter::text,'serial'] into resu2;
select * into ssrow from stockloc where serialnum=resu2::int;
insert into salesserial(fk_salesid,serialnum,fk_barcode) values(salesid,ssrow.serialnum,ssrow.fk_barcode);
counter := counter + 1;
end loop;
select items::json#>'{nserial,0,serial}' into resu2;
return resu;
end;
$$;
Even when the first insert fails, the second insert seems to be able to succeed.
I see that by “fail” you mean “does not insert any rows”.
That is not surprising since the first loop is never executed: both counter and nserlen are always 0.
Perhaps you mean the first WHILE condition to be counter < oserlen?
You also seem to br confused by PL/pgSQL's BEGIN: while it looks like the BEGIN that starts a transaction, it is quite different. It is just the “opening parenthesis” in a PL/pgSQL block.

Variable containing the number of rows affected by previous DELETE? (in a function)

I have a function that is used as an INSERT trigger. This function deletes rows that would conflict with [the serial number in] the row being inserted. It works beautifully, so I'd really rather not debate the merits of the concept.
DECLARE
re1 feeds_item.shareurl%TYPE;
BEGIN
SELECT regexp_replace(NEW.shareurl, '/[^/]+(-[0-9]+\.html)$','/[^/]+\\1') INTO re1;
RAISE NOTICE 'DELETEing rows from feeds_item where shareurl ~ ''%''', re1;
DELETE FROM feeds_item where shareurl ~ re1;
RETURN NEW;
END;
I would like to add to the NOTICE an indication of how many rows are affected (aka: deleted). How can I do that (using LANGUAGE 'plpgsql')?
UPDATE:
Base on some excellent guidance from "Chicken in the kitchen", I have changed it to this:
DECLARE
re1 feeds_item.shareurl%TYPE;
num_rows int;
BEGIN
SELECT regexp_replace(NEW.shareurl, '/[^/]+(-[0-9]+\.html)$','/[^/]+\\1') INTO re1;
DELETE FROM feeds_item where shareurl ~ re1;
IF FOUND THEN
GET DIAGNOSTICS num_rows = ROW_COUNT;
RAISE NOTICE 'DELETEd % row(s) from feeds_item where shareurl ~ ''%''', num_rows, re1;
END IF;
RETURN NEW;
END;
For a very robust solution, that is part of PostgreSQL SQL and not just plpgsql you could also do the following:
with a as (DELETE FROM feeds_item WHERE shareurl ~ re1 returning 1)
select count(*) from a;
You can actually get lots more information such as:
with a as (delete from sales returning amount)
select sum(amount) from a;
to see totals, in this way you could get any aggregate and even group and filter it.
In Oracle PL/SQL, the system variable to store the number of deleted / inserted / updated rows is:
SQL%ROWCOUNT
After a DELETE / INSERT / UPDATE statement, and BEFORE COMMITTING, you can store SQL%ROWCOUNT in a variable of type NUMBER. Remember that COMMIT or ROLLBACK reset to ZERO the value of SQL%ROWCOUNT, so you have to copy the SQL%ROWCOUNT value in a variable BEFORE COMMIT or ROLLBACK.
Example:
BEGIN
DECLARE
affected_rows NUMBER DEFAULT 0;
BEGIN
DELETE FROM feeds_item
WHERE shareurl = re1;
affected_rows := SQL%ROWCOUNT;
DBMS_OUTPUT.
put_line (
'This DELETE would affect '
|| affected_rows
|| ' records in FEEDS_ITEM table.');
ROLLBACK;
END;
END;
I have found also this interesting SOLUTION (source: http://markmail.org/message/grqap2pncqd6w3sp )
On 4/7/07, Karthikeyan Sundaram wrote:
Hi,
I am using 8.1.0 postgres and trying to write a plpgsql block. In that I am inserting a row. I want to check to see if the row has been
inserted or not.
In oracle we can say like this
begin
insert into table_a values (1);
if sql%rowcount > 0
then
dbms.output.put_line('rows inserted');
else
dbms.output.put_line('rows not inserted');
end if; end;
Is there something equal to sql%rowcount in postgres? Please help.
Regards skarthi
Maybe:
http://www.postgresql.org/docs/8.2/static/plpgsql-statements.html#PLPGSQL-STATEMENTS-SQL-ONEROW
Click on the link above, you'll see this content:
37.6.6. Obtaining the Result Status There are several ways to determine the effect of a command. The first method is to use the GET
DIAGNOSTICS command, which has the form:
GET DIAGNOSTICS variable = item [ , ... ];This command allows
retrieval of system status indicators. Each item is a key word
identifying a state value to be assigned to the specified variable
(which should be of the right data type to receive it). The currently
available status items are ROW_COUNT, the number of rows processed by
the last SQL command sent down to the SQL engine, and RESULT_OID, the
OID of the last row inserted by the most recent SQL command. Note that
RESULT_OID is only useful after an INSERT command into a table
containing OIDs.
An example:
GET DIAGNOSTICS integer_var = ROW_COUNT; The second method to
determine the effects of a command is to check the special variable
named FOUND, which is of type boolean. FOUND starts out false within
each PL/pgSQL function call. It is set by each of the following types
of statements:
A SELECT INTO statement sets FOUND true if a row is assigned, false if
no row is returned.
A PERFORM statement sets FOUND true if it produces (and discards) a
row, false if no row is produced.
UPDATE, INSERT, and DELETE statements set FOUND true if at least one
row is affected, false if no row is affected.
A FETCH statement sets FOUND true if it returns a row, false if no row
is returned.
A FOR statement sets FOUND true if it iterates one or more times, else
false. This applies to all three variants of the FOR statement
(integer FOR loops, record-set FOR loops, and dynamic record-set FOR
loops). FOUND is set this way when the FOR loop exits; inside the
execution of the loop, FOUND is not modified by the FOR statement,
although it may be changed by the execution of other statements within
the loop body.
FOUND is a local variable within each PL/pgSQL function; any changes
to it affect only the current function.
I would to share my code (I had this idea from Roelof Rossouw):
CREATE OR REPLACE FUNCTION my_schema.sp_delete_mytable(_id integer)
RETURNS integer AS
$BODY$
DECLARE
AFFECTEDROWS integer;
BEGIN
WITH a AS (DELETE FROM mytable WHERE id = _id RETURNING 1)
SELECT count(*) INTO AFFECTEDROWS FROM a;
IF AFFECTEDROWS = 1 THEN
RETURN 1;
ELSE
RETURN 0;
END IF;
EXCEPTION WHEN OTHERS THEN
RETURN 0;
END;
$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;