Asynchronous data load by parallel sessions - postgresql

Looking for help with a data load function designed to support asynchronous execution by parallel sessions.
Process_Log table contains the list of data load functions, with current status and a list of upstream dependencies.
Each session first looks for a function that is ready for execution, calls it, and updates status.
For further details please see comments in the code.
In Oracle PL/SQL I would design it as a nested block within the loop, and autonomous transaction for status updates.
Not sure how to achieve that in Postgres. Running 9.2.
CREATE OR REPLACE FUNCTION dm_operations.dm_load()
RETURNS void AS
$BODY$
declare
_run_cnt integer;
_ready_cnt integer;
_process_id dm_operations.process_log.process_id%type;
_exec_name dm_operations.process_log.exec_name%type;
_rowcnt dm_operations.process_log.rows_affected%type;
_error text;
_error_text text;
_error_detail text;
_error_hint text;
_error_context text;
begin
loop
--(1) Find one function ready to run
select sum(case when process_status = 'RUNNING' then 1 else 0 end) run_cnt,
sum(case when process_status = 'READY' then 1 else 0 end) ready_cnt,
min(case when process_status = 'READY' then process_id end) process_id
into _run_cnt, _ready_cnt, _process_id
from dm_operations.process_log; --One row per each executable data load function
--(2) Exit loop if nothing is ready
if _ready_cnt = 0 then exit;
else
--(3) Lock the row until the status is updated
select exec_name
into _exec_name
from dm_operations.process_log
where process_id = _process_id
for update;
--(4) Set status of the function to 'RUNNING'
--New status must be visible to other sessions
update dm_operations.process_log
set process_status = 'RUNNING',
start_ts = now()
where process_id = _process_id;
--(5) Release lock. (How?)
--(6) Execute data load function. See example below.
-- Is this correct syntax for dynamic call to a function that returns void?
execute 'perform dm_operations.'||_exec_name;
--(7) Get number of rows processed by the data load function
GET DIAGNOSTICS _rowcnt := ROW_COUNT;
--(8) Upon successful function execution set status to 'SUCCESS'
update dm_operations.process_log
set process_status = 'SUCCESS',
end_ts = now(),
rows_affected = _rowcnt
where process_id = _process_id;
--(9) Check dependencies and update status
--These changes must be visible to the next loop iteration, and to other sessions
update dm_operations.process_log pl1
set process_status = 'READY'
where process_status is null
and not exists (select null from dm_operations.process_log pl2
where pl2.process_id in (select unnest(pl1.depends_on))
and (coalesce(pl2.process_status,'NULL') <> 'SUCCESS'));
end if;
--(10) Log error and allow the loop to continue
EXCEPTION
when others then
GET STACKED DIAGNOSTICS _error_text = MESSAGE_TEXT,
_error_detail = PG_EXCEPTION_DETAIL,
_error_hint = PG_EXCEPTION_HINT,
_error_context = PG_EXCEPTION_CONTEXT;
_error := _error_text||
_error_detail||
_error_hint||
_error_context;
update dm_operations.process_log
set process_status = 'ERROR',
start_ts = now(),
rows_affected = _rowcnt,
error_text = _error
where process_id = _process_id;
end;
end loop;
end;
$BODY$
LANGUAGE plpgsql;
Data load function example (6):
CREATE OR REPLACE FUNCTION load_target()
RETURNS void AS
$BODY$
begin
execute 'truncate table target_table';
insert into target_table
select ...
from source_table;
end;
$BODY$
LANGUAGE plpgsql;

You cannot start asynchronous operations in PL/pgSQL.
There are two options I can think of:
The hard way: Upgrade to a more recent PostgreSQL version and write a background worker in C that executes load_target. You'd have to use
Don't write your function in the database, but on the client side. Then you can simply open several database sessions and run functions in parallel that way.

Related

update is not allowed in a non volatile function postgres

I tried to use a cursor with multi parameters, the function was created without any problem, but then when I call my function using
select scratchpad.update_status()
I get the following error:
update is not allowed in a non volatile function
the function:
create function scrat.update_status() returns void
as
$$
DECLARE
day_to_process record;
BEGIN
FOR day_to_process in (SELECT distinct inst_status.date,inst_status.code,scrat.inst_status.id
FROM scrat.inst_status
WHERE inst_status.status ='S'
ORDER BY 1)
LOOP
raise notice 'Processing Date %', day_to_process.date::text;
update scrat.inst_status
set status = (select a.status from
(select status, max(date)
FROM scrat.inst_status
where status <> 'S'
and date::date < day_to_process.date
group by status
order by 2 desc
limit 1)a)
where inst_status.date = day_to_process.date
and id =day_to_process.id
and code=day_to_process.code;
END LOOP;
END ;
$$ LANGUAGE plpgsql STABLE;
As the documentation states:
STABLE indicates that the function cannot modify the database, and that within a single table scan it will consistently return the same result for the same argument values, but that its result could change across SQL statements.
So you will have to mark the function as VOLATILE.

Postgresql AFTER UPDATE trigger OLD has no field

I have a trigger, like many other triggers that fire after update and checks a field of the OLD set.
For some reason this trigger throw this error:
ERROR: record "old" has no field "status"
CONTEXT: SQL statement "SELECT NOT (OLD.status = NEW.status) AND NEW.status = 'Success'"
This is the body of the trigger:
CREATE OR REPLACE FUNCTION tr_online_payment_after_update()
RETURNS trigger AS
$BODY$
DECLARE
v_rec RECORD;
BEGIN
-- If state changed to Success
IF NOT (OLD.status = NEW.status) AND NEW.status = 'Success' THEN
-- Find any invoices attached and set them to paid
FOR v_rec IN
SELECT invoice_id_fk
FROM online_payment_invoice
WHERE online_payment_id_fk = NEW.online_payment_id
LOOP
UPDATE invoice
SET paid_date = CURRENT_DATE,
updator_id_fk = -2,
updated = LOCALTIMESTAMP
WHERE invoice_id = v_rec.invoice_id_fk;
END LOOP;
END IF;
RETURN NULL;
END;
$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
CREATE TRIGGER tr_online_payment_after_update
AFTER UPDATE
ON online_payment
FOR EACH ROW
EXECUTE PROCEDURE tr_online_payment_after_update();
Weird thing is that it actually seems to run the invoice update part of the trigger.
I cannot see what I am missing here. It does not make sense.
This is the full output:
=>UPDATE online_payment SET status = 'Success' WHERE online_payment_id = 18;
ERROR: record "old" has no field "status"
CONTEXT: SQL statement "SELECT NOT (OLD.status = NEW.status) AND NEW.status = 'Success'"
PL/pgSQL function tr_online_payment_after_update() line 7 at IF
SQL statement "UPDATE load_unit
SET load_hinderance = load_hinderance(load_unit_id)
WHERE load_consign_match_id_fk = v_lcm_id"
PL/pgSQL function tr_invoice_after_update() line 50 at SQL statement
SQL statement "UPDATE invoice
SET paid_date = CURRENT_DATE,
updator_id_fk = -2,
updated = LOCALTIMESTAMP
WHERE invoice_id = v_rec.invoice_id_fk"
PL/pgSQL function tr_online_payment_after_update() line 14 at SQL statement
Add the following diagnostic line to the function:
RAISE NOTICE 'Called % % on %',
TG_WHEN, TG_OP, TG_TABLE_NAME;
Then run the SQL statement that invokes the trigger and see if the output is what you expect.

How to optimize postgresql procedure

I have 61 million of non unique emails with statuses.
This emails need to deduplicate with logic by status.
I write stored procedure, but this procedure runs to long.
How I can optimize execution time of this procedure?
CREATE OR REPLACE FUNCTION public.load_oxy_emails() RETURNS boolean AS $$
DECLARE
row record;
rec record;
new_id int;
BEGIN
FOR row IN SELECT * FROM oxy_email ORDER BY id LOOP
SELECT * INTO rec FROM oxy_emails_clean WHERE email = row.email;
IF rec IS NOT NULL THEN
IF row.status = 3 THEN
UPDATE oxy_emails_clean SET status = 3 WHERE id = rec.id;
END IF;
ELSE
INSERT INTO oxy_emails_clean(id, email, status) VALUES(nextval('oxy_emails_clean_id_seq'), row.email, row.status);
SELECT currval('oxy_emails_clean_id_seq') INTO new_id;
INSERT INTO oxy_emails_clean_websites_relation(oxy_emails_clean_id, website_id) VALUES(new_id, row.website_id);
END IF;
END LOOP;
RETURN true;
END;
$$
LANGUAGE 'plpgsql';
How I can optimize execution time of this procedure?
Don't do it with a loop.
Doing a row-by-row processing (also known as "slow-by-slow") is almost always a lot slower then doing bulk changes where a single statement processes a lot of rows "in one go".
The change of the status can easily be done using a single statement:
update oxy_emails_clean oec
SET status = 3
from oxy_email oe
where oe.id = oec.id
and oe.status = 3;
The copying of the rows can be done using a chain of CTEs:
with to_copy as (
select *
from oxy_email
where status <> 3 --<< all those that have a different status
), clean_inserted as (
INSERT INTO oxy_emails_clean (id, email, status)
select nextval('oxy_emails_clean_id_seq'), email, status
from to_copy
returning id;
)
insert oxy_emails_clean_websites_relation (oxy_emails_clean_id, website_id)
select ci.id, tc.website_id
from clean_inserted ci
join to_copy tc on tc.id = ci.id;

Fixing invalid memory alloc request at PostgreSQL 9.2.9

I've encountered a problem querying some of my tables recently. When I try to select data I get an ERROR telling: ERROR: invalid memory alloc request size 4294967293. This generally indicates data corruption. There is a nice and precise technique of how to delete corrupted rows described here: https://confluence.atlassian.com/jirakb/invalid-memory-alloc-request-size-440107132.html
But, since I have lots of corrupted tables, this method is too slow. So, I've found a nice function which returns the last successful ctid here: http://blog.dob.sk/2012/05/19/fixing-pg_dump-invalid-memory-alloc-request-size/
Looking for corrupted row is a bit faster when using it, but not fast enough. I slightly modified it to store all "last successful ctid" in a different table and now it looks like this:
CREATE OR REPLACE FUNCTION
find_bad_row(tableName TEXT)
RETURNS void
as $find_bad_row$
DECLARE
result tid;
curs REFCURSOR;
row1 RECORD;
row2 RECORD;
tabName TEXT;
count BIGINT := 0;
BEGIN
DROP TABLE IF EXISTS bad_rows_tbl;
CREATE TABLE bad_rows_tbl (id varchar(255), offs BIGINT);
SELECT reverse(split_part(reverse($1), '.', 1)) INTO tabName;
OPEN curs FOR EXECUTE 'SELECT ctid FROM ' || tableName;
count := 1;
FETCH curs INTO row1;
WHILE row1.ctid IS NOT NULL LOOP
BEGIN
result = row1.ctid;
count := count + 1;
FETCH curs INTO row1;
EXECUTE 'SELECT (each(hstore(' || tabName || '))).* FROM '
|| tableName || ' WHERE ctid = $1' INTO row2
USING row1.ctid;
IF count % 100000 = 0 THEN
RAISE NOTICE 'rows processed: %', count;
END IF;
EXCEPTION
WHEN SQLSTATE 'XX000' THEN
RAISE NOTICE 'LAST CTID: %', result;
EXECUTE 'INSERT INTO bad_rows_tbl VALUES(' || result || ',' || count || ')';
END;
END LOOP;
CLOSE curs;
END
$find_bad_row$
LANGUAGE plpgsql;
I'm quite new to plpgsql, so I'm stuck with the following question: how to query not pre-unsuccessful ctid, but the exact unsuccessful one (or calculate the next one from pre-unsuccessful) so I could insert it into bad_rows_tbl and use as an argument for a DELETE statement further?
Hope for some help...
UPD: a function I ended up
CREATE OR REPLACE FUNCTION
find_bad_row(tableName TEXT)
RETURNS tid[]
as $find_bad_row$
DECLARE
result tid;
curs REFCURSOR;
row1 RECORD;
row2 RECORD;
tabName TEXT;
youNeedMe BOOLEAN = false;
count BIGINT := 0;
arrIter BIGINT := 0;
arr tid[];
BEGIN
CREATE TABLE bad_rows_tbl (id varchar(255), offs BIGINT);
SELECT reverse(split_part(reverse($1), '.', 1)) INTO tabName;
OPEN curs FOR EXECUTE 'SELECT ctid FROM ' || tableName;
count := 1;
FETCH curs INTO row1;
WHILE row1.ctid IS NOT NULL LOOP
BEGIN
result = row1.ctid;
count := count + 1;
IF youNeedMe THEN
arr[arrIter] = result;
arrIter := arrIter + 1;
RAISE NOTICE 'ADDING CTID: %', result;
youNeedMe = FALSE;
END IF;
FETCH curs INTO row1;
EXECUTE 'SELECT (each(hstore(' || tabName || '))).* FROM '
|| tableName || ' WHERE ctid = $1' INTO row2
USING row1.ctid;
IF count % 100000 = 0 THEN
RAISE NOTICE 'rows processed: %', count;
END IF;
EXCEPTION
WHEN SQLSTATE 'XX000' THEN
RAISE NOTICE 'LAST GOOD CTID: %', result;
youNeedMe = TRUE;
END;
END LOOP;
CLOSE curs;
RETURN arr;
END
$find_bad_row$
LANGUAGE plpgsql;
This is supplemental to the function given in the question and answers next steps after the db is dumpable.
Your next steps should be:
dumpall and restore on a physically different system. The reason being at this point we don't know what caused this and chances are not too bad that it might be hardware.
You need to take the old system down and run hardware diagnostics on it, looking for problems. You really want to find out what happened so you don't run into it again. Of particular interest:
Double check ECC RAM and MCE logs
Look at all RAID arrays and their battery backups
CPUs and PSUs
If it were me I would also look at environmental variables such as AC in and datacenter temperature.
Go over your backup strategy. In particular look at PITR (and related utility pgbarman). Make sure you can recover from a similar situation in the future if you run into it.
Data corruption doesn't just happen. In rare cases it can be caused by bugs in PostgreSQL, but in most cases it is due to your hardware or due to custom code you have running in the back-end. Narrowing down the cause and ensuring recoverability are critical going forward.
Assuming you aren't running custom C code in your database, most likely your data corruption is due to something on the hardware

count number of rows to be affected before update in trigger

I want to know number of rows that will be affected by UPDATE query in BEFORE per statement trigger . Is that possible?
The problem is that i want to allow only queries that will update up to 4 rows. If affected rows count is 5 or more i want to raise error.
I don't want to do this in code because i need this check on db level.
Is this at all possible?
Thanks in advance for any clues on that
Write a function that updates the rows for you or performs a rollback. Sorry for poor style formatting.
create function update_max(varchar, int)
RETURNS void AS
$BODY$
DECLARE
sql ALIAS FOR $1;
max ALIAS FOR $2;
rcount INT;
BEGIN
EXECUTE sql;
GET DIAGNOSTICS rcount = ROW_COUNT;
IF rcount > max THEN
--ROLLBACK;
RAISE EXCEPTION 'Too much rows affected (%).', rcount;
END IF;
--COMMIT;
END;
$BODY$ LANGUAGE plpgsql
Then call it like
select update_max('update t1 set id=id+10 where id < 4', 3);
where the first param ist your sql-Statement and the 2nd your max rows.
Simon had a good idea but his implementation is unnecessarily complicated. This is my proposition:
create or replace function trg_check_max_4()
returns trigger as $$
begin
perform true from pg_class
where relname='check_max_4' and relnamespace=pg_my_temp_schema();
if not FOUND then
create temporary table check_max_4
(value int check (value<=4))
on commit drop;
insert into check_max_4 values (0);
end if;
update check_max_4 set value=value+1;
return new;
end; $$ language plpgsql;
I've created something like this:
begin;
create table test (
id integer
);
insert into test(id) select generate_series(1,100);
create or replace function trg_check_max_4_updated_records()
returns trigger as $$
declare
counter_ integer := 0;
tablename_ text := 'temptable';
begin
raise notice 'trigger fired';
select count(42) into counter_
from pg_catalog.pg_tables where tablename = tablename_;
if counter_ = 0 then
raise notice 'Creating table %', tablename_;
execute 'create temporary table ' || tablename_ || ' (counter integer) on commit drop';
execute 'insert into ' || tablename_ || ' (counter) values(1)';
execute 'select counter from ' || tablename_ into counter_;
raise notice 'Actual value for counter= [%]', counter_;
else
execute 'select counter from ' || tablename_ into counter_;
execute 'update ' || tablename_ || ' set counter = counter + 1';
raise notice 'updating';
execute 'select counter from ' || tablename_ into counter_;
raise notice 'Actual value for counter= [%]', counter_;
if counter_ > 4 then
raise exception 'Cannot change more than 4 rows in one trancation';
end if;
end if;
return new;
end; $$ language plpgsql;
create trigger trg_bu_test before
update on test
for each row
execute procedure trg_check_max_4_updated_records();
update test set id = 10 where id <= 1;
update test set id = 10 where id <= 2;
update test set id = 10 where id <= 3;
update test set id = 10 where id <= 4;
update test set id = 10 where id <= 5;
rollback;
The main idea is to have a trigger on 'before update for each row' that creates (if necessary) a temporary table (that is dropped at the end of transaction). In this table there is just one row with one value, that is the number of updated rows in current transaction. For each update the value is incremented. If the value is bigger than 4, the transaction is stopped.
But I think that this is a wrong solution for your problem. What's a problem to run such wrong query that you've written about, twice, so you'll have 8 rows changed. What about deletion rows or truncating them?
PostgreSQL has two types of triggers: row and statement triggers. Row triggers only work within the context of a row so you can't use those. Unfortunately, "before" statement triggers don't see what kind of change is about to take place so I don't believe you can use those, either.
Based on that, I would say it's unlikely you'll be able to build that kind of protection into the database using triggers, not unless you don't mind using an "after" trigger and rolling back the transaction if the condition isn't satisfied. Wouldn't mind being proved wrong. :)
Have a look at using Serializable Isolation Level. I believe this will give you a consistent view of the database data within your transaction. Then you can use option #1 that MusiGenesis mentioned, without the timing vulnerability. Test it of course to validate.
I've never worked with postgresql, so my answer may not apply. In SQL Server, your trigger can call a stored procedure which would do one of two things:
Perform a SELECT COUNT(*) to determine the number of records that will be affected by the UPDATE, and then only execute the UPDATE if the count is 4 or less
Perform the UPDATE within a transaction, and only commit the transaction if the returned number of rows affected is 4 or less
No. 1 is timing vulnerable (the number of records affected by the UPDATE may change between the COUNT(*) check and the actual UPDATE. No. 2 is pretty inefficient, if there are many cases where the number of rows updated is greater than 4.