How to optimize postgresql procedure - postgresql

I have 61 million of non unique emails with statuses.
This emails need to deduplicate with logic by status.
I write stored procedure, but this procedure runs to long.
How I can optimize execution time of this procedure?
CREATE OR REPLACE FUNCTION public.load_oxy_emails() RETURNS boolean AS $$
DECLARE
row record;
rec record;
new_id int;
BEGIN
FOR row IN SELECT * FROM oxy_email ORDER BY id LOOP
SELECT * INTO rec FROM oxy_emails_clean WHERE email = row.email;
IF rec IS NOT NULL THEN
IF row.status = 3 THEN
UPDATE oxy_emails_clean SET status = 3 WHERE id = rec.id;
END IF;
ELSE
INSERT INTO oxy_emails_clean(id, email, status) VALUES(nextval('oxy_emails_clean_id_seq'), row.email, row.status);
SELECT currval('oxy_emails_clean_id_seq') INTO new_id;
INSERT INTO oxy_emails_clean_websites_relation(oxy_emails_clean_id, website_id) VALUES(new_id, row.website_id);
END IF;
END LOOP;
RETURN true;
END;
$$
LANGUAGE 'plpgsql';

How I can optimize execution time of this procedure?
Don't do it with a loop.
Doing a row-by-row processing (also known as "slow-by-slow") is almost always a lot slower then doing bulk changes where a single statement processes a lot of rows "in one go".
The change of the status can easily be done using a single statement:
update oxy_emails_clean oec
SET status = 3
from oxy_email oe
where oe.id = oec.id
and oe.status = 3;
The copying of the rows can be done using a chain of CTEs:
with to_copy as (
select *
from oxy_email
where status <> 3 --<< all those that have a different status
), clean_inserted as (
INSERT INTO oxy_emails_clean (id, email, status)
select nextval('oxy_emails_clean_id_seq'), email, status
from to_copy
returning id;
)
insert oxy_emails_clean_websites_relation (oxy_emails_clean_id, website_id)
select ci.id, tc.website_id
from clean_inserted ci
join to_copy tc on tc.id = ci.id;

Related

Is it worth Parallel/Concurrent INSERT INTO... (SELECT...) to the same Table in Postgres?

I was attempting an INSERT INTO.... ( SELECT... ) (inserting a batch of rows from SELECT... subquery), onto the same table in my database. For the most part it was working, however, I did see a "Deadlock" exception logged every now and then. Does it make sense to do this or is there a way to avoid a deadlock scenario? On a high-level, my queries both resemble this structure:
CREATE OR REPLACE PROCEDURE myConcurrentProc() LANGUAGE plpgsql
AS $procedure$
DECLARE
BEGIN
LOOP
EXIT WHEN row_count = 0
WITH cte AS (SELECT *
FROM TableA tbla
WHERE EXISTS (SELECT 1 FROM TableB tblb WHERE tblb.id = tbla.id)
INSERT INTO concurrent_table (SELECT id FROM cte);
COMMIT;
UPDATE log_tbl
SET status = 'FINISHED',
WHERE job_name = 'tblA_and_B_job';
END LOOP;
END
$procedure$;
And the other script that runs in parallel and INSERTS... also to the same table is also basically:
CREATE OR REPLACE PROCEDURE myConcurrentProc() LANGUAGE plpgsql
AS $procedure$
DECLARE
BEGIN
LOOP
EXIT WHEN row_count = 0
WITH cte AS (SELECT *
FROM TableC c
WHERE EXISTS (SELECT 1 FROM TableD d WHERE d.id = tblc.id)
INSERT INTO concurrent_table (SELECT id FROM cte);
COMMIT;
UPDATE log_tbl
SET status = 'FINISHED',
WHERE job_name = 'tbl_C_and_D_job';
END LOOP;
END
$procedure$;
So you can see I'm querying two different tables in each script, however inserting into the same some_table. I also have the UPDATE... statement that writes to a log table so I suppose that could also cause issues. Is there any way to use BEGIN... END here and COMMIT to avoid any deadlock/concurrency issues or should I just create a 2nd table to hold the "tbl_C_and_D_job" data?

Asynchronous data load by parallel sessions

Looking for help with a data load function designed to support asynchronous execution by parallel sessions.
Process_Log table contains the list of data load functions, with current status and a list of upstream dependencies.
Each session first looks for a function that is ready for execution, calls it, and updates status.
For further details please see comments in the code.
In Oracle PL/SQL I would design it as a nested block within the loop, and autonomous transaction for status updates.
Not sure how to achieve that in Postgres. Running 9.2.
CREATE OR REPLACE FUNCTION dm_operations.dm_load()
RETURNS void AS
$BODY$
declare
_run_cnt integer;
_ready_cnt integer;
_process_id dm_operations.process_log.process_id%type;
_exec_name dm_operations.process_log.exec_name%type;
_rowcnt dm_operations.process_log.rows_affected%type;
_error text;
_error_text text;
_error_detail text;
_error_hint text;
_error_context text;
begin
loop
--(1) Find one function ready to run
select sum(case when process_status = 'RUNNING' then 1 else 0 end) run_cnt,
sum(case when process_status = 'READY' then 1 else 0 end) ready_cnt,
min(case when process_status = 'READY' then process_id end) process_id
into _run_cnt, _ready_cnt, _process_id
from dm_operations.process_log; --One row per each executable data load function
--(2) Exit loop if nothing is ready
if _ready_cnt = 0 then exit;
else
--(3) Lock the row until the status is updated
select exec_name
into _exec_name
from dm_operations.process_log
where process_id = _process_id
for update;
--(4) Set status of the function to 'RUNNING'
--New status must be visible to other sessions
update dm_operations.process_log
set process_status = 'RUNNING',
start_ts = now()
where process_id = _process_id;
--(5) Release lock. (How?)
--(6) Execute data load function. See example below.
-- Is this correct syntax for dynamic call to a function that returns void?
execute 'perform dm_operations.'||_exec_name;
--(7) Get number of rows processed by the data load function
GET DIAGNOSTICS _rowcnt := ROW_COUNT;
--(8) Upon successful function execution set status to 'SUCCESS'
update dm_operations.process_log
set process_status = 'SUCCESS',
end_ts = now(),
rows_affected = _rowcnt
where process_id = _process_id;
--(9) Check dependencies and update status
--These changes must be visible to the next loop iteration, and to other sessions
update dm_operations.process_log pl1
set process_status = 'READY'
where process_status is null
and not exists (select null from dm_operations.process_log pl2
where pl2.process_id in (select unnest(pl1.depends_on))
and (coalesce(pl2.process_status,'NULL') <> 'SUCCESS'));
end if;
--(10) Log error and allow the loop to continue
EXCEPTION
when others then
GET STACKED DIAGNOSTICS _error_text = MESSAGE_TEXT,
_error_detail = PG_EXCEPTION_DETAIL,
_error_hint = PG_EXCEPTION_HINT,
_error_context = PG_EXCEPTION_CONTEXT;
_error := _error_text||
_error_detail||
_error_hint||
_error_context;
update dm_operations.process_log
set process_status = 'ERROR',
start_ts = now(),
rows_affected = _rowcnt,
error_text = _error
where process_id = _process_id;
end;
end loop;
end;
$BODY$
LANGUAGE plpgsql;
Data load function example (6):
CREATE OR REPLACE FUNCTION load_target()
RETURNS void AS
$BODY$
begin
execute 'truncate table target_table';
insert into target_table
select ...
from source_table;
end;
$BODY$
LANGUAGE plpgsql;
You cannot start asynchronous operations in PL/pgSQL.
There are two options I can think of:
The hard way: Upgrade to a more recent PostgreSQL version and write a background worker in C that executes load_target. You'd have to use
Don't write your function in the database, but on the client side. Then you can simply open several database sessions and run functions in parallel that way.

Recursive with cursor on psql, nothing data found

How to use a recursive query and then using cursor to update multiple rows in postgresql. I try to return data but no data is found. Any alternative to using recursive query and cursor, or maybe better code please help me.
drop function proses_stock_invoice(varchar, varchar, character varying);
create or replace function proses_stock_invoice
(p_medical_cd varchar,p_post_cd varchar, p_pstruserid character varying)
returns void
language plpgsql
as $function$
declare
cursor_data refcursor;
cursor_proses refcursor;
v_medicalCd varchar(20);
v_itemCd varchar(20);
v_quantity numeric(10);
begin
open cursor_data for
with recursive hasil(idnya, level, pasien_cd, id_root) as (
select medical_cd, 1, pasien_cd, medical_root_cd
from trx_medical
where medical_cd = p_pstruserid
union all
select A.medical_cd, level + 1, A.pasien_cd, A.medical_root_cd
from trx_medical A, hasil B
where A.medical_root_cd = B.idnya
)
select idnya from hasil where level >=1;
fetch next from cursor_data into v_medicalCd;
return v_medicalCd;
while (found)
loop
open cursor_proses for
select B.item_cd, B.quantity from trx_medical_resep A
join trx_resep_data B on A.medical_resep_seqno = B.medical_resep_seqno
where A.medical_cd = v_medicalCd and B.resep_tp = 'RESEP_TP_1';
fetch next from cursor_proses into v_itemCd, v_quantity;
while (found)
loop
update inv_pos_item
set quantity = quantity - v_quantity, modi_id = p_pstruserid, modi_id = now()
where item_cd = v_itemCd and pos_cd = p_post_cd;
end loop;
close cursor_proses;
end loop;
close cursor_data;
end
$function$;
but nothing data found?
You have a function with return void so it will never return any data to you. Still you have the statement return v_medicalCd after fetching the first record from the first cursor, so the function will return from that point and never reach the lines below.
When analyzing your function you have (1) a cursor that yields a number of idnya values from table trx_medical, which is input for (2) a cursor that yields a number of v_itemCd, v_quantity from tables trx_medical_resep, trx_resep_data for each idnya, which is then used to (3) update some rows in table inv_pos_item. You do not need cursors to do that and it is, in fact, extremely inefficient. Instead, turn the whole thing into a single update statement.
I am assuming here that you want to update an inventory of medicines by subtracting the medicines prescribed to patients from the stock in the inventory. This means that you will have to sum up prescribed amounts by type of medicine. That should look like this (note the comments):
CREATE FUNCTION proses_stock_invoice
-- VVV parameter not used
(p_medical_cd varchar, p_post_cd varchar, p_pstruserid varchar)
RETURNS void AS $function$
UPDATE inv_pos_item -- VVV column repeated VVV
SET quantity = quantity - prescribed.quantity, modi_id = p_pstruserid, modi_id = now()
FROM (
WITH RECURSIVE hasil(idnya, level, pasien_cd, id_root) AS (
SELECT medical_cd, 1, pasien_cd, medical_root_cd
FROM trx_medical
WHERE medical_cd = p_pstruserid
UNION ALL
SELECT A.medical_cd, level + 1, A.pasien_cd, A.medical_root_cd
FROM trx_medical A, hasil B
WHERE A.medical_root_cd = B.idnya
)
SELECT B.item_cd, sum(B.quantity) AS quantity
FROM trx_medical_resep A
JOIN trx_resep_data B USING (medical_resep_seqno)
JOIN hasil ON A.medical_cd = hasil.idnya
WHERE B.resep_tp = 'RESEP_TP_1'
--AND hacil.level >= 1 Useless because level is always >= 1
GROUP BY 1
) prescribed
WHERE item_cd = prescribed.item_cd
AND pos_cd = p_post_cd;
$function$ LANGUAGE sql STRICT;
Important
As with all UPDATE statements, test this code before you run the function. You can do that by running the prescribed sub-query separately as a stand-alone query to ensure that it does the right thing.

postgresql copy with schema support

I'm trying to load some data from CSV using the postgresql COPY command. The trick is that I'd like to implement multi-tenancy on a userid (which is contained in the CSV). Is there an easy way to tell the postgres copy command to filter based on this userid when loading the csv?
i.e. all rows with userid=x go to schema=x, rows with userid=y go to schema=y.
There is not a way of doing this with just the COPY command, but you could copy all your data into a master table, and then put together a simple PL/PGSQL function that does this for you. Something like this -
CREATE OR REPLACE FUNCTION public.spike()
RETURNS void AS
$BODY$
DECLARE
user_id integer;
destination_schema text;
BEGIN
FOR user_id IN SELECT userid FROM master_table GROUP BY userid LOOP
CASE user_id
WHEN 1 THEN
destination_schema := 'foo';
WHEN 2 THEN
destination_schema := 'bar';
ELSE
destination_schema := 'baz';
END CASE;
EXECUTE 'INSERT INTO '|| destination_schema ||'.my_table SELECT * FROM master_table WHERE userid=$1' USING user_id;
-- EXECUTE 'DELETE FROM master_table WHERE userid=$1' USING user_id;
END LOOP;
TRUNCATE TABLE master_table;
RETURN;
END;
$BODY$
LANGUAGE 'plpgsql' VOLATILE
COST 100;
This gets all unique user_ids from the master_table, uses a CASE statement to determine the destination schema, and then executes an INSERT SELECT to move rows, and finally deletes the moved rows.

Is there a similar function in postgresql for mysql's SQL_CALC_FOUND_ROWS?

everybody using mysql knows:
SELECT SQL_CALC_FOUND_ROWS ..... FROM table WHERE ... LIMIT 5, 10;
and right after run this :
SELECT FOUND_ROWS();
how do i do this in postrgesql? so far, i found only ways where i have to send the query twice...
No, there is not (at least not as of July 2007). I'm afraid you'll have to resort to:
BEGIN ISOLATION LEVEL SERIALIZABLE;
SELECT id, username, title, date FROM posts ORDER BY date DESC LIMIT 20;
SELECT count(id, username, title, date) AS total FROM posts;
END;
The isolation level needs to be SERIALIZABLE to ensure that the query does not see concurrent updates between the SELECT statements.
Another option you have, though, is to use a trigger to count rows as they're INSERTed or DELETEd. Suppose you have the following table:
CREATE TABLE posts (
id SERIAL PRIMARY KEY,
poster TEXT,
title TEXT,
time TIMESTAMPTZ DEFAULT now()
);
INSERT INTO posts (poster, title) VALUES ('Alice', 'Post 1');
INSERT INTO posts (poster, title) VALUES ('Bob', 'Post 2');
INSERT INTO posts (poster, title) VALUES ('Charlie', 'Post 3');
Then, perform the following to create a table called post_count that contains a running count of the number of rows in posts:
-- Don't let any new posts be added while we're setting up the counter.
BEGIN;
LOCK TABLE posts;
-- Create and initialize our post_count table.
SELECT count(*) INTO TABLE post_count FROM posts;
-- Create the trigger function.
CREATE FUNCTION post_added_or_removed() RETURNS TRIGGER AS $$
BEGIN
IF TG_OP = 'DELETE' THEN
UPDATE post_count SET count = count - 1;
ELSIF TG_OP = 'INSERT' THEN
UPDATE post_count SET count = count + 1;
END IF;
RETURN NULL;
END;
$$ LANGUAGE plpgsql;
-- Call the trigger function any time a row is inserted.
CREATE TRIGGER post_added_or_removed_tgr
AFTER INSERT OR DELETE
ON posts
FOR EACH ROW
EXECUTE PROCEDURE post_added_or_removed();
COMMIT;
Note that this maintains a running count of all of the rows in posts. To keep a running count of certain rows, you'll have to tweak it:
SELECT count(*) INTO TABLE post_count FROM posts WHERE poster <> 'Bob';
CREATE FUNCTION post_added_or_removed() RETURNS TRIGGER AS $$
BEGIN
-- The IF statements are nested because OR does not short circuit.
IF TG_OP = 'DELETE' THEN
IF OLD.poster <> 'Bob' THEN
UPDATE post_count SET count = count - 1;
END IF;
ELSIF TG_OP = 'INSERT' THEN
IF NEW.poster <> 'Bob' THEN
UPDATE post_count SET count = count + 1;
END IF;
END IF;
RETURN NULL;
END;
$$ LANGUAGE plpgsql;
There is a simple way, but keep in mind, that following COUNT(*) aggr function will be applied to all rows returned after where and before limit/offset (may be costy)
SELECT
id,
"count" (*) OVER () AS cnt
FROM
objects
WHERE
id > 2
OFFSET 50
LIMIT 5
No, PostgreSQL doesn't try to count all relevant results when you only need 10 results. You need a seperate COUNT to count all results.