Is it worth Parallel/Concurrent INSERT INTO... (SELECT...) to the same Table in Postgres?

Is it worth Parallel/Concurrent INSERT INTO... (SELECT...) to the same Table in Postgres? - postgresql

I was attempting an INSERT INTO.... ( SELECT... ) (inserting a batch of rows from SELECT... subquery), onto the same table in my database. For the most part it was working, however, I did see a "Deadlock" exception logged every now and then. Does it make sense to do this or is there a way to avoid a deadlock scenario? On a high-level, my queries both resemble this structure:
CREATE OR REPLACE PROCEDURE myConcurrentProc() LANGUAGE plpgsql
AS $procedure$
DECLARE
BEGIN
LOOP
EXIT WHEN row_count = 0
WITH cte AS (SELECT *
FROM TableA tbla
WHERE EXISTS (SELECT 1 FROM TableB tblb WHERE tblb.id = tbla.id)
INSERT INTO concurrent_table (SELECT id FROM cte);
COMMIT;
UPDATE log_tbl
SET status = 'FINISHED',
WHERE job_name = 'tblA_and_B_job';
END LOOP;
END
$procedure$;
And the other script that runs in parallel and INSERTS... also to the same table is also basically:
CREATE OR REPLACE PROCEDURE myConcurrentProc() LANGUAGE plpgsql
AS $procedure$
DECLARE
BEGIN
LOOP
EXIT WHEN row_count = 0
WITH cte AS (SELECT *
FROM TableC c
WHERE EXISTS (SELECT 1 FROM TableD d WHERE d.id = tblc.id)
INSERT INTO concurrent_table (SELECT id FROM cte);
COMMIT;
UPDATE log_tbl
SET status = 'FINISHED',
WHERE job_name = 'tbl_C_and_D_job';
END LOOP;
END
$procedure$;
So you can see I'm querying two different tables in each script, however inserting into the same some_table. I also have the UPDATE... statement that writes to a log table so I suppose that could also cause issues. Is there any way to use BEGIN... END here and COMMIT to avoid any deadlock/concurrency issues or should I just create a 2nd table to hold the "tbl_C_and_D_job" data?

Related

how to iterate in all schemas and find count from all tables present in all schemas with same table name for every 5mins?

imagine there are 5 schemas in my database and in every schema there is a common name table (ex:- table1) after every 5mins records get inserted in table1, how I can iterate in all schemas n calculate the count of table1[i have to automate the process so i am going to write the code in function and call that function after every 5mins using crontab].

Basically 2 options: Hard code schema.table and union the results. So something like:
create or replace function count_rows_in_each_table1()
returns table (schema_name text, number_or_rows integer)
language sql
as $$
select 'schema1', count(*) from schema1.table1 union all
select 'schema2', count(*) from schema2.table1 union all
select 'schema3', count(*) from schema3.table1 union all
...
select 'scheman', count(*) from scheman.table1;
$$;
The alternative being building the query dynamically from information_scheme.
create or replace function count_rows_in_each_table1()
returns table (schema_name text, number_of_rows bigint)
language plpgsql
as $$
declare
c_rows_count cursor is
select table_schema::text
from information_schema.tables
where table_name = 'table1';
l_tbl record;
l_sql_statement text = '';
l_connector text = '';
l_base_select text = 'select ''%s'', count(*) from %I.table1';
begin
for l_tbl in c_rows_count
loop
l_sql_statement = l_sql_statement ||
l_connector ||
format (l_base_select, l_tbl.table_schema, l_tbl.table_schema);
l_connector = ' union all ';
end loop;
raise notice E'Running Query: \n%', l_sql_statement;
return query execute l_sql_statement;
end;
$$;
Which is better. With few schema and few schema add/drop, opt for the first. It is direct and easily shows what you are doing. If you add/drop schema often then opt for the second. If you have many schema, but seldom add/drop them then modify the second to generate the first, save and schedule execution of the generated query.
NOTE: Not tested

Postgres SQL | IF ELSE | HOW TO

I am using psql (PostgreSQL) 11.2 (Debian 11.2-1.pgdg90+1).
I am trying to write a logic in .PSQL file that needs to import some data into a table if this table is empty, else do something else.
I am struggling to find the correct syntax to make it work.
Would appreciate some help around this.
DO $$ BEGIN
SELECT count(*) from (SELECT 1 table_x LIMIT 1) as isTableEmpty
IF isTableEmpty > 0
THEN
INSERT INTO table_x
SELECT * FROM table_b;
ELSE
INSERT INTO table_y
SELECT * FROM table_b;
END IF;
END $$;
thanks!

Read plpgsql structure. Then you would know you need a DECLARE section to declare isTableEmpty and from here Select into that you need to select into the isTableEmpty variable. So:
...
DECLARE
isTableEmpty integer;
BEGIN
SELECT count(*) into isTableEmpty from (SELECT 1 table_x LIMIT 1);
...
Though I'm not sure what you are trying to accomplish with?:
SELECT count(*) from (SELECT 1 table_x LIMIT 1) as isTableEmpty
As that is always going to return 1.

You are using count just to determine that a row exists or not in the table. To do so you need to create a variable in the DO block, select into that variable, and reference that variable. This is all unnecessary; you can just use exists(...) instead of count(*) .... See demo;
do $$
begin
if not exists (select null from table_x) then
insert into table_x (...)
values (...);
else
insert into table_y (...)
values (...);
end if;
end ;
$$;

How to optimize postgresql procedure

I have 61 million of non unique emails with statuses.
This emails need to deduplicate with logic by status.
I write stored procedure, but this procedure runs to long.
How I can optimize execution time of this procedure?
CREATE OR REPLACE FUNCTION public.load_oxy_emails() RETURNS boolean AS $$
DECLARE
row record;
rec record;
new_id int;
BEGIN
FOR row IN SELECT * FROM oxy_email ORDER BY id LOOP
SELECT * INTO rec FROM oxy_emails_clean WHERE email = row.email;
IF rec IS NOT NULL THEN
IF row.status = 3 THEN
UPDATE oxy_emails_clean SET status = 3 WHERE id = rec.id;
END IF;
ELSE
INSERT INTO oxy_emails_clean(id, email, status) VALUES(nextval('oxy_emails_clean_id_seq'), row.email, row.status);
SELECT currval('oxy_emails_clean_id_seq') INTO new_id;
INSERT INTO oxy_emails_clean_websites_relation(oxy_emails_clean_id, website_id) VALUES(new_id, row.website_id);
END IF;
END LOOP;
RETURN true;
END;
$$
LANGUAGE 'plpgsql';

How I can optimize execution time of this procedure?
Don't do it with a loop.
Doing a row-by-row processing (also known as "slow-by-slow") is almost always a lot slower then doing bulk changes where a single statement processes a lot of rows "in one go".
The change of the status can easily be done using a single statement:
update oxy_emails_clean oec
SET status = 3
from oxy_email oe
where oe.id = oec.id
and oe.status = 3;
The copying of the rows can be done using a chain of CTEs:
with to_copy as (
select *
from oxy_email
where status <> 3 --<< all those that have a different status
), clean_inserted as (
INSERT INTO oxy_emails_clean (id, email, status)
select nextval('oxy_emails_clean_id_seq'), email, status
from to_copy
returning id;
)
insert oxy_emails_clean_websites_relation (oxy_emails_clean_id, website_id)
select ci.id, tc.website_id
from clean_inserted ci
join to_copy tc on tc.id = ci.id;

PostgreSQL trigger not firing on update

I am using PostgreSQL 9.3.4
There are two linked tables updated remotely: table1 and table2
table2 uses foreign key dependance from table1.
Tables are updated in the following manner on remote server:
Insert into table1 returning id;
Insert into table2 using previous id
Query is sent ( 1 transaction with 2 insert statements)
I need to duplicate new rows to remote db using dblink so I created two 'before update' triggers for table1 and table2;
The problem is that only table2 trigger is firing; the first isn't ( from remote update;
doing test query from pgadmin under the same user, I get both triggers fired OK )
I assumed it is because the update is being processed in 1 transaction/query on remote server. So I tried to process both tables in second trigger, but still no luck - only table2 is processed.
What could be the reason ?
Thanks
P.S.
Trigger codes
Version 1
PROCEDURE fn_replicate_data:
DECLARE
BEGIN
PERFORM DBLINK_EXEC('myconn','INSERT INTO table1(dataid,sessionid,uid) VALUES('||new.dataid||','||new.sessionid||',
'||new.uid||') ');
RETURN new;
END;
PROCEDURE fn_replicate_data2:
DECLARE
BEGIN
PERFORM DBLINK_EXEC('myconn','INSERT INTO table2(dataid,data) VALUES('||new.dataid||','''||new.data||''') ');
RETURN new;
END;
CREATE TRIGGER tr_remote_insert_data
BEFORE INSERT OR UPDATE ON table1
FOR EACH ROW EXECUTE PROCEDURE fn_replicate_data();
CREATE TRIGGER tr_remote_insert_data2
BEFORE INSERT OR UPDATE ON table2
FOR EACH ROW EXECUTE PROCEDURE fn_replicate_data2();
VERSION2
PROCEDURE fn_replicate_data:
DECLARE
var table1%ROWTYPE;
BEGIN
select * from table1 into var where dataid = new.dataid;
PERFORM DBLINK_EXEC('myconn','INSERT INTO table1(dataid,sessionid,uid) VALUES('||var.dataid||','||var.sessionid||','||var.uid||') ');
PERFORM
DBLINK_EXEC('myconn','INSERT INTO table2(dataid,data) VALUES('||new.dataid||','''||new.data||''') ');
RETURN new;
END;
CREATE TRIGGER tr_remote_insert_data
BEFORE INSERT OR UPDATE ON table2
FOR EACH ROW EXECUTE PROCEDURE fn_replicate_data();

The reason was in NULL value if uid field. It had bigint type and had no default value in db, which causes the trigger to not work properly.
The fix is either
IF (NEW.uid IS NULL )
THEN
uid := 'DEFAULT';
else
uid := NEW.uid;
END IF;
before insert query; or (simpler) adding default value to db;

postgres count from table efficient way

In my application we are using postgresql,now it has one million records in summary table.
When I run the following query it takes 80,927 ms
SELECT COUNT(*) AS count
FROM summary_views
GROUP BY question_id,category_type_id
Is there any efficient way to do this?

COUNT(*) in PostgreSQL tends to be slow. It's a feature of MVCC. One of the workarounds of the problem is a row counting trigger with a helper table:
create table table_count(
table_count_id text primary key,
rows int default 0
);
CREATE OR REPLACE FUNCTION table_count_update()
RETURNS trigger AS
$BODY$
begin
if tg_op = 'INSERT' then
update table_count set rows = rows + 1
where table_count_id = TG_TABLE_NAME;
elsif tg_op = 'DELETE' then
update table_count set rows = rows - 1
where table_count_id = TG_TABLE_NAME;
end if;
return null;
end;
$BODY$
LANGUAGE 'plpgsql' VOLATILE;
Next step is to add proper trigger declaration for each table you'd like to use it with. For example for table tab_name:
begin;
insert into table_count values
('tab_name',(select count(*) from tab_name));
create trigger tab_name_table_count after insert or delete
on tab_name for each row execute procedure table_count_update();
commit;
It is important to run in a transaction block to keep actual count and helper table in sync in case of delete or insert between initial count and trigger creation. Transaction guarantees this. From now on to get current count instantly, just invoke:
select rows from table_count where table_count_id = 'tab_name';
Edit: In case of your group by clause, you'll need more sophisticated trigger function and count table.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Is it worth Parallel/Concurrent INSERT INTO... (SELECT...) to the same Table in Postgres? - postgresql

Related

how to iterate in all schemas and find count from all tables present in all schemas with same table name for every 5mins?

Postgres SQL | IF ELSE | HOW TO

How to optimize postgresql procedure

PostgreSQL trigger not firing on update

postgres count from table efficient way

Categories

Resources