Delete Duplicate rows in several Postgresql tables

Delete Duplicate rows in several Postgresql tables - postgresql

I have a postgres database with several tables like table1, table2, table3. More than 1000 tables.
I imported all of these tables from a script. And apparently the script had issues to import.
Many tables have duplicate rows (all values exactly same).
I am able to go in each table and then delete duplicate row using Dbeaver, but because there are over 1000 tables, it is very time consuming.
Example of tables:
table1
name gender age
a m 20
a m 20
b f 21
b f 21
table2
fruit hobby
x running
x running
y stamp
y stamp
How can I do the following:
Identify tables in postgres with duplicate rows.
Delete all duplicate rows, leaving 1 record.
I need to do this on all 1000+ tables at once.

As you want to automate your deduplication of all table, you need to use plpgsql function where you can write dynamic queries to achieve it.
Try This function:
create or replace function func_dedup(_schemaname varchar) returns void as
$$
declare
_rec record;
begin
for _rec in select table_name from information_schema. tables where table_schema=_schemaname
loop
execute format('CREATE TEMP TABLE tab_temp as select DISTINCT * from '||_rec.table_name);
execute format('truncate '||_rec.table_name);
execute format('insert into '||_rec.table_name||' select * from tab_temp');
execute format('drop table tab_temp');
end loop;
end;
$$
language plpgsql
Now call your function like below:
select * from func_dedup('your_schema'); --
demo
Steps:
Get the list of all tables in your schema by using below query and loop it for each table.
select table_name from information_schema. tables where table_schema=_schemaname
Insert all distinct records in a TEMP TABLE.
Truncate your main table.
Insert all your data from TEMP TABLE to main table.
Drop the TEMP TABLE. (here dropping temp table is important we have to reuse it for next loop cycle.)
Note - if your tables are very large in size the consider using Regular Table instead of TEMP TABLE.

Related

Can't create Postgres procedure from a query

I am coming from a mssql world and moving over to postgres. I am trying to create a new procedure from a query I wrote and it fails on creation. I am using pgAdmin 4 to create the proc and I've tried copy-pasting the query into the "code" tab of the dialog box.
What I'm trying to accomplish is inserting a bunch of rows into a table and outputting the ids from the identity column into a temporary table. I will be using those ids for more work further down the line, but it's failing before it is even usable. The way I did it in MSSQL was I had a table variable and used "output inserted.id" to get those values to insert into the table variable.
From what I understand, I have to create a temp table and use the returning keyword in postgres. The following query works if I run it in a query window
CREATE TEMPORARY TABLE temp_table
(
temp_id integer
);
WITH ROWS AS
(
INSERT INTO table_a
(some_name_a)
SELECT some_name_b
FROM table_b
RETURNING id)
INSERT INTO temp_table(temp_id)
SELECT id FROM ROWS;
But when I try to create the procedure for that I get an error saying
"ERROR: syntax error at or near "CREATE" LINE 3: AS $BODY$CREATE TEMPORARY TABLE temp_table^"
Here is what the create proc code looks like:
CREATE OR REPLACE PROCEDURE public.temp()
LANGUAGE 'plpgsql'
AS $BODY$
CREATE TEMPORARY TABLE temp_table
(
temp_id integer
);
WITH ROWS AS
(
INSERT INTO table_a
(some_name_a)
SELECT some_name_b
FROM table_b
RETURNING id)
INSERT INTO temp_table(temp_id)
SELECT id FROM ROWS;
$BODY$;

Should I use Plpgsql to loop through table instead of using SQL?

I have a job which runs every night to load changes into a temporary table and apply those changes to the main table.
CREATE TEMP TABLE IF NOT EXIST tmp AS SELECT * FROM mytable LIMIT 0;
COPY tmp FROM PROGRAM '';
11 SQL queries to update 'mytable' based on data from 'tmp'
I have a large number of queries to delete duplicates from tmp, update values in tmp, update values in the main table and insert new rows into the main table. Is it possible to loop over both tables using plpgsql instead?
UPDATE mytable m
SET "Field" = t."Field" +1
FROM tmp t
WHERE (t."ID" = m."ID");
In this example, it is simple change of a column value. Instead, I want to do more complex operations on both the main table as well as the temp table.
EDIT: so here is some is some PSEUDO code of what I imagine.
LOOP tmp t, mytable m
BEGIN
-- operation in plpgsql including UPDATE, INSERT, DELETE
END
WHERE t.ID = m.ID;

You can use plpgsql FOR to loop over query results.
DECLARE
myrow RECORD;
BEGIN
FOR myrow IN SELECT * FROM table1 JOIN table2 USING (id)
LOOP
... do something with the row ...
END LOOP;
END
If you want to update a table while looping over it, you can create a FOR UPDATE cursor, but that won't work if the query is a join, because then you're not opening an update cursor on a table.
Note writing to/updating temp tables is much faster than writing to normal tables because temp tables don't have WAL and crash recovery overhead, and they're owned by one single connection, so you don't have to worry about locks.
If you put a query inside the loop, it will be executed many times though, which could get pretty slow. It's usually faster to use bulk queries, even if they're complicated.
If you want to UPDATE many rows in the temp table with values that depend on other tables and joins, it could be faster to run several updates on the temp table with different join and WHERE conditions.

How can I delete all tables from a Firebird 3.0 database using single query?

I'm working on JSF application that uses a Firebird 3.0 database containing hundreds of tables. I need to delete all tables time to time.
I have checked this query:
DROP TABLE TABLE_NAME
but only one table can be deleted at a time by using this query and its very time consuming for program, can I have another approach to hammer it away?

You can create procedure in which drop tables
create or alter procedure PRC_DROP_TABLES
as
declare variable TBL varchar(50);
begin
for select r.rdb$relation_name
from rdb$relation_fields r
where
r.rdb$system_flag=0 and r.rdb$view_context is null
-- and r.rdb$relation_name not containing '$' --uncomment and modify this if you what filter tables by condition
group by r.rdb$relation_name
into :tbl do
execute statement 'drop table '||:tbl;
end

Create a temp table (if not exists) for use into a custom procedure

I'm trying to get the hang of using temp tables:
CREATE OR REPLACE FUNCTION test1(user_id BIGINT) RETURNS BIGINT AS
$BODY$
BEGIN
create temp table temp_table1
ON COMMIT DELETE ROWS
as SELECT table1.column1, table1.column2
FROM table1
INNER JOIN -- ............
if exists (select * from temp_table1) then
-- work with the result
return 777;
else
return 0;
end if;
END;
$BODY$
LANGUAGE plpgsql;
I want the row temp_table1 to be deleted immediately or as soon as possible, that's why I added ON COMMIT DELETE ROWS. Obviously, I got the error:
ERROR: relation "temp_table1" already exists
I tried to add IF NOT EXISTS but I couldn't, I simply couldn't find working example of it that would be the I'm looking for.
Your suggestions?

DROP Table each time before creating TEMP table as below:
BEGIN
DROP TABLE IF EXISTS temp_table1;
create temp table temp_table1
-- Your rest Code comes here

The problem of temp tables is that dropping and recreating temp table bloats pg_attribute heavily and therefore one sunny morning you will find db performance dead, and pg_attribute 200+ gb while your db would be like 10gb.
So we're very heavy on temp tables having >500 rps and async i\o via nodejs and thus experienced a very heavy bloating of pg_attribute because of that. All you are left with is a very aggressive vacuuming which halts performance.
All answers given here do not solve this, because they all bloat pg_attribute heavily.
So the solution is elegantly this
create temp table if not exists my_temp_table (description) on commit delete rows;
So you go on playing with temp tables and save your pg_attribute.

You want to DROP term table after commit (not DELETE ROWS), so:
begin
create temp table temp_table1
on commit drop
...
Documentation

Postgres SELECT ... FOR UPDATE in functions

I have two questions about using SELECT … FOR UPDATE row-level locking in a Postgres function:
Does it matter which columns I select? Do they have any relation to what data I need to lock and then update?
SELECT * FROM table WHERE x=y FOR UPDATE;
vs
SELECT 1 FROM table WHERE x=y FOR UPDATE;
I can't do a select in a function without saving the data somewhere, so I save to a dummy variable. This seems hacky; is it the right way to do things?
Here is my function:
CREATE OR REPLACE FUNCTION update_message(v_1 INTEGER, v_timestamp INTEGER, v_version INTEGER)
RETURNS void AS $$
DECLARE
v_timestamp_conv TIMESTAMP;
dummy INTEGER;
BEGIN
SELECT timestamp 'epoch' + v_timestamp * interval '1 second' INTO v_timestamp_conv;
SELECT 1 INTO dummy FROM my_table WHERE userid=v_1 LIMIT 1 FOR UPDATE;
UPDATE my_table SET (timestamp) = (v_timestamp_conv) WHERE userid=v_1 AND version < v_version;
END;
$$ LANGUAGE plpgsql;

Does it matter which columns I select?
No, it doesn't matter. Even if SELECT 1 FROM table WHERE ... FOR UPDATE is used, the query locks all rows that meet where conditions.
If the query retrieves rows from a join, and we don't want to lock rows from all tables involved in the join, but only rows from specific tables, a SELECT ... FOR UPDATE OF list-of-tablenames syntax can be usefull:
http://www.postgresql.org/docs/9.0/static/sql-select.html#SQL-FOR-UPDATE-SHARE
I can't do a select in a function without saving the data somewhere, so I save to a dummy variable. This seems hacky; is it the right way to do things?
In Pl/PgSql use a PERFORM command to discard query result:
http://www.postgresql.org/docs/9.2/static/plpgsql-statements.html#PLPGSQL-STATEMENTS-SQL-NORESULT
Instead of:
SELECT 1 INTO dummy FROM my_table WHERE userid=v_1 LIMIT 1 FOR UPDATE;
use:
PERFORM 1 FROM my_table WHERE userid=v_1 LIMIT 1 FOR UPDATE;