Should I use Plpgsql to loop through table instead of using SQL? - postgresql

I have a job which runs every night to load changes into a temporary table and apply those changes to the main table.
CREATE TEMP TABLE IF NOT EXIST tmp AS SELECT * FROM mytable LIMIT 0;
COPY tmp FROM PROGRAM '';
11 SQL queries to update 'mytable' based on data from 'tmp'
I have a large number of queries to delete duplicates from tmp, update values in tmp, update values in the main table and insert new rows into the main table. Is it possible to loop over both tables using plpgsql instead?
UPDATE mytable m
SET "Field" = t."Field" +1
FROM tmp t
WHERE (t."ID" = m."ID");
In this example, it is simple change of a column value. Instead, I want to do more complex operations on both the main table as well as the temp table.
EDIT: so here is some is some PSEUDO code of what I imagine.
LOOP tmp t, mytable m
BEGIN
-- operation in plpgsql including UPDATE, INSERT, DELETE
END
WHERE t.ID = m.ID;

You can use plpgsql FOR to loop over query results.
DECLARE
myrow RECORD;
BEGIN
FOR myrow IN SELECT * FROM table1 JOIN table2 USING (id)
LOOP
... do something with the row ...
END LOOP;
END
If you want to update a table while looping over it, you can create a FOR UPDATE cursor, but that won't work if the query is a join, because then you're not opening an update cursor on a table.
Note writing to/updating temp tables is much faster than writing to normal tables because temp tables don't have WAL and crash recovery overhead, and they're owned by one single connection, so you don't have to worry about locks.
If you put a query inside the loop, it will be executed many times though, which could get pretty slow. It's usually faster to use bulk queries, even if they're complicated.
If you want to UPDATE many rows in the temp table with values that depend on other tables and joins, it could be faster to run several updates on the temp table with different join and WHERE conditions.

Related

Delete Duplicate rows in several Postgresql tables

I have a postgres database with several tables like table1, table2, table3. More than 1000 tables.
I imported all of these tables from a script. And apparently the script had issues to import.
Many tables have duplicate rows (all values exactly same).
I am able to go in each table and then delete duplicate row using Dbeaver, but because there are over 1000 tables, it is very time consuming.
Example of tables:
table1
name gender age
a m 20
a m 20
b f 21
b f 21
table2
fruit hobby
x running
x running
y stamp
y stamp
How can I do the following:
Identify tables in postgres with duplicate rows.
Delete all duplicate rows, leaving 1 record.
I need to do this on all 1000+ tables at once.
As you want to automate your deduplication of all table, you need to use plpgsql function where you can write dynamic queries to achieve it.
Try This function:
create or replace function func_dedup(_schemaname varchar) returns void as
$$
declare
_rec record;
begin
for _rec in select table_name from information_schema. tables where table_schema=_schemaname
loop
execute format('CREATE TEMP TABLE tab_temp as select DISTINCT * from '||_rec.table_name);
execute format('truncate '||_rec.table_name);
execute format('insert into '||_rec.table_name||' select * from tab_temp');
execute format('drop table tab_temp');
end loop;
end;
$$
language plpgsql
Now call your function like below:
select * from func_dedup('your_schema'); --
demo
Steps:
Get the list of all tables in your schema by using below query and loop it for each table.
select table_name from information_schema. tables where table_schema=_schemaname
Insert all distinct records in a TEMP TABLE.
Truncate your main table.
Insert all your data from TEMP TABLE to main table.
Drop the TEMP TABLE. (here dropping temp table is important we have to reuse it for next loop cycle.)
Note - if your tables are very large in size the consider using Regular Table instead of TEMP TABLE.

Is there an alternative to temp tables in PostgreSQL to hold and manipulate result sets?

Within a plpgsql function, I need to hold the result set of a select statement and perform many subsequent queries and manipulations over that set.
I read about temp tables in PostgreSQL, but these tables are visible in session/transaction scope, while i need my table (or any data structure holding result set) to be locally visible and only exist within the function, so that each function call can have its own copy of that (table/data structure)
I simply need a table alike structure to hold select result set within a function call, instead of temp tables.
Is there such thing?
Concurrent sessions may have their own (local) temp tables with the same names. Here is an example of a function which does not do anything wise but creates a temp table, returns its data and drops the table on exit:
create or replace function my_function()
returns table (id int, str text)
language plpgsql as $$
begin
create temp table my_temp_table as
select i as id, i::text as str
from generate_series(1, 3) i;
return query
select *
from my_temp_table;
drop table my_temp_table;
end $$;
The function may be safely run in concurrent sessions.
drop table if exists... at the beginning of the function is a reasonable alternative. Do not forget to use temp or temporary in create temp table...

Is it safe to use temporary tables when an application may try to create them for independent, but simultaneous processes?

I am hoping that I can articulate this effectively, so here it goes:
I am creating a model which will be run on a platform by users, possibly simultaneously, but each model run is marked by a unique integer identifier. This model will execute a series of PostgreSQL queries and eventually write a result elswehere.
Now because of the required parallelization of model runs, I have to make sure that the processes will not collide, despite running in the same database. I am at a point now where I have to store a list of records, sorted by a score variable and then operate on them. This is the beginning of the query:
DO
$$
DECLARE row RECORD;
BEGIN
DROP TABLE IF EXISTS ranked_clusters;
CREATE TEMP TABLE ranked_clusters AS (
SELECT
pl.cluster_id AS c_id,
SUM(pl.total_area) AS cluster_score
FROM
emob.parking_lots AS pl
WHERE
pl.cluster_id IS NOT NULL
AND
run_id = 2005149
GROUP BY
pl.cluster_id
ORDER BY
cluster_score DESC
);
FOR row IN SELECT c_id FROM ranked_clusters LOOP
RAISE NOTICE 'Cluster %', row.c_id;
END LOOP;
END;
$$ LANGUAGE plpgsql;
So I create a temporary table called ranked_clusters and then iterate through it, at the moment just logging the identifiers of each record.
I have been careful to only build this list from records which have a run_id value equal to a certain number, so data from the same source, but with a different number will be ignored.
What I am worried about however is that a simultaneous process will also create its own ranked_clusters temporary table, which will collide with the first one, invalidating the results.
So my question is essentially this: Are temporary tables only visible to the session which creates them (or to the cursor object from say, Python)? And is it therefore safe to use a temporary table in this way?
The main reason I ask is because I see that these so-called "temporary" tables seem to persist after I execute the query in PgAdmin III, and the query fails on the next execution because the table already exists. This troubles me because it seems as though the tables are actually globally accessible during their lifetime and would therefore introduce the possibility of a collision when a simultaneous run occurs.
Thanks #a_horse_with_no_name for the explanation but I am not yet convinced that it is safe, because I have been able to execute the following code:
import psycopg2 as pg2
conn = pg2.connect(dbname=CONFIG["GEODB_NAME"],
user=CONFIG["GEODB_USER"],
password=CONFIG["GEODB_PASS"],
host=CONFIG["GEODB_HOST"],
port=CONFIG["GEODB_PORT"])
conn.autocommit = True
cur = conn.cursor()
conn2 = pg2.connect(dbname=CONFIG["GEODB_NAME"],
user=CONFIG["GEODB_USER"],
password=CONFIG["GEODB_PASS"],
host=CONFIG["GEODB_HOST"],
port=CONFIG["GEODB_PORT"])
conn2.autocommit = True
cur2 = conn.cursor()
cur.execute("CREATE TEMPORARY TABLE temptable (tempcol INTEGER); INSERT INTO temptable VALUES (0);")
cur2.execute("SELECT tempcol FROM temptable;")
print(cur2.fetchall())
And I receive the value in temptable despite it being created as a temporary table in a completely different connection as the one which queries it afterwards. Am I missing something here? Because it seems like the temporary table is indeed accessible between connections.
The above had a typo, Both cursors were actually being spawned from conn, rather than one from conn and another from conn2. Individual connections in psycopg2 are not able to access each other's temporary tables, but cursors spawned from the same connection are.
Temporary tables are only visible to the session (=connection) that created them. Even if two sessions create the same table, they won't interfere with each other.
Temporary tables are removed automatically when the session is disconnected.
If you want to automatically remove them when your transaction ends, use the ON COMMIT DROP option when creating the table.
So the answer is: yes, this is safe.
Unrelated, but: you can't store rows "in a sorted way". Rows in a table have no implicit sort order. The only way you can get a guaranteed sort order is to use an ORDER BY when selecting the rows. The order by that is part of your CREATE TABLE AS statement is pretty much useless.
If you have to rely on the sort order of the rows, the only safe way to do that is in the SELECT statement:
FOR row IN SELECT c_id FROM ranked_clusters ORDER BY cluster_score
LOOP
RAISE NOTICE 'Cluster %', row.c_id;
END LOOP;

Create a temp table (if not exists) for use into a custom procedure

I'm trying to get the hang of using temp tables:
CREATE OR REPLACE FUNCTION test1(user_id BIGINT) RETURNS BIGINT AS
$BODY$
BEGIN
create temp table temp_table1
ON COMMIT DELETE ROWS
as SELECT table1.column1, table1.column2
FROM table1
INNER JOIN -- ............
if exists (select * from temp_table1) then
-- work with the result
return 777;
else
return 0;
end if;
END;
$BODY$
LANGUAGE plpgsql;
I want the row temp_table1 to be deleted immediately or as soon as possible, that's why I added ON COMMIT DELETE ROWS. Obviously, I got the error:
ERROR: relation "temp_table1" already exists
I tried to add IF NOT EXISTS but I couldn't, I simply couldn't find working example of it that would be the I'm looking for.
Your suggestions?
DROP Table each time before creating TEMP table as below:
BEGIN
DROP TABLE IF EXISTS temp_table1;
create temp table temp_table1
-- Your rest Code comes here
The problem of temp tables is that dropping and recreating temp table bloats pg_attribute heavily and therefore one sunny morning you will find db performance dead, and pg_attribute 200+ gb while your db would be like 10gb.
So we're very heavy on temp tables having >500 rps and async i\o via nodejs and thus experienced a very heavy bloating of pg_attribute because of that. All you are left with is a very aggressive vacuuming which halts performance.
All answers given here do not solve this, because they all bloat pg_attribute heavily.
So the solution is elegantly this
create temp table if not exists my_temp_table (description) on commit delete rows;
So you go on playing with temp tables and save your pg_attribute.
You want to DROP term table after commit (not DELETE ROWS), so:
begin
create temp table temp_table1
on commit drop
...
Documentation

Inserts into Sybase table in the cursor fetch loop

I am writing a Sybase stored procedure with a cursor fetching records from table and inserting records back. Surprisingly I found that fetches see amd return records inserted in the same loop. Under some conditions this results in endless loop, sometimes the loop ends inseting a number of extra records. Looking through Sybase documentation I could not find a cure. Transaction isolation does not help since we are acting inside a single transaction. Of course I could solve the problem by inserting into a temporary table in the loop and then inserting back into main table after the loop ends.
But the question remain: how can I isolate cursor fetches from inserts into the same table?
Pseudocode follows:
create table t (v int)
go
insert into t(v) values (1)
create procedure p as
begin
declare #v int
declare c cursor for select v from t
begin transaction
open c
while 1=1 begin
fetch c into #v
if (##sqlstatus != 0) break
insert into t (v) values (#v+1)
end
close c
commit
end
go
exec p
go
I was surprised by this behavior, too. Cursors only iterate over rows, they are not immune to changes during the loop. The manual states this:
A searched or positioned update on an allpages-locked table can change
the location of the row; for example, if it updates key columns of a
clustered index. The cursor does not track the row; it remains
positioned just before the next row at the original location.
Positioned updates are not allowed until a subsequent fetch returns
the next row. The updated row may be visible to the cursor a second
time, if the row moves to a later position in the search order
My solution was to insert into a temporary table and copy the results back at the end, this also sped up the process by a factor of about 10.
Pseudo-code:
select * into #results from OriginalTable where 1<>1 --< create temp table with same columns
WHILE fetch...
BEGIN
insert into #results
select -your-results-here
END
insert into OriginalTable select * from #results