How to COPY to multiple CSV files in PostgreSQL? - postgresql

I've got a PostGIS database of points in Postgres, and I would like to extract the points in several geographically distinct areas to CSV files, one file per area.
I have set up an area table with area polygons, and area titles and I would like to effectively loop through that table, using something like Postgis' st_intersects() to select the data to go in each CSV file, and get the filename for the CSV file from the title in the area table.
I'm comfortable with the details of doing the intersection code, and setting up the CSV output - what I don't know is how to do it for each area. Is it possible to do something like this with some sort of join? Or do I need to do it with a stored procedure, and use a loop construct in plpgsql?

You can loop over rows in your area table in plpgsql. But be careful to get quoting of identifiers and values right:
Assuming this setup:
CREATE TABLE area (
title text PRIMARY KEY
, area_polygon geometry
);
CREATE TABLE points(
point_id serial PRIMARY KEY
, the_geom geometry);
You can use this plpgsql block:
DO
$do$
DECLARE
_title text;
BEGIN
FOR _title IN
SELECT title FROM area
LOOP
EXECUTE format('COPY (SELECT p.*
FROM area a
JOIN points p ON ST_INTERSECTS(p.the_geom, a.area_polygon)
WHERE a.title = %L) TO %L (FORMAT csv)'
, _title
, '/path/to/' || _title || '.csv');
END LOOP;
END
$do$;
Use format with %L (for string literal) to get properly quoted strings to avoid syntax errors and possible SQL injection. You still need to use strings in area.title that work for file names.)
Also careful to quote the filename as a whole, not just the title part of it.
You must concatenate the whole command as string. The "utility command" COPY does not allow variable substitution. That's only possible with the core DML commands SELECT, INSERT, UPDATE, and DELETE. See:
Error when setting n_distinct using a plpgsql variable
So don't read out area.area_polygon in the loop. It would have to be cast to text to concatenate it into the query string, where the text representation would be cast back to geometry (or whatever your actual undisclosed data type is). That's prone to errors.
Instead I only read area.title to uniquely identify the row and handle the rest in the query internally.

You can use a plpgsql function or an inline do (if you only need to do it once and you don't want to store a function.)
do $body$
DECLARE i int;
BEGIN FOR i IN SELECT DISTINCT city FROM table
LOOP RAISE
NOTICE 'foo';
EXECUTE format($$COPY (SELECT * FROM foo WHERE x='%s') TO /tmp/%s$$, i, i);
END LOOP;
RETURN;
END;
$body$ LANGUAGE plpgsql;

Related

Declare and return value for DELETE and INSERT

I am trying to remove duplicated data from some of our databases based upon unique id's. All deleted data should be stored in a separate table for auditing purposes. Since it concerns quite some databases and different schemas and tables I wanted to start using variables to reduce chance of errors and the amount of work it will take me.
This is the best example query I could think off, but it doesn't work:
do $$
declare #source_schema varchar := 'my_source_schema';
declare #source_table varchar := 'my_source_table';
declare #target_table varchar := 'my_target_schema' || source_table || '_duplicates'; --target schema and appendix are always the same, source_table is a variable input.
declare #unique_keys varchar := ('1', '2', '3')
begin
select into #target_table
from #source_schema.#source_table
where id in (#unique_keys);
delete from #source_schema.#source_table where export_id in (#unique_keys);
end ;
$$;
The query syntax works with hard-coded values.
Most of the times my variables are perceived as columns or not recognized at all. :(
You need to create and then call a plpgsql procedure with input parameters :
CREATE OR REPLACE PROCEDURE duplicates_suppress
(my_target_schema text, my_source_schema text, my_source_table text, unique_keys text[])
LANGUAGE plpgsql AS
$$
BEGIN
EXECUTE FORMAT(
'WITH list AS (INSERT INTO %1$I.%3$I_duplicates SELECT * FROM %2$I.%3$I WHERE array[id] <# %4$L :: integer[] RETURNING id)
DELETE FROM %2$I.%3$I AS t USING list AS l WHERE t.id = l.id', my_target_schema, my_source_schema, my_source_table, unique_keys :: text) ;
END ;
$$ ;
The procedure duplicates_suppress inserts into my_target_schema.my_source_table || '_duplicates' the rows from my_source_schema.my_source_table whose id is in the array unique_keys and then deletes these rows from the table my_source_schema.my_source_table .
See the test result in dbfiddle.
As has been commented, you need some kind of dynamic SQL. In a FUNCTION, PROCEDURE or a DO statement to do it on the server.
You should be comfortable with PL/pgSQL. Dynamic SQL is no beginners' toy.
Example with a PROCEDURE, like Edouard already suggested. You'll need a FUNCTION instead to wrap it in an outer transaction (like you very well might). See:
When to use stored procedure / user-defined function?
CREATE OR REPLACE PROCEDURE pg_temp.f_archive_dupes(_source_schema text, _source_table text, _unique_keys int[], OUT _row_count int)
LANGUAGE plpgsql AS
$proc$
-- target schema and appendix are always the same, source_table is a variable input
DECLARE
_target_schema CONSTANT text := 's2'; -- hardcoded
_target_table text := _source_table || '_duplicates';
_sql text := format(
'WITH del AS (
DELETE FROM %I.%I
WHERE id = ANY($1)
RETURNING *
)
INSERT INTO %I.%I TABLE del', _source_schema, _source_table
, _target_schema, _target_table);
BEGIN
RAISE NOTICE '%', _sql; -- debug
EXECUTE _sql USING _unique_keys; -- execute
GET DIAGNOSTICS _row_count = ROW_COUNT;
END
$proc$;
Call:
CALL pg_temp.f_archive_dupes('s1', 't1', '{1, 3}', 0);
db<>fiddle here
I made the procedure temporary, since I assume you don't need to keep it permanently. Create it once per database. See:
How to create a temporary function in PostgreSQL?
Passed schema and table names are case-sensitive strings! (Unlike unquoted identifiers in plain SQL.) Either way, be wary of SQL-injection when concatenating SQL dynamically. See:
Are PostgreSQL column names case-sensitive?
Table name as a PostgreSQL function parameter
Made _unique_keys type int[] (array of integer) since your sample values look like integers. Use a the actual data type of your id columns!
The variable _sql holds the query string, so it can easily be debugged before actually executing. Using RAISE NOTICE '%', _sql; for that purpose.
I suggest to comment the EXECUTE line until you are sure.
I made the PROCEDURE return the number of processed rows. You didn't ask for that, but it's typically convenient. At hardly any cost. See:
Dynamic SQL (EXECUTE) as condition for IF statement
Best way to get result count before LIMIT was applied
Last, but not least, use DELETE ... RETURNING * in a data-modifying CTE. Since that has to find rows only once it comes at about half the cost of separate SELECT and DELETE. And it's perfectly safe. If anything goes wrong, the whole transaction is rolled back anyway.
Two separate commands can also run into concurrency issues or race conditions which are ruled out this way, as DELETE implicitly locks the rows to delete. Example:
Replicating data between Postgres DBs
Or you can build the statements in a client program. Like psql, and use \gexec. Example:
Filter column names from existing table for SQL DDL statement
Based on Erwin's answer, minor optimization...
create or replace procedure pg_temp.p_archive_dump
(_source_schema text, _source_table text,
_unique_key int[],_target_schema text)
language plpgsql as
$$
declare
_row_count bigint;
_target_table text := '';
BEGIN
select quote_ident(_source_table) ||'_'|| array_to_string(_unique_key,'_') into _target_table from quote_ident(_source_table);
raise notice 'the deleted table records will store in %.%',_target_schema, _target_table;
execute format('create table %I.%I as select * from %I.%I limit 0',_target_schema, _target_table,_source_schema,_source_table );
execute format('with mm as ( delete from %I.%I where id = any (%L) returning * ) insert into %I.%I table mm'
,_source_schema,_source_table,_unique_key, _target_schema, _target_table);
GET DIAGNOSTICS _row_count = ROW_COUNT;
RAISE notice 'rows influenced, %',_row_count;
end
$$;
--
if your _unique_key is not that much, this solution also create a table for you. Obviously you need to create the target schema yourself.
If your unique_key is too much, you can customize to properly rename the dumped table.
Let's call it.
call pg_temp.p_archive_dump('s1','t1', '{1,2}','s2');
s1 is the source schema, t1 is source table, {1,2} is the unique key you want to extract to the new table. s2 is the target schema

What Identifier in "EXECUTE FORMAT" with "FOR ... IN" loop "record" variable/iterator that uses dot/period "." for Column name in PLPGSQL Procedure?

I am using a basic plpgsql EXECUTE... FORMAT dynamic script with a FOR...IN loop. The problem is, the LOOP has a variable/pointer (like any FOR LOOP in any language) that iterates through the query result set of the SELECT... query in the EXECUTE. I need to make the column name part after the period/dot in temprecord.column_name dynamic, and thus use an Identifier (i.e. %I, %s, %L) for it, As you know, the way to get data (In this column is an ALTER TABLE sql statement) out of this variable/pointer is to use the "dot notation", that is, i.e.
/* Output employee names from column "names" in Employee table */
FOR temprecord IN
EXECUTE format('SELECT *
FROM %I t', ''Employee'')
LOOP
EXECUTE temprecord.names; -- THIS WORKS FINE WHEN I HARDCODE IT. I CAN'T SEEM TO MAKE IT DYNAMIC
END LOOP;
So the above works fine when I hardcord temprecord.names. The problem is I want the column name dynamic, so if different callers/methods call my function, I can select different columns through the temprecord iterator, and EXECUTE this data.
I tried many times what I have below and the best response I have gotten so far is that the sql query that I have in the column (as I stated above) executed, however, it showed an syntax error and truncation error, but I clearly noticed it was returned the all three columns of my table and concatenating the columns together. But I know the real problem is the variable temprecord not picking up on the dot notation that specifies column name i.e. temprecord.column_name_here, that's why it returns all columns and throws truncation here. As I stated above, it works when hardcoded.
i.e.
/* Using the $$ for string formatting here */
CREATE OR REPLACE PROCEDURE my_proc(drop_or_add text)
LANGUAGE plpgsql
AS
$procedure$
DECLARE
temprecord record;
col_nm text;
BEGIN
col_nm := concat_ws('_','sql',drop_or_add); -- SHOULD CONCATENATE TO COLUMN NAME, I.E. sql_drop OR sql_add WHICH HOLDS SQL "ALTER TABLE..." QUERY
FOR temprecord IN
EXECUTE format($f$ SELECT t.col1, t.col2, t.col3%s
FROM some_tbl t
WHERE t.col4 = %L, drop_or_add, 'blue')
LOOP
EXECUTE format($f$ %I.%s $f$, "temprecord", col_nm);
END LOOP;
END
$procedure$;
Hopefully none of this syntax is confusing, nor my logic. Again, I am simply looping with the temprecord iterator variable and trying to access some columns of my database table, that holds some SQL statements and needs to be dynamic based on the argument passed to my Procedure. So, potentially EXECUTE temprecord.sql_drop or EXECUTE temprecord.sql_add can execute.
I always used the %s in my other plpgsql scripts when I access a column of table using dot notation i.e. tbl1.id = tbl2.% and it's always worked fine. I have tried basically every combination possible, but I'll list a few here, i.e.
EXECUTE format($f$ %I.%s $f$, "temprecord", col_nm);
EXECUTE format($f$ %I%s $f$, "temprecord", '.sql_drop');
EXECUTE format($f$ %I.%s $f$, temprecord, col_nm);
EXECUTE format($f$ %I.%I $f$, "temprecord", col_nm);
EXECUTE format($f$ %I.%I $f$, "temprecord", "col_nm");
.
..
...
the string passed to EXECUTE statement have to be valid SQL command. It cannot be an expression.
The PL/pgSQL is almost static language - there are some dynamic features, but almost it is static typed language. Overusing EXECUTE statement is generally bad idea. It can work, but the code will be very unreadable and not well maintainable.
For dynamic access to record, the best way is (today) using transformation to json:
postgres=# create or replace function fx(nm varchar(64))
returns void as $$
declare
r record;
r2 record;
begin
for r in execute format('select v as %I from generate_series(1,3) g(v)', nm)
loop
raise notice '%', (row_to_json(r))->>nm;
end loop;
end;
$$ language plpgsql;
CREATE FUNCTION
postgres=# select fx('foo');
NOTICE: 1
NOTICE: 2
NOTICE: 3
┌────┐
│ fx │
╞════╡
│ │
└────┘
(1 row)
I was able to find a solution, however, I absolutely tried every combination of Identifiers and I could not get this to work when concatenating the record iterator and the column name in the EXECUTE FORMAT block. So my solution was just stick stick a CASE statement before the second EXECUTE... and then directly pass that String variable to the EXECUTE and use only one identifier. It seemed like the problem was any time I was trying to divide/split up the temprecord.column_name inside the FORMAT i.e. EXECUTE FORMAT($f$ %I.$s $f$, temprecord,column_name), it would just not work. However, I am able to do this with regular Tables and Columns in SELECT.. FROM WHERE t1.col1 = t2.%, so I'm not sure if its just different with the record variable because its part of a FOR... IN LOOP or something.
Solution
CASE
WHEN drop_or_add ILIKE '%drop%' THEN
v_exect_sql := temprow.sql_drop;
WHEN drop_or_add ILIKE '%add%' THEN
v_exect_sql := temprow.sql_add;
END CASE;
BEGIN
EXECUTE format($f$ %s $f$, v_exect_sql); -- Have to pass the `temprecord.column_name` all as one Identifier or would NOT work for me

Syntax for parameterized PostgreSQL function using dynamic SQL

This code:
ALTER TABLE myschema.mytable add column geom geometry (point,4326);
CREATE INDEX mytable_idx on myschema.mytable using GIST(geom);
UPDATE myschema.mytable set geom = st_setsrid(st_point(mytable.long, mytable.lat), 4326);
This works fine when updating a single table. How would you convert it into a dynamic SQL function, with schema and table as parameters?
Since the function input must be an existing table, the simplest safe way would be to use a regclass input parameter like demonstrated here:
Table name as a PostgreSQL function parameter
However, you also need the bare table name for the concatenated index name, so I'll stick with taking text for schema and table separately:
CREATE OR REPLACE FUNCTION create_geom(_sch text, _tab text)
RETURNS void
LANGUAGE plpgsql AS
$func$
BEGIN
EXECUTE format(
'ALTER TABLE %1$I.%2$I ADD COLUMN geom geometry(POINT,4326);
UPDATE %1$I.%2$I SET geom = st_setsrid(st_point(long, lat), 4326);
CREATE INDEX %3$I ON %1$I.%2$I USING gist(geom);'
, _sch, _tab
, _tab || '_geom_gist_idx');
END
$func$;
Call:
SELECT create_geom('myschema', 'mytable');
Use a single EXECUTE, no need for multiple calls.
Just omit table-qualification for columns in the UPDATE. While not joining additional tables, column names are unambiguous. Else, use a table alias, which can be constant. Like:
UPDATE %1$s AS x SET geom = st_setsrid(st_point(x.long, x.lat), 4326);
But it's smarter to populate the column before you build the index. That's a lot faster and produces a balanced index without bloat. So I switched the commands.
Note how I concatenate the index name first (_tab || '_geom_gist_idx'), and then double-quote as required with %3$I. That's the safe way. Something like %I_idx fails with non-standard names.
That said, it's typically a mistake to add columns with redundant information to a table. (What keeps you from changing one or the other? Why bloat the table?) Either just use an expression index instead of all of the above:
CREATE INDEX ON myschema.mytable USING gist (st_setsrid(st_point(long, lat), 4326));
Or drop the now redundant long & lat from the table. Those can be extracted from the new geom cheaply on the fly.
Or, if you need all columns (for special performance reasons?), consider a generated column instead. See:
Computed / calculated / virtual / derived columns in PostgreSQL
Having your queries as SQL templates and using format function for identifiers:
CREATE OR REPLACE FUNCTION public.create_geom(sch text, tab text)
RETURNS void language plpgsql AS $body$
DECLARE
DYNSQLA constant text := 'ALTER TABLE %I.%I add column geom geometry (point,4326)';
DYNSQLB constant text := 'CREATE INDEX %I_idx on %I.%I using GIST(geom);';
DYNSQLC constant text := 'UPDATE %I.%I set geom = st_setsrid(st_point(%I.long, %I.lat), 4326)';
BEGIN
execute format(DYNSQLA, sch, tab);
execute format(DYNSQLB, tab, sch, tab);
execute format(DYNSQLC, sch, tab, tab, tab);
END;
$body$;
SELECT create_geom('myschema','mytable');

How to call a stored procedure in SQL developer where the argument is a long string with carriage returns

I am trying to run a stored procedure that takes a string with carriage returns as an argument from within a SQL Developer session. Specifically the string is itself a SQL statement which gets picked up by the procedure, processed and stored in a table.
The problem is finding a way to preserve the readable formatting of the statements i.e., text with multiple lines / carriage returns. The closest I have gotten is the following:
1) Create a new table to store the SQL statements:
CREATE TABLE sql_table
(
id NUMBER,
sql_string CLOB
);
2) Create a stored procedure:
CREATE OR REPLACE PROCEDURE
update_sql(var_id NUMBER, var_sql_string IN CLOB) IS
BEGIN
INSERT INTO sql_table (id, sql_string) VALUES (var_id, var_sql_string);
--do other stuff
END; /
3) Run the following command to add a new row with a sql statement to the table:
EXEC update_sql(127,'&input');
4) At the prompt, paste in a multiple line statement such as:
SELECT *
FROM any_table
WHERE a = b;
5) Then query sql_table and copy and paste the content of the column sql_string into a text editor - the carriage returns are now gone:
SELECT * FROM any_table WHERE a=b;
As mentioned, I would like to preserve the carriage returns so that the statements display nicely when extracted from the table.
Any help appreciated.
Thanks,
Dennis
You didn't share any sample text you wanted to store, so I made up a very simple one
DECLARE
VAR_ID NUMBER;
VAR_SQL_STRING CLOB;
BEGIN
VAR_ID := 4;
VAR_SQL_STRING := q'[SELECT
* FROM EMPLOYEES
WHERE FIRST_NAME = 'JEFF']';
UPDATE_SQL(
VAR_ID => VAR_ID,
VAR_SQL_STRING => VAR_SQL_STRING
);
--rollback;
END;
The q feature for string literals is quite nice because it handles quoted strings for you auto-magically, which you'll have all over the place in your SQL. Steven Feuerstein talks about this here in his article covering strings in PL/SQL.
You don't say HOW you're extracting the data...
As mentioned, I would like to preserve the carriage returns so that
the statements display nicely when extracted from the table.
But browsing the table in SQL Developer -

Use function variable in dynamic COPY statement

According to docs of PostgreSQL it is possible to copy data to csv file right from a query without using an intermediate table. I am curious how to do that.
CREATE OR REPLACE FUNCTION m_tbl(my_var integer)
RETURNS void AS
$BODY$
DECLARE
BEGIN
COPY (
select my_var
)
TO 'c:/temp/out.csv';
END;
$$ LANGUAGE plpgsql;
I get an error: no such column 'my_var'.
Yes, it is possible to COPY from any query, whether or not it refers to a table.
However, COPY is a non-plannable statement, a utility statement. It doesn't support query parameters - and query parameters are how PL/PgSQL implements the insertion of variables into statements.
So you can't use PL/PgSQL variables with COPY.
You must instead use dynamic SQL with EXECUTE. See the Pl/PgSQL documentation for examples. There are lots of examples here on Stack Overflow and on https://dba.stackexchange.com/ too.
Something like:
EXECUTE format('
COPY (
select %L
)
TO ''c:/temp/out.csv'';
', my_var);
The same applies if you want the file path to be dynamic - you'd use:
EXECUTE format('
COPY (
select %L
)
TO %L;
', my_var, 'file_name.csv');
It also works for dynamic column names but you would use %I (for identifier, like "my_name") instead of %L for literal like 'my_value'. For details on %I and %L, see the documentation for format.