Creating a table with many columns in PostgreSQL - postgresql

In order to use COPY (in my case, from a csv file) function in PostgreSQL I need to create the destination table first.
Now, in case my table has 60 columns, for instance, it feel weird and inefficient to write manually this:
CREATE TABLE table_name(
column1 datatype,
column2 datatype,
column3 datatype,
.....
column60 datatype
Those who use PostgreSQL - how do you ger around this issue?

I usually use file_fdw extension to read data from CSV files.
But unfortunately, file_fdw is not that convenient/flexible when you solve such tasks as reading from a CSV file with many columns. CREATE TABLE will work with any number of columns, but if it doesn't correspond to the CSV file, it will fail later, when performing SELECT. So the problem of explicit creation of table remains. However, it is possible to solve it.
Here is brute-force approach that doesn't require anything except Postgres. Written in PL/pgSQL, this function tries to create a table with one single column, and attempt to SELECT from it. If it fails, it drops the table and tries again, but with 2 columns. And so on, until SELECT is OK. All columns are of type text – this is quite a limitation, but it still solves the task of having ready-to-SELECT table instead of doing manual work.
create or replace function autocreate_table_to_read_csv(
fdw_server text,
csv text,
table_name text,
max_columns_num int default 100
) returns void as $$
declare
i int;
sql text;
rec record;
begin
execute format('drop foreign table if exists %I', table_name);
for i in 1..max_columns_num loop
begin
select into sql
format('create foreign table %I (', table_name)
|| string_agg('col' || n::text || ' text', ', ')
|| format(
e') server %I options ( filename \'%s\', format \'csv\' );',
fdw_server,
csv
)
from generate_series(1, i) as g(n);
raise debug 'SQL: %', sql;
execute sql;
execute format('select * from %I limit 1;', table_name) into rec;
-- looks OK, so the number of columns corresponds to the first row of CSV file
raise info 'Table % created with % column(s). SQL: %', table_name, i, sql;
exit;
exception when others then
raise debug 'CSV has more than % column(s), making another attempt...', i;
end;
end loop;
end;
$$ language plpgsql;
Once it founds the proper number of columns, it reports about it (see raise info).
To see more details, run set client_min_messages to debug; before using the function.
Example of use:
test=# create server csv_import foreign data wrapper file_fdw;
CREATE SERVER
test=# set client_min_messages to debug;
SET
test=# select autocreate_table_to_read_csv('csv_import', '/home/nikolay/tmp/sample.csv', 'readcsv');
NOTICE: foreign table "readcsv" does not exist, skipping
DEBUG: SQL: create foreign table readcsv (col1 text) server csv_import options ( filename '/home/nikolay/tmp/sample.csv', format 'csv' );
DEBUG: CSV has more than 1 column(s), making another attempt...
DEBUG: SQL: create foreign table readcsv (col1 text, col2 text) server csv_import options ( filename '/home/nikolay/tmp/sample.csv', format 'csv' );
DEBUG: CSV has more than 2 column(s), making another attempt...
DEBUG: SQL: create foreign table readcsv (col1 text, col2 text, col3 text) server csv_import options ( filename '/home/nikolay/tmp/sample.csv', format 'csv' );
INFO: Table readcsv created with 3 column(s). SQL: create foreign table readcsv (col1 text, col2 text, col3 text) server csv_import options ( filename '/home/nikolay/tmp/sample.csv', format 'csv' );
autocreate_table_to_read_csv
------------------------------
(1 row)
test=# select * from readcsv limit 2;
col1 | col2 | col3
-------+-------+-------
1313 | xvcv | 22
fvbvb | 2434 | 4344
(2 rows)
Update: found implementation of very similar (but w/o "brute-force", requiring explicit specification of # of columns in CSV file) approach, for COPY .. FROM: How to import CSV file data into a PostgreSQL table?
P.S. Actually, this would be a really good task to improve file_fdw and COPY .. FROM capabilities of Postgres making them more flexible – for example, for postgres_fdw, there is a very handy IMPORT FOREIGN SCHEMA command, which allows very quickly define remote ("foreign") objects, just with one line – it saves a lot of efforts. Having similar thing for CSV dta would be awesome.

Related

Declare and return value for DELETE and INSERT

I am trying to remove duplicated data from some of our databases based upon unique id's. All deleted data should be stored in a separate table for auditing purposes. Since it concerns quite some databases and different schemas and tables I wanted to start using variables to reduce chance of errors and the amount of work it will take me.
This is the best example query I could think off, but it doesn't work:
do $$
declare #source_schema varchar := 'my_source_schema';
declare #source_table varchar := 'my_source_table';
declare #target_table varchar := 'my_target_schema' || source_table || '_duplicates'; --target schema and appendix are always the same, source_table is a variable input.
declare #unique_keys varchar := ('1', '2', '3')
begin
select into #target_table
from #source_schema.#source_table
where id in (#unique_keys);
delete from #source_schema.#source_table where export_id in (#unique_keys);
end ;
$$;
The query syntax works with hard-coded values.
Most of the times my variables are perceived as columns or not recognized at all. :(
You need to create and then call a plpgsql procedure with input parameters :
CREATE OR REPLACE PROCEDURE duplicates_suppress
(my_target_schema text, my_source_schema text, my_source_table text, unique_keys text[])
LANGUAGE plpgsql AS
$$
BEGIN
EXECUTE FORMAT(
'WITH list AS (INSERT INTO %1$I.%3$I_duplicates SELECT * FROM %2$I.%3$I WHERE array[id] <# %4$L :: integer[] RETURNING id)
DELETE FROM %2$I.%3$I AS t USING list AS l WHERE t.id = l.id', my_target_schema, my_source_schema, my_source_table, unique_keys :: text) ;
END ;
$$ ;
The procedure duplicates_suppress inserts into my_target_schema.my_source_table || '_duplicates' the rows from my_source_schema.my_source_table whose id is in the array unique_keys and then deletes these rows from the table my_source_schema.my_source_table .
See the test result in dbfiddle.
As has been commented, you need some kind of dynamic SQL. In a FUNCTION, PROCEDURE or a DO statement to do it on the server.
You should be comfortable with PL/pgSQL. Dynamic SQL is no beginners' toy.
Example with a PROCEDURE, like Edouard already suggested. You'll need a FUNCTION instead to wrap it in an outer transaction (like you very well might). See:
When to use stored procedure / user-defined function?
CREATE OR REPLACE PROCEDURE pg_temp.f_archive_dupes(_source_schema text, _source_table text, _unique_keys int[], OUT _row_count int)
LANGUAGE plpgsql AS
$proc$
-- target schema and appendix are always the same, source_table is a variable input
DECLARE
_target_schema CONSTANT text := 's2'; -- hardcoded
_target_table text := _source_table || '_duplicates';
_sql text := format(
'WITH del AS (
DELETE FROM %I.%I
WHERE id = ANY($1)
RETURNING *
)
INSERT INTO %I.%I TABLE del', _source_schema, _source_table
, _target_schema, _target_table);
BEGIN
RAISE NOTICE '%', _sql; -- debug
EXECUTE _sql USING _unique_keys; -- execute
GET DIAGNOSTICS _row_count = ROW_COUNT;
END
$proc$;
Call:
CALL pg_temp.f_archive_dupes('s1', 't1', '{1, 3}', 0);
db<>fiddle here
I made the procedure temporary, since I assume you don't need to keep it permanently. Create it once per database. See:
How to create a temporary function in PostgreSQL?
Passed schema and table names are case-sensitive strings! (Unlike unquoted identifiers in plain SQL.) Either way, be wary of SQL-injection when concatenating SQL dynamically. See:
Are PostgreSQL column names case-sensitive?
Table name as a PostgreSQL function parameter
Made _unique_keys type int[] (array of integer) since your sample values look like integers. Use a the actual data type of your id columns!
The variable _sql holds the query string, so it can easily be debugged before actually executing. Using RAISE NOTICE '%', _sql; for that purpose.
I suggest to comment the EXECUTE line until you are sure.
I made the PROCEDURE return the number of processed rows. You didn't ask for that, but it's typically convenient. At hardly any cost. See:
Dynamic SQL (EXECUTE) as condition for IF statement
Best way to get result count before LIMIT was applied
Last, but not least, use DELETE ... RETURNING * in a data-modifying CTE. Since that has to find rows only once it comes at about half the cost of separate SELECT and DELETE. And it's perfectly safe. If anything goes wrong, the whole transaction is rolled back anyway.
Two separate commands can also run into concurrency issues or race conditions which are ruled out this way, as DELETE implicitly locks the rows to delete. Example:
Replicating data between Postgres DBs
Or you can build the statements in a client program. Like psql, and use \gexec. Example:
Filter column names from existing table for SQL DDL statement
Based on Erwin's answer, minor optimization...
create or replace procedure pg_temp.p_archive_dump
(_source_schema text, _source_table text,
_unique_key int[],_target_schema text)
language plpgsql as
$$
declare
_row_count bigint;
_target_table text := '';
BEGIN
select quote_ident(_source_table) ||'_'|| array_to_string(_unique_key,'_') into _target_table from quote_ident(_source_table);
raise notice 'the deleted table records will store in %.%',_target_schema, _target_table;
execute format('create table %I.%I as select * from %I.%I limit 0',_target_schema, _target_table,_source_schema,_source_table );
execute format('with mm as ( delete from %I.%I where id = any (%L) returning * ) insert into %I.%I table mm'
,_source_schema,_source_table,_unique_key, _target_schema, _target_table);
GET DIAGNOSTICS _row_count = ROW_COUNT;
RAISE notice 'rows influenced, %',_row_count;
end
$$;
--
if your _unique_key is not that much, this solution also create a table for you. Obviously you need to create the target schema yourself.
If your unique_key is too much, you can customize to properly rename the dumped table.
Let's call it.
call pg_temp.p_archive_dump('s1','t1', '{1,2}','s2');
s1 is the source schema, t1 is source table, {1,2} is the unique key you want to extract to the new table. s2 is the target schema

PostgreSQL dynamic column selection

I am struggling a bit on some dynamic postgresql :
I have a table named "list_columns" containing the columns names list with "column_name" as the variable name; those column names come from an input table called "input_table".
[list_columns]
column_name
col_name_a
col_name_b
col_name_c...
[input_table]
col_name_a
col_name_b
col_name_c
col_name_d
col_name_e
value_a_1
value_b_1
value_c_1
value_d_1
value_e_1
value_a_2
value_b_2
value_c_2
value_d_2
value_e_2
...
...
...
...
...
What I'd like to do is dynamically create a new table using that list, something like this:
create table output_table as
select (select distinct(column_name) seperated by "," from list_columns) from input_table;
The resulting table would be
[output_table]
col_name_a
col_name_b
col_name_c
value_a_1
value_b_1
value_c_1
value_a_2
value_b_2
value_c_2
...
...
...
I saw I should use some execute procedures but I can't figure out how to do so.
Note: I know i could directly select the 3 columns; I oversimplied the case.
If someone would be kind enough to help me on this,
Thank you,
Regards,
Jonathan
You need dynamic SQL for this, and for that you need PL/pgSQL.
You need to assemble the CREATE TABLE statement based on the input_table, then run that generated SQL.
do
$$
declare
l_columns text;
l_sql text;
begin
-- this generates the list of columns
select string_agg(distinct column_name, ',')
into l_columns
from list_table;
-- this generates the actual CREATE TABLE statement using the columns
-- from the previous step
l_sql := 'create table output_table as select '||l_columns||' from input_table';
-- this runs the generated SQL, thus creating the output table.
execute l_sql;
end;
$$;
If you need that a lot, you can put that into a stored function (your unsupported Postgres version doesn't support real procedures).

Postgres 11 throwing cache lookup failed for type errors

Here is the test case and results:
drop table if exists test1;
drop table if exists test2;
drop trigger if exists test1_tr on test1;
drop function if exists tf_test1;
create table test1 (name varchar(8) not null);
create table test2 (name varchar(8) not null);
\echo create trigger function tf_test1
CREATE OR REPLACE FUNCTION tf_test1() RETURNS trigger AS $BODY$
BEGIN
IF TG_OP = 'INSERT' THEN
INSERT INTO test2(name) VALUES (NEW.name);
END IF;
return new;
END
$BODY$
LANGUAGE 'plpgsql';
\echo create trigger test1_tr
CREATE TRIGGER test1_tr
AFTER INSERT OR UPDATE OR DELETE ON test1 FOR EACH ROW
EXECUTE PROCEDURE tf_test1();
\echo Insert
insert into test1 (name) values ('NAME_001');
insert into test1 (name) values ('NAME_002');
insert into test1 (name) values ('NAME_003');
insert into test1 (name) values ('NAME_004');
\echo Select test1
select * from test1;
\echo Select test2
select * from test2;
---------------------------- output -------------------------------
DROP TABLE
DROP TABLE
DROP TABLE
DROP TABLE
DROP TRIGGER
DROP FUNCTION
CREATE TABLE
CREATE TABLE
create trigger function tf_test1
CREATE FUNCTION
create trigger test1_tr
CREATE TRIGGER
Insert
INSERT 0 1
psql:test3.sql:28: ERROR: cache lookup failed for type 113
CONTEXT: SQL statement "INSERT INTO test2(name) VALUES (NEW.name)"
PL/pgSQL function tf_test1() line 4 at SQL statement
INSERT 0 1
INSERT 0 1
Select test1
name
----------
NAME_001
NAME_003
NAME_004
(3 rows)
Select test2
name
----------
NAME_001
NAME_003
NAME_004
(3 rows)
We have several servers running various flavors of RHEL 7.x. All Postgres instances are v11. This is happening on about 1/2 of them. There doesn't seem to be any consistent RH version that is the culprit.
I have queried both pg_class and pg_type for the OID referenced as the missing type. In all cases, the result set is empty.
Any help is appreciated.
I would also appreciate an insight into what's happening with Postgres. I'm a long-time Oracle DBA, but fairly new to Postgres. It seems like an internal Postgres error and not really a code problem, but a web search doesn't turn up much.
Follow-up on this to provide some closure. We had increased our buffer and effective cache size in the Postgresql.conf file and also turned Auditing on (pgaudit extension) full blast...For the machines where the PG memory conf parameters exceeded the physical memory of the machine and auditing was turned on, we would get cache lookup errors. A clue about this was the errors would hop around in the job flow, were not consistent from machine to machine and were effectively unsquashable bugs (dropping the offending trigger would just cause the cache error somewhere else in the job stream).
For now, we have increased the physical memory of the servers and turned auditing off. The cache lookup errors are gone. Further tuning is needed so we can eventually turn auditing back on.

Dynamically insert from one table into another and use destination table data types

Below is a function that's part of a process used to periodically upload many CSVs (that can change) into a Postgres 9.6 db. Anyone have any suggestions on how to improve this function or other data upload processes you'd like to share?
The function works (I think), so I thought I'd share in case it would be helpful for someone else. As a total newbie, this took me flipping forever, so hopefully I can save someone some time.
I lifted code from various sources+++ to make this function, which inserts all of the columns in the source table that have matching destination table columns, casting the data type from the source columns into the data type of the destination columns during the insert. I plan to turn this into a trigger function(s) that executes upon update of the source table(s).
Bigger picture: 1) batch file runs dbf2csv to export DBFs into CSVs, 2) batch files run csvkit to load many CSVs into a separate tables in a schema called dataloader and add a new column for the CSV date, 3) the below function moves the data from the dataloader table to the main tables located in the public schema. I had thought about using PGloader, but I don't know Python. An issue that I will have is if new columns are added to the source CSVs (this function will ignore them), but I can monitor that manually as the columns don't change much.
+++ A few I can remember (thanks!)
Dynamic insert
Dynamic insert #2
Data type
More dynamic code
I experimented with FDWs and can't remember why I didn't use this approach.
Foreign data wrapper
CREATE OR REPLACE FUNCTION dataloader.insert_des3 (
tbl_des pg_catalog.regclass,
tbl_src pg_catalog.regclass
)
RETURNS void AS
$body$
DECLARE
tdes_cols text;
tsrc_cols text;
BEGIN
SET search_path TO dataloader, public;
SELECT string_agg( c1.attname, ',' ),
string_agg( quote_ident( COALESCE( c2.attname, 'NULL' ) ) || '::' || format_type(c1.atttypid, c1.atttypmod), ',' )
INTO tdes_cols,
tsrc_cols
FROM pg_attribute c1
LEFT JOIN pg_attribute c2
ON c2.attrelid = tbl_src
AND c2.attnum > 0 --attnum is the column number of c2
AND NOT c2.attisdropped
AND c1.attname = lower(c2.attname)
WHERE c1.attrelid = tbl_des
AND c1.attnum > 0
AND NOT c1.attisdropped
AND c1.attname <> 'id';
EXECUTE format(
' INSERT INTO %I (%s)
SELECT %s
FROM %I
',
tbl_des,
tdes_cols,
tsrc_cols,
tbl_src
);
END;
$body$
LANGUAGE 'plpgsql'
VOLATILE
CALLED ON NULL INPUT
SECURITY INVOKER
COST 100;
To call the function
SELECT dataloader.insert_des('public.tbl_des','dataloader.tbl_src')

Create a temporary table from a selection or insert if table already exist

How to create a temporary table, if it does not already exist, and add the selected rows to it?
CREATE TABLE AS
is the simplest and fastest way:
CREATE TEMP TABLE tbl AS
SELECT * FROM tbl WHERE ... ;
Do not use SELECT INTO. See:
Combine two tables into a new one so that select rows from the other one are ignored
Not sure whether table already exists
CREATE TABLE IF NOT EXISTS ... was introduced in version Postgres 9.1.
For older versions, use the function provided in this related answer:
PostgreSQL create table if not exists
Then:
INSERT INTO tbl (col1, col2, ...)
SELECT col1, col2, ...
Chances are, something is going wrong in your code if the temp table already exists. Make sure you don't duplicate data in the table or something. Or consider the following paragraph ...
Unique names
Temporary tables are only visible within your current session (not to be confused with transaction!). So the table name cannot conflict with other sessions. If you need unique names within your session, you could use dynamic SQL and utilize a SEQUENCE:
Create once:
CREATE SEQUENCE tablename_helper_seq;
You could use a DO statement (or a plpgsql function):
DO
$do$
BEGIN
EXECUTE
'CREATE TEMP TABLE tbl' || nextval('tablename_helper_seq'::regclass) || ' AS
SELECT * FROM tbl WHERE ... ';
RAISE NOTICE 'Temporary table created: "tbl%"' || ', lastval();
END
$do$;
lastval() and currval(regclass) are instrumental to return the dynamically created table name.