Using PostgreSQL comments as descriptions in dbt docs - postgresql

We've been adding comments to the columns in postgres as column descriptions. Similarly, there are descriptions in dbt that can be written.
How would I go about writing SQL to automatically setting the same descriptions in postgres into dbt docs?

Here's how I often do it.
Take a look at this answer on how to pull descriptions from the pg.catalog.
From there, you want to write a BQ query that generates a json which you can then convert to a yaml file you can use directly in dbt.
BQ link - save results as JSON file.
Use a json2yaml tool.
Save yaml file to an appropriate place in your project tree.
Code sample:
-- intended to be saved as JSON and converted to YAML
-- ex. cat script_job_id_1.json | python3 json2yaml.py | tee schema.yml
-- version will be created as version:'2' . Remove quotes after conversion
DECLARE database STRING;
DECLARE dataset STRING;
DECLARE dataset_desc STRING;
DECLARE source_qry STRING;
SET database = "bigquery-public-data";
SET dataset = "census_bureau_acs";
SET dataset_desc = "";
SET source_qry = CONCAT('''CREATE OR REPLACE TEMP TABLE tt_master_table AS ''',
'''(''',
'''SELECT cfp.table_name, ''',
'''cfp.column_name, ''',
'''cfp.description, ''',
'''FROM `''', database, '''`.''', dataset, '''.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS cfp ''',
''')''');
EXECUTE IMMEDIATE source_qry;
WITH column_info AS (
SELECT table_name as name,
ARRAY_AGG(STRUCT(column_name AS name, COALESCE(description,"") AS description)) AS columns
FROM tt_master_table
GROUP by table_name
)
, table_level AS (
SELECT CONCAT(database, ".", dataset) AS name,
database,
dataset,
dataset_desc AS `description`,
ARRAY_AGG(
STRUCT(name, columns)) AS tables
FROM column_info
GROUP BY database,
dataset,
dataset_desc
LIMIT 1)
SELECT CAST(2 AS INT) AS version,
ARRAY_AGG(STRUCT(name, database, dataset, description, tables)) AS sources
FROM table_level
GROUP BY version

Related

How to add ONE column to ALL tables in postgresql schema

question is pretty simple, but can't seem to find a concrete answer anywhere.
I need to update all tables inside my postgresql schema to include a timestamp column with default NOW(). I'm wondering how I can do this via a query instead of having to go to each individual table. There are several hundred tables in the schema and they all just need to have the one column added with the default value.
Any help would be greatly appreciated!
The easy way with psql, run a query to generate the commands, save and run the results
-- Turn off headers:
\t
-- Use SQL to build SQL:
SELECT 'ALTER TABLE public.' || table_name || ' add fecha timestamp not null default now();'
FROM information_schema.tables
WHERE table_type = 'BASE TABLE' AND table_schema='public';
-- If the output looks good, write it to a file and run it:
\g out.tmp
\i out.tmp
-- or if you don't want the temporal file, use gexec to run it:
\gexec

How to create CSV and save it in a variable for further processing in postgresql?

Facing kind of a mini challenge here today.
I want to create CSV string from a column in a table in postgresSQL using a SQL query inside a stored function and want to be able to store into another table as single value (and do further processing on that table).
My database engine is postgreSQL.
I have seen lots of examples allowing the user to use COPY TO and COPY FROM but they either return to STDOUT or save to a file.
Copy (Select id From product limit 10) To STDOUT With CSV DELIMITER ',';
Source Data:
Product
id | Name
10 | Product1
21 | Product1
34 | Product1
45 | Product1
17 | Product1
Required/Target Data:
TempTable
value
10,21,34,45,17
Neither of above is useful to my requirement. I want to be able to store the generated CSV into another column of another table.
Similar Code for SQL Server:
I used to do this in SQL Server using the following code.
CREATE FUNCTION [dbo].[CreateCSV] (#MyXML XML)
RETURNS VARCHAR(MAX)
BEGIN
DECLARE #listStr VARCHAR(MAX);
SELECT
#listStr =
COALESCE(#listStr+',' ,'') +
c.value('#Value[1]','nvarchar(max)')
FROM #myxml.nodes('/row') as T(c)
RETURN #listStr
END
In SQL Server, I would generate the CSV by calling the CreatCsv() function within a stored procedure. I am trying to replicate the process in postgresql.
I must admit i am new to PostgreSQL so i need your help in this.
Appreciate a helpful response.
Thanks
Steve
Thanks #a_horse_with_no_name
Turns out i needed
SELECT string_agg(id,',') FROM (Select cast (id as varchar(100)) From product limit 10) AS tab;
Thanks for helping me with that. :)

How to copy from CSV file to PostgreSQL table with headers in CSV file?

I want to copy a CSV file to a Postgres table. There are about 100 columns in this table, so I do not want to rewrite them if I don't have to.
I am using the \copy table from 'table.csv' delimiter ',' csv; command but without a table created I get ERROR: relation "table" does not exist. If I add a blank table I get no error, but nothing happens. I tried this command two or three times and there was no output or messages, but the table was not updated when I checked it through PGAdmin.
Is there a way to import a table with headers included like I am trying to do?
This worked. The first row had column names in it.
COPY wheat FROM 'wheat_crop_data.csv' DELIMITER ';' CSV HEADER
With the Python library pandas, you can easily create column names and infer data types from a csv file.
from sqlalchemy import create_engine
import pandas as pd
engine = create_engine('postgresql://user:pass#localhost/db_name')
df = pd.read_csv('/path/to/csv_file')
df.to_sql('pandas_db', engine)
The if_exists parameter can be set to replace or append to an existing table, e.g. df.to_sql('pandas_db', engine, if_exists='replace'). This works for additional input file types as well, docs here and here.
Alternative by terminal with no permission
The pg documentation at NOTES
say
The path will be interpreted relative to the working directory of the server process (normally the cluster's data directory), not the client's working directory.
So, gerally, using psql or any client, even in a local server, you have problems ... And, if you're expressing COPY command for other users, eg. at a Github README, the reader will have problems ...
The only way to express relative path with client permissions is using STDIN,
When STDIN or STDOUT is specified, data is transmitted via the connection between the client and the server.
as remembered here:
psql -h remotehost -d remote_mydb -U myuser -c \
"copy mytable (column1, column2) from STDIN with delimiter as ','" \
< ./relative_path/file.csv
I have been using this function for a while with no problems. You just need to provide the number columns there are in the csv file, and it will take the header names from the first row and create the table for you:
create or replace function data.load_csv_file
(
target_table text, -- name of the table that will be created
csv_file_path text,
col_count integer
)
returns void
as $$
declare
iter integer; -- dummy integer to iterate columns with
col text; -- to keep column names in each iteration
col_first text; -- first column name, e.g., top left corner on a csv file or spreadsheet
begin
set schema 'data';
create table temp_table ();
-- add just enough number of columns
for iter in 1..col_count
loop
execute format ('alter table temp_table add column col_%s text;', iter);
end loop;
-- copy the data from csv file
execute format ('copy temp_table from %L with delimiter '','' quote ''"'' csv ', csv_file_path);
iter := 1;
col_first := (select col_1
from temp_table
limit 1);
-- update the column names based on the first row which has the column names
for col in execute format ('select unnest(string_to_array(trim(temp_table::text, ''()''), '','')) from temp_table where col_1 = %L', col_first)
loop
execute format ('alter table temp_table rename column col_%s to %s', iter, col);
iter := iter + 1;
end loop;
-- delete the columns row // using quote_ident or %I does not work here!?
execute format ('delete from temp_table where %s = %L', col_first, col_first);
-- change the temp table name to the name given as parameter, if not blank
if length (target_table) > 0 then
execute format ('alter table temp_table rename to %I', target_table);
end if;
end;
$$ language plpgsql;
## csv with header
$ psql -U$db_user -h$db_host -p$db_port -d DB_NAME \
-c "\COPY TB_NAME FROM 'data_sample.csv' WITH (FORMAT CSV, header);"
## csv without header
$ psql -U$db_user -h$db_host -p$db_port -d DB_NAME \
-c "\COPY TB_NAME FROM 'data_sample.csv' WITH (FORMAT CSV);"
## csv without header, specify column
$ psql -U$db_user -h$db_host -p$db_port -d DB_NAME \
-c "\COPY TB_NAME(COL1,COL2) FROM 'data_sample.csv' WITH (FORMAT CSV);"
all columns in csv should be same as table (or same as specified column)
about COPY
https://www.postgresql.org/docs/9.2/sql-copy.html
You can use d6tstack which creates the table for you and is faster than pd.to_sql() because it uses native DB import commands. It supports Postgres as well as MYSQL and MS SQL.
import pandas as pd
df = pd.read_csv('table.csv')
uri_psql = 'postgresql+psycopg2://usr:pwd#localhost/db'
d6tstack.utils.pd_to_psql(df, uri_psql, 'table')
It is also useful for importing multiple CSVs, solving data schema changes and/or preprocess with pandas (eg for dates) before writing to db, see further down in examples notebook
d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'),
apply_after_read=apply_fun).to_psql_combine(uri_psql, 'table')

How can a list of table's field names be queried from PostgreSQL?

How can a plain text list of the field names of table be retrieved from PostgreSQL database?
Just query INFORMATION_SCHEMA.COLUMNS, like this:
SELECT
column_name,
data_type,
character_maximum_length,
ordinal_position
FROM information_schema.columns
WHERE table_name = 'mytable'
Better still, INFORMATION_SCHEMA is almost universally supported by all popular SQL databases, so this should work anywhere.
If you really want just dump plain text file recipe, you can execute this query using command line psql and save it as CSV or something like that.

SQL join from multiple tables

We've got a system (MS SQL 2008 R2-based) that has a number of "input" database and a one "output" database. I'd like to write a query that will read from the output DB, and JOIN it to data in one of the source DB. However, the source table may be one or more individual tables :( The name of the source DB is included in the output DB; ideally, I'd like to do something like the following (pseudo-SQL ahoy)
select o.[UID]
,o.[description]
,i.[data]
from [output].dbo.[description] as o
left join (select [UID]
,[data]
from
[output.sourcedb].dbo.datatable
) as i
on i.[UID] = o.[UID];
Is there any way to do something like the above - "dynamically" specify the database and table to be joined on for each row in the query?
Try using the exec function, then specify the select as a string, adding variables for database names and tables where appropriate. Simple example:
DECLARE #dbName VARCHAR(255), #tableName VARCHAR(255), #colName VARCHAR(255)
...
EXEC('SELECT * FROM ' + #dbName + '.dbo.' + #tableName + ' WHERE ' + #colName + ' = 1')
No, the table must be known at the time you prepare the query. Otherwise how would the query optimizer know what indexes it might be able to use? Or if the table you reference even has an UID column?
You'll have to do this in stages:
Fetch the sourcedb value from your output database in one query.
Build an SQL query string, interpolating the value you fetched in the first query into the FROM clause of the second query.
Be careful to check that this value contains a legitimate database name. For instance, filter out non-alpha characters or apply a regular expression or look it up in a whitelist. Otherwise you're exposing yourself to a SQL Injection risk.
Execute the new SQL string you built with exec() as #user353852 suggests.