How to check content loading status on a static database? - postgresql

We have a static database we constantly update with loader scripts. These loader scripts get current information from third party sources, clean it and upload it to database.
I have already made some SQL scripts to ensure schemas and tables required exists. Now I'd like to check that each table has the expected row count.
I did something like this:
select case when count(*) = <someNumber>
then 'someSchema.someTable OK'
else 'someSchema.someTable BAD row count' end
from someSchema.someTable;
But doing these kind of queries for ~300 tables is cumbersome.
Now I was thinking maybe there's a way to have a table like:
create table expected_row_count (
schema_name varchar,
table_name varchar,
row_count bigint
);
And somehow test all listed tables and only output the ones that fail the count check. But I'm kind of missing now... Should I try to write a function? Can a table like this be used to build queries and execute them?

Whole credit goes to #a-horse_with*_no_name , I'm posting a reply for completeness:
Check row count
First let's create some data to test the query:
create schema if not exists data;
create table if not exists data.test1 (nothing int);
create table if not exists data.test2 (nothing int);
insert into data.test1 (nothing)
(select random() from generate_series(1, 28));
insert into data.test2 (nothing)
(select random() from generate_series(1, 55));
create table if not exists public.expected_row_count (
table_schema varchar not null default '',
table_name varchar not null default '',
row_count bigint not null default 0
);
insert into public.expected_row_count (table_schema, table_name, row_count) values
('data', 'test1', (select count(*) from data.test1)),
('data', 'test2', (select count(*) from data.test2))
;
Now the query to check the data:
select * from (
select
table_schema,
table_name,
(xpath('/row/cnt/text()', xml_count))[1]::text::int as row_count
from (
select
table_schema,
table_name,
query_to_xml(format('select count(*) as cnt from %I.%I', table_schema, table_name), false, true, '') as xml_count
from information_schema.tables
where table_schema = 'data' --<< change here for the schema you want
) infs ) as r
inner join expected_row_count erc
on r.table_schema = erc.table_schema
and r.table_name = erc.table_name
and r.row_count != erc.row_count
;
Previous query should give an empty results if all counts are ok, and the
tables with missing data if not. To check it, update the count for some
table on expected_row_count and re-run the query. For example:
update expected_row_count set row_count = 666 where table_name = 'test1';

Related

Execute result of a PostgreSQL query again to get the final result set?

I'm looking for a way to get the list of all json attributes across all my PostgreSQL tables dynamically.
I have Query 1 which would generate a list of sql statements, and then run that sql statements to get the final output all in one go (like the dynamic SQL concept in SQL server).
Query 1 looks like this :
create temporary table test (ordr int, field varchar(1000));
-- Step 1 Create temp table to insert all table/col/json attrbute info
insert into test(ordr,field)
select 0 ordr,'create temporary table temp_table
( table_schema varchar(200)
,table_name varchar(200)
,column_name varchar(200)
,json_attribute varchar(200)
,data_type varchar(50)
);'
union
-- Non json type columns
select 1 ordr, 'insert into temp_table(table_name, column_name,data_type,json_attribute)'
union
-- Json columns with data like json object
select
3 ordr,
concat('select distinct ''', t.table_name, ''' tbl, ''', c.column_name, ''' col, ''' , c.data_type,''' data_type, '
,'jsonb_object_keys(', c.column_name, ') json_attribute', ' from ', t.table_name,
' where jsonb_typeof(' , c.column_name, ') = ''object'' union') AS field
from information_schema.tables t
join information_schema.columns c on c.table_name = t.table_name
where t.table_schema not in ('information_schema', 'pg_catalog')
--and table_type = 'BASE TABLE'
and c.data_type ='jsonb';
--final sql statements to build temp table
--copy all the column "txt" to a separate window and execute it, it will create a temp table "temp_table" which will have all tables/cols/json attributes
select ordr
,(case when t.ordr = (select max(t2.ordr) from test t2) then replace(field,'union','') else field end) txt
from test t
union
select 9999, ';select * from temp_table;'
order by 1 ;
Query 1 output : This is a list of sql statements
I'm looking for a way to run the Query 1 & Query 1 output which would get me the final output all in one go.
Any lead or guidance will be really appreciated.

Postgres-11 : Alter Table dynamically

I am trying to alter table based on another table dynamically.
Below is the piece of code i wrote in postgresql stored procedure.
But running out into some syntax errors.
Please help me here.
I just started working in postgresql and i am from sql server background. Like how we do in sql server print stmts do debug dynamic queries inside procedures; do we have any link to refer please share that as well. It would help me.
DROP TABLE IF EXISTS temp_table;
CREATE TEMP TABLE temp_table AS
with cte as
(
select column_name,data_type,character_maximum_length
from information_schema."columns" c
where table_name = 'customer_new' and table_schema = 'public'
and column_default is null
)
,cte1 as
(
select column_name,data_type,character_maximum_length
from information_schema."columns" c
where table_name = 'customer_old' and table_schema = 'public'
and column_default is null
)
select cte.column_name,
case when cte.character_maximum_length is not null then cte.data_type||'('||cte.character_maximum_length||')' else cte.data_type end as data_type
from cte
left join cte1 on cte.column_name = cte1.column_name
where cte1.column_name is null;
for v_column_name,v_data_type in SELECT column_name,data_type FROM temp_table
loop
execute format ('alter table %s add column %s %s ;', v_dump_table_name, v_column_name, v_data_type);
end loop;
DROP TABLE IF EXISTS temp_table;
Thanks in advance.
DROP TABLE IF EXISTS temp_table;
CREATE TEMP TABLE temp_table AS
with cte as
(
select column_name,data_type,character_maximum_length
from information_schema."columns" c
where table_name = 'customer_new' and table_schema = 'public'
and column_default is null
)
,cte1 as
(
select column_name,data_type,character_maximum_length
from information_schema."columns" c
where table_name = 'customer_old' and table_schema = 'public'
and column_default is null
)
select 'alter table '||v_dump_table_name||' add column '||cte2.column_name||' '||data_type
as col
from cte2;
for v_column in SELECT col FROM temp_table
loop
execute format ('%s ;', v_column);
end loop;
DROP TABLE IF EXISTS temp_table;

How to list MAX(id) of all tables given the db schema name?

I am looking for a pgsql query to pull the last PK for all tables given the db schema.
Need this for my db migration work.
You can do this with a variation of a dynamic row count query:
with pk_list as (
select tbl_ns.nspname as table_schema,
tbl.relname as table_name,
cons.conname as pk_name,
col.attname as pk_column
from pg_class tbl
join pg_constraint cons on tbl.oid = cons.conrelid and cons.contype = 'p'
join pg_namespace tbl_ns on tbl_ns.oid = tbl.relnamespace
join pg_attribute col on col.attrelid = tbl.oid and col.attnum = cons.conkey[1]
join pg_type typ on typ.oid = col.atttypid
where tbl.relkind = 'r'
and cardinality(cons.conkey) = 1 -- only single column primary keys
and tbl_ns.nspname not in ('pg_catalog', 'information_schema')
and typ.typname in ('int2','int4','int8','varchar','numeric','float4','float8','date','timestamp','timestamptz')
and has_table_privilege(format('%I.%I', tbl_ns.nspname, tbl.relname), 'select')
), maxvals as (
select table_schema, table_name, pk_column,
(xpath('/row/max/text()',
query_to_xml(format('select max(%I) from %I.%I', pk_column, table_schema, table_name), true, true, ''))
)[1]::text as max_val
from pk_list
)
select table_schema,
table_name,
pk_column,
max_val
from maxvals;
The first CTE (pk_list ) retrieves the name of the primary key column for each "user" table (that is: tables that are not system tables)
The second CTE (maxvals) then creates a select statement that retrieves the max value for each PK column from the first CTE and runs that query using query_to_xml(). The xpath() function is then used to parse the XML and return the max value as a text value (so it's possible to mix numbers and varchars)
The final select then simply displays the result from that.
The above has the following restrictions:
Only single-column primary keys are considered
It only deals with data types that support using max() on them (e.g. UUID columns are not included)

How to create postgres query to generate counts of columns where tables specified as data

I am trying to produce a table containing counts of non-null datapoints for columns in the "Area Health Resource File" -- which contains per-county demographic and health data.
I have reworked the data into timeseries from the provided format, resulting
in a bunch of tables named "series_" for some data category foo, and
rows identified by county FIPS and year (initial and final for multiyear surveys).
Now want to produce counts over the timeseries columns. So far the query I have is:
do language plpgsql $$
declare
query text;
begin
query := (with cats as (
select tcategory, format('series_%s', tcategory) series_tbl
from series_categories),
cols as (
select tcategory, series_tbl, attname col
from pg_attribute a join pg_class r on a.attrelid = r.oid
join cats c on c.series_tbl = r.relname
where attname not in ('FIPS', 'initial', 'final')
and attnum >= 0
order by tcategory, col),
scols as (
select tcategory, series_tbl, col,
format('count(%s)', quote_ident(col)) sel
from cols),
sel as (
select format(
E' (select %s tcategory, %s col, %s from %s)\n',
quote_literal(tcategory), quote_literal(col), sel, series_tbl) q
from scols)
select string_agg(q, E'union\n') from sel);
execute format(
'select * into category_column_counts from (%s) x', query);
end;
$$;
(Here the "series_categories" table has category name.)
This ... "works" but is probably hundreds of times too slow. Its doing ~10,000
individual tablescans, which could be reduced 500-fold, as there are only 20-ish
categories. I would like to use select count(col1), count(col2) ...
for each table, then "unnest" these row records and concatenate all together.
I haven't figured it out though. I looked at:
https://stackoverflow.com/a/14087244/435563
for inspiration, but haven't transformed that successfully.
I don't know the AHRF format (I looked up the web site but there are too many cute nurse pictures for me to focus on the content...) but you are probably going it the wrong way in first extracting data into multiple tables and then trying to piece it back together again. Instead, you should use a design pattern called Entity-Attribute-Value that stores all the data values in a single table with a category identifier and a "feature" identifier, with table structures somewhat like this:
CREATE TABLE categories (
id serial PRIMARY KEY,
category text NOT NULL,
... -- other attributes like min/max allowable values, measurement technique, etc.
);
CREATE TABLE feature ( -- town, county, state, whatever
id serial PRIMARY KEY,
fips varchar NOT NULL,
name varchar,
... -- other attributes
);
CREATE TABLE measurement (
feature integer REFERENCES feature,
category integer REFERENCES categories,
dt date,
value double precision NOT NULL,
PRIMARY KEY (feature, category, dt)
);
This design pattern is very flexible. For instance, you can store 50 categories for some rows of one feature class and only 5 for another set of rows. You can store data from multiple observations on different dates or years. You can have multiple "feature" tables with separate "measurement" tables, or you can set it up with table inheritance.
Answering your query is then very straightforward using standard PK-FK relationships. More to the point, answering any query is far easier than with your current structure of divide-but-not-conquer.
I don't know exactly how your "initial year"\"final year" data works, but otherwise your requirement would be met by a simple query like so:
SELECT f.fips, c.category, count(*)
FROM feature f -- replace feature by whatever real table you create, like "county"
JOIN measurement m ON m.feature = f.id
JOIN categories c ON c.id = m.category
GROUP BY f.fips, c.category;
Do you want to know dental decay as a function of smoking, alcohol consumption versus psychiatric help, correlation between obesity and substance abuse, trend in toddler development? All fairly easy with the above structure, all a slow painful trod with multiple tables.
Here is the optimization I found: it uses json_each(row_to_json(c)) to turn records into sequences of individual values.
do language plpgsql $$
declare
query text;
begin
query := (with cats as (
select tcategory, table_name
from series_category_tables),
cols as (
select tcategory, table_name, attname col, typname type_name
from pg_attribute a join pg_class r on a.attrelid = r.oid
join cats c on c.table_name = r.relname
join pg_type t on t.oid = a.atttypid
where attname not in ('FIPS', 'initial', 'final')
and attnum >= 0
order by tcategory, col),
-- individual "count" fields
sel as (
select
format(
E' (select %s tcategory, %s table_name, \n'
|| E' d.key column_name, d.value->>''f2'' type_name, '
|| E'(d.value->>''f1'')::int count\n'
|| E' from (\n'
|| E' select (json_each(row_to_json(c))).* from (select\n'
|| E' %s \n'
|| E' from %s) c) d)\n',
quote_literal(tcategory),
quote_literal(table_name),
string_agg(
format(
' row(count(%1$s), %2$s) %1$s',
quote_ident(col), quote_literal(type_name)),
E',\n'), quote_ident(table_name)) selstr
from cols
group by tcategory, table_name),
selu as (
select
string_agg(selstr, E'union\n') selu
from sel)
select * from selu);
drop table if exists category_columns;
create table category_columns (
tcategory text, table_name text,
column_name text, type_name text, count int);
execute format(
'insert into category_columns select * from (%s) x', query);
end;
$$;
It runs in ~45 seconds vs 6 minutes for the previous version. Can I/you do better than this?

Postgresql, select a "fake" row

In Postgres 8.4 or higher, what is the most efficient way to get a row of data populated by defaults without actually creating the row. Eg, as a transaction (pseudocode):
create table "mytable"
(
id serial PRIMARY KEY NOT NULL,
parent_id integer NOT NULL DEFAULT 1,
random_id integer NOT NULL DEFAULT random(),
)
begin transaction
fake_row = insert into mytable (id) values (0) returning *;
delete from mytable where id=0;
return fake_row;
end transaction
Basically I'd expect a query with a single row where parent_id is 1 and random_id is a random number (or other function return value) but I don't want this record to persist in the table or impact on the primary key sequence serial_id_seq.
My options seem to be using a transaction like above or creating views which are copies of the table with the fake row added but I don't know all the pros and cons of each or whether a better way exists.
I'm looking for an answer that assumes no prior knowledge of the datatypes or default values of any column except id or the number or ordering of the columns. Only the table name will be known and that a record with id 0 should not exist in the table.
In the past I created the fake record 0 as a permanent record but I've come to consider this record a type of pollution (since I typically have to filter it out of future queries).
You can copy the table definition and defaults to the temp table with:
CREATE TEMP TABLE table_name_rt (LIKE table_name INCLUDING DEFAULTS);
And use this temp table to generate dummy rows. Such table will be dropped at the end of the session (or transaction) and will only be visible to current session.
You can query the catalog and build a dynamic query
Say we have this table:
create table test10(
id serial primary key,
first_name varchar( 100 ),
last_name varchar( 100 ) default 'Tom',
age int not null default 38,
salary float default 100.22
);
When you run following query:
SELECT string_agg( txt, ' ' order by id )
FROM (
select 1 id, 'SELECT ' txt
union all
select 2, -9999 || ' as id '
union all
select 3, ', '
|| coalesce( column_default, 'null'||'::'||c.data_type )
|| ' as ' || c.column_name
from information_schema.columns c
where table_schema = 'public'
and table_name = 'test10'
and ordinal_position > 1
) xx
;
you will get this sting as a result:
"SELECT -9999 as id , null::character varying as first_name ,
'Tom'::character varying as last_name , 38 as age , 100.22 as salary"
then execute this query and you will get the "phantom row".
We can build a function that build and excecutes the query and return our row as a result:
CREATE OR REPLACE FUNCTION get_phantom_rec (p_i test10.id%type )
returns test10 as $$
DECLARE
v_sql text;
myrow test10%rowtype;
begin
SELECT string_agg( txt, ' ' order by id )
INTO v_sql
FROM (
select 1 id, 'SELECT ' txt
union all
select 2, p_i || ' as id '
union all
select 3, ', '
|| coalesce( column_default, 'null'||'::'||c.data_type )
|| ' as ' || c.column_name
from information_schema.columns c
where table_schema = 'public'
and table_name = 'test10'
and ordinal_position > 1
) xx
;
EXECUTE v_sql INTO myrow;
RETURN myrow;
END$$ LANGUAGE plpgsql ;
and then this simple query gives you what you want:
select * from get_phantom_rec ( -9999 );
id | first_name | last_name | age | salary
-------+------------+-----------+-----+--------
-9999 | | Tom | 38 | 100.22
I would just select the fake values as literals:
select 1 id, 1 parent_id, 1 user_id
The returned row will be (virtually) indistinguishable from a real row.
To get the values from the catalog:
select
0 as id, -- special case for serial type, just return 0
(select column_default::int -- Cast to int, because we know the column is int
from INFORMATION_SCHEMA.COLUMNS
where table_name = 'mytable'
and column_name = 'parent_id') as parent_id,
(select column_default::int -- Cast to int, because we know the column is int
from INFORMATION_SCHEMA.COLUMNS
where table_name = 'mytable'
and column_name = 'user_id') as user_id;
Note that you must know what the columns are and their type, but this is reasonable. If you change the table schema (except default value), you would need to tweak the query.
See the above as a SQLFiddle.