Select and union same attributes from many tables using PostgreSQL - postgresql

I'm trying to write a more efficient PostgreSQL query that will UNION together the like attributes across numerous tables. Context is that the database has separate tables for results on different assessments, and I'm trying to look at the outcomes across all assessments. At the moment, for instance, I have one table that stores the name of all of the assessment tables:
| tables |
|---------|
| table_a |
| table_b |
| table_c |
And I'm aggregating the outcomes together using this query (in reality, I'm aggregating across 25+ different tables):
SELECT person_id, subject_id, outcome
FROM table_a
UNION ALL
SELECT person_id, subject_id, outcome
FROM table_b
UNION ALL
SELECT person_id, subject_id, outcome
FROM table_c
Is there a PostgreSQL approach to essentially looping the same SELECT statement through multiple tables, and then UNION ALL results together (e.g. so I don't have to repeat the snippet above 25+ times)?

You should write code that generates the statement for you. One thing you can do is to write a PL/pgSQL function:
CREATE FUNCTION get_them()
RETURNS TABLE (
person_id bigint,
subject_id bigint,
outcome text
) LANGUAGE plpgsql AS
$$DECLARE
v_sql text := '';
v_sep text := '';
v_tab text;
BEGIN
FOR v_tab IN
SELECT tables FROM tab_of_tabs
LOOP
v_sql := v_sql || v_sep ||
format(
'SELECT person_id, subject_id, outcome FROM %I',
v_tab
);
v_sep := ' UNION ALL ';
END LOOP;
RETURN QUERY EXECUTE v_sql;
END;$$;

Related

PostgreSQL: How to query all tables with exact number of lines [duplicate]

I'm looking for a way to find the row count for all my tables in Postgres. I know I can do this one table at a time with:
SELECT count(*) FROM table_name;
but I'd like to see the row count for all the tables and then order by that to get an idea of how big all my tables are.
There's three ways to get this sort of count, each with their own tradeoffs.
If you want a true count, you have to execute the SELECT statement like the one you used against each table. This is because PostgreSQL keeps row visibility information in the row itself, not anywhere else, so any accurate count can only be relative to some transaction. You're getting a count of what that transaction sees at the point in time when it executes. You could automate this to run against every table in the database, but you probably don't need that level of accuracy or want to wait that long.
WITH tbl AS
(SELECT table_schema,
TABLE_NAME
FROM information_schema.tables
WHERE TABLE_NAME not like 'pg_%'
AND table_schema in ('public'))
SELECT table_schema,
TABLE_NAME,
(xpath('/row/c/text()', query_to_xml(format('select count(*) as c from %I.%I', table_schema, TABLE_NAME), FALSE, TRUE, '')))[1]::text::int AS rows_n
FROM tbl
ORDER BY rows_n DESC;
The second approach notes that the statistics collector tracks roughly how many rows are "live" (not deleted or obsoleted by later updates) at any time. This value can be off by a bit under heavy activity, but is generally a good estimate:
SELECT schemaname,relname,n_live_tup
FROM pg_stat_user_tables
ORDER BY n_live_tup DESC;
That can also show you how many rows are dead, which is itself an interesting number to monitor.
The third way is to note that the system ANALYZE command, which is executed by the autovacuum process regularly as of PostgreSQL 8.3 to update table statistics, also computes a row estimate. You can grab that one like this:
SELECT
nspname AS schemaname,relname,reltuples
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE
nspname NOT IN ('pg_catalog', 'information_schema') AND
relkind='r'
ORDER BY reltuples DESC;
Which of these queries is better to use is hard to say. Normally I make that decision based on whether there's more useful information I also want to use inside of pg_class or inside of pg_stat_user_tables. For basic counting purposes just to see how big things are in general, either should be accurate enough.
Here is a solution that does not require functions to get an accurate count for each table:
select table_schema,
table_name,
(xpath('/row/cnt/text()', xml_count))[1]::text::int as row_count
from (
select table_name, table_schema,
query_to_xml(format('select count(*) as cnt from %I.%I', table_schema, table_name), false, true, '') as xml_count
from information_schema.tables
where table_schema = 'public' --<< change here for the schema you want
) t
query_to_xml will run the passed SQL query and return an XML with the result (the row count for that table). The outer xpath() will then extract the count information from that xml and convert it to a number
The derived table is not really necessary, but makes the xpath() a bit easier to understand - otherwise the whole query_to_xml() would need to be passed to the xpath() function.
To get estimates, see Greg Smith's answer.
To get exact counts, the other answers so far are plagued with some issues, some of them serious (see below). Here's a version that's hopefully better:
CREATE FUNCTION rowcount_all(schema_name text default 'public')
RETURNS table(table_name text, cnt bigint) as
$$
declare
table_name text;
begin
for table_name in SELECT c.relname FROM pg_class c
JOIN pg_namespace s ON (c.relnamespace=s.oid)
WHERE c.relkind = 'r' AND s.nspname=schema_name
LOOP
RETURN QUERY EXECUTE format('select cast(%L as text),count(*) from %I.%I',
table_name, schema_name, table_name);
END LOOP;
end
$$ language plpgsql;
It takes a schema name as parameter, or public if no parameter is given.
To work with a specific list of schemas or a list coming from a query without modifying the function, it can be called from within a query like this:
WITH rc(schema_name,tbl) AS (
select s.n,rowcount_all(s.n) from (values ('schema1'),('schema2')) as s(n)
)
SELECT schema_name,(tbl).* FROM rc;
This produces a 3-columns output with the schema, the table and the rows count.
Now here are some issues in the other answers that this function avoids:
Table and schema names shouldn't be injected into executable SQL without being quoted, either with quote_ident or with the more modern format() function with its %I format string. Otherwise some malicious person may name their table tablename;DROP TABLE other_table which is perfectly valid as a table name.
Even without the SQL injection and funny characters problems, table name may exist in variants differing by case. If a table is named ABCD and another one abcd, the SELECT count(*) FROM... must use a quoted name otherwise it will skip ABCD and count abcd twice. The %I of format does this automatically.
information_schema.tables lists custom composite types in addition to tables, even when table_type is 'BASE TABLE' (!). As a consequence, we can't iterate oninformation_schema.tables, otherwise we risk having select count(*) from name_of_composite_type and that would fail. OTOH pg_class where relkind='r' should always work fine.
The type of COUNT() is bigint, not int. Tables with more than 2.15 billion rows may exist (running a count(*) on them is a bad idea, though).
A permanent type need not to be created for a function to return a resultset with several columns. RETURNS TABLE(definition...) is a better alternative.
The hacky, practical answer for people trying to evaluate which Heroku plan they need and can't wait for heroku's slow row counter to refresh:
Basically you want to run \dt in psql, copy the results to your favorite text editor (it will look like this:
public | auth_group | table | axrsosvelhutvw
public | auth_group_permissions | table | axrsosvelhutvw
public | auth_permission | table | axrsosvelhutvw
public | auth_user | table | axrsosvelhutvw
public | auth_user_groups | table | axrsosvelhutvw
public | auth_user_user_permissions | table | axrsosvelhutvw
public | background_task | table | axrsosvelhutvw
public | django_admin_log | table | axrsosvelhutvw
public | django_content_type | table | axrsosvelhutvw
public | django_migrations | table | axrsosvelhutvw
public | django_session | table | axrsosvelhutvw
public | exercises_assignment | table | axrsosvelhutvw
), then run a regex search and replace like this:
^[^|]*\|\s+([^|]*?)\s+\| table \|.*$
to:
select '\1', count(*) from \1 union/g
which will yield you something very similar to this:
select 'auth_group', count(*) from auth_group union
select 'auth_group_permissions', count(*) from auth_group_permissions union
select 'auth_permission', count(*) from auth_permission union
select 'auth_user', count(*) from auth_user union
select 'auth_user_groups', count(*) from auth_user_groups union
select 'auth_user_user_permissions', count(*) from auth_user_user_permissions union
select 'background_task', count(*) from background_task union
select 'django_admin_log', count(*) from django_admin_log union
select 'django_content_type', count(*) from django_content_type union
select 'django_migrations', count(*) from django_migrations union
select 'django_session', count(*) from django_session
;
(You'll need to remove the last union and add the semicolon at the end manually)
Run it in psql and you're done.
?column? | count
--------------------------------+-------
auth_group_permissions | 0
auth_user_user_permissions | 0
django_session | 1306
django_content_type | 17
auth_user_groups | 162
django_admin_log | 9106
django_migrations | 19
[..]
If you don't mind potentially stale data, you can access the same statistics used by the query optimizer.
Something like:
SELECT relname, n_tup_ins - n_tup_del as rowcount FROM pg_stat_all_tables;
Simple Two Steps: (Note : No need to change anything - just copy paste)
1. create function
create function
cnt_rows(schema text, tablename text) returns integer
as
$body$
declare
result integer;
query varchar;
begin
query := 'SELECT count(1) FROM ' || schema || '.' || tablename;
execute query into result;
return result;
end;
$body$
language plpgsql;
2. Run this query to get rows count for all the tables
select sum(cnt_rows) as total_no_of_rows from (select
cnt_rows(table_schema, table_name)
from information_schema.tables
where
table_schema not in ('pg_catalog', 'information_schema')
and table_type='BASE TABLE') as subq;
or
To get rows counts tablewise
select
table_schema,
table_name,
cnt_rows(table_schema, table_name)
from information_schema.tables
where
table_schema not in ('pg_catalog', 'information_schema')
and table_type='BASE TABLE'
order by 3 desc;
Not sure if an answer in bash is acceptable to you, but FWIW...
PGCOMMAND=" psql -h localhost -U fred -d mydb -At -c \"
SELECT table_name
FROM information_schema.tables
WHERE table_type='BASE TABLE'
AND table_schema='public'
\""
TABLENAMES=$(export PGPASSWORD=test; eval "$PGCOMMAND")
for TABLENAME in $TABLENAMES; do
PGCOMMAND=" psql -h localhost -U fred -d mydb -At -c \"
SELECT '$TABLENAME',
count(*)
FROM $TABLENAME
\""
eval "$PGCOMMAND"
done
Extracted from my Comment in the answer from GregSmith to make it more readable:
with tbl as (
SELECT table_schema,table_name
FROM information_schema.tables
WHERE table_name not like 'pg_%' AND table_schema IN ('public')
)
SELECT
table_schema,
table_name,
(xpath('/row/c/text()',
query_to_xml(format('select count(*) AS c from %I.%I', table_schema, table_name),
false,
true,
'')))[1]::text::int AS rows_n
FROM tbl ORDER BY 3 DESC;
Thanks to #a_horse_with_no_name
I usually don't rely on statistics, especially in PostgreSQL.
SELECT table_name, dsql2('select count(*) from '||table_name) as rownum
FROM information_schema.tables
WHERE table_type='BASE TABLE'
AND table_schema='livescreen'
ORDER BY 2 DESC;
CREATE OR REPLACE FUNCTION dsql2(i_text text)
RETURNS int AS
$BODY$
Declare
v_val int;
BEGIN
execute i_text into v_val;
return v_val;
END;
$BODY$
LANGUAGE plpgsql VOLATILE
COST 100;
This worked for me
SELECT schemaname,relname,n_live_tup FROM pg_stat_user_tables ORDER BY
n_live_tup DESC;
I don't remember the URL from where I collected this. But hope this should help you:
CREATE TYPE table_count AS (table_name TEXT, num_rows INTEGER);
CREATE OR REPLACE FUNCTION count_em_all () RETURNS SETOF table_count AS '
DECLARE
the_count RECORD;
t_name RECORD;
r table_count%ROWTYPE;
BEGIN
FOR t_name IN
SELECT
c.relname
FROM
pg_catalog.pg_class c LEFT JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE
c.relkind = ''r''
AND n.nspname = ''public''
ORDER BY 1
LOOP
FOR the_count IN EXECUTE ''SELECT COUNT(*) AS "count" FROM '' || t_name.relname
LOOP
END LOOP;
r.table_name := t_name.relname;
r.num_rows := the_count.count;
RETURN NEXT r;
END LOOP;
RETURN;
END;
' LANGUAGE plpgsql;
Executing select count_em_all(); should get you row count of all your tables.
I made a small variation to include all tables, also for non-public tables.
CREATE TYPE table_count AS (table_schema TEXT,table_name TEXT, num_rows INTEGER);
CREATE OR REPLACE FUNCTION count_em_all () RETURNS SETOF table_count AS '
DECLARE
the_count RECORD;
t_name RECORD;
r table_count%ROWTYPE;
BEGIN
FOR t_name IN
SELECT table_schema,table_name
FROM information_schema.tables
where table_schema !=''pg_catalog''
and table_schema !=''information_schema''
ORDER BY 1,2
LOOP
FOR the_count IN EXECUTE ''SELECT COUNT(*) AS "count" FROM '' || t_name.table_schema||''.''||t_name.table_name
LOOP
END LOOP;
r.table_schema := t_name.table_schema;
r.table_name := t_name.table_name;
r.num_rows := the_count.count;
RETURN NEXT r;
END LOOP;
RETURN;
END;
' LANGUAGE plpgsql;
use select count_em_all(); to call it.
Hope you find this usefull.
Paul
You Can use this query to generate all tablenames with their counts
select ' select '''|| tablename ||''', count(*) from ' || tablename ||'
union' from pg_tables where schemaname='public';
the result from the above query will be
select 'dim_date', count(*) from dim_date union
select 'dim_store', count(*) from dim_store union
select 'dim_product', count(*) from dim_product union
select 'dim_employee', count(*) from dim_employee union
You'll need to remove the last union and add the semicolon at the end !!
select 'dim_date', count(*) from dim_date union
select 'dim_store', count(*) from dim_store union
select 'dim_product', count(*) from dim_product union
select 'dim_employee', count(*) from dim_employee **;**
RUN !!!
Here is a much simpler way.
tables="$(echo '\dt' | psql -U "${PGUSER}" | tail -n +4 | head -n-2 | tr -d ' ' | cut -d '|' -f2)"
for table in $tables; do
printf "%s: %s\n" "$table" "$(echo "SELECT COUNT(*) FROM $table;" | psql -U "${PGUSER}" | tail -n +3 | head -n-2 | tr -d ' ')"
done
output should look like this
auth_group: 0
auth_group_permissions: 0
auth_permission: 36
auth_user: 2
auth_user_groups: 0
auth_user_user_permissions: 0
authtoken_token: 2
django_admin_log: 0
django_content_type: 9
django_migrations: 22
django_session: 0
mydata_table1: 9011
mydata_table2: 3499
you can update the psql -U "${PGUSER}" portion as needed to access your database
note that the head -n-2 syntax may not work in macOS, you could probably just use a different implementation there
Tested on psql (PostgreSQL) 11.2 under CentOS 7
if you want it sorted by table, then just wrap it with sort
for table in $tables; do
printf "%s: %s\n" "$table" "$(echo "SELECT COUNT(*) FROM $table;" | psql -U "${PGUSER}" | tail -n +3 | head -n-2 | tr -d ' ')"
done | sort -k 2,2nr
output;
mydata_table1: 9011
mydata_table2: 3499
auth_permission: 36
django_migrations: 22
django_content_type: 9
authtoken_token: 2
auth_user: 2
auth_group: 0
auth_group_permissions: 0
auth_user_groups: 0
auth_user_user_permissions: 0
django_admin_log: 0
django_session: 0
I like Daniel Vérité's answer.
But when you can't use a CREATE statement you can either use a bash solution or, if you're a windows user, a powershell one:
# You don't need this if you have pgpass.conf
$env:PGPASSWORD = "userpass"
# Get table list
$tables = & 'C:\Program Files\PostgreSQL\9.4\bin\psql.exe' -U user -w -d dbname -At -c "select table_name from information_schema.tables where table_type='BASE TABLE' AND table_schema='schema1'"
foreach ($table in $tables) {
& 'C:\path_to_postresql\bin\psql.exe' -U root -w -d dbname -At -c "select '$table', count(*) from $table"
}
I wanted the total from all tables + a list of tables with their counts. A little like a performance chart of where most time was spent
WITH results AS (
SELECT nspname AS schemaname,relname,reltuples
FROM pg_class C
LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
WHERE
nspname NOT IN ('pg_catalog', 'information_schema') AND
relkind='r'
GROUP BY schemaname, relname, reltuples
)
SELECT * FROM results
UNION
SELECT 'all' AS schemaname, 'all' AS relname, SUM(reltuples) AS "reltuples" FROM results
ORDER BY reltuples DESC
You could of course put a LIMIT clause on the results in this version too so that you get the largest n offenders as well as a total.
One thing that should be noted about this is that you need to let it sit for a while after bulk imports. I tested this by just adding 5000 rows to a database across several tables using real import data. It showed 1800 records for about a minute (probably a configurable window)
This is based from https://stackoverflow.com/a/2611745/1548557 work, so thank you and recognition to that for the query to use within the CTE
If you're in the psql shell, using \gexec allows you to execute the syntax described in syed's answer and Aur's answer without manual edits in an external text editor.
with x (y) as (
select
'select count(*), '''||
tablename||
''' as "tablename" from '||
tablename||' '
from pg_tables
where schemaname='public'
)
select
string_agg(y,' union all '||chr(10)) || ' order by tablename'
from x \gexec
Note, string_agg() is used both to delimit union all between statements and to smush the separated datarows into a single unit to be passed into the buffer.
\gexec
Sends the current query buffer to the server, then treats each column of each row of the query's output (if any) as a SQL statement to be executed.
below query will give us row count and size for each table
select table_schema, table_name,
pg_relation_size('"'||table_schema||'"."'||table_name||'"')/1024/1024 size_MB,
(xpath('/row/c/text()', query_to_xml(format('select count(*) AS c from %I.%I', table_schema, table_name),
false, true,'')))[1]::text::int AS rows_n
from information_schema.tables
order by size_MB desc;

How to pivot or crosstab in postgresql without writing a function?

I have a dataset that looks something like this:
I'd like to aggregate all co values on one row, so the final result looks something like:
Seems pretty easy, right? Just write a query using crosstab, as suggested in this answer. Problem is that requires that I CREATE EXTENSION tablefunc; and I don't have write access to my DB.
Can anyone recommend an alternative?
Conditional aggregation:
SELECT co,
MIN(CASE WHEN ontology_type = 'industry' THEN tags END) AS industry,
MIN(CASE WHEN ontology_type = 'customer_type' THEN tags END) AS customer_type,
-- ...
FROM tab_name
GROUP BY co
You can use DO to generate and PREPARE your own SQL with crosstab columns, then EXECUTE it.
-- replace tab_name to yours table name
DO $$
DECLARE
_query text;
_name text;
BEGIN
_name := 'prepared_query';
_query := '
SELECT co
'||(SELECT ', '||string_agg(DISTINCT
' string_agg(DISTINCT
CASE ontology_type WHEN '||quote_literal(ontology_type)||' THEN tags
ELSE NULL
END, '',''
) AS '||quote_ident(ontology_type),',')
FROM tab_name)||'
FROM tab_name
GROUP BY co
';
BEGIN
EXECUTE 'DEALLOCATE '||_name;
EXCEPTION
WHEN invalid_sql_statement_name THEN
END;
EXECUTE 'PREPARE '||_name||' AS '||_query;
END
$$;
EXECUTE prepared_query;
Since Ver. 9.4 there's json_object_agg(), which lets us do part of the necessary magic dynamically.
However to be totally dynamic, a temp type (a temp table) has to be FIRST built by running a SQL-EXEC inside an anonymous procedure.
DB FIDDLE (UK):
https://dbfiddle.uk/Sn7iO4zL
DISCLAIMER: Typically the ability to create TEMP TABLES are granted to end-users, but YMMV. Another concern is whether anon. procedures can be exec'd as in-line code by regular users.
-- /**
-- begin test data
-- begin test data
-- begin test data
-- */
DROP TABLE IF EXISTS tmpSales ;
CREATE TEMP TABLE tmpSales AS
SELECT
sale_id
,TRUNC(RANDOM()*12)+1 AS book_id
,TRUNC(RANDOM()*100)+1 AS customer_id
,(date '2010-01-01' + random() * (timestamp '2016-12-31' - timestamp '2010-01-01')) AS sale_date
FROM generate_series(1,10000) AS sale_id;
DROP TABLE IF EXISTS tmp_month_total ;
CREATE TEMP TABLE tmp_month_total AS
SELECT
date_part( 'year' , sale_date ) AS year
,date_part( 'month', sale_date ) AS mn
,to_char(sale_date, 'mon') AS month
,COUNT(*) AS total
FROM tmpSales
GROUP BY date_part('year', sale_date), to_char(sale_date, 'mon') ,date_part( 'month', sale_date )
;
DATA:
+----+--+-----+-----+
|year|mn|month|total|
+----+--+-----+-----+
|2010|1 |jan |127 |
|2010|2 |feb |117 |
|2010|3 |mar |121 |
|2010|4 |apr |131 |
|2010|5 |may |106 |
|2010|6 |jun |121 |
|2010|7 |jul |129 |
|2010|8 |aug |114 |
|2010|9 |sep |115 |
|2010|10|oct |110 |
|2010|11|nov |133 |
|2010|12|dec |108 |
+----+--+-----+-----+
-- /**
-- END test data
-- END test data
-- END test data
-- */
-- /**
-- dyn. build a temporary row-type based on existing data, not hard-coded
-- dyn. build a temporary row-type based on existing data, not hard-coded
-- dyn. build a temporary row-type based on existing data, not hard-coded
-- **/
DROP TABLE IF EXISTS tmpTblTyp CASCADE ;
DO LANGUAGE plpgsql $$ DECLARE v_sqlstring VARCHAR = ''; BEGIN
v_sqlstring := CONCAT( 'CREATE TEMP TABLE tmpTblTyp AS SELECT '
,(SELECT STRING_AGG( CONCAT('NULL::int AS ' , month )::TEXT , ' ,'
ORDER BY mn
)::TEXT
FROM
(SELECT DISTINCT month, mn FROM tmp_month_total )a )
,' LIMIT 0 '
) ; -- RAISE NOTICE '%', v_sqlstring ;
EXECUTE( v_sqlstring ) ; END $$;
DROP TABLE IF EXISTS tmpMoToJson ;
CREATE TEMP TABLE tmpMoToJson AS
SELECT
year AS year
,(json_build_array( months )) AS js_months_arr
,json_populate_recordset ( NULL::tmpTblTyp /** use temp table as a record type!! **/
, json_build_array( months )
) jprs /** builds row-type column that can be expanded with (jprs).*
**/
FROM ( SELECT year
-- accum data into JSON array
,json_object_agg(month,total) AS months
FROM tmp_month_total
GROUP BY year
ORDER BY year
) a
;
SELECT
year
,(ROW((jprs).*)::tmpTblTyp).* -- explode the composite type row
FROM tmpMoToJson ;
+----+---+---+---+---+---+---+---+---+---+---+---+---+
|year|jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec|
+----+---+---+---+---+---+---+---+---+---+---+---+---+
|2010|127|117|121|131|106|121|129|114|115|110|133|108|
|2011|117|112|117|115|139|116|119|152|117|112|115|103|
|2012|129|111|98 |140|109|131|114|110|112|115|100|121|
|2013|128|112|141|127|141|102|113|109|111|110|123|116|
|2014|129|114|117|118|111|123|106|111|127|121|124|145|
|2015|118|113|131|122|120|121|140|114|118|108|114|131|
|2016|117|110|139|100|110|116|112|109|131|117|122|132|
+----+---+---+---+---+---+---+---+---+---+---+---+---+
By using pivot also we can achieve your required out put
SELECT co
,industry
,customer_type
,product_type
,sales_model
,stage
FROM dataSet
PIVOT(max(tags) FOR ontologyType IN (
industry
,customer_type
,product_type
,sales_model
,stage
)) AS PVT

Postgres find all rows in database tables matching criteria on a given column

I am trying to write sub-queries so that I search all tables for a column named id and since there are multiple tables with id column, I want to add the condition, so that id = 3119093.
My attempt was:
Select *
from information_schema.tables
where id = '3119093' and id IN (
Select table_name
from information_schema.columns
where column_name = 'id' );
This didn't work so I tried:
Select *
from information_schema.tables
where table_name IN (
Select table_name
from information_schema.columns
where column_name = 'id' and 'id' IN (
Select * from table_name where 'id' = 3119093));
This isn't the right way either. Any help would be appreciated. Thanks!
A harder attempt is:
CREATE OR REPLACE FUNCTION search_columns(
needle text,
haystack_tables name[] default '{}',
haystack_schema name[] default '{public}'
)
RETURNS table(schemaname text, tablename text, columnname text, rowctid text)
AS $$
begin
FOR schemaname,tablename,columnname IN
SELECT c.table_schema,c.table_name,c.column_name
FROM information_schema.columns c
JOIN information_schema.tables t ON
(t.table_name=c.table_name AND t.table_schema=c.table_schema)
WHERE (c.table_name=ANY(haystack_tables) OR haystack_tables='{}')
AND c.table_schema=ANY(haystack_schema)
AND t.table_type='BASE TABLE'
--AND c.column_name = "id"
LOOP
EXECUTE format('SELECT ctid FROM %I.%I WHERE cast(%I as text) like %L',
schemaname,
tablename,
columnname,
needle
) INTO rowctid;
IF rowctid is not null THEN
RETURN NEXT;
END IF;
END LOOP;
END;
$$ language plpgsql;
select * from search_columns('%3119093%'::varchar,'{}'::name[]) ;
The only problem is this code displays the table name and column name. I have to then manually enter
Select * from table_name where id = 3119093
where I got the table name from the code above.
I want to automatically implement returning rows from a table but I don't know how to get the table name automatically.
I took the time to make it work for you.
For starters, some information on what is going on inside the code.
Explanation
function takes two input arguments: column name and column value
it requires a created type that it will be returning a set of
first loop identifies tables that have a column name specified as the input argument
then it forms a query which aggregates all rows that match the input condition inside every table taken from step 3 with comparison based on ILIKE - as per your example
function goes into the second loop only if there is at least one row in currently visited table that matches specified condition (then the array is not null)
second loop unnests the array of rows that match the condition and for every element it puts it in the function output with RETURN NEXT rec clause
Notes
Searching with LIKE is inefficient - I suggest adding another input argument "column type" and restrict it in the lookup by adding a join to pg_catalog.pg_type table.
The second loop is there so that if more than 1 row is found for a particular table, then every row gets returned.
If you are looking for something else, like you need key-value pairs, not just the values, then you need to extend the function. You could for example build json format from rows.
Now, to the code.
Test case
CREATE TABLE tbl1 (col1 int, id int); -- does contain values
CREATE TABLE tbl2 (col1 int, col2 int); -- doesn't contain column "id"
CREATE TABLE tbl3 (id int, col5 int); -- doesn't contain values
INSERT INTO tbl1 (col1, id)
VALUES (1, 5), (1, 33), (1, 25);
Table stores data:
postgres=# select * From tbl1;
col1 | id
------+----
1 | 5
1 | 33
1 | 25
(3 rows)
Creating type
CREATE TYPE sometype AS ( schemaname text, tablename text, colname text, entirerow text );
Function code
CREATE OR REPLACE FUNCTION search_tables_for_column (
v_column_name text
, v_column_value text
)
RETURNS SETOF sometype
LANGUAGE plpgsql
STABLE
AS
$$
DECLARE
rec sometype%rowtype;
v_row_array text[];
rec2 record;
arr_el text;
BEGIN
FOR rec IN
SELECT
nam.nspname AS schemaname
, cls.relname AS tablename
, att.attname AS colname
, null::text AS entirerow
FROM
pg_attribute att
JOIN pg_class cls ON att.attrelid = cls.oid
JOIN pg_namespace nam ON cls.relnamespace = nam.oid
WHERE
cls.relkind = 'r'
AND att.attname = v_column_name
LOOP
EXECUTE format('SELECT ARRAY_AGG(row(tablename.*)::text) FROM %I.%I AS tablename WHERE %I::text ILIKE %s',
rec.schemaname, rec.tablename, rec.colname, quote_literal(concat('%',v_column_value,'%'))) INTO v_row_array;
IF v_row_array is not null THEN
FOR rec2 IN
SELECT unnest(v_row_array) AS one_row
LOOP
rec.entirerow := rec2.one_row;
RETURN NEXT rec;
END LOOP;
END IF;
END LOOP;
END
$$;
Exemplary call & output
postgres=# select * from search_tables_for_column('id','5');
schemaname | tablename | colname | entirerow
------------+-----------+---------+-----------
public | tbl1 | id | (1,5)
public | tbl1 | id | (1,25)
(2 rows)

How to create postgres query to generate counts of columns where tables specified as data

I am trying to produce a table containing counts of non-null datapoints for columns in the "Area Health Resource File" -- which contains per-county demographic and health data.
I have reworked the data into timeseries from the provided format, resulting
in a bunch of tables named "series_" for some data category foo, and
rows identified by county FIPS and year (initial and final for multiyear surveys).
Now want to produce counts over the timeseries columns. So far the query I have is:
do language plpgsql $$
declare
query text;
begin
query := (with cats as (
select tcategory, format('series_%s', tcategory) series_tbl
from series_categories),
cols as (
select tcategory, series_tbl, attname col
from pg_attribute a join pg_class r on a.attrelid = r.oid
join cats c on c.series_tbl = r.relname
where attname not in ('FIPS', 'initial', 'final')
and attnum >= 0
order by tcategory, col),
scols as (
select tcategory, series_tbl, col,
format('count(%s)', quote_ident(col)) sel
from cols),
sel as (
select format(
E' (select %s tcategory, %s col, %s from %s)\n',
quote_literal(tcategory), quote_literal(col), sel, series_tbl) q
from scols)
select string_agg(q, E'union\n') from sel);
execute format(
'select * into category_column_counts from (%s) x', query);
end;
$$;
(Here the "series_categories" table has category name.)
This ... "works" but is probably hundreds of times too slow. Its doing ~10,000
individual tablescans, which could be reduced 500-fold, as there are only 20-ish
categories. I would like to use select count(col1), count(col2) ...
for each table, then "unnest" these row records and concatenate all together.
I haven't figured it out though. I looked at:
https://stackoverflow.com/a/14087244/435563
for inspiration, but haven't transformed that successfully.
I don't know the AHRF format (I looked up the web site but there are too many cute nurse pictures for me to focus on the content...) but you are probably going it the wrong way in first extracting data into multiple tables and then trying to piece it back together again. Instead, you should use a design pattern called Entity-Attribute-Value that stores all the data values in a single table with a category identifier and a "feature" identifier, with table structures somewhat like this:
CREATE TABLE categories (
id serial PRIMARY KEY,
category text NOT NULL,
... -- other attributes like min/max allowable values, measurement technique, etc.
);
CREATE TABLE feature ( -- town, county, state, whatever
id serial PRIMARY KEY,
fips varchar NOT NULL,
name varchar,
... -- other attributes
);
CREATE TABLE measurement (
feature integer REFERENCES feature,
category integer REFERENCES categories,
dt date,
value double precision NOT NULL,
PRIMARY KEY (feature, category, dt)
);
This design pattern is very flexible. For instance, you can store 50 categories for some rows of one feature class and only 5 for another set of rows. You can store data from multiple observations on different dates or years. You can have multiple "feature" tables with separate "measurement" tables, or you can set it up with table inheritance.
Answering your query is then very straightforward using standard PK-FK relationships. More to the point, answering any query is far easier than with your current structure of divide-but-not-conquer.
I don't know exactly how your "initial year"\"final year" data works, but otherwise your requirement would be met by a simple query like so:
SELECT f.fips, c.category, count(*)
FROM feature f -- replace feature by whatever real table you create, like "county"
JOIN measurement m ON m.feature = f.id
JOIN categories c ON c.id = m.category
GROUP BY f.fips, c.category;
Do you want to know dental decay as a function of smoking, alcohol consumption versus psychiatric help, correlation between obesity and substance abuse, trend in toddler development? All fairly easy with the above structure, all a slow painful trod with multiple tables.
Here is the optimization I found: it uses json_each(row_to_json(c)) to turn records into sequences of individual values.
do language plpgsql $$
declare
query text;
begin
query := (with cats as (
select tcategory, table_name
from series_category_tables),
cols as (
select tcategory, table_name, attname col, typname type_name
from pg_attribute a join pg_class r on a.attrelid = r.oid
join cats c on c.table_name = r.relname
join pg_type t on t.oid = a.atttypid
where attname not in ('FIPS', 'initial', 'final')
and attnum >= 0
order by tcategory, col),
-- individual "count" fields
sel as (
select
format(
E' (select %s tcategory, %s table_name, \n'
|| E' d.key column_name, d.value->>''f2'' type_name, '
|| E'(d.value->>''f1'')::int count\n'
|| E' from (\n'
|| E' select (json_each(row_to_json(c))).* from (select\n'
|| E' %s \n'
|| E' from %s) c) d)\n',
quote_literal(tcategory),
quote_literal(table_name),
string_agg(
format(
' row(count(%1$s), %2$s) %1$s',
quote_ident(col), quote_literal(type_name)),
E',\n'), quote_ident(table_name)) selstr
from cols
group by tcategory, table_name),
selu as (
select
string_agg(selstr, E'union\n') selu
from sel)
select * from selu);
drop table if exists category_columns;
create table category_columns (
tcategory text, table_name text,
column_name text, type_name text, count int);
execute format(
'insert into category_columns select * from (%s) x', query);
end;
$$;
It runs in ~45 seconds vs 6 minutes for the previous version. Can I/you do better than this?

Duplicate single database record

Hello what is the easiest way to duplicate a DB record over the same table?
My problem is that the table where I am doing this has many column, like 100+, and I don't like how the solution looks like. Here is what I do (this is inside plpqsql function):
...
1. duplicate record
INSERT INTO history
(SELECT NEXTVAL('history_id_seq'), col_1, col_2, ... , col_100)
FROM history
WHERE history_id = 1234
ORDER BY datetime DESC
LIMIT 1)
RETURNING
history_id INTO new_history_id;
2. update some columns
UPDATE history
SET
col_5 = 'test_5',
col_23 = 'test_23',
datetime = CURRENT_TIMESTAMP
WHERE history_id = new_history_id;
Here are the problems I am attempting to solve
Listing all these 100+ columns looks lame
When new column is added eventually the function should be updated too
On separate DB instances the column order might differ, which would cause the function fail
I am not sure if I can list them once more (solving issue 3) like insert into <table> (<columns_list>) values (<query>) but then the query looks even uglier.
I would like to achieve something like 'insert into ', but this seems impossible the unique primary key constraint will raise a duplication error.
Any suggestions?
Thanks in advance for you time.
This isn't pretty or particularly optimized but there are a couple of ways to go about this. Ideally, you might want to do this all in an UPDATE trigger though you could implement a duplication function something like this:
-- create source table
CREATE TABLE history (history_id serial not null primary key, col_2 int, col_3 int, col_4 int, datetime timestamptz default now());
-- add some data
INSERT INTO history (col_2, col_3, col_4)
SELECT g, g * 10, g * 100 FROM generate_series(1, 100) AS g;
-- function to duplicate record
CREATE OR REPLACE FUNCTION fn_history_duplicate(p_history_id integer) RETURNS SETOF history AS
$BODY$
DECLARE
cols text;
insert_statement text;
BEGIN
-- build list of columns
SELECT array_to_string(array_agg(column_name::name), ',') INTO cols
FROM information_schema.columns
WHERE (table_schema, table_name) = ('public', 'history')
AND column_name <> 'history_id';
-- build insert statement
insert_statement := 'INSERT INTO history (' || cols || ') SELECT ' || cols || ' FROM history WHERE history_id = $1 RETURNING *';
-- execute statement
RETURN QUERY EXECUTE insert_statement USING p_history_id;
RETURN;
END;
$BODY$
LANGUAGE 'plpgsql';
-- test
SELECT * FROM fn_history_duplicate(1);
history_id | col_2 | col_3 | col_4 | datetime
------------+-------+-------+-------+-------------------------------
101 | 1 | 10 | 100 | 2013-04-15 14:56:11.131507+00
(1 row)
As I noted in my original comment, you might also take a look at the colnames extension as an alternative to querying the information schema.
You don't need the update anyway, you can supply the constant values directly in the SELECT statement:
INSERT INTO history
SELECT NEXTVAL('history_id_seq'),
col_1,
col_2,
col_3,
col_4,
'test_5',
...
'test_23',
...,
col_100
FROM history
WHERE history_sid = 1234
ORDER BY datetime DESC
LIMIT 1
RETURNING history_sid INTO new_history_sid;