How to pivot or crosstab in postgresql without writing a function? - postgresql

I have a dataset that looks something like this:
I'd like to aggregate all co values on one row, so the final result looks something like:
Seems pretty easy, right? Just write a query using crosstab, as suggested in this answer. Problem is that requires that I CREATE EXTENSION tablefunc; and I don't have write access to my DB.
Can anyone recommend an alternative?

Conditional aggregation:
SELECT co,
MIN(CASE WHEN ontology_type = 'industry' THEN tags END) AS industry,
MIN(CASE WHEN ontology_type = 'customer_type' THEN tags END) AS customer_type,
-- ...
FROM tab_name
GROUP BY co

You can use DO to generate and PREPARE your own SQL with crosstab columns, then EXECUTE it.
-- replace tab_name to yours table name
DO $$
DECLARE
_query text;
_name text;
BEGIN
_name := 'prepared_query';
_query := '
SELECT co
'||(SELECT ', '||string_agg(DISTINCT
' string_agg(DISTINCT
CASE ontology_type WHEN '||quote_literal(ontology_type)||' THEN tags
ELSE NULL
END, '',''
) AS '||quote_ident(ontology_type),',')
FROM tab_name)||'
FROM tab_name
GROUP BY co
';
BEGIN
EXECUTE 'DEALLOCATE '||_name;
EXCEPTION
WHEN invalid_sql_statement_name THEN
END;
EXECUTE 'PREPARE '||_name||' AS '||_query;
END
$$;
EXECUTE prepared_query;

Since Ver. 9.4 there's json_object_agg(), which lets us do part of the necessary magic dynamically.
However to be totally dynamic, a temp type (a temp table) has to be FIRST built by running a SQL-EXEC inside an anonymous procedure.
DB FIDDLE (UK):
https://dbfiddle.uk/Sn7iO4zL
DISCLAIMER: Typically the ability to create TEMP TABLES are granted to end-users, but YMMV. Another concern is whether anon. procedures can be exec'd as in-line code by regular users.
-- /**
-- begin test data
-- begin test data
-- begin test data
-- */
DROP TABLE IF EXISTS tmpSales ;
CREATE TEMP TABLE tmpSales AS
SELECT
sale_id
,TRUNC(RANDOM()*12)+1 AS book_id
,TRUNC(RANDOM()*100)+1 AS customer_id
,(date '2010-01-01' + random() * (timestamp '2016-12-31' - timestamp '2010-01-01')) AS sale_date
FROM generate_series(1,10000) AS sale_id;
DROP TABLE IF EXISTS tmp_month_total ;
CREATE TEMP TABLE tmp_month_total AS
SELECT
date_part( 'year' , sale_date ) AS year
,date_part( 'month', sale_date ) AS mn
,to_char(sale_date, 'mon') AS month
,COUNT(*) AS total
FROM tmpSales
GROUP BY date_part('year', sale_date), to_char(sale_date, 'mon') ,date_part( 'month', sale_date )
;
DATA:
+----+--+-----+-----+
|year|mn|month|total|
+----+--+-----+-----+
|2010|1 |jan |127 |
|2010|2 |feb |117 |
|2010|3 |mar |121 |
|2010|4 |apr |131 |
|2010|5 |may |106 |
|2010|6 |jun |121 |
|2010|7 |jul |129 |
|2010|8 |aug |114 |
|2010|9 |sep |115 |
|2010|10|oct |110 |
|2010|11|nov |133 |
|2010|12|dec |108 |
+----+--+-----+-----+
-- /**
-- END test data
-- END test data
-- END test data
-- */
-- /**
-- dyn. build a temporary row-type based on existing data, not hard-coded
-- dyn. build a temporary row-type based on existing data, not hard-coded
-- dyn. build a temporary row-type based on existing data, not hard-coded
-- **/
DROP TABLE IF EXISTS tmpTblTyp CASCADE ;
DO LANGUAGE plpgsql $$ DECLARE v_sqlstring VARCHAR = ''; BEGIN
v_sqlstring := CONCAT( 'CREATE TEMP TABLE tmpTblTyp AS SELECT '
,(SELECT STRING_AGG( CONCAT('NULL::int AS ' , month )::TEXT , ' ,'
ORDER BY mn
)::TEXT
FROM
(SELECT DISTINCT month, mn FROM tmp_month_total )a )
,' LIMIT 0 '
) ; -- RAISE NOTICE '%', v_sqlstring ;
EXECUTE( v_sqlstring ) ; END $$;
DROP TABLE IF EXISTS tmpMoToJson ;
CREATE TEMP TABLE tmpMoToJson AS
SELECT
year AS year
,(json_build_array( months )) AS js_months_arr
,json_populate_recordset ( NULL::tmpTblTyp /** use temp table as a record type!! **/
, json_build_array( months )
) jprs /** builds row-type column that can be expanded with (jprs).*
**/
FROM ( SELECT year
-- accum data into JSON array
,json_object_agg(month,total) AS months
FROM tmp_month_total
GROUP BY year
ORDER BY year
) a
;
SELECT
year
,(ROW((jprs).*)::tmpTblTyp).* -- explode the composite type row
FROM tmpMoToJson ;
+----+---+---+---+---+---+---+---+---+---+---+---+---+
|year|jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec|
+----+---+---+---+---+---+---+---+---+---+---+---+---+
|2010|127|117|121|131|106|121|129|114|115|110|133|108|
|2011|117|112|117|115|139|116|119|152|117|112|115|103|
|2012|129|111|98 |140|109|131|114|110|112|115|100|121|
|2013|128|112|141|127|141|102|113|109|111|110|123|116|
|2014|129|114|117|118|111|123|106|111|127|121|124|145|
|2015|118|113|131|122|120|121|140|114|118|108|114|131|
|2016|117|110|139|100|110|116|112|109|131|117|122|132|
+----+---+---+---+---+---+---+---+---+---+---+---+---+

By using pivot also we can achieve your required out put
SELECT co
,industry
,customer_type
,product_type
,sales_model
,stage
FROM dataSet
PIVOT(max(tags) FOR ontologyType IN (
industry
,customer_type
,product_type
,sales_model
,stage
)) AS PVT

Related

Select and union same attributes from many tables using PostgreSQL

I'm trying to write a more efficient PostgreSQL query that will UNION together the like attributes across numerous tables. Context is that the database has separate tables for results on different assessments, and I'm trying to look at the outcomes across all assessments. At the moment, for instance, I have one table that stores the name of all of the assessment tables:
| tables |
|---------|
| table_a |
| table_b |
| table_c |
And I'm aggregating the outcomes together using this query (in reality, I'm aggregating across 25+ different tables):
SELECT person_id, subject_id, outcome
FROM table_a
UNION ALL
SELECT person_id, subject_id, outcome
FROM table_b
UNION ALL
SELECT person_id, subject_id, outcome
FROM table_c
Is there a PostgreSQL approach to essentially looping the same SELECT statement through multiple tables, and then UNION ALL results together (e.g. so I don't have to repeat the snippet above 25+ times)?
You should write code that generates the statement for you. One thing you can do is to write a PL/pgSQL function:
CREATE FUNCTION get_them()
RETURNS TABLE (
person_id bigint,
subject_id bigint,
outcome text
) LANGUAGE plpgsql AS
$$DECLARE
v_sql text := '';
v_sep text := '';
v_tab text;
BEGIN
FOR v_tab IN
SELECT tables FROM tab_of_tabs
LOOP
v_sql := v_sql || v_sep ||
format(
'SELECT person_id, subject_id, outcome FROM %I',
v_tab
);
v_sep := ' UNION ALL ';
END LOOP;
RETURN QUERY EXECUTE v_sql;
END;$$;

Postgres find all rows in database tables matching criteria on a given column

I am trying to write sub-queries so that I search all tables for a column named id and since there are multiple tables with id column, I want to add the condition, so that id = 3119093.
My attempt was:
Select *
from information_schema.tables
where id = '3119093' and id IN (
Select table_name
from information_schema.columns
where column_name = 'id' );
This didn't work so I tried:
Select *
from information_schema.tables
where table_name IN (
Select table_name
from information_schema.columns
where column_name = 'id' and 'id' IN (
Select * from table_name where 'id' = 3119093));
This isn't the right way either. Any help would be appreciated. Thanks!
A harder attempt is:
CREATE OR REPLACE FUNCTION search_columns(
needle text,
haystack_tables name[] default '{}',
haystack_schema name[] default '{public}'
)
RETURNS table(schemaname text, tablename text, columnname text, rowctid text)
AS $$
begin
FOR schemaname,tablename,columnname IN
SELECT c.table_schema,c.table_name,c.column_name
FROM information_schema.columns c
JOIN information_schema.tables t ON
(t.table_name=c.table_name AND t.table_schema=c.table_schema)
WHERE (c.table_name=ANY(haystack_tables) OR haystack_tables='{}')
AND c.table_schema=ANY(haystack_schema)
AND t.table_type='BASE TABLE'
--AND c.column_name = "id"
LOOP
EXECUTE format('SELECT ctid FROM %I.%I WHERE cast(%I as text) like %L',
schemaname,
tablename,
columnname,
needle
) INTO rowctid;
IF rowctid is not null THEN
RETURN NEXT;
END IF;
END LOOP;
END;
$$ language plpgsql;
select * from search_columns('%3119093%'::varchar,'{}'::name[]) ;
The only problem is this code displays the table name and column name. I have to then manually enter
Select * from table_name where id = 3119093
where I got the table name from the code above.
I want to automatically implement returning rows from a table but I don't know how to get the table name automatically.
I took the time to make it work for you.
For starters, some information on what is going on inside the code.
Explanation
function takes two input arguments: column name and column value
it requires a created type that it will be returning a set of
first loop identifies tables that have a column name specified as the input argument
then it forms a query which aggregates all rows that match the input condition inside every table taken from step 3 with comparison based on ILIKE - as per your example
function goes into the second loop only if there is at least one row in currently visited table that matches specified condition (then the array is not null)
second loop unnests the array of rows that match the condition and for every element it puts it in the function output with RETURN NEXT rec clause
Notes
Searching with LIKE is inefficient - I suggest adding another input argument "column type" and restrict it in the lookup by adding a join to pg_catalog.pg_type table.
The second loop is there so that if more than 1 row is found for a particular table, then every row gets returned.
If you are looking for something else, like you need key-value pairs, not just the values, then you need to extend the function. You could for example build json format from rows.
Now, to the code.
Test case
CREATE TABLE tbl1 (col1 int, id int); -- does contain values
CREATE TABLE tbl2 (col1 int, col2 int); -- doesn't contain column "id"
CREATE TABLE tbl3 (id int, col5 int); -- doesn't contain values
INSERT INTO tbl1 (col1, id)
VALUES (1, 5), (1, 33), (1, 25);
Table stores data:
postgres=# select * From tbl1;
col1 | id
------+----
1 | 5
1 | 33
1 | 25
(3 rows)
Creating type
CREATE TYPE sometype AS ( schemaname text, tablename text, colname text, entirerow text );
Function code
CREATE OR REPLACE FUNCTION search_tables_for_column (
v_column_name text
, v_column_value text
)
RETURNS SETOF sometype
LANGUAGE plpgsql
STABLE
AS
$$
DECLARE
rec sometype%rowtype;
v_row_array text[];
rec2 record;
arr_el text;
BEGIN
FOR rec IN
SELECT
nam.nspname AS schemaname
, cls.relname AS tablename
, att.attname AS colname
, null::text AS entirerow
FROM
pg_attribute att
JOIN pg_class cls ON att.attrelid = cls.oid
JOIN pg_namespace nam ON cls.relnamespace = nam.oid
WHERE
cls.relkind = 'r'
AND att.attname = v_column_name
LOOP
EXECUTE format('SELECT ARRAY_AGG(row(tablename.*)::text) FROM %I.%I AS tablename WHERE %I::text ILIKE %s',
rec.schemaname, rec.tablename, rec.colname, quote_literal(concat('%',v_column_value,'%'))) INTO v_row_array;
IF v_row_array is not null THEN
FOR rec2 IN
SELECT unnest(v_row_array) AS one_row
LOOP
rec.entirerow := rec2.one_row;
RETURN NEXT rec;
END LOOP;
END IF;
END LOOP;
END
$$;
Exemplary call & output
postgres=# select * from search_tables_for_column('id','5');
schemaname | tablename | colname | entirerow
------------+-----------+---------+-----------
public | tbl1 | id | (1,5)
public | tbl1 | id | (1,25)
(2 rows)

How to create postgres query to generate counts of columns where tables specified as data

I am trying to produce a table containing counts of non-null datapoints for columns in the "Area Health Resource File" -- which contains per-county demographic and health data.
I have reworked the data into timeseries from the provided format, resulting
in a bunch of tables named "series_" for some data category foo, and
rows identified by county FIPS and year (initial and final for multiyear surveys).
Now want to produce counts over the timeseries columns. So far the query I have is:
do language plpgsql $$
declare
query text;
begin
query := (with cats as (
select tcategory, format('series_%s', tcategory) series_tbl
from series_categories),
cols as (
select tcategory, series_tbl, attname col
from pg_attribute a join pg_class r on a.attrelid = r.oid
join cats c on c.series_tbl = r.relname
where attname not in ('FIPS', 'initial', 'final')
and attnum >= 0
order by tcategory, col),
scols as (
select tcategory, series_tbl, col,
format('count(%s)', quote_ident(col)) sel
from cols),
sel as (
select format(
E' (select %s tcategory, %s col, %s from %s)\n',
quote_literal(tcategory), quote_literal(col), sel, series_tbl) q
from scols)
select string_agg(q, E'union\n') from sel);
execute format(
'select * into category_column_counts from (%s) x', query);
end;
$$;
(Here the "series_categories" table has category name.)
This ... "works" but is probably hundreds of times too slow. Its doing ~10,000
individual tablescans, which could be reduced 500-fold, as there are only 20-ish
categories. I would like to use select count(col1), count(col2) ...
for each table, then "unnest" these row records and concatenate all together.
I haven't figured it out though. I looked at:
https://stackoverflow.com/a/14087244/435563
for inspiration, but haven't transformed that successfully.
I don't know the AHRF format (I looked up the web site but there are too many cute nurse pictures for me to focus on the content...) but you are probably going it the wrong way in first extracting data into multiple tables and then trying to piece it back together again. Instead, you should use a design pattern called Entity-Attribute-Value that stores all the data values in a single table with a category identifier and a "feature" identifier, with table structures somewhat like this:
CREATE TABLE categories (
id serial PRIMARY KEY,
category text NOT NULL,
... -- other attributes like min/max allowable values, measurement technique, etc.
);
CREATE TABLE feature ( -- town, county, state, whatever
id serial PRIMARY KEY,
fips varchar NOT NULL,
name varchar,
... -- other attributes
);
CREATE TABLE measurement (
feature integer REFERENCES feature,
category integer REFERENCES categories,
dt date,
value double precision NOT NULL,
PRIMARY KEY (feature, category, dt)
);
This design pattern is very flexible. For instance, you can store 50 categories for some rows of one feature class and only 5 for another set of rows. You can store data from multiple observations on different dates or years. You can have multiple "feature" tables with separate "measurement" tables, or you can set it up with table inheritance.
Answering your query is then very straightforward using standard PK-FK relationships. More to the point, answering any query is far easier than with your current structure of divide-but-not-conquer.
I don't know exactly how your "initial year"\"final year" data works, but otherwise your requirement would be met by a simple query like so:
SELECT f.fips, c.category, count(*)
FROM feature f -- replace feature by whatever real table you create, like "county"
JOIN measurement m ON m.feature = f.id
JOIN categories c ON c.id = m.category
GROUP BY f.fips, c.category;
Do you want to know dental decay as a function of smoking, alcohol consumption versus psychiatric help, correlation between obesity and substance abuse, trend in toddler development? All fairly easy with the above structure, all a slow painful trod with multiple tables.
Here is the optimization I found: it uses json_each(row_to_json(c)) to turn records into sequences of individual values.
do language plpgsql $$
declare
query text;
begin
query := (with cats as (
select tcategory, table_name
from series_category_tables),
cols as (
select tcategory, table_name, attname col, typname type_name
from pg_attribute a join pg_class r on a.attrelid = r.oid
join cats c on c.table_name = r.relname
join pg_type t on t.oid = a.atttypid
where attname not in ('FIPS', 'initial', 'final')
and attnum >= 0
order by tcategory, col),
-- individual "count" fields
sel as (
select
format(
E' (select %s tcategory, %s table_name, \n'
|| E' d.key column_name, d.value->>''f2'' type_name, '
|| E'(d.value->>''f1'')::int count\n'
|| E' from (\n'
|| E' select (json_each(row_to_json(c))).* from (select\n'
|| E' %s \n'
|| E' from %s) c) d)\n',
quote_literal(tcategory),
quote_literal(table_name),
string_agg(
format(
' row(count(%1$s), %2$s) %1$s',
quote_ident(col), quote_literal(type_name)),
E',\n'), quote_ident(table_name)) selstr
from cols
group by tcategory, table_name),
selu as (
select
string_agg(selstr, E'union\n') selu
from sel)
select * from selu);
drop table if exists category_columns;
create table category_columns (
tcategory text, table_name text,
column_name text, type_name text, count int);
execute format(
'insert into category_columns select * from (%s) x', query);
end;
$$;
It runs in ~45 seconds vs 6 minutes for the previous version. Can I/you do better than this?

Postgresql, select a "fake" row

In Postgres 8.4 or higher, what is the most efficient way to get a row of data populated by defaults without actually creating the row. Eg, as a transaction (pseudocode):
create table "mytable"
(
id serial PRIMARY KEY NOT NULL,
parent_id integer NOT NULL DEFAULT 1,
random_id integer NOT NULL DEFAULT random(),
)
begin transaction
fake_row = insert into mytable (id) values (0) returning *;
delete from mytable where id=0;
return fake_row;
end transaction
Basically I'd expect a query with a single row where parent_id is 1 and random_id is a random number (or other function return value) but I don't want this record to persist in the table or impact on the primary key sequence serial_id_seq.
My options seem to be using a transaction like above or creating views which are copies of the table with the fake row added but I don't know all the pros and cons of each or whether a better way exists.
I'm looking for an answer that assumes no prior knowledge of the datatypes or default values of any column except id or the number or ordering of the columns. Only the table name will be known and that a record with id 0 should not exist in the table.
In the past I created the fake record 0 as a permanent record but I've come to consider this record a type of pollution (since I typically have to filter it out of future queries).
You can copy the table definition and defaults to the temp table with:
CREATE TEMP TABLE table_name_rt (LIKE table_name INCLUDING DEFAULTS);
And use this temp table to generate dummy rows. Such table will be dropped at the end of the session (or transaction) and will only be visible to current session.
You can query the catalog and build a dynamic query
Say we have this table:
create table test10(
id serial primary key,
first_name varchar( 100 ),
last_name varchar( 100 ) default 'Tom',
age int not null default 38,
salary float default 100.22
);
When you run following query:
SELECT string_agg( txt, ' ' order by id )
FROM (
select 1 id, 'SELECT ' txt
union all
select 2, -9999 || ' as id '
union all
select 3, ', '
|| coalesce( column_default, 'null'||'::'||c.data_type )
|| ' as ' || c.column_name
from information_schema.columns c
where table_schema = 'public'
and table_name = 'test10'
and ordinal_position > 1
) xx
;
you will get this sting as a result:
"SELECT -9999 as id , null::character varying as first_name ,
'Tom'::character varying as last_name , 38 as age , 100.22 as salary"
then execute this query and you will get the "phantom row".
We can build a function that build and excecutes the query and return our row as a result:
CREATE OR REPLACE FUNCTION get_phantom_rec (p_i test10.id%type )
returns test10 as $$
DECLARE
v_sql text;
myrow test10%rowtype;
begin
SELECT string_agg( txt, ' ' order by id )
INTO v_sql
FROM (
select 1 id, 'SELECT ' txt
union all
select 2, p_i || ' as id '
union all
select 3, ', '
|| coalesce( column_default, 'null'||'::'||c.data_type )
|| ' as ' || c.column_name
from information_schema.columns c
where table_schema = 'public'
and table_name = 'test10'
and ordinal_position > 1
) xx
;
EXECUTE v_sql INTO myrow;
RETURN myrow;
END$$ LANGUAGE plpgsql ;
and then this simple query gives you what you want:
select * from get_phantom_rec ( -9999 );
id | first_name | last_name | age | salary
-------+------------+-----------+-----+--------
-9999 | | Tom | 38 | 100.22
I would just select the fake values as literals:
select 1 id, 1 parent_id, 1 user_id
The returned row will be (virtually) indistinguishable from a real row.
To get the values from the catalog:
select
0 as id, -- special case for serial type, just return 0
(select column_default::int -- Cast to int, because we know the column is int
from INFORMATION_SCHEMA.COLUMNS
where table_name = 'mytable'
and column_name = 'parent_id') as parent_id,
(select column_default::int -- Cast to int, because we know the column is int
from INFORMATION_SCHEMA.COLUMNS
where table_name = 'mytable'
and column_name = 'user_id') as user_id;
Note that you must know what the columns are and their type, but this is reasonable. If you change the table schema (except default value), you would need to tweak the query.
See the above as a SQLFiddle.

Duplicate single database record

Hello what is the easiest way to duplicate a DB record over the same table?
My problem is that the table where I am doing this has many column, like 100+, and I don't like how the solution looks like. Here is what I do (this is inside plpqsql function):
...
1. duplicate record
INSERT INTO history
(SELECT NEXTVAL('history_id_seq'), col_1, col_2, ... , col_100)
FROM history
WHERE history_id = 1234
ORDER BY datetime DESC
LIMIT 1)
RETURNING
history_id INTO new_history_id;
2. update some columns
UPDATE history
SET
col_5 = 'test_5',
col_23 = 'test_23',
datetime = CURRENT_TIMESTAMP
WHERE history_id = new_history_id;
Here are the problems I am attempting to solve
Listing all these 100+ columns looks lame
When new column is added eventually the function should be updated too
On separate DB instances the column order might differ, which would cause the function fail
I am not sure if I can list them once more (solving issue 3) like insert into <table> (<columns_list>) values (<query>) but then the query looks even uglier.
I would like to achieve something like 'insert into ', but this seems impossible the unique primary key constraint will raise a duplication error.
Any suggestions?
Thanks in advance for you time.
This isn't pretty or particularly optimized but there are a couple of ways to go about this. Ideally, you might want to do this all in an UPDATE trigger though you could implement a duplication function something like this:
-- create source table
CREATE TABLE history (history_id serial not null primary key, col_2 int, col_3 int, col_4 int, datetime timestamptz default now());
-- add some data
INSERT INTO history (col_2, col_3, col_4)
SELECT g, g * 10, g * 100 FROM generate_series(1, 100) AS g;
-- function to duplicate record
CREATE OR REPLACE FUNCTION fn_history_duplicate(p_history_id integer) RETURNS SETOF history AS
$BODY$
DECLARE
cols text;
insert_statement text;
BEGIN
-- build list of columns
SELECT array_to_string(array_agg(column_name::name), ',') INTO cols
FROM information_schema.columns
WHERE (table_schema, table_name) = ('public', 'history')
AND column_name <> 'history_id';
-- build insert statement
insert_statement := 'INSERT INTO history (' || cols || ') SELECT ' || cols || ' FROM history WHERE history_id = $1 RETURNING *';
-- execute statement
RETURN QUERY EXECUTE insert_statement USING p_history_id;
RETURN;
END;
$BODY$
LANGUAGE 'plpgsql';
-- test
SELECT * FROM fn_history_duplicate(1);
history_id | col_2 | col_3 | col_4 | datetime
------------+-------+-------+-------+-------------------------------
101 | 1 | 10 | 100 | 2013-04-15 14:56:11.131507+00
(1 row)
As I noted in my original comment, you might also take a look at the colnames extension as an alternative to querying the information schema.
You don't need the update anyway, you can supply the constant values directly in the SELECT statement:
INSERT INTO history
SELECT NEXTVAL('history_id_seq'),
col_1,
col_2,
col_3,
col_4,
'test_5',
...
'test_23',
...,
col_100
FROM history
WHERE history_sid = 1234
ORDER BY datetime DESC
LIMIT 1
RETURNING history_sid INTO new_history_sid;