Is there an equivalent of DataFrames 'describe' to a postgresql table? - postgresql

I want to summarize multiple tables in my database getting each columns statistics (min, max, avg, num of null values, etc.).
Is there a postgresql command/tool for doing that?

Postgresql maintains statistics on all tables. They are made visible via the pg_stats view.
It contains at least some of the information you are after, such as the proportion of null values, as well as other potentially useful info like histograms of most commonly occurring values, etc.
These statistics are maintained by the database itself, to aid in query planning.
Example Usage: Obtain fraction of nulls and number of distinct values in table 'foo':
ispdb_t1=> select tablename || '.' || attname as tablecolumn, null_frac, n_distinct from pg_stats where tablename='foo';
tablecolumn | null_frac | n_distinct
-------------------+-------------+------------
foo.name | 0 | -1
foo.a | 0.000785309 | 4
foo.b | 0.000241633 | 4
foo.id | 0 | -1
foo.d | 0 | 553
(6 rows)

Related

PostgreSQL how to generate a partition row_number() with certain numbers overridden

I have an unusual problem I'm trying to solve with SQL where I need to generate sequential numbers for partitioned rows but override specific numbers with values from the data, while not breaking the sequence (unless the override causes a number to be used greater than the number of rows present).
I feel I might be able to achieve this by selecting the rows where I need to override the generated sequence value and the rows I don't need to override the value, then unioning them together and somehow using coalesce to get the desired dynamically generated sequence value, or maybe there's some way I can utilise recursive.
I've not been able to solve this problem yet, but I've put together a SQL Fiddle which provides a simplified version:
http://sqlfiddle.com/#!17/236b5/5
The desired_dynamic_number is what I'm trying to generate and the generated_dynamic_number is my current work-in-progress attempt.
Any pointers around the best way to achieve the desired_dynamic_number values dynamically?
Update:
I'm almost there using lag:
http://sqlfiddle.com/#!17/236b5/24
step-by-step demo:db<>fiddle
SELECT
*,
COALESCE( -- 3
first_value(override_as_number) OVER w -- 2
, 1
)
+ row_number() OVER w - 1 -- 4, 5
FROM (
SELECT
*,
SUM( -- 1
CASE WHEN override_as_number IS NOT NULL THEN 1 ELSE 0 END
) OVER (PARTITION BY grouped_by ORDER BY secondary_order_by)
as grouped
FROM sample
) s
WINDOW w AS (PARTITION BY grouped_by, grouped ORDER BY secondary_order_by)
Create a new subpartition within your partitions: This cumulative sum creates a unique group id for every group of records which starts with a override_as_number <> NULL followed by NULL records. So, for instance, your (AAA, d) to (AAA, f) belongs to the same subpartition/group.
first_value() gives the first value of such subpartition.
The COALESCE ensures a non-NULL result from the first_value() function if your partition starts with a NULL record.
row_number() - 1 creates a row count within a subpartition, starting with 0.
Adding the first_value() of a subpartition with the row count creates your result: Beginning with the one non-NULL record of a subpartition (adding the 0 row count), the first following NULL records results in the value +1 and so forth.
Below query gives exact result, but you need to verify with all combinations
select c.*,COALESCE(c.override_as_number,c.act) as final FROM
(
select b.*, dense_rank() over(partition by grouped_by order by grouped_by, actual) as act from
(
select a.*,COALESCE(override_as_number,row_num) as actual FROM
(
select grouped_by , secondary_order_by ,
dense_rank() over ( partition by grouped_by order by grouped_by, secondary_order_by ) as row_num
,override_as_number,desired_dynamic_number from fiddle
) a
) b
) c ;
column "final" is the result
grouped_by | secondary_order_by | row_num | override_as_number | desired_dynamic_number | actual | act | final
------------+--------------------+---------+--------------------+------------------------+--------+-----+-------
AAA | a | 1 | 1 | 1 | 1 | 1 | 1
AAA | b | 2 | | 2 | 2 | 2 | 2
AAA | c | 3 | 3 | 3 | 3 | 3 | 3
AAA | d | 4 | 3 | 3 | 3 | 3 | 3
AAA | e | 5 | | 4 | 5 | 4 | 4
AAA | f | 6 | | 5 | 6 | 5 | 5
AAA | g | 7 | 999 | 999 | 999 | 6 | 999
XYZ | a | 1 | | 1 | 1 | 1 | 1
ZZZ | a | 1 | | 1 | 1 | 1 | 1
ZZZ | b | 2 | | 2 | 2 | 2 | 2
(10 rows)
Hope this helps!
The real world problem I was trying to solve did not have a nicely ordered secondary_order_by column, instead it would be something a bit more randomised (a created timestamp).
For the benefit of people who stumble across this question with a similar problem to solve, a colleague solved this problem using a cartesian join, who's solution I'm posting below. The solution is Snowflake SQL which should be possible to adapt to Postgres. It does fall down on higher override_as_number values though unless the from table(generator(rowcount => 1000)) 1000 value is not increased to something suitably high.
The SQL:
with tally_table as (
select row_number() over (order by seq4()) as gen_list
from table(generator(rowcount => 1000))
),
base as (
select *,
IFF(override_as_number IS NULL, row_number() OVER(PARTITION BY grouped_by, override_as_number order by random),override_as_number) as rownum
from "SANDPIT"."TEST"."SAMPLEDATA" order by grouped_by,override_as_number,random
) --select * from base order by grouped_by,random;
,
cart_product as (
select *
from tally_table cross join (Select distinct grouped_by from base ) as distinct_grouped_by
) --select * from cart_product;
,
filter_product as (
select *,
row_number() OVER(partition by cart_product.grouped_by order by cart_product.grouped_by,gen_list) as seq_order
from cart_product
where CONCAT(grouped_by,'~',gen_list) NOT IN (select concat(grouped_by,'~',override_as_number) from base where override_as_number is not null)
) --select * from try2 order by 2,3 ;
select base.grouped_by,
base.random,
base.override_as_number,
base.answer, -- This is hard coded as test data
IFF(override_as_number is null, gen_list, seq_order) as computed_answer
from base inner join filter_product on base.rownum = filter_product.seq_order and base.grouped_by = filter_product.grouped_by
order by base.grouped_by,
random;
In the end I went for a simpler solution using a temporary table and cursor to inject override_as_number values and shuffle other numbers.

PostGIS minimum distance of two sets including other variables from both tables

I have two tables (table1 and table2) with three columns: id, value and geometry. The geometries are point features.
I want to do a join between both tables where the resulting table contains for each point of table1, the minimum distance to a point of table2, the value of table1 and the value of the corresponding point of table2.
I tried the following code, but logically, this gives for each poin of table1 the distance to each point of table2. However, I cannot leave v2 from the group by clause. How can I get the table I want?
SELECT t1.value AS v1,
t2.value AS v2,
MIN(st_distance(t1.eometry, t2.geometry)) AS dis
FROM table1 t1, table2 t2
GROUP BY v1, v2
For some simplicity I simply took integer values and their differences instead of the distance between points (but it should be exactly the same: just change the subtraction against the st_distance function):
demo:db<>fiddle
SELECT DISTINCT ON (v1.point)
v1.point,
v2.point,
abs(v1.point - v2.point)
FROM
table1 v1
CROSS JOIN table2 v2
ORDER BY v1.point, abs(v1.point - v2.point)
My tables:
table1.point: 1, 2, 4, 8, 16
table2.point: 2, 3, 5, 7, 11, 13
The result:
| point | point | abs |
|-------|-------|-----|
| 1 | 2 | 1 |
| 2 | 2 | 0 |
| 4 | 3 | 1 |
| 8 | 7 | 1 |
| 16 | 13 | 3 |
Explanation:
You have to calculate all differences to know which one is the smallest. That's the reason for the CROSS JOIN. Now you can ORDER BY the points of table1 and the differences (or distances). Notice the abs() function: This makes all negative values positive. Otherwise difference -42 would be taken instead of +1.
DISTINCT ON (v1.point) takes the first ordered row for each v1.point.
Notice:
Because of the CROSS JOIN and the heavy mathematics in st_distance it could be really slow for huge data sets!

Postgres Average calculation ignores null

This is my postgres table
name | revenue
--------+---------
John | 100
Will | 100
Tom | 100
Susan | 100
Ben |
(5 rows)
Here, when I calculate average for revenue, It returns 100, which is clearly not the case and sum/count, which is 400/5 is 80. Is this behaviour by conventional design or am I missing the point?
I know I could change null to 0 and process as usual . But, given the default behaviour, is this intentional and preferred way of calculating average.
This is both intentional and perfectly logical. Remember that NULL means that the value is unknown.
It might, for instance, represent a value which will be filled in at some future date. If the future value turns out to be 0, the average will be 400 / 5 = 80, as you say; but if the future value turns out to be 200, the average value will be 600 / 5 = 120 instead. All we can know right now is that the average of known values is 400 / 4 = 100.
If you actually know that you have 0 revenue for this item, you should store 0 in that column. If you don't know what revenue you have for that item, you should exclude it from your calculations, which is what Postgres, following the SQL Standard, does for you.
If you can't fix the data, but it is in fact a case that all NULLs in this table should be treated as 0 - or as some other fixed value - you can use a COALESCE inside the aggregate:
SELECT AVG(COALESCE(revenue, 0)) as forced_average
You should force a 0 value for null revenues.
create table tbl (name varchar(10), revenue int);
✓
insert into tbl values
('John', 100), ('Will', 100), ('Tom', 100), ('Susan', 100), ('Ben', null);
5 rows affected
select avg(case when revenue is null then 0 else revenue end) from tbl;
| avg |
| ------------------: |
| 80.0000000000000000 |
select avg(coalesce(revenue,0)) from tbl;
| avg |
| ------------------: |
| 80.0000000000000000 |
dbfiddle here

crosstab in PostgreSQL, count

Crosstab function returns error:
No function matches the given name and argument types
I have in table clients, dates and type of client.
Example:
CLIENT_ID | DATE | CLI_TYPE
1234 | 201601 | F
1236 | 201602 | P
1234 | 201602 | F
1237 | 201601 | F
I would like to get number of clients(distinct) group by date and then count all clients and sort them by client type (but types: P i F put in row and count client, if they are P or F)
Something like this:
DATE | COUNT_CLIENT | P | F
201601 | 2 | 0 | 2
201602 | 2 | 1 | 1
SELECT date
, count(DISTINCT client_id) AS count_client
, count(*) FILTER (WHERE cli_type = 'P') AS p
, count(*) FILTER (WHERE cli_type = 'F') AS f
FROM clients
GROUP BY date;
This counts distinct clients per day, and total rows for client_types 'P' and 'F'. It's undefined how you want to count multiple types for the same client (or whether that's even possible).
About aggregate FILTER:
Postgres COUNT number of column values with INNER JOIN
crosstab() might make it faster, but it's pretty unclear what you want exactly.
About crosstab():
PostgreSQL Crosstab Query

How do I get min, median and max from my query in postgresql?

I have written a query in which one column is a month. From that I have to get min month, max month, and median month. Below is my query.
select ext.employee,
pl.fromdate,
ext.FULL_INC as full_inc,
prevExt.FULL_INC as prevInc,
(extract(year from age (pl.fromdate))*12 +extract(month from age (pl.fromdate))) as month,
case
when prevExt.FULL_INC is not null then (ext.FULL_INC -coalesce(prevExt.FULL_INC,0))
else 0
end as difference,
(case when prevExt.FULL_INC is not null then (ext.FULL_INC - prevExt.FULL_INC) / prevExt.FULL_INC*100 else 0 end) as percent
from pl_payroll pl
inner join pl_extpayfile ext
on pl.cid = ext.payrollid
and ext.FULL_INC is not null
left outer join pl_extpayfile prevExt
on prevExt.employee = ext.employee
and prevExt.cid = (select max (cid) from pl_extpayfile
where employee = prevExt.employee
and payrollid = (
select max(p.cid)
from pl_extpayfile,
pl_payroll p
where p.cid = payrollid
and pl_extpayfile.employee = prevExt.employee
and p.fromdate < pl.fromdate
))
and coalesce(prevExt.FULL_INC, 0) > 0
where ext.employee = 17
and (exists (
select employee
from pl_extpayfile preext
where preext.employee = ext.employee
and preext.FULL_INC <> ext.FULL_INC
and payrollid in (
select cid
from pl_payroll
where cid = (
select max(p.cid)
from pl_extpayfile,
pl_payroll p
where p.cid = payrollid
and pl_extpayfile.employee = preext.employee
and p.fromdate < pl.fromdate
)
)
)
or not exists (
select employee
from pl_extpayfile fext,
pl_payroll p
where fext.employee = ext.employee
and p.cid = fext.payrollid
and p.fromdate < pl.fromdate
and fext.FULL_INC > 0
)
)
order by employee,
ext.payrollid desc
If it is not possible, than is it possible to get max month and min month?
To calculate the median in PostgreSQL, simply take the 50% percentile (no need to add extra functions or anything):
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY x) FROM t;
You want the aggregate functions named min and max. See the PostgreSQL documentation and tutorial:
http://www.postgresql.org/docs/current/static/tutorial-agg.html
http://www.postgresql.org/docs/current/static/functions-aggregate.html
There's no built-in median in PostgreSQL, however one has been implemented and contributed to the wiki:
http://wiki.postgresql.org/wiki/Aggregate_Median
It's used the same way as min and max once you've loaded it. Being written in PL/PgSQL it'll be a fair bit slower, but there's even a C version there that you could adapt if speed was vital.
UPDATE After comment:
It sounds like you want to show the statistical aggregates alongside the individual results. You can't do this with a plain aggregate function because you can't reference columns not in the GROUP BY in the result list.
You will need to fetch the stats from subqueries, or use your aggregates as window functions.
Given dummy data:
CREATE TABLE dummystats ( depname text, empno integer, salary integer );
INSERT INTO dummystats(depname,empno,salary) VALUES
('develop',11,5200),
('develop',7,4200),
('personell',2,5555),
('mgmt',1,9999999);
... and after adding the median aggregate from the PG wiki:
You can do this with an ordinary aggregate:
regress=# SELECT min(salary), max(salary), median(salary) FROM dummystats;
min | max | median
------+---------+----------------------
4200 | 9999999 | 5377.5000000000000000
(1 row)
but not this:
regress=# SELECT depname, empno, min(salary), max(salary), median(salary)
regress-# FROM dummystats;
ERROR: column "dummystats.depname" must appear in the GROUP BY clause or be used in an aggregate function
because it doesn't make sense in the aggregation model to show the averages alongside individual values. You can show groups:
regress=# SELECT depname, min(salary), max(salary), median(salary)
regress-# FROM dummystats GROUP BY depname;
depname | min | max | median
-----------+---------+---------+-----------------------
personell | 5555 | 5555 | 5555.0000000000000000
develop | 4200 | 5200 | 4700.0000000000000000
mgmt | 9999999 | 9999999 | 9999999.000000000000
(3 rows)
... but it sounds like you want the individual values. For that, you must use a window, a feature new in PostgreSQL 8.4.
regress=# SELECT depname, empno,
min(salary) OVER (),
max(salary) OVER (),
median(salary) OVER ()
FROM dummystats;
depname | empno | min | max | median
-----------+-------+------+---------+-----------------------
develop | 11 | 4200 | 9999999 | 5377.5000000000000000
develop | 7 | 4200 | 9999999 | 5377.5000000000000000
personell | 2 | 4200 | 9999999 | 5377.5000000000000000
mgmt | 1 | 4200 | 9999999 | 5377.5000000000000000
(4 rows)
See also:
http://www.postgresql.org/docs/current/static/tutorial-window.html
http://www.postgresql.org/docs/current/static/functions-window.html
One more option for median:
SELECT x
FROM table
ORDER BY x
LIMIT 1 offset (select count(*) from x)/2
To find Median:
for instance consider that we have 6000 rows present in the table.First we need to take half rows from the original Table (because we know that median is always the middle value) so here half of 6000 is 3000(Take 3001 for getting exact two middle value).
SELECT *
FROM (SELECT column_name
FROM Table_name
ORDER BY column_name
LIMIT 3001)As Table1
ORDER BY column_name DESC ---->Look here we used DESC(Z-A)it will display the last
-- two values(using LIMIT 2) i.e (3000th row and 3001th row) from 6000
-- rows
LIMIT 2;