Postgres: Find number of distinct values for each column - postgresql

I am trying to find the number of distinct values in each column of a table. Declaratively that is:
for each column of table xyz
run_query("SELECT COUNT(DISTINCT column) FROM xyz")
Finding the column names of a table is shown here.
SELECT column_name
FROM information_schema.columns
WHERE table_name=xyz
However, I don't manage to merge the count query inside. I tried various queries, this one:
SELECT column_name, thecount
FROM information_schema.columns,
(SELECT COUNT(DISTINCT column_name) FROM myTable) AS thecount
WHERE table_name=myTable
is syntactically not allowed (reference to column_name in the nested query not allowed).
This one seems erroneous too (timeout):
SELECT column_name, count(distinct column_name)
FROM information_schema.columns, myTable
WHERE table_name=myTable
What is the right way to get the number of distinct values for each column of a table with one query?
Article SQL to find the number of distinct values in a column talks about a fixed column only.

In general, SQL expects the names of items (fields, tables, roles, indices, constraints, etc) in a statement to be constant. That many database systems let you examine the structure through something like information_schema does not mean you can plug that data into the running statement.
You can however use the information_schema to construct new SQL statements that you execute separately.
First consider your original problem.
CREATE TABLE foo (a numeric, b numeric, c numeric);
INSERT INTO foo(a,b,c)
VALUES (1,1,1), (1,1,2), (1,1,3), (1,2,1), (1,2,2);
SELECT COUNT(DISTINCT a) "distinct a",
COUNT(DISTINCT b) "distinct b",
COUNT(DISTINCT c) "distinct c"
FROM foo;
If you know the name of all of your columns when you are writing the query, that is sufficient.
If you are seeking data for an arbitrary table, you need to construct the SQL statement via SQL (I've added plenty of whitespace so you can see the different levels involved):
SELECT 'SELECT ' || STRING_AGG( 'COUNT (DISTINCT '
|| column_name
|| ') "'
|| column_name
|| '"',
',')
|| ' FROM foo;'
FROM information_schema.columns
WHERE table_name='foo';
That however is just the text of the necessary SQL statement. Depending on how you are accessing Postgresql, it might be easy for you to feed that into a new query, or if you are keeping everything inside Postgresql, then you will have to resort to one of the integrated procedural languages. An excellent (though complex,) discussion of the issues may provide guidance.

Related

Is a subquery able to select columns from outer query? [duplicate]

This question already has answers here:
sql server 2008 management studio not checking the syntax of my query
(2 answers)
Closed 1 year ago.
I have the following select:
SELECT DISTINCT pl
FROM [dbo].[VendorPriceList] h
WHERE PartNumber IN (SELECT DISTINCT PartNumber
FROM [dbo].InvoiceData
WHERE amount > 10
AND invoiceDate > DATEADD(yyyy, -1, CURRENT_TIMESTAMP)
UNION
SELECT DISTINCT PartNumber
FROM [dbo].VendorDeals)
The issue here is that the table [dbo].VendorDeals has NO column PartNumber, however no error is detected and the query works with the first part of the union.
Even more, IntelliSense also allows and recognize PartNumber. This fails only when inside a complex statement.
It is pretty obvious that if you qualify column names, the mistake will be evident.
This isn't a bug in SQL Server/the T-SQL dialect parsing, no, this is working exactly as intended. The problem, or bug, is in your T-SQL; specifically because you haven't qualified your columns. As I don't have the definition of your table, I'm going to provide sample DDL first:
CREATE TABLE dbo.Table1 (MyColumn varchar(10), OtherColumn int);
CREATE TABLE dbo.Table2 (YourColumn varchar(10) OtherColumn int);
And then an example that is similar to your query:
SELECT MyColumn
FROM dbo.Table1
WHERE MyColumn IN (SELECT MyColumn FROM dbo.Table2);
This, firstly, will parse; it is a valid query. Secondly, provided that dbo.Table2 contains at least one row, then every row from table dbo.Table1 will be returned where MyColumn has a non-NULL value. Why? Well, let's qualify the column with table's name as SQL Server would parse them:
SELECT Table1.MyColumn
FROM dbo.Table1
WHERE Table1.MyColumn IN (SELECT Table1.MyColumn FROM dbo.Table2);
Notice that the column inside the IN is also referencing Table1, not Table2. By default if a column has it's alias omitted in a subquery it will be assumed to be referencing the table(s) defined in that subquery. If, however, none of the tables in the sub query have a column by that name, then it will be assumed to reference a table where that column does exist; in this case Table1.
Let's, instead, take a different example, using the other column in the tables:
SELECT OtherColumn
FROM dbo.Table1
WHERE OtherColumn IN (SELECT OtherColumn FROM dbo.Table2);
This would be parsed as the following:
SELECT Table1.OtherColumn
FROM dbo.Table1
WHERE Table1.OtherColumn IN (SELECT Table2.OtherColumn FROM dbo.Table2);
This is because OtherColumn exists in both tables. As, in the subquery, OtherColumn isn't qualified it is assumed the column wanted is the one in the table defined in the same scope, Table2.
So what is the solution? Alias and qualify your columns:
SELECT T1.MyColumn
FROM dbo.Table1 T1
WHERE T1.MyColumn IN (SELECT T2.MyColumn FROM dbo.Table2 T2);
This will, unsurprisingly, error as Table2 has no column MyColumn.
Personally, I suggest that unless you have only one table being referenced in a query, you alias and qualify all your columns. This not only ensures that the wrong column can't be referenced (such as in a subquery) but also means that other readers know exactly what columns are being referenced. It also stops failures in the future. I have honestly lost count how many times over years I have had a process fall over due to the "ambiguous column" error, due to a table's definition being changed and a query referencing the table wasn't properly qualified by the developer...

How to describe columns (get their names, data types, etc.) of a SQL query in PostgreSQL

I need a way to get a "description" of the columns from a SELECT query (cursor), such as their names, data types, precision, scale, etc., in PostgreSQL (or better yet PL/pgSQL).
I'm transitioning from Oracle PL/SQL, where I can get such description using a built-in procedure dbms_sql.describe_columns. It returns an array of records, one for each column of a given (parsed) cursor.
EDB has it implemented too (https://www.enterprisedb.com/docs/en/9.0/oracompat/Postgres_Plus_Advanced_Server_Oracle_Compatibility_Guide-127.htm#P13324_681237)
An examples of such query:
select col1 from tab where col2 = :a
I need an API (or a workaround) that could be called like this (hopefully):
select query_column_description('select col1 from tab where col2 = :a');
that will return something similar to:
{{"col1","numeric"}}
Why? We build views where these queries become individual columns. For example, view's query would look like the following:
select (select col1 from tab where col2 = t.colA) as col1::numeric
from tab_main t
http://sqlfiddle.com/#!17/21c7a/2
You can use systems table :
First step create a temporary view with your query (without clause where)
create or replace view temporary view a_view as
select col1 from tab
then select
select
row_to_json(t.*)
from (
select
column_name,
data_type
from
information_schema.columns
where
table_schema = 'public' and
table_name = 'a_view'
) as t

select all except for a specific column

I have a table with more than 20 columns, I want to get all columns except for one which I'll use in a conditional expression.
SELECT s.* (BUT NOT column1),
CASE WHEN column1 is null THEN 1 ELSE 2 END AS column1
from tb_sample s;
Can I achieve it in postgresql given the logic above?
It may not be ideal, but you can use information_schema to get the columns and use the column to exclude in the where clause.
That gives you a list of all the column names you DO want, which you can copy/paste into your select query:
select textcat(column_name, ',')
from information_schema.columns
where table_name ='table_name' and column_name !='column_to_exclude';

PostgreSQL select uniques from three different columns

I have one large table 100m+ rows and two smaller ones 2m rows ea. All three tables have a column of company names that need to be sent out to an API for matching. I want to select the strings from each column and then combine into a single column of unique strings.
I'm using a version of this response, but unsurprisingly the performance is very slow. Combined 2 columns into one column SQL
SELECT DISTINCT
unnest(string_to_array(upper(t.buyer) || '#' || upper(a.aw_supplier_name) || '#' || upper(b.supplier_source_string), '#'))
FROM
tenders t,
awards a,
banking b
;
Any ideas on a more performant way to achieve this?
Update: the banking table is the largest table with 100m rows.
Assuming PostgreSQL 9.6 and borrowing the select from rd_nielsen's answer, the following should give you a comma delimited string of the distinct names.
WITH cte
AS (
SELECT UPPER(T.buyer) NAMES
FROM tenders T
UNION
SELECT UPPER(A.aw_supplier_name) NAMES
FROM awards A
UNION
SELECT UPPER(b.supplier_source_string) NAMES
FROM banking b
)
SELECT array_to_string(ARRAY_AGG(cte.names), ',')
FROM cte
To get just a list of the combined names from all three tables, you could instead union together the selections from each table, like so:
select
upper(t.buyer)
from
tenders t
union
select
upper(a.aw_supplier_name)
from
awards a
union
select
upper(b.supplier_source_string)
from
banking b
;

Error while using regexp_split_to_table (Amazon Redshift)

I have the same question as this:
Splitting a comma-separated field in Postgresql and doing a UNION ALL on all the resulting tables
Just that my 'fruits' column is delimited by '|'. When I try:
SELECT
yourTable.ID,
regexp_split_to_table(yourTable.fruits, E'|') AS split_fruits
FROM yourTable
I get the following:
ERROR: type "e" does not exist
Q1. What does the E do? I saw some examples where E is not used. The official docs don't explain it in their "quick brown fox..." example.
Q2. How do I use '|' as the delimiter for my query?
Edit: I am using PostgreSQL 8.0.2. unnest() and regexp_split_to_table() both are not supported.
A1
E is a prefix for Posix-style escape strings. You don't normally need this in modern Postgres. Only prepend it if you want to interpret special characters in the string. Like E'\n' for a newline char.Details and links to documentation:
Insert text with single quotes in PostgreSQL
SQL select where column begins with \
E is pointless noise in your query, but it should still work. The answer you are linking to is not very good, I am afraid.
A2
Should work as is. But better without the E.
SELECT id, regexp_split_to_table(fruits, '|') AS split_fruits
FROM tbl;
For simple delimiters, you don't need expensive regular expressions. This is typically faster:
SELECT id, unnest(string_to_array(fruits, '|')) AS split_fruits
FROM tbl;
In Postgres 9.3+ you'd rather use a LATERAL join for set-returning functions:
SELECT t.id, f.split_fruits
FROM tbl t
LEFT JOIN LATERAL unnest(string_to_array(fruits, '|')) AS f(split_fruits)
ON true;
Details:
What is the difference between LATERAL and a subquery in PostgreSQL?
PostgreSQL unnest() with element number
Amazon Redshift is not Postgres
It only implements a reduced set of features as documented in its manual. In particular, there are no table functions, including the essential functions unnest(), generate_series() or regexp_split_to_table() when working with its "compute nodes" (accessing any tables).
You should go with a normalized table layout to begin with (extra table with one fruit per row).
Or here are some options to create a set of rows in Redshift:
How to select multiple rows filled with constants in Amazon Redshift?
This workaround should do it:
Create a table of numbers, with at least as many rows as there can be fruits in your column. Temporary or permanent if you'll keep using it. Say we never have more than 9:
CREATE TEMP TABLE nr9(i int);
INSERT INTO nr9(i) VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9);
Join to the number table and use split_part(), which is actually implemented in Redshift:
SELECT *, split_part(t.fruits, '|', n.i) As fruit
FROM nr9 n
JOIN tbl t ON split_part(t.fruits, '|', n.i) <> ''
Voilá.