PostgreSQL select uniques from three different columns - postgresql

I have one large table 100m+ rows and two smaller ones 2m rows ea. All three tables have a column of company names that need to be sent out to an API for matching. I want to select the strings from each column and then combine into a single column of unique strings.
I'm using a version of this response, but unsurprisingly the performance is very slow. Combined 2 columns into one column SQL
SELECT DISTINCT
unnest(string_to_array(upper(t.buyer) || '#' || upper(a.aw_supplier_name) || '#' || upper(b.supplier_source_string), '#'))
FROM
tenders t,
awards a,
banking b
;
Any ideas on a more performant way to achieve this?
Update: the banking table is the largest table with 100m rows.

Assuming PostgreSQL 9.6 and borrowing the select from rd_nielsen's answer, the following should give you a comma delimited string of the distinct names.
WITH cte
AS (
SELECT UPPER(T.buyer) NAMES
FROM tenders T
UNION
SELECT UPPER(A.aw_supplier_name) NAMES
FROM awards A
UNION
SELECT UPPER(b.supplier_source_string) NAMES
FROM banking b
)
SELECT array_to_string(ARRAY_AGG(cte.names), ',')
FROM cte

To get just a list of the combined names from all three tables, you could instead union together the selections from each table, like so:
select
upper(t.buyer)
from
tenders t
union
select
upper(a.aw_supplier_name)
from
awards a
union
select
upper(b.supplier_source_string)
from
banking b
;

Related

PostgreSQL - Pattern Match - String to Sub-string

I am trying to join two tables within a database, based upon a matching postcode, but am struggling where there are multiple postcodes relating to a single row of data.
i.e. table 1 has 2 columns (a unique ID and postcodes). It is possible for a record to have just a single postcode in this column or multiple postcodes in comma-separated form.
table 2 also has two columns (development description and postcode). In this table the postcode column can have only one postcode.
I would like to identify & join where the postcode from table 2 matches or is included within the relevant column in table 1. I have been able to do so where there is a single postcode within each column, but am currently unable to do so where there are multiple postcodes in table 1.
The below code brings back the matches where there is a single postcode.
SELECT t1.id,
t1.postcodes,
t2.dev_description,
t2.postcode
FROM table1 AS t1
INNER JOIN table2 AS t2
ON t2.postcode LIKE t1.postcodes
WHERE t2.postcode = 'XXX XXX'
I have tried using '%'|| ||'%' and various other functions, but am at a bit of a loss to be honest.
If someone could help it would be great!
Thanks
You could join on:
',' || t1.postcodes || ',' like '%,' || t2.postcode || ',%'
This would expand to:
',1234AB,2345AB,3456AB,' like '%,1234AB,%'
Or you can use string_to_array and the #> contains operator:
string_to_array(t1.postcodes, ',') #> array[t2.postcode]
This expands to:
array['1234AB','2345AB','3456AB'] #> array['1234AB']
Hmmm, I've never joined two tables using ON and LIKE... Anyway, look up the command STRPOS.
Something like this perhaps:
...
OR (STRPOS(t1.postcodes, t2.postcode) > 0)
...

How can I combine two PIVOTs that use different aggregate elements and the same spreading/grouping elements into a single row per ID?

Couldn't find an exact duplicate question so please push one to me if you know of one.
https://i.stack.imgur.com/Xjmca.jpg
See the screenshot (sorry for link, not enough rep). In the table I have ID, Cat, Awd, and Xmit.
I want a resultset where each row is a distinct ID plus the aggregate Awd and Xmit amounts for each Cat (so four add'l columns per ID).
Currently I'm using two CTEs, one to aggregate each of Awd and Xmit. Both make use of the PIVOT operator, using Cat to spread and ID to group. After each CTE does its thing, I'm INNER JOINing them on ID.
WITH CTE1 (ID, P_Awd, G_Awd) AS (
SELECT ...
FROM Table
PIVOT(SUM(Awd) FOR Cat IN ('P', 'G'),
CTE2 ([same as CTE1 but replace "Awd" with "Xmit"])
SELECT ID, P_Awd, P_Xmit, G_Awd, G_Xmit
FROM CTE1 INNER JOIN CTE2 ON CTE1.ID = CTE2.ID
The output of this (greatly simplified) is two rows per ID, with each row holding the resultset of one CTE or the other.
What am I overlooking? Am I overcomplicating this?
Here on one method via a CROSS APPLY
Also, this is assumes you don't need dynamic SQL
Example
Select *
From (
Select ID
,B.*
From YourTable A
Cross Apply ( values (cat+'_Awd',Awd)
,(cat+'_Xmit',Xmit)
) B(Item,Value)
) src
Pivot (sum(Value) for Item in ([P_Awd],[P_XMit],[G_Awd],[G_XMit]) ) pvt
Returns (Limited Set -- Best if you not use images for sample data)
ID P_Awd P_XMit G_Awd G_XMit
1 1000 500 1000 0
2 2000 1500 500 500

Error while using regexp_split_to_table (Amazon Redshift)

I have the same question as this:
Splitting a comma-separated field in Postgresql and doing a UNION ALL on all the resulting tables
Just that my 'fruits' column is delimited by '|'. When I try:
SELECT
yourTable.ID,
regexp_split_to_table(yourTable.fruits, E'|') AS split_fruits
FROM yourTable
I get the following:
ERROR: type "e" does not exist
Q1. What does the E do? I saw some examples where E is not used. The official docs don't explain it in their "quick brown fox..." example.
Q2. How do I use '|' as the delimiter for my query?
Edit: I am using PostgreSQL 8.0.2. unnest() and regexp_split_to_table() both are not supported.
A1
E is a prefix for Posix-style escape strings. You don't normally need this in modern Postgres. Only prepend it if you want to interpret special characters in the string. Like E'\n' for a newline char.Details and links to documentation:
Insert text with single quotes in PostgreSQL
SQL select where column begins with \
E is pointless noise in your query, but it should still work. The answer you are linking to is not very good, I am afraid.
A2
Should work as is. But better without the E.
SELECT id, regexp_split_to_table(fruits, '|') AS split_fruits
FROM tbl;
For simple delimiters, you don't need expensive regular expressions. This is typically faster:
SELECT id, unnest(string_to_array(fruits, '|')) AS split_fruits
FROM tbl;
In Postgres 9.3+ you'd rather use a LATERAL join for set-returning functions:
SELECT t.id, f.split_fruits
FROM tbl t
LEFT JOIN LATERAL unnest(string_to_array(fruits, '|')) AS f(split_fruits)
ON true;
Details:
What is the difference between LATERAL and a subquery in PostgreSQL?
PostgreSQL unnest() with element number
Amazon Redshift is not Postgres
It only implements a reduced set of features as documented in its manual. In particular, there are no table functions, including the essential functions unnest(), generate_series() or regexp_split_to_table() when working with its "compute nodes" (accessing any tables).
You should go with a normalized table layout to begin with (extra table with one fruit per row).
Or here are some options to create a set of rows in Redshift:
How to select multiple rows filled with constants in Amazon Redshift?
This workaround should do it:
Create a table of numbers, with at least as many rows as there can be fruits in your column. Temporary or permanent if you'll keep using it. Say we never have more than 9:
CREATE TEMP TABLE nr9(i int);
INSERT INTO nr9(i) VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9);
Join to the number table and use split_part(), which is actually implemented in Redshift:
SELECT *, split_part(t.fruits, '|', n.i) As fruit
FROM nr9 n
JOIN tbl t ON split_part(t.fruits, '|', n.i) <> ''
Voilá.

SELECT * except nth column

Is it possible to SELECT * but without n-th column, for example 2nd?
I have some view that have 4 and 5 columns (each has different column names, except for the 2nd column), but I do not want to show the second column.
SELECT * -- how to prevent 2nd column to be selected?
FROM view4
WHERE col2 = 'foo';
SELECT * -- how to prevent 2nd column to be selected?
FROM view5
WHERE col2 = 'foo';
without having to list all the columns (since they all have different column name).
The real answer is that you just can not practically (See LINK). This has been a requested feature for decades and the developers refuse to implement it. The best practice is to mention the column names instead of *. Using * in itself a source of performance penalties though.
However, in case you really need to use it, you might need to select the columns directly from the schema -> check LINK. Or as the below example using two PostgreSQL built-in functions: ARRAY and ARRAY_TO_STRING. The first one transforms a query result into an array, and the second one concatenates array components into a string. List components separator can be specified with the second parameter of the ARRAY_TO_STRING function;
SELECT 'SELECT ' ||
ARRAY_TO_STRING(ARRAY(SELECT COLUMN_NAME::VARCHAR(50)
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME='view4' AND
COLUMN_NAME NOT IN ('col2')
ORDER BY ORDINAL_POSITION
), ', ') || ' FROM view4';
where strings are concatenated with the standard operator ||. The COLUMN_NAME data type is information_schema.sql_identifier. This data type requires explicit conversion to CHAR/VARCHAR data type.
But that is not recommended as well, What if you add more columns in the long run but they are not necessarily required for that query?
You would start pulling more column than you need.
What if the select is part of an insert as in
Insert into tableA (col1, col2, col3.. coln) Select everything but 2 columns FROM tableB
The column match will be wrong and your insert will fail.
It's possible but I still recommend writing every needed column for every select written even if nearly every column is required.
Conclusion:
Since you are already using a VIEW, the simplest and most reliable way is to alter you view and mention the column names, excluding your 2nd column..
-- my table with 2 rows and 4 columns
DROP TABLE IF EXISTS t_target_table;
CREATE TEMP TABLE t_target_table as
SELECT 1 as id, 1 as v1 ,2 as v2,3 as v3,4 as v4
UNION ALL
SELECT 2 as id, 5 as v1 ,-6 as v2,7 as v3,8 as v4
;
-- my computation and stuff that i have to messure, any logic could be done here !
DROP TABLE IF EXISTS t_processing;
CREATE TEMP TABLE t_processing as
SELECT *, md5(t_target_table::text) as row_hash, case when v2 < 0 THEN true else false end as has_negative_value_in_v2
FROM t_target_table
;
-- now we want to insert that stuff into the t_target_table
-- this is standard
-- INSERT INTO t_target_table (id, v1, v2, v3, v4) SELECT id, v1, v2, v3, v4 FROM t_processing;
-- this is andvanced ;-)
INSERT INTO t_target_table
-- the following row select only the columns that are pressent in the target table, and ignore the others.
SELECT r.* FROM (SELECT to_jsonb(t_processing) as d FROM t_processing) t JOIN LATERAL jsonb_populate_record(NULL::t_target_table, d) as r ON TRUE
;
-- WARNING : you need a object that represent the target structure, an exclusion of a single column is not possible
For columns col1, col2, col3 and col4 you will need to request
SELECT col1, col3, col4 FROM...
to omit the second column. Requesting
SELECT *
will get you all the columns

Postgres: Find number of distinct values for each column

I am trying to find the number of distinct values in each column of a table. Declaratively that is:
for each column of table xyz
run_query("SELECT COUNT(DISTINCT column) FROM xyz")
Finding the column names of a table is shown here.
SELECT column_name
FROM information_schema.columns
WHERE table_name=xyz
However, I don't manage to merge the count query inside. I tried various queries, this one:
SELECT column_name, thecount
FROM information_schema.columns,
(SELECT COUNT(DISTINCT column_name) FROM myTable) AS thecount
WHERE table_name=myTable
is syntactically not allowed (reference to column_name in the nested query not allowed).
This one seems erroneous too (timeout):
SELECT column_name, count(distinct column_name)
FROM information_schema.columns, myTable
WHERE table_name=myTable
What is the right way to get the number of distinct values for each column of a table with one query?
Article SQL to find the number of distinct values in a column talks about a fixed column only.
In general, SQL expects the names of items (fields, tables, roles, indices, constraints, etc) in a statement to be constant. That many database systems let you examine the structure through something like information_schema does not mean you can plug that data into the running statement.
You can however use the information_schema to construct new SQL statements that you execute separately.
First consider your original problem.
CREATE TABLE foo (a numeric, b numeric, c numeric);
INSERT INTO foo(a,b,c)
VALUES (1,1,1), (1,1,2), (1,1,3), (1,2,1), (1,2,2);
SELECT COUNT(DISTINCT a) "distinct a",
COUNT(DISTINCT b) "distinct b",
COUNT(DISTINCT c) "distinct c"
FROM foo;
If you know the name of all of your columns when you are writing the query, that is sufficient.
If you are seeking data for an arbitrary table, you need to construct the SQL statement via SQL (I've added plenty of whitespace so you can see the different levels involved):
SELECT 'SELECT ' || STRING_AGG( 'COUNT (DISTINCT '
|| column_name
|| ') "'
|| column_name
|| '"',
',')
|| ' FROM foo;'
FROM information_schema.columns
WHERE table_name='foo';
That however is just the text of the necessary SQL statement. Depending on how you are accessing Postgresql, it might be easy for you to feed that into a new query, or if you are keeping everything inside Postgresql, then you will have to resort to one of the integrated procedural languages. An excellent (though complex,) discussion of the issues may provide guidance.