PostgreSQL reusing computed result as input to other select computations - postgresql

Is there any way we can take a computed result inside the select clause and insert it into another computation inside the select clause?
For example this is what I want to have but can't so far:
select trim(leading https://www.amazon.com for url) as trimmedURL,
substring(trimmedURL, from position('/' in trimmedURL) for position ('html' in trimmedURL))....
As you can see I have used trimmedURL 3 times inside the substring function. I know how to naively do that be copy/paste of trim(leading https://www.amazon.com for url) into the substring function.
Is there any way to avoid that and not create really large function calls as the first value computed might be placed many times inside other functions. This will improve code readability and usability.

you could use a lateral join and place the computed fields i the lateral query. the lateral fields are then accessible from the main query.
Postgres documentation for lateral join
i.e.
SELECT
trimmedUrl
, SUBSTRING(trimmedURL,10,20) url_part
FROM mytable
LEFT JOIN LATERAL (SELECT trim(leading https://www.amazon.com for url) as trimmedURL) trmd
ON TRUE
also, note that postgresql ignores casing in the naming of columns / tables etc unless they are quoted.
Here's a self-contained example:
WITH x(col) AS (Values ('abc://cdf/def'), ('abc://xyz/pqr'))
SELECT x.col, SUBSTRING(y.col2 from position('/' in y.col2)) resuing_computation
FROM x
LEFT JOIN LATERAL (SELECT trim(leading 'abc://' from col) col2) y ON TRUE

Related

Return partitioned row_num()=1 in same cte

Is there a shorter way to filter a cte on rown_num = 1 rather than an external where clause? I vaguely recall doing this in teradata with a 'qualify' statement. Is there a less code way I can use in Postgres?
with
first_touch as (
select
l.session_id as last_session,
l.client_id,
f.session_id as first_session,
row_number() over(partition by f.client_id order by f.date asc) as rn
from ga_marketing.sessions l
join ga_marketing.sessions f on l.client_id = f.client_id
where l.date between '2021-06-01' and '2021-06-11'
)
select *
from first_touch
where rn = 1
I would rather somehow filter within the cte for rn=1 rather than outside. Is this possible? Is there a shorter way to write what I want?
You need at least two levels of select to do that with a Window function. You could do both levels in the CTE if you wanted and then have a third dummy select outside the CTE, but I don't see what the point of that would be, other than to make the dummy select appear cleaner (no WHERE clause, no column "rn"). That part would get shorter, but the CTE would get longer. Or you could just do away with the CTE altogether and write nested queries directly, which I guess would be "shorter" in the number keystrokes used. Or you could encapsulate different fragments into a view, to hide some of the levels from sight.
I think you could also write this using DISTINCT ON or JOIN LATERAL, rather than using a window function, but that doesn't seem to be what you are asking.

It's a function or table in this later join case?

It is often particularly handy to LEFT JOIN to a LATERAL subquery, so that source rows will appear in the result even if the LATERAL subquery produces no rows for them. For example, if get_product_names() returns the names of products made by a manufacturer, but some manufacturers in our table currently produce no products, we could find out which ones those are like this:
SELECT m.name
FROM manufacturers m LEFT JOIN LATERAL get_product_names(m.id) pname ON true
WHERE pname IS NULL;
All contents extract from PostgreSQL manual. LINK
Now I finally probably get what does LATERAL mean. In this case,
Overall I am Not sure get_product_names is a table or function. The following is my understanding.
A: get_product_names(m.id) is a function, and using m.id as a input parameter returns a table. The return table alias as pname. Overall it's a table m join a null (where condition) table.
B: get_product_names is a table, table m left join table get_product_names on m.id. pname is alias for get_product_names. Overall it's a table m join a null (where condition) table.
get_product_names is a table function (also known as set returning function or SRF in PostgreSQL slang). Such a function does not necessarily return a single result row, but arbitrarily many rows.
Since the result of such a function is a table, you typically use it in SQL statements where you would use a table, that is in the FROM clause.
A simple example is
SELECT * FROM generate_series(1, 5);
generate_series
-----------------
1
2
3
4
5
(5 rows)
You can also use normal functions in this way, they are then treated as a table function that returns exactly one row.

PostgreSQL CTE records as parameters to function

I have a function that accepts two integers as parameters my_function(input_a, input_b). Is there an easy way to pass the results of a CTE (that returns records of input_a, input_b) into the function?
Should I be looking into writing a custom function with a for loop or is there a better approach?
If the function returns a single record then:
WITH cte AS (SELECT 1 a, 2 b)
SELECT my_function(a, b) FROM cte;
will work. However, if the function is an SRF (Set-Returning-Function), then you need to use LATERAL, to let the database know that you want to feed the results of the prior tables in the JOIN statement to the functions later on in the JOIN. This is accomplished like so:
WITH cte AS (SELECT 1 a, 2 b)
SELECT * FROM cte, LATERAL my_function(a, b);
The LATERAL will cause PostgreSQL to take each row from the CTE and run "my_function" with the values from that row, returning the results of that function to the overall SELECT statement.

Retrieval of columns from functions that returns table (or setof record)

All time I have an variation of this problem, and not remember how to workaround, only "oop was so simple, but how to?"... Perhaps there are some patterns and best way to work with each pattern. Let's see the main one, examplefying by unnest() and ts_stat().
First, good examples, no problems, because unnest() returns only one column:
SELECT * FROM unnest(array[1,2,3]) t(id); -- is ok, the int columns there!
SELECT unnest(array[1,2,3]) t(id); -- is ok, the int columns
WITH t AS (SELECT unnest(array[1,2,3]) as id)
SELECT id, unnest(array[4,id]) as x
FROM t; -- more complex, but ok!
Now a function that returns a defined SETOF RECORD,
SELECT * FROM ts_stat('SELECT kx FROM terms where id=2') -- GOOD
-- show all word|ndoc|nentry columns
SELECT ts_stat('SELECT kx FROM terms where id=2') as x -- BAD
-- because lost columns, show only "x" column... but works
-- NOTE: you can imagine any other function, as json_each(), etc.
See GOOD/BAD considerations... So, this is the problem: a SETOF RECORD with more tham one column. In the simplest (unnest above) case, the solution is to use in the "FROM side", as a table; but, when RECORD have multiple fields, arises the problem.
--MAIN EXAMPLE FOR THE DISCUSSION:
WITH t AS (SELECT unnest(array[1,2,3]) as id)
SELECT id, ts_stat('SELECT kx FROM terms where id='||id) as x
FROM t; -- BAD, but works...
Now, in this main example, is not possible to use ts_stat() in the "FROM side", so, characterizing the pattern: a function that returns a TABLE or a SETOF RECORD, in a query where we need columns, but the function can't in the "FROM side".
QUESTION: What the generic (and most elegant) solution to this pattern? How (syntax pattern) to show columns?
NOTE: another problem is that, if you not remember exactly the syntax of solution, you try things that not works... In this case an error:
WITH t AS (SELECT unnest(array[1,2,3]) as id)
SELECT id, x.word, x.ndoc, x.nentry
FROM (
SELECT t.nsid,
ts_stat('SELECT kx FROM terms where id='||id) as x
FROM t
) s;
SQL PARSER ERROR (PostgreSQL 9.5): no table "x" in the FROM clause.
You should never use a set-returning-function (SRF) in a SELECT list. The main example should be written with an implicit LATERAL JOIN:
SELECT v.id, x.*
FROM (VALUES (1),(2),(3)) v(id)
JOIN ts_stat('SELECT kx FROM terms where id=' || v.id) x ON true;
The lateral join is implicit here because an SRF can refer to columns from relations specified before it the FROM clause without using the keyword LATERAL. In the example above the SRF ts_stat() makes a lateral reference to column and relation v(id). You can also do this with e.g. sub-queries but then you have to explicitly use the keyword LATERAL.
Note that while you can use a SRF in a select list, its use is discouraged. You provide the example of unnest(anyarray) which is interesting because there is also the overloaded variant unnest(anyarray, ...) (i.e. unnest multiple arrays in one call) which will throw an error when used in a select list; in can only be used as a row source. The reason why you should not use SRFs in a select list is that there is no obvious solution when using multiple SRFs each producing a different number of rows.

How to specify two expressions in the select list when the subquery is not introduced with EXISTS

I have a query that uses a subquery and I am having a problem returning the expected results. The error I receive is..."Only one expression can be specified in the select list when the subquery is not introduced with EXISTS." How can I rewrite this to work?
SELECT
a.Part,
b.Location,
b.LeadTime
FROM
dbo.Parts a
LEFT OUTER JOIN dbo.Vendor b ON b.Part = a.Part
WHERE
b.Location IN ('A','B','C')
AND
Date IN (SELECT Location, MAX(Date) FROM dbo.Vendor GROUP BY Location)
GROUP BY
a.Part,
b.Location,
b.LeadTime
ORDER BY
a.Part
I think something like this may be what you're looking for. You didn't say what version of SQL Server--this works in SQL 2005 and up:
SELECT
p.Part,
p.Location, -- from *p*, otherwise if no match we'll get a NULL
v.LeadTime
FROM
dbo.Parts p
OUTER APPLY (
SELECT TOP (1) * -- * here is okay because we specify columns outside
FROM dbo.Vendor v
WHERE p.Location = v.Location -- the correlation part
ORDER BY v.Date DESC
) v
WHERE
p.Location IN ('A','B','C')
ORDER BY
p.Part
;
Now, your query can be repaired as is by adding the "correlation" part to change your query into a correlated subquery as demonstrated in Kory's answer (you'd also remove the GROUP BY clause). However, that method still requires an additional and unnecessary join, hurting performance, plus it can only pull one column at a time. This method allows you to pull all the columns from the other table, and has no extra join.
Note: this gives logically the same results as Lamak's answer, however I prefer it for a few reasons:
When there is an index on the correlation columns (Location, here) this can be satisfied with seeks, but the Row_Number solution has to scan (I believe).
I prefer the way this expresses the intent of the query more directly and succinctly. In the Row_Number method, one must get out to the outer condition to see that we are only grabbing the rn = 1 values, then bop back into the CTE to see what that is.
Using CROSS APPLY or OUTER APPLY, all the other tables not involved in the single-inner-row-per-outer-row selection are outside where (to me) they belong. We aren't squishing concerns together. Using Row_Number feels a bit like throwing a DISTINCT on a query to fix duplication rather than dealing with the underlying issue. I guess this is basically the same issue as the previous point worded in a different way.
The moment you have TWO tables from which you wish to pull the most recent value, the Row_Number() solution blows up completely. With this syntax, you just easily add another APPLY clause, and it's crystal clear what you're doing. There is a way to use Row_Number for the multiple tables scenario by moving the other tables outside, but I still don't prefer that syntax.
Using this syntax allows you to perform additional joins based on whether the selected row exists or not (in the case that no matching row was found). In the Row_Number solution, you can only reasonably do that NOT NULL checking in the outer query--so you are forced to split up the query into multiple, separated parts (you don't want to be joining to values you will be discarding!).
P.S. I strongly encourage you to use aliases that hint at the table they represent. Please don't use a and b. I used p for Parts and v for Vendor--this helps you and others make sense of the query more quickly in the future.
If I understood you corrrectly, you want the rows with the max date for locations A, B and C. Now, assuming SQL Server 2005+, you can do this:
;WITH CTE AS
(
SELECT
a.Part,
b.Location,
b.LeadTime,
RN = ROW_NUMBER() OVER(PARTITION BY a.Part ORDER BY [Date] DESC)
FROM
dbo.Parts a
LEFT OUTER JOIN dbo.Vendor b ON b.Part = a.Part
WHERE
b.Location IN ('A','B','C')
)
SELECT Part,
Location,
LeadTime
FROM CTE
WHERE RN = 1
ORDER BY Part
In your subquery you need to correlate the Location and Part to the outer query.
Example:
Date = (SELECT MAX(Date)
FROM dbo.Vender v
WHERE v.Location = b.Location
AND v.Part = b.Part
)
So this will bring back one date for each location and part