How to do recursive lateral join in PostgreSQL? - postgresql

I have a treelike data structure Object -> package -> package -> .... -> package. I have a query over table containing the Objects and I need to check if the topmost parent has a value set or not.
using only with recursive CTE will just give me all the packages and I don't know which is the topmost parent for my current object. With lateral join i can make a query per row to check for value but I can't seem to find a way to make a query that would work like recursive lateral join.
The output of the query should be a table containing all the values of Object and a value from the topmost package.
Is there a way to do it purely in SQL or do I need to have some intermediate data processing on server side?

It's a bit unclear to me what you are trying to achieve, but:
I need to check if the topmost parent has a value set or not.
This can be done by "carrying" the root information through the recursion. In the root query of the CTE, you can select the attribute you want to make available to every level. In the recursive part you simply take the attribute from the parent.
Something along the lines:
with recursive all_packages as (
select t.*, p.some_column as root_value
from the_table t
where <condition to select the roots>
union all
select t.*, p.root_value
from the_table t
join all_packages p on p.some_id = t.some_parent_id
where p.root_value .... (do something with the root value)
)
select *
from all_packages;

Related

Is it possible to create a view for a subquery referring the main query?

If have a query that uses a subquery. Basically it is like this:
SELECT A.name, A.pk
array_to_string(array(SELECT B.name FROM b WHERE B.relA = A.pk ),', ')
FROM A;
Basically it makes a column in A from a to-many relationship to B. (In my case A is a list of items and B contains tags related to that items. The query makes a column with a list of tags for each row in A.)
Since the real world query is more complex and I need the subquery more than one time, I want to make a view from the subquery (DRY). This is not possible, because A.pk is only known, if the subquery is a subquery in a main query that fetches from A. It is not known if the subquery stands alone. So I cannot create a view from the stand-alone version:
CREATE VIEW bview AS SELECT B.b FROM B WHERE B.relA=A.pk;
gives me the expected:
ERROR: missing FROM-clause entry for table "A"
Is there a way to define akin of "incomplete view", that is not executed itself, but in a main query completing the subquery without using functions?
Edit: The WHERE clause inside the subquery cannot be replaced with a JOIN clause, because it takes the A.pk from the outer query.
You can create a simple view without referring to table A and then use that as a row source in various parts of your complex query:
CREATE VIEW bview AS
SELECT relA, string_agg(name, ', ') AS tags
FROM b
GROUP BY relA;
This may seem inefficient because if you run the view like this without qualification, then all tags for all relA are concatenated. However, when you use the view in a larger query with qualifications, then the view is only evaluated for those relA values that are asked for in the outer query. This is because the view is "merged" with the outer query by the planner.
So you end up with:
SELECT name, pk, tags
FROM A
JOIN bview ON relA = pk;

Two filters on one column with respect to each other

I would like to filter my data source by itself. In SQL it is just INNER JOINNING a table by itself.
For example,
SELECT table1.*
FROM table1 INNER JOIN (SELECT id FROM table1 WHERE variable = ‘X’ AND value = 1) q1 ON table1.id = q1.id
WHERE table1.variable = ‘Y’
As you can see I want to present only the variable which equals ‘Y’ with respect to variable =’X’ and value=1.
I can also write it like this,
SELECT *
FROM table1
WHERE variable = ‘Y’ AND id IN (SELECT id FROM table1 WHERE variable = ‘X’ AND value = 1)
I am using a long data file which means my primary key is 'id' and 'variable' together. So, I want all the variable = ‘Y’ data to be presented only if the 'id' has variable = ‘X’ AND value = 1. How do I translate this process in Tableau dashboard?
Any suggestions on how to do it without inner joining the data source by itself? I tried the inner join way but my data is very large which resulting in too much processing time and it makes all the other processes extremely slow.
First, just point your data source at table1 without any other changes. Plain and simple.
Second, back on a worksheet, select the id field in the datapane and right click to create a set. Choose the all radio button on the general tab of the set dialog pane, and then switch to the condition tab. Define the set via the formula max(variable = 'x' and value = 1). Call your set something meaningful like ids_having_an_X1. This will create a set of ids that have at least one data row matching your condition. Think of it as a list of ids that could go inside a SQL IN (...) clause if that helps
Now you can use your set on the filter shelf to only include those ids in the query, or in calculations, or on other shelves.
To get the effect of your where clause, put variable on the filter shelf choosing only the value 'Y'

Updating for each row in a table

I have this query here which returns an error because of too many rows returned:
UPDATE tmp_rsl2 SET comm_percent=( SELECT c2.comm_percent
FROM tmp_rsl2 t1
INNER JOIN gn_salesperson g1 ON t1.sales_person=g1.sales_person
INNER JOIN comm_schema c1 ON g1.comm_schema=c1.comm_schema
INNER JOIN comm_schema_dt c2 ON c1.comm_schema_id=c2.comm_schema_id AND (t1.balance_amount::numeric <= (COALESCE(c2.value_amount,0)) );`
Basically for each row of the comm_percent column, I want to update all of them using the subquery SELECT statement. I imagine using a FOR loop or something but I'd like to hear ideas or to know a proper way to do this.
The error TOO_MANY_ROWS is about assigning a value to a variable, that can only take '1' (one) value, whereas the SELECT query is returning more than one.
Without a reference schema, its difficult to give an SQL that'd work (not to say that the issue lies with the Schema), but you need to ensure that the value assigned to comm_percent from the SELECT statement returns only 1 row. A very blind attempt at how it 'might' work in your case (given below), but again without knowing the schema its difficult to gauge whether it'd work.
UPDATE tmp_rsl2
SET comm_percent = c2.comm_percent
FROM gn_salesperson g1 ON
INNER JOIN comm_schema c1 ON g1.comm_schema = c1.comm_schema
INNER JOIN comm_schema_dt c2 ON c1.comm_schema_id = c2.comm_schema_id
AND (tmp_rsl2.balance_amount::NUMERIC <= (COALESCE(c2.value_amount, 0)))
WHERE tmp_rsl2.sales_person = g1.sales_person
UPDATE
As per below comments, have given an unrelated SQLFiddle example that should give an idea of how to perform an UPDATE of all rows of a table looking up corresponding values from another table.

Does inner join effect order by?

I have a function a() which gives result in a specific order.
I want to do:
select final.*,tablex.name
from a() as final
inner join tablex on (a.key=tablex.key2)
My question is, can I guarantee that the join won't effect the order of rows as a() set it?
a() is:
select ....
from....
joins...
order by x,y,z
The short version:
The order of rows returned by a SQL query is not guaranteed in any way unless you use an order by
Any order you see without an order by is pure coincidence and can not be relied upon.
So how did I always get the correct order so far? when I did Select * from a()
If your function is a SQL function, then the query inside the function is executed "as is" (it's essentially "inlined") so you only run a single query that does have an order by. If it's a PL/pgSQL function and the only thing it does is a RETURN QUERY ... then you again only have a single query that is executed which does have an order by.
Assuming you do use a SQL function, then running:
select final.*,tablex.name
from a() as final
join tablex on a.key=tablex.key2
is equivalent to:
select final.*,tablex.name
from (
-- this is your query inside the function
select ...
from ...
join ...
order by x,y,z
) as final
join tablex on a.key=tablex.key2;
In this case the order by inside the derived table doesn't make sense as it might be "overruled" by an overall order by statement. In fact some databases would outright reject this query (and I sometime wish Postgres would do as well).
Without an order by on the **overall* query, the database is free to choose any order of rows that it wants.
So to get back to the initial question:
can I guarantee that the join won't effect the order of rows as a() set it?
The answer to that is a clear: NO - the order of the rows for that query is in no way guaranteed. If you need an order that you can rely on, you have to specify an order by.
I would even go so far to remove the order by from the function - what if someone runs: select * from a() order by z,y,x - I don't think Postgres will be smart enough to remove the order by inside the function.

How to specify two expressions in the select list when the subquery is not introduced with EXISTS

I have a query that uses a subquery and I am having a problem returning the expected results. The error I receive is..."Only one expression can be specified in the select list when the subquery is not introduced with EXISTS." How can I rewrite this to work?
SELECT
a.Part,
b.Location,
b.LeadTime
FROM
dbo.Parts a
LEFT OUTER JOIN dbo.Vendor b ON b.Part = a.Part
WHERE
b.Location IN ('A','B','C')
AND
Date IN (SELECT Location, MAX(Date) FROM dbo.Vendor GROUP BY Location)
GROUP BY
a.Part,
b.Location,
b.LeadTime
ORDER BY
a.Part
I think something like this may be what you're looking for. You didn't say what version of SQL Server--this works in SQL 2005 and up:
SELECT
p.Part,
p.Location, -- from *p*, otherwise if no match we'll get a NULL
v.LeadTime
FROM
dbo.Parts p
OUTER APPLY (
SELECT TOP (1) * -- * here is okay because we specify columns outside
FROM dbo.Vendor v
WHERE p.Location = v.Location -- the correlation part
ORDER BY v.Date DESC
) v
WHERE
p.Location IN ('A','B','C')
ORDER BY
p.Part
;
Now, your query can be repaired as is by adding the "correlation" part to change your query into a correlated subquery as demonstrated in Kory's answer (you'd also remove the GROUP BY clause). However, that method still requires an additional and unnecessary join, hurting performance, plus it can only pull one column at a time. This method allows you to pull all the columns from the other table, and has no extra join.
Note: this gives logically the same results as Lamak's answer, however I prefer it for a few reasons:
When there is an index on the correlation columns (Location, here) this can be satisfied with seeks, but the Row_Number solution has to scan (I believe).
I prefer the way this expresses the intent of the query more directly and succinctly. In the Row_Number method, one must get out to the outer condition to see that we are only grabbing the rn = 1 values, then bop back into the CTE to see what that is.
Using CROSS APPLY or OUTER APPLY, all the other tables not involved in the single-inner-row-per-outer-row selection are outside where (to me) they belong. We aren't squishing concerns together. Using Row_Number feels a bit like throwing a DISTINCT on a query to fix duplication rather than dealing with the underlying issue. I guess this is basically the same issue as the previous point worded in a different way.
The moment you have TWO tables from which you wish to pull the most recent value, the Row_Number() solution blows up completely. With this syntax, you just easily add another APPLY clause, and it's crystal clear what you're doing. There is a way to use Row_Number for the multiple tables scenario by moving the other tables outside, but I still don't prefer that syntax.
Using this syntax allows you to perform additional joins based on whether the selected row exists or not (in the case that no matching row was found). In the Row_Number solution, you can only reasonably do that NOT NULL checking in the outer query--so you are forced to split up the query into multiple, separated parts (you don't want to be joining to values you will be discarding!).
P.S. I strongly encourage you to use aliases that hint at the table they represent. Please don't use a and b. I used p for Parts and v for Vendor--this helps you and others make sense of the query more quickly in the future.
If I understood you corrrectly, you want the rows with the max date for locations A, B and C. Now, assuming SQL Server 2005+, you can do this:
;WITH CTE AS
(
SELECT
a.Part,
b.Location,
b.LeadTime,
RN = ROW_NUMBER() OVER(PARTITION BY a.Part ORDER BY [Date] DESC)
FROM
dbo.Parts a
LEFT OUTER JOIN dbo.Vendor b ON b.Part = a.Part
WHERE
b.Location IN ('A','B','C')
)
SELECT Part,
Location,
LeadTime
FROM CTE
WHERE RN = 1
ORDER BY Part
In your subquery you need to correlate the Location and Part to the outer query.
Example:
Date = (SELECT MAX(Date)
FROM dbo.Vender v
WHERE v.Location = b.Location
AND v.Part = b.Part
)
So this will bring back one date for each location and part