How to access fields after cross join of the same table in PySpark - pyspark

I ran a cross join on a table and it worked fine. Now the problem is I don't know how to address the same field from the resulting dataframe.
df = spark.sql("select p1.id, p2.id from profile p1 CROSS JOIN profile p2 WHERE p1.id < p2.id")
When I printed out the first row, I got something like this:
Row(id=21398968, id=76109821)
Running "print(res_2[0]['id'])" yields only the first one as a scalar value (not a list)

You can change your query to be:
df = spark.sql("select p1.id AS p1_id, p2.id AS p2_id from profile CROSS JOIN profile p2 WHERE p1.id < p2.id")
By using AS you should be able to avoid the name conflict.

Related

Is it possible to do a "LIMIT 1" on a left join in Postgres?

I have two tables: one for money and attributes surrounding it (e.g. who earnt it) and a child table for the "ledger" - this contains one or more entries that represent the history of money that has moved.
SELECT SUM(pl.achieved)
FROM payout p
LEFT JOIN payout_ledgers pl ON pl.payout_id = p.id
This query works well when there is only one ledger item, but when more are added the SUM will increase. I want to join only the latest row. So hypothetically:
SELECT SUM(pl.achieved)
FROM payout p
LEFT JOIN payout_ledgers pl ON pl.payout_id = p.id ORDER BY pl.ts DESC LIMIT 1
WHERE ...
ORDER BY ...
LIMIT ...
(which sadly doesn't work)
What I have tried:
Using a subquery works, but is painfully slow given the size of the data set (and other omitted properties and where clauses etc.):
SELECT SUM(pl.achieved)
FROM payout p
LEFT JOIN payout_ledgers pl ON pl.payout_id = p.id AND pl.id = (SELECT id FROM payout_ledgers WHERE payout_id = p.id ORDER BY ts DESC LIMIT 1)
Incidentally, I'm unsure why this subquery is so slow (~12 seconds, as opposed to 150ms with no subquery). I would have expected it to be quicker given that we're only selecting based on the foreign key (payout_id).
Another thing I tried was to do a select from the join - my logic being that if we select from small joined dataset instead of the whole table it would be quicker. However I was met with relation "pl" does not exist error:
SELECT SUM(pl.achieved)
FROM payouts p
LEFT JOIN payout_ledgers pl ON pl.payout_id = p.id
WHERE pl.id = (SELECT id FROM pl ORDER BY ts DESC LIMIT 1)
Thank you in advance for any suggestions. I am also open to suggestions for schema changes that could make this type of logic easier, although my preference would be to try and get the query working since the schema is not easy to change on our production environment.
If you're on Postgres 9.4+, you can use a LEFT JOIN LATERAL (docs)
SELECT SUM(sub.achieved)
FROM payout p
LEFT JOIN LATERAL (SELECT achieved
FROM payout_ledgers pl
WHERE pl.payout_id = p.id
ORDER BY pl.ts DESC LIMIT 1) sub ON true
This will return the sum of the "achieved" field in the most recent entry in payout_ledgers for all payouts.
window functions:
-- using row_number()
SELECT SUM(sss.achieved)
FROM (SELECT pl.achieved
, row_number() OVER (PARTITION BY pl.payout_id, ORDER BY pl.ts DESC)
FROM payouts p
JOIN payout_ledgers pl ON pl.payout_id = p.id
) sss
WHERE sss.rn =1
;
-- using last_value()
SELECT SUM(sss.achieved)
FROM (SELECT
, last_value(achieved) OVER (PARTITION BY pl.payout_id, ORDER BY pl.ts ASC) AS achieved
FROM payouts p
JOIN payout_ledgers pl ON pl.payout_id = p.id
) sss
;
BTW: you do not need the LEFT JOIN (adding no value to the SUM does not change the sum)

COALESCE TSQL with a join tsql

I have a requirement to pick up data that is in more than one place and I have some form of recognition if using the coalesce function. Basically I am looking to coalesce the join itself but looking online its seems as if i can only do this on the fields.
So we have a Products and Suppliers table, we also have these as a temp table so in total 4 tables (products, tempproducts, suppliers, tempsuppliers). In the suppliers and products table is where we store our products and suppliers and their temptables we store any new suppliers/products. We also have a tempsupplierproduct which joins new suppliers to new products. However we can end in a situation where a new supplier has an existing product so the new supplier will be in the tempsuppliers table and its product is in the products table NOT the tempproducts as it is not new, we will also have a new tempsupplierproduct to join the two up.
So i want a query which looks in the tempsupplierproducts table and then gets basic information about the supplier and products. To do this i am using a coalesce.
SELECT DISTINCT SP.*, COALESCE(P.Product, PD.Product) 'Product', COALESCE(S.Supplier, SU.Supplier) 'Supplier'
FROM tempsupplierproduct SP
LEFT JOIN tempProduct P ON SP.ProductCode = P.Code
LEFT JOIN Products PD ON SP.ProductCode = PD.Code
LEFT JOIN tempSupplier S ON SP.SupplierCode = S.Code
LEFT JOIN Suppliers SU ON SP.SupplierCode = SU.Code
Now while this works, something at the back of my head tells me it is not entirely right, ideally i want if data is not in table A then join to table B. I have seen maybe coalescing inside the join itself but I am unsure how to do this
LEFT JOIN Suppliers Su ON SP.SupplierCode = COALESCE(S.Code, SU.Code)
maybe away, but I am confused by this, all it is saying is use code in temptable if not there then use supplier code. So what would this mean if we have a code in the temptable, will this try to join on it, if so then this is incorrect also.
Any help is appreciated
You can union the two suppliers tables together and then join them in one go like this. I'm assuming that there are no duplicates between the two tables in this case but with a bit of extra work that could be resolved as well.
WITH AllSuppliers AS
(
SELECT Code, Supplier FROM Suppliers
UNION ALL
SELECT Code, Supplier FROM tempSupplier
)
SELECT DISTINCT SP.*, COALESCE(P.Product, PD.Product) 'Product', S.Supplier
FROM tempsupplierproduct SP
LEFT JOIN tempProduct P ON SP.ProductCode = P.Code
LEFT JOIN Products PD ON SP.ProductCode = PD.Code
LEFT JOIN AllSuppliers S ON SP.SupplierCode = S.Code
If you need to handle duplicates in the two suppliers tables then an approach like this should work, essentially we rank the duplicates and then pick the highest ranked result. For two tables you could use a full outer join between the two but this approach will scale to any number of tables.
WITH AllSuppliers AS
(
SELECT Code, Supplier, 1 AS TablePriority FROM Suppliers
UNION ALL
SELECT Code, Supplier, 2 AS TablePriority FROM tempSupplier
),
SuppliersRanked AS
(
SELECT Code, Supplier,
ROW_NUMBER() OVER (PARTITION BY Code ORDER BY TablePriority) AS RowPriority
FROM AllSuppliers
)
SELECT DISTINCT SP.*, COALESCE(P.Product, PD.Product) 'Product', S.Supplier
FROM tempsupplierproduct SP
LEFT JOIN tempProduct P ON SP.ProductCode = P.Code
LEFT JOIN Products PD ON SP.ProductCode = PD.Code
LEFT JOIN SuppliersRanked S ON SP.SupplierCode = S.Code
AND RowPriority = 1
You can absolutely join on a coalesced field. Here is a snippet from one of my production views:
LEFT JOIN [Portal].tblHelpdeskresource supplier ON PO.fld_str_SupplierID = supplier.fld_str_SupplierID
-- Job type a
LEFT JOIN [Portal].tblHelpDeskFault HDF ON PO.fld_int_HelpdeskFaultID = HDF.fld_int_ID
-- Job Type b
LEFT JOIN [Portal].tblProjectHeader PH ON PO.fld_int_ProjectHeaderID = PH.fld_int_ID
LEFT JOIN [Portal].tblPPMScheduleLine PSL ON PH.fld_int_PPMScheduleRef = PSL.fld_int_ID
-- Managers (used to be separate for a & b type, now converged)
LEFT JOIN [Portal].uvw_HelpDeskSiteManagers PSM ON COALESCE(PSL.fld_int_StoreID,HDF.fld_int_StoreID) = PSM.PortalSiteId
LEFT JOIN [Portal].tblHelpdeskResource PHDR ON PSM.PortalResourceId = PHDR.fld_int_ID

TSQL -- Where Statements on Multiple columns in Update

My basic question has to do with updating multiple columns at once from specified values in my query. The reason I want to do this is that I am updating my values from a ginormous table so I only want to query it once in order to reduce run time. Here is an example of an example select statement that returns the value I want for just one of the columns I need to update:
select a.Value
from Table1
left outer join
(
select ID, FilterCol1, FilterCol2, Value
from Table2
) a on a.ID = Table1.ID
where {Condition1a on FilterCol1}
and {Condition2a on FilterCol2}
In order to update multiple columns at once I would like to be able do something like this (but it returns NULL):
Update T1
set T1Value1 = (select a.Value where {Condition1a on FilterCol1}
and {Condition2a on FilterCol2)
,T1Value2 = (select a.Value where {Condition1b on FilterCol1}
and {Condition2b on FilterCol2})
from Table1 T1
left outer join
(
select ID, FilterCol1, FilterCol2, Value
from Table2
) a on a.ID = Table1.ID
Any help figuring this out would be greatly appreciated, let me know if you have any questions or if I made any errors. Thanks!
EDIT: I think I have identified the problem, but I'm not sure of a solution yet. I think seeing the issue requires a little more context: The select from table 2 is actually an unpivot on a wide table. This means that when the left outer join is applied, there will be multiple rows for a given ID. What the case statement that Earl suggested seems to be doing (and I assume this is happening with the where clause as well) is comparing my Conditions to only the first row of the columns from a. Since my conditions are meant to help determine which of the rows from a is chosen, they will always evaluate false for the first row (I know this just from what I know about the data), hence my perpetual NULL values. Does anyone know of a workaround to look at the other rows in a?
UPDATE T1
SET T1Value1 = CASE WHEN (FilterCol1 = Condition1a AND FilterCol2 = Condition2a) THEN a.Value END,
T1Value2 = CASE WHEN (FilterCol1 = Condition1b AND FilterCol2 = Condition2b) THEN a.Value END
FROM Table1 T1
left outer join
(
select ID, FilterCol1, FilterCol2, Value
) a on a.ID = Table1.ID

TSQL/SQL Server 2008 R2 - Recursive select consolidating self-referenced table Unit and apply SUM on UnitSale and UnitCharge

I've been searching here and everywhere and I cant find a proper path to follow on my problem.
Here is the structure I am using:
Table [Unit] - represents an unit of an organization, like Management, General Coordination, Production Team 1, etc.
This table is self-referenced by his own key on the ParentID column.
Table [UnitSale] - holds fictitious sales data, referencing a specific Unit.
Table [UnitCharge] - hold fictitious costs and charges of a specific Unit.
My goal is to select the Units, from the top-most member of the tree, recursively consolidating its child-Units, by applying SUM on each UnitSale and UnitCharge of the children, and finally applying theses totals to the current Unit, in this case, the top most.
Image of sample data: http://brit.dyndns-work.com:89/Brit/SampleData.png
Check the SQL Fiddle: http://sqlfiddle.com/#!3/75c3cc/3
Any help?
CTE is a good way to go. I would however do it bottom-up attributing sales from lower level to upper level, group by unit and finally join to unit for description and calculate rate. Check the updated fiddle: http://sqlfiddle.com/#!3/75c3cc/16/0.
with cte1 as
(
select u.id, u.parentid, s.salevalue, c.chargevalue
from Unit u
left join UnitSale s on s.unitid = u.id
left join UnitCharge c on c.unitid = u.id
union all
select u.id, u.parentid, x.salevalue, x.chargevalue
from Unit u
inner join cte1 x on x.parentid = u.id
)
, cte2 as
(
select id, sum(salevalue) as totalsale, sum(chargevalue) as totalcharge
from cte1
group by id
)
select u.id, u.description, u.parentid, x.totalsale, x.totalcharge, x.totalsale / x.totalcharge as rate
from cte2 x
inner join unit u on u.id = x.id
order by u.description

Postgres join not respecting outer where clause

In SQL Server, I know for sure that the following query;
SELECT things.*
FROM things
LEFT OUTER JOIN (
SELECT thingreadings.thingid, reading
FROM thingreadings
INNER JOIN things on thingreadings.thingid = things.id
ORDER BY reading DESC LIMIT 1) AS readings
ON things.id = readings.thingid
WHERE things.id = '1'
Would join against thingreadings only once the WHERE id = 1 had restricted the record set down. It left joins against just one row. However in order for performance to be acceptable in postgres, I have to add the WHERE id= 1 to the INNER JOIN things on thingreadings.thingid = things.id line too.
This isn't ideal; is it possible to force postgres to know that what I am joining against is only one row without explicitly adding the WHERE clauses everywhere?
An example of this problem can be seen here;
I am trying to recreate the following query in a more efficient way;
SELECT things.id, things.name,
(SELECT thingreadings.id FROM thingreadings WHERE thingid = things.id ORDER BY id DESC LIMIT 1),
(SELECT thingreadings.reading FROM thingreadings WHERE thingid = things.id ORDER BY id DESC LIMIT 1)
FROM things
WHERE id IN (1,2)
http://sqlfiddle.com/#!15/a172c/2
Not really sure why you did all that work. Isn't the inner query enough?
SELECT t.*
FROM thingreadings tr
INNER JOIN things t on tr.thingid = t.id AND t.id = '1'
ORDER BY tr.reading DESC
LIMIT 1;
sqlfiddle demo
When you want to select the latest value for each thingID, you can do:
SELECT t.*,a.reading
FROM things t
INNER JOIN (
SELECT t1.*
FROM thingreadings t1
LEFT JOIN thingreadings t2
ON (t1.thingid = t2.thingid AND t1.reading < t2.reading)
WHERE t2.thingid IS NULL
) a ON a.thingid = t.id
sqlfiddle demo
The derived table gets you the record with the most recent reading, then the JOIN gets you the information from things table for that record.
The where clause in SQL applies to the result set you're requesting, NOT to the join.
What your code is NOT saying: "do this join only for the ID of 1"...
What your code IS saying: "do this join, then pull records out of it where the ID is 1"...
This is why you need the inner where clause. Incidentally, I also think Filipe is right about the unnecessary code.