convert SQL join query to pyspark syntax - pyspark

I'm working to convert a known working SQL query to work in pyspark, given two dataframes, using methods such as: .join, .where, filter, etc.
Here are examples of SQL queries that work (only selecting r.id where I will normally select more columns):
# "invalid" records, where there is a matching `record_id` for rv_df
SELECT DISTINCT(r.id) FROM core_record AS r LEFT OUTER JOIN core_recordvalidation rv ON r.id = rv.record_id WHERE r.job_id = 41 AND rv.record_id is not null;
# "valid" records, where there is no matching `record_id` for rv_df
SELECT DISTINCT(r.id) FROM core_record AS r LEFT OUTER JOIN core_recordvalidation rv ON r.id = rv.record_id WHERE r.job_id = 41 AND rv.record_id is not null;
I'm 80/20 close, but having trouble wrapping my head around the the last few steps, and/or how to do this most efficiently.
I've got a Dataframe r_df with column id that I'd like to join with Dataframe rv_df on column record_id. As output, I'd like only distinct r.id, and only columns from r_df, none from rv_df. Finally, I'd like two different calls where there is a match (what will be "invalid" records for me), and where there is not a match (what I consider "valid" records).
I have pyspark queries that get close, but not terribly clear on how to ensure that r_df.id is distinct, and select only columns from r_df, none from rv_df.
Any help would be much appreciated!

Just had to walk away for a couple hours. Found a solution that works for my use case.
First, selecting only distinct record_id from rv_df:
rv_df = rv_df.select('record_id').distinct()
Then use that for intersection and disjoints:
# Intersection:
j_df = r_df.join(rv_df, r_df.id == rv_df.record_id, 'leftsemi').select(r_df['*'])
# Disjoint:
j_df = r_df.join(rv_df, r_df.id == rv_df.record_id, 'leftanti').select(r_df['*'])

Related

Postgresql :joining on fields using CASE expression

I am trying to join two tables on two fields with a below condition
If condition 1 is satisfied then join ON a.field_1 = b.field_1
If condition 2 is satisfied then join ON a.field_2 = b.field_2
In order to do so, I am writing the below query
SELECT
a.field_1,a.field_2,
b.field_1,b.field_2 FROM table a
INNER JOIN table b
CASE WHEN COALESCE(TRIM(a.field_1),'') = '' THEN a.field_1 = b.field_1
ELSE a.field_2 = b.field_2 END
I am not sure whether this would run.
From the manual:
T1 { [INNER] | { LEFT | RIGHT | FULL } [OUTER] } JOIN T2 ON boolean_expression
A JOIN accepts an arbitrary "boolean expression", which may reference columns from both joined relations. While the trivial form is t1.col = t2.col, the boolean expression is not limited to it and so comparing two sets of columns based on some other column is totally fine.
Please note your syntax error #sticky bit pointed out.

Cannot get a result by Max date

I'm trying to get the highlighted result only as it's the latest date. First time I've asked a question here so I apologize in advance if this isn't clear. Thanks
By using the following query
SELECT
MAX(A.Insp_Date) AS Last_Insp_Date
,A.Doc_ID
,A.Service_Call_ID
,A.Customer_ID
,A.Address_Code
,A.State
,A.Branch
,B.HydLoc
,B.FlwOutSz
,B.StaticPSI
,B.ResidualPSI
,B.PititPSI
,B.FlwGPM
FROM [dbo].[fofHydrntInspHdr] AS A
LEFT OUTER JOIN [dbo].[fofHYD2800FlwTstRT] AS B
ON A.Doc_ID = B.Doc_ID
WHERE A.Doc_ID > 0
AND A.Address_Code = 'GEN0021'
GROUP BY
A.Doc_ID
,A.Service_Call_ID
,A.Customer_ID
,A.Address_Code
,A.State
,A.Branch
,B.HydLoc
,B.FlwOutSz
,B.StaticPSI
,B.ResidualPSI
,B.PititPSI
,B.FlwGPM
I've also tried using max doc_id and it still doesn't work. Appreciate any help.
Another option that shouldn't require two scans of your table is to filter for the latest using a window function:
with r as
(
SELECT
A.Insp_Date AS Last_Insp_Date
,A.Doc_ID
,A.Service_Call_ID
,A.Customer_ID
,A.Address_Code
,A.State
,A.Branch
,B.HydLoc
,B.FlwOutSz
,B.StaticPSI
,B.ResidualPSI
,B.PititPSI
,B.FlwGPM
,DENSE_RANK() OVER (ORDER BY A.Insp_Date DESC) AS r
FROM [dbo].[fofHydrntInspHdr] AS A
LEFT OUTER JOIN [dbo].[fofHYD2800FlwTstRT] AS B
ON A.Doc_ID = B.Doc_ID
WHERE A.Doc_ID > 0
AND A.Address_Code = 'GEN0021'
)
SELECT
Insp_Date AS Last_Insp_Date
,Doc_ID
,Service_Call_ID
,Customer_ID
,Address_Code
,State
,Branch
,HydLoc
,FlwOutSz
,StaticPSI
,ResidualPSI
,PititPSI
,FlwGPM
FROM r
WHERE r = 1;
As an aside, I would advise against aliasing your tables with A, B, C etc as they don't relate to the table and make understanding the query later on more awkward. In this case, aliases like h and ft would convey that one table is the Headers and the other the Flow Tests, whilst also reducing character count.
It also looks like you have some bad duplication going on in your results there, which suggests that either your query is not joining and filtering appropriately or your data is messy.

EFCore returning too many columns for a simple LEFT OUTER join

I am currently using EFCore 1.1 (preview release) with SQL Server.
I am doing what I thought was a simple OUTER JOIN between an Order and OrderItem table.
var orders = from order in ctx.Order
join orderItem in ctx.OrderItem
on order.OrderId equals orderItem.OrderId into tmp
from oi in tmp.DefaultIfEmpty()
select new
{
order.OrderDt,
Sku = (oi == null) ? null : oi.Sku,
Qty = (oi == null) ? (int?) null : oi.Qty
};
The actual data returned is correct (I know earlier versions had issues with OUTER JOINS not working at all). However the SQL is horrible and includes every column in Order and OrderItem which is problematic considering one of them is a large XML Blob.
SELECT [order].[OrderId], [order].[OrderStatusTypeId],
[order].[OrderSummary], [order].[OrderTotal], [order].[OrderTypeId],
[order].[ParentFSPId], [order].[ParentOrderId],
[order].[PayPalECToken], [order].[PaymentFailureTypeId] ....
...[orderItem].[OrderId], [orderItem].[OrderItemType], [orderItem].[Qty],
[orderItem].[SKU] FROM [Order] AS [order] LEFT JOIN [OrderItem] AS
[orderItem] ON [order].[OrderId] = [orderItem].[OrderId] ORDER BY
[order].[OrderId]
(There are many more columns not shown here.)
On the other hand - if I make it an INNER JOIN then the SQL is as expected with only the columns in my select clause:
SELECT [order].[OrderDt], [orderItem].[SKU], [orderItem].[Qty] FROM
[Order] AS [order] INNER JOIN [OrderItem] AS [orderItem] ON
[order].[OrderId] = [orderItem].[OrderId]
I tried reverting to EFCore 1.01, but got some horrible nuget package errors and gave up with that.
Not clear whether this is an actual regression issue or an incomplete feature in EFCore. But couldn't find any further information about this elsewhere.
Edit: EFCore 2.1 has addressed a lot of issues with grouping and also N+1 type issues where a separate query is made for every child entity. Very impressed with the performance in fact.
3/14/18 - 2.1 Preview 1 of EFCore isn't recommended because the GROUP BY SQL has some issues when using OrderBy() but it's fixed in nightly builds and Preview 2.
The following applies to EF Core 1.1.0 (release).
Although shouldn't be doing such things, tried several alternative syntax queries (using navigation property instead of manual join, joining subqueries containing anonymous type projection, using let / intermediate Select, using Concat / Union to emulate left join, alternative left join syntax etc.) The result - either the same as in the post, and/or executing more than one query, and/or invalid SQL queries, and/or strange runtime exceptions like IndexOutOfRange, InvalidArgument etc.
What I can say based on tests is that most likely the problem is related to bug(s) (regression, incomplete implementation - does it really matter) in GroupJoin translation. For instance, #7003: Wrong SQL generated for query with group join on a subquery that is not present in the final projection or #6647 - Left Join (GroupJoin) always materializes elements resulting in unnecessary data pulling etc.
Until it get fixed (when?), as a (far from perfect) workaround I could suggest using the alternative left outer join syntax (from a in A from b in B.Where(b = b.Key == a.Key).DefaultIfEmpty()):
var orders = from o in ctx.Order
from oi in ctx.OrderItem.Where(oi => oi.OrderId == o.OrderId).DefaultIfEmpty()
select new
{
OrderDt = o.OrderDt,
Sku = oi.Sku,
Qty = (int?)oi.Qty
};
which produces the following SQL:
SELECT [o].[OrderDt], [t1].[Sku], [t1].[Qty]
FROM [Order] AS [o]
CROSS APPLY (
SELECT [t0].*
FROM (
SELECT NULL AS [empty]
) AS [empty0]
LEFT JOIN (
SELECT [oi0].*
FROM [OrderItem] AS [oi0]
WHERE [oi0].[OrderId] = [o].[OrderId]
) AS [t0] ON 1 = 1
) AS [t1]
As you can see, the projection is ok, but instead of LEFT JOIN it uses strange CROSS APPLY which might introduce another performance issue.
Also note that you have to use casts for value types and nothing for strings when accessing the right joined table as shown above. If you use null checks as in the original query, you'll get ArgumentNullException at runtime (yet another bug).
Using "into" will create a temporary identifier to store the results.
Reference : MDSN: into (C# Reference)
So removing the "into tmp from oi in tmp.DefaultIfEmpty()" will result in the clean sql with the three columns.
var orders = from order in ctx.Order
join orderItem in ctx.OrderItem
on order.OrderId equals orderItem.OrderId
select new
{
order.OrderDt,
Sku = (oi == null) ? null : oi.Sku,
Qty = (oi == null) ? (int?) null : oi.Qty
};

Postgres join table: Return only records from one table but with values from others [duplicate]

This question already has answers here:
Nested Case statement type error (postgres)
(2 answers)
Closed 7 years ago.
I have a table with ~5,000 records. I have made three join columns in this table. The values in each column are not unique. I want to join to another table (sequentially) by each of these three columns to return values given a condition.
The join table contains multiple columns. Three of these columns are the join columns which will correspond to the first tables' join columns. The join columns in the join table are unique. I want to take the values from the join table and bring to a new column in the first table.
I have a code that I have put together from other suggestions and it runs but I am receiving over 8 million records in the return table. I want the table to only have the records from the first table.
Here is the code:
CREATE TABLE current_condition_joined AS SELECT
a.id, a.geom, a.condition_join_1, a.condition_join_2, a.condition_join_3,
coalesce(b.condition, c.condition2, d.condition3) as current_condition,
coalesce(b.ecosite, c.ecosite2, d.ecosite3) as current_ecosite,
coalesce(b.ecophase, c.ecophase2, d.ecophase3) as current_ecophase,
coalesce(b.consite, c.consite2, d.consite3) as current_consite,
coalesce(b.conphase, c.conphase2, d.conphase3) as current_conphase
FROM current_condition a
LEFT JOIN boreal_mixedwood_labeled b ON a.condition_join_1 = b.label
LEFT JOIN boreal_mixedwood_labeled c ON a.condition_join_2 = c.label2
LEFT JOIN boreal_mixedwood_labeled d ON a.condition_join_3 = d.label3
WHERE b.condition != 'ERROR' and c.condition2 != 'ERROR';
I want to get the values from the first join if condition is not ERROR, else the values from the second join if condition is not ERROR, else the values of the third join.
I've looked around, but all examples are asking slightly different things then I am so I can't piece it together.
This is not the same question as: Nested Case statement type error (postgres)
The question asked there was in regard to making a nested statement work. This question is about how the join works. Two different questions, two different posts.
Try add a DISTINCT.
CREATE TABLE current_condition_joined AS SELECT DISTINCT
a.id, a.geom, a.condition_join_1, a.condition_join_2, a.condition_join_3,
coalesce(b.condition, c.condition2, d.condition3) as current_condition,
coalesce(b.ecosite, c.ecosite2, d.ecosite3) as current_ecosite,
coalesce(b.ecophase, c.ecophase2, d.ecophase3) as current_ecophase,
coalesce(b.consite, c.consite2, d.consite3) as current_consite,
coalesce(b.conphase, c.conphase2, d.conphase3) as current_conphase
FROM current_condition a
LEFT JOIN boreal_mixedwood_labeled b ON a.condition_join_1 = b.label
LEFT JOIN boreal_mixedwood_labeled c ON a.condition_join_2 = c.label2
LEFT JOIN boreal_mixedwood_labeled d ON a.condition_join_3 = d.label3
WHERE b.condition != 'ERROR' and c.condition2 != 'ERROR';
You can try use GROUP BY too.
The code you present is what I gave you for your previous question:
Nested Case statement type error (postgres).
But you broke it by moving the conditions b.condition != 'ERROR' and c.condition2 != 'ERROR' to the WHERE clause, which is simply wrong. Consider:
Query with LEFT JOIN not returning rows for count of 0
If rows are multiplied, then your join conditions most probably identify multiple matching rows, multiplying each other. Hard to diagnose while you still refuse to provide the table definition of boreal_mixedwood_labeled like I requested repeatedly for your previous question.

Postgres - Get data from each alias

In my application i have a query that do multiple joins with a table position. Just like this:
SELECT *
FROM (...) as trips
join trip as t on trips.trip_id = t.trip_id
left outer join vehicle as v on v.vehicle_id = t.trip_vehicle_id
left outer join position as start on trips.start_position_id = start.position_id and start.position_vehicle_id = v.vehicle_id
left outer join position as "end" on trips.end_position_id = "end".position_id and "end".position_vehicle_id = v.vehicle_id
left outer join position as last on trips.last_position_id = last.position_id and last.position_vehicle_id = v.vehicle_id;
My table position has 35 columns(for example position_id).
When I run the query, in result should appear the table position 3 times, start, end and last. But postgres can not distinguish between, for exemplar, start.position_id, end.position_id and last.position_id. So this 3 columns are group and appear as one, position_id.
As the data from start.position_id and end.position_id are different, the column, position_id, that appear in result, it's empty.
Without having to rename all the columns, like this: start.position_id as start_position_id.
How can i get each group of data separately, for exemple, get all columns from the table 'start'. In MYSQL i can do this operation by calling fetch_fields, and give the function an alias, like 'start'.
But i can i do this in Postgres?
Best Regards,
Nuno Oliveira
My understanding is that you can't (or find it difficult to) discern between which table each column with a shared name (such as "position_id") belongs to, but only need to see one of the sets of shared columns at any one time. If that is the case, use tablename.* in your SELECT, so SELECT trips.*, start.*... would show the columns from trips and start, but no columns from other tables involved in the join.
SELECT [...,] start.* [,...] FROM [...] atable AS start [...]