converting sql to dataframe api - scala

How can the below sql be converted in spark? I attempted to do the below but saw this error -
Error evaluating method : '$eq$eq$eq': Method threw
'java.lang.RuntimeException' exception.
I am also not sure how to represent where sp1.cart_id = sp.cart_id in spark query
select distinct
o.order_id
, 'PENDING'
from shopping sp
inner join order o
on o.cart_id = sp.cart_id
where o.order_date = (select max(sp1.order_date)
from shopping sp1
where sp1.cart_id = sp.cart_id)
SHOPPING_DF
.select(
"ORDER_ID",
"PENDING")
.join(ORDER_DF, Seq("CART_ID"), "inner")
.filter(col(ORDER_DATE) === SHOPPING_DF.groupBy("CART_ID").agg(max("ORDER_DATE")))```

If this query is rewritten as a simple join on a table shopping that uses the window function max to determine the order date for each cart_id, this could easily be rewritten as sql as
SELECT DISTINCT
o.order_id,
'PENDING'
FROM
order o
INNER JOIN (
SELECT
cart_id,
MAX(order_date) OVER (
PARTITION BY cart_id
) as order_date
FROM
shopping
) sp ON sp.cart_id = o.cart_id AND
sp.order_date = o.order_date
This may be run on your spark session to achieve the results.
Converting this to the spark api could be written as
ORDER_DF.alias('o')
.join(
SHOPPING_DF.selectExpr(
"cart_id",
"MAX(order_date) OVER (PARTITION BY cart_id) as order_date"
).alias("sp"),
Seq("cart_id","order_date"),
"inner"
)
.selectExpr(
"o.order_id",
"'PENDING' as PENDING"
).distinct()
Let me know if this works for you.

Related

Select multiple non aggregated columns with group by in postgres

I'm making a query with having multiple non aggregated columns with group by clause but Postgres is throwing an error that I have to add non aggregated columns in group by or use any aggregate function on that column this is the query that I'm trying to run.
select
tb1.pipeline as pipeline_id,
tb3.pipeline_name as pipeline_name,
tb2."name" as integration_name,
cast(tb1.integration_id as VARCHAR) as integration_id,
tb1.created_at as created_at,
cast(tb1.id as VARCHAR) as batch_id,
sum(tb1.row_select) as row_select,
sum(tb1.row_insert) as row_insert,
from
table1 tb1
join
table2 tb2 on tb1.integration_id = tb2.id
join
table3 tb3 on tb1.pipeline = tb3.id
where
tb1.pipeline is not null
and tb1.is_super_parent = false
group by
tb1.pipeline
and I found one solution/hack for this error that is I added max function in all other non aggregated columns this solves my problem.
select
tb1.pipeline as pipeline_id,
max(tb3.pipeline_name) as pipeline_name,
max(tb2."name") as integration_name,
max(cast(tb1.integration_id as VARCHAR)) as integration_id,
max(tb1.created_at) as created_at,
max(cast(tb1.id as VARCHAR)) as batch_id,
sum(tb1.row_select) as row_select,
sum(tb1.row_insert) as row_insert,
from
table1 tb1
join
table2 tb2 on tb1.integration_id = tb2.id
join
table3 tb3 on tb1.pipeline = tb3.id
where
tb1.pipeline is not null
and tb1.is_super_parent = false
group by
tb1.pipeline
But I don't want to add max functions when there is no need for that second thing is that applying max to all other column query will be expensive so any other better approach that I can do to solve the above issue, thanks in advance.
Well the first thing you need is to learn to format your queries in so as to get an idea of their flow at a glance. Note due to the extra comma in row_insert, from your query will give a syntax error. With that said; How do you solve your issue?
You cannot avoid the additional aggregates or the expanded group by as long as the exist in the scope same query. You need to separate the aggregation from selection of additional columns. You basically have 2 choices:
Perform the aggregation in a CTE.
with sums (pipeline_id, row_select, row_insert) as
( select tb1.pipeline
, sum(tb1.row_select) as row_select
, sum(tb1.row_insert) as row_insert
table1 tb1
where tb1.pipeline is not null
and tb1.is_super_parent = false
group by tb1.pipeline
)
select s.pipeline_id
, tbl3.pipeline_name
, tb2."name" integration_name
, s.row_select
, s.row_insert
from sums s
join table2 tbl2 on (s.pipeline_id = tb2.id)
join table3 tbl3 on (s.pipeline_id = tb3.id);
Perform the aggregation in a sub-query.
select s.pipeline_id
, tbl3.pipeline_name
, tb2."name" integration_name
, s.row_select
, s.row_insert
from ( select tb1.pipeline
, sum(tb1.row_select) as row_select
, sum(tb1.row_insert) as row_insert
table1 tb1
where tb1.pipeline is not null
and tb1.is_super_parent = false
group by tb1.pipeline
) s
join table2 tbl2 on (s.pipeline_id = tb2.id)
join table3 tbl3 on (s.pipeline_id = tb3.id);
NOTE: Not tested as no sample data supplied.

instead of fetching multiple tables using pyspark how can we execute join query using jdbc

customer - c_id, c_name, c_address
product - p_id, p_name, price
supplier - s_id, s_name, s_address
orders - o_id, c_id, p_id, quantity, time
SELECT o.o_id,
c.c_id,
c.c_name,
p.p_id,
p.p_name,
p.price * o.quantity AS amount
FROM customer c
JOIN orders o ON o.c_id = c.c_id
JOIN product p ON p.p_id = o.p_id;
I want to execute above query without fetching 3 tables as individual data frames in pyspark and performing joins on dataframes.
You can use query in-place of table as described below
Reference PySpark Documentation
df = spark.read.jdbc(
"url", "(query) as table",
properties={"user":"username", "password":"password"})
In your case it will be:
df = spark.read.jdbc("url", """
(
SELECT o.o_id,
c.c_id,
c.c_name,
p.p_id,
p.p_name,
p.price * o.quantity AS amount
FROM customer c
JOIN orders o ON o.c_id = c.c_id
JOIN product p ON p.p_id = o.p_id
) as table""", properties={"user":"username", "password":"password"})
This answer has used this type of query in place of table. Also this question is relevant in your case

SQL Server 2012 Passing parameter from main query to the Joined subquery

I need to select some settings from some joined tables, but only if Items ORDER BY EndTime DESC ItemID is among first 1000 Items.
Do do this I built the following Query that, although surely can be improved, works:
SELECT ss.ModuleCode, ss.MaxItems , w.*
FROM Subscriptions ss
JOIN Sellers s ON s.UID=ss.UID
JOIN Items i ON s.UserID=i.UserID
JOIN Items ii ON i.ItemID=ii.ItemID
JOIN Modules mo ON ss.ModuleCode=mo.ModuleCode
JOIN Settings w ON w.UID=s.UID AND ss.ModuleCode=w.WCode
FULL JOIN GoogleFonts f ON f.FontCode=a.FontFamily
JOIN ( SELECT
ItemID
FROM Items
WHERE UserID=#UserID
ORDER BY EndTime DESC
OFFSET 0 ROWS
FETCH FIRST (1000) ROWS ONLY
) it ON it.ItemID=i.ItemID
WHERE it.ItemID=#ItemID
AND .....
but since MaxItems is not always 1000 and its value is defined by ss.MaxItems,
I would replace the fixed value of 1000 with the dynamic value of ss.MaxItems, but I haven't find a way to do it:
Although not optimal since makes the query much heavier, I tried putting instead of 1000 a further query with this result:
SELECT ss.ModuleCode, ss.MaxItems , w.*
FROM Subscriptions ss
JOIN Sellers s ON s.UID=ss.UID
JOIN Items i ON s.UserID=i.UserID
JOIN Items ii ON i.ItemID=ii.ItemID
JOIN Modules mo ON ss.ModuleCode=mo.ModuleCode
JOIN Settings w ON w.UID=s.UID AND ss.ModuleCode=w.WCode
FULL JOIN GoogleFonts f ON f.FontCode=a.FontFamily
JOIN ( SELECT
ItemID
FROM Items
WHERE UserID=#UserID
ORDER BY EndTime DESC
OFFSET 0 ROWS
FETCH FIRST ( SELECT ss.MaxItems
FROM Subscriptions ss
JOIN Sellers s ON s.UID=ss.UID
JOIN Items i ON s.UserID=i.UserID
JOIN Modules mo ON ss.ModuleCode=mo.ModuleCode
JOIN Settings w ON w.UID=s.UID AND ss.ModuleCode=w.WCode
WHERE i.ItemID=#ItemID) ROWS ONLY
) it ON it.ItemID=i.ItemID
Where it.ItemID=#ItemID
AND .....
but since this returns more than 1 value it is not accepted: limiting to TOP 1 result the latest subquery will work but will not be fully dynamic as required.
Can suggest how to solve or at least suggest the path for the solution?
Thanks!
Instead of fetch use row_number:
JOIN (SELECT ItemID, ROW_NUMBER() OVER (PARTITION BY UserID ORDER BY EndTime) as seqnum
FROM Items it
WHERE UserID = #UserID
) it
ON it.ItemID = i.ItemID AND seqnum <= ss.maxitems

Postgres join not respecting outer where clause

In SQL Server, I know for sure that the following query;
SELECT things.*
FROM things
LEFT OUTER JOIN (
SELECT thingreadings.thingid, reading
FROM thingreadings
INNER JOIN things on thingreadings.thingid = things.id
ORDER BY reading DESC LIMIT 1) AS readings
ON things.id = readings.thingid
WHERE things.id = '1'
Would join against thingreadings only once the WHERE id = 1 had restricted the record set down. It left joins against just one row. However in order for performance to be acceptable in postgres, I have to add the WHERE id= 1 to the INNER JOIN things on thingreadings.thingid = things.id line too.
This isn't ideal; is it possible to force postgres to know that what I am joining against is only one row without explicitly adding the WHERE clauses everywhere?
An example of this problem can be seen here;
I am trying to recreate the following query in a more efficient way;
SELECT things.id, things.name,
(SELECT thingreadings.id FROM thingreadings WHERE thingid = things.id ORDER BY id DESC LIMIT 1),
(SELECT thingreadings.reading FROM thingreadings WHERE thingid = things.id ORDER BY id DESC LIMIT 1)
FROM things
WHERE id IN (1,2)
http://sqlfiddle.com/#!15/a172c/2
Not really sure why you did all that work. Isn't the inner query enough?
SELECT t.*
FROM thingreadings tr
INNER JOIN things t on tr.thingid = t.id AND t.id = '1'
ORDER BY tr.reading DESC
LIMIT 1;
sqlfiddle demo
When you want to select the latest value for each thingID, you can do:
SELECT t.*,a.reading
FROM things t
INNER JOIN (
SELECT t1.*
FROM thingreadings t1
LEFT JOIN thingreadings t2
ON (t1.thingid = t2.thingid AND t1.reading < t2.reading)
WHERE t2.thingid IS NULL
) a ON a.thingid = t.id
sqlfiddle demo
The derived table gets you the record with the most recent reading, then the JOIN gets you the information from things table for that record.
The where clause in SQL applies to the result set you're requesting, NOT to the join.
What your code is NOT saying: "do this join only for the ID of 1"...
What your code IS saying: "do this join, then pull records out of it where the ID is 1"...
This is why you need the inner where clause. Incidentally, I also think Filipe is right about the unnecessary code.

TSQL Update Query behaving unexpectedly

I have a nested select query that is returning the proper amount of rows. The query builds a recordset and compares it to a table and returns the records in the query that are not in the table.
I converted the select query to an update query. I am trying to populate the table with the rows returned from the query. When I run the update query it is returning with zero rows to update. I dont understand why because the select query is returning record and I am using the same code in the update query.
Thanks
Select Query: (This is returning several records)
Select *
From
(SELECT DISTINCT
ProductClass,SalProductClass.[Description],B.Branch,B.BranchDesc,B.Salesperson,B.Name,
CAST(0 AS FLOAT) AS Rate,'N' AS Split
FROM (SELECT SalBranch.Branch,SalBranch.[Description] AS BranchDesc,A.Salesperson,A.Name
FROM (SELECT DISTINCT
Salesperson,Name
FROM SalSalesperson
) A
CROSS JOIN SalBranch
) B
CROSS JOIN SalProductClass
) C
Left Outer Join RateComm On
RateComm.ProductClass = C.ProductClass and
RateComm.Branch = C.Branch And RateComm.Salesperson = C.Salesperson
Where RateComm.ProductClass is Null
Update Query: (This is returning zero records)
UPDATE RateComm
SET RateComm.ProductClass=C.ProductClass,RateComm.ProdClassDesc=C.ProdClassDesc,
RateComm.Branch=C.Branch,RateComm.BranchDesc=C.BranchDesc,RateComm.Salesperson=C.Salesperson,
RateComm.Name=C.Name,RateComm.Rate=C.Rate,RateComm.Split=C.Split
FROM (SELECT DISTINCT
ProductClass,SalProductClass.[Description] AS ProdClassDesc,B.Branch,B.BranchDesc,B.Salesperson,B.Name,
CAST(0 AS FLOAT) AS Rate,'N' AS Split
FROM (SELECT SalBranch.Branch,SalBranch.[Description] AS BranchDesc,A.Salesperson,A.Name
FROM (SELECT DISTINCT
Salesperson,Name
FROM SalSalesperson
) A
CROSS JOIN SalBranch
) B
CROSS JOIN SalProductClass
) C
LEFT OUTER JOIN RateComm ON C.ProductClass=RateComm.ProductClass AND
C.Salesperson=RateComm.Salesperson AND C.Branch=RateComm.Branch
WHERE RateComm.ProductClass IS NULL
It's difficult to update what doesn't exist. Have you tried an INSERT query instead?