Analytical query - dense_rank on Spark Streaming - pyspark

I would like to run the below operation on Spark Streaming (structured streaming) Dataframe. But it is giving me error that "Non-time-based windows are not supported on streaming DataFrames/Datasets". Any suggestion on this would be really helpfull.
task1_2_1_DF = spark.sql("""select Origin, UniqueCarrier, avg_dep_delay
from(
select Origin, UniqueCarrier, avg_dep_delay, dense_rank() over (partition by Origin order by avg_dep_delay asc) rnk
from (
select Origin, UniqueCarrier, round(avg(DepDelay), 2) avg_dep_delay
from cc598_task1_src_data
where Cancelled = 0
and Origin in ('CMI', 'BWI', 'MIA','LAX', 'IAH', 'SFO')
group by Origin, UniqueCarrier) tmp
) tmp1
where rnk <=10""")

Related

Oracle to Postgres - Convert Keep dense_rank query

New to Postgresql and trying to convert an Oracle query to Postgres that uses keep dense_rank.
Here is the Oracle query that works properly:
SELECT
dbx.bundle_id,
j.JOB_CD||'_'||d.doc_type_cd||wgc.ending_cd as job_doc_type_wgt_grp,
b.create_date,
min (A.ADDRESS1) keep ( dense_rank first order by a.POSTAL_CD) as first_address,
max (A.ADDRESS1) keep ( dense_rank last order by a.POSTAL_CD) as last_address,
min (A.DOCUMENT_ID) keep ( dense_rank first order by a.POSTAL_CD) as first_docid,
max (A.DOCUMENT_ID) keep ( dense_rank last order by a.POSTAL_CD) as last_docid,
min (doc.original_img_name) keep ( dense_rank first order by a.POSTAL_CD) as first_doc_name,
max (doc.original_img_name) keep ( dense_rank last order by a.POSTAL_CD) as first_doc_name,
COUNT(distinct a.DOCUMENT_ID) as doc_count,
SUM(B.PAGES) AS BUNDLE_PAGE_COUNT
FROM ADDRESS A
JOIN DOC_BUNDLE_XREF DBX ON (DBX.DOCUMENT_ID = A.DOCUMENT_ID)
JOIN DOCUMENT DOC ON DOC.DOCUMENT_ID = DBX.DOCUMENT_ID
JOIN BUNDLE B ON B.BUNDLE_ID = DBX.BUNDLE_ID
JOIN JOB J ON J.JOB_ID = B.JOB_ID
JOIN DOC_TYPE D ON D.DOC_TYPE_ID=B.DOC_TYPE_ID
JOIN WEIGHT_GROUP_CD WGC ON WGC.WEIGHT_GROUP_CD_ID = B.WEIGHT_GROUP_CD_ID
WHERE A.ADDRESS_TYPE_ID =
(SELECT MAX( address_type_id )
FROM ADDRESS AI
WHERE AI.document_id =A.DOCUMENT_ID)
AND DBX.BUNDLE_ID in (1404,1405,1407)
group by dbx.BUNDLE_ID, j.JOB_CD||'_'||d.doc_type_cd||wgc.ending_cd, b.create_date;
Here's the PG version:
SELECT
dbx.bundle_id,
j.JOB_CD||'_'||d.doc_type_cd||wgc.ending_cd as job_doc_type_wgt_grp,
b.create_date,
FIRST_VALUE(A.ADDRESS1) OVER (order by a.POSTAL_CD) as first_address,
LAST_VALUE(A.ADDRESS1) OVER (order by a.POSTAL_CD) as last_address,
FIRST_VALUE(A.DOCUMENT_ID) OVER (order by a.POSTAL_CD) as first_docid,
LAST_VALUE(A.DOCUMENT_ID) OVER (order by a.POSTAL_CD) as last_docid,
FIRST_VALUE(doc.original_img_name) OVER (order by a.POSTAL_CD) as first_doc_name,
LAST_VALUE(doc.original_img_name) OVER (order by a.POSTAL_CD) as first_doc_name,
COUNT(distinct a.DOCUMENT_ID) as doc_count,
SUM(B.PAGES) AS BUNDLE_PAGE_COUNT
FROM ADDRESS A
JOIN DOC_BUNDLE_XREF DBX ON (DBX.DOCUMENT_ID = A.DOCUMENT_ID)
JOIN DOCUMENT DOC ON DOC.DOCUMENT_ID = DBX.DOCUMENT_ID
JOIN BUNDLE B ON B.BUNDLE_ID = DBX.BUNDLE_ID
JOIN JOB J ON J.JOB_ID = B.JOB_ID
JOIN DOC_TYPE D ON D.DOC_TYPE_ID=B.DOC_TYPE_ID
JOIN WEIGHT_GROUP_CD WGC ON WGC.WEIGHT_GROUP_CD_ID = B.WEIGHT_GROUP_CD_ID
WHERE A.ADDRESS_TYPE_ID = (
SELECT MAX( address_type_id )
FROM ADDRESS AI
WHERE AI.document_id =A.DOCUMENT_ID)
AND DBX.BUNDLE_ID in (1404,1405,1407)
group by dbx.BUNDLE_ID, j.JOB_CD||'_'||d.doc_type_cd||wgc.ending_cd, b.create_date;
Running this statement yields the following error:
SQL Error [42803]: ERROR: column "a.address1" must appear in the GROUP BY clause or be used in an aggregate function
I've tried MIN and MAX in place of FIRST_VALUE and LAST_VALUE, same results. The same error happens on all of the other FIRST_VALUE and LAST_VALUE functions.
What am I missing here? Any idea why it doesn't recognize a.address1 (or any of the other columns) as being in an aggregate?
I'm using DBeaver version 21 to run these queries if that makes any difference. Any guidance is greatly appreciated.

ROW_NUMBER() over partition by [DateCol vs 1] order by DateCol

Im pretty new at T-SQL.
I saw this T-SQL script:
SELECT [Date], ClosePrice, ROW_NUMBER() over (partition by 1 order by [Date])rn
FROM NIFTY_SMALLCAP_250_STOCKS
source: https://youtu.be/vE8UcS8U_xE?t=2882 (with my own small change)
Works as expected:
Expected result (and sample data)
Then I changed the script to : [Date] instead of 1
SELECT [Date], ClosePrice, ROW_NUMBER() over (partition by [Date] order by [Date])rn
FROM NIFTY_SMALLCAP_250_STOCKS
The Result - all "dynamic" rn column values equals 1
The Result
My question:
Why partitioning by [Date] (that is obviously uniqe) doesnt work here?
What am i missing here about the "ROW_NUMBER()" and "partition by" combination?

DELETE statement deleting data partially

When I am executing a below query in DB2 it is not deleting data at once.
I have got 800 records and out of that every 2 records are duplicate and I want to delete 1 record of 2 records so it will leave 400 records in DB.
Below is a sample of RESERVATION_NUMBER.
DELETE
FROM reservation_number
WHERE reservation_id IN (SELECT reservation_id
FROM (SELECT ROW_NUMBER()
OVER() AS RN,
msr1.reservation_number,
msr1.reservation_id,
msr1.used_flag
FROM reservation_number msr1,
reservation_number msr2
WHERE
msr1.reservation_number = msr2.reservation_number
AND msr1.reservation_id <> msr2.reservation_id
ORDER BY msr1.reservation_number)
WHERE Mod (rn, 2) = 0
ORDER BY reservation_number)
This query is deleting complete data if I execute it multiple time. Data is being deleted in below fashion -
400, 168, 076, 038, 019, 003, 001
Would this not be easier?
DELETE FROM (
SELECT ROW_NUMBER() OVER(PARTITION BY RESERVATION_NUMBER
ORDER BY RESERVATION_ID ) AS RN
FROM
RESERVATION_NUMBER
) WHERE RN > 1
I got the fix. I was missing a parameter in OVER().
Here is the right query
DELETE
FROM
RESERVATION_NUMBER
WHERE
RESERVATION_ID IN (
SELECT
RESERVATION_ID
FROM
(
SELECT
ROW_NUMBER() OVER(ORDER BY msr1.RESERVATION_NUMBER) AS RN,
msr1.RESERVATION_NUMBER,
msr1.RESERVATION_ID,
msr1.USED_FLAG
FROM
RESERVATION_NUMBER msr1 ,
RESERVATION_NUMBER msr2
WHERE
msr1.RESERVATION_NUMBER = msr2.RESERVATION_NUMBER
AND msr1.RESERVATION_ID <> msr2.RESERVATION_ID
ORDER BY
msr1.RESERVATION_NUMBER )
WHERE
MOD (RN,2)=1
ORDER BY
RESERVATION_NUMBER )

Implement ROW_NUMBER() in beamSQL

I have the below query :
SELECT DISTINCT Summed, ROW_NUMBER () OVER (order by Summed desc) as Rank from table1
I have to write it in Apache Beam(beamSql). Below is my code :
PCollection<BeamRecord> rec_2_part2 = rec_2.apply(BeamSql.query("SELECT DISTINCT Summed, ROW_NUMBER(Summed) OVER (ORDER BY Summed) Rank1 from PCOLLECTION "));
But I'm getting the below error :
Caused by: java.lang.UnsupportedOperationException: Operator: ROW_NUMBER is not supported yet!
Any idea how to implement ROW_NUMBER() in beamSql ?
Here is one way you can approximate your current query without using ROW_NUMBER:
SELECT
t1.Summed,
(SELECT COUNT(*) FROM (SELECT DISTINCT Summed FROM table1) t2
WHERE t2.Summed >= t1.Summed) AS Rank
FROM
(
SELECT DISTINCT Summed
FROM table1
) t1
The basic idea is to first subquery to get a table with only distinct Summed values. Then, use a correlated subquery to simulate the row number. This isn't a very efficient method, but if ROW_NUMBER is not available, then you're stuck with some alternative.
The solution which worked for the above query:
PCollection<BeamRecord> rec_2 = rec_1.apply(BeamSql.query("SELECT max(Summed) as maxed, max(Summed)-10 as least, 'a' as Dummy from PCOLLECTION"));

How to update a table with a list of values at a time?

I have
update NewLeaderBoards set MonthlyRank=(Select RowNumber() (order by TotalPoints desc) from LeaderBoards)
I tried it this way -
(Select RowNumber() (order by TotalPoints desc) from LeaderBoards) as NewRanks
update NewLeaderBoards set MonthlyRank = NewRanks
But it doesnt work for me..Can anyone suggest me how can i perform an update in such a way..
You need to use the WITH statement and a full CTE:
;With Ranks As
(
Select PrimaryKeyColumn, Row_Number() Over( Order By TotalPoints Desc ) As Num
From LeaderBoards
)
Update NewLeaderBoards
Set MonthlyRank = T2.Num
From NewLeaderBoards As T1
Join Ranks As T2
On T2.PrimaryKeyColumn = T1.PrimaryKeyColumn