reduceByKey a list of lists in PySpark - group-by

I am new to pyspark and so far it is a bit difficult to understand the way it works specially when you get use to libraries like pandas. But it is seems the way to go for big data.
For my current ETL job, I have the following elements:
This is my rdd:
[
[
('SMSG', 'BKT'), ('SQNR', '00000004'), ('STNQ', '06'), ('TRNN', '000001'), ('SMSG', 'BKS'), ('SQNR', '00000005'), ('STNQ', '24'), ('DAIS', '171231'), ('TRNN', '000001'), ....
],
[
('SMSG', 'BKT'), ('SQNR', '00000024'), ('STNQ', '06'), ('TRNN', '000002'), ('NRID', ' '), ('TREC', '020'), ('TRNN', '000002'), ('NRID', ' '), ('TACN', '001'), ('CARF', ' '), ...
],
...
]
The row data is a fixed size text file.
what I want to do now is to groupByKey each cell of the list.
final result should be:
[
[
('SMSG_1', 'BKT'),('SMSG_2','BKS'),('SQNR_1', '00000004'),('SQNR_2', '00000005'),('STNQ_1','06'),('STNQ_2','24'),('TRNN', '000001'),()('DAIS', '171231'),...
],
[
('SMSG', 'BKT'),('SQNR', '00000024'),('STNQ','06'),('TRNN', '000002'),('NRID', ' '), ('TREC', '020'), ('TACN', '001'), ('CARF', ' '),...
],
...
]
Basically the rules are as following:
1- if the keys are same and the values are also same remove duplicates.
2- if the keys are same and the values different, rename columns and add a suffix as "_Number" where Number can be replaced by the iteration number of that key.
My code start as following:
def addBKT():
...
def prepareTrans():
...
if __name__ == '__main__':
input_folder = '/Users/admin/Documents/Training/FR20180101HOT'
rdd = sc.wholeTextFiles(input_folder).map(lambda x: x[1].split("BKT"))
rdd = rdd.flatMap(prepareTrans).map(addBKT).map(lambda x: x.split("\n")).map(hot_to_flat_file_v2)
print(rdd.take(1))
The print give me (as shared before) the following list of lists of tuples. I am taking only 1 sublist but the full rdd has about 2000 sublists of tuples:
[
[
('SMSG', 'BKT'), ('SQNR', '00000004'), ('STNQ', '06'), ('TRNN', '000001'), ('SMSG', 'BKS'), ('SQNR', '00000005'), ('STNQ', '24'), ('DAIS', '171231'), ('TRNN', '000001'), ....
]
]
I tried to reduce first the nested lists as following:
rdd = rdd.flatMap(lambda x:x).reduceByKey(list)
I was expecting as result a new list of lists without duplicates and for the tuples with different values, group them all under the same key. However, I am not able to do that.
As second step, I was planning to transform tuples with multiple values to new pairs of tuples as much as I got values in the grouped tuple: i.e. ('Key', ['Value1', 'Value2']) become ('Key_1', 'Value1'),('Key_2', 'Value2')
Finally, the output of all these transformations is to convert the final RDD to a DataFrame and store it in parquet format.
I really hope someone did something like that in the past. I took a lot of time to try doing this but I am not able to make it nor I was able to find any example online.
Thank you for your help.

As you are new to spark, You may not aware of Spark Dataframe. Dataframe is advanced concept compared to RDD. Here I solved your problem using Pyspark Dataframe. Have a look into this, Dont hesitate to learn spark Dataframe.
rdd1 = sc.parallelize([("SMSG", "BKT"), ("SMSG", "BKT"), ("SMSG", "BKS"), ('SQNR', '00000004'), ('SQNR', '00000005') ])
rddToDF = rdd1.toDF(["C1", "C2"])
+----+--------+
| C1| C2|
+----+--------+
|SMSG| BKT|
|SMSG| BKT|
|SMSG| BKS|
|SQNR|00000004|
|SQNR|00000005|
+----+--------+
DfRmDup = rddToDF.drop_duplicates() #Removing duplicates from Dataframe
DfRmDup.show()
+----+--------+
| C1| C2|
+----+--------+
|SQNR|00000004|
|SMSG| BKT|
|SQNR|00000005|
|SMSG| BKS|
+----+--------+
rank = DfRmDup.withColumn("rank", dense_rank().over(Window.partitionBy("C1").orderBy(asc("C2"))))
rank.show()
+----+--------+----+
| C1| C2|rank|
+----+--------+----+
|SQNR|00000004| 1|
|SQNR|00000005| 2|
|SMSG| BKS| 1|
|SMSG| BKT| 2|
+----+--------+----+
rank.withColumn("C1", concat(col("C1"), lit("_"), col("rank"))).drop("rank").show()
+------+--------+
| C1| C2|
+------+--------+
|SQNR_1|00000004|
|SQNR_2|00000005|
|SMSG_1| BKS|
|SMSG_2| BKT|
+------+--------+
#Converting back to RDD
rank.withColumn("C1", concat(col("C1"), lit("_"), col("rank"))).drop("rank").rdd.map(lambda x: (x[0],x[1])).collect()
[('SQNR_1', '00000004'),
('SQNR_2', '00000005'),
('SMSG_1', 'BKS'),
('SMSG_2', 'BKT')]

Thank you a lot for the link, I follwed the solution provided. The dataframe got created successfully, which is great.
input_folder = '/Users/admin/Documents/Training/FR20180101HOT'
rdd_split = sc.wholeTextFiles(input_folder).map(lambda x: x[1].split("BKT"))
rdd_trans = rdd_split.flatMap(prepareTrans).map(addBKT).map(lambda x: x.split("\n")).map(hot_to_flat_file_v2)
#rdd_group = rdd_trans.map(lambda x : x[i] for i in range(len(x))).reduceByKey(lambda x, y: str(x) + ','+ str(y))
df = spark.read.options(inferSchema="true").csv(rdd_trans)
print(df.show(1))
The print show me something like that:
+--------+-------+--------+------------+--------+------+--------+----------+----...
| _c0| _c1| _c2| _c3| _c4| _c5| _c6| _c7| _c8| _c9| _c10| _c11| _c12| _c13| _c14| _c15| _c16| _c17| _c18| _c19| _c20| _c21| _c22| _c23| _c24| _c25| _c26| _c27| _c28| _c29| _c30| _c31| _c32| _c33| _c34| _c35| _c36| _c37| _c38| _c39| _c40| _c41| _c42| _c43| _c44| _c45| _c46| _c47| _c48| _c49| _c50| _c51| _c52| _c53| _c54| _c55| _c56| _c57| _c58| _c59| _c60| _c61| _c62| _c63| _c64| _c65| _c66| _c67| _c68| _c69| _c70| _c71| _c72| _c73| _c74| _c75| _c76| _c77| _c78| _c79| _c80| _c81| _c82| _c83| _c84| _c85| _c86| _c87| _c88| _c89| _c90| _c91| _c92| _c93| _c94| _c95| _c96| _c97| _c98| _c99| _c100| _c101| _c102| _c103| _c104| _c105| _c106| _c107| _c108| _c109| _c110| _c111| _c112| _c113| _c114|_c115| _c116|_c117| _c118|_c119| _c120| _c121| _c122| _c123| _c124| _c125| _c126| _c127| _c128| _c129| _c130| _c131| _c132|_c133| _c134| _c135| _c136| _c137| _c138| _c139| _c140| _c141| _c142| _c143| _c144| _c145| _c146| _c147| _c148| _c149| _c150|_c151| _c152|_c153| _c154|_c155| _c156| _c157| _c158| _c159| _c160| _c161| _c162|_c163| _c164| _c165| _c166| _c167| _c168|_c169| _c170| _c171| _c172| _c173| _c174| _c175| _c176| _c177| _c178| _c179| _c180| _c181| _c182| _c183| _c184| _c185| _c186|_c187| _c188|_c189| _c190|_c191| _c192| _c193| _c194| _c195| _c196| _c197| _c198| _c199| _c200| _c201| _c202| _c203| _c204|_c205| _c206| _c207| _c208| _c209| _c210| _c211| _c212| _c213| _c214| _c215| _c216| _c217| _c218| _c219| _c220| _c221| _c222|_c223| _c224|_c225| _c226|_c227| _c228| _c229| _c230| _c231| _c232| _c233| _c234| _c235| _c236| _c237| _c238| _c239| _c240|_c241| _c242| _c243| _c244| _c245| _c246| _c247| _c248| _c249| _c250| _c251| _c252| _c253| _c254| _c255| _c256| _c257| _c258|_c259| _c260|_c261| _c262|_c263| _c264| _c265| _c266| _c267| _c268| _c269| _c270|_c271| _c272| _c273| _c274|_c275| _c276|_c277| _c278| _c279| _c280| _c281| _c282| _c283| _c284| _c285| _c286| _c287| _c288| _c289| _c290| _c291| _c292| _c293| _c294|_c295| _c296| _c297| _c298| _c299| _c300| _c301| _c302|_c303| _c304| _c305| _c306| _c307| _c308|_c309| _c310| _c311| _c312|_c313| _c314|_c315| _c316|_c317| _c318| _c319| _c320| _c321| _c322| _c323| _c324| _c325| _c326| _c327| _c328| _c329| _c330| _c331| _c332| _c333| _c334|_c335| _c336| _c337| _c338| _c339| _c340| _c341| _c342| _c343| _c344| _c345| _c346| _c347| _c348| _c349| _c350| _c351| _c352| _c353| _c354| _c355| _c356| _c357| _c358| _c359| _c360|_c361| _c362|_c363| _c364|_c365| _c366| _c367| _c368| _c369| _c370| _c371| _c372| _c373| _c374| _c375| _c376|_c377| _c378| _c379| _c380| _c381| _c382| _c383| _c384| _c385| _c386| _c387| _c388| _c389| _c390| _c391| _c392| _c393| _c394| _c395| _c396| _c397| _c398| _c399| _c400| _c401| _c402| _c403| _c404| _c405| _c406| _c407| _c408| _c409| _c410|_c411| _c412|_c413| _c414|_c415| _c416| _c417| _c418| _c419| _c420| _c421| _c422| _c423| _c424| _c425| _c426|_c427| _c428| _c429| _c430| _c431| _c432| _c433| _c434| _c435| _c436| _c437| _c438| _c439| _c440| _c441| _c442| _c443| _c444| _c445| _c446| _c447| _c448| _c449| _c450| _c451| _c452| _c453| _c454| _c455| _c456| _c457| _c458| _c459| _c460|_c461| _c462|_c463| _c464|_c465| _c466| _c467| _c468| _c469| _c470| _c471| _c472| _c473| _c474| _c475| _c476|_c477| _c478| _c479| _c480| _c481| _c482| _c483| _c484| _c485| _c486| _c487| _c488| _c489| _c490| _c491| _c492| _c493| _c494| _c495| _c496| _c497| _c498| _c499| _c500| _c501| _c502| _c503| _c504| _c505| _c506| _c507| _c508| _c509| _c510|_c511| _c512|_c513| _c514|_c515| _c516| _c517| _c518| _c519| _c520| _c521| _c522| _c523| _c524| _c525| _c526|_c527| _c528| _c529| _c530| _c531| _c532| _c533| _c534| _c535| _c536| _c537| _c538| _c539| _c540| _c541| _c542| _c543| _c544| _c545| _c546| _c547| _c548| _c549| _c550| _c551| _c552| _c553| _c554| _c555| _c556| _c557| _c558| _c559| _c560|_c561| _c562| _c563| _c564|_c565| _c566| _c567| _c568| _c569| _c570| _c571| _c572|_c573| _c574| _c575| _c576|_c577| _c578|_c579| _c580| _c581| _c582| _c583| _c584| _c585| _c586| _c587| _c588| _c589| _c590| _c591| _c592| _c593| _c594| _c595| _c596|_c597| _c598| _c599| _c600| _c601| _c602| _c603| _c604| _c605| _c606| _c607| _c608| _c609| _c610| _c611| _c612| _c613| _c614| _c615| _c616| _c617| _c618| _c619| _c620|_c621| _c622|_c623| _c624| _c625| _c626| _c627| _c628| _c629| _c630| _c631| _c632| _c633| _c634| _c635| _c636| _c637| _c638| _c639| _c640|_c641| _c642|_c643| _c644| _c645| _c646| _c647| _c648| _c649| _c650| _c651| _c652| _c653| _c654| _c655| _c656| _c657| _c658| _c659| _c660|_c661| _c662|_c663| _c664| _c665| _c666| _c667| _c668| _c669| _c670| _c671| _c672| _c673| _c674| _c675| _c676| _c677| _c678| _c679| _c680| _c681| _c682| _c683| _c684| _c685| _c686| _c687| _c688| _c689| _c690| _c691| _c692| _c693| _c694| _c695| _c696|_c697| _c698| _c699| _c700| _c701|
+--------+-------+--------+------------+--------+------+--------+----------+-------...
|[('SMSG'| 'BKT')| ('SQNR'| '00000004')| ('STNQ'| '06')| ('TRNN'| '000001')| ('NRID'| ' ')| ('TREC'| '020')| ('TACN'| '001')| ('CARF'| ' ')| ('CSTF'| ' ...| ('RPSI'| 'SABR')| ('ESAC'| ' ')| ('DISI'| ' ')| ('NRMI'| ' ')| ('NRCT'| ' ')| ('AREI'| ' ')| ('RESD'| ' ...| ('SMSG'| 'BKS')| ('SQNR'| '00000005')| ('STNQ'| '24')| ('DAIS'| '171231')| ('TRNN'| '000001')| ('TDNR'| '0015131574285 ')| ('CDGT'| '2')| ('CPUI'| 'FFFF')| ('CJCP'| ' ')| ('AGTN'| '20212146')| ('RFIC'| ' ')| ('TOUR'| ' ')| ('TRNC'| 'TKTT')| ('TODC'| 'CDGCDG ')| ('PNRR'| 'IKQOWZ/AA ')| ('TIIS'| '0000')| ('RESD'| ' ...| ('SMSG'| 'BKS')| ('SQNR'| '00000006')| ('STNQ'| '30')| ('DAIS'| '171231')| ('TRNN'| '000001')| ('TDNR'| '0015131574285 ')| ('CDGT'| '2')| ('COBL'| 225.0)| ('NTFA'| 0.0)| ('TMFT_1'| 'YR ')| ('TMFA_1'| 300.0)| ('TMFT_2'| 'FR ')| ('TMFA_2'| 20.81)| ('TMFT_3'| 'QX ')| ('TMFA_3'| 27.91)| ('TDAM'| 712.92)| ('RESD'| ' ')| ('CUTP'| 'EUR2')| ('SMSG'| 'BKS')| ('SQNR'| '00000007')| ('STNQ'| '30')| ('DAIS'| '171231')| ('TRNN'| '000001')| ('TDNR'| '0015131574285 ')| ('CDGT'| '2')| ('COBL'| 0.0)| ('NTFA'| 0.0)| ('TMFT_1'| 'IZ ')| ('TMFA_1'| 4.51)| ('TMFT_2'| 'YC ')| ('TMFA_2'| 9.22)| ('TMFT_3'| 'XY ')| ('TMFA_3'| 11.74)| ('TDAM'| 0.0)| ('RESD'| ' ')| ('CUTP'| 'EUR2')| ('SMSG'| 'BKS')| ('SQNR'| '00000008')| ('STNQ'| '30')| ('DAIS'| '171231')| ('TRNN'| '000001')| ('TDNR'| '0015131574285 ')| ('CDGT'| '2')| ('COBL'| 0.0)| ('NTFA'| 0.0)| ('TMFT_1'| 'XA ')| ('TMFA_1'| 6.64)| ('TMFT_2'| 'AY ')| ('TMFA_2'| 9.4)| ('TMFT_3'| 'WD ')| ('TMFA_3'| 29.33)| ('TDAM'| 0.0)| ('RESD'| ' ')| ('CUTP'| 'EUR2')| ('SMSG'| 'BKS')| ('SQNR'| '00000009')| ('STNQ'| '30')| ('DAIS'| '171231')| ('TRNN'| '000001')| ('TDNR'| '0015131574285 ')| ('CDGT'| '2')| ('COBL'| 0.0)| ('NTFA'| 0.0)| ('TMFT_1'| 'EK ')| ('TMFA_1'| 18.89)| ('TMFT_2'| 'EL ')| ('TMFA_2'| 4.19)| ('TMFT_3'| 'HG ')| ('TMFA_3'| 16.76)| ('TDAM'| 0.0)| ('RESD'| ' ')| ('CUTP'| 'EUR2')| ('SMSG'| 'BKS')| ('SQNR'| '00000010')| ('STNQ'| '30')| ('DAIS'| '171231')| ('TRNN'| '000001')| ('TDNR'| '0015131574285 ')| ('CDGT'| '2')| ('COBL'| 0.0)| ('NTFA'| 0.0)| ('TMFT_1'| 'JT ')| ('TMFA_1'| 2.52)| ('TMFT_2'| 'UC ')| ('TMFA_2'| 6.72)| ('TMFT_3'| 'QK ')| ('TMFA_3'| 16.76)| ('TDAM'| 0.0)| ('RESD'| ' ')| ('CUTP'| 'EUR2')| ('SMSG'| 'BKS')| ('SQNR'| '00000011')| ('STNQ'| '30')| ('DAIS'| '171231')| ('TRNN'| '000001')| ('TDNR'| '0015131574285 ')| ('CDGT'| '2')| ('COBL'| 0.0)| ('NTFA'| 0.0)| ('TMFT_1'| 'XF ')| ('TMFA_1'| 2.52)| ('TMFT_2'| 'XFCLT3 ')| ('TMFA_2'| 0.0)| ('TMFT_3'| ' ')| ('TMFA_3'| 0.0)| ('TDAM'| 0.0)| ('RESD'| ' ')| ('CUTP'| 'EUR2')| ('SMSG'| 'BKS')| ('SQNR'| '00000012')| ('STNQ'| '39')| ('DAIS'| '171231')| ('TRNN'| '000001')| ('TDNR'| '0015131574285 ')| ('CDGT'| '2')| ('STAT'| 'I ')| ('COTP'| ' ')| ('CORT'| '00000')| ('COAM'| 0.0)| ('SPTP'| ' ')| ('SPRT'| '00000')| ('SPAM'| 0.0)| ('EFRT'| '00000')| ('EFCO'| 0.0)| ('APBC'| 0.0)| ('RDII'| ' ')| ('RESD'| ' ...| ('CUTP'| 'EUR2')| ('SMSG'| 'BKS')| ('SQNR'| '00000013')| ('STNQ'| '46')| ('DAIS'| '171231')| ('TRNN'| '000001')| ('TDNR'| '0015131574285 ')| ('CDGT'| '2')| ('ORIT'| ' ')| ('ORIL'| ' ')| ('ORID'| ' ')| ('ORIA'| '00000000')| ('ENRS'| 'NONREF/RESTRICT...| ('RESD'| ' ')| ('SMSG'| 'BKI')| ('SQNR'| '00000014')| ('STNQ'| '63')| ('DAIS'| '171231')| ('TRNN'| '000001')| ('TDNR'| '0015131574285 ')| ('CDGT'| '2')| ('SEGI'| '1')| ('STPO'| 'X')| ('NBDA'| '22APR')| ('NADA'| '22APR')| ('ORAC'| 'CDG ')| ('DSTC'| 'MIA ')| ('CARR'| 'AA ')| ('CABI'| ' ')| ('FTNR'| ' 63 ')| ('RBKD'| 'O ')| ('FTDA'| '22APR')| ('FTDT'| '1155 ')| ('FBST'| 'OK')| ('FBAL'| '1PC')| ('FBTD'| 'OLN0DMN3 ')| ('FFRF'| ' ...| ('FCPT'| ' ')| ('RESD'| ' ')| ('SMSG'| 'BKI')| ('SQNR'| '00000015')| ('STNQ'| '63')| ('DAIS'| '171231')| ('TRNN'| '000001')| ('TDNR'| '0015131574285 ')| ('CDGT'| '2')| ('SEGI'| '2')| ('STPO'| 'O')| ('NBDA'| '22APR')| ('NADA'| '22APR')| ('ORAC'| 'MIA ')| ('DSTC'| 'MBJ ')| ('CARR'| 'AA ')| ('CABI'| ' ')| ('FTNR'| '1515 ')| ('RBKD'| 'O ')| ('FTDA'| '22APR')| ('FTDT'| '1801 ')| ('FBST'| 'OK')| ('FBAL'| '1PC')| ('FBTD'| 'OLN0DMN3 ')| ('FFRF'| ' ...| ('FCPT'| ' ')| ('RESD'| ' ')| ('SMSG'| 'BKI')| ('SQNR'| '00000016')| ('STNQ'| '63')| ('DAIS'| '171231')| ('TRNN'| '000001')| ('TDNR'| '0015131574285 ')| ('CDGT'| '2')| ('SEGI'| '3')| ('STPO'| 'X')| ('NBDA'| '29APR')| ('NADA'| '29APR')| ('ORAC'| 'MBJ ')| ('DSTC'| 'CLT ')| ('CARR'| 'AA ')| ('CABI'| ' ')| ('FTNR'| ' 844 ')| ('RBKD'| 'O ')| ('FTDA'| '29APR')| ('FTDT'| '1059 ')| ('FBST'| 'OK')| ('FBAL'| '1PC')| ('FBTD'| 'OLN0DMN3 ')| ('FFRF'| ' ...| ('FCPT'| ' ')| ('RESD'| ' ')| ('SMSG'| 'BKI')| ('SQNR'| '00000017')| ('STNQ'| '63')| ('DAIS'| '171231')| ('TRNN'| '000001')| ('TDNR'| '0015131574285 ')| ('CDGT'| '2')| ('SEGI'| '4')| ('STPO'| ' ')| ('NBDA'| '29APR')| ('NADA'| '29APR')| ('ORAC'| 'CLT ')| ('DSTC'| 'CDG ')| ('CARR'| 'AA ')| ('CABI'| ' ')| ('FTNR'| ' 786 ')| ('RBKD'| 'O ')| ('FTDA'| '29APR')| ('FTDT'| '1630 ')| ('FBST'| 'OK')| ('FBAL'| '1PC')| ('FBTD'| 'OLN0DMN3 ')| ('FFRF'| ' ...| ('FCPT'| ' ')| ('RESD'| ' ')| ('SMSG'| 'BAR')| ('SQNR'| '00000018')| ('STNQ'| '64')| ('DAIS'| '171231')| ('TRNN'| '000001')| ('TDNR'| '0015131574285 ')| ('CDGT'| '2')| ('FARE'| 'EUR 225.00')| ('TKMI'| '/')| ('EQFR'| ' ')| ('TOTL'| 'EUR 712.92')| ('SASI'| '0011')| ('FCMI'| '0')| ('BAID'| ' ')| ('BEOT'| ' ')| ('FCPI'| '0')| ('AENT'| ' ')| ('RESD'| ' ...| ('SMSG'| 'BAR')| ('SQNR'| '00000019')| ('STNQ'| '65')| ('DAIS'| '171231')| ('TRNN'| '000001')| ('TDNR'| '0015131574285 ')| ('CDGT'| '2')| ('PXNM'| ' ...| ('PXDA'| ' ...| ('DOBR'| '02APR68')| ('PXTP'| ' ')| ('RESD'| ' ')| ('SMSG'| 'BAR')| ('SQNR'| '00000020')| ('STNQ'| '66')| ('DAIS'| '171231')| ('TRNN'| '000001')| ('TDNR'| '0015131574285 ')| ('CDGT'| '2')| ('FPSN'| '1')| ('FPIN'| 'AA132193 ...| ('RESD'| ' ...| ('SMSG'| 'BKF')| ('SQNR'| '00000021')| ('STNQ'| '81')| ('DAIS'| '171231')| ('TRNN'| '000001')| ('TDNR'| '0015131574285 ')| ('CDGT'| '2')| ('FRCS'| '1')| ('FRCA'| 'PAR AA X/MIA AA...| ('RESD'| ' ')| ('SMSG'| 'BKF')| ('SQNR'| '00000022')| ('STNQ'| '81')| ('DAIS'| '171231')| ('TRNN'| '000001')| ('TDNR'| '0015131574285 ')| ('CDGT'| '2')| ('FRCS'| '2')| ('FRCA'| '1IZ9.22YC11.74X...| ('RESD'| ' ')| ('SMSG'| 'BKP')| ('SQNR'| '00000023')| ('STNQ'| '84')| ('DAIS'| '171231')| ('TRNN'| '000001')| ('FPTP'| 'CA ')| ('FPAM'| 712.92)| ('FPAC'| ' ...| ('EXDA'| ' ')| ('EXPC'| ' ')| ('APLC'| ' ')| ('INVN'| ' ')| ('INVD'| '000000')| ('REMT'| 712.92)| ('CVVR'| ' ')| ('RESD'| ' ...| ('CUTP'| 'EUR2')]|
+--------+-------+--------+------------+--------+------+--------+----------+-------...
I think I still need to go through each pair of columns, rename the second column with the value of the first row of first column and finally, drop all first columns of every pairs of columns.
Or is it possible to add more options to:
df = spark.read.options(inferSchema="true").csv(rdd_trans)
to get the exact correct dataframe structure? It will avoid more processing time (my goal is to be faster than in pandas version)
In the mean time, I tried to do:
df.write.parquet("/Users/admin/Documents/Training/FR20180101HOT.parquet")
But got an error:
Py4JJavaError: An error occurred while calling o447851.parquet.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
...
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 8220.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8220.0 (TID 12712, localhost, executor driver): org.apache.spark.SparkException: Task failed while writing rows.
...
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
...
I can't put all the error message due to limit of text but it seems related to memory issue.
I did a count for the df:
print(df.count())
15723
Which is equal to the number of rows in my pandas version (other python code not using pyspark) so it is getting the right number of rows. However, in pandas I am able to extract in parquet without a problem.

You can try regexp_replace for your case.
Check the example case below,
df1.withColumn("c0", regexp_replace("_c0", "[()']", "")).withColumn("c1", regexp_replace("_c1", "\)", "")).show()
+----+---+---+---+
| _c0|_c1| c0| c1|
+----+---+---+---+
|('a'| 2)| a| 2|
|('b'| 4)| b| 4|
|('c'| 6)| c| 6|
+----+---+---+---+

Related

Element-wise addition of lists in Pyspark Dataframe

I have dataframe:
data = [{"category": 'A', "bigram": 'delicious spaghetti', "vector": [0.01, -0.02, 0.03], 'all_vector' : 2},
{"category": 'A', "bigram": 'delicious dinner', "vector": [0.04, 0.05, 0.06], 'all_vector' : 2},
{"category": 'B', "bigram": 'new blog', "vector": [-0.14, -0.15, -0.16], 'all_vector' : 2},
{"category": 'B', "bigram": 'bright sun', "vector": [0.071, -0.09, 0.063], 'all_vector' : 2}
]
sdf = spark.createDataFrame(data)
+----------+-------------------+--------+---------------------+
|all_vector|bigram |category|vector |
+----------+-------------------+--------+---------------------+
|2 |delicious spaghetti|A |[0.01, -0.02, 0.03] |
|2 |delicious dinner |A |[0.04, 0.05, 0.06] |
|2 |new blog |B |[-0.14, -0.15, -0.16]|
|2 |bright sun |B |[0.071, -0.09, 0.063]|
+----------+-------------------+--------+---------------------+
I need to element-wise add lists in a vector column and divide by all_vector column ( i need normalize vector). Then group by category column. I wrote an example code but unfortunately it doesn't work:
#udf_annotator(returnType=ArrayType(FloatType()))
def result_vector(vector, all_vector):
lst = [sum(x) for x in zip(*vector)] / all_vector
return lst
sdf_new = sdf\
.withColumn('norm_vector', result_vector(F.col('vector'), F.col('all_vector')))\
.withColumn('rank', F.row_number().over(Window.partitionBy('category')))\
.where(F.col('rank') == 1)
I want it this way:
+----------+-------------------+--------+-----------------------+---------------------+
|all_vector|bigram |category|norm_vector |vector |
+----------+-------------------+--------+-----------------------+---------------------+
|2 |delicious spaghetti|A |[0.05, 0.03, 0.09] |[0.01, -0.02, 0.03] |
|2 |delicious dinner |A |[0.05, 0.03, 0.09] |[0.04, 0.05, 0.06] |
|2 |new blog |B |[-0.069, -0.24, -0.097]|[-0.14, -0.15, -0.16]|
|2 |bright sun |B |[-0.069, -0.24, -0.097]|[0.071, -0.09, 0.063]|
+----------+-------------------+--------+-----------------------+---------------------+
The zip_with function will help you zip two arrays and apply a function element wise. To use the function, we can create an array collection of the arrays in the vector column, and use the aggregate function. There might also be other simpler ways to do this though.
data_sdf. \
withColumn('vector_collection', func.collect_list('vector').over(wd.partitionBy('cat'))). \
withColumn('ele_wise_sum',
func.expr('''
aggregate(vector_collection,
cast(array() as array<double>),
(x, y) -> zip_with(x, y, (a, b) -> coalesce(a, 0) + coalesce(b, 0))
)
''')
). \
show(truncate=False)
# +---+---------------------+----------------------------------------------+-------------------------------------+
# |cat|vector |vector_collection |ele_wise_sum |
# +---+---------------------+----------------------------------------------+-------------------------------------+
# |B |[-0.14, -0.15, -0.16]|[[-0.14, -0.15, -0.16], [0.071, -0.09, 0.063]]|[-0.06900000000000002, -0.24, -0.097]|
# |B |[0.071, -0.09, 0.063]|[[-0.14, -0.15, -0.16], [0.071, -0.09, 0.063]]|[-0.06900000000000002, -0.24, -0.097]|
# |A |[0.01, -0.02, 0.03] |[[0.01, -0.02, 0.03], [0.04, 0.05, 0.06]] |[0.05, 0.030000000000000002, 0.09] |
# |A |[0.04, 0.05, 0.06] |[[0.01, -0.02, 0.03], [0.04, 0.05, 0.06]] |[0.05, 0.030000000000000002, 0.09] |
# +---+---------------------+----------------------------------------------+-------------------------------------+

Pyspark windows function: preceding and following event

I have the following dataframe in pyspark:
+------------------- +-------------------+---------+-----------------------+-----------+
|device_id |order_creation_time|order_id |status_check_time |status_code|
+--------------------+-------------------+---------+-----------------------+-----------+
|67a-05df-4ca5-af6ajn|2022-11-26 23:54:41|105785113|2022-11-26 23:55:33.858|200 |
|67a-05df-4ca5-af6ajn|2022-11-26 23:54:41|105785113|2022-11-26 23:55:13.1 |200 |
|67a-05df-4ca5-af6ajn|2022-11-26 23:54:41|105785113|2022-11-26 23:54:57.682|200 |
|67a-05df-4ca5-af6ajn|2022-11-26 23:54:41|105785113|2022-11-26 23:54:36.676|200 |
|67a-05df-4ca5-af6ajn|2022-11-26 23:54:41|105785113|2022-11-26 23:54:21.293|200 |
+--------------------+-------------------+---------+-----------------------+-----------+
I need to get the time of the status_check_time immediately preceding, and immediately after the order_creation_time.
The order_creation_time column will be always constant across the same order_id (so, each order_id has only 1 order_creation_time)
In this case, the output should be:
+------------------- +-------------------+---------+---------------------------+-----------------------+
|device_id |order_creation_time|order_id |previous_status_check_time |next_status_check_time |
+--------------------+-------------------+---------+---------------------------+-----------------------+
|67a-05df-4ca5-af6ajn|2022-11-26 23:54:41|105785113|2022-11-26 23:54:36.676 |2022-11-26 23:54:57.682|
+--------------------+-------------------+---------+---------------------------+-----------------------+
I was trying to use lag and lead functions, but I'm not getting the desired output:
ss = (
SparkSession.
builder.
appName("test").
master("local[2]").
getOrCreate()
)
data = [
{"device_id": "67a-05df-4ca5-af6ajn", "order_creation_time": "2022-11-26 23:54:41", "order_id": "105785113", "status_check_time":"2022-11-26 23:55:33.858", "status_code": 200},
{"device_id": "67a-05df-4ca5-af6ajn", "order_creation_time": "2022-11-26 23:54:41", "order_id": "105785113", "status_check_time":"2022-11-26 23:55:13.1" , "status_code": 200},
{"device_id": "67a-05df-4ca5-af6ajn", "order_creation_time": "2022-11-26 23:54:41", "order_id": "105785113", "status_check_time":"2022-11-26 23:54:57.682", "status_code": 200},
{"device_id": "67a-05df-4ca5-af6ajn", "order_creation_time": "2022-11-26 23:54:41", "order_id": "105785113", "status_check_time":"2022-11-26 23:54:36.676", "status_code": 200},
{"device_id": "67a-05df-4ca5-af6ajn", "order_creation_time": "2022-11-26 23:54:41", "order_id": "105785113", "status_check_time":"2022-11-26 23:54:21.293", "status_code": 200}
]
df = ss.createDataFrame(data)
windowSpec = Window.partitionBy("device_id").orderBy("status_check_time")
(
df.withColumn(
"previous_status_check_time", lag("status_check_time").over(windowSpec)
).withColumn(
"next_status_check_time", lead("status_check_time").over(windowSpec)
).show(truncate=False)
)
Any ideas of how to fix this??
We can calculate the difference between the two timestamps in seconds and retain the ones that are the closest negative and closest positive.
data_sdf. \
withColumn('ts_diff', func.col('status_check_time').cast('long') - func.col('order_creation_time').cast('long')). \
groupBy([k for k in data_sdf.columns if k != 'status_check_time']). \
agg(func.max(func.when(func.col('ts_diff') < 0, func.struct('ts_diff', 'status_check_time'))).status_check_time.alias('previous_status_check_time'),
func.min(func.when(func.col('ts_diff') >= 0, func.struct('ts_diff', 'status_check_time'))).status_check_time.alias('next_status_check_time')
). \
show(truncate=False)
# +--------------------+-------------------+---------+-----------+--------------------------+-----------------------+
# |device_id |order_creation_time|order_id |status_code|previous_status_check_time|next_status_check_time |
# +--------------------+-------------------+---------+-----------+--------------------------+-----------------------+
# |67a-05df-4ca5-af6ajn|2022-11-26 23:54:41|105785113|200 |2022-11-26 23:54:36.676 |2022-11-26 23:54:57.682|
# +--------------------+-------------------+---------+-----------+--------------------------+-----------------------+
The timestamp difference results in the following
# +--------------------+-------------------+---------+-----------------------+-----------+-------+
# |device_id |order_creation_time|order_id |status_check_time |status_code|ts_diff|
# +--------------------+-------------------+---------+-----------------------+-----------+-------+
# |67a-05df-4ca5-af6ajn|2022-11-26 23:54:41|105785113|2022-11-26 23:55:33.858|200 |-52 |
# |67a-05df-4ca5-af6ajn|2022-11-26 23:54:41|105785113|2022-11-26 23:55:13.1 |200 |-32 |
# |67a-05df-4ca5-af6ajn|2022-11-26 23:54:41|105785113|2022-11-26 23:54:57.682|200 |-16 |
# |67a-05df-4ca5-af6ajn|2022-11-26 23:54:41|105785113|2022-11-26 23:54:36.676|200 |5 |
# |67a-05df-4ca5-af6ajn|2022-11-26 23:54:41|105785113|2022-11-26 23:54:21.293|200 |20 |
# +--------------------+-------------------+---------+-----------------------+-----------+-------+

Is there more efficient way to implement cosine similarity in PySpark 1.6?

I am trying to compute cosine similarity between a given user_id from user's table and another table with movies in order to sort out most similar movies to recommend.
Cosine similarity: = dot(a,b) / (norm(a) * norm(b)) or dot(a,b)/sqrt((dot(a)*dot(b))
df = self.given_user.crossJoin(self.movies_df)
df = df.select('userId', 'movieId', 'user_features', 'movie_features')
df = df.rdd.map(lambda x: (x[0], x[1], x[2], x[3], float(np.dot(np.array(x[2]), np.array(x[3]))))).toDF(df.columns + ['dotxy'])
df = df.rdd.map(lambda x: (x[0], x[1], x[2], x[3], x[4], float(np.dot(np.array(x[2]), np.array(x[2]))))).toDF(df.columns + ['dotxx'])
df = df.rdd.map(lambda x: (x[0], x[1], x[2], x[3], x[4], x[5], float(np.dot(np.array(x[3]), np.array(x[3]))))).toDF(df.columns + ['dotyy'])
output = df.withColumn('cosine_sim', F.col("dotxy") / F.sqrt(F.col("dotxx") * F.col("dotyy")))
output.select('userId', 'movieId', 'dotxy', 'dotxx', 'dotyy', 'cosine_sim').orderBy('cosine_sim', ascending=False).show(5)
The resulting output looks like:
+------+-------+-----+-----+-----+----------+
|userId|movieId|dotxy|dotxx|dotyy|cosine_sim|
+------+-------+-----+-----+-----+----------+
| 18| 1430| 1.0| 0.5| 2.0| 1.0|
| 18| 2177| 1.0| 0.5| 2.0| 1.0|
| 18| 1565| 1.0| 0.5| 2.0| 1.0|
| 18| 415| 1.0| 0.5| 2.0| 1.0|
| 18| 1764| 1.0| 0.5| 2.0| 1.0|
+------+-------+-----+-----+-----+----------+
Is there more efficient/ compact way of cosine similarity function implementation in PySpark 1.6?
You could use more numpy functions.
import numpy as np
df = spark.createDataFrame([(18, 1, [1, 0, 1], [1, 1, 1])]).toDF('userId','movieId','user_features','movie_features')
df.rdd.map(lambda x: (x[0], x[1], x[2], x[3], float(np.dot(np.array(x[2]), np.array(x[3])) / (np.linalg.norm(np.array(x[2])) * np.linalg.norm(np.array(x[3])))))).toDF(df.columns + ['cosine_sim']).show()
+------+-------+-------------+--------------+------------------+
|userId|movieId|user_features|movie_features| cosine_sim |
+------+-------+-------------+--------------+------------------+
| 18| 1| [1, 0, 1]| [1, 1, 1]|0.8164965809277259|
+------+-------+-------------+--------------+------------------+

Finding most populated cities per country

I need to write code that gives the most populated cities per country with the population.
Here is a input data:
DataFrame = {
/** Input data */
val inputDf = Seq(
("Warsaw", "Poland", "1 764 615"),
("Cracow", "Poland", "769 498"),
("Paris", "France", "2 206 488"),
("Villeneuve-Loubet", "France", "15 020"),
("Pittsburgh PA", "United States", "302 407"),
("Chicago IL", "United States", "2 716 000"),
("Milwaukee WI", "United States", "595 351"),
("Vilnius", "Lithuania", "580 020"),
("Stockholm", "Sweden", "972 647"),
("Goteborg", "Sweden", "580 020")
).toDF("name", "country", "population")
println("Input:")
inputDf.show(false)
My solution was:
val topPopulation = inputDf
// .select("name", "country", "population")
.withColumn("population", regexp_replace($"population", " ", "").cast("Integer"))
// .agg(max($"population").alias(("population")))
// .withColumn("population", regexp_replace($"population", " ", "").cast("Integer"))
// .withColumn("country", $"country")
// .withColumn("name", $"name")
// .cast("Integer")
.groupBy("country")
.agg(
max("population").alias("population")
)
.orderBy($"population".desc)
// .orderBy("max(population)")
topPopulation
But i have troubke, because "Except can only be performed on tables with the same number of columns, but the first table has 2 columns and the second table has 3 columns;;
"
Input:
+-----------------+-------------+----------+
|name |country |population|
+-----------------+-------------+----------+
|Warsaw |Poland |1 764 615 |
|Cracow |Poland |769 498 |
|Paris |France |2 206 488 |
|Villeneuve-Loubet|France |15 020 |
|Pittsburgh PA |United States|302 407 |
|Chicago IL |United States|2 716 000 |
|Milwaukee WI |United States|595 351 |
|Vilnius |Lithuania |580 020 |
|Stockholm |Sweden |972 647 |
|Goteborg |Sweden |580 020 |
+-----------------+-------------+----------+
Expected:
+----------+-------------+----------+
|name |country |population|
+----------+-------------+----------+
|Warsaw |Poland |1 764 615 |
|Paris |France |2 206 488 |
|Chicago IL|United States|2 716 000 |
|Vilnius |Lithuania |580 020 |
|Stockholm |Sweden |972 647 |
+----------+-------------+----------+
Actual:
+-------------+----------+
|country |population|
+-------------+----------+
|United States|2716000 |
|France |2206488 |
|Poland |1764615 |
|Sweden |972647 |
|Lithuania |580020 |
+-------------+----------+
Try this-
Load the test data
val inputDf = Seq(
("Warsaw", "Poland", "1 764 615"),
("Cracow", "Poland", "769 498"),
("Paris", "France", "2 206 488"),
("Villeneuve-Loubet", "France", "15 020"),
("Pittsburgh PA", "United States", "302 407"),
("Chicago IL", "United States", "2 716 000"),
("Milwaukee WI", "United States", "595 351"),
("Vilnius", "Lithuania", "580 020"),
("Stockholm", "Sweden", "972 647"),
("Goteborg", "Sweden", "580 020")
).toDF("name", "country", "population")
println("Input:")
inputDf.show(false)
/**
* Input:
* +-----------------+-------------+----------+
* |name |country |population|
* +-----------------+-------------+----------+
* |Warsaw |Poland |1 764 615 |
* |Cracow |Poland |769 498 |
* |Paris |France |2 206 488 |
* |Villeneuve-Loubet|France |15 020 |
* |Pittsburgh PA |United States|302 407 |
* |Chicago IL |United States|2 716 000 |
* |Milwaukee WI |United States|595 351 |
* |Vilnius |Lithuania |580 020 |
* |Stockholm |Sweden |972 647 |
* |Goteborg |Sweden |580 020 |
* +-----------------+-------------+----------+
*/
find the city in the country having max population
val topPopulation = inputDf
.withColumn("population", regexp_replace($"population", " ", "").cast("Integer"))
.withColumn("population_name", struct($"population", $"name"))
.groupBy("country")
.agg(max("population_name").as("population_name"))
.selectExpr("country", "population_name.*")
topPopulation.show(false)
topPopulation.printSchema()
/**
* +-------------+----------+----------+
* |country |population|name |
* +-------------+----------+----------+
* |France |2206488 |Paris |
* |Poland |1764615 |Warsaw |
* |Lithuania |580020 |Vilnius |
* |Sweden |972647 |Stockholm |
* |United States|2716000 |Chicago IL|
* +-------------+----------+----------+
*
* root
* |-- country: string (nullable = true)
* |-- population: integer (nullable = true)
* |-- name: string (nullable = true)
*/

Fix query to resolve to_char and or string comparison issue in scala databricks 2.4.3

I've processed parquet file and created the following data frame in scala spark 2.4.3.
+-----------+------------+-----------+--------------+-----------+
| itemno|requestMonth|requestYear|totalRequested|requestDate|
+-----------+------------+-----------+--------------+-----------+
| 7512365| 2| 2014| 110.0| 2014-02-01|
| 7519278| 4| 2013| 96.0| 2013-04-01|
|5436134-070| 12| 2013| 8.0| 2013-12-01|
| 7547385| 1| 2014| 89.0| 2014-01-01|
| 0453978| 9| 2014| 18.0| 2014-09-01|
| 7558402| 10| 2014| 260.0| 2014-10-01|
|5437662-070| 7| 2013| 78.0| 2013-07-01|
| 3089858| 11| 2014| 5.0| 2014-11-01|
| 7181584| 2| 2017| 4.0| 2017-02-01|
| 7081417| 3| 2017| 15.0| 2017-03-01|
| 5814215| 4| 2017| 35.0| 2017-04-01|
| 7178940| 10| 2014| 5.0| 2014-10-01|
| 0450636| 1| 2015| 7.0| 2015-01-01|
| 5133406| 5| 2014| 46.0| 2014-05-01|
| 2204858| 12| 2015| 34.0| 2015-12-01|
| 1824299| 5| 2015| 1.0| 2015-05-01|
|5437474-620| 8| 2015| 4.0| 2015-08-01|
| 3086317| 9| 2014| 1.0| 2014-09-01|
| 2204331| 3| 2015| 2.0| 2015-03-01|
| 5334160| 1| 2018| 2.0| 2018-01-01|
+-----------+------------+-----------+--------------+-----------+
To derive a new feature, I am trying to apply logic and rearrange data frame as following
itemno – as it is in above-mentioned data frame
startDate - the start of the season
endDate - the end of the season
totalRequested - number of parts requested in that season
percetageOfRequests - totalRequested in current season / total over this plus 3 previous seasons (4 total seasons)
//seasons date for reference
Spring: 1 March to 31 May.
Summer: 1 June to 31 August.
Autumn: 1 September to 30 November.
Winter: 1 December to 28 February.
What I did:
I tried following two logics
case
when to_char(StartDate,'MMDD') between '0301' and '0531' then 'spring'
.....
.....
end as season
but it didn't work. I did to_char logic in oracle DB and it worked there but after looking around, I found spark SQL doesn't have this function. Also, I tried
import org.apache.spark.sql.functions._
val dateDF1 = orvPartRequestsDF.withColumn("MMDD", concat_ws("-", month($"requestDate"), dayofmonth($"requestDate")))
%sql
select distinct requestDate, MMDD,
case
when MMDD between '3-1' and '5-31' then 'Spring'
when MMDD between '6-1' and '8-31' then 'Summer'
when MMDD between '9-1' and '11-30' then 'Autumn'
when MMDD between '12-1' and '2-28' then 'Winter'
end as season
from temporal
and it also didn't work. Could you please let me know what I am missing here (my guess is I can't compare strings like this but I am not sure so I asked here) and how I can solve this?
After JXC solution#1 with range between
Since I was seeing some dicrepancy, I am sharing data frame again. Following is the dataframe seasonDF12
+-------+-----------+--------------+------+----------+
| itemno|requestYear|totalRequested|season|seasonCalc|
+-------+-----------+--------------+------+----------+
|0450000| 2011| 0.0|Winter| 201075|
|0450000| 2011| 0.0|Winter| 201075|
|0450000| 2011| 0.0|Spring| 201100|
|0450000| 2011| 0.0|Spring| 201100|
|0450000| 2011| 0.0|Spring| 201100|
|0450000| 2011| 0.0|Summer| 201125|
|0450000| 2011| 0.0|Summer| 201125|
|0450000| 2011| 0.0|Summer| 201125|
|0450000| 2011| 0.0|Autumn| 201150|
|0450000| 2011| 0.0|Autumn| 201150|
|0450000| 2011| 0.0|Autumn| 201150|
|0450000| 2011| 0.0|Winter| 201175|
|0450000| 2012| 3.0|Winter| 201175|
|0450000| 2012| 1.0|Winter| 201175|
|0450000| 2012| 4.0|Spring| 201200|
|0450000| 2012| 0.0|Spring| 201200|
|0450000| 2012| 0.0|Spring| 201200|
|0450000| 2012| 2.0|Summer| 201225|
|0450000| 2012| 3.0|Summer| 201225|
|0450000| 2012| 2.0|Summer| 201225|
+-------+-----------+--------------+------+----------+
to which I'll apply
val seasonDF2 = seasonDF12.selectExpr("*", """
sum(totalRequested) OVER (
PARTITION BY itemno
ORDER BY seasonCalc
RANGE BETWEEN 100 PRECEDING AND CURRENT ROW
) AS sum_totalRequested
""")
and I am seeing
look at first 40 in sum_totalRequested column. All the entries above it are 0. Not sure why it's 40. I think I already shared it but I need above dataframe to be transformed in to
itemno startDateOfSeason endDateOfSeason sum_totalRequestedBySeason (totalrequestedinCurrentSeason/totalRequestedinlast 3 + current season.)
Final output will be like this:
itemno startDateOfSeason endDateOfSeason season sum_totalRequestedBySeason (totalrequestedinCurrentSeason/totalRequestedinlast 3 + current season.)
123 12/01/2018 02/28/2019 winter 12 12/12+ 36 (36 from previous three seasons)
123 03/01/2019 05/31/2019 spring 24 24/24 + 45 (45 from previous three seasons)
Edit-2: adjusted to calculate the sum groupby seasons first and then the Window aggregate sum:
Edit-1: Based on the comments, the named season is not required. we can set Spring, Summer, Autumn, Winter as 0, 25, 50 and 75 respectively and the season will be an integer added up by year(requestDate)*100 so that we can use rangeBetween (offset=-100 for current + the previous 3 seasons) in Window aggregate functions:
Note: below are pyspark code:
df.createOrReplaceTempView("df_table")
df1 = spark.sql("""
WITH t1 AS ( SELECT *
, year(requestDate) as YY
, date_format(requestDate, "MMdd") as MMDD
FROM df_table )
, t2 AS ( SELECT *,
CASE
WHEN MMDD BETWEEN '0301' AND '0531' THEN
named_struct(
'startDateOfSeason', date(concat_ws('-', YY, '03-01'))
, 'endDateOfSeason', date(concat_ws('-', YY, '05-31'))
, 'season', 'spring'
, 'label', int(YY)*100
)
WHEN MMDD BETWEEN '0601' AND '0831' THEN
named_struct(
'startDateOfSeason', date(concat_ws('-', YY, '06-01'))
, 'endDateOfSeason', date(concat_ws('-', YY, '08-31'))
, 'season', 'summer'
, 'label', int(YY)*100 + 25
)
WHEN MMDD BETWEEN '0901' AND '1130' THEN
named_struct(
'startDateOfSeason', date(concat_ws('-', YY, '09-01'))
, 'endDateOfSeason', date(concat_ws('-', YY, '11-30'))
, 'season', 'autumn'
, 'label', int(YY)*100 + 50
)
WHEN MMDD BETWEEN '1201' AND '1231' THEN
named_struct(
'startDateOfSeason', date(concat_ws('-', YY, '12-01'))
, 'endDateOfSeason', last_day(concat_ws('-', int(YY)+1, '02-28'))
, 'season', 'winter'
, 'label', int(YY)*100 + 75
)
WHEN MMDD BETWEEN '0101' AND '0229' THEN
named_struct(
'startDateOfSeason', date(concat_ws('-', int(YY)-1, '12-01'))
, 'endDateOfSeason', last_day(concat_ws('-', YY, '02-28'))
, 'season', 'winter'
, 'label', (int(YY)-1)*100 + 75
)
END AS seasons
FROM t1
)
SELECT itemno
, seasons.*
, sum(totalRequested) AS sum_totalRequestedBySeason
FROM t2
GROUP BY itemno, seasons
""")
This will get the following result:
df1.show()
+-----------+-----------------+---------------+------+------+--------------------------+
| itemno|startDateOfSeason|endDateOfSeason|season| label|sum_totalRequestedBySeason|
+-----------+-----------------+---------------+------+------+--------------------------+
|5436134-070| 2013-12-01| 2013-12-31|winter|201375| 8.0|
| 1824299| 2015-03-01| 2015-05-31|spring|201500| 1.0|
| 0453978| 2014-09-01| 2014-11-30|autumn|201450| 18.0|
| 7181584| 2017-01-01| 2017-02-28|winter|201675| 4.0|
| 7178940| 2014-09-01| 2014-11-30|autumn|201450| 5.0|
| 7547385| 2014-01-01| 2014-02-28|winter|201375| 89.0|
| 5814215| 2017-03-01| 2017-05-31|spring|201700| 35.0|
| 3086317| 2014-09-01| 2014-11-30|autumn|201450| 1.0|
| 0450636| 2015-01-01| 2015-02-28|winter|201475| 7.0|
| 2204331| 2015-03-01| 2015-05-31|spring|201500| 2.0|
|5437474-620| 2015-06-01| 2015-08-31|summer|201525| 4.0|
| 5133406| 2014-03-01| 2014-05-31|spring|201400| 46.0|
| 7081417| 2017-03-01| 2017-05-31|spring|201700| 15.0|
| 7519278| 2013-03-01| 2013-05-31|spring|201300| 96.0|
| 7558402| 2014-09-01| 2014-11-30|autumn|201450| 260.0|
| 2204858| 2015-12-01| 2015-12-31|winter|201575| 34.0|
|5437662-070| 2013-06-01| 2013-08-31|summer|201325| 78.0|
| 5334160| 2018-01-01| 2018-02-28|winter|201775| 2.0|
| 3089858| 2014-09-01| 2014-11-30|autumn|201450| 5.0|
| 7512365| 2014-01-01| 2014-02-28|winter|201375| 110.0|
+-----------+-----------------+---------------+------+------+--------------------------+
After we have the season totals, then calculate the sum of the current plus previous 3 seasons using Window aggregate function and then the ratio:
df1.selectExpr("*", """
round(sum_totalRequestedBySeason/sum(sum_totalRequestedBySeason) OVER (
PARTITION BY itemno
ORDER BY label
RANGE BETWEEN 100 PRECEDING AND CURRENT ROW
),2) AS ratio_of_current_over_current_plus_past_3_seasons
""").show()