Parse the XML column into multiple columns and transpose into rows based on count in Spark DataFrame - scala

I have a scenario where the XML column response_output have ordercount and orders with corresponding order details.
For example xml is like below, the count of OrderCount is 4 , and under orders we have 4 order details
<USR_ORD><OrderResponse><OrderResult>
<OrderCount>4</OrderCount>
<ORDTime>2021-02-02 21:13:12</ORDTime><ORDStatus>COMPLETE</ORDStatus>
<ORDValue>
<USR1OrderTotalTime>221</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc>
<orders>
<order><name>MR RITA SOMA</name><address>606 JAL TXS</address><tracknumber>7825225</tracknumber><status>UNK</status></order>
<order><name>MR RITA SOMA</name><address>1 BAL, HAL</address><tracknumber>7825226</tracknumber><status>FAIL</status></order>
<order><name>MR RODREX SOMA</name><address>18, GHC,BAN</address><tracknumber>7825224</tracknumber><status>SUC</status></order>
<order><name>MR RITA SOMA</name><address>1 BAL, HAL</address><tracknumber>7825223</tracknumber><status>SUC</status></order>
</orders>
<USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD>
</ORDValue>
</OrderResult></OrderResponse></USR_ORD>
I need to retrieve the records based on the ordercount, if the ordercount is 4 then I need to iterate 4 times on orders and fetch 4 records with all order details and if the ordercount is 1 then I need to fetch 1 record with order details respectively.
Could anyone help me with this with spark2, scala solution?
SourceData:
+-----------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|customer_id|response_id|response_output |
+-----------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100 |1 |<USR_ORD><OrderResponse><OrderResult><OrderCount>1</OrderCount><ORDTime>2021-02-02 10:34:19</ORDTime><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>321</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><orders><order><name>MRS MITA PERS</name><address>17 MAXI RD CHN</address><tracknumber>7825222</tracknumber><status>FAIL</status><amount>4500</amount><orderdate>2019-10-18</orderdate></order></orders><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult></OrderResponse></USR_ORD> |
|200 |1 |<USR_ORD><OrderResponse><OrderResult><OrderCount>4</OrderCount><ORDTime>2021-02-02 21:13:12</ORDTime><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>221</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><orders><order><name>MR RITA SOMA</name><address>606 JAL TXS</address><tracknumber>7825225</tracknumber><status>UNK</status><amount>1030</amount><orderdate>2020-11-16</orderdate></order><order><name>MR RITA SOMA</name><address>1 BAL, HAL</address><tracknumber>7825226</tracknumber><status>FAIL</status><amount>8000</amount><orderdate>2018-07-17</orderdate></order><order><name>MR RODREX SOMA</name><address>18, GHC, BAN</address><tracknumber>7825224</tracknumber><status>SUC</status><amount>2500</amount><orderdate>2017-09-16</orderdate></order><order><name>MR RITA SOMA</name><address>1 BAL, HAL</address><tracknumber>7825223</tracknumber><status>SUC</status><amount>2700</amount><orderdate>2017-04-22</orderdate></order></orders><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult></OrderResponse></USR_ORD>|
+-----------+-----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
When I try to run the below sql I am getting as follows but I need to get 4 records for customer_id 200 as the count is 4 with the corresponding oder details.
spark.sql("""select
| customer_id,
| xpath_string(response_output,'USR_ORD/OrderResponse/OrderResult/OrderCount') as OrderCount,
| xpath_string(response_output,'USR_ORD/OrderResponse/OrderResult/ORDValue/orders/order/name') as name,
| xpath_string(response_output,'USR_ORD/OrderResponse/OrderResult/ORDValue/orders/order/address') as address,
| xpath_string(response_output,'USR_ORD/OrderResponse/OrderResult/ORDValue/orders/order/tracknumber') as tracknumber,
| xpath_string(response_output,'USR_ORD/OrderResponse/OrderResult/ORDValue/orders/order/status') as status
| from cust_tbl""").show()
Result I am Getting:
+-----------+----------+-------------+--------------+-----------+------+
|customer_id|OrderCount| name| address|tracknumber|status|
+-----------+----------+-------------+--------------+-----------+------+
| 100| 1|MRS MITA PERS|17 MAXI RD CHN| 7825222| FAIL|
| 200| 4| MR RITA SOMA| 606 JAL TXS| 7825225| UNK|
+-----------+----------+-------------+--------------+-----------+------+
Expecting OutPut:
+-----------+----------+------------+-----------+-----------+------+
|customer_id|OrderCount|name |address |tracknumber|status|
+-----------+----------+------------+-----------+-----------+------+
|200 |4 |MRRITASOMA |606JALTXS |7825225 |UNK |
|200 |4 |MRRITASOMA |1BAL HAL |7825226 |FAIL |
|200 |4 |MRRODREXSOMA|18 GHC BAN |7825224 |SUC |
|200 |4 |MRRITASOMA |1 BAL HAL |7825223 |SUC |
|100 |1 |MRSMITAPERS |17MAXIRDCHN|7825222 |FAIL |
+-----------+----------+------------+-----------+-----------+------+

The function xpath_string extracts one string value for the given XPath expression. For your case, you need to use xpath to get array of the node values for each order detail (name, status, ...) and zip them all together using arrays_zip:
val df1 = df.withColumn(
"OrderCount",
expr("xpath_string(response_output, 'USR_ORD/OrderResponse/OrderResult/OrderCount')")
).withColumn(
"orders",
explode(
arrays_zip(
expr("xpath(response_output, '/USR_ORD/OrderResponse/OrderResult/ORDValue/orders/order/name/text()')"),
expr("xpath(response_output, '/USR_ORD/OrderResponse/OrderResult/ORDValue/orders/order/address/text()')"),
expr("xpath(response_output, '/USR_ORD/OrderResponse/OrderResult/ORDValue/orders/order/tracknumber/text()')"),
expr("xpath(response_output, '/USR_ORD/OrderResponse/OrderResult/ORDValue/orders/order/status/text()')")
).cast("array<struct<name:string,address:string,tracknumber:string,status:string>>")
)
).select("customer_id", "OrderCount", "orders.*")
df1.show(false)
//+-----------+----------+--------------+--------------+-----------+------+
//|customer_id|OrderCount|name |address |tracknumber|status|
//+-----------+----------+--------------+--------------+-----------+------+
//|100 |1 |MRS MITA PERS |17 MAXI RD CHN|7825222 |FAIL |
//|200 |4 |MR RITA SOMA |606 JAL TXS |7825225 |UNK |
//|200 |4 |MR RITA SOMA |1 BAL, HAL |7825226 |FAIL |
//|200 |4 |MR RODREX SOMA|18, GHC, BAN |7825224 |SUC |
//|200 |4 |MR RITA SOMA |1 BAL, HAL |7825223 |SUC |
//+-----------+----------+--------------+--------------+-----------+------+
Update
For Spark < 2.4, you can posexplode each array columns and join on index :
val df1 = df.withColumn(
"OrderCount",
expr("xpath_string(response_output, 'USR_ORD/OrderResponse/OrderResult/OrderCount')")
).select(
col("customer_id"),
col("OrderCount"),
expr("xpath(response_output, '/USR_ORD/OrderResponse/OrderResult/ORDValue/orders/order/name/text()')").as("name"),
expr("xpath(response_output, '/USR_ORD/OrderResponse/OrderResult/ORDValue/orders/order/address/text()')").as("address"),
expr("xpath(response_output, '/USR_ORD/OrderResponse/OrderResult/ORDValue/orders/order/tracknumber/text()')").as("tracknumber"),
expr("xpath(response_output, '/USR_ORD/OrderResponse/OrderResult/ORDValue/orders/order/status/text()')").as("status")
)
val result = df1.selectExpr("customer_id", "OrderCount", "posexplode(name) as (idx, name)")
.join(
df1.selectExpr("customer_id", "posexplode(address) as (idx, address)"),
Seq("idx", "customer_id")
).join(
df1.selectExpr("customer_id","posexplode(tracknumber) as (idx, tracknumber)"),
Seq("idx", "customer_id")
).join(
df1.selectExpr("customer_id", "posexplode(status) as (idx, status)"),
Seq("idx", "customer_id")
).drop("idx")
result.show(false)
//+-----------+----------+--------------+--------------+-----------+------+
//|customer_id|OrderCount|name |address |tracknumber|status|
//+-----------+----------+--------------+--------------+-----------+------+
//|100 |1 |MRS MITA PERS |17 MAXI RD CHN|7825222 |FAIL |
//|200 |4 |MR RITA SOMA |606 JAL TXS |7825225 |UNK |
//|200 |4 |MR RITA SOMA |1 BAL, HAL |7825226 |FAIL |
//|200 |4 |MR RODREX SOMA|18, GHC, BAN |7825224 |SUC |
//|200 |4 |MR RITA SOMA |1 BAL, HAL |7825223 |SUC |
//+-----------+----------+--------------+--------------+-----------+------+

Related

How to implement UniqueCount in Spark Scala

I am trying to implement uniqueCount in spark scala
Below is the transformation i am trying to implement :
case when ([last_revision]=1) and ([source]=""AR"") then UniqueCount([review_uuid]) OVER ([encounter_id]) end
Input
|last_revision|source|review_uuid |encounter_id|
|-------------|------|--------------|------------|
|1 |AR |123-1234-12345|7654 |
|1 |AR |123-7890-45678|7654 |
|1 |MR |789-1234-12345|7654 |
Expected Output
|last_revision|source|review_uuid |encounter_id|reviews_per_encounter|
|-------------|------|--------------|------------|---------------------|
|1 |AR |123-1234-12345|7654 |2 |
|1 |AR |123-7890-45678|7654 |2 |
|1 |MR |789-1234-12345|7654 |null |
My code :
.withColumn("reviews_per_encounter", when(col("last_revision") === "1" && col("source") === "AR", size(collect_set(col("review_uuid")).over(Window.partitionBy(col("encounter_id"))))))
My Output :
|last_revision|source|review_uuid |encounter_id|reviews_per_encounter|
|-------------|------|--------------|------------|---------------------|
|1 |AR |123-1234-12345|7654 |3 |
|1 |AR |123-7890-45678|7654 |3 |
|1 |MR |789-1234-12345|7654 |null |
Schema :
last_revision : integer
source : string
review_uuid : string
encounter_id : string
reviews_per_encounter : integer
In place of 2(expected) i am getting value 3, not sure what mistake i am doing here.
Please help. Thanks
The output makes perfect sense, as I commented, this is because this:
size(collect_set(col("review_uuid")))
Means:
give me the count of unique review_uuids in the whole dataframe (result: 3)
What you're looking for is:
give me the count of unique review_uuids only if the source in the corresponding row is equal to "AR" and "last_revision" is 1 (result: 2)
Notice the difference, now this doesn't need window functions and over actually. You can achieve this both using subqueries or self joining, here's how you can do it using self left join:
df.join(
df.where(col("last_revision") === lit(1) && col("source") === "AR")
.select(count_distinct(col("review_uuid")) as "reviews_per_encounter"),
col("last_revision") === lit(1) && col("source") === "AR",
"left"
)
Output:
+-------------+------+-----------+------------+---------------------+
|last_revision|source|review_uuid|encounter_id|reviews_per_encounter|
+-------------+------+-----------+------------+---------------------+
| 1| AR| 12345| 7654| 2|
| 1| AR| 45678| 7654| 2|
| 1| MR| 78945| 7654| null|
+-------------+------+-----------+------------+---------------------+
(I used some random uuid's, they were too long to copy :) )

Convert a column from StringType to Json (object)

Here is a sample data
val df4 = sc.parallelize(List(
("A1",45, "5", 1, 90),
("A2",60, "1", 1, 120),
("A6", 30, "9", 1, 450),
("A7", 89, "7", 1, 333),
("A7", 89, "4", 1, 320),
("A2",60, "5", 1, 22),
("A1",45, "22", 1, 1)
)).toDF("CID","age", "children", "marketplace_id","value")
thanks to #Shu for this piece of code
val df5 = df4.selectExpr("CID","""to_json(named_struct("id", children)) as item""", "value", "marketplace_id")
+---+-----------+-----+--------------+
|CID|item |value|marketplace_id|
+---+-----------+-----+--------------+
|A1 |{"id":"5"} |90 |1 |
|A2 |{"id":"1"} |120 |1 |
|A6 |{"id":"9"} |450 |1 |
|A7 |{"id":"7"} |333 |1 |
|A7 |{"id":"4"} |320 |1 |
|A2 |{"id":"5"} |22 |1 |
|A1 |{"id":"22"}|1 |1 |
+---+-----------+-----+--------------+
when you do df5.dtypes
(CID,StringType), (item,StringType), (value,IntegerType), (marketplace_id,IntegerType)
the column item is of string type, is there a way this can be of json/object type(if that is a thing)?
EDIT 1:
I will describe what I am trying to achieve here, the above two steps remains same.
val w = Window.partitionBy("CID").orderBy(desc("value"))
val sorted_list = df5.withColumn("item", collect_list("item").over(w)).groupBy("CID").agg(max("item") as "item")
Output:
+---+-------------------------+
|CID|item |
+---+-------------------------+
|A6 |[{"id":"9"}] |
|A2 |[{"id":"1"}, {"id":"5"}] |
|A7 |[{"id":"7"}, {"id":"4"}] |
|A1 |[{"id":"5"}, {"id":"22"}]|
+---+-------------------------+
now whatever is inside [ ] is a string. which is causing a problem for one of the tools we are using.
Sorry, pardon me I am new to scala, spark if this is a basic question.
Store json data using struct type, check below code.
scala> dfa
.withColumn("item_without_json",struct($"cid".as("id")))
.withColumn("item_as_json",to_json($"item_without_json"))
.show(false)
+---+-----------+-----+--------------+-----------------+------------+
|CID|item |value|marketplace_id|item_without_json|item_as_json|
+---+-----------+-----+--------------+-----------------+------------+
|A1 |{"id":"A1"}|90 |1 |[A1] |{"id":"A1"} |
|A2 |{"id":"A2"}|120 |1 |[A2] |{"id":"A2"} |
|A6 |{"id":"A6"}|450 |1 |[A6] |{"id":"A6"} |
|A7 |{"id":"A7"}|333 |1 |[A7] |{"id":"A7"} |
|A7 |{"id":"A7"}|320 |1 |[A7] |{"id":"A7"} |
|A2 |{"id":"A2"}|22 |1 |[A2] |{"id":"A2"} |
|A1 |{"id":"A1"}|1 |1 |[A1] |{"id":"A1"} |
+---+-----------+-----+--------------+-----------------+------------+
Based on the comment you made to have the dataset converted to json you would use:
df4
.select(collect_list(struct($"CID".as("id"))).as("items"))
.write()
.json(path)
The output will look like:
{"items":[{"id":"A1"},{"id":"A2"},{"id":"A6"},{"id":"A7"}, ...]}
If you need the thing in memory to pass down to a function, instead of write().json(...) use toJSON

Join with uneven columns

I have two dataframes structured the following way:
|Source|#Users|#Clicks|Hour|Type
and
Type|Total # Users|Hour
I'd like to join these columns based on hour however the first dataframe is at a deeper granularity in the second and therefore has more rows. Basically I want a dataframe where I have
|Source|#Users|#Clicks|Hour|Type|Total # Users
where the total # users is from the second dataframe. Any suggestions? I think I maybe want to use map?
Edit:
Here's an example
DF1
|Source|#Users|#Clicks|Hour|Type
|Prod1 |50 |3 |01 |Internet
|Prod2 |10 |2 |07 |iOS
|Prod3 |1 |50 |07 |Internet
|Prod2 |3 |2 |07 |Internet
|Prod3 |8 |2 |05 |Internet
DF2
|Type |Total #Users|Hour
|Internet|100 |01
|iOS |500 |01
|Internet|300 |07
|Internet|15 |05
|iOS |20 |07
Result
|Source|#Users|#Clicks|Hour|Type |Total #Users
|Prod1 |50 |3 |01 |Internet|100
|Prod2 |10 |2 |07 |iOS |20
|Prod3 |1 |50 |07 |Internet|300
|Prod2 |3 |2 |07 |Internet|300
|Prod3 |8 |2 |05 |Internet|15
That's a left join you're trying to do :
df1.join(df2, (df1.Hour === df2.Hour) & (df1.Type === df2.Type), "left_outer")
Short version : a left join keep all the rows from df1 and join on condition with matching rows of df2 if there is a match (null if not, duplicate if multiple matches).
More info on Pyspark join
More info on SQL Joins types

How can I do map reduce on spark dataframe group by conditional columns?

My spark dataframe looks like this:
+------+------+-------+------+
|userid|useid1|userid2|score |
+------+------+-------+------+
|23 |null |dsad |3 |
|11 |44 |null |4 |
|231 |null |temp |5 |
|231 |null |temp |2 |
+------+------+-------+------+
I want to do the calculation for each pair of userid and useid1/userid2 (whichever is not null).
And if it's useid1, I multiply the score by 5, if it's userid2, I multiply the score by 3.
Finally, I want to add all score for each pair.
The result should be:
+------+--------+-----------+
|userid|useid1/2|final score|
+------+--------+-----------+
|23 |dsad |9 |
|11 |44 |20 |
|231 |temp |21 |
+------+------+-------------+
How can I do this?
For the groupBy part, I know dataframe has the groupBy function, but I don't know if I can use it conditionally, like if userid1 is null, groupby(userid, userid2), if userid2 is null, groupby(userid, useid1).
For the calculation part, how to multiply 3 or 5 based on the condition?
The below solution will help to solve your problem.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val groupByUserWinFun = Window.partitionBy("userid","useid1/2")
val finalScoreDF = userDF.withColumn("useid1/2", when($"userid1".isNull, $"userid2").otherwise($"userid1"))
.withColumn("finalscore", when($"userid1".isNull, $"score" * 3).otherwise($"score" * 5))
.withColumn("finalscore", sum("finalscore").over(groupByUserWinFun))
.select("userid", "useid1/2", "finalscore").distinct()
using when method in spark SQL, select userid1 or 2 and multiply with values based on the condition
Output:
+------+--------+----------+
|userid|useid1/2|finalscore|
+------+--------+----------+
| 11 | 44| 20.0|
| 23 | dsad| 9.0|
| 231| temp| 21.0|
+------+--------+----------+
Group by will work:
val original = Seq(
(23, null, "dsad", 3),
(11, "44", null, 4),
(231, null, "temp", 5),
(231, null, "temp", 2)
).toDF("userid", "useid1", "userid2", "score")
// action
val result = original
.withColumn("useid1/2", coalesce($"useid1", $"userid2"))
.withColumn("score", $"score" * when($"useid1".isNotNull, 5).otherwise(3))
.groupBy("userid", "useid1/2")
.agg(sum("score").alias("final score"))
result.show(false)
Output:
+------+--------+-----------+
|userid|useid1/2|final score|
+------+--------+-----------+
|23 |dsad |9 |
|231 |temp |21 |
|11 |44 |20 |
+------+--------+-----------+
coalesce will do the needful.
df.withColumn("userid1/2", coalesce(col("useid1"), col("useid1")))
basically this function return first non-null value of the order
documentation :
COALESCE(T v1, T v2, ...)
Returns the first v that is not NULL, or NULL if all v's are NULL.
needs an import import org.apache.spark.sql.functions.coalesce

append two dataframes and update data

Hello guys I want to update an old dataframe based on pos_id and article_id field.
If the tuple (pos_id,article_id) exist , I will add each column to the old one, if it doesn't exist I will add the new one. It worked fine. But I don't know how to deal with the case , when the dataframe is intially empty , in this case , I will add the new rows in the second dataframe to the old one. Here it is what I did
val histocaisse = spark.read
.format("csv")
.option("header", "true") //reading the headers
.load("C:/Users/MHT/Desktop/histocaisse_dte1.csv")
val hist = histocaisse
.withColumn("pos_id", 'pos_id.cast(LongType))
.withColumn("article_id", 'pos_id.cast(LongType))
.withColumn("date", 'date.cast(DateType))
.withColumn("qte", 'qte.cast(DoubleType))
.withColumn("ca", 'ca.cast(DoubleType))
val histocaisse2 = spark.read
.format("csv")
.option("header", "true") //reading the headers
.load("C:/Users/MHT/Desktop/histocaisse_dte2.csv")
val hist2 = histocaisse2.withColumn("pos_id", 'pos_id.cast(LongType))
.withColumn("article_id", 'pos_id.cast(LongType))
.withColumn("date", 'date.cast(DateType))
.withColumn("qte", 'qte.cast(DoubleType))
.withColumn("ca", 'ca.cast(DoubleType))
hist2.show(false)
+------+----------+----------+----+----+
|pos_id|article_id|date |qte |ca |
+------+----------+----------+----+----+
|1 |1 |2000-01-07|2.5 |3.5 |
|2 |2 |2000-01-07|14.7|12.0|
|3 |3 |2000-01-07|3.5 |1.2 |
+------+----------+----------+----+----+
+------+----------+----------+----+----+
|pos_id|article_id|date |qte |ca |
+------+----------+----------+----+----+
|1 |1 |2000-01-08|2.5 |3.5 |
|2 |2 |2000-01-08|14.7|12.0|
|3 |3 |2000-01-08|3.5 |1.2 |
|4 |4 |2000-01-08|3.5 |1.2 |
|5 |5 |2000-01-08|14.5|1.2 |
|6 |6 |2000-01-08|2.0 |1.25|
+------+----------+----------+----+----+
+------+----------+----------+----+----+
|pos_id|article_id|date |qte |ca |
+------+----------+----------+----+----+
|1 |1 |2000-01-08|5.0 |7.0 |
|2 |2 |2000-01-08|39.4|24.0|
|3 |3 |2000-01-08|7.0 |2.4 |
|4 |4 |2000-01-08|3.5 |1.2 |
|5 |5 |2000-01-08|14.5|1.2 |
|6 |6 |2000-01-08|2.0 |1.25|
+------+----------+----------+----+----+
Here is the solution , i found
val df = hist2.join(hist1, Seq("article_id", "pos_id"), "left")
.select($"pos_id", $"article_id",
coalesce(hist2("date"), hist1("date")).alias("date"),
(coalesce(hist2("qte"), lit(0)) + coalesce(hist1("qte"), lit(0))).alias("qte"),
(coalesce(hist2("ca"), lit(0)) + coalesce(hist1("ca"), lit(0))).alias("ca"))
.orderBy("pos_id", "article_id")
This case doesn't work when hist1 is empty .Any help please ?
Thanks a lot
Not sure if I understood correctly, but if the problem is sometimes the second dataframe is empty, and that makes the join crash, something you can try is this:
val checkHist1Empty = Try(hist1.first)
val df = checkHist1Empty match {
case Success(df) => {
hist2.join(hist1, Seq("article_id", "pos_id"), "left")
.select($"pos_id", $"article_id",
coalesce(hist2("date"), hist1("date")).alias("date"),
(coalesce(hist2("qte"), lit(0)) + coalesce(hist1("qte"), lit(0))).alias("qte"),
(coalesce(hist2("ca"), lit(0)) + coalesce(hist1("ca"), lit(0))).alias("ca"))
.orderBy("pos_id", "article_id")
}
case Failure(e) => {
hist2.select($"pos_id", $"article_id",
coalesce(hist2("date")).alias("date"),
coalesce(hist2("qte"), lit(0)).alias("qte"),
coalesce(hist2("ca"), lit(0)).alias("ca"))
.orderBy("pos_id", "article_id")
}
}
This basically checks if the hist1 is empty before performing the join. In case it is empty it generates the df based on the same logic but applied only to the hist2 dataframe. In case it contains information it applies the logic you had, which you said that works.
instead of doing a join, why don't you do a union of the two dataframes and then groupBy (pos_id,article_id) and apply udf to each column sum for qte and ca.
val df3 = df1.unionAll(df2)
val df4 = df3.groupBy("pos_id", "article_id").agg($"pos_id", $"article_id", max("date"), sum("qte"), sum("ca"))