Merge two columns of array of structs based on a key - scala

I have a dataframe of schema as below:
input dataframe
|-- A: string (nullable = true)
|-- B_2020: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)
| | |-- z: double (nullable = true)
|-- B_2019: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)
I want to merge 2020 and 2019 columns into one column of array of structs as well based on the matching key value.
Desired schema:
expected output dataframe
|-- A: string (nullable = true)
|-- B: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x_this_year: double (nullable = true)
| | |-- y_this_year: double (nullable = true)
| | |-- x_last_year: double (nullable = true)
| | |-- y_last_year: double (nullable = true)
| | |-- z_this_year: double (nullable = true)
I would like to merge on the matching key in the structs. Also note, if there is a key present only in one of 2019 or 2020 data, then null need to be used to substitute the values of the other year in merged column.

scala> val df = Seq(
| ("ABC",
| Seq(
| ("a", 2, 4, 6),
| ("b", 3, 6, 9),
| ("c", 1, 2, 3)
| ),
| Seq(
| ("a", 4, 8),
| ("d", 3, 4)
| ))
| ).toDF("A", "B_2020", "B_2019").select(
| $"A",
| $"B_2020" cast "array<struct<key:string,x:double,y:double,z:double>>",
| $"B_2019" cast "array<struct<key:string,x:double,y:double>>"
| )
df: org.apache.spark.sql.DataFrame = [A: string, B_2020: array<struct<key:string,x:double,y:double,z:double>> ... 1 more field]
scala> df.printSchema
root
|-- A: string (nullable = true)
|-- B_2020: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)
| | |-- z: double (nullable = true)
|-- B_2019: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)
scala> df.show(false)
+---+------------------------------------------------------------+------------------------------+
|A |B_2020 |B_2019 |
+---+------------------------------------------------------------+------------------------------+
|ABC|[[a, 2.0, 4.0, 6.0], [b, 3.0, 6.0, 9.0], [c, 1.0, 2.0, 3.0]]|[[a, 4.0, 8.0], [d, 3.0, 4.0]]|
+---+------------------------------------------------------------+------------------------------+
scala> val df2020 = df.select($"A", explode($"B_2020") as "this_year").select($"A",
| $"this_year.key" as "key", $"this_year.x" as "x_this_year",
| $"this_year.y" as "y_this_year", $"this_year.z" as "z_this_year")
df2020: org.apache.spark.sql.DataFrame = [A: string, key: string ... 3 more fields]
scala> val df2019 = df.select($"A", explode($"B_2019") as "last_year").select($"A",
| $"last_year.key" as "key", $"last_year.x" as "x_last_year",
| $"last_year.y" as "y_last_year")
df2019: org.apache.spark.sql.DataFrame = [A: string, key: string ... 2 more fields]
scala> df2020.show(false)
+---+---+-----------+-----------+-----------+
|A |key|x_this_year|y_this_year|z_this_year|
+---+---+-----------+-----------+-----------+
|ABC|a |2.0 |4.0 |6.0 |
|ABC|b |3.0 |6.0 |9.0 |
|ABC|c |1.0 |2.0 |3.0 |
+---+---+-----------+-----------+-----------+
scala> df2019.show(false)
+---+---+-----------+-----------+
|A |key|x_last_year|y_last_year|
+---+---+-----------+-----------+
|ABC|a |4.0 |8.0 |
|ABC|d |3.0 |4.0 |
+---+---+-----------+-----------+
scala> val outputDF = df2020.join(df2019, Seq("A", "key"), "outer").select(
| $"A" as "market_name",
| struct($"key", $"x_this_year", $"y_this_year", $"x_last_year",
| $"y_last_year", $"z_this_year") as "cancellation_policy_booking")
outputDF: org.apache.spark.sql.DataFrame = [market_name: string, cancellation_policy_booking: struct<key: string, x_this_year: double ... 4 more fields>]
scala> outputDF.printSchema
root
|-- market_name: string (nullable = true)
|-- cancellation_policy_booking: struct (nullable = false)
| |-- key: string (nullable = true)
| |-- x_this_year: double (nullable = true)
| |-- y_this_year: double (nullable = true)
| |-- x_last_year: double (nullable = true)
| |-- y_last_year: double (nullable = true)
| |-- z_this_year: double (nullable = true)
scala> outputDF.show(false)
+-----------+----------------------------+
|market_name|cancellation_policy_booking |
+-----------+----------------------------+
|ABC |[b, 3.0, 6.0,,, 9.0] |
|ABC |[a, 2.0, 4.0, 4.0, 8.0, 6.0]|
|ABC |[d,,, 3.0, 4.0,] |
|ABC |[c, 1.0, 2.0,,, 3.0] |
+-----------+----------------------------+

Related

How to convert a spark dataframe to a list of structs in scala

I have a spark dataframe composed of 12 rows and different columns, 22 in this case.
I want to convert it into a dataframe of the format:
root
|-- data: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- ast: double (nullable = true)
| | |-- blk: double (nullable = true)
| | |-- dreb: double (nullable = true)
| | |-- fg3_pct: double (nullable = true)
| | |-- fg3a: double (nullable = true)
| | |-- fg3m: double (nullable = true)
| | |-- fg_pct: double (nullable = true)
| | |-- fga: double (nullable = true)
| | |-- fgm: double (nullable = true)
| | |-- ft_pct: double (nullable = true)
| | |-- fta: double (nullable = true)
| | |-- ftm: double (nullable = true)
| | |-- games_played: long (nullable = true)
| | |-- seconds: double (nullable = true)
| | |-- oreb: double (nullable = true)
| | |-- pf: double (nullable = true)
| | |-- player_id: long (nullable = true)
| | |-- pts: double (nullable = true)
| | |-- reb: double (nullable = true)
| | |-- season: long (nullable = true)
| | |-- stl: double (nullable = true)
| | |-- turnover: double (nullable = true)
Where each element of the dataframe data field corresponds to a different row of the original dataframe.
The final goal is exporting it to .json file which will have the format:
{"data": [{row1}, {row2}, ..., {row12}]}
The code I am employing at the moment is the following:
val best_12_struct = best_12.withColumn("data", array((0 to 11).map(i => struct(col("ast"), col("blk"), col("dreb"), col("fg3_pct"), col("fg3a"),
col("fg3m"), col("fg_pct"), col("fga"), col("fgm"),
col("ft_pct"), col("fta"), col("ftm"), col("games_played"),
col("seconds"), col("oreb"), col("pf"), col("player_id"),
col("pts"), col("reb"), col("season"), col("stl"), col("turnover"))) : _*))
val best_12_data = best_12_struct.select("data")
But the array(0 to 11) copies 12 times the same element into data. Therefore, the .json I finally obtain has 12 {"data": ...}, being in each the same row copied 12 times, instead of just one {"data": ...} with 12 elements, corresponding each to one row of the original dataframe.
you have 12 times the same row as the method withColumn will only pick information from the current treated row.
You need to aggregate rows at dataframe level with collect_list that is an aggregate function as follow:
import org.apache.spark.sql.functions._
val best_12_data = best_12
.withColumn("row", struct(col("ast"), col("blk"), col("dreb"), col("fg3_pct"), col("fg3a"), col("fg3m"), col("fg_pct"), col("fga"), col("fgm"), col("ft_pct"), col("fta"), col("ftm"), col("games_played"), col("seconds"), col("oreb"), col("pf"), col("player_id"), col("pts"), col("reb"), col("season"), col("stl"), col("turnover")))
.agg(collect_list(col("row")).as("data"))

How to update column value in case of array of struct in spark scala

root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- Animal: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- Elephant: string (nullable = false)
| | |-- Lion: string (nullable = true)
| | |-- Zebra: string (nullable = true)
| | |-- Dog: string (nullable = true)
I just want to is this posible to update the array of struct to some value if I Have a list of column of which I dont want to update.
For eg
If I have a list List[String] = List(Zebra,Dog)
Is this possible to set all other array of column to 0 like Elephant and Lion will be 0
+---+----+-----+------+-------+--------------------+
|_id|h |inc |op |ts |webhooks |
+---+----+-----+------+-------+--------------------+
|fa1|fa11|fa111|fa1111|fa11111|[[1, 1, 0, 1]]|
|fb1|fb11|fb111|fb1111|fb11111|[[0, 1, 1, 0]]|
+---+----+-----+------+-------+--------------------+
After operations It will be
+---+----+-----+------+-------+--------------------+
|_id|h |inc |op |ts |webhooks |
+---+----+-----+------+-------+--------------------+
|fa1|fa11|fa111|fa1111|fa11111|[[0, 0, 0, 1]]|
|fb1|fb11|fb111|fb1111|fb11111|[[0, 0, 1, 0]]|
+---+----+-----+------+-------+--------------------+
I was going by iteration by row
Like I made a function
def changeValue(row :Row) = {
//some code
}
But not able to do so
Check below code.
scala> ddf.show(false)
+---+----+-----+------+-------+--------------------+
|_id|h |inc |op |ts |webhooks |
+---+----+-----+------+-------+--------------------+
|fa1|fa11|fa111|fa1111|fa11111|[[1, 11, 111, 1111]]|
|fb1|fb11|fb111|fb1111|fb11111|[[2, 22, 222, 2222]]|
+---+----+-----+------+-------+--------------------+
scala> val columnsTobeUpdatedInWebhooks = Seq("zebra","dog") // Columns to be updated in webhooks.
columnsTobeUpdatedInWebhooks: Seq[String] = List(zebra, dog)
Constructing Expression
val expr = flatten(
array(
ddf
.select(explode($"webhooks").as("webhooks"))
.select("webhooks.*")
.columns
.map(c => if(columnsTobeUpdatedInWebhooks.contains(c)) col(s"webhooks.${c}").as(c) else array(lit(0)).as(c)):_*
)
)
expr: org.apache.spark.sql.Column = flatten(array(array(0) AS `elephant`, array(0) AS `lion`, webhooks.zebra AS `zebra`, webhooks.dog AS `dog`))
Applying Expression
scala> ddf.withColumn("webhooks",struct(expr)).show(false)
+---+----+-----+------+-------+--------------+
|_id|h |inc |op |ts |webhooks |
+---+----+-----+------+-------+--------------+
|fa1|fa11|fa111|fa1111|fa11111|[[0, 0, 0, 1]]|
|fb1|fb11|fb111|fb1111|fb11111|[[0, 0, 1, 0]]|
+---+----+-----+------+-------+--------------+
Final Schema
scala> ddf.withColumn("webhooks",allwebhookColumns).printSchema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- elephant: integer (nullable = false)
| | |-- lion: integer (nullable = false)
| | |-- zebra: integer (nullable = false)
| | |-- dog: integer (nullable = false)

Convert all the columns of a spark dataframe into a json format and then include the json formatted data as a column in another/parent dataframe

Converted dataframe(say child dataframe) into json using df.toJSON
After json conversion the schema looks like this :
root
|-- value: string (nullable = true)
I used the following suggestion to get child dataframe into the intermediate parent schema/dataframe:
scala> parentDF.toJSON.select(struct($"value").as("data")).printSchema
root
|-- data: struct (nullable = false)
| |-- value: string (nullable = true)
Now I still need to build the parentDF schema further to make it look like:
root
|-- id
|-- version
|-- data: struct (nullable = false)
| |-- value: string (nullable = true)
Q1) How can I build the id column using the id from value(i.e value.id needs to be represented as id)
Q2) I need to bring version from a different dataframe(say versionDF) where version is a constant(in all columns). Do I fetch one row from this versionDF to read value of version column and then populate it as literal in the parentDF ?
please help with any code snippets on this.
Instead of toJSON use to_json in select statement & select required columns along with to_json function.
Check below code.
val version = // Get version value from versionDF
parentDF.select($"id",struct(to_json(struct($"*")).as("value")).as("data"),lit(version).as("version"))
scala> parentDF.select($"id",struct(to_json(struct($"*")).as("value")).as("data"),lit(version).as("version")).printSchema
root
|-- id: integer (nullable = false)
|-- data: struct (nullable = false)
| |-- value: string (nullable = true)
|-- version: double (nullable = false)
Updated
scala> df.select($"id",to_json(struct(struct($"*").as("value"))).as("data"),lit(version).as("version")).printSchema
root
|-- id: integer (nullable = false)
|-- data: string (nullable = true)
|-- version: integer (nullable = false)
scala> df.select($"id",to_json(struct(struct($"*").as("value"))).as("data"),lit(version).as("version")).show(false)
+---+------------------------------------------+-------+
|id |data |version|
+---+------------------------------------------+-------+
|1 |{"value":{"id":1,"col1":"a1","col2":"b1"}}|1 |
|2 |{"value":{"id":2,"col1":"a2","col2":"b2"}}|1 |
|3 |{"value":{"id":3,"col1":"a3","col2":"b3"}}|1 |
+---+------------------------------------------+-------+
Update-1
scala> df.select($"id",to_json(struct($"*").as("value")).as("data"),lit(version).as("version")).printSchema
root
|-- id: integer (nullable = false)
|-- data: string (nullable = true)
|-- version: integer (nullable = false)
scala> df.select($"id",to_json(struct($"*").as("value")).as("data"),lit(version).as("version")).show(false)
+---+--------------------------------+-------+
|id |data |version|
+---+--------------------------------+-------+
|1 |{"id":1,"col1":"a1","col2":"b1"}|1 |
|2 |{"id":2,"col1":"a2","col2":"b2"}|1 |
|3 |{"id":3,"col1":"a3","col2":"b3"}|1 |
+---+--------------------------------+-------+
Try this:
scala> val versionDF = List((1.0)).toDF("version")
versionDF: org.apache.spark.sql.DataFrame = [version: double]
scala> versionDF.show
+-------+
|version|
+-------+
| 1.0|
+-------+
scala> val version = versionDF.first.get(0)
version: Any = 1.0
scala>
scala> val childDF = List((1,"a1","b1"),(2,"a2","b2"),(3,"a3","b3")).toDF("id","col1","col2")
childDF: org.apache.spark.sql.DataFrame = [id: int, col1: string ... 1 more field]
scala> childDF.show
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| a1| b1|
| 2| a2| b2|
| 3| a3| b3|
+---+----+----+
scala>
scala> val parentDF = childDF.toJSON.select(struct($"value").as("data")).withColumn("id",from_json($"data.value",childDF.schema).getItem("id")).withColumn("version",lit(version))
parentDF: org.apache.spark.sql.DataFrame = [data: struct<value: string>, id: int ... 1 more field]
scala> parentDF.printSchema
root
|-- data: struct (nullable = false)
| |-- value: string (nullable = true)
|-- id: integer (nullable = true)
|-- version: double (nullable = false)
scala> parentDF.show(false)
+----------------------------------+---+-------+
|data |id |version|
+----------------------------------+---+-------+
|[{"id":1,"col1":"a1","col2":"b1"}]|1 |1.0 |
|[{"id":2,"col1":"a2","col2":"b2"}]|2 |1.0 |
|[{"id":3,"col1":"a3","col2":"b3"}]|3 |1.0 |
+----------------------------------+---+-------+

How to parse the JSON data using Spark-Scala

I've requirement to parse the JSON data as shown in the expected results below, currently i'm not getting how to include the signals name(ABS, ADA, ADW) in Signal column. Any help would be much appreciated.
I tried something which gives the results as shown below, but i will need to include all the signals in SIGNAL column as well which is shown in the expected results.
jsonDF.select(explode($"ABS") as "element").withColumn("stime", col("element.E")).withColumn("can_value", col("element.V")).drop(col("element")).show()
+-------------+--------- --+
| stime|can_value |
+-------------+--------- +
|value of E |value of V |
+-------------+----------- +
df.printSchema
-- ABS: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: long (nullable = true)
|-- ADA: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: long (nullable = true)
|-- ADW: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: long (nullable = true)
|-- ALT: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
|-- APP: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
I will need output like below:
-----------------+-------------+---------+
|SIGNAL |stime |can_value|
+-----------------+-------------+---------+
|ABS |value of E |value of V |
|ADA |value of E |value of V |
|ADW |value of E |value of V |
+-----------------+-------------+---------+
To get the expected output, and to insert values in Signal column:
jsonDF.select(explode($"ABS") as "element")
.withColumn("stime", col("element.E"))
.withColumn("can_value", col("element.V"))
.drop(col("element"))
.withColumn("SIGNAL",lit("ABS"))
.show()
And the generalized version of the above approach:
(Based on the result of df.printSchema assuming that, you have signal values as column names, and those columns contain array having elements of the form struct(E,V))
val columns:Array[String] = df.columns
var arrayOfDFs:Array[DataFrame] = Array()
for(col_name <- columns){
val temp = df.selectExpr("explode("+col_name+") as element")
.select(
lit(col_name).as("SIGNAL"),
col("element.E").as("stime"),
col("element.V").as("can_value"))
arrayOfDFs = arrayOfDFs :+ temp
}
val jsonDF = arrayOfDFs.reduce(_ union _)
jsonDF.show(false)

How to project an array of structs in spark dataframe API

I have a Dataframe like this:
val df = Seq(
Seq(("a","b","c"))
)
.toDF("arr")
.select($"arr".cast("array<struct<c1:string,c2:string,c3:string>>"))
df.printSchema
root
|-- arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- c1: string (nullable = true)
| | |-- c2: string (nullable = true)
| | |-- c3: string (nullable = true)
df.show()
+---------+
| arr|
+---------+
|[[a,b,c]]|
+---------+
I want to select only c1 and c3, such that:
df.printSchema
root
|-- arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- c1: string (nullable = true)
| | |-- c3: string (nullable = true)
df.show()
+---------+
| arr|
+---------+
|[[a,c]] |
+---------+
Can this be done without UDF?
I can do it with an UDF, but I'd like a solution without it, something like
df
.select($"arr.c1".as("arr"))
root
|-- arr: array (nullable = true)
| |-- element: string (containsNull = true)
But this only works to select 1 struct element, I've also tried :
df
.select(array(struct($"arr.c1",$"arr.c3")).as("arr"))
but this gives
root
|-- arr: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- c1: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- c3: array (nullable = true)
| | | |-- element: string (containsNull = true)
I can only give an answer for the Python API but I am sure the Scala API has something very similar.
The key is the function arrays_zip, which, according to the documentation, "[r]eturns a merged array of structs in which the N-th struct contains all N-th values of input arrays."
Example (still from the documentation):
from pyspark.sql.functions import arrays_zip
df = spark.createDataFrame([(([1, 2, 3], [2, 3, 4]))], ['vals1', 'vals2'])
df.select(arrays_zip(df.vals1, df.vals2).alias('zipped')).collect()
# Prints: [Row(zipped=[Row(vals1=1, vals2=2), Row(vals1=2, vals2=3), Row(vals1=3, vals2=4)])]