Flatten nested dataframe and reconvert to original form - pyspark

I have a parquet file with the following schema
root
|-- listOfMetrics: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Action: string (nullable = true)
| | |-- id: long (nullable = true)
| | |-- date: date (nullable = true)
| | |-- female_executives: double (nullable = true)
| | |-- male_executives: double (nullable = true)
| | |-- female_directors: double (nullable = true)
| | |-- male_directors: double (nullable = true)
| | |-- female_executives_and_directors: double (nullable = true)
| | |-- male_executives_and_directors: double (nullable = true)
| | |-- flag: integer (nullable = true)
and df.show() returns something like the following
+----------------------------------+
| listOfMetrics |
+----------------------------------+
| [[ADD, 5394, 2...|
| [[ADD, 527, 20...|
| [[ADD, 714, 20...|
| [[ADD, 765, 20...|
| [[ADD, 996, 20...|
| [[ADD, 146, 20...|
| [[ADD, 947, 20...|
+----------------------------------+
The 'Action' column is what I am targeting. This column can contain 'DELETE' or 'ADD' so based on this, I need to separate the rows.
The approach that I took was to flatten and then separate using pyspark.sql and reconvert back to its original form but failed at the conversion step.
I have the following questions
So is there a better way to do this in pyspark?
Can we dynamically convert the dataframe to its original form after transformation?
Can we separate the rows without flattening them?
I am new to spark and finding it very difficult to get this working
Thanks

It is possible to split each array into two parts using filter: one part containing only elements with ADD and one part containing only elements with DELETE. After that each part will become a separate line using stack. The original structure can be kept and flattening the dataframe is not necessary.
I am using a slightly simplified set of test data:
root
|-- listOfMetrics: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Action: string (nullable = true)
| | |-- flag: long (nullable = true)
| | |-- id: long (nullable = true)
+----------------------------------------------------------------------------------+
|listOfMetrics |
+----------------------------------------------------------------------------------+
|[[ADD, 2, 1], [ADD, 4, 3], [DELETE, 6, 5], [DELETE, 8, 7], [ADD, 10, 9]] |
|[[ADD, 12, 11], [ADD, 14, 13], [DELETE, 16, 15], [DELETE, 18, 17], [ADD, 110, 19]]|
+----------------------------------------------------------------------------------+
The code:
df.withColumn("listOfMetrics", F.expr("""stack(2,
filter(listOfMetrics, e -> e.Action = 'ADD'),
filter(listOfMetrics, e -> e.Action = 'DELETE'))""")) \
.show(truncate=False)
Output:
+----------------------------------------------+
|listOfMetrics |
+----------------------------------------------+
|[[ADD, 2, 1], [ADD, 4, 3], [ADD, 10, 9]] |
|[[DELETE, 6, 5], [DELETE, 8, 7]] |
|[[ADD, 12, 11], [ADD, 14, 13], [ADD, 110, 19]]|
|[[DELETE, 16, 15], [DELETE, 18, 17]] |
+----------------------------------------------+

Related

How to update column value in case of array of struct in spark scala

root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- Animal: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- Elephant: string (nullable = false)
| | |-- Lion: string (nullable = true)
| | |-- Zebra: string (nullable = true)
| | |-- Dog: string (nullable = true)
I just want to is this posible to update the array of struct to some value if I Have a list of column of which I dont want to update.
For eg
If I have a list List[String] = List(Zebra,Dog)
Is this possible to set all other array of column to 0 like Elephant and Lion will be 0
+---+----+-----+------+-------+--------------------+
|_id|h |inc |op |ts |webhooks |
+---+----+-----+------+-------+--------------------+
|fa1|fa11|fa111|fa1111|fa11111|[[1, 1, 0, 1]]|
|fb1|fb11|fb111|fb1111|fb11111|[[0, 1, 1, 0]]|
+---+----+-----+------+-------+--------------------+
After operations It will be
+---+----+-----+------+-------+--------------------+
|_id|h |inc |op |ts |webhooks |
+---+----+-----+------+-------+--------------------+
|fa1|fa11|fa111|fa1111|fa11111|[[0, 0, 0, 1]]|
|fb1|fb11|fb111|fb1111|fb11111|[[0, 0, 1, 0]]|
+---+----+-----+------+-------+--------------------+
I was going by iteration by row
Like I made a function
def changeValue(row :Row) = {
//some code
}
But not able to do so
Check below code.
scala> ddf.show(false)
+---+----+-----+------+-------+--------------------+
|_id|h |inc |op |ts |webhooks |
+---+----+-----+------+-------+--------------------+
|fa1|fa11|fa111|fa1111|fa11111|[[1, 11, 111, 1111]]|
|fb1|fb11|fb111|fb1111|fb11111|[[2, 22, 222, 2222]]|
+---+----+-----+------+-------+--------------------+
scala> val columnsTobeUpdatedInWebhooks = Seq("zebra","dog") // Columns to be updated in webhooks.
columnsTobeUpdatedInWebhooks: Seq[String] = List(zebra, dog)
Constructing Expression
val expr = flatten(
array(
ddf
.select(explode($"webhooks").as("webhooks"))
.select("webhooks.*")
.columns
.map(c => if(columnsTobeUpdatedInWebhooks.contains(c)) col(s"webhooks.${c}").as(c) else array(lit(0)).as(c)):_*
)
)
expr: org.apache.spark.sql.Column = flatten(array(array(0) AS `elephant`, array(0) AS `lion`, webhooks.zebra AS `zebra`, webhooks.dog AS `dog`))
Applying Expression
scala> ddf.withColumn("webhooks",struct(expr)).show(false)
+---+----+-----+------+-------+--------------+
|_id|h |inc |op |ts |webhooks |
+---+----+-----+------+-------+--------------+
|fa1|fa11|fa111|fa1111|fa11111|[[0, 0, 0, 1]]|
|fb1|fb11|fb111|fb1111|fb11111|[[0, 0, 1, 0]]|
+---+----+-----+------+-------+--------------+
Final Schema
scala> ddf.withColumn("webhooks",allwebhookColumns).printSchema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- elephant: integer (nullable = false)
| | |-- lion: integer (nullable = false)
| | |-- zebra: integer (nullable = false)
| | |-- dog: integer (nullable = false)

Merge two columns of array of structs based on a key

I have a dataframe of schema as below:
input dataframe
|-- A: string (nullable = true)
|-- B_2020: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)
| | |-- z: double (nullable = true)
|-- B_2019: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)
I want to merge 2020 and 2019 columns into one column of array of structs as well based on the matching key value.
Desired schema:
expected output dataframe
|-- A: string (nullable = true)
|-- B: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x_this_year: double (nullable = true)
| | |-- y_this_year: double (nullable = true)
| | |-- x_last_year: double (nullable = true)
| | |-- y_last_year: double (nullable = true)
| | |-- z_this_year: double (nullable = true)
I would like to merge on the matching key in the structs. Also note, if there is a key present only in one of 2019 or 2020 data, then null need to be used to substitute the values of the other year in merged column.
scala> val df = Seq(
| ("ABC",
| Seq(
| ("a", 2, 4, 6),
| ("b", 3, 6, 9),
| ("c", 1, 2, 3)
| ),
| Seq(
| ("a", 4, 8),
| ("d", 3, 4)
| ))
| ).toDF("A", "B_2020", "B_2019").select(
| $"A",
| $"B_2020" cast "array<struct<key:string,x:double,y:double,z:double>>",
| $"B_2019" cast "array<struct<key:string,x:double,y:double>>"
| )
df: org.apache.spark.sql.DataFrame = [A: string, B_2020: array<struct<key:string,x:double,y:double,z:double>> ... 1 more field]
scala> df.printSchema
root
|-- A: string (nullable = true)
|-- B_2020: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)
| | |-- z: double (nullable = true)
|-- B_2019: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)
scala> df.show(false)
+---+------------------------------------------------------------+------------------------------+
|A |B_2020 |B_2019 |
+---+------------------------------------------------------------+------------------------------+
|ABC|[[a, 2.0, 4.0, 6.0], [b, 3.0, 6.0, 9.0], [c, 1.0, 2.0, 3.0]]|[[a, 4.0, 8.0], [d, 3.0, 4.0]]|
+---+------------------------------------------------------------+------------------------------+
scala> val df2020 = df.select($"A", explode($"B_2020") as "this_year").select($"A",
| $"this_year.key" as "key", $"this_year.x" as "x_this_year",
| $"this_year.y" as "y_this_year", $"this_year.z" as "z_this_year")
df2020: org.apache.spark.sql.DataFrame = [A: string, key: string ... 3 more fields]
scala> val df2019 = df.select($"A", explode($"B_2019") as "last_year").select($"A",
| $"last_year.key" as "key", $"last_year.x" as "x_last_year",
| $"last_year.y" as "y_last_year")
df2019: org.apache.spark.sql.DataFrame = [A: string, key: string ... 2 more fields]
scala> df2020.show(false)
+---+---+-----------+-----------+-----------+
|A |key|x_this_year|y_this_year|z_this_year|
+---+---+-----------+-----------+-----------+
|ABC|a |2.0 |4.0 |6.0 |
|ABC|b |3.0 |6.0 |9.0 |
|ABC|c |1.0 |2.0 |3.0 |
+---+---+-----------+-----------+-----------+
scala> df2019.show(false)
+---+---+-----------+-----------+
|A |key|x_last_year|y_last_year|
+---+---+-----------+-----------+
|ABC|a |4.0 |8.0 |
|ABC|d |3.0 |4.0 |
+---+---+-----------+-----------+
scala> val outputDF = df2020.join(df2019, Seq("A", "key"), "outer").select(
| $"A" as "market_name",
| struct($"key", $"x_this_year", $"y_this_year", $"x_last_year",
| $"y_last_year", $"z_this_year") as "cancellation_policy_booking")
outputDF: org.apache.spark.sql.DataFrame = [market_name: string, cancellation_policy_booking: struct<key: string, x_this_year: double ... 4 more fields>]
scala> outputDF.printSchema
root
|-- market_name: string (nullable = true)
|-- cancellation_policy_booking: struct (nullable = false)
| |-- key: string (nullable = true)
| |-- x_this_year: double (nullable = true)
| |-- y_this_year: double (nullable = true)
| |-- x_last_year: double (nullable = true)
| |-- y_last_year: double (nullable = true)
| |-- z_this_year: double (nullable = true)
scala> outputDF.show(false)
+-----------+----------------------------+
|market_name|cancellation_policy_booking |
+-----------+----------------------------+
|ABC |[b, 3.0, 6.0,,, 9.0] |
|ABC |[a, 2.0, 4.0, 4.0, 8.0, 6.0]|
|ABC |[d,,, 3.0, 4.0,] |
|ABC |[c, 1.0, 2.0,,, 3.0] |
+-----------+----------------------------+

How to select all structures from a dataframe in pyspark?

I have a json database loaded using pyspark.
I'm trying to access all "x" components of each structures in it.
This is the output of df.select("level_instance_json.player").printSchema()
root
|-- player: struct (nullable = true)
| |-- 0: struct (nullable = true)
| | |-- head_pitch: long (nullable = true)
| | |-- head_roll: long (nullable = true)
| | |-- head_yaw: long (nullable = true)
| | |-- r: long (nullable = true)
| | |-- x: long (nullable = true)
| | |-- y: long (nullable = true)
| |-- 1: struct (nullable = true)
| | |-- head_pitch: long (nullable = true)
| | |-- head_roll: long (nullable = true)
| | |-- head_yaw: long (nullable = true)
| | |-- r: long (nullable = true)
| | |-- x: long (nullable = true)
| | |-- y: long (nullable = true)
...
I've tried selecting all using the '*' selector but it doesn't work.
df.select("level_instance_json.player.*.x").show(10) gives this error:
'No such struct field * in 0, 1, 10, 100, 1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009, 101, 1010, 1011, 1012, 1013, 1014, 1015, 1016, 1017, 1018, 1019, 102,...
You can do this :
list_player_numbers = [el.name for el in df.select("level_instance_json.player").schema['player'].dataType]
list_fields = ['.'.join(['level_instance_json', 'player', player_number, 'x']) for player_number in list_player_numbers]
output = df.select(list_fields)
It should work.
Xavier

Is this a valid Spark Schema?

I was in the process of flattening a Spark Schema using the method suggested here, when I came across an edge case -
val writerSchema = StructType(Seq(
StructField("f1", ArrayType(ArrayType(
StructType(Seq(
StructField("f2", ArrayType(LongType))
))
)))
))
writerSchema.printTreeString()
root
|-- f1: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: struct (containsNull = true)
| | | |-- f2: array (nullable = true)
| | | | |-- element: long (containsNull = true)
This prints the following output - f1 and not
f1
f1.f2
as I expected it to be.
Questions -
Is writerSchema a valid Spark schema?
How do I handle ArrayType objects when flattening the schema?
If you want to handle data like this
val json = """{"f1": [{"f2": [1, 2, 3] }, {"f2": [4,5,6]}, {"f2": [7,8,9]}, {"f2": [10,11,12]}]}"""
The valid schema will be
val writerSchema = StructType(Seq(
StructField("f1", ArrayType(
StructType(Seq(
StructField("f2", ArrayType(LongType))
))
))))
root
|-- f1: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- f2: array (nullable = true)
| | | |-- element: long (containsNull = true)
You shouldn't be putting an ArrayType inside another ArrayType.
So lets suppose you have a dataframe inputDF :
inputDF.printSchema
root
|-- f1: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- f2: array (nullable = true)
| | | |-- element: long (containsNull = true)
inputDF.show(false)
+-------------------------------------------------------------------------------------------------------+
|f1 |
+-------------------------------------------------------------------------------------------------------+
|[[WrappedArray(1, 2, 3)], [WrappedArray(4, 5, 6)], [WrappedArray(7, 8, 9)], [WrappedArray(10, 11, 12)]]|
+-------------------------------------------------------------------------------------------------------+
To flatten this dataframe we can explode the array columns (f1 and f2):
First, flatten column 'f1'
val semiFlattenDF = inputDF.select(explode(col("f1"))).select(col("col.*"))
semiFlattenDF.printSchema
root
|-- f2: array (nullable = true)
| |-- element: long (containsNull = true)
semiFlattenDF.show
+------------+
| f2|
+------------+
| [1, 2, 3]|
| [4, 5, 6]|
| [7, 8, 9]|
|[10, 11, 12]|
+------------+
Now flatten column 'f2' and get the column name as 'value'
val fullyFlattenDF = semiFlattenDF.select(explode(col("f2")).as("value"))
So now the DataFrame is flattened:
fullyFlattenDF.printSchema
root
|-- value: long (nullable = true)
fullyFlattenDF.show
+-----+
|value|
+-----+
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
| 10|
| 11|
| 12|
+-----+

Removing duplicate array structs by last item in array struct in Spark Dataframe

So my table looks something like this:
customer_1|place|customer_2|item |count
-------------------------------------------------
a | NY | b |(2010,304,310)| 34
a | NY | b |(2024,201,310)| 21
a | NY | b |(2010,304,312)| 76
c | NY | x |(2010,304,310)| 11
a | NY | b |(453,131,235) | 10
I've tried doing, but this does not eliminate the duplicates as the former array is still there (as it should be, I need it for end results).
val df= df_one.withColumn("vs", struct(col("item").getItem(size(col("item"))-1), col("item"), col("count")))
.groupBy(col("customer_1"), col("place"), col("customer_2"))
.agg(max("vs").alias("vs"))
.select(col("customer_1"), col("place"), col("customer_2"), col("vs.item"), col("vs.count"))
I would like to group by customer_1, place and customer_2 columns and return only array structs whose last item (-1) is unique with the highest count, any ideas?
Expected output:
customer_1|place|customer_2|item |count
-------------------------------------------------
a | NY | b |(2010,304,312)| 76
a | NY | b |(2010,304,310)| 34
a | NY | b |(453,131,235) | 10
c | NY | x |(2010,304,310)| 11
Given that the schema of the dataframe is as
root
|-- customer_1: string (nullable = true)
|-- place: string (nullable = true)
|-- customer_2: string (nullable = true)
|-- item: array (nullable = true)
| |-- element: integer (containsNull = false)
|-- count: string (nullable = true)
You can apply concat funcations to create temp column for checking duplicate rows as done below
import org.apache.spark.sql.functions._
df.withColumn("temp", concat($"customer_1",$"place",$"customer_2", $"item"(size($"item")-1)))
.dropDuplicates("temp")
.drop("temp")
You should get following output
+----------+-----+----------+----------------+-----+
|customer_1|place|customer_2|item |count|
+----------+-----+----------+----------------+-----+
|a |NY |b |[2010, 304, 312]|76 |
|c |NY |x |[2010, 304, 310]|11 |
|a |NY |b |[453, 131, 235] |10 |
|a |NY |b |[2010, 304, 310]|34 |
+----------+-----+----------+----------------+-----+
Struct
Given the schema of dataframe is as
root
|-- customer_1: string (nullable = true)
|-- place: string (nullable = true)
|-- customer_2: string (nullable = true)
|-- item: struct (nullable = true)
| |-- _1: integer (nullable = false)
| |-- _2: integer (nullable = false)
| |-- _3: integer (nullable = false)
|-- count: string (nullable = true)
We can still do same as above with slight change in getting the third item from the struct as
import org.apache.spark.sql.functions._
df.withColumn("temp", concat($"customer_1",$"place",$"customer_2", $"item._3"))
.dropDuplicates("temp")
.drop("temp")
Hope the answer is helpful