Pivot spark dataframe array of kv pairs into individual columns - scala

I have following schema:
root
|-- id: string (nullable = true)
|-- date: timestamp (nullable = true)
|-- config: struct (nullable = true)
| |-- entry: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- key: string (nullable = true)
| | | |-- value: string (nullable = true)
There will not be more than 3 key-value pairs (k1,k2,k3) in the array and I would like to make value from each key into its own column while the corresponding data would come from the value from the same kv pair.
+--------+----------+----------+----------+---------+
|id |date |k1 |k2 |k3 |
+--------+----------+----------+----------+---------+
| id1 |2019-08-12|id1-v1 |id1-v2 |id1-v3 |
| id2 |2019-08-12|id2-v1 |id2-v2 |id2-v3 |
+--------+----------+----------+----------+---------+
So far I tried something like this:
sourceDF.filter($"someColumn".contains("SOME_STRING"))
.select($"id", $"date", $"config.entry" as "kvpairs")
.withColumn($"kvpairs".getItem(0).getField("key").toString(), $"kvpairs".getItem(0).getField("value"))
.withColumn($"kvpairs".getItem(1).getField("key").toString(), $"kvpairs".getItem(1).getField("value"))
.withColumn($"kvpairs".getItem(2).getField("key").toString(), $"kvpairs".getItem(2).getField("value"))
But in this case, the column names are shown as kvpairs[0][key], kvpairs[1][key] and kvpairs[2][key] as shown below:
+--------+----------+---------------+---------------+---------------+
|id |date |kvpairs[0][key]|kvpairs[1][key]|kvpairs[2][key]|
+--------+----------+---------------+---------------+---------------+
| id1 |2019-08-12| id1-v1 | id1-v2 | id1-v3 |
| id2 |2019-08-12| id2-v1 | id2-v2 | id2-v3 |
+--------+----------+---------------+---------------+---------------+
Two questions:
Is my approach right? Is there a better and easier way to pivot this
such that I get one row per array with the 3 kv pairs as 3 columns? I want to handle cases where order of the kv pairs may differ.
If the above approach is fine, how do I alias the column name to the data of the "key" element in the array?

Using multiple withColumn together with getItem will not work since the order of the kv pairs may differ. What you can do instead is explode the array and then use pivot as follows:
sourceDF.filter($"someColumn".contains("SOME_STRING"))
.select($"id", $"date", explode($"config.entry") as "exploded")
.select($"id", $"date", $"exploded.*")
.groupBy("id", "date")
.pivot("key")
.agg(first("value"))
The usage of first inside the aggregation here assumes there will be a single value for each key. Otherwise collect_list or collect_set can be used.
Result:
+---+----------+------+------+------+
|id |date |k1 |k2 |k2 |
+---+----------+------+------+------+
|id1|2019-08-12|id1-v1|id1-v2|id1-v3|
|id2|2019-08-12|id2-v1|id2-v2|id2-v3|
+---+----------+------+------+------+

Related

get keys from MapType column in pyspark and use it in navigation

I have a spark df with following schema:
|-- col1 : string
|-- col2 : string
|-- data: struct
| |-- items: map (nullable = true)
| | |-- key: string
| | |-- value: struct
| | | |-- id: string
| | | |-- legalNature
and I'm recieving different json with different keys below items object,
here example:
+-------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| col1| col2 | data |
+-------+---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| xxx | yyy |{"data":{"items":{"71f3e2e4-5c3a-466d-8063-7bfd753b303c":{"id":"123","legalNature":"legalNature","allowedSignature":[],"category":"TP01","createdOn":"2021-01-22T13:17:12.502+01:00"}}}}|
| ... | ... |{"data":{"items":{"86c6b41e-085c-eb11-a812-000d3ab29c25":{"id":"153","legalNature":"legalNature","allowedSignature":[],"category":"TP01","createdOn":"2021-01-22T13:17:12.502+01:00"}}}}|
| | |{"data":{"items":{"56c6b41e-085c-eb11-a812-000d3ab29c24":{"id":"173","legalNature":"legalNature","allowedSignature":[],"category":"TP01","createdOn":"2021-01-22T13:17:12.502+01:00"}}}}|
| | |{"data":{"items":{"1843f179-3687-eb11-a812-0022489bac2c":{"id":"193","legalNature":"legalNature","allowedSignature":[],"category":"TP01","createdOn":"2021-01-22T13:17:12.502+01:00"}}}}|
| | |{"data":{"items":{"2643f179-3687-eb11-a812-0022489bac2a":{"id":"133","legalNature":"legalNature","allowedSignature":[],"category":"TP01","createdOn":"2021-01-22T13:17:12.502+01:00"}}}}|
| | |{"data":{"items":{"91f3e2e4-5c3a-466d-8063-7bfd753b345i":{"id":"143","legalNature":"legalNature","allowedSignature":[],"category":"TP01","createdOn":"2021-01-22T13:17:12.502+01:00"}}}}|
+-------+---------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Items is a struct of type MapType(StringType(), itemsSchema), since the key string from map type may change in each json I get, how can I navigate my json schema dynamically in order to get fields inside items struct?
For instance I need something like this for performing select operation:
df
.select(
col("col1"),
col("col2").alias("my_col_2"),
col(f"data.items.{itemsKey}.legalNature"),
col(f"data.items.{itemsKey}.id"))
)
where itemsKey change for each json inside my df.
I already see that for getting key I can use map_keys function, like this:
df.select(map_keys("data.items"))
But the problem is that this function is returning a data frame and not a string, more precisley a dataframe with different itemsKey for each row.
Is there a way to get dynamicallymy itemsKey?
I hope I've been clear, any help is appreciated.
you should use explode function of spark.sql
from pyspark.sql.functions import explode
df3 = df.select(df.*,explode(df.data.items))
This will give results as
+-------+-------------------------------------------+--------------------------------------------+
| col1| key |value
+-------+-------------------------------------------+-----+---------------------------------------
| ...| a04452cb-a909-47b0-ad5a-9bc44c6014e3|brown|{"legalNature":"legalNature","id": "123",..}|
+-------+----+-----+--------------------------------|---------------------------------------------|
To filter with your particular itemId, simply use filter function
filteredValue = df3.filter(df.key == "Your_item_id").select("value").first();
legalName = filteredValue["value"]["legalName"]
id = filteredValue["value"]["id"]
...
...

Retrive subkey values of all the keys in json spark dataframe

i have a data frame with schema like below: (I have large number of keys )
|-- loginRequest: struct (nullable = true)
| |-- responseHeader: struct (nullable = true)
| | |-- status: long (nullable = true)
| | |-- code: long (nullable = true)
|-- loginResponse: struct (nullable = true)
| |-- responseHeader: struct (nullable = true)
| | |-- status: long (nullable = true)
| | |-- code: long (nullable = true)
I want to create a column with status of all the keys of responseHeader.status
Expected
+--------------------+--------------------+------------+
| loginRequest| loginResponse| status |
+--------------------+--------------------+------------+
|[0,1] | null| 0 |
| null|[0,1] | 0 |
| null| [0,1]| 0 |
| null| [1,0]| 1 |
+--------------------+--------------------+-------------
Thanks in Advance
A simple select will solve your problem.
You have a nest field :
loginResponse: struct (nullable = true)
| |-- responseHeader: struct (nullable = true)
| | |-- status
A quick way would be to flatten your dataframe.
Doing something like this:
df.select(df.col("loginRequest.*"),df.col("loginResponse.*"))
And get it working from there:
Or,
You could use something like this:
var explodeDF = df.withColumn("statusRequest", df("loginRequest. responseHeader"))
which you helped me into and these questions:
Flattening Rows in Spark
DataFrame explode list of JSON objects
In order to get it to populate either from response or request, you can use and when condition in spark.
- How to use AND or OR condition in when in Spark
You are able to get the subfields with the . delimiter in the select statement and with the help of the coalesce method, you should get exactly what you aim for, i.e. let's call the input dataframe df with your specified input schema, then this piece of code should do the work:
import org.apache.spark.sql.functions.{coalesce, col}
val df_status = df.withColumn("status",
coalesce(
col("loginRequest.responseHeader.status"),
col("loginResponse.responseHeader.status")
)
)
What coalesce does, is that it takes first non-null value in the order of the input columns to the method and in case there is no non-null value, it will return null (see https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/functions.html#coalesce-org.apache.spark.sql.Column...-).

Convert Array with nested struct to string column along with other columns from the PySpark DataFrame

This is similar to Pyspark: cast array with nested struct to string
But, the accepted answer is not working for my case, so asking here
|-- Col1: string (nullable = true)
|-- Col2: array (nullable = true)
|-- element: struct (containsNull = true)
|-- Col2Sub: string (nullable = true)
Sample JSON
{"Col1":"abc123","Col2":[{"Col2Sub":"foo"},{"Col2Sub":"bar"}]}
This gives result in a single column
import pyspark.sql.functions as F
df.selectExpr("EXPLODE(Col2) AS structCol").select(F.expr("concat_ws(',', structCol.*)").alias("Col2_concated")).show()
+----------------+
| Col2_concated |
+----------------+
|foo,bar |
+----------------+
But, how to get a result or DataFrame like this
+-------+---------------+
|Col1 | Col2_concated |
+-------+---------------+
|abc123 |foo,bar |
+-------+---------------+
EDIT:
This solution gives the wrong result
df.selectExpr("Col1","EXPLODE(Col2) AS structCol").select("Col1", F.expr("concat_ws(',', structCol.*)").alias("Col2_concated")).show()
+-------+---------------+
|Col1 | Col2_concated |
+-------+---------------+
|abc123 |foo |
+-------+---------------+
|abc123 |bar |
+-------+---------------+
Just avoid the explode and you are already there. All you need is the concat_ws function. This function concatenates multiple string columns with a given seperator. See example below:
from pyspark.sql import functions as F
j = '{"Col1":"abc123","Col2":[{"Col2Sub":"foo"},{"Col2Sub":"bar"}]}'
df = spark.read.json(sc.parallelize([j]))
#printSchema tells us the column names we can use with concat_ws
df.printSchema()
Output:
root
|-- Col1: string (nullable = true)
|-- Col2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Col2Sub: string (nullable = true)
The column Col2 is an array of Col2Sub and we can use this column name to get the desired result:
bla = df.withColumn('Col2', F.concat_ws(',', df.Col2.Col2Sub))
bla.show()
+------+-------+
| Col1| Col2|
+------+-------+
|abc123|foo,bar|
+------+-------+

PySpark Dataframe Transpose as List

I'm working with pyspark sql api, and trying to group rows with repeated values into a list of rest of contents. It's similar to transpose, but instead of pivoting all values, will put values into array.
Current output:
group_id | member_id | name
55 | 123 | jake
55 | 234 | tim
65 | 345 | chris
Desired output:
group_id | members
55 | [[123, 'jake'], [234, 'tim']]
65 | [345, 'chris']
You need to groupby the group_id and use pyspark.sql.functions.collect_list() as the aggregation function.
As for combining the member_id and name columns, you have two options:
Option 1: Use pyspark.sql.functions.array:
from pyspark.sql.functions import array, collect_list
df1 = df.groupBy("group_id")\
.agg(collect_list(array("member_id", "name")).alias("members"))
df1.show(truncate=False)
#+--------+-------------------------------------------------+
#|group_id|members |
#+--------+-------------------------------------------------+
#|55 |[WrappedArray(123, jake), WrappedArray(234, tim)]|
#|65 |[WrappedArray(345, chris)] |
#+--------+-------------------------------------------------+
This returns a WrappedArray of arrays of strings. The integers are converted to strings because you can't have mixed type arrays.
df1.printSchema()
#root
# |-- group_id: integer (nullable = true)
# |-- members: array (nullable = true)
# | |-- element: array (containsNull = true)
# | | |-- element: string (containsNull = true)
Option 2: Use pyspark.sql.functions.struct
from pyspark.sql.functions import collect_list, struct
df2 = df.groupBy("group_id")\
.agg(collect_list(struct("member_id", "name")).alias("members"))
df2.show(truncate=False)
#+--------+-----------------------+
#|group_id|members |
#+--------+-----------------------+
#|65 |[[345,chris]] |
#|55 |[[123,jake], [234,tim]]|
#+--------+-----------------------+
This returns an array of structs, with named fields for member_id and name
df2.printSchema()
#root
# |-- group_id: integer (nullable = true)
# |-- members: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- member_id: integer (nullable = true)
# | | |-- name: string (nullable = true)
What's useful about the struct method is that you can access elements of the nested array by name using the dot accessor:
df2.select("group_id", "members.member_id").show()
#+--------+----------+
#|group_id| member_id|
#+--------+----------+
#| 65| [345]|
#| 55|[123, 234]|
#+--------+----------+

Lookup table in Spark

I have a dataframe in Spark with no clearly defined schema that I want to use as a lookup table. For example, the dataframe below:
+------------------------------------------------------------------------+
|lookupcolumn |
+------------------------------------------------------------------------+
|[val1,val2,val3,val4,val5,val6] |
+------------------------------------------------------------------------+
The schema would look like this:
|-- lookupcolumn: struct (nullable = true)
| |-- key1: string (nullable = true)
| |-- key2: string (nullable = true)
| |-- key3: string (nullable = true)
| |-- key4: string (nullable = true)
| |-- key5: string (nullable = true)
| |-- key6: string (nullable = true)
I'm saying "schema not clearly defined" since the number of keys is unknown while the data is being read, so I leave it to Spark to infer the schema.
Now, if I have another dataframe with a column as below:
+-----------------+
| datacolumn|
+-----------------+
| key1 |
| key3 |
| key5 |
| key2 |
| key4 |
+-----------------+
and I want the result to be:
+-----------------+
| resultcolumn|
+-----------------+
| val1 |
| val3 |
| val5 |
| val2 |
| val4 |
+-----------------+
I tried a UDF like this:
val get_val = udf((keyindex: String) => {
val res = lookupDf.select($"lookupcolumn"(keyindex).alias("result"))
res.head.toString
})
But it throws a Null Pointer exception error.
Can someone tell me what's wrong with the UDF, and if there's a better/simpler way of doing this lookup in Spark?
I assume that the lookup table is quite small, in this case it would make more sense to collect it to the driver and convert it to a normal Map. Then use this Map in the UDF function. It can be done in many way, for example like this:
val values = lookupDf.select("lookupcolumn.*").head.toSeq.map(_.toString)
val keys = lookupDf.select("lookupcolumn.*").columns
val lookup_map = keys.zip(values).toMap
Using the above lookup_map variable, the UDF will simply be:
val lookup = udf((key: String) => lookup_map.get(key))
And the final dataframe can be obtained by:
val df2 = df.withColumn("resultcolumn", lookup($"datacolumn"))