Flattening map<string,string> column in spark scala - scala

Below is my source schema.
root
|-- header: struct (nullable = true)
| |-- timestamp: long (nullable = true)
| |-- id: string (nullable = true)
| |-- honame: string (nullable = true)
|-- device: struct (nullable = true)
| |-- srcId: string (nullable = true)
| |-- srctype.: string (nullable = true)
|-- ATTRIBUTES: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- event_date: date (nullable = true)
|-- event_datetime: string (nullable = true)
I want to explode the ATTRIBUTES map type column and select all the columns which ends with _id.
Im using the below code.
val exploded = batch_df.select($"event_date", explode($"ATTRIBUTES")).show()
I am getting the below sample output.
---+----------+--------------------+--------------------+
|date | key| value|
+----------+--------------------+--------------------+
|2021-05-18|SYST_id | 85|
|2021-05-18|RECVR_id | 1|
|2021-05-18|Account_Id| | 12345|
|2021-05-18|Vb_id | 845|
|2021-05-18|SYS_INFO_id | 640|
|2021-05-18|mem_id | 456|
------------------------------------------------------
However, my required output is as below.
+---+-------+--------------+-----------+------------+-------+-------------+-------+
|date | SYST_id | RECVR_id | Account_Id | Vb_id | SYS_INFO_id| mem_id|
+----+------+--------------+-----------+------------+-------+-------------+-------+
|2021-05-18| 85 | 1 | 12345 | 845 | 640 | 456 |
+-----------+--------------+-----------+------------+-------+-------------+-------+
Could someone pls assist.

Your approach works. You only have to add a pivot operation after the explode:
import org.apache.spark.sql.functions._
exploded.groupBy("date").pivot("key").agg(first("value")).show()
I assume that the combination of date and key is unique, so it is safe to take the first (and only) value in the aggregation. If the combination is not unique, you could use collect_list as aggregation function.
Edit:
To add scrId and srctype, simply add these columns to the select statement:
val exploded = batch_df.select($"event_date", $"device.srcId", $"device.srctype", explode($"ATTRIBUTES"))
To reduce the number of columns after the pivot operation, apply a filter on the key column before aggregating:
val relevant_cols = Array("Account_Id", "Vb_id", "RECVR_id", "mem_id") // the four additional columns
exploded.filter($"key".isin(relevant_cols:_*).or($"key".endsWith(lit("_split"))))
.groupBy("date").pivot("key").agg(first("value")).show()

Related

Scala spark: extract columns from a schema

I have a schema that looks following:
|-- contributors: map (nullable = true)
| | |-- key: string
| | |-- value: array (valueContainsNull = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- type: string (nullable = true)
| | | | |-- name: string (nullable = true)
| | | | |-- id: string (nullable = true)
I would like to have a dataframe that have the columns key, name and id
I have used the following code to get name and id but how do I get the column key?
df.select(explode(col("contributors")))
.select(explode(col("value")))
.select(col("col.*"))
Update
I tried to apply the first solution to the following schema but the compiler does not like it. I would like to get value._name and subgenres.element.value._name
|-- mainGenre: struct (nullable = true)
| |-- value: struct (nullable = true)
| | |-- _name: string (nullable = true)
| |-- subgenres: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- value: struct (nullable = true)
| | | | |-- type: string (nullable = true))
| | | | |-- _name: string (nullable = true)
| | | |-- name: map (nullable = true)
| | | | |-- key: string
| | | | |-- value: string (valueContainsNull = true)
I tried to create a variable with value._name and then insert it in my second variable like this.
val col_mainGenre_name = df_r.select(col("mainGenre.*"))
.select(col("value.*"))
.select(col("_name"))
.drop("readableName")
.drop("description")
val df_exploded = df_r.select(col("mainGenre.*"))
.select(col_mainGenre_name, col("value.*"))
You can add key column in your second and third select. select method of dataframe accepts several columns as argument.
You should modify your code as follows:
import org.apache.spark.sql.functions.{col, explode}
df.select(explode(col("contributors")))
.select(col("key"), explode(col("value")))
.select(col("key"), col("col.*"))
With the following contributors input column:
+--------------------------------------------------------------------------------------------+
|contributors |
+--------------------------------------------------------------------------------------------+
|{key1 -> [{type11, name11, id11}, {type12, name12, id12}], key2 -> [{type21, name21, id21}]}|
|{key3 -> [{type31, name31, id31}, {type32, name32, id32}], key4 -> []} |
+--------------------------------------------------------------------------------------------+
You get the following ouput:
+----+------+------+----+
|key |type |name |id |
+----+------+------+----+
|key1|type11|name11|id11|
|key1|type12|name12|id12|
|key2|type21|name21|id21|
|key3|type31|name31|id31|
|key3|type32|name32|id32|
+----+------+------+----+
if you want to keep only name and id columns from value, you should also modify the last select to select only col.id and col.name columns:
import org.apache.spark.sql.functions.{col, explode}
df.select(explode(col("contributors")))
.select(col("key"), explode(col("value")))
.select(col("key"), col("col.name"), col("col.id"))
With the same contributors column input, you get your expected ouput:
+----+------+----+
|key |name |id |
+----+------+----+
|key1|name11|id11|
|key1|name12|id12|
|key2|name21|id21|
|key3|name31|id31|
|key3|name32|id32|
+----+------+----+

Retrive subkey values of all the keys in json spark dataframe

i have a data frame with schema like below: (I have large number of keys )
|-- loginRequest: struct (nullable = true)
| |-- responseHeader: struct (nullable = true)
| | |-- status: long (nullable = true)
| | |-- code: long (nullable = true)
|-- loginResponse: struct (nullable = true)
| |-- responseHeader: struct (nullable = true)
| | |-- status: long (nullable = true)
| | |-- code: long (nullable = true)
I want to create a column with status of all the keys of responseHeader.status
Expected
+--------------------+--------------------+------------+
| loginRequest| loginResponse| status |
+--------------------+--------------------+------------+
|[0,1] | null| 0 |
| null|[0,1] | 0 |
| null| [0,1]| 0 |
| null| [1,0]| 1 |
+--------------------+--------------------+-------------
Thanks in Advance
A simple select will solve your problem.
You have a nest field :
loginResponse: struct (nullable = true)
| |-- responseHeader: struct (nullable = true)
| | |-- status
A quick way would be to flatten your dataframe.
Doing something like this:
df.select(df.col("loginRequest.*"),df.col("loginResponse.*"))
And get it working from there:
Or,
You could use something like this:
var explodeDF = df.withColumn("statusRequest", df("loginRequest. responseHeader"))
which you helped me into and these questions:
Flattening Rows in Spark
DataFrame explode list of JSON objects
In order to get it to populate either from response or request, you can use and when condition in spark.
- How to use AND or OR condition in when in Spark
You are able to get the subfields with the . delimiter in the select statement and with the help of the coalesce method, you should get exactly what you aim for, i.e. let's call the input dataframe df with your specified input schema, then this piece of code should do the work:
import org.apache.spark.sql.functions.{coalesce, col}
val df_status = df.withColumn("status",
coalesce(
col("loginRequest.responseHeader.status"),
col("loginResponse.responseHeader.status")
)
)
What coalesce does, is that it takes first non-null value in the order of the input columns to the method and in case there is no non-null value, it will return null (see https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/sql/functions.html#coalesce-org.apache.spark.sql.Column...-).

Convert Array with nested struct to string column along with other columns from the PySpark DataFrame

This is similar to Pyspark: cast array with nested struct to string
But, the accepted answer is not working for my case, so asking here
|-- Col1: string (nullable = true)
|-- Col2: array (nullable = true)
|-- element: struct (containsNull = true)
|-- Col2Sub: string (nullable = true)
Sample JSON
{"Col1":"abc123","Col2":[{"Col2Sub":"foo"},{"Col2Sub":"bar"}]}
This gives result in a single column
import pyspark.sql.functions as F
df.selectExpr("EXPLODE(Col2) AS structCol").select(F.expr("concat_ws(',', structCol.*)").alias("Col2_concated")).show()
+----------------+
| Col2_concated |
+----------------+
|foo,bar |
+----------------+
But, how to get a result or DataFrame like this
+-------+---------------+
|Col1 | Col2_concated |
+-------+---------------+
|abc123 |foo,bar |
+-------+---------------+
EDIT:
This solution gives the wrong result
df.selectExpr("Col1","EXPLODE(Col2) AS structCol").select("Col1", F.expr("concat_ws(',', structCol.*)").alias("Col2_concated")).show()
+-------+---------------+
|Col1 | Col2_concated |
+-------+---------------+
|abc123 |foo |
+-------+---------------+
|abc123 |bar |
+-------+---------------+
Just avoid the explode and you are already there. All you need is the concat_ws function. This function concatenates multiple string columns with a given seperator. See example below:
from pyspark.sql import functions as F
j = '{"Col1":"abc123","Col2":[{"Col2Sub":"foo"},{"Col2Sub":"bar"}]}'
df = spark.read.json(sc.parallelize([j]))
#printSchema tells us the column names we can use with concat_ws
df.printSchema()
Output:
root
|-- Col1: string (nullable = true)
|-- Col2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Col2Sub: string (nullable = true)
The column Col2 is an array of Col2Sub and we can use this column name to get the desired result:
bla = df.withColumn('Col2', F.concat_ws(',', df.Col2.Col2Sub))
bla.show()
+------+-------+
| Col1| Col2|
+------+-------+
|abc123|foo,bar|
+------+-------+

In PySpark how to parse an embedded JSON

I am new to PySpark.
I have a JSON file which has below schema
df = spark.read.json(input_file)
df.printSchema()
|-- UrlsInfo: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- displayUrl: string (nullable = true)
| | |-- type: string (nullable = true)
| | |-- url: string (nullable = true)
|-- type: long (nullable = true)
I want a new result dataframe which should have only two columns type and UrlsInfo.element.DisplayUrl
This is my try code, which doesn't give the expected output
df.createOrReplaceTempView("the_table")
resultDF = spark.sql("SELECT type, UrlsInfo.element.DisplayUrl FROM the_table")
resultDF.show()
I want resultDF to be something like this:
Type | DisplayUrl
----- ------------
2 | http://example.com
This is related JSON file parsing in Pyspark, but doesn't answer my question.
As you can see in your schema, UrlsInfo is an array type, not a struct. The "element" schema item thus refers not to a named property (you're trying to access it by .element) but to an array element (which responds to an index like [0]).
I've reproduced your schema by hand:
from pyspark.sql import Row
df = spark.createDataFrame([Row(UrlsInfo=[Row(displayUri="http://example.com", type="narf", url="poit")], Type=2)])
df.printSchema()
root
|-- Type: long (nullable = true)
|-- UrlsInfo: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- displayUri: string (nullable = true)
| | |-- type: string (nullable = true)
| | |-- url: string (nullable = true)
and I'm able to produce a table like what you seem to be looking for by using an index:
df.createOrReplaceTempView("temp")
resultDF = spark.sql("SELECT type, UrlsInfo[0].DisplayUri FROM temp")
resultDF.show()
+----+----------------------+
|type|UrlsInfo[0].DisplayUri|
+----+----------------------+
| 2| http://example.com|
+----+----------------------+
However, this only gives the first element (if any) of UrlsInfo in the second column.
EDIT: I'd forgotten about the EXPLODE function, which you can use here to treat the UrlsInfo elements like a set of rows:
from pyspark.sql import Row
df = spark.createDataFrame([Row(UrlsInfo=[Row(displayUri="http://example.com", type="narf", url="poit"), Row(displayUri="http://another-example.com", type="narf", url="poit")], Type=2)])
df.createOrReplaceTempView("temp")
resultDF = spark.sql("SELECT type, EXPLODE(UrlsInfo.displayUri) AS displayUri FROM temp")
resultDF.show()
+----+--------------------+
|type| displayUri|
+----+--------------------+
| 2| http://example.com|
| 2|http://another-ex...|
+----+--------------------+

How to find the "lowest" element from array<struct>?

I've a dataframe with following schema -
|-- ID: string (nullable = true)
|-- VALUES: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _v1: string (nullable = true)
| | |-- _v2: string (nullable = true)
VALUES are like -
[["1","a"],["2","b"],["3","c"],["4","d"]]
[["4","g"]]
[["3","e"],["4","f"]]
I want to take the VALUES with the lowest integer i.e.
The result df should look like - (which will be StructType now, not Array[Struct])
["1","a"]
["4","g"]
["3","e"]
Can someone please guide me how can I approach this problem by creating a udf ?
Thanks in advance.
You don't need a UDF for that. Just use sort_array and pick the first element.
df.show
+--------------------+
| data_arr|
+--------------------+
|[[4,a], [2,b], [1...|
| [[1,a]]|
| [[3,b], [1,v]]|
+--------------------+
df.printSchema
root
|-- data_arr: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- col1: string (nullable = false)
| | |-- col2: string (nullable = false)
import org.apache.spark.sql.functions.sort_array
df.withColumn("first_asc", sort_array($"data_arr")(0)).show
+--------------------+---------+
| data_arr|first_asc|
+--------------------+---------+
|[[4,a], [2,b], [1...| [1,c]|
| [[1,a]]| [1,a]|
| [[3,b], [1,v]]| [1,v]|
+--------------------+---------+
Using the same dataframe as in the example:
val findSmallest = udf((rows: Seq[Row]) => {
rows.map(row => (row.getAs[String](0), row.getAs[String](1))).sorted.head
})
df.withColumn("SMALLEST", findSmallest($"VALUES"))
Will give a result like this:
+---+--------------------+--------+
| ID| VALUES|SMALLEST|
+---+--------------------+--------+
| 1|[[1,a], [2,b], [3...| [1,2]|
| 2| [[4,e]]| [4,g]|
| 3| [[3,g], [4,f]]| [3,g]|
+---+--------------------+--------+
If you only want the final values use select("SMALLEST).