I have a schema that looks following:
|-- contributors: map (nullable = true)
| | |-- key: string
| | |-- value: array (valueContainsNull = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- type: string (nullable = true)
| | | | |-- name: string (nullable = true)
| | | | |-- id: string (nullable = true)
I would like to have a dataframe that have the columns key, name and id
I have used the following code to get name and id but how do I get the column key?
df.select(explode(col("contributors")))
.select(explode(col("value")))
.select(col("col.*"))
Update
I tried to apply the first solution to the following schema but the compiler does not like it. I would like to get value._name and subgenres.element.value._name
|-- mainGenre: struct (nullable = true)
| |-- value: struct (nullable = true)
| | |-- _name: string (nullable = true)
| |-- subgenres: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- value: struct (nullable = true)
| | | | |-- type: string (nullable = true))
| | | | |-- _name: string (nullable = true)
| | | |-- name: map (nullable = true)
| | | | |-- key: string
| | | | |-- value: string (valueContainsNull = true)
I tried to create a variable with value._name and then insert it in my second variable like this.
val col_mainGenre_name = df_r.select(col("mainGenre.*"))
.select(col("value.*"))
.select(col("_name"))
.drop("readableName")
.drop("description")
val df_exploded = df_r.select(col("mainGenre.*"))
.select(col_mainGenre_name, col("value.*"))
You can add key column in your second and third select. select method of dataframe accepts several columns as argument.
You should modify your code as follows:
import org.apache.spark.sql.functions.{col, explode}
df.select(explode(col("contributors")))
.select(col("key"), explode(col("value")))
.select(col("key"), col("col.*"))
With the following contributors input column:
+--------------------------------------------------------------------------------------------+
|contributors |
+--------------------------------------------------------------------------------------------+
|{key1 -> [{type11, name11, id11}, {type12, name12, id12}], key2 -> [{type21, name21, id21}]}|
|{key3 -> [{type31, name31, id31}, {type32, name32, id32}], key4 -> []} |
+--------------------------------------------------------------------------------------------+
You get the following ouput:
+----+------+------+----+
|key |type |name |id |
+----+------+------+----+
|key1|type11|name11|id11|
|key1|type12|name12|id12|
|key2|type21|name21|id21|
|key3|type31|name31|id31|
|key3|type32|name32|id32|
+----+------+------+----+
if you want to keep only name and id columns from value, you should also modify the last select to select only col.id and col.name columns:
import org.apache.spark.sql.functions.{col, explode}
df.select(explode(col("contributors")))
.select(col("key"), explode(col("value")))
.select(col("key"), col("col.name"), col("col.id"))
With the same contributors column input, you get your expected ouput:
+----+------+----+
|key |name |id |
+----+------+----+
|key1|name11|id11|
|key1|name12|id12|
|key2|name21|id21|
|key3|name31|id31|
|key3|name32|id32|
+----+------+----+
Related
Below is my source schema.
root
|-- header: struct (nullable = true)
| |-- timestamp: long (nullable = true)
| |-- id: string (nullable = true)
| |-- honame: string (nullable = true)
|-- device: struct (nullable = true)
| |-- srcId: string (nullable = true)
| |-- srctype.: string (nullable = true)
|-- ATTRIBUTES: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- event_date: date (nullable = true)
|-- event_datetime: string (nullable = true)
I want to explode the ATTRIBUTES map type column and select all the columns which ends with _id.
Im using the below code.
val exploded = batch_df.select($"event_date", explode($"ATTRIBUTES")).show()
I am getting the below sample output.
---+----------+--------------------+--------------------+
|date | key| value|
+----------+--------------------+--------------------+
|2021-05-18|SYST_id | 85|
|2021-05-18|RECVR_id | 1|
|2021-05-18|Account_Id| | 12345|
|2021-05-18|Vb_id | 845|
|2021-05-18|SYS_INFO_id | 640|
|2021-05-18|mem_id | 456|
------------------------------------------------------
However, my required output is as below.
+---+-------+--------------+-----------+------------+-------+-------------+-------+
|date | SYST_id | RECVR_id | Account_Id | Vb_id | SYS_INFO_id| mem_id|
+----+------+--------------+-----------+------------+-------+-------------+-------+
|2021-05-18| 85 | 1 | 12345 | 845 | 640 | 456 |
+-----------+--------------+-----------+------------+-------+-------------+-------+
Could someone pls assist.
Your approach works. You only have to add a pivot operation after the explode:
import org.apache.spark.sql.functions._
exploded.groupBy("date").pivot("key").agg(first("value")).show()
I assume that the combination of date and key is unique, so it is safe to take the first (and only) value in the aggregation. If the combination is not unique, you could use collect_list as aggregation function.
Edit:
To add scrId and srctype, simply add these columns to the select statement:
val exploded = batch_df.select($"event_date", $"device.srcId", $"device.srctype", explode($"ATTRIBUTES"))
To reduce the number of columns after the pivot operation, apply a filter on the key column before aggregating:
val relevant_cols = Array("Account_Id", "Vb_id", "RECVR_id", "mem_id") // the four additional columns
exploded.filter($"key".isin(relevant_cols:_*).or($"key".endsWith(lit("_split"))))
.groupBy("date").pivot("key").agg(first("value")).show()
This question already has answers here:
Spark scala - Nested StructType conversion to Map
(2 answers)
Exploding nested Struct in Spark dataframe
(3 answers)
Closed 2 years ago.
I have a dataframe with this schema:
root
|-- customer_id: string (nullable = true)
|-- service: struct (nullable = true)
| |-- cat1: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- category: string (nullable = true)
| | | |-- match_id: string (nullable = true)
| |-- cat2: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- category: string (nullable = true)
| | | |-- match_id: string (nullable = true)
actual data looks like this:
+-----------+-------------------------------------------------------------------------------+
|customer_id|service |
+-----------+-------------------------------------------------------------------------------+
|CID1 |[[[cat1, service1], [cat1, service3]],] |
|CID2 |[[[cat1, service4],], [[cat2, service7], [cat2, service8], [cat2, service9]]] |
+-----------+-------------------------------------------------------------------------------+
I hope transformed data can look like this:
+-----------+------+--------------------------------------------------------------------------+
|customer_id| cat | service |
+-----------+------+--------------------------------------------------------------------------+
|CID1 | cat1 | [[cat1, service1], [cat1, service3]] |
|CID2 | cat1 | [[cat1, service4]] |
|CID2 | cat2 | [[cat2, service7], [cat2, service8], [cat2, service9]] |
+-----------+------+--------------------------------------------------------------------------+
or even better(but it'll be simple if I can do above form)
+-----------+------+-----------------------------------+
|customer_id| cat | service |
+-----------+------+-----------------------------------+
|CID1 | cat1 | [service1, service3] |
|CID2 | cat1 | [service4] |
|CID2 | cat2 | [service7, service8, service9]] |
+-----------+------+-----------------------------------+
where service is a concatenation of original cat1 and cat2.
And 1 thing to notice is there could be many fields under original service, meaning there could be cat1, cat2, cat3 ...
I'm new to Scala as well as Spark, and have searched for a while, but haven't seen similar examples.
you could explode your service column twice and collect list by grouping customer_id:
val explodedOnceDF = df.select(col("customer_id"),explode("service").as("service"))
val explodedTwiceDF = explodedOnceDF.select(col("customer_id"),explode("service").as("service"))
val requiredOutput = explodedTwiceDF.groupBy("customer_id").agg(collect_list("service").as("service")).select(col("customer_id"),col("service"))
Hope this Helps!!
I've requirement to parse the JSON data as shown in the expected results below, currently i'm not getting how to include the signals name(ABS, ADA, ADW) in Signal column. Any help would be much appreciated.
I tried something which gives the results as shown below, but i will need to include all the signals in SIGNAL column as well which is shown in the expected results.
jsonDF.select(explode($"ABS") as "element").withColumn("stime", col("element.E")).withColumn("can_value", col("element.V")).drop(col("element")).show()
+-------------+--------- --+
| stime|can_value |
+-------------+--------- +
|value of E |value of V |
+-------------+----------- +
df.printSchema
-- ABS: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: long (nullable = true)
|-- ADA: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: long (nullable = true)
|-- ADW: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: long (nullable = true)
|-- ALT: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
|-- APP: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
I will need output like below:
-----------------+-------------+---------+
|SIGNAL |stime |can_value|
+-----------------+-------------+---------+
|ABS |value of E |value of V |
|ADA |value of E |value of V |
|ADW |value of E |value of V |
+-----------------+-------------+---------+
To get the expected output, and to insert values in Signal column:
jsonDF.select(explode($"ABS") as "element")
.withColumn("stime", col("element.E"))
.withColumn("can_value", col("element.V"))
.drop(col("element"))
.withColumn("SIGNAL",lit("ABS"))
.show()
And the generalized version of the above approach:
(Based on the result of df.printSchema assuming that, you have signal values as column names, and those columns contain array having elements of the form struct(E,V))
val columns:Array[String] = df.columns
var arrayOfDFs:Array[DataFrame] = Array()
for(col_name <- columns){
val temp = df.selectExpr("explode("+col_name+") as element")
.select(
lit(col_name).as("SIGNAL"),
col("element.E").as("stime"),
col("element.V").as("can_value"))
arrayOfDFs = arrayOfDFs :+ temp
}
val jsonDF = arrayOfDFs.reduce(_ union _)
jsonDF.show(false)
Good,
I'm working with the spark framework in Scala. My dataframe has a column with the following structure and content:
+---------------------------------------------------------------------------------------------+
|Email_Code |
+---------------------------------------------------------------------------------------------+
|[WrappedArray([3,spain]), WrappedArray([,]), WrappedArray([3,spain])] |
|[WrappedArray([3,spain]), WrappedArray([3,spain])] |
+---------------------------------------------------------------------------------------------+
|-- Email_Code: array (nullable = true)
| |-- element: array (containsNull = false)
| | |-- element: struct (containsNull = false)
| | | |-- Code: string (nullable = true)
| | | |-- Value: string (nullable = true)
And I am trying to develop a udf function that takes all the values of the "Code" structure present in the array. But I'm not able ...
I would like an exit like the following:
+---------------------------------------------------------------------------------------------+
|Email_Code |
+---------------------------------------------------------------------------------------------+
|[3,,3] |
|[3,3] |
+---------------------------------------------------------------------------------------------+
Any help please?
I got to fix it:
val transformation = udf((data: Seq[Seq[Row]]) => {data.flatMap(x => x).map{case Row(code:String, value:String) => code}})
df.withColumn("result", transformation($"columnName"))
I've a dataframe with following schema -
|-- ID: string (nullable = true)
|-- VALUES: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _v1: string (nullable = true)
| | |-- _v2: string (nullable = true)
VALUES are like -
[["1","a"],["2","b"],["3","c"],["4","d"]]
[["4","g"]]
[["3","e"],["4","f"]]
I want to take the VALUES with the lowest integer i.e.
The result df should look like - (which will be StructType now, not Array[Struct])
["1","a"]
["4","g"]
["3","e"]
Can someone please guide me how can I approach this problem by creating a udf ?
Thanks in advance.
You don't need a UDF for that. Just use sort_array and pick the first element.
df.show
+--------------------+
| data_arr|
+--------------------+
|[[4,a], [2,b], [1...|
| [[1,a]]|
| [[3,b], [1,v]]|
+--------------------+
df.printSchema
root
|-- data_arr: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- col1: string (nullable = false)
| | |-- col2: string (nullable = false)
import org.apache.spark.sql.functions.sort_array
df.withColumn("first_asc", sort_array($"data_arr")(0)).show
+--------------------+---------+
| data_arr|first_asc|
+--------------------+---------+
|[[4,a], [2,b], [1...| [1,c]|
| [[1,a]]| [1,a]|
| [[3,b], [1,v]]| [1,v]|
+--------------------+---------+
Using the same dataframe as in the example:
val findSmallest = udf((rows: Seq[Row]) => {
rows.map(row => (row.getAs[String](0), row.getAs[String](1))).sorted.head
})
df.withColumn("SMALLEST", findSmallest($"VALUES"))
Will give a result like this:
+---+--------------------+--------+
| ID| VALUES|SMALLEST|
+---+--------------------+--------+
| 1|[[1,a], [2,b], [3...| [1,2]|
| 2| [[4,e]]| [4,g]|
| 3| [[3,g], [4,f]]| [3,g]|
+---+--------------------+--------+
If you only want the final values use select("SMALLEST).