spark2.0 dataframe collect muilt row as array by column - scala

i have some dataframe like below, i want convert muilt row as an array if column value is the same
val data = Seq(("a","b","sum",0),("a","b","avg",2)).toDF("id1","id2","type","value2").show
+---+---+----+------+
|id1|id2|type|value2|
+---+---+----+------+
| a| b| sum| 0|
| a| b| avg| 2|
+---+---+----+------+
i want to convert it to below
+---+---+----+------+
|id1|id2|agg |value2|
+---+---+----+------+
| a| b| 0,2| 0|
+---+---+----+------+
the printSchema should be like below
root
|-- id1: string (nullable = true)
|-- id2: string (nullable = true)
|-- agg: struct (nullable = true)
| |-- sum: int (nullable = true)
| |-- dc: int (nullable = true)

You can:
import org.apache.spark.sql.functions._
val data = Seq(
("a","b","sum",0),("a","b","avg",2)
).toDF("id1","id2","type","value2")
val result = data.groupBy($"id1", $"id2").agg(struct(
first(when($"type" === "sum", $"value2"), true).alias("sum"),
first(when($"type" === "avg", $"value2"), true).alias("avg")
).alias("agg"))
result.show
+---+---+-----+
|id1|id2| agg|
+---+---+-----+
| a| b|[0,2]|
+---+---+-----+
result.printSchema
root
|-- id1: string (nullable = true)
|-- id2: string (nullable = true)
|-- agg: struct (nullable = false)
| |-- sum: integer (nullable = true)
| |-- avg: integer (nullable = true)

Related

Convert all the columns of a spark dataframe into a json format and then include the json formatted data as a column in another/parent dataframe

Converted dataframe(say child dataframe) into json using df.toJSON
After json conversion the schema looks like this :
root
|-- value: string (nullable = true)
I used the following suggestion to get child dataframe into the intermediate parent schema/dataframe:
scala> parentDF.toJSON.select(struct($"value").as("data")).printSchema
root
|-- data: struct (nullable = false)
| |-- value: string (nullable = true)
Now I still need to build the parentDF schema further to make it look like:
root
|-- id
|-- version
|-- data: struct (nullable = false)
| |-- value: string (nullable = true)
Q1) How can I build the id column using the id from value(i.e value.id needs to be represented as id)
Q2) I need to bring version from a different dataframe(say versionDF) where version is a constant(in all columns). Do I fetch one row from this versionDF to read value of version column and then populate it as literal in the parentDF ?
please help with any code snippets on this.
Instead of toJSON use to_json in select statement & select required columns along with to_json function.
Check below code.
val version = // Get version value from versionDF
parentDF.select($"id",struct(to_json(struct($"*")).as("value")).as("data"),lit(version).as("version"))
scala> parentDF.select($"id",struct(to_json(struct($"*")).as("value")).as("data"),lit(version).as("version")).printSchema
root
|-- id: integer (nullable = false)
|-- data: struct (nullable = false)
| |-- value: string (nullable = true)
|-- version: double (nullable = false)
Updated
scala> df.select($"id",to_json(struct(struct($"*").as("value"))).as("data"),lit(version).as("version")).printSchema
root
|-- id: integer (nullable = false)
|-- data: string (nullable = true)
|-- version: integer (nullable = false)
scala> df.select($"id",to_json(struct(struct($"*").as("value"))).as("data"),lit(version).as("version")).show(false)
+---+------------------------------------------+-------+
|id |data |version|
+---+------------------------------------------+-------+
|1 |{"value":{"id":1,"col1":"a1","col2":"b1"}}|1 |
|2 |{"value":{"id":2,"col1":"a2","col2":"b2"}}|1 |
|3 |{"value":{"id":3,"col1":"a3","col2":"b3"}}|1 |
+---+------------------------------------------+-------+
Update-1
scala> df.select($"id",to_json(struct($"*").as("value")).as("data"),lit(version).as("version")).printSchema
root
|-- id: integer (nullable = false)
|-- data: string (nullable = true)
|-- version: integer (nullable = false)
scala> df.select($"id",to_json(struct($"*").as("value")).as("data"),lit(version).as("version")).show(false)
+---+--------------------------------+-------+
|id |data |version|
+---+--------------------------------+-------+
|1 |{"id":1,"col1":"a1","col2":"b1"}|1 |
|2 |{"id":2,"col1":"a2","col2":"b2"}|1 |
|3 |{"id":3,"col1":"a3","col2":"b3"}|1 |
+---+--------------------------------+-------+
Try this:
scala> val versionDF = List((1.0)).toDF("version")
versionDF: org.apache.spark.sql.DataFrame = [version: double]
scala> versionDF.show
+-------+
|version|
+-------+
| 1.0|
+-------+
scala> val version = versionDF.first.get(0)
version: Any = 1.0
scala>
scala> val childDF = List((1,"a1","b1"),(2,"a2","b2"),(3,"a3","b3")).toDF("id","col1","col2")
childDF: org.apache.spark.sql.DataFrame = [id: int, col1: string ... 1 more field]
scala> childDF.show
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| a1| b1|
| 2| a2| b2|
| 3| a3| b3|
+---+----+----+
scala>
scala> val parentDF = childDF.toJSON.select(struct($"value").as("data")).withColumn("id",from_json($"data.value",childDF.schema).getItem("id")).withColumn("version",lit(version))
parentDF: org.apache.spark.sql.DataFrame = [data: struct<value: string>, id: int ... 1 more field]
scala> parentDF.printSchema
root
|-- data: struct (nullable = false)
| |-- value: string (nullable = true)
|-- id: integer (nullable = true)
|-- version: double (nullable = false)
scala> parentDF.show(false)
+----------------------------------+---+-------+
|data |id |version|
+----------------------------------+---+-------+
|[{"id":1,"col1":"a1","col2":"b1"}]|1 |1.0 |
|[{"id":2,"col1":"a2","col2":"b2"}]|2 |1.0 |
|[{"id":3,"col1":"a3","col2":"b3"}]|3 |1.0 |
+----------------------------------+---+-------+

flattern scala array data type column to multiple columns

Is their any possible way to flatten an array in Scala DF?
As I know with columns and select filed.a works, but I don't want to specify them Manually.
df.printSchema()
|-- client_version: string (nullable = true)
|-- filed: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: string (nullable = true)
| | |-- b: string (nullable = true)
| | |-- c: string (nullable = true)
| | |-- d: string (nullable = true)
final df
df.printSchema()
|-- client_version: string (nullable = true)
|-- filed_a: string (nullable = true)
|-- filed_b: string (nullable = true)
|-- filed_c: string (nullable = true)
|-- filed_d: string (nullable = true)
You can flatten your ArrayType column with explode and map the nested struct element names to the wanted top-level column names, as shown below:
import org.apache.spark.sql.functions._
case class S(a: String, b: String, c: String, d: String)
val df = Seq(
("1.0", Seq(S("a1", "b1", "c1", "d1"))),
("2.0", Seq(S("a2", "b2", "c2", "d2"), S("a3", "b3", "c3", "d3")))
).toDF("client_version", "filed")
df.printSchema
// root
// |-- client_version: string (nullable = true)
// |-- filed: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- a: string (nullable = true)
// | | |-- b: string (nullable = true)
// | | |-- c: string (nullable = true)
// | | |-- d: string (nullable = true)
val dfFlattened = df.withColumn("filed_element", explode($"filed"))
val structElements = dfFlattened.select($"filed_element.*").columns
val dfResult = dfFlattened.select( col("client_version") +: structElements.map(
c => col(s"filed_element.$c").as(s"filed_$c")
): _*
)
dfResult.show
// +--------------+-------+-------+-------+-------+
// |client_version|filed_a|filed_b|filed_c|filed_d|
// +--------------+-------+-------+-------+-------+
// | 1.0| a1| b1| c1| d1|
// | 2.0| a2| b2| c2| d2|
// | 2.0| a3| b3| c3| d3|
// +--------------+-------+-------+-------+-------+
dfResult.printSchema
// root
// |-- client_version: string (nullable = true)
// |-- filed_a: string (nullable = true)
// |-- filed_b: string (nullable = true)
// |-- filed_c: string (nullable = true)
// |-- filed_d: string (nullable = true)
Use explode to flatten the arrays by adding more rows and then select with the * notation to bring the struct columns back to the top.
import org.apache.spark.sql.functions.{collect_list, explode, struct}
import spark.implicits._
val df = Seq(("1", "a", "a", "a"),
("1", "b", "b", "b"),
("2", "a", "a", "a"),
("2", "b", "b", "b"),
("2", "c", "c", "c"),
("3", "a", "a","a")).toDF("idx", "A", "B", "C")
.groupBy(("idx"))
.agg(collect_list(struct("A", "B", "C")).as("nested_col"))
df.printSchema()
// root
// |-- idx: string (nullable = true)
// |-- nested_col: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- A: string (nullable = true)
// | | |-- B: string (nullable = true)
// | | |-- C: string (nullable = true)
df.show
// +---+--------------------+
// |idx| nested_col|
// +---+--------------------+
// | 3| [[a, a, a]]|
// | 1|[[a, a, a], [b, b...|
// | 2|[[a, a, a], [b, b...|
// +---+--------------------+
val dfExploded = df.withColumn("exploded", explode($"nested_col")).drop("nested_col")
dfExploded.show
// +---+---------+
// |idx| exploded|
// +---+---------+
// | 3|[a, a, a]|
// | 1|[a, a, a]|
// | 1|[b, b, b]|
// | 2|[a, a, a]|
// | 2|[b, b, b]|
// | 2|[c, c, c]|
// +---+---------+
val finalDF = dfExploded.select("idx", "exploded.*")
finalDF.show
// +---+---+---+---+
// |idx| A| B| C|
// +---+---+---+---+
// | 3| a| a| a|
// | 1| a| a| a|
// | 1| b| b| b|
// | 2| a| a| a|
// | 2| b| b| b|
// | 2| c| c| c|
// +---+---+---+---+

Explode multiple columns of same type with different lengths

I have a spark data frame with the following format that needs to be exploded. I check other solutions such as this one. However, in my case, before and after can be arrays of different length.
root
|-- id: string (nullable = true)
|-- before: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- start_time: string (nullable = true)
| | |-- end_time: string (nullable = true)
| | |-- area: string (nullable = true)
|-- after: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- start_time: string (nullable = true)
| | |-- end_time: string (nullable = true)
| | |-- area: string (nullable = true)
For instance, if the data frame has just one row, before is a array of size 2 and after is an array of size 3, the exploded version should be have 5 rows with the following schema:
root
|-- id: string (nullable = true)
|-- type: string (nullable = true)
|-- start_time: integer (nullable = false)
|-- end_time: string (nullable = true)
|-- area: string (nullable = true)
where type is a new column which can be "before" or "after".
I can do thss in two separate explodes where I make type column in each explode and union then.
val dfSummary1 = df.withColumn("before_exp",
explode($"before")).withColumn("type",
lit("before")).withColumn(
"start_time", $"before_exp.start_time").withColumn(
"end_time", $"before_exp.end_time").withColumn(
"area", $"before_exp.area").drop("before_exp", "before")
val dfSummary2 = df.withColumn("after_exp",
explode($"after")).withColumn("type",
lit("after")).withColumn(
"start_time", $"after_exp.start_time").withColumn(
"end_time", $"after_exp.end_time").withColumn(
"area", $"after_exp.area").drop("after_exp", "after")
val dfResult = dfSumamry1.unionAll(dfSummary2)
But, I was wondering if there is a more elegant way to do this. Thanks.
You can also achieve this without union. With the data :
case class Area(start_time: String, end_time: String, area: String)
val df = Seq((
"1", Seq(Area("01:00", "01:30", "10"), Area("02:00", "02:30", "20")),
Seq(Area("07:00", "07:30", "70"), Area("08:00", "08:30", "80"), Area("09:00", "09:30", "90"))
)).toDF("id", "before", "after")
you can do
df
.select($"id",
explode(
array(
struct(lit("before").as("type"), $"before".as("data")),
struct(lit("after").as("type"), $"after".as("data"))
)
).as("step1")
)
.select($"id",$"step1.type", explode($"step1.data").as("step2"))
.select($"id",$"type", $"step2.*")
.show()
+---+------+----------+--------+----+
| id| type|start_time|end_time|area|
+---+------+----------+--------+----+
| 1|before| 01:00| 01:30| 10|
| 1|before| 02:00| 02:30| 20|
| 1| after| 07:00| 07:30| 70|
| 1| after| 08:00| 08:30| 80|
| 1| after| 09:00| 09:30| 90|
+---+------+----------+--------+----+
I think exploding the two columns separately followed by a union is a decent straightforward approach. You could simplify the StructField-element selection a little and create a simple method for the repetitive explode process, like below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame
case class Area(start_time: String, end_time: String, area: String)
val df = Seq((
"1", Seq(Area("01:00", "01:30", "10"), Area("02:00", "02:30", "20")),
Seq(Area("07:00", "07:30", "70"), Area("08:00", "08:30", "80"), Area("09:00", "09:30", "90"))
)).toDF("id", "before", "after")
def explodeCol(df: DataFrame, colName: String): DataFrame = {
val expColName = colName + "_exp"
df.
withColumn("type", lit(colName)).
withColumn(expColName, explode(col(colName))).
select("id", "type", expColName + ".*")
}
val dfResult = explodeCol(df, "before") union explodeCol(df, "after")
dfResult.show
// +---+------+----------+--------+----+
// | id| type|start_time|end_time|area|
// +---+------+----------+--------+----+
// | 1|before| 01:00| 01:30| 10|
// | 1|before| 02:00| 02:30| 20|
// | 1| after| 07:00| 07:30| 70|
// | 1| after| 08:00| 08:30| 80|
// | 1| after| 09:00| 09:30| 90|
// +---+------+----------+--------+----+

Spark: How to split struct type into multiple columns?

I know this question has been asked many times on Stack Overflow and has been satisfactorily answered in most posts, but I'm not sure if this is the best way in my case.
I have a Dataset that has several struct types embedded in it:
root
|-- STRUCT1: struct (nullable = true)
| |-- FIELD_1: string (nullable = true)
| |-- FIELD_2: long (nullable = true)
| |-- FIELD_3: integer (nullable = true)
|-- STRUCT2: struct (nullable = true)
| |-- FIELD_4: string (nullable = true)
| |-- FIELD_5: long (nullable = true)
| |-- FIELD_6: integer (nullable = true)
|-- STRUCT3: struct (nullable = true)
| |-- FIELD_7: string (nullable = true)
| |-- FIELD_8: long (nullable = true)
| |-- FIELD_9: integer (nullable = true)
|-- ARRAYSTRUCT4: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- FIELD_10: integer (nullable = true)
| | |-- FIELD_11: integer (nullable = true)
+-------+------------+------------+------------------+
|STRUCT1| STRUCT2 | STRUCT3 | ARRAYSTRUCT4 |
+-------+------------+------------+------------------+
|[1,2,3]|[aa, xx, yy]|[p1, q2, r3]|[[1a, 2b],[3c,4d]]|
+-------+------------+------------+------------------+
I want to convert this into:
1. A dataset where the structs are expanded into columns.
2. A data set where the array (ARRAYSTRUCT4) is exploded into rows.
root
|-- FIELD_1: string (nullable = true)
|-- FIELD_2: long (nullable = true)
|-- FIELD_3: integer (nullable = true)
|-- FIELD_4: string (nullable = true)
|-- FIELD_5: long (nullable = true)
|-- FIELD_6: integer (nullable = true)
|-- FIELD_7: string (nullable = true)
|-- FIELD_8: long (nullable = true)
|-- FIELD_9: integer (nullable = true)
|-- FIELD_10: integer (nullable = true)
|-- FIELD_11: integer (nullable = true)
+-------+------------+------------+---------+ ---------+----------+
|FIELD_1| FIELD_2 | FIELD_3 | FIELD_4 | |FIELD_10| FIELD_11 |
+-------+------------+------------+---------+ ... ---------+----------+
|1 |2 |3 | aa | | 1a | 2b |
+-------+------------+------------+-----------------------------------+
To achieve this, I could use:
val expanded = df.select("STRUCT1.*", "STRUCT2.*", "STRUCT3.*", "STRUCT4")
followed by an explode:
val exploded = expanded.select(explode(expanded("STRUCT4")))
However, I was wondering if there's a more functional way to do this, especially the select. I could use withColumn as below:
data.withColumn("FIELD_1", $"STRUCT1".getItem(0))
.withColumn("FIELD_2", $"STRUCT1".getItem(1))
.....
But I have 80+ columns. Is there a better way to achieve this?
You can first make all columns struct-type by explode-ing any Array(struct) columns into struct columns via foldLeft, then use map to interpolate each of the struct column names into col.*, as shown below:
import org.apache.spark.sql.functions._
case class S1(FIELD_1: String, FIELD_2: Long, FIELD_3: Int)
case class S2(FIELD_4: String, FIELD_5: Long, FIELD_6: Int)
case class S3(FIELD_7: String, FIELD_8: Long, FIELD_9: Int)
case class S4(FIELD_10: Int, FIELD_11: Int)
val df = Seq(
(S1("a1", 101, 11), S2("a2", 102, 12), S3("a3", 103, 13), Array(S4(1, 1), S4(3, 3))),
(S1("b1", 201, 21), S2("b2", 202, 22), S3("b3", 203, 23), Array(S4(2, 2), S4(4, 4)))
).toDF("STRUCT1", "STRUCT2", "STRUCT3", "ARRAYSTRUCT4")
// +-----------+-----------+-----------+--------------+
// | STRUCT1| STRUCT2| STRUCT3| ARRAYSTRUCT4|
// +-----------+-----------+-----------+--------------+
// |[a1,101,11]|[a2,102,12]|[a3,103,13]|[[1,1], [3,3]]|
// |[b1,201,21]|[b2,202,22]|[b3,203,23]|[[2,2], [4,4]]|
// +-----------+-----------+-----------+--------------+
val arrayCols = df.dtypes.filter( t => t._2.startsWith("ArrayType(StructType") ).
map(_._1)
// arrayCols: Array[String] = Array(ARRAYSTRUCT4)
val expandedDF = arrayCols.foldLeft(df)((accDF, c) =>
accDF.withColumn(c.replace("ARRAY", ""), explode(col(c))).drop(c)
)
val structCols = expandedDF.columns
expandedDF.select(structCols.map(c => col(s"$c.*")): _*).
show
// +-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+
// |FIELD_1|FIELD_2|FIELD_3|FIELD_4|FIELD_5|FIELD_6|FIELD_7|FIELD_8|FIELD_9|FIELD_10|FIELD_11|
// +-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+
// | a1| 101| 11| a2| 102| 12| a3| 103| 13| 1| 1|
// | a1| 101| 11| a2| 102| 12| a3| 103| 13| 3| 3|
// | b1| 201| 21| b2| 202| 22| b3| 203| 23| 2| 2|
// | b1| 201| 21| b2| 202| 22| b3| 203| 23| 4| 4|
// +-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+
Note that for simplicity it's assumed that your DataFrame has only struct and Array(struct)-type columns. If there are other data types, just apply filtering conditions to arrayCols and structCols accordingly.

How to transform Spark Dataframe columns to a single column of a string array

I want to know how can I "merge" multiple dataframe columns into one as a string array?
For example, I have this dataframe:
val df = sqlContext.createDataFrame(Seq((1, "Jack", "125", "Text"), (2,"Mary", "152", "Text2"))).toDF("Id", "Name", "Number", "Comment")
Which looks like this:
scala> df.show
+---+----+------+-------+
| Id|Name|Number|Comment|
+---+----+------+-------+
| 1|Jack| 125| Text|
| 2|Mary| 152| Text2|
+---+----+------+-------+
scala> df.printSchema
root
|-- Id: integer (nullable = false)
|-- Name: string (nullable = true)
|-- Number: string (nullable = true)
|-- Comment: string (nullable = true)
How can I transform it so it would look like this:
scala> df.show
+---+-----------------+
| Id| List|
+---+-----------------+
| 1| [Jack,125,Text]|
| 2| [Mary,152,Text2]|
+---+-----------------+
scala> df.printSchema
root
|-- Id: integer (nullable = false)
|-- List: Array (nullable = true)
| |-- element: string (containsNull = true)
Use org.apache.spark.sql.functions.array:
import org.apache.spark.sql.functions._
val result = df.select($"Id", array($"Name", $"Number", $"Comment") as "List")
result.show()
// +---+------------------+
// |Id |List |
// +---+------------------+
// |1 |[Jack, 125, Text] |
// |2 |[Mary, 152, Text2]|
// +---+------------------+
Can also be used with withColumn :
import org.apache.spark.sql.functions as F
df.withColumn("Id", F.array(F.col("Name"), F.col("Number"), F.col("Comment")))