Spark GroupBy agg collect_list multiple columns - group-by

I have a question similar to this but the number of columns to be operated by collect_list is given by a name list. For example:
scala> w.show
+---+-----+----+-----+
|iid|event|date|place|
+---+-----+----+-----+
| A| D1| T0| P1|
| A| D0| T1| P2|
| B| Y1| T0| P3|
| B| Y2| T2| P3|
| C| H1| T0| P5|
| C| H0| T9| P5|
| B| Y0| T1| P2|
| B| H1| T3| P6|
| D| H1| T2| P4|
+---+-----+----+-----+
scala> val combList = List("event", "date", "place")
combList: List[String] = List(event, date, place)
scala> val v = w.groupBy("iid").agg(collect_list(combList(0)), collect_list(combList(1)), collect_list(combList(2)))
v: org.apache.spark.sql.DataFrame = [iid: string, collect_list(event): array<string> ... 2 more fields]
scala> v.show
+---+-------------------+------------------+-------------------+
|iid|collect_list(event)|collect_list(date)|collect_list(place)|
+---+-------------------+------------------+-------------------+
| B| [Y1, Y2, Y0, H1]| [T0, T2, T1, T3]| [P3, P3, P2, P6]|
| D| [H1]| [T2]| [P4]|
| C| [H1, H0]| [T0, T9]| [P5, P5]|
| A| [D1, D0]| [T0, T1]| [P1, P2]|
+---+-------------------+------------------+-------------------+
Is there any way I can apply collect_list to multiple columns inside agg without knowing the number of elements in the combList prior?

You can use collect_list(struct(col1, col2)) AS elements.
Example:
df.select("cd_issuer", "cd_doc", "cd_item", "nm_item").printSchema
val outputDf = spark.sql(s"SELECT cd_issuer, cd_doc, collect_list(struct(cd_item, nm_item)) AS item FROM teste GROUP BY cd_issuer, cd_doc")
outputDf.printSchema
df
|-- cd_issuer: string (nullable = true)
|-- cd_doc: string (nullable = true)
|-- cd_item: string (nullable = true)
|-- nm_item: string (nullable = true)
outputDf
|-- cd_issuer: string (nullable = true)
|-- cd_doc: string (nullable = true)
|-- item: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- cd_item: string (nullable = true)
| | |-- nm_item: string (nullable = true)

Related

flattern scala array data type column to multiple columns

Is their any possible way to flatten an array in Scala DF?
As I know with columns and select filed.a works, but I don't want to specify them Manually.
df.printSchema()
|-- client_version: string (nullable = true)
|-- filed: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- a: string (nullable = true)
| | |-- b: string (nullable = true)
| | |-- c: string (nullable = true)
| | |-- d: string (nullable = true)
final df
df.printSchema()
|-- client_version: string (nullable = true)
|-- filed_a: string (nullable = true)
|-- filed_b: string (nullable = true)
|-- filed_c: string (nullable = true)
|-- filed_d: string (nullable = true)
You can flatten your ArrayType column with explode and map the nested struct element names to the wanted top-level column names, as shown below:
import org.apache.spark.sql.functions._
case class S(a: String, b: String, c: String, d: String)
val df = Seq(
("1.0", Seq(S("a1", "b1", "c1", "d1"))),
("2.0", Seq(S("a2", "b2", "c2", "d2"), S("a3", "b3", "c3", "d3")))
).toDF("client_version", "filed")
df.printSchema
// root
// |-- client_version: string (nullable = true)
// |-- filed: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- a: string (nullable = true)
// | | |-- b: string (nullable = true)
// | | |-- c: string (nullable = true)
// | | |-- d: string (nullable = true)
val dfFlattened = df.withColumn("filed_element", explode($"filed"))
val structElements = dfFlattened.select($"filed_element.*").columns
val dfResult = dfFlattened.select( col("client_version") +: structElements.map(
c => col(s"filed_element.$c").as(s"filed_$c")
): _*
)
dfResult.show
// +--------------+-------+-------+-------+-------+
// |client_version|filed_a|filed_b|filed_c|filed_d|
// +--------------+-------+-------+-------+-------+
// | 1.0| a1| b1| c1| d1|
// | 2.0| a2| b2| c2| d2|
// | 2.0| a3| b3| c3| d3|
// +--------------+-------+-------+-------+-------+
dfResult.printSchema
// root
// |-- client_version: string (nullable = true)
// |-- filed_a: string (nullable = true)
// |-- filed_b: string (nullable = true)
// |-- filed_c: string (nullable = true)
// |-- filed_d: string (nullable = true)
Use explode to flatten the arrays by adding more rows and then select with the * notation to bring the struct columns back to the top.
import org.apache.spark.sql.functions.{collect_list, explode, struct}
import spark.implicits._
val df = Seq(("1", "a", "a", "a"),
("1", "b", "b", "b"),
("2", "a", "a", "a"),
("2", "b", "b", "b"),
("2", "c", "c", "c"),
("3", "a", "a","a")).toDF("idx", "A", "B", "C")
.groupBy(("idx"))
.agg(collect_list(struct("A", "B", "C")).as("nested_col"))
df.printSchema()
// root
// |-- idx: string (nullable = true)
// |-- nested_col: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- A: string (nullable = true)
// | | |-- B: string (nullable = true)
// | | |-- C: string (nullable = true)
df.show
// +---+--------------------+
// |idx| nested_col|
// +---+--------------------+
// | 3| [[a, a, a]]|
// | 1|[[a, a, a], [b, b...|
// | 2|[[a, a, a], [b, b...|
// +---+--------------------+
val dfExploded = df.withColumn("exploded", explode($"nested_col")).drop("nested_col")
dfExploded.show
// +---+---------+
// |idx| exploded|
// +---+---------+
// | 3|[a, a, a]|
// | 1|[a, a, a]|
// | 1|[b, b, b]|
// | 2|[a, a, a]|
// | 2|[b, b, b]|
// | 2|[c, c, c]|
// +---+---------+
val finalDF = dfExploded.select("idx", "exploded.*")
finalDF.show
// +---+---+---+---+
// |idx| A| B| C|
// +---+---+---+---+
// | 3| a| a| a|
// | 1| a| a| a|
// | 1| b| b| b|
// | 2| a| a| a|
// | 2| b| b| b|
// | 2| c| c| c|
// +---+---+---+---+

Read CSV with last column as array of values (and the values are inside parenthesis and separated by comma) in Spark

I have a CSV file where the last column is inside parenthesis and the values are separated by commas. The number of values is variable in the last column. When I read them to as Dataframe with some column names as follows, I get Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match. My CSV file looks like this
a1,b1,true,2017-05-16T07:00:41.0000000,2.5,(c1,d1,e1)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2,e2,f2,g2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2,e2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2,e2,k2,f2)
what I finally want is something like this:
root
|-- MId: string (nullable = true)
|-- PId: string (nullable = true)
|-- IsTeacher: boolean(nullable = true)
|-- STime: datetype(nullable = true)
|-- TotalMinutes: double(nullable = true)
|-- SomeArrayHeader: array<string>(nullable = true)
I have written the following code till now:
val infoDF =
sqlContext.read.format("csv")
.option("header", "false")
.load(inputPath)
.toDF(
"MId",
"PId",
"IsTeacher",
"STime",
"TotalMinutes",
"SomeArrayHeader")
I thought of reading them without giving column names and then cast the columns which are after the 5th columns to array type. But then I am having problems with the parentheses. Is there a way I can do this while reading and telling that fields inside parenthesis are actually one field of type array.
Ok. The solution is only tactical for your case. The below one worked for me
val df = spark.read.option("quote", "(").csv("in/staff.csv").toDF(
"MId",
"PId",
"IsTeacher",
"STime",
"TotalMinutes",
"arr")
df.show()
val df2 = df.withColumn("arr",split(regexp_replace('arr,"[)]",""),","))
df2.printSchema()
df2.show()
Output:
+---+---+---------+--------------------+------------+---------------+
|MId|PId|IsTeacher| STime|TotalMinutes| arr|
+---+---+---------+--------------------+------------+---------------+
| a1| b1| true|2017-05-16T07:00:...| 2.5| c1,d1,e1)|
| a2| b2| true|2017-05-26T07:00:...| 0.5|c2,d2,e2,f2,g2)|
| a2| b2| true|2017-05-26T07:00:...| 0.5| c2)|
| a2| b2| true|2017-05-26T07:00:...| 0.5| c2,d2)|
| a2| b2| true|2017-05-26T07:00:...| 0.5| c2,d2,e2)|
| a2| b2| true|2017-05-26T07:00:...| 0.5|c2,d2,e2,k2,f2)|
+---+---+---------+--------------------+------------+---------------+
root
|-- MId: string (nullable = true)
|-- PId: string (nullable = true)
|-- IsTeacher: string (nullable = true)
|-- STime: string (nullable = true)
|-- TotalMinutes: string (nullable = true)
|-- arr: array (nullable = true)
| |-- element: string (containsNull = true)
+---+---+---------+--------------------+------------+--------------------+
|MId|PId|IsTeacher| STime|TotalMinutes| arr|
+---+---+---------+--------------------+------------+--------------------+
| a1| b1| true|2017-05-16T07:00:...| 2.5| [c1, d1, e1]|
| a2| b2| true|2017-05-26T07:00:...| 0.5|[c2, d2, e2, f2, g2]|
| a2| b2| true|2017-05-26T07:00:...| 0.5| [c2]|
| a2| b2| true|2017-05-26T07:00:...| 0.5| [c2, d2]|
| a2| b2| true|2017-05-26T07:00:...| 0.5| [c2, d2, e2]|
| a2| b2| true|2017-05-26T07:00:...| 0.5|[c2, d2, e2, k2, f2]|
+---+---+---------+--------------------+------------+--------------------+

Spark: How to split struct type into multiple columns?

I know this question has been asked many times on Stack Overflow and has been satisfactorily answered in most posts, but I'm not sure if this is the best way in my case.
I have a Dataset that has several struct types embedded in it:
root
|-- STRUCT1: struct (nullable = true)
| |-- FIELD_1: string (nullable = true)
| |-- FIELD_2: long (nullable = true)
| |-- FIELD_3: integer (nullable = true)
|-- STRUCT2: struct (nullable = true)
| |-- FIELD_4: string (nullable = true)
| |-- FIELD_5: long (nullable = true)
| |-- FIELD_6: integer (nullable = true)
|-- STRUCT3: struct (nullable = true)
| |-- FIELD_7: string (nullable = true)
| |-- FIELD_8: long (nullable = true)
| |-- FIELD_9: integer (nullable = true)
|-- ARRAYSTRUCT4: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- FIELD_10: integer (nullable = true)
| | |-- FIELD_11: integer (nullable = true)
+-------+------------+------------+------------------+
|STRUCT1| STRUCT2 | STRUCT3 | ARRAYSTRUCT4 |
+-------+------------+------------+------------------+
|[1,2,3]|[aa, xx, yy]|[p1, q2, r3]|[[1a, 2b],[3c,4d]]|
+-------+------------+------------+------------------+
I want to convert this into:
1. A dataset where the structs are expanded into columns.
2. A data set where the array (ARRAYSTRUCT4) is exploded into rows.
root
|-- FIELD_1: string (nullable = true)
|-- FIELD_2: long (nullable = true)
|-- FIELD_3: integer (nullable = true)
|-- FIELD_4: string (nullable = true)
|-- FIELD_5: long (nullable = true)
|-- FIELD_6: integer (nullable = true)
|-- FIELD_7: string (nullable = true)
|-- FIELD_8: long (nullable = true)
|-- FIELD_9: integer (nullable = true)
|-- FIELD_10: integer (nullable = true)
|-- FIELD_11: integer (nullable = true)
+-------+------------+------------+---------+ ---------+----------+
|FIELD_1| FIELD_2 | FIELD_3 | FIELD_4 | |FIELD_10| FIELD_11 |
+-------+------------+------------+---------+ ... ---------+----------+
|1 |2 |3 | aa | | 1a | 2b |
+-------+------------+------------+-----------------------------------+
To achieve this, I could use:
val expanded = df.select("STRUCT1.*", "STRUCT2.*", "STRUCT3.*", "STRUCT4")
followed by an explode:
val exploded = expanded.select(explode(expanded("STRUCT4")))
However, I was wondering if there's a more functional way to do this, especially the select. I could use withColumn as below:
data.withColumn("FIELD_1", $"STRUCT1".getItem(0))
.withColumn("FIELD_2", $"STRUCT1".getItem(1))
.....
But I have 80+ columns. Is there a better way to achieve this?
You can first make all columns struct-type by explode-ing any Array(struct) columns into struct columns via foldLeft, then use map to interpolate each of the struct column names into col.*, as shown below:
import org.apache.spark.sql.functions._
case class S1(FIELD_1: String, FIELD_2: Long, FIELD_3: Int)
case class S2(FIELD_4: String, FIELD_5: Long, FIELD_6: Int)
case class S3(FIELD_7: String, FIELD_8: Long, FIELD_9: Int)
case class S4(FIELD_10: Int, FIELD_11: Int)
val df = Seq(
(S1("a1", 101, 11), S2("a2", 102, 12), S3("a3", 103, 13), Array(S4(1, 1), S4(3, 3))),
(S1("b1", 201, 21), S2("b2", 202, 22), S3("b3", 203, 23), Array(S4(2, 2), S4(4, 4)))
).toDF("STRUCT1", "STRUCT2", "STRUCT3", "ARRAYSTRUCT4")
// +-----------+-----------+-----------+--------------+
// | STRUCT1| STRUCT2| STRUCT3| ARRAYSTRUCT4|
// +-----------+-----------+-----------+--------------+
// |[a1,101,11]|[a2,102,12]|[a3,103,13]|[[1,1], [3,3]]|
// |[b1,201,21]|[b2,202,22]|[b3,203,23]|[[2,2], [4,4]]|
// +-----------+-----------+-----------+--------------+
val arrayCols = df.dtypes.filter( t => t._2.startsWith("ArrayType(StructType") ).
map(_._1)
// arrayCols: Array[String] = Array(ARRAYSTRUCT4)
val expandedDF = arrayCols.foldLeft(df)((accDF, c) =>
accDF.withColumn(c.replace("ARRAY", ""), explode(col(c))).drop(c)
)
val structCols = expandedDF.columns
expandedDF.select(structCols.map(c => col(s"$c.*")): _*).
show
// +-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+
// |FIELD_1|FIELD_2|FIELD_3|FIELD_4|FIELD_5|FIELD_6|FIELD_7|FIELD_8|FIELD_9|FIELD_10|FIELD_11|
// +-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+
// | a1| 101| 11| a2| 102| 12| a3| 103| 13| 1| 1|
// | a1| 101| 11| a2| 102| 12| a3| 103| 13| 3| 3|
// | b1| 201| 21| b2| 202| 22| b3| 203| 23| 2| 2|
// | b1| 201| 21| b2| 202| 22| b3| 203| 23| 4| 4|
// +-------+-------+-------+-------+-------+-------+-------+-------+-------+--------+--------+
Note that for simplicity it's assumed that your DataFrame has only struct and Array(struct)-type columns. If there are other data types, just apply filtering conditions to arrayCols and structCols accordingly.

pyspark VectorUDT to integer or float conversion

Below is my dataframe
a b c d
1 2 3 [1211]
2 2 4 [1222]
4 5 4 [12322]
Here d column is of vector type and was not able to convert directly from vectorUDT to integer below was my code for conversion
newDF = newDF.select(col('d'),
newDF.d.cast('int').alias('d'))
someone please help on same
We can use udf to reserialize the vector and access the values,
>>> from pyspark.sql import function as F
>>> from pyspark.sql.types import IntegerType
>>> df = spark.createDataFrame([(1,2,3,Vectors.dense([1211])),(2,2,4,Vectors.dense([1222])),(4,5,4,Vectors.dense([12322]))],['a','b','c','d'])
>>> df.show()
+---+---+---+---------+
| a| b| c| d|
+---+---+---+---------+
| 1| 2| 3| [1211.0]|
| 2| 2| 4| [1222.0]|
| 4| 5| 4|[12322.0]|
+---+---+---+---------+
>>> df.printSchema()
root
|-- a: long (nullable = true)
|-- b: long (nullable = true)
|-- c: long (nullable = true)
|-- d: vector (nullable = true)
>>> udf1 = F.udf(lambda x : int(x[0]),IntegerType())
>>> df.select('d',udf1('d').alias('d1')).show()
+---------+-----+
| d| d1|
+---------+-----+
| [1211.0]| 1211|
| [1222.0]| 1222|
|[12322.0]|12322|
+---------+-----+
>>> df.select('d',udf1('d').alias('d1')).printSchema()
root
|-- d: vector (nullable = true)
|-- d1: integer (nullable = true)

spark2.0 dataframe collect muilt row as array by column

i have some dataframe like below, i want convert muilt row as an array if column value is the same
val data = Seq(("a","b","sum",0),("a","b","avg",2)).toDF("id1","id2","type","value2").show
+---+---+----+------+
|id1|id2|type|value2|
+---+---+----+------+
| a| b| sum| 0|
| a| b| avg| 2|
+---+---+----+------+
i want to convert it to below
+---+---+----+------+
|id1|id2|agg |value2|
+---+---+----+------+
| a| b| 0,2| 0|
+---+---+----+------+
the printSchema should be like below
root
|-- id1: string (nullable = true)
|-- id2: string (nullable = true)
|-- agg: struct (nullable = true)
| |-- sum: int (nullable = true)
| |-- dc: int (nullable = true)
You can:
import org.apache.spark.sql.functions._
val data = Seq(
("a","b","sum",0),("a","b","avg",2)
).toDF("id1","id2","type","value2")
val result = data.groupBy($"id1", $"id2").agg(struct(
first(when($"type" === "sum", $"value2"), true).alias("sum"),
first(when($"type" === "avg", $"value2"), true).alias("avg")
).alias("agg"))
result.show
+---+---+-----+
|id1|id2| agg|
+---+---+-----+
| a| b|[0,2]|
+---+---+-----+
result.printSchema
root
|-- id1: string (nullable = true)
|-- id2: string (nullable = true)
|-- agg: struct (nullable = false)
| |-- sum: integer (nullable = true)
| |-- avg: integer (nullable = true)