How to refer to data in Spark SQL using column names? - scala

If I have a file named shades.txt with the following data:
1 | 1 | dark red
2 | 1 | light red
3 | 2 | dark green
4 | 3 | light blue
5 | 3 | sky blue
I can store it in SparkSQL like below:
var shades_part = sc.textFile("shades.txt")
var shades_data = shades_part.map(line => line.split("\\|").map(elem => elem.trim))
shades_data.toDF().createOrReplaceTempView("shades_data")
But when I try to run a query on it
sqlContext.sql("select * from shades_data A where A.shade_name = "dark green")
I get an error:
cannot resolve '`A.shade_name`' given input columns: [A.value];

You can use case class for your code as
case class Shades(id: Int, colorCode:Int, shadeColor: String)
Then modify your code as
var shades_part = sc.textFile("shades.txt")
var shades_data = shades_part.map(line => line.split("\\|")).map(elem => Shades(elem(0).trim.toInt, elem(1).trim.toInt, elem(2).trim))
val df = shades_data.toDF()
You should have dataframe as
+---+---------+----------+
|id |colorCode|shadeColor|
+---+---------+----------+
|1 |1 |dark red |
|2 |1 |light red |
|3 |2 |dark green|
|4 |3 |light blue|
|5 |3 |sky blue |
+---+---------+----------+
Now you can use use filter function as
df.filter($"shadeColor" === "dark green").show(false)
which should give you
+---+---------+----------+
|id |colorCode|shadeColor|
+---+---------+----------+
|3 |2 |dark green|
+---+---------+----------+
Using Schema
You can create schema as
val schema = StructType(Array(StructField("id", IntegerType, true), StructField("colorCode", IntegerType, true), StructField("shadeColor", StringType, true)))
and use the schema in sqlContext as
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("delimiter", "|")
.schema(schema)
.option("ignoreTrailingWhiteSpace", true)
.option("ignoreLeadingWhiteSpace", true)
.load("shades.txt")
Or you can use createDataFrame function as
var shades_part = sc.textFile("shades.txt")
var shades_data = shades_part.map(line => line.split("\\|").map(_.trim)).map(elem => Row.fromSeq(Seq(elem(0).toInt, elem(1).toInt, elem(2))))
sqlContext.createDataFrame(shades_data, schema).show(false)

check the schema of the dataframe, you will know the column names
val dataframe = shades_data.toDF()
dataframe.printSchema()
If there are no column names defined while creating a dataframe then by default it will use _c0, _c1 ... as column names.
or else you can give column names while creating dataframes like below
val dataframe = shades_data.toDF("col1", "col2", "col3")
dataframe.printSchema()
root
|-- col1: integer (nullable = false)
|-- col2: integer (nullable = false)
|-- col3: string (nullable = true)

Related

Extracting array elements and mapping them to case class

The following is the output I am getting after performing a groupByKey, mapGroups and then a joinWith operation on the caseclass dataset:
+------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|_1 |_2 |
+------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[IND0001,Christopher,Black] |null |
|[IND0002,Madeleine,Kerr] |[IND0002,WrappedArray([IND0002,ACC0155,323], [IND0002,ACC0262,60])] |
|[IND0003,Sarah,Skinner] |[IND0003,WrappedArray([IND0003,ACC0235,631], [IND0003,ACC0486,400], [IND0003,ACC0540,53])] |
|[IND0004,Rachel,Parsons] |[IND0004,WrappedArray([IND0004,ACC0116,965])] |
|[IND0005,Oliver,Johnston] |[IND0005,WrappedArray([IND0005,ACC0146,378], [IND0005,ACC0201,34], [IND0005,ACC0450,329])] |
|[IND0006,Carl,Metcalfe] |[IND0006,WrappedArray([IND0006,ACC0052,57], [IND0006,ACC0597,547])] |
The code is as follows:
val test = accountDS.groupByKey(_.customerId).mapGroups{ case (id, xs) => (id, xs.toSeq)}
test.show(false)
val newTest = customerDS.joinWith(test, customerDS("customerId") === test("_1"), "leftouter")
newTest.show(500,false)
Now I want to take the arrays and output them in a format as follows:
+----------+-----------+----------+---------------------------------------------------------------------+--------------+------------+-----------------+
* |customerId|forename |surname |accounts |numberAccounts|totalBalance|averageBalance |
* +----------+-----------+----------+---------------------------------------------------------------------+--------------+------------+-----------------+
* |IND0001 |Christopher|Black |[] |0 |0 |0.0 |
* |IND0002 |Madeleine |Kerr |[[IND0002,ACC0155,323], [IND0002,ACC0262,60]] |2 |383 |191.5 |
* |IND0003 |Sarah |Skinner |[[IND0003,ACC0235,631], [IND0003,ACC0486,400], [IND0003,ACC0540,53]] |3 |1084 |361.3333333333333|
Note: I cannot use spark.sql.functions._ at all --> training academy rules :(
How do I get the above output which should be mapped to a case class as follows:
case class CustomerAccountOutput(
customerId: String,
forename: String,
surname: String,
//Accounts for this customer
accounts: Seq[AccountData],
//Statistics of the accounts
numberAccounts: Int,
totalBalance: Long,
averageBalance: Double
)
I really need help with this. Stuck with this for weeks without a working solution.
Let's say you have the following DataFrame:
val sourceDf = Seq((1, Array("aa", "CA")), (2, Array("bb", "OH"))).toDF("id", "data_arr")
sourceDf.show()
// output:
+---+--------+
| id|data_arr|
+---+--------+
| 1|[aa, CA]|
| 2|[bb, OH]|
+---+--------+
and you want to convert it to the following schema:
val destSchema = StructType(Array(
StructField("id", IntegerType, true),
StructField("name", StringType, true),
StructField("state", StringType, true)
))
You can do:
val destDf: DataFrame = sourceDf
.map { sourceRow =>
Row(sourceRow(0), sourceRow.getAs[mutable.WrappedArray[String]](1)(0), sourceRow.getAs[mutable.WrappedArray[String]](1)(1))
}(RowEncoder(destSchema))
destDf.show()
// output:
+---+----+-----+
| id|name|state|
+---+----+-----+
| 1| aa| CA|
| 2| bb| OH|
+---+----+-----+

Extract Array of Struct from Parquet into multi-value csv Spark Scala

Using Spark Scala I am trying to extract an array of Struct from parquet. The input is a parquet file. The output is a csv file. The field of the csv can have "multi-values" delimited by "#;". The csv is delimited by ",". What is the best way to accomplish this?
Schema
root
|-- llamaEvent: struct (nullable = true)
| |-- activity: struct (nullable = true)
| | |-- Animal: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- time: string (nullable = true)
| | | | |-- status: string (nullable = true)
| | | | |-- llamaType: string (nullable = true)
Example Input as json (the input will be parquet)
{
"llamaEvent":{
"activity":{
"Animal":[
{
"time":"5-1-2020",
"status":"Running",
"llamaType":"red llama"
},
{
"time":"6-2-2020",
"status":"Sitting",
"llamaType":"blue llama"
}
]
}
}
}
Desired CSV Output
time,status,llamaType
5-1-2020#;6-2-2020,running#;sitting,red llama#;blue llama
Update:
Based on some trial and error, I believe a solution like this maybe appropriate depending on use case. This does a "short cut" by grabbing the array item, cast it to string, then parse out extraneous characters, which is good for some use cases.
df.select(col("llamaEvent.activity").getItem("Animal").getItem("time")).cast("String"))
Then you can perform whatever parsing you want after such as regexp_replace
df.withColumn("time", regexp_replace(col("time"),",",";#"))
Several appropriate solutions were also proposed using groupby, explode, aggregate as well.
One approach would be to flatten the array of animal attribute structs using SQL function inline and aggregate the attributes via collect_list, followed by concatenating with the specific delimiter.
Given a DataFrame df similar to your provided schema, the following transformations will generate the wanted dataset, dfResult:
val attribCSVs = List("time", "status", "llamaType").map(
c => concat_ws("#;", collect_list(c)).as(c)
)
val dfResult = df.
select($"eventId", expr("inline(llamaEvent.activity.Animal)")).
groupBy("eventId").agg(attribCSVs.head, attribCSVs.tail: _*)
Note that an event identifying column eventId is added to the sample json data for the necessary groupBy aggregation.
Let's assemble some sample data:
val jsons = Seq(
"""{
"eventId": 1,
"llamaEvent":{
"activity":{
"Animal":[
{
"time":"5-1-2020",
"status":"Running",
"llamaType":"red llama"
},
{
"time":"6-2-2020",
"status":"Sitting",
"llamaType":"blue llama"
}
]
}
}
}""",
"""{
"eventId": 2,
"llamaEvent":{
"activity":{
"Animal":[
{
"time":"5-2-2020",
"status":"Running",
"llamaType":"red llama"
},
{
"time":"6-3-2020",
"status":"Standing",
"llamaType":"blue llama"
}
]
}
}
}"""
)
val df = spark.read.option("multiLine", true).json(jsons.toDS)
df.show(false)
+-------+----------------------------------------------------------------------+
|eventId|llamaEvent |
+-------+----------------------------------------------------------------------+
|1 |{{[{red llama, Running, 5-1-2020}, {blue llama, Sitting, 6-2-2020}]}} |
|2 |{{[{red llama, Running, 5-2-2020}, {blue llama, Standing, 6-3-2020}]}}|
+-------+----------------------------------------------------------------------+
Applying the above transformations, dfResult should look like below:
dfResult.show(false)
+-------+------------------+-----------------+---------------------+
|eventId|time |status |llamaType |
+-------+------------------+-----------------+---------------------+
|1 |5-1-2020#;6-2-2020|Running#;Sitting |red llama#;blue llama|
|2 |5-2-2020#;6-3-2020|Running#;Standing|red llama#;blue llama|
+-------+------------------+-----------------+---------------------+
Writing dfResult to a CSV file:
dfResult.write.option("header", true).csv("/path/to/csv")
/*
eventId,time,status,llamaType
1,5-1-2020#;6-2-2020,Running#;Sitting,red llama#;blue llama
2,5-2-2020#;6-3-2020,Running#;Standing,red llama#;blue llama
*/
This will be a working solution for you
df = spark.createDataFrame([(str([a_json]))],T.StringType())
df = df.withColumn('col', F.from_json("value", T.ArrayType(T.StringType())))
df = df.withColumn("col", F.explode("col"))
df = df.withColumn("col", F.from_json("col", T.MapType(T.StringType(), T.StringType())))
df = df.withColumn("llamaEvent", df.col.getItem("llamaEvent"))
df = df.withColumn("llamaEvent", F.from_json("llamaEvent", T.MapType(T.StringType(), T.StringType())))
df = df.select("*", F.explode("llamaEvent").alias("x","y"))
df = df.withColumn("Activity", F.from_json("y", T.MapType(T.StringType(), T.StringType())))
df = df.select("*", F.explode("Activity").alias("x","yy"))
df = df.withColumn("final_col", F.from_json("yy", T.ArrayType(T.StringType())))
df = df.withColumn("final_col", F.explode("final_col"))
df = df.withColumn("final_col", F.from_json("final_col", T.MapType(T.StringType(), T.StringType())))
df = df.withColumn("time", df.final_col.getItem("time")).withColumn("status", df.final_col.getItem("status")).withColumn("llamaType", df.final_col.getItem("llamaType")).withColumn("agg_col", F.lit("1"))
df_grp = df.groupby("agg_col").agg(F.concat_ws("#;", F.collect_list(df.time)).alias("time"), F.concat_ws("#;", F.collect_list(df.status)).alias("status"), F.concat_ws("#;", F.collect_list(df.llamaType)).alias("llamaType"))
display(df)
+--------------------+--------------------+--------------------+--------+--------------------+--------------------+------+--------------------+--------------------+--------+-------+----------+-------+
| value| col| llamaEvent| x| y| Activity| x| yy| final_col| time| status| llamaType|agg_col|
+--------------------+--------------------+--------------------+--------+--------------------+--------------------+------+--------------------+--------------------+--------+-------+----------+-------+
|[{'llamaEvent': {...|[llamaEvent -> {"...|[activity -> {"An...|activity|{"Animal":[{"time...|[Animal -> [{"tim...|Animal|[{"time":"5-1-202...|[time -> 5-1-2020...|5-1-2020|Running| red llama| 1|
|[{'llamaEvent': {...|[llamaEvent -> {"...|[activity -> {"An...|activity|{"Animal":[{"time...|[Animal -> [{"tim...|Animal|[{"time":"5-1-202...|[time -> 6-2-2020...|6-2-2020|Sitting|blue llama| 1|
+--------------------+--------------------+--------------------+--------+--------------------+--------------------+------+--------------------+--------------------+--------+-------+----------+-------+
df_grp.show(truncate=False)
+-------+------------------+----------------+---------------------+
|agg_col|time |status |llamaType |
+-------+------------------+----------------+---------------------+
|1 |5-1-2020#;6-2-2020|Running#;Sitting|red llama#;blue llama|
+-------+------------------+----------------+---------------------+

Add new column of Map Datatype to Spark Dataframe in scala

I'm able to create a new Dataframe with one column having Map datatype.
val inputDF2 = Seq(
(1, "Visa", 1, Map[String, Int]()),
(2, "MC", 2, Map[String, Int]())).toDF("id", "card_type", "number_of_cards", "card_type_details")
scala> inputDF2.show(false)
+---+---------+---------------+-----------------+
|id |card_type|number_of_cards|card_type_details|
+---+---------+---------------+-----------------+
|1 |Visa |1 |[] |
|2 |MC |2 |[] |
+---+---------+---------------+-----------------+
Now I want to create a new column of the same type as card_type_details. I'm trying to use the spark withColumn method to add this new column.
inputDF2.withColumn("tmp", lit(null) cast "map<String, Int>").show(false)
+---------+---------+---------------+---------------------+-----+
|person_id|card_type|number_of_cards|card_type_details |tmp |
+---------+---------+---------------+---------------------+-----+
|1 |Visa |1 |[] |null |
|2 |MC |2 |[] |null |
+---------+---------+---------------+---------------------+-----+
When I checked the schema of both the columns, it is same but values are coming different.
scala> inputDF2.withColumn("tmp", lit(null) cast "map<String, Int>").printSchema
root
|-- id: integer (nullable = false)
|-- card_type: string (nullable = true)
|-- number_of_cards: integer (nullable = false)
|-- card_type_details: map (nullable = true)
| |-- key: string
| |-- value: integer (valueContainsNull = false)
|-- tmp: map (nullable = true)
| |-- key: string
| |-- value: integer (valueContainsNull = true)
I'm not sure if I'm doing correctly while adding the new column. Issue is coming when I'm applying the .isEmpty method on the tmp column. I'm getting null pointer exception.
scala> def checkValue = udf((card_type_details: Map[String, Int]) => {
| var output_map = Map[String, Int]()
| if (card_type_details.isEmpty) { output_map += 0.toString -> 1 }
| else {output_map = card_type_details }
| output_map
| })
checkValue: org.apache.spark.sql.expressions.UserDefinedFunction
scala> inputDF2.withColumn("value", checkValue(col("card_type_details"))).show(false)
+---+---------+---------------+-----------------+--------+
|id |card_type|number_of_cards|card_type_details|value |
+---+---------+---------------+-----------------+--------+
|1 |Visa |1 |[] |[0 -> 1]|
|2 |MC |2 |[] |[0 -> 1]|
+---+---------+---------------+-----------------+--------+
scala> inputDF2.withColumn("tmp", lit(null) cast "map<String, Int>")
.withColumn("value", checkValue(col("tmp"))).show(false)
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$checkValue$1: (map<string,int>) => map<string,int>)
Caused by: java.lang.NullPointerException
at $anonfun$checkValue$1.apply(<console>:28)
at $anonfun$checkValue$1.apply(<console>:26)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:108)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:107)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1063)
How to add a new column that should have the same values as card_type_details column.
To add the tmp column with the same value as card_type_details, you just do:
inputDF2.withColumn("tmp", col("cart_type_details"))
If you aim to add a column with an empty map and avoid the NullPointerException, the solution is:
inputDF2.withColumn("tmp", typedLit(Map.empty[Int, String]))

Run a custom transformation on string columns

Suppose I have the following dataframe:
var df = Seq(
("2019-09-01", 0.1, 1, "0x0000000000000001", "0x00000001", "True"),
("2019-09-02", 0.2, 2, "0x0000000000000002", "0x00000002", "False"),
("2019-09-03", 0.3, 3, "0x0000000000000003", "0x00000003", "True")
).toDF("Timestamp", "Float", "Integer", "Hex1", "Hex2", "Bool")
I need to run a transformation on the string colums (in this example: Hex1, Hex2 and Bool) and convert them to a numeric value by using some custom logic.
The dataframes are generated by reading CSV files which I don't know the schema. All I know is that they contain a Timestamp column as the first column and then a variable number of columns which might be numeric (integers or doubles/floats) or these hex and boolean values.
I'm thinking this transformation would need to find all the string columns and for each one, run the transformation that will add a new column to the dataframe with the numerical representation of the string.
In this case, the hex values would be converted to their decimal representation. And the "True", "False" strings would be converted to 1 and 0 respectively.
Back to the simplified example, I should get a df like this:
|Timestamp |Float|Integer|Hex1 |Hex2 |Bool |
|-----------|-----|-------|------------------|----------|-----|
|2019-09-01 |0.1 |1 |1 |1 |1 |
|2019-09-02 |0.2 |2 |2 |2 |0 |
|2019-09-03"|0.3 |3 |3 |3 |1 |
With all numeric (integer, float or double) columns except for the Timestamp
As per your example use following function:
Use conv standard function to convert hex to appropriate type.
conv(num: Column, fromBase: Int, toBase: Int): Column Convert a number in a string column from one base to another.
when(Column condition, Object value):
Evaluates a list of conditions and returns one of multiple possible result expressions.
import org.apache.spark.sql.functions.conv
import org.apache.spark.sql.functions._
val s1 = df.
withColumn("Hex1", conv(col("Hex1").substr(lit(3), length(col("Hex1"))), 16, 10) cast IntegerType).
withColumn("Hex2", conv(col("Hex2").substr(lit(3), length(col("Hex2"))), 16, 10) cast IntegerType).
withColumn("Bool", when(col("Bool") === "True", 1)
.otherwise(0))
s1.show()
s1.printSchema()
From your problem definition ie dynamically. If you want to do same task dynamically you have to do extra work.
Create mapping ie column and it's datatype map: This can be abstracted out, you can create your mapping file externally. Can be generated dynamically by reading mapping file.
val list = List(
("Hex", "Hex1"),
("Hex", "Hex2"),
("Bool", "Bool")
)
create converter using pattern matching :
object Helper {
def convert(columnDetail: (String, String)): Column = {
columnDetail._1 match {
case "Hex" => conv(col(columnDetail._2).substr(lit(3), length(col(columnDetail._2))), 16, 10) cast IntegerType
case "Bool" => when(col(columnDetail._2) === "True", 1).otherwise(0)
// your other case
}
}
}
you can add all the cases and its appropriate implementation.
final solution
import spark.implicits._
var df = Seq(
("2019-09-01", 0.1, 1, "0x0000000000000001", "0x00000001", "True"),
("2019-09-02", 0.2, 2, "0x0000000000000002", "0x00000002", "False"),
("2019-09-03", 0.3, 3, "0x0000000000000003", "0x00000003", "True")
).toDF("Timestamp", "Float", "Integer", "Hex1", "Hex2", "Bool")
val list = List(
("Hex", "Hex1"),
("Hex", "Hex2"),
("Bool", "Bool")
)
val temp = list.foldLeft(df) { (tempDF, listValue) =>
tempDF.withColumn(listValue._2, Helper.convert(listValue))
}
temp.show(false)
temp.printSchema()
}
object Helper {
def convert(columnDetail: (String, String)): Column = {
columnDetail._1 match {
case "Hex" => conv(col(columnDetail._2).substr(lit(3), length(col(columnDetail._2))), 16, 10) cast IntegerType
case "Bool" => when(col(columnDetail._2) === "True", 1).otherwise(0)
// your other case
}
}
}
Result:
+----------+-----+-------+----+----+----+
|Timestamp |Float|Integer|Hex1|Hex2|Bool|
+----------+-----+-------+----+----+----+
|2019-09-01|0.1 |1 |1 |1 |1 |
|2019-09-02|0.2 |2 |2 |2 |0 |
|2019-09-03|0.3 |3 |3 |3 |1 |
+----------+-----+-------+----+----+----+
root
|-- Timestamp: string (nullable = true)
|-- Float: double (nullable = false)
|-- Integer: integer (nullable = false)
|-- Hex1: integer (nullable = true)
|-- Hex2: integer (nullable = true)
|-- Bool: integer (nullable = false)
Below is my spark code to do this. I have used conv function of spark sql http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.conv
. Also if you want to write a logic to dynamically identify all string columns at run time and perform conversion, it could be done only if you know exactly what kind of conversion you are going to do.
var df = Seq(
("2019-09-01", 0.1, 1, "0x0000000000000001", "0x00000001", "True"),
("2019-09-02", 0.2, 2, "0x0000000000000002", "0x00000002", "False"),
("2019-09-03", 0.3, 3, "0x0000000000000003", "0x00000003", "True")
).toDF("Timestamp", "Float", "Integer", "Hex1", "Hex2", "Bool")
// df.show
df.createOrReplaceTempView("sourceTable")
val finalDF = spark.sql("""
select Timestamp,
Float,
Integer,
conv(substr(Hex1,3),16,10) as Hex1,
conv(substr(Hex2,3),16,10) as Hex2,
case when Bool = "True" then 1
when Bool = "False" then 0
else NULL
end as Bool
from sourceTable
""")
finalDF.show
Result :
+----------+-----+-------+----+----+----+
| Timestamp|Float|Integer|Hex1|Hex2|Bool|
+----------+-----+-------+----+----+----+
|2019-09-01| 0.1| 1| 1| 1| 1|
|2019-09-02| 0.2| 2| 2| 2| 0|
|2019-09-03| 0.3| 3| 3| 3| 1|
+----------+-----+-------+----+----+----+

How to return ListBuffer as a column from UDF using Spark Scala?

I am trying to use UDF's and return ListBuffer as a column from UDF, i am getting error.
I have created Df by executing below code:
val df = Seq((1,"dept3##rama##kumar","dept3##rama##kumar"), (2,"dept31##rama1##kumar1","dept33##rama3##kumar3")).toDF("id","str1","str2")
df.show()
it show like below:
+---+--------------------+--------------------+
| id| str1| str2|
+---+--------------------+--------------------+
| 1| dept3##rama##kumar| dept3##rama##kumar|
| 2|dept31##rama1##ku...|dept33##rama3##ku...|
+---+--------------------+--------------------+
as per my requirement i have to use i have to split the above columns based some inputs so i have tried UDF like below :
def appendDelimiterError=udf((id: Int, str1: String, str2: String)=> {
var lit = new ListBuffer[Any]()
if(str1.contains("##"){val a=str1.split("##")}
else if(str1.contains("##"){val a=str1.split("##")}
else if(str1.contains("#&"){val a=str1.split("#&")}
if(str2.contains("##"){ val b=str2.split("##")}
else if(str2.contains("##"){ val b=str2.split("##") }
else if(str1.contains("##"){val b=str2.split("##")}
var tmp_row = List(a,"test1",b)
lit +=tmp_row
return lit
})
val
i try to cal by executing below code:
val df1=df.appendDelimiterError("newcol",appendDelimiterError(df("id"),df("str1"),df("str2"))
i getting error "this was a bad call" .i want use ListBuffer/list to store and return to calling place.
my expected output will be:
+---+--------------------+------------------------+----------------------------------------------------------------------+
| id| str1| str2 | newcol |
+---+--------------------+------------------------+----------------------------------------------------------------------+
| 1| dept3##rama##kumar| dept3##rama##kumar |ListBuffer(List("dept","rama","kumar"),List("dept3","rama","kumar")) |
| 2|dept31##rama1##kumar1|dept33##rama3##kumar3 | ListBuffer(List("dept31","rama1","kumar1"),List("dept33","rama3","kumar3")) |
+---+--------------------+------------------------+----------------------------------------------------------------------+
How to achieve this?
An alternative with my own fictional data to which you can tailor and no UDF:
import org.apache.spark.sql.functions.{col, udf}
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val df = Seq(
(1, "111##cat##666", "222##fritz##777"),
(2, "AAA##cat##555", "BBB##felix##888"),
(3, "HHH##mouse##yyy", "123##mickey##ZZZ")
).toDF("c0", "c1", "c2")
val df2 = df.withColumn( "c_split", split(col("c1"), ("(##)|(##)|(##)|(##)") ))
.union(df.withColumn("c_split", split(col("c2"), ("(##)|(##)|(##)|(##)") )) )
df2.show(false)
df2.printSchema()
val df3 = df2.groupBy(col("c0")).agg(collect_list(col("c_split")).as("List_of_Data") )
df3.show(false)
df3.printSchema()
Gives answer but no ListBuffer - really necessary?, as follows:
+---+---------------+----------------+------------------+
|c0 |c1 |c2 |c_split |
+---+---------------+----------------+------------------+
|1 |111##cat##666 |222##fritz##777 |[111, cat, 666] |
|2 |AAA##cat##555 |BBB##felix##888 |[AAA, cat, 555] |
|3 |HHH##mouse##yyy|123##mickey##ZZZ|[HHH, mouse, yyy] |
|1 |111##cat##666 |222##fritz##777 |[222, fritz, 777] |
|2 |AAA##cat##555 |BBB##felix##888 |[BBB, felix, 888] |
|3 |HHH##mouse##yyy|123##mickey##ZZZ|[123, mickey, ZZZ]|
+---+---------------+----------------+------------------+
root
|-- c0: integer (nullable = false)
|-- c1: string (nullable = true)
|-- c2: string (nullable = true)
|-- c_split: array (nullable = true)
| |-- element: string (containsNull = true)
+---+---------------------------------------+
|c0 |List_of_Data |
+---+---------------------------------------+
|1 |[[111, cat, 666], [222, fritz, 777]] |
|3 |[[HHH, mouse, yyy], [123, mickey, ZZZ]]|
|2 |[[AAA, cat, 555], [BBB, felix, 888]] |
+---+---------------------------------------+
root
|-- c0: integer (nullable = false)
|-- List_of_Data: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)