Scala Spark: Flatten Array of Key/Value structs - scala

I have an input dataframe which contains an array-typed column. Each entry in the array is a struct consisting of a key (one of about four values) and a value. I want to turn this into a dataframe with one column for each possible key, and nulls where that value is not in the array for that row. Keys are never duplicated in any of the arrays, but they may be out of order or missing.
So far the best I've got is
val wantedCols =df.columns
.filter(_ != arrayCol)
.filter(_ != "col")
val flattened = df
.select((wantedCols.map(col(_)) ++ Seq(explode(col(arrayCol)))):_*)
.groupBy(wantedCols.map(col(_)):_*)
.pivot("col.key")
.agg(first("col.value"))
This does exactly what I want, but it's hideous and I have no idea what the ramifactions of grouping on every-column-but-one would be. What's the RIGHT way to do this?
EDIT: Example input/output:
case class testStruct(name : String, number : String)
val dfExampleInput = Seq(
(0, "KY", Seq(testStruct("A", "45"))),
(1, "OR", Seq(testStruct("A", "30"), testStruct("B", "10"))))
.toDF("index", "state", "entries")
.show
+-----+-----+------------------+
|index|state| entries|
+-----+-----+------------------+
| 0| KY| [[A, 45]]|
| 1| OR|[[A, 30], [B, 10]]|
+-----+-----+------------------+
val dfExampleOutput = Seq(
(0, "KY", "45", null),
(1, "OR", "30", "10"))
.toDF("index", "state", "A", "B")
.show
+-----+-----+---+----+
|index|state| A| B|
+-----+-----+---+----+
| 0| KY| 45|null|
| 1| OR| 30| 10|
+-----+-----+---+----+
FURTHER EDIT:
I submitted a solution myself (see below) that handles this well so long as you know the keys in advance (in my case I do.) If finding the keys is an issue, another answer holds code to handle that.

Without groupBy pivot agg first
Please check below code.
scala> val df = Seq((0, "KY", Seq(("A", "45"))),(1, "OR", Seq(("A", "30"),("B", "10")))).toDF("index", "state", "entries").withColumn("entries",$"entries".cast("array<struct<name:string,number:string>>"))
df: org.apache.spark.sql.DataFrame = [index: int, state: string ... 1 more field]
scala> df.printSchema
root
|-- index: integer (nullable = false)
|-- state: string (nullable = true)
|-- entries: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- number: string (nullable = true)
scala> df.show(false)
+-----+-----+------------------+
|index|state|entries |
+-----+-----+------------------+
|0 |KY |[[A, 45]] |
|1 |OR |[[A, 30], [B, 10]]|
+-----+-----+------------------+
scala> val finalDFColumns = df.select(explode($"entries").as("entries")).select("entries.*").select("name").distinct.map(_.getAs[String](0)).orderBy($"value".asc).collect.foldLeft(df.limit(0))((cdf,c) => cdf.withColumn(c,lit(null))).columns
finalDFColumns: Array[String] = Array(index, state, entries, A, B)
scala> val finalDF = df.select($"*" +: (0 until max).map(i => $"entries".getItem(i)("number").as(i.toString)): _*)
finalDF: org.apache.spark.sql.DataFrame = [index: int, state: string ... 3 more fields]
scala> finalDF.show(false)
+-----+-----+------------------+---+----+
|index|state|entries |0 |1 |
+-----+-----+------------------+---+----+
|0 |KY |[[A, 45]] |45 |null|
|1 |OR |[[A, 30], [B, 10]]|30 |10 |
+-----+-----+------------------+---+----+
scala> finalDF.printSchema
root
|-- index: integer (nullable = false)
|-- state: string (nullable = true)
|-- entries: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- number: string (nullable = true)
|-- 0: string (nullable = true)
|-- 1: string (nullable = true)
scala> finalDF.columns.zip(finalDFColumns).foldLeft(finalDF)((fdf,column) => fdf.withColumnRenamed(column._1,column._2)).show(false)
+-----+-----+------------------+---+----+
|index|state|entries |A |B |
+-----+-----+------------------+---+----+
|0 |KY |[[A, 45]] |45 |null|
|1 |OR |[[A, 30], [B, 10]]|30 |10 |
+-----+-----+------------------+---+----+
scala>
Final Output
scala> finalDF.columns.zip(finalDFColumns).foldLeft(finalDF)((fdf,column) => fdf.withColumnRenamed(column._1,column._2)).drop($"entries").show(false)
+-----+-----+---+----+
|index|state|A |B |
+-----+-----+---+----+
|0 |KY |45 |null|
|1 |OR |30 |10 |
+-----+-----+---+----+

I wouldn't worry too much about grouping by several columns, other than potentially making things confusing. In that vein, if there is a simpler, more maintainable way, go for it. Without example input/output, I'm not sure if this gets you where you're trying to go, but maybe it'll be of use:
Seq(Seq("k1" -> "v1", "k2" -> "v2")).toDS() // some basic input based on my understanding of your description
.select(explode($"value")) // flatten the array
.select("col.*") // de-nest the struct
.groupBy("_2") // one row per distinct value
.pivot("_1") // one column per distinct key
.count // or agg(first) if you want the value in each column
.show
+---+----+----+
| _2| k1| k2|
+---+----+----+
| v2|null| 1|
| v1| 1|null|
+---+----+----+
Based on what you've now said, I get the impression that there are many columns like "state" that aren't required for the aggregation, but need to be in the final result.
For reference, if you didn't need to pivot, you could add a struct column with all such fields nested within, then add it to your aggregation, eg: .agg(first($"myStruct"), first($"number")). The main advantage is only having actual key column(s) referenced in the groubBy. But when using pivot things get a little weird, so we'll set that option aside.
In this use case, the simplest way I could come up with involves splitting your dataframe and joining it back together after the aggregation using some rowkey. In this example I am assuming that "index" is suitable for that purpose:
val mehCols = dfExampleInput.columns.filter(_ != "entries").map(col)
val mehDF = dfExampleInput.select(mehCols:_*)
val aggDF = dfExampleInput
.select($"index", explode($"entries").as("entry"))
.select($"index", $"entry.*")
.groupBy("index")
.pivot("name")
.agg(first($"number"))
scala> mehDF.join(aggDF, Seq("index")).show
+-----+-----+---+----+
|index|state| A| B|
+-----+-----+---+----+
| 0| KY| 45|null|
| 1| OR| 30| 10|
+-----+-----+---+----+
I doubt you would see much of a difference in performance, if any. Maybe at the extremes, eg: very many meh columns, or very many pivot columns, or something like that, or maybe nothing at all. Personally, I would test both with decently-sized input, and if there wasn't a significant difference, use whichever one seemed easier to maintain.

Here is another way that is based on the assumption that there are no duplicates on the entries column i.e Seq(testStruct("A", "30"), testStruct("A", "70"), testStruct("B", "10")) will cause an error. The next solution combines both RDD and Dataframe APIs for the implementation:
import org.apache.spark.sql.functions.explode
import org.apache.spark.sql.types.StructType
case class testStruct(name : String, number : String)
val df = Seq(
(0, "KY", Seq(testStruct("A", "45"))),
(1, "OR", Seq(testStruct("A", "30"), testStruct("B", "10"))),
(2, "FL", Seq(testStruct("A", "30"), testStruct("B", "10"), testStruct("C", "20"))),
(3, "TX", Seq(testStruct("B", "60"), testStruct("A", "19"), testStruct("C", "40")))
)
.toDF("index", "state", "entries")
.cache
// get all possible keys from entries i.e Seq[A, B, C]
val finalCols = df.select(explode($"entries").as("entry"))
.select($"entry".getField("name").as("entry_name"))
.distinct
.collect
.map{_.getAs[String]("entry_name")}
.sorted // Attention: we need to retain the order of the columns
// 1. when generating row values and
// 2. when creating the schema
val rdd = df.rdd.map{ r =>
// transform the entries array into a map i.e Map(A -> 30, B -> 10)
val entriesMap = r.getSeq[Row](2).map{r => (r.getString(0), r.getString(1))}.toMap
// transform finalCols into a map with null value i.e Map(A -> null, B -> null, C -> null)
val finalColsMap = finalCols.map{c => (c, null)}.toMap
// replace null values with those that are present from the current row by merging the two previous maps
// Attention: this should retain the order of finalColsMap
val merged = finalColsMap ++ entriesMap
// concatenate the two first row values ["index", "state"] with the values from merged
val finalValues = Seq(r(0), r(1)) ++ merged.values
Row.fromSeq(finalValues)
}
val extraCols = finalCols.map{c => s"`${c}` STRING"}
val schema = StructType.fromDDL("`index` INT, `state` STRING," + extraCols.mkString(","))
val finalDf = spark.createDataFrame(rdd, schema)
finalDf.show
// +-----+-----+---+----+----+
// |index|state| A| B| C|
// +-----+-----+---+----+----+
// | 0| KY| 45|null|null|
// | 1| OR| 30| 10|null|
// | 2| FL| 30| 10| 20|
// | 3| TX| 19| 60| 40|
// +-----+-----+---+----+----+
Note: the solution requires one extra action to retrieve the unique keys although it doesn't cause any shuffling since it it based on narrow transformations only.

I've worked out a solution myself:
def extractFromArray(colName : String, key : String, numKeys : Int, keyName : String) = {
val indexCols = (0 to numKeys-1).map(col(colName).getItem(_))
indexCols.foldLeft(lit(null))((innerCol : Column, indexCol : Column) =>
when(indexCol.isNotNull && (indexCol.getItem(keyName) === key), indexCol)
.otherwise(innerCol))
}
Example:
case class testStruct(name : String, number : String)
val df = Seq(
(0, "KY", Seq(testStruct("A", "45"))),
(1, "OR", Seq(testStruct("A", "30"), testStruct("B", "10"))),
(2, "FL", Seq(testStruct("A", "30"), testStruct("B", "10"), testStruct("C", "20"))),
(3, "TX", Seq(testStruct("B", "60"), testStruct("A", "19"), testStruct("C", "40")))
)
.toDF("index", "state", "entries")
.withColumn("A", extractFromArray("entries", "B", 3, "name"))
.show
which produces:
+-----+-----+--------------------+-------+
|index|state| entries| A|
+-----+-----+--------------------+-------+
| 0| KY| [[A, 45]]| null|
| 1| OR| [[A, 30], [B, 10]]|[B, 10]|
| 2| FL|[[A, 30], [B, 10]...|[B, 10]|
| 3| TX|[[B, 60], [A, 19]...|[B, 60]|
+-----+-----+--------------------+-------+
This solution is a little different from other answers:
It works on only a single key at a time
It requires the key name and number of keys be known in advance
It produces a column of structs, rather than doing the extra step of extracting specific values
It works as a simple column-to-column operation, rather than requiring transformations on the entire DF
It can be evaluated lazily
The first three issues can be handled by calling code, and leave it somewhat more flexible for cases where you already know the keys or where the structs contain additional values to extract.

Related

Scala modify value of nested column

I need to conditionally modify a value of a nested field in a Dataframe (or create a new field with the nested values).
I would like to do it without having to use UDF, but I really would want to avoid RDD/map since the production tables can have many hundred millions of records and map in that condition dosen't ring as efficient/fast to me.
Bellow is the test case:
case class teste(var testID: Int = 0, var testDesc: String = "", var testValue: String = "")
val DFMain = Seq( ("A",teste(1, "AAA", "10")),("B",teste(2, "BBB", "20")),("C",teste(3, "CCC", "30"))).toDF("F1","F2")
val DFNewData = Seq( ("A",teste(1, "AAA", "40")),("B",teste(2, "BBB", "50")),("C",teste(3, "CCC", "60"))).toDF("F1","F2")
val DFJoined = DFMain.join(DFNewData,DFMain("F2.testID")===DFNewData("F2.testID"),"left").
select(DFMain("F1"), DFMain("F2"), DFNewData("F2.testValue").as("NewValue")).
withColumn("F2.testValue",$"NewValue")
DFJoined.show()
This will add a new column, but I need that F2.testValue to be equal to the value of NewValue inside the Struct when its above 50.
Original Data:
+---+------------+
| F1| F2|
+---+------------+
| A|[1, AAA, 10]|
| B|[2, BBB, 20]|
| C|[3, CCC, 30]|
+---+------------+
Desired Result:
+---+------------+
| F1| F2|
+---+------------+
| A|[1, AAA, 10]|
| B|[2, BBB, 50]|
| C|[3, CCC, 60]|
+---+------------+
Could you please try this.
case class teste(var testID: Int = 0, var testDesc: String = "", var testValue: String = "")
val DFMain = Seq( ("A",teste(1, "AAA", "10")),("B",teste(2, "BBB", "20")),("C",teste(3, "CCC", "30"))).toDF("F1","F2")
DFMain.show(false)
+---+------------+
|F1 |F2 |
+---+------------+
|A |[1, AAA, 10]|
|B |[2, BBB, 20]|
|C |[3, CCC, 30]|
+---+------------+
val DFNewData = Seq( ("A",teste(1, "AAA", "40")),("B",teste(2, "BBB", "50")),("C",teste(3, "CCC", "60"))).toDF("F1","F2")
val DFJoined = DFMain.join(DFNewData,DFMain("F2.testID")===DFNewData("F2.testID"),"left").
select(DFMain("F1"), DFMain("F2"), DFNewData("F2.testValue").as("NewValue"))
.withColumn("F2_testValue",$"NewValue")
DFJoined.show
+---+------------+--------+------------+
| F1| F2|NewValue|F2_testValue|
+---+------------+--------+------------+
| A|[1, AAA, 10]| 40| 40|
| B|[2, BBB, 20]| 50| 50|
| C|[3, CCC, 30]| 60| 60|
+---+------------+--------+------------+
DFJoined.printSchema
root
|-- F1: string (nullable = true)
|-- F2: struct (nullable = true)
| |-- testID: integer (nullable = false)
| |-- testDesc: string (nullable = true)
| |-- testValue: string (nullable = true)
|-- NewValue: string (nullable = true)
|-- F2_testValue: string (nullable = true)
DFJoined.withColumn("f2_new", expr(" case when F2_testValue>=50 then concat_ws('|',F2.testID,F2.testDesc,F2_testValue) else concat_ws('|',F2.testID,F2.testDesc,F2.testValue) end "))
.withColumn("f2_new3",struct(split($"f2_new","[|]")(0),split($"f2_new","[|]")(1),split($"f2_new","[|]")(2) ) )
.show(false)
+---+------------+--------+------------+--------+------------+
|F1 |F2 |NewValue|F2_testValue|f2_new |f2_new3 |
+---+------------+--------+------------+--------+------------+
|A |[1, AAA, 10]|40 |40 |1|AAA|10|[1, AAA, 10]|
|B |[2, BBB, 20]|50 |50 |2|BBB|50|[2, BBB, 50]|
|C |[3, CCC, 30]|60 |60 |3|CCC|60|[3, CCC, 60]|
+---+------------+--------+------------+--------+------------+
f2_new3 is the desired output.
The reason for the workaround is the below one is not working.
DFJoined.withColumn("f2_new", expr(" case when F2_testValue>=50 then struct(F2.testID,F2.testDesc,F2_testValue) else struct(F2.testID,F2.testDesc,F2.testValue) end ")).show()
In addition to stack0114106 answer, I also found this solution for the problem, they are more or less alike:
val DFFinal = DFJoined.selectExpr("""
named_struct(
'F1', F1,
'F2', named_struct(
'testID', F2.testID,
'testDesc', F2.testDesc,
'testValue', case when NewValue>=50 then NewValue else F2.testValue end
)
) as named_struct
""").select($"named_struct.F1", $"named_struct.F2")

Run a custom transformation on string columns

Suppose I have the following dataframe:
var df = Seq(
("2019-09-01", 0.1, 1, "0x0000000000000001", "0x00000001", "True"),
("2019-09-02", 0.2, 2, "0x0000000000000002", "0x00000002", "False"),
("2019-09-03", 0.3, 3, "0x0000000000000003", "0x00000003", "True")
).toDF("Timestamp", "Float", "Integer", "Hex1", "Hex2", "Bool")
I need to run a transformation on the string colums (in this example: Hex1, Hex2 and Bool) and convert them to a numeric value by using some custom logic.
The dataframes are generated by reading CSV files which I don't know the schema. All I know is that they contain a Timestamp column as the first column and then a variable number of columns which might be numeric (integers or doubles/floats) or these hex and boolean values.
I'm thinking this transformation would need to find all the string columns and for each one, run the transformation that will add a new column to the dataframe with the numerical representation of the string.
In this case, the hex values would be converted to their decimal representation. And the "True", "False" strings would be converted to 1 and 0 respectively.
Back to the simplified example, I should get a df like this:
|Timestamp |Float|Integer|Hex1 |Hex2 |Bool |
|-----------|-----|-------|------------------|----------|-----|
|2019-09-01 |0.1 |1 |1 |1 |1 |
|2019-09-02 |0.2 |2 |2 |2 |0 |
|2019-09-03"|0.3 |3 |3 |3 |1 |
With all numeric (integer, float or double) columns except for the Timestamp
As per your example use following function:
Use conv standard function to convert hex to appropriate type.
conv(num: Column, fromBase: Int, toBase: Int): Column Convert a number in a string column from one base to another.
when(Column condition, Object value):
Evaluates a list of conditions and returns one of multiple possible result expressions.
import org.apache.spark.sql.functions.conv
import org.apache.spark.sql.functions._
val s1 = df.
withColumn("Hex1", conv(col("Hex1").substr(lit(3), length(col("Hex1"))), 16, 10) cast IntegerType).
withColumn("Hex2", conv(col("Hex2").substr(lit(3), length(col("Hex2"))), 16, 10) cast IntegerType).
withColumn("Bool", when(col("Bool") === "True", 1)
.otherwise(0))
s1.show()
s1.printSchema()
From your problem definition ie dynamically. If you want to do same task dynamically you have to do extra work.
Create mapping ie column and it's datatype map: This can be abstracted out, you can create your mapping file externally. Can be generated dynamically by reading mapping file.
val list = List(
("Hex", "Hex1"),
("Hex", "Hex2"),
("Bool", "Bool")
)
create converter using pattern matching :
object Helper {
def convert(columnDetail: (String, String)): Column = {
columnDetail._1 match {
case "Hex" => conv(col(columnDetail._2).substr(lit(3), length(col(columnDetail._2))), 16, 10) cast IntegerType
case "Bool" => when(col(columnDetail._2) === "True", 1).otherwise(0)
// your other case
}
}
}
you can add all the cases and its appropriate implementation.
final solution
import spark.implicits._
var df = Seq(
("2019-09-01", 0.1, 1, "0x0000000000000001", "0x00000001", "True"),
("2019-09-02", 0.2, 2, "0x0000000000000002", "0x00000002", "False"),
("2019-09-03", 0.3, 3, "0x0000000000000003", "0x00000003", "True")
).toDF("Timestamp", "Float", "Integer", "Hex1", "Hex2", "Bool")
val list = List(
("Hex", "Hex1"),
("Hex", "Hex2"),
("Bool", "Bool")
)
val temp = list.foldLeft(df) { (tempDF, listValue) =>
tempDF.withColumn(listValue._2, Helper.convert(listValue))
}
temp.show(false)
temp.printSchema()
}
object Helper {
def convert(columnDetail: (String, String)): Column = {
columnDetail._1 match {
case "Hex" => conv(col(columnDetail._2).substr(lit(3), length(col(columnDetail._2))), 16, 10) cast IntegerType
case "Bool" => when(col(columnDetail._2) === "True", 1).otherwise(0)
// your other case
}
}
}
Result:
+----------+-----+-------+----+----+----+
|Timestamp |Float|Integer|Hex1|Hex2|Bool|
+----------+-----+-------+----+----+----+
|2019-09-01|0.1 |1 |1 |1 |1 |
|2019-09-02|0.2 |2 |2 |2 |0 |
|2019-09-03|0.3 |3 |3 |3 |1 |
+----------+-----+-------+----+----+----+
root
|-- Timestamp: string (nullable = true)
|-- Float: double (nullable = false)
|-- Integer: integer (nullable = false)
|-- Hex1: integer (nullable = true)
|-- Hex2: integer (nullable = true)
|-- Bool: integer (nullable = false)
Below is my spark code to do this. I have used conv function of spark sql http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.conv
. Also if you want to write a logic to dynamically identify all string columns at run time and perform conversion, it could be done only if you know exactly what kind of conversion you are going to do.
var df = Seq(
("2019-09-01", 0.1, 1, "0x0000000000000001", "0x00000001", "True"),
("2019-09-02", 0.2, 2, "0x0000000000000002", "0x00000002", "False"),
("2019-09-03", 0.3, 3, "0x0000000000000003", "0x00000003", "True")
).toDF("Timestamp", "Float", "Integer", "Hex1", "Hex2", "Bool")
// df.show
df.createOrReplaceTempView("sourceTable")
val finalDF = spark.sql("""
select Timestamp,
Float,
Integer,
conv(substr(Hex1,3),16,10) as Hex1,
conv(substr(Hex2,3),16,10) as Hex2,
case when Bool = "True" then 1
when Bool = "False" then 0
else NULL
end as Bool
from sourceTable
""")
finalDF.show
Result :
+----------+-----+-------+----+----+----+
| Timestamp|Float|Integer|Hex1|Hex2|Bool|
+----------+-----+-------+----+----+----+
|2019-09-01| 0.1| 1| 1| 1| 1|
|2019-09-02| 0.2| 2| 2| 2| 0|
|2019-09-03| 0.3| 3| 3| 3| 1|
+----------+-----+-------+----+----+----+

Spark scala dataframe: Merging multiple columns into single column

I have a spark dataframe which looks something like below:
+---+------+----+
| id|animal|talk|
+---+------+----+
| 1| bat|done|
| 2| mouse|mone|
| 3| horse| gun|
| 4| horse|some|
+---+------+----+
I want to generate a new column, say merged which would look something like
+---+-----------------------------------------------------------+
| id| merged columns |
+---+-----------------------------------------------------------+
| 1| [{name: animal, value: bat}, {name: talk, value: done}] |
| 2| [{name: animal, value: mouse}, {name: talk, value: mone}] |
| 3| [{name: animal, value: horse}, {name: talk, value: gun}] |
| 4| [{name: animal, value: horse}, {name: talk, value: some}] |
+---+-----------------------------------------------------------+
Basically, combining all the columns into an Array of case class merged(name:String, value: String).
Can anyone help me with how to do this in Scala?
Here for simplicity I have used only two columns but generic answer which can work for N number of columns would greatly help.
Your expected output doesn't seem to reflect your requirement of producing a list of name-value structured objects. If I understand it correctly, consider using foldLeft to iteratively convert the wanted columns to StructType name-value columns, and group them into an ArrayType column:
import org.apache.spark.sql.functions._
val df = Seq(
(1, "bat", "done"),
(2, "mouse", "mone"),
(3, "horse", "gun"),
(4, "horse", "some")
).toDF("id", "animal", "talk")
val cols = df.columns.filter(_ != "id")
val resultDF = cols.
foldLeft(df)( (accDF, c) =>
accDF.withColumn(c, struct(lit(c).as("name"), col(c).as("value")))
).
select($"id", array(cols.map(col): _*).as("merged"))
resultDF.show(false)
// +---+-----------------------------+
// |id |merged |
// +---+-----------------------------+
// |1 |[[animal,bat], [talk,done]] |
// |2 |[[animal,mouse], [talk,mone]]|
// |3 |[[animal,horse], [talk,gun]] |
// |4 |[[animal,horse], [talk,some]]|
// +---+-----------------------------+
resultDF.printSchema
// root
// |-- id: integer (nullable = false)
// |-- merged: array (nullable = false)
// | |-- element: struct (containsNull = false)
// | | |-- name: string (nullable = false)
// | | |-- value: string (nullable = true)

create empty array-column of given schema in Spark

Due to the fact that parquet cannt parsists empty arrays, I replaced empty arrays with null before writing a table. Now as I read the table, I want to do the opposite:
I have a DataFrame with the following schema :
|-- id: long (nullable = false)
|-- arr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- x: double (nullable = true)
| | |-- y: double (nullable = true)
and the following content:
+---+-----------+
| id| arr|
+---+-----------+
| 1|[[1.0,2.0]]|
| 2| null|
+---+-----------+
I'd like to replace the null-array (id=2) with an empty array, i.e.
+---+-----------+
| id| arr|
+---+-----------+
| 1|[[1.0,2.0]]|
| 2| []|
+---+-----------+
I've tried:
val arrSchema = df.schema(1).dataType
df
.withColumn("arr",when($"arr".isNull,array().cast(arrSchema)).otherwise($"arr"))
.show()
which gives :
java.lang.ClassCastException: org.apache.spark.sql.types.NullType$
cannot be cast to org.apache.spark.sql.types.StructType
Edit : I don't want to "hardcode" any schema of my array column (at least not the schema of the struct) because this can vary from case to case. I can only use the schema information from df at runtime
I'm using Spark 2.1 by the way, therefore I cannot use typedLit
Spark 2.2+ with known external type
In general you can use typedLit to provide empty arrays.
import org.apache.spark.sql.functions.typedLit
typedLit(Seq.empty[(Double, Double)])
To use specific names for nested objects you can use case classes:
case class Item(x: Double, y: Double)
typedLit(Seq.empty[Item])
or rename by cast:
typedLit(Seq.empty[(Double, Double)])
.cast("array<struct<x: Double, y: Double>>")
Spark 2.1+ with schema only
With schema only you can try:
val schema = StructType(Seq(
StructField("arr", StructType(Seq(
StructField("x", DoubleType),
StructField("y", DoubleType)
)))
))
def arrayOfSchema(schema: StructType) =
from_json(lit("""{"arr": []}"""), schema)("arr")
arrayOfSchema(schema).alias("arr")
where schema can be extracted from the existing DataFrame and wrapped with additional StructType:
StructType(Seq(
StructField("arr", df.schema("arr").dataType)
))
One way is the use a UDF :
val arrSchema = df.schema(1).dataType // ArrayType(StructType(StructField(x,DoubleType,true), StructField(y,DoubleType,true)),true)
val emptyArr = udf(() => Seq.empty[Any],arrSchema)
df
.withColumn("arr",when($"arr".isNull,emptyArr()).otherwise($"arr"))
.show()
+---+-----------+
| id| arr|
+---+-----------+
| 1|[[1.0,2.0]]|
| 2| []|
+---+-----------+
Another approach would be to use coalesce:
val df = Seq(
(Some(1), Some(Array((1.0, 2.0)))),
(Some(2), None)
).toDF("id", "arr")
df.withColumn("arr", coalesce($"arr", typedLit(Array.empty[(Double, Double)]))).
show
// +---+-----------+
// | id| arr|
// +---+-----------+
// | 1|[[1.0,2.0]]|
// | 2| []|
// +---+-----------+
UDF with case class could also be interesting:
case class Item(x: Double, y: Double)
val udf_emptyArr = udf(() => Seq[Item]())
df
.withColumn("arr",coalesce($"arr",udf_emptyArr()))
.show()

How to join two dataframes?

I cannot get Sparks DataFrame join to work (no result gets produced). Here is my code:
val e = Seq((1, 2), (1, 3), (2, 4))
var edges = e.map(p => Edge(p._1, p._2)).toDF()
var filtered = edges.filter("start = 1").distinct()
println("filtered")
filtered.show()
filtered.printSchema()
println("edges")
edges.show()
edges.printSchema()
var joined = filtered.join(edges, filtered("end") === edges("start"))//.select(filtered("start"), edges("end"))
println("joined")
joined.show()
It requires case class Edge(start: Int, end: Int) to be defined at top level. Here is the output it produces:
filtered
+-----+---+
|start|end|
+-----+---+
| 1| 2|
| 1| 3|
+-----+---+
root
|-- start: integer (nullable = false)
|-- end: integer (nullable = false)
edges
+-----+---+
|start|end|
+-----+---+
| 1| 2|
| 1| 3|
| 2| 4|
+-----+---+
root
|-- start: integer (nullable = false)
|-- end: integer (nullable = false)
joined
+-----+---+-----+---+
|start|end|start|end|
+-----+---+-----+---+
+-----+---+-----+---+
I don't understand why the output is empty. Why isn't the first row of filtered get combined with the last row of edges?
val f2 = filtered.withColumnRenamed("start", "fStart").withColumnRenamed("end", "fEnd")
f2.join(edges, f2("fEnd") === edges("start")).show
I believe this is because filtered("start").equals(edges("start")), that is as filtered is a filtered view on edges and they share the column definitions. The columns are the same so Spark does not understand which you are referencing.
As such you can do things like
edges.select(filtered("start")).show