Spark SQL nested withColumn - scala

I have a DataFrame that has multiple columns of which some of them are structs. Something like this
root
|-- foo: struct (nullable = true)
| |-- bar: string (nullable = true)
| |-- baz: string (nullable = true)
|-- abc: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- def: struct (nullable = true)
| | | |-- a: string (nullable = true)
| | | |-- b: integer (nullable = true)
| | | |-- c: string (nullable = true)
I want to apply a UserDefinedFunction on the column baz to replace baz with a function of baz, but I cannot figure out how to do that. Here is an example of the desired output (note that baz is now an int)
root
|-- foo: struct (nullable = true)
| |-- bar: string (nullable = true)
| |-- baz: int (nullable = true)
|-- abc: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- def: struct (nullable = true)
| | | |-- a: string (nullable = true)
| | | |-- b: integer (nullable = true)
| | | |-- c: string (nullable = true)
It looks like DataFrame.withColumn only works on top level columns but not on nested columns. I'm using Scala for this problem.
Can someone help me out with this?
Thanks

that's easy, just use a dot to select nested structures, e.g. $"foo.baz" :
case class Foo(bar:String,baz:String)
case class Record(foo:Foo)
val df = Seq(
Record(Foo("Hi","There"))
).toDF()
df.printSchema
root
|-- foo: struct (nullable = true)
| |-- bar: string (nullable = true)
| |-- baz: string (nullable = true)
val myUDF = udf((s:String) => {
// do something with s
s.toUpperCase
})
df
.withColumn("udfResult",myUDF($"foo.baz"))
.show
+----------+---------+
| foo|udfResult|
+----------+---------+
|[Hi,There]| THERE|
+----------+---------+
If you want to add the result of your UDF to the existing struct foo, i.e. to get:
root
|-- foo: struct (nullable = false)
| |-- bar: string (nullable = true)
| |-- baz: string (nullable = true)
| |-- udfResult: string (nullable = true)
there are two options:
with withColumn:
df
.withColumn("udfResult",myUDF($"foo.baz"))
.withColumn("foo",struct($"foo.*",$"udfResult"))
.drop($"udfResult")
with select:
df
.select(struct($"foo.*",myUDF($"foo.baz").as("udfResult")).as("foo"))
EDIT:
Replacing the existing attribute in the struct with the result from the UDF:
unfortunately, this does not work:
df
.withColumn("foo.baz",myUDF($"foo.baz"))
but can be done like this:
// get all columns except foo.baz
val structCols = df.select($"foo.*")
.columns
.filter(_!="baz")
.map(name => col("foo."+name))
df.withColumn(
"foo",
struct((structCols:+myUDF($"foo.baz").as("baz")):_*)
)

You can do this using the struct function as Raphael Roth has already been demonstrated in their answer above. There is an easier way to do this though using the Make Structs Easy* library. The library adds a withField method to the Column class allowing you to add/replace Columns inside a StructType column, in much the same way as the withColumn method on the DataFrame class allows you to add/replace columns inside a DataFrame. For your specific use-case, you could do something like this:
import org.apache.spark.sql.functions._
import com.github.fqaiser94.mse.methods._
// generate some fake data
case class Foo(bar: String, baz: String)
case class Record(foo: Foo, arrayOfFoo: Seq[Foo])
val df = Seq(
Record(Foo("Hello", "World"), Seq(Foo("Blue", "Red"), Foo("Green", "Yellow")))
).toDF
df.printSchema
// root
// |-- foo: struct (nullable = true)
// | |-- bar: string (nullable = true)
// | |-- baz: string (nullable = true)
// |-- arrayOfFoo: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- bar: string (nullable = true)
// | | |-- baz: string (nullable = true)
df.show(false)
// +--------------+------------------------------+
// |foo |arrayOfFoo |
// +--------------+------------------------------+
// |[Hello, World]|[[Blue, Red], [Green, Yellow]]|
// +--------------+------------------------------+
// example user defined function that capitalizes a given string
val myUdf = udf((s: String) => s.toUpperCase)
// capitalize value of foo.baz
df.withColumn("foo", $"foo".withField("baz", myUdf($"foo.baz"))).show(false)
// +--------------+------------------------------+
// |foo |arrayOfFoo |
// +--------------+------------------------------+
// |[Hello, WORLD]|[[Blue, Red], [Green, Yellow]]|
// +--------------+------------------------------+
I noticed you had a follow-up question about replacing a Column nested inside a struct nested inside of an array.
This can also be done by combining the functions provided by the Make Structs Easy library with the functions provided by spark-hofs library, as follows:
import za.co.absa.spark.hofs._
// capitalize the value of foo.baz in each element of arrayOfFoo
df.withColumn("arrayOfFoo", transform($"arrayOfFoo", foo => foo.withField("baz", myUdf(foo.getField("baz"))))).show(false)
// +--------------+------------------------------+
// |foo |arrayOfFoo |
// +--------------+------------------------------+
// |[Hello, World]|[[Blue, RED], [Green, YELLOW]]|
// +--------------+------------------------------+
*Full disclosure: I am the author of the Make Structs Easy library that is referenced in this answer.

Related

Scala spark UDF function that takes input and puts it in an Array

I am trying to create a Scala UDF for Spark, that can be used in Spark SQL. The objective of the function is to accept any column type as input, and put it in an ArrayType, unless the input is already an ArrayType.
Here's the code I have so far:
package com.latitudefinancial.spark.udf
import org.apache.spark.sql.api.java.UDF1
import org.apache.spark.sql.types._
import org.apache.spark.sql.SparkSession
class GetDatatype extends UDF1[Object, scala.collection.Seq[_]] {
override def call(inputObject: Object): scala.collection.Seq[_] = {
if (inputObject.isInstanceOf[scala.collection.Seq[_]]) {
return inputObject.asInstanceOf[scala.collection.Seq[_]]
} else {
return Array(inputObject)
}
}
}
val myFunc = new GetDatatype().call _
val myFuncUDF = udf(myFunc)
spark.udf.register("myFuncUDF", myFuncUDF)
The data may look like this:
+-----------+-----------+--------------------------------------------------------------+--------+-------------------------------+
|create_date|item |datatype_of_item |item2 |datatype_of_item2 |
+-----------+-----------+--------------------------------------------------------------+--------+-------------------------------+
|2021-06-01 |[item 3, 3]|org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema|string 3|java.lang.String |
+-----------+-----------+--------------------------------------------------------------+--------+-------------------------------+
or this:
+-----------+--------------------------+-------------------------------------------+--------------------+-------------------------------------------+
|create_date|item |datatype_of_item |item2 |datatype_of_item_2 |
+-----------+--------------------------+-------------------------------------------+--------------------+-------------------------------------------+
|2021-05-01 |[[item 1, 1], [item 2, 2]]|scala.collection.mutable.WrappedArray$ofRef|[string 1, string 2]|scala.collection.mutable.WrappedArray$ofRef|
|2021-06-01 |[[item 3, 3]] |scala.collection.mutable.WrappedArray$ofRef|[string 3] |scala.collection.mutable.WrappedArray$ofRef|
+-----------+--------------------------+-------------------------------------------+--------------------+-------------------------------------------+
The UDF function may be passed contents from item or item2 columns.
However when executing this line:
val myFuncUDF = udf(myFunc)
I get the following error:
scala> val myFuncUDF = udf(myFunc)
java.lang.UnsupportedOperationException: Schema for type Any is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$schemaFor$1(ScalaReflection.scala:743)
Spark cannot use UDFs with this return type (Any, oder Object). You could do it without UDF I think:
val df = Seq(
(Seq((1,"a"),(2,"b")),(1,"a"))
).toDF("item","item 2")
def wrapInArray(df:DataFrame,c:String) = if(df.schema(c).dataType.isInstanceOf[ArrayType]) col(c) else array(col(c))
df
.withColumn("test",wrapInArray(df,"item"))
.withColumn("test 2",wrapInArray(df,"item 2"))
gives the schema
root
|-- item: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: integer (nullable = false)
| | |-- _2: string (nullable = true)
|-- item 2: struct (nullable = true)
| |-- _1: integer (nullable = false)
| |-- _2: string (nullable = true)
|-- test: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: integer (nullable = false)
| | |-- _2: string (nullable = true)
|-- test 2: array (nullable = false)
| |-- element: struct (containsNull = true)
| | |-- _1: integer (nullable = false)
| | |-- _2: string (nullable = true)

How to flatten Array of WrappedArray of structs in scala

I have a dataframe with the following schema:
root
|-- id: string (nullable = true)
|-- collect_list(typeCounts): array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: struct (containsNull = true)
| | | |-- type: string (nullable = true)
| | | |-- count: long (nullable = false)
Example data:
+-----------+----------------------------------------------------------------------------+
|id |collect_list(typeCounts) |
+-----------+----------------------------------------------------------------------------+
|1 |[WrappedArray([B00XGS,6], [B001FY,5]), WrappedArray([B06LJ7,4])]|
|2 |[WrappedArray([B00UFY,3])] |
+-----------+----------------------------------------------------------------------------+
How can I flatten collect_list(typeCounts) to a flat array of structs in scala? I have read some answers on stackoverflow for similar questions suggesting UDF's, but I am not sure what the UDF method signature should be for structs.
If you're on Spark 2.4+, instead of using a UDF (which is generally less efficient than native Spark functions) you can apply flatten, like below:
df.withColumn("collect_list(typeCounts)", flatten($"collect_list(typeCounts)"))
i am not sure what the udf method signature should be for structs
UDF takes structs as Rows for input and may return them as Scala case classes. To flatten the nested collections, you can create a simple UDF as follows:
import org.apache.spark.sql.Row
case class TC(`type`: String, count: Long)
val flattenLists = udf{ (lists: Seq[Seq[Row]]) =>
lists.flatMap( _.map{ case Row(t: String, c: Long) => TC(t, c) } )
}
To test out the UDF, let's assemble a DataFrame with your described schema:
val df = Seq(
("1", Seq(TC("B00XGS", 6), TC("B001FY", 5))),
("1", Seq(TC("B06LJ7", 4))),
("2", Seq(TC("B00UFY", 3)))
).toDF("id", "typeCounts").
groupBy("id").agg(collect_list("typeCounts"))
df.printSchema
// root
// |-- id: string (nullable = true)
// |-- collect_list(typeCounts): array (nullable = true)
// | |-- element: array (containsNull = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- type: string (nullable = true)
// | | | |-- count: long (nullable = false)
Applying the UDF:
df.
withColumn("collect_list(typeCounts)", flattenLists($"collect_list(typeCounts)")).
printSchema
// root
// |-- id: string (nullable = true)
// |-- collect_list(typeCounts): array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- type: string (nullable = true)
// | | |-- count: long (nullable = false)

Convert flattened data frame to struct in Spark

I had a deep nested JSON files which I had to process, and in order to do that I had to flatten them because couldn't find a way to hash some deep nested fields. This is how my dataframe looks like (after flattening):
scala> flattendedJSON.printSchema
root
|-- header_appID: string (nullable = true)
|-- header_appVersion: string (nullable = true)
|-- header_userID: string (nullable = true)
|-- body_cardId: string (nullable = true)
|-- body_cardStatus: string (nullable = true)
|-- body_cardType: string (nullable = true)
|-- header_userAgent_browser: string (nullable = true)
|-- header_userAgent_browserVersion: string (nullable = true)
|-- header_userAgent_deviceName: string (nullable = true)
|-- body_beneficiary_beneficiaryAccounts_beneficiaryAccountOwner: string (nullable = true)
|-- body_beneficiary_beneficiaryPhoneNumbers_beneficiaryPhoneNumber: string (nullable = true)
And I need to convert it back to original structure (before flattening):
scala> nestedJson.printSchema
root
|-- header: struct (nullable = true)
| |-- appID: string (nullable = true)
| |-- appVersion: string (nullable = true)
| |-- userAgent: struct (nullable = true)
| | |-- browser: string (nullable = true)
| | |-- browserVersion: string (nullable = true)
| | |-- deviceName: string (nullable = true)
|-- body: struct (nullable = true)
| |-- beneficiary: struct (nullable = true)
| | |-- beneficiaryAccounts: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- beneficiaryAccountOwner: string (nullable = true)
| | |-- beneficiaryPhoneNumbers: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- beneficiaryPhoneNumber: string (nullable = true)
| |-- cardId: string (nullable = true)
| |-- cardStatus: string (nullable = true)
| |-- cardType: string (nullable = true)
I've managed to do it with single nested field, but if it's more, it can't work and I can't find a way to do it properly. Here's what I tried:
val structColumns = flattendedJSON.columns.filter(_.contains("_"))
val structColumnsMap = structColumns.map(_.split("\\_")).
groupBy(_(0)).mapValues(_.map(_(1)))
val dfExpanded = structColumnsMap.foldLeft(flattendedJSON){ (accDF, kv) =>
val cols = kv._2.map(v => col("`" + kv._1 + "_" + v + "`").as(v))
accDF.withColumn(kv._1, struct(cols: _*))
}
val dfResult = structColumns.foldLeft(flattendedJSON)(_ drop _)
And it's working if I have one nested object (e.g. header_appID), but in case of header_userAgent_browser, I get an exception:
org.apache.spark.sql.AnalysisException: cannot resolve
'header_userAgent' given input columns: ..
Using Spark 2.3 and Scala 2.11.8
I would recommend use case classes to work with a Dataset instead of flatten the DF and then again try to convert to the old json format. Even if it has nested objects you can define a set of case classes to cast it. It allows you to work with an object notation making the things easier than DF.
There are tools where you can provide a sample of the json and it generates the classes for you (I use this: https://json2caseclass.cleverapps.io).
If you anyways want to convert it from the DF, an alternative could be, create a Dataset using map on your DF. Something like this:
case class NestedNode(fieldC: String, fieldD: String) // for JSON
case class MainNode(fieldA: String, fieldB: NestedNode) // for JSON
case class FlattenData(fa: String, fc: String, fd: String)
Seq(
FlattenData("A1", "B1", "C1"),
FlattenData("A2", "B2", "C2"),
FlattenData("A3", "B3", "C3")
).toDF
.as[FlattenData] // Cast it to access with object notation
.map(flattenItem=>{
MainNode(flattenItem.fa, NestedNode(flattenItem.fc, flattenItem.fd) ) // Creating output format
})
At the end, that schema defined with the classes will be used by yourDS.write.mode(your_save_mode).json(your_target_path)

Renaming column names of a DataFrame in Spark Scala

I am trying to convert all the headers / column names of a DataFrame in Spark-Scala. as of now I come up with following code which only replaces a single column name.
for( i <- 0 to origCols.length - 1) {
df.withColumnRenamed(
df.columns(i),
df.columns(i).toLowerCase
);
}
If structure is flat:
val df = Seq((1L, "a", "foo", 3.0)).toDF
df.printSchema
// root
// |-- _1: long (nullable = false)
// |-- _2: string (nullable = true)
// |-- _3: string (nullable = true)
// |-- _4: double (nullable = false)
the simplest thing you can do is to use toDF method:
val newNames = Seq("id", "x1", "x2", "x3")
val dfRenamed = df.toDF(newNames: _*)
dfRenamed.printSchema
// root
// |-- id: long (nullable = false)
// |-- x1: string (nullable = true)
// |-- x2: string (nullable = true)
// |-- x3: double (nullable = false)
If you want to rename individual columns you can use either select with alias:
df.select($"_1".alias("x1"))
which can be easily generalized to multiple columns:
val lookup = Map("_1" -> "foo", "_3" -> "bar")
df.select(df.columns.map(c => col(c).as(lookup.getOrElse(c, c))): _*)
or withColumnRenamed:
df.withColumnRenamed("_1", "x1")
which use with foldLeft to rename multiple columns:
lookup.foldLeft(df)((acc, ca) => acc.withColumnRenamed(ca._1, ca._2))
With nested structures (structs) one possible option is renaming by selecting a whole structure:
val nested = spark.read.json(sc.parallelize(Seq(
"""{"foobar": {"foo": {"bar": {"first": 1.0, "second": 2.0}}}, "id": 1}"""
)))
nested.printSchema
// root
// |-- foobar: struct (nullable = true)
// | |-- foo: struct (nullable = true)
// | | |-- bar: struct (nullable = true)
// | | | |-- first: double (nullable = true)
// | | | |-- second: double (nullable = true)
// |-- id: long (nullable = true)
#transient val foobarRenamed = struct(
struct(
struct(
$"foobar.foo.bar.first".as("x"), $"foobar.foo.bar.first".as("y")
).alias("point")
).alias("location")
).alias("record")
nested.select(foobarRenamed, $"id").printSchema
// root
// |-- record: struct (nullable = false)
// | |-- location: struct (nullable = false)
// | | |-- point: struct (nullable = false)
// | | | |-- x: double (nullable = true)
// | | | |-- y: double (nullable = true)
// |-- id: long (nullable = true)
Note that it may affect nullability metadata. Another possibility is to rename by casting:
nested.select($"foobar".cast(
"struct<location:struct<point:struct<x:double,y:double>>>"
).alias("record")).printSchema
// root
// |-- record: struct (nullable = true)
// | |-- location: struct (nullable = true)
// | | |-- point: struct (nullable = true)
// | | | |-- x: double (nullable = true)
// | | | |-- y: double (nullable = true)
or:
import org.apache.spark.sql.types._
nested.select($"foobar".cast(
StructType(Seq(
StructField("location", StructType(Seq(
StructField("point", StructType(Seq(
StructField("x", DoubleType), StructField("y", DoubleType)))))))))
).alias("record")).printSchema
// root
// |-- record: struct (nullable = true)
// | |-- location: struct (nullable = true)
// | | |-- point: struct (nullable = true)
// | | | |-- x: double (nullable = true)
// | | | |-- y: double (nullable = true)
For those of you interested in PySpark version (actually it's same in Scala - see comment below) :
merchants_df_renamed = merchants_df.toDF(
'merchant_id', 'category', 'subcategory', 'merchant')
merchants_df_renamed.printSchema()
Result:
root
|-- merchant_id: integer (nullable = true)
|-- category: string (nullable = true)
|-- subcategory: string (nullable = true)
|-- merchant: string (nullable = true)
def aliasAllColumns(t: DataFrame, p: String = "", s: String = ""): DataFrame =
{
t.select( t.columns.map { c => t.col(c).as( p + c + s) } : _* )
}
In case is isn't obvious, this adds a prefix and a suffix to each of the current column names. This can be useful when you have two tables with one or more columns having the same name, and you wish to join them but still be able to disambiguate the columns in the resultant table. It sure would be nice if there were a similar way to do this in "normal" SQL.
Suppose the dataframe df has 3 columns id1, name1, price1
and you wish to rename them to id2, name2, price2
val list = List("id2", "name2", "price2")
import spark.implicits._
val df2 = df.toDF(list:_*)
df2.columns.foreach(println)
I found this approach useful in many cases.
Sometime we have the column name is below format in SQLServer or MySQL table
Ex : Account Number,customer number
But Hive tables do not support column name containing spaces, so please use below solution to rename your old column names.
Solution:
val renamedColumns = df.columns.map(c => df(c).as(c.replaceAll(" ", "_").toLowerCase()))
df = df.select(renamedColumns: _*)
tow table join not rename the joined key
// method 1: create a new DF
day1 = day1.toDF(day1.columns.map(x => if (x.equals(key)) x else s"${x}_d1"): _*)
// method 2: use withColumnRenamed
for ((x, y) <- day1.columns.filter(!_.equals(key)).map(x => (x, s"${x}_d1"))) {
day1 = day1.withColumnRenamed(x, y)
}
works!

Nested JSON in Spark

I have the following JSON loaded as a DataFrame:
root
|-- data: struct (nullable = true)
| |-- field1: string (nullable = true)
| |-- field2: string (nullable = true)
|-- moreData: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- more1: string (nullable = true)
| | |-- more2: string (nullable = true)
| | |-- more3: string (nullable = true)
I want to get the following RDD from this DataFrame:
RDD[(more1, more2, more3, field1, field2)]
How can I do this? I think I have to use flatMap for the nested JSON?
A combination of explode and dot syntax should do the trick:
import org.apache.spark.sql.functions.explode
case class Data(field1: String, field2: String)
case class MoreData(more1: String, more2: String, more3: String)
val df = sc.parallelize(Seq(
(Data("foo", "bar"), Array(MoreData("a", "b", "c"), MoreData("d", "e", "f")))
)).toDF("data", "moreData")
df.printSchema
// root
// |-- data: struct (nullable = true)
// | |-- field1: string (nullable = true)
// | |-- field2: string (nullable = true)
// |-- moreData: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- more1: string (nullable = true)
// | | |-- more2: string (nullable = true)
// | | |-- more3: string (nullable = true)
val columns = Seq(
$"moreData.more1", $"moreData.more2", $"moreData.more3",
$"data.field1", $"data.field2")
val aRDD = df.withColumn("moreData", explode($"moreData"))
.select(columns: _*)
.rdd
aRDD.collect
// Array[org.apache.spark.sql.Row] = Array([a,b,c,foo,bar], [d,e,f,foo,bar])
Depending on your requirements you can follow this with map to extract values from the rows:
import org.apache.spark.sql.Row
aRDD.map{case Row(m1: String, m2: String, m3: String, f1: String, f2: String) =>
(m1, m2, m3, f1, f2)}
See also Querying Spark SQL DataFrame with complex types