I have a dataframe:
+--------------------+------+
|people |person|
+--------------------+------+
|[[jack, jill, hero]]|joker |
+--------------------+------+
It's schema:
root
|-- people: struct (nullable = true)
| |-- person: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- person: string (nullable = true)
Here, root--person is a string. So, I can update this field using udf as:
def updateString = udf((s: String) => {
"Mr. " + s
})
df.withColumn("person", updateString(col("person"))).select("person").show(false)
output:
+---------+
|person |
+---------+
|Mr. joker|
+---------+
I want to do same operation on root--people--person column which contains array of person. How to achieve this using udf?
def updateArray = udf((arr: Seq[Row]) => ???
df.withColumn("people", updateArray(col("people.person"))).select("people").show(false)
expected:
+------------------------------+
|people |
+------------------------------+
|[Mr. hero, Mr. jack, Mr. jill]|
+------------------------------+
Edit: I also want to preserve its schema after updating root--people--person.
Expected schema of people:
df.select("people").printSchema()
root
|-- people: struct (nullable = false)
| |-- person: array (nullable = true)
| | |-- element: string (containsNull = true)
Thanks,
The problem here is that people is s struct with only 1 field. In your UDF, you need to return Tuple1 and then further cast the output of your UDF to keep the names correct:
def updateArray = udf((r: Row) => Tuple1(r.getAs[Seq[String]](0).map(x=>"Mr."+x)))
val newDF = df
.withColumn("people",updateArray($"people").cast("struct<person:array<string>>"))
newDF.printSchema()
newDF.show()
gives
root
|-- people: struct (nullable = true)
| |-- person: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- person: string (nullable = true)
+--------------------+------+
| people|person|
+--------------------+------+
|[[Mr.jack, Mr.jil...| joker|
+--------------------+------+
for you just need to update your function and everything remains the same.
here is the code snippet.
scala> df2.show
+------+------------------+
|people| person|
+------+------------------+
| joker|[jack, jill, hero]|
+------+------------------+
//jus order is changed
I just updated your function instead of using Row I am using here Seq[String]
scala> def updateArray = udf((arr: Seq[String]) => arr.map(x=>"Mr."+x))
scala> df2.withColumn("test",updateArray($"person")).show(false)
+------+------------------+---------------------------+
|people|person |test |
+------+------------------+---------------------------+
|joker |[jack, jill, hero]|[Mr.jack, Mr.jill, Mr.hero]|
+------+------------------+---------------------------+
//keep all the column for testing purpose you could drop if you dont want.
let me know if you want to know more about same.
Let's create data for testing
scala> val data = Seq((List(Array("ja", "ji", "he")), "person")).toDF("people", "person")
data: org.apache.spark.sql.DataFrame = [people: array<array<string>>, person: string]
scala> data.printSchema
root
|-- people: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- person: string (nullable = true)
create UDF for our requirements
scala> def arrayConcat(array:Seq[Seq[String]], str: String) = array.map(_.map(str + _))
arrayConcat: (array: Seq[Seq[String]], str: String)Seq[Seq[String]]
scala> val arrayConcatUDF = udf(arrayConcat _)
arrayConcatUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(ArrayType(StringType,true),true),Some(List(ArrayType(ArrayType(StringType,true),true), StringType)))
Applying the udf
scala> data.withColumn("dasd", arrayConcatUDF($"people", lit("Mr."))).show(false)
+--------------------------+------+-----------------------------------+
|people |person|dasd |
+--------------------------+------+-----------------------------------+
|[WrappedArray(ja, ji, he)]|person|[WrappedArray(Mr.ja, Mr.ji, Mr.he)]|
+--------------------------+------+-----------------------------------+
You may need to tweak a bit(I think any tweak is hardly required) but this contains the most of it to solve your problem
Related
I am trying to create a Scala UDF for Spark, that can be used in Spark SQL. The objective of the function is to accept any column type as input, and put it in an ArrayType, unless the input is already an ArrayType.
Here's the code I have so far:
package com.latitudefinancial.spark.udf
import org.apache.spark.sql.api.java.UDF1
import org.apache.spark.sql.types._
import org.apache.spark.sql.SparkSession
class GetDatatype extends UDF1[Object, scala.collection.Seq[_]] {
override def call(inputObject: Object): scala.collection.Seq[_] = {
if (inputObject.isInstanceOf[scala.collection.Seq[_]]) {
return inputObject.asInstanceOf[scala.collection.Seq[_]]
} else {
return Array(inputObject)
}
}
}
val myFunc = new GetDatatype().call _
val myFuncUDF = udf(myFunc)
spark.udf.register("myFuncUDF", myFuncUDF)
The data may look like this:
+-----------+-----------+--------------------------------------------------------------+--------+-------------------------------+
|create_date|item |datatype_of_item |item2 |datatype_of_item2 |
+-----------+-----------+--------------------------------------------------------------+--------+-------------------------------+
|2021-06-01 |[item 3, 3]|org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema|string 3|java.lang.String |
+-----------+-----------+--------------------------------------------------------------+--------+-------------------------------+
or this:
+-----------+--------------------------+-------------------------------------------+--------------------+-------------------------------------------+
|create_date|item |datatype_of_item |item2 |datatype_of_item_2 |
+-----------+--------------------------+-------------------------------------------+--------------------+-------------------------------------------+
|2021-05-01 |[[item 1, 1], [item 2, 2]]|scala.collection.mutable.WrappedArray$ofRef|[string 1, string 2]|scala.collection.mutable.WrappedArray$ofRef|
|2021-06-01 |[[item 3, 3]] |scala.collection.mutable.WrappedArray$ofRef|[string 3] |scala.collection.mutable.WrappedArray$ofRef|
+-----------+--------------------------+-------------------------------------------+--------------------+-------------------------------------------+
The UDF function may be passed contents from item or item2 columns.
However when executing this line:
val myFuncUDF = udf(myFunc)
I get the following error:
scala> val myFuncUDF = udf(myFunc)
java.lang.UnsupportedOperationException: Schema for type Any is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$schemaFor$1(ScalaReflection.scala:743)
Spark cannot use UDFs with this return type (Any, oder Object). You could do it without UDF I think:
val df = Seq(
(Seq((1,"a"),(2,"b")),(1,"a"))
).toDF("item","item 2")
def wrapInArray(df:DataFrame,c:String) = if(df.schema(c).dataType.isInstanceOf[ArrayType]) col(c) else array(col(c))
df
.withColumn("test",wrapInArray(df,"item"))
.withColumn("test 2",wrapInArray(df,"item 2"))
gives the schema
root
|-- item: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: integer (nullable = false)
| | |-- _2: string (nullable = true)
|-- item 2: struct (nullable = true)
| |-- _1: integer (nullable = false)
| |-- _2: string (nullable = true)
|-- test: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: integer (nullable = false)
| | |-- _2: string (nullable = true)
|-- test 2: array (nullable = false)
| |-- element: struct (containsNull = true)
| | |-- _1: integer (nullable = false)
| | |-- _2: string (nullable = true)
I have a dataframe with the following schema:
root
|-- id: string (nullable = true)
|-- collect_list(typeCounts): array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: struct (containsNull = true)
| | | |-- type: string (nullable = true)
| | | |-- count: long (nullable = false)
Example data:
+-----------+----------------------------------------------------------------------------+
|id |collect_list(typeCounts) |
+-----------+----------------------------------------------------------------------------+
|1 |[WrappedArray([B00XGS,6], [B001FY,5]), WrappedArray([B06LJ7,4])]|
|2 |[WrappedArray([B00UFY,3])] |
+-----------+----------------------------------------------------------------------------+
How can I flatten collect_list(typeCounts) to a flat array of structs in scala? I have read some answers on stackoverflow for similar questions suggesting UDF's, but I am not sure what the UDF method signature should be for structs.
If you're on Spark 2.4+, instead of using a UDF (which is generally less efficient than native Spark functions) you can apply flatten, like below:
df.withColumn("collect_list(typeCounts)", flatten($"collect_list(typeCounts)"))
i am not sure what the udf method signature should be for structs
UDF takes structs as Rows for input and may return them as Scala case classes. To flatten the nested collections, you can create a simple UDF as follows:
import org.apache.spark.sql.Row
case class TC(`type`: String, count: Long)
val flattenLists = udf{ (lists: Seq[Seq[Row]]) =>
lists.flatMap( _.map{ case Row(t: String, c: Long) => TC(t, c) } )
}
To test out the UDF, let's assemble a DataFrame with your described schema:
val df = Seq(
("1", Seq(TC("B00XGS", 6), TC("B001FY", 5))),
("1", Seq(TC("B06LJ7", 4))),
("2", Seq(TC("B00UFY", 3)))
).toDF("id", "typeCounts").
groupBy("id").agg(collect_list("typeCounts"))
df.printSchema
// root
// |-- id: string (nullable = true)
// |-- collect_list(typeCounts): array (nullable = true)
// | |-- element: array (containsNull = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- type: string (nullable = true)
// | | | |-- count: long (nullable = false)
Applying the UDF:
df.
withColumn("collect_list(typeCounts)", flattenLists($"collect_list(typeCounts)")).
printSchema
// root
// |-- id: string (nullable = true)
// |-- collect_list(typeCounts): array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- type: string (nullable = true)
// | | |-- count: long (nullable = false)
I have a column in a dataframe that is an array [always of a single item], that looks like this:
root
|-- emdaNo: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- _value: string (nullable = true)
| | |-- id: string (nullable = true)
I can't for the life of me work out how to get the _value from it, in to a string...
Assuming x is the dataframe, i've tried:
x.select($"arrayName._value") // Yields ["myStringHere"]
and
x.select($"arrayName[0]._value") // Errors
How do i get a nice string of the value held in _value out please?
case class Element(_value: String, id: String)
val df = Seq(Array(Element("foo", "bar"))).toDF("emdaNo")
df.select(element_at($"emdaNo._value", 1) as "_value").show()
Output:
+------+
|_value|
+------+
| foo|
+------+
Alternatively (and before Spark 2.4)
df.select($"emdaNo._value"(0))
or
df.select($"emdaNo._value".getItem(0))
I have a dataframe with the following schema:
|-- A: map (nullable = true)
| |-- key: string
| |-- value: array (valueContainsNull = true)
| | |-- element: struct (containsNull = true)
| | | |-- uid: string (nullable = true)
| | | |-- price: double (nullable = true)
| | | |-- type: string (nullable = true)
|-- keyindex: string (nullable = true)
For example, if I have the following data:
{"A":{
"innerkey_1":[{"uid":"1","price":0.01,"recordtype":"STAT"},
{"uid":"6","price":4.3,"recordtype":"DYN"}],
"innerkey_2":[{"uid":"2","price":2.01,"recordtype":"DYN"},
{"uid":"4","price":6.1,"recordtype":"DYN"}]},
"innerkey_2"}
I use the following schema to read the data into a dataframe:
val schema = (new StructType().add("mainkey", MapType(StringType, new ArrayType(new StructType().add("uid",StringType).add("price",DoubleType).add("recordtype",StringType), true))).add("keyindex",StringType))
I am trying to figure out if I can use the keyindex to select values from the map. Since the keyindex in the example is "innerkey_2", I want the output to be
[{"uid":"2","price":2.01,"recordtype":"DYN"},
{"uid":"4","price":6.1,"recordtype":"DYN"}]
Thanks for your help!
getItem should do the trick:
scala> val df = Seq(("innerkey2", Map("innerkey2" -> Seq(("1", 0.01, "STAT"))))).toDF("keyindex", "A")
df: org.apache.spark.sql.DataFrame = [keyindex: string, A: map<string,array<struct<_1:string,_2:double,_3:string>>>]
scala> df.select($"A"($"keyindex")).show
+---------------+
| A[keyindex]|
+---------------+
|[[1,0.01,STAT]]|
+---------------+
I have a DataFrame that has multiple columns of which some of them are structs. Something like this
root
|-- foo: struct (nullable = true)
| |-- bar: string (nullable = true)
| |-- baz: string (nullable = true)
|-- abc: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- def: struct (nullable = true)
| | | |-- a: string (nullable = true)
| | | |-- b: integer (nullable = true)
| | | |-- c: string (nullable = true)
I want to apply a UserDefinedFunction on the column baz to replace baz with a function of baz, but I cannot figure out how to do that. Here is an example of the desired output (note that baz is now an int)
root
|-- foo: struct (nullable = true)
| |-- bar: string (nullable = true)
| |-- baz: int (nullable = true)
|-- abc: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- def: struct (nullable = true)
| | | |-- a: string (nullable = true)
| | | |-- b: integer (nullable = true)
| | | |-- c: string (nullable = true)
It looks like DataFrame.withColumn only works on top level columns but not on nested columns. I'm using Scala for this problem.
Can someone help me out with this?
Thanks
that's easy, just use a dot to select nested structures, e.g. $"foo.baz" :
case class Foo(bar:String,baz:String)
case class Record(foo:Foo)
val df = Seq(
Record(Foo("Hi","There"))
).toDF()
df.printSchema
root
|-- foo: struct (nullable = true)
| |-- bar: string (nullable = true)
| |-- baz: string (nullable = true)
val myUDF = udf((s:String) => {
// do something with s
s.toUpperCase
})
df
.withColumn("udfResult",myUDF($"foo.baz"))
.show
+----------+---------+
| foo|udfResult|
+----------+---------+
|[Hi,There]| THERE|
+----------+---------+
If you want to add the result of your UDF to the existing struct foo, i.e. to get:
root
|-- foo: struct (nullable = false)
| |-- bar: string (nullable = true)
| |-- baz: string (nullable = true)
| |-- udfResult: string (nullable = true)
there are two options:
with withColumn:
df
.withColumn("udfResult",myUDF($"foo.baz"))
.withColumn("foo",struct($"foo.*",$"udfResult"))
.drop($"udfResult")
with select:
df
.select(struct($"foo.*",myUDF($"foo.baz").as("udfResult")).as("foo"))
EDIT:
Replacing the existing attribute in the struct with the result from the UDF:
unfortunately, this does not work:
df
.withColumn("foo.baz",myUDF($"foo.baz"))
but can be done like this:
// get all columns except foo.baz
val structCols = df.select($"foo.*")
.columns
.filter(_!="baz")
.map(name => col("foo."+name))
df.withColumn(
"foo",
struct((structCols:+myUDF($"foo.baz").as("baz")):_*)
)
You can do this using the struct function as Raphael Roth has already been demonstrated in their answer above. There is an easier way to do this though using the Make Structs Easy* library. The library adds a withField method to the Column class allowing you to add/replace Columns inside a StructType column, in much the same way as the withColumn method on the DataFrame class allows you to add/replace columns inside a DataFrame. For your specific use-case, you could do something like this:
import org.apache.spark.sql.functions._
import com.github.fqaiser94.mse.methods._
// generate some fake data
case class Foo(bar: String, baz: String)
case class Record(foo: Foo, arrayOfFoo: Seq[Foo])
val df = Seq(
Record(Foo("Hello", "World"), Seq(Foo("Blue", "Red"), Foo("Green", "Yellow")))
).toDF
df.printSchema
// root
// |-- foo: struct (nullable = true)
// | |-- bar: string (nullable = true)
// | |-- baz: string (nullable = true)
// |-- arrayOfFoo: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- bar: string (nullable = true)
// | | |-- baz: string (nullable = true)
df.show(false)
// +--------------+------------------------------+
// |foo |arrayOfFoo |
// +--------------+------------------------------+
// |[Hello, World]|[[Blue, Red], [Green, Yellow]]|
// +--------------+------------------------------+
// example user defined function that capitalizes a given string
val myUdf = udf((s: String) => s.toUpperCase)
// capitalize value of foo.baz
df.withColumn("foo", $"foo".withField("baz", myUdf($"foo.baz"))).show(false)
// +--------------+------------------------------+
// |foo |arrayOfFoo |
// +--------------+------------------------------+
// |[Hello, WORLD]|[[Blue, Red], [Green, Yellow]]|
// +--------------+------------------------------+
I noticed you had a follow-up question about replacing a Column nested inside a struct nested inside of an array.
This can also be done by combining the functions provided by the Make Structs Easy library with the functions provided by spark-hofs library, as follows:
import za.co.absa.spark.hofs._
// capitalize the value of foo.baz in each element of arrayOfFoo
df.withColumn("arrayOfFoo", transform($"arrayOfFoo", foo => foo.withField("baz", myUdf(foo.getField("baz"))))).show(false)
// +--------------+------------------------------+
// |foo |arrayOfFoo |
// +--------------+------------------------------+
// |[Hello, World]|[[Blue, RED], [Green, YELLOW]]|
// +--------------+------------------------------+
*Full disclosure: I am the author of the Make Structs Easy library that is referenced in this answer.