I am trying to create a dataframe that returns an empty array for a nested struct type if another column is false. I created a dummy dataframe to illustrate my problem.
import spark.implicits._
val newDf = spark.createDataFrame(Seq(
("user1","true", Some(8), Some("usd"), Some("tx1")),
("user1", "true", Some(9), Some("usd"), Some("tx2")),
("user2", "false", None, None, None))).toDF("userId","flag", "amount", "currency", "transactionId")
val amountStruct = struct("amount"
,"currency").alias("amount")
val transactionStruct = struct("transactionId"
, "amount").alias("transactions")
val dataStruct = struct("flag","transactions").alias("data")
val finalDf = newDf.
withColumn("amount", amountStruct).
withColumn("transactions", transactionStruct).
select("userId", "flag","transactions").
groupBy("userId", "flag").
agg(collect_list("transactions").alias("transactions")).
withColumn("data", dataStruct).
drop("transactions","flag")
This is the output:
+------+--------------------+
|userId| data|
+------+--------------------+
| user2| [false, [[, [,]]]]|
| user1|[true, [[tx1, [8,...|
+------+--------------------+
and schema:
root
|-- userId: string (nullable = true)
|-- data: struct (nullable = false)
| |-- flag: string (nullable = true)
| |-- transactions: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- transactionId: string (nullable = true)
| | | |-- amount: struct (nullable = false)
| | | | |-- amount: integer (nullable = true)
| | | | |-- currency: string (nullable = true)
The output I want:
+------+--------------------+
|userId| data|
+------+--------------------+
| user2| [false, []] |
| user1|[true, [[tx1, [8,...|
+------+--------------------+
I've tried doing this before doing collect_list but no luck.
import org.apache.spark.sql.functions.typedLit
val emptyArray = typedLit(Array.empty[(String, Array[(Int, String)])])
testDf.withColumn("transactions", when($"flag" === "false", emptyArray).otherwise($"transactions")).show()
You were moments from victory. The approach with collect_list is the way to go, it just needs a little nudge.
TL;DR Solution
val newDf = spark
.createDataFrame(
Seq(
("user1", "true", Some(8), Some("usd"), Some("tx1")),
("user1", "true", Some(9), Some("usd"), Some("tx2")),
("user2", "false", None, None, None)
)
)
.toDF("userId", "flag", "amount", "currency", "transactionId")
val dataStruct = struct("flag", "transactions")
val finalDf2 = newDf
.groupBy("userId", "flag")
.agg(
collect_list(
when(
$"transactionId".isNotNull && $"amount".isNotNull && $"currency".isNotNull,
struct(
$"transactionId",
struct($"amount", $"currency").alias("amount")
)
)).alias("transactions"))
.withColumn("data", dataStruct)
.drop("transactions", "flag")
Explanation
SQL Aggregate Function Behavior
First of all, when it comes to behavior Spark follows SQL conventions. All the SQL aggregation functions (and collect_list is an aggregate function) ignore NULL on input as if it never was there.
Let's take a look at how does collect_list behave:
Seq(
("a", Some(1)),
("a", Option.empty[Int]),
("a", Some(3)),
("b", Some(10)),
("b", Some(20)),
("b", Option.empty[Int])
)
.toDF("col1", "col2")
.groupBy($"col1")
.agg(collect_list($"col2") as "col2_list")
.show()
And the result is:
+----+---------+
|col1|col2_list|
+----+---------+
| b| [10, 20]|
| a| [1, 3]|
+----+---------+
Tracking Down Nullability
It looks like collect_list behaves properly. So the reason you are seeing those blanks in your output is that the column that gets passed to the collect_list is not nullable.
To prove it let's examine the schema of the DataFrame just before it gets aggregated:
newDf
.withColumn("amount", amountStruct)
.withColumn("transactions", transactionStruct)
.printSchema()
root
|-- userId: string (nullable = true)
|-- flag: string (nullable = true)
|-- amount: struct (nullable = false)
| |-- amount: integer (nullable = true)
| |-- currency: string (nullable = true)
|-- currency: string (nullable = true)
|-- transactionId: string (nullable = true)
|-- transactions: struct (nullable = false)
| |-- transactionId: string (nullable = true)
| |-- amount: struct (nullable = false)
| | |-- amount: integer (nullable = true)
| | |-- currency: string (nullable = true)
Note the transactions: struct (nullable = false) part. It proves the suspicion.
If we translate all the nested NULLables to Scala here's what you got:
case class Row(
transactions: Transactions,
// other fields
)
case class Transactions(
transactionId: Option[String],
amount: Option[Amount],
)
case class Amount(
amount: Option[Int],
currency: Option[String]
)
And here's what you want instead:
case class Row(
transactions: Option[Transactions], // this is optional now
// other fields
)
case class Transactions(
transactionId: String, // while this is not optional
amount: Amount, // neither is this
)
case class Amount(
amount: Int, // neither is this
currency: String // neither is this
)
Fixing the Nullability
Now the last step is simple. To make the column that is the input to collect_list "properly" nullable you have to check the nullability of all the amount, currency and transactionId columns.
The result will be NOT NULL if and only if all the input columns are NOT NULL.
You can use the same when API method to construct the result. The otherwise clause if omitted implicitly returns NULL which is exactly what you need.
Related
DF1 - flat dataframe with data
+---------+--------+-------+
|FirstName|LastName| Device|
+---------+--------+-------+
| Robert|Williams|android|
| Maria|Sharpova| iphone|
+---------+--------+-------+
root
|-- FirstName: string (nullable = true)
|-- LastName: string (nullable = true)
|-- Device: string (nullable = true)
DF2 - empty dataframe with same column names
+------+----+
|header|body|
+------+----+
+------+----+
root
|-- header: struct (nullable = true)
| |-- FirstName: string (nullable = true)
| |-- LastName: string (nullable = true)
|-- body: struct (nullable = true)
| |-- Device: string (nullable = true)
DF2 schema Code:
val schema = StructType(Array(
StructField("header", StructType(Array(
StructField("FirstName", StringType),
StructField("LastName", StringType)))),
StructField("body", StructType(Array(
StructField("Device", StringType))))
))
DF2 with data from DF1 would be the final output.
Need to do this for multiple columns for a complex schema and make it configurable. Have to do this without using case class.
APPROACH #1 - use schema.fields.map to map DF1 -> DF2?
APPROACH #2 - create a new DF and define data and schema?
APPROACH #3 - use zip and map transformations to define 'select col as col' query.. don't know if this would work for nested (structtype) schema
How would I go on about doing that?
import spark.implicits._
import org.apache.spark.sql.functions._
val sourceDF = Seq(
("Robert", "Williams", "android"),
("Maria", "Sharpova", "iphone")
).toDF("FirstName", "LastName", "Device")
val resDF = sourceDF
.withColumn("header", struct('FirstName, 'LastName))
.withColumn("body", struct(col("Device")))
.select('header, 'body)
resDF.printSchema
// root
// |-- header: struct (nullable = false)
// | |-- FirstName: string (nullable = true)
// | |-- LastName: string (nullable = true)
// |-- body: struct (nullable = false)
// | |-- Device: string (nullable = true)
resDF.show(false)
// +------------------+---------+
// |header |body |
// +------------------+---------+
// |[Robert, Williams]|[android]|
// |[Maria, Sharpova] |[iphone] |
// +------------------+---------+
I am trying a convert data type of some columns based on a case class.
val simpleDf = Seq(("James",34,"2006-01-01","true","M",3000.60),
("Michael",33,"1980-01-10","true","F",3300.80),
("Robert",37,"1995-01-05","false","M",5000.50)
).toDF("firstName","age","jobStartDate","isGraduated","gender","salary")
// Output
simpleDf.printSchema()
root
|-- firstName: string (nullable = true)
|-- age: integer (nullable = false)
|-- jobStartDate: string (nullable = true)
|-- isGraduated: string (nullable = true)
|-- gender: string (nullable = true)
|-- salary: double (nullable = false)
Here I wanted to change the datatype of jobStartDate to Timestamp and isGraduated to Boolean. I am wondering if that conversion is possible using the case class?
I am aware this can be done by casting each column but in my case, I need to map the incoming DF based on a case class defined.
case class empModel(firstName:String,
age:Integer,
jobStartDate:java.sql.Timestamp,
isGraduated:Boolean,
gender:String,
salary:Double
)
val newDf = simpleData.as[empModel].toDF
newDf.show(false)
I am getting errors because of the string to timestamp conversation. Is there any workaround?
You can generate the schema from the case class using ScalaReflection:
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.catalyst.ScalaReflection
val schema = ScalaReflection.schemaFor[empModel].dataType.asInstanceOf[StructType]
Now, you can pass this schema when you load your files into dataframe.
Or if you prefer to cast some or all columns after you read the dataframe, you can iterate the schema fields and cast into corresponding data type. By using foldLeft for example :
val df = schema.fields.foldLeft(simpleDf){
(df, s) => df.withColumn(s.name, df(s.name).cast(s.dataType))
}
df.printSchema
//root
// |-- firstName: string (nullable = true)
// |-- age: integer (nullable = true)
// |-- jobStartDate: timestamp (nullable = true)
// |-- isGraduated: boolean (nullable = false)
// |-- gender: string (nullable = true)
// |-- salary: double (nullable = false)
df.show
//+---------+---+-------------------+-----------+------+------+
//|firstName|age| jobStartDate|isGraduated|gender|salary|
//+---------+---+-------------------+-----------+------+------+
//| James| 34|2006-01-01 00:00:00| true| M|3000.6|
//| Michael| 33|1980-01-10 00:00:00| true| F|3300.8|
//| Robert| 37|1995-01-05 00:00:00| false| M|5000.5|
//+---------+---+-------------------+-----------+------+------+
I have a dataframe with the following schema:
root
|-- id: string (nullable = true)
|-- collect_list(typeCounts): array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: struct (containsNull = true)
| | | |-- type: string (nullable = true)
| | | |-- count: long (nullable = false)
Example data:
+-----------+----------------------------------------------------------------------------+
|id |collect_list(typeCounts) |
+-----------+----------------------------------------------------------------------------+
|1 |[WrappedArray([B00XGS,6], [B001FY,5]), WrappedArray([B06LJ7,4])]|
|2 |[WrappedArray([B00UFY,3])] |
+-----------+----------------------------------------------------------------------------+
How can I flatten collect_list(typeCounts) to a flat array of structs in scala? I have read some answers on stackoverflow for similar questions suggesting UDF's, but I am not sure what the UDF method signature should be for structs.
If you're on Spark 2.4+, instead of using a UDF (which is generally less efficient than native Spark functions) you can apply flatten, like below:
df.withColumn("collect_list(typeCounts)", flatten($"collect_list(typeCounts)"))
i am not sure what the udf method signature should be for structs
UDF takes structs as Rows for input and may return them as Scala case classes. To flatten the nested collections, you can create a simple UDF as follows:
import org.apache.spark.sql.Row
case class TC(`type`: String, count: Long)
val flattenLists = udf{ (lists: Seq[Seq[Row]]) =>
lists.flatMap( _.map{ case Row(t: String, c: Long) => TC(t, c) } )
}
To test out the UDF, let's assemble a DataFrame with your described schema:
val df = Seq(
("1", Seq(TC("B00XGS", 6), TC("B001FY", 5))),
("1", Seq(TC("B06LJ7", 4))),
("2", Seq(TC("B00UFY", 3)))
).toDF("id", "typeCounts").
groupBy("id").agg(collect_list("typeCounts"))
df.printSchema
// root
// |-- id: string (nullable = true)
// |-- collect_list(typeCounts): array (nullable = true)
// | |-- element: array (containsNull = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- type: string (nullable = true)
// | | | |-- count: long (nullable = false)
Applying the UDF:
df.
withColumn("collect_list(typeCounts)", flattenLists($"collect_list(typeCounts)")).
printSchema
// root
// |-- id: string (nullable = true)
// |-- collect_list(typeCounts): array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- type: string (nullable = true)
// | | |-- count: long (nullable = false)
I have a dataframe:
+--------------------+------+
|people |person|
+--------------------+------+
|[[jack, jill, hero]]|joker |
+--------------------+------+
It's schema:
root
|-- people: struct (nullable = true)
| |-- person: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- person: string (nullable = true)
Here, root--person is a string. So, I can update this field using udf as:
def updateString = udf((s: String) => {
"Mr. " + s
})
df.withColumn("person", updateString(col("person"))).select("person").show(false)
output:
+---------+
|person |
+---------+
|Mr. joker|
+---------+
I want to do same operation on root--people--person column which contains array of person. How to achieve this using udf?
def updateArray = udf((arr: Seq[Row]) => ???
df.withColumn("people", updateArray(col("people.person"))).select("people").show(false)
expected:
+------------------------------+
|people |
+------------------------------+
|[Mr. hero, Mr. jack, Mr. jill]|
+------------------------------+
Edit: I also want to preserve its schema after updating root--people--person.
Expected schema of people:
df.select("people").printSchema()
root
|-- people: struct (nullable = false)
| |-- person: array (nullable = true)
| | |-- element: string (containsNull = true)
Thanks,
The problem here is that people is s struct with only 1 field. In your UDF, you need to return Tuple1 and then further cast the output of your UDF to keep the names correct:
def updateArray = udf((r: Row) => Tuple1(r.getAs[Seq[String]](0).map(x=>"Mr."+x)))
val newDF = df
.withColumn("people",updateArray($"people").cast("struct<person:array<string>>"))
newDF.printSchema()
newDF.show()
gives
root
|-- people: struct (nullable = true)
| |-- person: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- person: string (nullable = true)
+--------------------+------+
| people|person|
+--------------------+------+
|[[Mr.jack, Mr.jil...| joker|
+--------------------+------+
for you just need to update your function and everything remains the same.
here is the code snippet.
scala> df2.show
+------+------------------+
|people| person|
+------+------------------+
| joker|[jack, jill, hero]|
+------+------------------+
//jus order is changed
I just updated your function instead of using Row I am using here Seq[String]
scala> def updateArray = udf((arr: Seq[String]) => arr.map(x=>"Mr."+x))
scala> df2.withColumn("test",updateArray($"person")).show(false)
+------+------------------+---------------------------+
|people|person |test |
+------+------------------+---------------------------+
|joker |[jack, jill, hero]|[Mr.jack, Mr.jill, Mr.hero]|
+------+------------------+---------------------------+
//keep all the column for testing purpose you could drop if you dont want.
let me know if you want to know more about same.
Let's create data for testing
scala> val data = Seq((List(Array("ja", "ji", "he")), "person")).toDF("people", "person")
data: org.apache.spark.sql.DataFrame = [people: array<array<string>>, person: string]
scala> data.printSchema
root
|-- people: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- person: string (nullable = true)
create UDF for our requirements
scala> def arrayConcat(array:Seq[Seq[String]], str: String) = array.map(_.map(str + _))
arrayConcat: (array: Seq[Seq[String]], str: String)Seq[Seq[String]]
scala> val arrayConcatUDF = udf(arrayConcat _)
arrayConcatUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(ArrayType(StringType,true),true),Some(List(ArrayType(ArrayType(StringType,true),true), StringType)))
Applying the udf
scala> data.withColumn("dasd", arrayConcatUDF($"people", lit("Mr."))).show(false)
+--------------------------+------+-----------------------------------+
|people |person|dasd |
+--------------------------+------+-----------------------------------+
|[WrappedArray(ja, ji, he)]|person|[WrappedArray(Mr.ja, Mr.ji, Mr.he)]|
+--------------------------+------+-----------------------------------+
You may need to tweak a bit(I think any tweak is hardly required) but this contains the most of it to solve your problem
Within the JSON objects I am attempting to process, I am being given a nested StructType where each key represents a specific location, which then contains a currency and price:
-- id: string (nullable = true)
-- pricingByCountry: struct (nullable = true)
|-- regionPrices: struct (nullable = true)
|-- AT: struct (nullable = true)
|-- currency: string (nullable = true)
|-- price: double (nullable = true)
|-- BT: struct (nullable = true)
|-- currency: string (nullable = true)
|-- price: double (nullable = true)
|-- CL: struct (nullable = true)
|-- currency: string (nullable = true)
|-- price: double (nullable = true)
...etc.
and I'd like to explode it so that rather than having a column per country, I can have a row for each country:
+---+--------+---------+------+
| id| country| currency| price|
+---+--------+---------+------+
| 0| AT| EUR| 100|
| 0| BT| NGU| 400|
| 0| CL| PES| 200|
+---+--------+---------+------+
These solution make sense intuitively: Spark DataFrame exploding a map with the key as a member and Spark scala - Nested StructType conversion to Map, but unfortunately don't work because I'm passing in a column and not a whole row to be mapped. I don't want to manually map the whole row--just a specific column that contains nested structs. There are several other attributes at the same level as "id" that I'd like to maintain in the structure.
I think it can done as below:
// JSON test data
val ds = Seq("""{"id":"abcd","pricingByCountry":{"regionPrices":{"AT":{"currency":"EUR","price":100.00},"BT":{"currency":"NGE","price":200.00},"CL":{"currency":"PES","price":300.00}}}}""").toDS
val df = spark.read.json(ds)
// Schema to map udf output
val outputSchema = ArrayType(StructType(Seq(
StructField("country", StringType, false),
StructField("currency", StringType, false),
StructField("price", DoubleType, false)
)))
// UDF takes value of `regionPrices` json string and converts
// it to Array of tuple(country, currency, price)
import org.apache.spark.sql._
val toMap = udf((jsonString: String) => {
import com.fasterxml.jackson.databind._
import com.fasterxml.jackson.module.scala.DefaultScalaModule
val jsonMapper = new ObjectMapper()
jsonMapper.registerModule(DefaultScalaModule)
val jsonMap = jsonMapper.readValue(jsonString, classOf[Map[String, Map[String, Double]]])
jsonMap.map(f => (f._1, f._2("currency"), f._2("price"))).toSeq
}, outputSchema)
val result = df.
select(col("id").as("id"), explode(toMap(to_json(col("pricingByCountry.regionPrices")))).as("temp")).
select(col("id"), col("temp.country").as("country"), col("temp.currency").as("currency"), col("temp.price").as("price"))
Output will be:
scala> result.show
+----+-------+--------+-----+
| id|country|currency|price|
+----+-------+--------+-----+
|abcd| AT| EUR|100.0|
|abcd| BT| NGE|200.0|
|abcd| CL| PES|300.0|
+----+-------+--------+-----+