Nested struct keys exploded into column values - scala

Within the JSON objects I am attempting to process, I am being given a nested StructType where each key represents a specific location, which then contains a currency and price:
-- id: string (nullable = true)
-- pricingByCountry: struct (nullable = true)
|-- regionPrices: struct (nullable = true)
|-- AT: struct (nullable = true)
|-- currency: string (nullable = true)
|-- price: double (nullable = true)
|-- BT: struct (nullable = true)
|-- currency: string (nullable = true)
|-- price: double (nullable = true)
|-- CL: struct (nullable = true)
|-- currency: string (nullable = true)
|-- price: double (nullable = true)
...etc.
and I'd like to explode it so that rather than having a column per country, I can have a row for each country:
+---+--------+---------+------+
| id| country| currency| price|
+---+--------+---------+------+
| 0| AT| EUR| 100|
| 0| BT| NGU| 400|
| 0| CL| PES| 200|
+---+--------+---------+------+
These solution make sense intuitively: Spark DataFrame exploding a map with the key as a member and Spark scala - Nested StructType conversion to Map, but unfortunately don't work because I'm passing in a column and not a whole row to be mapped. I don't want to manually map the whole row--just a specific column that contains nested structs. There are several other attributes at the same level as "id" that I'd like to maintain in the structure.

I think it can done as below:
// JSON test data
val ds = Seq("""{"id":"abcd","pricingByCountry":{"regionPrices":{"AT":{"currency":"EUR","price":100.00},"BT":{"currency":"NGE","price":200.00},"CL":{"currency":"PES","price":300.00}}}}""").toDS
val df = spark.read.json(ds)
// Schema to map udf output
val outputSchema = ArrayType(StructType(Seq(
StructField("country", StringType, false),
StructField("currency", StringType, false),
StructField("price", DoubleType, false)
)))
// UDF takes value of `regionPrices` json string and converts
// it to Array of tuple(country, currency, price)
import org.apache.spark.sql._
val toMap = udf((jsonString: String) => {
import com.fasterxml.jackson.databind._
import com.fasterxml.jackson.module.scala.DefaultScalaModule
val jsonMapper = new ObjectMapper()
jsonMapper.registerModule(DefaultScalaModule)
val jsonMap = jsonMapper.readValue(jsonString, classOf[Map[String, Map[String, Double]]])
jsonMap.map(f => (f._1, f._2("currency"), f._2("price"))).toSeq
}, outputSchema)
val result = df.
select(col("id").as("id"), explode(toMap(to_json(col("pricingByCountry.regionPrices")))).as("temp")).
select(col("id"), col("temp.country").as("country"), col("temp.currency").as("currency"), col("temp.price").as("price"))
Output will be:
scala> result.show
+----+-------+--------+-----+
| id|country|currency|price|
+----+-------+--------+-----+
|abcd| AT| EUR|100.0|
|abcd| BT| NGE|200.0|
|abcd| CL| PES|300.0|
+----+-------+--------+-----+

Related

Scala Spark - copy data from 1 Dataframe into another DF with nested schema & same column names

DF1 - flat dataframe with data
+---------+--------+-------+
|FirstName|LastName| Device|
+---------+--------+-------+
| Robert|Williams|android|
| Maria|Sharpova| iphone|
+---------+--------+-------+
root
|-- FirstName: string (nullable = true)
|-- LastName: string (nullable = true)
|-- Device: string (nullable = true)
DF2 - empty dataframe with same column names
+------+----+
|header|body|
+------+----+
+------+----+
root
|-- header: struct (nullable = true)
| |-- FirstName: string (nullable = true)
| |-- LastName: string (nullable = true)
|-- body: struct (nullable = true)
| |-- Device: string (nullable = true)
DF2 schema Code:
val schema = StructType(Array(
StructField("header", StructType(Array(
StructField("FirstName", StringType),
StructField("LastName", StringType)))),
StructField("body", StructType(Array(
StructField("Device", StringType))))
))
DF2 with data from DF1 would be the final output.
Need to do this for multiple columns for a complex schema and make it configurable. Have to do this without using case class.
APPROACH #1 - use schema.fields.map to map DF1 -> DF2?
APPROACH #2 - create a new DF and define data and schema?
APPROACH #3 - use zip and map transformations to define 'select col as col' query.. don't know if this would work for nested (structtype) schema
How would I go on about doing that?
import spark.implicits._
import org.apache.spark.sql.functions._
val sourceDF = Seq(
("Robert", "Williams", "android"),
("Maria", "Sharpova", "iphone")
).toDF("FirstName", "LastName", "Device")
val resDF = sourceDF
.withColumn("header", struct('FirstName, 'LastName))
.withColumn("body", struct(col("Device")))
.select('header, 'body)
resDF.printSchema
// root
// |-- header: struct (nullable = false)
// | |-- FirstName: string (nullable = true)
// | |-- LastName: string (nullable = true)
// |-- body: struct (nullable = false)
// | |-- Device: string (nullable = true)
resDF.show(false)
// +------------------+---------+
// |header |body |
// +------------------+---------+
// |[Robert, Williams]|[android]|
// |[Maria, Sharpova] |[iphone] |
// +------------------+---------+

How to change datatype of columns in a dataframe based on a case class in scala/spark

I am trying a convert data type of some columns based on a case class.
val simpleDf = Seq(("James",34,"2006-01-01","true","M",3000.60),
("Michael",33,"1980-01-10","true","F",3300.80),
("Robert",37,"1995-01-05","false","M",5000.50)
).toDF("firstName","age","jobStartDate","isGraduated","gender","salary")
// Output
simpleDf.printSchema()
root
|-- firstName: string (nullable = true)
|-- age: integer (nullable = false)
|-- jobStartDate: string (nullable = true)
|-- isGraduated: string (nullable = true)
|-- gender: string (nullable = true)
|-- salary: double (nullable = false)
Here I wanted to change the datatype of jobStartDate to Timestamp and isGraduated to Boolean. I am wondering if that conversion is possible using the case class?
I am aware this can be done by casting each column but in my case, I need to map the incoming DF based on a case class defined.
case class empModel(firstName:String,
age:Integer,
jobStartDate:java.sql.Timestamp,
isGraduated:Boolean,
gender:String,
salary:Double
)
val newDf = simpleData.as[empModel].toDF
newDf.show(false)
I am getting errors because of the string to timestamp conversation. Is there any workaround?
You can generate the schema from the case class using ScalaReflection:
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.catalyst.ScalaReflection
val schema = ScalaReflection.schemaFor[empModel].dataType.asInstanceOf[StructType]
Now, you can pass this schema when you load your files into dataframe.
Or if you prefer to cast some or all columns after you read the dataframe, you can iterate the schema fields and cast into corresponding data type. By using foldLeft for example :
val df = schema.fields.foldLeft(simpleDf){
(df, s) => df.withColumn(s.name, df(s.name).cast(s.dataType))
}
df.printSchema
//root
// |-- firstName: string (nullable = true)
// |-- age: integer (nullable = true)
// |-- jobStartDate: timestamp (nullable = true)
// |-- isGraduated: boolean (nullable = false)
// |-- gender: string (nullable = true)
// |-- salary: double (nullable = false)
df.show
//+---------+---+-------------------+-----------+------+------+
//|firstName|age| jobStartDate|isGraduated|gender|salary|
//+---------+---+-------------------+-----------+------+------+
//| James| 34|2006-01-01 00:00:00| true| M|3000.6|
//| Michael| 33|1980-01-10 00:00:00| true| F|3300.8|
//| Robert| 37|1995-01-05 00:00:00| false| M|5000.5|
//+---------+---+-------------------+-----------+------+------+

how to update spark dataframe column containing array using udf

I have a dataframe:
+--------------------+------+
|people |person|
+--------------------+------+
|[[jack, jill, hero]]|joker |
+--------------------+------+
It's schema:
root
|-- people: struct (nullable = true)
| |-- person: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- person: string (nullable = true)
Here, root--person is a string. So, I can update this field using udf as:
def updateString = udf((s: String) => {
"Mr. " + s
})
df.withColumn("person", updateString(col("person"))).select("person").show(false)
output:
+---------+
|person |
+---------+
|Mr. joker|
+---------+
I want to do same operation on root--people--person column which contains array of person. How to achieve this using udf?
def updateArray = udf((arr: Seq[Row]) => ???
df.withColumn("people", updateArray(col("people.person"))).select("people").show(false)
expected:
+------------------------------+
|people |
+------------------------------+
|[Mr. hero, Mr. jack, Mr. jill]|
+------------------------------+
Edit: I also want to preserve its schema after updating root--people--person.
Expected schema of people:
df.select("people").printSchema()
root
|-- people: struct (nullable = false)
| |-- person: array (nullable = true)
| | |-- element: string (containsNull = true)
Thanks,
The problem here is that people is s struct with only 1 field. In your UDF, you need to return Tuple1 and then further cast the output of your UDF to keep the names correct:
def updateArray = udf((r: Row) => Tuple1(r.getAs[Seq[String]](0).map(x=>"Mr."+x)))
val newDF = df
.withColumn("people",updateArray($"people").cast("struct<person:array<string>>"))
newDF.printSchema()
newDF.show()
gives
root
|-- people: struct (nullable = true)
| |-- person: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- person: string (nullable = true)
+--------------------+------+
| people|person|
+--------------------+------+
|[[Mr.jack, Mr.jil...| joker|
+--------------------+------+
for you just need to update your function and everything remains the same.
here is the code snippet.
scala> df2.show
+------+------------------+
|people| person|
+------+------------------+
| joker|[jack, jill, hero]|
+------+------------------+
//jus order is changed
I just updated your function instead of using Row I am using here Seq[String]
scala> def updateArray = udf((arr: Seq[String]) => arr.map(x=>"Mr."+x))
scala> df2.withColumn("test",updateArray($"person")).show(false)
+------+------------------+---------------------------+
|people|person |test |
+------+------------------+---------------------------+
|joker |[jack, jill, hero]|[Mr.jack, Mr.jill, Mr.hero]|
+------+------------------+---------------------------+
//keep all the column for testing purpose you could drop if you dont want.
let me know if you want to know more about same.
Let's create data for testing
scala> val data = Seq((List(Array("ja", "ji", "he")), "person")).toDF("people", "person")
data: org.apache.spark.sql.DataFrame = [people: array<array<string>>, person: string]
scala> data.printSchema
root
|-- people: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- person: string (nullable = true)
create UDF for our requirements
scala> def arrayConcat(array:Seq[Seq[String]], str: String) = array.map(_.map(str + _))
arrayConcat: (array: Seq[Seq[String]], str: String)Seq[Seq[String]]
scala> val arrayConcatUDF = udf(arrayConcat _)
arrayConcatUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(ArrayType(StringType,true),true),Some(List(ArrayType(ArrayType(StringType,true),true), StringType)))
Applying the udf
scala> data.withColumn("dasd", arrayConcatUDF($"people", lit("Mr."))).show(false)
+--------------------------+------+-----------------------------------+
|people |person|dasd |
+--------------------------+------+-----------------------------------+
|[WrappedArray(ja, ji, he)]|person|[WrappedArray(Mr.ja, Mr.ji, Mr.he)]|
+--------------------------+------+-----------------------------------+
You may need to tweak a bit(I think any tweak is hardly required) but this contains the most of it to solve your problem

Convert a json string to array of key-value pairs in Spark scala

I have a JSON string that I load into a Spark DataFrame. The JSON string can have between 0 and 3 key-value pairs.
When more than one kv pairs are sent, the product_facets is correctly formatted as an array like below:
{"id":1,
"productData":{
"product":{
"product_name":"xyz",
"product_facets":{"entry":[{"key":"test","value":"success"}, {"key": "test2","value" : "fail"}]}
}}}
I can now use the explode function:
sourceDF.filter($"someKey".contains("some_string"))
.select($"id", explode($"productData.product.product_facets.entry") as "kvPairs")
However when only one key value was sent, the source JSON string for entry is not formatted as an array with square braces:
{"id":1,
"productData":{
"product":{
"product_name":"xyz",
"product_facets":{"entry":{"key":"test","value":"success"}}
}}}
The schema for product tag looks like:
| |-- product: struct (nullable = true)
| | |-- product_facets: struct (nullable = true)
| | | |-- entry: string (nullable = true)
| | |-- product_name: string (nullable = true)
How can I change the entry to an array of key value pairs that is compatible with the explode function. My end goal is to pivot the keys into individual columns and I want to use group by on exploding the kv pairs. I tried using from_json but could not get it to work.
val schema =
StructType(
Seq(
StructField("entry", ArrayType(
StructType(
Seq(
StructField("key", StringType),
StructField("value",StringType)
)
)
))
)
)
sourceDF.filter($"someKey".contains("some_string"))
.select($"id", from_json($"productData.product.product_facets.entry", schema) as "kvPairsFromJson")
But the above does creates a new column kvPairsFromJson that looks like "[]" and using explode does nothing.
Any pointers on whats going on or if there is a better way to do this?
I think one approach could be :
1. Create a udf which takes entry value as json string, and converts it to List( Tuple(K, V))
2. In udf, check if entry value is array or not and do conversion accordingly.
The code below explains above approach:
// one row where entry is array and other non-array
val ds = Seq("""{"id":1,"productData":{"product":{"product_name":"xyz","product_facets":{"entry":[{"key":"test","value":"success"},{"key":"test2","value":"fail"}]}}}}""", """{"id":2,"productData":{"product":{"product_name":"xyz","product_facets":{"entry":{"key":"test","value":"success"}}}}}""").toDS
val df = spark.read.json(ds)
// Schema used by udf to generate output column
import org.apache.spark.sql.types._
val outputSchema = ArrayType(StructType(Seq(
StructField("key", StringType, false),
StructField("value", StringType, false)
)))
// Converts non-array entry value to array
val toArray = udf((json: String) => {
import com.fasterxml.jackson.databind._
import com.fasterxml.jackson.module.scala.DefaultScalaModule
val jsonMapper = new ObjectMapper()
jsonMapper.registerModule(DefaultScalaModule)
if(!json.startsWith("[")) {
val jsonMap = jsonMapper.readValue(json, classOf[Map[String, String]])
List((jsonMap("key"), jsonMap("value")))
} else {
jsonMapper.readValue(json, classOf[List[Map[String, String]]]).map(f => (f("key"), f("value")))
}
}, outputSchema)
val arrayResult = df.select(col("id").as("id"), toArray(col("productData.product.product_facets.entry")).as("entry"))
val arrayExploded = df.select(col("id").as("id"), explode(toArray(col("productData.product.product_facets.entry"))).as("entry"))
val explodedToCols = df.select(col("id").as("id"), explode(toArray(col("productData.product.product_facets.entry"))).as("entry")).select(col("id"), col("entry.key").as("key"), col("entry.value").as("value"))
Results in:
scala> arrayResult.printSchema
root
|-- id: long (nullable = true)
|-- entry: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = false)
| | |-- value: string (nullable = false)
scala> arrayExploded.printSchema
root
|-- id: long (nullable = true)
|-- entry: struct (nullable = true)
| |-- key: string (nullable = false)
| |-- value: string (nullable = false)
scala> arrayResult.show(false)
+---+--------------------------------+
|id |entry |
+---+--------------------------------+
|1 |[[test, success], [test2, fail]]|
|2 |[[test, success]] |
+---+--------------------------------+
scala> arrayExploded.show(false)
+---+---------------+
|id |entry |
+---+---------------+
|1 |[test, success]|
|1 |[test2, fail] |
|2 |[test, success]|
+---+---------------+

Rename nested struct columns in a Spark DataFrame [duplicate]

This question already has answers here:
Rename nested field in spark dataframe
(5 answers)
Closed 3 years ago.
I am trying to change the names of a DataFrame columns in scala. I am easily able to change the column names for direct fields but I'm facing difficulty while converting array struct columns.
Below is my DataFrame schema.
|-- _VkjLmnVop: string (nullable = true)
|-- _KaTasLop: string (nullable = true)
|-- AbcDef: struct (nullable = true)
| |-- UvwXyz: struct (nullable = true)
| | |-- _MnoPqrstUv: string (nullable = true)
| | |-- _ManDevyIxyz: string (nullable = true)
But I need the schema like below
|-- vkj_lmn_vop: string (nullable = true)
|-- ka_tas_lop: string (nullable = true)
|-- abc_def: struct (nullable = true)
| |-- uvw_xyz: struct (nullable = true)
| | |-- mno_pqrst_uv: string (nullable = true)
| | |-- man_devy_ixyz: string (nullable = true)
For Non Struct columns I'm changing column names by below
def aliasAllColumns(df: DataFrame): DataFrame = {
df.select(df.columns.map { c =>
df.col(c)
.as(
c.replaceAll("_", "")
.replaceAll("([A-Z])", "_$1")
.toLowerCase
.replaceFirst("_", ""))
}: _*)
}
aliasAllColumns(file_data_df).show(1)
How I can change Struct column names dynamically?
You can create a recursive method to traverse the DataFrame schema for renaming the columns:
import org.apache.spark.sql.types._
def renameAllCols(schema: StructType, rename: String => String): StructType = {
def recurRename(schema: StructType): Seq[StructField] = schema.fields.map{
case StructField(name, dtype: StructType, nullable, meta) =>
StructField(rename(name), StructType(recurRename(dtype)), nullable, meta)
case StructField(name, dtype: ArrayType, nullable, meta) if dtype.elementType.isInstanceOf[StructType] =>
StructField(rename(name), ArrayType(StructType(recurRename(dtype.elementType.asInstanceOf[StructType])), true), nullable, meta)
case StructField(name, dtype, nullable, meta) =>
StructField(rename(name), dtype, nullable, meta)
}
StructType(recurRename(schema))
}
Testing it with the following example:
import org.apache.spark.sql.functions._
import spark.implicits._
val renameFcn = (s: String) =>
s.replace("_", "").replaceAll("([A-Z])", "_$1").toLowerCase.dropWhile(_ == '_')
case class C(A_Bc: Int, D_Ef: Int)
val df = Seq(
(10, "a", C(1, 2), Seq(C(11, 12), C(13, 14)), Seq(101, 102)),
(20, "b", C(3, 4), Seq(C(15, 16)), Seq(103))
).toDF("_VkjLmnVop", "_KaTasLop", "AbcDef", "ArrStruct", "ArrInt")
val newDF = spark.createDataFrame(df.rdd, renameAllCols(df.schema, renameFcn))
newDF.printSchema
// root
// |-- vkj_lmn_vop: integer (nullable = false)
// |-- ka_tas_lop: string (nullable = true)
// |-- abc_def: struct (nullable = true)
// | |-- a_bc: integer (nullable = false)
// | |-- d_ef: integer (nullable = false)
// |-- arr_struct: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- a_bc: integer (nullable = false)
// | | |-- d_ef: integer (nullable = false)
// |-- arr_int: array (nullable = true)
// | |-- element: integer (containsNull = false)
as far as I know, it's not possible to rename nested fields directly.
From one side, you could try moving to a flat object.
However, if you need to keep the structure, you can play with spark.sql.functions.struct(*cols).
Creates a new struct column.
Parameters: cols – list of column names (string) or list of Column expressions
You will need to decompose all the schema, generate the aliases that you need and then compose it again using the struct function.
It's not the best solution. But it's something :)
Pd: I'm attaching the PySpark doc since it contains a better explanation than the Scala one.