Flatten Nested schema in DataFrame, getting AnalysisException: cannot resolve column name - scala

I have a DF:
-- str1: struct (nullable = true)
| |-- a1: string (nullable = true)
| |-- a2: string (nullable = true)
| |-- a3: string (nullable = true)
|-- str2: string (nullable = true)
|-- str3: string (nullable = true)
|-- str4: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- b1: string (nullable = true)
| | |-- b2: string (nullable = true)
| | |-- b3: boolean (nullable = true)
| | |-- b4: struct (nullable = true)
| | | |-- c1: integer (nullable = true)
| | | |-- c2: string (nullable = true)
| | | |-- c3: integer (nullable = true)
I am trying to flatten it, to do that I have used code below:
def flattenSchema(schema: StructType, prefix: String = null):Array[Column]=
{
schema.fields.flatMap(f => {
val colName = if (prefix == null) f.name else (prefix + "." + f.name)
f.dataType match {
case st: StructType => flattenSchema(st, colName)
case at: ArrayType =>
val st = at.elementType.asInstanceOf[StructType]
flattenSchema(st, colName)
case _ => Array(new Column(colName).as(colName))
}
})
}
val d1 = df.select(flattenSchema(df.schema):_*)
Its giving me below Output:
|-- str1.a1: string (nullable = true)
|-- str1.a2: string (nullable = true)
|-- str1.a3: string (nullable = true)
|-- str2: string (nullable = true)
|-- str3: string (nullable = true)
|-- str4.b1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- str4.b2: array (nullable = true)
| |-- element: string (containsNull = true)
|-- str4.b3: array (nullable = true)
| |-- element: string (containsNull = true)
|-- str4.b4.c1: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- str4.b4.c2: array (nullable = true)
| |-- element: string (containsNull = true)
|-- str4.b4.c3: array (nullable = true)
| |-- element: integer (containsNull = true)
Problem is arising when I am trying to query it:
d1.select("str2").show -- Its giving me no issue
but when I do query on any flattened nested column
d1.select("str1.a1")
Error:
org.apache.spark.sql.AnalysisException: cannot resolve '`str1.a1`' given input columns: ....
What I am doing wrong here? or any other way to achieve the desired result?

Spark does not support string type column name with dot(.). Dot is use to access child column of any struct type column. If you will try to access same column from dataframe df then it should work since in df it struct type.

Related

rename spark dataframe structType fields

Given a dynamic structType . here structType name is not known . It is dynamic and hence its name is changing.
The name is variable . So don't pre assume "MAIN_COL" in the schema.
root
|-- MAIN_COL: struct (nullable = true)
| |-- a: string (nullable = true)
| |-- b: string (nullable = true)
| |-- c: string (nullable = true)
| |-- d: string (nullable = true)
| |-- f: long (nullable = true)
| |-- g: long (nullable = true)
| |-- h: long (nullable = true)
| |-- j: long (nullable = true)
how can we write a dynamic code to rename the fields of a structType with its name as its prefix.
root
|-- MAIN_COL: struct (nullable = true)
| |-- MAIN_COL_a: string (nullable = true)
| |-- MAIN_COL_b: string (nullable = true)
| |-- MAIN_COL_c: string (nullable = true)
| |-- MAIN_COL_d: string (nullable = true)
| |-- MAIN_COL_f: long (nullable = true)
| |-- MAIN_COL_g: long (nullable = true)
| |-- MAIN_COL_h: long (nullable = true)
| |-- MAIN_COL_j: long (nullable = true)
You can use DSL to update the schema of nested columns.
import org.apache.spark.sql.types._
val schema: StructType = df.schema.fields.head.dataType.asInstanceOf[StructType]
val updatedSchema = StructType.apply(
schema.fields.map(sf => StructField.apply("MAIN_COL_" + sf.name, sf.dataType))
)
val resultDF = df.withColumn("MAIN_COL", $"MAIN_COL".cast(updatedSchema))
Updated Schema:
root
|-- MAIN_COL: struct (nullable = false)
| |-- MAIN_COL_a: string (nullable = true)
| |-- MAIN_COL_b: string (nullable = true)
| |-- MAIN_COL_c: string (nullable = true)

Convert flattened data frame to struct in Spark

I had a deep nested JSON files which I had to process, and in order to do that I had to flatten them because couldn't find a way to hash some deep nested fields. This is how my dataframe looks like (after flattening):
scala> flattendedJSON.printSchema
root
|-- header_appID: string (nullable = true)
|-- header_appVersion: string (nullable = true)
|-- header_userID: string (nullable = true)
|-- body_cardId: string (nullable = true)
|-- body_cardStatus: string (nullable = true)
|-- body_cardType: string (nullable = true)
|-- header_userAgent_browser: string (nullable = true)
|-- header_userAgent_browserVersion: string (nullable = true)
|-- header_userAgent_deviceName: string (nullable = true)
|-- body_beneficiary_beneficiaryAccounts_beneficiaryAccountOwner: string (nullable = true)
|-- body_beneficiary_beneficiaryPhoneNumbers_beneficiaryPhoneNumber: string (nullable = true)
And I need to convert it back to original structure (before flattening):
scala> nestedJson.printSchema
root
|-- header: struct (nullable = true)
| |-- appID: string (nullable = true)
| |-- appVersion: string (nullable = true)
| |-- userAgent: struct (nullable = true)
| | |-- browser: string (nullable = true)
| | |-- browserVersion: string (nullable = true)
| | |-- deviceName: string (nullable = true)
|-- body: struct (nullable = true)
| |-- beneficiary: struct (nullable = true)
| | |-- beneficiaryAccounts: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- beneficiaryAccountOwner: string (nullable = true)
| | |-- beneficiaryPhoneNumbers: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- beneficiaryPhoneNumber: string (nullable = true)
| |-- cardId: string (nullable = true)
| |-- cardStatus: string (nullable = true)
| |-- cardType: string (nullable = true)
I've managed to do it with single nested field, but if it's more, it can't work and I can't find a way to do it properly. Here's what I tried:
val structColumns = flattendedJSON.columns.filter(_.contains("_"))
val structColumnsMap = structColumns.map(_.split("\\_")).
groupBy(_(0)).mapValues(_.map(_(1)))
val dfExpanded = structColumnsMap.foldLeft(flattendedJSON){ (accDF, kv) =>
val cols = kv._2.map(v => col("`" + kv._1 + "_" + v + "`").as(v))
accDF.withColumn(kv._1, struct(cols: _*))
}
val dfResult = structColumns.foldLeft(flattendedJSON)(_ drop _)
And it's working if I have one nested object (e.g. header_appID), but in case of header_userAgent_browser, I get an exception:
org.apache.spark.sql.AnalysisException: cannot resolve
'header_userAgent' given input columns: ..
Using Spark 2.3 and Scala 2.11.8
I would recommend use case classes to work with a Dataset instead of flatten the DF and then again try to convert to the old json format. Even if it has nested objects you can define a set of case classes to cast it. It allows you to work with an object notation making the things easier than DF.
There are tools where you can provide a sample of the json and it generates the classes for you (I use this: https://json2caseclass.cleverapps.io).
If you anyways want to convert it from the DF, an alternative could be, create a Dataset using map on your DF. Something like this:
case class NestedNode(fieldC: String, fieldD: String) // for JSON
case class MainNode(fieldA: String, fieldB: NestedNode) // for JSON
case class FlattenData(fa: String, fc: String, fd: String)
Seq(
FlattenData("A1", "B1", "C1"),
FlattenData("A2", "B2", "C2"),
FlattenData("A3", "B3", "C3")
).toDF
.as[FlattenData] // Cast it to access with object notation
.map(flattenItem=>{
MainNode(flattenItem.fa, NestedNode(flattenItem.fc, flattenItem.fd) ) // Creating output format
})
At the end, that schema defined with the classes will be used by yourDS.write.mode(your_save_mode).json(your_target_path)

Schema conversion from String to Array[Structype] using Spark Scala

I've the sample data as shown below, i would need to convert columns(ABS, ALT) from string to Array[structType] using spark scala code. Any help would be much appreciated.
With the help of UDF, i was able to convert from string to arrayType, but need some help on converting from string to Array[structType] for these two columns(ABS, ALT).
VIN TT MSG_TYPE ABS ALT
MSGXXXXXXXX 1 SIGL [{"E":1569XXXXXXX,"V":0.0}]
[{"E":156957XXXXXX,"V":0.0}]
df.currentSchema
root
|-- VIN: string (nullable = true)
|-- TT: long (nullable = true)
|-- MSG_TYPE: string (nullable = true)
|-- ABS: string (nullable = true)
|-- ALT: string (nullable = true)
df.expectedSchema:
|-- VIN: string (nullable = true)
|-- TT: long (nullable = true)
|-- MSG_TYPE: string (nullable = true)
|-- ABS: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: long (nullable = true)
|-- ALT: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
It also works if you try as below:
import org.apache.spark.sql.types.{StructField, StructType, ArrayType, StringType}
val schema = ArrayType(StructType(Seq(StructField("E", LongType), StructField("V", DoubleType))))
val final_df = newDF.withColumn("ABS", from_json($"ABS", schema)).withColumn("ALT", from_json($"ALT", schema))
final_df.printSchema:
root
|-- VIN: string (nullable = true)
|-- TT: string (nullable = true)
|-- MSG_TYPE: string (nullable = true)
|-- ABS: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = false)
|-- ALT: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = false)
You can use an udf to parse the Json and transform it into arrays of structs.
First, define a function that parses the Json (based on this answer):
case class Data(E:String, V:Double)
class CC[T] extends Serializable { def unapply(a: Any): Option[T] = Some(a.asInstanceOf[T]) }
object M extends CC[Map[String, Any]]
object L extends CC[List[Any]]
object S extends CC[String]
object D extends CC[Double]
def toStruct(in: String): Array[Data] = {
if( in == null || in.isEmpty) return new Array[Data](0)
val result = for {
Some(L(map)) <- List(JSON.parseFull(in))
M(data) <- map
S(e) = data("E")
D(v) = data("V")
} yield {
Data(e, v)
}
result.toArray
}
This function returns an array of Data objects, that have already the correct structure. Now we use this function to define an udf
val ts: String => Array[Data] = toStruct(_)
import org.apache.spark.sql.functions.udf
val toStructUdf = udf(ts)
Finally we call the udf (for example in a select statement):
val df = ...
val newdf = df.select('VIN, 'TT, 'MSG_TYPE, toStructUdf('ABS).as("ABS"), toStructUdf('ALT).as("ALT"))
newdf.printSchema()
Output:
root
|-- VIN: string (nullable = true)
|-- TT: string (nullable = true)
|-- MSG_TYPE: string (nullable = true)
|-- ABS: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: string (nullable = true)
| | |-- V: double (nullable = false)
|-- ALT: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: string (nullable = true)
| | |-- V: double (nullable = false)

How to cast all columns of a DataFrame (with Nested StructTypes and nested ArrayType) to string in Spark

root
|-- channelGrouping: string (nullable = true)
|-- clientId:string (nullable = true)
|-- customDimensions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |--index: Long (nullable = true)
| | |-- value: string (nullable = true)
|-- date: string (nullable = true)
|-- device: struct (nullable =true)
| |-- browser:string(nullable = true)
| |-- browserSize: Int (nullable = true)
| |-- browserVersion:string (nullable = true)
| |-- deviceCategory: string (nullable = true)
| |-- flashVersion: string (nullable = true)
| |--isMobile: boolean (nullable = true)
| |-- javaEnabled: boolean (nullable = true)
val structCastExpression1 = df.schema
.filter(_.dataType.isInstanceOf[StructType])
.map(c=> (c.name, c.dataType.asInstanceOf[StructType].map(_.name)))
.map{ case (col, sub) => s"""cast($col as struct${sub.map{ c =>
s"$c:string" }.mkString("<" , "," , ">")} ) as $col"""}
//List(cast(s1 as struct<x:string,y:string> ) as s1, // cast(s2
as struct<u:string,v:string> ) as s2)
val otherColumns = df.schema
.filterNot(_.dataType.isInstanceOf[StructType])
.map( c=> s""" cast(${c.name} as string) as ${c.name} """) //List(" cast(id as string) as id ", " cast(d as string) as d")
//original columns val originalColumns = df.columns
// Union both the expressions into one big expression val
finalExpression = otherColumns.union(structCastExpression1) //
List(" cast(id as string) as id ", // " cast(d as string) as d
", // cast(s1 as struct<x:string,y:string> ) as s1, //
cast(s2 as struct<u:string,v:string> ) as s2 )
// Use `selectExpr` to pass the expression
df.selectExpr(finalExpression : _*)
.select(originalColumns.head, originalColumns.tail: _*)
.printSchema
After i am using this
root
|-- channelGrouping: string (nullable = true)
|-- clientId:string (nullable = true)
|-- customDimensions: string (nullable = true)
|-- date: string (nullable = true)
|-- device: struct (nullable = true)
| |-- browser: string (nullable = true)
| |-- browserSize: string (nullable = true)
| |-- browserVersion:string (nullable = true)
| |-- deviceCategory: string (nullable = true)
| |-- flashVersion: string (nullable = true)
| |--isMobile: string (nullable = true)
| |-- javaEnabled: string (nullable = true)
| |-- language: string (nullable = true)
expected out put is
root
|-- channelGrouping: string (nullable = true)
|-- clientId:string (nullable = true)
|-- customDimensions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |--index: String (nullable = true)
| | |-- value: string (nullable = true)
|-- date: string (nullable = true)
|-- device: struct (nullable =true)
| |-- browser:string(nullable = true)
| |-- browserSize: String (nullable = true)
| |-- browserVersion:string (nullable = true)
| |-- deviceCategory: string (nullable = true)
| |-- flashVersion: string (nullable = true)
| |--isMobile: boolean (nullable = true)
| |-- javaEnabled: boolean (nullable = true)

Flatten complex JSON schema using pyspark

I am trying to flatten a complex JSON structure containing nested arrays, struct elements using a generic function which should work for any JSON files with any schema.
Below is a part of sample JSON structure which I want to flatten
root
|-- Data: struct (nullable = true)
| |-- Record: struct (nullable = true)
| | |-- FName: string (nullable = true)
| | |-- LName: long (nullable = true)
| | |-- Address: struct (nullable = true)
| | | |-- Applicant: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- Id: long (nullable = true)
| | | | | |-- Type: string (nullable = true)
| | | | | |-- Option: long (nullable = true)
| | | |-- Location: string (nullable = true)
| | | |-- Town: long (nullable = true)
| | |-- IsActive: boolean (nullable = true)
|-- Id: string (nullable = true)
to
root
|-- Data_Record_FName: string (nullable = true)
|-- Data_Record_LName: long (nullable = true)
|-- Data_Record_Address_Applicant_Id: long (nullable = true)
|-- Data_Record_Address_Applicant_Type: string (nullable = true)
|-- Data_Record_Address_Applicant_Option: long (nullable = true)
|-- Data_Record_Address_Location: string (nullable = true)
|-- Data_Record_Address_Town: long (nullable = true)
|-- Data_Record_IsActive: boolean (nullable = true)
|-- Id: string (nullable = true)
I am using the code below as suggested in below thread
How to flatten a struct in a Spark dataframe?
def flatten_df(nested_df, layers):
flat_cols = []
nested_cols = []
flat_df = []
flat_cols.append([c[0] for c in nested_df.dtypes if c[1][:6] != 'struct'])
nested_cols.append([c[0] for c in nested_df.dtypes if c[1][:6] == 'struct'])
flat_df.append(nested_df.select(flat_cols[0] +
[col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols[0]
for c in nested_df.select(nc+'.*').columns])
)
for i in range(1, layers):
print (flat_cols[i-1])
flat_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:6] != 'struct'])
nested_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:6] == 'struct'])
flat_df.append(flat_df[i-1].select(flat_cols[i] +
[col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols[i]
for c in flat_df[i-1].select(nc+'.*').columns])
)
return flat_df[-1]
my_flattened_df = flatten_df(jsonDF, 10)
my_flattened_df.printSchema()
But it doesn't work for array elements. With above code I am getting output as below. Can you please help. How can I modify this piece of code to include arrays too.
root
|-- Data_Record_FName: string (nullable = true)
|-- Data_Record_LName: long (nullable = true)
|-- Data_Record_Address_Applicant: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Id: long (nullable = true)
| | |-- Type: string (nullable = true)
| | |-- Option: long (nullable = true)
|-- Data_Record_Address_Location: string (nullable = true)
|-- Data_Record_Address_Town: long (nullable = true)
|-- Data_Record_IsActive: boolean (nullable = true)
|-- Id: string (nullable = true)
This is not a duplicate as there is no post regarding a generic function to flatten complex JSON schema that includes arrays too.