PySpark coalesce struct fields inside a struct - pyspark

I have a struct coming from a data source where the struct fields have multiple possible data types like the following:
|-- priority: struct (nullable = true)
| |-- priority_a: struct (nullable = true)
| | |-- union: boolean (nullable = true)
| | |-- int32: integer (nullable = true)
| | |-- double: double (nullable = true)
| |-- priority_b: integer (nullable = true)
| |-- priority_c: struct (nullable = true)
| | |-- union: boolean (nullable = true)
| | |-- double: double (nullable = true)
| | |-- int32: integer (nullable = true)
| |-- priority_d: struct (nullable = true)
| | |-- union: boolean (nullable = true)
| | |-- double: double (nullable = true)
| | |-- int32: integer (nullable = true)
| |-- priority_e: double (nullable = true)
I want to coalesce the struct fields and cast them to a data type which makes the most sense, for instance:
|-- priority: struct (nullable = true)
| |-- priority_a: integer (nullable = true)
| |-- priority_b: integer (nullable = true)
| |-- priority_c: double (nullable = true)
| |-- priority_d: double (nullable = true)
| |-- priority_e: double (nullable = true)
If a column is not a struct field inside a struct, the following code works perfectly for what I need:
try:
cols = [f'{c}.{col}' for col in source.select(f'{c}.*').columns]
if f'{struct_path}.union' in cols:
cols.remove(f'{struct_path}.union')
source = source.withColumn(pc, f.coalesce(*cols).cast(t)) # t is the type I want to cast to
except:
source = source.withColumn(c, f.col(c).cast(t))
I would like to the do the same recursively for a struct where the nested struct fields can have multiple data types. Is it possible to do so?

StructField's fields are accessible by fields property, so what you can do is you can make a loop go through the schema and check every field to see if it's StructType
from pyspark.sql import types as T
for field in schema.fields:
if isinstance(field.dataType, T.StructType):
print(field.dataType.fields)
Or if you want to read it recursively
def flatten(schema, prefix=None):
fields = []
for field in schema.fields:
name = prefix + '.' + field.name if prefix else field.name
dtype = field.dataType
if isinstance(dtype, T.ArrayType):
dtype = dtype.elementType
if isinstance(dtype, T.StructType):
print(dtype)
fields += flatten(dtype, prefix=name)
else:
fields.append((dtype, name))
return fields

Related

Compare two columns in different dataframes, of types String and Array<string> respectively in pyspark without use explode function

I have two dfs:
df1:
sku category cep seller state
4858 BDU 00000 xefd SP
df2:
depth price sku seller infos_product
6.1 5.60 47347 gaha [{1, 86800000, 86...
For df2 I have the follow schema:
|-- depth: double (nullable = true)
|-- sku: string (nullable = true)
|-- price: double (nullable = true)
|-- seller: string (nullable = true)
|-- infos_produt: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- modality_id: integer (nullable = true)
| | |-- cep_coleta_ini: integer (nullable = true)
| | |-- cep_coleta_fim: integer (nullable = true)
| | |-- cep_entrega_ini: integer (nullable = true)
| | |-- cep_entrega_fim: integer (nullable = true)
| | |-- cubage_factor_entrega: double (nullable = true)
| | |-- value_coleta: double (nullable = true)
| | |-- value_entrega: double (nullable = true)
| | |-- city: string (nullable = true)
| | |-- state: string (nullable = true)
I need to do a check between these df's. Something like that:
condi = [(df1.seller_id == df2.seller) & (df2.infos_produt.state == df1.state)]
df_finish = (df1\
.join(df2, on = condi ,how='left'))
But, return a error:
AnalysisException: cannot resolve '(infos_produt.`state` = view.coverage_state)' due to data type mismatch: differing types in '(infos_produt.`state` = view.coverage_state)' (array<string> and string).
Can anyone help me?
PS: I would like resolve this problem without apply 'explode', because I have a big data and explode function don't work.

How do I check if column present in the Spark DataFrame

I am trying a logic to return an empty column if column does not exist in dataframe.
Schema changes very frequent, sometime the whole struct will be missing (temp1) or array inside struct will be missing (suffix)
Schema looks like this:
root
|-- id: string (nullable = true)
|-- temp: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- temp1: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- code1: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | |-- temp2: struct (nullable = true)
| | | |-- name1: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- suffix: array (nullable = true)
| | | | |-- element: string (containsNull = true)
|-- timestamp: timestamp (nullable = true)
Or like this:
root
|-- id: string (nullable = true)
|-- temp: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- temp2: struct (nullable = true)
| | | |-- name1: array (nullable = true)
| | | | |-- element: string (containsNull = true)
|-- timestamp: timestamp (nullable = true)
When I am trying the below logic for the second schema, getting an exception that Struct not found
def has_Column(df: DataFrame, path: String) = Try(df(path)).isSuccess
df.withColumn("id", col("id")).
withColumn("tempLn", explode(col("temp"))).
withColumn("temp1_code1", when(lit(has_Column(df, "tempLn.temp1.code1")), concat_ws(" ",col("tempLn.temp1.code1"))).otherwise(lit("").cast("string"))).
withColumn("temp2_suffix", when(lit(has_Column(df, "tempLn.temp2.suffix")), concat_ws(" ",col("tempLn.temp2.suffix"))).otherwise(lit("").cast("string")))
Error:
org.apache.spark.sql.AnalysisException: No such struct field temp1;
You need to do the check the existence outside the select/withColumn... methods. As you reference it in the then part of case when expression, Spark tries to resolve it during the analysis of the query.
So you'll need to test like this:
if (has_Column(df, "tempLn.temp1.code1"))
df.withColumn("temp2_suffix", concat_ws(" ",col("tempLn.temp2.suffix")))
else
df.withColumn("temp2_suffix", lit(""))
To do it for multiple columns you can use foldLeft like this:
val df1 = Seq(
("tempLn.temp1.code1", "temp1_code1"),
("tempLn.temp2.suffix", "temp2_suffix")
).foldLeft(df) {
case (acc, (field, newCol)) => {
if (has_Column(acc, field))
acc.withColumn(newCol, concat_ws(" ", col(field)))
else
acc.withColumn(newCol, lit(""))
}
}

Flatten Nested schema in DataFrame, getting AnalysisException: cannot resolve column name

I have a DF:
-- str1: struct (nullable = true)
| |-- a1: string (nullable = true)
| |-- a2: string (nullable = true)
| |-- a3: string (nullable = true)
|-- str2: string (nullable = true)
|-- str3: string (nullable = true)
|-- str4: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- b1: string (nullable = true)
| | |-- b2: string (nullable = true)
| | |-- b3: boolean (nullable = true)
| | |-- b4: struct (nullable = true)
| | | |-- c1: integer (nullable = true)
| | | |-- c2: string (nullable = true)
| | | |-- c3: integer (nullable = true)
I am trying to flatten it, to do that I have used code below:
def flattenSchema(schema: StructType, prefix: String = null):Array[Column]=
{
schema.fields.flatMap(f => {
val colName = if (prefix == null) f.name else (prefix + "." + f.name)
f.dataType match {
case st: StructType => flattenSchema(st, colName)
case at: ArrayType =>
val st = at.elementType.asInstanceOf[StructType]
flattenSchema(st, colName)
case _ => Array(new Column(colName).as(colName))
}
})
}
val d1 = df.select(flattenSchema(df.schema):_*)
Its giving me below Output:
|-- str1.a1: string (nullable = true)
|-- str1.a2: string (nullable = true)
|-- str1.a3: string (nullable = true)
|-- str2: string (nullable = true)
|-- str3: string (nullable = true)
|-- str4.b1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- str4.b2: array (nullable = true)
| |-- element: string (containsNull = true)
|-- str4.b3: array (nullable = true)
| |-- element: string (containsNull = true)
|-- str4.b4.c1: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- str4.b4.c2: array (nullable = true)
| |-- element: string (containsNull = true)
|-- str4.b4.c3: array (nullable = true)
| |-- element: integer (containsNull = true)
Problem is arising when I am trying to query it:
d1.select("str2").show -- Its giving me no issue
but when I do query on any flattened nested column
d1.select("str1.a1")
Error:
org.apache.spark.sql.AnalysisException: cannot resolve '`str1.a1`' given input columns: ....
What I am doing wrong here? or any other way to achieve the desired result?
Spark does not support string type column name with dot(.). Dot is use to access child column of any struct type column. If you will try to access same column from dataframe df then it should work since in df it struct type.

Convert flattened data frame to struct in Spark

I had a deep nested JSON files which I had to process, and in order to do that I had to flatten them because couldn't find a way to hash some deep nested fields. This is how my dataframe looks like (after flattening):
scala> flattendedJSON.printSchema
root
|-- header_appID: string (nullable = true)
|-- header_appVersion: string (nullable = true)
|-- header_userID: string (nullable = true)
|-- body_cardId: string (nullable = true)
|-- body_cardStatus: string (nullable = true)
|-- body_cardType: string (nullable = true)
|-- header_userAgent_browser: string (nullable = true)
|-- header_userAgent_browserVersion: string (nullable = true)
|-- header_userAgent_deviceName: string (nullable = true)
|-- body_beneficiary_beneficiaryAccounts_beneficiaryAccountOwner: string (nullable = true)
|-- body_beneficiary_beneficiaryPhoneNumbers_beneficiaryPhoneNumber: string (nullable = true)
And I need to convert it back to original structure (before flattening):
scala> nestedJson.printSchema
root
|-- header: struct (nullable = true)
| |-- appID: string (nullable = true)
| |-- appVersion: string (nullable = true)
| |-- userAgent: struct (nullable = true)
| | |-- browser: string (nullable = true)
| | |-- browserVersion: string (nullable = true)
| | |-- deviceName: string (nullable = true)
|-- body: struct (nullable = true)
| |-- beneficiary: struct (nullable = true)
| | |-- beneficiaryAccounts: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- beneficiaryAccountOwner: string (nullable = true)
| | |-- beneficiaryPhoneNumbers: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- beneficiaryPhoneNumber: string (nullable = true)
| |-- cardId: string (nullable = true)
| |-- cardStatus: string (nullable = true)
| |-- cardType: string (nullable = true)
I've managed to do it with single nested field, but if it's more, it can't work and I can't find a way to do it properly. Here's what I tried:
val structColumns = flattendedJSON.columns.filter(_.contains("_"))
val structColumnsMap = structColumns.map(_.split("\\_")).
groupBy(_(0)).mapValues(_.map(_(1)))
val dfExpanded = structColumnsMap.foldLeft(flattendedJSON){ (accDF, kv) =>
val cols = kv._2.map(v => col("`" + kv._1 + "_" + v + "`").as(v))
accDF.withColumn(kv._1, struct(cols: _*))
}
val dfResult = structColumns.foldLeft(flattendedJSON)(_ drop _)
And it's working if I have one nested object (e.g. header_appID), but in case of header_userAgent_browser, I get an exception:
org.apache.spark.sql.AnalysisException: cannot resolve
'header_userAgent' given input columns: ..
Using Spark 2.3 and Scala 2.11.8
I would recommend use case classes to work with a Dataset instead of flatten the DF and then again try to convert to the old json format. Even if it has nested objects you can define a set of case classes to cast it. It allows you to work with an object notation making the things easier than DF.
There are tools where you can provide a sample of the json and it generates the classes for you (I use this: https://json2caseclass.cleverapps.io).
If you anyways want to convert it from the DF, an alternative could be, create a Dataset using map on your DF. Something like this:
case class NestedNode(fieldC: String, fieldD: String) // for JSON
case class MainNode(fieldA: String, fieldB: NestedNode) // for JSON
case class FlattenData(fa: String, fc: String, fd: String)
Seq(
FlattenData("A1", "B1", "C1"),
FlattenData("A2", "B2", "C2"),
FlattenData("A3", "B3", "C3")
).toDF
.as[FlattenData] // Cast it to access with object notation
.map(flattenItem=>{
MainNode(flattenItem.fa, NestedNode(flattenItem.fc, flattenItem.fd) ) // Creating output format
})
At the end, that schema defined with the classes will be used by yourDS.write.mode(your_save_mode).json(your_target_path)

Spark: pruning nested columns/fields

I have a question about the possibility to prune nested fields.
I'm developing a source for High Energy Physics Data format (ROOT).
below is the schema for some file using a DataSource that I'm developing.
root
|-- EventAuxiliary: struct (nullable = true)
| |-- processHistoryID_: struct (nullable = true)
| | |-- hash_: string (nullable = true)
| |-- id_: struct (nullable = true)
| | |-- run_: integer (nullable = true)
| | |-- luminosityBlock_: integer (nullable = true)
| | |-- event_: long (nullable = true)
| |-- processGUID_: string (nullable = true)
| |-- time_: struct (nullable = true)
| | |-- timeLow_: integer (nullable = true)
| | |-- timeHigh_: integer (nullable = true)
| |-- luminosityBlock_: integer (nullable = true)
| |-- isRealData_: boolean (nullable = true)
| |-- experimentType_: integer (nullable = true)
| |-- bunchCrossing_: integer (nullable = true)
| |-- orbitNumber_: integer (nullable = true)
| |-- storeNumber_: integer (nullable = true)
The DataSource is here https://github.com/diana-hep/spark-root/blob/master/src/main/scala/org/dianahep/sparkroot/experimental/package.scala#L62
When building a reader using the buildReader method of the FileFormat:
override def buildReaderWithPartitionValues(
sparkSession: SparkSession,
dataSchema: StructType,
partitionSchema: StructType,
requiredSchema: StructType,
filters: Seq[Filter],
options: Map[String, String],
hadoopConf: Configuration): PartitionedFile => Iterator[InternalRow]
I see that requiredSchema will always contain all of the fields/members of the top column that is being looked at. Meaning that when I want to select a particular nested field with :
df.select("EventAuxiliary.id_.run_"), requiredSchema will be again the full struct for that top column ("EventAuxiliary"). I would expect that schema would be something like this:
root
|-- EventAuxiliary: struct...
| |-- id_: struct ...
| | |-- run_: integer
since this is the only schema that has been required by the select statement.
Basically, I want to know how on the data source level I can prune nested fields. I thought that requiredSchema will be only the fields that are coming from the df.select.
I'm trying to see what avro/parquet are doing and found this: https://github.com/apache/spark/pull/14957/files
If there are suggestions/comments - would be appreciated!
Thanks!
VK