Explode array of structs to columns in pyspark - pyspark

I'd like to explode an array of structs to columns (as defined by the struct fields). E.g.
root
|-- news_style_super: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: struct (containsNull = true)
| | | |-- name: string (nullable = true)
| | | |-- sbox_ctr: double (nullable = true)
| | | |-- wise_ctr: double (nullable = true)
Should be transformed to
|-- name: string (nullable = true)
|-- sbox_ctr: double (nullable = true)
|-- wise_ctr: double (nullable = true)
How can I do this?

def get_final_dataframe(pathname, df):
cur_names = pathname.split(".")
if len(cur_names) > 1:
root_name = cur_names[0]
delimiter = "."
new_path_name = delimiter.join(cur_names[1:len(cur_names)])
for field in df.schema.fields:
if field.name == root_name:
if type(field.dataType) == ArrayType:
return get_final_dataframe(pathname, df.select(explode(root_name).alias(root_name)))
elif type(field.dataType) == StructType:
if hasColumn(df, delimiter.join(cur_names[0:2])):
return get_final_dataframe(new_path_name, df.select(delimiter.join(cur_names[0:2])))
else:
return -1, -1
else:
return -1, -1
else:
root_name = cur_names[0]
for field in df.schema.fields:
if field.name == root_name:
if type(field.dataType) == StringType:
return df, "string"
elif type(field.dataType) == LongType:
return df, "numeric"
elif type(field.dataType) == DoubleType:
return df, "numeric"
else:
return df, -1
return -1, -1
then,you can
key = "a.b.c.name"
# key = "context.content_feature.tag.name"
df2, field_type = get_final_dataframe(key, df1)

Related

parsing complex nested json in Spark scala

I am having a complex json with below schema which i need to convert to a dataframe in spark. Since the schema is compex I am unable to do it completely.
The Json file has a very complex schema and using explode with column select might be problematic
Below is the schema which I am trying to convert:
root
|-- data: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- meta: struct (nullable = true)
| |-- view: struct (nullable = true)
| | |-- approvals: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- reviewedAt: long (nullable = true)
| | | | |-- reviewedAutomatically: boolean (nullable = true)
| | | | |-- state: string (nullable = true)
| | | | |-- submissionDetails: struct (nullable = true)
| | | | | |-- permissionType: string (nullable =
I have used the below code to flatten the data but still there nested data which i need to flatten into columns:
def flattenStructSchema(schema: StructType, prefix: String = null) : Array[Column] = {
schema.fields.flatMap(f => {
val columnName = if (prefix == null)
f.name else (prefix + "." + f.name)
f.dataType match {
case st: StructType => flattenStructSchema(st, columnName)
case _ => Array(col(columnName).as(columnName.replace(".","_")))
}
})
}
val df2 = df.select(col("meta"))
val df4 = df.select(col("data"))
val df3 = df2.select(flattenStructSchema(df2.schema):_*).show()
df3.printSchema()
df3.show(10,false)

How to filter rows having array of struct?

I have a dataFrame with array of struct so I just want to filter columns or we can say select column in array of struct from the array of struct but is it possible as I am iterating through row .
Schema
root
|-- day: long (nullable = true)
|-- table_row: array (nullable = true)
| |-- element: struct (containsNull = true)
| |-- DATE: string (nullable = true)
| |-- ADMISSION_NUM: string (nullable = true)
| |-- SOURCE_CODE: string (nullable = true)
What I am doing is that I am iterating through Rows Can we select the array columns row wise . I only want to know how this is possible
def keepColumnInarray(columns: Set[String], row: Row): Row = {
//Some
}
Example If I want to keep column "Data" Then keepColumnInarray will only select this
Output Schema
root
|-- day: long (nullable = true)
|-- table_row: array (nullable = true)
| |-- element: struct (containsNull = true)
| |-- DATE: string (nullable = true)
spark >= 2.4
df.withColumn("table_row", expr("TRANSFORM (table_row, x -> named_struct('DATE', x.DATE))")
this will convert
table_row: array
struct (nullable = true)
| |-- DATE: string (nullable = true)
| |-- ADMISSION_NUM: string (nullable = true)
| |-- SOURCE_CODE: string (nullable = true)
to
table_row: array
struct (nullable = true)
| |-- DATE: string (nullable = true)
Update-1 (based on comments)
spark < 2.4
Use below UDF to select columns-
val df = spark.range(2).withColumnRenamed("id", "day")
.withColumn("table_row", expr("array(named_struct('DATE', 'sample_date'," +
" 'ADMISSION_NUM', 'sample_adm_num', 'SOURCE_CODE', 'sample_source_code'))"))
df.show(false)
df.printSchema()
//
// +---+---------------------------------------------------+
// |day|table_row |
// +---+---------------------------------------------------+
// |0 |[[sample_date, sample_adm_num, sample_source_code]]|
// |1 |[[sample_date, sample_adm_num, sample_source_code]]|
// +---+---------------------------------------------------+
//
// root
// |-- day: long (nullable = false)
// |-- table_row: array (nullable = false)
// | |-- element: struct (containsNull = false)
// | | |-- DATE: string (nullable = false)
// | | |-- ADMISSION_NUM: string (nullable = false)
// | | |-- SOURCE_CODE: string (nullable = false)
//
def keepColumnInarray(columnsToKeep: Seq[String], rows: mutable.WrappedArray[Row]) = {
rows.map(r => {
new GenericRowWithSchema(r.getValuesMap(columnsToKeep).values.toArray,
StructType(r.schema.filter(s => columnsToKeep.contains(s.name))))
})
}
val keepColumns = udf((columnsToKeep: Seq[String], rows: mutable.WrappedArray[Row]) =>
keepColumnInarray(columnsToKeep, rows)
, ArrayType(StructType(StructField("DATE", StringType) :: Nil)))
val processedDF = df
.withColumn("table_row_new", keepColumns(array(lit("DATE")), col("table_row")))
processedDF.show(false)
processedDF.printSchema()
//
// +---+---------------------------------------------------+---------------+
// |day|table_row |table_row_new |
// +---+---------------------------------------------------+---------------+
// |0 |[[sample_date, sample_adm_num, sample_source_code]]|[[sample_date]]|
// |1 |[[sample_date, sample_adm_num, sample_source_code]]|[[sample_date]]|
// +---+---------------------------------------------------+---------------+
//
// root
// |-- day: long (nullable = false)
// |-- table_row: array (nullable = false)
// | |-- element: struct (containsNull = false)
// | | |-- DATE: string (nullable = false)
// | | |-- ADMISSION_NUM: string (nullable = false)
// | | |-- SOURCE_CODE: string (nullable = false)
// |-- table_row_new: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- DATE: string (nullable = true)
//

Flatten Nested schema in DataFrame, getting AnalysisException: cannot resolve column name

I have a DF:
-- str1: struct (nullable = true)
| |-- a1: string (nullable = true)
| |-- a2: string (nullable = true)
| |-- a3: string (nullable = true)
|-- str2: string (nullable = true)
|-- str3: string (nullable = true)
|-- str4: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- b1: string (nullable = true)
| | |-- b2: string (nullable = true)
| | |-- b3: boolean (nullable = true)
| | |-- b4: struct (nullable = true)
| | | |-- c1: integer (nullable = true)
| | | |-- c2: string (nullable = true)
| | | |-- c3: integer (nullable = true)
I am trying to flatten it, to do that I have used code below:
def flattenSchema(schema: StructType, prefix: String = null):Array[Column]=
{
schema.fields.flatMap(f => {
val colName = if (prefix == null) f.name else (prefix + "." + f.name)
f.dataType match {
case st: StructType => flattenSchema(st, colName)
case at: ArrayType =>
val st = at.elementType.asInstanceOf[StructType]
flattenSchema(st, colName)
case _ => Array(new Column(colName).as(colName))
}
})
}
val d1 = df.select(flattenSchema(df.schema):_*)
Its giving me below Output:
|-- str1.a1: string (nullable = true)
|-- str1.a2: string (nullable = true)
|-- str1.a3: string (nullable = true)
|-- str2: string (nullable = true)
|-- str3: string (nullable = true)
|-- str4.b1: array (nullable = true)
| |-- element: string (containsNull = true)
|-- str4.b2: array (nullable = true)
| |-- element: string (containsNull = true)
|-- str4.b3: array (nullable = true)
| |-- element: string (containsNull = true)
|-- str4.b4.c1: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- str4.b4.c2: array (nullable = true)
| |-- element: string (containsNull = true)
|-- str4.b4.c3: array (nullable = true)
| |-- element: integer (containsNull = true)
Problem is arising when I am trying to query it:
d1.select("str2").show -- Its giving me no issue
but when I do query on any flattened nested column
d1.select("str1.a1")
Error:
org.apache.spark.sql.AnalysisException: cannot resolve '`str1.a1`' given input columns: ....
What I am doing wrong here? or any other way to achieve the desired result?
Spark does not support string type column name with dot(.). Dot is use to access child column of any struct type column. If you will try to access same column from dataframe df then it should work since in df it struct type.

How to explode StructType to rows from json dataframe in Spark rather than to columns

I read a nested json with this schema :
root
|-- company: struct (nullable = true)
| |-- 0: string (nullable = true)
| |-- 1: string (nullable = true)
| |-- 10: string (nullable = true)
| |-- 100: string (nullable = true)
| |-- 101: string (nullable = true)
| |-- 102: string (nullable = true)
| |-- 103: string (nullable = true)
| |-- 104: string (nullable = true)
| |-- 105: string (nullable = true)
| |-- 106: string (nullable = true)
| |-- 107: string (nullable = true)
| |-- 108: string (nullable = true)
| |-- 109: string (nullable = true)
When I try to :
df.select(col("company.*"))
I get every fields of the struct "company" as columns. But I want them as rows. I would like to get a row with the id and the string in another column :
0 1 10 100 101 102
"hey" "yooyo" "yuyu" "hey" "yooyo" "yuyu"
But rather get something like :
id name
0 "hey"
1 "yoooyo"
10 "yuuy"
100 "hey"
101 "yooyo"
102 "yuyu"
Thanks in advance for your help,
Tricky
Try this using union:
val dfExpl = df.select("company.*")
dfExpl.columns
.map(name => dfExpl.select(lit(name),col(name)))
.reduce(_ union _)
.show
Or alternatively using array/explode :
val dfExpl = df.select("company.*")
val selectExpr = dfExpl
.columns
.map(name =>
struct(
lit(name).as("id"),
col(name).as("value")
).as("col")
)
dfExpl
.select(
explode(array(selectExpr: _*))
)
.select("col.*")
.show()

Explode array in apache spark Data Frame

I am trying to flatten a schema of existing dataframe with nested fields. Structure of my dataframe is something like that:
root
|-- Id: long (nullable = true)
|-- Type: string (nullable = true)
|-- Uri: string (nullable = true)
|-- Type: array (nullable = true)
| |-- element: string (containsNull = true)
|-- Gender: array (nullable = true)
| |-- element: string (containsNull = true)
Type and gender can contain array of elements, one element or null value.
I tried to use the following code:
var resDf = df.withColumn("FlatType", explode(df("Type")))
But as a result in a resulting data frame I loose rows for which I had null values for Type column. It means, for example, if I have 10 rows and in 7 rows type is null and in 3 type is not null, after I use explode in resulting data frame I have only three rows.
How can I keep rows with null values but explode array of values?
I found some kind of workaround but still stuck in one place. For standard types we can do the following:
def customExplode(df: DataFrame, field: String, colType: String): org.apache.spark.sql.Column = {
var exploded = None: Option[org.apache.spark.sql.Column]
colType.toLowerCase() match {
case "string" =>
val avoidNull = udf((column: Seq[String]) =>
if (column == null) Seq[String](null)
else column)
exploded = Some(explode(avoidNull(df(field))))
case "boolean" =>
val avoidNull = udf((xs: Seq[Boolean]) =>
if (xs == null) Seq[Boolean]()
else xs)
exploded = Some(explode(avoidNull(df(field))))
case _ => exploded = Some(explode(df(field)))
}
exploded.get
}
And after that just use it like this:
val explodedField = customExplode(resultDf, fieldName, fieldTypeMap(field))
resultDf = resultDf.withColumn(newName, explodedField)
However, I have a problem for struct type for the following type of structure:
|-- Address: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- AddressType: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- DEA: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- Number: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
| | | | |-- ExpirationDate: array (nullable = true)
| | | | | |-- element: timestamp (containsNull = true)
| | | | |-- Status: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
How can we process that kind of schema when DEA is null?
Thank you in advance.
P.S. I tried to use Lateral views but result is the same.
Maybe you can try using when:
val resDf = df.withColumn("FlatType", when(df("Type").isNotNull, explode(df("Type")))
As shown in the when function's documentation, the value null is inserted for the values that do not match the conditions.
I think what you wanted is to use explode_outer instead of explode
see apache docs : explode and explode_outer