I have nested schema with depth level of 6. I am facing issues while traversing each element in the schema to modify a column. I have list which contains column names which needs to be modify(hash/anonymized).
My initial thought is to traverse each element in the schema and compare column with the list items and modify once there is match. But I do not know how to do it.
List values:['type','name','work','email']
Sample schema:
-- abc: struct (nullable = true)
| |-- xyz: struct (nullable = true)
| | |-- abc123: string (nullable = true)
| | |-- services: struct (nullable = true)
| | | |-- service: array (nullable = true)
| | | | |-- element: struct (contains Null = true)
| | | | | |-- type: string (nullable = true)
| | | | | |-- subtype: string (nullable = true)
| | | |-- name : string(nullable = true)
| | |-- details: struct (nullable =true)
| | | | -- work: (nullable = true)
Note: If I flatten the schema it creates 600+ columns. Thus I am looking for a solution which modify the column dynamically(no hardcoding)
EDIT:
if it helps in anyway I am sharing my code where I am trying modify the value, but its broken
def change_nested_field_value(schema, new_df,fields_to_change, parent=""):
new_schema = []
if isinstance(schema, StringType):
return schema
for field in schema:
full_field_name = field.name
short_name = full_field_name.split('.')
if parent:
full_field_name = parent + "." + full_field_name
#print(full_field_name)
if short_name[-1] not in fields_to_change:
if isinstance(field.dataType, StructType):
inner_schema = change_nested_field_value(field.dataType,new_df, fields_to_change, full_field_name)
new_schema.append(StructField(field.name, inner_schema))
elif isinstance(field.dataType, ArrayType):
inner_schema = change_nested_field_value(field.dataType.elementType, new_df,fields_to_change, full_field_name)
new_schema.append(StructField(field.name, ArrayType(inner_schema)))
else:
new_schema.append(StructField(field.name, field.dataType))
# else:
############ this is where I have access to the nested element. I need to modify the value here
# print(StructField(field.name, field.dataType))
return StructType(new_schema)
You can use the answers from here after converting the struct column into a json using to_json and finally converting back the json string to a struct using from_json. The schema for from_json can be inferred from the the original struct.
from typing import Dict, Any, List
from pyspark.sql import functions as F
from pyspark.sql.types import StringType
import json
def encrypt(s: str) -> str:
return f"encrypted_{s}"
def walk_dict(struct: Dict[str, Any], fields_to_encrypt: List[str]):
keys_copy = set(struct.keys())
for k in keys_copy:
if k in fields_to_encrypt and isinstance(struct[k], str):
struct[k] = encrypt(struct[k])
else:
walk_fields(struct[k], fields_to_encrypt)
def walk_fields(field: Any, fields_to_encrypt: List[str]):
if isinstance(field, dict):
walk_dict(field, fields_to_encrypt)
if isinstance(field, list):
[walk_fields(e, fields_to_encrypt) for e in field]
def encrypt_fields(json_string: str) -> str:
fields_to_encrypt = ["type", "subtype", "work", ]
as_json = json.loads(json_string)
walk_fields(as_json, fields_to_encrypt)
return json.dumps(as_json)
field_encryption_udf = F.udf(encrypt_fields, StringType())
data = [{
"abc": {
"xyz": {
"abc123": "abc123Value",
"services": {
"service": [
{
"type": "type_element_1",
"subtype": "subtype_element_1",
},
{
"type": "type_element_2",
"subtype": "subtype_element_2",
}
],
"name": "nameVal"
},
"details": {
"work": "workVal"
}
}
}
}, ]
df = spark.read.json(spark.sparkContext.parallelize(data))
schema_for_abc_column = StructType.fromJson([x.jsonValue()["type"] for x in df.schema.fields if x.name == "abc"][0])
df.withColumn("value_changes_column", F.from_json(field_encryption_udf(F.to_json("abc")), schema_for_abc_column)).show(truncate=False)
"""
+-----------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|abc |value_changes_column |
+-----------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{{abc123Value, {workVal}, {nameVal, [{subtype_element_1, type_element_1}, {subtype_element_2, type_element_2}]}}}|{{abc123Value, {encrypted_workVal}, {nameVal, [{encrypted_subtype_element_1, encrypted_type_element_1}, {encrypted_subtype_element_2, encrypted_type_element_2}]}}}|
+-----------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------+
"""
Related
I am trying to iterate over an array of array as a column in Spark dataframe. Looking for the best way to do this.
Schema:
root
|-- Animal: struct (nullable = true)
| |-- Species: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- mammal: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- description: string (nullable = true)
Currently I am using this logic. This only gets the first array.
df.select(
col("Animal.Species").getItem(0).getItem("mammal").getItem("description")
)
Pseudo Logic:
col("Animal.Species").getItem(0).getItem("mammal").getItem("description")
+
col("Animal.Species").getItem(1).getItem("mammal").getItem("description")
+
col("Animal.Species").getItem(2).getItem("mammal").getItem("description")
+
col("Animal.Species").getItem(...).getItem("mammal").getItem("description")
Desired Example Output (flattened elements as string)
llama, sheep, rabbit, hare
Not obvious, but you can use . (or the getField method of Column) to select "through" arrays of structs. Selecting Animal.Species.mammal returns an array of array of the innermost structs. Unfortunately, this array of array prevents you from being able to drill further down with something like Animal.Species.mammal.description, so you need to flatten it first, then use getField().
If I understand your schema correctly, the following JSON should be a valid input:
{
"Animal": {
"Species": [
{
"mammal": [
{ "description": "llama" },
{ "description": "sheep" }
]
},
{
"mammal": [
{ "description": "rabbit" },
{ "description": "hare" }
]
}
]
}
}
val df = spark.read.json("data.json")
df.printSchema
// root
// |-- Animal: struct (nullable = true)
// | |-- Species: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- mammal: array (nullable = true)
// | | | | |-- element: struct (containsNull = true)
// | | | | | |-- description: string (nullable = true)
df.select("Animal.Species.mammal").show(false)
// +----------------------------------------+
// |mammal |
// +----------------------------------------+
// |[[{llama}, {sheep}], [{rabbit}, {hare}]]|
// +----------------------------------------+
df.select(flatten(col("Animal.Species.mammal"))).show(false)
// +------------------------------------+
// |flatten(Animal.Species.mammal) |
// +------------------------------------+
// |[{llama}, {sheep}, {rabbit}, {hare}]|
// +------------------------------------+
This is now an array of structs and you can use getField("description") to obtain the array of interest:
df.select(flatten(col("Animal.Species.mammal")).getField("description")).show(false)
// +--------------------------------------------------------+
// |flatten(Animal.Species.mammal AS mammal#173).description|
// +--------------------------------------------------------+
// |[llama, sheep, rabbit, hare] |
// +--------------------------------------------------------+
Finally, array_join with separator ", " can be used to obtain the desired string:
df.select(
array_join(
flatten(col("Animal.Species.mammal")).getField("description"),
", "
) as "animals"
).show(false)
// +--------------------------+
// |animals |
// +--------------------------+
// |llama, sheep, rabbit, hare|
// +--------------------------+
You can apply explode two times: first time on Animal.Species and second time on the result of the first time:
import org.apache.spark.sql.functions._
df.withColumn("tmp", explode(col("Animal.Species")))
.withColumn("tmp", explode(col("tmp.mammal")))
.select("tmp.description")
.show()
I am having a complex json with below schema which i need to convert to a dataframe in spark. Since the schema is compex I am unable to do it completely.
The Json file has a very complex schema and using explode with column select might be problematic
Below is the schema which I am trying to convert:
root
|-- data: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- meta: struct (nullable = true)
| |-- view: struct (nullable = true)
| | |-- approvals: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- reviewedAt: long (nullable = true)
| | | | |-- reviewedAutomatically: boolean (nullable = true)
| | | | |-- state: string (nullable = true)
| | | | |-- submissionDetails: struct (nullable = true)
| | | | | |-- permissionType: string (nullable =
I have used the below code to flatten the data but still there nested data which i need to flatten into columns:
def flattenStructSchema(schema: StructType, prefix: String = null) : Array[Column] = {
schema.fields.flatMap(f => {
val columnName = if (prefix == null)
f.name else (prefix + "." + f.name)
f.dataType match {
case st: StructType => flattenStructSchema(st, columnName)
case _ => Array(col(columnName).as(columnName.replace(".","_")))
}
})
}
val df2 = df.select(col("meta"))
val df4 = df.select(col("data"))
val df3 = df2.select(flattenStructSchema(df2.schema):_*).show()
df3.printSchema()
df3.show(10,false)
I have a spark dataframe that looks like this in json;
{
"site_id": "ABC",
"region": "Texas",
"areas": [
{
"Carbon": [
"ABB",
"ABD",
"ABE"
]
}
],
"site_name": "ABC"
}
and I need to turn "areas" column to this;
"areas":
[
{
"area_name": "Carbon",
"pls": [
{
"pl_name": "ABB"
},
{
"pl_name": "ABD"
},
{
"pl_name": "ABE"
}
]
}
]
I already did df.collect() and manipulated the dictionary directly, but that created some complexity. Is there a way to do this directly in the dataframe itself?
Edit:
Here is the input schema
|-- site_id: string
|-- region: string
|-- site_name: string
|-- areas: array
| |-- element: map
| | |-- keyType: string
| | |-- valueType: array
| | | |-- element: string
in the output schema, the goal is to have the valueType also a dictionary. I actually save the data to a dynamodb table so the output should like the example I provided when scanned from the table.
Processing and producing JSON is not Spark's strength as far as my understanding. The easiest approach (that is not flattening then grouping by then collecting then pivoting etc) is using UDF. I completely understand UDF is not as fast as built-in Spark transformation, but if your data scale is not that big then it shouldn't be a problem.
def transform_json(arr):
r = []
for e in arr:
for k in e.keys():
r.append({
'area_name': k,
'pls': [{'pl_name': i} for i in e[k]]
})
return r
(df
.withColumn('areas', F.udf(
transform_json,
T.ArrayType(T.StructType([
T.StructField('area_name', T.StringType()),
T.StructField('pls', T.ArrayType(T.StructType([
T.StructField('pl_name', T.StringType())
]))),
])
))('areas')
)
.show(10, False)
)
# Output
# +------------------------------------------------------------------+
# |areas |
# +------------------------------------------------------------------+
# |[{Carbon, [{ABB}, {ABD}, {ABE}]}, {Oxygen, [{ABB}, {ABD}, {ABE}]}]|
# +------------------------------------------------------------------+
# Schema
# root
# |-- areas: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- area_name: string (nullable = true)
# | | |-- pls: array (nullable = true)
# | | | |-- element: struct (containsNull = true)
# | | | | |-- pl_name: string (nullable = true)
I am new to spark and scala. I have a json array struct as input, similar to the below schema.
root
|-- entity: struct (nullable = true)
| |-- email: string (nullable = true)
| |-- primaryAddresses: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- postalCode: string (nullable = true)
| | | |-- streetAddress: struct (nullable = true)
| | | | |-- line1: string (nullable = true)
I flattened the array struct to the below sample Dataframe
+-------------+--------------------------------------+--------------------------------------+
|entity.email |entity.primaryAddresses[0].postalCode |entity.primaryAddresses[1].postalCode |....
+-------------+--------------------------------------+--------------------------------------+
|a#b.com | | |
|a#b.com | |12345 |
|a#b.com |12345 | |
|a#b.com |0 |0 |
+-------------+--------------------------------------+--------------------------------------+
My end goal is to calculate presence/absence/zero counts for each of the columns for data quality metrics.But before I calculate the data quality metrics I am looking for an approach to derive one new column for each of the array column elements as below such that
if all values of particular array element is empty, then the derived column is empty for that element
if at least one value is present for an array element, the element presence is considered 1
if all values of an array element is zero the I mark the element as zero (I calibrate this as presence =1 and zero =1 when I calculate data quality later)
Below is a sample intermediate dataframe that I am trying to achieve with a column derived for each of array elements. The original array elements are dropped.
+-------------+--------------------------------------+
|entity.email |entity.primaryAddresses.postalCode |.....
+-------------+--------------------------------------+
|a#b.com | |
|a#b.com |1 |
|a#b.com |1 |
|a#b.com |0 |
+-------------+--------------------------------------+
The input json records elements are dynamic and can change. To derive columns for array element I build a scala map with a key as column name without array index (example:entity.primaryAddresses.postalCode) and value as list of array elements to run rules on for the specific key. I am looking for an approach to achieve the above intermediate data frame.
One concern is that for certain input files after I flatten the Dataframe , the dataframe column count exceeds 70k+. And since the record count is expected to be in millions I am wondering if instead of flattening the json if I should explode each of elements for better performance.
Appreciate any ideas. Thank you.
Created helper function & You can directly call df.explodeColumns on DataFrame.
Below code will flatten multi level array & struct type columns.
Use below function to extract columns & then apply your transformations on that.
scala> df.printSchema
root
|-- entity: struct (nullable = false)
| |-- email: string (nullable = false)
| |-- primaryAddresses: array (nullable = false)
| | |-- element: struct (containsNull = false)
| | | |-- postalCode: string (nullable = false)
| | | |-- streetAddress: struct (nullable = false)
| | | | |-- line1: string (nullable = false)
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import scala.annotation.tailrec
import scala.util.Try
implicit class DFHelpers(df: DataFrame) {
def columns = {
val dfColumns = df.columns.map(_.toLowerCase)
df.schema.fields.flatMap { data =>
data match {
case column if column.dataType.isInstanceOf[StructType] => {
column.dataType.asInstanceOf[StructType].fields.map { field =>
val columnName = column.name
val fieldName = field.name
col(s"${columnName}.${fieldName}").as(s"${columnName}_${fieldName}")
}.toList
}
case column => List(col(s"${column.name}"))
}
}
}
def flatten: DataFrame = {
val empty = df.schema.filter(_.dataType.isInstanceOf[StructType]).isEmpty
empty match {
case false =>
df.select(columns: _*).flatten
case _ => df
}
}
def explodeColumns = {
#tailrec
def columns(cdf: DataFrame):DataFrame = cdf.schema.fields.filter(_.dataType.typeName == "array") match {
case c if !c.isEmpty => columns(c.foldLeft(cdf)((dfa,field) => {
dfa.withColumn(field.name,explode_outer(col(s"${field.name}"))).flatten
}))
case _ => cdf
}
columns(df.flatten)
}
}
scala> df.explodeColumns.printSchema
root
|-- entity_email: string (nullable = false)
|-- entity_primaryAddresses_postalCode: string (nullable = true)
|-- entity_primaryAddresses_streetAddress_line1: string (nullable = true)
You can leverage on a custom user define function that can help you do the data quality metrics.
val postalUdf = udf((postalCode0: Int, postalCode1: Int) => {
//TODO implement you logic here
})
then use is to create a new dataframe column
df
.withColumn("postcalCode", postalUdf(col("postalCode_0"), col("postalCode_1")))
.show()
I am trying to flatten a schema of existing dataframe with nested fields. Structure of my dataframe is something like that:
root
|-- Id: long (nullable = true)
|-- Type: string (nullable = true)
|-- Uri: string (nullable = true)
|-- Type: array (nullable = true)
| |-- element: string (containsNull = true)
|-- Gender: array (nullable = true)
| |-- element: string (containsNull = true)
Type and gender can contain array of elements, one element or null value.
I tried to use the following code:
var resDf = df.withColumn("FlatType", explode(df("Type")))
But as a result in a resulting data frame I loose rows for which I had null values for Type column. It means, for example, if I have 10 rows and in 7 rows type is null and in 3 type is not null, after I use explode in resulting data frame I have only three rows.
How can I keep rows with null values but explode array of values?
I found some kind of workaround but still stuck in one place. For standard types we can do the following:
def customExplode(df: DataFrame, field: String, colType: String): org.apache.spark.sql.Column = {
var exploded = None: Option[org.apache.spark.sql.Column]
colType.toLowerCase() match {
case "string" =>
val avoidNull = udf((column: Seq[String]) =>
if (column == null) Seq[String](null)
else column)
exploded = Some(explode(avoidNull(df(field))))
case "boolean" =>
val avoidNull = udf((xs: Seq[Boolean]) =>
if (xs == null) Seq[Boolean]()
else xs)
exploded = Some(explode(avoidNull(df(field))))
case _ => exploded = Some(explode(df(field)))
}
exploded.get
}
And after that just use it like this:
val explodedField = customExplode(resultDf, fieldName, fieldTypeMap(field))
resultDf = resultDf.withColumn(newName, explodedField)
However, I have a problem for struct type for the following type of structure:
|-- Address: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- AddressType: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- DEA: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- Number: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
| | | | |-- ExpirationDate: array (nullable = true)
| | | | | |-- element: timestamp (containsNull = true)
| | | | |-- Status: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
How can we process that kind of schema when DEA is null?
Thank you in advance.
P.S. I tried to use Lateral views but result is the same.
Maybe you can try using when:
val resDf = df.withColumn("FlatType", when(df("Type").isNotNull, explode(df("Type")))
As shown in the when function's documentation, the value null is inserted for the values that do not match the conditions.
I think what you wanted is to use explode_outer instead of explode
see apache docs : explode and explode_outer