How to make fields inside nested structure non-nullable - pyspark

I have data structure like below shown. What I want to make sure is that field2 only contains non-nullable values, and also want to make sure field3.ZZZ is nullable = true.
root
|-- field1: string (nullable = true)
|-- field2: struct (nullable = true)
|-- field3: struct (nullable = true)
| |-- XXX: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- s: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- v: string (nullable = true)
| | | |-- v: long (nullable = true)
| |-- YYY: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- s: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- v: string (nullable = true)
| | | |-- v: long (nullable = true)
| |-- ZZZ: array (nullable = true)
| | |-- element: null (containsNull = true)
I did lots of research, but haven't found clear way to do the setting in nested structure. Can anyone help me on this?

Related

Non-primitive, unsupported type error in ADF when trying to read a Parquet file

In Databricks, a df is generated and saved as a parquet file. Here is the schema:
root
|-- dq_check_id: string (nullable = false)
|-- data_attribute_id: long (nullable = true)
|-- dq_check_scope_number_of_records: integer (nullable = false)
|-- dq_check_hit_number_of_records: integer (nullable = false)
|-- snapshotdate: timestamp (nullable = false)
|-- dq_execution_date: timestamp (nullable = false)
|-- generated_by: string (nullable = false)
|-- dq_check_outcomes: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- rule_output_cd: integer (nullable = false)
| | |-- business_key: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- identifying_data_element_id: long (nullable = true)
| | | | |-- identifying_data_element_value: string (nullable = true)
| | |-- technical_key: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- pk_attr_id: long (nullable = true)
| | | | |-- pk_attr_value: string (nullable = true)
| | |-- dq_check_attributes: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- dq_check_attr_id: long (nullable = true)
| | | | |-- dq_check_attr_value: string (nullable = true)
| | | | |-- dq_check_attr_seq: string (nullable = false)
| | |-- outcome_details: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- outcome_attr_id: integer (nullable = false)
| | | | |-- outcome_attr_value: string (nullable = false)
| | | | |-- outcome_attr_seq: string (nullable = false)
|-- generated_date: timestamp (nullable = true)
Then, when trying to read this parquet file in ADF, this error arrives:
Parquet file contained column 'dq_check_outcomes', which is of a non-primitive, unsupported type.
Are you sure there is not a MAP or LIST in the parquet file.
https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/ExternalTables/ComplexTypes.htm
Please look at the Microsoft documentation on supported options and data types. At the top it states it does not support MAP/LIST. My suggestion is to rebuild the parquet file, section by section until you fine the nested issue.
https://learn.microsoft.com/en-us/azure/data-factory/supported-file-formats-and-compression-codecs-legacy#parquet-format
Here is an image to the table for reference. It looks like all types are supported from your file.

Update a highly nested column from string to struct

|-- x: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- y: long (nullable = true)
| | |-- z: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- log: string (nullable = true)
I have the above nested schema where I want to change column z's log from string to struct.
|-- x: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- y: long (nullable = true)
| | |-- z: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- log: struct (nullable = true)
| | | | | | |-- b: string (nullable = true)
| | | | | | |-- c: string (nullable = true)
I'm not using Spark 3 but Spark 2.4.x. Will prefer Scala way but python works too since this is a one time manual thing to backfill some past data.
Is there a way to do this with some udf or any other way?
I know it's easy to do this via from_json but the nested array of struct is causing issues.
I think it depends on the values in your log column. I mean, the way you want to split the string into 2 separate fields.
The following PySpark code will just "move" your log values to b and c fields.
# Example data:
schema = (
T.StructType([
T.StructField('x', T.ArrayType(T.StructType([
T.StructField('y', T.LongType()),
T.StructField('z', T.ArrayType(T.StructType([
T.StructField('log', T.StringType())
]))),
])))
])
)
df = spark.createDataFrame([
[
[[
9,
[[
'text'
]]
]]
]
], schema)
df.printSchema()
# root
# |-- x: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- y: long (nullable = true)
# | | |-- z: array (nullable = true)
# | | | |-- element: struct (containsNull = true)
# | | | | |-- log: string (nullable = true)
df = df.withColumn('x', F.expr('transform(x, e -> struct(e.y as y, array(struct(struct(e.z.log[0] as b, e.z.log[0] as c) as log)) as z))'))
df.printSchema()
# root
# |-- x: array (nullable = true)
# | |-- element: struct (containsNull = false)
# | | |-- y: long (nullable = true)
# | | |-- z: array (nullable = false)
# | | | |-- element: struct (containsNull = false)
# | | | | |-- log: struct (nullable = false)
# | | | | | |-- b: string (nullable = true)
# | | | | | |-- c: string (nullable = true)
If string transformations are needed on log column, e.z.log[0] parts need to be changed to include string transformations.
Higher Order functions are your friend in this case. Coalesce basically. Code below
df = df.withColumn('x', F.expr('transform(x, e -> struct(e.y as y, array(struct(coalesce(("1" as a,"2" as b)) as log))as z))')).printSchema()
root
|-- x: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- y: long (nullable = true)
| | |-- z: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- log: struct (nullable = false)
| | | | | |-- a: string (nullable = false)
| | | | | |-- b: string (nullable = false)

suggestion for tune up the code which contains explode and groupby

I wrote the code for below probelem but it has below problems. Please suggest me if some tuning can be done.
It takes more time I think.
there are 3 brands as of now. It is hardcoded. If more brands would be added, i need to add the code manually.
input dataframe schema :
root
|-- id: string (nullable = true)
|-- attrib: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- pref: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- pref_type: string (nullable = true)
| | |-- brand: string (nullable = true)
| | |-- tp_id: string (nullable = true)
| | |-- aff: float (nullable = true)
| | |-- pre_id: string (nullable = true)
| | |-- cr_date: string (nullable = true)
| | |-- up_date: string (nullable = true)
| | |-- pref_attrib: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
expected output schema:
root
|-- id: string (nullable = true)
|-- attrib: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- pref: struct (nullable = false)
| |-- brandA: array (nullable = true)
| | |-- element: struct (containsNull = false)
| | | |-- pref_type: string (nullable = true)
| | | |-- tp_id: string (nullable = true)
| | | |-- aff: float (nullable = true)
| | | |-- pref_id: string (nullable = true)
| | | |-- cr_date: string (nullable = true)
| | | |-- up_date: string (nullable = true)
| | | |-- pref_attrib: map (nullable = true)
| | | | |-- key: string
| | | | |-- value: string (valueContainsNull = true)
| |-- brandB: array (nullable = true)
| | |-- element: struct (containsNull = false)
| | | |-- pref_type: string (nullable = true)
| | | |-- tp_id: string (nullable = true)
| | | |-- aff: float (nullable = true)
| | | |-- pref_id: string (nullable = true)
| | | |-- cr_date: string (nullable = true)
| | | |-- up_date: string (nullable = true)
| | | |-- pref_attrib: map (nullable = true)
| | | | |-- key: string
| | | | |-- value: string (valueContainsNull = true)
| |-- brandC: array (nullable = true)
| | |-- element: struct (containsNull = false)
| | | |-- pref_type: string (nullable = true)
| | | |-- tp_id: string (nullable = true)
| | | |-- aff: float (nullable = true)
| | | |-- pref_id: string (nullable = true)
| | | |-- cr_date: string (nullable = true)
| | | |-- up_date: string (nullable = true)
| | | |-- pref_attrib: map (nullable = true)
| | | | |-- key: string
| | | | |-- value: string (valueContainsNull = true)
The processing can be done based on the brand attribute under preferences(preferences.brand)
I have written the below code for that:
def modifyBrands(inputDf: DataFrame): DataFrame ={
val PreferenceProps = Array("pref_type", "tp_id", "aff", "pref_id", "cr_date", "up_date", "pref_attrib")
import org.apache.spark.sql.functions._
val explodedDf = inputDf.select(col("id"), explode(col("pref")))
.select(
col("id"),
col("col.pref_type"),
col("col.brand"),
col("col.tp_id"),
col("col.aff"),
col("col.pre_id"),
col("col.cr_dt"),
col("col.up_dt"),
col("col.pref_attrib")
).cache()
val brandAddedDf = explodedDf
.withColumn("brandA", when(col("brand") === "brandA", struct(PreferenceProps.head, PreferenceProps.tail:_*)).as("brandA"))
.withColumn("brandB", when(col("brand") === "brandB", struct(PreferenceProps.head, PreferenceProps.tail:_*)).as("brandB"))
.withColumn("brandC", when(col("brand") === "brandC", struct(PreferenceProps.head, PreferenceProps.tail:_*)).as("brandC"))
.cache()
explodedDf.unpersist()
val groupedDf = brandAddedDf.groupBy("id").agg(
collect_list("brandA").alias("brandA"),
collect_list("brandB").alias("brandB"),
collect_list("brandC").alias("brandC")
).withColumn("preferences", struct(
when(size(col("brandA")).notEqual(0), col("brandA")).alias("brandA"),
when(size(col("brandB")).notEqual(0), col("brandB")).alias("brandB"),
when(size(col("brandC")).notEqual(0), col("brandC")).alias("brandC"),
)).drop("brandA", "brandB", "brandC")
.cache()
brandAddedDf.unpersist()
val idAttributesDf = inputDf.select("id", "attrib").cache()
val joinedDf = idAttributesDf.join(groupedDf, "id")
groupedDf.unpersist()
idAttributesDf.unpersist()
joinedDf.printSchema()
joinedDf // returning joined df which will be wrote as paquet file.
}
You can simplify your code using higher-order function filter on arrays. Just map through brand names and for-each one return a filtered array from pref. This way you avoid the exploding / grouping part.
Here's a complete example:
val data = """{"id":1,"attrib":{"key":"k","value":"v"},"pref":[{"pref_type":"type1","brand":"brandA","tp_id":"id1","aff":"aff1","pre_id":"pre_id1","cr_date":"2021-01-06","up_date":"2021-01-06","pref_attrib":{"key":"k","value":"v"}},{"pref_type":"type1","brand":"brandB","tp_id":"id1","aff":"aff1","pre_id":"pre_id1","cr_date":"2021-01-06","up_date":"2021-01-06","pref_attrib":{"key":"k","value":"v"}},{"pref_type":"type1","brand":"brandC","tp_id":"id1","aff":"aff1","pre_id":"pre_id1","cr_date":"2021-01-06","up_date":"2021-01-06","pref_attrib":{"key":"k","value":"v"}}]}"""
val inputDf = spark.read.json(Seq(data).toDS)
val brands = Seq("brandA", "brandB", "brandC")
// or getting them from input dataframe
// val brands = inputDf.select("pref.brand").as[Seq[String]].collect.flatten
val brandAddedDf = inputDf.withColumn(
"pref",
struct(brands.map(b => expr(s"filter(pref, x -> x.brand = '$b')").as(b)): _*)
)
brandAddedDf.printSchema
//root
// |-- attrib: struct (nullable = true)
// | |-- key: string (nullable = true)
// | |-- value: string (nullable = true)
// |-- id: long (nullable = true)
// |-- pref: struct (nullable = false)
// | |-- brandA: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- aff: string (nullable = true)
// | | | |-- brand: string (nullable = true)
// | | | |-- cr_date: string (nullable = true)
// | | | |-- pre_id: string (nullable = true)
// | | | |-- pref_attrib: struct (nullable = true)
// | | | | |-- key: string (nullable = true)
// | | | | |-- value: string (nullable = true)
// | | | |-- pref_type: string (nullable = true)
// | | | |-- tp_id: string (nullable = true)
// | | | |-- up_date: string (nullable = true)
// | |-- brandB: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- aff: string (nullable = true)
// | | | |-- brand: string (nullable = true)
// | | | |-- cr_date: string (nullable = true)
// | | | |-- pre_id: string (nullable = true)
// | | | |-- pref_attrib: struct (nullable = true)
// | | | | |-- key: string (nullable = true)
// | | | | |-- value: string (nullable = true)
// | | | |-- pref_type: string (nullable = true)
// | | | |-- tp_id: string (nullable = true)
// | | | |-- up_date: string (nullable = true)
// | |-- brandC: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- aff: string (nullable = true)
// | | | |-- brand: string (nullable = true)
// | | | |-- cr_date: string (nullable = true)
// | | | |-- pre_id: string (nullable = true)
// | | | |-- pref_attrib: struct (nullable = true)
// | | | | |-- key: string (nullable = true)
// | | | | |-- value: string (nullable = true)
// | | | |-- pref_type: string (nullable = true)
// | | | |-- tp_id: string (nullable = true)
// | | | |-- up_date: string (nullable = true)
I think they're are a couple issues with how you are doing your code, but the real way to tell where you have a problem with your code is to look at the SPARK UI. I find the "Jobs" tab and the "SQL" tab very informative to figure out where the code is spending most of its time. Then see if those parts can be re-written to give you more speed. Some of the items I point out below may not matter if there is a bottleneck elsewhere that really is where most of the time is being spent.
There are reasons to create nested structures (Like you are for Brand). I'm just not sure I see the payoff here and it's not explained. It should be considered why you are maintaining this structure and what the benefit is. Is there a performance gain for maintaining it? Or is it simply an artifact of how the data was created?
General tips that might help a little:
In general you should only cache code that you will use more than once. You have a lot of code you don't use more than once but you still cache.
Small, small performance boost. (So in other words when you need every millisecond...) withColumn actually doesn't perform as well as select. (Likely due to some object creation) where possible use select instead of withColumn. Not really worth re-writing your code unless you really need every milli-second.

Flatten Parquet File with nested Arrays and StructType Spark Scala

I am looking to dynamically flatten a parquet file in Spark with Scala efficiently. I was wondering what an efficient way to achieve this.
The parquet file contains multiple Array and Struct Type Nesting at multiple depth levels. The parquet file schema can change in the future, so I cannot hard code any attributes. The desired end result is a flattened delimited file.
Would a solution using flatmap and recursively exploding work?
Example Schema:
|-- exCar: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- exCarOne: string (nullable = true)
| | |-- exCarTwo: string (nullable = true)
| | |-- exCarThree: string (nullable = true)
|-- exProduct: string (nullable = true)
|-- exName: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- exNameOne: string (nullable = true)
| | |-- exNameTwo: string (nullable = true)
| | |-- exNameThree: string (nullable = true)
| | |-- exNameFour: string (nullable = true)
| | |-- exNameCode: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- exNameCodeOne: string (nullable = true)
| | | | |-- exNameCodeTwo: string (nullable = true)
| | |-- exColor: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- exColorOne: string (nullable = true)
| | | | |-- exColorTwo: string (nullable = true)
| | | | |-- exWheelColor: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- exWheelColorOne: string (nullable = true)
| | | | | | |-- exWheelColorTwo: string (nullable = true)
| | | | | | |--exWheelColorThree: string (nullable =true)
| | |-- exGlass: string (nullable = true)
|-- exDetails: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- exBill: string (nullable = true)
| | |-- exAccount: string (nullable = true)
| | |-- exLoan: string (nullable = true)
| | |-- exRate: string (nullable = true)
Desired output Schema:
exCar.exCarOne
exCar.exCarTwo
exCar.exCarThree
exProduct
exName.exNameOne
exName.exNameTwo
exName.exNameThree
exName.exNameFour
exName.exNameCode.exNameCodeOne
exName.exNameCode.exNameCodeTwo
exName.exColor.exColorOne
exName.exColor.exColorTwo
exName.exColor.exWheelColor.exWheelColorOne
exName.exColor.exWheelColor.exWheelColorTwo
exName.exColor.exWheelColor.exWheelColorThree
exName.exGlass
exDetails.exBill
exDetails.exAccount
exDetails.exLoan
exDetails.exRate
There are 2 things that need to be done:
1) Explode the array columns from the most outer nested arrays to the ones lying inside: explode exName (giving you alot of rows with json that contains exColor), then exColor which you then explode allowing you access to exWheelColor, etc.
2) Project the nested json to a separate column.

Get WrappedArray row valule and convert it into string in Scala

I have a data frame which comes as like below
+---------------------------------------------------------------------+
|value |
+---------------------------------------------------------------------+
|[WrappedArray(LineItem_organizationId, LineItem_lineItemId)] |
|[WrappedArray(OrganizationId, LineItemId, SegmentSequence_segmentId)]|
+---------------------------------------------------------------------+
From the above two rows I want to create a string which is in this format
"LineItem_organizationId", "LineItem_lineItemId"
"OrganizationId", "LineItemId", "SegmentSequence_segmentId"
I want to create this as dynamic so in first column third value is present my string will have one more , separated columns value .
How can I do this in Scala .
this is what I am doing in order to create data frame
val xmlFiles = "C://Users//u6034690//Desktop//SPARK//trfsmallfffile//XML"
val discriptorFileLOcation = "C://Users//u6034690//Desktop//SPARK//trfsmallfffile//FinancialLineItem//REFXML"
import sqlContext.implicits._
val dfDiscriptor = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "FlatFileDescriptor").load(discriptorFileLOcation)
dfDiscriptor.printSchema()
val firstColumn = dfDiscriptor.select($"FFFileType.FFRecord.FFField").as("FFField")
val FirstColumnOfHeaderFile = firstColumn.select(explode($"FFField")).as("ColumnsDetails").select(explode($"col")).first.get(0).toString().split(",")(5)
println(FirstColumnOfHeaderFile)
//dfDiscriptor.printSchema()
val primaryKeyColumnsFinancialLineItem = dfDiscriptor.select(explode($"FFFileType.FFRecord.FFPrimKey.FFPrimKeyCol"))
primaryKeyColumnsFinancialLineItem.show(false)
Adding the full schema
root
|-- FFColumnDelimiter: string (nullable = true)
|-- FFContentItem: struct (nullable = true)
| |-- _VALUE: string (nullable = true)
| |-- _ffMajVers: long (nullable = true)
| |-- _ffMinVers: double (nullable = true)
|-- FFFileEncoding: string (nullable = true)
|-- FFFileType: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- FFPhysicalFile: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- FFFileName: string (nullable = true)
| | | | |-- FFRowCount: long (nullable = true)
| | |-- FFRecord: struct (nullable = true)
| | | |-- FFField: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- FFColumnNumber: long (nullable = true)
| | | | | |-- FFDataType: string (nullable = true)
| | | | | |-- FFFacets: struct (nullable = true)
| | | | | | |-- FFMaxLength: long (nullable = true)
| | | | | | |-- FFTotalDigits: long (nullable = true)
| | | | | |-- FFFieldIsOptional: boolean (nullable = true)
| | | | | |-- FFFieldName: string (nullable = true)
| | | | | |-- FFForKey: struct (nullable = true)
| | | | | | |-- FFForKeyCol: string (nullable = true)
| | | | | | |-- FFForKeyRecord: string (nullable = true)
| | | |-- FFPrimKey: struct (nullable = true)
| | | | |-- FFPrimKeyCol: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
| | | |-- FFRecordType: string (nullable = true)
|-- FFHeaderRow: boolean (nullable = true)
|-- FFId: string (nullable = true)
|-- FFRowDelimiter: string (nullable = true)
|-- FFTimeStamp: string (nullable = true)
|-- _env: string (nullable = true)
|-- _ffMajVers: long (nullable = true)
|-- _ffMinVers: double (nullable = true)
|-- _ffPubstyle: string (nullable = true)
|-- _schemaLocation: string (nullable = true)
|-- _sr: string (nullable = true)
|-- _xmlns: string (nullable = true)
|-- _xsi: string (nullable = true)
Looking at your given dataframe
+---------------------------------------------------------------------+
|value |
+---------------------------------------------------------------------+
|[WrappedArray(LineItem_organizationId, LineItem_lineItemId)] |
|[WrappedArray(OrganizationId, LineItemId, SegmentSequence_segmentId)]|
+---------------------------------------------------------------------+
it must have the following schema
|-- value: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
If the above assumption are true then you should write a udf function as
import org.apache.spark.sql.functions._
def arrayToString = udf((arr: collection.mutable.WrappedArray[collection.mutable.WrappedArray[String]]) => arr.flatten.mkString(", "))
And use it in the dataframe as
df.withColumn("value", arrayToString($"value"))
And you should have
+-----------------------------------------------------+
|value |
+-----------------------------------------------------+
|LineItem_organizationId, LineItem_lineItemId |
|OrganizationId, LineItemId, SegmentSequence_segmentId|
+-----------------------------------------------------+
|-- value: string (nullable = true)