I have the following nested xml file below which leads to my 2 questions.
How and what functions should I use to flatten the file into dataframe table? Examples would be very helpful, please?
How can I use field "CNTL_ID" (PK) to link it with "ENRLMTS" record? If this cannot be achieved easily, how can I create a surrogate key on the fly to link "PRVDR_INFO" with "ENRLMTS"?
XML File:
<PRVDR>
<PRVDR_INFO>
<INDVDL_INFO>
<CNTL_ID>12345678</CNTL_ID>
<BIRTH_DT>19200609</BIRTH_DT>
<BIRTH_STATE_CD>VA</BIRTH_STATE_CD>
<BIRTH_STATE_NAME>VIRGINIA</BIRTH_STATE_NAME>
<BIRTH_CNTRY_CD>US</BIRTH_CNTRY_CD>
<BIRTH_CNTRY_NAME>UNITED STATES</BIRTH_CNTRY_NAME>
<BIRTH_FRGN_SW>Z</BIRTH_FRGN_SW>
<NAME_LIST>
<PEC_INDVDL_NAME>
<NAME_CD>I</NAME_CD>
<NAME_DESC>INDIVIDUAL NAME</NAME_DESC>
<FIRST_NAME>WILL</FIRST_NAME>
<MDL_NAME>J</MDL_NAME>
<LAST_NAME>SMITH</LAST_NAME>
<DATA_STUS_CD>CURRENT</DATA_STUS_CD>
</PEC_INDVDL_NAME>
<PEC_INDVDL_NAME>
<NAME_CD>I</NAME_CD>
<NAME_DESC>INDIVIDUAL NAME</NAME_DESC>
<FIRST_NAME>WILL</FIRST_NAME>
<LAST_NAME>SMITH</LAST_NAME>
<TRMNTN_DT>2010-09-10T13:19:38</TRMNTN_DT>
<DATA_STUS_CD>HISTORY</DATA_STUS_CD>
</PEC_INDVDL_NAME>
</NAME_LIST>
<PEC_TIN>
<TIN>555778888</TIN>
<TAX_IDENT_TYPE_CD>T</TAX_IDENT_TYPE_CD>
<TAX_IDENT_DESC>SSN</TAX_IDENT_DESC>
<DATA_STUS_CD>CURRENT</DATA_STUS_CD>
</PEC_TIN>
<PEC_NPI>
<NPI>3334211156</NPI>
<VRFYD_BUSNS_SW>Y</VRFYD_BUSNS_SW>
<CREAT_TS>2010-09-10T13:16:28</CREAT_TS>
<DATA_STUS_CD>CURRENT</DATA_STUS_CD>
</PEC_NPI>
</INDVDL_INFO>
</PRVDR_INFO>
<ENRLMTS>
<ABC_855X>
<ENRLMT_INFO>
<ENRLMT_DTLS>
<FORM_TYPE_CD>9999A</FORM_TYPE_CD>
<ENRLMT_ID>123444555666778899000</ENRLMT_ID>
<ENRLMT_STUS_DLTS>
<STUS_CD>06</STUS_CD>
<STUS_DESC>APPROVED</STUS_DESC>
<STUS_DT>2012-05-14T16:04:22</STUS_DT>
<DATA_STUS_CD>HISTORY</DATA_STUS_CD>
<ENRLMT_STUS_RSN_DLTS>
<STUS_RSN_CD>047</STUS_RSN_CD>
<STUS_RSN_DESC>APPROVED</STUS_RSN_DESC>
<DATA_STUS_CD>HISTORY</DATA_STUS_CD>
</ENRLMT_STUS_RSN_DLTS>
</ENRLMT_STUS_DLTS>
<ENRLMT_STUS_DLTS>
<STUS_CD>06</STUS_CD>
<STUS_DESC>APPROVED</STUS_DESC>
<STUS_DT>2016-08-09T14:33:40</STUS_DT>
<DATA_STUS_CD>CURRENT</DATA_STUS_CD>
<ENRLMT_STUS_RSN_DLTS>
<STUS_RSN_CD>081</STUS_RSN_CD>
<STUS_RSN_DESC>APPROVED FOR REVALIDATION</STUS_RSN_DESC>
<DATA_STUS_CD>CURRENT</DATA_STUS_CD>
</ENRLMT_STUS_RSN_DLTS>
</ENRLMT_STUS_DLTS>
<BUSNS_STATE>VA</BUSNS_STATE>
<BUSNS_STATE_NAME>VIRGINIA</BUSNS_STATE_NAME>
<CNTRCTR_LIST>
<CNTRCTR_INFO>
<CNTRCTR_ID>11111</CNTRCTR_ID>
<CNTRCTR_NAME>SOLUTIONS, INC.</CNTRCTR_NAME>
<DATA_STUS_CD>CURRENT</DATA_STUS_CD>
</CNTRCTR_INFO>
</CNTRCTR_LIST>
</ENRLMT_DTLS>
</ENRLMT_INFO>
<PEC_ENRLMT_REVLDTN>
<REVLDTN_INSTNC_NUM>1</REVLDTN_INSTNC_NUM>
<REVLDTN_STUS_CD>03</REVLDTN_STUS_CD>
<REVLDTN_STUS_DESC>CANCELLED</REVLDTN_STUS_DESC>
</PEC_ENRLMT_REVLDTN>
<ACPT_NEW_PTNT_SW>Y</ACPT_NEW_PTNT_SW>
</ABC_855X>
</ENRLMTS>
</PRVDR>
Schema is below:
root
|-- ENRLMTS: struct (nullable = true)
| |-- ABC_855X: struct (nullable = true)
| | |-- ACPT_NEW_PTNT_SW: string (nullable = true)
| | |-- ENRLMT_INFO: struct (nullable = true)
| | | |-- ENRLMT_DTLS: struct (nullable = true)
| | | | |-- BUSNS_STATE: string (nullable = true)
| | | | |-- BUSNS_STATE_NAME: string (nullable = true)
| | | | |-- CNTRCTR_LIST: struct (nullable = true)
| | | | | |-- CNTRCTR_INFO: struct (nullable = true)
| | | | | | |-- CNTRCTR_ID: integer (nullable = true)
| | | | | | |-- CNTRCTR_NAME: string (nullable = true)
| | | | | | |-- DATA_STUS_CD: string (nullable = true)
| | | | |-- ENRLMT_ID: double (nullable = true)
| | | | |-- ENRLMT_STUS_DLTS: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- DATA_STUS_CD: string (nullable = true)
| | | | | | |-- ENRLMT_STUS_RSN_DLTS: struct (nullable = true)
| | | | | | | |-- DATA_STUS_CD: string (nullable = true)
| | | | | | | |-- STUS_RSN_CD: integer (nullable = true)
| | | | | | | |-- STUS_RSN_DESC: string (nullable = true)
| | | | | | |-- STUS_CD: integer (nullable = true)
| | | | | | |-- STUS_DESC: string (nullable = true)
| | | | | | |-- STUS_DT: string (nullable = true)
| | | | |-- FORM_TYPE_CD: string (nullable = true)
| | |-- PEC_ENRLMT_REVLDTN: struct (nullable = true)
| | | |-- REVLDTN_INSTNC_NUM: integer (nullable = true)
| | | |-- REVLDTN_STUS_CD: integer (nullable = true)
| | | |-- REVLDTN_STUS_DESC: string (nullable = true)
|-- PRVDR_INFO: struct (nullable = true)
| |-- INDVDL_INFO: struct (nullable = true)
| | |-- BIRTH_CNTRY_CD: string (nullable = true)
| | |-- BIRTH_CNTRY_NAME: string (nullable = true)
| | |-- BIRTH_DT: integer (nullable = true)
| | |-- BIRTH_FRGN_SW: string (nullable = true)
| | |-- BIRTH_STATE_CD: string (nullable = true)
| | |-- BIRTH_STATE_NAME: string (nullable = true)
| | |-- CNTL_ID: integer (nullable = true)
| | |-- NAME_LIST: struct (nullable = true)
| | | |-- PEC_INDVDL_NAME: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- DATA_STUS_CD: string (nullable = true)
| | | | | |-- FIRST_NAME: string (nullable = true)
| | | | | |-- LAST_NAME: string (nullable = true)
| | | | | |-- MDL_NAME: string (nullable = true)
| | | | | |-- NAME_CD: string (nullable = true)
| | | | | |-- NAME_DESC: string (nullable = true)
| | | | | |-- TRMNTN_DT: string (nullable = true)
| | |-- PEC_NPI: struct (nullable = true)
| | | |-- CREAT_TS: string (nullable = true)
| | | |-- DATA_STUS_CD: string (nullable = true)
| | | |-- NPI: long (nullable = true)
| | | |-- VRFYD_BUSNS_SW: string (nullable = true)
| | |-- PEC_TIN: struct (nullable = true)
| | | |-- DATA_STUS_CD: string (nullable = true)
| | | |-- TAX_IDENT_DESC: string (nullable = true)
| | | |-- TAX_IDENT_TYPE_CD: string (nullable = true)
| | | |-- TIN: integer (nullable = true)
My current nested dataframe output below. I would to break this nested output to multiple rows if possible.
+--------------------+--------------------+
| ENRLMTS| PRVDR_INFO|
+--------------------+--------------------+
|{{Y, {{VA, VIRGIN...|{{US, UNITED STAT...|
+--------------------+--------------------+
Thank you much.
I wrote the code for below probelem but it has below problems. Please suggest me if some tuning can be done.
It takes more time I think.
there are 3 brands as of now. It is hardcoded. If more brands would be added, i need to add the code manually.
input dataframe schema :
root
|-- id: string (nullable = true)
|-- attrib: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- pref: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- pref_type: string (nullable = true)
| | |-- brand: string (nullable = true)
| | |-- tp_id: string (nullable = true)
| | |-- aff: float (nullable = true)
| | |-- pre_id: string (nullable = true)
| | |-- cr_date: string (nullable = true)
| | |-- up_date: string (nullable = true)
| | |-- pref_attrib: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
expected output schema:
root
|-- id: string (nullable = true)
|-- attrib: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- pref: struct (nullable = false)
| |-- brandA: array (nullable = true)
| | |-- element: struct (containsNull = false)
| | | |-- pref_type: string (nullable = true)
| | | |-- tp_id: string (nullable = true)
| | | |-- aff: float (nullable = true)
| | | |-- pref_id: string (nullable = true)
| | | |-- cr_date: string (nullable = true)
| | | |-- up_date: string (nullable = true)
| | | |-- pref_attrib: map (nullable = true)
| | | | |-- key: string
| | | | |-- value: string (valueContainsNull = true)
| |-- brandB: array (nullable = true)
| | |-- element: struct (containsNull = false)
| | | |-- pref_type: string (nullable = true)
| | | |-- tp_id: string (nullable = true)
| | | |-- aff: float (nullable = true)
| | | |-- pref_id: string (nullable = true)
| | | |-- cr_date: string (nullable = true)
| | | |-- up_date: string (nullable = true)
| | | |-- pref_attrib: map (nullable = true)
| | | | |-- key: string
| | | | |-- value: string (valueContainsNull = true)
| |-- brandC: array (nullable = true)
| | |-- element: struct (containsNull = false)
| | | |-- pref_type: string (nullable = true)
| | | |-- tp_id: string (nullable = true)
| | | |-- aff: float (nullable = true)
| | | |-- pref_id: string (nullable = true)
| | | |-- cr_date: string (nullable = true)
| | | |-- up_date: string (nullable = true)
| | | |-- pref_attrib: map (nullable = true)
| | | | |-- key: string
| | | | |-- value: string (valueContainsNull = true)
The processing can be done based on the brand attribute under preferences(preferences.brand)
I have written the below code for that:
def modifyBrands(inputDf: DataFrame): DataFrame ={
val PreferenceProps = Array("pref_type", "tp_id", "aff", "pref_id", "cr_date", "up_date", "pref_attrib")
import org.apache.spark.sql.functions._
val explodedDf = inputDf.select(col("id"), explode(col("pref")))
.select(
col("id"),
col("col.pref_type"),
col("col.brand"),
col("col.tp_id"),
col("col.aff"),
col("col.pre_id"),
col("col.cr_dt"),
col("col.up_dt"),
col("col.pref_attrib")
).cache()
val brandAddedDf = explodedDf
.withColumn("brandA", when(col("brand") === "brandA", struct(PreferenceProps.head, PreferenceProps.tail:_*)).as("brandA"))
.withColumn("brandB", when(col("brand") === "brandB", struct(PreferenceProps.head, PreferenceProps.tail:_*)).as("brandB"))
.withColumn("brandC", when(col("brand") === "brandC", struct(PreferenceProps.head, PreferenceProps.tail:_*)).as("brandC"))
.cache()
explodedDf.unpersist()
val groupedDf = brandAddedDf.groupBy("id").agg(
collect_list("brandA").alias("brandA"),
collect_list("brandB").alias("brandB"),
collect_list("brandC").alias("brandC")
).withColumn("preferences", struct(
when(size(col("brandA")).notEqual(0), col("brandA")).alias("brandA"),
when(size(col("brandB")).notEqual(0), col("brandB")).alias("brandB"),
when(size(col("brandC")).notEqual(0), col("brandC")).alias("brandC"),
)).drop("brandA", "brandB", "brandC")
.cache()
brandAddedDf.unpersist()
val idAttributesDf = inputDf.select("id", "attrib").cache()
val joinedDf = idAttributesDf.join(groupedDf, "id")
groupedDf.unpersist()
idAttributesDf.unpersist()
joinedDf.printSchema()
joinedDf // returning joined df which will be wrote as paquet file.
}
You can simplify your code using higher-order function filter on arrays. Just map through brand names and for-each one return a filtered array from pref. This way you avoid the exploding / grouping part.
Here's a complete example:
val data = """{"id":1,"attrib":{"key":"k","value":"v"},"pref":[{"pref_type":"type1","brand":"brandA","tp_id":"id1","aff":"aff1","pre_id":"pre_id1","cr_date":"2021-01-06","up_date":"2021-01-06","pref_attrib":{"key":"k","value":"v"}},{"pref_type":"type1","brand":"brandB","tp_id":"id1","aff":"aff1","pre_id":"pre_id1","cr_date":"2021-01-06","up_date":"2021-01-06","pref_attrib":{"key":"k","value":"v"}},{"pref_type":"type1","brand":"brandC","tp_id":"id1","aff":"aff1","pre_id":"pre_id1","cr_date":"2021-01-06","up_date":"2021-01-06","pref_attrib":{"key":"k","value":"v"}}]}"""
val inputDf = spark.read.json(Seq(data).toDS)
val brands = Seq("brandA", "brandB", "brandC")
// or getting them from input dataframe
// val brands = inputDf.select("pref.brand").as[Seq[String]].collect.flatten
val brandAddedDf = inputDf.withColumn(
"pref",
struct(brands.map(b => expr(s"filter(pref, x -> x.brand = '$b')").as(b)): _*)
)
brandAddedDf.printSchema
//root
// |-- attrib: struct (nullable = true)
// | |-- key: string (nullable = true)
// | |-- value: string (nullable = true)
// |-- id: long (nullable = true)
// |-- pref: struct (nullable = false)
// | |-- brandA: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- aff: string (nullable = true)
// | | | |-- brand: string (nullable = true)
// | | | |-- cr_date: string (nullable = true)
// | | | |-- pre_id: string (nullable = true)
// | | | |-- pref_attrib: struct (nullable = true)
// | | | | |-- key: string (nullable = true)
// | | | | |-- value: string (nullable = true)
// | | | |-- pref_type: string (nullable = true)
// | | | |-- tp_id: string (nullable = true)
// | | | |-- up_date: string (nullable = true)
// | |-- brandB: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- aff: string (nullable = true)
// | | | |-- brand: string (nullable = true)
// | | | |-- cr_date: string (nullable = true)
// | | | |-- pre_id: string (nullable = true)
// | | | |-- pref_attrib: struct (nullable = true)
// | | | | |-- key: string (nullable = true)
// | | | | |-- value: string (nullable = true)
// | | | |-- pref_type: string (nullable = true)
// | | | |-- tp_id: string (nullable = true)
// | | | |-- up_date: string (nullable = true)
// | |-- brandC: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- aff: string (nullable = true)
// | | | |-- brand: string (nullable = true)
// | | | |-- cr_date: string (nullable = true)
// | | | |-- pre_id: string (nullable = true)
// | | | |-- pref_attrib: struct (nullable = true)
// | | | | |-- key: string (nullable = true)
// | | | | |-- value: string (nullable = true)
// | | | |-- pref_type: string (nullable = true)
// | | | |-- tp_id: string (nullable = true)
// | | | |-- up_date: string (nullable = true)
I think they're are a couple issues with how you are doing your code, but the real way to tell where you have a problem with your code is to look at the SPARK UI. I find the "Jobs" tab and the "SQL" tab very informative to figure out where the code is spending most of its time. Then see if those parts can be re-written to give you more speed. Some of the items I point out below may not matter if there is a bottleneck elsewhere that really is where most of the time is being spent.
There are reasons to create nested structures (Like you are for Brand). I'm just not sure I see the payoff here and it's not explained. It should be considered why you are maintaining this structure and what the benefit is. Is there a performance gain for maintaining it? Or is it simply an artifact of how the data was created?
General tips that might help a little:
In general you should only cache code that you will use more than once. You have a lot of code you don't use more than once but you still cache.
Small, small performance boost. (So in other words when you need every millisecond...) withColumn actually doesn't perform as well as select. (Likely due to some object creation) where possible use select instead of withColumn. Not really worth re-writing your code unless you really need every milli-second.
I have a spark dataframe with the following schema:
|-- id: long (nullable = true)
|-- comment: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- body: string (nullable = true)
| | |-- html_body: string (nullable = true)
| | |-- author_id: long (nullable = true)
| | |-- uploads: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- attachments: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- thumbnails: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- id: long (nullable = true)
| | | | | | |-- file_name: string (nullable = true)
| | | | | | |-- url: string (nullable = true)
| | | | | | |-- content_url: string (nullable = true)
| | | | | | |-- mapped_content_url: string (nullable = true)
| | | | | | |-- content_type: string (nullable = true)
| | | | | | |-- size: long (nullable = true)
| | | | | | |-- width: long (nullable = true)
| | | | | | |-- height: long (nullable = true)
| | | | | | |-- inline: boolean (nullable = true)
| | |-- created_at: string (nullable = true)
| | |-- public: boolean (nullable = true)
| | |-- channel: string (nullable = true)
| | |-- from: string (nullable = true)
| | |-- location: string (nullable = true)
However, this has way more data than what I need. For each element of the comment array, I would like to concatenate first comment.created_at and then comment.body into a new struct column called comment_final.
The end goal is to build from that, a string column that flattens the entire array into a string html-like field.
For the end result, I would like to do the following:
.withColumn('final_body',array_join(col('comment.body'),'<br/><br/>')).
Can someone help me out with the array/struct data modelling?
I have a requirement where I need to mask the data for some of the fields in a given schema. I've researched a lot and couldn't find the answer that is needed.
This is the schema where I need some changes on the fields(answer_type,response0, response3):
| |-- choices: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- choice_id: long (nullable = true)
| | | |-- created_time: long (nullable = true)
| | | |-- updated_time: long (nullable = true)
| | | |-- created_by: long (nullable = true)
| | | |-- updated_by: long (nullable = true)
| | | |-- answers: struct (nullable = true)
| | | | |-- answer_node_internal_id: long (nullable = true)
| | | | |-- label: string (nullable = true)
| | | | |-- text: map (nullable = true)
| | | | | |-- key: string
| | | | | |-- value: string (valueContainsNull = true)
| | | | |-- data_tag: string (nullable = true)
| | | | |-- answer_type: string (nullable = true)
| | | |-- response: struct (nullable = true)
| | | | |-- response0: string (nullable = true)
| | | | |-- response1: long (nullable = true)
| | | | |-- response2: double (nullable = true)
| | | | |-- response3: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
Is there a way I could assign values to those fields without affecting the above structure in pyspark?
I've tried using explode but i can't revert to original schema. I don't want to create a new column as well and at the same time don't want to lose any data from the provided schema object.
oh i got a similar problem days ago, i suggest to transform the structype to json
and then with a udf you can make the internal changes
and after you cant get the original struct again
you should see to_json and from_json from documentation.
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.functions.from_json
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.functions.to_json
I have a data frame as below
--------------------+
| pas1|
+--------------------+
|[[[[H, 5, 16, 201...|
|[, 1956-09-22, AD...|
|[, 1961-03-19, AD...|
|[, 1962-02-09, AD...|
+--------------------+
want to extract few columns from each row from above 4 rows and create a dataframe like below . Column names should be from the schema not hard coded ones like column1 & column2.
---------|-----------+
| gender | givenName |
+--------|-----------+
| a | b |
| a | b |
| a | b |
| a | b |
+--------------------+
pas1 - schema
root
|-- pas1: struct (nullable = true)
| |-- contactList: struct (nullable = true)
| | |-- contact: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- contactTypeCode: string (nullable = true)
| | | | |-- contactMediumTypeCode: string (nullable = true)
| | | | |-- contactTypeID: string (nullable = true)
| | | | |-- lastUpdateTimestamp: string (nullable = true)
| | | | |-- contactInformation: string (nullable = true)
| |-- dateOfBirth: string (nullable = true)
| |-- farePassengerTypeCode: string (nullable = true)
| |-- gender: string (nullable = true)
| |-- givenName: string (nullable = true)
| |-- groupDepositIndicator: string (nullable = true)
| |-- infantIndicator: string (nullable = true)
| |-- lastUpdateTimestamp: string (nullable = true)
| |-- passengerFOPList: struct (nullable = true)
| | |-- passengerFOP: struct (nullable = true)
| | | |-- fopID: string (nullable = true)
| | | |-- lastUpdateTimestamp: string (nullable = true)
| | | |-- fopFreeText: string (nullable = true)
| | | |-- fopSupplementaryInfoList: struct (nullable = true)
| | | | |-- fopSupplementaryInfo: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- type: string (nullable = true)
| | | | | | |-- value: string (nullable = true)
Thanks for the help
If you want to extract few columns from a dataframe containing structs, you can simply do something like this:
from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('Test').getOrCreate()
df = spark.sparkContext.parallelize([Row(pas1=Row(gender='a', givenName='b'))]).toDF()
df.select('pas1.gender','pas1.givenName').show()
Instead, if you want to flatten your dataframe, this question should help you: How to unwrap nested Struct column into multiple columns?