Spark: Select dynamic field name during XML load

Spark: Select dynamic field name during XML load - pyspark

While loading a path composed by XML files (+100.000 files) that contain almost the same structure, I need to create a dataframe mapping the XML tag fields to other field names (I'm using alias for this task). I'm using spark-xml library to achieve this goal.
However, there are some specific tag names that occur in some XML files and not in others (ICMS00, ICMS10, ICMS20, etc). Example:
<det nItem="1">
<imposto>
<ICMS>
<ICMS00>
<orig>0</orig>
<CST>00</CST>
<modBC>0</modBC>
<vBC>50.60</vBC>
<pICMS>12.00</pICMS>
<vICMS>6.07</vICMS>
</ICMS00>
</ICMS>
</imposto>
</det>
<det nItem="1">
<imposto>
<ICMS>
<ICMS20>
<orig>1</orig>
<CST>10</CST>
</ICMS20>
</ICMS>
</imposto>
</det>
The schema while loading without any modification is:
-- det: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- imposto: struct (nullable = true)
| | | |-- ICMS: struct (nullable = true)
| | | | |-- ICMS00: struct (nullable = true)
| | | | | |-- CST: long (nullable = true)
| | | | | |-- modBC: long (nullable = true)
| | | | | |-- orig: long (nullable = true)
| | | | | |-- pICMS: double (nullable = true)
| | | | | |-- vBC: double (nullable = true)
| | | | | |-- vICMS: double (nullable = true)
| | | | |-- ICMS20: struct (nullable = true)
| | | | | |-- CST: long (nullable = true)
| | | | | |-- orig: long (nullable = true)
I need a solution that maps the content of ICMS00 and ICMS20 to same column while creating a dataframe. However I could not find a solution similar to a select using regex or specifying the sub-tag without the fullpath.
The result schema should be similar to:
-- det: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- imposto: struct (nullable = true)
| | | |-- ICMS: struct (nullable = true)
| | | | |-- ICMS: struct (nullable = true) ###common tag name###
| | | | | |-- CST: long (nullable = true)
| | | | | |-- modBC: long (nullable = true)
| | | | | |-- orig: long (nullable = true)
| | | | | |-- pICMS: double (nullable = true)
| | | | | |-- vBC: double (nullable = true)
| | | | | |-- vICMS: double (nullable = true)
I already tried to change the schema before selecting fields:
# Transform variable ICMSXX fields to static ICMS
det_schema = df.schema['det'].json()
icms_schema = re.sub(r"ICMS([0-9])+", r"ICMS", det_schema)
det_schema_modified = StructField.fromJson(json.loads(icms_schema))
det_schema_modified.schema
# Explode det (item) and visualize the schema
df_itens = df.select(col('det').cast(det_schema_modified.dataType)).withColumn('det', explode(col('det')))
df_itens.select('det').printSchema()
However, it duplicates the schema and giver error when trying to select because of the duplicated field schema:
-- det: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- imposto: struct (nullable = true)
| | | |-- ICMS: struct (nullable = true)
| | | | |-- ICMS: struct (nullable = true)
| | | | | |-- CST: long (nullable = true)
| | | | | |-- modBC: long (nullable = true)
| | | | | |-- orig: long (nullable = true)
| | | | | |-- pICMS: double (nullable = true)
| | | | | |-- vBC: double (nullable = true)
| | | | | |-- vICMS: double (nullable = true)
| | | | |-- ICMS: struct (nullable = true)
| | | | | |-- CST: long (nullable = true)
| | | | | |-- orig: long (nullable = true)
My current code:
df = spark.read.format('com.databricks.spark.xml').options(rowTag='infNFe').load(file_location)
df_itens = df_itens.select(
col("det._nItem").alias("id"),
col("det.prod.cProd").alias("prod_cProd"),
col("det.prod.cEAN").alias("prod_cEAN"),
col("det.prod.NCM").alias("prod_NCM"),
col("det.imposto.ICMS.ICMS00.modBC").alias("ICMS_modBC"),
col("det.imposto.ICMS.ICMS00.vBC").alias("ICMS_vBC"),
col("det.imposto.ICMS.ICMS20.modBC").alias("ICMS_modBC"),
col("det.imposto.ICMS.ICMS20.vBC").alias("ICMS_vBC"),
etc.. for ICMS40, ICMS50, ICMS60, ...
Is there a way to select using regex or to handle these variable XML tag names while loading these files?
Something similar to:
df_itens.select(col("det.imposto.ICMS.ICMS*.modBC").alias("ICMS_modBC"))
or
df_itens.select(col("det.*.*.*.modBC").alias("ICMS_modBC"))

Related

Flatten Nested XML file in Pyspark

I have the following nested xml file below which leads to my 2 questions.
How and what functions should I use to flatten the file into dataframe table? Examples would be very helpful, please?
How can I use field "CNTL_ID" (PK) to link it with "ENRLMTS" record? If this cannot be achieved easily, how can I create a surrogate key on the fly to link "PRVDR_INFO" with "ENRLMTS"?
XML File:
<PRVDR>
<PRVDR_INFO>
<INDVDL_INFO>
<CNTL_ID>12345678</CNTL_ID>
<BIRTH_DT>19200609</BIRTH_DT>
<BIRTH_STATE_CD>VA</BIRTH_STATE_CD>
<BIRTH_STATE_NAME>VIRGINIA</BIRTH_STATE_NAME>
<BIRTH_CNTRY_CD>US</BIRTH_CNTRY_CD>
<BIRTH_CNTRY_NAME>UNITED STATES</BIRTH_CNTRY_NAME>
<BIRTH_FRGN_SW>Z</BIRTH_FRGN_SW>
<NAME_LIST>
<PEC_INDVDL_NAME>
<NAME_CD>I</NAME_CD>
<NAME_DESC>INDIVIDUAL NAME</NAME_DESC>
<FIRST_NAME>WILL</FIRST_NAME>
<MDL_NAME>J</MDL_NAME>
<LAST_NAME>SMITH</LAST_NAME>
<DATA_STUS_CD>CURRENT</DATA_STUS_CD>
</PEC_INDVDL_NAME>
<PEC_INDVDL_NAME>
<NAME_CD>I</NAME_CD>
<NAME_DESC>INDIVIDUAL NAME</NAME_DESC>
<FIRST_NAME>WILL</FIRST_NAME>
<LAST_NAME>SMITH</LAST_NAME>
<TRMNTN_DT>2010-09-10T13:19:38</TRMNTN_DT>
<DATA_STUS_CD>HISTORY</DATA_STUS_CD>
</PEC_INDVDL_NAME>
</NAME_LIST>
<PEC_TIN>
<TIN>555778888</TIN>
<TAX_IDENT_TYPE_CD>T</TAX_IDENT_TYPE_CD>
<TAX_IDENT_DESC>SSN</TAX_IDENT_DESC>
<DATA_STUS_CD>CURRENT</DATA_STUS_CD>
</PEC_TIN>
<PEC_NPI>
<NPI>3334211156</NPI>
<VRFYD_BUSNS_SW>Y</VRFYD_BUSNS_SW>
<CREAT_TS>2010-09-10T13:16:28</CREAT_TS>
<DATA_STUS_CD>CURRENT</DATA_STUS_CD>
</PEC_NPI>
</INDVDL_INFO>
</PRVDR_INFO>
<ENRLMTS>
<ABC_855X>
<ENRLMT_INFO>
<ENRLMT_DTLS>
<FORM_TYPE_CD>9999A</FORM_TYPE_CD>
<ENRLMT_ID>123444555666778899000</ENRLMT_ID>
<ENRLMT_STUS_DLTS>
<STUS_CD>06</STUS_CD>
<STUS_DESC>APPROVED</STUS_DESC>
<STUS_DT>2012-05-14T16:04:22</STUS_DT>
<DATA_STUS_CD>HISTORY</DATA_STUS_CD>
<ENRLMT_STUS_RSN_DLTS>
<STUS_RSN_CD>047</STUS_RSN_CD>
<STUS_RSN_DESC>APPROVED</STUS_RSN_DESC>
<DATA_STUS_CD>HISTORY</DATA_STUS_CD>
</ENRLMT_STUS_RSN_DLTS>
</ENRLMT_STUS_DLTS>
<ENRLMT_STUS_DLTS>
<STUS_CD>06</STUS_CD>
<STUS_DESC>APPROVED</STUS_DESC>
<STUS_DT>2016-08-09T14:33:40</STUS_DT>
<DATA_STUS_CD>CURRENT</DATA_STUS_CD>
<ENRLMT_STUS_RSN_DLTS>
<STUS_RSN_CD>081</STUS_RSN_CD>
<STUS_RSN_DESC>APPROVED FOR REVALIDATION</STUS_RSN_DESC>
<DATA_STUS_CD>CURRENT</DATA_STUS_CD>
</ENRLMT_STUS_RSN_DLTS>
</ENRLMT_STUS_DLTS>
<BUSNS_STATE>VA</BUSNS_STATE>
<BUSNS_STATE_NAME>VIRGINIA</BUSNS_STATE_NAME>
<CNTRCTR_LIST>
<CNTRCTR_INFO>
<CNTRCTR_ID>11111</CNTRCTR_ID>
<CNTRCTR_NAME>SOLUTIONS, INC.</CNTRCTR_NAME>
<DATA_STUS_CD>CURRENT</DATA_STUS_CD>
</CNTRCTR_INFO>
</CNTRCTR_LIST>
</ENRLMT_DTLS>
</ENRLMT_INFO>
<PEC_ENRLMT_REVLDTN>
<REVLDTN_INSTNC_NUM>1</REVLDTN_INSTNC_NUM>
<REVLDTN_STUS_CD>03</REVLDTN_STUS_CD>
<REVLDTN_STUS_DESC>CANCELLED</REVLDTN_STUS_DESC>
</PEC_ENRLMT_REVLDTN>
<ACPT_NEW_PTNT_SW>Y</ACPT_NEW_PTNT_SW>
</ABC_855X>
</ENRLMTS>
</PRVDR>
Schema is below:
root
|-- ENRLMTS: struct (nullable = true)
| |-- ABC_855X: struct (nullable = true)
| | |-- ACPT_NEW_PTNT_SW: string (nullable = true)
| | |-- ENRLMT_INFO: struct (nullable = true)
| | | |-- ENRLMT_DTLS: struct (nullable = true)
| | | | |-- BUSNS_STATE: string (nullable = true)
| | | | |-- BUSNS_STATE_NAME: string (nullable = true)
| | | | |-- CNTRCTR_LIST: struct (nullable = true)
| | | | | |-- CNTRCTR_INFO: struct (nullable = true)
| | | | | | |-- CNTRCTR_ID: integer (nullable = true)
| | | | | | |-- CNTRCTR_NAME: string (nullable = true)
| | | | | | |-- DATA_STUS_CD: string (nullable = true)
| | | | |-- ENRLMT_ID: double (nullable = true)
| | | | |-- ENRLMT_STUS_DLTS: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- DATA_STUS_CD: string (nullable = true)
| | | | | | |-- ENRLMT_STUS_RSN_DLTS: struct (nullable = true)
| | | | | | | |-- DATA_STUS_CD: string (nullable = true)
| | | | | | | |-- STUS_RSN_CD: integer (nullable = true)
| | | | | | | |-- STUS_RSN_DESC: string (nullable = true)
| | | | | | |-- STUS_CD: integer (nullable = true)
| | | | | | |-- STUS_DESC: string (nullable = true)
| | | | | | |-- STUS_DT: string (nullable = true)
| | | | |-- FORM_TYPE_CD: string (nullable = true)
| | |-- PEC_ENRLMT_REVLDTN: struct (nullable = true)
| | | |-- REVLDTN_INSTNC_NUM: integer (nullable = true)
| | | |-- REVLDTN_STUS_CD: integer (nullable = true)
| | | |-- REVLDTN_STUS_DESC: string (nullable = true)
|-- PRVDR_INFO: struct (nullable = true)
| |-- INDVDL_INFO: struct (nullable = true)
| | |-- BIRTH_CNTRY_CD: string (nullable = true)
| | |-- BIRTH_CNTRY_NAME: string (nullable = true)
| | |-- BIRTH_DT: integer (nullable = true)
| | |-- BIRTH_FRGN_SW: string (nullable = true)
| | |-- BIRTH_STATE_CD: string (nullable = true)
| | |-- BIRTH_STATE_NAME: string (nullable = true)
| | |-- CNTL_ID: integer (nullable = true)
| | |-- NAME_LIST: struct (nullable = true)
| | | |-- PEC_INDVDL_NAME: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- DATA_STUS_CD: string (nullable = true)
| | | | | |-- FIRST_NAME: string (nullable = true)
| | | | | |-- LAST_NAME: string (nullable = true)
| | | | | |-- MDL_NAME: string (nullable = true)
| | | | | |-- NAME_CD: string (nullable = true)
| | | | | |-- NAME_DESC: string (nullable = true)
| | | | | |-- TRMNTN_DT: string (nullable = true)
| | |-- PEC_NPI: struct (nullable = true)
| | | |-- CREAT_TS: string (nullable = true)
| | | |-- DATA_STUS_CD: string (nullable = true)
| | | |-- NPI: long (nullable = true)
| | | |-- VRFYD_BUSNS_SW: string (nullable = true)
| | |-- PEC_TIN: struct (nullable = true)
| | | |-- DATA_STUS_CD: string (nullable = true)
| | | |-- TAX_IDENT_DESC: string (nullable = true)
| | | |-- TAX_IDENT_TYPE_CD: string (nullable = true)
| | | |-- TIN: integer (nullable = true)
My current nested dataframe output below. I would to break this nested output to multiple rows if possible.
+--------------------+--------------------+
| ENRLMTS| PRVDR_INFO|
+--------------------+--------------------+
|{{Y, {{VA, VIRGIN...|{{US, UNITED STAT...|
+--------------------+--------------------+
Thank you much.

Reshape array of structs on pyspark

I have a spark dataframe with the following schema:
|-- id: long (nullable = true)
|-- comment: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- body: string (nullable = true)
| | |-- html_body: string (nullable = true)
| | |-- author_id: long (nullable = true)
| | |-- uploads: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- attachments: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- thumbnails: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- id: long (nullable = true)
| | | | | | |-- file_name: string (nullable = true)
| | | | | | |-- url: string (nullable = true)
| | | | | | |-- content_url: string (nullable = true)
| | | | | | |-- mapped_content_url: string (nullable = true)
| | | | | | |-- content_type: string (nullable = true)
| | | | | | |-- size: long (nullable = true)
| | | | | | |-- width: long (nullable = true)
| | | | | | |-- height: long (nullable = true)
| | | | | | |-- inline: boolean (nullable = true)
| | |-- created_at: string (nullable = true)
| | |-- public: boolean (nullable = true)
| | |-- channel: string (nullable = true)
| | |-- from: string (nullable = true)
| | |-- location: string (nullable = true)
However, this has way more data than what I need. For each element of the comment array, I would like to concatenate first comment.created_at and then comment.body into a new struct column called comment_final.
The end goal is to build from that, a string column that flattens the entire array into a string html-like field.
For the end result, I would like to do the following:
.withColumn('final_body',array_join(col('comment.body'),'<br/><br/>')).
Can someone help me out with the array/struct data modelling?

Spark, adding new field in array of arrays

I am new to Spark, having the following df schema (part of it).
root
|-- value: binary (nullable = false)
|-- event: struct (nullable = false)
| |-- eventtime: struct (nullable = true)
| | |-- seconds: long (nullable = true)
| | |-- ms: float (nullable = true)
| |-- fault: struct (nullable = true)
| | |-- collections: struct (nullable = true)
| | | |-- snapshots: array (nullable = false) --> ** FIRST LEVEL ARRAY (or array of arrays) **
| | | | |-- element: struct (containsNull = false)
| | | | | |-- ringbuffer: struct (nullable = true)
| | | | | | |-- columns: array (nullable = false) --> ** SECOND LEVEL ARRAY **
| | | | | | | |-- element: struct (containsNull = false)
| | | | | | | | |-- doubles: struct (nullable = true)
| | | | | | | | | |-- values: array (nullable = false)
| | | | | | | | | | |-- element: float (containsNull = false)
....................................
..........................
I am able to add a new field under fault with following code and the new field comp_id appears at the same level with collections.
df.withColumn("event", col("event").withField("fault.comp_id", lit(1234)))
How can I add a new field in array of arrays. For example, adding a new test_field under columns? I tried to get in arrays by defining the first index 0
df.withColumn("event",col("event").withField("fault.collections.snapshots.0.ringbuffer.columns.0.test_field", lit("test_value")))
But got this error
org.apache.spark.sql.catalyst.parser.ParseException: mismatched input '.0' expecting {<EOF>, '.', '-'}(line 1, pos 25)
== SQL ==
fault.snapshots.snapshots.0.ringbuffer.columns.0.test_field
-------------------------^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:255)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:124)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseMultipartIdentifier(ParseDriver.scala:61)
at org.apache.spark.sql.catalyst.expressions.UpdateFields$.nameParts(complexTypeCreator.scala:693)
at org.apache.spark.sql.catalyst.expressions.UpdateFields$.apply(complexTypeCreator.scala:701)
at org.apache.spark.sql.Column.withField(Column.scala:927)
at org.apache.spark.sql.Dataset.transform(Dataset.scala:2751)
at org.apache.spark.sql.Dataset.transform(Dataset.scala:2751)
So desired schema would be like following.
root
|-- value: binary (nullable = false)
|-- event: struct (nullable = false)
| |-- eventtime: struct (nullable = true)
| | |-- seconds: long (nullable = true)
| | |-- ms: float (nullable = true)
| |-- fault: struct (nullable = true)
| | |-- collections: struct (nullable = true)
| | | |-- snapshots: array (nullable = false) --> ** FIRST LEVEL ARRAY (or array of arrays) **
| | | | |-- element: struct (containsNull = false)
| | | | | |-- ringbuffer: struct (nullable = true)
| | | | | | |-- columns: array (nullable = false) --> ** SECOND LEVEL ARRAY **
| | | | | | | |-- element: struct (containsNull = false)
| | | | | | | | |-- doubles: struct (nullable = true)
| | | | | | | | | |-- values: array (nullable = false)
| | | | | | | | | | |-- element: float (containsNull = false)
| | | | | | | | |-- test_field: string (nullable = true)

How to assign constant values to the nested objects in pyspark?

I have a requirement where I need to mask the data for some of the fields in a given schema. I've researched a lot and couldn't find the answer that is needed.
This is the schema where I need some changes on the fields(answer_type,response0, response3):
| |-- choices: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- choice_id: long (nullable = true)
| | | |-- created_time: long (nullable = true)
| | | |-- updated_time: long (nullable = true)
| | | |-- created_by: long (nullable = true)
| | | |-- updated_by: long (nullable = true)
| | | |-- answers: struct (nullable = true)
| | | | |-- answer_node_internal_id: long (nullable = true)
| | | | |-- label: string (nullable = true)
| | | | |-- text: map (nullable = true)
| | | | | |-- key: string
| | | | | |-- value: string (valueContainsNull = true)
| | | | |-- data_tag: string (nullable = true)
| | | | |-- answer_type: string (nullable = true)
| | | |-- response: struct (nullable = true)
| | | | |-- response0: string (nullable = true)
| | | | |-- response1: long (nullable = true)
| | | | |-- response2: double (nullable = true)
| | | | |-- response3: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
Is there a way I could assign values to those fields without affecting the above structure in pyspark?
I've tried using explode but i can't revert to original schema. I don't want to create a new column as well and at the same time don't want to lose any data from the provided schema object.

oh i got a similar problem days ago, i suggest to transform the structype to json
and then with a udf you can make the internal changes
and after you cant get the original struct again
you should see to_json and from_json from documentation.
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.functions.from_json
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.functions.to_json

Spark Data frame throwing error when trying to query nested column

I'm having a Dataframe with below schema. This is basically an XML file, which I have converted to Dataframe for further processing. I trying to extract _Date column, but looks like some type mismatch is happening
df1.printSchema
|-- PlayWeek: struct (nullable = true)
| |-- TicketSales: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- PlayDate: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- BoxOfficeDetail: array (nullable = true)
| | | | | | |-- element: struct (containsNull = true)
| | | | | | | |-- VisualFormatCd: struct (nullable = true)
| | | | | | | | |-- Code: struct (nullable = true)
| | | | | | | | | |-- _SequenceId: long (nullable = true)
| | | | | | | | | |-- _VALUE: double (nullable = true)
| | | | | | | |-- _SessionTypeCd: string (nullable = true)
| | | | | | | |-- _TicketPrice: double (nullable = true)
| | | | | | | |-- _TicketQuantity: long (nullable = true)
| | | | | | | |-- _TicketTax: double (nullable = true)
| | | | | | | |-- _TicketTypeCd: string (nullable = true)
| | | | | |-- _Date: string (nullable = true)
| | | |-- _FilmId: long (nullable = true)
| | | |-- _Screen: long (nullable = true)
| | | |-- _TheatreId: long (nullable = true)
| |-- _BusinessEndDate: string (nullable = true)
| |-- _BusinessStartDate: string (nullable = true)
I need to extract _Date column but its throwing below error
scala> df1.select(df1.col("PlayWeek.TicketSales.PlayDate._Date")).show()
org.apache.spark.sql.AnalysisException: cannot resolve 'PlayWeek.TicketSales.PlayDate[_Date]' due to data type mismatch: argument 2 requires integral type, however, '_Date' is of string type.;
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:65)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:334)
Any help would be appreciated.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark: Select dynamic field name during XML load - pyspark

Related

Flatten Nested XML file in Pyspark

Reshape array of structs on pyspark

Spark, adding new field in array of arrays

How to assign constant values to the nested objects in pyspark?

Spark Data frame throwing error when trying to query nested column

Categories

Resources