While loading a path composed by XML files (+100.000 files) that contain almost the same structure, I need to create a dataframe mapping the XML tag fields to other field names (I'm using alias for this task). I'm using spark-xml library to achieve this goal.
However, there are some specific tag names that occur in some XML files and not in others (ICMS00, ICMS10, ICMS20, etc). Example:
<det nItem="1">
<imposto>
<ICMS>
<ICMS00>
<orig>0</orig>
<CST>00</CST>
<modBC>0</modBC>
<vBC>50.60</vBC>
<pICMS>12.00</pICMS>
<vICMS>6.07</vICMS>
</ICMS00>
</ICMS>
</imposto>
</det>
<det nItem="1">
<imposto>
<ICMS>
<ICMS20>
<orig>1</orig>
<CST>10</CST>
</ICMS20>
</ICMS>
</imposto>
</det>
The schema while loading without any modification is:
-- det: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- imposto: struct (nullable = true)
| | | |-- ICMS: struct (nullable = true)
| | | | |-- ICMS00: struct (nullable = true)
| | | | | |-- CST: long (nullable = true)
| | | | | |-- modBC: long (nullable = true)
| | | | | |-- orig: long (nullable = true)
| | | | | |-- pICMS: double (nullable = true)
| | | | | |-- vBC: double (nullable = true)
| | | | | |-- vICMS: double (nullable = true)
| | | | |-- ICMS20: struct (nullable = true)
| | | | | |-- CST: long (nullable = true)
| | | | | |-- orig: long (nullable = true)
I need a solution that maps the content of ICMS00 and ICMS20 to same column while creating a dataframe. However I could not find a solution similar to a select using regex or specifying the sub-tag without the fullpath.
The result schema should be similar to:
-- det: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- imposto: struct (nullable = true)
| | | |-- ICMS: struct (nullable = true)
| | | | |-- ICMS: struct (nullable = true) ###common tag name###
| | | | | |-- CST: long (nullable = true)
| | | | | |-- modBC: long (nullable = true)
| | | | | |-- orig: long (nullable = true)
| | | | | |-- pICMS: double (nullable = true)
| | | | | |-- vBC: double (nullable = true)
| | | | | |-- vICMS: double (nullable = true)
I already tried to change the schema before selecting fields:
# Transform variable ICMSXX fields to static ICMS
det_schema = df.schema['det'].json()
icms_schema = re.sub(r"ICMS([0-9])+", r"ICMS", det_schema)
det_schema_modified = StructField.fromJson(json.loads(icms_schema))
det_schema_modified.schema
# Explode det (item) and visualize the schema
df_itens = df.select(col('det').cast(det_schema_modified.dataType)).withColumn('det', explode(col('det')))
df_itens.select('det').printSchema()
However, it duplicates the schema and giver error when trying to select because of the duplicated field schema:
-- det: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- imposto: struct (nullable = true)
| | | |-- ICMS: struct (nullable = true)
| | | | |-- ICMS: struct (nullable = true)
| | | | | |-- CST: long (nullable = true)
| | | | | |-- modBC: long (nullable = true)
| | | | | |-- orig: long (nullable = true)
| | | | | |-- pICMS: double (nullable = true)
| | | | | |-- vBC: double (nullable = true)
| | | | | |-- vICMS: double (nullable = true)
| | | | |-- ICMS: struct (nullable = true)
| | | | | |-- CST: long (nullable = true)
| | | | | |-- orig: long (nullable = true)
My current code:
df = spark.read.format('com.databricks.spark.xml').options(rowTag='infNFe').load(file_location)
df_itens = df_itens.select(
col("det._nItem").alias("id"),
col("det.prod.cProd").alias("prod_cProd"),
col("det.prod.cEAN").alias("prod_cEAN"),
col("det.prod.NCM").alias("prod_NCM"),
col("det.imposto.ICMS.ICMS00.modBC").alias("ICMS_modBC"),
col("det.imposto.ICMS.ICMS00.vBC").alias("ICMS_vBC"),
col("det.imposto.ICMS.ICMS20.modBC").alias("ICMS_modBC"),
col("det.imposto.ICMS.ICMS20.vBC").alias("ICMS_vBC"),
etc.. for ICMS40, ICMS50, ICMS60, ...
Is there a way to select using regex or to handle these variable XML tag names while loading these files?
Something similar to:
df_itens.select(col("det.imposto.ICMS.ICMS*.modBC").alias("ICMS_modBC"))
or
df_itens.select(col("det.*.*.*.modBC").alias("ICMS_modBC"))
Related
I have the following nested xml file below which leads to my 2 questions.
How and what functions should I use to flatten the file into dataframe table? Examples would be very helpful, please?
How can I use field "CNTL_ID" (PK) to link it with "ENRLMTS" record? If this cannot be achieved easily, how can I create a surrogate key on the fly to link "PRVDR_INFO" with "ENRLMTS"?
XML File:
<PRVDR>
<PRVDR_INFO>
<INDVDL_INFO>
<CNTL_ID>12345678</CNTL_ID>
<BIRTH_DT>19200609</BIRTH_DT>
<BIRTH_STATE_CD>VA</BIRTH_STATE_CD>
<BIRTH_STATE_NAME>VIRGINIA</BIRTH_STATE_NAME>
<BIRTH_CNTRY_CD>US</BIRTH_CNTRY_CD>
<BIRTH_CNTRY_NAME>UNITED STATES</BIRTH_CNTRY_NAME>
<BIRTH_FRGN_SW>Z</BIRTH_FRGN_SW>
<NAME_LIST>
<PEC_INDVDL_NAME>
<NAME_CD>I</NAME_CD>
<NAME_DESC>INDIVIDUAL NAME</NAME_DESC>
<FIRST_NAME>WILL</FIRST_NAME>
<MDL_NAME>J</MDL_NAME>
<LAST_NAME>SMITH</LAST_NAME>
<DATA_STUS_CD>CURRENT</DATA_STUS_CD>
</PEC_INDVDL_NAME>
<PEC_INDVDL_NAME>
<NAME_CD>I</NAME_CD>
<NAME_DESC>INDIVIDUAL NAME</NAME_DESC>
<FIRST_NAME>WILL</FIRST_NAME>
<LAST_NAME>SMITH</LAST_NAME>
<TRMNTN_DT>2010-09-10T13:19:38</TRMNTN_DT>
<DATA_STUS_CD>HISTORY</DATA_STUS_CD>
</PEC_INDVDL_NAME>
</NAME_LIST>
<PEC_TIN>
<TIN>555778888</TIN>
<TAX_IDENT_TYPE_CD>T</TAX_IDENT_TYPE_CD>
<TAX_IDENT_DESC>SSN</TAX_IDENT_DESC>
<DATA_STUS_CD>CURRENT</DATA_STUS_CD>
</PEC_TIN>
<PEC_NPI>
<NPI>3334211156</NPI>
<VRFYD_BUSNS_SW>Y</VRFYD_BUSNS_SW>
<CREAT_TS>2010-09-10T13:16:28</CREAT_TS>
<DATA_STUS_CD>CURRENT</DATA_STUS_CD>
</PEC_NPI>
</INDVDL_INFO>
</PRVDR_INFO>
<ENRLMTS>
<ABC_855X>
<ENRLMT_INFO>
<ENRLMT_DTLS>
<FORM_TYPE_CD>9999A</FORM_TYPE_CD>
<ENRLMT_ID>123444555666778899000</ENRLMT_ID>
<ENRLMT_STUS_DLTS>
<STUS_CD>06</STUS_CD>
<STUS_DESC>APPROVED</STUS_DESC>
<STUS_DT>2012-05-14T16:04:22</STUS_DT>
<DATA_STUS_CD>HISTORY</DATA_STUS_CD>
<ENRLMT_STUS_RSN_DLTS>
<STUS_RSN_CD>047</STUS_RSN_CD>
<STUS_RSN_DESC>APPROVED</STUS_RSN_DESC>
<DATA_STUS_CD>HISTORY</DATA_STUS_CD>
</ENRLMT_STUS_RSN_DLTS>
</ENRLMT_STUS_DLTS>
<ENRLMT_STUS_DLTS>
<STUS_CD>06</STUS_CD>
<STUS_DESC>APPROVED</STUS_DESC>
<STUS_DT>2016-08-09T14:33:40</STUS_DT>
<DATA_STUS_CD>CURRENT</DATA_STUS_CD>
<ENRLMT_STUS_RSN_DLTS>
<STUS_RSN_CD>081</STUS_RSN_CD>
<STUS_RSN_DESC>APPROVED FOR REVALIDATION</STUS_RSN_DESC>
<DATA_STUS_CD>CURRENT</DATA_STUS_CD>
</ENRLMT_STUS_RSN_DLTS>
</ENRLMT_STUS_DLTS>
<BUSNS_STATE>VA</BUSNS_STATE>
<BUSNS_STATE_NAME>VIRGINIA</BUSNS_STATE_NAME>
<CNTRCTR_LIST>
<CNTRCTR_INFO>
<CNTRCTR_ID>11111</CNTRCTR_ID>
<CNTRCTR_NAME>SOLUTIONS, INC.</CNTRCTR_NAME>
<DATA_STUS_CD>CURRENT</DATA_STUS_CD>
</CNTRCTR_INFO>
</CNTRCTR_LIST>
</ENRLMT_DTLS>
</ENRLMT_INFO>
<PEC_ENRLMT_REVLDTN>
<REVLDTN_INSTNC_NUM>1</REVLDTN_INSTNC_NUM>
<REVLDTN_STUS_CD>03</REVLDTN_STUS_CD>
<REVLDTN_STUS_DESC>CANCELLED</REVLDTN_STUS_DESC>
</PEC_ENRLMT_REVLDTN>
<ACPT_NEW_PTNT_SW>Y</ACPT_NEW_PTNT_SW>
</ABC_855X>
</ENRLMTS>
</PRVDR>
Schema is below:
root
|-- ENRLMTS: struct (nullable = true)
| |-- ABC_855X: struct (nullable = true)
| | |-- ACPT_NEW_PTNT_SW: string (nullable = true)
| | |-- ENRLMT_INFO: struct (nullable = true)
| | | |-- ENRLMT_DTLS: struct (nullable = true)
| | | | |-- BUSNS_STATE: string (nullable = true)
| | | | |-- BUSNS_STATE_NAME: string (nullable = true)
| | | | |-- CNTRCTR_LIST: struct (nullable = true)
| | | | | |-- CNTRCTR_INFO: struct (nullable = true)
| | | | | | |-- CNTRCTR_ID: integer (nullable = true)
| | | | | | |-- CNTRCTR_NAME: string (nullable = true)
| | | | | | |-- DATA_STUS_CD: string (nullable = true)
| | | | |-- ENRLMT_ID: double (nullable = true)
| | | | |-- ENRLMT_STUS_DLTS: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- DATA_STUS_CD: string (nullable = true)
| | | | | | |-- ENRLMT_STUS_RSN_DLTS: struct (nullable = true)
| | | | | | | |-- DATA_STUS_CD: string (nullable = true)
| | | | | | | |-- STUS_RSN_CD: integer (nullable = true)
| | | | | | | |-- STUS_RSN_DESC: string (nullable = true)
| | | | | | |-- STUS_CD: integer (nullable = true)
| | | | | | |-- STUS_DESC: string (nullable = true)
| | | | | | |-- STUS_DT: string (nullable = true)
| | | | |-- FORM_TYPE_CD: string (nullable = true)
| | |-- PEC_ENRLMT_REVLDTN: struct (nullable = true)
| | | |-- REVLDTN_INSTNC_NUM: integer (nullable = true)
| | | |-- REVLDTN_STUS_CD: integer (nullable = true)
| | | |-- REVLDTN_STUS_DESC: string (nullable = true)
|-- PRVDR_INFO: struct (nullable = true)
| |-- INDVDL_INFO: struct (nullable = true)
| | |-- BIRTH_CNTRY_CD: string (nullable = true)
| | |-- BIRTH_CNTRY_NAME: string (nullable = true)
| | |-- BIRTH_DT: integer (nullable = true)
| | |-- BIRTH_FRGN_SW: string (nullable = true)
| | |-- BIRTH_STATE_CD: string (nullable = true)
| | |-- BIRTH_STATE_NAME: string (nullable = true)
| | |-- CNTL_ID: integer (nullable = true)
| | |-- NAME_LIST: struct (nullable = true)
| | | |-- PEC_INDVDL_NAME: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- DATA_STUS_CD: string (nullable = true)
| | | | | |-- FIRST_NAME: string (nullable = true)
| | | | | |-- LAST_NAME: string (nullable = true)
| | | | | |-- MDL_NAME: string (nullable = true)
| | | | | |-- NAME_CD: string (nullable = true)
| | | | | |-- NAME_DESC: string (nullable = true)
| | | | | |-- TRMNTN_DT: string (nullable = true)
| | |-- PEC_NPI: struct (nullable = true)
| | | |-- CREAT_TS: string (nullable = true)
| | | |-- DATA_STUS_CD: string (nullable = true)
| | | |-- NPI: long (nullable = true)
| | | |-- VRFYD_BUSNS_SW: string (nullable = true)
| | |-- PEC_TIN: struct (nullable = true)
| | | |-- DATA_STUS_CD: string (nullable = true)
| | | |-- TAX_IDENT_DESC: string (nullable = true)
| | | |-- TAX_IDENT_TYPE_CD: string (nullable = true)
| | | |-- TIN: integer (nullable = true)
My current nested dataframe output below. I would to break this nested output to multiple rows if possible.
+--------------------+--------------------+
| ENRLMTS| PRVDR_INFO|
+--------------------+--------------------+
|{{Y, {{VA, VIRGIN...|{{US, UNITED STAT...|
+--------------------+--------------------+
Thank you much.
I have a spark dataframe with the following schema:
|-- id: long (nullable = true)
|-- comment: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- body: string (nullable = true)
| | |-- html_body: string (nullable = true)
| | |-- author_id: long (nullable = true)
| | |-- uploads: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- attachments: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- thumbnails: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- id: long (nullable = true)
| | | | | | |-- file_name: string (nullable = true)
| | | | | | |-- url: string (nullable = true)
| | | | | | |-- content_url: string (nullable = true)
| | | | | | |-- mapped_content_url: string (nullable = true)
| | | | | | |-- content_type: string (nullable = true)
| | | | | | |-- size: long (nullable = true)
| | | | | | |-- width: long (nullable = true)
| | | | | | |-- height: long (nullable = true)
| | | | | | |-- inline: boolean (nullable = true)
| | |-- created_at: string (nullable = true)
| | |-- public: boolean (nullable = true)
| | |-- channel: string (nullable = true)
| | |-- from: string (nullable = true)
| | |-- location: string (nullable = true)
However, this has way more data than what I need. For each element of the comment array, I would like to concatenate first comment.created_at and then comment.body into a new struct column called comment_final.
The end goal is to build from that, a string column that flattens the entire array into a string html-like field.
For the end result, I would like to do the following:
.withColumn('final_body',array_join(col('comment.body'),'<br/><br/>')).
Can someone help me out with the array/struct data modelling?
I am new to Spark, having the following df schema (part of it).
root
|-- value: binary (nullable = false)
|-- event: struct (nullable = false)
| |-- eventtime: struct (nullable = true)
| | |-- seconds: long (nullable = true)
| | |-- ms: float (nullable = true)
| |-- fault: struct (nullable = true)
| | |-- collections: struct (nullable = true)
| | | |-- snapshots: array (nullable = false) --> ** FIRST LEVEL ARRAY (or array of arrays) **
| | | | |-- element: struct (containsNull = false)
| | | | | |-- ringbuffer: struct (nullable = true)
| | | | | | |-- columns: array (nullable = false) --> ** SECOND LEVEL ARRAY **
| | | | | | | |-- element: struct (containsNull = false)
| | | | | | | | |-- doubles: struct (nullable = true)
| | | | | | | | | |-- values: array (nullable = false)
| | | | | | | | | | |-- element: float (containsNull = false)
....................................
..........................
I am able to add a new field under fault with following code and the new field comp_id appears at the same level with collections.
df.withColumn("event", col("event").withField("fault.comp_id", lit(1234)))
How can I add a new field in array of arrays. For example, adding a new test_field under columns? I tried to get in arrays by defining the first index 0
df.withColumn("event",col("event").withField("fault.collections.snapshots.0.ringbuffer.columns.0.test_field", lit("test_value")))
But got this error
org.apache.spark.sql.catalyst.parser.ParseException: mismatched input '.0' expecting {<EOF>, '.', '-'}(line 1, pos 25)
== SQL ==
fault.snapshots.snapshots.0.ringbuffer.columns.0.test_field
-------------------------^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:255)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:124)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseMultipartIdentifier(ParseDriver.scala:61)
at org.apache.spark.sql.catalyst.expressions.UpdateFields$.nameParts(complexTypeCreator.scala:693)
at org.apache.spark.sql.catalyst.expressions.UpdateFields$.apply(complexTypeCreator.scala:701)
at org.apache.spark.sql.Column.withField(Column.scala:927)
at org.apache.spark.sql.Dataset.transform(Dataset.scala:2751)
at org.apache.spark.sql.Dataset.transform(Dataset.scala:2751)
So desired schema would be like following.
root
|-- value: binary (nullable = false)
|-- event: struct (nullable = false)
| |-- eventtime: struct (nullable = true)
| | |-- seconds: long (nullable = true)
| | |-- ms: float (nullable = true)
| |-- fault: struct (nullable = true)
| | |-- collections: struct (nullable = true)
| | | |-- snapshots: array (nullable = false) --> ** FIRST LEVEL ARRAY (or array of arrays) **
| | | | |-- element: struct (containsNull = false)
| | | | | |-- ringbuffer: struct (nullable = true)
| | | | | | |-- columns: array (nullable = false) --> ** SECOND LEVEL ARRAY **
| | | | | | | |-- element: struct (containsNull = false)
| | | | | | | | |-- doubles: struct (nullable = true)
| | | | | | | | | |-- values: array (nullable = false)
| | | | | | | | | | |-- element: float (containsNull = false)
| | | | | | | | |-- test_field: string (nullable = true)
I have a requirement where I need to mask the data for some of the fields in a given schema. I've researched a lot and couldn't find the answer that is needed.
This is the schema where I need some changes on the fields(answer_type,response0, response3):
| |-- choices: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- choice_id: long (nullable = true)
| | | |-- created_time: long (nullable = true)
| | | |-- updated_time: long (nullable = true)
| | | |-- created_by: long (nullable = true)
| | | |-- updated_by: long (nullable = true)
| | | |-- answers: struct (nullable = true)
| | | | |-- answer_node_internal_id: long (nullable = true)
| | | | |-- label: string (nullable = true)
| | | | |-- text: map (nullable = true)
| | | | | |-- key: string
| | | | | |-- value: string (valueContainsNull = true)
| | | | |-- data_tag: string (nullable = true)
| | | | |-- answer_type: string (nullable = true)
| | | |-- response: struct (nullable = true)
| | | | |-- response0: string (nullable = true)
| | | | |-- response1: long (nullable = true)
| | | | |-- response2: double (nullable = true)
| | | | |-- response3: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
Is there a way I could assign values to those fields without affecting the above structure in pyspark?
I've tried using explode but i can't revert to original schema. I don't want to create a new column as well and at the same time don't want to lose any data from the provided schema object.
oh i got a similar problem days ago, i suggest to transform the structype to json
and then with a udf you can make the internal changes
and after you cant get the original struct again
you should see to_json and from_json from documentation.
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.functions.from_json
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.functions.to_json
I'm having a Dataframe with below schema. This is basically an XML file, which I have converted to Dataframe for further processing. I trying to extract _Date column, but looks like some type mismatch is happening
df1.printSchema
|-- PlayWeek: struct (nullable = true)
| |-- TicketSales: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- PlayDate: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- BoxOfficeDetail: array (nullable = true)
| | | | | | |-- element: struct (containsNull = true)
| | | | | | | |-- VisualFormatCd: struct (nullable = true)
| | | | | | | | |-- Code: struct (nullable = true)
| | | | | | | | | |-- _SequenceId: long (nullable = true)
| | | | | | | | | |-- _VALUE: double (nullable = true)
| | | | | | | |-- _SessionTypeCd: string (nullable = true)
| | | | | | | |-- _TicketPrice: double (nullable = true)
| | | | | | | |-- _TicketQuantity: long (nullable = true)
| | | | | | | |-- _TicketTax: double (nullable = true)
| | | | | | | |-- _TicketTypeCd: string (nullable = true)
| | | | | |-- _Date: string (nullable = true)
| | | |-- _FilmId: long (nullable = true)
| | | |-- _Screen: long (nullable = true)
| | | |-- _TheatreId: long (nullable = true)
| |-- _BusinessEndDate: string (nullable = true)
| |-- _BusinessStartDate: string (nullable = true)
I need to extract _Date column but its throwing below error
scala> df1.select(df1.col("PlayWeek.TicketSales.PlayDate._Date")).show()
org.apache.spark.sql.AnalysisException: cannot resolve 'PlayWeek.TicketSales.PlayDate[_Date]' due to data type mismatch: argument 2 requires integral type, however, '_Date' is of string type.;
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:65)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:334)
Any help would be appreciated.