How to assign constant values to the nested objects in pyspark? - pyspark

I have a requirement where I need to mask the data for some of the fields in a given schema. I've researched a lot and couldn't find the answer that is needed.
This is the schema where I need some changes on the fields(answer_type,response0, response3):
| |-- choices: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- choice_id: long (nullable = true)
| | | |-- created_time: long (nullable = true)
| | | |-- updated_time: long (nullable = true)
| | | |-- created_by: long (nullable = true)
| | | |-- updated_by: long (nullable = true)
| | | |-- answers: struct (nullable = true)
| | | | |-- answer_node_internal_id: long (nullable = true)
| | | | |-- label: string (nullable = true)
| | | | |-- text: map (nullable = true)
| | | | | |-- key: string
| | | | | |-- value: string (valueContainsNull = true)
| | | | |-- data_tag: string (nullable = true)
| | | | |-- answer_type: string (nullable = true)
| | | |-- response: struct (nullable = true)
| | | | |-- response0: string (nullable = true)
| | | | |-- response1: long (nullable = true)
| | | | |-- response2: double (nullable = true)
| | | | |-- response3: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
Is there a way I could assign values to those fields without affecting the above structure in pyspark?
I've tried using explode but i can't revert to original schema. I don't want to create a new column as well and at the same time don't want to lose any data from the provided schema object.

oh i got a similar problem days ago, i suggest to transform the structype to json
and then with a udf you can make the internal changes
and after you cant get the original struct again
you should see to_json and from_json from documentation.
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.functions.from_json
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.functions.to_json

Related

Flatten Nested XML file in Pyspark

I have the following nested xml file below which leads to my 2 questions.
How and what functions should I use to flatten the file into dataframe table? Examples would be very helpful, please?
How can I use field "CNTL_ID" (PK) to link it with "ENRLMTS" record? If this cannot be achieved easily, how can I create a surrogate key on the fly to link "PRVDR_INFO" with "ENRLMTS"?
XML File:
<PRVDR>
<PRVDR_INFO>
<INDVDL_INFO>
<CNTL_ID>12345678</CNTL_ID>
<BIRTH_DT>19200609</BIRTH_DT>
<BIRTH_STATE_CD>VA</BIRTH_STATE_CD>
<BIRTH_STATE_NAME>VIRGINIA</BIRTH_STATE_NAME>
<BIRTH_CNTRY_CD>US</BIRTH_CNTRY_CD>
<BIRTH_CNTRY_NAME>UNITED STATES</BIRTH_CNTRY_NAME>
<BIRTH_FRGN_SW>Z</BIRTH_FRGN_SW>
<NAME_LIST>
<PEC_INDVDL_NAME>
<NAME_CD>I</NAME_CD>
<NAME_DESC>INDIVIDUAL NAME</NAME_DESC>
<FIRST_NAME>WILL</FIRST_NAME>
<MDL_NAME>J</MDL_NAME>
<LAST_NAME>SMITH</LAST_NAME>
<DATA_STUS_CD>CURRENT</DATA_STUS_CD>
</PEC_INDVDL_NAME>
<PEC_INDVDL_NAME>
<NAME_CD>I</NAME_CD>
<NAME_DESC>INDIVIDUAL NAME</NAME_DESC>
<FIRST_NAME>WILL</FIRST_NAME>
<LAST_NAME>SMITH</LAST_NAME>
<TRMNTN_DT>2010-09-10T13:19:38</TRMNTN_DT>
<DATA_STUS_CD>HISTORY</DATA_STUS_CD>
</PEC_INDVDL_NAME>
</NAME_LIST>
<PEC_TIN>
<TIN>555778888</TIN>
<TAX_IDENT_TYPE_CD>T</TAX_IDENT_TYPE_CD>
<TAX_IDENT_DESC>SSN</TAX_IDENT_DESC>
<DATA_STUS_CD>CURRENT</DATA_STUS_CD>
</PEC_TIN>
<PEC_NPI>
<NPI>3334211156</NPI>
<VRFYD_BUSNS_SW>Y</VRFYD_BUSNS_SW>
<CREAT_TS>2010-09-10T13:16:28</CREAT_TS>
<DATA_STUS_CD>CURRENT</DATA_STUS_CD>
</PEC_NPI>
</INDVDL_INFO>
</PRVDR_INFO>
<ENRLMTS>
<ABC_855X>
<ENRLMT_INFO>
<ENRLMT_DTLS>
<FORM_TYPE_CD>9999A</FORM_TYPE_CD>
<ENRLMT_ID>123444555666778899000</ENRLMT_ID>
<ENRLMT_STUS_DLTS>
<STUS_CD>06</STUS_CD>
<STUS_DESC>APPROVED</STUS_DESC>
<STUS_DT>2012-05-14T16:04:22</STUS_DT>
<DATA_STUS_CD>HISTORY</DATA_STUS_CD>
<ENRLMT_STUS_RSN_DLTS>
<STUS_RSN_CD>047</STUS_RSN_CD>
<STUS_RSN_DESC>APPROVED</STUS_RSN_DESC>
<DATA_STUS_CD>HISTORY</DATA_STUS_CD>
</ENRLMT_STUS_RSN_DLTS>
</ENRLMT_STUS_DLTS>
<ENRLMT_STUS_DLTS>
<STUS_CD>06</STUS_CD>
<STUS_DESC>APPROVED</STUS_DESC>
<STUS_DT>2016-08-09T14:33:40</STUS_DT>
<DATA_STUS_CD>CURRENT</DATA_STUS_CD>
<ENRLMT_STUS_RSN_DLTS>
<STUS_RSN_CD>081</STUS_RSN_CD>
<STUS_RSN_DESC>APPROVED FOR REVALIDATION</STUS_RSN_DESC>
<DATA_STUS_CD>CURRENT</DATA_STUS_CD>
</ENRLMT_STUS_RSN_DLTS>
</ENRLMT_STUS_DLTS>
<BUSNS_STATE>VA</BUSNS_STATE>
<BUSNS_STATE_NAME>VIRGINIA</BUSNS_STATE_NAME>
<CNTRCTR_LIST>
<CNTRCTR_INFO>
<CNTRCTR_ID>11111</CNTRCTR_ID>
<CNTRCTR_NAME>SOLUTIONS, INC.</CNTRCTR_NAME>
<DATA_STUS_CD>CURRENT</DATA_STUS_CD>
</CNTRCTR_INFO>
</CNTRCTR_LIST>
</ENRLMT_DTLS>
</ENRLMT_INFO>
<PEC_ENRLMT_REVLDTN>
<REVLDTN_INSTNC_NUM>1</REVLDTN_INSTNC_NUM>
<REVLDTN_STUS_CD>03</REVLDTN_STUS_CD>
<REVLDTN_STUS_DESC>CANCELLED</REVLDTN_STUS_DESC>
</PEC_ENRLMT_REVLDTN>
<ACPT_NEW_PTNT_SW>Y</ACPT_NEW_PTNT_SW>
</ABC_855X>
</ENRLMTS>
</PRVDR>
Schema is below:
root
|-- ENRLMTS: struct (nullable = true)
| |-- ABC_855X: struct (nullable = true)
| | |-- ACPT_NEW_PTNT_SW: string (nullable = true)
| | |-- ENRLMT_INFO: struct (nullable = true)
| | | |-- ENRLMT_DTLS: struct (nullable = true)
| | | | |-- BUSNS_STATE: string (nullable = true)
| | | | |-- BUSNS_STATE_NAME: string (nullable = true)
| | | | |-- CNTRCTR_LIST: struct (nullable = true)
| | | | | |-- CNTRCTR_INFO: struct (nullable = true)
| | | | | | |-- CNTRCTR_ID: integer (nullable = true)
| | | | | | |-- CNTRCTR_NAME: string (nullable = true)
| | | | | | |-- DATA_STUS_CD: string (nullable = true)
| | | | |-- ENRLMT_ID: double (nullable = true)
| | | | |-- ENRLMT_STUS_DLTS: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- DATA_STUS_CD: string (nullable = true)
| | | | | | |-- ENRLMT_STUS_RSN_DLTS: struct (nullable = true)
| | | | | | | |-- DATA_STUS_CD: string (nullable = true)
| | | | | | | |-- STUS_RSN_CD: integer (nullable = true)
| | | | | | | |-- STUS_RSN_DESC: string (nullable = true)
| | | | | | |-- STUS_CD: integer (nullable = true)
| | | | | | |-- STUS_DESC: string (nullable = true)
| | | | | | |-- STUS_DT: string (nullable = true)
| | | | |-- FORM_TYPE_CD: string (nullable = true)
| | |-- PEC_ENRLMT_REVLDTN: struct (nullable = true)
| | | |-- REVLDTN_INSTNC_NUM: integer (nullable = true)
| | | |-- REVLDTN_STUS_CD: integer (nullable = true)
| | | |-- REVLDTN_STUS_DESC: string (nullable = true)
|-- PRVDR_INFO: struct (nullable = true)
| |-- INDVDL_INFO: struct (nullable = true)
| | |-- BIRTH_CNTRY_CD: string (nullable = true)
| | |-- BIRTH_CNTRY_NAME: string (nullable = true)
| | |-- BIRTH_DT: integer (nullable = true)
| | |-- BIRTH_FRGN_SW: string (nullable = true)
| | |-- BIRTH_STATE_CD: string (nullable = true)
| | |-- BIRTH_STATE_NAME: string (nullable = true)
| | |-- CNTL_ID: integer (nullable = true)
| | |-- NAME_LIST: struct (nullable = true)
| | | |-- PEC_INDVDL_NAME: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- DATA_STUS_CD: string (nullable = true)
| | | | | |-- FIRST_NAME: string (nullable = true)
| | | | | |-- LAST_NAME: string (nullable = true)
| | | | | |-- MDL_NAME: string (nullable = true)
| | | | | |-- NAME_CD: string (nullable = true)
| | | | | |-- NAME_DESC: string (nullable = true)
| | | | | |-- TRMNTN_DT: string (nullable = true)
| | |-- PEC_NPI: struct (nullable = true)
| | | |-- CREAT_TS: string (nullable = true)
| | | |-- DATA_STUS_CD: string (nullable = true)
| | | |-- NPI: long (nullable = true)
| | | |-- VRFYD_BUSNS_SW: string (nullable = true)
| | |-- PEC_TIN: struct (nullable = true)
| | | |-- DATA_STUS_CD: string (nullable = true)
| | | |-- TAX_IDENT_DESC: string (nullable = true)
| | | |-- TAX_IDENT_TYPE_CD: string (nullable = true)
| | | |-- TIN: integer (nullable = true)
My current nested dataframe output below. I would to break this nested output to multiple rows if possible.
+--------------------+--------------------+
| ENRLMTS| PRVDR_INFO|
+--------------------+--------------------+
|{{Y, {{VA, VIRGIN...|{{US, UNITED STAT...|
+--------------------+--------------------+
Thank you much.

Reshape array of structs on pyspark

I have a spark dataframe with the following schema:
|-- id: long (nullable = true)
|-- comment: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- body: string (nullable = true)
| | |-- html_body: string (nullable = true)
| | |-- author_id: long (nullable = true)
| | |-- uploads: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- attachments: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- thumbnails: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- id: long (nullable = true)
| | | | | | |-- file_name: string (nullable = true)
| | | | | | |-- url: string (nullable = true)
| | | | | | |-- content_url: string (nullable = true)
| | | | | | |-- mapped_content_url: string (nullable = true)
| | | | | | |-- content_type: string (nullable = true)
| | | | | | |-- size: long (nullable = true)
| | | | | | |-- width: long (nullable = true)
| | | | | | |-- height: long (nullable = true)
| | | | | | |-- inline: boolean (nullable = true)
| | |-- created_at: string (nullable = true)
| | |-- public: boolean (nullable = true)
| | |-- channel: string (nullable = true)
| | |-- from: string (nullable = true)
| | |-- location: string (nullable = true)
However, this has way more data than what I need. For each element of the comment array, I would like to concatenate first comment.created_at and then comment.body into a new struct column called comment_final.
The end goal is to build from that, a string column that flattens the entire array into a string html-like field.
For the end result, I would like to do the following:
.withColumn('final_body',array_join(col('comment.body'),'<br/><br/>')).
Can someone help me out with the array/struct data modelling?

Spark: Select dynamic field name during XML load

While loading a path composed by XML files (+100.000 files) that contain almost the same structure, I need to create a dataframe mapping the XML tag fields to other field names (I'm using alias for this task). I'm using spark-xml library to achieve this goal.
However, there are some specific tag names that occur in some XML files and not in others (ICMS00, ICMS10, ICMS20, etc). Example:
<det nItem="1">
<imposto>
<ICMS>
<ICMS00>
<orig>0</orig>
<CST>00</CST>
<modBC>0</modBC>
<vBC>50.60</vBC>
<pICMS>12.00</pICMS>
<vICMS>6.07</vICMS>
</ICMS00>
</ICMS>
</imposto>
</det>
<det nItem="1">
<imposto>
<ICMS>
<ICMS20>
<orig>1</orig>
<CST>10</CST>
</ICMS20>
</ICMS>
</imposto>
</det>
The schema while loading without any modification is:
-- det: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- imposto: struct (nullable = true)
| | | |-- ICMS: struct (nullable = true)
| | | | |-- ICMS00: struct (nullable = true)
| | | | | |-- CST: long (nullable = true)
| | | | | |-- modBC: long (nullable = true)
| | | | | |-- orig: long (nullable = true)
| | | | | |-- pICMS: double (nullable = true)
| | | | | |-- vBC: double (nullable = true)
| | | | | |-- vICMS: double (nullable = true)
| | | | |-- ICMS20: struct (nullable = true)
| | | | | |-- CST: long (nullable = true)
| | | | | |-- orig: long (nullable = true)
I need a solution that maps the content of ICMS00 and ICMS20 to same column while creating a dataframe. However I could not find a solution similar to a select using regex or specifying the sub-tag without the fullpath.
The result schema should be similar to:
-- det: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- imposto: struct (nullable = true)
| | | |-- ICMS: struct (nullable = true)
| | | | |-- ICMS: struct (nullable = true) ###common tag name###
| | | | | |-- CST: long (nullable = true)
| | | | | |-- modBC: long (nullable = true)
| | | | | |-- orig: long (nullable = true)
| | | | | |-- pICMS: double (nullable = true)
| | | | | |-- vBC: double (nullable = true)
| | | | | |-- vICMS: double (nullable = true)
I already tried to change the schema before selecting fields:
# Transform variable ICMSXX fields to static ICMS
det_schema = df.schema['det'].json()
icms_schema = re.sub(r"ICMS([0-9])+", r"ICMS", det_schema)
det_schema_modified = StructField.fromJson(json.loads(icms_schema))
det_schema_modified.schema
# Explode det (item) and visualize the schema
df_itens = df.select(col('det').cast(det_schema_modified.dataType)).withColumn('det', explode(col('det')))
df_itens.select('det').printSchema()
However, it duplicates the schema and giver error when trying to select because of the duplicated field schema:
-- det: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- imposto: struct (nullable = true)
| | | |-- ICMS: struct (nullable = true)
| | | | |-- ICMS: struct (nullable = true)
| | | | | |-- CST: long (nullable = true)
| | | | | |-- modBC: long (nullable = true)
| | | | | |-- orig: long (nullable = true)
| | | | | |-- pICMS: double (nullable = true)
| | | | | |-- vBC: double (nullable = true)
| | | | | |-- vICMS: double (nullable = true)
| | | | |-- ICMS: struct (nullable = true)
| | | | | |-- CST: long (nullable = true)
| | | | | |-- orig: long (nullable = true)
My current code:
df = spark.read.format('com.databricks.spark.xml').options(rowTag='infNFe').load(file_location)
df_itens = df_itens.select(
col("det._nItem").alias("id"),
col("det.prod.cProd").alias("prod_cProd"),
col("det.prod.cEAN").alias("prod_cEAN"),
col("det.prod.NCM").alias("prod_NCM"),
col("det.imposto.ICMS.ICMS00.modBC").alias("ICMS_modBC"),
col("det.imposto.ICMS.ICMS00.vBC").alias("ICMS_vBC"),
col("det.imposto.ICMS.ICMS20.modBC").alias("ICMS_modBC"),
col("det.imposto.ICMS.ICMS20.vBC").alias("ICMS_vBC"),
etc.. for ICMS40, ICMS50, ICMS60, ...
Is there a way to select using regex or to handle these variable XML tag names while loading these files?
Something similar to:
df_itens.select(col("det.imposto.ICMS.ICMS*.modBC").alias("ICMS_modBC"))
or
df_itens.select(col("det.*.*.*.modBC").alias("ICMS_modBC"))

How can dataframe with list of lists can be explode each line as columns - pyspark

I have a data frame as below
--------------------+
| pas1|
+--------------------+
|[[[[H, 5, 16, 201...|
|[, 1956-09-22, AD...|
|[, 1961-03-19, AD...|
|[, 1962-02-09, AD...|
+--------------------+
want to extract few columns from each row from above 4 rows and create a dataframe like below . Column names should be from the schema not hard coded ones like column1 & column2.
---------|-----------+
| gender | givenName |
+--------|-----------+
| a | b |
| a | b |
| a | b |
| a | b |
+--------------------+
pas1 - schema
root
|-- pas1: struct (nullable = true)
| |-- contactList: struct (nullable = true)
| | |-- contact: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- contactTypeCode: string (nullable = true)
| | | | |-- contactMediumTypeCode: string (nullable = true)
| | | | |-- contactTypeID: string (nullable = true)
| | | | |-- lastUpdateTimestamp: string (nullable = true)
| | | | |-- contactInformation: string (nullable = true)
| |-- dateOfBirth: string (nullable = true)
| |-- farePassengerTypeCode: string (nullable = true)
| |-- gender: string (nullable = true)
| |-- givenName: string (nullable = true)
| |-- groupDepositIndicator: string (nullable = true)
| |-- infantIndicator: string (nullable = true)
| |-- lastUpdateTimestamp: string (nullable = true)
| |-- passengerFOPList: struct (nullable = true)
| | |-- passengerFOP: struct (nullable = true)
| | | |-- fopID: string (nullable = true)
| | | |-- lastUpdateTimestamp: string (nullable = true)
| | | |-- fopFreeText: string (nullable = true)
| | | |-- fopSupplementaryInfoList: struct (nullable = true)
| | | | |-- fopSupplementaryInfo: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- type: string (nullable = true)
| | | | | | |-- value: string (nullable = true)
Thanks for the help
If you want to extract few columns from a dataframe containing structs, you can simply do something like this:
from pyspark.sql import SparkSession,Row
spark = SparkSession.builder.appName('Test').getOrCreate()
df = spark.sparkContext.parallelize([Row(pas1=Row(gender='a', givenName='b'))]).toDF()
df.select('pas1.gender','pas1.givenName').show()
Instead, if you want to flatten your dataframe, this question should help you: How to unwrap nested Struct column into multiple columns?

How to unwind array in DataFrame (from JSON)?

Each record in an RDD contains a json. I'm using SQLContext to create a DataFrame from the Json like this:
val signalsJsonRdd = sqlContext.jsonRDD(signalsJson)
Below is the schema. datapayload is an array of items. I want to explode the array of items to get a dataframe where each row is an item from datapayload. I tried to do something based on this answer, but it seems that I would need to model the entire structure of the item in the case Row(arr: Array[...]) statement. I'm probably missing something.
val payloadDfs = signalsJsonRdd.explode($"data.datapayload"){
case org.apache.spark.sql.Row(arr: Array[String]) => arr.map(Tuple1(_))
}
The above code throws a scala.MatchError, because the type of the actual Row is very different from Row(arr: Array[String]). There is probably a simple way to do what I want, but I can't find it. Please help.
Schema give below
signalsJsonRdd.printSchema()
root
|-- _corrupt_record: string (nullable = true)
|-- data: struct (nullable = true)
| |-- dataid: string (nullable = true)
| |-- datapayload: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Reading: struct (nullable = true)
| | | | |-- A2DPActive: boolean (nullable = true)
| | | | |-- Accuracy: double (nullable = true)
| | | | |-- Active: boolean (nullable = true)
| | | | |-- Address: string (nullable = true)
| | | | |-- Charging: boolean (nullable = true)
| | | | |-- Connected: boolean (nullable = true)
| | | | |-- DeviceName: string (nullable = true)
| | | | |-- Guid: string (nullable = true)
| | | | |-- HandsFree: boolean (nullable = true)
| | | | |-- Header: double (nullable = true)
| | | | |-- Heading: double (nullable = true)
| | | | |-- Latitude: double (nullable = true)
| | | | |-- Longitude: double (nullable = true)
| | | | |-- PositionSource: long (nullable = true)
| | | | |-- Present: boolean (nullable = true)
| | | | |-- Radius: double (nullable = true)
| | | | |-- SSID: string (nullable = true)
| | | | |-- SSIDLength: long (nullable = true)
| | | | |-- SpeedInKmh: double (nullable = true)
| | | | |-- State: string (nullable = true)
| | | | |-- Time: string (nullable = true)
| | | | |-- Type: string (nullable = true)
| | | |-- Time: string (nullable = true)
| | | |-- Type: string (nullable = true)
tl;dr explode function is your friend (or my favorite flatMap).
explode function creates a new row for each element in the given array or map column.
Something like the following should work:
signalsJsonRdd.withColumn("element", explode($"data.datapayload"))
See functions object.