Pyspark put columns inside an array and then in a struct - pyspark

I have this code:
test_2 = test.select(struct(
col("transaction_start_date"),
col("transaction_end_date"),
col("batch_time_stamp"),
col("batch_id"),
array(
struct(col("vin"),
col("number_of_outbound_calls"),
col("number_of_inbound_calls")).alias("sad")
).alias("batch_data")).alias("transactionRecord"))
The schema is:
root
|-- transactionRecord: struct (nullable = false)
| |-- transaction_start_date: string (nullable = false)
| |-- transaction_end_date: string (nullable = false)
| |-- batch_time_stamp: string (nullable = false)
| |-- batch_id: string (nullable = false)
| |-- batch_data: array (nullable = false)
| | |-- element: struct (containsNull = false)
| | | |-- vin: string (nullable = false)
| | | |-- number_of_outbound_calls: long (nullable = true)
| | | |-- number_of_inbound_calls: long (nullable = true)
But i need this
root
|-- transactionRecord: struct (nullable = false)
| |-- transaction_start_date: string (nullable = false)
| |-- transaction_end_date: string (nullable = false)
| |-- batch_time_stamp: string (nullable = false)
| |-- batch_id: string (nullable = false)
| |-- batch_data: array (nullable = false)
| | |-- sad: struct (nullable = false)
| | | |-- vin: string (nullable = false)
| | | |-- number_of_outbound_calls: long (nullable = true)
| | | |-- number_of_inbound_calls: long (nullable = true)
I need inside array will be a struct and put alias to this struct. I tried but I dont know the code dosent work

Related

How to make fields inside nested structure non-nullable

I have data structure like below shown. What I want to make sure is that field2 only contains non-nullable values, and also want to make sure field3.ZZZ is nullable = true.
root
|-- field1: string (nullable = true)
|-- field2: struct (nullable = true)
|-- field3: struct (nullable = true)
| |-- XXX: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- s: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- v: string (nullable = true)
| | | |-- v: long (nullable = true)
| |-- YYY: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- s: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- v: string (nullable = true)
| | | |-- v: long (nullable = true)
| |-- ZZZ: array (nullable = true)
| | |-- element: null (containsNull = true)
I did lots of research, but haven't found clear way to do the setting in nested structure. Can anyone help me on this?

Created nested struct schema SPARK - Schema Jira

I use Dabricks for data engineering, I'm trying to build this schema through StructType but I'm not getting it. Can someone help me? This is the structure of a Jira "issues" json file. I need to create the schema to create the Dataframe in Pyspark. Copy the json return to check the schema structure.
df = spark.read.option("multiline", "true").json("data/issue.json")
df.show()
+--------------------+--------------------+-----+-------+--------------------+
| expand| fields| id| key| self|
+--------------------+--------------------+-----+-------+--------------------+
|renderedFields,na...|{{0, 0}, null, nu...|10000|FIRST-1|https://weldermar...|
+--------------------+--------------------+-----+-------+--------------------+
root
|-- expand: string (nullable = true)
|-- fields: struct (nullable = true)
| |-- aggregateprogress: struct (nullable = true)
| | |-- progress: long (nullable = true)
| | |-- total: long (nullable = true)
| |-- aggregatetimeestimate: string (nullable = true)
| |-- aggregatetimeoriginalestimate: string (nullable = true)
| |-- aggregatetimespent: string (nullable = true)
| |-- assignee: string (nullable = true)
| |-- attachment: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- comment: struct (nullable = true)
| | |-- comments: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- author: struct (nullable = true)
| | | | | |-- accountId: string (nullable = true)
| | | | | |-- accountType: string (nullable = true)
| | | | | |-- active: boolean (nullable = true)
| | | | | |-- avatarUrls: struct (nullable = true)
| | | | | | |-- 16x16: string (nullable = true)
| | | | | | |-- 24x24: string (nullable = true)
| | | | | | |-- 32x32: string (nullable = true)
| | | | | | |-- 48x48: string (nullable = true)
| | | | | |-- displayName: string (nullable = true)
| | | | | |-- emailAddress: string (nullable = true)
| | | | | |-- self: string (nullable = true)
| | | | | |-- timeZone: string (nullable = true)
| | | | |-- body: struct (nullable = true)
| | | | | |-- content: array (nullable = true)
| | | | | | |-- element: struct (containsNull = true)
| | | | | | | |-- content: array (nullable = true)
| | | | | | | | |-- element: struct (containsNull = true)
| | | | | | | | | |-- text: string (nullable = true)
| | | | | | | | | |-- type: string (nullable = true)
| | | | | | | |-- type: string (nullable = true)
| | | | | |-- type: string (nullable = true)
| | | | | |-- version: long (nullable = true)
| | | | |-- created: string (nullable = true)
| | | | |-- id: string (nullable = true)
| | | | |-- jsdPublic: boolean (nullable = true)
| | | | |-- self: string (nullable = true)
| | | | |-- updateAuthor: struct (nullable = true)
| | | | | |-- accountId: string (nullable = true)
| | | | | |-- accountType: string (nullable = true)
| | | | | |-- active: boolean (nullable = true)
| | | | | |-- avatarUrls: struct (nullable = true)
| | | | | | |-- 16x16: string (nullable = true)
| | | | | | |-- 24x24: string (nullable = true)
| | | | | | |-- 32x32: string (nullable = true)
| | | | | | |-- 48x48: string (nullable = true)
| | | | | |-- displayName: string (nullable = true)
| | | | | |-- emailAddress: string (nullable = true)
| | | | | |-- self: string (nullable = true)
| | | | | |-- timeZone: string (nullable = true)
| | | | |-- updated: string (nullable = true)
| | |-- maxResults: long (nullable = true)
| | |-- self: string (nullable = true)
| | |-- startAt: long (nullable = true)
| | |-- total: long (nullable = true)
| |-- components: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- created: string (nullable = true)
| |-- creator: struct (nullable = true)
| | |-- accountId: string (nullable = true)
| | |-- accountType: string (nullable = true)
| | |-- active: boolean (nullable = true)
| | |-- avatarUrls: struct (nullable = true)
| | | |-- 16x16: string (nullable = true)
| | | |-- 24x24: string (nullable = true)
| | | |-- 32x32: string (nullable = true)
| | | |-- 48x48: string (nullable = true)
| | |-- displayName: string (nullable = true)
| | |-- emailAddress: string (nullable = true)
| | |-- self: string (nullable = true)
| | |-- timeZone: string (nullable = true)
| |-- customfield_10001: string (nullable = true)
| |-- customfield_10002: string (nullable = true)
| |-- customfield_10003: string (nullable = true)
| |-- customfield_10004: string (nullable = true)
| |-- customfield_10005: string (nullable = true)
| |-- customfield_10006: string (nullable = true)
| |-- customfield_10007: string (nullable = true)
| |-- customfield_10008: string (nullable = true)
| |-- customfield_10009: string (nullable = true)
| |-- customfield_10010: string (nullable = true)
| |-- customfield_10014: string (nullable = true)
| |-- customfield_10015: string (nullable = true)
| |-- customfield_10016: string (nullable = true)
| |-- customfield_10017: string (nullable = true)
| |-- customfield_10018: struct (nullable = true)
| | |-- hasEpicLinkFieldDependency: boolean (nullable = true)
| | |-- nonEditableReason: struct (nullable = true)
| | | |-- message: string (nullable = true)
| | | |-- reason: string (nullable = true)
| | |-- showField: boolean (nullable = true)
| |-- customfield_10019: string (nullable = true)
| |-- customfield_10020: string (nullable = true)
| |-- customfield_10021: string (nullable = true)
| |-- customfield_10022: string (nullable = true)
| |-- customfield_10023: string (nullable = true)
| |-- customfield_10024: string (nullable = true)
| |-- customfield_10025: string (nullable = true)
| |-- customfield_10026: string (nullable = true)
| |-- customfield_10027: string (nullable = true)
| |-- customfield_10028: string (nullable = true)
| |-- customfield_10029: string (nullable = true)
| |-- customfield_10030: string (nullable = true)
| |-- description: string (nullable = true)
| |-- duedate: string (nullable = true)
| |-- environment: string (nullable = true)
| |-- fixVersions: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- issuelinks: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- issuerestriction: struct (nullable = true)
| | |-- shouldDisplay: boolean (nullable = true)
| |-- issuetype: struct (nullable = true)
| | |-- avatarId: long (nullable = true)
| | |-- description: string (nullable = true)
| | |-- entityId: string (nullable = true)
| | |-- hierarchyLevel: long (nullable = true)
| | |-- iconUrl: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- self: string (nullable = true)
| | |-- subtask: boolean (nullable = true)
| |-- labels: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- lastViewed: string (nullable = true)
| |-- priority: struct (nullable = true)
| | |-- iconUrl: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- self: string (nullable = true)
| |-- progress: struct (nullable = true)
| | |-- progress: long (nullable = true)
| | |-- total: long (nullable = true)
| |-- project: struct (nullable = true)
| | |-- avatarUrls: struct (nullable = true)
| | | |-- 16x16: string (nullable = true)
| | | |-- 24x24: string (nullable = true)
| | | |-- 32x32: string (nullable = true)
| | | |-- 48x48: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- key: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- projectTypeKey: string (nullable = true)
| | |-- self: string (nullable = true)
| | |-- simplified: boolean (nullable = true)
| |-- reporter: struct (nullable = true)
| | |-- accountId: string (nullable = true)
| | |-- accountType: string (nullable = true)
| | |-- active: boolean (nullable = true)
| | |-- avatarUrls: struct (nullable = true)
| | | |-- 16x16: string (nullable = true)
| | | |-- 24x24: string (nullable = true)
| | | |-- 32x32: string (nullable = true)
| | | |-- 48x48: string (nullable = true)
| | |-- displayName: string (nullable = true)
| | |-- emailAddress: string (nullable = true)
| | |-- self: string (nullable = true)
| | |-- timeZone: string (nullable = true)
| |-- resolution: string (nullable = true)
| |-- resolutiondate: string (nullable = true)
| |-- security: string (nullable = true)
| |-- status: struct (nullable = true)
| | |-- description: string (nullable = true)
| | |-- iconUrl: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- self: string (nullable = true)
| | |-- statusCategory: struct (nullable = true)
| | | |-- colorName: string (nullable = true)
| | | |-- id: long (nullable = true)
| | | |-- key: string (nullable = true)
| | | |-- name: string (nullable = true)
| | | |-- self: string (nullable = true)
| |-- statuscategorychangedate: string (nullable = true)
| |-- subtasks: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- summary: string (nullable = true)
| |-- timeestimate: string (nullable = true)
| |-- timeoriginalestimate: string (nullable = true)
| |-- timespent: string (nullable = true)
| |-- updated: string (nullable = true)
| |-- versions: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- votes: struct (nullable = true)
| | |-- hasVoted: boolean (nullable = true)
| | |-- self: string (nullable = true)
| | |-- votes: long (nullable = true)
| |-- watches: struct (nullable = true)
| | |-- isWatching: boolean (nullable = true)
| | |-- self: string (nullable = true)
| | |-- watchCount: long (nullable = true)
| |-- worklog: struct (nullable = true)
| | |-- maxResults: long (nullable = true)
| | |-- startAt: long (nullable = true)
| | |-- total: long (nullable = true)
| | |-- worklogs: array (nullable = true)
| | | |-- element: string (containsNull = true)
| |-- workratio: long (nullable = true)
|-- id: string (nullable = true)
|-- key: string (nullable = true)
|-- self: string (nullable = true)

Non-primitive, unsupported type error in ADF when trying to read a Parquet file

In Databricks, a df is generated and saved as a parquet file. Here is the schema:
root
|-- dq_check_id: string (nullable = false)
|-- data_attribute_id: long (nullable = true)
|-- dq_check_scope_number_of_records: integer (nullable = false)
|-- dq_check_hit_number_of_records: integer (nullable = false)
|-- snapshotdate: timestamp (nullable = false)
|-- dq_execution_date: timestamp (nullable = false)
|-- generated_by: string (nullable = false)
|-- dq_check_outcomes: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- rule_output_cd: integer (nullable = false)
| | |-- business_key: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- identifying_data_element_id: long (nullable = true)
| | | | |-- identifying_data_element_value: string (nullable = true)
| | |-- technical_key: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- pk_attr_id: long (nullable = true)
| | | | |-- pk_attr_value: string (nullable = true)
| | |-- dq_check_attributes: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- dq_check_attr_id: long (nullable = true)
| | | | |-- dq_check_attr_value: string (nullable = true)
| | | | |-- dq_check_attr_seq: string (nullable = false)
| | |-- outcome_details: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- outcome_attr_id: integer (nullable = false)
| | | | |-- outcome_attr_value: string (nullable = false)
| | | | |-- outcome_attr_seq: string (nullable = false)
|-- generated_date: timestamp (nullable = true)
Then, when trying to read this parquet file in ADF, this error arrives:
Parquet file contained column 'dq_check_outcomes', which is of a non-primitive, unsupported type.
Are you sure there is not a MAP or LIST in the parquet file.
https://www.vertica.com/docs/10.0.x/HTML/Content/Authoring/ExternalTables/ComplexTypes.htm
Please look at the Microsoft documentation on supported options and data types. At the top it states it does not support MAP/LIST. My suggestion is to rebuild the parquet file, section by section until you fine the nested issue.
https://learn.microsoft.com/en-us/azure/data-factory/supported-file-formats-and-compression-codecs-legacy#parquet-format
Here is an image to the table for reference. It looks like all types are supported from your file.

suggestion for tune up the code which contains explode and groupby

I wrote the code for below probelem but it has below problems. Please suggest me if some tuning can be done.
It takes more time I think.
there are 3 brands as of now. It is hardcoded. If more brands would be added, i need to add the code manually.
input dataframe schema :
root
|-- id: string (nullable = true)
|-- attrib: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- pref: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- pref_type: string (nullable = true)
| | |-- brand: string (nullable = true)
| | |-- tp_id: string (nullable = true)
| | |-- aff: float (nullable = true)
| | |-- pre_id: string (nullable = true)
| | |-- cr_date: string (nullable = true)
| | |-- up_date: string (nullable = true)
| | |-- pref_attrib: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
expected output schema:
root
|-- id: string (nullable = true)
|-- attrib: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- pref: struct (nullable = false)
| |-- brandA: array (nullable = true)
| | |-- element: struct (containsNull = false)
| | | |-- pref_type: string (nullable = true)
| | | |-- tp_id: string (nullable = true)
| | | |-- aff: float (nullable = true)
| | | |-- pref_id: string (nullable = true)
| | | |-- cr_date: string (nullable = true)
| | | |-- up_date: string (nullable = true)
| | | |-- pref_attrib: map (nullable = true)
| | | | |-- key: string
| | | | |-- value: string (valueContainsNull = true)
| |-- brandB: array (nullable = true)
| | |-- element: struct (containsNull = false)
| | | |-- pref_type: string (nullable = true)
| | | |-- tp_id: string (nullable = true)
| | | |-- aff: float (nullable = true)
| | | |-- pref_id: string (nullable = true)
| | | |-- cr_date: string (nullable = true)
| | | |-- up_date: string (nullable = true)
| | | |-- pref_attrib: map (nullable = true)
| | | | |-- key: string
| | | | |-- value: string (valueContainsNull = true)
| |-- brandC: array (nullable = true)
| | |-- element: struct (containsNull = false)
| | | |-- pref_type: string (nullable = true)
| | | |-- tp_id: string (nullable = true)
| | | |-- aff: float (nullable = true)
| | | |-- pref_id: string (nullable = true)
| | | |-- cr_date: string (nullable = true)
| | | |-- up_date: string (nullable = true)
| | | |-- pref_attrib: map (nullable = true)
| | | | |-- key: string
| | | | |-- value: string (valueContainsNull = true)
The processing can be done based on the brand attribute under preferences(preferences.brand)
I have written the below code for that:
def modifyBrands(inputDf: DataFrame): DataFrame ={
val PreferenceProps = Array("pref_type", "tp_id", "aff", "pref_id", "cr_date", "up_date", "pref_attrib")
import org.apache.spark.sql.functions._
val explodedDf = inputDf.select(col("id"), explode(col("pref")))
.select(
col("id"),
col("col.pref_type"),
col("col.brand"),
col("col.tp_id"),
col("col.aff"),
col("col.pre_id"),
col("col.cr_dt"),
col("col.up_dt"),
col("col.pref_attrib")
).cache()
val brandAddedDf = explodedDf
.withColumn("brandA", when(col("brand") === "brandA", struct(PreferenceProps.head, PreferenceProps.tail:_*)).as("brandA"))
.withColumn("brandB", when(col("brand") === "brandB", struct(PreferenceProps.head, PreferenceProps.tail:_*)).as("brandB"))
.withColumn("brandC", when(col("brand") === "brandC", struct(PreferenceProps.head, PreferenceProps.tail:_*)).as("brandC"))
.cache()
explodedDf.unpersist()
val groupedDf = brandAddedDf.groupBy("id").agg(
collect_list("brandA").alias("brandA"),
collect_list("brandB").alias("brandB"),
collect_list("brandC").alias("brandC")
).withColumn("preferences", struct(
when(size(col("brandA")).notEqual(0), col("brandA")).alias("brandA"),
when(size(col("brandB")).notEqual(0), col("brandB")).alias("brandB"),
when(size(col("brandC")).notEqual(0), col("brandC")).alias("brandC"),
)).drop("brandA", "brandB", "brandC")
.cache()
brandAddedDf.unpersist()
val idAttributesDf = inputDf.select("id", "attrib").cache()
val joinedDf = idAttributesDf.join(groupedDf, "id")
groupedDf.unpersist()
idAttributesDf.unpersist()
joinedDf.printSchema()
joinedDf // returning joined df which will be wrote as paquet file.
}
You can simplify your code using higher-order function filter on arrays. Just map through brand names and for-each one return a filtered array from pref. This way you avoid the exploding / grouping part.
Here's a complete example:
val data = """{"id":1,"attrib":{"key":"k","value":"v"},"pref":[{"pref_type":"type1","brand":"brandA","tp_id":"id1","aff":"aff1","pre_id":"pre_id1","cr_date":"2021-01-06","up_date":"2021-01-06","pref_attrib":{"key":"k","value":"v"}},{"pref_type":"type1","brand":"brandB","tp_id":"id1","aff":"aff1","pre_id":"pre_id1","cr_date":"2021-01-06","up_date":"2021-01-06","pref_attrib":{"key":"k","value":"v"}},{"pref_type":"type1","brand":"brandC","tp_id":"id1","aff":"aff1","pre_id":"pre_id1","cr_date":"2021-01-06","up_date":"2021-01-06","pref_attrib":{"key":"k","value":"v"}}]}"""
val inputDf = spark.read.json(Seq(data).toDS)
val brands = Seq("brandA", "brandB", "brandC")
// or getting them from input dataframe
// val brands = inputDf.select("pref.brand").as[Seq[String]].collect.flatten
val brandAddedDf = inputDf.withColumn(
"pref",
struct(brands.map(b => expr(s"filter(pref, x -> x.brand = '$b')").as(b)): _*)
)
brandAddedDf.printSchema
//root
// |-- attrib: struct (nullable = true)
// | |-- key: string (nullable = true)
// | |-- value: string (nullable = true)
// |-- id: long (nullable = true)
// |-- pref: struct (nullable = false)
// | |-- brandA: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- aff: string (nullable = true)
// | | | |-- brand: string (nullable = true)
// | | | |-- cr_date: string (nullable = true)
// | | | |-- pre_id: string (nullable = true)
// | | | |-- pref_attrib: struct (nullable = true)
// | | | | |-- key: string (nullable = true)
// | | | | |-- value: string (nullable = true)
// | | | |-- pref_type: string (nullable = true)
// | | | |-- tp_id: string (nullable = true)
// | | | |-- up_date: string (nullable = true)
// | |-- brandB: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- aff: string (nullable = true)
// | | | |-- brand: string (nullable = true)
// | | | |-- cr_date: string (nullable = true)
// | | | |-- pre_id: string (nullable = true)
// | | | |-- pref_attrib: struct (nullable = true)
// | | | | |-- key: string (nullable = true)
// | | | | |-- value: string (nullable = true)
// | | | |-- pref_type: string (nullable = true)
// | | | |-- tp_id: string (nullable = true)
// | | | |-- up_date: string (nullable = true)
// | |-- brandC: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- aff: string (nullable = true)
// | | | |-- brand: string (nullable = true)
// | | | |-- cr_date: string (nullable = true)
// | | | |-- pre_id: string (nullable = true)
// | | | |-- pref_attrib: struct (nullable = true)
// | | | | |-- key: string (nullable = true)
// | | | | |-- value: string (nullable = true)
// | | | |-- pref_type: string (nullable = true)
// | | | |-- tp_id: string (nullable = true)
// | | | |-- up_date: string (nullable = true)
I think they're are a couple issues with how you are doing your code, but the real way to tell where you have a problem with your code is to look at the SPARK UI. I find the "Jobs" tab and the "SQL" tab very informative to figure out where the code is spending most of its time. Then see if those parts can be re-written to give you more speed. Some of the items I point out below may not matter if there is a bottleneck elsewhere that really is where most of the time is being spent.
There are reasons to create nested structures (Like you are for Brand). I'm just not sure I see the payoff here and it's not explained. It should be considered why you are maintaining this structure and what the benefit is. Is there a performance gain for maintaining it? Or is it simply an artifact of how the data was created?
General tips that might help a little:
In general you should only cache code that you will use more than once. You have a lot of code you don't use more than once but you still cache.
Small, small performance boost. (So in other words when you need every millisecond...) withColumn actually doesn't perform as well as select. (Likely due to some object creation) where possible use select instead of withColumn. Not really worth re-writing your code unless you really need every milli-second.

Get WrappedArray row valule and convert it into string in Scala

I have a data frame which comes as like below
+---------------------------------------------------------------------+
|value |
+---------------------------------------------------------------------+
|[WrappedArray(LineItem_organizationId, LineItem_lineItemId)] |
|[WrappedArray(OrganizationId, LineItemId, SegmentSequence_segmentId)]|
+---------------------------------------------------------------------+
From the above two rows I want to create a string which is in this format
"LineItem_organizationId", "LineItem_lineItemId"
"OrganizationId", "LineItemId", "SegmentSequence_segmentId"
I want to create this as dynamic so in first column third value is present my string will have one more , separated columns value .
How can I do this in Scala .
this is what I am doing in order to create data frame
val xmlFiles = "C://Users//u6034690//Desktop//SPARK//trfsmallfffile//XML"
val discriptorFileLOcation = "C://Users//u6034690//Desktop//SPARK//trfsmallfffile//FinancialLineItem//REFXML"
import sqlContext.implicits._
val dfDiscriptor = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "FlatFileDescriptor").load(discriptorFileLOcation)
dfDiscriptor.printSchema()
val firstColumn = dfDiscriptor.select($"FFFileType.FFRecord.FFField").as("FFField")
val FirstColumnOfHeaderFile = firstColumn.select(explode($"FFField")).as("ColumnsDetails").select(explode($"col")).first.get(0).toString().split(",")(5)
println(FirstColumnOfHeaderFile)
//dfDiscriptor.printSchema()
val primaryKeyColumnsFinancialLineItem = dfDiscriptor.select(explode($"FFFileType.FFRecord.FFPrimKey.FFPrimKeyCol"))
primaryKeyColumnsFinancialLineItem.show(false)
Adding the full schema
root
|-- FFColumnDelimiter: string (nullable = true)
|-- FFContentItem: struct (nullable = true)
| |-- _VALUE: string (nullable = true)
| |-- _ffMajVers: long (nullable = true)
| |-- _ffMinVers: double (nullable = true)
|-- FFFileEncoding: string (nullable = true)
|-- FFFileType: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- FFPhysicalFile: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- FFFileName: string (nullable = true)
| | | | |-- FFRowCount: long (nullable = true)
| | |-- FFRecord: struct (nullable = true)
| | | |-- FFField: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- FFColumnNumber: long (nullable = true)
| | | | | |-- FFDataType: string (nullable = true)
| | | | | |-- FFFacets: struct (nullable = true)
| | | | | | |-- FFMaxLength: long (nullable = true)
| | | | | | |-- FFTotalDigits: long (nullable = true)
| | | | | |-- FFFieldIsOptional: boolean (nullable = true)
| | | | | |-- FFFieldName: string (nullable = true)
| | | | | |-- FFForKey: struct (nullable = true)
| | | | | | |-- FFForKeyCol: string (nullable = true)
| | | | | | |-- FFForKeyRecord: string (nullable = true)
| | | |-- FFPrimKey: struct (nullable = true)
| | | | |-- FFPrimKeyCol: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
| | | |-- FFRecordType: string (nullable = true)
|-- FFHeaderRow: boolean (nullable = true)
|-- FFId: string (nullable = true)
|-- FFRowDelimiter: string (nullable = true)
|-- FFTimeStamp: string (nullable = true)
|-- _env: string (nullable = true)
|-- _ffMajVers: long (nullable = true)
|-- _ffMinVers: double (nullable = true)
|-- _ffPubstyle: string (nullable = true)
|-- _schemaLocation: string (nullable = true)
|-- _sr: string (nullable = true)
|-- _xmlns: string (nullable = true)
|-- _xsi: string (nullable = true)
Looking at your given dataframe
+---------------------------------------------------------------------+
|value |
+---------------------------------------------------------------------+
|[WrappedArray(LineItem_organizationId, LineItem_lineItemId)] |
|[WrappedArray(OrganizationId, LineItemId, SegmentSequence_segmentId)]|
+---------------------------------------------------------------------+
it must have the following schema
|-- value: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
If the above assumption are true then you should write a udf function as
import org.apache.spark.sql.functions._
def arrayToString = udf((arr: collection.mutable.WrappedArray[collection.mutable.WrappedArray[String]]) => arr.flatten.mkString(", "))
And use it in the dataframe as
df.withColumn("value", arrayToString($"value"))
And you should have
+-----------------------------------------------------+
|value |
+-----------------------------------------------------+
|LineItem_organizationId, LineItem_lineItemId |
|OrganizationId, LineItemId, SegmentSequence_segmentId|
+-----------------------------------------------------+
|-- value: string (nullable = true)