Created nested struct schema SPARK - Schema Jira - pyspark

I use Dabricks for data engineering, I'm trying to build this schema through StructType but I'm not getting it. Can someone help me? This is the structure of a Jira "issues" json file. I need to create the schema to create the Dataframe in Pyspark. Copy the json return to check the schema structure.
df = spark.read.option("multiline", "true").json("data/issue.json")
df.show()
+--------------------+--------------------+-----+-------+--------------------+
| expand| fields| id| key| self|
+--------------------+--------------------+-----+-------+--------------------+
|renderedFields,na...|{{0, 0}, null, nu...|10000|FIRST-1|https://weldermar...|
+--------------------+--------------------+-----+-------+--------------------+
root
|-- expand: string (nullable = true)
|-- fields: struct (nullable = true)
| |-- aggregateprogress: struct (nullable = true)
| | |-- progress: long (nullable = true)
| | |-- total: long (nullable = true)
| |-- aggregatetimeestimate: string (nullable = true)
| |-- aggregatetimeoriginalestimate: string (nullable = true)
| |-- aggregatetimespent: string (nullable = true)
| |-- assignee: string (nullable = true)
| |-- attachment: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- comment: struct (nullable = true)
| | |-- comments: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- author: struct (nullable = true)
| | | | | |-- accountId: string (nullable = true)
| | | | | |-- accountType: string (nullable = true)
| | | | | |-- active: boolean (nullable = true)
| | | | | |-- avatarUrls: struct (nullable = true)
| | | | | | |-- 16x16: string (nullable = true)
| | | | | | |-- 24x24: string (nullable = true)
| | | | | | |-- 32x32: string (nullable = true)
| | | | | | |-- 48x48: string (nullable = true)
| | | | | |-- displayName: string (nullable = true)
| | | | | |-- emailAddress: string (nullable = true)
| | | | | |-- self: string (nullable = true)
| | | | | |-- timeZone: string (nullable = true)
| | | | |-- body: struct (nullable = true)
| | | | | |-- content: array (nullable = true)
| | | | | | |-- element: struct (containsNull = true)
| | | | | | | |-- content: array (nullable = true)
| | | | | | | | |-- element: struct (containsNull = true)
| | | | | | | | | |-- text: string (nullable = true)
| | | | | | | | | |-- type: string (nullable = true)
| | | | | | | |-- type: string (nullable = true)
| | | | | |-- type: string (nullable = true)
| | | | | |-- version: long (nullable = true)
| | | | |-- created: string (nullable = true)
| | | | |-- id: string (nullable = true)
| | | | |-- jsdPublic: boolean (nullable = true)
| | | | |-- self: string (nullable = true)
| | | | |-- updateAuthor: struct (nullable = true)
| | | | | |-- accountId: string (nullable = true)
| | | | | |-- accountType: string (nullable = true)
| | | | | |-- active: boolean (nullable = true)
| | | | | |-- avatarUrls: struct (nullable = true)
| | | | | | |-- 16x16: string (nullable = true)
| | | | | | |-- 24x24: string (nullable = true)
| | | | | | |-- 32x32: string (nullable = true)
| | | | | | |-- 48x48: string (nullable = true)
| | | | | |-- displayName: string (nullable = true)
| | | | | |-- emailAddress: string (nullable = true)
| | | | | |-- self: string (nullable = true)
| | | | | |-- timeZone: string (nullable = true)
| | | | |-- updated: string (nullable = true)
| | |-- maxResults: long (nullable = true)
| | |-- self: string (nullable = true)
| | |-- startAt: long (nullable = true)
| | |-- total: long (nullable = true)
| |-- components: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- created: string (nullable = true)
| |-- creator: struct (nullable = true)
| | |-- accountId: string (nullable = true)
| | |-- accountType: string (nullable = true)
| | |-- active: boolean (nullable = true)
| | |-- avatarUrls: struct (nullable = true)
| | | |-- 16x16: string (nullable = true)
| | | |-- 24x24: string (nullable = true)
| | | |-- 32x32: string (nullable = true)
| | | |-- 48x48: string (nullable = true)
| | |-- displayName: string (nullable = true)
| | |-- emailAddress: string (nullable = true)
| | |-- self: string (nullable = true)
| | |-- timeZone: string (nullable = true)
| |-- customfield_10001: string (nullable = true)
| |-- customfield_10002: string (nullable = true)
| |-- customfield_10003: string (nullable = true)
| |-- customfield_10004: string (nullable = true)
| |-- customfield_10005: string (nullable = true)
| |-- customfield_10006: string (nullable = true)
| |-- customfield_10007: string (nullable = true)
| |-- customfield_10008: string (nullable = true)
| |-- customfield_10009: string (nullable = true)
| |-- customfield_10010: string (nullable = true)
| |-- customfield_10014: string (nullable = true)
| |-- customfield_10015: string (nullable = true)
| |-- customfield_10016: string (nullable = true)
| |-- customfield_10017: string (nullable = true)
| |-- customfield_10018: struct (nullable = true)
| | |-- hasEpicLinkFieldDependency: boolean (nullable = true)
| | |-- nonEditableReason: struct (nullable = true)
| | | |-- message: string (nullable = true)
| | | |-- reason: string (nullable = true)
| | |-- showField: boolean (nullable = true)
| |-- customfield_10019: string (nullable = true)
| |-- customfield_10020: string (nullable = true)
| |-- customfield_10021: string (nullable = true)
| |-- customfield_10022: string (nullable = true)
| |-- customfield_10023: string (nullable = true)
| |-- customfield_10024: string (nullable = true)
| |-- customfield_10025: string (nullable = true)
| |-- customfield_10026: string (nullable = true)
| |-- customfield_10027: string (nullable = true)
| |-- customfield_10028: string (nullable = true)
| |-- customfield_10029: string (nullable = true)
| |-- customfield_10030: string (nullable = true)
| |-- description: string (nullable = true)
| |-- duedate: string (nullable = true)
| |-- environment: string (nullable = true)
| |-- fixVersions: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- issuelinks: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- issuerestriction: struct (nullable = true)
| | |-- shouldDisplay: boolean (nullable = true)
| |-- issuetype: struct (nullable = true)
| | |-- avatarId: long (nullable = true)
| | |-- description: string (nullable = true)
| | |-- entityId: string (nullable = true)
| | |-- hierarchyLevel: long (nullable = true)
| | |-- iconUrl: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- self: string (nullable = true)
| | |-- subtask: boolean (nullable = true)
| |-- labels: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- lastViewed: string (nullable = true)
| |-- priority: struct (nullable = true)
| | |-- iconUrl: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- self: string (nullable = true)
| |-- progress: struct (nullable = true)
| | |-- progress: long (nullable = true)
| | |-- total: long (nullable = true)
| |-- project: struct (nullable = true)
| | |-- avatarUrls: struct (nullable = true)
| | | |-- 16x16: string (nullable = true)
| | | |-- 24x24: string (nullable = true)
| | | |-- 32x32: string (nullable = true)
| | | |-- 48x48: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- key: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- projectTypeKey: string (nullable = true)
| | |-- self: string (nullable = true)
| | |-- simplified: boolean (nullable = true)
| |-- reporter: struct (nullable = true)
| | |-- accountId: string (nullable = true)
| | |-- accountType: string (nullable = true)
| | |-- active: boolean (nullable = true)
| | |-- avatarUrls: struct (nullable = true)
| | | |-- 16x16: string (nullable = true)
| | | |-- 24x24: string (nullable = true)
| | | |-- 32x32: string (nullable = true)
| | | |-- 48x48: string (nullable = true)
| | |-- displayName: string (nullable = true)
| | |-- emailAddress: string (nullable = true)
| | |-- self: string (nullable = true)
| | |-- timeZone: string (nullable = true)
| |-- resolution: string (nullable = true)
| |-- resolutiondate: string (nullable = true)
| |-- security: string (nullable = true)
| |-- status: struct (nullable = true)
| | |-- description: string (nullable = true)
| | |-- iconUrl: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- self: string (nullable = true)
| | |-- statusCategory: struct (nullable = true)
| | | |-- colorName: string (nullable = true)
| | | |-- id: long (nullable = true)
| | | |-- key: string (nullable = true)
| | | |-- name: string (nullable = true)
| | | |-- self: string (nullable = true)
| |-- statuscategorychangedate: string (nullable = true)
| |-- subtasks: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- summary: string (nullable = true)
| |-- timeestimate: string (nullable = true)
| |-- timeoriginalestimate: string (nullable = true)
| |-- timespent: string (nullable = true)
| |-- updated: string (nullable = true)
| |-- versions: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- votes: struct (nullable = true)
| | |-- hasVoted: boolean (nullable = true)
| | |-- self: string (nullable = true)
| | |-- votes: long (nullable = true)
| |-- watches: struct (nullable = true)
| | |-- isWatching: boolean (nullable = true)
| | |-- self: string (nullable = true)
| | |-- watchCount: long (nullable = true)
| |-- worklog: struct (nullable = true)
| | |-- maxResults: long (nullable = true)
| | |-- startAt: long (nullable = true)
| | |-- total: long (nullable = true)
| | |-- worklogs: array (nullable = true)
| | | |-- element: string (containsNull = true)
| |-- workratio: long (nullable = true)
|-- id: string (nullable = true)
|-- key: string (nullable = true)
|-- self: string (nullable = true)

Related

AssertionError: assertion failed: object serializer should have only one bound reference but there are 0

I have a Java POJO RawSpan with the following schema. My question is only around the trace_id but I am mentioning the entire schema for completeness sake:
root
|-- customer_id: string (nullable = true)
|-- trace_id: binary (nullable = true)
|-- entity_list: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- customer_id: string (nullable = false)
| | |-- entity_id: string (nullable = true)
| | |-- entity_type: string (nullable = false)
| | |-- entity_name: string (nullable = true)
| | |-- attributes: struct (nullable = true)
| | | |-- attribute_map: map (nullable = false)
| | | | |-- key: string
| | | | |-- value: struct (valueContainsNull = false)
| | | | | |-- value: string (nullable = true)
| | | | | |-- binary_value: binary (nullable = true)
| | | | | |-- value_list: array (nullable = true)
| | | | | | |-- element: string (containsNull = false)
| | | | | |-- value_map: map (nullable = true)
| | | | | | |-- key: string
| | | | | | |-- value: string (valueContainsNull = false)
| | |-- related_entity_ids: struct (nullable = true)
| | | |-- ids: array (nullable = false)
| | | | |-- element: string (containsNull = false)
|-- resource: struct (nullable = true)
| |-- attributes: struct (nullable = false)
| | |-- attribute_map: map (nullable = false)
| | | |-- key: string
| | | |-- value: struct (valueContainsNull = false)
| | | | |-- value: string (nullable = true)
| | | | |-- binary_value: binary (nullable = true)
| | | | |-- value_list: array (nullable = true)
| | | | | |-- element: string (containsNull = false)
| | | | |-- value_map: map (nullable = true)
| | | | | |-- key: string
| | | | | |-- value: string (valueContainsNull = false)
|-- event: struct (nullable = true)
| |-- customer_id: string (nullable = false)
| |-- event_id: binary (nullable = false)
| |-- event_name: string (nullable = true)
| |-- entity_id_list: array (nullable = false)
| | |-- element: string (containsNull = false)
| |-- resource_index: integer (nullable = false)
| |-- attributes: struct (nullable = true)
| | |-- attribute_map: map (nullable = false)
| | | |-- key: string
| | | |-- value: struct (valueContainsNull = false)
| | | | |-- value: string (nullable = true)
| | | | |-- binary_value: binary (nullable = true)
| | | | |-- value_list: array (nullable = true)
| | | | | |-- element: string (containsNull = false)
| | | | |-- value_map: map (nullable = true)
| | | | | |-- key: string
| | | | | |-- value: string (valueContainsNull = false)
| |-- start_time_millis: long (nullable = false)
| |-- end_time_millis: long (nullable = false)
| |-- metrics: struct (nullable = true)
| | |-- metric_map: map (nullable = false)
| | | |-- key: string
| | | |-- value: struct (valueContainsNull = false)
| | | | |-- value: double (nullable = true)
| | | | |-- binary_value: binary (nullable = true)
| | | | |-- value_list: array (nullable = true)
| | | | | |-- element: double (containsNull = false)
| | | | |-- value_map: map (nullable = true)
| | | | | |-- key: string
| | | | | |-- value: double (valueContainsNull = false)
| |-- event_ref_list: array (nullable = false)
| | |-- element: struct (containsNull = false)
| | | |-- trace_id: binary (nullable = false)
| | | |-- event_id: binary (nullable = false)
| | | |-- ref_type: string (nullable = false)
| |-- enriched_attributes: struct (nullable = true)
| | |-- attribute_map: map (nullable = false)
| | | |-- key: string
| | | |-- value: struct (valueContainsNull = false)
| | | | |-- value: string (nullable = true)
| | | | |-- binary_value: binary (nullable = true)
| | | | |-- value_list: array (nullable = true)
| | | | | |-- element: string (containsNull = false)
| | | | |-- value_map: map (nullable = true)
| | | | | |-- key: string
| | | | | |-- value: string (valueContainsNull = false)
| |-- jaegerFields: struct (nullable = true)
| | |-- flags: integer (nullable = false)
| | |-- logs: array (nullable = true)
| | | |-- element: string (containsNull = false)
| | |-- warnings: array (nullable = true)
| | | |-- element: string (containsNull = false)
| |-- http: struct (nullable = true)
| | |-- request: struct (nullable = true)
| | | |-- url: string (nullable = true)
| | | |-- scheme: string (nullable = true)
| | | |-- host: string (nullable = true)
| | | |-- method: string (nullable = true)
| | | |-- path: string (nullable = true)
| | | |-- query_string: string (nullable = true)
| | | |-- body: string (nullable = true)
| | | |-- session_id: string (nullable = true)
| | | |-- cookies: array (nullable = false)
| | | | |-- element: string (containsNull = false)
| | | |-- user_agent: string (nullable = true)
| | | |-- size: integer (nullable = false)
| | | |-- headers: struct (nullable = true)
| | | | |-- host: string (nullable = true)
| | | | |-- authority: string (nullable = true)
| | | | |-- content_type: string (nullable = true)
| | | | |-- path: string (nullable = true)
| | | | |-- x_forwarded_for: string (nullable = true)
| | | | |-- user_agent: string (nullable = true)
| | | | |-- cookie: string (nullable = true)
| | | | |-- other_headers: map (nullable = false)
| | | | | |-- key: string
| | | | | |-- value: string (valueContainsNull = false)
| | | |-- params: map (nullable = false)
| | | | |-- key: string
| | | | |-- value: string (valueContainsNull = false)
| | |-- response: struct (nullable = true)
| | | |-- body: string (nullable = true)
| | | |-- status_code: integer (nullable = false)
| | | |-- status_message: string (nullable = true)
| | | |-- size: integer (nullable = false)
| | | |-- cookies: array (nullable = false)
| | | | |-- element: string (containsNull = false)
| | | |-- headers: struct (nullable = true)
| | | | |-- content_type: string (nullable = true)
| | | | |-- set_cookie: string (nullable = true)
| | | | |-- other_headers: map (nullable = false)
| | | | | |-- key: string
| | | | | |-- value: string (valueContainsNull = false)
| |-- grpc: struct (nullable = true)
| | |-- request: struct (nullable = true)
| | | |-- method: string (nullable = true)
| | | |-- host_port: string (nullable = true)
| | | |-- call_options: string (nullable = true)
| | | |-- body: string (nullable = true)
| | | |-- size: integer (nullable = false)
| | | |-- metadata: map (nullable = false)
| | | | |-- key: string
| | | | |-- value: string (valueContainsNull = false)
| | | |-- request_metadata: struct (nullable = true)
| | | | |-- authority: string (nullable = true)
| | | | |-- content_type: string (nullable = true)
| | | | |-- path: string (nullable = true)
| | | | |-- x_forwarded_for: string (nullable = true)
| | | | |-- user_agent: string (nullable = true)
| | | | |-- other_metadata: map (nullable = false)
| | | | | |-- key: string
| | | | | |-- value: string (valueContainsNull = false)
| | |-- response: struct (nullable = true)
| | | |-- body: string (nullable = true)
| | | |-- size: integer (nullable = false)
| | | |-- status_code: integer (nullable = false)
| | | |-- status_message: string (nullable = true)
| | | |-- error_name: string (nullable = true)
| | | |-- error_message: string (nullable = true)
| | | |-- metadata: map (nullable = false)
| | | | |-- key: string
| | | | |-- value: string (valueContainsNull = false)
| | | |-- response_metadata: struct (nullable = true)
| | | | |-- content_type: string (nullable = true)
| | | | |-- other_metadata: map (nullable = false)
| | | | | |-- key: string
| | | | | |-- value: string (valueContainsNull = false)
| |-- sql: struct (nullable = true)
| | |-- query: string (nullable = true)
| | |-- db_type: string (nullable = true)
| | |-- url: string (nullable = true)
| | |-- params: string (nullable = true)
| | |-- sqlstate: string (nullable = true)
| |-- service_name: string (nullable = true)
| |-- rpc: struct (nullable = true)
| | |-- system: string (nullable = true)
| | |-- service: string (nullable = true)
| | |-- method: string (nullable = true)
|-- received_time_millis: long (nullable = true)
I am try to group these spans by traceId (type: ByteBuffer, it's a POJO) in Spark as:
implicit val spanEncoder: Encoder[RawSpan] = Encoders.bean(classOf[RawSpan])
implicit val bytebufEncoder: Encoder[ByteBuffer] = Encoders.bean(classOf[ByteBuffer])
val rawSpansDataset = input1.select("value").select(from_avro(..)).select("from_avro(value).*").as[RawSpan]
val groupedSpans = rawSpansDataset.groupByKey(rawSpan => rawSpan.getTraceId())
groupedSpans.count.show
The above instruction throws the following exception:
AssertionError: assertion failed: object serializer should have only one bound reference but there are 0
at scala.Predef$.assert(Predef.scala:223)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.$anonfun$tuple$2(ExpressionEncoder.scala:100)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.tuple(ExpressionEncoder.scala:98)
at org.apache.spark.sql.KeyValueGroupedDataset.aggUntyped(KeyValueGroupedDataset.scala:616)
at org.apache.spark.sql.KeyValueGroupedDataset.agg(KeyValueGroupedDataset.scala:626)
at org.apache.spark.sql.KeyValueGroupedDataset.count(KeyValueGroupedDataset.scala:733)
I have been unable to debug what I am doing wrong. How can I resolve this?

suggestion for tune up the code which contains explode and groupby

I wrote the code for below probelem but it has below problems. Please suggest me if some tuning can be done.
It takes more time I think.
there are 3 brands as of now. It is hardcoded. If more brands would be added, i need to add the code manually.
input dataframe schema :
root
|-- id: string (nullable = true)
|-- attrib: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- pref: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- pref_type: string (nullable = true)
| | |-- brand: string (nullable = true)
| | |-- tp_id: string (nullable = true)
| | |-- aff: float (nullable = true)
| | |-- pre_id: string (nullable = true)
| | |-- cr_date: string (nullable = true)
| | |-- up_date: string (nullable = true)
| | |-- pref_attrib: map (nullable = true)
| | | |-- key: string
| | | |-- value: string (valueContainsNull = true)
expected output schema:
root
|-- id: string (nullable = true)
|-- attrib: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- pref: struct (nullable = false)
| |-- brandA: array (nullable = true)
| | |-- element: struct (containsNull = false)
| | | |-- pref_type: string (nullable = true)
| | | |-- tp_id: string (nullable = true)
| | | |-- aff: float (nullable = true)
| | | |-- pref_id: string (nullable = true)
| | | |-- cr_date: string (nullable = true)
| | | |-- up_date: string (nullable = true)
| | | |-- pref_attrib: map (nullable = true)
| | | | |-- key: string
| | | | |-- value: string (valueContainsNull = true)
| |-- brandB: array (nullable = true)
| | |-- element: struct (containsNull = false)
| | | |-- pref_type: string (nullable = true)
| | | |-- tp_id: string (nullable = true)
| | | |-- aff: float (nullable = true)
| | | |-- pref_id: string (nullable = true)
| | | |-- cr_date: string (nullable = true)
| | | |-- up_date: string (nullable = true)
| | | |-- pref_attrib: map (nullable = true)
| | | | |-- key: string
| | | | |-- value: string (valueContainsNull = true)
| |-- brandC: array (nullable = true)
| | |-- element: struct (containsNull = false)
| | | |-- pref_type: string (nullable = true)
| | | |-- tp_id: string (nullable = true)
| | | |-- aff: float (nullable = true)
| | | |-- pref_id: string (nullable = true)
| | | |-- cr_date: string (nullable = true)
| | | |-- up_date: string (nullable = true)
| | | |-- pref_attrib: map (nullable = true)
| | | | |-- key: string
| | | | |-- value: string (valueContainsNull = true)
The processing can be done based on the brand attribute under preferences(preferences.brand)
I have written the below code for that:
def modifyBrands(inputDf: DataFrame): DataFrame ={
val PreferenceProps = Array("pref_type", "tp_id", "aff", "pref_id", "cr_date", "up_date", "pref_attrib")
import org.apache.spark.sql.functions._
val explodedDf = inputDf.select(col("id"), explode(col("pref")))
.select(
col("id"),
col("col.pref_type"),
col("col.brand"),
col("col.tp_id"),
col("col.aff"),
col("col.pre_id"),
col("col.cr_dt"),
col("col.up_dt"),
col("col.pref_attrib")
).cache()
val brandAddedDf = explodedDf
.withColumn("brandA", when(col("brand") === "brandA", struct(PreferenceProps.head, PreferenceProps.tail:_*)).as("brandA"))
.withColumn("brandB", when(col("brand") === "brandB", struct(PreferenceProps.head, PreferenceProps.tail:_*)).as("brandB"))
.withColumn("brandC", when(col("brand") === "brandC", struct(PreferenceProps.head, PreferenceProps.tail:_*)).as("brandC"))
.cache()
explodedDf.unpersist()
val groupedDf = brandAddedDf.groupBy("id").agg(
collect_list("brandA").alias("brandA"),
collect_list("brandB").alias("brandB"),
collect_list("brandC").alias("brandC")
).withColumn("preferences", struct(
when(size(col("brandA")).notEqual(0), col("brandA")).alias("brandA"),
when(size(col("brandB")).notEqual(0), col("brandB")).alias("brandB"),
when(size(col("brandC")).notEqual(0), col("brandC")).alias("brandC"),
)).drop("brandA", "brandB", "brandC")
.cache()
brandAddedDf.unpersist()
val idAttributesDf = inputDf.select("id", "attrib").cache()
val joinedDf = idAttributesDf.join(groupedDf, "id")
groupedDf.unpersist()
idAttributesDf.unpersist()
joinedDf.printSchema()
joinedDf // returning joined df which will be wrote as paquet file.
}
You can simplify your code using higher-order function filter on arrays. Just map through brand names and for-each one return a filtered array from pref. This way you avoid the exploding / grouping part.
Here's a complete example:
val data = """{"id":1,"attrib":{"key":"k","value":"v"},"pref":[{"pref_type":"type1","brand":"brandA","tp_id":"id1","aff":"aff1","pre_id":"pre_id1","cr_date":"2021-01-06","up_date":"2021-01-06","pref_attrib":{"key":"k","value":"v"}},{"pref_type":"type1","brand":"brandB","tp_id":"id1","aff":"aff1","pre_id":"pre_id1","cr_date":"2021-01-06","up_date":"2021-01-06","pref_attrib":{"key":"k","value":"v"}},{"pref_type":"type1","brand":"brandC","tp_id":"id1","aff":"aff1","pre_id":"pre_id1","cr_date":"2021-01-06","up_date":"2021-01-06","pref_attrib":{"key":"k","value":"v"}}]}"""
val inputDf = spark.read.json(Seq(data).toDS)
val brands = Seq("brandA", "brandB", "brandC")
// or getting them from input dataframe
// val brands = inputDf.select("pref.brand").as[Seq[String]].collect.flatten
val brandAddedDf = inputDf.withColumn(
"pref",
struct(brands.map(b => expr(s"filter(pref, x -> x.brand = '$b')").as(b)): _*)
)
brandAddedDf.printSchema
//root
// |-- attrib: struct (nullable = true)
// | |-- key: string (nullable = true)
// | |-- value: string (nullable = true)
// |-- id: long (nullable = true)
// |-- pref: struct (nullable = false)
// | |-- brandA: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- aff: string (nullable = true)
// | | | |-- brand: string (nullable = true)
// | | | |-- cr_date: string (nullable = true)
// | | | |-- pre_id: string (nullable = true)
// | | | |-- pref_attrib: struct (nullable = true)
// | | | | |-- key: string (nullable = true)
// | | | | |-- value: string (nullable = true)
// | | | |-- pref_type: string (nullable = true)
// | | | |-- tp_id: string (nullable = true)
// | | | |-- up_date: string (nullable = true)
// | |-- brandB: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- aff: string (nullable = true)
// | | | |-- brand: string (nullable = true)
// | | | |-- cr_date: string (nullable = true)
// | | | |-- pre_id: string (nullable = true)
// | | | |-- pref_attrib: struct (nullable = true)
// | | | | |-- key: string (nullable = true)
// | | | | |-- value: string (nullable = true)
// | | | |-- pref_type: string (nullable = true)
// | | | |-- tp_id: string (nullable = true)
// | | | |-- up_date: string (nullable = true)
// | |-- brandC: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- aff: string (nullable = true)
// | | | |-- brand: string (nullable = true)
// | | | |-- cr_date: string (nullable = true)
// | | | |-- pre_id: string (nullable = true)
// | | | |-- pref_attrib: struct (nullable = true)
// | | | | |-- key: string (nullable = true)
// | | | | |-- value: string (nullable = true)
// | | | |-- pref_type: string (nullable = true)
// | | | |-- tp_id: string (nullable = true)
// | | | |-- up_date: string (nullable = true)
I think they're are a couple issues with how you are doing your code, but the real way to tell where you have a problem with your code is to look at the SPARK UI. I find the "Jobs" tab and the "SQL" tab very informative to figure out where the code is spending most of its time. Then see if those parts can be re-written to give you more speed. Some of the items I point out below may not matter if there is a bottleneck elsewhere that really is where most of the time is being spent.
There are reasons to create nested structures (Like you are for Brand). I'm just not sure I see the payoff here and it's not explained. It should be considered why you are maintaining this structure and what the benefit is. Is there a performance gain for maintaining it? Or is it simply an artifact of how the data was created?
General tips that might help a little:
In general you should only cache code that you will use more than once. You have a lot of code you don't use more than once but you still cache.
Small, small performance boost. (So in other words when you need every millisecond...) withColumn actually doesn't perform as well as select. (Likely due to some object creation) where possible use select instead of withColumn. Not really worth re-writing your code unless you really need every milli-second.

Reshape array of structs on pyspark

I have a spark dataframe with the following schema:
|-- id: long (nullable = true)
|-- comment: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- body: string (nullable = true)
| | |-- html_body: string (nullable = true)
| | |-- author_id: long (nullable = true)
| | |-- uploads: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- attachments: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- thumbnails: array (nullable = true)
| | | | | |-- element: struct (containsNull = true)
| | | | | | |-- id: long (nullable = true)
| | | | | | |-- file_name: string (nullable = true)
| | | | | | |-- url: string (nullable = true)
| | | | | | |-- content_url: string (nullable = true)
| | | | | | |-- mapped_content_url: string (nullable = true)
| | | | | | |-- content_type: string (nullable = true)
| | | | | | |-- size: long (nullable = true)
| | | | | | |-- width: long (nullable = true)
| | | | | | |-- height: long (nullable = true)
| | | | | | |-- inline: boolean (nullable = true)
| | |-- created_at: string (nullable = true)
| | |-- public: boolean (nullable = true)
| | |-- channel: string (nullable = true)
| | |-- from: string (nullable = true)
| | |-- location: string (nullable = true)
However, this has way more data than what I need. For each element of the comment array, I would like to concatenate first comment.created_at and then comment.body into a new struct column called comment_final.
The end goal is to build from that, a string column that flattens the entire array into a string html-like field.
For the end result, I would like to do the following:
.withColumn('final_body',array_join(col('comment.body'),'<br/><br/>')).
Can someone help me out with the array/struct data modelling?

not able to get column value from spark data frame

I have loaded data from xml file :
val xmlContent=spark.sqlContext.read.format("com.databricks.spark.xml").option("rowTag","GROUP.NOTES").load("/datalake/other/decomlake/spark-xml-poc/Sample.xml")
xmlContent schema is as following:
xmlContent.printSchema
root
|-- GROUP.NOTES-ROW: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- XML_COLUMN_1_TEXT-MV: struct (nullable = true)
| | | |-- XML_COLUMN_1_TEXT: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- _VALUE: string (nullable = true)
| | | | | |-- _val: long (nullable = true)
| | |-- XML_COLUMN_2_TEXT-MV: struct (nullable = true)
| | | |-- XML_COLUMN_2_TEXT: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- _VALUE: string (nullable = true)
| | | | | |-- _val: long (nullable = true)
| | |-- XML_COLUMN_3_TEXT-MV: struct (nullable = true)
| | | |-- XML_COLUMN_3_TEXT: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- _VALUE: string (nullable = true)
| | | | | |-- _val: long (nullable = true)
| | |-- XML_FLD001: string (nullable = true)
| | |-- XML_FLD002: string (nullable = true)
| | |-- XML_FLD004: string (nullable = true)
| | |-- XML_FLD006: string (nullable = true)
| | |-- XML_FLD007: string (nullable = true)
| | |-- XML_ID: string (nullable = true)
| | |-- _confidential: string (nullable = true)
|-- _account: string (nullable = true)
|-- _area: string (nullable = true)
|-- _exbatch: string (nullable = true)
|-- _filename: string (nullable = true)
|-- _mahptablename: string (nullable = true)
|-- _subaccount: string (nullable = true)
I am able to get value of columns:
_account,
_area,
_exbatch,
_filename,
_mahptablename,
_subaccount
but not able to get value of column GROUP.NOTES-ROW as getting following error,
val groupNoteDf=xmlContent.select("GROUP.NOTES-ROW").show()
org.apache.spark.sql.AnalysisException: cannot resolve '`GROUP.NOTES-ROW`' given input columns: [_exbatch, _account, _area, GROUP.NOTES-ROW, _subaccount, _mahptablename, _filename];;
'Project ['GROUP.NOTES-ROW]
+- Relation[GROUP.NOTES-ROW#0,_account#1,_area#2,_exbatch#3,_filename#4,_mahptablename#5,_subaccount#6] XmlRelation(<function0>,Some(/datalake/other/decomlake/spark-xml-poc/Sample.xml),Map(rowtag -> GROUP.NOTES, path -> /datalake/other/decomlake/spark-xml-poc/Sample.xml),null)
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:89)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:268)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:268)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:279)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:289)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:293)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:293)
at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$6.apply(QueryPlan.scala:298)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:298)
at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:268)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:86)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:79)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:79)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
I am using databricks api to parse and load xml file
Thanks in advance if anyone can help me to find the solution
You need Backticks ` to escape hypen - in your columnname.
xmlContent.select("`GROUP.NOTES-ROW`").show()
Also as show() returns Unit, don't assign it to any variable. You can directly view your Dataframe using above statement. If you are creating a new Dataframe by assigning it to a variable then don't use show().

Get WrappedArray row valule and convert it into string in Scala

I have a data frame which comes as like below
+---------------------------------------------------------------------+
|value |
+---------------------------------------------------------------------+
|[WrappedArray(LineItem_organizationId, LineItem_lineItemId)] |
|[WrappedArray(OrganizationId, LineItemId, SegmentSequence_segmentId)]|
+---------------------------------------------------------------------+
From the above two rows I want to create a string which is in this format
"LineItem_organizationId", "LineItem_lineItemId"
"OrganizationId", "LineItemId", "SegmentSequence_segmentId"
I want to create this as dynamic so in first column third value is present my string will have one more , separated columns value .
How can I do this in Scala .
this is what I am doing in order to create data frame
val xmlFiles = "C://Users//u6034690//Desktop//SPARK//trfsmallfffile//XML"
val discriptorFileLOcation = "C://Users//u6034690//Desktop//SPARK//trfsmallfffile//FinancialLineItem//REFXML"
import sqlContext.implicits._
val dfDiscriptor = sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "FlatFileDescriptor").load(discriptorFileLOcation)
dfDiscriptor.printSchema()
val firstColumn = dfDiscriptor.select($"FFFileType.FFRecord.FFField").as("FFField")
val FirstColumnOfHeaderFile = firstColumn.select(explode($"FFField")).as("ColumnsDetails").select(explode($"col")).first.get(0).toString().split(",")(5)
println(FirstColumnOfHeaderFile)
//dfDiscriptor.printSchema()
val primaryKeyColumnsFinancialLineItem = dfDiscriptor.select(explode($"FFFileType.FFRecord.FFPrimKey.FFPrimKeyCol"))
primaryKeyColumnsFinancialLineItem.show(false)
Adding the full schema
root
|-- FFColumnDelimiter: string (nullable = true)
|-- FFContentItem: struct (nullable = true)
| |-- _VALUE: string (nullable = true)
| |-- _ffMajVers: long (nullable = true)
| |-- _ffMinVers: double (nullable = true)
|-- FFFileEncoding: string (nullable = true)
|-- FFFileType: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- FFPhysicalFile: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- FFFileName: string (nullable = true)
| | | | |-- FFRowCount: long (nullable = true)
| | |-- FFRecord: struct (nullable = true)
| | | |-- FFField: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- FFColumnNumber: long (nullable = true)
| | | | | |-- FFDataType: string (nullable = true)
| | | | | |-- FFFacets: struct (nullable = true)
| | | | | | |-- FFMaxLength: long (nullable = true)
| | | | | | |-- FFTotalDigits: long (nullable = true)
| | | | | |-- FFFieldIsOptional: boolean (nullable = true)
| | | | | |-- FFFieldName: string (nullable = true)
| | | | | |-- FFForKey: struct (nullable = true)
| | | | | | |-- FFForKeyCol: string (nullable = true)
| | | | | | |-- FFForKeyRecord: string (nullable = true)
| | | |-- FFPrimKey: struct (nullable = true)
| | | | |-- FFPrimKeyCol: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
| | | |-- FFRecordType: string (nullable = true)
|-- FFHeaderRow: boolean (nullable = true)
|-- FFId: string (nullable = true)
|-- FFRowDelimiter: string (nullable = true)
|-- FFTimeStamp: string (nullable = true)
|-- _env: string (nullable = true)
|-- _ffMajVers: long (nullable = true)
|-- _ffMinVers: double (nullable = true)
|-- _ffPubstyle: string (nullable = true)
|-- _schemaLocation: string (nullable = true)
|-- _sr: string (nullable = true)
|-- _xmlns: string (nullable = true)
|-- _xsi: string (nullable = true)
Looking at your given dataframe
+---------------------------------------------------------------------+
|value |
+---------------------------------------------------------------------+
|[WrappedArray(LineItem_organizationId, LineItem_lineItemId)] |
|[WrappedArray(OrganizationId, LineItemId, SegmentSequence_segmentId)]|
+---------------------------------------------------------------------+
it must have the following schema
|-- value: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
If the above assumption are true then you should write a udf function as
import org.apache.spark.sql.functions._
def arrayToString = udf((arr: collection.mutable.WrappedArray[collection.mutable.WrappedArray[String]]) => arr.flatten.mkString(", "))
And use it in the dataframe as
df.withColumn("value", arrayToString($"value"))
And you should have
+-----------------------------------------------------+
|value |
+-----------------------------------------------------+
|LineItem_organizationId, LineItem_lineItemId |
|OrganizationId, LineItemId, SegmentSequence_segmentId|
+-----------------------------------------------------+
|-- value: string (nullable = true)