AWS Glue Resolve Column Choice as Array or Struct - pyspark

Run out of ideas on how to solve the following issue. A table in the Glue data catalog has this schema:
root
|-- _id: string
|-- _field: struct
| |-- ref: choice
| | |-- array
| | | |-- element: struct
| | | | |-- value: null
| | | | |-- key: string
| | | | |-- name: string
| | |-- struct
| | | |-- value: null
| | | |-- key: choice
| | | | |-- int
| | | | |-- string
| | | |-- name: string
If I try to resolve the ref choice using
resolved = (
df.
resolveChoice(
specs = [('_field.ref','cast:array')]
)
)
I lose records.
Any ideas on how I could:
filter the DataFrame on whether _field.ref is an array or struct
convert struct records into an array or vice-versa

I was able to solve my own problem by using
resolved_df = ResolveChoice.apply(df, choice = "make_cols")
This will save array values in a new ref_array column and struct values in ref_struct column.
This allowed me to split the DataFrame by
resolved_df1 = resolved_df.filter(col("ref_array").isNotNull()).select(col("ref_array").alias("ref"))
resolved_df2 = resolved_df.filter(col("ref_struct").isNotNull()).select(col("ref_struct").alias("ref"))
After either converting the array to structs only (using explode()) or converting structs to an array using array(), recombine them

Related

Spark, adding new field in array of arrays

I am new to Spark, having the following df schema (part of it).
root
|-- value: binary (nullable = false)
|-- event: struct (nullable = false)
| |-- eventtime: struct (nullable = true)
| | |-- seconds: long (nullable = true)
| | |-- ms: float (nullable = true)
| |-- fault: struct (nullable = true)
| | |-- collections: struct (nullable = true)
| | | |-- snapshots: array (nullable = false) --> ** FIRST LEVEL ARRAY (or array of arrays) **
| | | | |-- element: struct (containsNull = false)
| | | | | |-- ringbuffer: struct (nullable = true)
| | | | | | |-- columns: array (nullable = false) --> ** SECOND LEVEL ARRAY **
| | | | | | | |-- element: struct (containsNull = false)
| | | | | | | | |-- doubles: struct (nullable = true)
| | | | | | | | | |-- values: array (nullable = false)
| | | | | | | | | | |-- element: float (containsNull = false)
....................................
..........................
I am able to add a new field under fault with following code and the new field comp_id appears at the same level with collections.
df.withColumn("event", col("event").withField("fault.comp_id", lit(1234)))
How can I add a new field in array of arrays. For example, adding a new test_field under columns? I tried to get in arrays by defining the first index 0
df.withColumn("event",col("event").withField("fault.collections.snapshots.0.ringbuffer.columns.0.test_field", lit("test_value")))
But got this error
org.apache.spark.sql.catalyst.parser.ParseException: mismatched input '.0' expecting {<EOF>, '.', '-'}(line 1, pos 25)
== SQL ==
fault.snapshots.snapshots.0.ringbuffer.columns.0.test_field
-------------------------^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:255)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:124)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseMultipartIdentifier(ParseDriver.scala:61)
at org.apache.spark.sql.catalyst.expressions.UpdateFields$.nameParts(complexTypeCreator.scala:693)
at org.apache.spark.sql.catalyst.expressions.UpdateFields$.apply(complexTypeCreator.scala:701)
at org.apache.spark.sql.Column.withField(Column.scala:927)
at org.apache.spark.sql.Dataset.transform(Dataset.scala:2751)
at org.apache.spark.sql.Dataset.transform(Dataset.scala:2751)
So desired schema would be like following.
root
|-- value: binary (nullable = false)
|-- event: struct (nullable = false)
| |-- eventtime: struct (nullable = true)
| | |-- seconds: long (nullable = true)
| | |-- ms: float (nullable = true)
| |-- fault: struct (nullable = true)
| | |-- collections: struct (nullable = true)
| | | |-- snapshots: array (nullable = false) --> ** FIRST LEVEL ARRAY (or array of arrays) **
| | | | |-- element: struct (containsNull = false)
| | | | | |-- ringbuffer: struct (nullable = true)
| | | | | | |-- columns: array (nullable = false) --> ** SECOND LEVEL ARRAY **
| | | | | | | |-- element: struct (containsNull = false)
| | | | | | | | |-- doubles: struct (nullable = true)
| | | | | | | | | |-- values: array (nullable = false)
| | | | | | | | | | |-- element: float (containsNull = false)
| | | | | | | | |-- test_field: string (nullable = true)

Exploding Nested Struct In Spark Dataframe having Different Schema

I have a json which has below schema:
|-- Pool: struct (nullable = true)
| |-- 1: struct (nullable = true)
| | |-- Client: struct (nullable = true)
| | | |-- 1: struct (nullable = true)
| | | | |-- Active: boolean (nullable = true)
| | | | |-- Alias: string (nullable = true)
| | | | |-- Chaddr: string (nullable = true)
| | | |-- 2: struct (nullable = true)
| | | | |-- Active: boolean (nullable = true)
| | | | |-- Alias: string (nullable = true)
| | | | |-- Chaddr: string (nullable = true)
| |-- 2: struct (nullable = true)
| | |-- Alias: string (nullable = true)
| | |-- Chaddr: string (nullable = true)
| | |-- ChaddrMask: string (nullable = true)
| | |-- Client: struct (nullable = true)
| | | |-- 1: struct (nullable = true)
| | | | |-- Active: boolean (nullable = true)
| | | | |-- Alias: string (nullable = true)
| | | | |-- Chaddr: string (nullable = true)
And the output that i am trying to achieve is:
PoolId ClientID Client_Active
1 1 true
1 2 false
2 1 true
This schema keeps on changing with json.Eg for now there are 2 Pool id, there may be another json which will have 5 Pool Id and same is with CLient Id.
The problem with is :
We cant use Explode on struct.
Pool cant be converted to Map as each time client has different client ID that leads to different schema for each row.
Any thought how to achieve this?
I have tried this link for converting to Struct to Map and then exploding but it doesn't work when there are different numbers of Client IDs in different Pool.
From my perspective you only need to define an UDF.
Here's an example :
Define a projection case class (what you want as a resulting structure)
case class Projection(PoolId: String, ClientID: String, Client_Active: Boolean)
Define an UDF like the one below, allowing you to work both with your structure (fields) and data:
val myUdf = udf{r: Row =>
r.schema.fields.flatMap{rf =>
val poolId = rf.name
val pool = r.getAs[Row](poolId)
val clientRow = pool.getAs[Row]("Client")
clientRow.schema.fields.map{cr =>
val clientId = cr.name
val isActive = clientRow.getAs[Row](clientId).getAs[Boolean]("Active")
Projection(poolId, clientId, isActive)
}
}
}
Use your UDF :
val newDF = df.select(explode(myUdf($"Pool")).as("projection"))
.select("projection.*")
.cache
newDF.show(false)
The output is the expected one :
+------+--------+-------------+
|PoolId|ClientID|Client_Active|
+------+--------+-------------+
|1 |1 |true |
|1 |2 |false |
|2 |1 |true |
+------+--------+-------------+

Select key column from data as null if it doesn't exist in pyspark

My Dataframe (df) is structured as follows:
root
|-- val1: string (nullable = true)
|-- val2: string (nullable = true)
|-- val3: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _type: string (nullable = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
I have two sample records as follows:
+------+------+-----------------------------------+
| val1 | val2 | val3 |
+------+------+-----------------------------------+
| A | a | {k1: A1, k2: A2, k3: A3} |
+------+------+-----------------------------------+
| B | b | {k3: B3} |
+------+------+-----------------------------------+
I'm trying to select data from this as follows:
df.select(val1,val2,val3.k1,val3.k2,val3.k3)
And I want my output to look like:
+------+------+---------+---------+---------+
| val1 | val2 | k1 | k2 | k3 |
+------+------+---------+---------+---------+
| A | a | A1 | A2 | A3 |
+------+------+-----------------------------+
| B | b | NULL | NULL | B3 |
+------+------+-----------------------------+
But since I don't have the keys k1 and k2 for all records, the select statement throws an error. How do I solve this? I'm relatively new to pyspark.
I think you can use
df.selectExpr('val3.*')
Let me know if this works

How to assign constant values to the nested objects in pyspark?

I have a requirement where I need to mask the data for some of the fields in a given schema. I've researched a lot and couldn't find the answer that is needed.
This is the schema where I need some changes on the fields(answer_type,response0, response3):
| |-- choices: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- choice_id: long (nullable = true)
| | | |-- created_time: long (nullable = true)
| | | |-- updated_time: long (nullable = true)
| | | |-- created_by: long (nullable = true)
| | | |-- updated_by: long (nullable = true)
| | | |-- answers: struct (nullable = true)
| | | | |-- answer_node_internal_id: long (nullable = true)
| | | | |-- label: string (nullable = true)
| | | | |-- text: map (nullable = true)
| | | | | |-- key: string
| | | | | |-- value: string (valueContainsNull = true)
| | | | |-- data_tag: string (nullable = true)
| | | | |-- answer_type: string (nullable = true)
| | | |-- response: struct (nullable = true)
| | | | |-- response0: string (nullable = true)
| | | | |-- response1: long (nullable = true)
| | | | |-- response2: double (nullable = true)
| | | | |-- response3: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
Is there a way I could assign values to those fields without affecting the above structure in pyspark?
I've tried using explode but i can't revert to original schema. I don't want to create a new column as well and at the same time don't want to lose any data from the provided schema object.
oh i got a similar problem days ago, i suggest to transform the structype to json
and then with a udf you can make the internal changes
and after you cant get the original struct again
you should see to_json and from_json from documentation.
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.functions.from_json
https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.functions.to_json

Spark Data frame throwing error when trying to query nested column

I'm having a Dataframe with below schema. This is basically an XML file, which I have converted to Dataframe for further processing. I trying to extract _Date column, but looks like some type mismatch is happening
df1.printSchema
|-- PlayWeek: struct (nullable = true)
| |-- TicketSales: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- PlayDate: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- BoxOfficeDetail: array (nullable = true)
| | | | | | |-- element: struct (containsNull = true)
| | | | | | | |-- VisualFormatCd: struct (nullable = true)
| | | | | | | | |-- Code: struct (nullable = true)
| | | | | | | | | |-- _SequenceId: long (nullable = true)
| | | | | | | | | |-- _VALUE: double (nullable = true)
| | | | | | | |-- _SessionTypeCd: string (nullable = true)
| | | | | | | |-- _TicketPrice: double (nullable = true)
| | | | | | | |-- _TicketQuantity: long (nullable = true)
| | | | | | | |-- _TicketTax: double (nullable = true)
| | | | | | | |-- _TicketTypeCd: string (nullable = true)
| | | | | |-- _Date: string (nullable = true)
| | | |-- _FilmId: long (nullable = true)
| | | |-- _Screen: long (nullable = true)
| | | |-- _TheatreId: long (nullable = true)
| |-- _BusinessEndDate: string (nullable = true)
| |-- _BusinessStartDate: string (nullable = true)
I need to extract _Date column but its throwing below error
scala> df1.select(df1.col("PlayWeek.TicketSales.PlayDate._Date")).show()
org.apache.spark.sql.AnalysisException: cannot resolve 'PlayWeek.TicketSales.PlayDate[_Date]' due to data type mismatch: argument 2 requires integral type, however, '_Date' is of string type.;
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:65)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:334)
Any help would be appreciated.