I have a Glue data frame with the following structure, due to some historical data we have differences in the structure.
When I try to change the structure the resolveChoice is not working.
|-- logs: array
| |-- element: struct
| | |-- ...
| | |-- target: struct
| | | |-- ...
| | | |-- details: struct
| | | | |-- ...
| | | | |-- reviews: struct
| | | | | |-- comment: string
| | | | | |-- generalReason: choice
| | | | | | |-- array
| | | | | | | |-- element: array
| | | | | | | | |-- element: choice
| | | | | | | | | |-- int
| | | | | | | | | |-- string
| | | | | | |-- string
| | | | | | |-- struct
| | | | | | | |-- deleted: long
| | | | | | | |-- brandId: string
| | | | | | | |-- brand: string
| | | | | | | |-- name_JP: string
| | | | | | | |-- status: boolean
| | | | | | | |-- modifiedDate: string
| | | | | | | |-- name_EN: string
| | | | | | | |-- id: string
| | |-- ...
|-- SK: string
|-- PK: string
The code I am running:
case_logs_temp = case_logs.resolveChoice(
specs=[
("logs.target.details.reviews.generalReason", "project:struct")
],
transformation_ctx="case_logs_resolveChoice",
)
# case_logs.printSchema()
case_logs_temp.printSchema()
The printschema looks exactly like the one above.
Any ideas how to resolve the issue?
The following code works:
specs=[
("logs[].target.details.reviews.generalReason", "project:struct")
],
transformation_ctx="case_logs_resolveChoice",
)
Thanks to issue In AWS Glue, how do I apply resolveChoice to a struct element within an array in a DynamicFrame?
Related
I have a table consisting of main ID, subID and two timestamps (start-end).
+-------+---------------------+---------------------+---------------------+
|main_id|sub_id |start_timestamp |end_timestamp |
+-------+---------------------+---------------------+---------------------+
| 1| 1 | 2021/06/01 19:00 | 2021/06/01 19:10 |
| 1| 2 | 2021/06/01 19:01 | 2021/06/01 19:10 |
| 1| 3 | 2021/06/01 19:01 | 2021/06/01 19:05 |
| 1| 3 | 2021/06/01 19:07 | 2021/06/01 19:09 |
My goal is to pick all the records with the same mainID (different subIDs) and perform a logical AND on the timestamp column (the goal is to find periods, where all the subIDs were active).
+-------+---------------------------+---------------------------+
|main_id| global_start | global_stop |
+-------+---------------------------+---------------------------+
| 1| 2021/06/01 19:01 | 2021/06/01 19:05 |
| 1| 2021/06/01 19:07 | 2021/06/01 19:09 |
Thanks!
Partial solution
This kind of logic is probably really difficult to implement in pure Spark. Built-in functions are not enough for that.
Also, the expected output is 2 lines, but a simple group by main_id should output only one line. Therefore, the logic behind is not trivial.
I would advice you to group your data by main_id and process them with python, using an UDF.
# Agg your data by main_id
df2 = (
df.groupby("main_id", "sub_id")
.agg(
F.collect_list(F.struct("start_timestamp", "end_timestamp")).alias("timestamps")
)
.groupby("main_id")
.agg(F.collect_list(F.struct("sub_id", "timestamps")).alias("data"))
)
df2.show(truncate=False)
+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|main_id|data |
+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|1 |[[3, [[2021/06/01 19:01, 2021/06/01 19:05], [2021/06/01 19:07, 2021/06/01 19:09]]], [1, [[2021/06/01 19:00, 2021/06/01 19:10]]], [2, [[2021/06/01 19:01, 2021/06/01 19:10]]]]|
+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
df2.printSchema()
root
|-- main_id: long (nullable = true)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- sub_id: long (nullable = true)
| | |-- timestamps: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- start_timestamp: string (nullable = true)
| | | | |-- end_timestamp: string (nullable = true)
Once this step achieved, you can process the column data with Python and perform your logical AND.
#F.udf # Add required type depending on function return
def process(data):
"""
data is a complex type (see df2.printSchema()) :
list(dict(
"sub_id": value_of_sub_id,
"timestamps": list(dict(
"start_timestamp": value_of_start_timestamp,
"end_timestamp": value_of_end_timestamp,
))
))
"""
... # implement the "logical AND" here.
df2.select(
"main_id",
process(F.col("data"))
)
I hope this can help you or others to build a final solution.
Run out of ideas on how to solve the following issue. A table in the Glue data catalog has this schema:
root
|-- _id: string
|-- _field: struct
| |-- ref: choice
| | |-- array
| | | |-- element: struct
| | | | |-- value: null
| | | | |-- key: string
| | | | |-- name: string
| | |-- struct
| | | |-- value: null
| | | |-- key: choice
| | | | |-- int
| | | | |-- string
| | | |-- name: string
If I try to resolve the ref choice using
resolved = (
df.
resolveChoice(
specs = [('_field.ref','cast:array')]
)
)
I lose records.
Any ideas on how I could:
filter the DataFrame on whether _field.ref is an array or struct
convert struct records into an array or vice-versa
I was able to solve my own problem by using
resolved_df = ResolveChoice.apply(df, choice = "make_cols")
This will save array values in a new ref_array column and struct values in ref_struct column.
This allowed me to split the DataFrame by
resolved_df1 = resolved_df.filter(col("ref_array").isNotNull()).select(col("ref_array").alias("ref"))
resolved_df2 = resolved_df.filter(col("ref_struct").isNotNull()).select(col("ref_struct").alias("ref"))
After either converting the array to structs only (using explode()) or converting structs to an array using array(), recombine them
I have following catalog and want to use AWS glue to flatten it
| accountId | resourceId | items |
|-----------|------------|-----------------------------------------------------------------|
| 1 | r1 | {application:{component:[{name: "tool", version: "1.0"}, {name: "app", version: "1.0"}]}} |
| 1 | r2 | {application:{component:[{name: "tool", version: "2.0"}, {name: "app", version: "2.0"}]}} |
| 2 | r3 | {application:{component:[{name: "tool", version: "3.0"}, {name: "app", version: "3.0"}]}} |
Here is my schema
root
|-- accountId:
|-- resourceId:
|-- PeriodId:
|-- items:
| |-- application:
| | |-- component: array
I want to flatten it to following:
| accountId | resourceId | name | version |
|-----------|------------|------|---------|
| 1 | r1 | tool | 1.0 |
| 1 | r1 | app | 1.0 |
| 1 | r2 | tool | 2.0 |
| 1 | r2 | app | 2.0 |
| 2 | r3 | tool | 3.0 |
| 2 | r3 | app | 3.0 |
From what I understand from your schema and data, yours is a deeply nested structure, so you could explode on items.application.component, and then select your name and version columns from that.
This link might help you understand: https://docs.databricks.com/spark/latest/dataframes-datasets/complex-nested-data.html
from pyspark.sql import functions as F
df.withColumn("items", F.explode(F.col("items.application.component")))\
.select("accountId","resourceId","items.name","items.version").show()
+---------+----------+----+-------+
|accountId|resourceId|name|version|
+---------+----------+----+-------+
| 1| r1|tool| 1.0|
| 1| r1| app| 1.0|
| 1| r2|tool| 2.0|
| 1| r2| app| 2.0|
| 2| r3|tool| 3.0|
| 2| r3| app| 3.0|
+---------+----------+----+-------+
I'm new to in Scala, I have dataframe where I'm trying one column of dataframe to date from string in other word like below
1) yyyyMMddHHmmss(20150610120256) ->yyyy-MM-dd HH:mm:ss(2015-06-10 12:02:56)
2) yyyyMMddHHmmss(20150611 ) ->yyyy-MM-dd(2015-06-11)
First case i'm able achieve successfully but problem with second case where time is miss due to of this i'm not bale to convert into date.More details you could get below.Any help will be appreciated.
df.printSchema
root
|-- TYPE: string (nullable = true)
|-- CODE: string (nullable = true)
|-- SQ_CODE: string (nullable = true)
|-- RE_TYPE: string (nullable = true)
|-- VERY_ID: long (nullable = true)
|-- IN_DATE: string (nullable = true)
df.show
Input
+-----+-------+---------+---------+-------------------+-----------------+
| TYPE| CODE| SQ_CODE| RE_TYPE | VERY_ID| IN_DATE |
+-----+-------+---------+---------+-------------------+-----------------+
| F | 000544| 2017002| OP | 95032015062763298| 20150610120256 |
| F | 000544| 2017002| LD | 95032015062763261| 20150611 |
| F | 000544| 2017002| AK | 95037854336743246| 20150611012356 |
+-----+-------+---------+--+------+-------------------+-----------------+
df=df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) ,
to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast("date")))
.otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmss").cast("timestamp")))
Actual output
+-----+-------+---------+---------+-------------------+----------------------+
| TYPE| CODE| SQ_CODE| RE_TYPE | VERY_ID| IN_DATE |
+-----+-------+---------+---------+-------------------+----------------------+
| F | 000544| 2017002| OP | 95032015062763298| 2015-06-10 12:02:56 |
| F | 000544| 2017002| LD | 95032015062763261| null |
| F | 000544| 2017002| AK | 95037854336743246| 2015-06-11 01:23:56 |
+-----+-------+---------+--+------+-------------------+----------------------+
df=df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) ,
to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast("timestamp")))
.otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmss").cast("timestamp")))
Actual output
+-----+-------+---------+---------+-------------------+----------------------+
| TYPE| CODE| SQ_CODE| RE_TYPE | VERY_ID| IN_DATE |
+-----+-------+---------+---------+-------------------+----------------------+
| F | 000544| 2017002| OP | 95032015062763298| 2015-06-10 12:02:56 |
| F | 000544| 2017002| LD | 95032015062763261| 2015-06-11 00:00:00 |
| F | 000544| 2017002| AK | 95037854336743246| 2015-06-11 01:23:56 |
+-----+-------+---------+--+------+-------------------+----------------------+
Expected output
+-----+-------+---------+---------+-------------------+----------------------+
| TYPE| CODE| SQ_CODE| RE_TYPE | VERY_ID| IN_DATE |
+-----+-------+---------+---------+-------------------+----------------------+
| F | 000544| 2017002| OP | 95032015062763298| 2015-06-10 12:02:56 |
| F | 000544| 2017002| LD | 95032015062763261| 2015-06-11 |
| F | 000544| 2017002| AK | 95037854336743246| 2015-06-11 01:23:56 |
+-----+-------+---------+--+------+-------------------+----------------------+
I'd
Choose more precise data type - here TimestampType.
coalesce with different formats.
import org.apache.spark.sql.functions._
val df = Seq("20150610120256", "20150611").toDF("IN_DATE")
df.withColumn("IN_DATE", coalesce(
to_timestamp($"IN_DATE", "yyyyMMddHHmmss"),
to_timestamp($"IN_DATE", "yyyyMMdd"))).show
+-------------------+
| IN_DATE|
+-------------------+
|2015-06-10 12:02:56|
|2015-06-11 00:00:00|
+-------------------+
There are several options to achieve a date parser.
Use the built in spark sql function TODATE(). Here's an example of that implementation.
Create a user defined function where you can do different date parsing based on the input format you like, and return the string. Read more about UDF's here.
2015-06-11 format is spark.sql.types.DateType and 2015-06-10 12:02:56 is spark.sql.types.TimestampType
You can't have two dataType on the same column. A schema should have only one dataType for each columns.
I would suggest you to create two new columns and have the format you desire in them as
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{DateType, TimestampType}
df.withColumn("IN_DATE_DateOnly",from_unixtime(unix_timestamp(df("IN_DATE"),"yyyyMMdd")).cast(DateType))
.withColumn("IN_DATE_DateAndTime",unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmSS").cast(TimestampType))
this will give you dataframe as
+----+------+-------+-------+-----------------+--------------+----------------+---------------------+
|TYPE|CODE |SQ_CODE|RE_TYPE|VERY_ID |IN_DATE |IN_DATE_DateOnly|IN_DATE_DateAndTime |
+----+------+-------+-------+-----------------+--------------+----------------+---------------------+
|F |000544|2017002|OP |95032015062763298|20150610120256|null |2015-06-10 12:02:00.0|
|F |000544|2017002|LD |95032015062763261|20150611 |2015-06-11 |null |
|F |000544|2017002|AK |95037854336743246|20150611012356|null |2015-06-11 01:23:00.0|
+----+------+-------+-------+-----------------+--------------+----------------+---------------------+
You can see that the dataType is different
root
|-- TYPE: string (nullable = true)
|-- CODE: string (nullable = true)
|-- SQ_CODE: string (nullable = true)
|-- RE_TYPE: string (nullable = true)
|-- VERY_ID: string (nullable = true)
|-- IN_DATE: string (nullable = true)
|-- IN_DATE_DateOnly: date (nullable = true)
|-- IN_DATE_DateAndTime: timestamp (nullable = true)
I hope the answer is helpful
Try This query
df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) ,
to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast(DateType)))
.otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmSS").cast(TimestampType)))
I'm having a Dataframe with below schema. This is basically an XML file, which I have converted to Dataframe for further processing. I trying to extract _Date column, but looks like some type mismatch is happening
df1.printSchema
|-- PlayWeek: struct (nullable = true)
| |-- TicketSales: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- PlayDate: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- BoxOfficeDetail: array (nullable = true)
| | | | | | |-- element: struct (containsNull = true)
| | | | | | | |-- VisualFormatCd: struct (nullable = true)
| | | | | | | | |-- Code: struct (nullable = true)
| | | | | | | | | |-- _SequenceId: long (nullable = true)
| | | | | | | | | |-- _VALUE: double (nullable = true)
| | | | | | | |-- _SessionTypeCd: string (nullable = true)
| | | | | | | |-- _TicketPrice: double (nullable = true)
| | | | | | | |-- _TicketQuantity: long (nullable = true)
| | | | | | | |-- _TicketTax: double (nullable = true)
| | | | | | | |-- _TicketTypeCd: string (nullable = true)
| | | | | |-- _Date: string (nullable = true)
| | | |-- _FilmId: long (nullable = true)
| | | |-- _Screen: long (nullable = true)
| | | |-- _TheatreId: long (nullable = true)
| |-- _BusinessEndDate: string (nullable = true)
| |-- _BusinessStartDate: string (nullable = true)
I need to extract _Date column but its throwing below error
scala> df1.select(df1.col("PlayWeek.TicketSales.PlayDate._Date")).show()
org.apache.spark.sql.AnalysisException: cannot resolve 'PlayWeek.TicketSales.PlayDate[_Date]' due to data type mismatch: argument 2 requires integral type, however, '_Date' is of string type.;
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:65)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:335)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:334)
Any help would be appreciated.