create a Spark DataFrame from a nested array of struct element? - scala

I have read a JSON file into Spark. This file has the following structure:
root
|-- engagement: struct (nullable = true)
| |-- engagementItems: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- availabilityEngagement: struct (nullable = true)
| | | | |-- dimapraUnit: struct (nullable = true)
| | | | | |-- code: string (nullable = true)
| | | | | |-- constrained: boolean (nullable = true)
| | | | | |-- id: long (nullable = true)
| | | | | |-- label: string (nullable = true)
| | | | | |-- ranking: long (nullable = true)
| | | | | |-- type: string (nullable = true)
| | | | | |-- version: long (nullable = true)
| | | | | |-- visible: boolean (nullable = true)
I created a recursive function to flatten the schema with columns that are of nested StructType
def flattenSchema(schema: StructType, prefix: String = null):Array[Column]=
{
schema.fields.flatMap(f => {
val colName = if (prefix == null) f.name else (prefix + "." + f.name)
f.dataType match {
case st: StructType => flattenSchema(st, colName)
case _ => Array(col(colName).alias(colName))
}
})
}
val newDF=SIWINSDF.select(flattenSchema(SIWINSDF.schema):_*)
val secondDF=newDF.toDF(newDF.columns.map(_.replace(".", "_")): _*)
How can i flatten the ArrayType that contain nested StructType for example engagementItems: array (nullable = true)
Any help is appreciated.

The problem here is that you need to manage the case for the ArrayType and after convert it into StructType. Therefore you can use the the Scala runtime conversion for that.
First I generated the scenario as next (btw it would be very helpful to include this in your question since makes the reproduction of the problem much easier):
case class DimapraUnit(code: String, constrained: Boolean, id: Long, label: String, ranking: Long, _type: String, version: Long, visible: Boolean)
case class AvailabilityEngagement(dimapraUnit: DimapraUnit)
case class Element(availabilityEngagement: AvailabilityEngagement)
case class Engagement(engagementItems: Array[Element])
case class root(engagement: Engagement)
def getSchema(): StructType ={
import org.apache.spark.sql.types._
import org.apache.spark.sql.catalyst.ScalaReflection
val schema = ScalaReflection.schemaFor[root].dataType.asInstanceOf[StructType]
schema.printTreeString()
schema
}
This will print out:
root
|-- engagement: struct (nullable = true)
| |-- engagementItems: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- availabilityEngagement: struct (nullable = true)
| | | | |-- dimapraUnit: struct (nullable = true)
| | | | | |-- code: string (nullable = true)
| | | | | |-- constrained: boolean (nullable = false)
| | | | | |-- id: long (nullable = false)
| | | | | |-- label: string (nullable = true)
| | | | | |-- ranking: long (nullable = false)
| | | | | |-- _type: string (nullable = true)
| | | | | |-- version: long (nullable = false)
| | | | | |-- visible: boolean (nullable = false)
Then I modified your function by adding an extra check for the ArrayType and converting it to StructType using asInstanceOf:
import org.apache.spark.sql.types._
def flattenSchema(schema: StructType, prefix: String = null):Array[Column]=
{
schema.fields.flatMap(f => {
val colName = if (prefix == null) f.name else (prefix + "." + f.name)
f.dataType match {
case st: StructType => flattenSchema(st, colName)
case at: ArrayType =>
val st = at.elementType.asInstanceOf[StructType]
flattenSchema(st, colName)
case _ => Array(new Column(colName).alias(colName))
}
})
}
And finally the results:
val s = getSchema()
val res = flattenSchema(s)
res.foreach(println(_))
Output:
engagement.engagementItems.availabilityEngagement.dimapraUnit.code AS `engagement.engagementItems.availabilityEngagement.dimapraUnit.code`
engagement.engagementItems.availabilityEngagement.dimapraUnit.constrained AS `engagement.engagementItems.availabilityEngagement.dimapraUnit.constrained`
engagement.engagementItems.availabilityEngagement.dimapraUnit.id AS `engagement.engagementItems.availabilityEngagement.dimapraUnit.id`
engagement.engagementItems.availabilityEngagement.dimapraUnit.label AS `engagement.engagementItems.availabilityEngagement.dimapraUnit.label`
engagement.engagementItems.availabilityEngagement.dimapraUnit.ranking AS `engagement.engagementItems.availabilityEngagement.dimapraUnit.ranking`
engagement.engagementItems.availabilityEngagement.dimapraUnit._type AS `engagement.engagementItems.availabilityEngagement.dimapraUnit._type`
engagement.engagementItems.availabilityEngagement.dimapraUnit.version AS `engagement.engagementItems.availabilityEngagement.dimapraUnit.version`
engagement.engagementItems.availabilityEngagement.dimapraUnit.visible AS `engagement.engagementItems.availabilityEngagement.dimapraUnit.visible`

Related

parsing complex nested json in Spark scala

I am having a complex json with below schema which i need to convert to a dataframe in spark. Since the schema is compex I am unable to do it completely.
The Json file has a very complex schema and using explode with column select might be problematic
Below is the schema which I am trying to convert:
root
|-- data: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- meta: struct (nullable = true)
| |-- view: struct (nullable = true)
| | |-- approvals: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- reviewedAt: long (nullable = true)
| | | | |-- reviewedAutomatically: boolean (nullable = true)
| | | | |-- state: string (nullable = true)
| | | | |-- submissionDetails: struct (nullable = true)
| | | | | |-- permissionType: string (nullable =
I have used the below code to flatten the data but still there nested data which i need to flatten into columns:
def flattenStructSchema(schema: StructType, prefix: String = null) : Array[Column] = {
schema.fields.flatMap(f => {
val columnName = if (prefix == null)
f.name else (prefix + "." + f.name)
f.dataType match {
case st: StructType => flattenStructSchema(st, columnName)
case _ => Array(col(columnName).as(columnName.replace(".","_")))
}
})
}
val df2 = df.select(col("meta"))
val df4 = df.select(col("data"))
val df3 = df2.select(flattenStructSchema(df2.schema):_*).show()
df3.printSchema()
df3.show(10,false)

Apply a function to a column inside a structure of a Spark DataFrame, replacing that column

I cannot find exactly what I am looking for, so here it is my question. I fetch from MongoDb some data into a Spark Dataframe. The dataframe has the following schema (df.printSchema):
|-- flight: struct (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: timestamp (nullable = true)
| | | |-- departure: timestamp (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: timestamp (nullable = true)
| | | |-- departure: timestamp (nullable = true)
Do note the top-level structure, followed by an array, inside which I need to change my data.
For example:
{
"flight": {
"legs": [{
"departure": ISODate("2020-10-30T13:35:00.000Z"),
"arrival": ISODate("2020-10-30T14:47:00.000Z")
}
],
"segments": [{
"departure": ISODate("2020-10-30T13:35:00.000Z"),
"arrival": ISODate("2020-10-30T14:47:00.000Z")
}
]
}
}
I want to export this in Json, but for some business reason, I want the arrival dates to have a different format than the departure dates. For example, I may want to export the departure ISODate in ms from epoch, but not the arrival one.
To do so, I thought of applying a custom function to do the transformation:
// Here I can do any tranformation. I hope to replace the timestamp with the needed value
val doSomething: UserDefinedFunction = udf( (value: Seq[Timestamp]) => {
value.map(x => "doSomething" + x.getTime) }
)
val newDf = df.withColumn("flight.legs.departure",
doSomething(df.col("flight.legs.departure")))
But this simply returns a brand new column, containing an array of a single doSomething string.
{
"flight": {
"legs": [{
"arrival": "2020-10-30T14:47:00Z",
"departure": "2020-10-30T13:35:00Z"
}
],
"segments": [{
"arrival": "2020-10-30T14:47:00Z",
"departure": "2020-10-30T13:35:00Z",
}
]
},
"flight.legs.departure": ["doSomething1596268800000"]
}
And newDf.show(1)
+--------------------+---------------------+
| flight|flight.legs.departure|
+--------------------+---------------------+
|[[[182], 94, [202...| [doSomething15962...|
+--------------------+---------------------+
Instead of
{
...
"arrival": "2020-10-30T14:47:00Z",
//leg departure date that I changed
"departure": "doSomething1596268800000"
... // segments not affected in this example
"arrival": "2020-10-30T14:47:00Z",
"departure": "2020-10-30T13:35:00Z",
...
}
Any ideas how to proceed?
Edit - clarification:
Please bear in mind that my schema is way more complex than what shown above. For example, there is yet another top level data tag, so flight is below along with other information. Then inside flight, legs and segments there are multiple more elements, some that are also nested. I only focused on the ones that I needed to change.
I am saying this, because I would like the simplest solution that would scale. I.e. ideally one that would simply change the required elements without having to de-construct and that re-construct the whole nested structure. If we cannot avoid that, is using case classes the simplest solution?
Please check the code below.
Execution Time
With UDF : Time taken: 679 ms
Without UDF : Time taken: 1493 ms
Code With UDF
scala> :paste
// Entering paste mode (ctrl-D to finish)
// Creating UDF to update value inside array.
import java.text.SimpleDateFormat
val dateFormat = new SimpleDateFormat("yyyy-MM-dd'T'hh:mm:ss") // For me departure values are in string, so using this to convert sql timestmap.
val doSomething = udf((value: Seq[String]) => {
value.map(x => s"dosomething${dateFormat.parse(x).getTime}")
})
// Exiting paste mode, now interpreting.
import java.text.SimpleDateFormat
dateFormat: java.text.SimpleDateFormat = java.text.SimpleDateFormat#41bd83a
doSomething: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StringType,true),Some(List(ArrayType(StringType,true))))
scala> :paste
// Entering paste mode (ctrl-D to finish)
spark.time {
val updated = df.select("flight.*").withColumn("legs",arrays_zip($"legs.arrival",doSomething($"legs.departure")).cast("array<struct<arrival:string,departure:string>>")).select(struct($"segments",$"legs").as("flight"))
updated.printSchema
updated.show(false)
}
// Exiting paste mode, now interpreting.
root
|-- flight: struct (nullable = false)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
+-------------------------------------------------------------------------------------------------+
|flight |
+-------------------------------------------------------------------------------------------------+
|[[[2020-10-30T14:47:00, 2020-10-30T13:35:00]], [[2020-10-30T14:47:00, dosomething1604045100000]]]|
+-------------------------------------------------------------------------------------------------+
Time taken: 679 ms
scala>
Code Without UDF
scala> val df = spark.read.json(Seq("""{"flight": {"legs": [{"departure": "2020-10-30T13:35:00","arrival": "2020-10-30T14:47:00"}],"segments": [{"departure": "2020-10-30T13:35:00","arrival": "2020-10-30T14:47:00"}]}}""").toDS)
df: org.apache.spark.sql.DataFrame = [flight: struct<legs: array<struct<arrival:string,departure:string>>, segments: array<struct<arrival:string,departure:string>>>]
scala> df.printSchema
root
|-- flight: struct (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
scala> df.show(false)
+--------------------------------------------------------------------------------------------+
|flight |
+--------------------------------------------------------------------------------------------+
|[[[2020-10-30T14:47:00, 2020-10-30T13:35:00]], [[2020-10-30T14:47:00, 2020-10-30T13:35:00]]]|
+--------------------------------------------------------------------------------------------+
scala> :paste
// Entering paste mode (ctrl-D to finish)
spark.time {
val updated= df
.select("flight.*")
.select($"segments",$"legs.arrival",$"legs.departure") // extracting legs struct column values.
.withColumn("departure",explode($"departure")) // exploding departure column
.withColumn("departure",concat_ws("-",lit("something"),$"departure".cast("timestamp").cast("long"))) // updating departure column values
.groupBy($"segments",$"arrival") // grouping columns except legs column
.agg(collect_list($"departure").as("departure")) // constructing list back
.select($"segments",arrays_zip($"arrival",$"departure").as("legs")) // construction arrival & departure columns using arrays_zip method.
.select(struct($"legs",$"segments").as("flight")) // finally creating flight by combining legs & segments columns.
updated.printSchema
updated.show(false)
}
// Exiting paste mode, now interpreting.
root
|-- flight: struct (nullable = false)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = false)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
+---------------------------------------------------------------------------------------------+
|flight |
+---------------------------------------------------------------------------------------------+
|[[[2020-10-30T14:47:00, something-1604045100]], [[2020-10-30T14:47:00, 2020-10-30T13:35:00]]]|
+---------------------------------------------------------------------------------------------+
Time taken: 1493 ms
scala>
Try this
scala> df.show(false)
+----------------------------------------------------------------------------------------------------------------+
|flight |
+----------------------------------------------------------------------------------------------------------------+
|[[[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]], [[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]]]|
|[[[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]], [[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]]]|
+----------------------------------------------------------------------------------------------------------------+
scala>
scala> df.printSchema
root
|-- flight: struct (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
scala>
scala> val myudf = udf(
| (arrs:Seq[String]) => {
| arrs.map("something" ++ _)
| }
| )
myudf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StringType,true),Some(List(ArrayType(StringType,true))))
scala> val df2 = df.select($"flight", myudf($"flight.legs.arr") as "editedArrs")
df2: org.apache.spark.sql.DataFrame = [flight: struct<legs: array<struct<dep:string,arr:string>>, segments: array<struct<dep:string,arr:string>>>, editedArrs: array<string>]
scala> df2.printSchema
root
|-- flight: struct (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
|-- editedArrs: array (nullable = true)
| |-- element: string (containsNull = true)
scala> df2.show(false)
+----------------------------------------------------------------------------------------------------------------+-----------------------------------+
|flight |editedArrs |
+----------------------------------------------------------------------------------------------------------------+-----------------------------------+
|[[[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]], [[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]]]|[something2020-10-30T14:47:00.000Z]|
|[[[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]], [[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]]]|[something2020-10-25T14:37:00.000Z]|
+----------------------------------------------------------------------------------------------------------------+-----------------------------------+
scala>
scala>
scala> val df3 = df2.select(struct(arrays_zip($"flight.legs.dep", $"editedArrs") cast "array<struct<dep:string,arr:string>>" as "legs", $"flight.segments") as "flight")
df3: org.apache.spark.sql.DataFrame = [flight: struct<legs: array<struct<dep:string,arr:string>>, segments: array<struct<dep:string,arr:string>>>]
scala>
scala> df3.printSchema
root
|-- flight: struct (nullable = false)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
scala>
scala> df3.show(false)
+-------------------------------------------------------------------------------------------------------------------------+
|flight |
+-------------------------------------------------------------------------------------------------------------------------+
|[[[2020-10-30T13:35:00.000Z, something2020-10-30T14:47:00.000Z]], [[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]]]|
|[[[2020-10-25T13:15:00.000Z, something2020-10-25T14:37:00.000Z]], [[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]]]|
+-------------------------------------------------------------------------------------------------------------------------+

how to explode a dataframe schema in databricks

I have a schema that should be exploded, below is the schema
|-- CaseNumber: string (nullable = true)
|-- Customers: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Contacts: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- FirstName: string (nullable = true)
| | | | |-- LastName: string (nullable = true)
I want my schema to be like this,
|-- CaseNumber: string (nullable = true)
|-- FirstName: string (nullable = true)
|-- LastName: string (nullable = true)
or
+----------+---------------------+
| CaseNumber| FirstName| LastName|
+----------+---------------------+
| 1 | aa | bb |
+----------|-----------|---------|
| 2 | cc | dd |
+------------------------------- |
I am new to databricks, any help would be appreciated.thanks
Here is one way to solve it without using explode command -
case class MyCase(val Customers = Array[Customer](), CaseNumber : String
)
case class Customers(val Contacts = Array[Contacts]()
)
case class Contacts(val Firstname:String, val LastName:String
)
val dataset = // dataframe.as[MyCase]
dataset.map{ mycase =>
// return a Seq of tuples like - (mycase.caseNumber, //read customer's contract's first and last name )
//one row per first and last names, repeat mycase.caseNumber .. basically a loop
}.flatmap(identity)
I think you can still do explode(customersFlat.contacts). I sure this something like this some while ago, so forgive me my syntax and let me know whether this works
df.select("caseNumber",explode("customersFlat.contacts").as("contacts").select("caseNumber","contacts.firstName","contacts.lastName")

Exploding nested df columns in Spark Scala

Column name is 'col1' and is of the form:
col1: array (nullable = true)
| |-- A1: struct (containsNull = true)
| | |-- B0: struct (nullable = true)
| | | |-- B01: string (nullable = true)
| | | |-- B02: string (nullable = true)
| | |-- B1: string (nullable = true)
| | |-- B2: string (nullable = true)
| | |-- B3: string (nullable = true)
| | |-- B4: string (nullable = true)
| | |-- B5: string (nullable = true)
I am trying 2 things first to fetch the value B2. Code:
val explodeDF = test_df.explode($"col1") { case Row(col1_details:Array[String]) =>
col1_details:Array.map{ col1_details:Array =>
val firstName = col1_details:Array(2).asInstanceOf[String]
val lastName = col1_details:Array(3).asInstanceOf[String]
val email = col1_details:Array(4).asInstanceOf[String]
val salary = col1_details:Array(5).asInstanceOf[String]
notes_details(firstName, lastName, email, salary)
}
}
Error:
error: too many arguments for method apply: (index: Int)Char in class StringOps
col1_details(firstName, lastName, email, salary)
I have tried various snippets and I have been getting different errors. Any suggestions on the what the mistake would be highly helpful.

How to access elements in a ArrayType in a writeStream?

I am building up a schema to accept some data streaming in. It has an ArrayType with some elements. Here is my StructType with the ArrayType:
val innerBody = StructType(
StructField("value", LongType, false) ::
StructField("spent", BooleanType, false) ::
StructField("tx_index", LongType, false) :: Nil)
val prev_out = StructType(StructField("prev_out", innerBody, false) :: Nil)
val body = StructType(
StructField("inputs", ArrayType(prev_out, false), false) ::
StructField("out", ArrayType(innerBody, false), false) :: Nill)
val schema = StructType(StructField("x", body, false) :: Nil)
This builds a schema like"
root
|-- bit: struct (nullable = true)
| |-- x: struct (nullable = false)
| | |-- inputs: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- prev_out: struct (nullable = false)
| | | | | |-- value: long (nullable = false)
| | | | | |-- spent: boolean (nullable = false)
| | | | | |-- tx_index: long (nullable = false)
| | |-- out: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- value: long (nullable = false)
| | | | |-- spent: boolean (nullable = false)
| | | | |-- tx_index: long (nullable = false)
I am trying to select the value from the "value element" in schema as it is streaming in. I am using the writeStream sink.
val parsed = df.select("bit.x.inputs.element.prev_out.value")
.writeStream.format("console").start()
I have this but code above, but gives an error.
Message: cannot resolve 'bit.x.inputs.element.prev_out.value' given
input columns: [key, value, timestamp, partition, offset,
timestampType, topic];;
How can I access the "value" element in this schema?
If you have data frame like this, first explode and followed by select will help you.
df.printSchema()
//root
//|-- bit: struct (nullable = true)
//| |-- x: struct (nullable = true)
//| | |-- inputs: array (nullable = true)
//| | | |-- element: struct (containsNull = true)
//| | | | |-- prev_out: struct (nullable = true)
//| | | | | |-- spent: boolean (nullable = true)
//| | | | | |-- tx_infex: long (nullable = true)
//| | | | | |-- value: long (nullable = true)
import org.apache.spark.sql.functions._
val intermediateDf: DataFrame = df.select(explode(col("bit.x.inputs")).as("interCol"))
intermediateDf.printSchema()
//root
//|-- interCol: struct (nullable = true)
//| |-- prev_out: struct (nullable = true)
//| | |-- spent: boolean (nullable = true)
//| | |-- tx_infex: long (nullable = true)
//| | |-- value: long (nullable = true)
val finalDf: DataFrame = intermediateDf.select(col("interCol.prev_out.value").as("value"))
finalDf.printSchema()
//root
//|-- value: long (nullable = true)
finalDf.show()
//+-----------+
//| value|
//+-----------+
//|12347628746|
//|12347628746|
//+-----------+