Iterate dataframe column Array of Array in Spark Scala - scala

I am trying to iterate over an array of array as a column in Spark dataframe. Looking for the best way to do this.
Schema:
root
|-- Animal: struct (nullable = true)
| |-- Species: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- mammal: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- description: string (nullable = true)
Currently I am using this logic. This only gets the first array.
df.select(
col("Animal.Species").getItem(0).getItem("mammal").getItem("description")
)
Pseudo Logic:
col("Animal.Species").getItem(0).getItem("mammal").getItem("description")
+
col("Animal.Species").getItem(1).getItem("mammal").getItem("description")
+
col("Animal.Species").getItem(2).getItem("mammal").getItem("description")
+
col("Animal.Species").getItem(...).getItem("mammal").getItem("description")
Desired Example Output (flattened elements as string)
llama, sheep, rabbit, hare

Not obvious, but you can use . (or the getField method of Column) to select "through" arrays of structs. Selecting Animal.Species.mammal returns an array of array of the innermost structs. Unfortunately, this array of array prevents you from being able to drill further down with something like Animal.Species.mammal.description, so you need to flatten it first, then use getField().
If I understand your schema correctly, the following JSON should be a valid input:
{
"Animal": {
"Species": [
{
"mammal": [
{ "description": "llama" },
{ "description": "sheep" }
]
},
{
"mammal": [
{ "description": "rabbit" },
{ "description": "hare" }
]
}
]
}
}
val df = spark.read.json("data.json")
df.printSchema
// root
// |-- Animal: struct (nullable = true)
// | |-- Species: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- mammal: array (nullable = true)
// | | | | |-- element: struct (containsNull = true)
// | | | | | |-- description: string (nullable = true)
df.select("Animal.Species.mammal").show(false)
// +----------------------------------------+
// |mammal |
// +----------------------------------------+
// |[[{llama}, {sheep}], [{rabbit}, {hare}]]|
// +----------------------------------------+
df.select(flatten(col("Animal.Species.mammal"))).show(false)
// +------------------------------------+
// |flatten(Animal.Species.mammal) |
// +------------------------------------+
// |[{llama}, {sheep}, {rabbit}, {hare}]|
// +------------------------------------+
This is now an array of structs and you can use getField("description") to obtain the array of interest:
df.select(flatten(col("Animal.Species.mammal")).getField("description")).show(false)
// +--------------------------------------------------------+
// |flatten(Animal.Species.mammal AS mammal#173).description|
// +--------------------------------------------------------+
// |[llama, sheep, rabbit, hare] |
// +--------------------------------------------------------+
Finally, array_join with separator ", " can be used to obtain the desired string:
df.select(
array_join(
flatten(col("Animal.Species.mammal")).getField("description"),
", "
) as "animals"
).show(false)
// +--------------------------+
// |animals |
// +--------------------------+
// |llama, sheep, rabbit, hare|
// +--------------------------+

You can apply explode two times: first time on Animal.Species and second time on the result of the first time:
import org.apache.spark.sql.functions._
df.withColumn("tmp", explode(col("Animal.Species")))
.withColumn("tmp", explode(col("tmp.mammal")))
.select("tmp.description")
.show()

Related

parsing complex nested json in Spark scala

I am having a complex json with below schema which i need to convert to a dataframe in spark. Since the schema is compex I am unable to do it completely.
The Json file has a very complex schema and using explode with column select might be problematic
Below is the schema which I am trying to convert:
root
|-- data: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- meta: struct (nullable = true)
| |-- view: struct (nullable = true)
| | |-- approvals: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- reviewedAt: long (nullable = true)
| | | | |-- reviewedAutomatically: boolean (nullable = true)
| | | | |-- state: string (nullable = true)
| | | | |-- submissionDetails: struct (nullable = true)
| | | | | |-- permissionType: string (nullable =
I have used the below code to flatten the data but still there nested data which i need to flatten into columns:
def flattenStructSchema(schema: StructType, prefix: String = null) : Array[Column] = {
schema.fields.flatMap(f => {
val columnName = if (prefix == null)
f.name else (prefix + "." + f.name)
f.dataType match {
case st: StructType => flattenStructSchema(st, columnName)
case _ => Array(col(columnName).as(columnName.replace(".","_")))
}
})
}
val df2 = df.select(col("meta"))
val df4 = df.select(col("data"))
val df3 = df2.select(flattenStructSchema(df2.schema):_*).show()
df3.printSchema()
df3.show(10,false)

Pyspark turn a list to a dictionary in a specific column

I have a spark dataframe that looks like this in json;
{
"site_id": "ABC",
"region": "Texas",
"areas": [
{
"Carbon": [
"ABB",
"ABD",
"ABE"
]
}
],
"site_name": "ABC"
}
and I need to turn "areas" column to this;
"areas":
[
{
"area_name": "Carbon",
"pls": [
{
"pl_name": "ABB"
},
{
"pl_name": "ABD"
},
{
"pl_name": "ABE"
}
]
}
]
I already did df.collect() and manipulated the dictionary directly, but that created some complexity. Is there a way to do this directly in the dataframe itself?
Edit:
Here is the input schema
|-- site_id: string
|-- region: string
|-- site_name: string
|-- areas: array
| |-- element: map
| | |-- keyType: string
| | |-- valueType: array
| | | |-- element: string
in the output schema, the goal is to have the valueType also a dictionary. I actually save the data to a dynamodb table so the output should like the example I provided when scanned from the table.
Processing and producing JSON is not Spark's strength as far as my understanding. The easiest approach (that is not flattening then grouping by then collecting then pivoting etc) is using UDF. I completely understand UDF is not as fast as built-in Spark transformation, but if your data scale is not that big then it shouldn't be a problem.
def transform_json(arr):
r = []
for e in arr:
for k in e.keys():
r.append({
'area_name': k,
'pls': [{'pl_name': i} for i in e[k]]
})
return r
(df
.withColumn('areas', F.udf(
transform_json,
T.ArrayType(T.StructType([
T.StructField('area_name', T.StringType()),
T.StructField('pls', T.ArrayType(T.StructType([
T.StructField('pl_name', T.StringType())
]))),
])
))('areas')
)
.show(10, False)
)
# Output
# +------------------------------------------------------------------+
# |areas |
# +------------------------------------------------------------------+
# |[{Carbon, [{ABB}, {ABD}, {ABE}]}, {Oxygen, [{ABB}, {ABD}, {ABE}]}]|
# +------------------------------------------------------------------+
# Schema
# root
# |-- areas: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- area_name: string (nullable = true)
# | | |-- pls: array (nullable = true)
# | | | |-- element: struct (containsNull = true)
# | | | | |-- pl_name: string (nullable = true)

Apply a function to a column inside a structure of a Spark DataFrame, replacing that column

I cannot find exactly what I am looking for, so here it is my question. I fetch from MongoDb some data into a Spark Dataframe. The dataframe has the following schema (df.printSchema):
|-- flight: struct (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: timestamp (nullable = true)
| | | |-- departure: timestamp (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: timestamp (nullable = true)
| | | |-- departure: timestamp (nullable = true)
Do note the top-level structure, followed by an array, inside which I need to change my data.
For example:
{
"flight": {
"legs": [{
"departure": ISODate("2020-10-30T13:35:00.000Z"),
"arrival": ISODate("2020-10-30T14:47:00.000Z")
}
],
"segments": [{
"departure": ISODate("2020-10-30T13:35:00.000Z"),
"arrival": ISODate("2020-10-30T14:47:00.000Z")
}
]
}
}
I want to export this in Json, but for some business reason, I want the arrival dates to have a different format than the departure dates. For example, I may want to export the departure ISODate in ms from epoch, but not the arrival one.
To do so, I thought of applying a custom function to do the transformation:
// Here I can do any tranformation. I hope to replace the timestamp with the needed value
val doSomething: UserDefinedFunction = udf( (value: Seq[Timestamp]) => {
value.map(x => "doSomething" + x.getTime) }
)
val newDf = df.withColumn("flight.legs.departure",
doSomething(df.col("flight.legs.departure")))
But this simply returns a brand new column, containing an array of a single doSomething string.
{
"flight": {
"legs": [{
"arrival": "2020-10-30T14:47:00Z",
"departure": "2020-10-30T13:35:00Z"
}
],
"segments": [{
"arrival": "2020-10-30T14:47:00Z",
"departure": "2020-10-30T13:35:00Z",
}
]
},
"flight.legs.departure": ["doSomething1596268800000"]
}
And newDf.show(1)
+--------------------+---------------------+
| flight|flight.legs.departure|
+--------------------+---------------------+
|[[[182], 94, [202...| [doSomething15962...|
+--------------------+---------------------+
Instead of
{
...
"arrival": "2020-10-30T14:47:00Z",
//leg departure date that I changed
"departure": "doSomething1596268800000"
... // segments not affected in this example
"arrival": "2020-10-30T14:47:00Z",
"departure": "2020-10-30T13:35:00Z",
...
}
Any ideas how to proceed?
Edit - clarification:
Please bear in mind that my schema is way more complex than what shown above. For example, there is yet another top level data tag, so flight is below along with other information. Then inside flight, legs and segments there are multiple more elements, some that are also nested. I only focused on the ones that I needed to change.
I am saying this, because I would like the simplest solution that would scale. I.e. ideally one that would simply change the required elements without having to de-construct and that re-construct the whole nested structure. If we cannot avoid that, is using case classes the simplest solution?
Please check the code below.
Execution Time
With UDF : Time taken: 679 ms
Without UDF : Time taken: 1493 ms
Code With UDF
scala> :paste
// Entering paste mode (ctrl-D to finish)
// Creating UDF to update value inside array.
import java.text.SimpleDateFormat
val dateFormat = new SimpleDateFormat("yyyy-MM-dd'T'hh:mm:ss") // For me departure values are in string, so using this to convert sql timestmap.
val doSomething = udf((value: Seq[String]) => {
value.map(x => s"dosomething${dateFormat.parse(x).getTime}")
})
// Exiting paste mode, now interpreting.
import java.text.SimpleDateFormat
dateFormat: java.text.SimpleDateFormat = java.text.SimpleDateFormat#41bd83a
doSomething: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StringType,true),Some(List(ArrayType(StringType,true))))
scala> :paste
// Entering paste mode (ctrl-D to finish)
spark.time {
val updated = df.select("flight.*").withColumn("legs",arrays_zip($"legs.arrival",doSomething($"legs.departure")).cast("array<struct<arrival:string,departure:string>>")).select(struct($"segments",$"legs").as("flight"))
updated.printSchema
updated.show(false)
}
// Exiting paste mode, now interpreting.
root
|-- flight: struct (nullable = false)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
+-------------------------------------------------------------------------------------------------+
|flight |
+-------------------------------------------------------------------------------------------------+
|[[[2020-10-30T14:47:00, 2020-10-30T13:35:00]], [[2020-10-30T14:47:00, dosomething1604045100000]]]|
+-------------------------------------------------------------------------------------------------+
Time taken: 679 ms
scala>
Code Without UDF
scala> val df = spark.read.json(Seq("""{"flight": {"legs": [{"departure": "2020-10-30T13:35:00","arrival": "2020-10-30T14:47:00"}],"segments": [{"departure": "2020-10-30T13:35:00","arrival": "2020-10-30T14:47:00"}]}}""").toDS)
df: org.apache.spark.sql.DataFrame = [flight: struct<legs: array<struct<arrival:string,departure:string>>, segments: array<struct<arrival:string,departure:string>>>]
scala> df.printSchema
root
|-- flight: struct (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
scala> df.show(false)
+--------------------------------------------------------------------------------------------+
|flight |
+--------------------------------------------------------------------------------------------+
|[[[2020-10-30T14:47:00, 2020-10-30T13:35:00]], [[2020-10-30T14:47:00, 2020-10-30T13:35:00]]]|
+--------------------------------------------------------------------------------------------+
scala> :paste
// Entering paste mode (ctrl-D to finish)
spark.time {
val updated= df
.select("flight.*")
.select($"segments",$"legs.arrival",$"legs.departure") // extracting legs struct column values.
.withColumn("departure",explode($"departure")) // exploding departure column
.withColumn("departure",concat_ws("-",lit("something"),$"departure".cast("timestamp").cast("long"))) // updating departure column values
.groupBy($"segments",$"arrival") // grouping columns except legs column
.agg(collect_list($"departure").as("departure")) // constructing list back
.select($"segments",arrays_zip($"arrival",$"departure").as("legs")) // construction arrival & departure columns using arrays_zip method.
.select(struct($"legs",$"segments").as("flight")) // finally creating flight by combining legs & segments columns.
updated.printSchema
updated.show(false)
}
// Exiting paste mode, now interpreting.
root
|-- flight: struct (nullable = false)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = false)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
+---------------------------------------------------------------------------------------------+
|flight |
+---------------------------------------------------------------------------------------------+
|[[[2020-10-30T14:47:00, something-1604045100]], [[2020-10-30T14:47:00, 2020-10-30T13:35:00]]]|
+---------------------------------------------------------------------------------------------+
Time taken: 1493 ms
scala>
Try this
scala> df.show(false)
+----------------------------------------------------------------------------------------------------------------+
|flight |
+----------------------------------------------------------------------------------------------------------------+
|[[[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]], [[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]]]|
|[[[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]], [[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]]]|
+----------------------------------------------------------------------------------------------------------------+
scala>
scala> df.printSchema
root
|-- flight: struct (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
scala>
scala> val myudf = udf(
| (arrs:Seq[String]) => {
| arrs.map("something" ++ _)
| }
| )
myudf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StringType,true),Some(List(ArrayType(StringType,true))))
scala> val df2 = df.select($"flight", myudf($"flight.legs.arr") as "editedArrs")
df2: org.apache.spark.sql.DataFrame = [flight: struct<legs: array<struct<dep:string,arr:string>>, segments: array<struct<dep:string,arr:string>>>, editedArrs: array<string>]
scala> df2.printSchema
root
|-- flight: struct (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
|-- editedArrs: array (nullable = true)
| |-- element: string (containsNull = true)
scala> df2.show(false)
+----------------------------------------------------------------------------------------------------------------+-----------------------------------+
|flight |editedArrs |
+----------------------------------------------------------------------------------------------------------------+-----------------------------------+
|[[[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]], [[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]]]|[something2020-10-30T14:47:00.000Z]|
|[[[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]], [[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]]]|[something2020-10-25T14:37:00.000Z]|
+----------------------------------------------------------------------------------------------------------------+-----------------------------------+
scala>
scala>
scala> val df3 = df2.select(struct(arrays_zip($"flight.legs.dep", $"editedArrs") cast "array<struct<dep:string,arr:string>>" as "legs", $"flight.segments") as "flight")
df3: org.apache.spark.sql.DataFrame = [flight: struct<legs: array<struct<dep:string,arr:string>>, segments: array<struct<dep:string,arr:string>>>]
scala>
scala> df3.printSchema
root
|-- flight: struct (nullable = false)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
scala>
scala> df3.show(false)
+-------------------------------------------------------------------------------------------------------------------------+
|flight |
+-------------------------------------------------------------------------------------------------------------------------+
|[[[2020-10-30T13:35:00.000Z, something2020-10-30T14:47:00.000Z]], [[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]]]|
|[[[2020-10-25T13:15:00.000Z, something2020-10-25T14:37:00.000Z]], [[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]]]|
+-------------------------------------------------------------------------------------------------------------------------+

How to Process Nested struct with Array element in Spark RDD

I am using spark sql data processing for nested with array.
{
"isActive": true,
"sample": {
"someitem": {
"thesearecool": [{
"neat": "wow"
},
{
"neat": "tubular"
}
]
},
"coolcolors": [{
"color": "red",
"hex": "ff0000"
},
{
"color": "blue",
"hex": "0000ff"
}
]
}
}
schema :
root
|-- isActive: boolean (nullable = true)
|-- sample: struct (nullable = true)
| |-- coolcolors: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- color: string (nullable = true)
| | | |-- hex: string (nullable = true)
| |-- someitem: struct (nullable = true)
| | |-- thesearecool: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- neat: string (nullable = true)
code :
val nested1 = nested.withColumn("color_data", explode($"sample.coolcolors")).select("isActive","color_data.color","color_data.hex","sample.someitem.thesearecool.neat")
val nested2 = nested.withColumn("thesearecool_data", explode($"sample.someitem.thesearecool")).select("thesearecool_data.neat")
sample output:
+--------+-----+------+--------------+
|isActive|color|hex |neat |
+--------+-----+------+--------------+
|true |red |ff0000|[wow, tubular]|
|true |blue |0000ff|[wow, tubular]|
+--------+-----+------+--------------+
+-------+
|neat |
+-------+
|wow |
|tubular|
+-------+
we need to process data single result.
Explode twice, and select what you want.
df.withColumn("coolcolors", explode($"sample.coolcolors"))
.withColumn("thesearecool", explode($"sample.someitem.thesearecool"))
.select("isActive", "coolcolors.color", "coolcolors.hex", "thesearecool.neat").show
Then,
+--------+-----+------+-------+
|isActive|color| hex| neat|
+--------+-----+------+-------+
| true| red|ff0000| wow|
| true| red|ff0000|tubular|
| true| blue|0000ff| wow|
| true| blue|0000ff|tubular|
+--------+-----+------+-------+

Explode array in apache spark Data Frame

I am trying to flatten a schema of existing dataframe with nested fields. Structure of my dataframe is something like that:
root
|-- Id: long (nullable = true)
|-- Type: string (nullable = true)
|-- Uri: string (nullable = true)
|-- Type: array (nullable = true)
| |-- element: string (containsNull = true)
|-- Gender: array (nullable = true)
| |-- element: string (containsNull = true)
Type and gender can contain array of elements, one element or null value.
I tried to use the following code:
var resDf = df.withColumn("FlatType", explode(df("Type")))
But as a result in a resulting data frame I loose rows for which I had null values for Type column. It means, for example, if I have 10 rows and in 7 rows type is null and in 3 type is not null, after I use explode in resulting data frame I have only three rows.
How can I keep rows with null values but explode array of values?
I found some kind of workaround but still stuck in one place. For standard types we can do the following:
def customExplode(df: DataFrame, field: String, colType: String): org.apache.spark.sql.Column = {
var exploded = None: Option[org.apache.spark.sql.Column]
colType.toLowerCase() match {
case "string" =>
val avoidNull = udf((column: Seq[String]) =>
if (column == null) Seq[String](null)
else column)
exploded = Some(explode(avoidNull(df(field))))
case "boolean" =>
val avoidNull = udf((xs: Seq[Boolean]) =>
if (xs == null) Seq[Boolean]()
else xs)
exploded = Some(explode(avoidNull(df(field))))
case _ => exploded = Some(explode(df(field)))
}
exploded.get
}
And after that just use it like this:
val explodedField = customExplode(resultDf, fieldName, fieldTypeMap(field))
resultDf = resultDf.withColumn(newName, explodedField)
However, I have a problem for struct type for the following type of structure:
|-- Address: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- AddressType: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- DEA: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- Number: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
| | | | |-- ExpirationDate: array (nullable = true)
| | | | | |-- element: timestamp (containsNull = true)
| | | | |-- Status: array (nullable = true)
| | | | | |-- element: string (containsNull = true)
How can we process that kind of schema when DEA is null?
Thank you in advance.
P.S. I tried to use Lateral views but result is the same.
Maybe you can try using when:
val resDf = df.withColumn("FlatType", when(df("Type").isNotNull, explode(df("Type")))
As shown in the when function's documentation, the value null is inserted for the values that do not match the conditions.
I think what you wanted is to use explode_outer instead of explode
see apache docs : explode and explode_outer