Pyspark convert Json to DF - pyspark

I have this file .json and I need, convert it in DF, the file is this:
{
"id": "517379",
"created_at": "2020-11-16T04:25:03Z",
"company": "1707",
"invoice": [
{
"invoice_id": "4102",
"date": "2020-11-16T04:25:03Z",
"id": "517379",
"cantidad": "21992.47",
"extra_data": {
"column": "ASDFG",
"credito": "Crédito"
}
}
I need in this way, like df.
id. , created_at, , company, invoice_id, date , id. , cantidad, column, credito
517379 , 2020-11-16T04:25:03Z , 1707, 4102, 2020-11-16T04:25:03Z , 517379 , 21992.47, ASDFG, Crédito

by default spark try to sparse each line as a json document, so if your file contains json objects across multiple line, you will have a dataframe with one column _corrupt_record. To solve this problem you need to set the reading option multiline to true.
test.json
[
{
"id":"517379",
"created_at":"2020-11-16T04:25:03Z",
"company":"1707",
"invoice":[
{
"invoice_id":"4102",
"date":"2020-11-16T04:25:03Z",
"id":"517379",
"cantidad":"21992.47",
"extra_data":{
"column":"ASDFG",
"credito":"Crédito"
}
}
]
},
{
"id":"1234",
"created_at":"2020-11-16T04:25:03Z",
"company":"1707",
"invoice":[
{
"invoice_id":"4102",
"date":"2020-11-16T04:25:03Z",
"id":"517379",
"cantidad":"21992.47",
"extra_data":{
"column":"ASDFG",
"credito":"Crédito"
}
}
]
}
]
pyspark code
from pyspark.sql import SparkSession
df = spark.read.option("multiline","true").json("test.json")
df.printSchema()
root
|-- company: string (nullable = true)
|-- created_at: string (nullable = true)
|-- id: string (nullable = true)
|-- invoice: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- cantidad: string (nullable = true)
| | |-- date: string (nullable = true)
| | |-- extra_data: struct (nullable = true)
| | | |-- column: string (nullable = true)
| | | |-- credito: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- invoice_id: string (nullable = true)
df.show()
+-------+--------------------+------+--------------------+
|company| created_at| id| invoice|
+-------+--------------------+------+--------------------+
| 1707|2020-11-16T04:25:03Z|517379|[{21992.47, 2020-...|
| 1707|2020-11-16T04:25:03Z| 1234|[{21992.47, 2020-...|
+-------+--------------------+------+--------------------+
if you know the schema, it’s better to provide it to reduce the processing time and parsing all the types as you want
from pyspark.sql import types as pyspark_types
from pyspark.sql import functions as pyspark_functions
schema = pyspark_types.StructType(fields=[
pyspark_types.StructField("id", pyspark_types.StringType()),
pyspark_types.StructField("created_at", pyspark_types.TimestampType()),
pyspark_types.StructField("company", pyspark_types.StringType()),
pyspark_types.StructField("invoice", pyspark_types.ArrayType(
pyspark_types.StructType(fields=[
pyspark_types.StructField("invoice_id", pyspark_types.StringType()),
pyspark_types.StructField("date", pyspark_types.TimestampType()),
pyspark_types.StructField("id", pyspark_types.StringType()),
pyspark_types.StructField("cantidad", pyspark_types.StringType()),
pyspark_types.StructField("extra_data", pyspark_types.StructType(fields=[
pyspark_types.StructField("column", pyspark_types.StringType()),
pyspark_types.StructField("credito", pyspark_types.StringType())
]))
])
)),
])
df = spark.read.option("multiline","true").json("test.json", schema=schema)
df.printSchema()
root
|-- id: string (nullable = true)
|-- created_at: timestamp (nullable = true)
|-- company: string (nullable = true)
|-- invoice: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- invoice_id: string (nullable = true)
| | |-- date: timestamp (nullable = true)
| | |-- id: string (nullable = true)
| | |-- cantidad: string (nullable = true)
| | |-- extra_data: struct (nullable = true)
| | | |-- column: string (nullable = true)
| | | |-- credito: string (nullable = true)
df.show()
+------+-------------------+-------+--------------------+
| id| created_at|company| invoice|
+------+-------------------+-------+--------------------+
|517379|2020-11-16 04:25:03| 1707|[{4102, 2020-11-1...|
| 1234|2020-11-16 04:25:03| 1707|[{4102, 2020-11-1...|
+------+-------------------+-------+--------------------+
# To get completely denormalized dataframe
df = df.withColumn("invoice", pyspark_functions.explode_outer("invoice")) \
.select("id", "company", "created_at",
"invoice.invoice_id", "invoice.date", "invoice.id", "invoice.cantidad",
"invoice.extra_data.*")
in the second example, spark doesn’t try to detect the schema, while it’s provided, but it try to cast the variables to the types you provided (created_at and date in the second example are timestamps instead of string)

Related

convert array of array to array of struct in pyspark

I have dataframe like below
id contact_persons
-----------------------
1 [[abc, abc#xyz.com, 896676, manager],[pqr, pqr#xyz.com, 89809043, director],[stu, stu#xyz.com, 09909343, programmer]]
schema looks like this.
root
|-- id: string (nullable = true)
|-- contact_persons: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
i need to convert this dataframe like below schema.
root
|-- id: string (nullable = true)
|-- contact_persons: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- emails: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- phone: string (nullable = true)
| | |-- roles: string (nullable = true)
I know there is struct function in pyspark, but in this scenario, i dont know how to use this as array is dynamic sized.
You can use TRANSFORM expression to cast it:
import pyspark.sql.functions as f
df = spark.createDataFrame([
[1, [['abc', 'abc#xyz.com', '896676', 'manager'],
['pqr', 'pqr#xyz.com', '89809043', 'director'],
['stu', 'stu#xyz.com', '09909343', 'programmer']]]
], schema='id string, contact_persons array<array<string>>')
expression = 'TRANSFORM(contact_persons, el -> STRUCT(el[0] AS name, el[1] AS emails, el[2] AS phone, el[3] AS roles))'
output_df = df.withColumn('contact_persons', f.expr(expression))
# output_df.printSchema()
# root
# |-- id: string (nullable = true)
# |-- contact_persons: array (nullable = true)
# | |-- element: struct (containsNull = false)
# | | |-- name: string (nullable = true)
# | | |-- emails: string (nullable = true)
# | | |-- phone: string (nullable = true)
# | | |-- roles: string (nullable = true)
output_df.show(truncate=False)
+---+-----------------------------------------------------------------------------------------------------------------------+
|id |contact_persons |
+---+-----------------------------------------------------------------------------------------------------------------------+
|1 |[{abc, abc#xyz.com, 896676, manager}, {pqr, pqr#xyz.com, 89809043, director}, {stu, stu#xyz.com, 09909343, programmer}]|
+---+-----------------------------------------------------------------------------------------------------------------------+

How do I check if column present in the Spark DataFrame

I am trying a logic to return an empty column if column does not exist in dataframe.
Schema changes very frequent, sometime the whole struct will be missing (temp1) or array inside struct will be missing (suffix)
Schema looks like this:
root
|-- id: string (nullable = true)
|-- temp: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- temp1: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- code1: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | |-- temp2: struct (nullable = true)
| | | |-- name1: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- suffix: array (nullable = true)
| | | | |-- element: string (containsNull = true)
|-- timestamp: timestamp (nullable = true)
Or like this:
root
|-- id: string (nullable = true)
|-- temp: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- temp2: struct (nullable = true)
| | | |-- name1: array (nullable = true)
| | | | |-- element: string (containsNull = true)
|-- timestamp: timestamp (nullable = true)
When I am trying the below logic for the second schema, getting an exception that Struct not found
def has_Column(df: DataFrame, path: String) = Try(df(path)).isSuccess
df.withColumn("id", col("id")).
withColumn("tempLn", explode(col("temp"))).
withColumn("temp1_code1", when(lit(has_Column(df, "tempLn.temp1.code1")), concat_ws(" ",col("tempLn.temp1.code1"))).otherwise(lit("").cast("string"))).
withColumn("temp2_suffix", when(lit(has_Column(df, "tempLn.temp2.suffix")), concat_ws(" ",col("tempLn.temp2.suffix"))).otherwise(lit("").cast("string")))
Error:
org.apache.spark.sql.AnalysisException: No such struct field temp1;
You need to do the check the existence outside the select/withColumn... methods. As you reference it in the then part of case when expression, Spark tries to resolve it during the analysis of the query.
So you'll need to test like this:
if (has_Column(df, "tempLn.temp1.code1"))
df.withColumn("temp2_suffix", concat_ws(" ",col("tempLn.temp2.suffix")))
else
df.withColumn("temp2_suffix", lit(""))
To do it for multiple columns you can use foldLeft like this:
val df1 = Seq(
("tempLn.temp1.code1", "temp1_code1"),
("tempLn.temp2.suffix", "temp2_suffix")
).foldLeft(df) {
case (acc, (field, newCol)) => {
if (has_Column(acc, field))
acc.withColumn(newCol, concat_ws(" ", col(field)))
else
acc.withColumn(newCol, lit(""))
}
}

Convert a JSON string to a struct column without schema in Spark

Spark: 3.0.0
Scala: 2.12.8
My data frame has a column with JSON string, and I want to create a new column from it with the StructType.
temp_json_string
{"name":"test","id":"12","category":[{"products":["A","B"],"displayName":"test_1","displayLabel":"test1"},{"products":["C"],"displayName":"test_2","displayLabel":"test2"}],"createdAt":"","createdBy":""}
root
|-- temp_json_string: string (nullable = true)
Formatted JSON:
{
"name":"test",
"id":"12",
"category":[
{
"products":[
"A",
"B"
],
"displayName":"test_1",
"displayLabel":"test1"
},
{
"products":[
"C"
],
"displayName":"test_2",
"displayLabel":"test2"
}
],
"createdAt":"",
"createdBy":""
}
I want to create a new column of type Struct so I tried:
dataFrame
.withColumn("temp_json_struct", struct(col("temp_json_string")))
.select("temp_json_struct")
Now, I get the schema as:
root
|-- temp_json_struct: struct (nullable = false)
| |-- temp_json_string: string (nullable = true)
Desired result:
root
|-- temp_json_struct: struct (nullable = false)
| |-- name: string (nullable = true)
| |-- category: array (nullable = true)
| | |-- products: array (nullable = true)
| | |-- displayName: string (nullable = true)
| | |-- displayLabel: string (nullable = true)
| |-- createdAt: timestamp (nullable = true)
| |-- updatedAt: timestamp (nullable = true)
json_str_col is the column that has JSON string. I had multiple files so that's why the fist line is iterating through each row to extract the schema. If you know your schema up front then just replace json_schema with that.
json_schema = spark.read.json(df.rdd.map(lambda row: row.json_str_col)).schema
df = df.withColumn('new_col', from_json(col('json_str_col'), json_schema))
// import spark implicits for conversion to dataset (.as[String])
import spark.implicits._
val df = ??? //create your dataframe having the 'temp_json_string' column
//convert Dataset[Row] aka Dataframe to Dataset[String]
val ds = df.select("temp_json_string").as[String]
//read as json
spark.read.json(ds)
Documentation
There at least two different ways to retrieve/discover the schema for a given JSON.
For the illustration, let's create some data first:
import org.apache.spark.sql.types.StructType
val jsData = Seq(
("""{
"name":"test","id":"12","category":[
{
"products":[
"A",
"B"
],
"displayName":"test_1",
"displayLabel":"test1"
},
{
"products":[
"C"
],
"displayName":"test_2",
"displayLabel":"test2"
}
],
"createdAt":"",
"createdBy":""}""")
)
Option 1: schema_of_json
The first option is to use the built-in function schema_of_json. The function will return the schema for the given JSON in DDL format:
val json = jsData.toDF("js").collect()(0).getString(0)
val ddlSchema: String = spark.sql(s"select schema_of_json('${json}')")
.collect()(0) //get 1st row
.getString(0) //get 1st col of the row as string
.replace("null", "string") //replace type with string, this occurs since you have "createdAt":""
// struct<category:array<struct<displayLabel:string,displayName:string,products:array<string>>>,createdAt:null,createdBy:null,id:string,name:string>
val schema: StructType = StructType.fromDDL(s"js_schema $ddlSchema")
Note that you would expect that schema_of_json would also work on the column level i.e: schema_of_json(js_col), unfortunately, this doesn't work as expected therefore we are forced to pass a string instead.
Option 2: use Spark JSON reader (recommended)
import org.apache.spark.sql.functions.from_json
val schema: StructType = spark.read.json(jsData.toDS).schema
// schema.printTreeString
// root
// |-- category: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- displayLabel: string (nullable = true)
// | | |-- displayName: string (nullable = true)
// | | |-- products: array (nullable = true)
// | | | |-- element: string (containsNull = true)
// |-- createdAt: string (nullable = true)
// |-- createdBy: string (nullable = true)
// |-- id: string (nullable = true)
// |-- name: string (nullable = true)
As you can see, here we are producing a schema based on StructType and not a DDL string as in the previous case.
After discovering the schema we can move on to the next step which is converting the JSON data into a struct. To achieve that we will use from_json built-in function:
jsData.toDF("js")
.withColumn("temp_json_struct", from_json($"js", schema))
.printSchema()
// root
// |-- js: string (nullable = true)
// |-- temp_json_struct: struct (nullable = true)
// | |-- category: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- displayLabel: string (nullable = true)
// | | | |-- displayName: string (nullable = true)
// | | | |-- products: array (nullable = true)
// | | | | |-- element: string (containsNull = true)
// | |-- createdAt: string (nullable = true)
// | |-- createdBy: string (nullable = true)
// | |-- id: string (nullable = true)
// | |-- name: string (nullable = true)

Apply a function to a column inside a structure of a Spark DataFrame, replacing that column

I cannot find exactly what I am looking for, so here it is my question. I fetch from MongoDb some data into a Spark Dataframe. The dataframe has the following schema (df.printSchema):
|-- flight: struct (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: timestamp (nullable = true)
| | | |-- departure: timestamp (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: timestamp (nullable = true)
| | | |-- departure: timestamp (nullable = true)
Do note the top-level structure, followed by an array, inside which I need to change my data.
For example:
{
"flight": {
"legs": [{
"departure": ISODate("2020-10-30T13:35:00.000Z"),
"arrival": ISODate("2020-10-30T14:47:00.000Z")
}
],
"segments": [{
"departure": ISODate("2020-10-30T13:35:00.000Z"),
"arrival": ISODate("2020-10-30T14:47:00.000Z")
}
]
}
}
I want to export this in Json, but for some business reason, I want the arrival dates to have a different format than the departure dates. For example, I may want to export the departure ISODate in ms from epoch, but not the arrival one.
To do so, I thought of applying a custom function to do the transformation:
// Here I can do any tranformation. I hope to replace the timestamp with the needed value
val doSomething: UserDefinedFunction = udf( (value: Seq[Timestamp]) => {
value.map(x => "doSomething" + x.getTime) }
)
val newDf = df.withColumn("flight.legs.departure",
doSomething(df.col("flight.legs.departure")))
But this simply returns a brand new column, containing an array of a single doSomething string.
{
"flight": {
"legs": [{
"arrival": "2020-10-30T14:47:00Z",
"departure": "2020-10-30T13:35:00Z"
}
],
"segments": [{
"arrival": "2020-10-30T14:47:00Z",
"departure": "2020-10-30T13:35:00Z",
}
]
},
"flight.legs.departure": ["doSomething1596268800000"]
}
And newDf.show(1)
+--------------------+---------------------+
| flight|flight.legs.departure|
+--------------------+---------------------+
|[[[182], 94, [202...| [doSomething15962...|
+--------------------+---------------------+
Instead of
{
...
"arrival": "2020-10-30T14:47:00Z",
//leg departure date that I changed
"departure": "doSomething1596268800000"
... // segments not affected in this example
"arrival": "2020-10-30T14:47:00Z",
"departure": "2020-10-30T13:35:00Z",
...
}
Any ideas how to proceed?
Edit - clarification:
Please bear in mind that my schema is way more complex than what shown above. For example, there is yet another top level data tag, so flight is below along with other information. Then inside flight, legs and segments there are multiple more elements, some that are also nested. I only focused on the ones that I needed to change.
I am saying this, because I would like the simplest solution that would scale. I.e. ideally one that would simply change the required elements without having to de-construct and that re-construct the whole nested structure. If we cannot avoid that, is using case classes the simplest solution?
Please check the code below.
Execution Time
With UDF : Time taken: 679 ms
Without UDF : Time taken: 1493 ms
Code With UDF
scala> :paste
// Entering paste mode (ctrl-D to finish)
// Creating UDF to update value inside array.
import java.text.SimpleDateFormat
val dateFormat = new SimpleDateFormat("yyyy-MM-dd'T'hh:mm:ss") // For me departure values are in string, so using this to convert sql timestmap.
val doSomething = udf((value: Seq[String]) => {
value.map(x => s"dosomething${dateFormat.parse(x).getTime}")
})
// Exiting paste mode, now interpreting.
import java.text.SimpleDateFormat
dateFormat: java.text.SimpleDateFormat = java.text.SimpleDateFormat#41bd83a
doSomething: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StringType,true),Some(List(ArrayType(StringType,true))))
scala> :paste
// Entering paste mode (ctrl-D to finish)
spark.time {
val updated = df.select("flight.*").withColumn("legs",arrays_zip($"legs.arrival",doSomething($"legs.departure")).cast("array<struct<arrival:string,departure:string>>")).select(struct($"segments",$"legs").as("flight"))
updated.printSchema
updated.show(false)
}
// Exiting paste mode, now interpreting.
root
|-- flight: struct (nullable = false)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
+-------------------------------------------------------------------------------------------------+
|flight |
+-------------------------------------------------------------------------------------------------+
|[[[2020-10-30T14:47:00, 2020-10-30T13:35:00]], [[2020-10-30T14:47:00, dosomething1604045100000]]]|
+-------------------------------------------------------------------------------------------------+
Time taken: 679 ms
scala>
Code Without UDF
scala> val df = spark.read.json(Seq("""{"flight": {"legs": [{"departure": "2020-10-30T13:35:00","arrival": "2020-10-30T14:47:00"}],"segments": [{"departure": "2020-10-30T13:35:00","arrival": "2020-10-30T14:47:00"}]}}""").toDS)
df: org.apache.spark.sql.DataFrame = [flight: struct<legs: array<struct<arrival:string,departure:string>>, segments: array<struct<arrival:string,departure:string>>>]
scala> df.printSchema
root
|-- flight: struct (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
scala> df.show(false)
+--------------------------------------------------------------------------------------------+
|flight |
+--------------------------------------------------------------------------------------------+
|[[[2020-10-30T14:47:00, 2020-10-30T13:35:00]], [[2020-10-30T14:47:00, 2020-10-30T13:35:00]]]|
+--------------------------------------------------------------------------------------------+
scala> :paste
// Entering paste mode (ctrl-D to finish)
spark.time {
val updated= df
.select("flight.*")
.select($"segments",$"legs.arrival",$"legs.departure") // extracting legs struct column values.
.withColumn("departure",explode($"departure")) // exploding departure column
.withColumn("departure",concat_ws("-",lit("something"),$"departure".cast("timestamp").cast("long"))) // updating departure column values
.groupBy($"segments",$"arrival") // grouping columns except legs column
.agg(collect_list($"departure").as("departure")) // constructing list back
.select($"segments",arrays_zip($"arrival",$"departure").as("legs")) // construction arrival & departure columns using arrays_zip method.
.select(struct($"legs",$"segments").as("flight")) // finally creating flight by combining legs & segments columns.
updated.printSchema
updated.show(false)
}
// Exiting paste mode, now interpreting.
root
|-- flight: struct (nullable = false)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = false)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
+---------------------------------------------------------------------------------------------+
|flight |
+---------------------------------------------------------------------------------------------+
|[[[2020-10-30T14:47:00, something-1604045100]], [[2020-10-30T14:47:00, 2020-10-30T13:35:00]]]|
+---------------------------------------------------------------------------------------------+
Time taken: 1493 ms
scala>
Try this
scala> df.show(false)
+----------------------------------------------------------------------------------------------------------------+
|flight |
+----------------------------------------------------------------------------------------------------------------+
|[[[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]], [[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]]]|
|[[[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]], [[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]]]|
+----------------------------------------------------------------------------------------------------------------+
scala>
scala> df.printSchema
root
|-- flight: struct (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
scala>
scala> val myudf = udf(
| (arrs:Seq[String]) => {
| arrs.map("something" ++ _)
| }
| )
myudf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StringType,true),Some(List(ArrayType(StringType,true))))
scala> val df2 = df.select($"flight", myudf($"flight.legs.arr") as "editedArrs")
df2: org.apache.spark.sql.DataFrame = [flight: struct<legs: array<struct<dep:string,arr:string>>, segments: array<struct<dep:string,arr:string>>>, editedArrs: array<string>]
scala> df2.printSchema
root
|-- flight: struct (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
|-- editedArrs: array (nullable = true)
| |-- element: string (containsNull = true)
scala> df2.show(false)
+----------------------------------------------------------------------------------------------------------------+-----------------------------------+
|flight |editedArrs |
+----------------------------------------------------------------------------------------------------------------+-----------------------------------+
|[[[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]], [[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]]]|[something2020-10-30T14:47:00.000Z]|
|[[[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]], [[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]]]|[something2020-10-25T14:37:00.000Z]|
+----------------------------------------------------------------------------------------------------------------+-----------------------------------+
scala>
scala>
scala> val df3 = df2.select(struct(arrays_zip($"flight.legs.dep", $"editedArrs") cast "array<struct<dep:string,arr:string>>" as "legs", $"flight.segments") as "flight")
df3: org.apache.spark.sql.DataFrame = [flight: struct<legs: array<struct<dep:string,arr:string>>, segments: array<struct<dep:string,arr:string>>>]
scala>
scala> df3.printSchema
root
|-- flight: struct (nullable = false)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
scala>
scala> df3.show(false)
+-------------------------------------------------------------------------------------------------------------------------+
|flight |
+-------------------------------------------------------------------------------------------------------------------------+
|[[[2020-10-30T13:35:00.000Z, something2020-10-30T14:47:00.000Z]], [[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]]]|
|[[[2020-10-25T13:15:00.000Z, something2020-10-25T14:37:00.000Z]], [[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]]]|
+-------------------------------------------------------------------------------------------------------------------------+

Extract struct fields in Spark scala

I have a df of schema
|-- Data: struct (nullable = true)
| |-- address_billing: struct (nullable = true)
| | |-- address1: string (nullable = true)
| | |-- address2: string (nullable = true)
| |-- address_shipping: struct (nullable = true)
| | |-- address1: string (nullable = true)
| | |-- address2: string (nullable = true)
| | |-- city: string (nullable = true)
| |-- cancelled_initiator: string (nullable = true)
| |-- cancelled_reason: string (nullable = true)
| |-- statuses: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- store_code: string (nullable = true)
| |-- store_name: string (nullable = true)
| |-- tax_code: string (nullable = true)
| |-- total: string (nullable = true)
| |-- updated_at: string (nullable = true)
I need to extract its all fields in separate columns without manually giving name.
Is there any way by which we can do this?
I tried:
val df2=df1.select(df1.col("Data.*"))
but got the error
org.apache.spark.sql.AnalysisException: No such struct field * in address_billing, address_shipping,....
Also, Can anyone suggest to me how to add a prefix to all these columns, as the some of the columns name may be the same.
Output should be like
address_billing_address1
address_billing_address2
.
.
.
Just change df1.col to col. Either of these should work:
df1.select(col("Data.*"))
df1.select($"Data.*")
df1.select("Data.*")