Spark SQL data frame

Spark SQL data frame - scala

Data structure:
{"Emp":{"Name":"John", "Sal":"2000", "Address":[{"loc":"Sanjose","Zip":"222"},{"loc":"dayton","Zip":"333"}]}}
Now I want to load the data into a data frame and want to append zip to loc. The loc column name should be same (loc). The transformed data should be like this:
{"Emp":{"Name":"John", "Sal":"2000", "Address":[{"loc":"Sanjose222","Zip":"222"},{"loc":"dayton333","Zip":"333"}]}}
No RDDs. I need a data frame operation to achieve this, preferably with the withColumn function. How can I do this?

Given a data structure as
val jsonString = """{"Emp":{"Name":"John","Sal":"2000","Address":[{"loc":"Sanjose","Zip":"222"},{"loc":"dayton","Zip":"333"}]}}"""
You can covert it to dataframe as
val df = spark.read.json(sc.parallelize(jsonString::Nil))
which would give you
+-----------------------------------------------------+
|Emp |
+-----------------------------------------------------+
|[WrappedArray([222,Sanjose], [333,dayton]),John,2000]|
+-----------------------------------------------------+
//root
// |-- Emp: struct (nullable = true)
// | |-- Address: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- Zip: string (nullable = true)
// | | | |-- loc: string (nullable = true)
// | |-- Name: string (nullable = true)
// | |-- Sal: string (nullable = true)
Now to get the desired output you would need to separate struct Emp column to separate columns and use Address array column in udf function to get your desired result as
import org.apache.spark.sql.functions._
def attachZipWithLoc = udf((array: Seq[Row])=> array.map(row => address(row.getAs[String]("loc")+row.getAs[String]("Zip"), row.getAs[String]("Zip"))))
df.select($"Emp.*")
.withColumn("Address", attachZipWithLoc($"Address"))
.select(struct($"Name".as("Name"), $"Sal".as("Sal"), $"Address".as("Address")).as("Emp"))
where address in udf class is a case class
case class address(loc: String, Zip: String)
which should give you
+-----------------------------------------------------------+
|Emp |
+-----------------------------------------------------------+
|[John,2000,WrappedArray([Sanjose222,222], [dayton333,333])]|
+-----------------------------------------------------------+
//root
// |-- Emp: struct (nullable = false)
// | |-- Name: string (nullable = true)
// | |-- Sal: string (nullable = true)
// | |-- Address: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- loc: string (nullable = true)
// | | | |-- Zip: string (nullable = true)
Now to get the json you can just use .toJSON and you should get
+-----------------------------------------------------------------------------------------------------------------+
|value |
+-----------------------------------------------------------------------------------------------------------------+
|{"Emp":{"Name":"John","Sal":"2000","Address":[{"loc":"Sanjose222","Zip":"222"},{"loc":"dayton333","Zip":"333"}]}}|
+-----------------------------------------------------------------------------------------------------------------+

Related

How do I check if column present in the Spark DataFrame

I am trying a logic to return an empty column if column does not exist in dataframe.
Schema changes very frequent, sometime the whole struct will be missing (temp1) or array inside struct will be missing (suffix)
Schema looks like this:
root
|-- id: string (nullable = true)
|-- temp: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- temp1: struct (nullable = true)
| | | |-- code: string (nullable = true)
| | | |-- code1: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | |-- temp2: struct (nullable = true)
| | | |-- name1: array (nullable = true)
| | | | |-- element: string (containsNull = true)
| | | |-- suffix: array (nullable = true)
| | | | |-- element: string (containsNull = true)
|-- timestamp: timestamp (nullable = true)
Or like this:
root
|-- id: string (nullable = true)
|-- temp: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- temp2: struct (nullable = true)
| | | |-- name1: array (nullable = true)
| | | | |-- element: string (containsNull = true)
|-- timestamp: timestamp (nullable = true)
When I am trying the below logic for the second schema, getting an exception that Struct not found
def has_Column(df: DataFrame, path: String) = Try(df(path)).isSuccess
df.withColumn("id", col("id")).
withColumn("tempLn", explode(col("temp"))).
withColumn("temp1_code1", when(lit(has_Column(df, "tempLn.temp1.code1")), concat_ws(" ",col("tempLn.temp1.code1"))).otherwise(lit("").cast("string"))).
withColumn("temp2_suffix", when(lit(has_Column(df, "tempLn.temp2.suffix")), concat_ws(" ",col("tempLn.temp2.suffix"))).otherwise(lit("").cast("string")))
Error:
org.apache.spark.sql.AnalysisException: No such struct field temp1;

You need to do the check the existence outside the select/withColumn... methods. As you reference it in the then part of case when expression, Spark tries to resolve it during the analysis of the query.
So you'll need to test like this:
if (has_Column(df, "tempLn.temp1.code1"))
df.withColumn("temp2_suffix", concat_ws(" ",col("tempLn.temp2.suffix")))
else
df.withColumn("temp2_suffix", lit(""))
To do it for multiple columns you can use foldLeft like this:
val df1 = Seq(
("tempLn.temp1.code1", "temp1_code1"),
("tempLn.temp2.suffix", "temp2_suffix")
).foldLeft(df) {
case (acc, (field, newCol)) => {
if (has_Column(acc, field))
acc.withColumn(newCol, concat_ws(" ", col(field)))
else
acc.withColumn(newCol, lit(""))
}
}

How to flatten Array of WrappedArray of structs in scala

I have a dataframe with the following schema:
root
|-- id: string (nullable = true)
|-- collect_list(typeCounts): array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: struct (containsNull = true)
| | | |-- type: string (nullable = true)
| | | |-- count: long (nullable = false)
Example data:
+-----------+----------------------------------------------------------------------------+
|id |collect_list(typeCounts) |
+-----------+----------------------------------------------------------------------------+
|1 |[WrappedArray([B00XGS,6], [B001FY,5]), WrappedArray([B06LJ7,4])]|
|2 |[WrappedArray([B00UFY,3])] |
+-----------+----------------------------------------------------------------------------+
How can I flatten collect_list(typeCounts) to a flat array of structs in scala? I have read some answers on stackoverflow for similar questions suggesting UDF's, but I am not sure what the UDF method signature should be for structs.

If you're on Spark 2.4+, instead of using a UDF (which is generally less efficient than native Spark functions) you can apply flatten, like below:
df.withColumn("collect_list(typeCounts)", flatten($"collect_list(typeCounts)"))
i am not sure what the udf method signature should be for structs
UDF takes structs as Rows for input and may return them as Scala case classes. To flatten the nested collections, you can create a simple UDF as follows:
import org.apache.spark.sql.Row
case class TC(`type`: String, count: Long)
val flattenLists = udf{ (lists: Seq[Seq[Row]]) =>
lists.flatMap( _.map{ case Row(t: String, c: Long) => TC(t, c) } )
}
To test out the UDF, let's assemble a DataFrame with your described schema:
val df = Seq(
("1", Seq(TC("B00XGS", 6), TC("B001FY", 5))),
("1", Seq(TC("B06LJ7", 4))),
("2", Seq(TC("B00UFY", 3)))
).toDF("id", "typeCounts").
groupBy("id").agg(collect_list("typeCounts"))
df.printSchema
// root
// |-- id: string (nullable = true)
// |-- collect_list(typeCounts): array (nullable = true)
// | |-- element: array (containsNull = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- type: string (nullable = true)
// | | | |-- count: long (nullable = false)
Applying the UDF:
df.
withColumn("collect_list(typeCounts)", flattenLists($"collect_list(typeCounts)")).
printSchema
// root
// |-- id: string (nullable = true)
// |-- collect_list(typeCounts): array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- type: string (nullable = true)
// | | |-- count: long (nullable = false)

Spark - Flatten Array of Structs using flatMap

I have a df with schema -
root
|-- arrayCol: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- email: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- qty: long (nullable = true)
| | |-- rqty: long (nullable = true)
| | |-- pids: array (nullable = true)
| | | |-- element: string (containsNull = true)
| | |-- sqty: long (nullable = true)
| | |-- id1: string (nullable = true)
| | |-- id2: string (nullable = true)
| | |-- window: struct (nullable = true)
| | | |-- end: string (nullable = true)
| | | |-- start: string (nullable = true)
| | |-- otherId: string (nullable = true)
|-- primarykey: string (nullable = true)
|-- runtime: string (nullable = true)
I don't want to use explode as its extremely slow and wanted to try flapMap instead.
I tried doing -
val ds = df1.as[(Array[StructType], String, String)]
ds.flatMap{ case(x, y, z) => x.map((_, y, z))}.toDF()
This gives me error -
scala.MatchError: org.apache.spark.sql.types.StructType
How do I flatten arrayCol?
Sample data -
{
"primaryKeys":"sfdfrdsdjn",
"runtime":"2020-10-31T13:01:04.813Z",
"arrayCol":[{"id":"qwerty","id1":"dsfdsfdsf","window":{"start":"2020-11-01T10:30:00Z","end":"2020-11-01T12:30:00Z"}, "email":[],"id2":"sdfsdfsdPuyOplzlR1idvfPkv5138g","rqty":3,"sqty":3,"qty":3,"otherId":null}]
}
Expected Output -
primaryKey runtime arrayCol
sfdfrdsdjn 2020-10-31T13:01:04.813Z {"id":"qwerty","id1":"dsfdsfdsf","window":{"start":"2020-11-01T10:30:00Z","end":"2020-11-01T12:30:00Z"}, "email":[],"id2":"sdfsdfsdPuyOplzlR1idvfPkv5138g","rqty":3,"sqty":3,"qty":3,"otherId":null}
I want one row for every element in arrayCol. Just like explode(arrayCol)

You almost had it. Remember when using spark with scala, always try to use the Dataset API as often as possible. This not only increases readeability, but helps solve these type of issues very quickly.
case class ArrayColWindow(end:String,start:String)
case class ArrayCol(id:String,email:Seq[String], qty:Long,rqty:Long,pids:Seq[String],
sqty:Long,id1:String,id2:String,window:ArrayColWindow, otherId:String)
case class FullArrayCols(arrayCol:Seq[ArrayCol],primarykey:String,runtime:String)
val inputTest = List(
FullArrayCols(Seq(ArrayCol("qwerty", Seq(), 3, 3, Seq(), 3, "dsfdsfdsf", "sdfsdfsdPuyOplzlR1idvfPkv5138g",
ArrayColWindow("2020-11-01T10:30:00Z", "2020-11-01T12:30:00Z"), null)),
"sfdfrdsdjn", "2020-10-31T13:01:04.813Z")
).toDS()
val output = inputTest.as[(Seq[ArrayCol],String,String)].flatMap{ case(x, y, z) => x.map((_, y, z))}
output.show(truncate=false)

you could just change
val ds = df1.as[(Array[StructType], String, String)]
to
val ds = df1.as[(Array[String], String, String)]
and you can get rid of the error and see the output you want.

Convert a JSON string to a struct column without schema in Spark

Spark: 3.0.0
Scala: 2.12.8
My data frame has a column with JSON string, and I want to create a new column from it with the StructType.
temp_json_string
{"name":"test","id":"12","category":[{"products":["A","B"],"displayName":"test_1","displayLabel":"test1"},{"products":["C"],"displayName":"test_2","displayLabel":"test2"}],"createdAt":"","createdBy":""}
root
|-- temp_json_string: string (nullable = true)
Formatted JSON:
{
"name":"test",
"id":"12",
"category":[
{
"products":[
"A",
"B"
],
"displayName":"test_1",
"displayLabel":"test1"
},
{
"products":[
"C"
],
"displayName":"test_2",
"displayLabel":"test2"
}
],
"createdAt":"",
"createdBy":""
}
I want to create a new column of type Struct so I tried:
dataFrame
.withColumn("temp_json_struct", struct(col("temp_json_string")))
.select("temp_json_struct")
Now, I get the schema as:
root
|-- temp_json_struct: struct (nullable = false)
| |-- temp_json_string: string (nullable = true)
Desired result:
root
|-- temp_json_struct: struct (nullable = false)
| |-- name: string (nullable = true)
| |-- category: array (nullable = true)
| | |-- products: array (nullable = true)
| | |-- displayName: string (nullable = true)
| | |-- displayLabel: string (nullable = true)
| |-- createdAt: timestamp (nullable = true)
| |-- updatedAt: timestamp (nullable = true)

json_str_col is the column that has JSON string. I had multiple files so that's why the fist line is iterating through each row to extract the schema. If you know your schema up front then just replace json_schema with that.
json_schema = spark.read.json(df.rdd.map(lambda row: row.json_str_col)).schema
df = df.withColumn('new_col', from_json(col('json_str_col'), json_schema))

// import spark implicits for conversion to dataset (.as[String])
import spark.implicits._
val df = ??? //create your dataframe having the 'temp_json_string' column
//convert Dataset[Row] aka Dataframe to Dataset[String]
val ds = df.select("temp_json_string").as[String]
//read as json
spark.read.json(ds)
Documentation

There at least two different ways to retrieve/discover the schema for a given JSON.
For the illustration, let's create some data first:
import org.apache.spark.sql.types.StructType
val jsData = Seq(
("""{
"name":"test","id":"12","category":[
{
"products":[
"A",
"B"
],
"displayName":"test_1",
"displayLabel":"test1"
},
{
"products":[
"C"
],
"displayName":"test_2",
"displayLabel":"test2"
}
],
"createdAt":"",
"createdBy":""}""")
)
Option 1: schema_of_json
The first option is to use the built-in function schema_of_json. The function will return the schema for the given JSON in DDL format:
val json = jsData.toDF("js").collect()(0).getString(0)
val ddlSchema: String = spark.sql(s"select schema_of_json('${json}')")
.collect()(0) //get 1st row
.getString(0) //get 1st col of the row as string
.replace("null", "string") //replace type with string, this occurs since you have "createdAt":""
// struct<category:array<struct<displayLabel:string,displayName:string,products:array<string>>>,createdAt:null,createdBy:null,id:string,name:string>
val schema: StructType = StructType.fromDDL(s"js_schema $ddlSchema")
Note that you would expect that schema_of_json would also work on the column level i.e: schema_of_json(js_col), unfortunately, this doesn't work as expected therefore we are forced to pass a string instead.
Option 2: use Spark JSON reader (recommended)
import org.apache.spark.sql.functions.from_json
val schema: StructType = spark.read.json(jsData.toDS).schema
// schema.printTreeString
// root
// |-- category: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- displayLabel: string (nullable = true)
// | | |-- displayName: string (nullable = true)
// | | |-- products: array (nullable = true)
// | | | |-- element: string (containsNull = true)
// |-- createdAt: string (nullable = true)
// |-- createdBy: string (nullable = true)
// |-- id: string (nullable = true)
// |-- name: string (nullable = true)
As you can see, here we are producing a schema based on StructType and not a DDL string as in the previous case.
After discovering the schema we can move on to the next step which is converting the JSON data into a struct. To achieve that we will use from_json built-in function:
jsData.toDF("js")
.withColumn("temp_json_struct", from_json($"js", schema))
.printSchema()
// root
// |-- js: string (nullable = true)
// |-- temp_json_struct: struct (nullable = true)
// | |-- category: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- displayLabel: string (nullable = true)
// | | | |-- displayName: string (nullable = true)
// | | | |-- products: array (nullable = true)
// | | | | |-- element: string (containsNull = true)
// | |-- createdAt: string (nullable = true)
// | |-- createdBy: string (nullable = true)
// | |-- id: string (nullable = true)
// | |-- name: string (nullable = true)

Apply a function to a column inside a structure of a Spark DataFrame, replacing that column

I cannot find exactly what I am looking for, so here it is my question. I fetch from MongoDb some data into a Spark Dataframe. The dataframe has the following schema (df.printSchema):
|-- flight: struct (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: timestamp (nullable = true)
| | | |-- departure: timestamp (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: timestamp (nullable = true)
| | | |-- departure: timestamp (nullable = true)
Do note the top-level structure, followed by an array, inside which I need to change my data.
For example:
{
"flight": {
"legs": [{
"departure": ISODate("2020-10-30T13:35:00.000Z"),
"arrival": ISODate("2020-10-30T14:47:00.000Z")
}
],
"segments": [{
"departure": ISODate("2020-10-30T13:35:00.000Z"),
"arrival": ISODate("2020-10-30T14:47:00.000Z")
}
]
}
}
I want to export this in Json, but for some business reason, I want the arrival dates to have a different format than the departure dates. For example, I may want to export the departure ISODate in ms from epoch, but not the arrival one.
To do so, I thought of applying a custom function to do the transformation:
// Here I can do any tranformation. I hope to replace the timestamp with the needed value
val doSomething: UserDefinedFunction = udf( (value: Seq[Timestamp]) => {
value.map(x => "doSomething" + x.getTime) }
)
val newDf = df.withColumn("flight.legs.departure",
doSomething(df.col("flight.legs.departure")))
But this simply returns a brand new column, containing an array of a single doSomething string.
{
"flight": {
"legs": [{
"arrival": "2020-10-30T14:47:00Z",
"departure": "2020-10-30T13:35:00Z"
}
],
"segments": [{
"arrival": "2020-10-30T14:47:00Z",
"departure": "2020-10-30T13:35:00Z",
}
]
},
"flight.legs.departure": ["doSomething1596268800000"]
}
And newDf.show(1)
+--------------------+---------------------+
| flight|flight.legs.departure|
+--------------------+---------------------+
|[[[182], 94, [202...| [doSomething15962...|
+--------------------+---------------------+
Instead of
{
...
"arrival": "2020-10-30T14:47:00Z",
//leg departure date that I changed
"departure": "doSomething1596268800000"
... // segments not affected in this example
"arrival": "2020-10-30T14:47:00Z",
"departure": "2020-10-30T13:35:00Z",
...
}
Any ideas how to proceed?
Edit - clarification:
Please bear in mind that my schema is way more complex than what shown above. For example, there is yet another top level data tag, so flight is below along with other information. Then inside flight, legs and segments there are multiple more elements, some that are also nested. I only focused on the ones that I needed to change.
I am saying this, because I would like the simplest solution that would scale. I.e. ideally one that would simply change the required elements without having to de-construct and that re-construct the whole nested structure. If we cannot avoid that, is using case classes the simplest solution?

Please check the code below.
Execution Time
With UDF : Time taken: 679 ms
Without UDF : Time taken: 1493 ms
Code With UDF
scala> :paste
// Entering paste mode (ctrl-D to finish)
// Creating UDF to update value inside array.
import java.text.SimpleDateFormat
val dateFormat = new SimpleDateFormat("yyyy-MM-dd'T'hh:mm:ss") // For me departure values are in string, so using this to convert sql timestmap.
val doSomething = udf((value: Seq[String]) => {
value.map(x => s"dosomething${dateFormat.parse(x).getTime}")
})
// Exiting paste mode, now interpreting.
import java.text.SimpleDateFormat
dateFormat: java.text.SimpleDateFormat = java.text.SimpleDateFormat#41bd83a
doSomething: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StringType,true),Some(List(ArrayType(StringType,true))))
scala> :paste
// Entering paste mode (ctrl-D to finish)
spark.time {
val updated = df.select("flight.*").withColumn("legs",arrays_zip($"legs.arrival",doSomething($"legs.departure")).cast("array<struct<arrival:string,departure:string>>")).select(struct($"segments",$"legs").as("flight"))
updated.printSchema
updated.show(false)
}
// Exiting paste mode, now interpreting.
root
|-- flight: struct (nullable = false)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
+-------------------------------------------------------------------------------------------------+
|flight |
+-------------------------------------------------------------------------------------------------+
|[[[2020-10-30T14:47:00, 2020-10-30T13:35:00]], [[2020-10-30T14:47:00, dosomething1604045100000]]]|
+-------------------------------------------------------------------------------------------------+
Time taken: 679 ms
scala>
Code Without UDF
scala> val df = spark.read.json(Seq("""{"flight": {"legs": [{"departure": "2020-10-30T13:35:00","arrival": "2020-10-30T14:47:00"}],"segments": [{"departure": "2020-10-30T13:35:00","arrival": "2020-10-30T14:47:00"}]}}""").toDS)
df: org.apache.spark.sql.DataFrame = [flight: struct<legs: array<struct<arrival:string,departure:string>>, segments: array<struct<arrival:string,departure:string>>>]
scala> df.printSchema
root
|-- flight: struct (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
scala> df.show(false)
+--------------------------------------------------------------------------------------------+
|flight |
+--------------------------------------------------------------------------------------------+
|[[[2020-10-30T14:47:00, 2020-10-30T13:35:00]], [[2020-10-30T14:47:00, 2020-10-30T13:35:00]]]|
+--------------------------------------------------------------------------------------------+
scala> :paste
// Entering paste mode (ctrl-D to finish)
spark.time {
val updated= df
.select("flight.*")
.select($"segments",$"legs.arrival",$"legs.departure") // extracting legs struct column values.
.withColumn("departure",explode($"departure")) // exploding departure column
.withColumn("departure",concat_ws("-",lit("something"),$"departure".cast("timestamp").cast("long"))) // updating departure column values
.groupBy($"segments",$"arrival") // grouping columns except legs column
.agg(collect_list($"departure").as("departure")) // constructing list back
.select($"segments",arrays_zip($"arrival",$"departure").as("legs")) // construction arrival & departure columns using arrays_zip method.
.select(struct($"legs",$"segments").as("flight")) // finally creating flight by combining legs & segments columns.
updated.printSchema
updated.show(false)
}
// Exiting paste mode, now interpreting.
root
|-- flight: struct (nullable = false)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = false)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
+---------------------------------------------------------------------------------------------+
|flight |
+---------------------------------------------------------------------------------------------+
|[[[2020-10-30T14:47:00, something-1604045100]], [[2020-10-30T14:47:00, 2020-10-30T13:35:00]]]|
+---------------------------------------------------------------------------------------------+
Time taken: 1493 ms
scala>

Try this
scala> df.show(false)
+----------------------------------------------------------------------------------------------------------------+
|flight |
+----------------------------------------------------------------------------------------------------------------+
|[[[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]], [[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]]]|
|[[[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]], [[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]]]|
+----------------------------------------------------------------------------------------------------------------+
scala>
scala> df.printSchema
root
|-- flight: struct (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
scala>
scala> val myudf = udf(
| (arrs:Seq[String]) => {
| arrs.map("something" ++ _)
| }
| )
myudf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StringType,true),Some(List(ArrayType(StringType,true))))
scala> val df2 = df.select($"flight", myudf($"flight.legs.arr") as "editedArrs")
df2: org.apache.spark.sql.DataFrame = [flight: struct<legs: array<struct<dep:string,arr:string>>, segments: array<struct<dep:string,arr:string>>>, editedArrs: array<string>]
scala> df2.printSchema
root
|-- flight: struct (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
|-- editedArrs: array (nullable = true)
| |-- element: string (containsNull = true)
scala> df2.show(false)
+----------------------------------------------------------------------------------------------------------------+-----------------------------------+
|flight |editedArrs |
+----------------------------------------------------------------------------------------------------------------+-----------------------------------+
|[[[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]], [[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]]]|[something2020-10-30T14:47:00.000Z]|
|[[[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]], [[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]]]|[something2020-10-25T14:37:00.000Z]|
+----------------------------------------------------------------------------------------------------------------+-----------------------------------+
scala>
scala>
scala> val df3 = df2.select(struct(arrays_zip($"flight.legs.dep", $"editedArrs") cast "array<struct<dep:string,arr:string>>" as "legs", $"flight.segments") as "flight")
df3: org.apache.spark.sql.DataFrame = [flight: struct<legs: array<struct<dep:string,arr:string>>, segments: array<struct<dep:string,arr:string>>>]
scala>
scala> df3.printSchema
root
|-- flight: struct (nullable = false)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
scala>
scala> df3.show(false)
+-------------------------------------------------------------------------------------------------------------------------+
|flight |
+-------------------------------------------------------------------------------------------------------------------------+
|[[[2020-10-30T13:35:00.000Z, something2020-10-30T14:47:00.000Z]], [[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]]]|
|[[[2020-10-25T13:15:00.000Z, something2020-10-25T14:37:00.000Z]], [[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]]]|
+-------------------------------------------------------------------------------------------------------------------------+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark SQL data frame - scala

Related

How do I check if column present in the Spark DataFrame

How to flatten Array of WrappedArray of structs in scala

Spark - Flatten Array of Structs using flatMap

Convert a JSON string to a struct column without schema in Spark

Apply a function to a column inside a structure of a Spark DataFrame, replacing that column

Categories

Resources