How to Process Nested struct with Array element in Spark RDD - scala

I am using spark sql data processing for nested with array.
{
"isActive": true,
"sample": {
"someitem": {
"thesearecool": [{
"neat": "wow"
},
{
"neat": "tubular"
}
]
},
"coolcolors": [{
"color": "red",
"hex": "ff0000"
},
{
"color": "blue",
"hex": "0000ff"
}
]
}
}
schema :
root
|-- isActive: boolean (nullable = true)
|-- sample: struct (nullable = true)
| |-- coolcolors: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- color: string (nullable = true)
| | | |-- hex: string (nullable = true)
| |-- someitem: struct (nullable = true)
| | |-- thesearecool: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- neat: string (nullable = true)
code :
val nested1 = nested.withColumn("color_data", explode($"sample.coolcolors")).select("isActive","color_data.color","color_data.hex","sample.someitem.thesearecool.neat")
val nested2 = nested.withColumn("thesearecool_data", explode($"sample.someitem.thesearecool")).select("thesearecool_data.neat")
sample output:
+--------+-----+------+--------------+
|isActive|color|hex |neat |
+--------+-----+------+--------------+
|true |red |ff0000|[wow, tubular]|
|true |blue |0000ff|[wow, tubular]|
+--------+-----+------+--------------+
+-------+
|neat |
+-------+
|wow |
|tubular|
+-------+
we need to process data single result.

Explode twice, and select what you want.
df.withColumn("coolcolors", explode($"sample.coolcolors"))
.withColumn("thesearecool", explode($"sample.someitem.thesearecool"))
.select("isActive", "coolcolors.color", "coolcolors.hex", "thesearecool.neat").show
Then,
+--------+-----+------+-------+
|isActive|color| hex| neat|
+--------+-----+------+-------+
| true| red|ff0000| wow|
| true| red|ff0000|tubular|
| true| blue|0000ff| wow|
| true| blue|0000ff|tubular|
+--------+-----+------+-------+

Related

Iterate dataframe column Array of Array in Spark Scala

I am trying to iterate over an array of array as a column in Spark dataframe. Looking for the best way to do this.
Schema:
root
|-- Animal: struct (nullable = true)
| |-- Species: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- mammal: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- description: string (nullable = true)
Currently I am using this logic. This only gets the first array.
df.select(
col("Animal.Species").getItem(0).getItem("mammal").getItem("description")
)
Pseudo Logic:
col("Animal.Species").getItem(0).getItem("mammal").getItem("description")
+
col("Animal.Species").getItem(1).getItem("mammal").getItem("description")
+
col("Animal.Species").getItem(2).getItem("mammal").getItem("description")
+
col("Animal.Species").getItem(...).getItem("mammal").getItem("description")
Desired Example Output (flattened elements as string)
llama, sheep, rabbit, hare
Not obvious, but you can use . (or the getField method of Column) to select "through" arrays of structs. Selecting Animal.Species.mammal returns an array of array of the innermost structs. Unfortunately, this array of array prevents you from being able to drill further down with something like Animal.Species.mammal.description, so you need to flatten it first, then use getField().
If I understand your schema correctly, the following JSON should be a valid input:
{
"Animal": {
"Species": [
{
"mammal": [
{ "description": "llama" },
{ "description": "sheep" }
]
},
{
"mammal": [
{ "description": "rabbit" },
{ "description": "hare" }
]
}
]
}
}
val df = spark.read.json("data.json")
df.printSchema
// root
// |-- Animal: struct (nullable = true)
// | |-- Species: array (nullable = true)
// | | |-- element: struct (containsNull = true)
// | | | |-- mammal: array (nullable = true)
// | | | | |-- element: struct (containsNull = true)
// | | | | | |-- description: string (nullable = true)
df.select("Animal.Species.mammal").show(false)
// +----------------------------------------+
// |mammal |
// +----------------------------------------+
// |[[{llama}, {sheep}], [{rabbit}, {hare}]]|
// +----------------------------------------+
df.select(flatten(col("Animal.Species.mammal"))).show(false)
// +------------------------------------+
// |flatten(Animal.Species.mammal) |
// +------------------------------------+
// |[{llama}, {sheep}, {rabbit}, {hare}]|
// +------------------------------------+
This is now an array of structs and you can use getField("description") to obtain the array of interest:
df.select(flatten(col("Animal.Species.mammal")).getField("description")).show(false)
// +--------------------------------------------------------+
// |flatten(Animal.Species.mammal AS mammal#173).description|
// +--------------------------------------------------------+
// |[llama, sheep, rabbit, hare] |
// +--------------------------------------------------------+
Finally, array_join with separator ", " can be used to obtain the desired string:
df.select(
array_join(
flatten(col("Animal.Species.mammal")).getField("description"),
", "
) as "animals"
).show(false)
// +--------------------------+
// |animals |
// +--------------------------+
// |llama, sheep, rabbit, hare|
// +--------------------------+
You can apply explode two times: first time on Animal.Species and second time on the result of the first time:
import org.apache.spark.sql.functions._
df.withColumn("tmp", explode(col("Animal.Species")))
.withColumn("tmp", explode(col("tmp.mammal")))
.select("tmp.description")
.show()

Pyspark convert Json to DF

I have this file .json and I need, convert it in DF, the file is this:
{
"id": "517379",
"created_at": "2020-11-16T04:25:03Z",
"company": "1707",
"invoice": [
{
"invoice_id": "4102",
"date": "2020-11-16T04:25:03Z",
"id": "517379",
"cantidad": "21992.47",
"extra_data": {
"column": "ASDFG",
"credito": "Crédito"
}
}
I need in this way, like df.
id. , created_at, , company, invoice_id, date , id. , cantidad, column, credito
517379 , 2020-11-16T04:25:03Z , 1707, 4102, 2020-11-16T04:25:03Z , 517379 , 21992.47, ASDFG, Crédito
by default spark try to sparse each line as a json document, so if your file contains json objects across multiple line, you will have a dataframe with one column _corrupt_record. To solve this problem you need to set the reading option multiline to true.
test.json
[
{
"id":"517379",
"created_at":"2020-11-16T04:25:03Z",
"company":"1707",
"invoice":[
{
"invoice_id":"4102",
"date":"2020-11-16T04:25:03Z",
"id":"517379",
"cantidad":"21992.47",
"extra_data":{
"column":"ASDFG",
"credito":"Crédito"
}
}
]
},
{
"id":"1234",
"created_at":"2020-11-16T04:25:03Z",
"company":"1707",
"invoice":[
{
"invoice_id":"4102",
"date":"2020-11-16T04:25:03Z",
"id":"517379",
"cantidad":"21992.47",
"extra_data":{
"column":"ASDFG",
"credito":"Crédito"
}
}
]
}
]
pyspark code
from pyspark.sql import SparkSession
df = spark.read.option("multiline","true").json("test.json")
df.printSchema()
root
|-- company: string (nullable = true)
|-- created_at: string (nullable = true)
|-- id: string (nullable = true)
|-- invoice: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- cantidad: string (nullable = true)
| | |-- date: string (nullable = true)
| | |-- extra_data: struct (nullable = true)
| | | |-- column: string (nullable = true)
| | | |-- credito: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- invoice_id: string (nullable = true)
df.show()
+-------+--------------------+------+--------------------+
|company| created_at| id| invoice|
+-------+--------------------+------+--------------------+
| 1707|2020-11-16T04:25:03Z|517379|[{21992.47, 2020-...|
| 1707|2020-11-16T04:25:03Z| 1234|[{21992.47, 2020-...|
+-------+--------------------+------+--------------------+
if you know the schema, it’s better to provide it to reduce the processing time and parsing all the types as you want
from pyspark.sql import types as pyspark_types
from pyspark.sql import functions as pyspark_functions
schema = pyspark_types.StructType(fields=[
pyspark_types.StructField("id", pyspark_types.StringType()),
pyspark_types.StructField("created_at", pyspark_types.TimestampType()),
pyspark_types.StructField("company", pyspark_types.StringType()),
pyspark_types.StructField("invoice", pyspark_types.ArrayType(
pyspark_types.StructType(fields=[
pyspark_types.StructField("invoice_id", pyspark_types.StringType()),
pyspark_types.StructField("date", pyspark_types.TimestampType()),
pyspark_types.StructField("id", pyspark_types.StringType()),
pyspark_types.StructField("cantidad", pyspark_types.StringType()),
pyspark_types.StructField("extra_data", pyspark_types.StructType(fields=[
pyspark_types.StructField("column", pyspark_types.StringType()),
pyspark_types.StructField("credito", pyspark_types.StringType())
]))
])
)),
])
df = spark.read.option("multiline","true").json("test.json", schema=schema)
df.printSchema()
root
|-- id: string (nullable = true)
|-- created_at: timestamp (nullable = true)
|-- company: string (nullable = true)
|-- invoice: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- invoice_id: string (nullable = true)
| | |-- date: timestamp (nullable = true)
| | |-- id: string (nullable = true)
| | |-- cantidad: string (nullable = true)
| | |-- extra_data: struct (nullable = true)
| | | |-- column: string (nullable = true)
| | | |-- credito: string (nullable = true)
df.show()
+------+-------------------+-------+--------------------+
| id| created_at|company| invoice|
+------+-------------------+-------+--------------------+
|517379|2020-11-16 04:25:03| 1707|[{4102, 2020-11-1...|
| 1234|2020-11-16 04:25:03| 1707|[{4102, 2020-11-1...|
+------+-------------------+-------+--------------------+
# To get completely denormalized dataframe
df = df.withColumn("invoice", pyspark_functions.explode_outer("invoice")) \
.select("id", "company", "created_at",
"invoice.invoice_id", "invoice.date", "invoice.id", "invoice.cantidad",
"invoice.extra_data.*")
in the second example, spark doesn’t try to detect the schema, while it’s provided, but it try to cast the variables to the types you provided (created_at and date in the second example are timestamps instead of string)

Apply a function to a column inside a structure of a Spark DataFrame, replacing that column

I cannot find exactly what I am looking for, so here it is my question. I fetch from MongoDb some data into a Spark Dataframe. The dataframe has the following schema (df.printSchema):
|-- flight: struct (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: timestamp (nullable = true)
| | | |-- departure: timestamp (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: timestamp (nullable = true)
| | | |-- departure: timestamp (nullable = true)
Do note the top-level structure, followed by an array, inside which I need to change my data.
For example:
{
"flight": {
"legs": [{
"departure": ISODate("2020-10-30T13:35:00.000Z"),
"arrival": ISODate("2020-10-30T14:47:00.000Z")
}
],
"segments": [{
"departure": ISODate("2020-10-30T13:35:00.000Z"),
"arrival": ISODate("2020-10-30T14:47:00.000Z")
}
]
}
}
I want to export this in Json, but for some business reason, I want the arrival dates to have a different format than the departure dates. For example, I may want to export the departure ISODate in ms from epoch, but not the arrival one.
To do so, I thought of applying a custom function to do the transformation:
// Here I can do any tranformation. I hope to replace the timestamp with the needed value
val doSomething: UserDefinedFunction = udf( (value: Seq[Timestamp]) => {
value.map(x => "doSomething" + x.getTime) }
)
val newDf = df.withColumn("flight.legs.departure",
doSomething(df.col("flight.legs.departure")))
But this simply returns a brand new column, containing an array of a single doSomething string.
{
"flight": {
"legs": [{
"arrival": "2020-10-30T14:47:00Z",
"departure": "2020-10-30T13:35:00Z"
}
],
"segments": [{
"arrival": "2020-10-30T14:47:00Z",
"departure": "2020-10-30T13:35:00Z",
}
]
},
"flight.legs.departure": ["doSomething1596268800000"]
}
And newDf.show(1)
+--------------------+---------------------+
| flight|flight.legs.departure|
+--------------------+---------------------+
|[[[182], 94, [202...| [doSomething15962...|
+--------------------+---------------------+
Instead of
{
...
"arrival": "2020-10-30T14:47:00Z",
//leg departure date that I changed
"departure": "doSomething1596268800000"
... // segments not affected in this example
"arrival": "2020-10-30T14:47:00Z",
"departure": "2020-10-30T13:35:00Z",
...
}
Any ideas how to proceed?
Edit - clarification:
Please bear in mind that my schema is way more complex than what shown above. For example, there is yet another top level data tag, so flight is below along with other information. Then inside flight, legs and segments there are multiple more elements, some that are also nested. I only focused on the ones that I needed to change.
I am saying this, because I would like the simplest solution that would scale. I.e. ideally one that would simply change the required elements without having to de-construct and that re-construct the whole nested structure. If we cannot avoid that, is using case classes the simplest solution?
Please check the code below.
Execution Time
With UDF : Time taken: 679 ms
Without UDF : Time taken: 1493 ms
Code With UDF
scala> :paste
// Entering paste mode (ctrl-D to finish)
// Creating UDF to update value inside array.
import java.text.SimpleDateFormat
val dateFormat = new SimpleDateFormat("yyyy-MM-dd'T'hh:mm:ss") // For me departure values are in string, so using this to convert sql timestmap.
val doSomething = udf((value: Seq[String]) => {
value.map(x => s"dosomething${dateFormat.parse(x).getTime}")
})
// Exiting paste mode, now interpreting.
import java.text.SimpleDateFormat
dateFormat: java.text.SimpleDateFormat = java.text.SimpleDateFormat#41bd83a
doSomething: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StringType,true),Some(List(ArrayType(StringType,true))))
scala> :paste
// Entering paste mode (ctrl-D to finish)
spark.time {
val updated = df.select("flight.*").withColumn("legs",arrays_zip($"legs.arrival",doSomething($"legs.departure")).cast("array<struct<arrival:string,departure:string>>")).select(struct($"segments",$"legs").as("flight"))
updated.printSchema
updated.show(false)
}
// Exiting paste mode, now interpreting.
root
|-- flight: struct (nullable = false)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
+-------------------------------------------------------------------------------------------------+
|flight |
+-------------------------------------------------------------------------------------------------+
|[[[2020-10-30T14:47:00, 2020-10-30T13:35:00]], [[2020-10-30T14:47:00, dosomething1604045100000]]]|
+-------------------------------------------------------------------------------------------------+
Time taken: 679 ms
scala>
Code Without UDF
scala> val df = spark.read.json(Seq("""{"flight": {"legs": [{"departure": "2020-10-30T13:35:00","arrival": "2020-10-30T14:47:00"}],"segments": [{"departure": "2020-10-30T13:35:00","arrival": "2020-10-30T14:47:00"}]}}""").toDS)
df: org.apache.spark.sql.DataFrame = [flight: struct<legs: array<struct<arrival:string,departure:string>>, segments: array<struct<arrival:string,departure:string>>>]
scala> df.printSchema
root
|-- flight: struct (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
scala> df.show(false)
+--------------------------------------------------------------------------------------------+
|flight |
+--------------------------------------------------------------------------------------------+
|[[[2020-10-30T14:47:00, 2020-10-30T13:35:00]], [[2020-10-30T14:47:00, 2020-10-30T13:35:00]]]|
+--------------------------------------------------------------------------------------------+
scala> :paste
// Entering paste mode (ctrl-D to finish)
spark.time {
val updated= df
.select("flight.*")
.select($"segments",$"legs.arrival",$"legs.departure") // extracting legs struct column values.
.withColumn("departure",explode($"departure")) // exploding departure column
.withColumn("departure",concat_ws("-",lit("something"),$"departure".cast("timestamp").cast("long"))) // updating departure column values
.groupBy($"segments",$"arrival") // grouping columns except legs column
.agg(collect_list($"departure").as("departure")) // constructing list back
.select($"segments",arrays_zip($"arrival",$"departure").as("legs")) // construction arrival & departure columns using arrays_zip method.
.select(struct($"legs",$"segments").as("flight")) // finally creating flight by combining legs & segments columns.
updated.printSchema
updated.show(false)
}
// Exiting paste mode, now interpreting.
root
|-- flight: struct (nullable = false)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = false)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
+---------------------------------------------------------------------------------------------+
|flight |
+---------------------------------------------------------------------------------------------+
|[[[2020-10-30T14:47:00, something-1604045100]], [[2020-10-30T14:47:00, 2020-10-30T13:35:00]]]|
+---------------------------------------------------------------------------------------------+
Time taken: 1493 ms
scala>
Try this
scala> df.show(false)
+----------------------------------------------------------------------------------------------------------------+
|flight |
+----------------------------------------------------------------------------------------------------------------+
|[[[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]], [[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]]]|
|[[[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]], [[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]]]|
+----------------------------------------------------------------------------------------------------------------+
scala>
scala> df.printSchema
root
|-- flight: struct (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
scala>
scala> val myudf = udf(
| (arrs:Seq[String]) => {
| arrs.map("something" ++ _)
| }
| )
myudf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StringType,true),Some(List(ArrayType(StringType,true))))
scala> val df2 = df.select($"flight", myudf($"flight.legs.arr") as "editedArrs")
df2: org.apache.spark.sql.DataFrame = [flight: struct<legs: array<struct<dep:string,arr:string>>, segments: array<struct<dep:string,arr:string>>>, editedArrs: array<string>]
scala> df2.printSchema
root
|-- flight: struct (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
|-- editedArrs: array (nullable = true)
| |-- element: string (containsNull = true)
scala> df2.show(false)
+----------------------------------------------------------------------------------------------------------------+-----------------------------------+
|flight |editedArrs |
+----------------------------------------------------------------------------------------------------------------+-----------------------------------+
|[[[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]], [[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]]]|[something2020-10-30T14:47:00.000Z]|
|[[[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]], [[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]]]|[something2020-10-25T14:37:00.000Z]|
+----------------------------------------------------------------------------------------------------------------+-----------------------------------+
scala>
scala>
scala> val df3 = df2.select(struct(arrays_zip($"flight.legs.dep", $"editedArrs") cast "array<struct<dep:string,arr:string>>" as "legs", $"flight.segments") as "flight")
df3: org.apache.spark.sql.DataFrame = [flight: struct<legs: array<struct<dep:string,arr:string>>, segments: array<struct<dep:string,arr:string>>>]
scala>
scala> df3.printSchema
root
|-- flight: struct (nullable = false)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
scala>
scala> df3.show(false)
+-------------------------------------------------------------------------------------------------------------------------+
|flight |
+-------------------------------------------------------------------------------------------------------------------------+
|[[[2020-10-30T13:35:00.000Z, something2020-10-30T14:47:00.000Z]], [[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]]]|
|[[[2020-10-25T13:15:00.000Z, something2020-10-25T14:37:00.000Z]], [[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]]]|
+-------------------------------------------------------------------------------------------------------------------------+

How to access elements in a ArrayType in a writeStream?

I am building up a schema to accept some data streaming in. It has an ArrayType with some elements. Here is my StructType with the ArrayType:
val innerBody = StructType(
StructField("value", LongType, false) ::
StructField("spent", BooleanType, false) ::
StructField("tx_index", LongType, false) :: Nil)
val prev_out = StructType(StructField("prev_out", innerBody, false) :: Nil)
val body = StructType(
StructField("inputs", ArrayType(prev_out, false), false) ::
StructField("out", ArrayType(innerBody, false), false) :: Nill)
val schema = StructType(StructField("x", body, false) :: Nil)
This builds a schema like"
root
|-- bit: struct (nullable = true)
| |-- x: struct (nullable = false)
| | |-- inputs: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- prev_out: struct (nullable = false)
| | | | | |-- value: long (nullable = false)
| | | | | |-- spent: boolean (nullable = false)
| | | | | |-- tx_index: long (nullable = false)
| | |-- out: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- value: long (nullable = false)
| | | | |-- spent: boolean (nullable = false)
| | | | |-- tx_index: long (nullable = false)
I am trying to select the value from the "value element" in schema as it is streaming in. I am using the writeStream sink.
val parsed = df.select("bit.x.inputs.element.prev_out.value")
.writeStream.format("console").start()
I have this but code above, but gives an error.
Message: cannot resolve 'bit.x.inputs.element.prev_out.value' given
input columns: [key, value, timestamp, partition, offset,
timestampType, topic];;
How can I access the "value" element in this schema?
If you have data frame like this, first explode and followed by select will help you.
df.printSchema()
//root
//|-- bit: struct (nullable = true)
//| |-- x: struct (nullable = true)
//| | |-- inputs: array (nullable = true)
//| | | |-- element: struct (containsNull = true)
//| | | | |-- prev_out: struct (nullable = true)
//| | | | | |-- spent: boolean (nullable = true)
//| | | | | |-- tx_infex: long (nullable = true)
//| | | | | |-- value: long (nullable = true)
import org.apache.spark.sql.functions._
val intermediateDf: DataFrame = df.select(explode(col("bit.x.inputs")).as("interCol"))
intermediateDf.printSchema()
//root
//|-- interCol: struct (nullable = true)
//| |-- prev_out: struct (nullable = true)
//| | |-- spent: boolean (nullable = true)
//| | |-- tx_infex: long (nullable = true)
//| | |-- value: long (nullable = true)
val finalDf: DataFrame = intermediateDf.select(col("interCol.prev_out.value").as("value"))
finalDf.printSchema()
//root
//|-- value: long (nullable = true)
finalDf.show()
//+-----------+
//| value|
//+-----------+
//|12347628746|
//|12347628746|
//+-----------+

Pivot spark multilevel Dataset

I have the Dataset in Spark with these schemas:
root
|-- from: struct (nullable = false)
| |-- id: string (nullable = true)
| |-- name: string (nullable = true)
| |-- tags: string (nullable = true)
|-- v1: struct (nullable = false)
| |-- id: string (nullable = true)
| |-- name: string (nullable = true)
| |-- tags: string (nullable = true)
|-- v2: struct (nullable = false)
| |-- id: string (nullable = true)
| |-- name: string (nullable = true)
| |-- tags: string (nullable = true)
|-- v3: struct (nullable = false)
| |-- id: string (nullable = true)
| |-- name: string (nullable = true)
| |-- tags: string (nullable = true)
|-- to: struct (nullable = false)
| |-- id: string (nullable = true)
| |-- name: string (nullable = true)
| |-- tags: string (nullable = true)
How to make the table(with only 3 columns id,name,tags) from this Dataset on Scala?
Just combine all the columns into an array, explode and select all nested fields:
import org.apache.spark.sql.functions.{array, col, explode}
case class Vertex(id: String, name: String, tags: String)
val df = Seq(((
Vertex("1", "from", "a"), Vertex("2", "V1", "b"), Vertex("3", "V2", "c"),
Vertex("4", "v3", "d"), Vertex("5", "to", "e")
)).toDF("from", "v1", "v2", "v3", "to")
df.select(explode(array(df.columns map col: _*)).alias("col")).select("col.*")
with the result as follows:
+---+----+----+
| id|name|tags|
+---+----+----+
| 1|from| a|
| 2| V1| b|
| 3| V2| c|
| 4| v3| d|
| 5| to| e|
+---+----+----+