Check below code. It is generating dataframe with ambiguity if duplicate keys are present . How should we modify the code to add parent column name as prefix to it.
Added another column with json data.
scala> val df = Seq(
(77, "email1", """{"key1":38,"key3":39}""","""{"name":"aaa","age":10}"""),
(78, "email2", """{"key1":38,"key4":39}""","""{"name":"bbb","age":20}"""),
(178, "email21", """{"key1":"when string","key4":36, "key6":"test", "key10":false }""","""{"name":"ccc","age":30}"""),
(179, "email8", """{"sub1":"qwerty","sub2":["42"]}""","""{"name":"ddd","age":40}"""),
(180, "email8", """{"sub1":"qwerty","sub2":["42", "56", "test"]}""","""{"name":"eee","age":50}""")
).toDF("id", "name", "colJson","personInfo")
scala> df.printSchema
root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- colJson: string (nullable = true)
|-- personInfo: string (nullable = true)
scala> df.show(false)
+---+-------+---------------------------------------------------------------+-----------------------+
|id |name |colJson |personInfo |
+---+-------+---------------------------------------------------------------+-----------------------+
|77 |email1 |{"key1":38,"key3":39} |{"name":"aaa","age":10}|
|78 |email2 |{"key1":38,"key4":39} |{"name":"bbb","age":20}|
|178|email21|{"key1":"when string","key4":36, "key6":"test", "key10":false }|{"name":"ccc","age":30}|
|179|email8 |{"sub1":"qwerty","sub2":["42"]} |{"name":"ddd","age":40}|
|180|email8 |{"sub1":"qwerty","sub2":["42", "56", "test"]} |{"name":"eee","age":50}|
+---+-------+---------------------------------------------------------------+-----------------------+
created fromJson implicit function,You can pass multiple columns to this & It will parse & extract the columns from json.
scala> :paste
// Entering paste mode (ctrl-D to finish)
import org.apache.spark.sql.{Column, DataFrame, Row}
import org.apache.spark.sql.functions.from_json
implicit class DFHelper(inDF: DataFrame) {
import inDF.sparkSession.implicits._
def fromJson(columns:Column*):DataFrame = {
val schemas = columns.map(column => (column, inDF.sparkSession.read.json(inDF.select(column).as[String]).schema))
val mdf = schemas.foldLeft(inDF)((df,schema) => {
df.withColumn(schema._1.toString(),from_json(schema._1,schema._2))
})
mdf.selectExpr(mdf.schema.map(c => if(c.dataType.typeName =="struct") s"${c.name}.*" else c.name):_*)
}
}
// Exiting paste mode, now interpreting.
import org.apache.spark.sql.{Column, DataFrame, Row}
import org.apache.spark.sql.functions.from_json
defined class DFHelper
scala> df.fromJson($"colJson",$"personInfo").show(false)
+---+-------+-----------+-----+----+----+----+------+--------------+---+----+
|id |name |key1 |key10|key3|key4|key6|sub1 |sub2 |age|name|
+---+-------+-----------+-----+----+----+----+------+--------------+---+----+
|77 |email1 |38 |null |39 |null|null|null |null |10 |aaa |
|78 |email2 |38 |null |null|39 |null|null |null |20 |bbb |
|178|email21|when string|false|null|36 |test|null |null |30 |ccc |
|179|email8 |null |null |null|null|null|qwerty|[42] |40 |ddd |
|180|email8 |null |null |null|null|null|qwerty|[42, 56, test]|50 |eee |
+---+-------+-----------+-----+----+----+----+------+--------------+---+----+
scala> df.fromJson($"colJson",$"personInfo").printSchema()
root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- key1: string (nullable = true)
|-- key10: boolean (nullable = true)
|-- key3: long (nullable = true)
|-- key4: long (nullable = true)
|-- key6: string (nullable = true)
|-- sub1: string (nullable = true)
|-- sub2: array (nullable = true)
| |-- element: string (containsNull = true)
|-- age: long (nullable = true)
|-- name: string (nullable = true)
Try this-
df.show(false)
df.printSchema()
/**
* +---+-------+---------------------------------------------------------------+-----------------------+
* |id |name |colJson |personInfo |
* +---+-------+---------------------------------------------------------------+-----------------------+
* |77 |email1 |{"key1":38,"key3":39} |{"name":"aaa","age":10}|
* |78 |email2 |{"key1":38,"key4":39} |{"name":"bbb","age":20}|
* |178|email21|{"key1":"when string","key4":36, "key6":"test", "key10":false }|{"name":"ccc","age":30}|
* |179|email8 |{"sub1":"qwerty","sub2":["42"]} |{"name":"ddd","age":40}|
* |180|email8 |{"sub1":"qwerty","sub2":["42", "56", "test"]} |{"name":"eee","age":50}|
* +---+-------+---------------------------------------------------------------+-----------------------+
*
* root
* |-- id: integer (nullable = false)
* |-- name: string (nullable = true)
* |-- colJson: string (nullable = true)
* |-- personInfo: string (nullable = true)
*
* #param inDF
*/
implicit class DFHelper(inDF: DataFrame) {
import inDF.sparkSession.implicits._
def fromJson(columns:Column*):DataFrame = {
val schemas = columns.map(column => (column, inDF.sparkSession.read.json(inDF.select(column).as[String]).schema))
val mdf = schemas.foldLeft(inDF)((df,schema) => {
df.withColumn(schema._1.toString(),from_json(schema._1,schema._2))
})
mdf//.selectExpr(mdf.schema.map(c => if(c.dataType.typeName =="struct") s"${c.name}.*" else c.name):_*)
}
}
val p = df.fromJson($"colJson", $"personInfo")
p.show(false)
p.printSchema()
/**
* +---+-------+---------------------------------+----------+
* |id |name |colJson |personInfo|
* +---+-------+---------------------------------+----------+
* |77 |email1 |[38,, 39,,,,] |[10, aaa] |
* |78 |email2 |[38,,, 39,,,] |[20, bbb] |
* |178|email21|[when string, false,, 36, test,,]|[30, ccc] |
* |179|email8 |[,,,,, qwerty, [42]] |[40, ddd] |
* |180|email8 |[,,,,, qwerty, [42, 56, test]] |[50, eee] |
* +---+-------+---------------------------------+----------+
*
* root
* |-- id: integer (nullable = false)
* |-- name: string (nullable = true)
* |-- colJson: struct (nullable = true)
* | |-- key1: string (nullable = true)
* | |-- key10: boolean (nullable = true)
* | |-- key3: long (nullable = true)
* | |-- key4: long (nullable = true)
* | |-- key6: string (nullable = true)
* | |-- sub1: string (nullable = true)
* | |-- sub2: array (nullable = true)
* | | |-- element: string (containsNull = true)
* |-- personInfo: struct (nullable = true)
* | |-- age: long (nullable = true)
* | |-- name: string (nullable = true)
*/
// fetch columns of struct using <parent_col>.<child_col>
p.select($"colJson.key1", $"personInfo.age").show(false)
/**
* +-----------+---+
* |key1 |age|
* +-----------+---+
* |38 |10 |
* |38 |20 |
* |when string|30 |
* |null |40 |
* |null |50 |
* +-----------+---+
*/
Related
I wrote a scala script to load an avro file, and to work with the generated data (to retrieve top contributors).
The problem is that while loading the file it gives a dataset that i can not convert to dataframe cuz it contains some complex types:
val history_src = "path_to_avro_files\\frwiki*.avro"
val revisions_dataset = spark.read.format("avro").load(history_src)
//gives a dataset the we can see the data and make a take(1) without problems
val first_essay = revisions_dataset.map(row => (row.getString(0), row.getLong(2), row.get(3).asInstanceOf[mutable.WrappedArray[Revision]].array
.map(x=> (x.r_contributor.r_username, x.r_contributor.r_contributor_id, x.r_contributor.r_contributor_ip)))).take(1)
//gives GenericRowWithSchema cannot be cast to Revision
val second_essay = revisions_dataset.map(row => (row.getString(0), row.getLong(2), row.get(3).asInstanceOf[mutable.WrappedArray[GenericRowWithSchema]].toStream
.map(x=> (x.getLong(0),row.get(3).asInstanceOf[mutable.WrappedArray[GenericRowWithSchema]].map(c => (c.getLong(0))))))).take(1)
// gives WrappedArray$ofRef cannot be cast to scala.collection.mutable.ArrayBuffer
I tried with Encoders and Encoder using my case classes Below but didn't work
case class History (title: String, namespace: Long, id: Long, revisions: Array[Revision])
case class Contributor (r_username: String, r_contributor_id: Long, r_contributor_ip: String)
case class Revision(r_id: Long, r_parent_id: Long, timestamp : Long, r_contributor: Contributor, sha: String)
I can generate the schema from my revisions_dataset is like this and it gives this:
root
|-- p_title: string (nullable = true)
|-- p_namespace: long (nullable = true)
|-- p_id: long (nullable = true)
|-- p_revisions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- r_id: long (nullable = true)
| | |-- r_parent_id: long (nullable = true)
| | |-- r_timestamp: long (nullable = true)
| | |-- r_contributor: struct (nullable = true)
| | | |-- r_username: string (nullable = true)
| | | |-- r_contributor_id: long (nullable = true)
| | | |-- r_contributor_ip: string (nullable = true)
| | |-- r_sha1: string (nullable = true)
My goal is to have a dataframe to be able retrive the list of contributors on the revisions list and to flatten it to have a list of conributors inside the page (with the same level as the title).
Any help Please ?
import org.apache.spark.sql.functions._
val r1 = Revision(1, 1, 1, Contributor("c1", 1, "ip1"), "sha")
val r2 = Revision(1, 1, 1, Contributor("c2", 2, "ip2"), "sha")
val r3 = Revision(1, 1, 1, Contributor("c3", 3, "ip3"), "sha")
val revisions_dataset = Seq(
("title1", 0L, 1L, Array(r1, r2)),
("title1", 0L, 2L, Array(r1, r3)),
("title1", 0L, 3L, Array(r2))
).toDF("p_title", "p_namespace", "p_id", "p_revisions")
val flattened = revisions_dataset.select($"p_title", $"p_id", explode($"p_revisions").alias("p_revision"))
.withColumn("r_contributor_username", $"p_revision.r_contributor.r_username")
.withColumn("r_contributor_id", $"p_revision.r_contributor.r_contributor_id")
.withColumn("r_contributor_ip", $"p_revision.r_contributor.r_contributor_ip")
.drop("p_revision")
flattened.show(false)
Output:
+-------+----+----------------------+----------------+----------------+
|p_title|p_id|r_contributor_username|r_contributor_id|r_contributor_ip|
+-------+----+----------------------+----------------+----------------+
|title1 |1 |c1 |1 |ip1 |
|title1 |1 |c2 |2 |ip2 |
|title1 |2 |c1 |1 |ip1 |
|title1 |2 |c3 |3 |ip3 |
|title1 |3 |c2 |2 |ip2 |
+-------+----+----------------------+----------------+----------------+
I have an array of struct
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- index: string (nullable = false)
| | |-- failed_at: string (nullable = true)
| | |-- status: string (nullable = true)
| | |-- updated_at: string (nullable = true)
I have to remove the column from array of struct (webhooks) by taking the input from list
eg filterList: List[String]= List("index","status")
I have to remove the columns which are not in list
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- index: string (nullable = false)
| | |-- status: string (nullable = true)
I want to do this operations by row level like By iterating each row not by the dataFrame level or not by using spark operations
Check below code.
scala> val ddf = Seq(("fa1","fa11","fa111","fa1111","fa11111",Seq(("1","11","111","1111"))),("fb1","fb11","fb111","fb1111","fb11111",Seq(("2","22","222","2222")))).toDF("_id","h","inc","op","ts","webhooks").withColumn("webhooks",$"webhooks".cast("array<struct<index:string,failed_at:string,status:string,updated_at:string>>"))
ddf: org.apache.spark.sql.DataFrame = [_id: string, h: string ... 4 more fields]
scala> ddf.printSchema
root
|-- _id: string (nullable = true)
|-- h: string (nullable = true)
|-- inc: string (nullable = true)
|-- op: string (nullable = true)
|-- ts: string (nullable = true)
|-- webhooks: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- index: string (nullable = true)
| | |-- failed_at: string (nullable = true)
| | |-- status: string (nullable = true)
| | |-- updated_at: string (nullable = true)
scala> ddf.show(false)
+---+----+-----+------+-------+--------------------+
|_id|h |inc |op |ts |webhooks |
+---+----+-----+------+-------+--------------------+
|fa1|fa11|fa111|fa1111|fa11111|[[1, 11, 111, 1111]]|
|fb1|fb11|fb111|fb1111|fb11111|[[2, 22, 222, 2222]]|
+---+----+-----+------+-------+--------------------+
scala> val columsExceptWebhooks = ddf.columns.filterNot(_ == "webhooks").map(col(_))
columsExceptWebhooks: Array[org.apache.spark.sql.Column] = Array(_id, h, inc, op, ts)
scala> val columnsRequiredInWebhooks = Seq("index","status").map(c => col(s"webhooks.${c}"))
columnsRequiredInWebhooks: Seq[org.apache.spark.sql.Column] = List(webhooks.index, webhooks.status)
scala> ddf.withColumn("webhooks",explode($"webhooks")).select((columsExceptWebhooks :+ array(struct(columnsRequiredInWebhooks:_*)).as("webhook")):_*).show
+---+----+-----+------+-------+----------+
|_id| h| inc| op| ts| webhook|
+---+----+-----+------+-------+----------+
|fa1|fa11|fa111|fa1111|fa11111|[[1, 111]]|
|fb1|fb11|fb111|fb1111|fb11111|[[2, 222]]|
+---+----+-----+------+-------+----------+
May be you are looking for this-
please check the output column named processed
val df = spark.range(2).withColumn("webhooks",
array(
struct(lit("index1").as("index"), lit("failed_at1").as("failed_at"),
lit("status1").as("status"), lit("updated_at1").as("updated_at")),
struct(lit("index2").as("index"), lit("failed_at2").as("failed_at"),
lit("status2").as("status"), lit("updated_at2").as("updated_at"))
)
)
df.show(false)
df.printSchema()
/**
* +---+----------------------------------------------------------------------------------------+
* |id |webhooks |
* +---+----------------------------------------------------------------------------------------+
* |0 |[[index1, failed_at1, status1, updated_at1], [index2, failed_at2, status2, updated_at2]]|
* |1 |[[index1, failed_at1, status1, updated_at1], [index2, failed_at2, status2, updated_at2]]|
* +---+----------------------------------------------------------------------------------------+
*/
val filterList: List[String]= List("index1","status1")
val (index, status) = filterList.head -> filterList.last
df.selectExpr( "webhooks",
s"filter(webhooks, x -> array(x.index, x.status)=array('$index', '$status')) as processed")
.show(false)
/**
* +----------------------------------------------------------------------------------------+--------------------------------------------+
* |webhooks |processed |
* +----------------------------------------------------------------------------------------+--------------------------------------------+
* |[[index1, failed_at1, status1, updated_at1], [index2, failed_at2, status2, updated_at2]]|[[index1, failed_at1, status1, updated_at1]]|
* |[[index1, failed_at1, status1, updated_at1], [index2, failed_at2, status2, updated_at2]]|[[index1, failed_at1, status1, updated_at1]]|
* +----------------------------------------------------------------------------------------+--------------------------------------------+
*/
I have added both column in the output namely webhooks and processed so that it can be easily compared. but for the output please check only processed column
I cannot find exactly what I am looking for, so here it is my question. I fetch from MongoDb some data into a Spark Dataframe. The dataframe has the following schema (df.printSchema):
|-- flight: struct (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: timestamp (nullable = true)
| | | |-- departure: timestamp (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: timestamp (nullable = true)
| | | |-- departure: timestamp (nullable = true)
Do note the top-level structure, followed by an array, inside which I need to change my data.
For example:
{
"flight": {
"legs": [{
"departure": ISODate("2020-10-30T13:35:00.000Z"),
"arrival": ISODate("2020-10-30T14:47:00.000Z")
}
],
"segments": [{
"departure": ISODate("2020-10-30T13:35:00.000Z"),
"arrival": ISODate("2020-10-30T14:47:00.000Z")
}
]
}
}
I want to export this in Json, but for some business reason, I want the arrival dates to have a different format than the departure dates. For example, I may want to export the departure ISODate in ms from epoch, but not the arrival one.
To do so, I thought of applying a custom function to do the transformation:
// Here I can do any tranformation. I hope to replace the timestamp with the needed value
val doSomething: UserDefinedFunction = udf( (value: Seq[Timestamp]) => {
value.map(x => "doSomething" + x.getTime) }
)
val newDf = df.withColumn("flight.legs.departure",
doSomething(df.col("flight.legs.departure")))
But this simply returns a brand new column, containing an array of a single doSomething string.
{
"flight": {
"legs": [{
"arrival": "2020-10-30T14:47:00Z",
"departure": "2020-10-30T13:35:00Z"
}
],
"segments": [{
"arrival": "2020-10-30T14:47:00Z",
"departure": "2020-10-30T13:35:00Z",
}
]
},
"flight.legs.departure": ["doSomething1596268800000"]
}
And newDf.show(1)
+--------------------+---------------------+
| flight|flight.legs.departure|
+--------------------+---------------------+
|[[[182], 94, [202...| [doSomething15962...|
+--------------------+---------------------+
Instead of
{
...
"arrival": "2020-10-30T14:47:00Z",
//leg departure date that I changed
"departure": "doSomething1596268800000"
... // segments not affected in this example
"arrival": "2020-10-30T14:47:00Z",
"departure": "2020-10-30T13:35:00Z",
...
}
Any ideas how to proceed?
Edit - clarification:
Please bear in mind that my schema is way more complex than what shown above. For example, there is yet another top level data tag, so flight is below along with other information. Then inside flight, legs and segments there are multiple more elements, some that are also nested. I only focused on the ones that I needed to change.
I am saying this, because I would like the simplest solution that would scale. I.e. ideally one that would simply change the required elements without having to de-construct and that re-construct the whole nested structure. If we cannot avoid that, is using case classes the simplest solution?
Please check the code below.
Execution Time
With UDF : Time taken: 679 ms
Without UDF : Time taken: 1493 ms
Code With UDF
scala> :paste
// Entering paste mode (ctrl-D to finish)
// Creating UDF to update value inside array.
import java.text.SimpleDateFormat
val dateFormat = new SimpleDateFormat("yyyy-MM-dd'T'hh:mm:ss") // For me departure values are in string, so using this to convert sql timestmap.
val doSomething = udf((value: Seq[String]) => {
value.map(x => s"dosomething${dateFormat.parse(x).getTime}")
})
// Exiting paste mode, now interpreting.
import java.text.SimpleDateFormat
dateFormat: java.text.SimpleDateFormat = java.text.SimpleDateFormat#41bd83a
doSomething: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StringType,true),Some(List(ArrayType(StringType,true))))
scala> :paste
// Entering paste mode (ctrl-D to finish)
spark.time {
val updated = df.select("flight.*").withColumn("legs",arrays_zip($"legs.arrival",doSomething($"legs.departure")).cast("array<struct<arrival:string,departure:string>>")).select(struct($"segments",$"legs").as("flight"))
updated.printSchema
updated.show(false)
}
// Exiting paste mode, now interpreting.
root
|-- flight: struct (nullable = false)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
+-------------------------------------------------------------------------------------------------+
|flight |
+-------------------------------------------------------------------------------------------------+
|[[[2020-10-30T14:47:00, 2020-10-30T13:35:00]], [[2020-10-30T14:47:00, dosomething1604045100000]]]|
+-------------------------------------------------------------------------------------------------+
Time taken: 679 ms
scala>
Code Without UDF
scala> val df = spark.read.json(Seq("""{"flight": {"legs": [{"departure": "2020-10-30T13:35:00","arrival": "2020-10-30T14:47:00"}],"segments": [{"departure": "2020-10-30T13:35:00","arrival": "2020-10-30T14:47:00"}]}}""").toDS)
df: org.apache.spark.sql.DataFrame = [flight: struct<legs: array<struct<arrival:string,departure:string>>, segments: array<struct<arrival:string,departure:string>>>]
scala> df.printSchema
root
|-- flight: struct (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
scala> df.show(false)
+--------------------------------------------------------------------------------------------+
|flight |
+--------------------------------------------------------------------------------------------+
|[[[2020-10-30T14:47:00, 2020-10-30T13:35:00]], [[2020-10-30T14:47:00, 2020-10-30T13:35:00]]]|
+--------------------------------------------------------------------------------------------+
scala> :paste
// Entering paste mode (ctrl-D to finish)
spark.time {
val updated= df
.select("flight.*")
.select($"segments",$"legs.arrival",$"legs.departure") // extracting legs struct column values.
.withColumn("departure",explode($"departure")) // exploding departure column
.withColumn("departure",concat_ws("-",lit("something"),$"departure".cast("timestamp").cast("long"))) // updating departure column values
.groupBy($"segments",$"arrival") // grouping columns except legs column
.agg(collect_list($"departure").as("departure")) // constructing list back
.select($"segments",arrays_zip($"arrival",$"departure").as("legs")) // construction arrival & departure columns using arrays_zip method.
.select(struct($"legs",$"segments").as("flight")) // finally creating flight by combining legs & segments columns.
updated.printSchema
updated.show(false)
}
// Exiting paste mode, now interpreting.
root
|-- flight: struct (nullable = false)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = false)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
+---------------------------------------------------------------------------------------------+
|flight |
+---------------------------------------------------------------------------------------------+
|[[[2020-10-30T14:47:00, something-1604045100]], [[2020-10-30T14:47:00, 2020-10-30T13:35:00]]]|
+---------------------------------------------------------------------------------------------+
Time taken: 1493 ms
scala>
Try this
scala> df.show(false)
+----------------------------------------------------------------------------------------------------------------+
|flight |
+----------------------------------------------------------------------------------------------------------------+
|[[[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]], [[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]]]|
|[[[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]], [[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]]]|
+----------------------------------------------------------------------------------------------------------------+
scala>
scala> df.printSchema
root
|-- flight: struct (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
scala>
scala> val myudf = udf(
| (arrs:Seq[String]) => {
| arrs.map("something" ++ _)
| }
| )
myudf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StringType,true),Some(List(ArrayType(StringType,true))))
scala> val df2 = df.select($"flight", myudf($"flight.legs.arr") as "editedArrs")
df2: org.apache.spark.sql.DataFrame = [flight: struct<legs: array<struct<dep:string,arr:string>>, segments: array<struct<dep:string,arr:string>>>, editedArrs: array<string>]
scala> df2.printSchema
root
|-- flight: struct (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
|-- editedArrs: array (nullable = true)
| |-- element: string (containsNull = true)
scala> df2.show(false)
+----------------------------------------------------------------------------------------------------------------+-----------------------------------+
|flight |editedArrs |
+----------------------------------------------------------------------------------------------------------------+-----------------------------------+
|[[[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]], [[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]]]|[something2020-10-30T14:47:00.000Z]|
|[[[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]], [[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]]]|[something2020-10-25T14:37:00.000Z]|
+----------------------------------------------------------------------------------------------------------------+-----------------------------------+
scala>
scala>
scala> val df3 = df2.select(struct(arrays_zip($"flight.legs.dep", $"editedArrs") cast "array<struct<dep:string,arr:string>>" as "legs", $"flight.segments") as "flight")
df3: org.apache.spark.sql.DataFrame = [flight: struct<legs: array<struct<dep:string,arr:string>>, segments: array<struct<dep:string,arr:string>>>]
scala>
scala> df3.printSchema
root
|-- flight: struct (nullable = false)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
scala>
scala> df3.show(false)
+-------------------------------------------------------------------------------------------------------------------------+
|flight |
+-------------------------------------------------------------------------------------------------------------------------+
|[[[2020-10-30T13:35:00.000Z, something2020-10-30T14:47:00.000Z]], [[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]]]|
|[[[2020-10-25T13:15:00.000Z, something2020-10-25T14:37:00.000Z]], [[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]]]|
+-------------------------------------------------------------------------------------------------------------------------+
I am building up a schema to accept some data streaming in. It has an ArrayType with some elements. Here is my StructType with the ArrayType:
val innerBody = StructType(
StructField("value", LongType, false) ::
StructField("spent", BooleanType, false) ::
StructField("tx_index", LongType, false) :: Nil)
val prev_out = StructType(StructField("prev_out", innerBody, false) :: Nil)
val body = StructType(
StructField("inputs", ArrayType(prev_out, false), false) ::
StructField("out", ArrayType(innerBody, false), false) :: Nill)
val schema = StructType(StructField("x", body, false) :: Nil)
This builds a schema like"
root
|-- bit: struct (nullable = true)
| |-- x: struct (nullable = false)
| | |-- inputs: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- prev_out: struct (nullable = false)
| | | | | |-- value: long (nullable = false)
| | | | | |-- spent: boolean (nullable = false)
| | | | | |-- tx_index: long (nullable = false)
| | |-- out: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- value: long (nullable = false)
| | | | |-- spent: boolean (nullable = false)
| | | | |-- tx_index: long (nullable = false)
I am trying to select the value from the "value element" in schema as it is streaming in. I am using the writeStream sink.
val parsed = df.select("bit.x.inputs.element.prev_out.value")
.writeStream.format("console").start()
I have this but code above, but gives an error.
Message: cannot resolve 'bit.x.inputs.element.prev_out.value' given
input columns: [key, value, timestamp, partition, offset,
timestampType, topic];;
How can I access the "value" element in this schema?
If you have data frame like this, first explode and followed by select will help you.
df.printSchema()
//root
//|-- bit: struct (nullable = true)
//| |-- x: struct (nullable = true)
//| | |-- inputs: array (nullable = true)
//| | | |-- element: struct (containsNull = true)
//| | | | |-- prev_out: struct (nullable = true)
//| | | | | |-- spent: boolean (nullable = true)
//| | | | | |-- tx_infex: long (nullable = true)
//| | | | | |-- value: long (nullable = true)
import org.apache.spark.sql.functions._
val intermediateDf: DataFrame = df.select(explode(col("bit.x.inputs")).as("interCol"))
intermediateDf.printSchema()
//root
//|-- interCol: struct (nullable = true)
//| |-- prev_out: struct (nullable = true)
//| | |-- spent: boolean (nullable = true)
//| | |-- tx_infex: long (nullable = true)
//| | |-- value: long (nullable = true)
val finalDf: DataFrame = intermediateDf.select(col("interCol.prev_out.value").as("value"))
finalDf.printSchema()
//root
//|-- value: long (nullable = true)
finalDf.show()
//+-----------+
//| value|
//+-----------+
//|12347628746|
//|12347628746|
//+-----------+
This is my Existing data frame
+------------------+-------------------------+-----------+---------------+-------------------------+---------------------------+------------------------+--------------------------+---------------+-----------+----------------+-----------------+----------------------+--------------------------+-----------+--------------------+-----------+--------------------------------------------------------------------------------------------+-----------------------+------------------+-----------------------------+-----------------------+----------------------------------+
|DataPartition |TimeStamp |_lineItemId|_organizationId|fl:FinancialConceptGlobal|fl:FinancialConceptGlobalId|fl:FinancialConceptLocal|fl:FinancialConceptLocalId|fl:InstrumentId|fl:IsCredit|fl:IsDimensional|fl:IsRangeAllowed|fl:IsSegmentedByOrigin|fl:SegmentGroupDescription|fl:Segments|fl:StatementTypeCode|FFAction|!||LineItemName |LineItemName.languageId|LocalLanguageLabel|LocalLanguageLabel.languageId|SegmentChildDescription|SegmentChildDescription.languageId|
+------------------+-------------------------+-----------+---------------+-------------------------+---------------------------+------------------------+--------------------------+---------------+-----------+----------------+-----------------+----------------------+--------------------------+-----------+--------------------+-----------+--------------------------------------------------------------------------------------------+-----------------------+------------------+-----------------------------+-----------------------+----------------------------------+
|SelfSourcedPrivate|2017-11-02T10:23:59+00:00|3 |4298009288 |XTOT |3016350 |null |null |null |true |false |false |false |null |null |BAL |I|!| |Total Assets |505074 |null |null |null |null |
And here is the schema of above data frame
root
|-- DataPartition: string (nullable = true)
|-- TimeStamp: string (nullable = true)
|-- _lineItemId: long (nullable = true)
|-- _organizationId: long (nullable = true)
|-- fl:FinancialConceptGlobal: string (nullable = true)
|-- fl:FinancialConceptGlobalId: long (nullable = true)
|-- fl:FinancialConceptLocal: string (nullable = true)
|-- fl:FinancialConceptLocalId: long (nullable = true)
|-- fl:InstrumentId: long (nullable = true)
|-- fl:IsCredit: boolean (nullable = true)
|-- fl:IsDimensional: boolean (nullable = true)
|-- fl:IsRangeAllowed: boolean (nullable = true)
|-- fl:IsSegmentedByOrigin: boolean (nullable = true)
|-- fl:SegmentGroupDescription: string (nullable = true)
|-- fl:Segments: struct (nullable = true)
| |-- fl:SegmentSequence: struct (nullable = true)
| | |-- _VALUE: long (nullable = true)
| | |-- _segmentId: long (nullable = true)
|-- fl:StatementTypeCode: string (nullable = true)
|-- FFAction|!|: string (nullable = true)
|-- LineItemName: string (nullable = true)
|-- LineItemName.languageId: long (nullable = true)
|-- LocalLanguageLabel: string (nullable = true)
|-- LocalLanguageLabel.languageId: long (nullable = true)
|-- SegmentChildDescription: string (nullable = true)
|-- SegmentChildDescription.languageId: long (nullable = true)
I want to rename the header columns of the data frame using below code .
val temp = dfTypeNew.select(dfTypeNew.columns.filter(x => !x.equals("fl:Segments")).map(x => col(x).as(x.replace("_", "LineItem_").replace("fl:", ""))): _*)
When I do that I get below error
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Can't extract value from LineItemName#368: need struct type but got
string;
When I rename my columns without . I am able to extract
The error is there because (.)dot is used to access a struct field
To read a field that has a column name use backticks as below
val df = Seq(
("a","b","c"),
("a","b","c")
).toDF("x", "y", "z.z")
df.select("x", "`z.z`").show(false)
Output
+---+---+
|a |c.c|
+---+---+
|a |c |
|a |c |
+---+---+
Hope this helps!
Edited by Ramesh
#Anupam, all you had to do was use the above technique that Shankar suggested in your code as
val temp = dfTypeNew.select(dfTypeNew.columns.filter(x => !x.equals("fl:Segments")).map(x => col(s"`${x}`").as(x.replace("_", "LineItem_").replace("fl:", ""))): _*)
Thats all.