Convert dataset to dataframe from an avro file

Convert dataset to dataframe from an avro file - scala

I wrote a scala script to load an avro file, and to work with the generated data (to retrieve top contributors).
The problem is that while loading the file it gives a dataset that i can not convert to dataframe cuz it contains some complex types:
val history_src = "path_to_avro_files\\frwiki*.avro"
val revisions_dataset = spark.read.format("avro").load(history_src)
//gives a dataset the we can see the data and make a take(1) without problems
val first_essay = revisions_dataset.map(row => (row.getString(0), row.getLong(2), row.get(3).asInstanceOf[mutable.WrappedArray[Revision]].array
.map(x=> (x.r_contributor.r_username, x.r_contributor.r_contributor_id, x.r_contributor.r_contributor_ip)))).take(1)
//gives GenericRowWithSchema cannot be cast to Revision
val second_essay = revisions_dataset.map(row => (row.getString(0), row.getLong(2), row.get(3).asInstanceOf[mutable.WrappedArray[GenericRowWithSchema]].toStream
.map(x=> (x.getLong(0),row.get(3).asInstanceOf[mutable.WrappedArray[GenericRowWithSchema]].map(c => (c.getLong(0))))))).take(1)
// gives WrappedArray$ofRef cannot be cast to scala.collection.mutable.ArrayBuffer
I tried with Encoders and Encoder using my case classes Below but didn't work
case class History (title: String, namespace: Long, id: Long, revisions: Array[Revision])
case class Contributor (r_username: String, r_contributor_id: Long, r_contributor_ip: String)
case class Revision(r_id: Long, r_parent_id: Long, timestamp : Long, r_contributor: Contributor, sha: String)
I can generate the schema from my revisions_dataset is like this and it gives this:
root
|-- p_title: string (nullable = true)
|-- p_namespace: long (nullable = true)
|-- p_id: long (nullable = true)
|-- p_revisions: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- r_id: long (nullable = true)
| | |-- r_parent_id: long (nullable = true)
| | |-- r_timestamp: long (nullable = true)
| | |-- r_contributor: struct (nullable = true)
| | | |-- r_username: string (nullable = true)
| | | |-- r_contributor_id: long (nullable = true)
| | | |-- r_contributor_ip: string (nullable = true)
| | |-- r_sha1: string (nullable = true)
My goal is to have a dataframe to be able retrive the list of contributors on the revisions list and to flatten it to have a list of conributors inside the page (with the same level as the title).
Any help Please ?

import org.apache.spark.sql.functions._
val r1 = Revision(1, 1, 1, Contributor("c1", 1, "ip1"), "sha")
val r2 = Revision(1, 1, 1, Contributor("c2", 2, "ip2"), "sha")
val r3 = Revision(1, 1, 1, Contributor("c3", 3, "ip3"), "sha")
val revisions_dataset = Seq(
("title1", 0L, 1L, Array(r1, r2)),
("title1", 0L, 2L, Array(r1, r3)),
("title1", 0L, 3L, Array(r2))
).toDF("p_title", "p_namespace", "p_id", "p_revisions")
val flattened = revisions_dataset.select($"p_title", $"p_id", explode($"p_revisions").alias("p_revision"))
.withColumn("r_contributor_username", $"p_revision.r_contributor.r_username")
.withColumn("r_contributor_id", $"p_revision.r_contributor.r_contributor_id")
.withColumn("r_contributor_ip", $"p_revision.r_contributor.r_contributor_ip")
.drop("p_revision")
flattened.show(false)
Output:
+-------+----+----------------------+----------------+----------------+
|p_title|p_id|r_contributor_username|r_contributor_id|r_contributor_ip|
+-------+----+----------------------+----------------+----------------+
|title1 |1 |c1 |1 |ip1 |
|title1 |1 |c2 |2 |ip2 |
|title1 |2 |c1 |1 |ip1 |
|title1 |2 |c3 |3 |ip3 |
|title1 |3 |c2 |2 |ip2 |
+-------+----+----------------------+----------------+----------------+

Related

add parent column name as prefix to avoid ambiguity

Check below code. It is generating dataframe with ambiguity if duplicate keys are present . How should we modify the code to add parent column name as prefix to it.
Added another column with json data.
scala> val df = Seq(
(77, "email1", """{"key1":38,"key3":39}""","""{"name":"aaa","age":10}"""),
(78, "email2", """{"key1":38,"key4":39}""","""{"name":"bbb","age":20}"""),
(178, "email21", """{"key1":"when string","key4":36, "key6":"test", "key10":false }""","""{"name":"ccc","age":30}"""),
(179, "email8", """{"sub1":"qwerty","sub2":["42"]}""","""{"name":"ddd","age":40}"""),
(180, "email8", """{"sub1":"qwerty","sub2":["42", "56", "test"]}""","""{"name":"eee","age":50}""")
).toDF("id", "name", "colJson","personInfo")
scala> df.printSchema
root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- colJson: string (nullable = true)
|-- personInfo: string (nullable = true)
scala> df.show(false)
+---+-------+---------------------------------------------------------------+-----------------------+
|id |name |colJson |personInfo |
+---+-------+---------------------------------------------------------------+-----------------------+
|77 |email1 |{"key1":38,"key3":39} |{"name":"aaa","age":10}|
|78 |email2 |{"key1":38,"key4":39} |{"name":"bbb","age":20}|
|178|email21|{"key1":"when string","key4":36, "key6":"test", "key10":false }|{"name":"ccc","age":30}|
|179|email8 |{"sub1":"qwerty","sub2":["42"]} |{"name":"ddd","age":40}|
|180|email8 |{"sub1":"qwerty","sub2":["42", "56", "test"]} |{"name":"eee","age":50}|
+---+-------+---------------------------------------------------------------+-----------------------+
created fromJson implicit function,You can pass multiple columns to this & It will parse & extract the columns from json.
scala> :paste
// Entering paste mode (ctrl-D to finish)
import org.apache.spark.sql.{Column, DataFrame, Row}
import org.apache.spark.sql.functions.from_json
implicit class DFHelper(inDF: DataFrame) {
import inDF.sparkSession.implicits._
def fromJson(columns:Column*):DataFrame = {
val schemas = columns.map(column => (column, inDF.sparkSession.read.json(inDF.select(column).as[String]).schema))
val mdf = schemas.foldLeft(inDF)((df,schema) => {
df.withColumn(schema._1.toString(),from_json(schema._1,schema._2))
})
mdf.selectExpr(mdf.schema.map(c => if(c.dataType.typeName =="struct") s"${c.name}.*" else c.name):_*)
}
}
// Exiting paste mode, now interpreting.
import org.apache.spark.sql.{Column, DataFrame, Row}
import org.apache.spark.sql.functions.from_json
defined class DFHelper
scala> df.fromJson($"colJson",$"personInfo").show(false)
+---+-------+-----------+-----+----+----+----+------+--------------+---+----+
|id |name |key1 |key10|key3|key4|key6|sub1 |sub2 |age|name|
+---+-------+-----------+-----+----+----+----+------+--------------+---+----+
|77 |email1 |38 |null |39 |null|null|null |null |10 |aaa |
|78 |email2 |38 |null |null|39 |null|null |null |20 |bbb |
|178|email21|when string|false|null|36 |test|null |null |30 |ccc |
|179|email8 |null |null |null|null|null|qwerty|[42] |40 |ddd |
|180|email8 |null |null |null|null|null|qwerty|[42, 56, test]|50 |eee |
+---+-------+-----------+-----+----+----+----+------+--------------+---+----+
scala> df.fromJson($"colJson",$"personInfo").printSchema()
root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- key1: string (nullable = true)
|-- key10: boolean (nullable = true)
|-- key3: long (nullable = true)
|-- key4: long (nullable = true)
|-- key6: string (nullable = true)
|-- sub1: string (nullable = true)
|-- sub2: array (nullable = true)
| |-- element: string (containsNull = true)
|-- age: long (nullable = true)
|-- name: string (nullable = true)

Try this-
df.show(false)
df.printSchema()
/**
* +---+-------+---------------------------------------------------------------+-----------------------+
* |id |name |colJson |personInfo |
* +---+-------+---------------------------------------------------------------+-----------------------+
* |77 |email1 |{"key1":38,"key3":39} |{"name":"aaa","age":10}|
* |78 |email2 |{"key1":38,"key4":39} |{"name":"bbb","age":20}|
* |178|email21|{"key1":"when string","key4":36, "key6":"test", "key10":false }|{"name":"ccc","age":30}|
* |179|email8 |{"sub1":"qwerty","sub2":["42"]} |{"name":"ddd","age":40}|
* |180|email8 |{"sub1":"qwerty","sub2":["42", "56", "test"]} |{"name":"eee","age":50}|
* +---+-------+---------------------------------------------------------------+-----------------------+
*
* root
* |-- id: integer (nullable = false)
* |-- name: string (nullable = true)
* |-- colJson: string (nullable = true)
* |-- personInfo: string (nullable = true)
*
* #param inDF
*/
implicit class DFHelper(inDF: DataFrame) {
import inDF.sparkSession.implicits._
def fromJson(columns:Column*):DataFrame = {
val schemas = columns.map(column => (column, inDF.sparkSession.read.json(inDF.select(column).as[String]).schema))
val mdf = schemas.foldLeft(inDF)((df,schema) => {
df.withColumn(schema._1.toString(),from_json(schema._1,schema._2))
})
mdf//.selectExpr(mdf.schema.map(c => if(c.dataType.typeName =="struct") s"${c.name}.*" else c.name):_*)
}
}
val p = df.fromJson($"colJson", $"personInfo")
p.show(false)
p.printSchema()
/**
* +---+-------+---------------------------------+----------+
* |id |name |colJson |personInfo|
* +---+-------+---------------------------------+----------+
* |77 |email1 |[38,, 39,,,,] |[10, aaa] |
* |78 |email2 |[38,,, 39,,,] |[20, bbb] |
* |178|email21|[when string, false,, 36, test,,]|[30, ccc] |
* |179|email8 |[,,,,, qwerty, [42]] |[40, ddd] |
* |180|email8 |[,,,,, qwerty, [42, 56, test]] |[50, eee] |
* +---+-------+---------------------------------+----------+
*
* root
* |-- id: integer (nullable = false)
* |-- name: string (nullable = true)
* |-- colJson: struct (nullable = true)
* | |-- key1: string (nullable = true)
* | |-- key10: boolean (nullable = true)
* | |-- key3: long (nullable = true)
* | |-- key4: long (nullable = true)
* | |-- key6: string (nullable = true)
* | |-- sub1: string (nullable = true)
* | |-- sub2: array (nullable = true)
* | | |-- element: string (containsNull = true)
* |-- personInfo: struct (nullable = true)
* | |-- age: long (nullable = true)
* | |-- name: string (nullable = true)
*/
// fetch columns of struct using <parent_col>.<child_col>
p.select($"colJson.key1", $"personInfo.age").show(false)
/**
* +-----------+---+
* |key1 |age|
* +-----------+---+
* |38 |10 |
* |38 |20 |
* |when string|30 |
* |null |40 |
* |null |50 |
* +-----------+---+
*/

Apply a function to a column inside a structure of a Spark DataFrame, replacing that column

I cannot find exactly what I am looking for, so here it is my question. I fetch from MongoDb some data into a Spark Dataframe. The dataframe has the following schema (df.printSchema):
|-- flight: struct (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: timestamp (nullable = true)
| | | |-- departure: timestamp (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: timestamp (nullable = true)
| | | |-- departure: timestamp (nullable = true)
Do note the top-level structure, followed by an array, inside which I need to change my data.
For example:
{
"flight": {
"legs": [{
"departure": ISODate("2020-10-30T13:35:00.000Z"),
"arrival": ISODate("2020-10-30T14:47:00.000Z")
}
],
"segments": [{
"departure": ISODate("2020-10-30T13:35:00.000Z"),
"arrival": ISODate("2020-10-30T14:47:00.000Z")
}
]
}
}
I want to export this in Json, but for some business reason, I want the arrival dates to have a different format than the departure dates. For example, I may want to export the departure ISODate in ms from epoch, but not the arrival one.
To do so, I thought of applying a custom function to do the transformation:
// Here I can do any tranformation. I hope to replace the timestamp with the needed value
val doSomething: UserDefinedFunction = udf( (value: Seq[Timestamp]) => {
value.map(x => "doSomething" + x.getTime) }
)
val newDf = df.withColumn("flight.legs.departure",
doSomething(df.col("flight.legs.departure")))
But this simply returns a brand new column, containing an array of a single doSomething string.
{
"flight": {
"legs": [{
"arrival": "2020-10-30T14:47:00Z",
"departure": "2020-10-30T13:35:00Z"
}
],
"segments": [{
"arrival": "2020-10-30T14:47:00Z",
"departure": "2020-10-30T13:35:00Z",
}
]
},
"flight.legs.departure": ["doSomething1596268800000"]
}
And newDf.show(1)
+--------------------+---------------------+
| flight|flight.legs.departure|
+--------------------+---------------------+
|[[[182], 94, [202...| [doSomething15962...|
+--------------------+---------------------+
Instead of
{
...
"arrival": "2020-10-30T14:47:00Z",
//leg departure date that I changed
"departure": "doSomething1596268800000"
... // segments not affected in this example
"arrival": "2020-10-30T14:47:00Z",
"departure": "2020-10-30T13:35:00Z",
...
}
Any ideas how to proceed?
Edit - clarification:
Please bear in mind that my schema is way more complex than what shown above. For example, there is yet another top level data tag, so flight is below along with other information. Then inside flight, legs and segments there are multiple more elements, some that are also nested. I only focused on the ones that I needed to change.
I am saying this, because I would like the simplest solution that would scale. I.e. ideally one that would simply change the required elements without having to de-construct and that re-construct the whole nested structure. If we cannot avoid that, is using case classes the simplest solution?

Please check the code below.
Execution Time
With UDF : Time taken: 679 ms
Without UDF : Time taken: 1493 ms
Code With UDF
scala> :paste
// Entering paste mode (ctrl-D to finish)
// Creating UDF to update value inside array.
import java.text.SimpleDateFormat
val dateFormat = new SimpleDateFormat("yyyy-MM-dd'T'hh:mm:ss") // For me departure values are in string, so using this to convert sql timestmap.
val doSomething = udf((value: Seq[String]) => {
value.map(x => s"dosomething${dateFormat.parse(x).getTime}")
})
// Exiting paste mode, now interpreting.
import java.text.SimpleDateFormat
dateFormat: java.text.SimpleDateFormat = java.text.SimpleDateFormat#41bd83a
doSomething: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StringType,true),Some(List(ArrayType(StringType,true))))
scala> :paste
// Entering paste mode (ctrl-D to finish)
spark.time {
val updated = df.select("flight.*").withColumn("legs",arrays_zip($"legs.arrival",doSomething($"legs.departure")).cast("array<struct<arrival:string,departure:string>>")).select(struct($"segments",$"legs").as("flight"))
updated.printSchema
updated.show(false)
}
// Exiting paste mode, now interpreting.
root
|-- flight: struct (nullable = false)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
+-------------------------------------------------------------------------------------------------+
|flight |
+-------------------------------------------------------------------------------------------------+
|[[[2020-10-30T14:47:00, 2020-10-30T13:35:00]], [[2020-10-30T14:47:00, dosomething1604045100000]]]|
+-------------------------------------------------------------------------------------------------+
Time taken: 679 ms
scala>
Code Without UDF
scala> val df = spark.read.json(Seq("""{"flight": {"legs": [{"departure": "2020-10-30T13:35:00","arrival": "2020-10-30T14:47:00"}],"segments": [{"departure": "2020-10-30T13:35:00","arrival": "2020-10-30T14:47:00"}]}}""").toDS)
df: org.apache.spark.sql.DataFrame = [flight: struct<legs: array<struct<arrival:string,departure:string>>, segments: array<struct<arrival:string,departure:string>>>]
scala> df.printSchema
root
|-- flight: struct (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
scala> df.show(false)
+--------------------------------------------------------------------------------------------+
|flight |
+--------------------------------------------------------------------------------------------+
|[[[2020-10-30T14:47:00, 2020-10-30T13:35:00]], [[2020-10-30T14:47:00, 2020-10-30T13:35:00]]]|
+--------------------------------------------------------------------------------------------+
scala> :paste
// Entering paste mode (ctrl-D to finish)
spark.time {
val updated= df
.select("flight.*")
.select($"segments",$"legs.arrival",$"legs.departure") // extracting legs struct column values.
.withColumn("departure",explode($"departure")) // exploding departure column
.withColumn("departure",concat_ws("-",lit("something"),$"departure".cast("timestamp").cast("long"))) // updating departure column values
.groupBy($"segments",$"arrival") // grouping columns except legs column
.agg(collect_list($"departure").as("departure")) // constructing list back
.select($"segments",arrays_zip($"arrival",$"departure").as("legs")) // construction arrival & departure columns using arrays_zip method.
.select(struct($"legs",$"segments").as("flight")) // finally creating flight by combining legs & segments columns.
updated.printSchema
updated.show(false)
}
// Exiting paste mode, now interpreting.
root
|-- flight: struct (nullable = false)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = false)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- arrival: string (nullable = true)
| | | |-- departure: string (nullable = true)
+---------------------------------------------------------------------------------------------+
|flight |
+---------------------------------------------------------------------------------------------+
|[[[2020-10-30T14:47:00, something-1604045100]], [[2020-10-30T14:47:00, 2020-10-30T13:35:00]]]|
+---------------------------------------------------------------------------------------------+
Time taken: 1493 ms
scala>

Try this
scala> df.show(false)
+----------------------------------------------------------------------------------------------------------------+
|flight |
+----------------------------------------------------------------------------------------------------------------+
|[[[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]], [[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]]]|
|[[[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]], [[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]]]|
+----------------------------------------------------------------------------------------------------------------+
scala>
scala> df.printSchema
root
|-- flight: struct (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
scala>
scala> val myudf = udf(
| (arrs:Seq[String]) => {
| arrs.map("something" ++ _)
| }
| )
myudf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,ArrayType(StringType,true),Some(List(ArrayType(StringType,true))))
scala> val df2 = df.select($"flight", myudf($"flight.legs.arr") as "editedArrs")
df2: org.apache.spark.sql.DataFrame = [flight: struct<legs: array<struct<dep:string,arr:string>>, segments: array<struct<dep:string,arr:string>>>, editedArrs: array<string>]
scala> df2.printSchema
root
|-- flight: struct (nullable = true)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
|-- editedArrs: array (nullable = true)
| |-- element: string (containsNull = true)
scala> df2.show(false)
+----------------------------------------------------------------------------------------------------------------+-----------------------------------+
|flight |editedArrs |
+----------------------------------------------------------------------------------------------------------------+-----------------------------------+
|[[[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]], [[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]]]|[something2020-10-30T14:47:00.000Z]|
|[[[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]], [[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]]]|[something2020-10-25T14:37:00.000Z]|
+----------------------------------------------------------------------------------------------------------------+-----------------------------------+
scala>
scala>
scala> val df3 = df2.select(struct(arrays_zip($"flight.legs.dep", $"editedArrs") cast "array<struct<dep:string,arr:string>>" as "legs", $"flight.segments") as "flight")
df3: org.apache.spark.sql.DataFrame = [flight: struct<legs: array<struct<dep:string,arr:string>>, segments: array<struct<dep:string,arr:string>>>]
scala>
scala> df3.printSchema
root
|-- flight: struct (nullable = false)
| |-- legs: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
| |-- segments: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- dep: string (nullable = true)
| | | |-- arr: string (nullable = true)
scala>
scala> df3.show(false)
+-------------------------------------------------------------------------------------------------------------------------+
|flight |
+-------------------------------------------------------------------------------------------------------------------------+
|[[[2020-10-30T13:35:00.000Z, something2020-10-30T14:47:00.000Z]], [[2020-10-30T13:35:00.000Z, 2020-10-30T14:47:00.000Z]]]|
|[[[2020-10-25T13:15:00.000Z, something2020-10-25T14:37:00.000Z]], [[2020-10-25T13:15:00.000Z, 2020-10-25T14:37:00.000Z]]]|
+-------------------------------------------------------------------------------------------------------------------------+

how to update spark dataframe column containing array using udf

I have a dataframe:
+--------------------+------+
|people |person|
+--------------------+------+
|[[jack, jill, hero]]|joker |
+--------------------+------+
It's schema:
root
|-- people: struct (nullable = true)
| |-- person: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- person: string (nullable = true)
Here, root--person is a string. So, I can update this field using udf as:
def updateString = udf((s: String) => {
"Mr. " + s
})
df.withColumn("person", updateString(col("person"))).select("person").show(false)
output:
+---------+
|person |
+---------+
|Mr. joker|
+---------+
I want to do same operation on root--people--person column which contains array of person. How to achieve this using udf?
def updateArray = udf((arr: Seq[Row]) => ???
df.withColumn("people", updateArray(col("people.person"))).select("people").show(false)
expected:
+------------------------------+
|people |
+------------------------------+
|[Mr. hero, Mr. jack, Mr. jill]|
+------------------------------+
Edit: I also want to preserve its schema after updating root--people--person.
Expected schema of people:
df.select("people").printSchema()
root
|-- people: struct (nullable = false)
| |-- person: array (nullable = true)
| | |-- element: string (containsNull = true)
Thanks,

The problem here is that people is s struct with only 1 field. In your UDF, you need to return Tuple1 and then further cast the output of your UDF to keep the names correct:
def updateArray = udf((r: Row) => Tuple1(r.getAs[Seq[String]](0).map(x=>"Mr."+x)))
val newDF = df
.withColumn("people",updateArray($"people").cast("struct<person:array<string>>"))
newDF.printSchema()
newDF.show()
gives
root
|-- people: struct (nullable = true)
| |-- person: array (nullable = true)
| | |-- element: string (containsNull = true)
|-- person: string (nullable = true)
+--------------------+------+
| people|person|
+--------------------+------+
|[[Mr.jack, Mr.jil...| joker|
+--------------------+------+

for you just need to update your function and everything remains the same.
here is the code snippet.
scala> df2.show
+------+------------------+
|people| person|
+------+------------------+
| joker|[jack, jill, hero]|
+------+------------------+
//jus order is changed
I just updated your function instead of using Row I am using here Seq[String]
scala> def updateArray = udf((arr: Seq[String]) => arr.map(x=>"Mr."+x))
scala> df2.withColumn("test",updateArray($"person")).show(false)
+------+------------------+---------------------------+
|people|person |test |
+------+------------------+---------------------------+
|joker |[jack, jill, hero]|[Mr.jack, Mr.jill, Mr.hero]|
+------+------------------+---------------------------+
//keep all the column for testing purpose you could drop if you dont want.
let me know if you want to know more about same.

Let's create data for testing
scala> val data = Seq((List(Array("ja", "ji", "he")), "person")).toDF("people", "person")
data: org.apache.spark.sql.DataFrame = [people: array<array<string>>, person: string]
scala> data.printSchema
root
|-- people: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- person: string (nullable = true)
create UDF for our requirements
scala> def arrayConcat(array:Seq[Seq[String]], str: String) = array.map(_.map(str + _))
arrayConcat: (array: Seq[Seq[String]], str: String)Seq[Seq[String]]
scala> val arrayConcatUDF = udf(arrayConcat _)
arrayConcatUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(ArrayType(StringType,true),true),Some(List(ArrayType(ArrayType(StringType,true),true), StringType)))
Applying the udf
scala> data.withColumn("dasd", arrayConcatUDF($"people", lit("Mr."))).show(false)
+--------------------------+------+-----------------------------------+
|people |person|dasd |
+--------------------------+------+-----------------------------------+
|[WrappedArray(ja, ji, he)]|person|[WrappedArray(Mr.ja, Mr.ji, Mr.he)]|
+--------------------------+------+-----------------------------------+
You may need to tweak a bit(I think any tweak is hardly required) but this contains the most of it to solve your problem

How to access elements in a ArrayType in a writeStream?

I am building up a schema to accept some data streaming in. It has an ArrayType with some elements. Here is my StructType with the ArrayType:
val innerBody = StructType(
StructField("value", LongType, false) ::
StructField("spent", BooleanType, false) ::
StructField("tx_index", LongType, false) :: Nil)
val prev_out = StructType(StructField("prev_out", innerBody, false) :: Nil)
val body = StructType(
StructField("inputs", ArrayType(prev_out, false), false) ::
StructField("out", ArrayType(innerBody, false), false) :: Nill)
val schema = StructType(StructField("x", body, false) :: Nil)
This builds a schema like"
root
|-- bit: struct (nullable = true)
| |-- x: struct (nullable = false)
| | |-- inputs: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- prev_out: struct (nullable = false)
| | | | | |-- value: long (nullable = false)
| | | | | |-- spent: boolean (nullable = false)
| | | | | |-- tx_index: long (nullable = false)
| | |-- out: array (nullable = false)
| | | |-- element: struct (containsNull = false)
| | | | |-- value: long (nullable = false)
| | | | |-- spent: boolean (nullable = false)
| | | | |-- tx_index: long (nullable = false)
I am trying to select the value from the "value element" in schema as it is streaming in. I am using the writeStream sink.
val parsed = df.select("bit.x.inputs.element.prev_out.value")
.writeStream.format("console").start()
I have this but code above, but gives an error.
Message: cannot resolve 'bit.x.inputs.element.prev_out.value' given
input columns: [key, value, timestamp, partition, offset,
timestampType, topic];;
How can I access the "value" element in this schema?

If you have data frame like this, first explode and followed by select will help you.
df.printSchema()
//root
//|-- bit: struct (nullable = true)
//| |-- x: struct (nullable = true)
//| | |-- inputs: array (nullable = true)
//| | | |-- element: struct (containsNull = true)
//| | | | |-- prev_out: struct (nullable = true)
//| | | | | |-- spent: boolean (nullable = true)
//| | | | | |-- tx_infex: long (nullable = true)
//| | | | | |-- value: long (nullable = true)
import org.apache.spark.sql.functions._
val intermediateDf: DataFrame = df.select(explode(col("bit.x.inputs")).as("interCol"))
intermediateDf.printSchema()
//root
//|-- interCol: struct (nullable = true)
//| |-- prev_out: struct (nullable = true)
//| | |-- spent: boolean (nullable = true)
//| | |-- tx_infex: long (nullable = true)
//| | |-- value: long (nullable = true)
val finalDf: DataFrame = intermediateDf.select(col("interCol.prev_out.value").as("value"))
finalDf.printSchema()
//root
//|-- value: long (nullable = true)
finalDf.show()
//+-----------+
//| value|
//+-----------+
//|12347628746|
//|12347628746|
//+-----------+

Index a map by a the value of a different column in Spark

I have a dataframe with the following schema:
|-- A: map (nullable = true)
| |-- key: string
| |-- value: array (valueContainsNull = true)
| | |-- element: struct (containsNull = true)
| | | |-- uid: string (nullable = true)
| | | |-- price: double (nullable = true)
| | | |-- type: string (nullable = true)
|-- keyindex: string (nullable = true)
For example, if I have the following data:
{"A":{
"innerkey_1":[{"uid":"1","price":0.01,"recordtype":"STAT"},
{"uid":"6","price":4.3,"recordtype":"DYN"}],
"innerkey_2":[{"uid":"2","price":2.01,"recordtype":"DYN"},
{"uid":"4","price":6.1,"recordtype":"DYN"}]},
"innerkey_2"}
I use the following schema to read the data into a dataframe:
val schema = (new StructType().add("mainkey", MapType(StringType, new ArrayType(new StructType().add("uid",StringType).add("price",DoubleType).add("recordtype",StringType), true))).add("keyindex",StringType))
I am trying to figure out if I can use the keyindex to select values from the map. Since the keyindex in the example is "innerkey_2", I want the output to be
[{"uid":"2","price":2.01,"recordtype":"DYN"},
{"uid":"4","price":6.1,"recordtype":"DYN"}]
Thanks for your help!

getItem should do the trick:
scala> val df = Seq(("innerkey2", Map("innerkey2" -> Seq(("1", 0.01, "STAT"))))).toDF("keyindex", "A")
df: org.apache.spark.sql.DataFrame = [keyindex: string, A: map<string,array<struct<_1:string,_2:double,_3:string>>>]
scala> df.select($"A"($"keyindex")).show
+---------------+
| A[keyindex]|
+---------------+
|[[1,0.01,STAT]]|
+---------------+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Convert dataset to dataframe from an avro file - scala

Related

add parent column name as prefix to avoid ambiguity

Apply a function to a column inside a structure of a Spark DataFrame, replacing that column

how to update spark dataframe column containing array using udf

How to access elements in a ArrayType in a writeStream?

Index a map by a the value of a different column in Spark

Categories

Resources