How do I write the below code in typesafe manner in spark scala with Dataset Api:
val schema: StructType = Encoders.product[CaseClass].schema
//read json from a file
val readAsDataSet :CaseClass=sparkSession.read.option("mode",mode).schema(schema).json(path)as[CaseClass]
//below code needs to be written in type safe way:
val someDF= readAsDataSet.withColumn("col1",explode(col("col_to_be_exploded")))
.select(from_unixtime(col("timestamp").divide(1000))
.as("date"), col("col1"))
As someone in the comments said, you can create a Dataset[CaseClass] and do your operations on there. Let's set it up:
import spark.implicits._
case class MyTest (timestamp: Long, col_explode: Seq[String])
val df = Seq(
MyTest(1673850366000L, Seq("some", "strings", "here")),
MyTest(1271850365998L, Seq("pasta", "with", "cream")),
MyTest(611850366000L, Seq("tasty", "food"))
).toDF("timestamp", "col_explode").as[MyTest]
df.show(false)
+-------------+---------------------+
|timestamp |col_explode |
+-------------+---------------------+
|1673850366000|[some, strings, here]|
|1271850365998|[pasta, with, cream] |
|611850366000 |[tasty, food] |
+-------------+---------------------+
Typically, you can do many operations with the map function and the Scala language.
A map function returns the same amount of elements as the input has. The explode function that you're using, however, does not return the same amount of elements. You can implement this behaviour using the flatMap function.
So, using the Scala language and the flatMap function together, you can do something like this:
import java.time.LocalDateTime
import java.time.ZoneOffset
case class Exploded (datetime: String, exploded: String)
val output = df.flatMap{ case MyTest(timestamp, col_explode) =>
col_explode.map( value => {
val date = LocalDateTime.ofEpochSecond(timestamp/1000, 0, ZoneOffset.UTC).toString
Exploded(date, value)
}
)
}
output.show(false)
+-------------------+--------+
|datetime |exploded|
+-------------------+--------+
|2023-01-16T06:26:06|some |
|2023-01-16T06:26:06|strings |
|2023-01-16T06:26:06|here |
|2010-04-21T11:46:05|pasta |
|2010-04-21T11:46:05|with |
|2010-04-21T11:46:05|cream |
|1989-05-22T14:26:06|tasty |
|1989-05-22T14:26:06|food |
+-------------------+--------+
As you see, we've created a second case class called Exploded which we use to type our output dataset. Our output dataset has the following type: org.apache.spark.sql.Dataset[Exploded] so everything is completely type safe.
Hi guys i using Spark/Scala 2.4.4 and have an RDD[Map(String, String)] with Persons... the rows are like this:
rdd = Map(
(FirstName -> "Michael")
(SecondName -> "Jackson")
)
I created a case class like this:
case class Person(FirstName: String, SecondName: String) {
val FullName = FirstName + SecondName
}
how can i convert this RDD to a Dataset[Person], and it's possible to generate FullName column when the class are instantiated or i need to run withcolumn?
Thanks!
You need to expand the map into two columns and cast the rows as Person.
val rdd = sc.parallelize(Seq(Map(
("FirstName" -> "Michael"),
("SecondName" -> "Jackson")
)))
val df = rdd.toDF("person").select("person.FirstName", "person.SecondName").as[Person]
Then you can get the FullName using map, for example:
df.map(r => r.FullName).show
+--------------+
| value|
+--------------+
|MichaelJackson|
+--------------+
My goal is to have a spark dataframe that holds each of my Candy objects in a separate row, with their respective properties
+------------------------------------+
main
+------------------------------------+
{"brand":"brand1","name":"snickers"}
+------------------------------------+
{"brand":"brand2","name":"haribo"}
+------------------------------------+
Case class for Proof of concept
case class Candy(
brand: String,
name: String)
val candy1 = Candy("brand1", "snickers")
val candy2 = Candy("brand2", "haribo")
So far I have only managed to put them in the same row with:
import org.json4s.DefaultFormats
import org.json4s.jackson.Serialization.{read, write}
implicit val formats = DefaultFormats
val body = write(Array(candy1, candy2))
val df=Seq(body).toDF("main")
df.show(5, false)
giving me everything in one row instead of 2. How can I split each object up into its own row while maintaining the schema of my Candy object?
+-------------------------------------------------------------------------+
| main |
+-------------------------------------------------------------------------+
|[{"brand":"brand1","name":"snickers"},{"brand":"brand2","name":"haribo"}]|
+-------------------------------------------------------------------------+
Do you want to keep the item as a json string inside the dataframe?
If you don't, you can do this, taking advatange of the dataset ability to handle case classes:
val df=Seq(candy1, candy2).toDS
This will give you the following output:
+------+--------+
| brand| name|
+------+--------+
|brand1|snickers|
|brand2| haribo|
+------+--------+
IMHO that's the best optionm but if you want to keep your data as a json string, then you can first define a toJson method inside your case class:
case class Candy(brand:String, name: String) {
def toJson = s"""{"brand": "$brand", "name": "$name" }"""
}
And then build the DF using that method:
val df=Seq(candy1.toJson, candy2.toJson).toDF("main")
OUTPUT
+----------------------------------------+
|main |
+----------------------------------------+
|{"brand": "brand1", "name": "snickers" }|
|{"brand": "brand2", "name": "haribo" } |
+----------------------------------------+
Id Identifers
'123' {"country":"PR", "idType":"SELECTED","status":"Not Done"}
'234' {"country":"PR", "idType":"NOT SELECTED","status":"Not Done"}
Need to change the $Identifers.status = "Done" when idType is equal to "SELECTED"
So Expected OutPut will be
Id Identifers
'123' {"country":"PR", "idType":"SELECTED","status":"Not Done"}
'234' {"country":"PR", "idType":"NOT SELECTED","status":"Done"}
I tried using
df.withColumn("$NewIdentifers", when($"Identifers.idType" === "SELECTED", "DONE"))
But this failing giving Null
Starting from Spark-2.2:
You can use from_json(to create individual cols) and to_json(recreate json object) in built functions for this case.
Example:
//sample data
df.show(false)
//+---+-------------------------------------------------------------+
//|Id |Identifiers |
//+---+-------------------------------------------------------------+
//|123|{"country":"PR", "idType":"SELECTED","status":"Not Done"} |
//|234|{"country":"PR", "idType":"NOT SELECTED","status":"Not Done"}|
//+---+-------------------------------------------------------------+
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
//defining the schema
val sch=new StructType().add("country",StringType).add("idType",StringType).add("status",StringType)
//read the identifiers using from_json and pass the schema
val df1=df.withColumn("jsn",from_json(col("Identifiers"),sch)).select("Id","jsn.*")
//required json cols
val jsn_cols=df1.columns.filter(_.toLowerCase != "id")
//here we are using when otherwise and updating status column then recreating json object using to_json function
df1.withColumn("status",when(col("idType") === "SELECTED",lit("Done")).otherwise(col("status"))).
withColumn("identifiers",to_json(struct(jsn_cols.head,jsn_cols.tail:_*))).
drop(jsn_cols:_*).
show(false)
//+---+------------------------------------------------------------+
//|Id |identifiers |
//+---+------------------------------------------------------------+
//|123|{"country":"PR","idType":"SELECTED","status":"Done"} |
//|234|{"country":"PR","idType":"NOT SELECTED","status":"Not Done"}|
//+---+------------------------------------------------------------+
I have two spark datasets, one with columns accountid and key, the key column in the format of an array [key1,key2,key3..] and another dataset with two columns accountid and key values which is in json. accountid , {key:value, key,value...}. I need to update the value in the second dataset if key appear for accountid in first dataset.
import org.apache.spark.sql.functions._
val df= sc.parallelize(Seq(("20180610114049", "id1","key1"),
("20180610114049", "id2","key2"),
("20180610114049", "id1","key1"),
("20180612114049", "id2","key1"),
("20180613114049", "id3","key2"),
("20180613114049", "id3","key3")
)).toDF("date","accountid", "key")
val gp=df.groupBy("accountid","date").agg(collect_list("key"))
+---------+--------------+-----------------+
|accountid| date|collect_list(key)|
+---------+--------------+-----------------+
| id2|20180610114049| [key2]|
| id1|20180610114049| [key1, key1]|
| id3|20180613114049| [key2, key3]|
| id2|20180612114049| [key1]|
+---------+--------------+-----------------+
val df2= sc.parallelize(Seq(("20180610114049", "id1","{'key1':'0.0','key2':'0.0','key3':'0.0'}"),
("20180610114049", "id2","{'key1':'0.0','key2':'0.0','key3':'0.0'}"),
("20180611114049", "id1","{'key1':'0.0','key2':'0.0','key3':'0.0'}"),
("20180612114049", "id2","{'key1':'0.0','key2':'0.0','key3':'0.0'}"),
("20180613114049", "id3","{'key1':'0.0','key2':'0.0','key3':'0.0'}")
)).toDF("date","accountid", "result")
+--------------+---------+----------------------------------------+
|date |accountid|result |
+--------------+---------+----------------------------------------+
|20180610114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
|20180610114049|id2 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
|20180611114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
|20180612114049|id2 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
|20180613114049|id3 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
+--------------+---------+----------------------------------------+
expected output
+--------------+---------+----------------------------------------+
|date |accountid|result |
+--------------+---------+----------------------------------------+
|20180610114049|id1 |{'key1':'1.0','key2':'0.0','key3':'0.0'}|
|20180610114049|id2 |{'key1':'0.0','key2':'1.0','key3':'0.0'}|
|20180611114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
|20180612114049|id2 |{'key1':'1.0','key2':'0.0','key3':'0.0'}|
|20180613114049|id3 |{'key1':'0.0','key2':'1.0','key3':'1.0'}|
+--------------+---------+----------------------------------------+
You will most definitely need a UDF to do it cleanly here.
You can pass both the array and the JSON to the UDF after joining on date and accountid, parse the JSON inside the UDF using the parser of your choice (I'm using JSON4S in the example), check if the key exists in the array and then change the value, convert it to JSON again and return it from the UDF.
val gp=df.groupBy("accountid","date").agg(collect_list("key").as("key"))
val joined = df2.join(gp, Seq("date", "accountid") , "left_outer")
joined.show(false)
//+--------------+---------+----------------------------------------+------------+
//|date |accountid|result |key |
//+--------------+---------+----------------------------------------+------------+
//|20180610114049|id2 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key2] |
//|20180613114049|id3 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key2, key3]|
//|20180610114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key1, key1]|
//|20180611114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|null |
//|20180612114049|id2 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key1] |
//+--------------+---------+----------------------------------------+------------+
// the UDF that will do the most work
// it's important to declare `formats` inside the function
// to avoid object not Serializable exception
// Not all cases are covered, use with caution :D
val convertJsonValues = udf{(json: String, arr: Seq[String]) =>
import org.json4s.jackson.JsonMethods._
import org.json4s.JsonDSL._
implicit val format = org.json4s.DefaultFormats
// replace single quotes with double
val kvMap = parse(json.replaceAll("'", """"""")).values.asInstanceOf[Map[String,String]]
val updatedKV = kvMap.map{ case(k,v) => if(arr.contains(k)) (k,"1.0") else (k,v) }
compact(render(updatedKV))
}
// Use when-otherwise and send empty array where `key` is null
joined.select($"date",
$"accountid",
when($"key".isNull, convertJsonValues($"result", array()))
.otherwise(convertJsonValues($"result", $"key"))
.as("result")
).show(false)
//+--------------+---------+----------------------------------------+
//|date |accountid|result |
//+--------------+---------+----------------------------------------+
//|20180610114049|id2 |{"key1":"0.0","key2":"1.0","key3":"0.0"}|
//|20180613114049|id3 |{"key1":"0.0","key2":"1.0","key3":"1.0"}|
//|20180610114049|id1 |{"key1":"1.0","key2":"0.0","key3":"0.0"}|
//|20180611114049|id1 |{"key1":"0.0","key2":"0.0","key3":"0.0"}|
//|20180612114049|id2 |{"key1":"1.0","key2":"0.0","key3":"0.0"}|
//+--------------+---------+----------------------------------------+
You can achieve your requirement with the use of udf function after you join both dataframes. Of course there are stuffs like converting json to struct, struct to json again, case class usage and more (comments are provided for further explanation)
import org.apache.spark.sql.functions._
//aliasing the collected key
val gp = df.groupBy("accountid","date").agg(collect_list("key").as("keys"))
//schema for converting json to struct
val schema = StructType(Seq(StructField("key1", StringType, true), StructField("key2", StringType, true), StructField("key3", StringType, true)))
//udf function to update the values of struct where result is a case class
def updateKeysUdf = udf((arr: Seq[String], json: Row) => Seq(json.schema.fieldNames.map(key => if(arr.contains(key)) "1.0" else json.getAs[String](key))).collect{case Array(a,b,c) => result(a,b,c)}.toList(0))
//changing json string to stuct using the above schema
df2.withColumn("result", from_json(col("result"), schema))
.as("df2") //aliasing df2 for joining and selecting
.join(gp.as("gp"), col("df2.accountid") === col("gp.accountid"), "left") //aliasing gp dataframe and joining with accountid
.select(col("df2.accountid"), col("df2.date"), to_json(updateKeysUdf(col("gp.keys"), col("df2.result"))).as("result")) //selecting and calling above udf function and finally converting to json stirng
.show(false)
where result is a case class
case class result(key1: String, key2: String, key3: String)
which should give you
+---------+--------------+----------------------------------------+
|accountid|date |result |
+---------+--------------+----------------------------------------+
|id3 |20180613114049|{"key1":"0.0","key2":"1.0","key3":"1.0"}|
|id1 |20180610114049|{"key1":"1.0","key2":"0.0","key3":"0.0"}|
|id1 |20180611114049|{"key1":"1.0","key2":"0.0","key3":"0.0"}|
|id2 |20180610114049|{"key1":"0.0","key2":"1.0","key3":"0.0"}|
|id2 |20180610114049|{"key1":"1.0","key2":"0.0","key3":"0.0"}|
|id2 |20180612114049|{"key1":"0.0","key2":"1.0","key3":"0.0"}|
|id2 |20180612114049|{"key1":"1.0","key2":"0.0","key3":"0.0"}|
+---------+--------------+----------------------------------------+
I hope the answer is helpful