Creating spark dataframe with rows holding object with schema - scala

My goal is to have a spark dataframe that holds each of my Candy objects in a separate row, with their respective properties
+------------------------------------+
main
+------------------------------------+
{"brand":"brand1","name":"snickers"}
+------------------------------------+
{"brand":"brand2","name":"haribo"}
+------------------------------------+
Case class for Proof of concept
case class Candy(
brand: String,
name: String)
val candy1 = Candy("brand1", "snickers")
val candy2 = Candy("brand2", "haribo")
So far I have only managed to put them in the same row with:
import org.json4s.DefaultFormats
import org.json4s.jackson.Serialization.{read, write}
implicit val formats = DefaultFormats
val body = write(Array(candy1, candy2))
val df=Seq(body).toDF("main")
df.show(5, false)
giving me everything in one row instead of 2. How can I split each object up into its own row while maintaining the schema of my Candy object?
+-------------------------------------------------------------------------+
| main |
+-------------------------------------------------------------------------+
|[{"brand":"brand1","name":"snickers"},{"brand":"brand2","name":"haribo"}]|
+-------------------------------------------------------------------------+

Do you want to keep the item as a json string inside the dataframe?
If you don't, you can do this, taking advatange of the dataset ability to handle case classes:
val df=Seq(candy1, candy2).toDS
This will give you the following output:
+------+--------+
| brand| name|
+------+--------+
|brand1|snickers|
|brand2| haribo|
+------+--------+
IMHO that's the best optionm but if you want to keep your data as a json string, then you can first define a toJson method inside your case class:
case class Candy(brand:String, name: String) {
def toJson = s"""{"brand": "$brand", "name": "$name" }"""
}
And then build the DF using that method:
val df=Seq(candy1.toJson, candy2.toJson).toDF("main")
OUTPUT
+----------------------------------------+
|main |
+----------------------------------------+
|{"brand": "brand1", "name": "snickers" }|
|{"brand": "brand2", "name": "haribo" } |
+----------------------------------------+

Related

Write transformation in typesafe way

How do I write the below code in typesafe manner in spark scala with Dataset Api:
val schema: StructType = Encoders.product[CaseClass].schema
//read json from a file
val readAsDataSet :CaseClass=sparkSession.read.option("mode",mode).schema(schema).json(path)as[CaseClass]
//below code needs to be written in type safe way:
val someDF= readAsDataSet.withColumn("col1",explode(col("col_to_be_exploded")))
.select(from_unixtime(col("timestamp").divide(1000))
.as("date"), col("col1"))
As someone in the comments said, you can create a Dataset[CaseClass] and do your operations on there. Let's set it up:
import spark.implicits._
case class MyTest (timestamp: Long, col_explode: Seq[String])
val df = Seq(
MyTest(1673850366000L, Seq("some", "strings", "here")),
MyTest(1271850365998L, Seq("pasta", "with", "cream")),
MyTest(611850366000L, Seq("tasty", "food"))
).toDF("timestamp", "col_explode").as[MyTest]
df.show(false)
+-------------+---------------------+
|timestamp |col_explode |
+-------------+---------------------+
|1673850366000|[some, strings, here]|
|1271850365998|[pasta, with, cream] |
|611850366000 |[tasty, food] |
+-------------+---------------------+
Typically, you can do many operations with the map function and the Scala language.
A map function returns the same amount of elements as the input has. The explode function that you're using, however, does not return the same amount of elements. You can implement this behaviour using the flatMap function.
So, using the Scala language and the flatMap function together, you can do something like this:
import java.time.LocalDateTime
import java.time.ZoneOffset
case class Exploded (datetime: String, exploded: String)
val output = df.flatMap{ case MyTest(timestamp, col_explode) =>
col_explode.map( value => {
val date = LocalDateTime.ofEpochSecond(timestamp/1000, 0, ZoneOffset.UTC).toString
Exploded(date, value)
}
)
}
output.show(false)
+-------------------+--------+
|datetime |exploded|
+-------------------+--------+
|2023-01-16T06:26:06|some |
|2023-01-16T06:26:06|strings |
|2023-01-16T06:26:06|here |
|2010-04-21T11:46:05|pasta |
|2010-04-21T11:46:05|with |
|2010-04-21T11:46:05|cream |
|1989-05-22T14:26:06|tasty |
|1989-05-22T14:26:06|food |
+-------------------+--------+
As you see, we've created a second case class called Exploded which we use to type our output dataset. Our output dataset has the following type: org.apache.spark.sql.Dataset[Exploded] so everything is completely type safe.

create view for two different dataframe in scala spark

I have a code snippet that will read a Json array of the file path and then union the output and gives me two different tables. So I want to create two different createOrReplaceview(name) for those two tables and the name will be available in json array like below:
{
"source": [
{
"name": "testPersons",
"data": [
"E:\\dataset\\2020-05-01\\",
"E:\\dataset\\2020-05-02\\"
],
"type": "json"
},
{
"name": "testPets",
"data": [
"E:\\dataset\\2020-05-01\\078\\",
"E:\\dataset\\2020-05-02\\078\\"
],
"type": "json"
}
]
}
My output:
testPersons
+---+------+
|name |age|
+---+------+
|John |24 |
|Cammy |20 |
|Britto|30 |
|George|23 |
|Mikle |15 |
+---+------+
testPets
+---+------+
|name |age|
+---+------+
|piku |2 |
|jimmy |3 |
|rapido|1 |
+---+------+
Above is my Output and Json array my code iterate through each array and read the data section and read the data.
But how to change my below code to create a temp view for each output table.
for example i want to create .createOrReplaceTempView(testPersons) and .createOrReplaceTempView(testPets)
view name as per in Json array
if (dataArr(counter)("type").value.toString() == "json") {
val name = dataArr(counter)("name").value.toString()
val dataPath = dataArr(counter)("data").arr
val input = dataPath.map(item => {
val rdd = spark.sparkContext.wholeTextFiles(item.str).map(i => "[" + i._2.replaceAll("\\}.*\n{0,}.*\\{", "},{") + "]")
spark
.read
.schema(Schema.getSchema(name))
.option("multiLine", true)
.json(rdd)
})
val emptyDF = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], Schema.getSchema(name))
val finalDF = input.foldLeft(emptyDF)((x, y) => x.union(y))
finalDF.show()
Expected output:
spark.sql("SELECT * FROM testPersons").show()
spark.sql("SELECT * FROM testPets").show()
It should give me the table for each one.
Since you already have your data wrangled into shape and have your rows in DataFrames and simply want to access them as temporary views, I suppose you are looking for the function(s):
createOrReplaceGlobalTempView
createOrReplaceTempView
They can be invoked from a DataFrame/Dataset.
df.createOrReplaceGlobalTempView("testPersons")
spark.sql("SELECT * FROM global_temp.testPersons").show()
df.createOrReplaceTempView("testPersons")
spark.sql("SELECT * FROM testPersons").show()
For an explanation about the difference between the two, you can take a look at this question.
If you are trying to dynamically read the JSON, get the files in data into DataFrames and then save them into their own table.
import net.liftweb.json._
import net.liftweb.json.DefaultFormats
case class Source(name: String, data: List[String], `type`: String)
val file = scala.io.Source.fromFile("path/to/your/file").mkString
implicit val formats: DefaultFormats.type = DefaultFormats
val json = parse(file)
val sourceList = (json \ "source").children
for (source <- sourceList) {
val s = source.extract[Source]
val df = s.data.map(d => spark.read(d)).reduce(_ union _)
df.createOrReplaceTempView(s.name)
}

Generate dynamic header using Scala case class for Spark Table

I have an existing case class having many fields
case class output {
userId : String,
timeStamp: String,
...
}
And I am using it to generate header for a spark job like this.
--------------------
userId | timeStamp|
--------------------
1 2324444444
2 2334445556
Now i want to add more columns to this and these column will be come from map(attributeName, attributeValue) as attributeNames. So my question is how can I add map to case class and then how can i use map key as column value to generate dynamic columns. After this my final output should be like
----------------------------------------------------
userId | timeStamp| attributeName1 | attributeName2
----------------------------------------------------
1 2324444444| |
2 2334445554| |
you can do something like this
case class output {
userId : String,
timeStamp: String,
keyvalues: Map,
...
}
import spark.implicits._
import org.apache.spark.sql.functions._
val df = spark.read.textFile(inputlocation).as[output]
val keysDF = df.select(explode(map_keys($"keyvalues"))).distinct()
val keys = keysDF.collect().map(f=>f.get(0)).map(f=>col("keyvalues").getItem(f).as(f.toString))
df.select(col("userId") +: keyCols:_*)
or you can check this thread for other ways todo.

Spark update value in the second dataset based on the value from first dataset

I have two spark datasets, one with columns accountid and key, the key column in the format of an array [key1,key2,key3..] and another dataset with two columns accountid and key values which is in json. accountid , {key:value, key,value...}. I need to update the value in the second dataset if key appear for accountid in first dataset.
import org.apache.spark.sql.functions._
val df= sc.parallelize(Seq(("20180610114049", "id1","key1"),
("20180610114049", "id2","key2"),
("20180610114049", "id1","key1"),
("20180612114049", "id2","key1"),
("20180613114049", "id3","key2"),
("20180613114049", "id3","key3")
)).toDF("date","accountid", "key")
val gp=df.groupBy("accountid","date").agg(collect_list("key"))
+---------+--------------+-----------------+
|accountid| date|collect_list(key)|
+---------+--------------+-----------------+
| id2|20180610114049| [key2]|
| id1|20180610114049| [key1, key1]|
| id3|20180613114049| [key2, key3]|
| id2|20180612114049| [key1]|
+---------+--------------+-----------------+
val df2= sc.parallelize(Seq(("20180610114049", "id1","{'key1':'0.0','key2':'0.0','key3':'0.0'}"),
("20180610114049", "id2","{'key1':'0.0','key2':'0.0','key3':'0.0'}"),
("20180611114049", "id1","{'key1':'0.0','key2':'0.0','key3':'0.0'}"),
("20180612114049", "id2","{'key1':'0.0','key2':'0.0','key3':'0.0'}"),
("20180613114049", "id3","{'key1':'0.0','key2':'0.0','key3':'0.0'}")
)).toDF("date","accountid", "result")
+--------------+---------+----------------------------------------+
|date |accountid|result |
+--------------+---------+----------------------------------------+
|20180610114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
|20180610114049|id2 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
|20180611114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
|20180612114049|id2 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
|20180613114049|id3 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
+--------------+---------+----------------------------------------+
expected output
+--------------+---------+----------------------------------------+
|date |accountid|result |
+--------------+---------+----------------------------------------+
|20180610114049|id1 |{'key1':'1.0','key2':'0.0','key3':'0.0'}|
|20180610114049|id2 |{'key1':'0.0','key2':'1.0','key3':'0.0'}|
|20180611114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
|20180612114049|id2 |{'key1':'1.0','key2':'0.0','key3':'0.0'}|
|20180613114049|id3 |{'key1':'0.0','key2':'1.0','key3':'1.0'}|
+--------------+---------+----------------------------------------+
You will most definitely need a UDF to do it cleanly here.
You can pass both the array and the JSON to the UDF after joining on date and accountid, parse the JSON inside the UDF using the parser of your choice (I'm using JSON4S in the example), check if the key exists in the array and then change the value, convert it to JSON again and return it from the UDF.
val gp=df.groupBy("accountid","date").agg(collect_list("key").as("key"))
val joined = df2.join(gp, Seq("date", "accountid") , "left_outer")
joined.show(false)
//+--------------+---------+----------------------------------------+------------+
//|date |accountid|result |key |
//+--------------+---------+----------------------------------------+------------+
//|20180610114049|id2 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key2] |
//|20180613114049|id3 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key2, key3]|
//|20180610114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key1, key1]|
//|20180611114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|null |
//|20180612114049|id2 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key1] |
//+--------------+---------+----------------------------------------+------------+
// the UDF that will do the most work
// it's important to declare `formats` inside the function
// to avoid object not Serializable exception
// Not all cases are covered, use with caution :D
val convertJsonValues = udf{(json: String, arr: Seq[String]) =>
import org.json4s.jackson.JsonMethods._
import org.json4s.JsonDSL._
implicit val format = org.json4s.DefaultFormats
// replace single quotes with double
val kvMap = parse(json.replaceAll("'", """"""")).values.asInstanceOf[Map[String,String]]
val updatedKV = kvMap.map{ case(k,v) => if(arr.contains(k)) (k,"1.0") else (k,v) }
compact(render(updatedKV))
}
// Use when-otherwise and send empty array where `key` is null
joined.select($"date",
$"accountid",
when($"key".isNull, convertJsonValues($"result", array()))
.otherwise(convertJsonValues($"result", $"key"))
.as("result")
).show(false)
//+--------------+---------+----------------------------------------+
//|date |accountid|result |
//+--------------+---------+----------------------------------------+
//|20180610114049|id2 |{"key1":"0.0","key2":"1.0","key3":"0.0"}|
//|20180613114049|id3 |{"key1":"0.0","key2":"1.0","key3":"1.0"}|
//|20180610114049|id1 |{"key1":"1.0","key2":"0.0","key3":"0.0"}|
//|20180611114049|id1 |{"key1":"0.0","key2":"0.0","key3":"0.0"}|
//|20180612114049|id2 |{"key1":"1.0","key2":"0.0","key3":"0.0"}|
//+--------------+---------+----------------------------------------+
You can achieve your requirement with the use of udf function after you join both dataframes. Of course there are stuffs like converting json to struct, struct to json again, case class usage and more (comments are provided for further explanation)
import org.apache.spark.sql.functions._
//aliasing the collected key
val gp = df.groupBy("accountid","date").agg(collect_list("key").as("keys"))
//schema for converting json to struct
val schema = StructType(Seq(StructField("key1", StringType, true), StructField("key2", StringType, true), StructField("key3", StringType, true)))
//udf function to update the values of struct where result is a case class
def updateKeysUdf = udf((arr: Seq[String], json: Row) => Seq(json.schema.fieldNames.map(key => if(arr.contains(key)) "1.0" else json.getAs[String](key))).collect{case Array(a,b,c) => result(a,b,c)}.toList(0))
//changing json string to stuct using the above schema
df2.withColumn("result", from_json(col("result"), schema))
.as("df2") //aliasing df2 for joining and selecting
.join(gp.as("gp"), col("df2.accountid") === col("gp.accountid"), "left") //aliasing gp dataframe and joining with accountid
.select(col("df2.accountid"), col("df2.date"), to_json(updateKeysUdf(col("gp.keys"), col("df2.result"))).as("result")) //selecting and calling above udf function and finally converting to json stirng
.show(false)
where result is a case class
case class result(key1: String, key2: String, key3: String)
which should give you
+---------+--------------+----------------------------------------+
|accountid|date |result |
+---------+--------------+----------------------------------------+
|id3 |20180613114049|{"key1":"0.0","key2":"1.0","key3":"1.0"}|
|id1 |20180610114049|{"key1":"1.0","key2":"0.0","key3":"0.0"}|
|id1 |20180611114049|{"key1":"1.0","key2":"0.0","key3":"0.0"}|
|id2 |20180610114049|{"key1":"0.0","key2":"1.0","key3":"0.0"}|
|id2 |20180610114049|{"key1":"1.0","key2":"0.0","key3":"0.0"}|
|id2 |20180612114049|{"key1":"0.0","key2":"1.0","key3":"0.0"}|
|id2 |20180612114049|{"key1":"1.0","key2":"0.0","key3":"0.0"}|
+---------+--------------+----------------------------------------+
I hope the answer is helpful

passing UDF to a method or class

I have a UDF say
val testUDF = udf{s: string=>s.toUpperCase}
I want to create this UDF in a separate method or may be something else like an implementation class and pass it on another class which uses it. Is it possible?
Say suppose I have a class A
class A(df: DataFrame) {
def testMethod(): DataFrame = {
val demo=df.select(testUDF(col))
}
}
class A should be able to use UDF. Can this be achieved?
Given a dataframe as
+----+
|col1|
+----+
|abc |
|dBf |
|Aec |
+----+
And a udf function
import org.apache.spark.sql.functions._
val testUDF = udf{s: String=>s.toUpperCase}
You can definitely use that udf function from another class as
val demo = df.select(testUDF(col("col1")).as("upperCasedCol"))
which should give you
+-------------+
|upperCasedCol|
+-------------+
|ABC |
|DBF |
|AEC |
+-------------+
But I would suggest you to use other functions if possible as udf function requires columns to be serialized and deserialized which would consume time and memory more than other functions available. UDF function should be the last choice.
You can use upper function for your case
val demo = df.select(upper(col("col1")).as("upperCasedCol"))
This will generate the same output as the original udf function
I hope the answer is helpful
Updated
Since your question is asking for information on how to call the udf function defined in another class or object, here is the method
suppose you have an object where you defined the udf function or a function that i suggested as
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
object UDFs {
def testUDF = udf{s: String=>s.toUpperCase}
def testUpper(column: Column) = upper(column)
}
Your A class is as in your question, I just added another function
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
class A(df: DataFrame) {
def testMethod(): DataFrame = {
val demo = df.select(UDFs.testUDF(col("col1")))
demo
}
def usingUpper() = {
df.select(UDFs.testUpper(col("col1")))
}
}
Then you can call the functions from main as below
import org.apache.spark.sql.SparkSession
object TestUpper {
def main(args: Array[String]): Unit = {
val sparkSession = SparkSession.builder().appName("Simple Application")
.master("local")
.config("", "")
.getOrCreate()
import sparkSession.implicits._
val df = Seq(
("abc"),
("dBf"),
("Aec")
).toDF("col1")
val a = new A(df)
//calling udf function
a.testMethod().show(false)
//calling upper function
a.usingUpper().show(false)
}
}
I guess this is more than helpful
If I understand correctly you would actually like some kind of factory to create this user-defined-function for a specific class A.
This could be achieve using a type class which gets injected implicitly.
E.g. (I had to define UDF and DataFrame to be able to test this)
type UDF = String => String
case class DataFrame(col: String) {
def select(in: String) = s"col:$col, in:$in"
}
trait UDFFactory[A] {
def testUDF: UDF
}
implicit object UDFFactoryA extends UDFFactory[AClass] {
def testUDF: UDF = _.toUpperCase
}
class AClass(df: DataFrame) {
def testMethod(implicit factory: UDFFactory[AClass]) = {
val demo = df.select(factory.testUDF(df.col))
println(demo)
}
}
val a = new AClass(DataFrame("test"))
a.testMethod // prints 'col:test, in:TEST'
Like you mentioned, create a method exactly like your UDF in your object body or companion class,
val myUDF = udf((str:String) => { str.toUpperCase })
Then for some dataframe df do this,
val res=df withColumn("NEWCOLNAME", myUDF(col("OLDCOLNAME")))
This will change something like this,
+-------------------+
| OLDCOLNAME |
+-------------------+
| abc |
+-------------------+
to
+-------------------+-------------------+
| OLDCOLNAME | NEWCOLNAME |
+-------------------+-------------------+
| abc | ABC |
+-------------------+-------------------+
Let me know if this helped, Cheers.
Yes thats possible as functions are objects in scala which can be passed around:
import org.apache.spark.sql.expressions.UserDefinedFunction
class A(df: DataFrame, testUdf:UserDefinedFunction) {
def testMethod(): DataFrame = {
df.select(testUdf(col))
}
}