Index of an array item in scala - scala

I have a table with an array column like:
+-Name-+
array
0: {"given_name":"B. A.", "surname":"Name1"}
1: {"given_name":"A.", "surname":"Name2"}
2: {"given_name":"C." "surname":"Name3"}
I would like to add one more element item "index" starting with 1 into an array to find the sequence of the author, like
+-Name-+
array
0: {"given_name":"B. A.", "surname":"Name1", "index":"1"}
1: {"given_name":"A.", "surname":"Name2", "index":"2"}
2: {"given_name":"C." "surname":"Name3", "index":"3"}
How to do this in Scala, your help is much appreciated.

Here's one approach using a UDF that maps each element of the array-typed column to include also the element index:
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions._
import spark.implicits._
case class Name(given_name: String, surname: String)
case class NameIdx(given_name: String, surname: String, index: Int)
val df = Seq(
Seq(Name("John", "Doe"), Name("Jane", "Smith"), Name("Mike", "Davis")),
Seq(Name("Rachel", "Smith"), Name("Steve", "Thompson"))
).toDF("name")
val addIndex = udf((names: Seq[Row]) => names.map{
case name # Row(gn: String, sn: String) => NameIdx(gn, sn, names.indexOf(name) + 1)
})
df.select(addIndex($"name").as("name")).show(false)
// +----------------------------------------------+
// |name |
// +----------------------------------------------+
// |[[John,Doe,1], [Jane,Smith,2], [Mike,Davis,3]]|
// |[[Rachel,Smith,1], [Steve,Thompson,2]] |
// +----------------------------------------------+
To produce JSON values, apply to_json as follows:
df.select(to_json(addIndex($"name")).as("name")).show(false)
// +-----------------------------------------------------------------------------------------------------------------------------------------------------+
// |name |
// +-----------------------------------------------------------------------------------------------------------------------------------------------------+
// |[{"given_name":"John","surname":"Doe","index":1},{"given_name":"Jane","surname":"Smith","index":2},{"given_name":"Mike","surname":"Davis","index":3}]|
// |[{"given_name":"Rachel","surname":"Smith","index":1},{"given_name":"Steve","surname":"Thompson","index":2}] |
// +-----------------------------------------------------------------------------------------------------------------------------------------------------+

Related

Generate dynamic header using Scala case class for Spark Table

I have an existing case class having many fields
case class output {
userId : String,
timeStamp: String,
...
}
And I am using it to generate header for a spark job like this.
--------------------
userId | timeStamp|
--------------------
1 2324444444
2 2334445556
Now i want to add more columns to this and these column will be come from map(attributeName, attributeValue) as attributeNames. So my question is how can I add map to case class and then how can i use map key as column value to generate dynamic columns. After this my final output should be like
----------------------------------------------------
userId | timeStamp| attributeName1 | attributeName2
----------------------------------------------------
1 2324444444| |
2 2334445554| |
you can do something like this
case class output {
userId : String,
timeStamp: String,
keyvalues: Map,
...
}
import spark.implicits._
import org.apache.spark.sql.functions._
val df = spark.read.textFile(inputlocation).as[output]
val keysDF = df.select(explode(map_keys($"keyvalues"))).distinct()
val keys = keysDF.collect().map(f=>f.get(0)).map(f=>col("keyvalues").getItem(f).as(f.toString))
df.select(col("userId") +: keyCols:_*)
or you can check this thread for other ways todo.

Spark update value in the second dataset based on the value from first dataset

I have two spark datasets, one with columns accountid and key, the key column in the format of an array [key1,key2,key3..] and another dataset with two columns accountid and key values which is in json. accountid , {key:value, key,value...}. I need to update the value in the second dataset if key appear for accountid in first dataset.
import org.apache.spark.sql.functions._
val df= sc.parallelize(Seq(("20180610114049", "id1","key1"),
("20180610114049", "id2","key2"),
("20180610114049", "id1","key1"),
("20180612114049", "id2","key1"),
("20180613114049", "id3","key2"),
("20180613114049", "id3","key3")
)).toDF("date","accountid", "key")
val gp=df.groupBy("accountid","date").agg(collect_list("key"))
+---------+--------------+-----------------+
|accountid| date|collect_list(key)|
+---------+--------------+-----------------+
| id2|20180610114049| [key2]|
| id1|20180610114049| [key1, key1]|
| id3|20180613114049| [key2, key3]|
| id2|20180612114049| [key1]|
+---------+--------------+-----------------+
val df2= sc.parallelize(Seq(("20180610114049", "id1","{'key1':'0.0','key2':'0.0','key3':'0.0'}"),
("20180610114049", "id2","{'key1':'0.0','key2':'0.0','key3':'0.0'}"),
("20180611114049", "id1","{'key1':'0.0','key2':'0.0','key3':'0.0'}"),
("20180612114049", "id2","{'key1':'0.0','key2':'0.0','key3':'0.0'}"),
("20180613114049", "id3","{'key1':'0.0','key2':'0.0','key3':'0.0'}")
)).toDF("date","accountid", "result")
+--------------+---------+----------------------------------------+
|date |accountid|result |
+--------------+---------+----------------------------------------+
|20180610114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
|20180610114049|id2 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
|20180611114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
|20180612114049|id2 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
|20180613114049|id3 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
+--------------+---------+----------------------------------------+
expected output
+--------------+---------+----------------------------------------+
|date |accountid|result |
+--------------+---------+----------------------------------------+
|20180610114049|id1 |{'key1':'1.0','key2':'0.0','key3':'0.0'}|
|20180610114049|id2 |{'key1':'0.0','key2':'1.0','key3':'0.0'}|
|20180611114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
|20180612114049|id2 |{'key1':'1.0','key2':'0.0','key3':'0.0'}|
|20180613114049|id3 |{'key1':'0.0','key2':'1.0','key3':'1.0'}|
+--------------+---------+----------------------------------------+
You will most definitely need a UDF to do it cleanly here.
You can pass both the array and the JSON to the UDF after joining on date and accountid, parse the JSON inside the UDF using the parser of your choice (I'm using JSON4S in the example), check if the key exists in the array and then change the value, convert it to JSON again and return it from the UDF.
val gp=df.groupBy("accountid","date").agg(collect_list("key").as("key"))
val joined = df2.join(gp, Seq("date", "accountid") , "left_outer")
joined.show(false)
//+--------------+---------+----------------------------------------+------------+
//|date |accountid|result |key |
//+--------------+---------+----------------------------------------+------------+
//|20180610114049|id2 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key2] |
//|20180613114049|id3 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key2, key3]|
//|20180610114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key1, key1]|
//|20180611114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|null |
//|20180612114049|id2 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key1] |
//+--------------+---------+----------------------------------------+------------+
// the UDF that will do the most work
// it's important to declare `formats` inside the function
// to avoid object not Serializable exception
// Not all cases are covered, use with caution :D
val convertJsonValues = udf{(json: String, arr: Seq[String]) =>
import org.json4s.jackson.JsonMethods._
import org.json4s.JsonDSL._
implicit val format = org.json4s.DefaultFormats
// replace single quotes with double
val kvMap = parse(json.replaceAll("'", """"""")).values.asInstanceOf[Map[String,String]]
val updatedKV = kvMap.map{ case(k,v) => if(arr.contains(k)) (k,"1.0") else (k,v) }
compact(render(updatedKV))
}
// Use when-otherwise and send empty array where `key` is null
joined.select($"date",
$"accountid",
when($"key".isNull, convertJsonValues($"result", array()))
.otherwise(convertJsonValues($"result", $"key"))
.as("result")
).show(false)
//+--------------+---------+----------------------------------------+
//|date |accountid|result |
//+--------------+---------+----------------------------------------+
//|20180610114049|id2 |{"key1":"0.0","key2":"1.0","key3":"0.0"}|
//|20180613114049|id3 |{"key1":"0.0","key2":"1.0","key3":"1.0"}|
//|20180610114049|id1 |{"key1":"1.0","key2":"0.0","key3":"0.0"}|
//|20180611114049|id1 |{"key1":"0.0","key2":"0.0","key3":"0.0"}|
//|20180612114049|id2 |{"key1":"1.0","key2":"0.0","key3":"0.0"}|
//+--------------+---------+----------------------------------------+
You can achieve your requirement with the use of udf function after you join both dataframes. Of course there are stuffs like converting json to struct, struct to json again, case class usage and more (comments are provided for further explanation)
import org.apache.spark.sql.functions._
//aliasing the collected key
val gp = df.groupBy("accountid","date").agg(collect_list("key").as("keys"))
//schema for converting json to struct
val schema = StructType(Seq(StructField("key1", StringType, true), StructField("key2", StringType, true), StructField("key3", StringType, true)))
//udf function to update the values of struct where result is a case class
def updateKeysUdf = udf((arr: Seq[String], json: Row) => Seq(json.schema.fieldNames.map(key => if(arr.contains(key)) "1.0" else json.getAs[String](key))).collect{case Array(a,b,c) => result(a,b,c)}.toList(0))
//changing json string to stuct using the above schema
df2.withColumn("result", from_json(col("result"), schema))
.as("df2") //aliasing df2 for joining and selecting
.join(gp.as("gp"), col("df2.accountid") === col("gp.accountid"), "left") //aliasing gp dataframe and joining with accountid
.select(col("df2.accountid"), col("df2.date"), to_json(updateKeysUdf(col("gp.keys"), col("df2.result"))).as("result")) //selecting and calling above udf function and finally converting to json stirng
.show(false)
where result is a case class
case class result(key1: String, key2: String, key3: String)
which should give you
+---------+--------------+----------------------------------------+
|accountid|date |result |
+---------+--------------+----------------------------------------+
|id3 |20180613114049|{"key1":"0.0","key2":"1.0","key3":"1.0"}|
|id1 |20180610114049|{"key1":"1.0","key2":"0.0","key3":"0.0"}|
|id1 |20180611114049|{"key1":"1.0","key2":"0.0","key3":"0.0"}|
|id2 |20180610114049|{"key1":"0.0","key2":"1.0","key3":"0.0"}|
|id2 |20180610114049|{"key1":"1.0","key2":"0.0","key3":"0.0"}|
|id2 |20180612114049|{"key1":"0.0","key2":"1.0","key3":"0.0"}|
|id2 |20180612114049|{"key1":"1.0","key2":"0.0","key3":"0.0"}|
+---------+--------------+----------------------------------------+
I hope the answer is helpful

Change value of nested column in DataFrame

I have dataframe with two level nested fields
root
|-- request: struct (nullable = true)
| |-- dummyID: string (nullable = true)
| |-- data: struct (nullable = true)
| | |-- fooID: string (nullable = true)
| | |-- barID: string (nullable = true)
I want to update the value of fooId column here. I was able to update value for the first level for example dummyID column here using this question as reference How to add a nested column to a DataFrame
Input data:
{
"request": {
"dummyID": "test_id",
"data": {
"fooID": "abc",
"barID": "1485351"
}
}
}
output data:
{
"request": {
"dummyID": "test_id",
"data": {
"fooID": "def",
"barID": "1485351"
}
}
}
How can I do it using Scala?
Here is a generic solution to this problem that makes it possible to update any number of nested values, at any level, based on an arbitrary function applied in a recursive traversal:
def mutate(df: DataFrame, fn: Column => Column): DataFrame = {
// Get a projection with fields mutated by `fn` and select it
// out of the original frame with the schema reassigned to the original
// frame (explained later)
df.sqlContext.createDataFrame(df.select(traverse(df.schema, fn):_*).rdd, df.schema)
}
def traverse(schema: StructType, fn: Column => Column, path: String = ""): Array[Column] = {
schema.fields.map(f => {
f.dataType match {
case s: StructType => struct(traverse(s, fn, path + f.name + "."): _*)
case _ => fn(col(path + f.name))
}
})
}
This is effectively equivalent to the usual "just redefine the whole struct as a projection" solutions, but it automates re-nesting fields with the original structure AND preserves nullability/metadata (which are lost when you redefine the structs manually). Annoyingly, preserving those properties isn't possible while creating the projection (afaict) so the code above redefines the schema manually.
An example application:
case class Organ(name: String, count: Int)
case class Disease(id: Int, name: String, organ: Organ)
case class Drug(id: Int, name: String, alt: Array[String])
val df = Seq(
(1, Drug(1, "drug1", Array("x", "y")), Disease(1, "disease1", Organ("heart", 2))),
(2, Drug(2, "drug2", Array("a")), Disease(2, "disease2", Organ("eye", 3)))
).toDF("id", "drug", "disease")
df.show(false)
+---+------------------+-------------------------+
|id |drug |disease |
+---+------------------+-------------------------+
|1 |[1, drug1, [x, y]]|[1, disease1, [heart, 2]]|
|2 |[2, drug2, [a]] |[2, disease2, [eye, 3]] |
+---+------------------+-------------------------+
// Update the integer field ("count") at the lowest level:
val df2 = mutate(df, c => if (c.toString == "disease.organ.count") c - 1 else c)
df2.show(false)
+---+------------------+-------------------------+
|id |drug |disease |
+---+------------------+-------------------------+
|1 |[1, drug1, [x, y]]|[1, disease1, [heart, 1]]|
|2 |[2, drug2, [a]] |[2, disease2, [eye, 2]] |
+---+------------------+-------------------------+
// This will NOT necessarily be equal unless the metadata and nullability
// of all fields is preserved (as the code above does)
assertResult(df.schema.toString)(df2.schema.toString)
A limitation of this is that it cannot add new fields, only update existing ones (though the map can be changed into a flatMap and the function to return Array[Column] for that, if you don't care about preserving nullability/metadata).
Additionally, here is a more generic version for Dataset[T]:
case class Record(id: Int, drug: Drug, disease: Disease)
def mutateDS[T](df: Dataset[T], fn: Column => Column)(implicit enc: Encoder[T]): Dataset[T] = {
df.sqlContext.createDataFrame(df.select(traverse(df.schema, fn):_*).rdd, enc.schema).as[T]
}
// To call as typed dataset:
val fn: Column => Column = c => if (c.toString == "disease.organ.count") c - 1 else c
mutateDS(df.as[Record], fn).show(false)
// To call as untyped dataset:
implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema) // This is necessary regardless of sparkSession.implicits._ imports
mutateDS(df, fn).show(false)
One way, although cumbersome is to fully unpack and recreate the column by explicitly referencing each element of the original struct.
dataFrame.withColumn("person",
struct(
col("person.age").alias("age),
struct(
col("person.name.first").alias("first"),
lit("some new value").alias("last")).alias("name")))

How to use the functions.explode to flatten element in dataFrame

I've made this piece of code :
case class RawPanda(id: Long, zip: String, pt: String, happy: Boolean, attributes: Array[Double])
case class PandaPlace(name: String, pandas: Array[RawPanda])
object TestSparkDataFrame extends App{
System.setProperty("hadoop.home.dir", "E:\\Programmation\\Libraries\\hadoop")
val conf = new SparkConf().setAppName("TestSparkDataFrame").set("spark.driver.memory","4g").setMaster("local[*]")
val session = SparkSession.builder().config(conf).getOrCreate()
import session.implicits._
def createAndPrintSchemaRawPanda(session:SparkSession):DataFrame = {
val newPanda = RawPanda(1,"M1B 5K7", "giant", true, Array(0.1, 0.1))
val pandaPlace = PandaPlace("torronto", Array(newPanda))
val df =session.createDataFrame(Seq(pandaPlace))
df
}
val df2 = createAndPrintSchemaRawPanda(session)
df2.show
+--------+--------------------+
| name| pandas|
+--------+--------------------+
|torronto|[[1,M1B 5K7,giant...|
+--------+--------------------+
val pandaInfo = df2.explode(df2("pandas")) {
case Row(pandas: Seq[Row]) =>
pandas.map{
case (Row(
id: Long,
zip: String,
pt: String,
happy: Boolean,
attrs: Seq[Double])) => RawPanda(id, zip, pt , happy, attrs.toArray)
}
}
pandaInfo2.show
+--------+--------------------+---+-------+-----+-----+----------+
| name| pandas| id| zip| pt|happy|attributes|
+--------+--------------------+---+-------+-----+-----+----------+
|torronto|[[1,M1B 5K7,giant...| 1|M1B 5K7|giant| true|[0.1, 0.1]|
+--------+--------------------+---+-------+-----+-----+----------+
The problem that the explode function as I used it is deprecated, so I would like to recaculate the pandaInfo2 dataframe but using the adviced method in the warning.
use flatMap() or select() with functions.explode() instead
But then when I do :
val pandaInfo = df2.select(functions.explode(df("pandas"))
I obtain the same result as I had in df2.
I don't know how to proceed to use flatMap or functions.explode.
How could I use flatMap or functions.explode to obtain the result that I want ?(the one in pandaInfo)
I've seen this post and this other one but none of them helped me.
Calling select with explode function returns a DataFrame where the Array pandas is "broken up" into individual records; Then, if you want to "flatten" the structure of the resulting single "RawPanda" per record, you can select the individual columns using a dot-separated "route":
val pandaInfo2 = df2.select($"name", explode($"pandas") as "pandas")
.select($"name", $"pandas",
$"pandas.id" as "id",
$"pandas.zip" as "zip",
$"pandas.pt" as "pt",
$"pandas.happy" as "happy",
$"pandas.attributes" as "attributes"
)
A less verbose version of the exact same operation would be:
import org.apache.spark.sql.Encoders // going to use this to "encode" case class into schema
val pandaColumns = Encoders.product[RawPanda].schema.fields.map(_.name)
val pandaInfo3 = df2.select($"name", explode($"pandas") as "pandas")
.select(Seq($"name", $"pandas") ++ pandaColumns.map(f => $"pandas.$f" as f): _*)

Transformation of data into list of objects of class in spark scala

I am trying to write a spark transformation code to convert the below data into list of objects of following class, I am totally new to scala and spark and tried splitting the data and put them into case class but I was unable to append them back. Request your help on this.
Data :
FirstName,LastName,Country,match,Goals
Cristiano,Ronaldo,Portugal,Match1,1
Cristiano,Ronaldo,Portugal,Match2,1
Cristiano,Ronaldo,Portugal,Match3,0
Cristiano,Ronaldo,Portugal,Match4,2
Lionel,Messi,Argentina,Match1,1
Lionel,Messi,Argentina,Match2,2
Lionel,Messi,Argentina,Match3,1
Lionel,Messi,Argentina,Match4,2
Desired output:
PLayerStats{ String FirstName,
String LastName,
String Country,
Map <String,Int> matchandscore
}
Assuming you already loaded data into an RDD[String] named data:
case class PlayerStats(FirstName: String, LastName: String, Country: String, matchandscore: Map[String, Int])
val result: RDD[PlayerStats] = data
.filter(!_.startsWith("FirstName")) // remove header
.map(_.split(",")).map { // map into case classes
case Array(fn, ln, cntry, mn, g) => PlayerStats(fn, ln, cntry, Map(mn -> g.toInt))
}
.keyBy(p => (p.FirstName, p.LastName)) // key by player
.reduceByKey((p1, p2) => p1.copy(matchandscore = p1.matchandscore ++ p2.matchandscore))
.map(_._2) // remove key
Firstly convert the line into key value pair say (Cristiano, rest of data) then apply groupByKey or reduceByKey can also work then try to convert the key value pair data after applying groupByKey or reduceByKey into your class by putting value. Take a help of famous word count program.
http://spark.apache.org/examples.html
You could try something as follows:
val file = sc.textFile("myfile.csv")
val df = file.map(line => line.split(",")). // split line by comma
filter(lineSplit => lineSplit(0) != "FirstName"). // filter out first row
map(lineSplit => { // transform lines
(lineSplit(0), lineSplit(1), lineSplit(2), Map((lineSplit(3), lineSplit(4).toInt)))}).
toDF("FirstName", "LastName", "Country", "MatchAndScore")
df.schema
// res34: org.apache.spark.sql.types.StructType = StructType(StructField(FirstName,StringType,true), StructField(LastName,StringType,true), StructField(Country,StringType,true), StructField(MatchAndScore,MapType(StringType,IntegerType,false),true))
df.show
+---------+--------+---------+----------------+
|FirstName|LastName| Country| MatchAndScore|
+---------+--------+---------+----------------+
|Cristiano| Ronaldo| Portugal|Map(Match1 -> 1)|
|Cristiano| Ronaldo| Portugal|Map(Match2 -> 1)|
|Cristiano| Ronaldo| Portugal|Map(Match3 -> 0)|
|Cristiano| Ronaldo| Portugal|Map(Match4 -> 2)|
| Lionel| Messi|Argentina|Map(Match1 -> 1)|
| Lionel| Messi|Argentina|Map(Match2 -> 2)|
| Lionel| Messi|Argentina|Map(Match3 -> 1)|
| Lionel| Messi|Argentina|Map(Match4 -> 2)|
+---------+--------+---------+----------------+