Related
I have a DataFrame in scala which from which I need to create a new DataFrame for distinct values of SourceHash field.
var myProductsList = List[ProductInfo]()
val distinctFiles = dfDateFiltered.select(col("SourceHash")).distinct()
distinctFiles.foreach(rowFilter => {
val productInfo = createProductInfo(validFrom, validTo, dfDateFiltered, rowFilter.getString(0))
myProductsList = myProductsList :+ productInfo
})
myProductsList.toDF()
The problem is, this code throws java.lang.NullPointerException inside createProductInfo for any invocation on the dataframe dfDateFiltered.
The only way I can overcome this is using collect() before foreach like :
distinctFiles.collect().foreach(rowFilter => {...
}
But collect is expensive call, so this must be avoided.
How can I efficiently extract a new DataSet without losing on performance?
Below is the createProductInfo code:
private def createProductInfo(validFrom: String, validTo: String, dfDateFiltered: Dataset[Row], rowFilter: String) : ProductInfo = {
val dfPerFile = dfDateFiltered.filter(col("SourceHash") === rowFilter)
val dfRow = dfPerFile.head
val clientCount = dfPerFile.filter(col("ServerOrClient") === "Client").count
val buildVersion = dfPerFile.filter(col("ServerOrClient") === "Server").select(col("BuildVersion")).head.getString(0)
val productInfo = ProductInfo(dfRow.getInt(0),
dfRow.getInt(1),
dfRow.getString(12),
dfRow.getString(13),
dfRow.getString(14),
validFrom,
validTo,
dfRow.getString(8),
dfRow.getTimestamp(9),
clientCount,
buildVersion
)
productInfo
}
Function "createProductInfo" can be avoided, values can be collected by grouping. Original dataset don't exists in question, approach can be shown on such data:
val dfDateFiltered = Seq(
(1, "Server", 1),
(1, "Client", 2),
(2, "Client", 3)
).toDF("SourceHash", "ServerOrClient", "BuildVersion")
val validFrom = "Today"
dfDateFiltered
.groupBy("SourceHash")
.agg(sum(when($"ServerOrClient" === lit("Client"), 1).otherwise(0)).alias("clientCount"),
first(when($"ServerOrClient" === lit("Server"), col("BuildVersion")).otherwise(null), true).alias("buildVersion")
)
.withColumn("validFrom", lit(validFrom))
.as[Product]
Output:
+----------+-----------+------------+---------+
|SourceHash|clientCount|buildVersion|validFrom|
+----------+-----------+------------+---------+
|1 |1 |1 |Today |
|2 |1 |null |Today |
+----------+-----------+------------+---------+
My goal is to have a spark dataframe that holds each of my Candy objects in a separate row, with their respective properties
+------------------------------------+
main
+------------------------------------+
{"brand":"brand1","name":"snickers"}
+------------------------------------+
{"brand":"brand2","name":"haribo"}
+------------------------------------+
Case class for Proof of concept
case class Candy(
brand: String,
name: String)
val candy1 = Candy("brand1", "snickers")
val candy2 = Candy("brand2", "haribo")
So far I have only managed to put them in the same row with:
import org.json4s.DefaultFormats
import org.json4s.jackson.Serialization.{read, write}
implicit val formats = DefaultFormats
val body = write(Array(candy1, candy2))
val df=Seq(body).toDF("main")
df.show(5, false)
giving me everything in one row instead of 2. How can I split each object up into its own row while maintaining the schema of my Candy object?
+-------------------------------------------------------------------------+
| main |
+-------------------------------------------------------------------------+
|[{"brand":"brand1","name":"snickers"},{"brand":"brand2","name":"haribo"}]|
+-------------------------------------------------------------------------+
Do you want to keep the item as a json string inside the dataframe?
If you don't, you can do this, taking advatange of the dataset ability to handle case classes:
val df=Seq(candy1, candy2).toDS
This will give you the following output:
+------+--------+
| brand| name|
+------+--------+
|brand1|snickers|
|brand2| haribo|
+------+--------+
IMHO that's the best optionm but if you want to keep your data as a json string, then you can first define a toJson method inside your case class:
case class Candy(brand:String, name: String) {
def toJson = s"""{"brand": "$brand", "name": "$name" }"""
}
And then build the DF using that method:
val df=Seq(candy1.toJson, candy2.toJson).toDF("main")
OUTPUT
+----------------------------------------+
|main |
+----------------------------------------+
|{"brand": "brand1", "name": "snickers" }|
|{"brand": "brand2", "name": "haribo" } |
+----------------------------------------+
Suppose we have the following csv file like
fname,age,class,dob
Second csv file name like
f_name,user_age,class,DataofBith
I am trying to make it common header for all csv file that return same Dataframe of header like in standard manner like
first_name,age,class,dob
val df2 = df.withColumnRenamed("DateOfBirth","DateOfBirth").withColumnRenamed("fname","name")
df2.printSchema()
But that way is not generic. Can we do this in a dynamic manner for all CSV as per standard conversion like of DataFrame of CSV in fname,f_name it should be converted to the name ?
You can use List of schema then Iterate on top of schema like below -
Approach :1
val df= Seq((1,"goutam","kumar"),(2,"xyz","kumar")).toDF("id","fname","lname")
val schema=Seq("id"->"sid","fname"->"sfname","lname"->"slname")
val mapedSchema = schema.map(x=>df(x._1).as(x._2))
df.select(mapedSchema :_*)
while reading csv give "option("header", false)" then you can get read of mapping of old schema with new schema.
Approach :2
val schema=Seq("sid","sfname","slname")
val mapedSchema=data.columns.zip(schema)
val mapedSchemaWithDF = mapedSchema.map(x=>df(x._1).as(x._2))
df.select(mapedSchemaWithDF:_*)
The function withColumnRenamed works also if the column is not present in the dataframe. Hence you can go ahead and read all dataframes and apply the same renaming logic everywhere and union them all later.
import org.apache.spark.sql.DataFrame
def renaming(df: DataFrame): DataFrame = {
df.withColumnRenamed("dob", "DateOfBirth")
.withColumnRenamed("fname", "name")
.withColumnRenamed("f_name", "name")
.withColumnRenamed("user_age", "age")
// append more renaming functions here
}
val df1 = renaming(spark.read.csv("...your path here"))
val df2 = renaming(spark.read.csv("...another path here"))
val result = df1.unionAll(df2)
result will have the same schema (DateOfBirth, name, age) in this case.
Edit:
Following your input, if I understand correctly what you have to do, what about this?
val df1 = spark.read.csv("...your path here").toDF("name", "age", "class", "born_date")
val df2 = spark.read.csv("...another path here").toDF("name", "age", "class", "born_date")
You can use a simple select in combination with Scala Map. Is easier to handle the column transformations via a dictionary (Map) is which key will be the old name and value the new name.
Lets create first the two datasets as you described them:
val df1 = Seq(
("toto", 23, "g", "2010-06-09"),
("bla", 35, "s", "1990-10-01"),
("pino", 12, "a", "1995-10-05")
).toDF("fname", "age", "class", "dob")
val df2 = Seq(
("toto", 23, "g", "2010-06-09"),
("bla", 35, "s", "1990-10-01"),
("pino", 12, "a", "1995-10-05")
).toDF("f_name", "user_age", "class", "DataofBith")
Next we have created a Scala function named transform which accept two arguments, the target df and mapping which contains the transformations details:
val mapping = Map(
"fname" -> "first_name",
"f_name" -> "first_name",
"user_age" -> "age",
"DataofBith" -> "dob"
)
def transform(df: DataFrame, mapping: Map[String, String]) : DataFrame = {
val keys = mapping.keySet
val cols = df.columns.map{c =>
if(keys.contains(c))
df(c).as(mapping(c))
else
df(c)
}
df.select(cols:_*)
}
The function goes through the given columns checking first whether the current column exists in mapping. If so, it renames using the corresponding value from the dictionary otherwise the column remains untouched. Note that this will just rename the column (via alias) hence we don't expect to affect performance.
Finally, some examples:
val newDF1 = transform(df1, mapping)
newDF1.show
// +----------+---+-----+----------+
// |first_name|age|class| dob|
// +----------+---+-----+----------+
// | toto| 23| g|2010-06-09|
// | bla| 35| s|1990-10-01|
// | pino| 12| a|1995-10-05|
// +----------+---+-----+----------+
val newDF2 = transform(df2, mapping)
newDF2.show
// +----------+---+-----+----------+
// |first_name|age|class| dob|
// +----------+---+-----+----------+
// | toto| 23| g|2010-06-09|
// | bla| 35| s|1990-10-01|
// | pino| 12| a|1995-10-05|
// +----------+---+-----+----------+
I have two spark datasets, one with columns accountid and key, the key column in the format of an array [key1,key2,key3..] and another dataset with two columns accountid and key values which is in json. accountid , {key:value, key,value...}. I need to update the value in the second dataset if key appear for accountid in first dataset.
import org.apache.spark.sql.functions._
val df= sc.parallelize(Seq(("20180610114049", "id1","key1"),
("20180610114049", "id2","key2"),
("20180610114049", "id1","key1"),
("20180612114049", "id2","key1"),
("20180613114049", "id3","key2"),
("20180613114049", "id3","key3")
)).toDF("date","accountid", "key")
val gp=df.groupBy("accountid","date").agg(collect_list("key"))
+---------+--------------+-----------------+
|accountid| date|collect_list(key)|
+---------+--------------+-----------------+
| id2|20180610114049| [key2]|
| id1|20180610114049| [key1, key1]|
| id3|20180613114049| [key2, key3]|
| id2|20180612114049| [key1]|
+---------+--------------+-----------------+
val df2= sc.parallelize(Seq(("20180610114049", "id1","{'key1':'0.0','key2':'0.0','key3':'0.0'}"),
("20180610114049", "id2","{'key1':'0.0','key2':'0.0','key3':'0.0'}"),
("20180611114049", "id1","{'key1':'0.0','key2':'0.0','key3':'0.0'}"),
("20180612114049", "id2","{'key1':'0.0','key2':'0.0','key3':'0.0'}"),
("20180613114049", "id3","{'key1':'0.0','key2':'0.0','key3':'0.0'}")
)).toDF("date","accountid", "result")
+--------------+---------+----------------------------------------+
|date |accountid|result |
+--------------+---------+----------------------------------------+
|20180610114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
|20180610114049|id2 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
|20180611114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
|20180612114049|id2 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
|20180613114049|id3 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
+--------------+---------+----------------------------------------+
expected output
+--------------+---------+----------------------------------------+
|date |accountid|result |
+--------------+---------+----------------------------------------+
|20180610114049|id1 |{'key1':'1.0','key2':'0.0','key3':'0.0'}|
|20180610114049|id2 |{'key1':'0.0','key2':'1.0','key3':'0.0'}|
|20180611114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
|20180612114049|id2 |{'key1':'1.0','key2':'0.0','key3':'0.0'}|
|20180613114049|id3 |{'key1':'0.0','key2':'1.0','key3':'1.0'}|
+--------------+---------+----------------------------------------+
You will most definitely need a UDF to do it cleanly here.
You can pass both the array and the JSON to the UDF after joining on date and accountid, parse the JSON inside the UDF using the parser of your choice (I'm using JSON4S in the example), check if the key exists in the array and then change the value, convert it to JSON again and return it from the UDF.
val gp=df.groupBy("accountid","date").agg(collect_list("key").as("key"))
val joined = df2.join(gp, Seq("date", "accountid") , "left_outer")
joined.show(false)
//+--------------+---------+----------------------------------------+------------+
//|date |accountid|result |key |
//+--------------+---------+----------------------------------------+------------+
//|20180610114049|id2 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key2] |
//|20180613114049|id3 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key2, key3]|
//|20180610114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key1, key1]|
//|20180611114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|null |
//|20180612114049|id2 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key1] |
//+--------------+---------+----------------------------------------+------------+
// the UDF that will do the most work
// it's important to declare `formats` inside the function
// to avoid object not Serializable exception
// Not all cases are covered, use with caution :D
val convertJsonValues = udf{(json: String, arr: Seq[String]) =>
import org.json4s.jackson.JsonMethods._
import org.json4s.JsonDSL._
implicit val format = org.json4s.DefaultFormats
// replace single quotes with double
val kvMap = parse(json.replaceAll("'", """"""")).values.asInstanceOf[Map[String,String]]
val updatedKV = kvMap.map{ case(k,v) => if(arr.contains(k)) (k,"1.0") else (k,v) }
compact(render(updatedKV))
}
// Use when-otherwise and send empty array where `key` is null
joined.select($"date",
$"accountid",
when($"key".isNull, convertJsonValues($"result", array()))
.otherwise(convertJsonValues($"result", $"key"))
.as("result")
).show(false)
//+--------------+---------+----------------------------------------+
//|date |accountid|result |
//+--------------+---------+----------------------------------------+
//|20180610114049|id2 |{"key1":"0.0","key2":"1.0","key3":"0.0"}|
//|20180613114049|id3 |{"key1":"0.0","key2":"1.0","key3":"1.0"}|
//|20180610114049|id1 |{"key1":"1.0","key2":"0.0","key3":"0.0"}|
//|20180611114049|id1 |{"key1":"0.0","key2":"0.0","key3":"0.0"}|
//|20180612114049|id2 |{"key1":"1.0","key2":"0.0","key3":"0.0"}|
//+--------------+---------+----------------------------------------+
You can achieve your requirement with the use of udf function after you join both dataframes. Of course there are stuffs like converting json to struct, struct to json again, case class usage and more (comments are provided for further explanation)
import org.apache.spark.sql.functions._
//aliasing the collected key
val gp = df.groupBy("accountid","date").agg(collect_list("key").as("keys"))
//schema for converting json to struct
val schema = StructType(Seq(StructField("key1", StringType, true), StructField("key2", StringType, true), StructField("key3", StringType, true)))
//udf function to update the values of struct where result is a case class
def updateKeysUdf = udf((arr: Seq[String], json: Row) => Seq(json.schema.fieldNames.map(key => if(arr.contains(key)) "1.0" else json.getAs[String](key))).collect{case Array(a,b,c) => result(a,b,c)}.toList(0))
//changing json string to stuct using the above schema
df2.withColumn("result", from_json(col("result"), schema))
.as("df2") //aliasing df2 for joining and selecting
.join(gp.as("gp"), col("df2.accountid") === col("gp.accountid"), "left") //aliasing gp dataframe and joining with accountid
.select(col("df2.accountid"), col("df2.date"), to_json(updateKeysUdf(col("gp.keys"), col("df2.result"))).as("result")) //selecting and calling above udf function and finally converting to json stirng
.show(false)
where result is a case class
case class result(key1: String, key2: String, key3: String)
which should give you
+---------+--------------+----------------------------------------+
|accountid|date |result |
+---------+--------------+----------------------------------------+
|id3 |20180613114049|{"key1":"0.0","key2":"1.0","key3":"1.0"}|
|id1 |20180610114049|{"key1":"1.0","key2":"0.0","key3":"0.0"}|
|id1 |20180611114049|{"key1":"1.0","key2":"0.0","key3":"0.0"}|
|id2 |20180610114049|{"key1":"0.0","key2":"1.0","key3":"0.0"}|
|id2 |20180610114049|{"key1":"1.0","key2":"0.0","key3":"0.0"}|
|id2 |20180612114049|{"key1":"0.0","key2":"1.0","key3":"0.0"}|
|id2 |20180612114049|{"key1":"1.0","key2":"0.0","key3":"0.0"}|
+---------+--------------+----------------------------------------+
I hope the answer is helpful
I have output from a spark data frame like below:
Amt |id |num |Start_date |Identifier
43.45|19840|A345|[2014-12-26, 2013-12-12]|[232323,45466]|
43.45|19840|A345|[2010-03-16, 2013-16-12]|[34343,45454]|
My requirement is to generate output in below format from the above output
Amt |id |num |Start_date |Identifier
43.45|19840|A345|2014-12-26|232323
43.45|19840|A345|2013-12-12|45466
43.45|19840|A345|2010-03-16|34343
43.45|19840|A345|2013-16-12|45454
Can somebody help me to achieve this.
Is this the thing you're looking for?
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val sparkSession = ...
import sparkSession.implicits._
val input = sc.parallelize(Seq(
(43.45, 19840, "A345", Seq("2014-12-26", "2013-12-12"), Seq(232323,45466)),
(43.45, 19840, "A345", Seq("2010-03-16", "2013-16-12"), Seq(34343,45454))
)).toDF("amt", "id", "num", "start_date", "identifier")
val zipArrays = udf { (dates: Seq[String], identifiers: Seq[Int]) =>
dates.zip(identifiers)
}
val output = input.select($"amt", $"id", $"num", explode(zipArrays($"start_date", $"identifier")))
.select($"amt", $"id", $"num", $"col._1".as("start_date"), $"col._2".as("identifier"))
output.show()
Which returns:
+-----+-----+----+----------+----------+
| amt| id| num|start_date|identifier|
+-----+-----+----+----------+----------+
|43.45|19840|A345|2014-12-26| 232323|
|43.45|19840|A345|2013-12-12| 45466|
|43.45|19840|A345|2010-03-16| 34343|
|43.45|19840|A345|2013-16-12| 45454|
+-----+-----+----+----------+----------+
EDIT:
Since you would like to have multiple columns that should be zipped, you should try something like this:
val input = sc.parallelize(Seq(
(43.45, 19840, "A345", Seq("2014-12-26", "2013-12-12"), Seq("232323","45466"), Seq("123", "234")),
(43.45, 19840, "A345", Seq("2010-03-16", "2013-16-12"), Seq("34343","45454"), Seq("345", "456"))
)).toDF("amt", "id", "num", "start_date", "identifier", "another_column")
val zipArrays = udf { seqs: Seq[Seq[String]] =>
for(i <- seqs.head.indices) yield seqs.fold(Seq.empty)((accu, seq) => accu :+ seq(i))
}
val columnsToSelect = Seq($"amt", $"id", $"num")
val columnsToZip = Seq($"start_date", $"identifier", $"another_column")
val outputColumns = columnsToSelect ++ columnsToZip.zipWithIndex.map { case (column, index) =>
$"col".getItem(index).as(column.toString())
}
val output = input.select($"amt", $"id", $"num", explode(zipArrays(array(columnsToZip: _*)))).select(outputColumns: _*)
output.show()
/*
+-----+-----+----+----------+----------+--------------+
| amt| id| num|start_date|identifier|another_column|
+-----+-----+----+----------+----------+--------------+
|43.45|19840|A345|2014-12-26| 232323| 123|
|43.45|19840|A345|2013-12-12| 45466| 234|
|43.45|19840|A345|2010-03-16| 34343| 345|
|43.45|19840|A345|2013-16-12| 45454| 456|
+-----+-----+----+----------+----------+--------------+
*/
If I understand correctly, you want the first elements of col 3 and 4.
Does this make sense?
val newDataFrame = for {
row <- oldDataFrame
} yield {
val zro = row(0) // 43.45
val one = row(1) // 19840
val two = row(2) // A345
val dates = row(3) // [2014-12-26, 2013-12-12]
val numbers = row(4) // [232323,45466]
Row(zro, one, two, dates(0), numbers(0))
}
You could use SparkSQL.
First you create a view with the information we need to process:
df.createOrReplaceTempView("tableTest")
Then you can select the data with the expansions:
sparkSession.sqlContext.sql(
"SELECT Amt, id, num, expanded_start_date, expanded_id " +
"FROM tableTest " +
"LATERAL VIEW explode(Start_date) Start_date AS expanded_start_date " +
"LATERAL VIEW explode(Identifier) AS expanded_id")
.show()