Spark Dataframe GroupBy and compute Complex aggregate function

Spark Dataframe GroupBy and compute Complex aggregate function - scala

Using Spark dataframe , I need to compute the percentage by using the below
complex formula :
Group by "KEY " and calculate "re_pct" as ( sum(sa) / sum( sa / (pct/100) ) ) * 100
For Instance , Input Dataframe is
val values1 = List(List("01", "20000", "45.30"), List("01", "30000", "45.30"))
.map(row => (row(0), row(1), row(2)))
val DS1 = values1.toDF("KEY", "SA", "PCT")
DS1.show()
+---+-----+-----+
|KEY| SA| PCT|
+---+-----+-----+
| 01|20000|45.30|
| 01|30000|45.30|
+---+-----+-----+
Expected Result :
+---+-----+--------------+
|KEY| re_pcnt |
+---+-----+--------------+
| 01| 45.30000038505 |
+---+-----+--------------+
I have tried to calculate as below
val result = DS1.groupBy("KEY").agg(((sum("SA").divide(
sum(
("SA").divide(
("PCT").divide(100)
)
)
)) * 100).as("re_pcnt"))
But facing Error:(36, 16) value divide is not a member of String ("SA").divide({
Any suggestion on implementing the above logic ?

You can try importing spark.implicits._ and then use $ to refer to a column.
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val result = DS1.groupBy("KEY")
.agg(((sum($"SA").divide(sum(($"SA").divide(($"PCT").divide(100))))) * 100)
.as("re_pcnt"))
Which will give you the requested output.
If you do not want to import you can always use the col() command instead of $.
It is possible to use a string as input to the agg() function with the use of expr(). However, the input string need to be changed a bit. The following gives exactly the same result as before, but uses a string instead:
val opr = "sum(SA)/(sum(SA/(PCT/100))) * 100"
val df = DS1.groupBy("KEY").agg(expr(opr).as("re_pcnt"))
Note that .as("re_pcnt") need to be inside the agg() method, it can not be outside.

Your code works almost perfectly. You just have to put the '$' symbol in order to specify you're passing a column:
val result = DS1.groupBy($"KEY").agg(((sum($"SA").divide(
sum(
($"SA").divide(
($"PCT").divide(100)
)
)
)) * 100).as("re_pcnt"))
Here's the output:
result.show()
+---+-------+
|KEY|re_pcnt|
+---+-------+
| 01| 45.3|
+---+-------+

Related

How to Transform a Spark Scala Nested Map within a Map Data Structure?

I want to write a nested data structure consisting of a Map inside another Map using an array of a Scala case class.
The result should transform this dataframe:
|Value|Country| Timestamp| Sum|
+-----+-------+----------+----+
| 123| ITA|1475600500|18.0|
| 123| ITA|1475600516|19.0|
+-----+-------+----------+----+
into:
+--------------------------------------------------------------------+
|value |
+--------------------------------------------------------------------+
[{"value":123,"attributes":{"ITA":{"1475600500":18,"1475600516":19}}}]
+--------------------------------------------------------------------+
The actualResult dataset below gets me close but the structure isn't quite the same as my expected dataframe.
case class Record(value: Integer, attributes: Map[String, Map[String, BigDecimal]])
val actualResult = df
.map(r =>
Array(
Record(
r.getAs[Int]("Value"),
Map(
r.getAs[String]("Country") ->
Map(
r.getAs[String]("Timestamp") -> new BigDecimal(
r.getAs[Double]("Sum").toString
)
)
)
)
)
)
The Timestamp column in the actualResult dataset doesn't get combined together into the same Record row but rather creates two separate rows instead.
+----------------------------------------------------+
|value |
+----------------------------------------------------+
[{"value":123,"attributes":{"ITA":{"1475600516":19}}}]
[{"value":123,"attributes":{"ITA":{"1475600500":18}}}]
+----------------------------------------------------+

With the use of groupBy and collect_list by creatng combined column using struct I was able to get single row as below output.
val mycsv =
"""
|Value|Country|Timestamp|Sum
| 123|ITA|1475600500|18.0
| 123|ITA|1475600516|19.0
""".stripMargin('|').lines.toList.toDS()
val df: DataFrame = spark.read.option("header", true)
.option("sep", "|")
.option("inferSchema", true)
.csv(mycsv)
df.show
val df1 = df.
groupBy("Value","Country")
.agg( collect_list(struct(col("Country"), col("Timestamp"), col("Sum"))).alias("attributes")).drop("Country")
val json = df1.toJSON // you can save in to file
json.show(false)
Result combined 2 rows
+-----+-------+----------+----+
|Value|Country| Timestamp| Sum|
+-----+-------+----------+----+
|123.0|ITA |1475600500|18.0|
|123.0|ITA |1475600516|19.0|
+-----+-------+----------+----+
+----------------------------------------------------------------------------------------------------------------------------------------------+
|value |
+----------------------------------------------------------------------------------------------------------------------------------------------+
|{"Value":123.0,"attributes":[{"Country":"ITA","Timestamp":1475600500,"Sum":18.0},{"Country":"ITA","Timestamp":1475600516,"Sum":19.0}]}|
+----------------------------------------------------------------------------------------------------------------------------------------------+

How to properly format the string in a new column of DataFrame?

I have a DataFrame with two columns col1 and col2 (Spark 2.2.0 and Scala 2.11). I need to create a new column in the following format:
=path("http://mywebsite.com/photo/AAA_BBB.jpg", 1)
where AAA is the value of col1 and BBB is the value of col2 for a given row.
The problem is that I do not know how to properly handle ". I tried this:
df = df.withColumn($"url",=path("http://mywebsite.com/photo/"+col("col1") + "_"+col("col2")+".jpg", 1))"
UPDATE:
It compiles ok now, but column values are not inserted in a string. Instead of column values, I see the text col1 and col2.
df = df.withColumn("url_rec",lit("=path('http://mywebsite.com/photo/"+col("col1")+"_"+col("col1")+".jpg', 1)"))
I get this:
=path('http://mywebsite.com/photo/col1_col1.jpg', 1)

As stated in the comments, you can either use concat multiple times like :
d.show
+---+---+
| a| b|
+---+---+
|AAA|BBB|
+---+---+
d.withColumn("URL" ,
concat(
concat(
concat(
concat(lit("""=path("http://mywebsite.com/photo/""" ), $"a") ,
lit("_") ) , $"b"
)
, lit(""".jpg", 1) """)
).as[String].first
// String = "=path("http://mywebsite.com/photo/AAA_BBB.jpg", 1) "
Or you can map over the dataframe to append a new column ( which is cleaner than the concat method )
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val urlRdd = d.map{ x =>
Row.fromSeq(x.toSeq ++ Seq(s"""=path("http://mywebsite.com/photo/${x.getAs[String]("a")}_${x.getAs[String]("b")}.jpg", 1)"""))
}
val newDF = sqlContext.createDataFrame(urlRdd, d.schema.add("url", StringType) )
newDF.map(_.getAs[String]("url")).first
// String = =path("http://mywebsite.com/photo/AAA_BBB.jpg", 1)

This is an old question but I put my answer here for anybody else. You can use the format_string function
scala> df1.show()
+----+----+
|col1|col2|
+----+----+
| AAA| BBB|
+----+----+
scala> df1.withColumn(
"URL",
format_string(
"""=path("http://mywebsite.com/photo/%s_%s.jpg", 1)""",
col("col1"),
col("col2")
)
).show(truncate = false)
+----+----+--------------------------------------------------+
|col1|col2|URL |
+----+----+--------------------------------------------------+
|AAA |BBB |=path("http://mywebsite.com/photo/AAA_BBB.jpg", 1)|
+----+----+--------------------------------------------------+

Using of String Functions in Dataframe Join in scala

I am trying to join two dataframe with condition like "Wo" in "Hello World" i.e (dataframe1 col contains dataframe2 col1 value).
In HQL, we can use instr(t1.col1,t2.col1)>0
How can I achieve this same condtition in Dataframe in Scala ? I tried
df1.join(df2,df1("col1").indexOfSlice(df2("col1")) > 0)
But it throwing me the below error
error: value indexOfSlice is not a member of
org.apache.spark.sql.Column
I just want to achive the below hql query using DataFrames.
select t1.*,t2.col1 from t1,t2 where instr(t1.col1,t2.col1)>0

The following solution is tested with spark 2.2. You'll be needing to define a UDF and you can specify a join condition as part of where filter :
val indexOfSlice_ = (c1: String, c2: String) => c1.indexOfSlice(c2)
val islice = udf(indexOfSlice_)
val df10: DataFrame = Seq(("Hello World", 2), ("Foo", 3)).toDF("c1", "c2")
val df20: DataFrame = Seq(("Wo", 2), ("Bar", 3)).toDF("c3", "c4")
df10.crossJoin(df20).where(islice(df10.col("c1"), df20.col("c3")) > 0).show
// +-----------+---+---+---+
// | c1| c2| c3| c4|
// +-----------+---+---+---+
// |Hello World| 2| Wo| 2|
// +-----------+---+---+---+
PS: Beware ! Using a cross-join is an expensive operation as it yields a cartesian join.
EDIT: Consider reading this when you want to use this solution.

Spark and SparkSQL: How to imitate window function?

Description
Given a dataframe df
id | date
---------------
1 | 2015-09-01
2 | 2015-09-01
1 | 2015-09-03
1 | 2015-09-04
2 | 2015-09-04
I want to create a running counter or index,
grouped by the same id and
sorted by date in that group,
thus
id | date | counter
--------------------------
1 | 2015-09-01 | 1
1 | 2015-09-03 | 2
1 | 2015-09-04 | 3
2 | 2015-09-01 | 1
2 | 2015-09-04 | 2
This is something I can achieve with window function, e.g.
val w = Window.partitionBy("id").orderBy("date")
val resultDF = df.select( df("id"), rowNumber().over(w) )
Unfortunately, Spark 1.4.1 does not support window functions for regular dataframes:
org.apache.spark.sql.AnalysisException: Could not resolve window function 'row_number'. Note that, using window functions currently requires a HiveContext;
Questions
How can I achieve the above computation on current Spark 1.4.1 without using window functions?
When will window functions for regular dataframes be supported in Spark?
Thanks!

You can use HiveContext for local DataFrames as well and, unless you have a very good reason not to, it is probably a good idea anyway. It is a default SQLContext available in spark-shell and pyspark shell (as for now sparkR seems to use plain SQLContext) and its parser is recommended by Spark SQL and DataFrame Guide.
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.rowNumber
object HiveContextTest {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Hive Context")
val sc = new SparkContext(conf)
val sqlContext = new HiveContext(sc)
import sqlContext.implicits._
val df = sc.parallelize(
("foo", 1) :: ("foo", 2) :: ("bar", 1) :: ("bar", 2) :: Nil
).toDF("k", "v")
val w = Window.partitionBy($"k").orderBy($"v")
df.select($"k", $"v", rowNumber.over(w).alias("rn")).show
}
}

You can do this with RDDs. Personally I find the API for RDDs makes a lot more sense - I don't always want my data to be 'flat' like a dataframe.
val df = sqlContext.sql("select 1, '2015-09-01'"
).unionAll(sqlContext.sql("select 2, '2015-09-01'")
).unionAll(sqlContext.sql("select 1, '2015-09-03'")
).unionAll(sqlContext.sql("select 1, '2015-09-04'")
).unionAll(sqlContext.sql("select 2, '2015-09-04'"))
// dataframe as an RDD (of Row objects)
df.rdd
// grouping by the first column of the row
.groupBy(r => r(0))
// map each group - an Iterable[Row] - to a list and sort by the second column
.map(g => g._2.toList.sortBy(row => row(1).toString))
.collect()
The above gives a result like the following:
Array[List[org.apache.spark.sql.Row]] =
Array(
List([1,2015-09-01], [1,2015-09-03], [1,2015-09-04]),
List([2,2015-09-01], [2,2015-09-04]))
If you want the position within the 'group' as well, you can use zipWithIndex.
df.rdd.groupBy(r => r(0)).map(g =>
g._2.toList.sortBy(row => row(1).toString).zipWithIndex).collect()
Array[List[(org.apache.spark.sql.Row, Int)]] = Array(
List(([1,2015-09-01],0), ([1,2015-09-03],1), ([1,2015-09-04],2)),
List(([2,2015-09-01],0), ([2,2015-09-04],1)))
You could flatten this back to a simple List/Array of Row objects using FlatMap, but if you need to perform anything on the 'group' that won't be a great idea.
The downside to using RDD like this is that it's tedious to convert from DataFrame to RDD and back again.

I totally agree that Window functions for DataFrames are the way to go if you have Spark version (>=)1.5. But if you are really stuck with an older version(e.g 1.4.1), here is a hacky way to solve this
val df = sc.parallelize((1, "2015-09-01") :: (2, "2015-09-01") :: (1, "2015-09-03") :: (1, "2015-09-04") :: (1, "2015-09-04") :: Nil)
.toDF("id", "date")
val dfDuplicate = df.selecExpr("id as idDup", "date as dateDup")
val dfWithCounter = df.join(dfDuplicate,$"id"===$"idDup")
.where($"date"<=$"dateDup")
.groupBy($"id", $"date")
.agg($"id", $"date", count($"idDup").as("counter"))
.select($"id",$"date",$"counter")
Now if you do dfWithCounter.show
You will get:
+---+----------+-------+
| id| date|counter|
+---+----------+-------+
| 1|2015-09-01| 1|
| 1|2015-09-04| 3|
| 1|2015-09-03| 2|
| 2|2015-09-01| 1|
| 2|2015-09-04| 2|
+---+----------+-------+
Note that date is not sorted, but the counter is correct. Also you can change the ordering of the counter by changing the <= to >= in the where statement.

scala spark - matching dataframes based on variable dates

I'm trying to match two dataframes based on a variable date window. I am not simply trying to get an exact match, which my code achieves but to get all likely candidates within a variable day window.
I was able to get exact matches on dates with my code.
But I want to find out if the records are still viable to match since they could be a few days off either side but would still be reasonable enough to join on.
I've tried looking for something similar to python's pd.to_timedelta('1 day') in spark to add to the filter but alas have struck no luck.
Here is my current code which matches the dataframe on the ID column and then runs a filter to ensure that the from_date in the second dataframe is between the start_date and the end_date of the first dataframe.
What I need is not the exact date match but be able to match records if they fall between a day or two (either side) of the actual dates.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
val df1 = spark.read.option("header","true")
.option("inferSchema","true").csv("../data/df1.csv")
val df2 = spark.read.option("header","true")
.option("inferSchema","true")
.csv("../data/df2.csv")
val df = df2.join(df1,
(df1("ID") === df2("ID")) &&
(df2("from_date") >= df1("start_date")) &&
(df2("from_date") <= df1("end_date")),"left")
.select(df1("ID"), df1("start_date"), df1("end_date"),
$"from_date", $"to_date")
df.coalesce(1).write.format("com.databricks.spark.csv")
.option("header", "true").save("../mydata.csv")
Essentially I want to be able to edit this date window to increase or decrease the data actually matching.
Would really appreciate your input. I'm new to spark/scala but gotta say I'm loving it so far ... soo much faster (and cleaner) than python!
cheers

You can apply date_add and date_sub to start_date/end_date in your join condition, as shown below:
import org.apache.spark.sql.functions._
import java.sql.Date
val df1 = Seq(
(1, Date.valueOf("2018-12-01"), Date.valueOf("2018-12-05")),
(2, Date.valueOf("2018-12-01"), Date.valueOf("2018-12-06")),
(3, Date.valueOf("2018-12-01"), Date.valueOf("2018-12-07"))
).toDF("ID", "start_date", "end_date")
val df2 = Seq(
(1, Date.valueOf("2018-11-30")),
(2, Date.valueOf("2018-12-08")),
(3, Date.valueOf("2018-12-08"))
).toDF("ID", "from_date")
val deltaDays = 1
df2.join( df1,
df1("ID") === df2("ID") &&
df2("from_date") >= date_sub(df1("start_date"), deltaDays) &&
df2("from_date") <= date_add(df1("end_date"), deltaDays),
"left_outer"
).show
// +---+----------+----+----------+----------+
// | ID| from_date| ID|start_date| end_date|
// +---+----------+----+----------+----------+
// | 1|2018-11-30| 1|2018-12-01|2018-12-05|
// | 2|2018-12-08|null| null| null|
// | 3|2018-12-08| 3|2018-12-01|2018-12-07|
// +---+----------+----+----------+----------+

You can get the same results using datediff() function also. Check this out:
scala> val df1 = Seq((1, "2018-12-01", "2018-12-05"),(2, "2018-12-01", "2018-12-06"),(3, "2018-12-01", "2018-12-07")).toDF("ID", "start_date", "end_date").withColumn("start_date",'start_date.cast("date")).withColumn("end_date",'end_date.cast("date"))
df1: org.apache.spark.sql.DataFrame = [ID: int, start_date: date ... 1 more field]
scala> val df2 = Seq((1, "2018-11-30"), (2, "2018-12-08"),(3, "2018-12-08")).toDF("ID", "from_date").withColumn("from_date",'from_date.cast("date"))
df2: org.apache.spark.sql.DataFrame = [ID: int, from_date: date]
scala> val delta = 1;
delta: Int = 1
scala> df2.join(df1,df1("ID") === df2("ID") && datediff('from_date,'start_date) >= -delta && datediff('from_date,'end_date)<=delta, "leftOuter").show(false)
+---+----------+----+----------+----------+
|ID |from_date |ID |start_date|end_date |
+---+----------+----+----------+----------+
|1 |2018-11-30|1 |2018-12-01|2018-12-05|
|2 |2018-12-08|null|null |null |
|3 |2018-12-08|3 |2018-12-01|2018-12-07|
+---+----------+----+----------+----------+
scala>