scala - Spark : how to get the resultSet with condition in a groupedData - scala

Is there a way to the group Dataframe using its own schema?
This is produces data of format :
Country | Class | Name | age
US, 1,'aaa',21
US, 1,'bbb',20
BR, 2,'ccc',30
AU, 3,'ddd',20
....
I would want to do some like
Country | Class 1 Students | Class 2 Students
US , 2, 0
BR , 0, 1
....
condition 1. Country Groupping.
condition 2. get only 1 or 2 class value
this is a source code..
val df = Seq(("US", 1, "AAA",19),("US", 1, "BBB",20),("KR", 2, "CCC",29),
("AU", 3, "DDD",18)).toDF("country", "class", "name","age")
df.groupBy("country").agg(count($"name") as "Cnt")

You should use pivot function.
val df = Seq(("US", 1, "AAA",19),("US", 1, "BBB",20),("KR", 2, "CCC",29),
("AU", 3, "DDD",18)).toDF("country", "class", "name","age")
df.groupBy("country").pivot("class").agg(count($"name") as "Cnt").show
+-------+---+---+---+
|country| 1| 2| 3|
+-------+---+---+---+
| AU| 0| 0| 1|
| US| 2| 0| 0|
| KR| 0| 1| 0|
+-------+---+---+---+

Related

How to compute cumulative sum on multiple float columns?

I have 100 float columns in a Dataframe which are ordered by date.
ID Date C1 C2 ....... C100
1 02/06/2019 32.09 45.06 99
1 02/04/2019 32.09 45.06 99
2 02/03/2019 32.09 45.06 99
2 05/07/2019 32.09 45.06 99
I need to get C1 to C100 in the cumulative sum based on id and date.
Target dataframe should look like this:
ID Date C1 C2 ....... C100
1 02/04/2019 32.09 45.06 99
1 02/06/2019 64.18 90.12 198
2 02/03/2019 32.09 45.06 99
2 05/07/2019 64.18 90.12 198
I want to achieve this without looping from C1- C100.
Initial code for one column:
var DF1 = DF.withColumn("CumSum_c1", sum("C1").over(
Window.partitionBy("ID")
.orderBy(col("date").asc)))
I found a similar question here but he manually did it for two columns : Cumulative sum in Spark
Its a classical use for foldLeft. Let's generate some data first :
import org.apache.spark.sql.expressions._
val df = spark.range(1000)
.withColumn("c1", 'id + 3)
.withColumn("c2", 'id % 2 + 1)
.withColumn("date", monotonically_increasing_id)
.withColumn("id", 'id % 10 + 1)
// We will select the columns we want to compute the cumulative sum of.
val columns = df.drop("id", "date").columns
val w = Window.partitionBy(col("id")).orderBy(col("date").asc)
val results = columns.foldLeft(df)((tmp_, column) => tmp_.withColumn(s"cum_sum_$column", sum(column).over(w)))
results.orderBy("id", "date").show
// +---+---+---+-----------+----------+----------+
// | id| c1| c2| date|cum_sum_c1|cum_sum_c2|
// +---+---+---+-----------+----------+----------+
// | 1| 3| 1| 0| 3| 1|
// | 1| 13| 1| 10| 16| 2|
// | 1| 23| 1| 20| 39| 3|
// | 1| 33| 1| 30| 72| 4|
// | 1| 43| 1| 40| 115| 5|
// | 1| 53| 1| 8589934592| 168| 6|
// | 1| 63| 1| 8589934602| 231| 7|
Here is another way using simple select expression :
val w = Window.partitionBy($"id").orderBy($"date".asc).rowsBetween(Window.unboundedPreceding, Window.currentRow)
// get columns you want to sum
val columnsToSum = df.drop("ID", "Date").columns
// map over those columns and create new sum columns
val selectExpr = Seq(col("ID"), col("Date")) ++ columnsToSum.map(c => sum(col(c)).over(w).alias(c)).toSeq
df.select(selectExpr:_*).show()
Gives:
+---+----------+-----+-----+----+
| ID| Date| C1| C2|C100|
+---+----------+-----+-----+----+
| 1|02/04/2019|32.09|45.06| 99|
| 1|02/06/2019|64.18|90.12| 198|
| 2|02/03/2019|32.09|45.06| 99|
| 2|05/07/2019|64.18|90.12| 198|
+---+----------+-----+-----+----+

Understanding pivot and agg

I have the following columns in DataFrame df:
c_id p_id type values
278230 57371100 11 1
278230 57371100 12 1
...
I execute the following code and expect to see columns 11_total and 12_total:
df
.groupBy($"c_id",$"p_id")
.pivot("type")
.agg(sum("values") as "total")
.na.fill(0)
.show()
Instead, I get columns 11 and 12:
+-----------+----------+---+---+
| c_id| p_id| 11| 12|
+-----------+----------+---+---+
| 278230| 57371100| 0| 1|
| 337790| 72031970| 3| 0|
| 320710| 71904400| 0| 1|
Why?
That's because Spark appends aliases to the pivot column values only when there are multiple aggregations for clarity:
val df = Seq(
(278230, 57371100, 11, 1),
(278230, 57371100, 12, 2),
(337790, 72031970, 11, 1),
(337790, 72031970, 11, 2),
(337790, 72031970, 12, 3)
)toDF("c_id", "p_id", "type", "values")
df.groupBy($"c_id", $"p_id").pivot("type").
agg(sum("values").as("total")).
show
// +------+--------+---+---+
// | c_id| p_id| 11| 12|
// +------+--------+---+---+
// |337790|72031970| 3| 3|
// |278230|57371100| 1| 2|
// +------+--------+---+---+
df.groupBy($"c_id", $"p_id").pivot("type").
agg(sum("values").as("total"), max("values").as("max")).
show
// +------+--------+--------+------+--------+------+
// | c_id| p_id|11_total|11_max|12_total|12_max|
// +------+--------+--------+------+--------+------+
// |337790|72031970| 3| 2| 3| 3|
// |278230|57371100| 1| 1| 2| 2|
// +------+--------+--------+------+--------+------+

Spark Scala: Count Consecutive Months

I have the following DataFrame example:
Provider Patient Date
Smith John 2016-01-23
Smith John 2016-02-20
Smith John 2016-03-21
Smith John 2016-06-25
Smith Jill 2016-02-01
Smith Jill 2016-03-10
James Jill 2017-04-10
James Jill 2017-05-11
I want to programmatically add a column that indicates how many consecutive months that a patient sees a doctor. The new DataFrame would look like this:
Provider Patient Date consecutive_id
Smith John 2016-01-23 3
Smith John 2016-02-20 3
Smith John 2016-03-21 3
Smith John 2016-06-25 1
Smith Jill 2016-02-01 2
Smith Jill 2016-03-10 2
James Jill 2017-04-10 2
James Jill 2017-05-11 2
I'm assuming that there is a way to achieve this with a Window function, but I haven't been able to figure it out yet and I'm looking forward to the insight the community can provide. Thanks.
There are at least 3 ways to get the result
Implement logic in SQL
Use Spark API for windowing functions - .over(windowSpec)
Use directly .rdd.mapPartitions
Introducing Window Functions in Spark SQL
For all solutions you can call .toDebugString to see operations under the hood.
SQL solution is below
val my_df = List(
("Smith", "John", "2016-01-23"),
("Smith", "John", "2016-02-20"),
("Smith", "John", "2016-03-21"),
("Smith", "John", "2016-06-25"),
("Smith", "Jill", "2016-02-01"),
("Smith", "Jill", "2016-03-10"),
("James", "Jill", "2017-04-10"),
("James", "Jill", "2017-05-11")
).toDF(Seq("Provider", "Patient", "Date"): _*)
my_df.createOrReplaceTempView("tbl")
val q = """
select t2.*, count(*) over (partition by provider, patient, grp) consecutive_id
from (select t1.*, sum(x) over (partition by provider, patient order by yyyymm) grp
from (select t0.*,
case
when cast(yyyymm as int) -
cast(lag(yyyymm) over (partition by provider, patient order by yyyymm) as int) = 1
then 0
else 1
end x
from (select tbl.*, substr(translate(date, '-', ''), 1, 6) yyyymm from tbl) t0) t1) t2
"""
sql(q).show
sql(q).rdd.toDebugString
Output
scala> sql(q).show
+--------+-------+----------+------+---+---+--------------+
|Provider|Patient| Date|yyyymm| x|grp|consecutive_id|
+--------+-------+----------+------+---+---+--------------+
| Smith| Jill|2016-02-01|201602| 1| 1| 2|
| Smith| Jill|2016-03-10|201603| 0| 1| 2|
| James| Jill|2017-04-10|201704| 1| 1| 2|
| James| Jill|2017-05-11|201705| 0| 1| 2|
| Smith| John|2016-01-23|201601| 1| 1| 3|
| Smith| John|2016-02-20|201602| 0| 1| 3|
| Smith| John|2016-03-21|201603| 0| 1| 3|
| Smith| John|2016-06-25|201606| 1| 2| 1|
+--------+-------+----------+------+---+---+--------------+
Update
Mix of .mapPartitions + .over(windowSpec)
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, IntegerType, StructField, StructType}
val schema = new StructType().add(
StructField("provider", StringType, true)).add(
StructField("patient", StringType, true)).add(
StructField("date", StringType, true)).add(
StructField("x", IntegerType, true)).add(
StructField("grp", IntegerType, true))
def f(iter: Iterator[Row]) : Iterator[Row] = {
iter.scanLeft(Row("_", "_", "000000", 0, 0))
{
case (x1, x2) =>
val x =
if (x2.getString(2).replaceAll("-", "").substring(0, 6).toInt ==
x1.getString(2).replaceAll("-", "").substring(0, 6).toInt + 1)
(0) else (1);
val grp = x1.getInt(4) + x;
Row(x2.getString(0), x2.getString(1), x2.getString(2), x, grp);
}.drop(1)
}
val df_mod = spark.createDataFrame(my_df.repartition($"provider", $"patient")
.sortWithinPartitions($"date")
.rdd.mapPartitions(f, true), schema)
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy($"provider", $"patient", $"grp")
df_mod.withColumn("consecutive_id", count(lit("1")).over(windowSpec)
).orderBy($"provider", $"patient", $"date").show
Output
scala> df_mod.withColumn("consecutive_id", count(lit("1")).over(windowSpec)
| ).orderBy($"provider", $"patient", $"date").show
+--------+-------+----------+---+---+--------------+
|provider|patient| date| x|grp|consecutive_id|
+--------+-------+----------+---+---+--------------+
| James| Jill|2017-04-10| 1| 1| 2|
| James| Jill|2017-05-11| 0| 1| 2|
| Smith| Jill|2016-02-01| 1| 1| 2|
| Smith| Jill|2016-03-10| 0| 1| 2|
| Smith| John|2016-01-23| 1| 1| 3|
| Smith| John|2016-02-20| 0| 1| 3|
| Smith| John|2016-03-21| 0| 1| 3|
| Smith| John|2016-06-25| 1| 2| 1|
+--------+-------+----------+---+---+--------------+
You could:
Reformat your dates to integers (2016-01 = 1, 2016-02 = 2, 2017-01 = 13 ...etc)
Combine all the dates into an array with a window and collect_list:
val winSpec = Window.partitionBy("Provider","Patient").orderBy("Date")
df.withColumn("Dates", collect_list("Date").over(winSpec))
Pass the array into a modified version of #marios solution as a UDF with spark.udf.register to get the maximum number of consecutive months

Dataframe.map need to result with more than the rows in dataset

I am using scala and spark and have a simple dataframe.map to produce the required transformation on data. However I need to provide an additional row of data with the modified original. How can I use the dataframe.map to give out this.
ex:
dataset from:
id, name, age
1, john, 23
2, peter, 32
if age < 25 default to 25.
dataset to:
id, name, age
1, john, 25
1, john, -23
2, peter, 32
Would a 'UnionAll' handle it?
eg.
df1 = original dataframe
df2 = transformed df1
df1.unionAll(df2)
EDIT: implementation using unionAll()
val df1=sqlContext.createDataFrame(Seq( (1,"john",23) , (2,"peter",32) )).
toDF( "id","name","age")
def udfTransform= udf[Int,Int] { (age) => if (age<25) 25 else age }
val df2=df1.withColumn("age2", udfTransform($"age")).
where("age!=age2").
drop("age2")
df1.withColumn("age", udfTransform($"age")).
unionAll(df2).
orderBy("id").
show()
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1| john| 25|
| 1| john| 23|
| 2|peter| 32|
+---+-----+---+
Note: the implementation differs a bit from the originally proposed (naive) solution. The devil is always in the detail!
EDIT 2: implementation using nested array and explode
val df1=sx.createDataFrame(Seq( (1,"john",23) , (2,"peter",32) )).
toDF( "id","name","age")
def udfArr= udf[Array[Int],Int] { (age) =>
if (age<25) Array(age,25) else Array(age) }
val df2=df1.withColumn("age", udfArr($"age"))
df2.show()
+---+-----+--------+
| id| name| age|
+---+-----+--------+
| 1| john|[23, 25]|
| 2|peter| [32]|
+---+-----+--------+
df2.withColumn("age",explode($"age") ).show()
+---+-----+---+
| id| name|age|
+---+-----+---+
| 1| john| 23|
| 1| john| 25|
| 2|peter| 32|
+---+-----+---+

I want to convert all my existing UDTFs in Hive to Scala functions and use it from Spark SQL

Can any one give me an example UDTF (eg; explode) written in scala which returns multiple row and use it as UDF in SparkSQL?
Table: table1
+------+----------+----------+
|userId|someString| varA|
+------+----------+----------+
| 1| example1| [0, 2, 5]|
| 2| example2|[1, 20, 5]|
+------+----------+----------+
I'd like to create the following Scala code:
def exampleUDTF(var: Seq[Int]) = <Return Type???> {
// code to explode varA field ???
}
sqlContext.udf.register("exampleUDTF",exampleUDTF _)
sqlContext.sql("FROM table1 SELECT userId, someString, exampleUDTF(varA)").collect().foreach(println)
Expected output:
+------+----------+----+
|userId|someString|varA|
+------+----------+----+
| 1| example1| 0|
| 1| example1| 2|
| 1| example1| 5|
| 2| example2| 1|
| 2| example2| 20|
| 2| example2| 5|
+------+----------+----+
You can't do this with a UDF. A UDF can only add a single column to a DataFrame. There is, however, a function called DataFrame.explode, which you can use instead. To do it with your example, you would do this:
import org.apache.spark.sql._
val df = Seq(
(1,"example1", Array(0,2,5)),
(2,"example2", Array(1,20,5))
).toDF("userId", "someString", "varA")
val explodedDf = df.explode($"varA"){
case Row(arr: Seq[Int]) => arr.toArray.map(a => Tuple1(a))
}.drop($"varA").withColumnRenamed("_1", "varA")
+------+----------+-----+
|userId|someString| varA|
+------+----------+-----+
| 1| example1| 0|
| 1| example1| 2|
| 1| example1| 5|
| 2| example2| 1|
| 2| example2| 20|
| 2| example2| 5|
+------+----------+-----+
Note that explode takes a function as an argument. So even though you can't create a UDF to do what you want, you can create a function to pass to explode to do what you want. Like this:
def exploder(row: Row) : Array[Tuple1[Int]] = {
row match { case Row(arr) => arr.toArray.map(v => Tuple1(v)) }
}
df.explode($"varA")(exploder)
That's about the best you are going to get in terms of recreating a UDTF.
Hive Table:
name id
["Subhajit Sen","Binoy Mondal","Shantanu Dutta"] 15
["Gobinathan SP","Harsh Gupta","Rahul Anand"] 16
Creating a scala function :
def toUpper(name: Seq[String]) = (name.map(a => a.toUpperCase)).toSeq
Registering function as UDF :
sqlContext.udf.register("toUpper",toUpper _)
Calling the UDF using sqlContext and storing output as DataFrame object :
var df = sqlContext.sql("SELECT toUpper(name) FROM namelist").toDF("Name")
Exploding the DataFrame :
df.explode(df("Name")){case org.apache.spark.sql.Row(arr: Seq[String]) => arr.toSeq.map(v => Tuple1(v))}.drop(df("Name")).withColumnRenamed("_1","Name").show
Result:
+--------------+
| Name|
+--------------+
| SUBHAJIT SEN|
| BINOY MONDAL|
|SHANTANU DUTTA|
| GOBINATHAN SP|
| HARSH GUPTA|
| RAHUL ANAND|
+--------------+