End-dating records using window functions in Spark SQL - scala

I have a dataframe like below
+----+----+----------+----------+
|colA|colB| colC| colD|
+----+----+----------+----------+
| a| 2|2013-12-12|2999-12-31|
| b| 3|2011-12-14|2999-12-31|
| a| 4|2013-12-17|2999-12-31|
| b| 8|2011-12-19|2999-12-31|
| a| 6|2013-12-23|2999-12-31|
+----+----+----------+----------+
I need to group the records based on ColA and rank the records based on colC(most recent date gets bigger rank) and then update the dates in colD by subtracting a day from the colC record of the adjacent rank.
The final dataframe should like below
+----+----+----------+----------+
|colA|colB| colC| colD|
+----+----+----------+----------+
| a| 2|2013-12-12|2013-12-16|
| a| 4|2013-12-17|2013-12-22|
| a| 6|2013-12-23|2999-12-31|
| b| 3|2011-12-14|2011-12-18|
| b| 8|2011-12-29|2999-12-31|
+----+----+----------+----------+

You can get it using the window functions
scala> val df = Seq(("a",2,"2013-12-12","2999-12-31"),("b",3,"2011-12-14","2999-12-31"),("a",4,"2013-12-17","2999-12-31"),("b",8,"2011-12-19","2999-12-31"),("a",6,"2013-12-23","2999-12-31")).toDF("colA","colB","colC","colD")
df: org.apache.spark.sql.DataFrame = [colA: string, colB: int ... 2 more fields]
scala> val df2 = df.withColumn("colc",'colc.cast("date")).withColumn("cold",'cold.cast("date"))
df2: org.apache.spark.sql.DataFrame = [colA: string, colB: int ... 2 more fields]
scala> df2.createOrReplaceTempView("yash")
scala> spark.sql(""" select cola,colb,colc,cold, rank() over(partition by cola order by colc) c1, coalesce(date_sub(lead(colc) over(partition by cola order by colc),1),cold) as cold2 from yash """).show
+----+----+----------+----------+---+----------+
|cola|colb| colc| cold| c1| cold2|
+----+----+----------+----------+---+----------+
| b| 3|2011-12-14|2999-12-31| 1|2011-12-18|
| b| 8|2011-12-19|2999-12-31| 2|2999-12-31|
| a| 2|2013-12-12|2999-12-31| 1|2013-12-16|
| a| 4|2013-12-17|2999-12-31| 2|2013-12-22|
| a| 6|2013-12-23|2999-12-31| 3|2999-12-31|
+----+----+----------+----------+---+----------+
scala>
Removing the unnecessary columns
scala> spark.sql(""" select cola,colb,colc, coalesce(date_sub(lead(colc) over(partition by cola order by colc),1),cold) as cold from yash """).show
+----+----+----------+----------+
|cola|colb| colc| cold|
+----+----+----------+----------+
| b| 3|2011-12-14|2011-12-18|
| b| 8|2011-12-19|2999-12-31|
| a| 2|2013-12-12|2013-12-16|
| a| 4|2013-12-17|2013-12-22|
| a| 6|2013-12-23|2999-12-31|
+----+----+----------+----------+
scala>

You can create row_number over partition by colA and order by colC, then a self join on the dataframe. The code should look like this.
val rnkDF = df.withColumn("rnk", row_number().over(Window.partitionBy("colA").orderBy($"colC".asc)))
.withColumn("rnkminusone", $"rnk" - lit(1))
val joinDF = rnkDF.alias('A).join(rnkDF.alias('B), ($"A.colA" === $"B.colA").and($"A.rnk" === $"B.rnkminusone"),"left")
.select($"A.colA".as("colA")
, $"A.colB".as("colB")
, $"A.colC".as("colC")
, when($"B.colC".isNull, $"A.colD").otherwise(date_sub($"B.colC", 1)).as("colD"))
The results are below. I hope this helps.
+----+----+----------+----------+
|colA|colB| colC| colD|
+----+----+----------+----------+
| a| 2|2013-12-12|2013-12-16|
| a| 4|2013-12-17|2013-12-22|
| a| 6|2013-12-23|2999-12-31|
| b| 3|2011-12-14|2011-12-18|
| b| 8|2011-12-19|2999-12-31|
+----+----+----------+----------+

Related

Join two dataframes and replace the original column values using Spark Scala

I have two DFs
df1:
+---+-----+--------+
|key|price| date|
+---+-----+--------+
| 1| 1.0|20210101|
| 2| 2.0|20210101|
| 3| 3.0|20210101|
+---+-----+--------+
df2:
+---+-----+
|key|price|
+---+-----+
| 1| 1.1|
| 2| 2.2|
| 3| 3.3|
+---+-----+
I'd like to replace price column values from df1 with price values from df2 where df1.key == df2.key
Expected output:
+---+-----+--------+
|key|price| date|
+---+-----+--------+
| 1| 1.1|20210101|
| 2| 2.1|20210101|
| 3| 3.3|20210101|
+---+-----+--------+
I've found some solutions in python but I couldn't come up with a working solution in Scala.
Simply join + drop df1 column price:
val df = df1.join(df2, Seq("key")).drop(df1("price"))
df.show
//+---+-----+--------+
//|key|price| date|
//+---+-----+--------+
//| 1| 1.1|20210101|
//| 2| 2.2|20210101|
//| 3| 3.3|20210101|
//+---+-----+--------+
Or if you have more entries in df1 and you want to keep their price when there is no match in df2 then use left join + coalesce expression:
val df = df1.join(df2, Seq("key"), "left").select(
col("key"),
col("date"),
coalesce(df2("price"), df1("price")).as("price")
)

Scala Spark use Window function to find max value

I have a data set that looks like this:
+------------------------|-----+
| timestamp| zone|
+------------------------+-----+
| 2019-01-01 00:05:00 | A|
| 2019-01-01 00:05:00 | A|
| 2019-01-01 00:05:00 | B|
| 2019-01-01 01:05:00 | C|
| 2019-01-01 02:05:00 | B|
| 2019-01-01 02:05:00 | B|
+------------------------+-----+
For each hour I need to count which zone had the most rows and end up with a table that looks like this:
+-----|-----+-----+
| hour| zone| max |
+-----+-----+-----+
| 0| A| 2|
| 1| C| 1|
| 2| B| 2|
+-----+-----+-----+
My instructions say that I need to use the Window function along with "group by" to find my max count.
I've tried a few things but I'm not sure if I'm close. Any help would be appreciated.
You can use 2 subsequent window-functions to get your result:
df
.withColumn("hour",hour($"timestamp"))
.withColumn("cnt",count("*").over(Window.partitionBy($"hour",$"zone")))
.withColumn("rnb",row_number().over(Window.partitionBy($"hour").orderBy($"cnt".desc)))
.where($"rnb"===1)
.select($"hour",$"zone",$"cnt".as("max"))
You can use Windowing functions and group by with dataframes.
In your case you could use rank() over(partition by) window function.
import org.apache.spark.sql.function._
// first group by hour and zone
val df_group = data_tms.
select(hour(col("timestamp")).as("hour"), col("zone"))
.groupBy(col("hour"), col("zone"))
.agg(count("zone").as("max"))
// second rank by hour order by max in descending order
val df_rank = df_group.
select(col("hour"),
col("zone"),
col("max"),
rank().over(Window.partitionBy(col("hour")).orderBy(col("max").desc)).as("rank"))
// filter by col rank = 1
df_rank
.select(col("hour"),
col("zone"),
col("max"))
.where(col("rank") === 1)
.orderBy(col("hour"))
.show()
/*
+----+----+---+
|hour|zone|max|
+----+----+---+
| 0| A| 2|
| 1| C| 1|
| 2| B| 2|
+----+----+---+
*/

How get the percentage of totals for each count after a groupBy in pyspark?

Given the following DataFrame:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local").appName("test").getOrCreate()
df = spark.createDataFrame([['a',1],['b', 2],['a', 3]], ['category', 'value'])
df.show()
+--------+-----+
|category|value|
+--------+-----+
| a| 1|
| b| 2|
| a| 3|
+--------+-----+
I want to count the number of items in each category and provide a percentage of total for each count, like so
+--------+-----+----------+
|category|count|percentage|
+--------+-----+----------+
| b| 1| 0.333|
| a| 2| 0.667|
+--------+-----+----------+
You can obtain the count and percentage/ratio of totals with the following
import pyspark.sql.functions as f
from pyspark.sql.window import Window
df.groupBy('category').count()\
.withColumn('percentage', f.round(f.col('count') / f.sum('count')\
.over(Window.partitionBy()),3)).show()
+--------+-----+----------+
|category|count|percentage|
+--------+-----+----------+
| b| 1| 0.333|
| a| 2| 0.667|
+--------+-----+----------+
The previous statement can be divided in steps. df.groupBy('category').count() produces the count:
+--------+-----+
|category|count|
+--------+-----+
| b| 1|
| a| 2|
+--------+-----+
then by applying window functions we can obtain the total count on each row:
df.groupBy('category').count().withColumn('total', f.sum('count').over(Window.partitionBy())).show()
+--------+-----+-----+
|category|count|total|
+--------+-----+-----+
| b| 1| 3|
| a| 2| 3|
+--------+-----+-----+
where the total column is calculated by adding together all the counts in the partition (a single partition that includes all rows).
Once we have count and total for each row we can calculate the ratio:
df.groupBy('category')\
.count()\
.withColumn('total', f.sum('count').over(Window.partitionBy()))\
.withColumn('percentage',f.col('count')/f.col('total'))\
.show()
+--------+-----+-----+------------------+
|category|count|total| percentage|
+--------+-----+-----+------------------+
| b| 1| 3|0.3333333333333333|
| a| 2| 3|0.6666666666666666|
+--------+-----+-----+------------------+
You can groupby and aggregate with agg:
import pyspark.sql.functions as F
df.groupby('category').agg(F.count('value') / df.count()).show()
Output:
+--------+------------------+
|category|(count(value) / 3)|
+--------+------------------+
| b|0.3333333333333333|
| a|0.6666666666666666|
+--------+------------------+
To make it nicer you can use:
df.groupby('category').agg(
(
F.round(F.count('value') / df.count(), 2)
).alias('ratio')
).show()
Output:
+--------+-----+
|category|ratio|
+--------+-----+
| b| 0.33|
| a| 0.67|
+--------+-----+
You can also use SQL:
df.createOrReplaceTempView('df')
spark.sql(
"""
SELECT category, COUNT(*) / (SELECT COUNT(*) FROM df) AS ratio
FROM df
GROUP BY category
"""
).show()

How to sort on a variable within each group in pyspark?

I am trying to sort a value val using another column ts for each id.
# imports
from pyspark.sql import functions as F
from pyspark.sql import SparkSession as ss
import pandas as pd
# create dummy data
pdf = pd.DataFrame( [['2',2,'cat'],['1',1,'dog'],['1',2,'cat'],['2',3,'cat'],['2',4,'dog']] ,columns=['id','ts','val'])
sdf = ss.createDataFrame( pdf )
sdf.show()
+---+---+---+
| id| ts|val|
+---+---+---+
| 2| 2|cat|
| 1| 1|dog|
| 1| 2|cat|
| 2| 3|cat|
| 2| 4|dog|
+---+---+---+
You can aggregate by id and sort by ts:
sorted_sdf = ( sdf.groupBy('id')
.agg( F.sort_array( F.collect_list( F.struct( F.col('ts'), F.col('val') ) ), asc = True)
.alias('sorted_col') )
)
sorted_sdf.show()
+---+--------------------+
| id| sorted_col|
+---+--------------------+
| 1| [[1,dog], [2,cat]]|
| 2|[[2,cat], [3,cat]...|
+---+--------------------+
Then, we can explode this list:
explode_sdf = sorted_sdf.select( 'id' , F.explode( F.col('sorted_col') ).alias('sorted_explode') )
explode_sdf.show()
+---+--------------+
| id|sorted_explode|
+---+--------------+
| 1| [1,dog]|
| 1| [2,cat]|
| 2| [2,cat]|
| 2| [3,cat]|
| 2| [4,dog]|
+---+--------------+
Break the tuples of sorted_explode into two:
detupled_sdf = explode_sdf.select( 'id', 'sorted_explode.*' )
detupled_sdf.show()
+---+---+---+
| id| ts|val|
+---+---+---+
| 1| 1|dog|
| 1| 2|cat|
| 2| 2|cat|
| 2| 3|cat|
| 2| 4|dog|
+---+---+---+
Now our original dataframe is sorted by ts for each id!

Get distinct values of specific column with max of different columns

I have the following DataFrame
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A| 6|null|null|
| B|null| 5|null|
| C|null|null| 7|
| B|null|null| 4|
| B|null| 2|null|
| B|null| 1|null|
| A| 4|null|null|
+----+----+----+----+
What I would like to do in Spark is to return all entries in col1 in the case it has a maximum value for one of the columns col2, col3 or col4.
This snippet won't do what I want:
df.groupBy("col1").max("col2","col3","col4").show()
And this one just gives the max only for one column (1):
df.groupBy("col1").max("col2").show()
I even tried to merge the single outputs by this:
//merge rows
val rows = test1.rdd.zip(test2.rdd).map{
case (rowLeft, rowRight) => Row.fromSeq(rowLeft.toSeq ++ rowRight.toSeq)}
//merge schemas
val schema = StructType(test1.schema.fields ++ test2.schema.fields)
// create new df
val test3: DataFrame = sqlContext.createDataFrame(rows, schema)
where test1 and test2 are DataFramesdone with queries as (1).
So how do I achive this nicely??
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
| A| 6|null|null|
| B|null| 5|null|
| C|null|null| 7|
+----+----+----+----+
Or even only the distinct values:
+----+
|col1|
+----+
| A|
| B|
| C|
+----+
Thanks in advance! Best
You can use some thing like below :-
sqlcontext.sql("select x.* from table_name x ,
(select max(col2) as a,max(col3) as b, max(col4) as c from table_name ) temp
where a=x.col2 or b= x.col3 or c=x.col4")
Will give the desired result.
It can be solved like this:
df.registerTempTable("temp")
spark.sql("SELECT max(col2) AS max2, max(col3) AS max3, max(col4) AS max4 FROM temp").registerTempTable("max_temp")
spark.sql("SELECT col1 FROM temp, max_temp WHERE col2 = max2 OR col3 = max3 OR col4 = max4").show