combining lag with row computation in windowing apache spark - scala

assume there is a dataframe as follows:
a| b|
1| 3|
1| 5|
2| 6|
2| 9|
2|14|
I want to produce a final dataframe like this
a| b| c
1| 3| 0
1| 5| -2
2| 6| -6
2| 9| -10
2| 14| -17
The value of c is computed for every row except the first one as a-b+c for the previous row. I tried to use lag as well as rowsBetween, but no success Since "c" value does not exist and it is filled with random variable!!
val w = Window.partitionBy().orderBy($"a", $"b)
df.withColumn("c", lead($"a", 1, 0).over(w) - lead($"b", 1, 0).over(w) + lead($"c", 1, 0).over(w))

You can't reference c while calculating c; What you need is a cumulative sum, which could simply be:
df.withColumn("c", sum(lag($"a" - $"b", 1, 0).over(w)).over(w)).show
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 3| 0|
| 1| 5| -2|
| 2| 6| -6|
| 2| 9|-10|
| 2| 14|-17|
+---+---+---+
But note this is inefficient due to the lack of the partition column.

Related

How to build a rank based on threshold in Spark?

Suppose I have a dataframe:
val df = Seq(
(1,"A"),
(1,"B"),
(1,"C"),
(1,"D"),
(1,"E"),
(1,"F"),
(1,"G"),
(1,"H"),
(2,"I"),
(2,"J"),
(2,"J"),
(2,"J"),
(3,"K"),
).toDF("id", "code")
I need to rank it based on ids and with respect to some threshold. Example:
threshold = 3
id code rank
1 A 1
1 B 1
1 C 1 -- threshold has been reached
1 D 2
1 E 2
1 F 2 -- threshold has been reached
1 G 3
1 H 3
2 I 1
2 J 1
2 J 1 -- threshold has been reached
2 J 2
3 K 1
How can I do it?
I can create a simple rank:
df.withColumn("rank", dense_rank().over(Window.orderBy("id")))
But how to split ranked groups by threshold?
A solution that does not require to move all data into one partition:
//get the largest number of equal ids
val maxGroupSize = df.groupBy("id").count().agg(max("count")).first().getLong(0)
val threshold = 3
var f = maxGroupSize
while( f % threshold>0) f=f+1
df.withColumn("tmp1", 'id* f)
.withColumn("tmp2", dense_rank().over(Window.partitionBy("id").orderBy("code"))-1)
.withColumn("tmp3", 'tmp1+'tmp2)
.withColumn("rank", ('tmp3 / threshold).cast("int"))
Result:
+---+----+----+----+----+----+
| id|code|tmp1|tmp2|tmp3|rank|
+---+----+----+----+----+----+
| 1| A| 9| 0| 9| 3|
| 1| B| 9| 1| 10| 3|
| 1| C| 9| 2| 11| 3|
| 1| D| 9| 3| 12| 4|
| 1| E| 9| 4| 13| 4|
| 1| F| 9| 5| 14| 4|
| 1| G| 9| 6| 15| 5|
| 1| H| 9| 7| 16| 5|
| 2| I| 18| 0| 18| 6|
| 2| J| 18| 1| 19| 6|
| 3| K| 27| 0| 27| 9|
+---+----+----+----+----+----+
The downside of this approach is that the ranks are not consecutive.
It would be possible to fix this with another window
df.withColumn("rank2", dense_rank().over(Window.orderBy("rank")))
but this would again move all data to a single executor.

Compare sum of values between two specific date ranges over different categories

I'm working in databricks. I have the following dataframe:
+----------+---+-----+
| date|cat|value|
+----------+---+-----+
|2022-08-11| a| 1|
|2022-08-12| a| 1|
|2022-08-13| a| 1|
|2022-08-14| a| 1|
|2022-08-15| a| 1|
|2022-08-16| a| 1|
|2022-08-17| a| 2|
|2022-08-18| a| 2|
|2022-08-19| a| 2|
|2022-08-20| a| 2|
|2022-08-21| a| 2|
|2022-08-22| a| 2|
|2022-08-11| b| 1|
|2022-08-12| b| 1|
|2022-08-13| b| 1|
|2022-08-14| b| 1|
|2022-08-15| b| 1|
|2022-08-16| b| 1|
|2022-08-17| b| 3|
|2022-08-18| b| 3|
|2022-08-19| b| 3|
|2022-08-20| b| 3|
|2022-08-21| b| 3|
|2022-08-22| b| 3|
+----------+---+-----+
I want to be able to compare the sum of the values between the 17 and the 22 (week1) and between the 11 and the 16 (week2). Start end and end date of each period are predefined.
So far I've tried something like this:
w = (Window.partitionBy('cat'))
df = (df
.withColumn('date', f.to_date('date', 'yyyy-MM-dd'))
.withColumn('value_week_1',
f.when(
(f.col('date') >= '2022-08-17') &
(f.col('date') <= '2022-08-22'),
f.sum('value').over(w)
)
)
.withColumn('value_week_2',
f.when(
(f.col('date') >= '2022-08-11') &
(f.col('date') <= '2022-08-16'),
f.sum('value').over(w)
)
)
)
but It doesn't work and I'm not sure I'm going in the right direction.
Ultimately I'd like to have something like this:
+----------+---+-----+----+------+--------+
| date|cat|value| w1| w2| diff|
+----------+---+-----+----+------+--------+
|2022-08-11| a| 1| 6| 12| 6|
|2022-08-12| a| 1| 6| 12| 6|
|2022-08-13| a| 1| 6| 12| 6|
|2022-08-14| a| 1| 6| 12| 6|
|2022-08-15| a| 1| 6| 12| 6|
|2022-08-16| a| 1| 6| 12| 6|
|2022-08-17| a| 2| 6| 12| 6|
|2022-08-18| a| 2| 6| 12| 6|
|2022-08-19| a| 2| 6| 12| 6|
|2022-08-20| a| 2| 6| 12| 6|
|2022-08-21| a| 2| 6| 12| 6|
|2022-08-22| a| 2| 6| 12| 6|
|2022-08-11| b| 3| 18| 30| 12|
|2022-08-12| b| 3| 18| 30| 12|
|2022-08-13| b| 3| 18| 30| 12|
|2022-08-14| b| 3| 18| 30| 12|
|2022-08-15| b| 3| 18| 30| 12|
|2022-08-16| b| 3| 18| 30| 12|
|2022-08-17| b| 5| 18| 30| 12|
|2022-08-18| b| 5| 18| 30| 12|
|2022-08-19| b| 5| 18| 30| 12|
|2022-08-20| b| 5| 18| 30| 12|
|2022-08-21| b| 5| 18| 30| 12|
|2022-08-22| b| 5| 18| 30| 12|
+----------+---+-----+----+------+--------+
I think we don't need to use window in your case, we can just:
df_agg = df\
.withColumn('week', func.when((func.col('date')>='2022-08-17')&(func.col('date')<='2022-08-22'), func.lit('w1')).otherwise(func.lit('w2')))\
.groupby('cat').pivot('week')\
.agg(func.sum('value'))\
.withColumn('diff', func.col('w2')-func.col('w1'))
We can just create a new column called week to see if the date is under which week, then create a pivot table.
w =Window.partitionBy('cat').orderBy('cat')
df1 = (
#Create week column to help partion. Use row number to create cululative day count. Find each 7th day using pytho's modulo
df.withColumn('wk',(~(row_number().over(w)%7>0)).cast('int')).withColumn('wk',F.sum('wk').over(Window.partitionBy('cat').orderBy().rowsBetween(-sys.maxsize, 0))+1)
#Finfd the cumulative sum per group per week
.groupby('cat','wk').agg(F.collect_list('date').alias('date'),F.sum('value').alias('value')).withColumn('date', explode('date'))
# #Put the total sum in an array in preparation for pivot
.withColumn('value_1', F.collect_set('value').over(Window.partitionBy('cat').orderBy('date','value').rowsBetween(-sys.maxsize, sys.maxsize)))
# #pivot and create week columns
.withColumn('wk',F.array(F.struct(*[F.col('value_1')[i].alias(f"week_{i+1}")for i in range(2)]))).selectExpr('*','inline(wk)').drop('wk','value_1')
# #Find the difference
.withColumn('diff', abs(col('week_1')-col('week_2')))
).show()
There are somethings about this problem that do not make sense. Please see end of article for my observations.
First, the dates from 8/11 to 8/16 do not make up a whole week. Second, the labels of week-1 being 8/17 to 8/22 and week-2 being 8/11 to 8/16 are logically backwards.
I am going to solve this problem using PySpark and Spark SQL since it is straight forward.
#
# Create sample data
#
dat1 = [
("2022-08-11","a",1),
("2022-08-12","a",1),
("2022-08-13","a",1),
("2022-08-14","a",1),
("2022-08-15","a",1),
("2022-08-16","a",1),
("2022-08-17","a",2),
("2022-08-18","a",2),
("2022-08-19","a",2),
("2022-08-20","a",2),
("2022-08-21","a",2),
("2022-08-22","a",2),
("2022-08-11","b",1),
("2022-08-12","b",1),
("2022-08-13","b",1),
("2022-08-14","b",1),
("2022-08-15","b",1),
("2022-08-16","b",1),
("2022-08-17","b",3),
("2022-08-18","b",3),
("2022-08-19","b",3),
("2022-08-20","b",3),
("2022-08-21","b",3),
("2022-08-22","b",3)
]
col1 = ["date", "cat", "value"]
df1 = spark.createDataFrame(data=dat1, schema=col1)
df1.createOrReplaceTempView("sample_data")
The above code create a temporary view with the data set.
#
# Core data - add category w0
#
stmt = """
select
date,
cat,
value,
case
when date >= "2022-08-11" and date <= "2022-08-16" then 2
when date >= "2022-08-17" and date <= "2022-08-22" then 1
else 0
end as w0
from sample_data as q1
"""
df2 = spark.sql(stmt)
df2.createOrReplaceTempView("core_data")
The code above labels the data as week-1 or week-2 and save this category information as w0. This could have been hard coded into the dataset above.
#
# Pivot data - sum vaule by cat, pivot on w0
#
stmt = """
select * from
(
select cat, w0, value from core_data
)
pivot (
cast(sum(value) as DECIMAL(4, 2)) as total
for w0 in (1 w1, 2 w2)
)
"""
df3 = spark.sql(stmt)
df3.createOrReplaceTempView("pivot_data")
The code above creates a column per week category and summarizes the values.
Please note, the result set has 3/5 for cat = b while the original data set has 1/3. I am using your original data set.
Last but not least, we join the core_data to the pivot_data and create a calculated column of the difference of (w1-w2).
You can use spark.sql() to create a dataframe and save this result as a file if you want.
To recap, the length of the week categories is not 7 days, labeling a prior week a greater number than the current does not make sense, and the expected result set is wrong in your example since the input set has different numbers.
In short, working with temporary views allows you to leverage your existing T-SQL skills.

pyspark: Auto filling in implicit missing values

I have a dataframe
user day amount
a 2 10
a 1 14
a 4 5
b 1 4
You see that, the maximum value of day is 4, and the minimum value is 1. I want to fill 0 for amount column in all missing days of all users, so the above data frame will become.
user day amount
a 2 10
a 1 14
a 4 5
a 3 0
b 1 4
b 2 0
b 3 0
b 4 0
How could I do that in PySpark? Many thanks.
Here is one approach. You can get the min and max values first , then group on user column and pivot, then fill in missing columns and fill all nulls as 0, then stack them back:
min_max = df.agg(F.min("day"),F.max("day")).collect()[0]
df1 = df.groupBy("user").pivot("day").agg(F.first("amount").alias("amount")).na.fill(0)
missing_cols = [F.lit(0).alias(str(i)) for i in range(min_max[0],min_max[1]+1)
if str(i) not in df1.columns ]
df1 = df1.select("*",*missing_cols)
#+----+---+---+---+---+
#|user| 1| 2| 4| 3|
#+----+---+---+---+---+
#| b| 4| 0| 0| 0|
#| a| 14| 10| 5| 0|
#+----+---+---+---+---+
#the next step is inspired from https://stackoverflow.com/a/37865645/9840637
arr = F.explode(F.array([F.struct(F.lit(c).alias("day"), F.col(c).alias("amount"))
for c in df1.columns[1:]])).alias("kvs")
(df1.select(["user"] + [arr])
.select(["user"]+ ["kvs.day", "kvs.amount"]).orderBy("user")).show()
+----+---+------+
|user|day|amount|
+----+---+------+
| a| 1| 14|
| a| 2| 10|
| a| 4| 5|
| a| 3| 0|
| b| 1| 4|
| b| 2| 0|
| b| 4| 0|
| b| 3| 0|
+----+---+------+
Note, since column day was pivotted , the dtype might have changed so you may have to cast them back to the original dtype
Another way to do this is to use sequence, array functions and explode. (spark2.4+)
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy(F.lit(0))
df.withColumn("boundaries", F.sequence(F.min("day").over(w),F.max("day").over(w),F.lit(1)))\
.groupBy("user").agg(F.collect_list("day").alias('day'),F.collect_list("amount").alias('amount')\
,F.first("boundaries").alias("boundaries")).withColumn("boundaries", F.array_except("boundaries","day"))\
.withColumn("day",F.flatten(F.array("day","boundaries"))).drop("boundaries")\
.withColumn("zip", F.explode(F.arrays_zip("day","amount")))\
.select("user","zip.day", F.when(F.col("zip.amount").isNull(),\
F.lit(0)).otherwise(F.col("zip.amount")).alias("amount")).show()
#+----+---+------+
#|user|day|amount|
#+----+---+------+
#| a| 2| 10|
#| a| 1| 14|
#| a| 4| 5|
#| a| 3| 0|
#| b| 1| 4|
#| b| 2| 0|
#| b| 3| 0|
#| b| 4| 0|
#+----+---+------+

How to aggregate contiguous rows in pyspark

I have an immense amount of user data (billions of rows) where I need to summarize the amount of time spent in a specific state by each user.
Let's say it's historical web data, and I want to sum the amount of time each user has spent on the site. The data only says if the user is present.
df = spark.createDataFrame([("A", 1), ("A", 2), ("A", 3),("B", 4 ),("B", 5 ),("A", 6 ),("A", 7 ),("A", 8 )], ["user","timestamp"])
+----+---------+
|user|timestamp|
+----+---------+
| A| 1|
| A| 2|
| A| 3|
| B| 4|
| B| 5|
| A| 6|
| A| 7|
| A| 8|
+----+---------+
The correct answer would be this since I'm summing the total per contiguous segment.
+----+---------+
|user| ttl |
+----+---------+
| A| 4|
| B| 1|
+----+---------+
I tried doing a max()-min() and groupby but that resulted in segment A being 8-1 and gave the wrong answer.
In sqlite I was able to get the answer by creating a partition number and then finding the difference and summing. I created the partition with this...
SELECT
COUNT(*) FILTER (WHERE a.user <>
( SELECT b.user
FROM foobar AS b
WHERE a.timestamp > b.timestamp
ORDER BY b.timestamp DESC
LIMIT 1
))
OVER (ORDER BY timestamp) c,
user,
timestamp
FROM foobar a;
which gave me...
+----+---------+---+
|user|timestamp| c |
+----+---------+---+
| A| 1| 1 |
| A| 2| 1 |
| A| 3| 1 |
| B| 4| 2 |
| B| 5| 2 |
| A| 6| 3 |
| A| 7| 3 |
| A| 8| 3 |
+----+---------+---+
Then the LAST() - FIRST() functions in sql made that easy to finish.
Any ideas on how to scale this and do it in pyspark? I can't seem to find adequate substitutes for the "count(*) where(...)" sqlite offered
We can do this:
Create the DataFrame
from pyspark.sql.window import Window
from pyspark.sql.functions import max, min
from pyspark.sql import functions as F
df = spark.createDataFrame([("A", 1), ("A", 2), ("A", 3),("B", 4 ),("B", 5 ),("A", 6 ),("A", 7 ),("A", 8 )], ["user","timestamp"])
df.show()
+----+---------+
|user|timestamp|
+----+---------+
| A| 1|
| A| 2|
| A| 3|
| B| 4|
| B| 5|
| A| 6|
| A| 7|
| A| 8|
+----+---------+
Assign a row_number to each row, which are ordered by timestamp. The column dummy is used such that we can use window function row_number.
df = df.withColumn('dummy', F.lit(1))
w1 = Window.partitionBy('dummy').orderBy('timestamp')
df = df.withColumn('row_number', F.row_number().over(w1))
df.show()
+----+---------+-----+----------+
|user|timestamp|dummy|row_number|
+----+---------+-----+----------+
| A| 1| 1| 1|
| A| 2| 1| 2|
| A| 3| 1| 3|
| B| 4| 1| 4|
| B| 5| 1| 5|
| A| 6| 1| 6|
| A| 7| 1| 7|
| A| 8| 1| 8|
+----+---------+-----+----------+
We want to create a sub group within each user group here.
(1) For each user group, compute the difference of current row's row_number to previous row's row_number. So any difference larger than 1 indicating there's a new contiguous group. This results diff, note the first row in each group has a value of -1.
(2) We then assign null to every row with diff==1. This results column diff2.
(3) Next, we use the last function to fill the rows with diff2 == null using the last non-null value in column diff2. This results subgroupid.
This is the sub group we want to create for each user group.
w2 = Window.partitionBy('user').orderBy('timestamp')
df = df.withColumn('diff', df['row_number'] - F.lag('row_number').over(w2)).fillna(-1)
df = df.withColumn('diff2', F.when(df['diff']==1, None).otherwise(F.abs(df['diff'])))
df = df.withColumn('subgroupid', F.last(F.col('diff2'), True).over(w2))
df.show()
+----+---------+-----+----------+----+-----+----------+
|user|timestamp|dummy|row_number|diff|diff2|subgroupid|
+----+---------+-----+----------+----+-----+----------+
| B| 4| 1| 4| -1| 1| 1|
| B| 5| 1| 5| 1| null| 1|
| A| 1| 1| 1| -1| 1| 1|
| A| 2| 1| 2| 1| null| 1|
| A| 3| 1| 3| 1| null| 1|
| A| 6| 1| 6| 3| 3| 3|
| A| 7| 1| 7| 1| null| 3|
| A| 8| 1| 8| 1| null| 3|
+----+---------+-----+----------+----+-----+----------+
We now group by both user and subgroupid to compute the time each user spent on each contiguous time interval.
Lastly, we group by user only to sum up the total time spent by each user.
s = "(max('timestamp') - min('timestamp'))"
df = df.groupBy(['user', 'subgroupid']).agg(eval(s))
s = s.replace("'","")
df = df.groupBy('user').sum(s).select('user', F.col("sum(" + s + ")").alias('total_time'))
df.show()
+----+----------+
|user|total_time|
+----+----------+
| B| 1|
| A| 4|
+----+----------+

Improve the efficiency of Spark SQL in repeated calls to groupBy/count. Pivot the outcome

I have a Spark DataFrame consisting of columns of integers. I want to tabulate each column and pivot the outcome by the column names.
In the following toy example, I start with this DataFrame df
+---+---+---+---+---+
| a| b| c| d| e|
+---+---+---+---+---+
| 1| 1| 1| 0| 2|
| 1| 1| 1| 1| 1|
| 2| 2| 2| 3| 3|
| 0| 0| 0| 0| 1|
| 1| 1| 1| 0| 0|
| 3| 3| 3| 2| 2|
| 0| 1| 1| 1| 0|
+---+---+---+---+---+
Each cell can only contain one of {0, 1, 2, 3}. Now I want to tabulate the counts in each column. Ideally, I would have a column for each label (0, 1, 2, 3), and a row for each column. I do:
val output = df.columns.map(cs => df.select(cs).groupBy(cs).count().orderBy(cs).
withColumnRenamed(cs, "severity").
withColumnRenamed("count", "counts").withColumn("window", lit(cs))
)
I get an Array of DataFrames, one for each row of the df. Each of these dataframes has 4 rows (one for each outcome). Then I do:
val longOutput = output.reduce(_ union _) // flatten the array to produce one dataframe
longOutput.show()
to collapse the Array.
+--------+------+------+
|severity|counts|window|
+--------+------+------+
| 0| 2| a|
| 1| 3| a|
| 2| 1| a|
| 3| 1| a|
| 0| 1| b|
| 1| 4| b|
| 2| 1| b|
| 3| 1| b|
...
And finally, I pivot on the original column names
longOutput.cache()
val results = longOutput.groupBy("window").pivot("severity").agg(first("counts"))
results.show()
+------+---+---+---+---+
|window| 0| 1| 2| 3|
+------+---+---+---+---+
| e| 2| 2| 2| 1|
| d| 3| 2| 1| 1|
| c| 1| 4| 1| 1|
| b| 1| 4| 1| 1|
| a| 2| 3| 1| 1|
+------+---+---+---+---+
However the reduction piece took 8 full seconds on the toy example. It ran for over 2 hours on my actual data which had 1000 columns and 400,000 rows before I terminated it. I am running locally on a machine with 12 cores and 128G of RAM. But clearly, what I'm doing is slow on even a small amount of data, so machine size is not in itself the problem. The column groupby/count took only 7 minutes on the full data set. But then I can't do anything with that Array[DataFrame].
I tried several ways of avoiding union. I tried writing out my array to disk, but that failed due to a memory problem after several hours of effort. I also tried to adjust memory allowances on Zeppelin
So I need a way of doing the tabulation that does not give me an Array of DataFrames, but rather a simple data frame.
The problem with your code is that you trigger one spark job per column and then a big union. In general, it's much faster to try and keep everything within the same one.
In your case, instead of dividing the work, you could explode the dataframe to do everything in one pass like this:
df
.select(array(df.columns.map(c => struct(lit(c) as "name", col(c) as "value") ) : _*) as "a")
.select(explode('a))
.select($"col.name" as "name", $"col.value" as "value")
.groupBy("name")
.pivot("value")
.count()
.show()
This first line is the only one that's a bit tricky. It creates an array of tuples where each column name is mapped to its value. Then we explode it (one line per element of the array) and finally compute a basic pivot.