How to aggregate contiguous rows in pyspark - pyspark

I have an immense amount of user data (billions of rows) where I need to summarize the amount of time spent in a specific state by each user.
Let's say it's historical web data, and I want to sum the amount of time each user has spent on the site. The data only says if the user is present.
df = spark.createDataFrame([("A", 1), ("A", 2), ("A", 3),("B", 4 ),("B", 5 ),("A", 6 ),("A", 7 ),("A", 8 )], ["user","timestamp"])
+----+---------+
|user|timestamp|
+----+---------+
| A| 1|
| A| 2|
| A| 3|
| B| 4|
| B| 5|
| A| 6|
| A| 7|
| A| 8|
+----+---------+
The correct answer would be this since I'm summing the total per contiguous segment.
+----+---------+
|user| ttl |
+----+---------+
| A| 4|
| B| 1|
+----+---------+
I tried doing a max()-min() and groupby but that resulted in segment A being 8-1 and gave the wrong answer.
In sqlite I was able to get the answer by creating a partition number and then finding the difference and summing. I created the partition with this...
SELECT
COUNT(*) FILTER (WHERE a.user <>
( SELECT b.user
FROM foobar AS b
WHERE a.timestamp > b.timestamp
ORDER BY b.timestamp DESC
LIMIT 1
))
OVER (ORDER BY timestamp) c,
user,
timestamp
FROM foobar a;
which gave me...
+----+---------+---+
|user|timestamp| c |
+----+---------+---+
| A| 1| 1 |
| A| 2| 1 |
| A| 3| 1 |
| B| 4| 2 |
| B| 5| 2 |
| A| 6| 3 |
| A| 7| 3 |
| A| 8| 3 |
+----+---------+---+
Then the LAST() - FIRST() functions in sql made that easy to finish.
Any ideas on how to scale this and do it in pyspark? I can't seem to find adequate substitutes for the "count(*) where(...)" sqlite offered

We can do this:
Create the DataFrame
from pyspark.sql.window import Window
from pyspark.sql.functions import max, min
from pyspark.sql import functions as F
df = spark.createDataFrame([("A", 1), ("A", 2), ("A", 3),("B", 4 ),("B", 5 ),("A", 6 ),("A", 7 ),("A", 8 )], ["user","timestamp"])
df.show()
+----+---------+
|user|timestamp|
+----+---------+
| A| 1|
| A| 2|
| A| 3|
| B| 4|
| B| 5|
| A| 6|
| A| 7|
| A| 8|
+----+---------+
Assign a row_number to each row, which are ordered by timestamp. The column dummy is used such that we can use window function row_number.
df = df.withColumn('dummy', F.lit(1))
w1 = Window.partitionBy('dummy').orderBy('timestamp')
df = df.withColumn('row_number', F.row_number().over(w1))
df.show()
+----+---------+-----+----------+
|user|timestamp|dummy|row_number|
+----+---------+-----+----------+
| A| 1| 1| 1|
| A| 2| 1| 2|
| A| 3| 1| 3|
| B| 4| 1| 4|
| B| 5| 1| 5|
| A| 6| 1| 6|
| A| 7| 1| 7|
| A| 8| 1| 8|
+----+---------+-----+----------+
We want to create a sub group within each user group here.
(1) For each user group, compute the difference of current row's row_number to previous row's row_number. So any difference larger than 1 indicating there's a new contiguous group. This results diff, note the first row in each group has a value of -1.
(2) We then assign null to every row with diff==1. This results column diff2.
(3) Next, we use the last function to fill the rows with diff2 == null using the last non-null value in column diff2. This results subgroupid.
This is the sub group we want to create for each user group.
w2 = Window.partitionBy('user').orderBy('timestamp')
df = df.withColumn('diff', df['row_number'] - F.lag('row_number').over(w2)).fillna(-1)
df = df.withColumn('diff2', F.when(df['diff']==1, None).otherwise(F.abs(df['diff'])))
df = df.withColumn('subgroupid', F.last(F.col('diff2'), True).over(w2))
df.show()
+----+---------+-----+----------+----+-----+----------+
|user|timestamp|dummy|row_number|diff|diff2|subgroupid|
+----+---------+-----+----------+----+-----+----------+
| B| 4| 1| 4| -1| 1| 1|
| B| 5| 1| 5| 1| null| 1|
| A| 1| 1| 1| -1| 1| 1|
| A| 2| 1| 2| 1| null| 1|
| A| 3| 1| 3| 1| null| 1|
| A| 6| 1| 6| 3| 3| 3|
| A| 7| 1| 7| 1| null| 3|
| A| 8| 1| 8| 1| null| 3|
+----+---------+-----+----------+----+-----+----------+
We now group by both user and subgroupid to compute the time each user spent on each contiguous time interval.
Lastly, we group by user only to sum up the total time spent by each user.
s = "(max('timestamp') - min('timestamp'))"
df = df.groupBy(['user', 'subgroupid']).agg(eval(s))
s = s.replace("'","")
df = df.groupBy('user').sum(s).select('user', F.col("sum(" + s + ")").alias('total_time'))
df.show()
+----+----------+
|user|total_time|
+----+----------+
| B| 1|
| A| 4|
+----+----------+

Related

create a new column to increment value when value resets to 1 in another column in pyspark

Logic and columnIn Pyspark DataFrame consider a column like [1,2,3,4,1,2,1,1,2,3,1,2,1,1,2]. Pyspark Column
create a new column to increment value when value resets to 1.
Expected output is[1,1,1,1,2,2,3,4,4,4,5,5,6,7,7]
i am bit new to pyspark, if anyone can help me it would be great for me.
written the logic as like below
def sequence(row_num):
results = [1, ]
flag = 1
for col in range(0, len(row_num)-1):
if row_num[col][0]>=row_num[col+1][0]:
flag+=1
results.append(flag)
return results
but not able to pass a column through udf. please help me in this
Your Dataframe:
df = spark.createDataFrame(
[
('1','a'),
('2','b'),
('3','c'),
('4','d'),
('1','e'),
('2','f'),
('1','g'),
('1','h'),
('2','i'),
('3','j'),
('1','k'),
('2','l'),
('1','m'),
('1','n'),
('2','o')
], ['group','label']
)
+-----+-----+
|group|label|
+-----+-----+
| 1| a|
| 2| b|
| 3| c|
| 4| d|
| 1| e|
| 2| f|
| 1| g|
| 1| h|
| 2| i|
| 3| j|
| 1| k|
| 2| l|
| 1| m|
| 1| n|
| 2| o|
+-----+-----+
You can create a flag and use a Window Function to calculate the cumulative sum. No need to use an UDF:
from pyspark.sql import Window as W
from pyspark.sql import functions as F
w = W.partitionBy().orderBy('label').rowsBetween(Window.unboundedPreceding, 0)
df\
.withColumn('Flag', F.when(F.col('group') == 1, 1).otherwise(0))\
.withColumn('Output', F.sum('Flag').over(w))\
.show()
+-----+-----+----+------+
|group|label|Flag|Output|
+-----+-----+----+------+
| 1| a| 1| 1|
| 2| b| 0| 1|
| 3| c| 0| 1|
| 4| d| 0| 1|
| 1| e| 1| 2|
| 2| f| 0| 2|
| 1| g| 1| 3|
| 1| h| 1| 4|
| 2| i| 0| 4|
| 3| j| 0| 4|
| 1| k| 1| 5|
| 2| l| 0| 5|
| 1| m| 1| 6|
| 1| n| 1| 7|
| 2| o| 0| 7|
+-----+-----+----+------+

pyspark: Auto filling in implicit missing values

I have a dataframe
user day amount
a 2 10
a 1 14
a 4 5
b 1 4
You see that, the maximum value of day is 4, and the minimum value is 1. I want to fill 0 for amount column in all missing days of all users, so the above data frame will become.
user day amount
a 2 10
a 1 14
a 4 5
a 3 0
b 1 4
b 2 0
b 3 0
b 4 0
How could I do that in PySpark? Many thanks.
Here is one approach. You can get the min and max values first , then group on user column and pivot, then fill in missing columns and fill all nulls as 0, then stack them back:
min_max = df.agg(F.min("day"),F.max("day")).collect()[0]
df1 = df.groupBy("user").pivot("day").agg(F.first("amount").alias("amount")).na.fill(0)
missing_cols = [F.lit(0).alias(str(i)) for i in range(min_max[0],min_max[1]+1)
if str(i) not in df1.columns ]
df1 = df1.select("*",*missing_cols)
#+----+---+---+---+---+
#|user| 1| 2| 4| 3|
#+----+---+---+---+---+
#| b| 4| 0| 0| 0|
#| a| 14| 10| 5| 0|
#+----+---+---+---+---+
#the next step is inspired from https://stackoverflow.com/a/37865645/9840637
arr = F.explode(F.array([F.struct(F.lit(c).alias("day"), F.col(c).alias("amount"))
for c in df1.columns[1:]])).alias("kvs")
(df1.select(["user"] + [arr])
.select(["user"]+ ["kvs.day", "kvs.amount"]).orderBy("user")).show()
+----+---+------+
|user|day|amount|
+----+---+------+
| a| 1| 14|
| a| 2| 10|
| a| 4| 5|
| a| 3| 0|
| b| 1| 4|
| b| 2| 0|
| b| 4| 0|
| b| 3| 0|
+----+---+------+
Note, since column day was pivotted , the dtype might have changed so you may have to cast them back to the original dtype
Another way to do this is to use sequence, array functions and explode. (spark2.4+)
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w=Window().partitionBy(F.lit(0))
df.withColumn("boundaries", F.sequence(F.min("day").over(w),F.max("day").over(w),F.lit(1)))\
.groupBy("user").agg(F.collect_list("day").alias('day'),F.collect_list("amount").alias('amount')\
,F.first("boundaries").alias("boundaries")).withColumn("boundaries", F.array_except("boundaries","day"))\
.withColumn("day",F.flatten(F.array("day","boundaries"))).drop("boundaries")\
.withColumn("zip", F.explode(F.arrays_zip("day","amount")))\
.select("user","zip.day", F.when(F.col("zip.amount").isNull(),\
F.lit(0)).otherwise(F.col("zip.amount")).alias("amount")).show()
#+----+---+------+
#|user|day|amount|
#+----+---+------+
#| a| 2| 10|
#| a| 1| 14|
#| a| 4| 5|
#| a| 3| 0|
#| b| 1| 4|
#| b| 2| 0|
#| b| 3| 0|
#| b| 4| 0|
#+----+---+------+

Spark Dataframe: Group and rank rows on a certain column value

I am trying to rank a column when the "ID" column numbering starts from 1 to max and then resets from 1.
So, the first three rows have a continuous numbering on "ID"; hence these should be grouped with group rank =1. Rows four and five are in another group, group rank = 2.
The rows are sorted by "rownum" column. I am aware of the row_number window function but I don't think I can apply for this use case as there is no constant window. I can only think of looping through each row in the dataframe but not sure how I can update a column when number resets to 1.
val df = Seq(
(1, 1 ),
(2, 2 ),
(3, 3 ),
(4, 1),
(5, 2),
(6, 1),
(7, 1),
(8, 2)
).toDF("rownum", "ID")
df.show()
Expected result is below:
You can do it with 2 window-functions, the first one to flag the state, the second one to calculate a running sum:
df
.withColumn("increase", $"ID" > lag($"ID",1).over(Window.orderBy($"rownum")))
.withColumn("group_rank_of_ID",sum(when($"increase",lit(0)).otherwise(lit(1))).over(Window.orderBy($"rownum")))
.drop($"increase")
.show()
gives:
+------+---+----------------+
|rownum| ID|group_rank_of_ID|
+------+---+----------------+
| 1| 1| 1|
| 2| 2| 1|
| 3| 3| 1|
| 4| 1| 2|
| 5| 2| 2|
| 6| 1| 3|
| 7| 1| 4|
| 8| 2| 4|
+------+---+----------------+
As #Prithvi noted, we can use lead here.
The tricky part is in order to use window function such as lead, we need to at least provide the order.
Consider
val nextID = lag('ID, 1, -1) over Window.orderBy('rownum)
val isNewGroup = 'ID <= nextID cast "integer"
val group_rank_of_ID = sum(isNewGroup) over Window.orderBy('rownum)
/* you can try
df.withColumn("intermediate", nextID).show
// ^^^^^^^-- can be `isNewGroup`, or other vals
*/
df.withColumn("group_rank_of_ID", group_rank_of_ID).show
/* returns
+------+---+----------------+
|rownum| ID|group_rank_of_ID|
+------+---+----------------+
| 1| 1| 0|
| 2| 2| 0|
| 3| 3| 0|
| 4| 1| 1|
| 5| 2| 1|
| 6| 1| 2|
| 7| 1| 3|
| 8| 2| 3|
+------+---+----------------+
*/
df.withColumn("group_rank_of_ID", group_rank_of_ID + 1).show
/* returns
+------+---+----------------+
|rownum| ID|group_rank_of_ID|
+------+---+----------------+
| 1| 1| 1|
| 2| 2| 1|
| 3| 3| 1|
| 4| 1| 2|
| 5| 2| 2|
| 6| 1| 3|
| 7| 1| 4|
| 8| 2| 4|
+------+---+----------------+
*/

Adding a Column in DataFrame from another column of same dataFrame Pyspark

I have a Pyspark dataframe df, like following:
+---+----+---+
| id|name| c|
+---+----+---+
| 1| a| 5|
| 2| b| 4|
| 3| c| 2|
| 4| d| 3|
| 5| e| 1|
+---+----+---+
I want to add a column match_name that have value from the name column where id == c
Is it possible to do it with function withColumn()?
Currently i have to create two dataframes and then perform join.
Which is inefficient on large dataset.
Expected Output:
+---+----+---+----------+
| id|name| c|match_name|
+---+----+---+----------+
| 1| a| 5| e|
| 2| b| 4| d|
| 3| c| 2| b|
| 4| d| 3| c|
| 5| e| 1| a|
+---+----+---+----------+
Yes, it is possible, with when:
from pyspark.sql.functions import when, col
condition = col("id") == col("match")
result = df.withColumn("match_name", when(condition, col("name"))
result.show()
id name match match_name
1 a 3 null
2 b 2 b
3 c 5 null
4 d 4 d
5 e 1 null
You may also use otherwise to provide a different value if the condition is not met.

spark dataframe groupby multiple times

val df = (Seq((1, "a", "10"),(1,"b", "12"),(1,"c", "13"),(2, "a", "14"),
(2,"c", "11"),(1,"b","12" ),(2, "c", "12"),(3,"r", "11")).
toDF("col1", "col2", "col3"))
So I have a spark dataframe with 3 columns.
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| a| 10|
| 1| b| 12|
| 1| c| 13|
| 2| a| 14|
| 2| c| 11|
| 1| b| 12|
| 2| c| 12|
| 3| r| 11|
+----+----+----+
My requirement is actually I need to perform two levels of groupby as explained below.
Level1:
If I do groupby on col1 and do a sum of Col3. I will get below two columns.
1. col1
2. sum(col3)
I will loose col2 here.
Level2:
If i want to again group by on col1 and col2 and do a sum of Col3 I will get below 3 columns.
1. col1
2. col2
3. sum(col3)
My requirement is actually I need to perform two levels of groupBy and have these two columns(sum(col3) of level1, sum(col3) of level2) in a final one dataframe.
How can I do this, can anyone explain?
spark : 1.6.2
Scala : 2.10
One option is to do the two sum separately and then join them back:
(df.groupBy("col1", "col2").agg(sum($"col3").as("sum_level2")).
join(df.groupBy("col1").agg(sum($"col3").as("sum_level1")), Seq("col1")).show)
+----+----+----------+----------+
|col1|col2|sum_level2|sum_level1|
+----+----+----------+----------+
| 2| c| 23.0| 37.0|
| 2| a| 14.0| 37.0|
| 1| c| 13.0| 47.0|
| 1| b| 24.0| 47.0|
| 3| r| 11.0| 11.0|
| 1| a| 10.0| 47.0|
+----+----+----------+----------+
Another option is to use the window functions, considering the fact that the level1_sum is the sum of level2_sum grouped by col1:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"col1")
(df.groupBy("col1", "col2").agg(sum($"col3").as("sum_level2")).
withColumn("sum_level1", sum($"sum_level2").over(w)).show)
+----+----+----------+----------+
|col1|col2|sum_level2|sum_level1|
+----+----+----------+----------+
| 1| c| 13.0| 47.0|
| 1| b| 24.0| 47.0|
| 1| a| 10.0| 47.0|
| 3| r| 11.0| 11.0|
| 2| c| 23.0| 37.0|
| 2| a| 14.0| 37.0|
+----+----+----------+----------+