I am dealing with a problem in PySpark.
Input:
+-------+---------------------+
|user_id|activity_timestamp |
+-------+---------------------+
| 1| 2021/06/01 19:00 |
| 1| 2021/06/01 19:01 |
| 2| 2021/06/01 19:01 |
| 1| 2021/06/01 19:02 |
| 2| 2021/06/01 19:02 |
| 1| 2021/06/01 19:10 |
| 1| 2021/06/01 19:11 |
+-------+---------------------+
Desired output:
For each user, detect periods of continuous activity timestamp (gaps between timestamps are smaller than for ex. 5 minutes)
+-------+---------------------------+---------------------------+
|user_id| activity _start | activity _stop |
+-------+---------------------------+---------------------------+
| 1| 2021/06/01 19:00 | 2021/06/01 19:02 |
| 1| 2021/06/01 19:10 | 2021/06/01 19:11 |
| 2| 2021/06/01 19:01 | 2021/06/01 19:02 |
Progress so far: I have used a Window function to find the time of the previous activity and from that I have calculated the time that passed since previous activity. But I am struggling to create the desired output.
time_window = Window.partitionBy("user_id").orderBy("user_id", "activity_timestamp")
df = (df
.withColumn('prev_time', F.lag(F.col('activity_timestamp')).over(time_window))
.withColumn('time_gap', F.col('activity_timestamp').cast("long") - F.col('prev_time').cast("long"))
)
Starting with the time difference between two following rows within a window works, just like time_gap in the question.
The next steps are:
a new column is created with a value that depends on this difference: if the difference is smaller than 5 minutes, the new column gets the value 0 and 1 otherwise
sum the new column over the window. All rows that belong the to same period will get the same number: only the first row in the period contains a 1 and all other rows contain a 0. In the code below, this column is called id.
group the dataframe by user_id and id and take the minimum and maximum of activity_timestamp as start and end of the period
from pyspark.sql import functions as F
df.selectExpr("*", """
sum(
case
when lag(activity_timestamp) over (PARTITION BY user_id ORDER BY activity_timestamp) is null then 1
when cast(activity_timestamp as long)-cast(lag(activity_timestamp) over (PARTITION BY user_id ORDER BY activity_timestamp) as long) > 60 * 5 then 1
else 0
end
) over (PARTITION BY user_id ORDER BY activity_timestamp) as id
""") \
.groupBy("user_id", "id") \
.agg(F.min("activity_timestamp").alias("activity_start"),
F.max("activity_timestamp").alias("activity_end")) \
.drop("id") \
.show()
Output:
+-------+-------------------+-------------------+
|user_id| activity_start| activity_end|
+-------+-------------------+-------------------+
| 1|2021-06-01 19:00:00|2021-06-01 19:02:00|
| 1|2021-06-01 19:10:00|2021-06-01 19:11:00|
| 2|2021-06-01 19:01:00|2021-06-01 19:02:00|
+-------+-------------------+-------------------+
First convert the time in minutes, then get the adjacent time in the same row. From here we can calculate the time difference and then do a convert add a column which has 0 if the difference between times is less than 5 mins and 1 if it is larger. Just summing over the this new column reusing the window leads to the desired result.
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql import types as T
w = Window.partitionBy('user_id').orderBy('unixtime')
df.withColumn('unixtime', F.unix_timestamp('activity_timestamp') / 60)\
.withColumn('lag', F.lag('unixtime').over(w))\
.withColumn('diff', F.col('lag') - F.col('unixtime'))\
.withColumn('group', F.sum(F.when(F.col('diff') > -5., 0).otherwise(1)).over(w))\
.groupBy('user_id', 'group').agg(F.first('activity_timestamp').alias('activity_start'),
F.last('activity_timestamp').alias('activity_end'))\
.drop('group').show()
Output
+-------+-------------------+-------------------+
|user_id| activity_start| activity_end|
+-------+-------------------+-------------------+
| 1|2021-06-01 19:00:00|2021-06-01 19:02:00|
| 1|2021-06-01 19:10:00|2021-06-01 19:11:00|
| 2|2021-06-01 19:01:00|2021-06-01 19:02:00|
+-------+-------------------+-------------------+
Related
I am rewriting legacy SAS codes to PySpark. In one of those blocks, the SAS codes used the lag function. The way I understood the notes, it says an ID is a duplicate if it as two intake dates that are less than 4 days apart.
/*Next create flag if the same ID has two intake dates less than 4 days apart*/
/*Data MUST be sorted by ID and DESCENDING IntakeDate!!!*/
data duplicates (drop= lag_ID lag_IntakeDate);
set df2;
by ID;
lag_ID = lag(ID);
lag_INtakeDate = lag(IntakeDate);
if ID = lag_ID then do;
intake2TIME = intck('day', lag_IntakeDate, IntakeDate);
end;
if 0 <= abs(intake2TIME) < 4 then DUPLICATE = 1;
run;
/* If the DUPLICATE > 1, then it is a duplicate and eventually will be dropped.*/
I tried meeting the condition as described in the comments: I pulled by sql the ID and intake dates ordered by ID and descending intake date:
SELECT ID, intakeDate, col3, col4
from df order by ID, intakeDate DESC
I googled the lag equivalent and this is what I found:
https://www.educba.com/pyspark-lag/
However, I have not used window function before, the concept introduced by the site does not somehow make sense to me, though I tried the following to check if my understanding of WHERE EXISTS might work:
SELECT *
FROM df
WHERE EXISTS (
SELECT *
FROM df v2
WHERE df.ID = v2.ID AND DATEDIFF(df.IntakeDate, v2.IntakeDate) > 4 )
/* not sure about the second condition, though*/)
Initial df
+-----------+------------------+
| Id| IntakeDate|
+-----------+------------------+
| 5.0| 2021-04-14|
| 5.0| 2021-05-06|
| 5.0| 2021-05-08|
| 10.0| 2021-04-21|
| 10.0| 2021-05-25|
| 14.0| 2021-03-08|
| 14.0| 2021-03-09|
| 14.0| 2021-09-30|
| 14.0| 2022-04-08|
| 15.0| 2021-04-27|
| 15.0| 2021-05-18|
| 15.0| 2022-01-17|
| 26.0| 2021-08-27|
| 26.0| 2021-09-17|
+-----------+------------------+
expected df will have row dropped if the next intakedate is less than 3 days of the prior date
+-----------+------------------+
| Id| IntakeDate|
+-----------+------------------+
| 5.0| 2021-04-14|
| 5.0| 2021-05-06| row to drop
| 5.0| 2021-05-08|
| 10.0| 2021-04-21|
| 10.0| 2021-05-25|
| 14.0| 2021-03-08| row to drop
| 14.0| 2021-03-09|
| 14.0| 2021-09-30|
| 14.0| 2022-04-08|
| 15.0| 2021-04-27|
| 15.0| 2021-05-18|
| 15.0| 2022-01-17|
| 26.0| 2021-08-27|
| 26.0| 2021-09-17|
+-----------+------------------+
Please try the following code:
import pyspark.sql.window as Window
import pyspark.sql.functions as F
lead_over_id = Window.partitionBy('id').orderBy('IntakeDate')
df = (df
.withColumn('lead_1_date', F.lag('IntakeDate', -1).over(lead_over_id))
.withColumn('date_diff', F.datediff('IntakeDate', 'lead_1_date'))
.where((F.col('date_diff') > 4) | F.col('date_diff').isnull())
.drop('lead_1_date', 'date_diff')
)
I want to pivot a column and then rank the data from the pivoted column. Here is sample data:
| id | objective | metric | score |
|----|-----------|-------------|-------|
| 1 | Sales | Total Sales | 10 |
| 1 | Marketing | Total Reach | 4 |
| 2 | Sales | Total Sales | 2 |
| 2 | Marketing | Total Reach | 11 |
| 3 | Sales | Total Sales | 9 |
This would be my expected output after pivot + rank:
| id | Sales | Marketing |
|----|--------|-----------|
| 1 | 1 | 2 |
| 2 | 3 | 1 |
| 3 | 2 | 3 |
The ranking is based on sum(score) from each objective. An objective can have also have multiple metrics but that isn't included in the sample for simplicity.
I have been able to successfully pivot and count the scores like so:
pivot = (
spark.table('scoring_table')
.select('id', 'objective', 'metric', 'score')
.groupBy('id')
.pivot('objective')
.agg(
sf.sum('score').alias('score')
)
This then lets me see the total score per objective, but I'm unsure how to rank these. I have tried the following after aggregation:
.withColumn('rank', rank().over(Window.partitionBy('id', 'objective').orderBy(sf.col('score').desc())))
However objective is no longer callable from this point as it has been pivoted. I then tried this instead:
.withColumn('rank', rank().over(Window.partitionBy('id', 'Sales', 'Marketing').orderBy(sf.col('score').desc())))
But also the score column is no longer available. How can I rank these scores after pivoting the data?
You just need to order by the score after pivot:
from pyspark.sql import functions as F, Window
df2 = df.groupBy('id').pivot('objective').agg(F.sum('score')).fillna(0)
df3 = df2.select(
'id',
*[F.rank().over(Window.orderBy(F.desc(c))).alias(c) for c in df2.columns[1:]]
)
df3.show()
+---+---------+-----+
| id|Marketing|Sales|
+---+---------+-----+
| 2| 1| 3|
| 1| 2| 1|
| 3| 3| 2|
+---+---------+-----+
I have a dataframe such as:
id | value | date1 | date2
-------------------------------------
1 | 20 | 2015-09-01 | 2018-03-01
1 | 30 | 2019-04-04 | 2015-03-02
1 | 40 | 2014-01-01 | 2016-06-09
2 | 15 | 2014-01-01 | 2013-06-01
2 | 25 | 2019-07-18 | 2016-07-07
and want to return for each id the sum(value) where date1<max(date2) for that id. In the above example we will get:
id | sum_value
-----------
1 | 60
2 | 15
since for id 1 the max(date2) is 2018-03-01 and the first and third row fits the condition date1<max(date2) and therefore the value is sum of 20 and 40.
I have tried the code below but we can't use max outside the agg function.
df.withColumn('sum_value',F.when(F.col('date1')<F.max(F.col('date2')), value).otherwise(0))
.groupby(['id'])
Do you have any suggestions? The table is 2 billion rows so I am looking for other options than re-joining.
You can use a Window function. Direct translation of your requirements would be:
from pyspark.sql.functions import col, max as _max, sum as _sum
from pyspark.sql import Window
df.withColumn("max_date2", _max("date2").over(Window.partitionBy("id")))\
.where(col("date1") < col("max_date2"))\
.groupBy("id")\
.agg(_sum("value").alias("sum_value"))\
.show()
#+---+---------+
#| id|sum_value|
#+---+---------+
#| 1| 60.0|
#| 2| 15.0|
#+---+---------+
I'm trying to work on the following exercise using Scala and spark.
Given a file containing two columns: a time in seconds and a value
Example:
|---------------------|------------------|
| seconds | value |
|---------------------|------------------|
| 225 | 1,5 |
| 245 | 0,5 |
| 300 | 2,4 |
| 319 | 1,2 |
| 320 | 4,6 |
|---------------------|------------------|
and given a value V to be used for the rolling window this output should be created:
Example with V=20
|--------------|---------|--------------------|----------------------|
| seconds | value | num_row_in_window |sum_values_in_windows |
|--------------|---------|--------------------|----------------------|
| 225 | 1,5 | 1 | 1,5 |
| 245 | 0,5 | 2 | 2 |
| 300 | 2,4 | 1 | 2,4 |
| 319 | 1,2 | 2 | 3,6 |
| 320 | 4,6 | 3 | 8,2 |
|--------------|---------|--------------------|----------------------|
num_row_in_window is the number of rows contained in the current window and
sum_values_in_windows is the sum of the values contained in the current window.
I've been trying with the sliding function or using the sql api but it's a bit unclear to me which is the best solution to tackle this problem considering that I'm a spark/scala novice.
This is a perfect application for window-functions. By using rangeBetween you can set your sliding window to 20s. Note that in the example below no partitioning is specified (no partitionBy). Without a partitioning, this code will not scale:
import ss.implicits._
val df = Seq(
(225, 1.5),
(245, 0.5),
(300, 2.4),
(319, 1.2),
(320, 4.6)
).toDF("seconds", "value")
val window = Window.orderBy($"seconds").rangeBetween(-20L, 0L) // add partitioning here
df
.withColumn("num_row_in_window", sum(lit(1)).over(window))
.withColumn("sum_values_in_window", sum($"value").over(window))
.show()
+-------+-----+-----------------+--------------------+
|seconds|value|num_row_in_window|sum_values_in_window|
+-------+-----+-----------------+--------------------+
| 225| 1.5| 1| 1.5|
| 245| 0.5| 2| 2.0|
| 300| 2.4| 1| 2.4|
| 319| 1.2| 2| 3.6|
| 320| 4.6| 3| 8.2|
+-------+-----+-----------------+--------------------+
Consider the following dataframe:
+-------+-----------+-------+
| rid| createdon| count|
+-------+-----------+-------+
| 124| 2017-06-15| 1 |
| 123| 2017-06-14| 2 |
| 123| 2017-06-14| 1 |
+-------+-----------+-------+
I need to add the count column among rows which has createdon and rid of are same.
Therefore the resultant dataframe should be follows:
+-------+-----------+-------+
| rid| createdon| count|
+-------+-----------+-------+
| 124| 2017-06-15| 1 |
| 123| 2017-06-14| 3 |
+-------+-----------+-------+
I am using Spark 2.0.2.
I have tried agg, conditions inside select etc, but couldn't find the solution. Can anyone help me?
Try this
import org.apache.spark.sql.{functions => func}
df.groupBy($"rid", $"createdon").agg(func.sum($"count").alias("count"))
this should do what you want:
import org.apache.spark.sql.functions.sum
df
.groupBy($"rid",$"createdon")
.agg(sum($"count").as("count"))
.show