Calculate distinct ID for a specified time period - pyspark

I have a Spark dataframe that looks like this:
ID
area_id
dob
dod
id1
A
2000/09/10
Null
id2
A
2001/09/28
2010/01/02
id3
B
2017/09/30
Null
id4
B
2019/10/01
2020/12/10
id5
C
2005/10/08
2010/07/13
where dob is the date of birth and dod is the date of death.
I'd like to calculate a distinct number of IDs per area_id for a specified time period where a time period could be:
a year (e.g. 2010, 2020, ...)
a year-month (2010-01, 2020-12, ...)
...
This is different from calculating moving averages or aggregating by intervals, so I'll appreciate any ideas for more appropriate approaches.

replace nulls with today --> stick in temporary table
use where clause with BETWEEN *use expr function so you can use columns. expr(" [the date in question] BETWEEN dob and dod ")
group by area_id, ID

Given that your column schema is as follows.
types.StructField('ID', types.StringType())
types.StructField('area_id', types.StringType())
types.StructField('dob', types.DateType())
types.StructField('dod', types.DateType())
You can use pyspark.sql functions like the following.
from pyspark.sql import functions
#by month
df.groupBy(df["area_id"], functions.month(df["dob"])).count()
#by quarter
df.groupBy(df["area_id"], functions.quarter(df["dob"])).count()
#by year
df.groupBy(df["area_id"], functions.year(df["dob"])).count()
#by year and month
df.groupBy(df["area_id"], functions.year(df["dob"]), functions.quarter(df["dob"])).count()

First you want to find the records that match an arbitrary time period and then apply collect_set after grouping on area_id for matching rows.
I use an extensible lambda based system that can be extended to arbitrary time period notation. In my example, I cover the notations used as examples in your question. I break down year-month into a condition with both year and month specified.
I have modified the input to include cases to better illustrate the idea
Step 1
from pyspark.sql import functions as F
data = [("id1", "A", "2000/09/10", "2021/11/10"),
("id2", "A", "2001/09/28", "2020/10/02",),
("id3", "B", "2017/09/30", None),
("id4", "B", "2017/10/01", "2020/12/10",),
("id5", "C", "2005/10/08", "2010/07/13",), ]
df = spark.createDataFrame(data, ("ID", "area_id", "dob", "dod",))\
.withColumn("dob", F.to_date("dob", "yyyy/MM/dd"))\
.withColumn("dod", F.to_date("dod", "yyyy/MM/dd"))
df.show()
#+---+-------+----------+----------+
#| ID|area_id| dob| dod|
#+---+-------+----------+----------+
#|id1| A|2000-09-10|2021-11-10|
#|id2| A|2001-09-28|2020-10-02|
#|id3| B|2017-09-30| null|
#|id4| B|2017-10-01|2020-12-10|
#|id5| C|2005-10-08|2010-07-13|
#+---+-------+----------+----------+
# Map of supported extractors
extractor_map = {"quarter": F.quarter, "month": F.month, "year": F.year}
# specify conditions using extractors defined
# Find rows such that the 2019-10 lies between `dob` and `dod`
conditions = {"month": 10, "year": 2019}
# Iterate throught the conditions and in each iterations
# update the conditional expression to include the result of the
# condition evaluation after extracting value using the appropriate extractor
# The extractor are not `null` safe and will evaluate to `null`
# depending on how you want to tackle null, you can modify the condition
conditional_expression = F.lit(True)
for term, condition in conditions.items():
extractor = extractor_map[term]
conditional_expression = (conditional_expression) & (F.lit(condition).between(extractor("dob"), extractor("dod")))
condition_example = df.withColumn("include", conditional_expression)
condition_example.show()
#+---+-------+----------+----------+-------+
#| ID|area_id| dob| dod|include|
#+---+-------+----------+----------+-------+
#|id1| A|2000-09-10|2021-11-10| true|
#|id2| A|2001-09-28|2020-10-02| true|
#|id3| B|2017-09-30| null| null|
#|id4| B|2017-10-01|2020-12-10| true|
#|id5| C|2005-10-08|2010-07-13| false|
#+---+-------+----------+----------+-------+
Step 2
# Filter rows that match the condition
df_to_group = condition_example.filter(F.col("include") == True)
# Grouping on `area_id` and collecting distinct `ID`
df_to_group.groupBy("area_id").agg(F.collect_set("ID")).show()
Output
+-------+---------------+
|area_id|collect_set(ID)|
+-------+---------------+
| B| [id4]|
| A| [id2, id1]|
+-------+---------------+

Related

Determine if dates are continuous in a list

I have a dataframe containing the id of some person and the date on which he performed a certain action:
+----+----------+
| id| date|
+----+----------+
| 1|2022-09-01|
| 1|2022-10-01|
| 1|2022-11-01|
| 2|2022-07-01|
| 2|2022-10-01|
| 2|2022-11-01|
| 3|2022-09-01|
| 3|2022-10-01|
| 3|2022-11-01|
+----+----------+
I need to determine the fact that this person performed some action over a certain period of time (suppose the last 3 months). In a specific example, person number 2 missed months 08 and 09, respectively, the condition was not met. So I expect to get the following result:
+----+------------------------------------+------+
| id| dates|3month|
+----+------------------------------------+------+
| 1|[2022-09-01, 2022-10-01, 2022-11-01]| true|
| 2|[2022-07-01, 2022-10-01, 2022-11-01]| false|
| 3|[2022-09-01, 2022-10-01, 2022-11-01]| true|
+----+------------------------------------+------+
I understand that I should group by person ID and get an array of dates that correspond to it.
data.groupBy(col("id")).agg(collect_list("date") as "dates").withColumn("3month", ???)
However, I'm at a loss in writing a function that would carry out a check for compliance with the requirement.I have an option using recursion, but it does not suit me due to low performance (there may be more than one thousand dates). I would be very grateful if someone could help me with my problem.
A simple trick is to use a set instead of a list in your aggregation, in order to have distinct values, and then check the size of that set.
Here are some possible solutions:
Solution 1
Assuming you have a list of months of interest on which you want to check, you can perform a preliminary filter on the required months, then aggregate and validate.
import org.apache.spark.sql.{functions => F}
import java.time.{LocalDate, Duration}
val requiredMonths = Seq(
LocalDate.parse("2022-09-01"),
LocalDate.parse("2022-10-01"),
LocalDate.parse("2022-11-01")
);
df
.filter(F.date_trunc("month", $"date").isInCollection(requiredMonths))
.groupBy($"id")
.agg(F.collect_set(F.date_trunc("month", $"date")) as "months")
.withColumn("is_valid", F.size($"months") === requiredMonths.size)
date_trunc is used to truncate the date column to month.
Solution 2
Similar to the previous one, with preliminary filter, but here assuming you have a range of months
import java.time.temporal.ChronoUnit
val firstMonth = LocalDate.parse("2022-09-01");
val lastMonth = LocalDate.parse("2022-11-01");
val requiredNumberOfMonths = ChronoUnit.MONTHS.between(firstMonth, lastMonth) + 1;
df
.withColumn("month", F.date_trunc("month", $"date"))
.filter($"month" >= firstMonth && $"month" <= lastMonth)
.groupBy($"id")
.agg(F.collect_set($"month") as "months")
.withColumn("is_valid", F.size($"months") === requiredNumberOfMonths)
Solution 3
Both solution 1 and 2 have a problem that causes the complete exclusion from the final result of the ids that have no intersection with the dates of interest.
This is caused by the filter applied before grouping.
Here is a solution based on solution 2 that does not filter and solves this problem.
df
.withColumn("month", F.date_trunc("month", $"date"))
.groupBy($"id")
.agg(F.collect_set(F.when($"month" >= firstMonth && $"month" <= lastMonth, $"month")) as "months")
.withColumn("is_valid", F.size($"months") === requiredNumberOfMonths)
Now the filter is performed using a conditional collect_set.
It is right to consider also solution 1 and 2 because the preliminary filter can have advantages and in some cases that could be the expected result.

Group by and save the max value with overlapping columns in scala spark

I have data that looks like this:
id,start,expiration,customerid,content
1,13494,17358,0001,whateveriwanthere
2,14830,28432,0001,somethingelsewoo
3,11943,19435,0001,yes
4,39271,40231,0002,makingfakedata
5,01321,02143,0002,morefakedata
In the data above, I want to group by customerid for overlapping start and expiration (essentially just merge intervals). I am doing this successfully by grouping by the customer id, then aggregating on a first("start") and max("expiration").
df.groupBy("customerid").agg(first("start"), max("expiration"))
However, this drops the id column entirely. I want to save the id of the row that had the max expiration. For instance, I want my output to look like this:
id,start,expiration,customerid
2,11934,28432,0001
4,39271,40231,0002
5,01321,02143,0002
I am not sure how to add that id column for whichever row had the maximum expiration.
You can use a cumulative conditional sum along with lag function to define group column that flags rows that overlap. Then, simply group by customerid + group and get min start and max expiration. To get the id value associated with max expiration date, you can use this trick with struct ordering:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("customerid").orderBy("start")
val result = df.withColumn(
"group",
sum(
when(
col("start").between(lag("start", 1).over(w), lag("expiration", 1).over(w)),
0
).otherwise(1)
).over(w)
).groupBy("customerid", "group").agg(
min(col("start")).as("start"),
max(struct(col("expiration"), col("id"))).as("max")
).select("max.id", "customerid", "start", "max.expiration")
result.show
//+---+----------+-----+----------+
//| id|customerid|start|expiration|
//+---+----------+-----+----------+
//| 5| 0002|01321| 02143|
//| 4| 0002|39271| 40231|
//| 2| 0001|11943| 28432|
//+---+----------+-----+----------+

Count unique ids between two consecutive dates that are values of a column in PySpark

I have a PySpark DF, with ID and Date column, looking like this.
ID
Date
1
2021-10-01
2
2021-10-01
1
2021-10-02
3
2021-10-02
I want to count the number of unique IDs that did not exist in the date one day before. So, here the result would be 1 as there is only one new unique ID in 2021-10-02.
ID
Date
Count
1
2021-10-01
-
2
2021-10-01
-
1
2021-10-02
1
3
2021-10-02
1
I tried following this solution but it does not work on date type value. Any help would be highly appreciated.
Thank you!
If you want to avoid a self-join (e.g. for performance reasons), you could work with Window functions:
from pyspark.sql import Row, Window
import datetime
df = spark.createDataFrame([
Row(ID=1, date=datetime.date(2021,10,1)),
Row(ID=2, date=datetime.date(2021,10,1)),
Row(ID=1, date=datetime.date(2021,10,2)),
Row(ID=2, date=datetime.date(2021,10,2)),
Row(ID=1, date=datetime.date(2021,10,3)),
Row(ID=3, date=datetime.date(2021,10,3)),
])
First add the number of days since an ID was last seen (will be None if it never appeared before)
df = df.withColumn('days_since_last_occurrence', F.datediff('date', F.lag('date').over(Window.partitionBy('ID').orderBy('date'))))
Second, we add a column marking rows where this number of days is not 1. We add a 1 into this column so that we can later sum over this column to count the rows
df = df.withColumn('is_new', F.when(F.col('days_since_last_occurrence') == 1, None).otherwise(1))
Now we do the sum of all rows with the same date and then remove the column we do not require anymore:
(
df
.withColumn('count', F.sum('is_new').over(Window.partitionBy('date'))) # sum over all rows with the same date
.drop('is_new', 'days_since_last_occurrence')
.sort('date', 'ID')
.show()
)
# Output:
+---+----------+-----+
| ID| date|count|
+---+----------+-----+
| 1|2021-10-01| 2|
| 2|2021-10-01| 2|
| 1|2021-10-02| null|
| 2|2021-10-02| null|
| 1|2021-10-03| 1|
| 3|2021-10-03| 1|
+---+----------+-----+
Take out the id list of the current day and the previous day, and then get the size of the difference between the two to get the final result.
Update to a solution to eliminate join.
df = df.select('date', F.expr('collect_set(id) over (partition by date) as id_arr')).dropDuplicates() \
.select('*', F.expr('size(array_except(id_arr, lag(id_arr,1,id_arr) over (order by date))) as count')) \
.select(F.explode('id_arr').alias('id'), 'date', 'count')
df.show(truncate=False)

How to select the N highest values for each category in spark scala

Say I have this dataset:
val main_df = Seq(("yankees-mets",8,20),("yankees-redsox",4,14),("yankees-mets",6,17),
("yankees-redsox",2,10),("yankees-mets",5,17),("yankees-redsox",5,10)).toDF("teams","homeruns","hits")
which looks like this:
I want to pivot on the teams' columns, and for all the other columns return the 2 (or N) highest values for that column. So for yankees-mets and homeruns, it would return this,
Since the 2 highest homerun totals for them were 8 and 6.
How would I do this in the general case?
Thanks
Your problem is not really good fit for the pivot, since pivot means:
A pivot is an aggregation where one (or more in the general case) of the grouping columns has its distinct values transposed into individual columns.
You could create an additional rank column with a window function and then select only rows with rank 1 or 2:
import org.apache.spark.sql.expressions.Window
main_df.withColumn(
"rank",
rank()
.over(
Window.partitionBy("teams")
.orderBy($"homeruns".desc)
)
)
.where($"teams" === "yankees-mets" and ($"rank" === 1 or $"rank" === 2))
.show
+------------+--------+----+----+
| teams|homeruns|hits|rank|
+------------+--------+----+----+
|yankees-mets| 8| 20| 1|
|yankees-mets| 6| 17| 2|
+------------+--------+----+----+
Then if you no longer need rank column you could just drop it.

How to associate dates with a count or an integer [duplicate]

This question already has answers here:
Spark SQL window function with complex condition
(2 answers)
Closed 4 years ago.
I have a DataFrame with columns "id" and "date". date is of format yyyy-mm-dd here is an example:
+---------+----------+
| item_id| ds|
+---------+----------+
| 25867869|2018-05-01|
| 17190474|2018-01-02|
| 19870756|2018-01-02|
|172248680|2018-07-29|
| 41148162|2018-03-01|
+---------+----------+
I want to create a new column, in which each date is associated with an integer starting from 1. such that the smallest(earliest) date gets integer 1 , next(2nd earliest date) gets assigned to 2 and so on..
I want my DataFrame to look like this... :
+---------+----------+---------+
| item_id| ds| number|
+---------+----------+---------+
| 25867869|2018-05-01| 3|
| 17190474|2018-01-02| 1|
| 19870756|2018-01-02| 1|
|172248680|2018-07-29| 4|
| 41148162|2018-03-01| 2|
+---------+----------+---------+
Explanation:
2018 jan 02 date comes the earliest hence its number is 1. since there are 2 rows with same date, therefore 1 is located twice. after 2018-01-02 the next date comes as 2018-03-01 hence its number is 2 and so on... How can I create such column ?
This can be achieved by dense_rank in Window functions.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val win = Window.orderBy(to_date(col("ds"),"yyyy-MM-dd").asc)
val df1 = df.withColumn("number", dense_rank() over win)
df1 will have the column number as you required.
Note : to_date(col("ds"),"yyyy-MM-dd") is mandatory, else it will be considered as Strings and does not survive the purpose.
You should make a function to get the oldest query without a number something like:
SELECT * FROM tablename WHERE number IS NULL ORDER BY ds ASC
then make another query where you get the greatest number:
SELECT * FROM tablename ORDER BY number DESC
then if both queries have the same date then update the table with the same number:
UPDATE tablename SET number = 'greatest number from first query' WHERE ds = 'the date from first query'
or if the dates are diferent then the same but add 1 to the number:
UPDATE tablename SET number= 'greatest number from first query' + 1 WHERE ds = 'the date from first query'
To make this work you should first assgin the number 1 to the oldest entry.
You should do this in a loop until the first query (checks if there is any number that is not set) is empty.
The first query suposes that the empty column is all null, if it's another case then you should change the WHERE condition to check when the column is empty.