I have a dataframe containing the id of some person and the date on which he performed a certain action:
+----+----------+
| id| date|
+----+----------+
| 1|2022-09-01|
| 1|2022-10-01|
| 1|2022-11-01|
| 2|2022-07-01|
| 2|2022-10-01|
| 2|2022-11-01|
| 3|2022-09-01|
| 3|2022-10-01|
| 3|2022-11-01|
+----+----------+
I need to determine the fact that this person performed some action over a certain period of time (suppose the last 3 months). In a specific example, person number 2 missed months 08 and 09, respectively, the condition was not met. So I expect to get the following result:
+----+------------------------------------+------+
| id| dates|3month|
+----+------------------------------------+------+
| 1|[2022-09-01, 2022-10-01, 2022-11-01]| true|
| 2|[2022-07-01, 2022-10-01, 2022-11-01]| false|
| 3|[2022-09-01, 2022-10-01, 2022-11-01]| true|
+----+------------------------------------+------+
I understand that I should group by person ID and get an array of dates that correspond to it.
data.groupBy(col("id")).agg(collect_list("date") as "dates").withColumn("3month", ???)
However, I'm at a loss in writing a function that would carry out a check for compliance with the requirement.I have an option using recursion, but it does not suit me due to low performance (there may be more than one thousand dates). I would be very grateful if someone could help me with my problem.
A simple trick is to use a set instead of a list in your aggregation, in order to have distinct values, and then check the size of that set.
Here are some possible solutions:
Solution 1
Assuming you have a list of months of interest on which you want to check, you can perform a preliminary filter on the required months, then aggregate and validate.
import org.apache.spark.sql.{functions => F}
import java.time.{LocalDate, Duration}
val requiredMonths = Seq(
LocalDate.parse("2022-09-01"),
LocalDate.parse("2022-10-01"),
LocalDate.parse("2022-11-01")
);
df
.filter(F.date_trunc("month", $"date").isInCollection(requiredMonths))
.groupBy($"id")
.agg(F.collect_set(F.date_trunc("month", $"date")) as "months")
.withColumn("is_valid", F.size($"months") === requiredMonths.size)
date_trunc is used to truncate the date column to month.
Solution 2
Similar to the previous one, with preliminary filter, but here assuming you have a range of months
import java.time.temporal.ChronoUnit
val firstMonth = LocalDate.parse("2022-09-01");
val lastMonth = LocalDate.parse("2022-11-01");
val requiredNumberOfMonths = ChronoUnit.MONTHS.between(firstMonth, lastMonth) + 1;
df
.withColumn("month", F.date_trunc("month", $"date"))
.filter($"month" >= firstMonth && $"month" <= lastMonth)
.groupBy($"id")
.agg(F.collect_set($"month") as "months")
.withColumn("is_valid", F.size($"months") === requiredNumberOfMonths)
Solution 3
Both solution 1 and 2 have a problem that causes the complete exclusion from the final result of the ids that have no intersection with the dates of interest.
This is caused by the filter applied before grouping.
Here is a solution based on solution 2 that does not filter and solves this problem.
df
.withColumn("month", F.date_trunc("month", $"date"))
.groupBy($"id")
.agg(F.collect_set(F.when($"month" >= firstMonth && $"month" <= lastMonth, $"month")) as "months")
.withColumn("is_valid", F.size($"months") === requiredNumberOfMonths)
Now the filter is performed using a conditional collect_set.
It is right to consider also solution 1 and 2 because the preliminary filter can have advantages and in some cases that could be the expected result.
Related
I have a PySpark DF, with ID and Date column, looking like this.
ID
Date
1
2021-10-01
2
2021-10-01
1
2021-10-02
3
2021-10-02
I want to count the number of unique IDs that did not exist in the date one day before. So, here the result would be 1 as there is only one new unique ID in 2021-10-02.
ID
Date
Count
1
2021-10-01
-
2
2021-10-01
-
1
2021-10-02
1
3
2021-10-02
1
I tried following this solution but it does not work on date type value. Any help would be highly appreciated.
Thank you!
If you want to avoid a self-join (e.g. for performance reasons), you could work with Window functions:
from pyspark.sql import Row, Window
import datetime
df = spark.createDataFrame([
Row(ID=1, date=datetime.date(2021,10,1)),
Row(ID=2, date=datetime.date(2021,10,1)),
Row(ID=1, date=datetime.date(2021,10,2)),
Row(ID=2, date=datetime.date(2021,10,2)),
Row(ID=1, date=datetime.date(2021,10,3)),
Row(ID=3, date=datetime.date(2021,10,3)),
])
First add the number of days since an ID was last seen (will be None if it never appeared before)
df = df.withColumn('days_since_last_occurrence', F.datediff('date', F.lag('date').over(Window.partitionBy('ID').orderBy('date'))))
Second, we add a column marking rows where this number of days is not 1. We add a 1 into this column so that we can later sum over this column to count the rows
df = df.withColumn('is_new', F.when(F.col('days_since_last_occurrence') == 1, None).otherwise(1))
Now we do the sum of all rows with the same date and then remove the column we do not require anymore:
(
df
.withColumn('count', F.sum('is_new').over(Window.partitionBy('date'))) # sum over all rows with the same date
.drop('is_new', 'days_since_last_occurrence')
.sort('date', 'ID')
.show()
)
# Output:
+---+----------+-----+
| ID| date|count|
+---+----------+-----+
| 1|2021-10-01| 2|
| 2|2021-10-01| 2|
| 1|2021-10-02| null|
| 2|2021-10-02| null|
| 1|2021-10-03| 1|
| 3|2021-10-03| 1|
+---+----------+-----+
Take out the id list of the current day and the previous day, and then get the size of the difference between the two to get the final result.
Update to a solution to eliminate join.
df = df.select('date', F.expr('collect_set(id) over (partition by date) as id_arr')).dropDuplicates() \
.select('*', F.expr('size(array_except(id_arr, lag(id_arr,1,id_arr) over (order by date))) as count')) \
.select(F.explode('id_arr').alias('id'), 'date', 'count')
df.show(truncate=False)
I have a Spark dataframe that looks like this:
ID
area_id
dob
dod
id1
A
2000/09/10
Null
id2
A
2001/09/28
2010/01/02
id3
B
2017/09/30
Null
id4
B
2019/10/01
2020/12/10
id5
C
2005/10/08
2010/07/13
where dob is the date of birth and dod is the date of death.
I'd like to calculate a distinct number of IDs per area_id for a specified time period where a time period could be:
a year (e.g. 2010, 2020, ...)
a year-month (2010-01, 2020-12, ...)
...
This is different from calculating moving averages or aggregating by intervals, so I'll appreciate any ideas for more appropriate approaches.
replace nulls with today --> stick in temporary table
use where clause with BETWEEN *use expr function so you can use columns. expr(" [the date in question] BETWEEN dob and dod ")
group by area_id, ID
Given that your column schema is as follows.
types.StructField('ID', types.StringType())
types.StructField('area_id', types.StringType())
types.StructField('dob', types.DateType())
types.StructField('dod', types.DateType())
You can use pyspark.sql functions like the following.
from pyspark.sql import functions
#by month
df.groupBy(df["area_id"], functions.month(df["dob"])).count()
#by quarter
df.groupBy(df["area_id"], functions.quarter(df["dob"])).count()
#by year
df.groupBy(df["area_id"], functions.year(df["dob"])).count()
#by year and month
df.groupBy(df["area_id"], functions.year(df["dob"]), functions.quarter(df["dob"])).count()
First you want to find the records that match an arbitrary time period and then apply collect_set after grouping on area_id for matching rows.
I use an extensible lambda based system that can be extended to arbitrary time period notation. In my example, I cover the notations used as examples in your question. I break down year-month into a condition with both year and month specified.
I have modified the input to include cases to better illustrate the idea
Step 1
from pyspark.sql import functions as F
data = [("id1", "A", "2000/09/10", "2021/11/10"),
("id2", "A", "2001/09/28", "2020/10/02",),
("id3", "B", "2017/09/30", None),
("id4", "B", "2017/10/01", "2020/12/10",),
("id5", "C", "2005/10/08", "2010/07/13",), ]
df = spark.createDataFrame(data, ("ID", "area_id", "dob", "dod",))\
.withColumn("dob", F.to_date("dob", "yyyy/MM/dd"))\
.withColumn("dod", F.to_date("dod", "yyyy/MM/dd"))
df.show()
#+---+-------+----------+----------+
#| ID|area_id| dob| dod|
#+---+-------+----------+----------+
#|id1| A|2000-09-10|2021-11-10|
#|id2| A|2001-09-28|2020-10-02|
#|id3| B|2017-09-30| null|
#|id4| B|2017-10-01|2020-12-10|
#|id5| C|2005-10-08|2010-07-13|
#+---+-------+----------+----------+
# Map of supported extractors
extractor_map = {"quarter": F.quarter, "month": F.month, "year": F.year}
# specify conditions using extractors defined
# Find rows such that the 2019-10 lies between `dob` and `dod`
conditions = {"month": 10, "year": 2019}
# Iterate throught the conditions and in each iterations
# update the conditional expression to include the result of the
# condition evaluation after extracting value using the appropriate extractor
# The extractor are not `null` safe and will evaluate to `null`
# depending on how you want to tackle null, you can modify the condition
conditional_expression = F.lit(True)
for term, condition in conditions.items():
extractor = extractor_map[term]
conditional_expression = (conditional_expression) & (F.lit(condition).between(extractor("dob"), extractor("dod")))
condition_example = df.withColumn("include", conditional_expression)
condition_example.show()
#+---+-------+----------+----------+-------+
#| ID|area_id| dob| dod|include|
#+---+-------+----------+----------+-------+
#|id1| A|2000-09-10|2021-11-10| true|
#|id2| A|2001-09-28|2020-10-02| true|
#|id3| B|2017-09-30| null| null|
#|id4| B|2017-10-01|2020-12-10| true|
#|id5| C|2005-10-08|2010-07-13| false|
#+---+-------+----------+----------+-------+
Step 2
# Filter rows that match the condition
df_to_group = condition_example.filter(F.col("include") == True)
# Grouping on `area_id` and collecting distinct `ID`
df_to_group.groupBy("area_id").agg(F.collect_set("ID")).show()
Output
+-------+---------------+
|area_id|collect_set(ID)|
+-------+---------------+
| B| [id4]|
| A| [id2, id1]|
+-------+---------------+
This question already has answers here:
Spark SQL window function with complex condition
(2 answers)
Closed 4 years ago.
I have a DataFrame with columns "id" and "date". date is of format yyyy-mm-dd here is an example:
+---------+----------+
| item_id| ds|
+---------+----------+
| 25867869|2018-05-01|
| 17190474|2018-01-02|
| 19870756|2018-01-02|
|172248680|2018-07-29|
| 41148162|2018-03-01|
+---------+----------+
I want to create a new column, in which each date is associated with an integer starting from 1. such that the smallest(earliest) date gets integer 1 , next(2nd earliest date) gets assigned to 2 and so on..
I want my DataFrame to look like this... :
+---------+----------+---------+
| item_id| ds| number|
+---------+----------+---------+
| 25867869|2018-05-01| 3|
| 17190474|2018-01-02| 1|
| 19870756|2018-01-02| 1|
|172248680|2018-07-29| 4|
| 41148162|2018-03-01| 2|
+---------+----------+---------+
Explanation:
2018 jan 02 date comes the earliest hence its number is 1. since there are 2 rows with same date, therefore 1 is located twice. after 2018-01-02 the next date comes as 2018-03-01 hence its number is 2 and so on... How can I create such column ?
This can be achieved by dense_rank in Window functions.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val win = Window.orderBy(to_date(col("ds"),"yyyy-MM-dd").asc)
val df1 = df.withColumn("number", dense_rank() over win)
df1 will have the column number as you required.
Note : to_date(col("ds"),"yyyy-MM-dd") is mandatory, else it will be considered as Strings and does not survive the purpose.
You should make a function to get the oldest query without a number something like:
SELECT * FROM tablename WHERE number IS NULL ORDER BY ds ASC
then make another query where you get the greatest number:
SELECT * FROM tablename ORDER BY number DESC
then if both queries have the same date then update the table with the same number:
UPDATE tablename SET number = 'greatest number from first query' WHERE ds = 'the date from first query'
or if the dates are diferent then the same but add 1 to the number:
UPDATE tablename SET number= 'greatest number from first query' + 1 WHERE ds = 'the date from first query'
To make this work you should first assgin the number 1 to the oldest entry.
You should do this in a loop until the first query (checks if there is any number that is not set) is empty.
The first query suposes that the empty column is all null, if it's another case then you should change the WHERE condition to check when the column is empty.
I would like to drop all records which are duplicate entries but have said a difference in the timestamp could be of any amount of time as an offset but for simplicity will use 2 minutes.
+-------------------+-----+----+
|Date |ColA |ColB|
+-------------------+-----+----+
|2017-07-04 18:50:21|ABC |DEF |
|2017-07-04 18:50:26|ABC |DEF |
|2017-07-04 18:50:21|ABC |KLM |
+-------------------+-----+----+
I would like my dataframe to have only rows
+-------------------+-----+----+
|Date |ColA |ColB|
+-------------------+-----+----+
|2017-07-04 18:50:26|ABC |DEF |
|2017-07-04 18:50:21|ABC |KLM |
+-------------------+-----+----+
I tried something like this but this does not remove duplicates.
val joinedDfNoDuplicates = joinedDFTransformed.as("df1").join(joinedDFTransformed.as("df2"), col("df1.ColA") === col("df2.ColA") &&
col("df1.ColB") === col("df2.ColB") &&
&& abs(unix_timestamp(col("Date")) - unix_timestamp(col("Date"))) > offset
)
For now, I am just selecting distinct or a group by min here Find minimum for a timestamp through Spark groupBy dataframe on the data based on certain columns but I would like a more robust solution the reason for this is that data outside of that interval may be valid data. Also, the offset could be changed so maybe within 5s or 5 minutes depending on requirements.
Somebody mentioned to me about creating a UDF comparing dates and if all other columns are the same but I am not sure exactly how to do that such that either I would filter out rows or add a flag and then remove those rows any help would be greatly appreciated.
Similiar sql question here Duplicate entries with different timestamp
Thanks!
I would do it like this:
Define a Window to order to dates over a dummy column.
Add a dummy column, and add a constant value to it.
Add a new column containing the date of the previous record.
calculate the difference between the date and the previous date.
Filter your records based on the value of the difference.
The Code can be something like the follow:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val w = Window.partitionBy("dummy").orderBy("Date") // step 1
df.withColumn("dummy", lit(1)) // this is step 1
.withColumn("previousDate", lag($"Date", 1) over w) // step 2
.withColumn("difference", unix_timestamp($"Date") - unix_timestamp("previousDate")) // step 3
This above solution is valid if you have pairs of records that might be close in time. If you have more than two records, you can compare each record to the first record (not the previous one) in the window, so instead of using lag($"Date",1), you use first($"Date"). In this case the 'difference' column contains the difference in time between the current record and the first record in the window.
Hi I would like to mark a row from group of records based on some rules. I have a dataframe like below
id price date
a 100 2016
a 200 2016
a 100 2016
b 100 2016
b 100 2015
My output dataframe should be
id price date
a 200 2016
b 100 2016
In the given dataframe the rules are based on two columns.From the group of ids(a,b), first one based on the maximum price and second one based on recent date.My actual rules are more complicated and it involve lot of other columns too.
What is best approach for solving problem like this. Need to pick a row from a group of rows based on some rules.Any help would be appreciated. Thanks
Try this.
val df = Seq(("a",100,2016), ("a",200,2016), ("a",100,2016), ("b",100,2016),("b",100,2015)).toDF("id", "price", "date")
df.show
val df1 = df.select($"id", struct($"price", $"date").alias("data")).groupBy($"id").agg(max("data").alias("data")).select($"id", $"data.price", $"data.date")
df1.show
You will get the output like below.
+---+-----+----+
| id|price|date|
+---+-----+----+
| b| 100|2016|
| a| 200|2016|
+---+-----+----+