Polars get count of events prior to "this" event, but within given duration - python-polars

I have been struggling with creating a feature, a counter that counts number of events prior to each event, where each prior event should have occurred within a given duration (dt). I know how to do it for all previous events, it is easy by using cumsum and over of the given column. But, if I want to do this with only events within e.g last 2 days, how do I do that ??
Below is how I do it (the wrong way) with cumsum.
import polars as pl
from datetime import date
df = pl.DataFrame(
data = {
shape: (8, 3)
│ Event ┆ Date ┆ cum_sum │
│ --- ┆ --- ┆ --- │
│ str ┆ date ┆ u32 │
│ Rain ┆ 2022-01-01 ┆ 0 │
│ Sun ┆ 2022-01-02 ┆ 0 │
│ Rain ┆ 2022-01-02 ┆ 1 │
│ Sun ┆ 2022-01-03 ┆ 1 │
│ Rain ┆ 2022-01-03 ┆ 2 │
│ Sun ┆ 2022-01-05 ┆ 2 │
│ Rain ┆ 2022-01-05 ┆ 3 │
│ Sun ┆ 2022-01-08 ┆ 3 │
What I would like to output is this:
shape: (8, 3)
│ Event ┆ Date ┆ cum_sum │
│ --- ┆ --- ┆ --- │
│ str ┆ date ┆ u32 │
│ Rain ┆ 2022-01-01 ┆ 0 │
│ Sun ┆ 2022-01-02 ┆ 0 │
│ Rain ┆ 2022-01-02 ┆ 1 │
│ Sun ┆ 2022-01-03 ┆ 1 │
│ Rain ┆ 2022-01-03 ┆ 2 │
│ Sun ┆ 2022-01-05 ┆ 1 │
│ Rain ┆ 2022-01-05 ┆ 1 │
│ Sun ┆ 2022-01-08 ┆ 0 │
(Preferably, a solution that scales somewhat well..)
Tried this without success

You can try a groupby_rolling for this.
pl.count() - 1
.sort(["Date", "Event"], reverse=[False, True])
shape: (8, 3)
│ Event ┆ Date ┆ count │
│ --- ┆ --- ┆ --- │
│ str ┆ date ┆ u32 │
│ Rain ┆ 2022-01-01 ┆ 0 │
│ Sun ┆ 2022-01-02 ┆ 0 │
│ Rain ┆ 2022-01-02 ┆ 1 │
│ Sun ┆ 2022-01-03 ┆ 1 │
│ Rain ┆ 2022-01-03 ┆ 2 │
│ Sun ┆ 2022-01-05 ┆ 1 │
│ Rain ┆ 2022-01-05 ┆ 1 │
│ Sun ┆ 2022-01-08 ┆ 0 │
We subtract one in the agg because we do not want to count the current event, only prior events. (The sort at the end is just to order the rows to match the original data.)


Given a data frame with n columns of numbers, how could you calculate the Pearson correlation of all column-pair combinations?

Let's say I have a Polars data frame like this:
=> shape: (19, 5)
│ date ┆ open_AA ┆ open_AADI ┆ open_AADR ┆ open_AAL │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
│ 1674777600000 ┆ 51.39 ┆ 12.84 ┆ 50.0799 ┆ 16.535 │
│ 1674691200000 ┆ 52.43 ┆ 13.14 ┆ 49.84 ┆ 16.54 │
│ 1674604800000 ┆ 51.87 ┆ 12.88 ┆ 49.75 ┆ 15.97 │
│ 1674518400000 ┆ 51.22 ┆ 12.81 ┆ 50.1 ┆ 16.01 │
│ ... ┆ ... ┆ ... ┆ ... ┆ ... │
│ 1672876800000 ┆ 45.3 ┆ 12.7 ┆ 47.185 ┆ 13.5 │
│ 1672790400000 ┆ 44.77 ┆ 12.355 ┆ 47.32 ┆ 12.86 │
│ 1672704000000 ┆ 45.77 ┆ 12.91 ┆ 47.84 ┆ 12.91 │
│ 1672358400000 ┆ 46.01 ┆ 12.57 ┆ 47.29 ┆ 12.55 │
I'm looking to calculate the Pearson correlation between each pair-combination of all columns (except the date one). The result would look something like this:
=> shape: (5, 5)
│ symbol ┆ open_AA ┆ open_AADI ┆ open_AADR ┆ open_AAL │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ utf8 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
│ open_AA ┆ 1 ┆ 1 ┆ .1 ┆ -.5 │
│ open_AADI ┆ .2 ┆ 1 ┆ .2 ┆ .4 │
│ open_AADR ┆ .4 ┆ .2 ┆ 1 ┆ .3 │
│ open_AAL. ┆ -.45 ┆ -.6 ┆ 50.1 ┆ 1 │
My hunch is that I need to do the following:
Get the cartesian product of columns [1..] as a new data frame.
Using Polars expressions, calculate the pearson_corr of each of each series pair.
I'm new to Polars and am having trouble with the syntax. Can anyone point me in the right direction?
Say you start with:
df = pl.DataFrame({"date":[5,6,7],"foo": [1, 3, 9], "bar": [4, 1, 3], "ham": [2, 18, 9]})
You want to exclude some cols, so let's put those in a variable
df.drop(excl_cols) # Use drop to exclude the date column (or whatever columns you don't want)
.pearson_corr() # this is the meat and potatos of the request but it's missing your symbol column on left
pl.Series(df.drop(excl_cols).columns).alias('symbol'), # This just creates a Series out of the column names to become its own column
pl.all() #then just every other column
shape: (3, 4)
│ symbol ┆ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 │
│ foo ┆ 1.0 ┆ -0.052414 ┆ 0.169695 │
│ bar ┆ -0.052414 ┆ 1.0 ┆ -0.993036 │
│ ham ┆ 0.169695 ┆ -0.993036 ┆ 1.0 │
Use DataFrame.pearson_corr
In [9]: df.drop('date').pearson_corr()
shape: (2, 2)
│ open_AA ┆ open_AADI │
│ --- ┆ --- │
│ f64 ┆ f64 │
│ 1.0 ┆ 1.0 │
│ 1.0 ┆ 1.0 │

Efficient way to rename columns from pivot

Currently pivot is joining the "values" column and value from "columns" column as new column name using underscore. Example from data below, new column name = "monthly_qty" + "_" + "product_a"
>>> data = pl.DataFrame({"month":["Jan", "Jan", "Feb", "Feb", "Mar", "Mar"], "type":["product_a", "product_b"]*3, "monthly_qty":[10,20]*3, "monthly_amt":[5., 8.]*3})
>>> data
shape: (6, 4)
│ month ┆ type ┆ monthly_qty ┆ monthly_amt │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ f64 │
│ Jan ┆ product_a ┆ 10 ┆ 5.0 │
│ Jan ┆ product_b ┆ 20 ┆ 8.0 │
│ Feb ┆ product_a ┆ 10 ┆ 5.0 │
│ Feb ┆ product_b ┆ 20 ┆ 8.0 │
│ Mar ┆ product_a ┆ 10 ┆ 5.0 │
│ Mar ┆ product_b ┆ 20 ┆ 8.0 │
>>> data = data.pivot(index="month", columns="type", values=["monthly_qty", "monthly_amt"])
>>> data
shape: (3, 5)
│ month ┆ monthly_qty_product_a ┆ monthly_qty_product_b ┆ monthly_amt_product_a ┆ monthly_amt_product_b │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f64 ┆ f64 │
│ Jan ┆ 10 ┆ 20 ┆ 5.0 ┆ 8.0 │
│ Feb ┆ 10 ┆ 20 ┆ 5.0 ┆ 8.0 │
│ Mar ┆ 10 ┆ 20 ┆ 5.0 ┆ 8.0 │
I wish to rename the columns as below, but not sure what is the most efficient way.
old column = "monthly_qty_product_a"
new_column = "product_a:monthly_qty"
This is what I can think of now, provided that the number of underscore is fixed.
>>> new_cols = {col:col if col=="month" else f"{'_'.join(col.split('_')[2:])}:{'_'.join(col.split('_')[0:2])}"for col in data.columns}
>>> data.rename(new_cols)
shape: (3, 5)
│ month ┆ product_a:monthly_qty ┆ product_b:monthly_qty ┆ product_a:monthly_amt ┆ product_b:monthly_amt │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f64 ┆ f64 │
│ Jan ┆ 10 ┆ 20 ┆ 5.0 ┆ 8.0 │
│ Feb ┆ 10 ┆ 20 ┆ 5.0 ┆ 8.0 │
│ Mar ┆ 10 ┆ 20 ┆ 5.0 ┆ 8.0 │
This will not work if value column has more than one underscore, e.g. "monthly_growth_pct"
Is there a better way of doing this? Any advice is much appreciated
There is no way in DataFrame.pivot to control this naming.
I would suggest to modify your long format dataframe (6 x 4) a bit by renaming the column monthly_qty to monthly_qty<CHAR>, where <CHAR> is a character you are quite sure is not present, for example !:
data = data.rename({"monthly_qty":"monthly_qty!"})
Proceed with the pivot, and then split on ! in your renaming logic.

Computing and retrieving operations at the group level without collapsing data frame in polars?

I am trying to compute a stat (or more) at the group level without having to create a second data frame. The current way I do it is by relying on the generation of a second data frame with the desired aggregation that I then merge back to the original one.
A silly example:
import polars as pl
df = pl. DataFrame( {'name' : ['Steve', 'Larry', 'Tom', 'Steve', 'Tom', 'Steve'],
'points': range(6)})
shape: (6, 2)
│ name ┆ points │
│ --- ┆ --- │
│ str ┆ i64 │
│ Steve ┆ 0 │
│ Larry ┆ 1 │
│ Tom ┆ 2 │
│ Steve ┆ 3 │
│ Tom ┆ 4 │
│ Steve ┆ 5 │
We created a simple data frame below in which some groups have more entries than others. In a second step we compute an additional data frame to keep track of the size of each group.
entries= df.groupby('name').agg(pl.count().alias('entries'))
shape: (3, 2)
│ name ┆ entries │
│ --- ┆ --- │
│ str ┆ u32 │
│ Steve ┆ 3 │
│ Tom ┆ 2 │
│ Larry ┆ 1 │
Now we bring back this information to the original data frame in a third step.
print(df.join(entries, left_on='name', right_on='name', how='left'))
shape: (6, 3)
│ name ┆ points ┆ entries │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u32 │
│ Steve ┆ 0 ┆ 3 │
│ Larry ┆ 1 ┆ 1 │
│ Tom ┆ 2 ┆ 2 │
│ Steve ┆ 3 ┆ 3 │
│ Tom ┆ 4 ┆ 2 │
│ Steve ┆ 5 ┆ 3 │
Is there a way to avoid this triangulation? I have the feeling that using over might be a solution but I can't figure it out yet.
Well ... I managed. Posting the question helped me organize my thoughts and indeed, over was the solution.
shape: (6, 3)
│ name ┆ points ┆ entries │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u32 │
│ Steve ┆ 0 ┆ 3 │
│ Larry ┆ 1 ┆ 1 │
│ Tom ┆ 2 ┆ 2 │
│ Steve ┆ 3 ┆ 3 │
│ Tom ┆ 4 ┆ 2 │
│ Steve ┆ 5 ┆ 3 │

polars equivalent to pandas groupby shift()

Is there an equivalent way to to df.groupby().shift in polars? Use pandas.shift() within a group
You can use the over expression to accomplish this in Polars. Using the example from the link...
import polars as pl
df = pl.DataFrame({
'object': [1, 1, 1, 2, 2],
'period': [1, 2, 4, 4, 23],
'value': [24, 67, 89, 5, 23],
shape: (5, 4)
│ object ┆ period ┆ value ┆ prev_value │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
│ 1 ┆ 1 ┆ 24 ┆ null │
│ 1 ┆ 2 ┆ 67 ┆ 24 │
│ 1 ┆ 4 ┆ 89 ┆ 67 │
│ 2 ┆ 4 ┆ 5 ┆ null │
│ 2 ┆ 23 ┆ 23 ┆ 5 │
To perform this on more than one column, you can specify the columns in the pl.col expression, and then use a prefix/suffix to name the new columns. For example:
pl.col(['period', 'value']).shift().over('object').prefix("prev_")
shape: (5, 5)
│ object ┆ period ┆ value ┆ prev_period ┆ prev_value │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
│ 1 ┆ 1 ┆ 24 ┆ null ┆ null │
│ 1 ┆ 2 ┆ 67 ┆ 1 ┆ 24 │
│ 1 ┆ 4 ┆ 89 ┆ 2 ┆ 67 │
│ 2 ┆ 4 ┆ 5 ┆ null ┆ null │
│ 2 ┆ 23 ┆ 23 ┆ 4 ┆ 5 │
Using multiple values with over
Let's use this data.
df = pl.DataFrame(
"id": [1] * 5 + [2] * 5,
"date": ["2020-01-01", "2020-01-01", "2020-02-01", "2020-02-01", "2020-02-01"] * 2,
"value1": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"value2": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
shape: (10, 4)
│ id ┆ date ┆ value1 ┆ value2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ date ┆ i64 ┆ i64 │
│ 1 ┆ 2020-01-01 ┆ 1 ┆ 10 │
│ 1 ┆ 2020-01-01 ┆ 2 ┆ 20 │
│ 1 ┆ 2020-02-01 ┆ 3 ┆ 30 │
│ 1 ┆ 2020-02-01 ┆ 4 ┆ 40 │
│ 1 ┆ 2020-02-01 ┆ 5 ┆ 50 │
│ 2 ┆ 2020-01-01 ┆ 6 ┆ 60 │
│ 2 ┆ 2020-01-01 ┆ 7 ┆ 70 │
│ 2 ┆ 2020-02-01 ┆ 8 ┆ 80 │
│ 2 ┆ 2020-02-01 ┆ 9 ┆ 90 │
│ 2 ┆ 2020-02-01 ┆ 10 ┆ 100 │
We can place a list of our grouping variables in the over expression (as well as a list in our pl.col expression). Polars will run them all in parallel.
pl.col(["value1", "value2"]).shift().over(['id','date']).prefix("prev_"),
pl.col(["value1", "value2"]).diff().over(['id','date']).suffix("_diff"),
shape: (10, 8)
│ id ┆ date ┆ value1 ┆ value2 ┆ prev_value1 ┆ prev_value2 ┆ value1_diff ┆ value2_diff │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ date ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
│ 1 ┆ 2020-01-01 ┆ 1 ┆ 10 ┆ null ┆ null ┆ null ┆ null │
│ 1 ┆ 2020-01-01 ┆ 2 ┆ 20 ┆ 1 ┆ 10 ┆ 1 ┆ 10 │
│ 1 ┆ 2020-02-01 ┆ 3 ┆ 30 ┆ null ┆ null ┆ null ┆ null │
│ 1 ┆ 2020-02-01 ┆ 4 ┆ 40 ┆ 3 ┆ 30 ┆ 1 ┆ 10 │
│ 1 ┆ 2020-02-01 ┆ 5 ┆ 50 ┆ 4 ┆ 40 ┆ 1 ┆ 10 │
│ 2 ┆ 2020-01-01 ┆ 6 ┆ 60 ┆ null ┆ null ┆ null ┆ null │
│ 2 ┆ 2020-01-01 ┆ 7 ┆ 70 ┆ 6 ┆ 60 ┆ 1 ┆ 10 │
│ 2 ┆ 2020-02-01 ┆ 8 ┆ 80 ┆ null ┆ null ┆ null ┆ null │
│ 2 ┆ 2020-02-01 ┆ 9 ┆ 90 ┆ 8 ┆ 80 ┆ 1 ┆ 10 │
│ 2 ┆ 2020-02-01 ┆ 10 ┆ 100 ┆ 9 ┆ 90 ┆ 1 ┆ 10 │

convert a pandas loc operation that needed the index to assign values to polars

In this example i have three columns, the 'DayOfWeek' Time' and the 'Risk'.
I want to group by 'DayOfWeek' and take the first element only and assign a high risk on it. This means the first known hour in day of week is the one that has the highest risk. The rest is initialized to 'Low' risk.
In pandas i had an additional column for the index, but in polars i do not. I could artificially create one, but is it even necessary?
Can i do this somehow smarter with polars?
df['risk'] = "Low"
df = df.sort('Time')
df.loc[df.groupby("DayOfWeek").head(1).index, "risk"] = "High"
The index is unique in this case and goes to range(n)
Here is my solution btw. (I don't really like it)
df = df.with_column(pl.arange(0, df.shape[0]).alias('pseudo_index')
# find lowest time for day
indexes_df = df.sort('Time').groupby('DayOfWeek').head(1)
# Set 'High' as col for all rows from groupby
indexes_df = indexes_df.select('pseudo_index').with_column(pl.lit('High').alias('risk'))
# Left join will generate null values for all values that are not in indexes_df 'pseudo_index'
df = df.join(indexes_df, how='left', on='pseudo_index').select([
pl.all().exclude(['pseudo_index', 'risk']), pl.col('risk').fill_null(pl.lit('low'))
You can use window functions to find where the first "index" of the "DayOfWeek" group equals the"index" column.
For that we only need to set an "index" column. We can do that easily with:
A method: df.with_row_count(<name>)
An expression: pl.arange(0, pl.count()).alias(<name>)
After that we can use this predicate:
pl.first("index").over("DayOfWeek") == pl.col("index")
Finally we use a when -> then -> otherwise expression to use that condition and create our new "Risk" column.
Let's start with some data. In the snippet below I create an hourly date range and then determine the weekdays from that.
Preparing data
df = pl.DataFrame({
"Time": pl.date_range(datetime(2022, 6, 1), datetime(2022, 6, 30), "1h").sample(frac=1.5, with_replacement=True).sort(),
pl.arange(0, pl.count()).alias("index"),
shape: (1045, 3)
│ index ┆ Time ┆ DayOfWeek │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[ns] ┆ u32 │
│ 0 ┆ 2022-06-29 22:00:00 ┆ 3 │
│ 1 ┆ 2022-06-14 11:00:00 ┆ 2 │
│ 2 ┆ 2022-06-11 21:00:00 ┆ 6 │
│ 3 ┆ 2022-06-27 20:00:00 ┆ 1 │
│ ... ┆ ... ┆ ... │
│ 1041 ┆ 2022-06-11 09:00:00 ┆ 6 │
│ 1042 ┆ 2022-06-18 22:00:00 ┆ 6 │
│ 1043 ┆ 2022-06-18 01:00:00 ┆ 6 │
│ 1044 ┆ 2022-06-23 18:00:00 ┆ 4 │
Computing Risk values
pl.first("index").over("DayOfWeek") == pl.col("index")
shape: (1045, 3)
│ Time ┆ DayOfWeek ┆ Risk │
│ --- ┆ --- ┆ --- │
│ datetime[ns] ┆ u32 ┆ str │
│ 2022-06-29 22:00:00 ┆ 3 ┆ High │
│ 2022-06-14 11:00:00 ┆ 2 ┆ High │
│ 2022-06-11 21:00:00 ┆ 6 ┆ High │
│ 2022-06-27 20:00:00 ┆ 1 ┆ High │
│ ... ┆ ... ┆ ... │
│ 2022-06-11 09:00:00 ┆ 6 ┆ Low │
│ 2022-06-18 22:00:00 ┆ 6 ┆ Low │
│ 2022-06-18 01:00:00 ┆ 6 ┆ Low │
│ 2022-06-23 18:00:00 ┆ 4 ┆ Low │