Advice refactoring polars expr - python-polars

I have a polars expr, and I cannot use a context 'cause my function has to return a polars expr.
I've implemented a RSI indicator in polars:
rsi_indicator = (100*pl.when(pl.col("close").pct_change() >= 0) \
.then(pl.col("close").pct_change()) \
.otherwise(0.0) \
.rolling_mean(window_size=window) \
/ (pl.when(pl.col("close").pct_change() >= 0) \
.then(pl.col("close").pct_change()) \
.otherwise(0.0).rolling_mean(window_size=window) + \
pl.when(pl.col("close").pct_change() < 0) \
.then(pl.col("close").pct_change()) \
.otherwise(0.0).abs().rolling_mean(window_size=window))).alias(f"rsi_{window}")
I would refactor this code isolating some quantities in order to easily maintaining code and readability.
For example I'd like to define a variable
U = pl.when(pl.col("close").pct_change() >= 0) \
.then(pl.col("close").pct_change()) \
.otherwise(0.0) \
.rolling_mean(window_size=window)
and its fiend V for negative returns in order to return simply 100*U/(U+V) but seems it doesn't work. Any advise?

First, let's use some real financial data, so that our results look reasonable. I'll also change the variable names so that your code will work, as-is.
import polars as pl
from yfinance import Ticker
ticker_data = Ticker("AAPL").history(period="3mo")
df = pl.from_pandas(ticker_data)
df = df.rename({col_nm: col_nm.lower() for col_nm in df.columns})
df.tail(10)
shape: (10, 7)
┌────────────┬────────────┬────────────┬────────────┬──────────┬───────────┬──────────────┐
│ open ┆ high ┆ low ┆ close ┆ volume ┆ dividends ┆ stock splits │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 ┆ f64 ┆ i64 ┆ f64 ┆ i64 │
╞════════════╪════════════╪════════════╪════════════╪══════════╪═══════════╪══════════════╡
│ 136.820007 ┆ 138.589996 ┆ 135.630005 ┆ 138.270004 ┆ 72433800 ┆ 0.0 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 139.899994 ┆ 141.910004 ┆ 139.770004 ┆ 141.660004 ┆ 89116800 ┆ 0.0 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 142.699997 ┆ 143.490005 ┆ 140.970001 ┆ 141.660004 ┆ 70207900 ┆ 0.0 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 142.130005 ┆ 143.419998 ┆ 137.320007 ┆ 137.440002 ┆ 67083400 ┆ 0.0 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 137.460007 ┆ 140.669998 ┆ 136.669998 ┆ 139.229996 ┆ 66242400 ┆ 0.0 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 137.25 ┆ 138.369995 ┆ 133.770004 ┆ 136.720001 ┆ 98964500 ┆ 0.0 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 136.039993 ┆ 139.039993 ┆ 135.660004 ┆ 138.929993 ┆ 71007500 ┆ 0.0 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 137.770004 ┆ 141.610001 ┆ 136.929993 ┆ 141.559998 ┆ 73353800 ┆ 0.0 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 141.350006 ┆ 144.119995 ┆ 141.080002 ┆ 142.919998 ┆ 73972200 ┆ 0.0 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 141.354996 ┆ 144.335007 ┆ 143.300003 ┆ 144.304993 ┆ 5503179 ┆ 0.0 ┆ 0 │
└────────────┴────────────┴────────────┴────────────┴──────────┴───────────┴──────────────┘
Next, I'll reformat your code, and construct U and V, per your question.
window = 10
rsi_indicator = (
100
* pl.when(pl.col("close").pct_change() >= 0)
.then(pl.col("close").pct_change())
.otherwise(0.0)
.rolling_mean(window_size=window)
/ (
pl.when(pl.col("close").pct_change() >= 0)
.then(pl.col("close").pct_change())
.otherwise(0.0)
.rolling_mean(window_size=window)
+ pl.when(pl.col("close").pct_change() < 0)
.then(pl.col("close").pct_change())
.otherwise(0.0)
.abs()
.rolling_mean(window_size=window)
)
).alias(f"rsi_{window}")
U = (
pl.when(pl.col("close").pct_change() >= 0)
.then(pl.col("close").pct_change())
.otherwise(0.0)
.rolling_mean(window_size=window)
)
V = (
pl.when(pl.col("close").pct_change() < 0)
.then(pl.col("close").pct_change())
.otherwise(0.0)
.abs()
.rolling_mean(window_size=window)
)
We can then express rsi_indicator in terms of U and V as follows:
df.select([
pl.col('close'),
rsi_indicator,
100 * (U / (U + V)).alias('rsi_UV')
]).tail(10)
shape: (10, 3)
┌────────────┬───────────┬───────────┐
│ close ┆ rsi_10 ┆ rsi_UV │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞════════════╪═══════════╪═══════════╡
│ 138.270004 ┆ 37.209581 ┆ 37.209581 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 141.660004 ┆ 49.321609 ┆ 49.321609 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 141.660004 ┆ 58.898881 ┆ 58.898881 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 137.440002 ┆ 61.526297 ┆ 61.526297 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 139.229996 ┆ 62.768001 ┆ 62.768001 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 136.720001 ┆ 53.110534 ┆ 53.110534 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 138.929993 ┆ 69.836919 ┆ 69.836919 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 141.559998 ┆ 71.086117 ┆ 71.086117 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 142.919998 ┆ 66.779857 ┆ 66.779857 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 144.304993 ┆ 70.359592 ┆ 70.359592 │
└────────────┴───────────┴───────────┘

Related

Given a data frame with n columns of numbers, how could you calculate the Pearson correlation of all column-pair combinations?

Let's say I have a Polars data frame like this:
=> shape: (19, 5)
┌───────────────┬─────────┬───────────┬───────────┬──────────┐
│ date ┆ open_AA ┆ open_AADI ┆ open_AADR ┆ open_AAL │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════════════╪═════════╪═══════════╪═══════════╪══════════╡
│ 1674777600000 ┆ 51.39 ┆ 12.84 ┆ 50.0799 ┆ 16.535 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1674691200000 ┆ 52.43 ┆ 13.14 ┆ 49.84 ┆ 16.54 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1674604800000 ┆ 51.87 ┆ 12.88 ┆ 49.75 ┆ 15.97 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1674518400000 ┆ 51.22 ┆ 12.81 ┆ 50.1 ┆ 16.01 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1672876800000 ┆ 45.3 ┆ 12.7 ┆ 47.185 ┆ 13.5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1672790400000 ┆ 44.77 ┆ 12.355 ┆ 47.32 ┆ 12.86 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1672704000000 ┆ 45.77 ┆ 12.91 ┆ 47.84 ┆ 12.91 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1672358400000 ┆ 46.01 ┆ 12.57 ┆ 47.29 ┆ 12.55 │
└───────────────┴─────────┴───────────┴───────────┴──────────┘
I'm looking to calculate the Pearson correlation between each pair-combination of all columns (except the date one). The result would look something like this:
=> shape: (5, 5)
┌───────────────┬─────────┬───────────┬───────────┬──────────┐
│ symbol ┆ open_AA ┆ open_AADI ┆ open_AADR ┆ open_AAL │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ utf8 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════════════╪═════════╪═══════════╪═══════════╪══════════╡
│ open_AA ┆ 1 ┆ 1 ┆ .1 ┆ -.5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ open_AADI ┆ .2 ┆ 1 ┆ .2 ┆ .4 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ open_AADR ┆ .4 ┆ .2 ┆ 1 ┆ .3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ open_AAL. ┆ -.45 ┆ -.6 ┆ 50.1 ┆ 1 │
└───────────────┴─────────┴───────────┴───────────┴──────────┘
My hunch is that I need to do the following:
Get the cartesian product of columns [1..] as a new data frame.
Using Polars expressions, calculate the pearson_corr of each of each series pair.
I'm new to Polars and am having trouble with the syntax. Can anyone point me in the right direction?
Say you start with:
df = pl.DataFrame({"date":[5,6,7],"foo": [1, 3, 9], "bar": [4, 1, 3], "ham": [2, 18, 9]})
You want to exclude some cols, so let's put those in a variable
excl_cols=['date']
Then...
(
df.drop(excl_cols) # Use drop to exclude the date column (or whatever columns you don't want)
.pearson_corr() # this is the meat and potatos of the request but it's missing your symbol column on left
.select(
[
pl.Series(df.drop(excl_cols).columns).alias('symbol'), # This just creates a Series out of the column names to become its own column
pl.all() #then just every other column
])
)
shape: (3, 4)
┌────────┬───────────┬───────────┬───────────┐
│ symbol ┆ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 │
╞════════╪═══════════╪═══════════╪═══════════╡
│ foo ┆ 1.0 ┆ -0.052414 ┆ 0.169695 │
│ bar ┆ -0.052414 ┆ 1.0 ┆ -0.993036 │
│ ham ┆ 0.169695 ┆ -0.993036 ┆ 1.0 │
└────────┴───────────┴───────────┴───────────┘
Use DataFrame.pearson_corr
In [9]: df.drop('date').pearson_corr()
Out[9]:
shape: (2, 2)
┌─────────┬───────────┐
│ open_AA ┆ open_AADI │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════════╪═══════════╡
│ 1.0 ┆ 1.0 │
│ 1.0 ┆ 1.0 │
└─────────┴───────────┘

python-polars is there a np.where equivalent?

Polars is there a np.where equivalent? trying to replicate the following code in polars.
If the value is above a certain threshold column called Is_Acceptable is 1 or if it is below it is 0
import pandas as pd
import numpy as np
df = pd.DataFrame({"fruit":["orange","apple","mango","kiwi"], "value":[1,0.8,0.7,1.2]})
df["Is_Acceptable?"] = np.where(df["value"].lt(0.9), 1, 0)
print(df)
Yes, there is pl.when().then().otherwise() expression
import polars as pl
from polars import col
df = pl.DataFrame({
"fruit": ["orange","apple","mango","kiwi"],
"value": [1, 0.8, 0.7, 1.2]
})
df = df.with_column(
pl.when(col('value') < 0.9).then(1).otherwise(0).alias('Is_Acceptable?')
)
print(df)
┌────────┬───────┬────────────────┐
│ fruit ┆ value ┆ Is_Acceptable? │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ i64 │
╞════════╪═══════╪════════════════╡
│ orange ┆ 1.0 ┆ 0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple ┆ 0.8 ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ mango ┆ 0.7 ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ kiwi ┆ 1.2 ┆ 0 │
└────────┴───────┴────────────────┘
The when/then/otherwise expression is a good general-purpose answer. However, in this case, one shortcut is to simply create a boolean expression.
(
df
.with_column(
(pl.col('value') < 0.9).alias('Is_Acceptable')
)
)
shape: (4, 3)
┌────────┬───────┬───────────────┐
│ fruit ┆ value ┆ Is_Acceptable │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ bool │
╞════════╪═══════╪═══════════════╡
│ orange ┆ 1.0 ┆ false │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple ┆ 0.8 ┆ true │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ mango ┆ 0.7 ┆ true │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ kiwi ┆ 1.2 ┆ false │
└────────┴───────┴───────────────┘
In numeric computations, False will be upcast to 0, and True will be upcast to 1. Or, if you prefer, you can upcast them explicitly to a different type.
(
df
.with_column(
(pl.col('value') < 0.9).cast(pl.Int64).alias('Is_Acceptable')
)
)
shape: (4, 3)
┌────────┬───────┬───────────────┐
│ fruit ┆ value ┆ Is_Acceptable │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ i64 │
╞════════╪═══════╪═══════════════╡
│ orange ┆ 1.0 ┆ 0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple ┆ 0.8 ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ mango ┆ 0.7 ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ kiwi ┆ 1.2 ┆ 0 │
└────────┴───────┴───────────────┘

Match the behavior of prefix_sep argument to pandas.get_dummies, in polars

I have a variable driver_age and some levels 16_to_25, 25_to_34, etc.
I would like the dummy encoded columns to have names like driver_age#16_to_25.
I have the following workaround, but it is incompatible with LazyFrames.
prefix_sep = "#"
for col in features_categorical:
ddf = df.get_column(col).to_dummies()
new_names = [f"{col}{prefix_sep}{x[len(col)+1:]}" for x in ddf.columns]
mapper = dict(zip(ddf.columns, new_names))
ddf = ddf.rename(mapper)
df = df.drop(col).hstack(ddf)
Is there a more efficient way to do this? Would it be reasonable to request this as a feature?
One (somewhat) easier way to accomplish this is to add the # as a suffix to your Categorical columns, and then target the #_ with a simple list comprehension.
Let's start with this data.
import polars as pl
df = (
pl.DataFrame([
pl.Series(
name='driver_age',
values=['16_to_25', '25_to_34', '35_to_45', '45_to_55'],
dtype=pl.Categorical),
pl.Series(
name='marital_status',
values=['S', 'M'] * 2,
dtype=pl.Categorical
),
pl.Series(
name='col1',
values=[1, 2, 3, 4],
),
pl.Series(
name='col2',
values=[10, 20, 30, 40],
),
])
)
df
shape: (4, 4)
┌────────────┬────────────────┬──────┬──────┐
│ driver_age ┆ marital_status ┆ col1 ┆ col2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ i64 ┆ i64 │
╞════════════╪════════════════╪══════╪══════╡
│ 16_to_25 ┆ S ┆ 1 ┆ 10 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 25_to_34 ┆ M ┆ 2 ┆ 20 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 35_to_45 ┆ S ┆ 3 ┆ 30 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 45_to_55 ┆ M ┆ 4 ┆ 40 │
└────────────┴────────────────┴──────┴──────┘
We use the suffix Expression to add a # to the end of the column names that are Categorical and create our dummy variables.
df = (
pl.get_dummies(
df
.select([
pl.exclude(pl.Categorical),
pl.col(pl.Categorical).suffix('#'),
]),
columns=[s.name + '#' for s in df.select(pl.col(pl.Categorical))]
)
)
df
shape: (4, 8)
┌──────┬──────┬──────────────────────┬──────────────────────┬──────────────────────┬──────────────────────┬───────────────────┬───────────────────┐
│ col1 ┆ col2 ┆ driver_age#_16_to_25 ┆ driver_age#_25_to_34 ┆ driver_age#_35_to_45 ┆ driver_age#_45_to_55 ┆ marital_status#_M ┆ marital_status#_S │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ u8 ┆ u8 ┆ u8 ┆ u8 ┆ u8 ┆ u8 │
╞══════╪══════╪══════════════════════╪══════════════════════╪══════════════════════╪══════════════════════╪═══════════════════╪═══════════════════╡
│ 1 ┆ 10 ┆ 1 ┆ 0 ┆ 0 ┆ 0 ┆ 0 ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20 ┆ 0 ┆ 1 ┆ 0 ┆ 0 ┆ 1 ┆ 0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 30 ┆ 0 ┆ 0 ┆ 1 ┆ 0 ┆ 0 ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 40 ┆ 0 ┆ 0 ┆ 0 ┆ 1 ┆ 1 ┆ 0 │
└──────┴──────┴──────────────────────┴──────────────────────┴──────────────────────┴──────────────────────┴───────────────────┴───────────────────┘
From here, it's a one-liner to change the column names:
df.columns = [col_nm.replace('#_', '#') for col_nm in df.columns]
df
shape: (4, 8)
┌──────┬──────┬─────────────────────┬─────────────────────┬─────────────────────┬─────────────────────┬──────────────────┬──────────────────┐
│ col1 ┆ col2 ┆ driver_age#16_to_25 ┆ driver_age#25_to_34 ┆ driver_age#35_to_45 ┆ driver_age#45_to_55 ┆ marital_status#M ┆ marital_status#S │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ u8 ┆ u8 ┆ u8 ┆ u8 ┆ u8 ┆ u8 │
╞══════╪══════╪═════════════════════╪═════════════════════╪═════════════════════╪═════════════════════╪══════════════════╪══════════════════╡
│ 1 ┆ 10 ┆ 1 ┆ 0 ┆ 0 ┆ 0 ┆ 0 ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20 ┆ 0 ┆ 1 ┆ 0 ┆ 0 ┆ 1 ┆ 0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 30 ┆ 0 ┆ 0 ┆ 1 ┆ 0 ┆ 0 ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 40 ┆ 0 ┆ 0 ┆ 0 ┆ 1 ┆ 1 ┆ 0 │
└──────┴──────┴─────────────────────┴─────────────────────┴─────────────────────┴─────────────────────┴──────────────────┴──────────────────┘
It's not done in Lazy mode, but then again, the get_dummies is also not available in Lazy mode.

Polars is throwing an error when I convert from eger to lazy execution

This code works and returns the expected result.
import polars as pl
df = pl.DataFrame({
'A':[1,2,3,3,2,1],
'B':[1,1,1,2,2,2]
})
(df
#.lazy()
.groupby('B')
.apply(lambda x: x
.with_columns(
[pl.col("A").shift(i).alias(f"A_lag_{i}") for i in range(3)]
)
)
.with_columns(
[pl.col(f'A_lag_{i}') / pl.col('A') for i in range(3)]
)
#.collect()
)
However, if you comment out the .lazy() and .collect() you get a NotFoundError: f'A_lag_0
I've tried a few versions of this code, but I can't entirely understand if I'm doing something wrong, or whether this is a bug in Polars.
This doesn't address the error that you are receiving, but the more idiomatic way to express this in Polars is to use the over expression. For example:
(
df
.lazy()
.with_columns([
pl.col("A").shift(i).over('B').alias(f"A_lag_{i}")
for i in range(3)])
.with_columns([
(pl.col(f"A_lag_{i}") / pl.col("A")).suffix('_result')
for i in range(3)])
.collect()
)
shape: (6, 8)
┌─────┬─────┬─────────┬─────────┬─────────┬────────────────┬────────────────┬────────────────┐
│ A ┆ B ┆ A_lag_0 ┆ A_lag_1 ┆ A_lag_2 ┆ A_lag_0_result ┆ A_lag_1_result ┆ A_lag_2_result │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ f64 ┆ f64 ┆ f64 │
╞═════╪═════╪═════════╪═════════╪═════════╪════════════════╪════════════════╪════════════════╡
│ 1 ┆ 1 ┆ 1 ┆ null ┆ null ┆ 1.0 ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 1 ┆ 2 ┆ 1 ┆ null ┆ 1.0 ┆ 0.5 ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 1 ┆ 3 ┆ 2 ┆ 1 ┆ 1.0 ┆ 0.666667 ┆ 0.333333 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 2 ┆ 3 ┆ null ┆ null ┆ 1.0 ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 2 ┆ 3 ┆ null ┆ 1.0 ┆ 1.5 ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2 ┆ 1 ┆ 2 ┆ 3 ┆ 1.0 ┆ 2.0 ┆ 3.0 │
└─────┴─────┴─────────┴─────────┴─────────┴────────────────┴────────────────┴────────────────┘

Excel equivalent average if on moving window

I'm learning polars (as substitute of pandas) and I would reply some excel functions.
In particular average if over a rolling windows.
Let us suppose we have a column with positive and negative value, how can I create a new column with rolling average only if all the value in the column are positive?
import polars as pl
df = pl.DataFrame(
{
"Date": ["12/04/98", "19/04/98", "26/04/98", "03/05/98", "10/05/98", "17/05/98", "24/05/98", "31/05/98", "07/06/98"],
"Close": [15.46 ,15.09 ,16.13 ,15.13 ,14.47 ,14.78 ,15.20 ,15.07 ,12.59]
}
)
df = df.with_columns([(
pl.col("Close").pct_change().alias("Close Returns")
)])
This creates a data frame with the column "Close Returns" and the new column will be it's average on a fixed windows only if the are all positive.
And if I want to create a new column as result of quotient positive average over negative?
As example for a window of two elements, in the image below there is the first which is null and do nothing. First widows contains a positive and a negative so returns zero (I need 2 positive value) while last window contains two negative and the mean can be computed.
Here my solution but I'm not satisfied:
import polars as pl
dataset = pl.DataFrame(
{
"Date": ["12/04/98", "19/04/98", "26/04/98", "03/05/98", "10/05/98", "17/05/98", "24/05/98", "31/05/98", "07/06/98"],
"Close": [15.46 ,15.09 ,16.13 ,15.13 ,14.47 ,14.78 ,15.20 ,15.07 ,12.59]
}
)
q = dataset.lazy().with_column(pl.col("Date").str.strptime(pl.Date, fmt="%d/%m/%y"))
df = q.collect()
df = df.with_columns([(
pl.col("Close").pct_change().alias("Close Returns")
)])
lag_vector = [2, 6, 7, 10, 12, 13]
for lag in lag_vector:
out = df.groupby_rolling(
index_column="Date", period=f"{lag}w"
).agg([
pl.col("Close Returns").filter(pl.col("Close Returns") >= 0).mean().alias("positive mean"),
pl.col("Close Returns").filter(pl.col("Close Returns") < 0).mean().alias("negative mean"),
])
out["negative mean"] = out["negative mean"].fill_null("zero")
out["positive mean"] = out["positive mean"].fill_null("zero")
out = out.with_columns([
(pl.col("positive mean") / (pl.col("positive mean") - pl.col("negative mean"))).alias(f"{lag} lag mean"),
])
df = df.join(out.select(["Date", f"{lag} lag mean"]), left_on="Date", right_on="Date")
Edit: I've tweaked my answer to use the any expression so that the non-negative windowed mean is calculated if any (rather than all) of the values in the window is non-negative. Likewise, for the negative windowed mean.
lag_vector = [1, 2, 3]
for lag in lag_vector:
out = (
df
.groupby_rolling(index_column="Date", period=f"{lag}w").agg(
[
pl.col('Close Returns').alias('Close Returns list'),
pl.when((pl.col("Close Returns") >= 0).any())
.then(pl.col('Close Returns').filter(pl.col("Close Returns") >= 0).mean())
.otherwise(0)
.alias("positive mean"),
pl.when((pl.col("Close Returns") < 0).any())
.then(pl.col('Close Returns').filter(pl.col("Close Returns") < 0).mean())
.otherwise(0)
.alias("negative mean"),
]
)
)
print(out)
Window size 1 week:
shape: (9, 4)
┌────────────┬────────────────────┬───────────────┬───────────────┐
│ Date ┆ Close Returns list ┆ positive mean ┆ negative mean │
│ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ list [f64] ┆ f64 ┆ f64 │
╞════════════╪════════════════════╪═══════════════╪═══════════════╡
│ 1998-04-12 ┆ [null] ┆ 0.0 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-04-19 ┆ [-0.023933] ┆ 0.0 ┆ -0.023933 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-04-26 ┆ [0.0689] ┆ 0.0689 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-03 ┆ [-0.061996] ┆ 0.0 ┆ -0.061996 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-10 ┆ [-0.043622] ┆ 0.0 ┆ -0.043622 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-17 ┆ [0.021424] ┆ 0.021424 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-24 ┆ [0.028417] ┆ 0.028417 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-31 ┆ [-0.008553] ┆ 0.0 ┆ -0.008553 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-06-07 ┆ [-0.164565] ┆ 0.0 ┆ -0.164565 │
└────────────┴────────────────────┴───────────────┴───────────────┘
Window size 2 weeks:
shape: (9, 4)
┌────────────┬────────────────────────┬───────────────┬───────────────┐
│ Date ┆ Close Returns list ┆ positive mean ┆ negative mean │
│ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ list [f64] ┆ f64 ┆ f64 │
╞════════════╪════════════════════════╪═══════════════╪═══════════════╡
│ 1998-04-12 ┆ [null] ┆ 0.0 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-04-19 ┆ [null, -0.023933] ┆ 0.0 ┆ -0.023933 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-04-26 ┆ [-0.023933, 0.0689] ┆ 0.0689 ┆ -0.023933 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-03 ┆ [0.0689, -0.061996] ┆ 0.0689 ┆ -0.061996 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-10 ┆ [-0.061996, -0.043622] ┆ 0.0 ┆ -0.052809 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-17 ┆ [-0.043622, 0.021424] ┆ 0.021424 ┆ -0.043622 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-24 ┆ [0.021424, 0.028417] ┆ 0.0249 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-31 ┆ [0.028417, -0.008553] ┆ 0.028417 ┆ -0.008553 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-06-07 ┆ [-0.008553, -0.164565] ┆ 0.0 ┆ -0.086559 │
└────────────┴────────────────────────┴───────────────┴───────────────┘
Window size 3 weeks:
shape: (9, 4)
┌────────────┬──────────────────────────────────┬───────────────┬───────────────┐
│ Date ┆ Close Returns list ┆ positive mean ┆ negative mean │
│ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ list [f64] ┆ f64 ┆ f64 │
╞════════════╪══════════════════════════════════╪═══════════════╪═══════════════╡
│ 1998-04-12 ┆ [null] ┆ 0.0 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-04-19 ┆ [null, -0.023933] ┆ 0.0 ┆ -0.023933 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-04-26 ┆ [null, -0.023933, 0.0689] ┆ 0.0689 ┆ -0.023933 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-03 ┆ [-0.023933, 0.0689, -0.061996] ┆ 0.0689 ┆ -0.042965 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-10 ┆ [0.0689, -0.061996, -0.043622] ┆ 0.0689 ┆ -0.052809 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-17 ┆ [-0.061996, -0.043622, 0.021424] ┆ 0.021424 ┆ -0.052809 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-24 ┆ [-0.043622, 0.021424, 0.028417] ┆ 0.0249 ┆ -0.043622 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-31 ┆ [0.021424, 0.028417, -0.008553] ┆ 0.0249 ┆ -0.008553 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-06-07 ┆ [0.028417, -0.008553, -0.164565] ┆ 0.028417 ┆ -0.086559 │
└────────────┴──────────────────────────────────┴───────────────┴───────────────┘
Is this closer to what you are looking for?
You can use groupby_rolling and then in the aggregation filter out values that are negative.
In the example below, we parse the dates and then groupby a window of 10 days ("10d"), finally we aggregate by our conditions.
df = pl.DataFrame(
{
"Date": ["12/04/98", "19/04/98", "26/04/98", "03/05/98", "10/05/98", "17/05/98", "24/05/98",],
"Close": [15.46 ,15.09 ,16.13 ,15.13 ,14.47 ,14.78 ,15.20]
}
)
(df.with_column(pl.col("Date").str.strptime(pl.Date, fmt="%d/%m/%y"))
.groupby_rolling(index_column="Date", period="10d")
.agg([
pl.col("Close").filter(pl.col("Close") > 0).mean().alias("mean")
])
)
shape: (7, 2)
┌────────────┬────────┐
│ Date ┆ mean │
│ --- ┆ --- │
│ date ┆ f64 │
╞════════════╪════════╡
│ 1998-04-12 ┆ 15.46 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1998-04-19 ┆ 15.275 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1998-04-26 ┆ 15.61 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1998-05-03 ┆ 15.63 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1998-05-10 ┆ 14.8 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1998-05-17 ┆ 14.625 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1998-05-24 ┆ 14.99 │
└────────────┴────────┘