Python Polars: Unique values in each row - python-polars

I have a Polars dataframe with 150 columns of currency codes. I can identify them with a regex expression df.select(pl.col('^*cur$')). I am trying to determine the unique set of currency codes in each row. Nulls should be ignored.
df = pl.DataFrame(
{
"col1cur": ["EUR", "EUR", "EUR"],
"col2cur": [None, "EUR", None],
"col3cur": ["EUR", None, None],
"col4cur": ["EUR", "GBP", None],
"target": [["EUR"], ["EUR", "GBP"], ["EUR"]]
}
)
In pandas, I would do this. Can anyone help, on how I would approach this in Polars?
pandas_df = pd.DataFrame(
{
"col1cur": ["EUR", "EUR", "EUR"],
"col2cur": [None, "EUR", None],
"col3cur": ["EUR", None, None],
"col4cur": ["EUR", "GBP", None],
},
dtype="string",
)
pandas_df["target"] = pandas_df.apply(
lambda x: pd.Series(x.dropna().unique()).to_list(), axis=1
)

Since polars isn't really good at row operations, I'd start off with a melt.
df.drop('target').with_row_count('i').join(
df.drop('target').with_row_count('i').melt('i').filter(~pl.col('value').is_null()) \
.groupby('i').agg(pl.col('value').unique()),
on='i'
).sort('i').drop('i')
We just do a with_row_count to create an index to maintain the identity of the original rows, then filter out the nulls, then groupby what was previously each row, aggregate to unique, and lastly wrap it in a join with the original columns by the row index.
shape: (3, 5)
┌─────────┬─────────┬─────────┬─────────┬────────────────┐
│ col1cur ┆ col2cur ┆ col3cur ┆ col4cur ┆ value │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ list[str] │
╞═════════╪═════════╪═════════╪═════════╪════════════════╡
│ EUR ┆ null ┆ EUR ┆ EUR ┆ ["EUR"] │
│ EUR ┆ EUR ┆ null ┆ GBP ┆ ["GBP", "EUR"] │
│ EUR ┆ null ┆ null ┆ null ┆ ["EUR"] │
└─────────┴─────────┴─────────┴─────────┴────────────────┘

Here's the closest I could get:
In [39]: df.with_columns(pl.concat_list(pl.col('*')).arr.unique().alias('target'))
Out[39]:
shape: (3, 5)
┌─────────┬─────────┬─────────┬─────────┬──────────────────────┐
│ col1cur ┆ col2cur ┆ col3cur ┆ col4cur ┆ target │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ list[str] │
╞═════════╪═════════╪═════════╪═════════╪══════════════════════╡
│ EUR ┆ null ┆ EUR ┆ EUR ┆ [null, "EUR"] │
│ EUR ┆ EUR ┆ null ┆ GBP ┆ ["EUR", null, "GBP"] │
│ EUR ┆ null ┆ null ┆ null ┆ [null, "EUR"] │
└─────────┴─────────┴─────────┴─────────┴──────────────────────┘
I'll update the answer if/when I find a way to exclude nulls
Slower solution, but which excludes nulls:
In [44]: df.with_columns(pl.concat_list(pl.col('*')).apply(lambda x: list(set(i for i in x if i is not None))).alias('target'))
Out[44]:
shape: (3, 5)
┌─────────┬─────────┬─────────┬─────────┬────────────────┐
│ col1cur ┆ col2cur ┆ col3cur ┆ col4cur ┆ target │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ list[str] │
╞═════════╪═════════╪═════════╪═════════╪════════════════╡
│ EUR ┆ null ┆ EUR ┆ EUR ┆ ["EUR"] │
│ EUR ┆ EUR ┆ null ┆ GBP ┆ ["EUR", "GBP"] │
│ EUR ┆ null ┆ null ┆ null ┆ ["EUR"] │
└─────────┴─────────┴─────────┴─────────┴────────────────┘

I would propose the concat_list with arr_eval.
df.drop('target').with_columns(
pl.concat_list('*').arr.eval(pl.element().unique().drop_nulls(), parallel=True).alias('target'))
This is similar to what #jcurious proposed in his comment.
Here is the result:
shape: (3, 5)
┌─────────┬─────────┬─────────┬─────────┬────────────────┐
│ col1cur ┆ col2cur ┆ col3cur ┆ col4cur ┆ target │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ list[str] │
╞═════════╪═════════╪═════════╪═════════╪════════════════╡
│ EUR ┆ null ┆ EUR ┆ EUR ┆ ["EUR"] │
│ EUR ┆ EUR ┆ null ┆ GBP ┆ ["EUR", "GBP"] │
│ EUR ┆ null ┆ null ┆ null ┆ ["EUR"] │
└─────────┴─────────┴─────────┴─────────┴────────────────┘
Edit: added parallel=True to arr.eval, to run the evaluation in parallel. (suggestion of #jqurious)

Related

Given a data frame with n columns of numbers, how could you calculate the Pearson correlation of all column-pair combinations?

Let's say I have a Polars data frame like this:
=> shape: (19, 5)
┌───────────────┬─────────┬───────────┬───────────┬──────────┐
│ date ┆ open_AA ┆ open_AADI ┆ open_AADR ┆ open_AAL │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════════════╪═════════╪═══════════╪═══════════╪══════════╡
│ 1674777600000 ┆ 51.39 ┆ 12.84 ┆ 50.0799 ┆ 16.535 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1674691200000 ┆ 52.43 ┆ 13.14 ┆ 49.84 ┆ 16.54 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1674604800000 ┆ 51.87 ┆ 12.88 ┆ 49.75 ┆ 15.97 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1674518400000 ┆ 51.22 ┆ 12.81 ┆ 50.1 ┆ 16.01 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1672876800000 ┆ 45.3 ┆ 12.7 ┆ 47.185 ┆ 13.5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1672790400000 ┆ 44.77 ┆ 12.355 ┆ 47.32 ┆ 12.86 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1672704000000 ┆ 45.77 ┆ 12.91 ┆ 47.84 ┆ 12.91 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1672358400000 ┆ 46.01 ┆ 12.57 ┆ 47.29 ┆ 12.55 │
└───────────────┴─────────┴───────────┴───────────┴──────────┘
I'm looking to calculate the Pearson correlation between each pair-combination of all columns (except the date one). The result would look something like this:
=> shape: (5, 5)
┌───────────────┬─────────┬───────────┬───────────┬──────────┐
│ symbol ┆ open_AA ┆ open_AADI ┆ open_AADR ┆ open_AAL │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ utf8 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════════════╪═════════╪═══════════╪═══════════╪══════════╡
│ open_AA ┆ 1 ┆ 1 ┆ .1 ┆ -.5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ open_AADI ┆ .2 ┆ 1 ┆ .2 ┆ .4 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ open_AADR ┆ .4 ┆ .2 ┆ 1 ┆ .3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ open_AAL. ┆ -.45 ┆ -.6 ┆ 50.1 ┆ 1 │
└───────────────┴─────────┴───────────┴───────────┴──────────┘
My hunch is that I need to do the following:
Get the cartesian product of columns [1..] as a new data frame.
Using Polars expressions, calculate the pearson_corr of each of each series pair.
I'm new to Polars and am having trouble with the syntax. Can anyone point me in the right direction?
Say you start with:
df = pl.DataFrame({"date":[5,6,7],"foo": [1, 3, 9], "bar": [4, 1, 3], "ham": [2, 18, 9]})
You want to exclude some cols, so let's put those in a variable
excl_cols=['date']
Then...
(
df.drop(excl_cols) # Use drop to exclude the date column (or whatever columns you don't want)
.pearson_corr() # this is the meat and potatos of the request but it's missing your symbol column on left
.select(
[
pl.Series(df.drop(excl_cols).columns).alias('symbol'), # This just creates a Series out of the column names to become its own column
pl.all() #then just every other column
])
)
shape: (3, 4)
┌────────┬───────────┬───────────┬───────────┐
│ symbol ┆ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 │
╞════════╪═══════════╪═══════════╪═══════════╡
│ foo ┆ 1.0 ┆ -0.052414 ┆ 0.169695 │
│ bar ┆ -0.052414 ┆ 1.0 ┆ -0.993036 │
│ ham ┆ 0.169695 ┆ -0.993036 ┆ 1.0 │
└────────┴───────────┴───────────┴───────────┘
Use DataFrame.pearson_corr
In [9]: df.drop('date').pearson_corr()
Out[9]:
shape: (2, 2)
┌─────────┬───────────┐
│ open_AA ┆ open_AADI │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════════╪═══════════╡
│ 1.0 ┆ 1.0 │
│ 1.0 ┆ 1.0 │
└─────────┴───────────┘

python-polars is there a np.where equivalent?

Polars is there a np.where equivalent? trying to replicate the following code in polars.
If the value is above a certain threshold column called Is_Acceptable is 1 or if it is below it is 0
import pandas as pd
import numpy as np
df = pd.DataFrame({"fruit":["orange","apple","mango","kiwi"], "value":[1,0.8,0.7,1.2]})
df["Is_Acceptable?"] = np.where(df["value"].lt(0.9), 1, 0)
print(df)
Yes, there is pl.when().then().otherwise() expression
import polars as pl
from polars import col
df = pl.DataFrame({
"fruit": ["orange","apple","mango","kiwi"],
"value": [1, 0.8, 0.7, 1.2]
})
df = df.with_column(
pl.when(col('value') < 0.9).then(1).otherwise(0).alias('Is_Acceptable?')
)
print(df)
┌────────┬───────┬────────────────┐
│ fruit ┆ value ┆ Is_Acceptable? │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ i64 │
╞════════╪═══════╪════════════════╡
│ orange ┆ 1.0 ┆ 0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple ┆ 0.8 ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ mango ┆ 0.7 ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ kiwi ┆ 1.2 ┆ 0 │
└────────┴───────┴────────────────┘
The when/then/otherwise expression is a good general-purpose answer. However, in this case, one shortcut is to simply create a boolean expression.
(
df
.with_column(
(pl.col('value') < 0.9).alias('Is_Acceptable')
)
)
shape: (4, 3)
┌────────┬───────┬───────────────┐
│ fruit ┆ value ┆ Is_Acceptable │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ bool │
╞════════╪═══════╪═══════════════╡
│ orange ┆ 1.0 ┆ false │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple ┆ 0.8 ┆ true │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ mango ┆ 0.7 ┆ true │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ kiwi ┆ 1.2 ┆ false │
└────────┴───────┴───────────────┘
In numeric computations, False will be upcast to 0, and True will be upcast to 1. Or, if you prefer, you can upcast them explicitly to a different type.
(
df
.with_column(
(pl.col('value') < 0.9).cast(pl.Int64).alias('Is_Acceptable')
)
)
shape: (4, 3)
┌────────┬───────┬───────────────┐
│ fruit ┆ value ┆ Is_Acceptable │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ i64 │
╞════════╪═══════╪═══════════════╡
│ orange ┆ 1.0 ┆ 0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple ┆ 0.8 ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ mango ┆ 0.7 ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ kiwi ┆ 1.2 ┆ 0 │
└────────┴───────┴───────────────┘

Polars is throwing an error when I convert from eger to lazy execution

This code works and returns the expected result.
import polars as pl
df = pl.DataFrame({
'A':[1,2,3,3,2,1],
'B':[1,1,1,2,2,2]
})
(df
#.lazy()
.groupby('B')
.apply(lambda x: x
.with_columns(
[pl.col("A").shift(i).alias(f"A_lag_{i}") for i in range(3)]
)
)
.with_columns(
[pl.col(f'A_lag_{i}') / pl.col('A') for i in range(3)]
)
#.collect()
)
However, if you comment out the .lazy() and .collect() you get a NotFoundError: f'A_lag_0
I've tried a few versions of this code, but I can't entirely understand if I'm doing something wrong, or whether this is a bug in Polars.
This doesn't address the error that you are receiving, but the more idiomatic way to express this in Polars is to use the over expression. For example:
(
df
.lazy()
.with_columns([
pl.col("A").shift(i).over('B').alias(f"A_lag_{i}")
for i in range(3)])
.with_columns([
(pl.col(f"A_lag_{i}") / pl.col("A")).suffix('_result')
for i in range(3)])
.collect()
)
shape: (6, 8)
┌─────┬─────┬─────────┬─────────┬─────────┬────────────────┬────────────────┬────────────────┐
│ A ┆ B ┆ A_lag_0 ┆ A_lag_1 ┆ A_lag_2 ┆ A_lag_0_result ┆ A_lag_1_result ┆ A_lag_2_result │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ f64 ┆ f64 ┆ f64 │
╞═════╪═════╪═════════╪═════════╪═════════╪════════════════╪════════════════╪════════════════╡
│ 1 ┆ 1 ┆ 1 ┆ null ┆ null ┆ 1.0 ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 1 ┆ 2 ┆ 1 ┆ null ┆ 1.0 ┆ 0.5 ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 1 ┆ 3 ┆ 2 ┆ 1 ┆ 1.0 ┆ 0.666667 ┆ 0.333333 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 2 ┆ 3 ┆ null ┆ null ┆ 1.0 ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 2 ┆ 3 ┆ null ┆ 1.0 ┆ 1.5 ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2 ┆ 1 ┆ 2 ┆ 3 ┆ 1.0 ┆ 2.0 ┆ 3.0 │
└─────┴─────┴─────────┴─────────┴─────────┴────────────────┴────────────────┴────────────────┘

Idiomatic replacement of empty string '' with pl.Null (null) in polars

I have a polars DataFrame with a number of Series that look like:
pl.Series(['cow', 'cat', '', 'lobster', ''])
and I'd like them to be
pl.Series(['cow', 'cat', pl.Null, 'lobster', pl.Null])
A simple string replacement won't work since pl.Null is not of type PyString:
pl.Series(['cow', 'cat', '', 'lobster', '']).str.replace('', pl.Null)
What's the idiomatic way of doing this for a Series/DataFrame in polars?
Series
For a single Series, you can use the set method.
import polars as pl
my_series = pl.Series(['cow', 'cat', '', 'lobster', ''])
my_series.set(my_series.str.lengths() == 0, None)
shape: (5,)
Series: '' [str]
[
"cow"
"cat"
null
"lobster"
null
]
DataFrame
For DataFrames, I would suggest using when/then/otherwise. For example, with this data:
df = pl.DataFrame({
'str1': ['cow', 'dog', "", 'lobster', ''],
'str2': ['', 'apple', "orange", '', 'kiwi'],
'str3': ['house', '', "apartment", 'condo', ''],
})
df
shape: (5, 3)
┌─────────┬────────┬───────────┐
│ str1 ┆ str2 ┆ str3 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════╪════════╪═══════════╡
│ cow ┆ ┆ house │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ dog ┆ apple ┆ │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ┆ orange ┆ apartment │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ lobster ┆ ┆ condo │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ┆ kiwi ┆ │
└─────────┴────────┴───────────┘
We can run a replacement on all string columns as follows:
df.with_columns([
pl.when(pl.col(pl.Utf8).str.lengths() ==0)
.then(None)
.otherwise(pl.col(pl.Utf8))
.keep_name()
])
shape: (5, 3)
┌─────────┬────────┬───────────┐
│ str1 ┆ str2 ┆ str3 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════╪════════╪═══════════╡
│ cow ┆ null ┆ house │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ dog ┆ apple ┆ null │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ orange ┆ apartment │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ lobster ┆ null ┆ condo │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ kiwi ┆ null │
└─────────┴────────┴───────────┘
The above should be fairly performant.
If you only want to replace empty strings with null on certain columns, you can provide a list:
only_these = ['str1', 'str2']
df.with_columns([
pl.when(pl.col(only_these).str.lengths() == 0)
.then(None)
.otherwise(pl.col(only_these))
.keep_name()
])
shape: (5, 3)
┌─────────┬────────┬───────────┐
│ str1 ┆ str2 ┆ str3 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════╪════════╪═══════════╡
│ cow ┆ null ┆ house │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ dog ┆ apple ┆ │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ orange ┆ apartment │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ lobster ┆ null ┆ condo │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ kiwi ┆ │
└─────────┴────────┴───────────┘

Excel equivalent average if on moving window

I'm learning polars (as substitute of pandas) and I would reply some excel functions.
In particular average if over a rolling windows.
Let us suppose we have a column with positive and negative value, how can I create a new column with rolling average only if all the value in the column are positive?
import polars as pl
df = pl.DataFrame(
{
"Date": ["12/04/98", "19/04/98", "26/04/98", "03/05/98", "10/05/98", "17/05/98", "24/05/98", "31/05/98", "07/06/98"],
"Close": [15.46 ,15.09 ,16.13 ,15.13 ,14.47 ,14.78 ,15.20 ,15.07 ,12.59]
}
)
df = df.with_columns([(
pl.col("Close").pct_change().alias("Close Returns")
)])
This creates a data frame with the column "Close Returns" and the new column will be it's average on a fixed windows only if the are all positive.
And if I want to create a new column as result of quotient positive average over negative?
As example for a window of two elements, in the image below there is the first which is null and do nothing. First widows contains a positive and a negative so returns zero (I need 2 positive value) while last window contains two negative and the mean can be computed.
Here my solution but I'm not satisfied:
import polars as pl
dataset = pl.DataFrame(
{
"Date": ["12/04/98", "19/04/98", "26/04/98", "03/05/98", "10/05/98", "17/05/98", "24/05/98", "31/05/98", "07/06/98"],
"Close": [15.46 ,15.09 ,16.13 ,15.13 ,14.47 ,14.78 ,15.20 ,15.07 ,12.59]
}
)
q = dataset.lazy().with_column(pl.col("Date").str.strptime(pl.Date, fmt="%d/%m/%y"))
df = q.collect()
df = df.with_columns([(
pl.col("Close").pct_change().alias("Close Returns")
)])
lag_vector = [2, 6, 7, 10, 12, 13]
for lag in lag_vector:
out = df.groupby_rolling(
index_column="Date", period=f"{lag}w"
).agg([
pl.col("Close Returns").filter(pl.col("Close Returns") >= 0).mean().alias("positive mean"),
pl.col("Close Returns").filter(pl.col("Close Returns") < 0).mean().alias("negative mean"),
])
out["negative mean"] = out["negative mean"].fill_null("zero")
out["positive mean"] = out["positive mean"].fill_null("zero")
out = out.with_columns([
(pl.col("positive mean") / (pl.col("positive mean") - pl.col("negative mean"))).alias(f"{lag} lag mean"),
])
df = df.join(out.select(["Date", f"{lag} lag mean"]), left_on="Date", right_on="Date")
Edit: I've tweaked my answer to use the any expression so that the non-negative windowed mean is calculated if any (rather than all) of the values in the window is non-negative. Likewise, for the negative windowed mean.
lag_vector = [1, 2, 3]
for lag in lag_vector:
out = (
df
.groupby_rolling(index_column="Date", period=f"{lag}w").agg(
[
pl.col('Close Returns').alias('Close Returns list'),
pl.when((pl.col("Close Returns") >= 0).any())
.then(pl.col('Close Returns').filter(pl.col("Close Returns") >= 0).mean())
.otherwise(0)
.alias("positive mean"),
pl.when((pl.col("Close Returns") < 0).any())
.then(pl.col('Close Returns').filter(pl.col("Close Returns") < 0).mean())
.otherwise(0)
.alias("negative mean"),
]
)
)
print(out)
Window size 1 week:
shape: (9, 4)
┌────────────┬────────────────────┬───────────────┬───────────────┐
│ Date ┆ Close Returns list ┆ positive mean ┆ negative mean │
│ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ list [f64] ┆ f64 ┆ f64 │
╞════════════╪════════════════════╪═══════════════╪═══════════════╡
│ 1998-04-12 ┆ [null] ┆ 0.0 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-04-19 ┆ [-0.023933] ┆ 0.0 ┆ -0.023933 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-04-26 ┆ [0.0689] ┆ 0.0689 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-03 ┆ [-0.061996] ┆ 0.0 ┆ -0.061996 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-10 ┆ [-0.043622] ┆ 0.0 ┆ -0.043622 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-17 ┆ [0.021424] ┆ 0.021424 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-24 ┆ [0.028417] ┆ 0.028417 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-31 ┆ [-0.008553] ┆ 0.0 ┆ -0.008553 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-06-07 ┆ [-0.164565] ┆ 0.0 ┆ -0.164565 │
└────────────┴────────────────────┴───────────────┴───────────────┘
Window size 2 weeks:
shape: (9, 4)
┌────────────┬────────────────────────┬───────────────┬───────────────┐
│ Date ┆ Close Returns list ┆ positive mean ┆ negative mean │
│ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ list [f64] ┆ f64 ┆ f64 │
╞════════════╪════════════════════════╪═══════════════╪═══════════════╡
│ 1998-04-12 ┆ [null] ┆ 0.0 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-04-19 ┆ [null, -0.023933] ┆ 0.0 ┆ -0.023933 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-04-26 ┆ [-0.023933, 0.0689] ┆ 0.0689 ┆ -0.023933 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-03 ┆ [0.0689, -0.061996] ┆ 0.0689 ┆ -0.061996 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-10 ┆ [-0.061996, -0.043622] ┆ 0.0 ┆ -0.052809 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-17 ┆ [-0.043622, 0.021424] ┆ 0.021424 ┆ -0.043622 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-24 ┆ [0.021424, 0.028417] ┆ 0.0249 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-31 ┆ [0.028417, -0.008553] ┆ 0.028417 ┆ -0.008553 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-06-07 ┆ [-0.008553, -0.164565] ┆ 0.0 ┆ -0.086559 │
└────────────┴────────────────────────┴───────────────┴───────────────┘
Window size 3 weeks:
shape: (9, 4)
┌────────────┬──────────────────────────────────┬───────────────┬───────────────┐
│ Date ┆ Close Returns list ┆ positive mean ┆ negative mean │
│ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ list [f64] ┆ f64 ┆ f64 │
╞════════════╪══════════════════════════════════╪═══════════════╪═══════════════╡
│ 1998-04-12 ┆ [null] ┆ 0.0 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-04-19 ┆ [null, -0.023933] ┆ 0.0 ┆ -0.023933 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-04-26 ┆ [null, -0.023933, 0.0689] ┆ 0.0689 ┆ -0.023933 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-03 ┆ [-0.023933, 0.0689, -0.061996] ┆ 0.0689 ┆ -0.042965 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-10 ┆ [0.0689, -0.061996, -0.043622] ┆ 0.0689 ┆ -0.052809 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-17 ┆ [-0.061996, -0.043622, 0.021424] ┆ 0.021424 ┆ -0.052809 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-24 ┆ [-0.043622, 0.021424, 0.028417] ┆ 0.0249 ┆ -0.043622 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-31 ┆ [0.021424, 0.028417, -0.008553] ┆ 0.0249 ┆ -0.008553 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-06-07 ┆ [0.028417, -0.008553, -0.164565] ┆ 0.028417 ┆ -0.086559 │
└────────────┴──────────────────────────────────┴───────────────┴───────────────┘
Is this closer to what you are looking for?
You can use groupby_rolling and then in the aggregation filter out values that are negative.
In the example below, we parse the dates and then groupby a window of 10 days ("10d"), finally we aggregate by our conditions.
df = pl.DataFrame(
{
"Date": ["12/04/98", "19/04/98", "26/04/98", "03/05/98", "10/05/98", "17/05/98", "24/05/98",],
"Close": [15.46 ,15.09 ,16.13 ,15.13 ,14.47 ,14.78 ,15.20]
}
)
(df.with_column(pl.col("Date").str.strptime(pl.Date, fmt="%d/%m/%y"))
.groupby_rolling(index_column="Date", period="10d")
.agg([
pl.col("Close").filter(pl.col("Close") > 0).mean().alias("mean")
])
)
shape: (7, 2)
┌────────────┬────────┐
│ Date ┆ mean │
│ --- ┆ --- │
│ date ┆ f64 │
╞════════════╪════════╡
│ 1998-04-12 ┆ 15.46 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1998-04-19 ┆ 15.275 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1998-04-26 ┆ 15.61 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1998-05-03 ┆ 15.63 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1998-05-10 ┆ 14.8 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1998-05-17 ┆ 14.625 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1998-05-24 ┆ 14.99 │
└────────────┴────────┘