Polars solution to normalise groups by per-group reference value - python-polars

I'm trying to use Polars to normalise the values of groups of entries by a single reference value per group.
In the example data below, I'm trying to generate the column normalised which contains values divided by the per-group ref reference state value, i.e.:
group_id reference_state value normalised
1 ref 5 1.0
1 a 3 0.6
1 b 1 0.2
2 ref 4 1.0
2 a 8 2.0
2 b 2 0.5
This is straightforward in Pandas:
for (i, x) in df.groupby("group_id"):
ref_val = x.loc[x["reference_state"] == "ref"]["value"]
df.loc[df["group_id"] == i, "normalised"] = x["value"] / ref_val.to_list()[0]
Is there a way to do this in Polars?
Thanks in advance!

You can use a window function to make an expression operate on different groups via:
and then you can write the logic which divides by the values if equal to "ref" with:
pl.col("value") / pl.col("value").filter(pl.col("reference_state") == "ref").first()
Putting it all together:
df = pl.DataFrame({
"group_id": [1, 1, 1, 2, 2, 2],
"reference_state": ["ref", "a", "b", "ref", "a", "b"],
"value": [5, 3, 1, 4, 8, 2],
pl.col("value") /
pl.col("value").filter(pl.col("reference_state") == "ref").first()
shape: (6, 4)
│ group_id ┆ reference_state ┆ value ┆ normalised │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ f64 │
│ 1 ┆ ref ┆ 5 ┆ 1.0 │
│ 1 ┆ a ┆ 3 ┆ 0.6 │
│ 1 ┆ b ┆ 1 ┆ 0.2 │
│ 2 ┆ ref ┆ 4 ┆ 1.0 │
│ 2 ┆ a ┆ 8 ┆ 2.0 │
│ 2 ┆ b ┆ 2 ┆ 0.5 │

Here's one way to do it:
create a temporary dataframe which, for each group_id, tells you the value where reference_state is 'ref'
join with that temporary dataframe
df.filter(pl.col("reference_state") == "ref").select(["group_id", "value"]),
.with_column((pl.col("value") / pl.col("value_right")).alias("normalised"))
This gives you:
shape: (6, 4)
│ group_id ┆ reference_state ┆ value ┆ normalised │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ f64 │
│ 1 ┆ ref ┆ 5 ┆ 1.0 │
│ 1 ┆ a ┆ 3 ┆ 0.6 │
│ 1 ┆ b ┆ 1 ┆ 0.2 │
│ 2 ┆ ref ┆ 4 ┆ 1.0 │
│ 2 ┆ a ┆ 8 ┆ 2.0 │
│ 2 ┆ b ┆ 2 ┆ 0.5 │


Match the behavior of prefix_sep argument to pandas.get_dummies, in polars

I have a variable driver_age and some levels 16_to_25, 25_to_34, etc.
I would like the dummy encoded columns to have names like driver_age#16_to_25.
I have the following workaround, but it is incompatible with LazyFrames.
prefix_sep = "#"
for col in features_categorical:
ddf = df.get_column(col).to_dummies()
new_names = [f"{col}{prefix_sep}{x[len(col)+1:]}" for x in ddf.columns]
mapper = dict(zip(ddf.columns, new_names))
ddf = ddf.rename(mapper)
df = df.drop(col).hstack(ddf)
Is there a more efficient way to do this? Would it be reasonable to request this as a feature?
One (somewhat) easier way to accomplish this is to add the # as a suffix to your Categorical columns, and then target the #_ with a simple list comprehension.
Let's start with this data.
import polars as pl
df = (
values=['16_to_25', '25_to_34', '35_to_45', '45_to_55'],
values=['S', 'M'] * 2,
values=[1, 2, 3, 4],
values=[10, 20, 30, 40],
shape: (4, 4)
│ driver_age ┆ marital_status ┆ col1 ┆ col2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ i64 ┆ i64 │
│ 16_to_25 ┆ S ┆ 1 ┆ 10 │
│ 25_to_34 ┆ M ┆ 2 ┆ 20 │
│ 35_to_45 ┆ S ┆ 3 ┆ 30 │
│ 45_to_55 ┆ M ┆ 4 ┆ 40 │
We use the suffix Expression to add a # to the end of the column names that are Categorical and create our dummy variables.
df = (
columns=[s.name + '#' for s in df.select(pl.col(pl.Categorical))]
shape: (4, 8)
│ col1 ┆ col2 ┆ driver_age#_16_to_25 ┆ driver_age#_25_to_34 ┆ driver_age#_35_to_45 ┆ driver_age#_45_to_55 ┆ marital_status#_M ┆ marital_status#_S │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ u8 ┆ u8 ┆ u8 ┆ u8 ┆ u8 ┆ u8 │
│ 1 ┆ 10 ┆ 1 ┆ 0 ┆ 0 ┆ 0 ┆ 0 ┆ 1 │
│ 2 ┆ 20 ┆ 0 ┆ 1 ┆ 0 ┆ 0 ┆ 1 ┆ 0 │
│ 3 ┆ 30 ┆ 0 ┆ 0 ┆ 1 ┆ 0 ┆ 0 ┆ 1 │
│ 4 ┆ 40 ┆ 0 ┆ 0 ┆ 0 ┆ 1 ┆ 1 ┆ 0 │
From here, it's a one-liner to change the column names:
df.columns = [col_nm.replace('#_', '#') for col_nm in df.columns]
shape: (4, 8)
│ col1 ┆ col2 ┆ driver_age#16_to_25 ┆ driver_age#25_to_34 ┆ driver_age#35_to_45 ┆ driver_age#45_to_55 ┆ marital_status#M ┆ marital_status#S │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ u8 ┆ u8 ┆ u8 ┆ u8 ┆ u8 ┆ u8 │
│ 1 ┆ 10 ┆ 1 ┆ 0 ┆ 0 ┆ 0 ┆ 0 ┆ 1 │
│ 2 ┆ 20 ┆ 0 ┆ 1 ┆ 0 ┆ 0 ┆ 1 ┆ 0 │
│ 3 ┆ 30 ┆ 0 ┆ 0 ┆ 1 ┆ 0 ┆ 0 ┆ 1 │
│ 4 ┆ 40 ┆ 0 ┆ 0 ┆ 0 ┆ 1 ┆ 1 ┆ 0 │
It's not done in Lazy mode, but then again, the get_dummies is also not available in Lazy mode.

Polars unable to compute over() when an input was a list

I have a dataframe like this:
import polars as pl
orig_df = pl.DataFrame({
"primary_key": [1, 1, 2, 2, 3, 3],
"simple_foreign_keys": [1, 1, 2, 4, 4, 4],
"fancy_foreign_keys": [[1], [1], [2], [1], [3, 4], [3, 4]],
I have some logic that computes when the secondary column changes, then additional logic to rank those changes per primary column. It works fine on the simple data type:
foreign_key_changed = pl.col('simple_foreign_keys') != pl.col('simple_foreign_keys').shift_and_fill(1, 0)
df = orig_df.sort('primary_key').with_column(foreign_key_changed.cumsum().alias('raw_changes'))
shape: (6, 4)
│ primary_key ┆ simple_foreign_keys ┆ fancy_foreign_keys ┆ raw_changes │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ list[i64] ┆ u32 │
│ 1 ┆ 1 ┆ [1] ┆ 1 │
│ 1 ┆ 1 ┆ [1] ┆ 1 │
│ 2 ┆ 2 ┆ [2] ┆ 2 │
│ 2 ┆ 4 ┆ [1] ┆ 3 │
│ 3 ┆ 4 ┆ [3, 4] ┆ 3 │
│ 3 ┆ 4 ┆ [3, 4] ┆ 3 │
df = orig_df.sort('primary_key').with_column(foreign_key_changed.cumsum().rank('dense').over('primary_key').alias('ranked_changes'))
shape: (6, 4)
│ primary_key ┆ simple_foreign_keys ┆ fancy_foreign_keys ┆ ranked_changes │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ list[i64] ┆ u32 │
│ 1 ┆ 1 ┆ [1] ┆ 1 │
│ 1 ┆ 1 ┆ [1] ┆ 1 │
│ 2 ┆ 2 ┆ [2] ┆ 1 │
│ 2 ┆ 4 ┆ [1] ┆ 2 │
│ 3 ┆ 4 ┆ [3, 4] ┆ 1 │
│ 3 ┆ 4 ┆ [3, 4] ┆ 1 │
But if I try the exact same logic on the List column, it blows up on me. Note that the intermediate columns (the cumsum, the rank) are still plain integers:
foreign_key_changed = pl.col('fancy_foreign_keys') != pl.col('fancy_foreign_keys').shift_and_fill(1, [])
df = orig_df.sort('primary_key').with_column(foreign_key_changed.cumsum().alias('raw_changes'))
shape: (6, 4)
│ primary_key ┆ simple_foreign_keys ┆ fancy_foreign_keys ┆ raw_changes │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ list[i64] ┆ u32 │
│ 1 ┆ 1 ┆ [1] ┆ 1 │
│ 1 ┆ 1 ┆ [1] ┆ 1 │
│ 2 ┆ 2 ┆ [2] ┆ 2 │
│ 2 ┆ 4 ┆ [1] ┆ 3 │
│ 3 ┆ 4 ┆ [3, 4] ┆ 4 │
│ 3 ┆ 4 ┆ [3, 4] ┆ 4 │
df = orig_df.sort('primary_key').with_column(foreign_key_changed.cumsum().rank('dense').over('primary_key').alias('ranked_changes'))
thread '<unnamed>' panicked at 'implementation error, cannot get ref List(Null) from Int64', /Users/runner/work/polars/polars/polars/polars-core/src/series/mod.rs:945:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
File "polars_example.py", line 18, in <module>
df = orig_df.sort('primary_key').with_column(foreign_key_changed.cumsum().rank('dense').over('primary_key').alias('ranked_changes'))
File "venv/lib/python3.9/site-packages/polars/internals/frame.py", line 4144, in with_column
return self.with_columns([column])
File "venv/lib/python3.9/site-packages/polars/internals/frame.py", line 5308, in with_columns
File "venv/lib/python3.9/site-packages/polars/internals/lazy_frame.py", line 652, in collect
return self._dataframe_class._from_pydf(ldf.collect())
pyo3_runtime.PanicException: implementation error, cannot get ref List(Null) from Int64
Am I doing this wrong? Is there some massaging that has to happen with the types?
This is with polars 0.13.62.

Apply to a list of columns in Polars

In the following dataframe I would like to multiply var_3 and var_4 by negative 1. I can do so using the following method but I am wondering if it can be done by collecting them in a list (imagining that there may be many more than 4 columns in the dataframe)
df = pl.DataFrame({"var_1": ["a", "a", "b"],
"var_2": ["c", "d", "e"],
"var_3": [1, 2, 3],
"var_4": [4, 5, 6]})
df.with_columns([pl.col("var_3") * -1,
pl.col("var_4") * -1])
Which returns the desired dataframe
My try at it goes like this, though it is not applying the multiplication:
var_list = ["var_3", "var_4"]
pl_cols_var_list = [pl.col(k) for k in var_list]
df.with_columns(pl_cols_var_list * -1)
You were close. You can provide your list of variable names (as strings) directly to the polars.col expression:
var_list = ["var_3", "var_4"]
df.with_columns(pl.col(var_list) * -1)
shape: (3, 4)
│ var_1 ┆ var_2 ┆ var_3 ┆ var_4 │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 │
│ a ┆ c ┆ -1 ┆ -4 │
│ a ┆ d ┆ -2 ┆ -5 │
│ b ┆ e ┆ -3 ┆ -6 │
Another tip, if you have lots of columns and want to exclude only a few, you can use the polars.exclude expression:
var_list = ["var_1", "var_2"]
df.with_columns(pl.exclude(var_list) * -1)
shape: (3, 4)
│ var_1 ┆ var_2 ┆ var_3 ┆ var_4 │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 │
│ a ┆ c ┆ -1 ┆ -4 │
│ a ┆ d ┆ -2 ┆ -5 │
│ b ┆ e ┆ -3 ┆ -6 │

polars outer join default null value

Can I specify the default NULL value for outer joins? Like 0?
The join method does not currently have an option for setting a default value for nulls. However, there is an easy way to accomplish this.
Let's say we have this data:
import polars as pl
df1 = pl.DataFrame({"key": ["a", "b", "d"], "var1": [1, 1, 1]})
df2 = pl.DataFrame({"key": ["a", "b", "c"], "var2": [2, 2, 2]})
df1.join(df2, on="key", how="outer")
shape: (4, 3)
│ key ┆ var1 ┆ var2 │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
│ a ┆ 1 ┆ 2 │
│ b ┆ 1 ┆ 2 │
│ c ┆ null ┆ 2 │
│ d ┆ 1 ┆ null │
To create a different value for the null values, simply use this:
df1.join(df2, on="key", how="outer").with_column(pl.all().fill_null(0))
shape: (4, 3)
│ key ┆ var1 ┆ var2 │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
│ a ┆ 1 ┆ 2 │
│ b ┆ 1 ┆ 2 │
│ c ┆ 0 ┆ 2 │
│ d ┆ 1 ┆ 0 │

How to create fields dynamically

Is there any way to create fields dynamically?. I know there are some ways. But it will be better to know best approach in polars. For example I want to add 12 shifted columns to existing dataframe.(lag1, lag2, lag3...lagN) How to achieve this?
You can just use the python language for that. Polars expressions are lazily evaluated, so you can create them anywhere, in a for loop, a function, list comprehension, you name it.
Below I give an example of dynamically created lag columns, one by calling a function, assigning to a variable and then using that variable. And one with a list comprehension.
# some initial dataframe
df = pl.DataFrame({
"a": [1, 2, 3, 4, 5],
"b": [5, 4, 3, 2, 1]
# a function that returns a lazy evaluated expression
def lag(name: str, n: int) -> pl.Expr:
return pl.col(name).shift(n).suffix(f"_lag_{n}")
# a lazy evaluated expression assigned to a variable
lag_foo = lag("a", 1)
out = df.select([
] + [lag("b", i) for i in range(5)] # create exprs with a list comprehension
This outputs:
shape: (5, 6)
│ a_lag_1 ┆ b_lag_0 ┆ b_lag_1 ┆ b_lag_2 ┆ b_lag_3 ┆ b_lag_4 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
│ null ┆ 5 ┆ null ┆ null ┆ null ┆ null │
│ 1 ┆ 4 ┆ 5 ┆ null ┆ null ┆ null │
│ 2 ┆ 3 ┆ 4 ┆ 5 ┆ null ┆ null │
│ 3 ┆ 2 ┆ 3 ┆ 4 ┆ 5 ┆ null │
│ 4 ┆ 1 ┆ 2 ┆ 3 ┆ 4 ┆ 5 │