Apply to a list of columns in Polars - python-polars

In the following dataframe I would like to multiply var_3 and var_4 by negative 1. I can do so using the following method but I am wondering if it can be done by collecting them in a list (imagining that there may be many more than 4 columns in the dataframe)
df = pl.DataFrame({"var_1": ["a", "a", "b"],
"var_2": ["c", "d", "e"],
"var_3": [1, 2, 3],
"var_4": [4, 5, 6]})
df.with_columns([pl.col("var_3") * -1,
pl.col("var_4") * -1])
Which returns the desired dataframe
var_1
var_2
var_3
var_4
a
c
-1
-4
a
d
-2
-5
b
e
-3
-6
My try at it goes like this, though it is not applying the multiplication:
var_list = ["var_3", "var_4"]
pl_cols_var_list = [pl.col(k) for k in var_list]
df.with_columns(pl_cols_var_list * -1)

You were close. You can provide your list of variable names (as strings) directly to the polars.col expression:
var_list = ["var_3", "var_4"]
df.with_columns(pl.col(var_list) * -1)
shape: (3, 4)
┌───────┬───────┬───────┬───────┐
│ var_1 ┆ var_2 ┆ var_3 ┆ var_4 │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 │
╞═══════╪═══════╪═══════╪═══════╡
│ a ┆ c ┆ -1 ┆ -4 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ d ┆ -2 ┆ -5 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ e ┆ -3 ┆ -6 │
└───────┴───────┴───────┴───────┘
Another tip, if you have lots of columns and want to exclude only a few, you can use the polars.exclude expression:
var_list = ["var_1", "var_2"]
df.with_columns(pl.exclude(var_list) * -1)
shape: (3, 4)
┌───────┬───────┬───────┬───────┐
│ var_1 ┆ var_2 ┆ var_3 ┆ var_4 │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 │
╞═══════╪═══════╪═══════╪═══════╡
│ a ┆ c ┆ -1 ┆ -4 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ d ┆ -2 ┆ -5 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ e ┆ -3 ┆ -6 │
└───────┴───────┴───────┴───────┘

Related

Polars solution to normalise groups by per-group reference value

I'm trying to use Polars to normalise the values of groups of entries by a single reference value per group.
In the example data below, I'm trying to generate the column normalised which contains values divided by the per-group ref reference state value, i.e.:
group_id reference_state value normalised
1 ref 5 1.0
1 a 3 0.6
1 b 1 0.2
2 ref 4 1.0
2 a 8 2.0
2 b 2 0.5
This is straightforward in Pandas:
for (i, x) in df.groupby("group_id"):
ref_val = x.loc[x["reference_state"] == "ref"]["value"]
df.loc[df["group_id"] == i, "normalised"] = x["value"] / ref_val.to_list()[0]
Is there a way to do this in Polars?
Thanks in advance!
You can use a window function to make an expression operate on different groups via:
.over("group_id")
and then you can write the logic which divides by the values if equal to "ref" with:
pl.col("value") / pl.col("value").filter(pl.col("reference_state") == "ref").first()
Putting it all together:
df = pl.DataFrame({
"group_id": [1, 1, 1, 2, 2, 2],
"reference_state": ["ref", "a", "b", "ref", "a", "b"],
"value": [5, 3, 1, 4, 8, 2],
})
(df.with_columns([
(
pl.col("value") /
pl.col("value").filter(pl.col("reference_state") == "ref").first()
).over("group_id").alias("normalised")
]))
shape: (6, 4)
┌──────────┬─────────────────┬───────┬────────────┐
│ group_id ┆ reference_state ┆ value ┆ normalised │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ f64 │
╞══════════╪═════════════════╪═══════╪════════════╡
│ 1 ┆ ref ┆ 5 ┆ 1.0 │
│ 1 ┆ a ┆ 3 ┆ 0.6 │
│ 1 ┆ b ┆ 1 ┆ 0.2 │
│ 2 ┆ ref ┆ 4 ┆ 1.0 │
│ 2 ┆ a ┆ 8 ┆ 2.0 │
│ 2 ┆ b ┆ 2 ┆ 0.5 │
└──────────┴─────────────────┴───────┴────────────┘
Here's one way to do it:
create a temporary dataframe which, for each group_id, tells you the value where reference_state is 'ref'
join with that temporary dataframe
(
df.join(
df.filter(pl.col("reference_state") == "ref").select(["group_id", "value"]),
on="group_id",
)
.with_column((pl.col("value") / pl.col("value_right")).alias("normalised"))
.drop("value_right")
)
This gives you:
Out[16]:
shape: (6, 4)
┌──────────┬─────────────────┬───────┬────────────┐
│ group_id ┆ reference_state ┆ value ┆ normalised │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ f64 │
╞══════════╪═════════════════╪═══════╪════════════╡
│ 1 ┆ ref ┆ 5 ┆ 1.0 │
│ 1 ┆ a ┆ 3 ┆ 0.6 │
│ 1 ┆ b ┆ 1 ┆ 0.2 │
│ 2 ┆ ref ┆ 4 ┆ 1.0 │
│ 2 ┆ a ┆ 8 ┆ 2.0 │
│ 2 ┆ b ┆ 2 ┆ 0.5 │
└──────────┴─────────────────┴───────┴────────────┘

Select columns from LazyFrame by condition

polars.LazyFrame.var will return variance value for each column in a table as below:
>>> df = pl.DataFrame({"a": [1, 2, 3, 4], "b": [1, 2, 1, 1], "c": [1, 1, 1, 1]}).lazy()
>>> df.collect()
shape: (4, 3)
┌─────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1 ┆ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 1 ┆ 1 │
└─────┴─────┴─────┘
>>> df.var().collect()
shape: (1, 3)
┌──────────┬──────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞══════════╪══════╪═════╡
│ 1.666667 ┆ 0.25 ┆ 0.0 │
└──────────┴──────┴─────┘
I wish to select columns with value > 0 from LazyFrame but couldn't find the solution.
I can iterate over columns in polars dataframe then filter columns by condition as below:
>>> data.var()
shape: (1, 3)
┌──────────┬──────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞══════════╪══════╪═════╡
│ 1.666667 ┆ 0.25 ┆ 0.0 │
└──────────┴──────┴─────┘
>>> cols = pl.select([s for s in data.var() if (s > 0).all()]).columns
>>> cols
['a', 'b']
>>> data.select(cols)
shape: (4, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 1 │
└─────┴─────┘
But it doesn't work in LazyFrame:
>>> data = data.lazy()
>>> data
<polars.internals.lazyframe.frame.LazyFrame object at 0x7f0e3d9966a0>
>>> cols = pl.select([s for s in data.var() if (s > 0).all()]).columns
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <listcomp>
File "/home/jasmine/miniconda3/envs/jupyternb/lib/python3.9/site-packages/polars/internals/lazyframe/frame.py", line 421, in __getitem__
raise TypeError(
TypeError: 'LazyFrame' object is not subscriptable (aside from slicing). Use 'select()' or 'filter()' instead.
The reason for doing this in LazyFrame is that we want to maximize the performance. Any advice would be much appreciated. Thanks!
polars doesn't know what the variance is until after it is calculated but that's the same time that it is displaying the results so there's no way to filter the columns reported and also have it be more performant than just displaying all the columns, at least with respect to the polars calculation. It could be that python/jupyter takes longer to display more results than fewer.
With that said you could do something like this:
df.var().melt().filter(pl.col('value')>0).collect()
which gives you what you want in one line but it's a different shape.
You could also do something like this:
dfvar=df.var()
dfvar.select(dfvar.melt().filter(pl.col('value')>0).select('variable').collect().to_series().to_list()).collect()
Building on the answer from #dean MacGregor, we:
do the var calculation
melt
apply the filter
extract the variable column with column names
pass it as a list to select
df.select(
(
df.var().melt().filter(pl.col('value')>0).collect()
["variable"]
)
.to_list()
).collect()
shape: (4, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 1 │
└─────┴─────┘
```

Python-Polars: How to filter categorical column with string list

I have a Polars dataframe like below:
df_cat = pl.DataFrame(
[
pl.Series("a_cat", ["c", "a", "b", "c", "b"], dtype=pl.Categorical),
pl.Series("b_cat", ["F", "G", "E", "G", "G"], dtype=pl.Categorical)
])
print(df_cat)
shape: (5, 2)
┌───────┬───────┐
│ a_cat ┆ b_cat │
│ --- ┆ --- │
│ cat ┆ cat │
╞═══════╪═══════╡
│ c ┆ F │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ G │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ E │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ G │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ G │
└───────┴───────┘
The following filter runs perfectly fine:
print(df_cat.filter(pl.col('a_cat') == 'c'))
shape: (2, 2)
┌───────┬───────┐
│ a_cat ┆ b_cat │
│ --- ┆ --- │
│ cat ┆ cat │
╞═══════╪═══════╡
│ c ┆ F │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ G │
└───────┴───────┘
What I want is to use a list of string to run the filter more efficiently. So I tried and ended up with the following error message:
print(df_cat.filter(pl.col('a_cat').is_in(['a', 'c'])))
---------------------------------------------------------------------------
ComputeError Traceback (most recent call last)
d:\GitRepo\Test2\stockEMD3.ipynb Cell 9 in <cell line: 1>()
----> 1 print(df_cat.filter(pl.col('a_cat').is_in(['c'])))
File c:\ProgramData\Anaconda3\envs\charm3.9\lib\site-packages\polars\internals\dataframe\frame.py:2185, in DataFrame.filter(self, predicate)
2181 if _NUMPY_AVAILABLE and isinstance(predicate, np.ndarray):
2182 predicate = pli.Series(predicate)
2184 return (
-> 2185 self.lazy()
2186 .filter(predicate) # type: ignore[arg-type]
2187 .collect(no_optimization=True, string_cache=False)
2188 )
File c:\ProgramData\Anaconda3\envs\charm3.9\lib\site-packages\polars\internals\lazyframe\frame.py:660, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, string_cache, no_optimization, slice_pushdown)
650 projection_pushdown = False
652 ldf = self._ldf.optimization_toggle(
653 type_coercion,
654 predicate_pushdown,
(...)
658 slice_pushdown,
659 )
--> 660 return pli.wrap_df(ldf.collect())
ComputeError: joins/or comparisons on categorical dtypes can only happen if they are created under the same global string cache
From this Stackoverflow link I understand "You need to set a global string cache to compare categoricals created in different columns/lists." but my question is
Why the == one single string filter case works?
What is the proper way to filter a categorical column with a list of string?
Thanks!
Actually, you don't need to set a global string cache to compare strings to Categorical variables. You can use cast to accomplish this.
Let's use this data. I've included the integer values that underlie the Categorical variables to demonstrate something later.
import polars as pl
df_cat = (
pl.DataFrame(
[
pl.Series("a_cat", ["c", "a", "b", "c", "X"], dtype=pl.Categorical),
pl.Series("b_cat", ["F", "G", "E", "S", "X"], dtype=pl.Categorical),
]
)
.with_column(
pl.all().to_physical().suffix('_phys')
)
)
df_cat
shape: (5, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ c ┆ F ┆ 0 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ G ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ E ┆ 2 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ S ┆ 0 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ X ┆ X ┆ 3 ┆ 4 │
└───────┴───────┴────────────┴────────────┘
Comparing a categorical variable to a string
If we cast a Categorical variable back to its string values, we can make any comparison we need. For example:
df_cat.filter(pl.col('a_cat').cast(pl.Utf8).is_in(['a', 'c']))
shape: (3, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ c ┆ F ┆ 0 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ G ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ S ┆ 0 ┆ 3 │
└───────┴───────┴────────────┴────────────┘
Or in a filter step comparing the string values of two Categorical variables that do not share the same string cache.
df_cat.filter(pl.col('a_cat').cast(pl.Utf8) == pl.col('b_cat').cast(pl.Utf8))
shape: (1, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ X ┆ X ┆ 3 ┆ 4 │
└───────┴───────┴────────────┴────────────┘
Notice that it is the string values being compared (not the integers underlying the two Categorical variables).
The equality operator on Categorical variables
The following statements are equivalent:
df_cat.filter((pl.col('a_cat') == 'a'))
df_cat.filter((pl.col('a_cat').cast(pl.Utf8) == 'a'))
The former is syntactic sugar for the latter, as the former is a common use case.
As the error states: ComputeError: joins/or comparisons on categorical dtypes can only happen if they are created under the same global string cache.
Comparisons of categorical values are only allowed under a global string cache. You really want to set this in such a case as it speeds up comparisons and prevents expensive casts to strings.
Setting this on the start of your query will ensure it runs:
import polars as pl
pl.Config.set_global_string_cache()
This is a new answer based on the one from #ritchie46.
Polar 0.15.15 it now is
import polars as pl
pl.toggle_string_cache(True)
Also a StringCache() Context manager can be used, see polars documentation:
with pl.StringCache():
print(df_cat.filter(pl.col('a_cat').is_in(['a', 'c'])))

Polars unable to compute over() when an input was a list

I have a dataframe like this:
import polars as pl
orig_df = pl.DataFrame({
"primary_key": [1, 1, 2, 2, 3, 3],
"simple_foreign_keys": [1, 1, 2, 4, 4, 4],
"fancy_foreign_keys": [[1], [1], [2], [1], [3, 4], [3, 4]],
})
I have some logic that computes when the secondary column changes, then additional logic to rank those changes per primary column. It works fine on the simple data type:
foreign_key_changed = pl.col('simple_foreign_keys') != pl.col('simple_foreign_keys').shift_and_fill(1, 0)
df = orig_df.sort('primary_key').with_column(foreign_key_changed.cumsum().alias('raw_changes'))
print(df)
shape: (6, 4)
┌─────────────┬─────────────────────┬────────────────────┬─────────────┐
│ primary_key ┆ simple_foreign_keys ┆ fancy_foreign_keys ┆ raw_changes │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ list[i64] ┆ u32 │
╞═════════════╪═════════════════════╪════════════════════╪═════════════╡
│ 1 ┆ 1 ┆ [1] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 1 ┆ [1] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ [2] ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 4 ┆ [1] ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 4 ┆ [3, 4] ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 4 ┆ [3, 4] ┆ 3 │
└─────────────┴─────────────────────┴────────────────────┴─────────────┘
df = orig_df.sort('primary_key').with_column(foreign_key_changed.cumsum().rank('dense').over('primary_key').alias('ranked_changes'))
print(df)
shape: (6, 4)
┌─────────────┬─────────────────────┬────────────────────┬────────────────┐
│ primary_key ┆ simple_foreign_keys ┆ fancy_foreign_keys ┆ ranked_changes │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ list[i64] ┆ u32 │
╞═════════════╪═════════════════════╪════════════════════╪════════════════╡
│ 1 ┆ 1 ┆ [1] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 1 ┆ [1] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ [2] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 4 ┆ [1] ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 4 ┆ [3, 4] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 4 ┆ [3, 4] ┆ 1 │
└─────────────┴─────────────────────┴────────────────────┴────────────────┘
But if I try the exact same logic on the List column, it blows up on me. Note that the intermediate columns (the cumsum, the rank) are still plain integers:
foreign_key_changed = pl.col('fancy_foreign_keys') != pl.col('fancy_foreign_keys').shift_and_fill(1, [])
df = orig_df.sort('primary_key').with_column(foreign_key_changed.cumsum().alias('raw_changes'))
print(df)
shape: (6, 4)
┌─────────────┬─────────────────────┬────────────────────┬─────────────┐
│ primary_key ┆ simple_foreign_keys ┆ fancy_foreign_keys ┆ raw_changes │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ list[i64] ┆ u32 │
╞═════════════╪═════════════════════╪════════════════════╪═════════════╡
│ 1 ┆ 1 ┆ [1] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 1 ┆ [1] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ [2] ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 4 ┆ [1] ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 4 ┆ [3, 4] ┆ 4 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 4 ┆ [3, 4] ┆ 4 │
└─────────────┴─────────────────────┴────────────────────┴─────────────┘
df = orig_df.sort('primary_key').with_column(foreign_key_changed.cumsum().rank('dense').over('primary_key').alias('ranked_changes'))
thread '<unnamed>' panicked at 'implementation error, cannot get ref List(Null) from Int64', /Users/runner/work/polars/polars/polars/polars-core/src/series/mod.rs:945:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
File "polars_example.py", line 18, in <module>
df = orig_df.sort('primary_key').with_column(foreign_key_changed.cumsum().rank('dense').over('primary_key').alias('ranked_changes'))
File "venv/lib/python3.9/site-packages/polars/internals/frame.py", line 4144, in with_column
return self.with_columns([column])
File "venv/lib/python3.9/site-packages/polars/internals/frame.py", line 5308, in with_columns
self.lazy()
File "venv/lib/python3.9/site-packages/polars/internals/lazy_frame.py", line 652, in collect
return self._dataframe_class._from_pydf(ldf.collect())
pyo3_runtime.PanicException: implementation error, cannot get ref List(Null) from Int64
Am I doing this wrong? Is there some massaging that has to happen with the types?
This is with polars 0.13.62.

polars outer join default null value

https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.join.html
Can I specify the default NULL value for outer joins? Like 0?
The join method does not currently have an option for setting a default value for nulls. However, there is an easy way to accomplish this.
Let's say we have this data:
import polars as pl
df1 = pl.DataFrame({"key": ["a", "b", "d"], "var1": [1, 1, 1]})
df2 = pl.DataFrame({"key": ["a", "b", "c"], "var2": [2, 2, 2]})
df1.join(df2, on="key", how="outer")
shape: (4, 3)
┌─────┬──────┬──────┐
│ key ┆ var1 ┆ var2 │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪══════╪══════╡
│ a ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ c ┆ null ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ d ┆ 1 ┆ null │
└─────┴──────┴──────┘
To create a different value for the null values, simply use this:
df1.join(df2, on="key", how="outer").with_column(pl.all().fill_null(0))
shape: (4, 3)
┌─────┬──────┬──────┐
│ key ┆ var1 ┆ var2 │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪══════╪══════╡
│ a ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ c ┆ 0 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ d ┆ 1 ┆ 0 │
└─────┴──────┴──────┘