Index operation on list column data in polars - python-polars

I'm working with polars 0.13.46 for Python and I have a column with a list of Strings for which I need to check if a particular String occurs before another. I have created the following code example that works, but needs to break out of polars using apply, which makes it very slow.
import polars as pl
from polars import col
df = pl.DataFrame(
{
'str': ['A', 'B', 'C', 'B', 'A'],
'group': [1,1,2,1,2]
}
).lazy()
df_groups = df.groupby('group').agg([col('str').list().alias('str_list')])
print(df_groups.collect())
pre = 'A'
succ = 'B'
df_groups_filtered = df_groups.filter(
col('str_list').apply(
lambda str_list:
pre in str_list and succ in str_list and
str_list.to_list().index(pre) < str_list.to_list().index(succ)
)
)
df_groups_filtered.collect()
This provides the desired result, which is only the first row of the two rows of the example data:
┌───────┬─────────────────┐
│ group ┆ str_list │
│ --- ┆ --- │
│ i64 ┆ list[str] │
╞═══════╪═════════════════╡
│ 1 ┆ ["A", "B", "B"] │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ ["C", "A"] │
└───────┴─────────────────┘
I know that I can do
df_groups_filtered = df_groups.filter(
col('str_list').arr.contains(pre) & col('str_list').arr.contains(succ)
)
for the part of checking that both strings are contained, but I couldn't figure out how I can check the order in pure polars.
Are there ways to achieve this natively with polars?

One way that we can solve this problem is to use the arr.eval expression. The arr.eval expression allows us to treat a list as if it were a Series/column, so that we can apply all the same expressions we accustomed to using.
The Algorithm
(
df_groups
.filter(
pl.col("str_list")
.arr.eval(
pl.element().filter(
((pl.element() == succ).cumsum() == 0) & (pl.element() == pre)
)
)
.arr.lengths() > 0
)
.collect()
.filter(pl.col("str_list").arr.contains(succ))
)
shape: (1, 2)
┌───────┬─────────────────┐
│ group ┆ str_list │
│ --- ┆ --- │
│ i64 ┆ list[str] │
╞═══════╪═════════════════╡
│ 1 ┆ ["A", "B", "B"] │
└───────┴─────────────────┘
Note: there is currently a bug in Polars which causes an error when we use
.filter(pl.col("str_list").arr.contains(succ))
in lazy mode. (I'll file a bug report for that.)
How the algorithm works, in steps
The arr.eval expression allows us to treat a list as a Series/column, so that we can apply our usual toolkit of expressions to our problem.
That said, using arr.eval can seem a bit confusing at first, so we'll walk through this in steps.
As a Series
Let's first see how the algorithm works when our data is a Series/column, and then back into how we code this when the data is in lists.
Let's start with this data. We'll attempt to find any time that cat appears before dog.
df_groups = pl.DataFrame([
pl.Series('cat_dog', ['aardvark', 'cat', 'mouse', 'dog', 'sloth', 'zebra']),
pl.Series('dog_cat', ['aardvark', 'dog', 'mouse', 'cat', 'sloth', 'zebra']),
pl.Series('cat_dog_cat', ['aardvark', 'cat', 'mouse', 'dog', 'monkey', 'cat']),
pl.Series('dog_cat_dog', ['aardvark', 'dog', 'mouse', 'cat', 'monkey', 'dog']),
pl.Series('no_dog', ['aardvark', 'cat', 'mouse', 'cat', 'monkey', 'zebra']),
pl.Series('no_cat', ['aardvark', 'mouse', 'dog', 'monkey', 'dog', 'zebra']),
pl.Series('neither', ['aardvark', 'mouse', 'tiger', 'zebra', 'sloth', 'zebra']),
])
df_groups
shape: (6, 7)
┌──────────┬──────────┬─────────────┬─────────────┬──────────┬──────────┬──────────┐
│ cat_dog ┆ dog_cat ┆ cat_dog_cat ┆ dog_cat_dog ┆ no_dog ┆ no_cat ┆ neither │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ str ┆ str ┆ str │
╞══════════╪══════════╪═════════════╪═════════════╪══════════╪══════════╪══════════╡
│ aardvark ┆ aardvark ┆ aardvark ┆ aardvark ┆ aardvark ┆ aardvark ┆ aardvark │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ cat ┆ dog ┆ cat ┆ dog ┆ cat ┆ mouse ┆ mouse │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ mouse ┆ mouse ┆ mouse ┆ mouse ┆ mouse ┆ dog ┆ tiger │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ dog ┆ cat ┆ dog ┆ cat ┆ cat ┆ monkey ┆ zebra │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ sloth ┆ sloth ┆ monkey ┆ monkey ┆ monkey ┆ dog ┆ sloth │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ zebra ┆ zebra ┆ cat ┆ dog ┆ zebra ┆ zebra ┆ zebra │
└──────────┴──────────┴─────────────┴─────────────┴──────────┴──────────┴──────────┘
To detect our first occurrence of a dog, we'll use the cumsum expression on a boolean expression.
pre = "cat"
succ = "dog"
df_groups = df_groups.with_columns(
(pl.all() == succ).cumsum().suffix('__cumsum')
)
df_groups.select(sorted(df_groups.columns))
shape: (6, 14)
┌──────────┬─────────────────┬─────────────┬─────────────────────┬──────────┬─────────────────┬─────────────┬─────────────────────┬──────────┬─────────────────┬──────────┬────────────────┬──────────┬────────────────┐
│ cat_dog ┆ cat_dog__cumsum ┆ cat_dog_cat ┆ cat_dog_cat__cumsum ┆ dog_cat ┆ dog_cat__cumsum ┆ dog_cat_dog ┆ dog_cat_dog__cumsum ┆ neither ┆ neither__cumsum ┆ no_cat ┆ no_cat__cumsum ┆ no_dog ┆ no_dog__cumsum │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ str ┆ u32 ┆ str ┆ u32 ┆ str ┆ u32 ┆ str ┆ u32 ┆ str ┆ u32 ┆ str ┆ u32 │
╞══════════╪═════════════════╪═════════════╪═════════════════════╪══════════╪═════════════════╪═════════════╪═════════════════════╪══════════╪═════════════════╪══════════╪════════════════╪══════════╪════════════════╡
│ aardvark ┆ 0 ┆ aardvark ┆ 0 ┆ aardvark ┆ 0 ┆ aardvark ┆ 0 ┆ aardvark ┆ 0 ┆ aardvark ┆ 0 ┆ aardvark ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ cat ┆ 0 ┆ cat ┆ 0 ┆ dog ┆ 1 ┆ dog ┆ 1 ┆ mouse ┆ 0 ┆ mouse ┆ 0 ┆ cat ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ mouse ┆ 0 ┆ mouse ┆ 0 ┆ mouse ┆ 1 ┆ mouse ┆ 1 ┆ tiger ┆ 0 ┆ dog ┆ 1 ┆ mouse ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ dog ┆ 1 ┆ dog ┆ 1 ┆ cat ┆ 1 ┆ cat ┆ 1 ┆ zebra ┆ 0 ┆ monkey ┆ 1 ┆ cat ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ sloth ┆ 1 ┆ monkey ┆ 1 ┆ sloth ┆ 1 ┆ monkey ┆ 1 ┆ sloth ┆ 0 ┆ dog ┆ 2 ┆ monkey ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ zebra ┆ 1 ┆ cat ┆ 1 ┆ zebra ┆ 1 ┆ dog ┆ 2 ┆ zebra ┆ 0 ┆ zebra ┆ 2 ┆ zebra ┆ 0 │
└──────────┴─────────────────┴─────────────┴─────────────────────┴──────────┴─────────────────┴─────────────┴─────────────────────┴──────────┴─────────────────┴──────────┴────────────────┴──────────┴────────────────┘
So our goal will be to see if there are any rows where the value is cat and where the cumsum on the dog expression is also zero.
Using lists and arr.eval
We'll take this in steps.
First, let's create some data as lists (instead of Series). I'm also going to eliminate the lazy mode, to reduce the clutter.
df_groups = pl.DataFrame({
'group': ['cat_dog', 'dog_cat', 'cat_dog_cat', 'dog_cat_dog', 'no_dog', 'no_cat', 'neither'],
'str_list': [
['aardvark', 'cat', 'mouse', 'dog'],
['aardvark', 'dog', 'mouse', 'cat'],
['aardvark', 'cat', 'mouse', 'dog', 'monkey', 'cat'],
['aardvark', 'dog', 'mouse', 'cat', 'monkey', 'dog'],
['aardvark', 'cat', 'mouse', 'cat', 'monkey'],
['aardvark', 'mouse', 'dog', 'monkey', 'dog'],
['aardvark', 'mouse', 'tiger', 'zebra'],
]
})
df_groups
shape: (7, 2)
┌─────────────┬─────────────────────────────────────┐
│ group ┆ str_list │
│ --- ┆ --- │
│ str ┆ list[str] │
╞═════════════╪═════════════════════════════════════╡
│ cat_dog ┆ ["aardvark", "cat", ... "dog"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ dog_cat ┆ ["aardvark", "dog", ... "cat"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ cat_dog_cat ┆ ["aardvark", "cat", ... "cat"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ dog_cat_dog ┆ ["aardvark", "dog", ... "dog"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ no_dog ┆ ["aardvark", "cat", ... "monkey"... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ no_cat ┆ ["aardvark", "mouse", ... "dog"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ neither ┆ ["aardvark", "mouse", ... "zebra... │
└─────────────┴─────────────────────────────────────┘
Now, let's look for any list where cat appears before a dog (i.e., the cumsum value on dog is zero).
pre = "cat"
succ = "dog"
(
df_groups
.with_columns(
pl.col("str_list")
.arr.eval(
pl.element().filter(
((pl.element() == succ).cumsum() == 0) & (pl.element() == pre)
)
)
)
)
shape: (7, 2)
┌─────────────┬────────────────┐
│ group ┆ str_list │
│ --- ┆ --- │
│ str ┆ list[str] │
╞═════════════╪════════════════╡
│ cat_dog ┆ ["cat"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ dog_cat ┆ [] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ cat_dog_cat ┆ ["cat"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ dog_cat_dog ┆ [] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ no_dog ┆ ["cat", "cat"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ no_cat ┆ [] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ neither ┆ [] │
└─────────────┴────────────────┘
We see that only three lists have a cat before a dog appears.
Next, we'll change the with_columns to a filter to keep only those rows where we found one or more cat before a dog.
(
df_groups
.filter(
pl.col("str_list")
.arr.eval(
pl.element().filter(
((pl.element() == succ).cumsum() == 0) & (pl.element() == pre)
)
)
.arr.lengths() > 0
)
)
shape: (3, 2)
┌─────────────┬─────────────────────────────────────┐
│ group ┆ str_list │
│ --- ┆ --- │
│ str ┆ list[str] │
╞═════════════╪═════════════════════════════════════╡
│ cat_dog ┆ ["aardvark", "cat", ... "dog"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ cat_dog_cat ┆ ["aardvark", "cat", ... "cat"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ no_dog ┆ ["aardvark", "cat", ... "monkey"... │
└─────────────┴─────────────────────────────────────┘
And finally, we need to eliminate any rows where a dog never appears.
(
df_groups
.filter(pl.col("str_list").arr.contains(succ))
.filter(
pl.col("str_list")
.arr.eval(
pl.element().filter(
((pl.element() == succ).cumsum() == 0) & (pl.element() == pre)
)
)
.arr.lengths() > 0
)
)
shape: (2, 2)
┌─────────────┬────────────────────────────────┐
│ group ┆ str_list │
│ --- ┆ --- │
│ str ┆ list[str] │
╞═════════════╪════════════════════════════════╡
│ cat_dog ┆ ["aardvark", "cat", ... "dog"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ cat_dog_cat ┆ ["aardvark", "cat", ... "cat"] │
└─────────────┴────────────────────────────────┘

It's not the most easy code to process, but this should work reasonably well.
I will see we can add an arg_where expression.
If we want to do this fast, we have to create a temporary dummy variable for now (Until we eliminate subexpressions) in polars.
For every element in the list, we filter the values that are not pre or succ (Or more values, I made it generic to work with n number of values).
This should leave these lists:
shape: (2,)
Series: 'dummy' [list]
[
["A"]
["A", "B"]
]
Then in the filter operation we use a fold to create a boolean predicate unfolded as:
dummy[0] == order[0] & dummy[1] == order[1]
We could write this out, which would be a bit more readable, but then it would not work on n number of elements.
pre = 'A'
succ = 'B'
order = [pre, succ]
# we first compute a dummy
# we do that sequentally because it can be expensive to compute it multiple times.
df_groups.with_columns([
# we use arr.eval
# and run the search in parallel
pl.col("str_list").arr.eval(
expr=pl.element().filter(pl.element().is_in(order)).head(2),
parallel=True
).alias("dummy"),
]).filter(
# we use a fold because this generic for any number of elements
pl.fold(acc=True,
f=lambda acc, e: acc & e,
exprs=[pl.col("dummy").arr.get(i) == order[i] for i in range(0, len(order))]
)
).drop("dummy")

Here's another way to do it that uses explode, pivot, and join as the primary strategy.
If we start from the same cat/dog df as #user18559875 then we get
df_groups = pl.DataFrame({
'group': ['cat_dog', 'dog_cat', 'cat_dog_cat', 'dog_cat_dog', 'no_dog', 'no_cat', 'neither'],
'str_list': [
['aardvark', 'cat', 'mouse', 'dog'],
['aardvark', 'dog', 'mouse', 'cat'],
['aardvark', 'cat', 'mouse', 'dog', 'monkey', 'cat'],
['aardvark', 'dog', 'mouse', 'cat', 'monkey', 'dog'],
['aardvark', 'cat', 'mouse', 'cat', 'monkey'],
['aardvark', 'mouse', 'dog', 'monkey', 'dog'],
['aardvark', 'mouse', 'tiger', 'zebra'],
]
})
Taking this step by step, first we want an id column so we use with_row_count
df_groups.with_row_count('id')
We then want to use explode to convert the str_list into a bigger df where each value of that nested list becomes its own row
df_groups.with_row_count('id').explode('str_list')
from here we add another index to capture the ordering of the list
df_groups.with_row_count('id').explode('str_list').with_row_count('order')
Now we do a groupby on our first id and our str_list column which we'll then aggregate by the minimum value of order (ie, get the index value of the underlying list)
df_groups.with_row_count('id').explode('str_list').with_row_count('order') \
.groupby(['id','str_list']).agg(pl.col('order').min())
Since we only care about the relative position of dogs and cats, we filter to only get dogs and cats, which we could/should actually do as early as possible which would be right after the explode (I'm putting it after the with_row_count in case the true ordinal position matters.
df_groups.with_row_count('id').explode('str_list').with_row_count('order') \
.filter(pl.col('str_list').is_in(['cat','dog'])) \
.groupby(['id','str_list']).agg(pl.col('order').min())
The next thing we want is to filter for when cat is less than dog so we pivot so there's a dog and a cat column from which we can simply filter for when cat<dog and/or when dog isn't present.
df_groups.with_row_count('id').explode('str_list').with_row_count('order') \
.filter(pl.col('str_list').is_in(['cat','dog'])) \
.groupby(['id','str_list']).agg(pl.col('order').min()) \
.pivot(index='id', columns='str_list', values='order') \
.filter((pl.col('cat')<pl.col('dog')) | (pl.col('dog').is_null()))
but that's not what you actually want, it just gives you the row of what you want so we just select the id column and join it back to the original.
df_groups.with_row_count('id').join(
df_groups.with_row_count('id').explode('str_list').with_row_count('order') \
.filter(pl.col('str_list').is_in(['cat','dog'])) \
.groupby(['id','str_list']).agg(pl.col('order').min()) \
.pivot(index='id', columns='str_list', values='order') \
.filter((pl.col('cat')<pl.col('dog')) | (pl.col('dog').is_null())).select('id'),
on='id').select(pl.exclude('id'))

I'd like to add another solution that I actually ended up adopting. Credits go to #mcrumiller, who posted this on Github.
import polars as pl
from polars import col, when
df = pl.DataFrame({
"id": [1,2,3,4],
"list": [['A', 'B'], ['B', 'A'], ['C', 'B', 'D'], ['D', 'A', 'C', 'B']]
})
def loc_of(value):
# only execute if the item is contained in the list
return when(col("list").arr.contains(value)).then(
col("list").arr.eval(
# create array of True/False, then cast to 1's and 0's
# arg_max() then finds the first occurrence of 1, i.e. the first occurrence of value
(pl.element() == value).cast(pl.UInt8).arg_max(),
parallel=True
).arr.first()
).otherwise(None) # return null if not found
df.filter(loc_of('A') < loc_of('B'))
I really like the simplicity of this approach. Performance-wise it is very similar to the approach of #user18559875.

Related

python-polars is there a np.where equivalent?

Polars is there a np.where equivalent? trying to replicate the following code in polars.
If the value is above a certain threshold column called Is_Acceptable is 1 or if it is below it is 0
import pandas as pd
import numpy as np
df = pd.DataFrame({"fruit":["orange","apple","mango","kiwi"], "value":[1,0.8,0.7,1.2]})
df["Is_Acceptable?"] = np.where(df["value"].lt(0.9), 1, 0)
print(df)
Yes, there is pl.when().then().otherwise() expression
import polars as pl
from polars import col
df = pl.DataFrame({
"fruit": ["orange","apple","mango","kiwi"],
"value": [1, 0.8, 0.7, 1.2]
})
df = df.with_column(
pl.when(col('value') < 0.9).then(1).otherwise(0).alias('Is_Acceptable?')
)
print(df)
┌────────┬───────┬────────────────┐
│ fruit ┆ value ┆ Is_Acceptable? │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ i64 │
╞════════╪═══════╪════════════════╡
│ orange ┆ 1.0 ┆ 0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple ┆ 0.8 ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ mango ┆ 0.7 ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ kiwi ┆ 1.2 ┆ 0 │
└────────┴───────┴────────────────┘
The when/then/otherwise expression is a good general-purpose answer. However, in this case, one shortcut is to simply create a boolean expression.
(
df
.with_column(
(pl.col('value') < 0.9).alias('Is_Acceptable')
)
)
shape: (4, 3)
┌────────┬───────┬───────────────┐
│ fruit ┆ value ┆ Is_Acceptable │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ bool │
╞════════╪═══════╪═══════════════╡
│ orange ┆ 1.0 ┆ false │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple ┆ 0.8 ┆ true │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ mango ┆ 0.7 ┆ true │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ kiwi ┆ 1.2 ┆ false │
└────────┴───────┴───────────────┘
In numeric computations, False will be upcast to 0, and True will be upcast to 1. Or, if you prefer, you can upcast them explicitly to a different type.
(
df
.with_column(
(pl.col('value') < 0.9).cast(pl.Int64).alias('Is_Acceptable')
)
)
shape: (4, 3)
┌────────┬───────┬───────────────┐
│ fruit ┆ value ┆ Is_Acceptable │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ i64 │
╞════════╪═══════╪═══════════════╡
│ orange ┆ 1.0 ┆ 0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple ┆ 0.8 ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ mango ┆ 0.7 ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ kiwi ┆ 1.2 ┆ 0 │
└────────┴───────┴───────────────┘

Python-Polars: How to filter categorical column with string list

I have a Polars dataframe like below:
df_cat = pl.DataFrame(
[
pl.Series("a_cat", ["c", "a", "b", "c", "b"], dtype=pl.Categorical),
pl.Series("b_cat", ["F", "G", "E", "G", "G"], dtype=pl.Categorical)
])
print(df_cat)
shape: (5, 2)
┌───────┬───────┐
│ a_cat ┆ b_cat │
│ --- ┆ --- │
│ cat ┆ cat │
╞═══════╪═══════╡
│ c ┆ F │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ G │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ E │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ G │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ G │
└───────┴───────┘
The following filter runs perfectly fine:
print(df_cat.filter(pl.col('a_cat') == 'c'))
shape: (2, 2)
┌───────┬───────┐
│ a_cat ┆ b_cat │
│ --- ┆ --- │
│ cat ┆ cat │
╞═══════╪═══════╡
│ c ┆ F │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ G │
└───────┴───────┘
What I want is to use a list of string to run the filter more efficiently. So I tried and ended up with the following error message:
print(df_cat.filter(pl.col('a_cat').is_in(['a', 'c'])))
---------------------------------------------------------------------------
ComputeError Traceback (most recent call last)
d:\GitRepo\Test2\stockEMD3.ipynb Cell 9 in <cell line: 1>()
----> 1 print(df_cat.filter(pl.col('a_cat').is_in(['c'])))
File c:\ProgramData\Anaconda3\envs\charm3.9\lib\site-packages\polars\internals\dataframe\frame.py:2185, in DataFrame.filter(self, predicate)
2181 if _NUMPY_AVAILABLE and isinstance(predicate, np.ndarray):
2182 predicate = pli.Series(predicate)
2184 return (
-> 2185 self.lazy()
2186 .filter(predicate) # type: ignore[arg-type]
2187 .collect(no_optimization=True, string_cache=False)
2188 )
File c:\ProgramData\Anaconda3\envs\charm3.9\lib\site-packages\polars\internals\lazyframe\frame.py:660, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, string_cache, no_optimization, slice_pushdown)
650 projection_pushdown = False
652 ldf = self._ldf.optimization_toggle(
653 type_coercion,
654 predicate_pushdown,
(...)
658 slice_pushdown,
659 )
--> 660 return pli.wrap_df(ldf.collect())
ComputeError: joins/or comparisons on categorical dtypes can only happen if they are created under the same global string cache
From this Stackoverflow link I understand "You need to set a global string cache to compare categoricals created in different columns/lists." but my question is
Why the == one single string filter case works?
What is the proper way to filter a categorical column with a list of string?
Thanks!
Actually, you don't need to set a global string cache to compare strings to Categorical variables. You can use cast to accomplish this.
Let's use this data. I've included the integer values that underlie the Categorical variables to demonstrate something later.
import polars as pl
df_cat = (
pl.DataFrame(
[
pl.Series("a_cat", ["c", "a", "b", "c", "X"], dtype=pl.Categorical),
pl.Series("b_cat", ["F", "G", "E", "S", "X"], dtype=pl.Categorical),
]
)
.with_column(
pl.all().to_physical().suffix('_phys')
)
)
df_cat
shape: (5, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ c ┆ F ┆ 0 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ G ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ E ┆ 2 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ S ┆ 0 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ X ┆ X ┆ 3 ┆ 4 │
└───────┴───────┴────────────┴────────────┘
Comparing a categorical variable to a string
If we cast a Categorical variable back to its string values, we can make any comparison we need. For example:
df_cat.filter(pl.col('a_cat').cast(pl.Utf8).is_in(['a', 'c']))
shape: (3, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ c ┆ F ┆ 0 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ G ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ S ┆ 0 ┆ 3 │
└───────┴───────┴────────────┴────────────┘
Or in a filter step comparing the string values of two Categorical variables that do not share the same string cache.
df_cat.filter(pl.col('a_cat').cast(pl.Utf8) == pl.col('b_cat').cast(pl.Utf8))
shape: (1, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ X ┆ X ┆ 3 ┆ 4 │
└───────┴───────┴────────────┴────────────┘
Notice that it is the string values being compared (not the integers underlying the two Categorical variables).
The equality operator on Categorical variables
The following statements are equivalent:
df_cat.filter((pl.col('a_cat') == 'a'))
df_cat.filter((pl.col('a_cat').cast(pl.Utf8) == 'a'))
The former is syntactic sugar for the latter, as the former is a common use case.
As the error states: ComputeError: joins/or comparisons on categorical dtypes can only happen if they are created under the same global string cache.
Comparisons of categorical values are only allowed under a global string cache. You really want to set this in such a case as it speeds up comparisons and prevents expensive casts to strings.
Setting this on the start of your query will ensure it runs:
import polars as pl
pl.Config.set_global_string_cache()
This is a new answer based on the one from #ritchie46.
Polar 0.15.15 it now is
import polars as pl
pl.toggle_string_cache(True)
Also a StringCache() Context manager can be used, see polars documentation:
with pl.StringCache():
print(df_cat.filter(pl.col('a_cat').is_in(['a', 'c'])))

Polars: assign existing category

I am using Polars to analyze some A/B test data (and a little bit more...). Now I had to correct for some inconsistency. df_prep is a Polars DataFrame that has a column 'Group' of type cat with levels 'A' and 'B'.
Naively, I did this:
# After the A/B test period, everything is B!
df_prep = (df_prep.lazy()
.with_column(
pl.when(pl.col('Datum') >= pl.col('TestEndDate'))
.then('B')
.otherwise(pl.col('Group'))
.alias('Group'))
.collect())
However, the problem is now that df_prep['Group'].unique() gives
shape: (3,)
Series: 'Group' [cat]
[
"B"
"A"
"B"
]
This is obviously not what I wanted. I wanted to assign the existing category "B".
How could this be achieved?
EDIT: I found one way:
df_prep = df_prep.with_column(pl.col('Group').cast(pl.Utf8).cast(pl.Categorical).alias('Group'))
But this doesn't seem right to me... Isn't there a more ideomatic solution?
This is a common problem when comparing string values to Categorical values. One way to solve this problem is to use a string cache, either globally or using a context manager.
Without a string cache
First, let's take a closer look at what is occurring. Let's start with this data, and look at the underlying physical representation of the Categorical variable (the integer that represents each unique category value).
import polars as pl
from datetime import date
df_prep = pl.DataFrame(
[
pl.Series(
name="Group",
values=["A", "A", "B", "B"],
dtype=pl.Categorical,
),
pl.Series(
name="Datum",
values=pl.date_range(date(2022, 1, 1), date(2022, 1, 4), "1d"),
),
pl.Series(name="TestEndDate", values=[date(2022, 1, 4)] * 4),
]
)
(
df_prep
.with_column(pl.col('Group').to_physical().alias('Physical'))
)
shape: (4, 4)
┌───────┬────────────┬─────────────┬──────────┐
│ Group ┆ Datum ┆ TestEndDate ┆ Physical │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ date ┆ date ┆ u32 │
╞═══════╪════════════╪═════════════╪══════════╡
│ A ┆ 2022-01-01 ┆ 2022-01-04 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ A ┆ 2022-01-02 ┆ 2022-01-04 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ B ┆ 2022-01-03 ┆ 2022-01-04 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ B ┆ 2022-01-04 ┆ 2022-01-04 ┆ 1 │
└───────┴────────────┴─────────────┴──────────┘
Note that A is a assigned a physical value of 0; B, a value of 1.
Now, let's run the next step (without a string cache), and see what happens:
result = (
df_prep.lazy()
.with_column(
pl.when(pl.col("Datum") >= pl.col("TestEndDate"))
.then("B")
.otherwise(pl.col("Group"))
.alias("Group")
)
.with_column(pl.col('Group').to_physical().alias('Physical'))
.collect()
)
result
shape: (4, 4)
┌───────┬────────────┬─────────────┬──────────┐
│ Group ┆ Datum ┆ TestEndDate ┆ Physical │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ date ┆ date ┆ u32 │
╞═══════╪════════════╪═════════════╪══════════╡
│ A ┆ 2022-01-01 ┆ 2022-01-04 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ A ┆ 2022-01-02 ┆ 2022-01-04 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ B ┆ 2022-01-03 ┆ 2022-01-04 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ B ┆ 2022-01-04 ┆ 2022-01-04 ┆ 0 │
└───────┴────────────┴─────────────┴──────────┘
Notice what happened. Without a string cache, the underlying physical representations of the Categorical values have changed. Indeed, the Categorical value B now has two underlying physical representations: 2 and 0. Polars sees the two B's as distinct.
Indeed, we see this if we use unique on this column:
result.get_column('Group').unique()
shape: (3,)
Series: 'Group' [cat]
[
"B"
"A"
"B"
]
Using a global string cache
One easy way to handle this is to use a global string cache while making comparisons between strings and Categorical values, or setting values for Categorical variables using strings.
We'll set the global string cache and rerun the algorithm. We'll use Polars' toggle_string_cache method to achieve this.
pl.toggle_string_cache(True)
df_prep = pl.DataFrame(
[
pl.Series(
name="Group",
values=["A", "A", "B", "B"],
dtype=pl.Categorical,
),
pl.Series(
name="Datum",
values=pl.date_range(date(2022, 1, 1), date(2022, 1, 4), "1d"),
),
pl.Series(name="TestEndDate", values=[date(2022, 1, 4)] * 4),
]
)
result = (
df_prep.lazy()
.with_column(
pl.when(pl.col("Datum") >= pl.col("TestEndDate"))
.then("B")
.otherwise(pl.col("Group"))
.alias("Group")
)
.with_column(pl.col('Group').to_physical().alias('Physical'))
.collect()
)
result
>>> result
shape: (4, 4)
┌───────┬────────────┬─────────────┬──────────┐
│ Group ┆ Datum ┆ TestEndDate ┆ Physical │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ date ┆ date ┆ u32 │
╞═══════╪════════════╪═════════════╪══════════╡
│ A ┆ 2022-01-01 ┆ 2022-01-04 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ A ┆ 2022-01-02 ┆ 2022-01-04 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ B ┆ 2022-01-03 ┆ 2022-01-04 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ B ┆ 2022-01-04 ┆ 2022-01-04 ┆ 1 │
└───────┴────────────┴─────────────┴──────────┘
>>> result.get_column('Group').unique()
shape: (2,)
Series: 'Group' [cat]
[
"A"
"B"
]
Notice how the Categorical variable maintains its correct physical representation. And the results of using unique on Group are what we expect.
Using a Context Manager
If you don't want to keep a global string cache in effect, you can use a context manager to set a localized, temporary StringCache while you are making comparisons to strings.
with pl.StringCache():
df_prep = pl.DataFrame(
[
pl.Series(
name="Group",
values=["A", "A", "B", "B"],
dtype=pl.Categorical,
),
pl.Series(
name="Datum",
values=pl.date_range(date(2022, 1, 1), date(2022, 1, 4), "1d"),
),
pl.Series(name="TestEndDate", values=[date(2022, 1, 4)] * 4),
]
)
result = (
df_prep.lazy()
.with_column(
pl.when(pl.col("Datum") >= pl.col("TestEndDate"))
.then("B")
.otherwise(pl.col("Group"))
.alias("Group")
)
.with_column(pl.col('Group').to_physical().alias('Physical'))
.collect()
)
result
shape: (4, 4)
┌───────┬────────────┬─────────────┬──────────┐
│ Group ┆ Datum ┆ TestEndDate ┆ Physical │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ date ┆ date ┆ u32 │
╞═══════╪════════════╪═════════════╪══════════╡
│ A ┆ 2022-01-01 ┆ 2022-01-04 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ A ┆ 2022-01-02 ┆ 2022-01-04 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ B ┆ 2022-01-03 ┆ 2022-01-04 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ B ┆ 2022-01-04 ┆ 2022-01-04 ┆ 1 │
└───────┴────────────┴─────────────┴──────────┘
>>> result.get_column('Group').unique()
shape: (2,)
Series: 'Group' [cat]
[
"A"
"B"
]
Edit: Reading/Scanning external files
You can read/scan external files with a string cache in effect. For example, below I've saved our DataFrame to tmp.parquet.
If I use read_parquet with a string cache in effect, the Categorical variables are included in the string cache.
(Note: in the examples below, I'll use a Context Manager -- to clearly delineate where the string cache is in effect.)
import polars as pl
with pl.StringCache():
(
pl.read_parquet('tmp.parquet')
.with_column(
pl.when(pl.col("Datum") >= pl.col("TestEndDate"))
.then("B")
.otherwise(pl.col("Group"))
.alias("Group")
)
.with_column(pl.col('Group').to_physical().alias('Physical'))
)
shape: (4, 4)
┌───────┬────────────┬─────────────┬──────────┐
│ Group ┆ Datum ┆ TestEndDate ┆ Physical │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ date ┆ date ┆ u32 │
╞═══════╪════════════╪═════════════╪══════════╡
│ A ┆ 2022-01-01 ┆ 2022-01-04 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ A ┆ 2022-01-02 ┆ 2022-01-04 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ B ┆ 2022-01-03 ┆ 2022-01-04 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ B ┆ 2022-01-04 ┆ 2022-01-04 ┆ 1 │
└───────┴────────────┴─────────────┴──────────┘
Notice that our Categorical values are correct. (The B values have the same underlying physical representation.)
However, if we move the read_parquet method outside the Context Manager (so that the DataFrame is created without a string cache), we have a problem.
df_prep = pl.read_parquet('tmp.parquet')
with pl.StringCache():
(
df_prep
.with_column(
pl.when(pl.col("Datum") >= pl.col("TestEndDate"))
.then("B")
.otherwise(pl.col("Group"))
.alias("Group")
)
.with_column(pl.col('Group').to_physical().alias('Physical'))
)
Traceback (most recent call last):
File "<stdin>", line 7, in <module>
File "/home/corey/.virtualenvs/StackOverflow/lib/python3.10/site-packages/polars/internals/dataframe/frame.py", line 4027, in with_column
self.lazy()
File "/home/corey/.virtualenvs/StackOverflow/lib/python3.10/site-packages/polars/internals/lazyframe/frame.py", line 803, in collect
return pli.wrap_df(ldf.collect())
exceptions.ComputeError: cannot combine categorical under a global string cache with a non cached categorical
The error message says it all.
Edit: Placing existing Categorical columns under a string cache
One way to correct the situation above (assuming that it's already too late to re-read your DataFrame with a string cache) is to put a new string cache into effect, and then cast the values back to strings and then back to Categorical.
Below, we'll use a shortcut to perform this for all Categorical columns in parallel - by specifying pl.Categorical in the pl.col.
with pl.StringCache():
(
df_prep
.with_columns([
pl.col(pl.Categorical).cast(pl.Utf8).cast(pl.Categorical)
])
.with_column(
pl.when(pl.col("Datum") >= pl.col("TestEndDate"))
.then("B")
.otherwise(pl.col("Group"))
.alias("Group")
)
.with_column(pl.col('Group').to_physical().alias('Physical'))
)
shape: (4, 4)
┌───────┬────────────┬─────────────┬──────────┐
│ Group ┆ Datum ┆ TestEndDate ┆ Physical │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ date ┆ date ┆ u32 │
╞═══════╪════════════╪═════════════╪══════════╡
│ A ┆ 2022-01-01 ┆ 2022-01-04 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ A ┆ 2022-01-02 ┆ 2022-01-04 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ B ┆ 2022-01-03 ┆ 2022-01-04 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ B ┆ 2022-01-04 ┆ 2022-01-04 ┆ 1 │
└───────┴────────────┴─────────────┴──────────┘
And now our code works correctly again.

Idiomatic replacement of empty string '' with pl.Null (null) in polars

I have a polars DataFrame with a number of Series that look like:
pl.Series(['cow', 'cat', '', 'lobster', ''])
and I'd like them to be
pl.Series(['cow', 'cat', pl.Null, 'lobster', pl.Null])
A simple string replacement won't work since pl.Null is not of type PyString:
pl.Series(['cow', 'cat', '', 'lobster', '']).str.replace('', pl.Null)
What's the idiomatic way of doing this for a Series/DataFrame in polars?
Series
For a single Series, you can use the set method.
import polars as pl
my_series = pl.Series(['cow', 'cat', '', 'lobster', ''])
my_series.set(my_series.str.lengths() == 0, None)
shape: (5,)
Series: '' [str]
[
"cow"
"cat"
null
"lobster"
null
]
DataFrame
For DataFrames, I would suggest using when/then/otherwise. For example, with this data:
df = pl.DataFrame({
'str1': ['cow', 'dog', "", 'lobster', ''],
'str2': ['', 'apple', "orange", '', 'kiwi'],
'str3': ['house', '', "apartment", 'condo', ''],
})
df
shape: (5, 3)
┌─────────┬────────┬───────────┐
│ str1 ┆ str2 ┆ str3 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════╪════════╪═══════════╡
│ cow ┆ ┆ house │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ dog ┆ apple ┆ │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ┆ orange ┆ apartment │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ lobster ┆ ┆ condo │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ┆ kiwi ┆ │
└─────────┴────────┴───────────┘
We can run a replacement on all string columns as follows:
df.with_columns([
pl.when(pl.col(pl.Utf8).str.lengths() ==0)
.then(None)
.otherwise(pl.col(pl.Utf8))
.keep_name()
])
shape: (5, 3)
┌─────────┬────────┬───────────┐
│ str1 ┆ str2 ┆ str3 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════╪════════╪═══════════╡
│ cow ┆ null ┆ house │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ dog ┆ apple ┆ null │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ orange ┆ apartment │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ lobster ┆ null ┆ condo │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ kiwi ┆ null │
└─────────┴────────┴───────────┘
The above should be fairly performant.
If you only want to replace empty strings with null on certain columns, you can provide a list:
only_these = ['str1', 'str2']
df.with_columns([
pl.when(pl.col(only_these).str.lengths() == 0)
.then(None)
.otherwise(pl.col(only_these))
.keep_name()
])
shape: (5, 3)
┌─────────┬────────┬───────────┐
│ str1 ┆ str2 ┆ str3 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════════╪════════╪═══════════╡
│ cow ┆ null ┆ house │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ dog ┆ apple ┆ │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ orange ┆ apartment │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ lobster ┆ null ┆ condo │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ kiwi ┆ │
└─────────┴────────┴───────────┘

Polars: how to add a column in front?

What would be the most idiomatic (and efficient) way to add a column in front of a polars data frame? Same thing like .with_column but add it at index 0?
You can select in the order you want your new DataFrame.
df = pl.DataFrame({
"a": [1, 2, 3],
"b": [True, None, False]
})
df.select([
pl.lit("foo").alias("z"),
pl.all()
])
shape: (3, 3)
┌─────┬─────┬───────┐
│ z ┆ a ┆ b │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ bool │
╞═════╪═════╪═══════╡
│ foo ┆ 1 ┆ true │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ foo ┆ 2 ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ foo ┆ 3 ┆ false │
└─────┴─────┴───────┘