Polars: assign existing category

Polars: assign existing category - python-polars

I am using Polars to analyze some A/B test data (and a little bit more...). Now I had to correct for some inconsistency. df_prep is a Polars DataFrame that has a column 'Group' of type cat with levels 'A' and 'B'.
Naively, I did this:
# After the A/B test period, everything is B!
df_prep = (df_prep.lazy()
.with_column(
pl.when(pl.col('Datum') >= pl.col('TestEndDate'))
.then('B')
.otherwise(pl.col('Group'))
.alias('Group'))
.collect())
However, the problem is now that df_prep['Group'].unique() gives
shape: (3,)
Series: 'Group' [cat]
[
"B"
"A"
"B"
]
This is obviously not what I wanted. I wanted to assign the existing category "B".
How could this be achieved?
EDIT: I found one way:
df_prep = df_prep.with_column(pl.col('Group').cast(pl.Utf8).cast(pl.Categorical).alias('Group'))
But this doesn't seem right to me... Isn't there a more ideomatic solution?

This is a common problem when comparing string values to Categorical values. One way to solve this problem is to use a string cache, either globally or using a context manager.
Without a string cache
First, let's take a closer look at what is occurring. Let's start with this data, and look at the underlying physical representation of the Categorical variable (the integer that represents each unique category value).
import polars as pl
from datetime import date
df_prep = pl.DataFrame(
[
pl.Series(
name="Group",
values=["A", "A", "B", "B"],
dtype=pl.Categorical,
),
pl.Series(
name="Datum",
values=pl.date_range(date(2022, 1, 1), date(2022, 1, 4), "1d"),
),
pl.Series(name="TestEndDate", values=[date(2022, 1, 4)] * 4),
]
)
(
df_prep
.with_column(pl.col('Group').to_physical().alias('Physical'))
)
shape: (4, 4)
┌───────┬────────────┬─────────────┬──────────┐
│ Group ┆ Datum ┆ TestEndDate ┆ Physical │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ date ┆ date ┆ u32 │
╞═══════╪════════════╪═════════════╪══════════╡
│ A ┆ 2022-01-01 ┆ 2022-01-04 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ A ┆ 2022-01-02 ┆ 2022-01-04 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ B ┆ 2022-01-03 ┆ 2022-01-04 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ B ┆ 2022-01-04 ┆ 2022-01-04 ┆ 1 │
└───────┴────────────┴─────────────┴──────────┘
Note that A is a assigned a physical value of 0; B, a value of 1.
Now, let's run the next step (without a string cache), and see what happens:
result = (
df_prep.lazy()
.with_column(
pl.when(pl.col("Datum") >= pl.col("TestEndDate"))
.then("B")
.otherwise(pl.col("Group"))
.alias("Group")
)
.with_column(pl.col('Group').to_physical().alias('Physical'))
.collect()
)
result
shape: (4, 4)
┌───────┬────────────┬─────────────┬──────────┐
│ Group ┆ Datum ┆ TestEndDate ┆ Physical │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ date ┆ date ┆ u32 │
╞═══════╪════════════╪═════════════╪══════════╡
│ A ┆ 2022-01-01 ┆ 2022-01-04 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ A ┆ 2022-01-02 ┆ 2022-01-04 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ B ┆ 2022-01-03 ┆ 2022-01-04 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ B ┆ 2022-01-04 ┆ 2022-01-04 ┆ 0 │
└───────┴────────────┴─────────────┴──────────┘
Notice what happened. Without a string cache, the underlying physical representations of the Categorical values have changed. Indeed, the Categorical value B now has two underlying physical representations: 2 and 0. Polars sees the two B's as distinct.
Indeed, we see this if we use unique on this column:
result.get_column('Group').unique()
shape: (3,)
Series: 'Group' [cat]
[
"B"
"A"
"B"
]
Using a global string cache
One easy way to handle this is to use a global string cache while making comparisons between strings and Categorical values, or setting values for Categorical variables using strings.
We'll set the global string cache and rerun the algorithm. We'll use Polars' toggle_string_cache method to achieve this.
pl.toggle_string_cache(True)
df_prep = pl.DataFrame(
[
pl.Series(
name="Group",
values=["A", "A", "B", "B"],
dtype=pl.Categorical,
),
pl.Series(
name="Datum",
values=pl.date_range(date(2022, 1, 1), date(2022, 1, 4), "1d"),
),
pl.Series(name="TestEndDate", values=[date(2022, 1, 4)] * 4),
]
)
result = (
df_prep.lazy()
.with_column(
pl.when(pl.col("Datum") >= pl.col("TestEndDate"))
.then("B")
.otherwise(pl.col("Group"))
.alias("Group")
)
.with_column(pl.col('Group').to_physical().alias('Physical'))
.collect()
)
result
>>> result
shape: (4, 4)
┌───────┬────────────┬─────────────┬──────────┐
│ Group ┆ Datum ┆ TestEndDate ┆ Physical │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ date ┆ date ┆ u32 │
╞═══════╪════════════╪═════════════╪══════════╡
│ A ┆ 2022-01-01 ┆ 2022-01-04 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ A ┆ 2022-01-02 ┆ 2022-01-04 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ B ┆ 2022-01-03 ┆ 2022-01-04 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ B ┆ 2022-01-04 ┆ 2022-01-04 ┆ 1 │
└───────┴────────────┴─────────────┴──────────┘
>>> result.get_column('Group').unique()
shape: (2,)
Series: 'Group' [cat]
[
"A"
"B"
]
Notice how the Categorical variable maintains its correct physical representation. And the results of using unique on Group are what we expect.
Using a Context Manager
If you don't want to keep a global string cache in effect, you can use a context manager to set a localized, temporary StringCache while you are making comparisons to strings.
with pl.StringCache():
df_prep = pl.DataFrame(
[
pl.Series(
name="Group",
values=["A", "A", "B", "B"],
dtype=pl.Categorical,
),
pl.Series(
name="Datum",
values=pl.date_range(date(2022, 1, 1), date(2022, 1, 4), "1d"),
),
pl.Series(name="TestEndDate", values=[date(2022, 1, 4)] * 4),
]
)
result = (
df_prep.lazy()
.with_column(
pl.when(pl.col("Datum") >= pl.col("TestEndDate"))
.then("B")
.otherwise(pl.col("Group"))
.alias("Group")
)
.with_column(pl.col('Group').to_physical().alias('Physical'))
.collect()
)
result
shape: (4, 4)
┌───────┬────────────┬─────────────┬──────────┐
│ Group ┆ Datum ┆ TestEndDate ┆ Physical │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ date ┆ date ┆ u32 │
╞═══════╪════════════╪═════════════╪══════════╡
│ A ┆ 2022-01-01 ┆ 2022-01-04 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ A ┆ 2022-01-02 ┆ 2022-01-04 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ B ┆ 2022-01-03 ┆ 2022-01-04 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ B ┆ 2022-01-04 ┆ 2022-01-04 ┆ 1 │
└───────┴────────────┴─────────────┴──────────┘
>>> result.get_column('Group').unique()
shape: (2,)
Series: 'Group' [cat]
[
"A"
"B"
]
Edit: Reading/Scanning external files
You can read/scan external files with a string cache in effect. For example, below I've saved our DataFrame to tmp.parquet.
If I use read_parquet with a string cache in effect, the Categorical variables are included in the string cache.
(Note: in the examples below, I'll use a Context Manager -- to clearly delineate where the string cache is in effect.)
import polars as pl
with pl.StringCache():
(
pl.read_parquet('tmp.parquet')
.with_column(
pl.when(pl.col("Datum") >= pl.col("TestEndDate"))
.then("B")
.otherwise(pl.col("Group"))
.alias("Group")
)
.with_column(pl.col('Group').to_physical().alias('Physical'))
)
shape: (4, 4)
┌───────┬────────────┬─────────────┬──────────┐
│ Group ┆ Datum ┆ TestEndDate ┆ Physical │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ date ┆ date ┆ u32 │
╞═══════╪════════════╪═════════════╪══════════╡
│ A ┆ 2022-01-01 ┆ 2022-01-04 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ A ┆ 2022-01-02 ┆ 2022-01-04 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ B ┆ 2022-01-03 ┆ 2022-01-04 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ B ┆ 2022-01-04 ┆ 2022-01-04 ┆ 1 │
└───────┴────────────┴─────────────┴──────────┘
Notice that our Categorical values are correct. (The B values have the same underlying physical representation.)
However, if we move the read_parquet method outside the Context Manager (so that the DataFrame is created without a string cache), we have a problem.
df_prep = pl.read_parquet('tmp.parquet')
with pl.StringCache():
(
df_prep
.with_column(
pl.when(pl.col("Datum") >= pl.col("TestEndDate"))
.then("B")
.otherwise(pl.col("Group"))
.alias("Group")
)
.with_column(pl.col('Group').to_physical().alias('Physical'))
)
Traceback (most recent call last):
File "<stdin>", line 7, in <module>
File "/home/corey/.virtualenvs/StackOverflow/lib/python3.10/site-packages/polars/internals/dataframe/frame.py", line 4027, in with_column
self.lazy()
File "/home/corey/.virtualenvs/StackOverflow/lib/python3.10/site-packages/polars/internals/lazyframe/frame.py", line 803, in collect
return pli.wrap_df(ldf.collect())
exceptions.ComputeError: cannot combine categorical under a global string cache with a non cached categorical
The error message says it all.
Edit: Placing existing Categorical columns under a string cache
One way to correct the situation above (assuming that it's already too late to re-read your DataFrame with a string cache) is to put a new string cache into effect, and then cast the values back to strings and then back to Categorical.
Below, we'll use a shortcut to perform this for all Categorical columns in parallel - by specifying pl.Categorical in the pl.col.
with pl.StringCache():
(
df_prep
.with_columns([
pl.col(pl.Categorical).cast(pl.Utf8).cast(pl.Categorical)
])
.with_column(
pl.when(pl.col("Datum") >= pl.col("TestEndDate"))
.then("B")
.otherwise(pl.col("Group"))
.alias("Group")
)
.with_column(pl.col('Group').to_physical().alias('Physical'))
)
shape: (4, 4)
┌───────┬────────────┬─────────────┬──────────┐
│ Group ┆ Datum ┆ TestEndDate ┆ Physical │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ date ┆ date ┆ u32 │
╞═══════╪════════════╪═════════════╪══════════╡
│ A ┆ 2022-01-01 ┆ 2022-01-04 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ A ┆ 2022-01-02 ┆ 2022-01-04 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ B ┆ 2022-01-03 ┆ 2022-01-04 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ B ┆ 2022-01-04 ┆ 2022-01-04 ┆ 1 │
└───────┴────────────┴─────────────┴──────────┘
And now our code works correctly again.

Related

Join between Polars dataframes with inequality conditions

I would like to do a join between two dataframes, using as join condition an inequality condition, i.e. greater than.
Given two dataframes, I would like to get the result equivalent to the SQL written below.
stock_market_value = pl.DataFrame(
{
"date": [date(2022, 1, 1), date(2022, 2, 1), date(2022, 3, 1)],
"price": [10.00, 12.00, 14.00]
}
)
my_stock_orders = pl.DataFrame(
{
"date": [date(2022, 1, 15), date(2022, 2, 15)],
"quantity": [2, 5]
}
)
I have read that Polars supports join of type asof, but I don't think it applies to my case (maybe putting tolerance equal to infinity?).
For sake of clarity, I wrote the join in form of SQL statement.
SELECT m.date, m.price * o.quantity AS portfolio_value
FROM stock_market_value m LEFT JOIN my_stock_orders o
ON m.date >= o.date
Example query/output:
duckdb.sql("""
SELECT
m.date market_date,
o.date order_date,
price,
quantity,
price * quantity AS portfolio_value
FROM stock_market_value m LEFT JOIN my_stock_orders o
ON m.date >= o.date
""").pl()
shape: (4, 5)
┌─────────────┬────────────┬───────┬──────────┬─────────────────┐
│ market_date | order_date | price | quantity | portfolio_value │
│ --- | --- | --- | --- | --- │
│ date | date | f64 | i64 | f64 │
╞═════════════╪════════════╪═══════╪══════════╪═════════════════╡
│ 2022-01-01 | null | 10.0 | null | null │
│ 2022-02-01 | 2022-01-15 | 12.0 | 2 | 24.0 │
│ 2022-03-01 | 2022-01-15 | 14.0 | 2 | 28.0 │
│ 2022-03-01 | 2022-02-15 | 14.0 | 5 | 70.0 │
└─────────────┴────────────┴───────┴──────────┴─────────────────┘
Why asof() is not the solution
Comments were suggesting to use asof, but it actually does not work in the way I expect.
Forward asof
result_fwd = stock_market_value.join_asof(
my_stock_orders, left_on="date", right_on="date", strategy="forward"
)
print(result_fwd)
shape: (3, 3)
┌────────────┬───────┬──────────┐
│ date ┆ price ┆ quantity │
│ --- ┆ --- ┆ --- │
│ date ┆ f64 ┆ i64 │
╞════════════╪═══════╪══════════╡
│ 2022-01-01 ┆ 10.0 ┆ 2 │
│ 2022-02-01 ┆ 12.0 ┆ 5 │
│ 2022-03-01 ┆ 14.0 ┆ null │
└────────────┴───────┴──────────┘
Backward asof
result_bwd = stock_market_value.join_asof(
my_stock_orders, left_on="date", right_on="date", strategy="backward"
)
print(result_bwd)
shape: (3, 3)
┌────────────┬───────┬──────────┐
│ date ┆ price ┆ quantity │
│ --- ┆ --- ┆ --- │
│ date ┆ f64 ┆ i64 │
╞════════════╪═══════╪══════════╡
│ 2022-01-01 ┆ 10.0 ┆ null │
│ 2022-02-01 ┆ 12.0 ┆ 2 │
│ 2022-03-01 ┆ 14.0 ┆ 5 │
└────────────┴───────┴──────────┘
Thanks!

You can do a join_asof. I you want to look forward you should use the forward strategy:
stock_market_value.join_asof(
my_stock_orders,
on='date',
strategy='forward',
).with_columns((pl.col("price") * pl.col("quantity")).alias("value"))
┌────────────┬───────┬──────────┬───────┐
│ date ┆ price ┆ quantity ┆ value │
│ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ f64 ┆ i64 ┆ f64 │
╞════════════╪═══════╪══════════╪═══════╡
│ 2022-01-01 ┆ 10.0 ┆ 2 ┆ 20.0 │
│ 2022-02-01 ┆ 12.0 ┆ 5 ┆ 60.0 │
│ 2022-03-01 ┆ 14.0 ┆ null ┆ null │
└────────────┴───────┴──────────┴───────┘

You can use join_asof to determine which records to exclude from the date logic, then perform a cartesian product + filter yourself on the remainder, then merge everything back together. The following implements what you want, although it's a little bit hacky.
Update: Using polars' native cross-product instead of self-defined cartesian product function.
import polars as pl
from polars import col
from datetime import date
stock_market_value = pl.DataFrame({
"market_date": [date(2022, 1, 1), date(2022, 2, 1), date(2022, 3, 1)],
"price": [10.00, 12.00, 14.00]
})
stock_market_orders = pl.DataFrame({
"order_date": [date(2022, 1, 15), date(2022, 2, 15)],
"quantity": [2, 5]
})
# use a backwards join-asof to find rows in market_value that have no rows in orders with order date < market date
stock_market_value = stock_market_value.with_columns(
stock_market_value.join_asof(
stock_market_orders,
left_on="market_date",
right_on="order_date",
)["order_date"].is_not_null().alias("has_match")
)
nonmatched_rows = stock_market_value.filter(col("has_match")==False).drop("has_match")
# keep all other rows and perform a cartesian product
matched_rows = stock_market_value.filter(col("has_match")==True).drop("has_match")
df = matched_rows.join(stock_market_orders, how="cross")
# filter based on our join condition
df = df.filter(col("market_date") > col("order_date"))
# concatenate the unmatched with the filtered result for our final answer
df = pl.concat((nonmatched_rows, df), how="diagonal")
print(df)
Output:
shape: (4, 4)
┌─────────────┬───────┬────────────┬──────────┐
│ market_date ┆ price ┆ order_date ┆ quantity │
│ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ f64 ┆ date ┆ i64 │
╞═════════════╪═══════╪════════════╪══════════╡
│ 2022-01-01 ┆ 10.0 ┆ null ┆ null │
│ 2022-02-01 ┆ 12.0 ┆ 2022-01-15 ┆ 2 │
│ 2022-03-01 ┆ 14.0 ┆ 2022-01-15 ┆ 2 │
│ 2022-03-01 ┆ 14.0 ┆ 2022-02-15 ┆ 5 │
└─────────────┴───────┴────────────┴──────────┘

Polars solution to normalise groups by per-group reference value

I'm trying to use Polars to normalise the values of groups of entries by a single reference value per group.
In the example data below, I'm trying to generate the column normalised which contains values divided by the per-group ref reference state value, i.e.:
group_id reference_state value normalised
1 ref 5 1.0
1 a 3 0.6
1 b 1 0.2
2 ref 4 1.0
2 a 8 2.0
2 b 2 0.5
This is straightforward in Pandas:
for (i, x) in df.groupby("group_id"):
ref_val = x.loc[x["reference_state"] == "ref"]["value"]
df.loc[df["group_id"] == i, "normalised"] = x["value"] / ref_val.to_list()[0]
Is there a way to do this in Polars?
Thanks in advance!

You can use a window function to make an expression operate on different groups via:
.over("group_id")
and then you can write the logic which divides by the values if equal to "ref" with:
pl.col("value") / pl.col("value").filter(pl.col("reference_state") == "ref").first()
Putting it all together:
df = pl.DataFrame({
"group_id": [1, 1, 1, 2, 2, 2],
"reference_state": ["ref", "a", "b", "ref", "a", "b"],
"value": [5, 3, 1, 4, 8, 2],
})
(df.with_columns([
(
pl.col("value") /
pl.col("value").filter(pl.col("reference_state") == "ref").first()
).over("group_id").alias("normalised")
]))
shape: (6, 4)
┌──────────┬─────────────────┬───────┬────────────┐
│ group_id ┆ reference_state ┆ value ┆ normalised │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ f64 │
╞══════════╪═════════════════╪═══════╪════════════╡
│ 1 ┆ ref ┆ 5 ┆ 1.0 │
│ 1 ┆ a ┆ 3 ┆ 0.6 │
│ 1 ┆ b ┆ 1 ┆ 0.2 │
│ 2 ┆ ref ┆ 4 ┆ 1.0 │
│ 2 ┆ a ┆ 8 ┆ 2.0 │
│ 2 ┆ b ┆ 2 ┆ 0.5 │
└──────────┴─────────────────┴───────┴────────────┘

Here's one way to do it:
create a temporary dataframe which, for each group_id, tells you the value where reference_state is 'ref'
join with that temporary dataframe
(
df.join(
df.filter(pl.col("reference_state") == "ref").select(["group_id", "value"]),
on="group_id",
)
.with_column((pl.col("value") / pl.col("value_right")).alias("normalised"))
.drop("value_right")
)
This gives you:
Out[16]:
shape: (6, 4)
┌──────────┬─────────────────┬───────┬────────────┐
│ group_id ┆ reference_state ┆ value ┆ normalised │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ f64 │
╞══════════╪═════════════════╪═══════╪════════════╡
│ 1 ┆ ref ┆ 5 ┆ 1.0 │
│ 1 ┆ a ┆ 3 ┆ 0.6 │
│ 1 ┆ b ┆ 1 ┆ 0.2 │
│ 2 ┆ ref ┆ 4 ┆ 1.0 │
│ 2 ┆ a ┆ 8 ┆ 2.0 │
│ 2 ┆ b ┆ 2 ┆ 0.5 │
└──────────┴─────────────────┴───────┴────────────┘

Python-Polars: How to filter categorical column with string list

I have a Polars dataframe like below:
df_cat = pl.DataFrame(
[
pl.Series("a_cat", ["c", "a", "b", "c", "b"], dtype=pl.Categorical),
pl.Series("b_cat", ["F", "G", "E", "G", "G"], dtype=pl.Categorical)
])
print(df_cat)
shape: (5, 2)
┌───────┬───────┐
│ a_cat ┆ b_cat │
│ --- ┆ --- │
│ cat ┆ cat │
╞═══════╪═══════╡
│ c ┆ F │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ G │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ E │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ G │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ G │
└───────┴───────┘
The following filter runs perfectly fine:
print(df_cat.filter(pl.col('a_cat') == 'c'))
shape: (2, 2)
┌───────┬───────┐
│ a_cat ┆ b_cat │
│ --- ┆ --- │
│ cat ┆ cat │
╞═══════╪═══════╡
│ c ┆ F │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ G │
└───────┴───────┘
What I want is to use a list of string to run the filter more efficiently. So I tried and ended up with the following error message:
print(df_cat.filter(pl.col('a_cat').is_in(['a', 'c'])))
---------------------------------------------------------------------------
ComputeError Traceback (most recent call last)
d:\GitRepo\Test2\stockEMD3.ipynb Cell 9 in <cell line: 1>()
----> 1 print(df_cat.filter(pl.col('a_cat').is_in(['c'])))
File c:\ProgramData\Anaconda3\envs\charm3.9\lib\site-packages\polars\internals\dataframe\frame.py:2185, in DataFrame.filter(self, predicate)
2181 if _NUMPY_AVAILABLE and isinstance(predicate, np.ndarray):
2182 predicate = pli.Series(predicate)
2184 return (
-> 2185 self.lazy()
2186 .filter(predicate) # type: ignore[arg-type]
2187 .collect(no_optimization=True, string_cache=False)
2188 )
File c:\ProgramData\Anaconda3\envs\charm3.9\lib\site-packages\polars\internals\lazyframe\frame.py:660, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, string_cache, no_optimization, slice_pushdown)
650 projection_pushdown = False
652 ldf = self._ldf.optimization_toggle(
653 type_coercion,
654 predicate_pushdown,
(...)
658 slice_pushdown,
659 )
--> 660 return pli.wrap_df(ldf.collect())
ComputeError: joins/or comparisons on categorical dtypes can only happen if they are created under the same global string cache
From this Stackoverflow link I understand "You need to set a global string cache to compare categoricals created in different columns/lists." but my question is
Why the == one single string filter case works?
What is the proper way to filter a categorical column with a list of string?
Thanks!

Actually, you don't need to set a global string cache to compare strings to Categorical variables. You can use cast to accomplish this.
Let's use this data. I've included the integer values that underlie the Categorical variables to demonstrate something later.
import polars as pl
df_cat = (
pl.DataFrame(
[
pl.Series("a_cat", ["c", "a", "b", "c", "X"], dtype=pl.Categorical),
pl.Series("b_cat", ["F", "G", "E", "S", "X"], dtype=pl.Categorical),
]
)
.with_column(
pl.all().to_physical().suffix('_phys')
)
)
df_cat
shape: (5, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ c ┆ F ┆ 0 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ G ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ E ┆ 2 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ S ┆ 0 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ X ┆ X ┆ 3 ┆ 4 │
└───────┴───────┴────────────┴────────────┘
Comparing a categorical variable to a string
If we cast a Categorical variable back to its string values, we can make any comparison we need. For example:
df_cat.filter(pl.col('a_cat').cast(pl.Utf8).is_in(['a', 'c']))
shape: (3, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ c ┆ F ┆ 0 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ G ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ S ┆ 0 ┆ 3 │
└───────┴───────┴────────────┴────────────┘
Or in a filter step comparing the string values of two Categorical variables that do not share the same string cache.
df_cat.filter(pl.col('a_cat').cast(pl.Utf8) == pl.col('b_cat').cast(pl.Utf8))
shape: (1, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ X ┆ X ┆ 3 ┆ 4 │
└───────┴───────┴────────────┴────────────┘
Notice that it is the string values being compared (not the integers underlying the two Categorical variables).
The equality operator on Categorical variables
The following statements are equivalent:
df_cat.filter((pl.col('a_cat') == 'a'))
df_cat.filter((pl.col('a_cat').cast(pl.Utf8) == 'a'))
The former is syntactic sugar for the latter, as the former is a common use case.

As the error states: ComputeError: joins/or comparisons on categorical dtypes can only happen if they are created under the same global string cache.
Comparisons of categorical values are only allowed under a global string cache. You really want to set this in such a case as it speeds up comparisons and prevents expensive casts to strings.
Setting this on the start of your query will ensure it runs:
import polars as pl
pl.Config.set_global_string_cache()

This is a new answer based on the one from #ritchie46.
Polar 0.15.15 it now is
import polars as pl
pl.toggle_string_cache(True)
Also a StringCache() Context manager can be used, see polars documentation:
with pl.StringCache():
print(df_cat.filter(pl.col('a_cat').is_in(['a', 'c'])))

How to form dynamic expressions without breaking on types

Any way to make the dynamic polars expressions not break with errors?
Currently I'm just excluding the columns by type, but just wondering if there is a better way.
For example, i have a df coming from parquet, if i just execute an expression on all columns it might break for certain types. Instead I want to contain these errors and possibly return a default value like None or -1 or something else.
import polars as pl
df = pl.scan_parquet("/path/to/data/*.parquet")
print(df.schema)
# Prints: {'date_time': <class 'polars.datatypes.Datetime'>, 'incident': <class 'polars.datatypes.Utf8'>, 'address': <class 'polars.datatypes.Utf8'>, 'city': <class 'polars.datatypes.Utf8'>, 'zipcode': <class 'polars.datatypes.Int32'>}
Now if i form generic expression on top of this, there are chances it may fail. For example,
# Finding positive count across all columns
# Fails due to: exceptions.ComputeError: cannot compare Utf8 with numeric data
print(df.select((pl.all() > 0).count().prefix("__positive_count_")).collect())
# Finding positive count across all columns
# Fails due to: pyo3_runtime.PanicException: 'unique_counts' not implemented for datetime[ns] data types
print(df.select(pl.all().unique_counts().prefix("__unique_count_")).collect())
# Finding positive count across all columns
# Fails due to: exceptions.SchemaError: Series dtype Int32 != utf8
# Note: this could have been avoided by doing an explict cast to string first
print(df.select((pl.all().str.lengths() > 0).count().prefix("__empty_count_")).collect())

I'll keep to things that work in lazy mode, as it appears that you are working in lazy mode with Parquet files.
Let's use this data as an example:
import polars as pl
from datetime import datetime
df = pl.DataFrame(
{
"col_int": [-2, -2, 0, 2, 2],
"col_float": [-20.0, -10, 10, 20, 20],
"col_date": pl.date_range(datetime(2020, 1, 1), datetime(2020, 5, 1), "1mo"),
"col_str": ["str1", "str2", "", None, "str5"],
"col_bool": [True, False, False, True, False],
}
).lazy()
df.collect()
shape: (5, 5)
┌─────────┬───────────┬─────────────────────┬─────────┬──────────┐
│ col_int ┆ col_float ┆ col_date ┆ col_str ┆ col_bool │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[ns] ┆ str ┆ bool │
╞═════════╪═══════════╪═════════════════════╪═════════╪══════════╡
│ -2 ┆ -20.0 ┆ 2020-01-01 00:00:00 ┆ str1 ┆ true │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ -2 ┆ -10.0 ┆ 2020-02-01 00:00:00 ┆ str2 ┆ false │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ 10.0 ┆ 2020-03-01 00:00:00 ┆ ┆ false │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20.0 ┆ 2020-04-01 00:00:00 ┆ null ┆ true │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20.0 ┆ 2020-05-01 00:00:00 ┆ str5 ┆ false │
└─────────┴───────────┴─────────────────────┴─────────┴──────────┘
Using the col Expression
One feature of the col expression is that you can supply a datatype, or even a list of datatypes. For example, if we want to contain our queries to floats, we can do the following:
df.select((pl.col(pl.Float64) > 0).sum().suffix("__positive_count_")).collect()
shape: (1, 1)
┌────────────────────────────┐
│ col_float__positive_count_ │
│ --- │
│ u32 │
╞════════════════════════════╡
│ 3 │
└────────────────────────────┘
(Note: (pl.col(...) > 0) creates a series of boolean values that need to be summed, not counted)
To include more than one datatype, you can supply a list of datatypes to col.
df.select(
(pl.col([pl.Int64, pl.Float64]) > 0).sum().suffix("__positive_count_")
).collect()
shape: (1, 2)
┌──────────────────────────┬────────────────────────────┐
│ col_int__positive_count_ ┆ col_float__positive_count_ │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞══════════════════════════╪════════════════════════════╡
│ 2 ┆ 3 │
└──────────────────────────┴────────────────────────────┘
You can also combine these into the same select statement if you'd like.
df.select(
[
(pl.col(pl.Utf8).str.lengths() == 0).sum().suffix("__empty_count"),
pl.col(pl.Utf8).is_null().sum().suffix("__null_count"),
(pl.col([pl.Float64, pl.Int64]) > 0).sum().suffix("_positive_count"),
]
).collect()
shape: (1, 4)
┌──────────────────────┬─────────────────────┬──────────────────────────┬────────────────────────┐
│ col_str__empty_count ┆ col_str__null_count ┆ col_float_positive_count ┆ col_int_positive_count │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ u32 ┆ u32 │
╞══════════════════════╪═════════════════════╪══════════════════════════╪════════════════════════╡
│ 1 ┆ 1 ┆ 3 ┆ 2 │
└──────────────────────┴─────────────────────┴──────────────────────────┴────────────────────────┘
The Cookbook has a handy list of datatypes.
Using the exclude expression
Another handy trick is to use the exclude expression. With this, we can select all columns except columns of certain datatypes. For example:
df.select(
[
pl.exclude(pl.Utf8).max().suffix("_max"),
pl.exclude([pl.Utf8, pl.Boolean]).min().suffix("_min"),
]
).collect()
shape: (1, 7)
┌─────────────┬───────────────┬─────────────────────┬──────────────┬─────────────┬───────────────┬─────────────────────┐
│ col_int_max ┆ col_float_max ┆ col_date_max ┆ col_bool_max ┆ col_int_min ┆ col_float_min ┆ col_date_min │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[ns] ┆ u32 ┆ i64 ┆ f64 ┆ datetime[ns] │
╞═════════════╪═══════════════╪═════════════════════╪══════════════╪═════════════╪═══════════════╪═════════════════════╡
│ 2 ┆ 20.0 ┆ 2020-05-01 00:00:00 ┆ 1 ┆ -2 ┆ -20.0 ┆ 2020-01-01 00:00:00 │
└─────────────┴───────────────┴─────────────────────┴──────────────┴─────────────┴───────────────┴─────────────────────┘
Unique counts
One caution: unique_counts results in Series of varying lengths.
df.select(pl.col("col_int").unique_counts().prefix(
"__unique_count_")).collect()
shape: (3, 1)
┌────────────────────────┐
│ __unique_count_col_int │
│ --- │
│ u32 │
╞════════════════════════╡
│ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 │
└────────────────────────┘
df.select(pl.col("col_float").unique_counts().prefix(
"__unique_count_")).collect()
shape: (4, 1)
┌──────────────────────────┐
│ __unique_count_col_float │
│ --- │
│ u32 │
╞══════════════════════════╡
│ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 │
└──────────────────────────┘
As such, these should not be combined into the same results. Each column/Series of a DataFrame must have the same length.

How to create fields dynamically

Is there any way to create fields dynamically?. I know there are some ways. But it will be better to know best approach in polars. For example I want to add 12 shifted columns to existing dataframe.(lag1, lag2, lag3...lagN) How to achieve this?
Thanks.

You can just use the python language for that. Polars expressions are lazily evaluated, so you can create them anywhere, in a for loop, a function, list comprehension, you name it.
Below I give an example of dynamically created lag columns, one by calling a function, assigning to a variable and then using that variable. And one with a list comprehension.
# some initial dataframe
df = pl.DataFrame({
"a": [1, 2, 3, 4, 5],
"b": [5, 4, 3, 2, 1]
})
# a function that returns a lazy evaluated expression
def lag(name: str, n: int) -> pl.Expr:
return pl.col(name).shift(n).suffix(f"_lag_{n}")
# a lazy evaluated expression assigned to a variable
lag_foo = lag("a", 1)
out = df.select([
lag_foo,
] + [lag("b", i) for i in range(5)] # create exprs with a list comprehension
)
print(out)
This outputs:
shape: (5, 6)
┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ a_lag_1 ┆ b_lag_0 ┆ b_lag_1 ┆ b_lag_2 ┆ b_lag_3 ┆ b_lag_4 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ null ┆ 5 ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 4 ┆ 5 ┆ null ┆ null ┆ null │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 3 ┆ 4 ┆ 5 ┆ null ┆ null │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 2 ┆ 3 ┆ 4 ┆ 5 ┆ null │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 1 ┆ 2 ┆ 3 ┆ 4 ┆ 5 │
└─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘