Copying row values based on Config file from polars dataframe - python-polars

I have a dataframe consisting of an ID, Local, Entity, Field and Global column.
# Creating a dictionary with the data
data = {'ID': [4, 4, 4, 4, 4],
'Local': ['A', 'B', 'C', 'D', 'E'],
'Field': ['P', 'Q', 'R', 'S', 'T'],
'Entity': ['K', 'L', 'M', 'N', 'O'],
'Global': ['F', 'G', 'H', 'I', 'J'],}
# Creating the dataframe
table = pl.DataFrame(data)
print(table)
shape: (5, 5)
┌─────┬───────┬───────┬──────────┬────────┐
│ ID ┆ Local ┆ Field ┆ Entity ┆ Global │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ str │
╞═════╪═══════╪═══════╪══════════╪════════╡
│ 4 ┆ A ┆ P ┆ K ┆ F │
│ 4 ┆ B ┆ Q ┆ L ┆ G │
│ 4 ┆ C ┆ R ┆ M ┆ H │
│ 4 ┆ D ┆ S ┆ N ┆ I │
│ 4 ┆ E ┆ T ┆ O ┆ J │
└─────┴───────┴───────┴──────────┴────────┘
Within the dataset certain rows need to be copied. For this purpose the following config file is provided with the following information:
copying:
- column_name: P
source_table: K
destination_table: X
- column_name: S
source_table: N
destination_table: W
In the config file there is a column_name value which refers to the Field column, a source_table which refers to the given Entity and a destination_table which should be the future entry in the Entity column. The goal is to enrich data based on existing rows (just with other tables).
The solution should look like this:
shape: (7, 5)
┌─────┬───────┬───────┬──────────┬────────┐
│ ID ┆ Local ┆ Field ┆ Entity ┆ Global │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ str │
╞═════╪═══════╪═══════╪══════════╪════════╡
│ 4 ┆ A ┆ P ┆ K ┆ F │
│ 4 ┆ B ┆ Q ┆ L ┆ G │
│ 4 ┆ C ┆ R ┆ M ┆ H │
│ 4 ┆ D ┆ S ┆ N ┆ I │
│ 4 ┆ E ┆ T ┆ O ┆ J │
│ 4 ┆ A ┆ P ┆ X ┆ F │
│ 4 ┆ D ┆ S ┆ W ┆ I │
└─────┴───────┴───────┴──────────┴────────┘
The dataset is a polars DataFrame and the config file is loaded as omegaconf. I tried it with this code:
conf.copying = [
{"column_name": "P", "source_table": "K", "destination_table": "X"},
{"column_name": "S", "source_table": "N", "destination_table": "W"},
]
# Iterate through the config file
for i in range(len(conf.copying)):
# Select the rows from the table dataframe that match the column_name and source_table fields in the config
match_rows = table.filter(
(pl.col("Field") == conf.copying[i]["column_name"])
& (pl.col("Entity") == conf.copying[i]["source_table"])
)
# Add the column Entities with the destination_table
match_rows = match_rows.select(
[
"ID",
"Local",
"Field",
"Global",
]
)
# Add the column Entities with the destination_table
match_rows = match_rows.with_columns(
pl.lit(conf.copying[i]["destination_table"]).alias("Entity")
)
match_rows = match_rows[
[
"ID",
"Local",
"Field",
"Entity",
"Global",
]
]
# Append the new rows to the original table dataframe
df_copy = match_rows.vstack(table)
However, the data is not copied as expected and added to the existing dataset. What am I doing wrong?

It's probably better if you extract all of the rows you want in one go.
One possible approach is to build a list of when/then expressions:
rules = [
pl.when(
(pl.col("Field") == rule["column_name"]) &
(pl.col("Entity") == rule["source_table"]))
.then(rule["destination_table"])
.alias("Entity")
for rule in conf.copying
]
You can use pl.coalesce and .drop_nulls() to remove non-matches.
>>> table.with_columns(pl.coalesce(rules)).drop_nulls()
shape: (2, 5)
┌─────┬───────┬───────┬────────┬────────┐
│ ID | Local | Field | Entity | Global │
│ --- | --- | --- | --- | --- │
│ i64 | str | str | str | str │
╞═════╪═══════╪═══════╪════════╪════════╡
│ 4 | A | P | X | F │
├─────┼───────┼───────┼────────┼────────┤
│ 4 | D | S | W | I │
└─────┴───────┴───────┴────────┴────────┘
This can then be added to the original dataframe.
Another style I've seen is to start with an "empty" when/then to build a .when().then() chain e.g.
rules = pl.when(False).then(None)
for rule in conf.copying:
rules = (
rules.when(
(pl.col("Field") == rule["column_name"]) &
(pl.col("Entity") == rule["source_table"]))
.then(rule["destination_table"])
)
>>> table.with_columns(rules.alias("Entity")).drop_nulls()
shape: (2, 5)
┌─────┬───────┬───────┬────────┬────────┐
│ ID | Local | Field | Entity | Global │
│ --- | --- | --- | --- | --- │
│ i64 | str | str | str | str │
╞═════╪═══════╪═══════╪════════╪════════╡
│ 4 | A | P | X | F │
├─────┼───────┼───────┼────────┼────────┤
│ 4 | D | S | W | I │
└─────┴───────┴───────┴────────┴────────┘
I'm not sure if it would make much difference in filtering the rows first and then doing the Entity modification - pl.any() could be used if you were to do that:
table.filter(
pl.any([
(pl.col("Field") == rule["column_name"]) &
(pl.col("Entity") == rule["source_table"])
for rule in conf.copying
])
)
shape: (2, 5)
┌─────┬───────┬───────┬────────┬────────┐
│ ID | Local | Field | Entity | Global │
│ --- | --- | --- | --- | --- │
│ i64 | str | str | str | str │
╞═════╪═══════╪═══════╪════════╪════════╡
│ 4 | A | P | K | F │
├─────┼───────┼───────┼────────┼────────┤
│ 4 | D | S | N | I │
└─────┴───────┴───────┴────────┴────────┘

Related

How can I use when, then and otherwise with multiple conditions in polars?

I have a data set with three columns. Column A is to be checked for strings. If the string matches foo or spam, the values in the same row for the other two columns L and G should be changed to XX. For this I have tried the following.
df = pl.DataFrame(
{
"A": ["foo", "ham", "spam", "egg",],
"L": ["A54", "A12", "B84", "C12"],
"G": ["X34", "C84", "G96", "L6",],
}
)
print(df)
shape: (4, 3)
┌──────┬─────┬─────┐
│ A ┆ L ┆ G │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════╪═════╪═════╡
│ foo ┆ A54 ┆ X34 │
│ ham ┆ A12 ┆ C84 │
│ spam ┆ B84 ┆ G96 │
│ egg ┆ C12 ┆ L6 │
└──────┴─────┴─────┘
expected outcome
shape: (4, 3)
┌──────┬─────┬─────┐
│ A ┆ L ┆ G │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════╪═════╪═════╡
│ foo ┆ XX ┆ XX │
│ ham ┆ A12 ┆ C84 │
│ spam ┆ XX ┆ XX │
│ egg ┆ C12 ┆ L6 │
└──────┴─────┴─────┘
I tried this
df = df.with_column(
pl.when((pl.col("A") == "foo") | (pl.col("A") == "spam"))
.then((pl.col("L")= "XX") & (pl.col( "G")= "XX"))
.otherwise((pl.col("L"))&(pl.col( "G")))
)
However, this does not work. Can someone help me with this?
For setting multiple columns to the same value you could use:
df.with_columns(
pl.when(pl.col("A").is_in(["foo", "spam"]))
.then("XX")
.otherwise(pl.col(["L", "G"]))
.keep_name()
)
shape: (4, 3)
┌──────┬─────┬─────┐
│ A | L | G │
│ --- | --- | --- │
│ str | str | str │
╞══════╪═════╪═════╡
│ foo | XX | XX │
├──────┼─────┼─────┤
│ ham | A12 | C84 │
├──────┼─────┼─────┤
│ spam | XX | XX │
├──────┼─────┼─────┤
│ egg | C12 | L6 │
└──────┴─────┴─────┘
.is_in() can be used instead of multiple == x | == y chains.
To update multiple columns at once with different values you could use .map() and a dictionary:
df.with_columns(
pl.when(pl.col("A").is_in(["foo", "spam"]))
.then(pl.col(["L", "G"]).map(
lambda column: {
"L": "XX",
"G": "YY",
}.get(column.name)))
.otherwise(pl.col(["L", "G"]))
)
shape: (4, 3)
┌──────┬─────┬─────┐
│ A | L | G │
│ --- | --- | --- │
│ str | str | str │
╞══════╪═════╪═════╡
│ foo | XX | YY │
├──────┼─────┼─────┤
│ ham | A12 | C84 │
├──────┼─────┼─────┤
│ spam | XX | YY │
├──────┼─────┼─────┤
│ egg | C12 | L6 │
└──────┴─────┴─────┘

Join between Polars dataframes with inequality conditions

I would like to do a join between two dataframes, using as join condition an inequality condition, i.e. greater than.
Given two dataframes, I would like to get the result equivalent to the SQL written below.
stock_market_value = pl.DataFrame(
{
"date": [date(2022, 1, 1), date(2022, 2, 1), date(2022, 3, 1)],
"price": [10.00, 12.00, 14.00]
}
)
my_stock_orders = pl.DataFrame(
{
"date": [date(2022, 1, 15), date(2022, 2, 15)],
"quantity": [2, 5]
}
)
I have read that Polars supports join of type asof, but I don't think it applies to my case (maybe putting tolerance equal to infinity?).
For sake of clarity, I wrote the join in form of SQL statement.
SELECT m.date, m.price * o.quantity AS portfolio_value
FROM stock_market_value m LEFT JOIN my_stock_orders o
ON m.date >= o.date
Example query/output:
duckdb.sql("""
SELECT
m.date market_date,
o.date order_date,
price,
quantity,
price * quantity AS portfolio_value
FROM stock_market_value m LEFT JOIN my_stock_orders o
ON m.date >= o.date
""").pl()
shape: (4, 5)
┌─────────────┬────────────┬───────┬──────────┬─────────────────┐
│ market_date | order_date | price | quantity | portfolio_value │
│ --- | --- | --- | --- | --- │
│ date | date | f64 | i64 | f64 │
╞═════════════╪════════════╪═══════╪══════════╪═════════════════╡
│ 2022-01-01 | null | 10.0 | null | null │
│ 2022-02-01 | 2022-01-15 | 12.0 | 2 | 24.0 │
│ 2022-03-01 | 2022-01-15 | 14.0 | 2 | 28.0 │
│ 2022-03-01 | 2022-02-15 | 14.0 | 5 | 70.0 │
└─────────────┴────────────┴───────┴──────────┴─────────────────┘
Why asof() is not the solution
Comments were suggesting to use asof, but it actually does not work in the way I expect.
Forward asof
result_fwd = stock_market_value.join_asof(
my_stock_orders, left_on="date", right_on="date", strategy="forward"
)
print(result_fwd)
shape: (3, 3)
┌────────────┬───────┬──────────┐
│ date ┆ price ┆ quantity │
│ --- ┆ --- ┆ --- │
│ date ┆ f64 ┆ i64 │
╞════════════╪═══════╪══════════╡
│ 2022-01-01 ┆ 10.0 ┆ 2 │
│ 2022-02-01 ┆ 12.0 ┆ 5 │
│ 2022-03-01 ┆ 14.0 ┆ null │
└────────────┴───────┴──────────┘
Backward asof
result_bwd = stock_market_value.join_asof(
my_stock_orders, left_on="date", right_on="date", strategy="backward"
)
print(result_bwd)
shape: (3, 3)
┌────────────┬───────┬──────────┐
│ date ┆ price ┆ quantity │
│ --- ┆ --- ┆ --- │
│ date ┆ f64 ┆ i64 │
╞════════════╪═══════╪══════════╡
│ 2022-01-01 ┆ 10.0 ┆ null │
│ 2022-02-01 ┆ 12.0 ┆ 2 │
│ 2022-03-01 ┆ 14.0 ┆ 5 │
└────────────┴───────┴──────────┘
Thanks!
You can do a join_asof. I you want to look forward you should use the forward strategy:
stock_market_value.join_asof(
my_stock_orders,
on='date',
strategy='forward',
).with_columns((pl.col("price") * pl.col("quantity")).alias("value"))
┌────────────┬───────┬──────────┬───────┐
│ date ┆ price ┆ quantity ┆ value │
│ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ f64 ┆ i64 ┆ f64 │
╞════════════╪═══════╪══════════╪═══════╡
│ 2022-01-01 ┆ 10.0 ┆ 2 ┆ 20.0 │
│ 2022-02-01 ┆ 12.0 ┆ 5 ┆ 60.0 │
│ 2022-03-01 ┆ 14.0 ┆ null ┆ null │
└────────────┴───────┴──────────┴───────┘
You can use join_asof to determine which records to exclude from the date logic, then perform a cartesian product + filter yourself on the remainder, then merge everything back together. The following implements what you want, although it's a little bit hacky.
Update: Using polars' native cross-product instead of self-defined cartesian product function.
import polars as pl
from polars import col
from datetime import date
stock_market_value = pl.DataFrame({
"market_date": [date(2022, 1, 1), date(2022, 2, 1), date(2022, 3, 1)],
"price": [10.00, 12.00, 14.00]
})
stock_market_orders = pl.DataFrame({
"order_date": [date(2022, 1, 15), date(2022, 2, 15)],
"quantity": [2, 5]
})
# use a backwards join-asof to find rows in market_value that have no rows in orders with order date < market date
stock_market_value = stock_market_value.with_columns(
stock_market_value.join_asof(
stock_market_orders,
left_on="market_date",
right_on="order_date",
)["order_date"].is_not_null().alias("has_match")
)
nonmatched_rows = stock_market_value.filter(col("has_match")==False).drop("has_match")
# keep all other rows and perform a cartesian product
matched_rows = stock_market_value.filter(col("has_match")==True).drop("has_match")
df = matched_rows.join(stock_market_orders, how="cross")
# filter based on our join condition
df = df.filter(col("market_date") > col("order_date"))
# concatenate the unmatched with the filtered result for our final answer
df = pl.concat((nonmatched_rows, df), how="diagonal")
print(df)
Output:
shape: (4, 4)
┌─────────────┬───────┬────────────┬──────────┐
│ market_date ┆ price ┆ order_date ┆ quantity │
│ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ f64 ┆ date ┆ i64 │
╞═════════════╪═══════╪════════════╪══════════╡
│ 2022-01-01 ┆ 10.0 ┆ null ┆ null │
│ 2022-02-01 ┆ 12.0 ┆ 2022-01-15 ┆ 2 │
│ 2022-03-01 ┆ 14.0 ┆ 2022-01-15 ┆ 2 │
│ 2022-03-01 ┆ 14.0 ┆ 2022-02-15 ┆ 5 │
└─────────────┴───────┴────────────┴──────────┘

Complex asof joins with option to select between duplicates + strictly less/greater than + use utf8

Is there a way to perform something similar to an asof join, but with
The option to select which element to join with (e.g. first, last) if there are duplicates
The option to join with only strictly less/greater than
Ability to use utf-8
Here's a code example:
import polars as pl
df1 = pl.DataFrame({
'by_1': ['X', 'X', 'Y', 'Y'] * 8,
'by_2': ['X', 'Y', 'X', 'Y'] * 8,
'on_1': ['A'] * 16 + ['C'] * 16,
'on_2': (['A'] * 8 + ['C'] * 8) * 2,
'__index__': list(range(32))
})
df2 = pl.DataFrame([
{ 'by_1': 'Y', 'by_2': 'Y', 'on_1': 'B', 'on_2': 'A' },
{ 'by_1': 'Y', 'by_2': 'Y', 'on_1': 'C', 'on_2': 'A' },
{ 'by_1': 'Y', 'by_2': 'Z', 'on_1': 'A', 'on_2': 'A' },
])
df1:
┌──────┬──────┬──────┬──────┬───────────┐
│ by_1 ┆ by_2 ┆ on_1 ┆ on_2 ┆ __index__ │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ i64 │
╞══════╪══════╪══════╪══════╪═══════════╡
│ X ┆ X ┆ A ┆ A ┆ 0 │
│ X ┆ Y ┆ A ┆ A ┆ 1 │
│ Y ┆ X ┆ A ┆ A ┆ 2 │
│ Y ┆ Y ┆ A ┆ A ┆ 3 │
│ X ┆ X ┆ A ┆ A ┆ 4 │
│ X ┆ Y ┆ A ┆ A ┆ 5 │
│ Y ┆ X ┆ A ┆ A ┆ 6 │
│ Y ┆ Y ┆ A ┆ A ┆ 7 │
│ X ┆ X ┆ A ┆ C ┆ 8 │
│ X ┆ Y ┆ A ┆ C ┆ 9 │
│ Y ┆ X ┆ A ┆ C ┆ 10 │
│ Y ┆ Y ┆ A ┆ C ┆ 11 │
│ X ┆ X ┆ A ┆ C ┆ 12 │
│ X ┆ Y ┆ A ┆ C ┆ 13 │
│ Y ┆ X ┆ A ┆ C ┆ 14 │
│ Y ┆ Y ┆ A ┆ C ┆ 15 │
│ X ┆ X ┆ C ┆ A ┆ 16 │
│ X ┆ Y ┆ C ┆ A ┆ 17 │
│ Y ┆ X ┆ C ┆ A ┆ 18 │
│ Y ┆ Y ┆ C ┆ A ┆ 19 │
│ X ┆ X ┆ C ┆ A ┆ 20 │
│ X ┆ Y ┆ C ┆ A ┆ 21 │
│ Y ┆ X ┆ C ┆ A ┆ 22 │
│ Y ┆ Y ┆ C ┆ A ┆ 23 │
│ X ┆ X ┆ C ┆ C ┆ 24 │
│ X ┆ Y ┆ C ┆ C ┆ 25 │
│ Y ┆ X ┆ C ┆ C ┆ 26 │
│ Y ┆ Y ┆ C ┆ C ┆ 27 │
│ X ┆ X ┆ C ┆ C ┆ 28 │
│ X ┆ Y ┆ C ┆ C ┆ 29 │
│ Y ┆ X ┆ C ┆ C ┆ 30 │
│ Y ┆ Y ┆ C ┆ C ┆ 31 │
└──────┴──────┴──────┴──────┴───────────┘
df2:
┌──────┬──────┬──────┬──────┐
│ by_1 ┆ by_2 ┆ on_1 ┆ on_2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str │
╞══════╪══════╪══════╪══════╡
│ Y ┆ Y ┆ B ┆ A │
│ Y ┆ Y ┆ C ┆ A │
│ Y ┆ Z ┆ A ┆ A │
└──────┴──────┴──────┴──────┘
# Case 1 - Less Than (lt)
df2.join_asof_lt(
df1,
by=['by_1', 'by_2'],
on=['on_1', 'on_2'],
lt_select_eq = 'first',
)
┌──────┬──────┬──────┬──────┬───────────┐
│ by_1 ┆ by_2 ┆ on_1 ┆ on_2 ┆ __index__ │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ i64 │
╞══════╪══════╪══════╪══════╪═══════════╡
│ Y ┆ Y ┆ B ┆ A ┆ 11 │ # First strictly less than is ('Y', 'Y'), ('A', 'C'), which exists at index 11 and 15. Since lt_select_eq is 'first', choose 11
│ Y ┆ Y ┆ C ┆ A ┆ 11 │ # First strictly less than is ('Y', 'Y'), ('A', 'C'), which exists at index 11 and 15. Since lt_select_eq is 'first', choose 11
│ Y ┆ Z ┆ A ┆ A ┆ null │ # Group (Z, Y) does not exist, so return None
└──────┴──────┴──────┴──────┴───────────┘
df2.join_asof_lt(
df1,
by=['by_1', 'by_2'],
on=['on_1', 'on_2'],
lt_select_eq = 'last',
)
┌──────┬──────┬──────┬──────┬───────────┐
│ by_1 ┆ by_2 ┆ on_1 ┆ on_2 ┆ __index__ │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ i64 │
╞══════╪══════╪══════╪══════╪═══════════╡
│ Y ┆ Y ┆ B ┆ A ┆ 15 │ # First strictly less than is ('Y', 'Y'), ('A', 'C'), which exists at index 11 and 15. Since lt_select_eq is 'last', choose 15
│ Y ┆ Y ┆ C ┆ A ┆ 15 │ # First strictly less than is ('Y', 'Y'), ('A', 'C'), which exists at index 11 and 15. Since lt_select_eq is 'last', choose 15
│ Y ┆ Z ┆ A ┆ A ┆ null │ # Group (Z, Y) does not exist, so return None
└──────┴──────┴──────┴──────┴───────────┘
# Case 2 - Less Than or Equal To (leq)
df2.join_asof_leq(
df1,
by=['by_1', 'by_2'],
on=['on_1', 'on_2'],
lt_select_eq = 'first',
eq_select_eq = 'last',
)
┌──────┬──────┬──────┬──────┬───────────┐
│ by_1 ┆ by_2 ┆ on_1 ┆ on_2 ┆ __index__ │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ i64 │
╞══════╪══════╪══════╪══════╪═══════════╡
│ Y ┆ Y ┆ B ┆ A ┆ 11 │ # First less than or equal to is ('Y', 'Y'), ('A', 'C'), which exists at index 11 and 15. Since lt_select_eq is 'first', choose 11
│ Y ┆ Y ┆ C ┆ A ┆ 23 │ # First less than or equal to is ('Y', 'Y'), ('C', 'A'), which exists at index 19 and 23. Since eq_select_eq is 'last', choose 23
│ Y ┆ Z ┆ A ┆ A ┆ null │ # Group (Z, Y) does not exist, so return None
└──────┴──────┴──────┴──────┴───────────┘
df2.join_asof_leq(
df1,
by=['by_1', 'by_2'],
on=['on_1', 'on_2'],
lt_select_eq = 'last',
eq_select_eq = 'first',
)
┌──────┬──────┬──────┬──────┬───────────┐
│ by_1 ┆ by_2 ┆ on_1 ┆ on_2 ┆ __index__ │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ i64 │
╞══════╪══════╪══════╪══════╪═══════════╡
│ Y ┆ Y ┆ B ┆ A ┆ 15 │ # First less than or equal to is ('Y', 'Y'), ('A', 'C'), which exists at index 11 and 15. Since lt_select_eq is 'last', choose 15
│ Y ┆ Y ┆ C ┆ A ┆ 19 │ # First less than or equal to is ('Y', 'Y'), ('C', 'A'), which exists at index 19 and 23. Since eq_select_eq is 'first', choose 19
│ Y ┆ Z ┆ A ┆ A ┆ null │ # Group (Z, Y) does not exist, so return None
└──────┴──────┴──────┴──────┴───────────┘
These examples are for lt / leq, but it could also be gt / geq. Thanks!
I don't follow why 3 and 7 are not to be classed as less than but for example:
df3 = df2.join(df1, on=["by_1", "by_2"], how="left")
df3.filter(
pl.col("__index__").is_null() |
(pl.col("on_1_right") < pl.col("on_1"))
)
shape: (9, 7)
┌──────┬──────┬──────┬──────┬────────────┬────────────┬───────────┐
│ by_1 | by_2 | on_1 | on_2 | on_1_right | on_2_right | __index__ │
│ --- | --- | --- | --- | --- | --- | --- │
│ str | str | str | str | str | str | i64 │
╞══════╪══════╪══════╪══════╪════════════╪════════════╪═══════════╡
│ Y | Y | B | A | A | A | 3 │
│ Y | Y | B | A | A | A | 7 │
│ Y | Y | B | A | A | C | 11 │
│ Y | Y | B | A | A | C | 15 │
│ Y | Y | C | A | A | A | 3 │
│ Y | Y | C | A | A | A | 7 │
│ Y | Y | C | A | A | C | 11 │
│ Y | Y | C | A | A | C | 15 │
│ Y | Z | A | A | null | null | null │
└──────┴──────┴──────┴──────┴────────────┴────────────┴───────────┘
Get "closest" match per group:
group_keys = ["by_1", "by_2", "on_1", "on_2"]
df3 = df2.join(df1, on=["by_1", "by_2"], how="left")
(
df3
.filter(
pl.col("__index__").is_null() |
(pl.col("on_1") > pl.col("on_1_right")))
.filter(
pl.col([
"on_1_right",
"on_2_right"
]) == pl.col(["on_1_right", "on_2_right"])
.last()
.over(group_keys))
)
shape: (5, 7)
┌──────┬──────┬──────┬──────┬────────────┬────────────┬───────────┐
│ by_1 | by_2 | on_1 | on_2 | on_1_right | on_2_right | __index__ │
│ --- | --- | --- | --- | --- | --- | --- │
│ str | str | str | str | str | str | i64 │
╞══════╪══════╪══════╪══════╪════════════╪════════════╪═══════════╡
│ Y | Y | B | A | A | C | 11 │
│ Y | Y | B | A | A | C | 15 │
│ Y | Y | C | A | A | C | 11 │
│ Y | Y | C | A | A | C | 15 │
│ Y | Z | A | A | null | null | null │
└──────┴──────┴──────┴──────┴────────────┴────────────┴───────────┘
If you .groupby(group_keys) that result you can use .first() / .last()
>>> groups.agg(pl.col("__index__").first())
shape: (3, 5)
┌──────┬──────┬──────┬──────┬───────────┐
│ by_1 | by_2 | on_1 | on_2 | __index__ │
│ --- | --- | --- | --- | --- │
│ str | str | str | str | i64 │
╞══════╪══════╪══════╪══════╪═══════════╡
│ Y | Y | C | A | 11 │
│ Y | Z | A | A | null │
│ Y | Y | B | A | 11 │
└──────┴──────┴──────┴──────┴───────────┘
>>> groups.agg(pl.col("__index__").last())
shape: (3, 5)
┌──────┬──────┬──────┬──────┬───────────┐
│ by_1 | by_2 | on_1 | on_2 | __index__ │
│ --- | --- | --- | --- | --- │
│ str | str | str | str | i64 │
╞══════╪══════╪══════╪══════╪═══════════╡
│ Y | Y | B | A | 15 │
│ Y | Z | A | A | null │
│ Y | Y | C | A | 15 │
└──────┴──────┴──────┴──────┴───────────┘

Python-Polars: How to filter categorical column with string list

I have a Polars dataframe like below:
df_cat = pl.DataFrame(
[
pl.Series("a_cat", ["c", "a", "b", "c", "b"], dtype=pl.Categorical),
pl.Series("b_cat", ["F", "G", "E", "G", "G"], dtype=pl.Categorical)
])
print(df_cat)
shape: (5, 2)
┌───────┬───────┐
│ a_cat ┆ b_cat │
│ --- ┆ --- │
│ cat ┆ cat │
╞═══════╪═══════╡
│ c ┆ F │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ G │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ E │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ G │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ G │
└───────┴───────┘
The following filter runs perfectly fine:
print(df_cat.filter(pl.col('a_cat') == 'c'))
shape: (2, 2)
┌───────┬───────┐
│ a_cat ┆ b_cat │
│ --- ┆ --- │
│ cat ┆ cat │
╞═══════╪═══════╡
│ c ┆ F │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ G │
└───────┴───────┘
What I want is to use a list of string to run the filter more efficiently. So I tried and ended up with the following error message:
print(df_cat.filter(pl.col('a_cat').is_in(['a', 'c'])))
---------------------------------------------------------------------------
ComputeError Traceback (most recent call last)
d:\GitRepo\Test2\stockEMD3.ipynb Cell 9 in <cell line: 1>()
----> 1 print(df_cat.filter(pl.col('a_cat').is_in(['c'])))
File c:\ProgramData\Anaconda3\envs\charm3.9\lib\site-packages\polars\internals\dataframe\frame.py:2185, in DataFrame.filter(self, predicate)
2181 if _NUMPY_AVAILABLE and isinstance(predicate, np.ndarray):
2182 predicate = pli.Series(predicate)
2184 return (
-> 2185 self.lazy()
2186 .filter(predicate) # type: ignore[arg-type]
2187 .collect(no_optimization=True, string_cache=False)
2188 )
File c:\ProgramData\Anaconda3\envs\charm3.9\lib\site-packages\polars\internals\lazyframe\frame.py:660, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, string_cache, no_optimization, slice_pushdown)
650 projection_pushdown = False
652 ldf = self._ldf.optimization_toggle(
653 type_coercion,
654 predicate_pushdown,
(...)
658 slice_pushdown,
659 )
--> 660 return pli.wrap_df(ldf.collect())
ComputeError: joins/or comparisons on categorical dtypes can only happen if they are created under the same global string cache
From this Stackoverflow link I understand "You need to set a global string cache to compare categoricals created in different columns/lists." but my question is
Why the == one single string filter case works?
What is the proper way to filter a categorical column with a list of string?
Thanks!
Actually, you don't need to set a global string cache to compare strings to Categorical variables. You can use cast to accomplish this.
Let's use this data. I've included the integer values that underlie the Categorical variables to demonstrate something later.
import polars as pl
df_cat = (
pl.DataFrame(
[
pl.Series("a_cat", ["c", "a", "b", "c", "X"], dtype=pl.Categorical),
pl.Series("b_cat", ["F", "G", "E", "S", "X"], dtype=pl.Categorical),
]
)
.with_column(
pl.all().to_physical().suffix('_phys')
)
)
df_cat
shape: (5, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ c ┆ F ┆ 0 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ G ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ E ┆ 2 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ S ┆ 0 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ X ┆ X ┆ 3 ┆ 4 │
└───────┴───────┴────────────┴────────────┘
Comparing a categorical variable to a string
If we cast a Categorical variable back to its string values, we can make any comparison we need. For example:
df_cat.filter(pl.col('a_cat').cast(pl.Utf8).is_in(['a', 'c']))
shape: (3, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ c ┆ F ┆ 0 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ G ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ S ┆ 0 ┆ 3 │
└───────┴───────┴────────────┴────────────┘
Or in a filter step comparing the string values of two Categorical variables that do not share the same string cache.
df_cat.filter(pl.col('a_cat').cast(pl.Utf8) == pl.col('b_cat').cast(pl.Utf8))
shape: (1, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ X ┆ X ┆ 3 ┆ 4 │
└───────┴───────┴────────────┴────────────┘
Notice that it is the string values being compared (not the integers underlying the two Categorical variables).
The equality operator on Categorical variables
The following statements are equivalent:
df_cat.filter((pl.col('a_cat') == 'a'))
df_cat.filter((pl.col('a_cat').cast(pl.Utf8) == 'a'))
The former is syntactic sugar for the latter, as the former is a common use case.
As the error states: ComputeError: joins/or comparisons on categorical dtypes can only happen if they are created under the same global string cache.
Comparisons of categorical values are only allowed under a global string cache. You really want to set this in such a case as it speeds up comparisons and prevents expensive casts to strings.
Setting this on the start of your query will ensure it runs:
import polars as pl
pl.Config.set_global_string_cache()
This is a new answer based on the one from #ritchie46.
Polar 0.15.15 it now is
import polars as pl
pl.toggle_string_cache(True)
Also a StringCache() Context manager can be used, see polars documentation:
with pl.StringCache():
print(df_cat.filter(pl.col('a_cat').is_in(['a', 'c'])))

window agg over one value, but return another via Polars

I am trying to use polars to do a window aggregate over one value, but map it back to another.
For example, if i wanted to get the name of the max value in a group, instead of (or in combination to) just the max value.
assuming an input of something like this.
|label|name|value|
|a. | foo| 1. |
|a. | bar| 2. |
|b. | baz| 1.5. |
|b. | boo| -1 |
# 'max_by' is not a real method, just using it to express what i'm trying to achieve.
df.select(col('label'), col('name').max_by('value').over('label'))
i want an output like this
|label|name|
|a. | bar|
|b. | baz|
ideally with the value too. But i know i can easily add that in via col('value').max().over('label').
|label|name|value|
|a. | bar| 2. |
|b. | baz| 1.5.|
You were close. There is a sort_by expression that can be used.
df.groupby('label').agg(pl.all().sort_by('value').last())
shape: (2, 3)
┌───────┬──────┬───────┐
│ label ┆ name ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞═══════╪══════╪═══════╡
│ a. ┆ bar ┆ 2.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b. ┆ baz ┆ 1.5 │
└───────┴──────┴───────┘
If you need a windowed version of this:
df.with_columns([
pl.col(['name','value']).sort_by('value').last().over('label').suffix("_max")
])
shape: (4, 5)
┌───────┬──────┬───────┬──────────┬───────────┐
│ label ┆ name ┆ value ┆ name_max ┆ value_max │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ str ┆ f64 │
╞═══════╪══════╪═══════╪══════════╪═══════════╡
│ a. ┆ foo ┆ 1.0 ┆ bar ┆ 2.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ a. ┆ bar ┆ 2.0 ┆ bar ┆ 2.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b. ┆ baz ┆ 1.5 ┆ baz ┆ 1.5 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b. ┆ boo ┆ -1.0 ┆ baz ┆ 1.5 │
└───────┴──────┴───────┴──────────┴───────────┘