Polars, sum column based on other column values in `groupby` - python-polars

I want to calculate the sum of a column in a groupby based on values of another column. Pretty much what pl.Expr.value_counts do (see example), but I want to apply a function (e.g sum) to a specific column, in this case the Price column.
I know that I could do the groupby on Weather + Windy and then aggregate, but, I can't do that since I have plenty of other aggregations I need to compute on only the Weather groupby.
import polars as pl
df = pl.DataFrame(
data = {
"Weather":["Rain","Sun","Rain","Sun","Rain","Sun","Rain","Sun"],
"Price":[1,2,3,4,5,6,7,8],
"Windy":["Y","Y","Y","Y","N","N","N","N"]
}
)
I can get number of counts per windy day by value_counts
df_agg = (df
.groupby("Weather")
.agg([
pl.col("Windy")
.value_counts()
.alias("Price")
])
)
shape: (2, 2)
┌─────────┬────────────────────┐
│ Weather ┆ Price │
│ --- ┆ --- │
│ str ┆ list[struct[2]] │
╞═════════╪════════════════════╡
│ Sun ┆ [{"Y",2}, {"N",2}] │
│ Rain ┆ [{"Y",2}, {"N",2}] │
└─────────┴────────────────────┘
I would like to do something like this:
df_agg =(df
.groupby("Weather")
.agg([
pl.col("Windy")
.custom_fun_on_other_col("Price",sum)
.alias("Price")
])
)
and, this is the result I want,
shape: (2, 2)
┌─────────┬────────────────────┐
│ Weather ┆ Price │
│ --- ┆ --- │
│ str ┆ list[struct[2]] │
╞═════════╪════════════════════╡
│ Sun ┆ [{"Y",6},{"N",14}] │
│ Rain ┆ [{"Y",4},{"N",12}] │
└─────────┴────────────────────┘
(Using polars version 15.15)

Inside a groupby context - you could combine .repeat_by().flatten() with .value_counts()
df.groupby("Weather").agg(
pl.col("Windy").repeat_by("Price").flatten().value_counts()
.alias("Price")
)
shape: (2, 2)
┌─────────┬─────────────────────┐
│ Weather | Price │
│ --- | --- │
│ str | list[struct[2]] │
╞═════════╪═════════════════════╡
│ Sun | [{"N",14}, {"Y",6}] │
├─────────┼─────────────────────┤
│ Rain | [{"Y",4}, {"N",12}] │
└─────────┴─────────────────────┘
Do you know about Window functions?
df.with_columns(
pl.sum("Price").over(["Weather", "Windy"]).alias("sum")
)
shape: (8, 4)
┌─────────┬───────┬───────┬─────┐
│ Weather | Price | Windy | sum │
│ --- | --- | --- | --- │
│ str | i64 | str | i64 │
╞═════════╪═══════╪═══════╪═════╡
│ Rain | 1 | Y | 4 │
├─────────┼───────┼───────┼─────┤
│ Sun | 2 | Y | 6 │
├─────────┼───────┼───────┼─────┤
│ Rain | 3 | Y | 4 │
├─────────┼───────┼───────┼─────┤
│ Sun | 4 | Y | 6 │
├─────────┼───────┼───────┼─────┤
│ Rain | 5 | N | 12 │
├─────────┼───────┼───────┼─────┤
│ Sun | 6 | N | 14 │
├─────────┼───────┼───────┼─────┤
│ Rain | 7 | N | 12 │
├─────────┼───────┼───────┼─────┤
│ Sun | 8 | N | 14 │
└─────────┴───────┴───────┴─────┘
You could also create the struct if desired:
pl.struct(["Windy", pl.sum("Price").over(["Weather", "Windy"])])

For instance, you can create temporary dataframe and then join it with main dataframe.
tmp = df.groupby(["Weather", "Windy"]).agg(col("Price").sum())\
.select([pl.col("Weather"), pl.struct(["Windy", "Price"])])\
.groupby("Weather").agg(pl.list("Windy"))
df.groupby("Weather").agg([
# your another aggregations ...
]).join(tmp, on="Weather")
┌─────────┬─────────────────────┐
│ Weather ┆ Windy │
│ --- ┆ --- │
│ str ┆ list[struct[2]] │
╞═════════╪═════════════════════╡
│ Rain ┆ [{"Y",4}, {"N",12}] │
│ Sun ┆ [{"N",14}, {"Y",6}] │
└─────────┴─────────────────────┘

Related

How can I use when, then and otherwise with multiple conditions in polars?

I have a data set with three columns. Column A is to be checked for strings. If the string matches foo or spam, the values in the same row for the other two columns L and G should be changed to XX. For this I have tried the following.
df = pl.DataFrame(
{
"A": ["foo", "ham", "spam", "egg",],
"L": ["A54", "A12", "B84", "C12"],
"G": ["X34", "C84", "G96", "L6",],
}
)
print(df)
shape: (4, 3)
┌──────┬─────┬─────┐
│ A ┆ L ┆ G │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════╪═════╪═════╡
│ foo ┆ A54 ┆ X34 │
│ ham ┆ A12 ┆ C84 │
│ spam ┆ B84 ┆ G96 │
│ egg ┆ C12 ┆ L6 │
└──────┴─────┴─────┘
expected outcome
shape: (4, 3)
┌──────┬─────┬─────┐
│ A ┆ L ┆ G │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════╪═════╪═════╡
│ foo ┆ XX ┆ XX │
│ ham ┆ A12 ┆ C84 │
│ spam ┆ XX ┆ XX │
│ egg ┆ C12 ┆ L6 │
└──────┴─────┴─────┘
I tried this
df = df.with_column(
pl.when((pl.col("A") == "foo") | (pl.col("A") == "spam"))
.then((pl.col("L")= "XX") & (pl.col( "G")= "XX"))
.otherwise((pl.col("L"))&(pl.col( "G")))
)
However, this does not work. Can someone help me with this?
For setting multiple columns to the same value you could use:
df.with_columns(
pl.when(pl.col("A").is_in(["foo", "spam"]))
.then("XX")
.otherwise(pl.col(["L", "G"]))
.keep_name()
)
shape: (4, 3)
┌──────┬─────┬─────┐
│ A | L | G │
│ --- | --- | --- │
│ str | str | str │
╞══════╪═════╪═════╡
│ foo | XX | XX │
├──────┼─────┼─────┤
│ ham | A12 | C84 │
├──────┼─────┼─────┤
│ spam | XX | XX │
├──────┼─────┼─────┤
│ egg | C12 | L6 │
└──────┴─────┴─────┘
.is_in() can be used instead of multiple == x | == y chains.
To update multiple columns at once with different values you could use .map() and a dictionary:
df.with_columns(
pl.when(pl.col("A").is_in(["foo", "spam"]))
.then(pl.col(["L", "G"]).map(
lambda column: {
"L": "XX",
"G": "YY",
}.get(column.name)))
.otherwise(pl.col(["L", "G"]))
)
shape: (4, 3)
┌──────┬─────┬─────┐
│ A | L | G │
│ --- | --- | --- │
│ str | str | str │
╞══════╪═════╪═════╡
│ foo | XX | YY │
├──────┼─────┼─────┤
│ ham | A12 | C84 │
├──────┼─────┼─────┤
│ spam | XX | YY │
├──────┼─────┼─────┤
│ egg | C12 | L6 │
└──────┴─────┴─────┘

Copying row values based on Config file from polars dataframe

I have a dataframe consisting of an ID, Local, Entity, Field and Global column.
# Creating a dictionary with the data
data = {'ID': [4, 4, 4, 4, 4],
'Local': ['A', 'B', 'C', 'D', 'E'],
'Field': ['P', 'Q', 'R', 'S', 'T'],
'Entity': ['K', 'L', 'M', 'N', 'O'],
'Global': ['F', 'G', 'H', 'I', 'J'],}
# Creating the dataframe
table = pl.DataFrame(data)
print(table)
shape: (5, 5)
┌─────┬───────┬───────┬──────────┬────────┐
│ ID ┆ Local ┆ Field ┆ Entity ┆ Global │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ str │
╞═════╪═══════╪═══════╪══════════╪════════╡
│ 4 ┆ A ┆ P ┆ K ┆ F │
│ 4 ┆ B ┆ Q ┆ L ┆ G │
│ 4 ┆ C ┆ R ┆ M ┆ H │
│ 4 ┆ D ┆ S ┆ N ┆ I │
│ 4 ┆ E ┆ T ┆ O ┆ J │
└─────┴───────┴───────┴──────────┴────────┘
Within the dataset certain rows need to be copied. For this purpose the following config file is provided with the following information:
copying:
- column_name: P
source_table: K
destination_table: X
- column_name: S
source_table: N
destination_table: W
In the config file there is a column_name value which refers to the Field column, a source_table which refers to the given Entity and a destination_table which should be the future entry in the Entity column. The goal is to enrich data based on existing rows (just with other tables).
The solution should look like this:
shape: (7, 5)
┌─────┬───────┬───────┬──────────┬────────┐
│ ID ┆ Local ┆ Field ┆ Entity ┆ Global │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ str │
╞═════╪═══════╪═══════╪══════════╪════════╡
│ 4 ┆ A ┆ P ┆ K ┆ F │
│ 4 ┆ B ┆ Q ┆ L ┆ G │
│ 4 ┆ C ┆ R ┆ M ┆ H │
│ 4 ┆ D ┆ S ┆ N ┆ I │
│ 4 ┆ E ┆ T ┆ O ┆ J │
│ 4 ┆ A ┆ P ┆ X ┆ F │
│ 4 ┆ D ┆ S ┆ W ┆ I │
└─────┴───────┴───────┴──────────┴────────┘
The dataset is a polars DataFrame and the config file is loaded as omegaconf. I tried it with this code:
conf.copying = [
{"column_name": "P", "source_table": "K", "destination_table": "X"},
{"column_name": "S", "source_table": "N", "destination_table": "W"},
]
# Iterate through the config file
for i in range(len(conf.copying)):
# Select the rows from the table dataframe that match the column_name and source_table fields in the config
match_rows = table.filter(
(pl.col("Field") == conf.copying[i]["column_name"])
& (pl.col("Entity") == conf.copying[i]["source_table"])
)
# Add the column Entities with the destination_table
match_rows = match_rows.select(
[
"ID",
"Local",
"Field",
"Global",
]
)
# Add the column Entities with the destination_table
match_rows = match_rows.with_columns(
pl.lit(conf.copying[i]["destination_table"]).alias("Entity")
)
match_rows = match_rows[
[
"ID",
"Local",
"Field",
"Entity",
"Global",
]
]
# Append the new rows to the original table dataframe
df_copy = match_rows.vstack(table)
However, the data is not copied as expected and added to the existing dataset. What am I doing wrong?
It's probably better if you extract all of the rows you want in one go.
One possible approach is to build a list of when/then expressions:
rules = [
pl.when(
(pl.col("Field") == rule["column_name"]) &
(pl.col("Entity") == rule["source_table"]))
.then(rule["destination_table"])
.alias("Entity")
for rule in conf.copying
]
You can use pl.coalesce and .drop_nulls() to remove non-matches.
>>> table.with_columns(pl.coalesce(rules)).drop_nulls()
shape: (2, 5)
┌─────┬───────┬───────┬────────┬────────┐
│ ID | Local | Field | Entity | Global │
│ --- | --- | --- | --- | --- │
│ i64 | str | str | str | str │
╞═════╪═══════╪═══════╪════════╪════════╡
│ 4 | A | P | X | F │
├─────┼───────┼───────┼────────┼────────┤
│ 4 | D | S | W | I │
└─────┴───────┴───────┴────────┴────────┘
This can then be added to the original dataframe.
Another style I've seen is to start with an "empty" when/then to build a .when().then() chain e.g.
rules = pl.when(False).then(None)
for rule in conf.copying:
rules = (
rules.when(
(pl.col("Field") == rule["column_name"]) &
(pl.col("Entity") == rule["source_table"]))
.then(rule["destination_table"])
)
>>> table.with_columns(rules.alias("Entity")).drop_nulls()
shape: (2, 5)
┌─────┬───────┬───────┬────────┬────────┐
│ ID | Local | Field | Entity | Global │
│ --- | --- | --- | --- | --- │
│ i64 | str | str | str | str │
╞═════╪═══════╪═══════╪════════╪════════╡
│ 4 | A | P | X | F │
├─────┼───────┼───────┼────────┼────────┤
│ 4 | D | S | W | I │
└─────┴───────┴───────┴────────┴────────┘
I'm not sure if it would make much difference in filtering the rows first and then doing the Entity modification - pl.any() could be used if you were to do that:
table.filter(
pl.any([
(pl.col("Field") == rule["column_name"]) &
(pl.col("Entity") == rule["source_table"])
for rule in conf.copying
])
)
shape: (2, 5)
┌─────┬───────┬───────┬────────┬────────┐
│ ID | Local | Field | Entity | Global │
│ --- | --- | --- | --- | --- │
│ i64 | str | str | str | str │
╞═════╪═══════╪═══════╪════════╪════════╡
│ 4 | A | P | K | F │
├─────┼───────┼───────┼────────┼────────┤
│ 4 | D | S | N | I │
└─────┴───────┴───────┴────────┴────────┘

Complex asof joins with option to select between duplicates + strictly less/greater than + use utf8

Is there a way to perform something similar to an asof join, but with
The option to select which element to join with (e.g. first, last) if there are duplicates
The option to join with only strictly less/greater than
Ability to use utf-8
Here's a code example:
import polars as pl
df1 = pl.DataFrame({
'by_1': ['X', 'X', 'Y', 'Y'] * 8,
'by_2': ['X', 'Y', 'X', 'Y'] * 8,
'on_1': ['A'] * 16 + ['C'] * 16,
'on_2': (['A'] * 8 + ['C'] * 8) * 2,
'__index__': list(range(32))
})
df2 = pl.DataFrame([
{ 'by_1': 'Y', 'by_2': 'Y', 'on_1': 'B', 'on_2': 'A' },
{ 'by_1': 'Y', 'by_2': 'Y', 'on_1': 'C', 'on_2': 'A' },
{ 'by_1': 'Y', 'by_2': 'Z', 'on_1': 'A', 'on_2': 'A' },
])
df1:
┌──────┬──────┬──────┬──────┬───────────┐
│ by_1 ┆ by_2 ┆ on_1 ┆ on_2 ┆ __index__ │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ i64 │
╞══════╪══════╪══════╪══════╪═══════════╡
│ X ┆ X ┆ A ┆ A ┆ 0 │
│ X ┆ Y ┆ A ┆ A ┆ 1 │
│ Y ┆ X ┆ A ┆ A ┆ 2 │
│ Y ┆ Y ┆ A ┆ A ┆ 3 │
│ X ┆ X ┆ A ┆ A ┆ 4 │
│ X ┆ Y ┆ A ┆ A ┆ 5 │
│ Y ┆ X ┆ A ┆ A ┆ 6 │
│ Y ┆ Y ┆ A ┆ A ┆ 7 │
│ X ┆ X ┆ A ┆ C ┆ 8 │
│ X ┆ Y ┆ A ┆ C ┆ 9 │
│ Y ┆ X ┆ A ┆ C ┆ 10 │
│ Y ┆ Y ┆ A ┆ C ┆ 11 │
│ X ┆ X ┆ A ┆ C ┆ 12 │
│ X ┆ Y ┆ A ┆ C ┆ 13 │
│ Y ┆ X ┆ A ┆ C ┆ 14 │
│ Y ┆ Y ┆ A ┆ C ┆ 15 │
│ X ┆ X ┆ C ┆ A ┆ 16 │
│ X ┆ Y ┆ C ┆ A ┆ 17 │
│ Y ┆ X ┆ C ┆ A ┆ 18 │
│ Y ┆ Y ┆ C ┆ A ┆ 19 │
│ X ┆ X ┆ C ┆ A ┆ 20 │
│ X ┆ Y ┆ C ┆ A ┆ 21 │
│ Y ┆ X ┆ C ┆ A ┆ 22 │
│ Y ┆ Y ┆ C ┆ A ┆ 23 │
│ X ┆ X ┆ C ┆ C ┆ 24 │
│ X ┆ Y ┆ C ┆ C ┆ 25 │
│ Y ┆ X ┆ C ┆ C ┆ 26 │
│ Y ┆ Y ┆ C ┆ C ┆ 27 │
│ X ┆ X ┆ C ┆ C ┆ 28 │
│ X ┆ Y ┆ C ┆ C ┆ 29 │
│ Y ┆ X ┆ C ┆ C ┆ 30 │
│ Y ┆ Y ┆ C ┆ C ┆ 31 │
└──────┴──────┴──────┴──────┴───────────┘
df2:
┌──────┬──────┬──────┬──────┐
│ by_1 ┆ by_2 ┆ on_1 ┆ on_2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str │
╞══════╪══════╪══════╪══════╡
│ Y ┆ Y ┆ B ┆ A │
│ Y ┆ Y ┆ C ┆ A │
│ Y ┆ Z ┆ A ┆ A │
└──────┴──────┴──────┴──────┘
# Case 1 - Less Than (lt)
df2.join_asof_lt(
df1,
by=['by_1', 'by_2'],
on=['on_1', 'on_2'],
lt_select_eq = 'first',
)
┌──────┬──────┬──────┬──────┬───────────┐
│ by_1 ┆ by_2 ┆ on_1 ┆ on_2 ┆ __index__ │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ i64 │
╞══════╪══════╪══════╪══════╪═══════════╡
│ Y ┆ Y ┆ B ┆ A ┆ 11 │ # First strictly less than is ('Y', 'Y'), ('A', 'C'), which exists at index 11 and 15. Since lt_select_eq is 'first', choose 11
│ Y ┆ Y ┆ C ┆ A ┆ 11 │ # First strictly less than is ('Y', 'Y'), ('A', 'C'), which exists at index 11 and 15. Since lt_select_eq is 'first', choose 11
│ Y ┆ Z ┆ A ┆ A ┆ null │ # Group (Z, Y) does not exist, so return None
└──────┴──────┴──────┴──────┴───────────┘
df2.join_asof_lt(
df1,
by=['by_1', 'by_2'],
on=['on_1', 'on_2'],
lt_select_eq = 'last',
)
┌──────┬──────┬──────┬──────┬───────────┐
│ by_1 ┆ by_2 ┆ on_1 ┆ on_2 ┆ __index__ │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ i64 │
╞══════╪══════╪══════╪══════╪═══════════╡
│ Y ┆ Y ┆ B ┆ A ┆ 15 │ # First strictly less than is ('Y', 'Y'), ('A', 'C'), which exists at index 11 and 15. Since lt_select_eq is 'last', choose 15
│ Y ┆ Y ┆ C ┆ A ┆ 15 │ # First strictly less than is ('Y', 'Y'), ('A', 'C'), which exists at index 11 and 15. Since lt_select_eq is 'last', choose 15
│ Y ┆ Z ┆ A ┆ A ┆ null │ # Group (Z, Y) does not exist, so return None
└──────┴──────┴──────┴──────┴───────────┘
# Case 2 - Less Than or Equal To (leq)
df2.join_asof_leq(
df1,
by=['by_1', 'by_2'],
on=['on_1', 'on_2'],
lt_select_eq = 'first',
eq_select_eq = 'last',
)
┌──────┬──────┬──────┬──────┬───────────┐
│ by_1 ┆ by_2 ┆ on_1 ┆ on_2 ┆ __index__ │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ i64 │
╞══════╪══════╪══════╪══════╪═══════════╡
│ Y ┆ Y ┆ B ┆ A ┆ 11 │ # First less than or equal to is ('Y', 'Y'), ('A', 'C'), which exists at index 11 and 15. Since lt_select_eq is 'first', choose 11
│ Y ┆ Y ┆ C ┆ A ┆ 23 │ # First less than or equal to is ('Y', 'Y'), ('C', 'A'), which exists at index 19 and 23. Since eq_select_eq is 'last', choose 23
│ Y ┆ Z ┆ A ┆ A ┆ null │ # Group (Z, Y) does not exist, so return None
└──────┴──────┴──────┴──────┴───────────┘
df2.join_asof_leq(
df1,
by=['by_1', 'by_2'],
on=['on_1', 'on_2'],
lt_select_eq = 'last',
eq_select_eq = 'first',
)
┌──────┬──────┬──────┬──────┬───────────┐
│ by_1 ┆ by_2 ┆ on_1 ┆ on_2 ┆ __index__ │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ i64 │
╞══════╪══════╪══════╪══════╪═══════════╡
│ Y ┆ Y ┆ B ┆ A ┆ 15 │ # First less than or equal to is ('Y', 'Y'), ('A', 'C'), which exists at index 11 and 15. Since lt_select_eq is 'last', choose 15
│ Y ┆ Y ┆ C ┆ A ┆ 19 │ # First less than or equal to is ('Y', 'Y'), ('C', 'A'), which exists at index 19 and 23. Since eq_select_eq is 'first', choose 19
│ Y ┆ Z ┆ A ┆ A ┆ null │ # Group (Z, Y) does not exist, so return None
└──────┴──────┴──────┴──────┴───────────┘
These examples are for lt / leq, but it could also be gt / geq. Thanks!
I don't follow why 3 and 7 are not to be classed as less than but for example:
df3 = df2.join(df1, on=["by_1", "by_2"], how="left")
df3.filter(
pl.col("__index__").is_null() |
(pl.col("on_1_right") < pl.col("on_1"))
)
shape: (9, 7)
┌──────┬──────┬──────┬──────┬────────────┬────────────┬───────────┐
│ by_1 | by_2 | on_1 | on_2 | on_1_right | on_2_right | __index__ │
│ --- | --- | --- | --- | --- | --- | --- │
│ str | str | str | str | str | str | i64 │
╞══════╪══════╪══════╪══════╪════════════╪════════════╪═══════════╡
│ Y | Y | B | A | A | A | 3 │
│ Y | Y | B | A | A | A | 7 │
│ Y | Y | B | A | A | C | 11 │
│ Y | Y | B | A | A | C | 15 │
│ Y | Y | C | A | A | A | 3 │
│ Y | Y | C | A | A | A | 7 │
│ Y | Y | C | A | A | C | 11 │
│ Y | Y | C | A | A | C | 15 │
│ Y | Z | A | A | null | null | null │
└──────┴──────┴──────┴──────┴────────────┴────────────┴───────────┘
Get "closest" match per group:
group_keys = ["by_1", "by_2", "on_1", "on_2"]
df3 = df2.join(df1, on=["by_1", "by_2"], how="left")
(
df3
.filter(
pl.col("__index__").is_null() |
(pl.col("on_1") > pl.col("on_1_right")))
.filter(
pl.col([
"on_1_right",
"on_2_right"
]) == pl.col(["on_1_right", "on_2_right"])
.last()
.over(group_keys))
)
shape: (5, 7)
┌──────┬──────┬──────┬──────┬────────────┬────────────┬───────────┐
│ by_1 | by_2 | on_1 | on_2 | on_1_right | on_2_right | __index__ │
│ --- | --- | --- | --- | --- | --- | --- │
│ str | str | str | str | str | str | i64 │
╞══════╪══════╪══════╪══════╪════════════╪════════════╪═══════════╡
│ Y | Y | B | A | A | C | 11 │
│ Y | Y | B | A | A | C | 15 │
│ Y | Y | C | A | A | C | 11 │
│ Y | Y | C | A | A | C | 15 │
│ Y | Z | A | A | null | null | null │
└──────┴──────┴──────┴──────┴────────────┴────────────┴───────────┘
If you .groupby(group_keys) that result you can use .first() / .last()
>>> groups.agg(pl.col("__index__").first())
shape: (3, 5)
┌──────┬──────┬──────┬──────┬───────────┐
│ by_1 | by_2 | on_1 | on_2 | __index__ │
│ --- | --- | --- | --- | --- │
│ str | str | str | str | i64 │
╞══════╪══════╪══════╪══════╪═══════════╡
│ Y | Y | C | A | 11 │
│ Y | Z | A | A | null │
│ Y | Y | B | A | 11 │
└──────┴──────┴──────┴──────┴───────────┘
>>> groups.agg(pl.col("__index__").last())
shape: (3, 5)
┌──────┬──────┬──────┬──────┬───────────┐
│ by_1 | by_2 | on_1 | on_2 | __index__ │
│ --- | --- | --- | --- | --- │
│ str | str | str | str | i64 │
╞══════╪══════╪══════╪══════╪═══════════╡
│ Y | Y | B | A | 15 │
│ Y | Z | A | A | null │
│ Y | Y | C | A | 15 │
└──────┴──────┴──────┴──────┴───────────┘

convert a pandas loc operation that needed the index to assign values to polars

In this example i have three columns, the 'DayOfWeek' Time' and the 'Risk'.
I want to group by 'DayOfWeek' and take the first element only and assign a high risk on it. This means the first known hour in day of week is the one that has the highest risk. The rest is initialized to 'Low' risk.
In pandas i had an additional column for the index, but in polars i do not. I could artificially create one, but is it even necessary?
Can i do this somehow smarter with polars?
df['risk'] = "Low"
df = df.sort('Time')
df.loc[df.groupby("DayOfWeek").head(1).index, "risk"] = "High"
The index is unique in this case and goes to range(n)
Here is my solution btw. (I don't really like it)
df = df.with_column(pl.arange(0, df.shape[0]).alias('pseudo_index')
# find lowest time for day
indexes_df = df.sort('Time').groupby('DayOfWeek').head(1)
# Set 'High' as col for all rows from groupby
indexes_df = indexes_df.select('pseudo_index').with_column(pl.lit('High').alias('risk'))
# Left join will generate null values for all values that are not in indexes_df 'pseudo_index'
df = df.join(indexes_df, how='left', on='pseudo_index').select([
pl.all().exclude(['pseudo_index', 'risk']), pl.col('risk').fill_null(pl.lit('low'))
])
You can use window functions to find where the first "index" of the "DayOfWeek" group equals the"index" column.
For that we only need to set an "index" column. We can do that easily with:
A method: df.with_row_count(<name>)
An expression: pl.arange(0, pl.count()).alias(<name>)
After that we can use this predicate:
pl.first("index").over("DayOfWeek") == pl.col("index")
Finally we use a when -> then -> otherwise expression to use that condition and create our new "Risk" column.
Example
Let's start with some data. In the snippet below I create an hourly date range and then determine the weekdays from that.
Preparing data
df = pl.DataFrame({
"Time": pl.date_range(datetime(2022, 6, 1), datetime(2022, 6, 30), "1h").sample(frac=1.5, with_replacement=True).sort(),
}).select([
pl.arange(0, pl.count()).alias("index"),
pl.all(),
pl.col("Time").dt.weekday().alias("DayOfWeek"),
])
print(df)
shape: (1045, 3)
┌───────┬─────────────────────┬───────────┐
│ index ┆ Time ┆ DayOfWeek │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[ns] ┆ u32 │
╞═══════╪═════════════════════╪═══════════╡
│ 0 ┆ 2022-06-29 22:00:00 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2022-06-14 11:00:00 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2022-06-11 21:00:00 ┆ 6 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 2022-06-27 20:00:00 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1041 ┆ 2022-06-11 09:00:00 ┆ 6 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1042 ┆ 2022-06-18 22:00:00 ┆ 6 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1043 ┆ 2022-06-18 01:00:00 ┆ 6 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1044 ┆ 2022-06-23 18:00:00 ┆ 4 │
└───────┴─────────────────────┴───────────┘
Computing Risk values
df.with_column(
pl.when(
pl.first("index").over("DayOfWeek") == pl.col("index")
).then(
"High"
).otherwise(
"Low"
).alias("Risk")
).drop("index")
print(df)
shape: (1045, 3)
┌─────────────────────┬───────────┬──────┐
│ Time ┆ DayOfWeek ┆ Risk │
│ --- ┆ --- ┆ --- │
│ datetime[ns] ┆ u32 ┆ str │
╞═════════════════════╪═══════════╪══════╡
│ 2022-06-29 22:00:00 ┆ 3 ┆ High │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-06-14 11:00:00 ┆ 2 ┆ High │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-06-11 21:00:00 ┆ 6 ┆ High │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-06-27 20:00:00 ┆ 1 ┆ High │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-06-11 09:00:00 ┆ 6 ┆ Low │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-06-18 22:00:00 ┆ 6 ┆ Low │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-06-18 01:00:00 ┆ 6 ┆ Low │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-06-23 18:00:00 ┆ 4 ┆ Low │
└─────────────────────┴───────────┴──────┘

window agg over one value, but return another via Polars

I am trying to use polars to do a window aggregate over one value, but map it back to another.
For example, if i wanted to get the name of the max value in a group, instead of (or in combination to) just the max value.
assuming an input of something like this.
|label|name|value|
|a. | foo| 1. |
|a. | bar| 2. |
|b. | baz| 1.5. |
|b. | boo| -1 |
# 'max_by' is not a real method, just using it to express what i'm trying to achieve.
df.select(col('label'), col('name').max_by('value').over('label'))
i want an output like this
|label|name|
|a. | bar|
|b. | baz|
ideally with the value too. But i know i can easily add that in via col('value').max().over('label').
|label|name|value|
|a. | bar| 2. |
|b. | baz| 1.5.|
You were close. There is a sort_by expression that can be used.
df.groupby('label').agg(pl.all().sort_by('value').last())
shape: (2, 3)
┌───────┬──────┬───────┐
│ label ┆ name ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞═══════╪══════╪═══════╡
│ a. ┆ bar ┆ 2.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b. ┆ baz ┆ 1.5 │
└───────┴──────┴───────┘
If you need a windowed version of this:
df.with_columns([
pl.col(['name','value']).sort_by('value').last().over('label').suffix("_max")
])
shape: (4, 5)
┌───────┬──────┬───────┬──────────┬───────────┐
│ label ┆ name ┆ value ┆ name_max ┆ value_max │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ str ┆ f64 │
╞═══════╪══════╪═══════╪══════════╪═══════════╡
│ a. ┆ foo ┆ 1.0 ┆ bar ┆ 2.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ a. ┆ bar ┆ 2.0 ┆ bar ┆ 2.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b. ┆ baz ┆ 1.5 ┆ baz ┆ 1.5 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b. ┆ boo ┆ -1.0 ┆ baz ┆ 1.5 │
└───────┴──────┴───────┴──────────┴───────────┘