Complex asof joins with option to select between duplicates + strictly less/greater than + use utf8

Complex asof joins with option to select between duplicates + strictly less/greater than + use utf8 - python-polars

Is there a way to perform something similar to an asof join, but with
The option to select which element to join with (e.g. first, last) if there are duplicates
The option to join with only strictly less/greater than
Ability to use utf-8
Here's a code example:
import polars as pl
df1 = pl.DataFrame({
'by_1': ['X', 'X', 'Y', 'Y'] * 8,
'by_2': ['X', 'Y', 'X', 'Y'] * 8,
'on_1': ['A'] * 16 + ['C'] * 16,
'on_2': (['A'] * 8 + ['C'] * 8) * 2,
'__index__': list(range(32))
})
df2 = pl.DataFrame([
{ 'by_1': 'Y', 'by_2': 'Y', 'on_1': 'B', 'on_2': 'A' },
{ 'by_1': 'Y', 'by_2': 'Y', 'on_1': 'C', 'on_2': 'A' },
{ 'by_1': 'Y', 'by_2': 'Z', 'on_1': 'A', 'on_2': 'A' },
])
df1:
┌──────┬──────┬──────┬──────┬───────────┐
│ by_1 ┆ by_2 ┆ on_1 ┆ on_2 ┆ __index__ │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ i64 │
╞══════╪══════╪══════╪══════╪═══════════╡
│ X ┆ X ┆ A ┆ A ┆ 0 │
│ X ┆ Y ┆ A ┆ A ┆ 1 │
│ Y ┆ X ┆ A ┆ A ┆ 2 │
│ Y ┆ Y ┆ A ┆ A ┆ 3 │
│ X ┆ X ┆ A ┆ A ┆ 4 │
│ X ┆ Y ┆ A ┆ A ┆ 5 │
│ Y ┆ X ┆ A ┆ A ┆ 6 │
│ Y ┆ Y ┆ A ┆ A ┆ 7 │
│ X ┆ X ┆ A ┆ C ┆ 8 │
│ X ┆ Y ┆ A ┆ C ┆ 9 │
│ Y ┆ X ┆ A ┆ C ┆ 10 │
│ Y ┆ Y ┆ A ┆ C ┆ 11 │
│ X ┆ X ┆ A ┆ C ┆ 12 │
│ X ┆ Y ┆ A ┆ C ┆ 13 │
│ Y ┆ X ┆ A ┆ C ┆ 14 │
│ Y ┆ Y ┆ A ┆ C ┆ 15 │
│ X ┆ X ┆ C ┆ A ┆ 16 │
│ X ┆ Y ┆ C ┆ A ┆ 17 │
│ Y ┆ X ┆ C ┆ A ┆ 18 │
│ Y ┆ Y ┆ C ┆ A ┆ 19 │
│ X ┆ X ┆ C ┆ A ┆ 20 │
│ X ┆ Y ┆ C ┆ A ┆ 21 │
│ Y ┆ X ┆ C ┆ A ┆ 22 │
│ Y ┆ Y ┆ C ┆ A ┆ 23 │
│ X ┆ X ┆ C ┆ C ┆ 24 │
│ X ┆ Y ┆ C ┆ C ┆ 25 │
│ Y ┆ X ┆ C ┆ C ┆ 26 │
│ Y ┆ Y ┆ C ┆ C ┆ 27 │
│ X ┆ X ┆ C ┆ C ┆ 28 │
│ X ┆ Y ┆ C ┆ C ┆ 29 │
│ Y ┆ X ┆ C ┆ C ┆ 30 │
│ Y ┆ Y ┆ C ┆ C ┆ 31 │
└──────┴──────┴──────┴──────┴───────────┘
df2:
┌──────┬──────┬──────┬──────┐
│ by_1 ┆ by_2 ┆ on_1 ┆ on_2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str │
╞══════╪══════╪══════╪══════╡
│ Y ┆ Y ┆ B ┆ A │
│ Y ┆ Y ┆ C ┆ A │
│ Y ┆ Z ┆ A ┆ A │
└──────┴──────┴──────┴──────┘
# Case 1 - Less Than (lt)
df2.join_asof_lt(
df1,
by=['by_1', 'by_2'],
on=['on_1', 'on_2'],
lt_select_eq = 'first',
)
┌──────┬──────┬──────┬──────┬───────────┐
│ by_1 ┆ by_2 ┆ on_1 ┆ on_2 ┆ __index__ │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ i64 │
╞══════╪══════╪══════╪══════╪═══════════╡
│ Y ┆ Y ┆ B ┆ A ┆ 11 │ # First strictly less than is ('Y', 'Y'), ('A', 'C'), which exists at index 11 and 15. Since lt_select_eq is 'first', choose 11
│ Y ┆ Y ┆ C ┆ A ┆ 11 │ # First strictly less than is ('Y', 'Y'), ('A', 'C'), which exists at index 11 and 15. Since lt_select_eq is 'first', choose 11
│ Y ┆ Z ┆ A ┆ A ┆ null │ # Group (Z, Y) does not exist, so return None
└──────┴──────┴──────┴──────┴───────────┘
df2.join_asof_lt(
df1,
by=['by_1', 'by_2'],
on=['on_1', 'on_2'],
lt_select_eq = 'last',
)
┌──────┬──────┬──────┬──────┬───────────┐
│ by_1 ┆ by_2 ┆ on_1 ┆ on_2 ┆ __index__ │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ i64 │
╞══════╪══════╪══════╪══════╪═══════════╡
│ Y ┆ Y ┆ B ┆ A ┆ 15 │ # First strictly less than is ('Y', 'Y'), ('A', 'C'), which exists at index 11 and 15. Since lt_select_eq is 'last', choose 15
│ Y ┆ Y ┆ C ┆ A ┆ 15 │ # First strictly less than is ('Y', 'Y'), ('A', 'C'), which exists at index 11 and 15. Since lt_select_eq is 'last', choose 15
│ Y ┆ Z ┆ A ┆ A ┆ null │ # Group (Z, Y) does not exist, so return None
└──────┴──────┴──────┴──────┴───────────┘
# Case 2 - Less Than or Equal To (leq)
df2.join_asof_leq(
df1,
by=['by_1', 'by_2'],
on=['on_1', 'on_2'],
lt_select_eq = 'first',
eq_select_eq = 'last',
)
┌──────┬──────┬──────┬──────┬───────────┐
│ by_1 ┆ by_2 ┆ on_1 ┆ on_2 ┆ __index__ │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ i64 │
╞══════╪══════╪══════╪══════╪═══════════╡
│ Y ┆ Y ┆ B ┆ A ┆ 11 │ # First less than or equal to is ('Y', 'Y'), ('A', 'C'), which exists at index 11 and 15. Since lt_select_eq is 'first', choose 11
│ Y ┆ Y ┆ C ┆ A ┆ 23 │ # First less than or equal to is ('Y', 'Y'), ('C', 'A'), which exists at index 19 and 23. Since eq_select_eq is 'last', choose 23
│ Y ┆ Z ┆ A ┆ A ┆ null │ # Group (Z, Y) does not exist, so return None
└──────┴──────┴──────┴──────┴───────────┘
df2.join_asof_leq(
df1,
by=['by_1', 'by_2'],
on=['on_1', 'on_2'],
lt_select_eq = 'last',
eq_select_eq = 'first',
)
┌──────┬──────┬──────┬──────┬───────────┐
│ by_1 ┆ by_2 ┆ on_1 ┆ on_2 ┆ __index__ │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ i64 │
╞══════╪══════╪══════╪══════╪═══════════╡
│ Y ┆ Y ┆ B ┆ A ┆ 15 │ # First less than or equal to is ('Y', 'Y'), ('A', 'C'), which exists at index 11 and 15. Since lt_select_eq is 'last', choose 15
│ Y ┆ Y ┆ C ┆ A ┆ 19 │ # First less than or equal to is ('Y', 'Y'), ('C', 'A'), which exists at index 19 and 23. Since eq_select_eq is 'first', choose 19
│ Y ┆ Z ┆ A ┆ A ┆ null │ # Group (Z, Y) does not exist, so return None
└──────┴──────┴──────┴──────┴───────────┘
These examples are for lt / leq, but it could also be gt / geq. Thanks!

I don't follow why 3 and 7 are not to be classed as less than but for example:
df3 = df2.join(df1, on=["by_1", "by_2"], how="left")
df3.filter(
pl.col("__index__").is_null() |
(pl.col("on_1_right") < pl.col("on_1"))
)
shape: (9, 7)
┌──────┬──────┬──────┬──────┬────────────┬────────────┬───────────┐
│ by_1 | by_2 | on_1 | on_2 | on_1_right | on_2_right | __index__ │
│ --- | --- | --- | --- | --- | --- | --- │
│ str | str | str | str | str | str | i64 │
╞══════╪══════╪══════╪══════╪════════════╪════════════╪═══════════╡
│ Y | Y | B | A | A | A | 3 │
│ Y | Y | B | A | A | A | 7 │
│ Y | Y | B | A | A | C | 11 │
│ Y | Y | B | A | A | C | 15 │
│ Y | Y | C | A | A | A | 3 │
│ Y | Y | C | A | A | A | 7 │
│ Y | Y | C | A | A | C | 11 │
│ Y | Y | C | A | A | C | 15 │
│ Y | Z | A | A | null | null | null │
└──────┴──────┴──────┴──────┴────────────┴────────────┴───────────┘
Get "closest" match per group:
group_keys = ["by_1", "by_2", "on_1", "on_2"]
df3 = df2.join(df1, on=["by_1", "by_2"], how="left")
(
df3
.filter(
pl.col("__index__").is_null() |
(pl.col("on_1") > pl.col("on_1_right")))
.filter(
pl.col([
"on_1_right",
"on_2_right"
]) == pl.col(["on_1_right", "on_2_right"])
.last()
.over(group_keys))
)
shape: (5, 7)
┌──────┬──────┬──────┬──────┬────────────┬────────────┬───────────┐
│ by_1 | by_2 | on_1 | on_2 | on_1_right | on_2_right | __index__ │
│ --- | --- | --- | --- | --- | --- | --- │
│ str | str | str | str | str | str | i64 │
╞══════╪══════╪══════╪══════╪════════════╪════════════╪═══════════╡
│ Y | Y | B | A | A | C | 11 │
│ Y | Y | B | A | A | C | 15 │
│ Y | Y | C | A | A | C | 11 │
│ Y | Y | C | A | A | C | 15 │
│ Y | Z | A | A | null | null | null │
└──────┴──────┴──────┴──────┴────────────┴────────────┴───────────┘
If you .groupby(group_keys) that result you can use .first() / .last()
>>> groups.agg(pl.col("__index__").first())
shape: (3, 5)
┌──────┬──────┬──────┬──────┬───────────┐
│ by_1 | by_2 | on_1 | on_2 | __index__ │
│ --- | --- | --- | --- | --- │
│ str | str | str | str | i64 │
╞══════╪══════╪══════╪══════╪═══════════╡
│ Y | Y | C | A | 11 │
│ Y | Z | A | A | null │
│ Y | Y | B | A | 11 │
└──────┴──────┴──────┴──────┴───────────┘
>>> groups.agg(pl.col("__index__").last())
shape: (3, 5)
┌──────┬──────┬──────┬──────┬───────────┐
│ by_1 | by_2 | on_1 | on_2 | __index__ │
│ --- | --- | --- | --- | --- │
│ str | str | str | str | i64 │
╞══════╪══════╪══════╪══════╪═══════════╡
│ Y | Y | B | A | 15 │
│ Y | Z | A | A | null │
│ Y | Y | C | A | 15 │
└──────┴──────┴──────┴──────┴───────────┘

Related

How can I use when, then and otherwise with multiple conditions in polars?

I have a data set with three columns. Column A is to be checked for strings. If the string matches foo or spam, the values in the same row for the other two columns L and G should be changed to XX. For this I have tried the following.
df = pl.DataFrame(
{
"A": ["foo", "ham", "spam", "egg",],
"L": ["A54", "A12", "B84", "C12"],
"G": ["X34", "C84", "G96", "L6",],
}
)
print(df)
shape: (4, 3)
┌──────┬─────┬─────┐
│ A ┆ L ┆ G │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════╪═════╪═════╡
│ foo ┆ A54 ┆ X34 │
│ ham ┆ A12 ┆ C84 │
│ spam ┆ B84 ┆ G96 │
│ egg ┆ C12 ┆ L6 │
└──────┴─────┴─────┘
expected outcome
shape: (4, 3)
┌──────┬─────┬─────┐
│ A ┆ L ┆ G │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════╪═════╪═════╡
│ foo ┆ XX ┆ XX │
│ ham ┆ A12 ┆ C84 │
│ spam ┆ XX ┆ XX │
│ egg ┆ C12 ┆ L6 │
└──────┴─────┴─────┘
I tried this
df = df.with_column(
pl.when((pl.col("A") == "foo") | (pl.col("A") == "spam"))
.then((pl.col("L")= "XX") & (pl.col( "G")= "XX"))
.otherwise((pl.col("L"))&(pl.col( "G")))
)
However, this does not work. Can someone help me with this?

For setting multiple columns to the same value you could use:
df.with_columns(
pl.when(pl.col("A").is_in(["foo", "spam"]))
.then("XX")
.otherwise(pl.col(["L", "G"]))
.keep_name()
)
shape: (4, 3)
┌──────┬─────┬─────┐
│ A | L | G │
│ --- | --- | --- │
│ str | str | str │
╞══════╪═════╪═════╡
│ foo | XX | XX │
├──────┼─────┼─────┤
│ ham | A12 | C84 │
├──────┼─────┼─────┤
│ spam | XX | XX │
├──────┼─────┼─────┤
│ egg | C12 | L6 │
└──────┴─────┴─────┘
.is_in() can be used instead of multiple == x | == y chains.
To update multiple columns at once with different values you could use .map() and a dictionary:
df.with_columns(
pl.when(pl.col("A").is_in(["foo", "spam"]))
.then(pl.col(["L", "G"]).map(
lambda column: {
"L": "XX",
"G": "YY",
}.get(column.name)))
.otherwise(pl.col(["L", "G"]))
)
shape: (4, 3)
┌──────┬─────┬─────┐
│ A | L | G │
│ --- | --- | --- │
│ str | str | str │
╞══════╪═════╪═════╡
│ foo | XX | YY │
├──────┼─────┼─────┤
│ ham | A12 | C84 │
├──────┼─────┼─────┤
│ spam | XX | YY │
├──────┼─────┼─────┤
│ egg | C12 | L6 │
└──────┴─────┴─────┘

Copying row values based on Config file from polars dataframe

I have a dataframe consisting of an ID, Local, Entity, Field and Global column.
# Creating a dictionary with the data
data = {'ID': [4, 4, 4, 4, 4],
'Local': ['A', 'B', 'C', 'D', 'E'],
'Field': ['P', 'Q', 'R', 'S', 'T'],
'Entity': ['K', 'L', 'M', 'N', 'O'],
'Global': ['F', 'G', 'H', 'I', 'J'],}
# Creating the dataframe
table = pl.DataFrame(data)
print(table)
shape: (5, 5)
┌─────┬───────┬───────┬──────────┬────────┐
│ ID ┆ Local ┆ Field ┆ Entity ┆ Global │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ str │
╞═════╪═══════╪═══════╪══════════╪════════╡
│ 4 ┆ A ┆ P ┆ K ┆ F │
│ 4 ┆ B ┆ Q ┆ L ┆ G │
│ 4 ┆ C ┆ R ┆ M ┆ H │
│ 4 ┆ D ┆ S ┆ N ┆ I │
│ 4 ┆ E ┆ T ┆ O ┆ J │
└─────┴───────┴───────┴──────────┴────────┘
Within the dataset certain rows need to be copied. For this purpose the following config file is provided with the following information:
copying:
- column_name: P
source_table: K
destination_table: X
- column_name: S
source_table: N
destination_table: W
In the config file there is a column_name value which refers to the Field column, a source_table which refers to the given Entity and a destination_table which should be the future entry in the Entity column. The goal is to enrich data based on existing rows (just with other tables).
The solution should look like this:
shape: (7, 5)
┌─────┬───────┬───────┬──────────┬────────┐
│ ID ┆ Local ┆ Field ┆ Entity ┆ Global │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ str │
╞═════╪═══════╪═══════╪══════════╪════════╡
│ 4 ┆ A ┆ P ┆ K ┆ F │
│ 4 ┆ B ┆ Q ┆ L ┆ G │
│ 4 ┆ C ┆ R ┆ M ┆ H │
│ 4 ┆ D ┆ S ┆ N ┆ I │
│ 4 ┆ E ┆ T ┆ O ┆ J │
│ 4 ┆ A ┆ P ┆ X ┆ F │
│ 4 ┆ D ┆ S ┆ W ┆ I │
└─────┴───────┴───────┴──────────┴────────┘
The dataset is a polars DataFrame and the config file is loaded as omegaconf. I tried it with this code:
conf.copying = [
{"column_name": "P", "source_table": "K", "destination_table": "X"},
{"column_name": "S", "source_table": "N", "destination_table": "W"},
]
# Iterate through the config file
for i in range(len(conf.copying)):
# Select the rows from the table dataframe that match the column_name and source_table fields in the config
match_rows = table.filter(
(pl.col("Field") == conf.copying[i]["column_name"])
& (pl.col("Entity") == conf.copying[i]["source_table"])
)
# Add the column Entities with the destination_table
match_rows = match_rows.select(
[
"ID",
"Local",
"Field",
"Global",
]
)
# Add the column Entities with the destination_table
match_rows = match_rows.with_columns(
pl.lit(conf.copying[i]["destination_table"]).alias("Entity")
)
match_rows = match_rows[
[
"ID",
"Local",
"Field",
"Entity",
"Global",
]
]
# Append the new rows to the original table dataframe
df_copy = match_rows.vstack(table)
However, the data is not copied as expected and added to the existing dataset. What am I doing wrong?

It's probably better if you extract all of the rows you want in one go.
One possible approach is to build a list of when/then expressions:
rules = [
pl.when(
(pl.col("Field") == rule["column_name"]) &
(pl.col("Entity") == rule["source_table"]))
.then(rule["destination_table"])
.alias("Entity")
for rule in conf.copying
]
You can use pl.coalesce and .drop_nulls() to remove non-matches.
>>> table.with_columns(pl.coalesce(rules)).drop_nulls()
shape: (2, 5)
┌─────┬───────┬───────┬────────┬────────┐
│ ID | Local | Field | Entity | Global │
│ --- | --- | --- | --- | --- │
│ i64 | str | str | str | str │
╞═════╪═══════╪═══════╪════════╪════════╡
│ 4 | A | P | X | F │
├─────┼───────┼───────┼────────┼────────┤
│ 4 | D | S | W | I │
└─────┴───────┴───────┴────────┴────────┘
This can then be added to the original dataframe.
Another style I've seen is to start with an "empty" when/then to build a .when().then() chain e.g.
rules = pl.when(False).then(None)
for rule in conf.copying:
rules = (
rules.when(
(pl.col("Field") == rule["column_name"]) &
(pl.col("Entity") == rule["source_table"]))
.then(rule["destination_table"])
)
>>> table.with_columns(rules.alias("Entity")).drop_nulls()
shape: (2, 5)
┌─────┬───────┬───────┬────────┬────────┐
│ ID | Local | Field | Entity | Global │
│ --- | --- | --- | --- | --- │
│ i64 | str | str | str | str │
╞═════╪═══════╪═══════╪════════╪════════╡
│ 4 | A | P | X | F │
├─────┼───────┼───────┼────────┼────────┤
│ 4 | D | S | W | I │
└─────┴───────┴───────┴────────┴────────┘
I'm not sure if it would make much difference in filtering the rows first and then doing the Entity modification - pl.any() could be used if you were to do that:
table.filter(
pl.any([
(pl.col("Field") == rule["column_name"]) &
(pl.col("Entity") == rule["source_table"])
for rule in conf.copying
])
)
shape: (2, 5)
┌─────┬───────┬───────┬────────┬────────┐
│ ID | Local | Field | Entity | Global │
│ --- | --- | --- | --- | --- │
│ i64 | str | str | str | str │
╞═════╪═══════╪═══════╪════════╪════════╡
│ 4 | A | P | K | F │
├─────┼───────┼───────┼────────┼────────┤
│ 4 | D | S | N | I │
└─────┴───────┴───────┴────────┴────────┘

polars groupby and pivot converting code from pyspark

Currently converting some code from pyspark to polars which i need some help with
In pyspark i am grouping by col1 and col2 then pivoting on column called VariableName using
the Value column.how would I do this in polars?
pivotDF = df.groupBy("col1","col2").pivot("VariableName").max("Value")

Let's start with this data:
import polars as pl
from pyspark.sql import SparkSession
df = pl.DataFrame(
{
"col1": ["A", "B"] * 12,
"col2": ["x", "y", "z"] * 8,
"VariableName": ["one", "two", "three", "four"] * 6,
"Value": pl.arange(0, 24, eager=True),
}
)
df
shape: (24, 4)
┌──────┬──────┬──────────────┬───────┐
│ col1 ┆ col2 ┆ VariableName ┆ Value │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ i64 │
╞══════╪══════╪══════════════╪═══════╡
│ A ┆ x ┆ one ┆ 0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ B ┆ y ┆ two ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ A ┆ z ┆ three ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ B ┆ x ┆ four ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ A ┆ z ┆ one ┆ 20 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ B ┆ x ┆ two ┆ 21 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ A ┆ y ┆ three ┆ 22 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ B ┆ z ┆ four ┆ 23 │
└──────┴──────┴──────────────┴───────┘
Running your query on pyspark yields:
spark = SparkSession.builder.getOrCreate()
(
spark
.createDataFrame(df.to_pandas())
.groupBy("col1", "col2")
.pivot("VariableName")
.max("Value")
.sort(["col1", "col2"])
.show()
)
+----+----+----+----+-----+----+
|col1|col2|four| one|three| two|
+----+----+----+----+-----+----+
| A| x|null| 12| 18|null|
| A| y|null| 16| 22|null|
| A| z|null| 20| 14|null|
| B| x| 15|null| null| 21|
| B| y| 19|null| null| 13|
| B| z| 23|null| null| 17|
+----+----+----+----+-----+----+
In Polars, we would code this using pivot.
(
df
.pivot(
index=["col1", "col2"],
values="Value",
columns="VariableName",
aggregate_fn="max",
)
.sort(["col1", "col2"])
)
shape: (6, 6)
┌──────┬──────┬──────┬──────┬───────┬──────┐
│ col1 ┆ col2 ┆ one ┆ two ┆ three ┆ four │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞══════╪══════╪══════╪══════╪═══════╪══════╡
│ A ┆ x ┆ 12 ┆ null ┆ 18 ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ A ┆ y ┆ 16 ┆ null ┆ 22 ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ A ┆ z ┆ 20 ┆ null ┆ 14 ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ B ┆ x ┆ null ┆ 21 ┆ null ┆ 15 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ B ┆ y ┆ null ┆ 13 ┆ null ┆ 19 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ B ┆ z ┆ null ┆ 17 ┆ null ┆ 23 │
└──────┴──────┴──────┴──────┴───────┴──────┘

Apply to a list of columns in Polars

In the following dataframe I would like to multiply var_3 and var_4 by negative 1. I can do so using the following method but I am wondering if it can be done by collecting them in a list (imagining that there may be many more than 4 columns in the dataframe)
df = pl.DataFrame({"var_1": ["a", "a", "b"],
"var_2": ["c", "d", "e"],
"var_3": [1, 2, 3],
"var_4": [4, 5, 6]})
df.with_columns([pl.col("var_3") * -1,
pl.col("var_4") * -1])
Which returns the desired dataframe
var_1
var_2
var_3
var_4
a
c
-1
-4
a
d
-2
-5
b
e
-3
-6
My try at it goes like this, though it is not applying the multiplication:
var_list = ["var_3", "var_4"]
pl_cols_var_list = [pl.col(k) for k in var_list]
df.with_columns(pl_cols_var_list * -1)

You were close. You can provide your list of variable names (as strings) directly to the polars.col expression:
var_list = ["var_3", "var_4"]
df.with_columns(pl.col(var_list) * -1)
shape: (3, 4)
┌───────┬───────┬───────┬───────┐
│ var_1 ┆ var_2 ┆ var_3 ┆ var_4 │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 │
╞═══════╪═══════╪═══════╪═══════╡
│ a ┆ c ┆ -1 ┆ -4 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ d ┆ -2 ┆ -5 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ e ┆ -3 ┆ -6 │
└───────┴───────┴───────┴───────┘
Another tip, if you have lots of columns and want to exclude only a few, you can use the polars.exclude expression:
var_list = ["var_1", "var_2"]
df.with_columns(pl.exclude(var_list) * -1)
shape: (3, 4)
┌───────┬───────┬───────┬───────┐
│ var_1 ┆ var_2 ┆ var_3 ┆ var_4 │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 │
╞═══════╪═══════╪═══════╪═══════╡
│ a ┆ c ┆ -1 ┆ -4 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ d ┆ -2 ┆ -5 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ e ┆ -3 ┆ -6 │
└───────┴───────┴───────┴───────┘

window agg over one value, but return another via Polars

I am trying to use polars to do a window aggregate over one value, but map it back to another.
For example, if i wanted to get the name of the max value in a group, instead of (or in combination to) just the max value.
assuming an input of something like this.
|label|name|value|
|a. | foo| 1. |
|a. | bar| 2. |
|b. | baz| 1.5. |
|b. | boo| -1 |
# 'max_by' is not a real method, just using it to express what i'm trying to achieve.
df.select(col('label'), col('name').max_by('value').over('label'))
i want an output like this
|label|name|
|a. | bar|
|b. | baz|
ideally with the value too. But i know i can easily add that in via col('value').max().over('label').
|label|name|value|
|a. | bar| 2. |
|b. | baz| 1.5.|

You were close. There is a sort_by expression that can be used.
df.groupby('label').agg(pl.all().sort_by('value').last())
shape: (2, 3)
┌───────┬──────┬───────┐
│ label ┆ name ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞═══════╪══════╪═══════╡
│ a. ┆ bar ┆ 2.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b. ┆ baz ┆ 1.5 │
└───────┴──────┴───────┘
If you need a windowed version of this:
df.with_columns([
pl.col(['name','value']).sort_by('value').last().over('label').suffix("_max")
])
shape: (4, 5)
┌───────┬──────┬───────┬──────────┬───────────┐
│ label ┆ name ┆ value ┆ name_max ┆ value_max │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ str ┆ f64 │
╞═══════╪══════╪═══════╪══════════╪═══════════╡
│ a. ┆ foo ┆ 1.0 ┆ bar ┆ 2.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ a. ┆ bar ┆ 2.0 ┆ bar ┆ 2.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b. ┆ baz ┆ 1.5 ┆ baz ┆ 1.5 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b. ┆ boo ┆ -1.0 ┆ baz ┆ 1.5 │
└───────┴──────┴───────┴──────────┴───────────┘

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Complex asof joins with option to select between duplicates + strictly less/greater than + use utf8 - python-polars

Related

How can I use when, then and otherwise with multiple conditions in polars?

Copying row values based on Config file from polars dataframe

polars groupby and pivot converting code from pyspark

Apply to a list of columns in Polars

window agg over one value, but return another via Polars

Categories

Resources