polars groupby and pivot converting code from pyspark

polars groupby and pivot converting code from pyspark - pyspark

Currently converting some code from pyspark to polars which i need some help with
In pyspark i am grouping by col1 and col2 then pivoting on column called VariableName using
the Value column.how would I do this in polars?
pivotDF = df.groupBy("col1","col2").pivot("VariableName").max("Value")

Let's start with this data:
import polars as pl
from pyspark.sql import SparkSession
df = pl.DataFrame(
{
"col1": ["A", "B"] * 12,
"col2": ["x", "y", "z"] * 8,
"VariableName": ["one", "two", "three", "four"] * 6,
"Value": pl.arange(0, 24, eager=True),
}
)
df
shape: (24, 4)
┌──────┬──────┬──────────────┬───────┐
│ col1 ┆ col2 ┆ VariableName ┆ Value │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ i64 │
╞══════╪══════╪══════════════╪═══════╡
│ A ┆ x ┆ one ┆ 0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ B ┆ y ┆ two ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ A ┆ z ┆ three ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ B ┆ x ┆ four ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ A ┆ z ┆ one ┆ 20 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ B ┆ x ┆ two ┆ 21 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ A ┆ y ┆ three ┆ 22 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ B ┆ z ┆ four ┆ 23 │
└──────┴──────┴──────────────┴───────┘
Running your query on pyspark yields:
spark = SparkSession.builder.getOrCreate()
(
spark
.createDataFrame(df.to_pandas())
.groupBy("col1", "col2")
.pivot("VariableName")
.max("Value")
.sort(["col1", "col2"])
.show()
)
+----+----+----+----+-----+----+
|col1|col2|four| one|three| two|
+----+----+----+----+-----+----+
| A| x|null| 12| 18|null|
| A| y|null| 16| 22|null|
| A| z|null| 20| 14|null|
| B| x| 15|null| null| 21|
| B| y| 19|null| null| 13|
| B| z| 23|null| null| 17|
+----+----+----+----+-----+----+
In Polars, we would code this using pivot.
(
df
.pivot(
index=["col1", "col2"],
values="Value",
columns="VariableName",
aggregate_fn="max",
)
.sort(["col1", "col2"])
)
shape: (6, 6)
┌──────┬──────┬──────┬──────┬───────┬──────┐
│ col1 ┆ col2 ┆ one ┆ two ┆ three ┆ four │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞══════╪══════╪══════╪══════╪═══════╪══════╡
│ A ┆ x ┆ 12 ┆ null ┆ 18 ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ A ┆ y ┆ 16 ┆ null ┆ 22 ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ A ┆ z ┆ 20 ┆ null ┆ 14 ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ B ┆ x ┆ null ┆ 21 ┆ null ┆ 15 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ B ┆ y ┆ null ┆ 13 ┆ null ┆ 19 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ B ┆ z ┆ null ┆ 17 ┆ null ┆ 23 │
└──────┴──────┴──────┴──────┴───────┴──────┘

Related

Complex asof joins with option to select between duplicates + strictly less/greater than + use utf8

Is there a way to perform something similar to an asof join, but with
The option to select which element to join with (e.g. first, last) if there are duplicates
The option to join with only strictly less/greater than
Ability to use utf-8
Here's a code example:
import polars as pl
df1 = pl.DataFrame({
'by_1': ['X', 'X', 'Y', 'Y'] * 8,
'by_2': ['X', 'Y', 'X', 'Y'] * 8,
'on_1': ['A'] * 16 + ['C'] * 16,
'on_2': (['A'] * 8 + ['C'] * 8) * 2,
'__index__': list(range(32))
})
df2 = pl.DataFrame([
{ 'by_1': 'Y', 'by_2': 'Y', 'on_1': 'B', 'on_2': 'A' },
{ 'by_1': 'Y', 'by_2': 'Y', 'on_1': 'C', 'on_2': 'A' },
{ 'by_1': 'Y', 'by_2': 'Z', 'on_1': 'A', 'on_2': 'A' },
])
df1:
┌──────┬──────┬──────┬──────┬───────────┐
│ by_1 ┆ by_2 ┆ on_1 ┆ on_2 ┆ __index__ │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ i64 │
╞══════╪══════╪══════╪══════╪═══════════╡
│ X ┆ X ┆ A ┆ A ┆ 0 │
│ X ┆ Y ┆ A ┆ A ┆ 1 │
│ Y ┆ X ┆ A ┆ A ┆ 2 │
│ Y ┆ Y ┆ A ┆ A ┆ 3 │
│ X ┆ X ┆ A ┆ A ┆ 4 │
│ X ┆ Y ┆ A ┆ A ┆ 5 │
│ Y ┆ X ┆ A ┆ A ┆ 6 │
│ Y ┆ Y ┆ A ┆ A ┆ 7 │
│ X ┆ X ┆ A ┆ C ┆ 8 │
│ X ┆ Y ┆ A ┆ C ┆ 9 │
│ Y ┆ X ┆ A ┆ C ┆ 10 │
│ Y ┆ Y ┆ A ┆ C ┆ 11 │
│ X ┆ X ┆ A ┆ C ┆ 12 │
│ X ┆ Y ┆ A ┆ C ┆ 13 │
│ Y ┆ X ┆ A ┆ C ┆ 14 │
│ Y ┆ Y ┆ A ┆ C ┆ 15 │
│ X ┆ X ┆ C ┆ A ┆ 16 │
│ X ┆ Y ┆ C ┆ A ┆ 17 │
│ Y ┆ X ┆ C ┆ A ┆ 18 │
│ Y ┆ Y ┆ C ┆ A ┆ 19 │
│ X ┆ X ┆ C ┆ A ┆ 20 │
│ X ┆ Y ┆ C ┆ A ┆ 21 │
│ Y ┆ X ┆ C ┆ A ┆ 22 │
│ Y ┆ Y ┆ C ┆ A ┆ 23 │
│ X ┆ X ┆ C ┆ C ┆ 24 │
│ X ┆ Y ┆ C ┆ C ┆ 25 │
│ Y ┆ X ┆ C ┆ C ┆ 26 │
│ Y ┆ Y ┆ C ┆ C ┆ 27 │
│ X ┆ X ┆ C ┆ C ┆ 28 │
│ X ┆ Y ┆ C ┆ C ┆ 29 │
│ Y ┆ X ┆ C ┆ C ┆ 30 │
│ Y ┆ Y ┆ C ┆ C ┆ 31 │
└──────┴──────┴──────┴──────┴───────────┘
df2:
┌──────┬──────┬──────┬──────┐
│ by_1 ┆ by_2 ┆ on_1 ┆ on_2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str │
╞══════╪══════╪══════╪══════╡
│ Y ┆ Y ┆ B ┆ A │
│ Y ┆ Y ┆ C ┆ A │
│ Y ┆ Z ┆ A ┆ A │
└──────┴──────┴──────┴──────┘
# Case 1 - Less Than (lt)
df2.join_asof_lt(
df1,
by=['by_1', 'by_2'],
on=['on_1', 'on_2'],
lt_select_eq = 'first',
)
┌──────┬──────┬──────┬──────┬───────────┐
│ by_1 ┆ by_2 ┆ on_1 ┆ on_2 ┆ __index__ │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ i64 │
╞══════╪══════╪══════╪══════╪═══════════╡
│ Y ┆ Y ┆ B ┆ A ┆ 11 │ # First strictly less than is ('Y', 'Y'), ('A', 'C'), which exists at index 11 and 15. Since lt_select_eq is 'first', choose 11
│ Y ┆ Y ┆ C ┆ A ┆ 11 │ # First strictly less than is ('Y', 'Y'), ('A', 'C'), which exists at index 11 and 15. Since lt_select_eq is 'first', choose 11
│ Y ┆ Z ┆ A ┆ A ┆ null │ # Group (Z, Y) does not exist, so return None
└──────┴──────┴──────┴──────┴───────────┘
df2.join_asof_lt(
df1,
by=['by_1', 'by_2'],
on=['on_1', 'on_2'],
lt_select_eq = 'last',
)
┌──────┬──────┬──────┬──────┬───────────┐
│ by_1 ┆ by_2 ┆ on_1 ┆ on_2 ┆ __index__ │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ i64 │
╞══════╪══════╪══════╪══════╪═══════════╡
│ Y ┆ Y ┆ B ┆ A ┆ 15 │ # First strictly less than is ('Y', 'Y'), ('A', 'C'), which exists at index 11 and 15. Since lt_select_eq is 'last', choose 15
│ Y ┆ Y ┆ C ┆ A ┆ 15 │ # First strictly less than is ('Y', 'Y'), ('A', 'C'), which exists at index 11 and 15. Since lt_select_eq is 'last', choose 15
│ Y ┆ Z ┆ A ┆ A ┆ null │ # Group (Z, Y) does not exist, so return None
└──────┴──────┴──────┴──────┴───────────┘
# Case 2 - Less Than or Equal To (leq)
df2.join_asof_leq(
df1,
by=['by_1', 'by_2'],
on=['on_1', 'on_2'],
lt_select_eq = 'first',
eq_select_eq = 'last',
)
┌──────┬──────┬──────┬──────┬───────────┐
│ by_1 ┆ by_2 ┆ on_1 ┆ on_2 ┆ __index__ │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ i64 │
╞══════╪══════╪══════╪══════╪═══════════╡
│ Y ┆ Y ┆ B ┆ A ┆ 11 │ # First less than or equal to is ('Y', 'Y'), ('A', 'C'), which exists at index 11 and 15. Since lt_select_eq is 'first', choose 11
│ Y ┆ Y ┆ C ┆ A ┆ 23 │ # First less than or equal to is ('Y', 'Y'), ('C', 'A'), which exists at index 19 and 23. Since eq_select_eq is 'last', choose 23
│ Y ┆ Z ┆ A ┆ A ┆ null │ # Group (Z, Y) does not exist, so return None
└──────┴──────┴──────┴──────┴───────────┘
df2.join_asof_leq(
df1,
by=['by_1', 'by_2'],
on=['on_1', 'on_2'],
lt_select_eq = 'last',
eq_select_eq = 'first',
)
┌──────┬──────┬──────┬──────┬───────────┐
│ by_1 ┆ by_2 ┆ on_1 ┆ on_2 ┆ __index__ │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ i64 │
╞══════╪══════╪══════╪══════╪═══════════╡
│ Y ┆ Y ┆ B ┆ A ┆ 15 │ # First less than or equal to is ('Y', 'Y'), ('A', 'C'), which exists at index 11 and 15. Since lt_select_eq is 'last', choose 15
│ Y ┆ Y ┆ C ┆ A ┆ 19 │ # First less than or equal to is ('Y', 'Y'), ('C', 'A'), which exists at index 19 and 23. Since eq_select_eq is 'first', choose 19
│ Y ┆ Z ┆ A ┆ A ┆ null │ # Group (Z, Y) does not exist, so return None
└──────┴──────┴──────┴──────┴───────────┘
These examples are for lt / leq, but it could also be gt / geq. Thanks!

I don't follow why 3 and 7 are not to be classed as less than but for example:
df3 = df2.join(df1, on=["by_1", "by_2"], how="left")
df3.filter(
pl.col("__index__").is_null() |
(pl.col("on_1_right") < pl.col("on_1"))
)
shape: (9, 7)
┌──────┬──────┬──────┬──────┬────────────┬────────────┬───────────┐
│ by_1 | by_2 | on_1 | on_2 | on_1_right | on_2_right | __index__ │
│ --- | --- | --- | --- | --- | --- | --- │
│ str | str | str | str | str | str | i64 │
╞══════╪══════╪══════╪══════╪════════════╪════════════╪═══════════╡
│ Y | Y | B | A | A | A | 3 │
│ Y | Y | B | A | A | A | 7 │
│ Y | Y | B | A | A | C | 11 │
│ Y | Y | B | A | A | C | 15 │
│ Y | Y | C | A | A | A | 3 │
│ Y | Y | C | A | A | A | 7 │
│ Y | Y | C | A | A | C | 11 │
│ Y | Y | C | A | A | C | 15 │
│ Y | Z | A | A | null | null | null │
└──────┴──────┴──────┴──────┴────────────┴────────────┴───────────┘
Get "closest" match per group:
group_keys = ["by_1", "by_2", "on_1", "on_2"]
df3 = df2.join(df1, on=["by_1", "by_2"], how="left")
(
df3
.filter(
pl.col("__index__").is_null() |
(pl.col("on_1") > pl.col("on_1_right")))
.filter(
pl.col([
"on_1_right",
"on_2_right"
]) == pl.col(["on_1_right", "on_2_right"])
.last()
.over(group_keys))
)
shape: (5, 7)
┌──────┬──────┬──────┬──────┬────────────┬────────────┬───────────┐
│ by_1 | by_2 | on_1 | on_2 | on_1_right | on_2_right | __index__ │
│ --- | --- | --- | --- | --- | --- | --- │
│ str | str | str | str | str | str | i64 │
╞══════╪══════╪══════╪══════╪════════════╪════════════╪═══════════╡
│ Y | Y | B | A | A | C | 11 │
│ Y | Y | B | A | A | C | 15 │
│ Y | Y | C | A | A | C | 11 │
│ Y | Y | C | A | A | C | 15 │
│ Y | Z | A | A | null | null | null │
└──────┴──────┴──────┴──────┴────────────┴────────────┴───────────┘
If you .groupby(group_keys) that result you can use .first() / .last()
>>> groups.agg(pl.col("__index__").first())
shape: (3, 5)
┌──────┬──────┬──────┬──────┬───────────┐
│ by_1 | by_2 | on_1 | on_2 | __index__ │
│ --- | --- | --- | --- | --- │
│ str | str | str | str | i64 │
╞══════╪══════╪══════╪══════╪═══════════╡
│ Y | Y | C | A | 11 │
│ Y | Z | A | A | null │
│ Y | Y | B | A | 11 │
└──────┴──────┴──────┴──────┴───────────┘
>>> groups.agg(pl.col("__index__").last())
shape: (3, 5)
┌──────┬──────┬──────┬──────┬───────────┐
│ by_1 | by_2 | on_1 | on_2 | __index__ │
│ --- | --- | --- | --- | --- │
│ str | str | str | str | i64 │
╞══════╪══════╪══════╪══════╪═══════════╡
│ Y | Y | B | A | 15 │
│ Y | Z | A | A | null │
│ Y | Y | C | A | 15 │
└──────┴──────┴──────┴──────┴───────────┘

Polars is throwing an error when I convert from eger to lazy execution

This code works and returns the expected result.
import polars as pl
df = pl.DataFrame({
'A':[1,2,3,3,2,1],
'B':[1,1,1,2,2,2]
})
(df
#.lazy()
.groupby('B')
.apply(lambda x: x
.with_columns(
[pl.col("A").shift(i).alias(f"A_lag_{i}") for i in range(3)]
)
)
.with_columns(
[pl.col(f'A_lag_{i}') / pl.col('A') for i in range(3)]
)
#.collect()
)
However, if you comment out the .lazy() and .collect() you get a NotFoundError: f'A_lag_0
I've tried a few versions of this code, but I can't entirely understand if I'm doing something wrong, or whether this is a bug in Polars.

This doesn't address the error that you are receiving, but the more idiomatic way to express this in Polars is to use the over expression. For example:
(
df
.lazy()
.with_columns([
pl.col("A").shift(i).over('B').alias(f"A_lag_{i}")
for i in range(3)])
.with_columns([
(pl.col(f"A_lag_{i}") / pl.col("A")).suffix('_result')
for i in range(3)])
.collect()
)
shape: (6, 8)
┌─────┬─────┬─────────┬─────────┬─────────┬────────────────┬────────────────┬────────────────┐
│ A ┆ B ┆ A_lag_0 ┆ A_lag_1 ┆ A_lag_2 ┆ A_lag_0_result ┆ A_lag_1_result ┆ A_lag_2_result │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ f64 ┆ f64 ┆ f64 │
╞═════╪═════╪═════════╪═════════╪═════════╪════════════════╪════════════════╪════════════════╡
│ 1 ┆ 1 ┆ 1 ┆ null ┆ null ┆ 1.0 ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 1 ┆ 2 ┆ 1 ┆ null ┆ 1.0 ┆ 0.5 ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 1 ┆ 3 ┆ 2 ┆ 1 ┆ 1.0 ┆ 0.666667 ┆ 0.333333 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 2 ┆ 3 ┆ null ┆ null ┆ 1.0 ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 2 ┆ 3 ┆ null ┆ 1.0 ┆ 1.5 ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2 ┆ 1 ┆ 2 ┆ 3 ┆ 1.0 ┆ 2.0 ┆ 3.0 │
└─────┴─────┴─────────┴─────────┴─────────┴────────────────┴────────────────┴────────────────┘

polars equivalent to pandas groupby shift()

Is there an equivalent way to to df.groupby().shift in polars? Use pandas.shift() within a group

You can use the over expression to accomplish this in Polars. Using the example from the link...
import polars as pl
df = pl.DataFrame({
'object': [1, 1, 1, 2, 2],
'period': [1, 2, 4, 4, 23],
'value': [24, 67, 89, 5, 23],
})
df.with_column(
pl.col('value').shift().over('object').alias('prev_value')
)
shape: (5, 4)
┌────────┬────────┬───────┬────────────┐
│ object ┆ period ┆ value ┆ prev_value │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞════════╪════════╪═══════╪════════════╡
│ 1 ┆ 1 ┆ 24 ┆ null │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2 ┆ 67 ┆ 24 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 4 ┆ 89 ┆ 67 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 5 ┆ null │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 23 ┆ 23 ┆ 5 │
└────────┴────────┴───────┴────────────┘
To perform this on more than one column, you can specify the columns in the pl.col expression, and then use a prefix/suffix to name the new columns. For example:
df.with_columns(
pl.col(['period', 'value']).shift().over('object').prefix("prev_")
)
shape: (5, 5)
┌────────┬────────┬───────┬─────────────┬────────────┐
│ object ┆ period ┆ value ┆ prev_period ┆ prev_value │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞════════╪════════╪═══════╪═════════════╪════════════╡
│ 1 ┆ 1 ┆ 24 ┆ null ┆ null │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2 ┆ 67 ┆ 1 ┆ 24 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 4 ┆ 89 ┆ 2 ┆ 67 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 5 ┆ null ┆ null │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 23 ┆ 23 ┆ 4 ┆ 5 │
└────────┴────────┴───────┴─────────────┴────────────┘
Using multiple values with over
Let's use this data.
df = pl.DataFrame(
{
"id": [1] * 5 + [2] * 5,
"date": ["2020-01-01", "2020-01-01", "2020-02-01", "2020-02-01", "2020-02-01"] * 2,
"value1": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"value2": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
}
).with_column(pl.col('date').str.strptime(pl.Date))
df
shape: (10, 4)
┌─────┬────────────┬────────┬────────┐
│ id ┆ date ┆ value1 ┆ value2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ date ┆ i64 ┆ i64 │
╞═════╪════════════╪════════╪════════╡
│ 1 ┆ 2020-01-01 ┆ 1 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-01-01 ┆ 2 ┆ 20 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-02-01 ┆ 3 ┆ 30 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-02-01 ┆ 4 ┆ 40 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-02-01 ┆ 5 ┆ 50 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-01-01 ┆ 6 ┆ 60 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-01-01 ┆ 7 ┆ 70 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-02-01 ┆ 8 ┆ 80 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-02-01 ┆ 9 ┆ 90 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-02-01 ┆ 10 ┆ 100 │
└─────┴────────────┴────────┴────────┘
We can place a list of our grouping variables in the over expression (as well as a list in our pl.col expression). Polars will run them all in parallel.
df.with_columns([
pl.col(["value1", "value2"]).shift().over(['id','date']).prefix("prev_"),
pl.col(["value1", "value2"]).diff().over(['id','date']).suffix("_diff"),
])
shape: (10, 8)
┌─────┬────────────┬────────┬────────┬─────────────┬─────────────┬─────────────┬─────────────┐
│ id ┆ date ┆ value1 ┆ value2 ┆ prev_value1 ┆ prev_value2 ┆ value1_diff ┆ value2_diff │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ date ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪════════════╪════════╪════════╪═════════════╪═════════════╪═════════════╪═════════════╡
│ 1 ┆ 2020-01-01 ┆ 1 ┆ 10 ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-01-01 ┆ 2 ┆ 20 ┆ 1 ┆ 10 ┆ 1 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-02-01 ┆ 3 ┆ 30 ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-02-01 ┆ 4 ┆ 40 ┆ 3 ┆ 30 ┆ 1 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-02-01 ┆ 5 ┆ 50 ┆ 4 ┆ 40 ┆ 1 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-01-01 ┆ 6 ┆ 60 ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-01-01 ┆ 7 ┆ 70 ┆ 6 ┆ 60 ┆ 1 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-02-01 ┆ 8 ┆ 80 ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-02-01 ┆ 9 ┆ 90 ┆ 8 ┆ 80 ┆ 1 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-02-01 ┆ 10 ┆ 100 ┆ 9 ┆ 90 ┆ 1 ┆ 10 │
└─────┴────────────┴────────┴────────┴─────────────┴─────────────┴─────────────┴─────────────┘

window agg over one value, but return another via Polars

I am trying to use polars to do a window aggregate over one value, but map it back to another.
For example, if i wanted to get the name of the max value in a group, instead of (or in combination to) just the max value.
assuming an input of something like this.
|label|name|value|
|a. | foo| 1. |
|a. | bar| 2. |
|b. | baz| 1.5. |
|b. | boo| -1 |
# 'max_by' is not a real method, just using it to express what i'm trying to achieve.
df.select(col('label'), col('name').max_by('value').over('label'))
i want an output like this
|label|name|
|a. | bar|
|b. | baz|
ideally with the value too. But i know i can easily add that in via col('value').max().over('label').
|label|name|value|
|a. | bar| 2. |
|b. | baz| 1.5.|

You were close. There is a sort_by expression that can be used.
df.groupby('label').agg(pl.all().sort_by('value').last())
shape: (2, 3)
┌───────┬──────┬───────┐
│ label ┆ name ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞═══════╪══════╪═══════╡
│ a. ┆ bar ┆ 2.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b. ┆ baz ┆ 1.5 │
└───────┴──────┴───────┘
If you need a windowed version of this:
df.with_columns([
pl.col(['name','value']).sort_by('value').last().over('label').suffix("_max")
])
shape: (4, 5)
┌───────┬──────┬───────┬──────────┬───────────┐
│ label ┆ name ┆ value ┆ name_max ┆ value_max │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ str ┆ f64 │
╞═══════╪══════╪═══════╪══════════╪═══════════╡
│ a. ┆ foo ┆ 1.0 ┆ bar ┆ 2.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ a. ┆ bar ┆ 2.0 ┆ bar ┆ 2.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b. ┆ baz ┆ 1.5 ┆ baz ┆ 1.5 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b. ┆ boo ┆ -1.0 ┆ baz ┆ 1.5 │
└───────┴──────┴───────┴──────────┴───────────┘

polars outer join default null value

https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.join.html
Can I specify the default NULL value for outer joins? Like 0?

The join method does not currently have an option for setting a default value for nulls. However, there is an easy way to accomplish this.
Let's say we have this data:
import polars as pl
df1 = pl.DataFrame({"key": ["a", "b", "d"], "var1": [1, 1, 1]})
df2 = pl.DataFrame({"key": ["a", "b", "c"], "var2": [2, 2, 2]})
df1.join(df2, on="key", how="outer")
shape: (4, 3)
┌─────┬──────┬──────┐
│ key ┆ var1 ┆ var2 │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪══════╪══════╡
│ a ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ c ┆ null ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ d ┆ 1 ┆ null │
└─────┴──────┴──────┘
To create a different value for the null values, simply use this:
df1.join(df2, on="key", how="outer").with_column(pl.all().fill_null(0))
shape: (4, 3)
┌─────┬──────┬──────┐
│ key ┆ var1 ┆ var2 │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪══════╪══════╡
│ a ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ c ┆ 0 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ d ┆ 1 ┆ 0 │
└─────┴──────┴──────┘

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

polars groupby and pivot converting code from pyspark - pyspark

Currently converting some code from pyspark to polars which i need some help with In pyspark i am grouping by col1 and col2 then pivoting on column called VariableName using the Value column.how would I do this in polars? pivotDF = df.groupBy("col1","col2").pivot("VariableName").max("Value")

Related

Complex asof joins with option to select between duplicates + strictly less/greater than + use utf8

Polars is throwing an error when I convert from eger to lazy execution

polars equivalent to pandas groupby shift()

window agg over one value, but return another via Polars

polars outer join default null value

Categories

Resources