Polars: groupby rolling sum - python-polars

Say I have
df = pl.DataFrame({'group': [1, 1, 1, 3, 3, 3, 4, 4], 'value': [1, 4, 2, 5, 3, 4, 2, 3]})
I'd like to get a rolling sum, with window of 2, for each group
Expected output is:
┌───────┐
│ value │
│ --- │
│ i64 │
╞═══════╡
│ 1 │
├╌╌╌╌╌╌╌┤
│ 5 │
├╌╌╌╌╌╌╌┤
│ 6 │
├╌╌╌╌╌╌╌┤
│ 5 │
├╌╌╌╌╌╌╌┤
│ 8 │
├╌╌╌╌╌╌╌┤
│ 7 │
├╌╌╌╌╌╌╌┤
│ 2 │
├╌╌╌╌╌╌╌┤
│ 5 │
└───────┘

.rolling_sum().over("group")
min_periods=1 will fill in the nulls.
>>> df.select(pl.col("value").rolling_sum(2, min_periods=1).over("group"))
shape: (8, 1)
┌───────┐
│ value │
│ --- │
│ i64 │
╞═══════╡
│ 1 │
├╌╌╌╌╌╌╌┤
│ 5 │
├╌╌╌╌╌╌╌┤
│ 6 │
├╌╌╌╌╌╌╌┤
│ 5 │
├╌╌╌╌╌╌╌┤
│ 8 │
├╌╌╌╌╌╌╌┤
│ 7 │
├╌╌╌╌╌╌╌┤
│ 2 │
├╌╌╌╌╌╌╌┤
│ 5 │
└───────┘

Related

Create duplicates of row based column values

I'm trying to build a histogram of some data in polars. As part of my histogram code, I need to duplicate some rows. I've got a column of values, where each row also has a weight that says how many times the row should be added to the histogram.
How can I duplicate my value rows according to the weight column?
Here is some example data, with a target series:
import polars as pl
df = pl.DataFrame({"value":[1,2,3], "weight":[2, 2, 1]})
print(df)
# shape: (3, 2)
# ┌───────┬────────┐
# │ value ┆ weight │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═══════╪════════╡
# │ 1 ┆ 2 │
# │ 2 ┆ 2 │
# │ 3 ┆ 1 │
# └───────┴────────┘
s_target = pl.Series(name="value", values=[1,1,2,2,3])
print(s_target)
# shape: (5,)
# Series: 'value' [i64]
# [
# 1
# 1
# 2
# 2
# 3
# ]
How about
(
df.with_columns(
pl.col("value").repeat_by(pl.col("weight"))
)
.select(pl.col("value").arr.explode())
)
In [11]: df.with_columns(pl.col('value').repeat_by(pl.col('weight'))).select(pl.col('value').arr.explode())
Out[11]:
shape: (5, 1)
┌───────┐
│ value │
│ --- │
│ i64 │
╞═══════╡
│ 1 │
│ 1 │
│ 2 │
│ 2 │
│ 3 │
└───────┘
I didn't know you could do this so easily, I only learned about it while writing the answer. Polars is so nice :)
Turns out repeat_by and a subsequent explode are the perfect building blocks for this transformation:
>>> df.select(pl.col('value').repeat_by('weight').arr.explode())
shape: (5, 1)
┌───────┐
│ value │
│ --- │
│ i64 │
╞═══════╡
│ 1 │
│ 1 │
│ 2 │
│ 2 │
│ 3 │
└───────┘

Enumerate each group

Starting with
df = pl.DataFrame({'group': [1, 1, 1, 3, 3, 3, 4, 4]})
how can I get a column which numbers the 'group' column?
Here's what df looks like:
shape: (8, 1)
┌───────┐
│ group │
│ --- │
│ i64 │
╞═══════╡
│ 1 │
├╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌╌╌┤
│ 4 │
├╌╌╌╌╌╌╌┤
│ 4 │
└───────┘
and here's my expected output:
shape: (8, 2)
┌───────┬─────────┐
│ group ┆ group_i │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═══════╪═════════╡
│ 1 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 2 │
└───────┴─────────┘
Here's one way I came up with, it just feels a bit complex for this task...is there a simpler way?
df.with_column(((pl.col('group')!=pl.col('group').shift()).cast(pl.Int64).cumsum()-1).alias('group_i'))
I think the terms come from SQL:
You're looking to .rank() your data - in particular - a "dense" ranking.
>>> df.with_column(pl.col("group").alias("group_i").rank("dense") - 1)
shape: (8, 2)
┌───────┬─────────┐
│ group | group_i │
│ --- | --- │
│ i64 | u32 │
╞═══════╪═════════╡
│ 1 | 0 │
├───────┼─────────┤
│ 1 | 0 │
├───────┼─────────┤
│ 1 | 0 │
├───────┼─────────┤
│ 3 | 1 │
├───────┼─────────┤
│ 3 | 1 │
├───────┼─────────┤
│ 3 | 1 │
├───────┼─────────┤
│ 4 | 2 │
├───────┼─────────┤
│ 4 | 2 │
└───────┴─────────┘

Is it possible in Polars to "reset" cumsum() at a certain condition?

I need to cumsum the column b until a becomes True. After that cumsum shall start from this row and so on.
a | b
-------------
False | 1
False | 2
True | 3
False | 4
Can I do it on Polars without looping each row?
You could use the .cumsum() of the a column as the "group number".
>>> df.select(pl.col("a").cumsum())
shape: (4, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 0 │
├╌╌╌╌╌┤
│ 0 │
├╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌┤
│ 1 │
└─────┘
And use that with .over()
>>> df.select(pl.col("b").cumsum().over(pl.col("a").cumsum()))
shape: (4, 1)
┌─────┐
│ b │
│ --- │
│ i64 │
╞═════╡
│ 1 │
├╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌┤
│ 7 │
└─────┘
You can .shift().backward_fill() to include the True
>>> df.select(pl.col("b").cumsum().over(
... pl.col("a").cumsum().shift().backward_fill()))
shape: (4, 1)
┌─────┐
│ b │
│ --- │
│ i64 │
╞═════╡
│ 1 │
├╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌┤
│ 6 │
├╌╌╌╌╌┤
│ 4 │
└─────┘

Sorting a groupby expression when taking first row, keeping all columns

Given the following dataframe, I would like to group by "foo", sort on "bar", and then keep the whole row.
df = pl.DataFrame(
{
"foo": [1, 1, 1, 2, 2, 2, 3],
"bar": [5, 7, 6, 4, 2, 3, 1],
"baz": [1, 2, 3, 4, 5, 6, 7],
}
)
df_desired = pl.DataFrame({"foo": [1, 2, 3], "bar": [5, 2, 1], "baz": [1,5,7]})
>>> df_desired
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ baz │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1 ┆ 5 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 1 ┆ 7 │
└─────┴─────┴─────┘
I can do this by sorting beforehand, but this is expensive compared to sorting the group:
df_solution = df.sort("bar").groupby("foo", maintain_order=True).first().sort(by="foo")
assert df_desired.frame_equal(df_solution)
I can sort by "foo" in the aggregation, as in this SO answer:
>>> df.groupby("foo").agg(pl.col("bar").sort().first()).sort(by="foo")
shape: (3, 2)
┌─────┬─────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 1 │
└─────┴─────┘
but then I only get that column. How do I also keep "baz"'s row value? Any additional entries to .agg([]) are independent of the new pl.col("bar").sort().
You could use .unique() instead of .groupby() after the .sort()
>>> df.sort(by="bar").unique(subset="foo")
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ baz │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 3 ┆ 1 ┆ 7 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 5 ┆ 1 │
└─────┴─────┴─────┘
For .groupby().agg() you can get the index of the row with pl.col("bar").arg_min()
You can pass this to pl.all().take() to return all columns.
>>> df.groupby("foo").agg(pl.all().take(pl.col("bar").arg_min()))
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ baz │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 3 ┆ 1 ┆ 7 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 5 ┆ 1 │
└─────┴─────┴─────┘
UPDATE:
Can also be written as .sort_by().first()
>>> df.groupby("foo").agg(pl.all().sort_by("bar").first())
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ baz │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1 ┆ 5 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 1 ┆ 7 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 5 │
└─────┴─────┴─────┘

How to use join in expression context?

Suppose I have a mapping dataframe that I would like to join to an original dataframe:
df = pl.DataFrame({
'A': [1, 2, 3, 2, 1],
})
mapper = pl.DataFrame({
'key': [1, 2, 3, 4, 5],
'value': ['a', 'b', 'c', 'd', 'e']
})
I can map A to value directly via df.join(mapper, ...), but is there a way to do this in an expression context, i.e. while building columns? As in:
df.with_columns([
(pl.col('A')+1).join(mapper, left_on='A', right_on='key')
])
With would furnish:
shape: (5, 2)
┌─────┬───────┐
│ A ┆ value │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═══════╡
│ 1 ┆ b │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1 ┆ b │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ c │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ c │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ d │
└─────┴───────┘
Probably, yes. I just putted df.select(col('A')+1) inside.
df = df.with_columns([
col('A'),
df.select(col('A')+1).join(mapper, left_on='A', right_on='key')['value']
])
print(df)
df
┌─────┬───────┐
│ A ┆ value │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═══════╡
│ 1 ┆ b │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ b │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ c │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ c │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1 ┆ d │
└─────┴───────┘