Cumulative sum that resets when turning negative/positive - python-polars

[enter image description here]
I am trying to add a column (column C) to my polars dataframe that counts how many times a value of one of the dataframe's columns (column A) is greater/less than the value of another column (column B). Once the value turns from less/greater to greater/less the cumulative sum should reset and start counting from 1/-1 again.

The data
I'm going to change the data in the example you provided.
df = pl.DataFrame(
{
"a": [11, 10, 10, 10, 9, 8, 8, 8, 8, 8, 15, 15, 15],
"b": [11, 9, 9, 9, 9, 9, 10, 8, 8, 10, 11, 11, 15],
}
)
print(df)
shape: (13, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 11 ┆ 11 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 10 ┆ 9 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 10 ┆ 9 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 10 ┆ 9 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 9 ┆ 9 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 8 ┆ 9 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 8 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 8 ┆ 8 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 8 ┆ 8 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 8 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 15 ┆ 11 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 15 ┆ 11 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 15 ┆ 15 │
└─────┴─────┘
Notice the cases where the two columns are the same. Your post didn't address what to do in these cases, so I made some assumptions as to what should happen. (You can adapt the code to handle those cases differently.)
The algorithm
df = (
df
.with_column((pl.col("a") - pl.col("b")).sign().alias("sign_a_minus_b"))
.with_column(
pl.when(pl.col("sign_a_minus_b") == 0)
.then(None)
.otherwise(pl.col("sign_a_minus_b"))
.forward_fill()
.alias("run_type")
)
.with_column(
(pl.col("run_type") != pl.col("run_type").shift_and_fill(1, 0))
.cumsum()
.alias("run_id")
)
.with_column(pl.col("sign_a_minus_b").cumsum().over("run_id").alias("result"))
)
print(df)
shape: (13, 6)
┌─────┬─────┬────────────────┬──────────┬────────┬────────┐
│ a ┆ b ┆ sign_a_minus_b ┆ run_type ┆ run_id ┆ result │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ u32 ┆ i64 │
╞═════╪═════╪════════════════╪══════════╪════════╪════════╡
│ 11 ┆ 11 ┆ 0 ┆ null ┆ 1 ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 10 ┆ 9 ┆ 1 ┆ 1 ┆ 2 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 10 ┆ 9 ┆ 1 ┆ 1 ┆ 2 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 10 ┆ 9 ┆ 1 ┆ 1 ┆ 2 ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 9 ┆ 9 ┆ 0 ┆ 1 ┆ 2 ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 8 ┆ 9 ┆ -1 ┆ -1 ┆ 3 ┆ -1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 8 ┆ 10 ┆ -1 ┆ -1 ┆ 3 ┆ -2 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 8 ┆ 8 ┆ 0 ┆ -1 ┆ 3 ┆ -2 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 8 ┆ 8 ┆ 0 ┆ -1 ┆ 3 ┆ -2 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 8 ┆ 10 ┆ -1 ┆ -1 ┆ 3 ┆ -3 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15 ┆ 11 ┆ 1 ┆ 1 ┆ 4 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15 ┆ 11 ┆ 1 ┆ 1 ┆ 4 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15 ┆ 15 ┆ 0 ┆ 1 ┆ 4 ┆ 2 │
└─────┴─────┴────────────────┴──────────┴────────┴────────┘
I've left the intermediate calculations in the output, merely to show how the algorithm works. (You can drop them.)
The basic idea is to calculate a run_id for each run of positive or negative values. We will then use the cumsum function and the over windowing expression to create a running count of positives/negatives over each run_id.
Key assumption: ties in columns a and b do not interrupt a run, but they do not contribute to the total for that run of positive/negative values.
sign_a_minus_b does two things: it identifies whether a run is positive/negative, and whether there is a tie in columns a and b.
run_type extends any run to include any cases where a tie occurs in columns a and b. The null value at the top of the column was intended - it shows what happens when a tie occurs in the first row.
result is the output column. Note that tied columns do not interrupt a run, but they don't contribute to the totals for that run.
One final note: if ties in columns a and b are not allowed, then this algorithm can be simplified ... and run faster.

Not very elegant or Pythonic, but something like the below should work:
import pandas as pd
df = pd.DataFrame({'a': [10, 10, 10, 8, 8, 8, 15, 15]
,'b': [9, 9, 9, 9, 10, 10, 11, 11]})
df['c'] = df.apply(lambda row: 1 if row['a'] > row['b'] else 0, axis=1)
df['d'] = df.apply(lambda row: 0 if row['a'] > row['b'] else -1, axis=1)
for i in range(1, len(df)):
if df.loc[i, 'a'] > df.loc[i, 'b']:
df.loc[i, 'c'] = df.loc[i-1, 'c'] + 1
df.loc[i, 'd'] = 0
else:
df.loc[i, 'd'] = df.loc[i-1, 'd'] - 1
df.loc[i, 'c'] = 0
df['ans'] = df['c'] + df['d']
print(df)
Also you might need to think about what the value should be for the specific case when column a and b are equal.

Related

How to group items by difference in percent in Polars?

I'd like to group values so that the difference between each group item remains within a certain percentage. E.g. each time when an item is 5% over the first group element it goes to the new group. As a return I need the first group value. Example with 5% threshold where 'a' is given, 'group' and 'groupFirst' must be calculated:
import polars as pl
df = pl.DataFrame({'a': [100, 103, 105, 106, 105, 104, 103, 106, 100, 102],
'group': [0, 0, 1, 1, 1, 1, 1, 1, 2, 2],
'groupFirst': [100, 100, 105, 105, 105, 105, 105, 105, 100, 100]})
print(df)
shape: (10, 3)
┌─────┬───────┬────────────┐
│ a ┆ group ┆ groupFirst │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═══════╪════════════╡
│ 100 ┆ 0 ┆ 100 │
│ 103 ┆ 0 ┆ 100 │
│ 105 ┆ 1 ┆ 105 │
│ 106 ┆ 1 ┆ 105 │
│ ... ┆ ... ┆ ... │
│ 103 ┆ 1 ┆ 105 │
│ 106 ┆ 1 ┆ 105 │
│ 100 ┆ 2 ┆ 100 │
│ 102 ┆ 2 ┆ 100 │
└─────┴───────┴────────────┘
Say you want to reset cummax when your values exceed the value 6. You could do:
(
df.with_columns(
(pl.col("a") >= 6)
.shift(1)
.fill_null(False)
.cumsum()
.alias("group")
).with_columns(
pl.col("a")
.cummax()
.over(pl.col("group"))
.alias("cummax")
)
)
Example:
In [73]: df = pl.DataFrame({'a': [1, 3, 5, 6, 1, 4, 3, 6, 5, 6]})
In [74]: (
...: df.with_columns(
...: (pl.col("a") >= 6)
...: .shift(1)
...: .fill_null(False)
...: .cumsum()
...: .alias("group")
...: ).with_columns(
...: pl.col("a")
...: .cummax()
...: .over(pl.col("group"))
...: .alias("cummax")
...: )
...: )
Out[74]:
shape: (10, 3)
┌─────┬───────┬────────┐
│ a ┆ group ┆ cummax │
│ --- ┆ --- ┆ --- │
│ i64 ┆ u32 ┆ i64 │
╞═════╪═══════╪════════╡
│ 1 ┆ 0 ┆ 1 │
│ 3 ┆ 0 ┆ 3 │
│ 5 ┆ 0 ┆ 5 │
│ 6 ┆ 0 ┆ 6 │
│ ... ┆ ... ┆ ... │
│ 3 ┆ 1 ┆ 4 │
│ 6 ┆ 1 ┆ 6 │
│ 5 ┆ 2 ┆ 5 │
│ 6 ┆ 2 ┆ 6 │
└─────┴───────┴────────┘
I don't think there is a way to use polars expressions to generate the groups since they always rely on the previous group. That being said the groups can be easily generated in O(n) so the penalty for doing that in python should be minor.
import numpy as np
def make_groups(a, threshold=1.05):
a=np.array(a)
outarray=np.empty(len(a), dtype=a.dtype)
outarray[0]=0
curgroup=a[0]
for indx, cur_a in enumerate(a[1:],1):
if cur_a >= threshold * curgroup or cur_a * threshold <= curgroup:
outarray[indx] = outarray[indx-1] + 1
curgroup=cur_a
else:
outarray[indx] = outarray[indx-1]
return pl.Series(outarray)
So now let's take that to our data.
df = pl.DataFrame({'a': [100, 103, 105, 106, 105, 104, 103, 106, 100, 102]})
We just do a map (incidentally, I tried making make_groups into a np.ufunc but couldn't get it to work).
df \
.with_columns(pl.col('a').map(lambda x: make_groups(x, 1.05)).alias('group')) \
.with_columns((pl.col('a').list().over('group').arr.first()).alias('groupFirst'))
shape: (10, 3)
┌─────┬───────┬────────────┐
│ a ┆ group ┆ groupFirst │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═══════╪════════════╡
│ 100 ┆ 0 ┆ 100 │
│ 103 ┆ 0 ┆ 100 │
│ 105 ┆ 1 ┆ 105 │
│ 106 ┆ 1 ┆ 105 │
│ ... ┆ ... ┆ ... │
│ 103 ┆ 1 ┆ 105 │
│ 106 ┆ 1 ┆ 105 │
│ 100 ┆ 2 ┆ 100 │
│ 102 ┆ 2 ┆ 100 │
└─────┴───────┴────────────┘
By the way, if you just want to use the default threshold then you can just do...
df \
.with_columns(pl.col('a').map(make_groups).alias('group')) \
.with_columns((pl.col('a').list().over('group').arr.first()).alias('groupFirst'))

Polars solution to normalise groups by per-group reference value

I'm trying to use Polars to normalise the values of groups of entries by a single reference value per group.
In the example data below, I'm trying to generate the column normalised which contains values divided by the per-group ref reference state value, i.e.:
group_id reference_state value normalised
1 ref 5 1.0
1 a 3 0.6
1 b 1 0.2
2 ref 4 1.0
2 a 8 2.0
2 b 2 0.5
This is straightforward in Pandas:
for (i, x) in df.groupby("group_id"):
ref_val = x.loc[x["reference_state"] == "ref"]["value"]
df.loc[df["group_id"] == i, "normalised"] = x["value"] / ref_val.to_list()[0]
Is there a way to do this in Polars?
Thanks in advance!
You can use a window function to make an expression operate on different groups via:
.over("group_id")
and then you can write the logic which divides by the values if equal to "ref" with:
pl.col("value") / pl.col("value").filter(pl.col("reference_state") == "ref").first()
Putting it all together:
df = pl.DataFrame({
"group_id": [1, 1, 1, 2, 2, 2],
"reference_state": ["ref", "a", "b", "ref", "a", "b"],
"value": [5, 3, 1, 4, 8, 2],
})
(df.with_columns([
(
pl.col("value") /
pl.col("value").filter(pl.col("reference_state") == "ref").first()
).over("group_id").alias("normalised")
]))
shape: (6, 4)
┌──────────┬─────────────────┬───────┬────────────┐
│ group_id ┆ reference_state ┆ value ┆ normalised │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ f64 │
╞══════════╪═════════════════╪═══════╪════════════╡
│ 1 ┆ ref ┆ 5 ┆ 1.0 │
│ 1 ┆ a ┆ 3 ┆ 0.6 │
│ 1 ┆ b ┆ 1 ┆ 0.2 │
│ 2 ┆ ref ┆ 4 ┆ 1.0 │
│ 2 ┆ a ┆ 8 ┆ 2.0 │
│ 2 ┆ b ┆ 2 ┆ 0.5 │
└──────────┴─────────────────┴───────┴────────────┘
Here's one way to do it:
create a temporary dataframe which, for each group_id, tells you the value where reference_state is 'ref'
join with that temporary dataframe
(
df.join(
df.filter(pl.col("reference_state") == "ref").select(["group_id", "value"]),
on="group_id",
)
.with_column((pl.col("value") / pl.col("value_right")).alias("normalised"))
.drop("value_right")
)
This gives you:
Out[16]:
shape: (6, 4)
┌──────────┬─────────────────┬───────┬────────────┐
│ group_id ┆ reference_state ┆ value ┆ normalised │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ f64 │
╞══════════╪═════════════════╪═══════╪════════════╡
│ 1 ┆ ref ┆ 5 ┆ 1.0 │
│ 1 ┆ a ┆ 3 ┆ 0.6 │
│ 1 ┆ b ┆ 1 ┆ 0.2 │
│ 2 ┆ ref ┆ 4 ┆ 1.0 │
│ 2 ┆ a ┆ 8 ┆ 2.0 │
│ 2 ┆ b ┆ 2 ┆ 0.5 │
└──────────┴─────────────────┴───────┴────────────┘

Sorting a groupby expression when taking first row, keeping all columns

Given the following dataframe, I would like to group by "foo", sort on "bar", and then keep the whole row.
df = pl.DataFrame(
{
"foo": [1, 1, 1, 2, 2, 2, 3],
"bar": [5, 7, 6, 4, 2, 3, 1],
"baz": [1, 2, 3, 4, 5, 6, 7],
}
)
df_desired = pl.DataFrame({"foo": [1, 2, 3], "bar": [5, 2, 1], "baz": [1,5,7]})
>>> df_desired
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ baz │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1 ┆ 5 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 1 ┆ 7 │
└─────┴─────┴─────┘
I can do this by sorting beforehand, but this is expensive compared to sorting the group:
df_solution = df.sort("bar").groupby("foo", maintain_order=True).first().sort(by="foo")
assert df_desired.frame_equal(df_solution)
I can sort by "foo" in the aggregation, as in this SO answer:
>>> df.groupby("foo").agg(pl.col("bar").sort().first()).sort(by="foo")
shape: (3, 2)
┌─────┬─────┐
│ foo ┆ bar │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 1 │
└─────┴─────┘
but then I only get that column. How do I also keep "baz"'s row value? Any additional entries to .agg([]) are independent of the new pl.col("bar").sort().
You could use .unique() instead of .groupby() after the .sort()
>>> df.sort(by="bar").unique(subset="foo")
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ baz │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 3 ┆ 1 ┆ 7 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 5 ┆ 1 │
└─────┴─────┴─────┘
For .groupby().agg() you can get the index of the row with pl.col("bar").arg_min()
You can pass this to pl.all().take() to return all columns.
>>> df.groupby("foo").agg(pl.all().take(pl.col("bar").arg_min()))
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ baz │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 3 ┆ 1 ┆ 7 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 5 ┆ 1 │
└─────┴─────┴─────┘
UPDATE:
Can also be written as .sort_by().first()
>>> df.groupby("foo").agg(pl.all().sort_by("bar").first())
shape: (3, 3)
┌─────┬─────┬─────┐
│ foo ┆ bar ┆ baz │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1 ┆ 5 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 1 ┆ 7 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 5 │
└─────┴─────┴─────┘

Polars unable to compute over() when an input was a list

I have a dataframe like this:
import polars as pl
orig_df = pl.DataFrame({
"primary_key": [1, 1, 2, 2, 3, 3],
"simple_foreign_keys": [1, 1, 2, 4, 4, 4],
"fancy_foreign_keys": [[1], [1], [2], [1], [3, 4], [3, 4]],
})
I have some logic that computes when the secondary column changes, then additional logic to rank those changes per primary column. It works fine on the simple data type:
foreign_key_changed = pl.col('simple_foreign_keys') != pl.col('simple_foreign_keys').shift_and_fill(1, 0)
df = orig_df.sort('primary_key').with_column(foreign_key_changed.cumsum().alias('raw_changes'))
print(df)
shape: (6, 4)
┌─────────────┬─────────────────────┬────────────────────┬─────────────┐
│ primary_key ┆ simple_foreign_keys ┆ fancy_foreign_keys ┆ raw_changes │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ list[i64] ┆ u32 │
╞═════════════╪═════════════════════╪════════════════════╪═════════════╡
│ 1 ┆ 1 ┆ [1] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 1 ┆ [1] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ [2] ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 4 ┆ [1] ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 4 ┆ [3, 4] ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 4 ┆ [3, 4] ┆ 3 │
└─────────────┴─────────────────────┴────────────────────┴─────────────┘
df = orig_df.sort('primary_key').with_column(foreign_key_changed.cumsum().rank('dense').over('primary_key').alias('ranked_changes'))
print(df)
shape: (6, 4)
┌─────────────┬─────────────────────┬────────────────────┬────────────────┐
│ primary_key ┆ simple_foreign_keys ┆ fancy_foreign_keys ┆ ranked_changes │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ list[i64] ┆ u32 │
╞═════════════╪═════════════════════╪════════════════════╪════════════════╡
│ 1 ┆ 1 ┆ [1] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 1 ┆ [1] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ [2] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 4 ┆ [1] ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 4 ┆ [3, 4] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 4 ┆ [3, 4] ┆ 1 │
└─────────────┴─────────────────────┴────────────────────┴────────────────┘
But if I try the exact same logic on the List column, it blows up on me. Note that the intermediate columns (the cumsum, the rank) are still plain integers:
foreign_key_changed = pl.col('fancy_foreign_keys') != pl.col('fancy_foreign_keys').shift_and_fill(1, [])
df = orig_df.sort('primary_key').with_column(foreign_key_changed.cumsum().alias('raw_changes'))
print(df)
shape: (6, 4)
┌─────────────┬─────────────────────┬────────────────────┬─────────────┐
│ primary_key ┆ simple_foreign_keys ┆ fancy_foreign_keys ┆ raw_changes │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ list[i64] ┆ u32 │
╞═════════════╪═════════════════════╪════════════════════╪═════════════╡
│ 1 ┆ 1 ┆ [1] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 1 ┆ [1] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ [2] ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 4 ┆ [1] ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 4 ┆ [3, 4] ┆ 4 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 4 ┆ [3, 4] ┆ 4 │
└─────────────┴─────────────────────┴────────────────────┴─────────────┘
df = orig_df.sort('primary_key').with_column(foreign_key_changed.cumsum().rank('dense').over('primary_key').alias('ranked_changes'))
thread '<unnamed>' panicked at 'implementation error, cannot get ref List(Null) from Int64', /Users/runner/work/polars/polars/polars/polars-core/src/series/mod.rs:945:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
File "polars_example.py", line 18, in <module>
df = orig_df.sort('primary_key').with_column(foreign_key_changed.cumsum().rank('dense').over('primary_key').alias('ranked_changes'))
File "venv/lib/python3.9/site-packages/polars/internals/frame.py", line 4144, in with_column
return self.with_columns([column])
File "venv/lib/python3.9/site-packages/polars/internals/frame.py", line 5308, in with_columns
self.lazy()
File "venv/lib/python3.9/site-packages/polars/internals/lazy_frame.py", line 652, in collect
return self._dataframe_class._from_pydf(ldf.collect())
pyo3_runtime.PanicException: implementation error, cannot get ref List(Null) from Int64
Am I doing this wrong? Is there some massaging that has to happen with the types?
This is with polars 0.13.62.

polars equivalent to pandas groupby shift()

Is there an equivalent way to to df.groupby().shift in polars? Use pandas.shift() within a group
You can use the over expression to accomplish this in Polars. Using the example from the link...
import polars as pl
df = pl.DataFrame({
'object': [1, 1, 1, 2, 2],
'period': [1, 2, 4, 4, 23],
'value': [24, 67, 89, 5, 23],
})
df.with_column(
pl.col('value').shift().over('object').alias('prev_value')
)
shape: (5, 4)
┌────────┬────────┬───────┬────────────┐
│ object ┆ period ┆ value ┆ prev_value │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞════════╪════════╪═══════╪════════════╡
│ 1 ┆ 1 ┆ 24 ┆ null │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2 ┆ 67 ┆ 24 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 4 ┆ 89 ┆ 67 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 5 ┆ null │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 23 ┆ 23 ┆ 5 │
└────────┴────────┴───────┴────────────┘
To perform this on more than one column, you can specify the columns in the pl.col expression, and then use a prefix/suffix to name the new columns. For example:
df.with_columns(
pl.col(['period', 'value']).shift().over('object').prefix("prev_")
)
shape: (5, 5)
┌────────┬────────┬───────┬─────────────┬────────────┐
│ object ┆ period ┆ value ┆ prev_period ┆ prev_value │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞════════╪════════╪═══════╪═════════════╪════════════╡
│ 1 ┆ 1 ┆ 24 ┆ null ┆ null │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2 ┆ 67 ┆ 1 ┆ 24 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 4 ┆ 89 ┆ 2 ┆ 67 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 5 ┆ null ┆ null │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 23 ┆ 23 ┆ 4 ┆ 5 │
└────────┴────────┴───────┴─────────────┴────────────┘
Using multiple values with over
Let's use this data.
df = pl.DataFrame(
{
"id": [1] * 5 + [2] * 5,
"date": ["2020-01-01", "2020-01-01", "2020-02-01", "2020-02-01", "2020-02-01"] * 2,
"value1": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"value2": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
}
).with_column(pl.col('date').str.strptime(pl.Date))
df
shape: (10, 4)
┌─────┬────────────┬────────┬────────┐
│ id ┆ date ┆ value1 ┆ value2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ date ┆ i64 ┆ i64 │
╞═════╪════════════╪════════╪════════╡
│ 1 ┆ 2020-01-01 ┆ 1 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-01-01 ┆ 2 ┆ 20 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-02-01 ┆ 3 ┆ 30 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-02-01 ┆ 4 ┆ 40 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-02-01 ┆ 5 ┆ 50 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-01-01 ┆ 6 ┆ 60 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-01-01 ┆ 7 ┆ 70 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-02-01 ┆ 8 ┆ 80 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-02-01 ┆ 9 ┆ 90 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-02-01 ┆ 10 ┆ 100 │
└─────┴────────────┴────────┴────────┘
We can place a list of our grouping variables in the over expression (as well as a list in our pl.col expression). Polars will run them all in parallel.
df.with_columns([
pl.col(["value1", "value2"]).shift().over(['id','date']).prefix("prev_"),
pl.col(["value1", "value2"]).diff().over(['id','date']).suffix("_diff"),
])
shape: (10, 8)
┌─────┬────────────┬────────┬────────┬─────────────┬─────────────┬─────────────┬─────────────┐
│ id ┆ date ┆ value1 ┆ value2 ┆ prev_value1 ┆ prev_value2 ┆ value1_diff ┆ value2_diff │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ date ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪════════════╪════════╪════════╪═════════════╪═════════════╪═════════════╪═════════════╡
│ 1 ┆ 2020-01-01 ┆ 1 ┆ 10 ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-01-01 ┆ 2 ┆ 20 ┆ 1 ┆ 10 ┆ 1 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-02-01 ┆ 3 ┆ 30 ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-02-01 ┆ 4 ┆ 40 ┆ 3 ┆ 30 ┆ 1 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-02-01 ┆ 5 ┆ 50 ┆ 4 ┆ 40 ┆ 1 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-01-01 ┆ 6 ┆ 60 ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-01-01 ┆ 7 ┆ 70 ┆ 6 ┆ 60 ┆ 1 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-02-01 ┆ 8 ┆ 80 ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-02-01 ┆ 9 ┆ 90 ┆ 8 ┆ 80 ┆ 1 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-02-01 ┆ 10 ┆ 100 ┆ 9 ┆ 90 ┆ 1 ┆ 10 │
└─────┴────────────┴────────┴────────┴─────────────┴─────────────┴─────────────┴─────────────┘