Enumerate each group - python-polars

Starting with
df = pl.DataFrame({'group': [1, 1, 1, 3, 3, 3, 4, 4]})
how can I get a column which numbers the 'group' column?
Here's what df looks like:
shape: (8, 1)
┌───────┐
│ group │
│ --- │
│ i64 │
╞═══════╡
│ 1 │
├╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌╌╌┤
│ 4 │
├╌╌╌╌╌╌╌┤
│ 4 │
└───────┘
and here's my expected output:
shape: (8, 2)
┌───────┬─────────┐
│ group ┆ group_i │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═══════╪═════════╡
│ 1 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 2 │
└───────┴─────────┘
Here's one way I came up with, it just feels a bit complex for this task...is there a simpler way?
df.with_column(((pl.col('group')!=pl.col('group').shift()).cast(pl.Int64).cumsum()-1).alias('group_i'))

I think the terms come from SQL:
You're looking to .rank() your data - in particular - a "dense" ranking.
>>> df.with_column(pl.col("group").alias("group_i").rank("dense") - 1)
shape: (8, 2)
┌───────┬─────────┐
│ group | group_i │
│ --- | --- │
│ i64 | u32 │
╞═══════╪═════════╡
│ 1 | 0 │
├───────┼─────────┤
│ 1 | 0 │
├───────┼─────────┤
│ 1 | 0 │
├───────┼─────────┤
│ 3 | 1 │
├───────┼─────────┤
│ 3 | 1 │
├───────┼─────────┤
│ 3 | 1 │
├───────┼─────────┤
│ 4 | 2 │
├───────┼─────────┤
│ 4 | 2 │
└───────┴─────────┘

Related

Polars table convert a list column to separate rows i.e. unnest a list column to multiple rows

I have a Polars dataframe in the form:
df = pl.DataFrame({'a':[1,2,3], 'b':[['a','b'],['a'],['c','d']]})
┌─────┬────────────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ list[str] │
╞═════╪════════════╡
│ 1 ┆ ["a", "b"] │
│ 2 ┆ ["a"] │
│ 3 ┆ ["c", "d"] │
└─────┴────────────┘
I want to convert it to the following form. I plan to save to a parquet file, and query the file (with sql).
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1 ┆ "a" │
│ 1 ┆ "b" │
│ 2 ┆ "a" │
│ 3 ┆ "c" │
│ 3 ┆ "d" │
└─────┴─────┘
I have seen an answer that works on struct columns, but df.unnest('b') on my data results in the error:
SchemaError: Series of dtype: List(Utf8) != Struct
I also found a github issue that shows list can be converted to a struct, but I can't work out how to do that, or if it applies here.
To decompose column with Lists, you can use .explode() method (doc)
df = pl.DataFrame({'a':[1,2,3], 'b':[['a','b'],['a'],['c','d']]})
df.explode("b")
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1 ┆ a │
│ 1 ┆ b │
│ 2 ┆ a │
│ 3 ┆ c │
│ 3 ┆ d │
└─────┴─────┘

Polars: groupby rolling sum

Say I have
df = pl.DataFrame({'group': [1, 1, 1, 3, 3, 3, 4, 4], 'value': [1, 4, 2, 5, 3, 4, 2, 3]})
I'd like to get a rolling sum, with window of 2, for each group
Expected output is:
┌───────┐
│ value │
│ --- │
│ i64 │
╞═══════╡
│ 1 │
├╌╌╌╌╌╌╌┤
│ 5 │
├╌╌╌╌╌╌╌┤
│ 6 │
├╌╌╌╌╌╌╌┤
│ 5 │
├╌╌╌╌╌╌╌┤
│ 8 │
├╌╌╌╌╌╌╌┤
│ 7 │
├╌╌╌╌╌╌╌┤
│ 2 │
├╌╌╌╌╌╌╌┤
│ 5 │
└───────┘
.rolling_sum().over("group")
min_periods=1 will fill in the nulls.
>>> df.select(pl.col("value").rolling_sum(2, min_periods=1).over("group"))
shape: (8, 1)
┌───────┐
│ value │
│ --- │
│ i64 │
╞═══════╡
│ 1 │
├╌╌╌╌╌╌╌┤
│ 5 │
├╌╌╌╌╌╌╌┤
│ 6 │
├╌╌╌╌╌╌╌┤
│ 5 │
├╌╌╌╌╌╌╌┤
│ 8 │
├╌╌╌╌╌╌╌┤
│ 7 │
├╌╌╌╌╌╌╌┤
│ 2 │
├╌╌╌╌╌╌╌┤
│ 5 │
└───────┘

Is it possible in Polars to "reset" cumsum() at a certain condition?

I need to cumsum the column b until a becomes True. After that cumsum shall start from this row and so on.
a | b
-------------
False | 1
False | 2
True | 3
False | 4
Can I do it on Polars without looping each row?
You could use the .cumsum() of the a column as the "group number".
>>> df.select(pl.col("a").cumsum())
shape: (4, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 0 │
├╌╌╌╌╌┤
│ 0 │
├╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌┤
│ 1 │
└─────┘
And use that with .over()
>>> df.select(pl.col("b").cumsum().over(pl.col("a").cumsum()))
shape: (4, 1)
┌─────┐
│ b │
│ --- │
│ i64 │
╞═════╡
│ 1 │
├╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌┤
│ 7 │
└─────┘
You can .shift().backward_fill() to include the True
>>> df.select(pl.col("b").cumsum().over(
... pl.col("a").cumsum().shift().backward_fill()))
shape: (4, 1)
┌─────┐
│ b │
│ --- │
│ i64 │
╞═════╡
│ 1 │
├╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌┤
│ 6 │
├╌╌╌╌╌┤
│ 4 │
└─────┘

Ordinal encoding of column in polars

I would like to do an ordinal encoding of a column. Pandas has the nice and convenient method of pd.factorize(), however, I would like to achieve the same in polars.
df = pl.DataFrame({"a": [5, 8, 10], "b": ["hi", "hello", "hi"]})
┌─────┬───────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═══════╡
│ 5 ┆ hi │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 8 ┆ hello │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 10 ┆ hi │
└─────┴───────┘
desired result:
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 0 ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 0 │
└─────┴─────┘
You can join with a dummy DataFrame that contains the unique values and the ordinal encoding you are interested in:
df = pl.DataFrame({"a": [5, 8, 10], "b": ["hi", "hello", "hi"]})
unique = df.select(
pl.col("b").unique(maintain_order=True)
).with_row_count(name="ordinal")
df.join(unique, on="b")
Or you could "misuse" the fact that categorical values are backed by u32 integers.
df.with_column(
pl.col("b").cast(pl.Categorical).to_physical().alias("ordinal")
)
Both methods output:
shape: (3, 3)
┌─────┬───────┬─────────┐
│ a ┆ b ┆ ordinal │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ u32 │
╞═════╪═══════╪═════════╡
│ 5 ┆ hi ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 8 ┆ hello ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 10 ┆ hi ┆ 0 │
└─────┴───────┴─────────┘
Here's another way to do it although I doubt it's better than the dummy Dataframe from #ritchie46
df.with_columns([pl.col('b').unique().list().alias('uniq'),
pl.col('b').unique().list().arr.eval(pl.element().rank()).alias('uniqid')]).explode(['uniq','uniqid']).filter(pl.col('b')==pl.col('uniq')).select(pl.exclude('uniq')).with_column(pl.col('uniqid')-1)
There's almost certainly a way to improve this but basically it creates a new column called uniq which is a list of all the unique values of the column as well as uniqid which (I think, and seems to be) the 1-index based order of the values. It then explodes those creating a row for ever value in uniq and then filters out the ones rows that don't equal the column b. Since rank gives the 1-index (rather than 0-index) you have to subtract 1 and exclude the uniq column that we don't care about since it's the same as b.
If the order is not important you could use .rank(method="dense")
>>> df.select(pl.all().rank(method="dense") - 1)
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞═════╪═════╡
│ 0 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 1 │
└─────┴─────┘
If it is - you could:
>>> (
... df.with_row_count()
... .with_columns([
... pl.col("row_nr").first()
... .over(col)
... .rank(method="dense")
... .alias(col) - 1
... for col in df.columns
... ])
... .drop("row_nr")
... )
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞═════╪═════╡
│ 0 ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 0 │
└─────┴─────┘

Computing and retrieving operations at the group level without collapsing data frame in polars?

I am trying to compute a stat (or more) at the group level without having to create a second data frame. The current way I do it is by relying on the generation of a second data frame with the desired aggregation that I then merge back to the original one.
A silly example:
import polars as pl
df = pl. DataFrame( {'name' : ['Steve', 'Larry', 'Tom', 'Steve', 'Tom', 'Steve'],
'points': range(6)})
print(df)
shape: (6, 2)
┌───────┬────────┐
│ name ┆ points │
│ --- ┆ --- │
│ str ┆ i64 │
╞═══════╪════════╡
│ Steve ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Larry ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Tom ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Steve ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Tom ┆ 4 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Steve ┆ 5 │
└───────┴────────┘
We created a simple data frame below in which some groups have more entries than others. In a second step we compute an additional data frame to keep track of the size of each group.
entries= df.groupby('name').agg(pl.count().alias('entries'))
print(entries)
shape: (3, 2)
┌───────┬─────────┐
│ name ┆ entries │
│ --- ┆ --- │
│ str ┆ u32 │
╞═══════╪═════════╡
│ Steve ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Tom ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Larry ┆ 1 │
└───────┴─────────┘
Now we bring back this information to the original data frame in a third step.
print(df.join(entries, left_on='name', right_on='name', how='left'))
shape: (6, 3)
┌───────┬────────┬─────────┐
│ name ┆ points ┆ entries │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u32 │
╞═══════╪════════╪═════════╡
│ Steve ┆ 0 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Larry ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Tom ┆ 2 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Steve ┆ 3 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Tom ┆ 4 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Steve ┆ 5 ┆ 3 │
└───────┴────────┴─────────┘
Is there a way to avoid this triangulation? I have the feeling that using over might be a solution but I can't figure it out yet.
Well ... I managed. Posting the question helped me organize my thoughts and indeed, over was the solution.
df.with_column(pl.col('name').count().over('name').alias('entries'))
shape: (6, 3)
┌───────┬────────┬─────────┐
│ name ┆ points ┆ entries │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u32 │
╞═══════╪════════╪═════════╡
│ Steve ┆ 0 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Larry ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Tom ┆ 2 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Steve ┆ 3 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Tom ┆ 4 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Steve ┆ 5 ┆ 3 │
└───────┴────────┴─────────┘