Filter DataFrame using within-group expression - python-polars

Assuming I already have a predicate expression, how do I filter with that predicate, but apply it only within groups? For example, the predicate might be to keep all rows equal to the maximum or within a group. (There could be multiple rows kept in a group if there is a tie.)
With my dplyr experience, I thought that I could just .groupby and then .filter, but that does not work.
import polars as pl
df = pl.DataFrame(dict(x=[0, 0, 1, 1], y=[1, 2, 3, 3]))
expression = pl.col("y") == pl.col("y").max()
df.groupby("x").filter(expression)
# AttributeError: 'GroupBy' object has no attribute 'filter'
I then thought I could apply .over to the expression, but that does not work either.
import polars as pl
df = pl.DataFrame(dict(x=[0, 0, 1, 1], y=[1, 2, 3, 3]))
expression = pl.col("y") == pl.col("y").max()
df.filter(expression.over("x"))
# RuntimeError: Any(ComputeError("this binary expression is not an aggregation:
# [(col(\"y\")) == (col(\"y\").max())]
# pherhaps you should add an aggregation like, '.sum()', '.min()', '.mean()', etc.
# if you really want to collect this binary expression, use `.list()`"))
For this particular problem, I can invoke .over on the max, but I don't know how to apply this to an arbitrary predicate I don't have control over.
import polars as pl
df = pl.DataFrame(dict(x=[0, 0, 1, 1], y=[1, 2, 3, 3]))
expression = pl.col("y") == pl.col("y").max().over("x")
df.filter(expression)
# shape: (3, 2)
# ┌─────┬─────┐
# │ x ┆ y │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 0 ┆ 2 │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 1 ┆ 3 │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 1 ┆ 3 │
# └─────┴─────┘

If you had updated to polars>=0.13.0 your second try would have worked. :)
df = pl.DataFrame(dict(
x=[0, 0, 1, 1],
y=[1, 2, 3, 3])
)
df.filter((pl.col("y") == pl.max("y").over("x")))
shape: (3, 2)
┌─────┬─────┐
│ x ┆ y │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 0 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 3 │
└─────┴─────┘

Related

Horizontal mean with polars expressions

I have an expression and I need to create a column for row mean.
I saw in the documentation that with the expression it is impossible to set an axis and mean provide mean for all data frame.
Is it possibile to compute row mean? Maybe with a fold?
For example, let us take this frame:
df = pl.DataFrame(
{
"foo": [1, 2, 3],
"bar": [6, 7, 8],
}
)
With df.mean(axis=1) but I'm writing a context with several expression and I would like to compute h-mean inside that context.
Thanks.
df = pl.DataFrame(
{
"foo": [1, 2, 3],
"bar": [6, 7, 8],
}
)
df.with_column(
(pl.sum(pl.all()) / 2).alias("mean")
)
shape: (3, 3)
┌─────┬─────┬──────┐
│ foo ┆ bar ┆ mean │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 │
╞═════╪═════╪══════╡
│ 1 ┆ 6 ┆ 3.5 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 7 ┆ 4.5 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3 ┆ 8 ┆ 5.5 │
└─────┴─────┴──────┘
If the horizontal aggregations get more complicated, I would recommend using pl.fold or pl.reduce.

In Polars how do I print all elements of a list column?

I have a Polars DataFrame with a list column. I want to control how many elements of a pl.List column are printed.
I've tried pl.pl.Config.set_fmt_str_lengths() but this only restricts the number of elements if set to a small value, it doesn't show more elements for a large value.
I'm working in Jupyterlab but I think it's a general issue.
import polars as pl
N = 5
df = (
pl.DataFrame(
{
'id': range(N)
}
)
.with_row_count("value")
.groupby_rolling(
"id",period=f"{N}i"
)
.agg(
pl.col("value")
)
)
df
shape: (5, 2)
┌─────┬───────────────┐
│ id ┆ value │
│ --- ┆ --- │
│ i64 ┆ list[u32] │
╞═════╪═══════════════╡
│ 0 ┆ [0] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ [0, 1] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ [0, 1, 2] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ [0, 1, ... 3] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ [0, 1, ... 4] │
└─────┴───────────────┘
pl.Config.set_tbl_rows(100)
And more generally, I would try looking at dir(pl.Config)
You can use the following config parameter from the Polars Documentation to set the length of the output e.g. 100.
import Polars as pl
pl.Config.set_fmt_str_lengths(100)
Currently I do not think you can, directly; the documentation for Config does not list any such method, and for me (in VSCode at least) set_fmt_str_lengths does not affect lists.
However, if your goal is simply to be able to see what you're working on and you don't mind a slightly hacky workaround, you can simply add a column next to it where you convert your list to a string representation of itself, at which point pl.Config.set_fmt_str_lengths(<some large n>) will then display however much of it you like. For example:
import polars as pl
pl.Config.set_fmt_str_lengths(100)
N = 5
df = (
pl.DataFrame(
{
'id': range(N)
}
)
.with_row_count("value")
.groupby_rolling(
"id",period=f"{N}i"
)
.agg(
pl.col("value")
).with_column(
pl.col("value").apply(lambda x: ["["+", ".join([f'{i}' for i in x])+"]"][0]).alias("string_repr")
)
)
df
shape: (5, 3)
┌─────┬───────────────┬─────────────────┐
│ id ┆ value ┆ string_repr │
│ --- ┆ --- ┆ --- │
│ i64 ┆ list[u32] ┆ str │
╞═════╪═══════════════╪═════════════════╡
│ 0 ┆ [0] ┆ [0] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ [0, 1] ┆ [0, 1] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ [0, 1, 2] ┆ [0, 1, 2] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ [0, 1, ... 3] ┆ [0, 1, 2, 3] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ [0, 1, ... 4] ┆ [0, 1, 2, 3, 4] │
└─────┴───────────────┴─────────────────┘

Sum columns based on column names in a list for polars

So in python Polars
I can add one or more columns to make a new column by using an expression something like
frame.with_column((pl.col('colname1') + pl.col('colname2').alias('new_colname')))
However, if I have all the column names in a list is there a way to sum all the columns in that list and create a new column based on the result ?
Thanks
sum expr supports horizontal summing. From the docs,
List[Expr] -> aggregate the sum value horizontally.
Sample code for ref,
import polars as pl
df = pl.DataFrame({"a": [1, 2, 3], "b": [1, 2, None]})
print(df)
This results in something like,
shape: (3, 2)
┌─────┬──────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪══════╡
│ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3 ┆ null │
└─────┴──────┘
On this you can do something like,
cols = ["a", "b"]
df2 = df.select(pl.sum([pl.col(i) for i in cols]).alias('new_colname'))
print(df2)
Which will result in,
shape: (3, 1)
┌──────┐
│ sum │
│ --- │
│ i64 │
╞══════╡
│ 2 │
├╌╌╌╌╌╌┤
│ 4 │
├╌╌╌╌╌╌┤
│ null │
└──────┘

How can I call a numpy ufunc with two positional arguments in polars?

I would like to call a numpy universal function (ufunc) that has two positional arguments in polars.
df.with_column(
numpy.left_shift(pl.col('col1'), 8)
)
Above attempt results in the following error message
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/local/lib/python3.8/dist-packages/polars/internals/expr.py", line 181, in __array_ufunc__
out_type = ufunc(np.array([1])).dtype
TypeError: left_shift() takes from 2 to 3 positional arguments but 1 were given
There are other ways to perform this computation, e.g.,
df['col1'] = numpy.left_shift(df['col1'], 8)
... but I'm trying to use this with a polars.LazyFrame.
I'm using polars 0.13.13 and Python 3.8.
Edit: as of Polars 0.13.19, the apply method converts Numpy datatypes to Polars datatypes without requiring the Numpy item method.
When you need to pass only one column from polars to the ufunc (as in your example), the easist method is to use the apply function on the particular column.
import numpy as np
import polars as pl
df = pl.DataFrame({"col1": [2, 4, 8, 16]}).lazy()
df.with_column(
pl.col("col1").apply(lambda x: np.left_shift(x, 8).item()).alias("result")
).collect()
shape: (4, 2)
┌──────┬────────┐
│ col1 ┆ result │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪════════╡
│ 2 ┆ 512 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 1024 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 8 ┆ 2048 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 16 ┆ 4096 │
└──────┴────────┘
If you need to pass multiple columns from Polars to the ufunc, then use the struct expression with apply.
df = pl.DataFrame({"col1": [2, 4, 8, 16], "shift": [1, 1, 2, 2]}).lazy()
df.with_column(
pl.struct(["col1", "shift"])
.apply(lambda cols: np.left_shift(cols["col1"], cols["shift"]).item())
.alias("result")
).collect()
shape: (4, 3)
┌──────┬───────┬────────┐
│ col1 ┆ shift ┆ result │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞══════╪═══════╪════════╡
│ 2 ┆ 1 ┆ 4 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 1 ┆ 8 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 8 ┆ 2 ┆ 32 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 16 ┆ 2 ┆ 64 │
└──────┴───────┴────────┘
One Note: the use of the numpy item method may no longer be needed in future releases of Polars. (Presently, the apply method does not always automatically translate between numpy dtypes and Polars dtypes.)
Does this help?

How to assign Exponential Moving Averages after groupby in python polars

I have just started using polars in python and I'm coming from pandas.
I would like to know how can I replicate the below pandas code in python polars
import pandas as pd
import polars as pl
df['exp_mov_avg_col'] = df.groupby('agg_col')['ewm_col'].transform(lambda x : x.ewm(span=14).mean())
I have tried the following:
df.groupby('agg_col').agg([pl.col('ewm_col').ewm_mean().alias('exp_mov_avg_col')])
but this gives me a list of exponential moving averages per provider, I want that list to be assigned to a column in original dataframe to the correct indexes, just like the pandas code does.
You can use window functions which apply an expression within a group defined by .over("group").
df = pl.DataFrame({
"agg_col": [1, 1, 2, 3, 3, 3],
"ewm_col": [1, 2, 3, 4, 5, 6]
})
(df.select([
pl.all().exclude("ewm_col"),
pl.col("ewm_col").ewm_mean(alpha=0.5).over("agg_col")
]))
Ouputs:
shape: (6, 2)
┌─────────┬──────────┐
│ agg_col ┆ ewm_col │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════════╪══════════╡
│ 1 ┆ 1.0 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 1.666667 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 3.0 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 4.0 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 4.666667 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 5.428571 │
└─────────┴──────────┘