In pandas, one can use logical indexing to assign items:
s = pd.Series(['a', 'b', 'c', 'd', 'e'])
idx = [True, False, True, False, True]
s[idx] = ['x', 'y', 'z']
In polars, we can do this with set_at_idx:
s = pl.Series(['a', 'b', 'c', 'd', 'e'])
idx = pl.Series([True, False, True, False, True])
s = s.set_at_idx(idx.arg_true(), ['x', 'y', 'z']) # arg_true() gives us integer indices
However, I'm struggling to figure out how to do this in an expression context, as I cannot do df['col'] = df['col'].set_at_idx.... Instead, polars suggests I use with_column:
import polars as pl
from polars import col, when
df = pl.DataFrame({
"my_col": ['a', 'b', 'c', 'd', 'e'],
"idx": [True, False, True, False, True]
})
new_values = ['x', 'y', 'z']
df = df.with_column(
when(col("idx")).then(new_values) # how to do this?
.otherwise(col("my_col"))
)
My method of getting around this is somewhat long-winded and there must be an easier way:
s = df["my_col"].clone()
s = s.set_at_idx(df["idx"].arg_true(), new_values)
df = df.with_column(s.alias("my_col"))
Syntactically it's not horrible, but is there an easier way to simply update a series with a list of values (or other series)?
I'm not aware of an elegant way to convert your code using Series directly to Expressions. But we won't let that stop us.
One underappreciated aspect of the Polars architecture is the ability to compose our own Expressions using existing Polars API Expressions. (Perhaps there should be a section in the User Guide for this.)
So, let's create an Expression to do what we need.
The code below may seem overwhelming at first. We'll look at examples and I'll explain how it works in detail below.
Expression: set_by_mask
Here's a custom Expression that will set values based on a mask. For lack of a better name I've called it set_by_mask. The Expression is a bit rough (e.g., it does zero error-checking), but it should act as a good starting point.
Notice at the end that we will assign this function as a method of the Expr class, so that it can be used just like any other Expression (e.g., it can participate in any valid chain of Expressions, be used within a groupby, etc..)
Much of the code below deals with "convenience" methods (e.g., allowing the mask parameters to be a list/tuple or an Expression or a Series). Later, we'll go through how the heart of the algorithm works.
Here's the code:
from typing import Any, Sequence
import polars.internals as pli
def set_by_mask(
self: pli.Expr,
mask: str | Sequence[bool] | pli.Series | pli.Expr,
values: Sequence[Any] | pli.Series | pli.Expr,
) -> pli.Expr:
"""
Set values at mask locations.
Parameters
----------
mask
Indices with True values are replaced with values.
Sequence[bool]: list or tuple of boolean values
str: column name of boolean Expression
Series | Expr: Series or Expression that evaluates to boolean
values
Values to replace where mask is True
Notes:
The number of elements in values must match the number of
True values in mask.
The mask Expression/list/tuple must match the length of the
Expression for which values are being set.
"""
if isinstance(mask, str):
mask = pl.col(mask)
if isinstance(mask, Sequence):
mask = pli.Series("", mask)
if isinstance(values, Sequence):
values = pli.Series("", values)
if isinstance(mask, pli.Series):
mask = pli.lit(mask)
if isinstance(values, pli.Series):
values = pli.lit(values)
result = (
self
.sort_by(mask)
.slice(0, mask.is_not().sum())
.append(values)
.sort_by(mask.arg_sort())
)
return self._from_pyexpr(result._pyexpr)
pli.Expr.set_by_mask = set_by_mask
Examples
First, let's run through some examples of how this works.
In the example below, we'll pass a string as our mask parameter -- indicating the column name of df to be used as a mask. And we'll pass a simple Python list of string values as our values parameter.
Remember to run the code above first before running the examples below. We need the set_by_mask function to be a method of the Expr class. (Don't worry, it's not permanent - when the Python interpreter exits, the Expr class will be restored to its original state.)
import polars as pl
df = pl.DataFrame({
"my_col": ["a", "b", "c", "d", "e"],
"idx": [True, False, True, False, True],
})
new_values = ("x", "y", "z")
(
df
.with_columns([
pl.col('my_col').set_by_mask("idx", new_values).alias('result')
])
)
shape: (5, 3)
┌────────┬───────┬────────┐
│ my_col ┆ idx ┆ result │
│ --- ┆ --- ┆ --- │
│ str ┆ bool ┆ str │
╞════════╪═══════╪════════╡
│ a ┆ true ┆ x │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b ┆ false ┆ b │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ c ┆ true ┆ y │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ d ┆ false ┆ d │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ e ┆ true ┆ z │
└────────┴───────┴────────┘
We see that the values for a, c, and e have been replaced, consistent with where the mask evaluated to True.
As another example, let's pass the mask and values parameter as external Series (ignoring the idx column).
new_values = pl.Series("", ['1', '2', '4', '5'])
mask = pl.Series("", [True, True, False, True, True])
(
df
.with_columns([
pl.col('my_col').set_by_mask(mask, new_values).alias('result')
])
)
shape: (5, 3)
┌────────┬───────┬────────┐
│ my_col ┆ idx ┆ result │
│ --- ┆ --- ┆ --- │
│ str ┆ bool ┆ str │
╞════════╪═══════╪════════╡
│ a ┆ true ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b ┆ false ┆ 2 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ c ┆ true ┆ c │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ d ┆ false ┆ 4 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ e ┆ true ┆ 5 │
└────────┴───────┴────────┘
How it works
The heart of the algorithm is this snippet.
result = (
self
.sort_by(mask)
.slice(0, mask.is_not().sum())
.append(values)
.sort_by(mask.arg_sort())
)
To see how it works, we'll use the first example and watch how the algorithm builds the answer.
First, we'll need to change the with_columns in your query to a select, because the intermediate steps in the algorithm won't produce a column whose length matches the other columns, which will lead to an error.
Here's the code that we'll run to observe the steps in the algorithm
import polars as pl
new_values = ("x", "y", "z")
(
pl.DataFrame({
"my_col": ["a", "b", "c", "d", "e"],
"idx": [True, False, True, False, True],
})
.select([
pl.col('my_col').set_by_mask("idx", new_values).alias('result')
])
)
shape: (5, 1)
┌────────┐
│ result │
│ --- │
│ str │
╞════════╡
│ x │
├╌╌╌╌╌╌╌╌┤
│ b │
├╌╌╌╌╌╌╌╌┤
│ y │
├╌╌╌╌╌╌╌╌┤
│ d │
├╌╌╌╌╌╌╌╌┤
│ z │
└────────┘
With that in place, let's look at how the algorithm evolves.
The Algorithm in steps
The first step of the algorithm is to sort the original column so that non-changing values (those corresponding to a mask value of False) are sorted to the top. We'll accomplish this using the sort_by Expression, and pass mask as our sorting criterion.
I'll change the heart of the algorithm to only these steps.
result = (
self
.sort_by(mask)
)
Here's the result.
shape: (5, 1)
┌────────┐
│ result │
│ --- │
│ str │
╞════════╡
│ b │
├╌╌╌╌╌╌╌╌┤
│ d │
├╌╌╌╌╌╌╌╌┤
│ a │
├╌╌╌╌╌╌╌╌┤
│ c │
├╌╌╌╌╌╌╌╌┤
│ e │
└────────┘
In our example, values b and d are not changing and are sorted to the top; values a, c, and e are being replaced and are sorted to the bottom.
In the next step, we'll use the slice Expression to eliminate those values that will be replaced.
result = (
self
.sort_by(mask)
.slice(0, mask.is_not().sum())
)
shape: (2, 1)
┌────────┐
│ result │
│ --- │
│ str │
╞════════╡
│ b │
├╌╌╌╌╌╌╌╌┤
│ d │
└────────┘
In the next step, we'll use the append Expression to place the new values at the bottom.
result = (
self
.sort_by(mask)
.slice(0, mask.is_not().sum())
.append(values)
)
shape: (5, 1)
┌────────┐
│ result │
│ --- │
│ str │
╞════════╡
│ b │
├╌╌╌╌╌╌╌╌┤
│ d │
├╌╌╌╌╌╌╌╌┤
│ x │
├╌╌╌╌╌╌╌╌┤
│ y │
├╌╌╌╌╌╌╌╌┤
│ z │
└────────┘
Now for the tricky step: how to get the values sorted in the proper order.
We're going to use arg_sort to accomplish this. One property of an arg_sort is that it can restore a sorted column back to its original un-sorted state.
If we look at the values above, the non-replaced values are at the top (corresponding to a mask value of False). And the replaced values are at the bottom (corresponding to a mask value of True). This corresponds to a mask of [False, False, True, True, True].
That, in turn, corresponds to the mask expression when it is sorted. (False sorts before True). Hence, sorting the column by the arg_sort of the mask will restore the column to correspond to the original un-sorted maskcolumn.
result = (
self
.sort_by(mask)
.slice(0, mask.is_not().sum())
.append(values)
.sort_by(mask.arg_sort())
)
shape: (5, 1)
┌────────┐
│ result │
│ --- │
│ str │
╞════════╡
│ x │
├╌╌╌╌╌╌╌╌┤
│ b │
├╌╌╌╌╌╌╌╌┤
│ y │
├╌╌╌╌╌╌╌╌┤
│ d │
├╌╌╌╌╌╌╌╌┤
│ z │
└────────┘
It subtle, but it works.
I appreciate the above may be more than you wanted. But hopefully, it demonstrates how we can compose our own Expressions using existing Polars Expressions.
So in python Polars
I can add one or more columns to make a new column by using an expression something like
frame.with_column((pl.col('colname1') + pl.col('colname2').alias('new_colname')))
However, if I have all the column names in a list is there a way to sum all the columns in that list and create a new column based on the result ?
Thanks
sum expr supports horizontal summing. From the docs,
List[Expr] -> aggregate the sum value horizontally.
Sample code for ref,
import polars as pl
df = pl.DataFrame({"a": [1, 2, 3], "b": [1, 2, None]})
print(df)
This results in something like,
shape: (3, 2)
┌─────┬──────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪══════╡
│ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3 ┆ null │
└─────┴──────┘
On this you can do something like,
cols = ["a", "b"]
df2 = df.select(pl.sum([pl.col(i) for i in cols]).alias('new_colname'))
print(df2)
Which will result in,
shape: (3, 1)
┌──────┐
│ sum │
│ --- │
│ i64 │
╞══════╡
│ 2 │
├╌╌╌╌╌╌┤
│ 4 │
├╌╌╌╌╌╌┤
│ null │
└──────┘
I would like to call a numpy universal function (ufunc) that has two positional arguments in polars.
df.with_column(
numpy.left_shift(pl.col('col1'), 8)
)
Above attempt results in the following error message
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/local/lib/python3.8/dist-packages/polars/internals/expr.py", line 181, in __array_ufunc__
out_type = ufunc(np.array([1])).dtype
TypeError: left_shift() takes from 2 to 3 positional arguments but 1 were given
There are other ways to perform this computation, e.g.,
df['col1'] = numpy.left_shift(df['col1'], 8)
... but I'm trying to use this with a polars.LazyFrame.
I'm using polars 0.13.13 and Python 3.8.
Edit: as of Polars 0.13.19, the apply method converts Numpy datatypes to Polars datatypes without requiring the Numpy item method.
When you need to pass only one column from polars to the ufunc (as in your example), the easist method is to use the apply function on the particular column.
import numpy as np
import polars as pl
df = pl.DataFrame({"col1": [2, 4, 8, 16]}).lazy()
df.with_column(
pl.col("col1").apply(lambda x: np.left_shift(x, 8).item()).alias("result")
).collect()
shape: (4, 2)
┌──────┬────────┐
│ col1 ┆ result │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪════════╡
│ 2 ┆ 512 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 1024 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 8 ┆ 2048 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 16 ┆ 4096 │
└──────┴────────┘
If you need to pass multiple columns from Polars to the ufunc, then use the struct expression with apply.
df = pl.DataFrame({"col1": [2, 4, 8, 16], "shift": [1, 1, 2, 2]}).lazy()
df.with_column(
pl.struct(["col1", "shift"])
.apply(lambda cols: np.left_shift(cols["col1"], cols["shift"]).item())
.alias("result")
).collect()
shape: (4, 3)
┌──────┬───────┬────────┐
│ col1 ┆ shift ┆ result │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞══════╪═══════╪════════╡
│ 2 ┆ 1 ┆ 4 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 1 ┆ 8 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 8 ┆ 2 ┆ 32 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 16 ┆ 2 ┆ 64 │
└──────┴───────┴────────┘
One Note: the use of the numpy item method may no longer be needed in future releases of Polars. (Presently, the apply method does not always automatically translate between numpy dtypes and Polars dtypes.)
Does this help?
I have just started using polars in python and I'm coming from pandas.
I would like to know how can I replicate the below pandas code in python polars
import pandas as pd
import polars as pl
df['exp_mov_avg_col'] = df.groupby('agg_col')['ewm_col'].transform(lambda x : x.ewm(span=14).mean())
I have tried the following:
df.groupby('agg_col').agg([pl.col('ewm_col').ewm_mean().alias('exp_mov_avg_col')])
but this gives me a list of exponential moving averages per provider, I want that list to be assigned to a column in original dataframe to the correct indexes, just like the pandas code does.
You can use window functions which apply an expression within a group defined by .over("group").
df = pl.DataFrame({
"agg_col": [1, 1, 2, 3, 3, 3],
"ewm_col": [1, 2, 3, 4, 5, 6]
})
(df.select([
pl.all().exclude("ewm_col"),
pl.col("ewm_col").ewm_mean(alpha=0.5).over("agg_col")
]))
Ouputs:
shape: (6, 2)
┌─────────┬──────────┐
│ agg_col ┆ ewm_col │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════════╪══════════╡
│ 1 ┆ 1.0 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 1.666667 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 3.0 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 4.0 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 4.666667 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 5.428571 │
└─────────┴──────────┘
Hi;
Is there any function that can generate a serie by calculating rowwise minimum of two series.? Functionality will be similar to np.minimum
a = [1,4,2,5,2]
b= [5,1,4,2,5]
np.minimum(a,b) -> [1,1,2,2,2]
Thanks.
q =df.lazy().with_column(pl.when(pl.col("a")>pl.col("b")).then(pl.col("b")).otherwise(pl.col("a")).alias("minimum"))
df = q.collect()
i didn't tested it but this should work i think
As the accepted answer states, you can use pl.when -> then -> otherwise expression.
If you have a wider DataFrame, you can use the DataFrame.min() method, pl.min expression, or a pl.fold for more control.
import polars as pl
df = pl.DataFrame({
"a": [1,4,2,5,2],
"b": [5,1,4,2,5],
"c": [3,2,5,7,2]
})
df.min(axis=1)
This outputs:
shape: (5,)
Series: 'a' [i64]
[
1
1
2
2
2
]
Min expression
When given multiple expression inputs to a pl.min the minimum is determined row-wise instead of column-wise.
df.select(pl.min(["a", "b", "c"]))
Outputs:
shape: (5, 1)
┌─────┐
│ min │
│ --- │
│ i64 │
╞═════╡
│ 1 │
├╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌┤
│ 2 │
├╌╌╌╌╌┤
│ 2 │
├╌╌╌╌╌┤
│ 2 │
└─────┘
Fold expression
Or with a fold expression:
df.select(
pl.fold(int(1e9), lambda acc, a: pl.when(acc > a).then(a).otherwise(acc), ["a", "b", "c"])
)
shape: (5, 1)
┌─────────┐
│ literal │
│ --- │
│ i64 │
╞═════════╡
│ 1 │
├╌╌╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌╌╌┤
│ 2 │
├╌╌╌╌╌╌╌╌╌┤
│ 2 │
├╌╌╌╌╌╌╌╌╌┤
│ 2 │
└─────────┘
The fold allows for more cool things, because you operate over expressions.
So we could for instance compute the min of the squared columns:
pl.fold(int(1e9), lambda acc, a: pl.when(acc > a).then(a).otherwise(acc), [pl.all()**2])
Or we could compute the min of square root of column "a" and the rest of the columns is squared.
pl.fold(int(1e9), lambda acc, a: pl.when(acc > a).then(a).otherwise(acc), [pl.col("a").sqrt(), pl.all().exclude("a")**2])
You get the idea.