How to assign multiple values to specific locations in a series in an expr? - python-polars

In pandas, one can use logical indexing to assign items:
s = pd.Series(['a', 'b', 'c', 'd', 'e'])
idx = [True, False, True, False, True]
s[idx] = ['x', 'y', 'z']
In polars, we can do this with set_at_idx:
s = pl.Series(['a', 'b', 'c', 'd', 'e'])
idx = pl.Series([True, False, True, False, True])
s = s.set_at_idx(idx.arg_true(), ['x', 'y', 'z']) # arg_true() gives us integer indices
However, I'm struggling to figure out how to do this in an expression context, as I cannot do df['col'] = df['col'].set_at_idx.... Instead, polars suggests I use with_column:
import polars as pl
from polars import col, when
df = pl.DataFrame({
"my_col": ['a', 'b', 'c', 'd', 'e'],
"idx": [True, False, True, False, True]
})
new_values = ['x', 'y', 'z']
df = df.with_column(
when(col("idx")).then(new_values) # how to do this?
.otherwise(col("my_col"))
)
My method of getting around this is somewhat long-winded and there must be an easier way:
s = df["my_col"].clone()
s = s.set_at_idx(df["idx"].arg_true(), new_values)
df = df.with_column(s.alias("my_col"))
Syntactically it's not horrible, but is there an easier way to simply update a series with a list of values (or other series)?

I'm not aware of an elegant way to convert your code using Series directly to Expressions. But we won't let that stop us.
One underappreciated aspect of the Polars architecture is the ability to compose our own Expressions using existing Polars API Expressions. (Perhaps there should be a section in the User Guide for this.)
So, let's create an Expression to do what we need.
The code below may seem overwhelming at first. We'll look at examples and I'll explain how it works in detail below.
Expression: set_by_mask
Here's a custom Expression that will set values based on a mask. For lack of a better name I've called it set_by_mask. The Expression is a bit rough (e.g., it does zero error-checking), but it should act as a good starting point.
Notice at the end that we will assign this function as a method of the Expr class, so that it can be used just like any other Expression (e.g., it can participate in any valid chain of Expressions, be used within a groupby, etc..)
Much of the code below deals with "convenience" methods (e.g., allowing the mask parameters to be a list/tuple or an Expression or a Series). Later, we'll go through how the heart of the algorithm works.
Here's the code:
from typing import Any, Sequence
import polars.internals as pli
def set_by_mask(
self: pli.Expr,
mask: str | Sequence[bool] | pli.Series | pli.Expr,
values: Sequence[Any] | pli.Series | pli.Expr,
) -> pli.Expr:
"""
Set values at mask locations.
Parameters
----------
mask
Indices with True values are replaced with values.
Sequence[bool]: list or tuple of boolean values
str: column name of boolean Expression
Series | Expr: Series or Expression that evaluates to boolean
values
Values to replace where mask is True
Notes:
The number of elements in values must match the number of
True values in mask.
The mask Expression/list/tuple must match the length of the
Expression for which values are being set.
"""
if isinstance(mask, str):
mask = pl.col(mask)
if isinstance(mask, Sequence):
mask = pli.Series("", mask)
if isinstance(values, Sequence):
values = pli.Series("", values)
if isinstance(mask, pli.Series):
mask = pli.lit(mask)
if isinstance(values, pli.Series):
values = pli.lit(values)
result = (
self
.sort_by(mask)
.slice(0, mask.is_not().sum())
.append(values)
.sort_by(mask.arg_sort())
)
return self._from_pyexpr(result._pyexpr)
pli.Expr.set_by_mask = set_by_mask
Examples
First, let's run through some examples of how this works.
In the example below, we'll pass a string as our mask parameter -- indicating the column name of df to be used as a mask. And we'll pass a simple Python list of string values as our values parameter.
Remember to run the code above first before running the examples below. We need the set_by_mask function to be a method of the Expr class. (Don't worry, it's not permanent - when the Python interpreter exits, the Expr class will be restored to its original state.)
import polars as pl
df = pl.DataFrame({
"my_col": ["a", "b", "c", "d", "e"],
"idx": [True, False, True, False, True],
})
new_values = ("x", "y", "z")
(
df
.with_columns([
pl.col('my_col').set_by_mask("idx", new_values).alias('result')
])
)
shape: (5, 3)
┌────────┬───────┬────────┐
│ my_col ┆ idx ┆ result │
│ --- ┆ --- ┆ --- │
│ str ┆ bool ┆ str │
╞════════╪═══════╪════════╡
│ a ┆ true ┆ x │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b ┆ false ┆ b │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ c ┆ true ┆ y │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ d ┆ false ┆ d │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ e ┆ true ┆ z │
└────────┴───────┴────────┘
We see that the values for a, c, and e have been replaced, consistent with where the mask evaluated to True.
As another example, let's pass the mask and values parameter as external Series (ignoring the idx column).
new_values = pl.Series("", ['1', '2', '4', '5'])
mask = pl.Series("", [True, True, False, True, True])
(
df
.with_columns([
pl.col('my_col').set_by_mask(mask, new_values).alias('result')
])
)
shape: (5, 3)
┌────────┬───────┬────────┐
│ my_col ┆ idx ┆ result │
│ --- ┆ --- ┆ --- │
│ str ┆ bool ┆ str │
╞════════╪═══════╪════════╡
│ a ┆ true ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b ┆ false ┆ 2 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ c ┆ true ┆ c │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ d ┆ false ┆ 4 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ e ┆ true ┆ 5 │
└────────┴───────┴────────┘
How it works
The heart of the algorithm is this snippet.
result = (
self
.sort_by(mask)
.slice(0, mask.is_not().sum())
.append(values)
.sort_by(mask.arg_sort())
)
To see how it works, we'll use the first example and watch how the algorithm builds the answer.
First, we'll need to change the with_columns in your query to a select, because the intermediate steps in the algorithm won't produce a column whose length matches the other columns, which will lead to an error.
Here's the code that we'll run to observe the steps in the algorithm
import polars as pl
new_values = ("x", "y", "z")
(
pl.DataFrame({
"my_col": ["a", "b", "c", "d", "e"],
"idx": [True, False, True, False, True],
})
.select([
pl.col('my_col').set_by_mask("idx", new_values).alias('result')
])
)
shape: (5, 1)
┌────────┐
│ result │
│ --- │
│ str │
╞════════╡
│ x │
├╌╌╌╌╌╌╌╌┤
│ b │
├╌╌╌╌╌╌╌╌┤
│ y │
├╌╌╌╌╌╌╌╌┤
│ d │
├╌╌╌╌╌╌╌╌┤
│ z │
└────────┘
With that in place, let's look at how the algorithm evolves.
The Algorithm in steps
The first step of the algorithm is to sort the original column so that non-changing values (those corresponding to a mask value of False) are sorted to the top. We'll accomplish this using the sort_by Expression, and pass mask as our sorting criterion.
I'll change the heart of the algorithm to only these steps.
result = (
self
.sort_by(mask)
)
Here's the result.
shape: (5, 1)
┌────────┐
│ result │
│ --- │
│ str │
╞════════╡
│ b │
├╌╌╌╌╌╌╌╌┤
│ d │
├╌╌╌╌╌╌╌╌┤
│ a │
├╌╌╌╌╌╌╌╌┤
│ c │
├╌╌╌╌╌╌╌╌┤
│ e │
└────────┘
In our example, values b and d are not changing and are sorted to the top; values a, c, and e are being replaced and are sorted to the bottom.
In the next step, we'll use the slice Expression to eliminate those values that will be replaced.
result = (
self
.sort_by(mask)
.slice(0, mask.is_not().sum())
)
shape: (2, 1)
┌────────┐
│ result │
│ --- │
│ str │
╞════════╡
│ b │
├╌╌╌╌╌╌╌╌┤
│ d │
└────────┘
In the next step, we'll use the append Expression to place the new values at the bottom.
result = (
self
.sort_by(mask)
.slice(0, mask.is_not().sum())
.append(values)
)
shape: (5, 1)
┌────────┐
│ result │
│ --- │
│ str │
╞════════╡
│ b │
├╌╌╌╌╌╌╌╌┤
│ d │
├╌╌╌╌╌╌╌╌┤
│ x │
├╌╌╌╌╌╌╌╌┤
│ y │
├╌╌╌╌╌╌╌╌┤
│ z │
└────────┘
Now for the tricky step: how to get the values sorted in the proper order.
We're going to use arg_sort to accomplish this. One property of an arg_sort is that it can restore a sorted column back to its original un-sorted state.
If we look at the values above, the non-replaced values are at the top (corresponding to a mask value of False). And the replaced values are at the bottom (corresponding to a mask value of True). This corresponds to a mask of [False, False, True, True, True].
That, in turn, corresponds to the mask expression when it is sorted. (False sorts before True). Hence, sorting the column by the arg_sort of the mask will restore the column to correspond to the original un-sorted maskcolumn.
result = (
self
.sort_by(mask)
.slice(0, mask.is_not().sum())
.append(values)
.sort_by(mask.arg_sort())
)
shape: (5, 1)
┌────────┐
│ result │
│ --- │
│ str │
╞════════╡
│ x │
├╌╌╌╌╌╌╌╌┤
│ b │
├╌╌╌╌╌╌╌╌┤
│ y │
├╌╌╌╌╌╌╌╌┤
│ d │
├╌╌╌╌╌╌╌╌┤
│ z │
└────────┘
It subtle, but it works.
I appreciate the above may be more than you wanted. But hopefully, it demonstrates how we can compose our own Expressions using existing Polars Expressions.

Related

divide a dataframe into another dataframe (element by element)

In pandas, I could set several named columns as an index and find the quotient of the division of two
DataFrame,like this
import pandas as pd
df_1=pd.DataFrame( {
'Name':['a','a','a', 'b','b', 'c'],
'Name_2':['first','second','third', 'first','second', 'first'],
'Value':[20,40,50,100,150,400]
})
df_2=pd.DataFrame( {
'Name':['a','a','a', 'b','b', 'c'],
'Name_2':['first','second','third', 'first','second', 'first'],
'Value':[10,20,25,50,75,200]
})
df_1=df_1.set_index(['Name','Name_2'])
df_2=df_2.set_index(['Name','Name_2'])
df_1/df_2
How can something like this be implemented in python-polars?
I can't find an answer to this question in the documentation.
You just use a join, then do the math on the appropriate column(s).
df_1=pl.DataFrame( {
'Name':['a','a','a', 'b','b', 'c'],
'Name_2':['first','second','third', 'first','second', 'first'],
'Value':[20,40,50,100,150,400]
})
df_2=pl.DataFrame( {
'Name':['a','a','a', 'b','b', 'c'],
'Name_2':['first','second','third', 'first','second', 'first'],
'Value':[10,20,25,50,75,200]
})
That's the setup then the solution is:
df_1.join(df_2, on=['Name','Name_2']) \
.select(['Name','Name_2', pl.col('Value')/pl.col('Value_right')])
If you have a bunch of "Value" columns and different indx columns you can do something like:
myindxcols=['Name', 'Name_2]
myvalcols=[x for x in df_1.columns if x in df_2.columns and not x in myindxcols]
df_1.join(df_2, on=myindxcols) \
.select(myindxcols + [pl.col(x)/pl.col(f"{x}_right") for x in myvalcols])
#Dean MacGregor beat me to it. Please accept his answer.
df_1 = pl.DataFrame( {
"Name":["a","a","a", "b","b", "c"],
"Name_2":["first","second","third", "first","second", "first"],
"Value":[20,40,50,100,150,400]
})
df_2 = pl.DataFrame( {
"Name":["a","a","a", "b","b", "c"],
"Name_2":["first","second","third", "first","second", "first"],
"Value":[10,20,25,50,75,200]
})
keys = ["Name", "Name_2"]
(df_1
.join(df_2, on=keys, suffix="_right")
.select([
*keys,
pl.col("Value") / pl.col("Value_right")
])
)
shape: (6, 3)
┌──────┬────────┬───────┐
│ Name ┆ Name_2 ┆ Value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞══════╪════════╪═══════╡
│ a ┆ first ┆ 2.0 │
│ a ┆ second ┆ 2.0 │
│ a ┆ third ┆ 2.0 │
│ b ┆ first ┆ 2.0 │
│ b ┆ second ┆ 2.0 │
│ c ┆ first ┆ 2.0 │
└──────┴────────┴───────┘

Filtering selected columns based on column aggregate

I wish to select only columns with fewer than 3 unique values. I can generate a boolean mask via pl.all().n_unique() < 3, but I don't know if I can use that mask via the polars API for this.
Currently, I am solving it via python. Is there a more idiomatic way?
import polars as pl, pandas as pd
df = pl.DataFrame({"col1":[1,1,2], "col2":[1,2,3], "col3":[3,3,3]})
# target is:
# df_few_unique = pl.DataFrame({"col1":[1,1,2], "col3":[3,3,3]})
# my attempt:
mask = df.select(pl.all().n_unique() < 3).to_numpy()[0]
cols = [col for col, m in zip(df.columns, mask) if m]
df_few_unique = df.select(cols)
df_few_unique
Equivalent in pandas:
df_pandas = df.to_pandas()
mask = (df_pandas.nunique() < 3)
df_pandas.loc[:, mask]
Edit: after some thinking, I discovered an even easier way to do this, one that doesn't rely on boolean masking at all.
pl.select(
[s for s in df
if s.n_unique() < 3]
)
shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col3 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 1 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 3 │
└──────┴──────┘
Previous answer
One easy way is to use the compress function from Python's itertools.
from itertools import compress
df.select(compress(df.columns, df.select(pl.all().n_unique() < 3).row(0)))
>>> df.select(compress(df.columns, df.select(pl.all().n_unique() < 3).row(0)))
shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col3 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 1 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 3 │
└──────┴──────┘
compress allows us to apply a boolean mask to a list, which in this case is a list of column names.
list(compress(df.columns, df.select(pl.all().n_unique() < 3).row(0)))
>>> list(compress(df.columns, df.select(pl.all().n_unique() < 3).row(0)))
['col1', 'col3']

Sum columns based on column names in a list for polars

So in python Polars
I can add one or more columns to make a new column by using an expression something like
frame.with_column((pl.col('colname1') + pl.col('colname2').alias('new_colname')))
However, if I have all the column names in a list is there a way to sum all the columns in that list and create a new column based on the result ?
Thanks
sum expr supports horizontal summing. From the docs,
List[Expr] -> aggregate the sum value horizontally.
Sample code for ref,
import polars as pl
df = pl.DataFrame({"a": [1, 2, 3], "b": [1, 2, None]})
print(df)
This results in something like,
shape: (3, 2)
┌─────┬──────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪══════╡
│ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3 ┆ null │
└─────┴──────┘
On this you can do something like,
cols = ["a", "b"]
df2 = df.select(pl.sum([pl.col(i) for i in cols]).alias('new_colname'))
print(df2)
Which will result in,
shape: (3, 1)
┌──────┐
│ sum │
│ --- │
│ i64 │
╞══════╡
│ 2 │
├╌╌╌╌╌╌┤
│ 4 │
├╌╌╌╌╌╌┤
│ null │
└──────┘

Polars: How to reorder columns in a specific order?

I cannot find how to reorder columns in a polars dataframe in the polars DataFrame docs.
thx
Using the select method is the recommended way to sort columns in polars.
Example:
Input:
df
┌─────┬───────┬─────┐
│Col1 ┆ Col2 ┆Col3 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════╪═══════╪═════╡
│ a ┆ x ┆ p │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ y ┆ q │
└─────┴───────┴─────┘
Output:
df.select(['Col3', 'Col2', 'Col1'])
or
df.select([pl.col('Col3'), pl.col('Col2'), pl.col('Col1)])
┌─────┬───────┬─────┐
│Col3 ┆ Col2 ┆Col1 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════╪═══════╪═════╡
│ p ┆ x ┆ a │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ q ┆ y ┆ b │
└─────┴───────┴─────┘
Note:
While df[['Col3', 'Col2', 'Col1']] gives the same result (version 0.14), it is recommended (link) that you use the select method instead.
We strongly recommend selecting data with expressions for almost all
use cases. Square bracket indexing is perhaps useful when doing
exploratory data analysis in a terminal or notebook when you just want
a quick look at a subset of data.
For all other use cases we recommend using expressions because:
expressions can be parallelized
the expression approach can be used in lazy and eager mode while the indexing approach can only be used in eager mode
in lazy mode the query optimizer can optimize expressions
Turns out it is the same as pandas:
df = df[['PRODUCT', 'PROGRAM', 'MFG_AREA', 'VERSION', 'RELEASE_DATE', 'FLOW_SUMMARY', 'TESTSUITE', 'MODULE', 'BASECLASS', 'SUBCLASS', 'Empty', 'Color', 'BINNING', 'BYPASS', 'Status', 'Legend']]
That seems like a special case of projection to me.
df = pl.DataFrame({
"c": [1, 2],
"a": ["a", "b"],
"b": [True, False]
})
df.select(sorted(df.columns))
shape: (2, 3)
┌─────┬───────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ str ┆ bool ┆ i64 │
╞═════╪═══════╪═════╡
│ a ┆ true ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ false ┆ 2 │
└─────┴───────┴─────┘

calculating rowwise minimum of two series?

Hi;
Is there any function that can generate a serie by calculating rowwise minimum of two series.? Functionality will be similar to np.minimum
a = [1,4,2,5,2]
b= [5,1,4,2,5]
np.minimum(a,b) -> [1,1,2,2,2]
Thanks.
q =df.lazy().with_column(pl.when(pl.col("a")>pl.col("b")).then(pl.col("b")).otherwise(pl.col("a")).alias("minimum"))
df = q.collect()
i didn't tested it but this should work i think
As the accepted answer states, you can use pl.when -> then -> otherwise expression.
If you have a wider DataFrame, you can use the DataFrame.min() method, pl.min expression, or a pl.fold for more control.
import polars as pl
df = pl.DataFrame({
"a": [1,4,2,5,2],
"b": [5,1,4,2,5],
"c": [3,2,5,7,2]
})
df.min(axis=1)
This outputs:
shape: (5,)
Series: 'a' [i64]
[
1
1
2
2
2
]
Min expression
When given multiple expression inputs to a pl.min the minimum is determined row-wise instead of column-wise.
df.select(pl.min(["a", "b", "c"]))
Outputs:
shape: (5, 1)
┌─────┐
│ min │
│ --- │
│ i64 │
╞═════╡
│ 1 │
├╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌┤
│ 2 │
├╌╌╌╌╌┤
│ 2 │
├╌╌╌╌╌┤
│ 2 │
└─────┘
Fold expression
Or with a fold expression:
df.select(
pl.fold(int(1e9), lambda acc, a: pl.when(acc > a).then(a).otherwise(acc), ["a", "b", "c"])
)
shape: (5, 1)
┌─────────┐
│ literal │
│ --- │
│ i64 │
╞═════════╡
│ 1 │
├╌╌╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌╌╌┤
│ 2 │
├╌╌╌╌╌╌╌╌╌┤
│ 2 │
├╌╌╌╌╌╌╌╌╌┤
│ 2 │
└─────────┘
The fold allows for more cool things, because you operate over expressions.
So we could for instance compute the min of the squared columns:
pl.fold(int(1e9), lambda acc, a: pl.when(acc > a).then(a).otherwise(acc), [pl.all()**2])
Or we could compute the min of square root of column "a" and the rest of the columns is squared.
pl.fold(int(1e9), lambda acc, a: pl.when(acc > a).then(a).otherwise(acc), [pl.col("a").sqrt(), pl.all().exclude("a")**2])
You get the idea.