Polars - how to parallelize lambda that uses only Polars expressions? - python-polars

This runs on a single core, despite not using (seemingly) any non-Polars stuff. What am I doing wrong?
(the goal is to convert a list in doc_ids field in every row into its string representation, s.t. [1, 2, 3] (list[int]) -> '[1, 2, 3]' (string))
import polars as pl
df = pl.DataFrame(dict(ent = ['a', 'b'], doc_ids = [[2,3], [3]]))
df = (df.lazy()
.with_column(
pl.concat_str([
pl.lit('['),
pl.col('doc_ids').apply(lambda x: x.cast(pl.Utf8)).arr.join(', '),
pl.lit(']')
])
.alias('docs_str')
)
.drop('doc_ids')
).collect()

In general, we want to avoid apply at all costs. It acts like a black-box function that Polars cannot optimize, leading to single-threaded performance.
Here's one way that we can eliminate apply: replace it with arr.eval. arr.eval allows us to treat a list as if it were an Expression/Series, which allows us to use standard expressions on it.
(
df.lazy()
.with_column(
pl.concat_str(
[
pl.lit("["),
pl.col("doc_ids")
.arr.eval(pl.element().cast(pl.Utf8))
.arr.join(", "),
pl.lit("]"),
]
).alias("docs_str")
)
.drop("doc_ids")
.collect()
)
shape: (2, 2)
┌─────┬──────────┐
│ ent ┆ docs_str │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪══════════╡
│ a ┆ [2, 3] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ [3] │
└─────┴──────────┘

Related

polars - take a window of N rows surrounding a row fulfilling a condition

Consider the following dataframe:
df = pl.DataFrame({
"letters": ["A", "B", "C", "D", "E", "F", "G", "H"],
"values": ["aa", "bb", "cc", "dd", "ee", "ff", "gg", "hh"]
})
print(df)
shape: (8, 2)
┌─────────┬────────┐
│ letters ┆ values │
│ --- ┆ --- │
│ str ┆ str │
╞═════════╪════════╡
│ A ┆ aa │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ B ┆ bb │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ C ┆ cc │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ D ┆ dd │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ E ┆ ee │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ F ┆ ff │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ G ┆ gg │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ H ┆ hh │
└─────────┴────────┘
How do I take a window of size +/- N around any row that satisfies a given condition? For example, the condition is pl.col("letters").contains("D|F") and N = 2. Then, the output should be:
┌─────────┬────────────────────────────────┐
│ letters ┆ output │
│ --- ┆ --- │
│ str ┆ list[str] │
╞═════════╪════════════════════════════════╡
│ D ┆ ["bb", "cc", "dd", "ee", "ff"] │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ F ┆ ["dd", "ee", "ff", "gg", "hh"] │
└─────────┴────────────────────────────────┘
Note that the windows are overlapping in this case (the F window also contains dd and the D windows also contains ff). Also, note that N = 2 for the sake of simplicity here but, in reality, it'll be larger (~10 - 20). And the dataset is relatively large so I'd like to do this as efficiently as possible without exploding memory usage.
EDIT: To make the ask more explicit, here's the query in DuckDB's SQL syntax that gives the right answer (and I'd like to know how to translate it to Polars):
df_table = df.to_arrow()
con = duckdb.connect()
query = """
SELECT
letters,
list(values) OVER (
ROWS BETWEEN 2 PRECEDING
AND 2 FOLLOWING
) as combined
FROM df_table
QUALIFY letters in ('D', 'F')
"""
print(pl.from_arrow(con.execute(query).arrow()))
shape: (2, 2)
┌─────────┬────────────────────────┐
│ letters ┆ combined │
│ --- ┆ --- │
│ str ┆ list[str] │
╞═════════╪════════════════════════╡
│ D ┆ ["bb", "cc", ... "ff"] │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ F ┆ ["dd", "ee", ... "hh"] │
└─────────┴────────────────────────┘
Benchmarks of suggested solutions
I ran the suggested solutions in a Jupyter notebook on one of Amazon's ml.c5.xlarge machines. While the notebook was running, I also kept htop open in a terminal to observe CPU and memory use. The dataset had 12M+ rows.
I ran both solutions via both the eager and lazy APIs. For good measure, I also tried using a simple Python for loop to extract the slices after identifying the rows of interest and also DuckDB.
Summary Table
Polars had really robust performance and judicious memory use (with the #jqurious' method) because of the clever, no-copy implementation of .shift() . Surprisingly, a well-thought out Python for loop did just as well. DuckDB had performed rather poorly in both speed and memory use.
Neither Polars nor DuckDB uses more than one core for the operation. Not sure if that's due to a lack of optimization or if this problem is just amenable to parallelization. I suppose we're only filtering over one column and then taking slices of that same column so there's not much multiple threads can do.
method
cpu use
memory use
time
ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ
single core
explosion
jqurious
single core
2.53G to 2.53G
4.63 s
(smart) for loop
single core
2.53G to 2.58G
4.91 s
DuckDB
single core
1.62G to 6.13G
38.6 s
cpu use shows if multiple cores were taxes during the operation
memory use shows how much memory was being used before the operation and the maximum memory use during the operation.
#ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ's solution:
preceding = 2
following = 2
look_around = [pl.col("body").shift(-i)
for i in range(-preceding, following + 1)]
(
df
.with_column(
pl.when(pl.col('body').str.contains(regex))
.then(pl.concat_list(look_around))
.alias('combined')
)
.filter(pl.col('combined').is_not_null())
)
Unfortunately, on my rather large dataset, this solution caused the memory use to explode and the kernel to crash with both the eager and lazy APIs.
#jqurious' solution
preceding = 2
following = 2
look_around = [
pl.col("body").shift(-i).alias(f"lag_{i}") for i in range(-preceding, following + 1)
]
(
df
.with_columns(
look_around
)
.filter(pl.col("body").str.contains(regex))
.select([
pl.col("body"),
pl.concat_list([f"lag_{i}" for i in range(-2, 3)]).alias("output")
])
)
eager:
cpu use: single-core
memory use: 2.53G -> 2.53G
time: 4.63 s ± 6.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
lazy:
cpu use: single-core
memory use: 2.53G -> 2.53G
time: 4.63 s ± 3.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
(Smart) Python for loop
preceding = 2
following = 2
output = []
indices = df.with_row_count().select(
pl.col("row_nr").filter(pl.col("body").str.contains(regex))
)["row_nr"]
for idx, x in enumerate(indices):
offset = max(0, x - preceding)
length = preceding + following + 1
output.append(df["body"].slice(offset, length))
cpu use: single-core
memory use: 2.53G -> 2.58G
time: 4.91 s ± 24.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
DuckDB
Note that I first converted the df to an Arrow.Table before running the query so DuckDB could directly act on it. Also, I'm not sure if the conversion of the result back to Arrow takes up a huge amount of computation and is unfair to it.
preceding = 2
following = 2
query = f"""
SELECT
body,
list(body) OVER (
ROWS BETWEEN {preceding} PRECEDING
AND {following} FOLLOWING
) as combined
FROM df_table
QUALIFY regexp_matches(body, '{regex}')
"""
result = con.execute(query).arrow()
With DuckDB, my first attempt to run the computation crashed. I had to retry by reading to an Arrow Table directly without using Polars (this saved about 1GB of memory) to give DuckDB more memory to use.
first try:
cpu: single-core
memory: 2.53G -> 6.93G -> crash!
time: NA
second try:
cpu: single-core
memory: 1.62G -> 6.13G
time: 38.6 s ± 311 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
A modification of Use the rolling function of polars to get a list of all values in the rolling windows
>>> (
... df
... .with_columns(
... [pl.col("values").shift(i).alias(f"lag_{i}") for i in range(-2, 3)])
... .filter(pl.col("letters").str.contains("D|F"))
... .select([
... pl.col("letters"),
... pl.concat_list(reversed([f"lag_{i}" for i in range(-2, 3)])).alias("output")
... ])
... )
shape: (2, 2)
┌─────────┬────────────────────────────────┐
│ letters | output │
│ --- | --- │
│ str | list[str] │
╞═════════╪════════════════════════════════╡
│ D | ["bb", "cc", "dd", "ee", "ff"] │
├─────────┼────────────────────────────────┤
│ F | ["dd", "ee", "ff", "gg", "hh"] │
└─//──────┴─//─────────────────────────────┘
You can try this:
preceding = 2
following = 2
look_around = [pl.col("values").shift(-i)
for i in range(-preceding, following + 1)]
(
df
.with_column(
pl.when(pl.col('letters').str.contains('D|F'))
.then(pl.concat_list(look_around))
.alias('combined')
)
.filter(pl.col('combined').is_not_null())
)
shape: (2, 3)
┌─────────┬────────┬────────────────────────┐
│ letters ┆ values ┆ combined │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ list[str] │
╞═════════╪════════╪════════════════════════╡
│ D ┆ dd ┆ ["bb", "cc", ... "ff"] │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ F ┆ ff ┆ ["dd", "ee", ... "hh"] │
└─────────┴────────┴────────────────────────┘

How do I pipe an expression in Polars?

How do you "pipe" an expression in Polars?
Consider this code:
def transformation(col:pl.Series)->pl.Series:
return col.tanh().suffix('_tanh')
It'd be nice to be able to do this:
df.with_columns([
pl.col('colA').pipe(transformation),
pl.col('colB').pipe(transformation),
pl.col('colC').pipe(transformation),
pl.col('colD').pipe(transformation),
])
But I don't think Polars supports .pipe for Series / expressions.
The alternative is
df.with_columns([
transformation(pl.col('colA')),
transformation(pl.col('colB')),
transformation(pl.col('colC')),
transformation(pl.col('colD')),
])
But this gets messy (IMO) when you have arguments to the transformation function
Edit:
I implemented this and it "works" for me
def _pipe(self, func, *args, **kwargs):
return func(self, *args, **kwargs)
pl.Expr.pipe = _pipe
Typically (like pandas) you'd apply pipe at the DataFrame level.
Especially in conjunction with lazy-eval, this would be equivalent to chaining expressions; your function will receive the underlying eager/lazy frame, along with any optional *args and **kwargs, and by making it lazy() you ensure that your chain of operations can still take advantage of the query optimiser and parallelisation.
For example:
import polars as pl
# define some UDFs
def extend_with_tan( df ):
return df.with_columns( pl.all().tanh().suffix("_tanh") )
def mul_in_place( df, n ):
return df.select( (pl.all() * n).suffix(f"_x{n}") )
# init lazyframe
df = pl.DataFrame({
"colA": [-4],
"colB": [-2],
"colC": [10],
}).lazy()
# pipe/result
dfx = df.pipe( extend_with_tan ).pipe( mul_in_place,n=3 )
dfx.collect()
# ┌─────────┬─────────┬─────────┬──────────────┬──────────────┬──────────────┐
# │ colA_x3 ┆ colB_x3 ┆ colC_x3 ┆ colA_tanh_x3 ┆ colB_tanh_x3 ┆ colC_tanh_x3 │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 ┆ f64 ┆ f64 ┆ f64 │
# ╞═════════╪═════════╪═════════╪══════════════╪══════════════╪══════════════╡
# │ -12 ┆ -6 ┆ 30 ┆ -2.997988 ┆ -2.892083 ┆ 3.0 │
# └─────────┴─────────┴─────────┴──────────────┴──────────────┴──────────────┘
API Docs: polars "pipe" method
As you've probably realized, adding custom methods in order to be able to do method chaining is unfortunately not a first-class citizen in python.
In polars, a canonical way that hopefully satisfies you is to instead write a function that returns an expression. You do this already (although the type hint is incorrectly set to pl.Series), but can save some space by giving a string argument to our transformation function:
import polars as pl
df = pl.DataFrame({"colA": [-4], "colB": [-2], "colC": [0], "colD": [2]})
def transformation(name: str | list[str]) -> pl.Expr:
return pl.col(name).tanh().suffix("_tanh")
df1 = df.with_columns(
[
transformation("colA"),
transformation("colB"),
transformation("colC"),
transformation("colD"),
]
)
I realise this doesn't quite do what you wanted, but perhaps the following will cheer you up a bit. Since pl.col() can take a list of column names, we can do the following:
df2 = df.with_column(transformation(["colA", "colB", "colC", "colD"]))
assert df1.frame_equal(df2) # True
And we can even target all of them using a regular expression:
# ^col\w+$ is a regular expression matching `col<anything>`
df3 = df.with_column(transformation("^col\w+$"))
assert df1.frame_equal(df3) # True

Lightweight syntax for filtering a polars DataFrame on a multi-column key?

I'm wondering if there's a lightweight syntax for filtering a polars DataFrame against a multi-column key, other than inner/anti joins. (There's nothing wrong with the joins, but it would be nice if there's something more compact).
Using the following frame as an example:
import polars as pl
df = pl.DataFrame(
data = [
["x",123, 4.5, "misc"],
["y",456,10.0,"other"],
["z",789,99.5,"value"],
],
columns = ["a","b","c","d"],
)
A PostgreSQL statement could use a VALUES expression, like so...
(("a","b") IN (VALUES ('x',123),('y',456)))
...and a pandas equivalent might set a multi-column index.
pf.set_index( ["a","b"], inplace=True )
pf[ pf.index.isin([('x',123),('y',456)]) ]
The polars syntax would look like this:
df.join(
pl.DataFrame(
data = [('x',123),('y',456)],
columns = {col:tp for col,tp in df.schema.items() if col in ("a","b")},
orient = 'row',
),
on = ["a","b"],
how = "inner", # or 'anti' for "not in"
)
Is a multi-column is_in construct, or equivalent expression, currently available with polars? Something like the following would be great if it exists (or could be added):
df.filter( pl.cols("a","b").is_in([('x',123),('y',456)]) )
In the next polars release >0.13.44 this will work on the struct datatype.
We convert the 2 (or more) columns we want to check to a struct with pl.struct and call the is_in expression. (A conversion to struct is a free operation)
df = pl.DataFrame(
data=[
["x", 123, 4.5, "misc"],
["y", 456, 10.0, "other"],
["z", 789, 99.5, "value"],
],
columns=["a", "b", "c", "d"],
)
df.filter(
pl.struct(["a", "b"]).is_in([{"a": "x", "b": 123}, {"a": "y", "b": 456}])
)
shape: (2, 4)
┌─────┬─────┬──────┬───────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 ┆ str │
╞═════╪═════╪══════╪═══════╡
│ x ┆ 123 ┆ 4.5 ┆ misc │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ y ┆ 456 ┆ 10.0 ┆ other │
└─────┴─────┴──────┴───────┘
Filtering by data in another DataFrame.
The idiomatic way to filter data by presence in another DataFrame are semi and anti joins. Inner joins also filter by presence, but they include the columns of the right hand DataFrame, where a semi join does not and only filters the left hand side.
semi: keep rows/keys that are in both DataFrames
anti: remove rows/keys that are in both DataFrames
The reason why these joins are preferred over is_in is that they are much faster and currently allow for more optimization.

Expression in Polars select context that refers to earlier alias

Is there a way to allow an expression in Polars to refer to a previous aliased expression? For example, this code that defines two new columns errors because the second new column refers to the first:
import polars as pl
df = pl.DataFrame(dict(x=[0, 0, 1]))
df.select([
(pl.col('x') + 1).alias('y'),
(pl.col('y') * 2).alias('z')],
)
# pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value:
# NotFound("Unable to get field named \"y\". Valid fields: [\"x\"]")
The error makes it obvious that the failure is caused by the first alias not being visible to the second expression. Is there a straightforward way to make this work?
All polars expressions within a context are executed in parallel. So they cannot refer to a column that does not yet exist.
A context is:
df.with_columns
df.select
df.groupby(..).agg
This means you need to enforce sequential execution for expressions that reference to other expression outputs.
In your case I would do:
(df.with_column(
(pl.col('x') + 1).alias('y')
).select([
pl.col('y'),
(pl.col('y') * 2).alias('z')
]))
One workaround is to pull out each new column into its own with_column call and then do a final select to keep the columns you were supposed to keep. You will probably want to make sure this is done lazily.
import polars as pl
df = pl.DataFrame(dict(x=[0, 0, 1]))
(df
.lazy()
.with_column((pl.col("x") + 1).alias("y"))
.with_column((pl.col("y") * 2).alias("z"))
.select(["y", "z"])
.collect()
)
# shape: (3, 2)
# ┌─────┬─────┐
# │ y ┆ z │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1 ┆ 2 │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 1 ┆ 2 │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 2 ┆ 4 │
# └─────┴─────┘

Groupby aggregate two columns into a dictionary in Polars

Given the following data, i'm looking to groupby and combine two columns into one, holding a dictionary. One column supplies the keys, while the values stem from another column which is aggregated into a list first.
import polars as pl
data = pl.DataFrame(
{
"names": ["foo", "ham", "spam", "cheese", "egg", "foo"],
"dates": ["1", "1", "2", "3", "3", "4"],
"groups": ["A", "A", "B", "B", "B", "C"],
}
)
>>> print(data)
names dates groups
0 foo 1 A
1 ham 1 A
2 spam 2 B
3 cheese 3 B
4 egg 3 B
5 foo 4 C
# This is what i'm trying to do:
groups combined
0 A {'1': ['foo', 'ham']}
1 B {'2': ['spam'], '3': ['cheese', 'egg']}
2 C {'4': ['foo']}
In pandas i can do this using two groupby statements, in pyspark using a set of operations around "map_from_entries" but despite various attempts i haven't figured out a way in polars. So far i use agg_list(), convert to pandas and use a lambda. While this works, it certainly doesn't feel right.
data = data.groupby(["groups", "dates"])["names"].agg_list()
data = (
data.to_pandas()
.groupby(["groups"])
.apply(lambda x: dict(zip(x["dates"], x["names_agg_list"])))
.reset_index(name="combined")
)
Alternativly, inspired by this post i've tried a number of variations similar to the following, including converting the dict to json strings among other things.
data = data.groupby(["groups"]).agg(
pl.apply(exprs=["dates", "names_agg_list"], f=build_dict).alias("combined")
)
With the release of polars>=0.12.10 you can do this:
print(data
.groupby(["groups", "dates"]).agg(pl.col("names").list().keep_name())
.groupby("groups")
.agg([
pl.apply([pl.col("dates"), pl.col("names")], lambda s: dict(zip(s[0], s[1].to_list())))
])
)
shape: (3, 2)
┌────────┬─────────────────────────────────────┐
│ groups ┆ dates │
│ --- ┆ --- │
│ str ┆ object │
╞════════╪═════════════════════════════════════╡
│ A ┆ {'1': ['foo', 'ham']} │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ C ┆ {'4': ['foo']} │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ B ┆ {'3': ['cheese', 'egg'], '2': ['... │
└────────┴─────────────────────────────────────┘
This not really how you should be using DataFrames though. There is likely a solution that lets you deal with more flattened dataframes and doesn't require you to put slow python objects in dataframes.