How can I call a numpy ufunc with two positional arguments in polars? - python-polars

I would like to call a numpy universal function (ufunc) that has two positional arguments in polars.
df.with_column(
numpy.left_shift(pl.col('col1'), 8)
)
Above attempt results in the following error message
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/local/lib/python3.8/dist-packages/polars/internals/expr.py", line 181, in __array_ufunc__
out_type = ufunc(np.array([1])).dtype
TypeError: left_shift() takes from 2 to 3 positional arguments but 1 were given
There are other ways to perform this computation, e.g.,
df['col1'] = numpy.left_shift(df['col1'], 8)
... but I'm trying to use this with a polars.LazyFrame.
I'm using polars 0.13.13 and Python 3.8.

Edit: as of Polars 0.13.19, the apply method converts Numpy datatypes to Polars datatypes without requiring the Numpy item method.
When you need to pass only one column from polars to the ufunc (as in your example), the easist method is to use the apply function on the particular column.
import numpy as np
import polars as pl
df = pl.DataFrame({"col1": [2, 4, 8, 16]}).lazy()
df.with_column(
pl.col("col1").apply(lambda x: np.left_shift(x, 8).item()).alias("result")
).collect()
shape: (4, 2)
┌──────┬────────┐
│ col1 ┆ result │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪════════╡
│ 2 ┆ 512 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 1024 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 8 ┆ 2048 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 16 ┆ 4096 │
└──────┴────────┘
If you need to pass multiple columns from Polars to the ufunc, then use the struct expression with apply.
df = pl.DataFrame({"col1": [2, 4, 8, 16], "shift": [1, 1, 2, 2]}).lazy()
df.with_column(
pl.struct(["col1", "shift"])
.apply(lambda cols: np.left_shift(cols["col1"], cols["shift"]).item())
.alias("result")
).collect()
shape: (4, 3)
┌──────┬───────┬────────┐
│ col1 ┆ shift ┆ result │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞══════╪═══════╪════════╡
│ 2 ┆ 1 ┆ 4 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 1 ┆ 8 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 8 ┆ 2 ┆ 32 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 16 ┆ 2 ┆ 64 │
└──────┴───────┴────────┘
One Note: the use of the numpy item method may no longer be needed in future releases of Polars. (Presently, the apply method does not always automatically translate between numpy dtypes and Polars dtypes.)
Does this help?

Related

Select columns from LazyFrame by condition

polars.LazyFrame.var will return variance value for each column in a table as below:
>>> df = pl.DataFrame({"a": [1, 2, 3, 4], "b": [1, 2, 1, 1], "c": [1, 1, 1, 1]}).lazy()
>>> df.collect()
shape: (4, 3)
┌─────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1 ┆ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 1 ┆ 1 │
└─────┴─────┴─────┘
>>> df.var().collect()
shape: (1, 3)
┌──────────┬──────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞══════════╪══════╪═════╡
│ 1.666667 ┆ 0.25 ┆ 0.0 │
└──────────┴──────┴─────┘
I wish to select columns with value > 0 from LazyFrame but couldn't find the solution.
I can iterate over columns in polars dataframe then filter columns by condition as below:
>>> data.var()
shape: (1, 3)
┌──────────┬──────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞══════════╪══════╪═════╡
│ 1.666667 ┆ 0.25 ┆ 0.0 │
└──────────┴──────┴─────┘
>>> cols = pl.select([s for s in data.var() if (s > 0).all()]).columns
>>> cols
['a', 'b']
>>> data.select(cols)
shape: (4, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 1 │
└─────┴─────┘
But it doesn't work in LazyFrame:
>>> data = data.lazy()
>>> data
<polars.internals.lazyframe.frame.LazyFrame object at 0x7f0e3d9966a0>
>>> cols = pl.select([s for s in data.var() if (s > 0).all()]).columns
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <listcomp>
File "/home/jasmine/miniconda3/envs/jupyternb/lib/python3.9/site-packages/polars/internals/lazyframe/frame.py", line 421, in __getitem__
raise TypeError(
TypeError: 'LazyFrame' object is not subscriptable (aside from slicing). Use 'select()' or 'filter()' instead.
The reason for doing this in LazyFrame is that we want to maximize the performance. Any advice would be much appreciated. Thanks!
polars doesn't know what the variance is until after it is calculated but that's the same time that it is displaying the results so there's no way to filter the columns reported and also have it be more performant than just displaying all the columns, at least with respect to the polars calculation. It could be that python/jupyter takes longer to display more results than fewer.
With that said you could do something like this:
df.var().melt().filter(pl.col('value')>0).collect()
which gives you what you want in one line but it's a different shape.
You could also do something like this:
dfvar=df.var()
dfvar.select(dfvar.melt().filter(pl.col('value')>0).select('variable').collect().to_series().to_list()).collect()
Building on the answer from #dean MacGregor, we:
do the var calculation
melt
apply the filter
extract the variable column with column names
pass it as a list to select
df.select(
(
df.var().melt().filter(pl.col('value')>0).collect()
["variable"]
)
.to_list()
).collect()
shape: (4, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 1 │
└─────┴─────┘
```

In Polars how do I print all elements of a list column?

I have a Polars DataFrame with a list column. I want to control how many elements of a pl.List column are printed.
I've tried pl.pl.Config.set_fmt_str_lengths() but this only restricts the number of elements if set to a small value, it doesn't show more elements for a large value.
I'm working in Jupyterlab but I think it's a general issue.
import polars as pl
N = 5
df = (
pl.DataFrame(
{
'id': range(N)
}
)
.with_row_count("value")
.groupby_rolling(
"id",period=f"{N}i"
)
.agg(
pl.col("value")
)
)
df
shape: (5, 2)
┌─────┬───────────────┐
│ id ┆ value │
│ --- ┆ --- │
│ i64 ┆ list[u32] │
╞═════╪═══════════════╡
│ 0 ┆ [0] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ [0, 1] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ [0, 1, 2] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ [0, 1, ... 3] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ [0, 1, ... 4] │
└─────┴───────────────┘
pl.Config.set_tbl_rows(100)
And more generally, I would try looking at dir(pl.Config)
You can use the following config parameter from the Polars Documentation to set the length of the output e.g. 100.
import Polars as pl
pl.Config.set_fmt_str_lengths(100)
Currently I do not think you can, directly; the documentation for Config does not list any such method, and for me (in VSCode at least) set_fmt_str_lengths does not affect lists.
However, if your goal is simply to be able to see what you're working on and you don't mind a slightly hacky workaround, you can simply add a column next to it where you convert your list to a string representation of itself, at which point pl.Config.set_fmt_str_lengths(<some large n>) will then display however much of it you like. For example:
import polars as pl
pl.Config.set_fmt_str_lengths(100)
N = 5
df = (
pl.DataFrame(
{
'id': range(N)
}
)
.with_row_count("value")
.groupby_rolling(
"id",period=f"{N}i"
)
.agg(
pl.col("value")
).with_column(
pl.col("value").apply(lambda x: ["["+", ".join([f'{i}' for i in x])+"]"][0]).alias("string_repr")
)
)
df
shape: (5, 3)
┌─────┬───────────────┬─────────────────┐
│ id ┆ value ┆ string_repr │
│ --- ┆ --- ┆ --- │
│ i64 ┆ list[u32] ┆ str │
╞═════╪═══════════════╪═════════════════╡
│ 0 ┆ [0] ┆ [0] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ [0, 1] ┆ [0, 1] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ [0, 1, 2] ┆ [0, 1, 2] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ [0, 1, ... 3] ┆ [0, 1, 2, 3] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ [0, 1, ... 4] ┆ [0, 1, 2, 3, 4] │
└─────┴───────────────┴─────────────────┘

python-polars create new column by dividing by two existing columns

in pandas the following creates a new column in dataframe by dividing by two existing columns. How do I do this in polars? Bonus if done in the fastest way using polars.LazyFrame
df = pd.DataFrame({"col1":[10,20,30,40,50], "col2":[5,2,10,10,25]})
df["ans"] = df["col1"]/df["col2"]
print(df)
You want to avoid Pandas-style coding and use Polars Expressions API. Expressions are the heart of Polars and yield the best performance.
Here's how we would code this using Expressions, including using Lazy mode:
(
df
.lazy()
.with_column(
(pl.col('col1') / pl.col('col2')).alias('result')
)
.collect()
)
shape: (5, 3)
┌──────┬──────┬────────┐
│ col1 ┆ col2 ┆ result │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 │
╞══════╪══════╪════════╡
│ 10 ┆ 5 ┆ 2.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 20 ┆ 2 ┆ 10.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 30 ┆ 10 ┆ 3.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 40 ┆ 10 ┆ 4.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 50 ┆ 25 ┆ 2.0 │
└──────┴──────┴────────┘
Here's a section of the User Guide that may help transitioning from Pandas-style coding to using Polars Expressions.

Python-Polars: How to filter categorical column with string list

I have a Polars dataframe like below:
df_cat = pl.DataFrame(
[
pl.Series("a_cat", ["c", "a", "b", "c", "b"], dtype=pl.Categorical),
pl.Series("b_cat", ["F", "G", "E", "G", "G"], dtype=pl.Categorical)
])
print(df_cat)
shape: (5, 2)
┌───────┬───────┐
│ a_cat ┆ b_cat │
│ --- ┆ --- │
│ cat ┆ cat │
╞═══════╪═══════╡
│ c ┆ F │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ G │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ E │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ G │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ G │
└───────┴───────┘
The following filter runs perfectly fine:
print(df_cat.filter(pl.col('a_cat') == 'c'))
shape: (2, 2)
┌───────┬───────┐
│ a_cat ┆ b_cat │
│ --- ┆ --- │
│ cat ┆ cat │
╞═══════╪═══════╡
│ c ┆ F │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ G │
└───────┴───────┘
What I want is to use a list of string to run the filter more efficiently. So I tried and ended up with the following error message:
print(df_cat.filter(pl.col('a_cat').is_in(['a', 'c'])))
---------------------------------------------------------------------------
ComputeError Traceback (most recent call last)
d:\GitRepo\Test2\stockEMD3.ipynb Cell 9 in <cell line: 1>()
----> 1 print(df_cat.filter(pl.col('a_cat').is_in(['c'])))
File c:\ProgramData\Anaconda3\envs\charm3.9\lib\site-packages\polars\internals\dataframe\frame.py:2185, in DataFrame.filter(self, predicate)
2181 if _NUMPY_AVAILABLE and isinstance(predicate, np.ndarray):
2182 predicate = pli.Series(predicate)
2184 return (
-> 2185 self.lazy()
2186 .filter(predicate) # type: ignore[arg-type]
2187 .collect(no_optimization=True, string_cache=False)
2188 )
File c:\ProgramData\Anaconda3\envs\charm3.9\lib\site-packages\polars\internals\lazyframe\frame.py:660, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, string_cache, no_optimization, slice_pushdown)
650 projection_pushdown = False
652 ldf = self._ldf.optimization_toggle(
653 type_coercion,
654 predicate_pushdown,
(...)
658 slice_pushdown,
659 )
--> 660 return pli.wrap_df(ldf.collect())
ComputeError: joins/or comparisons on categorical dtypes can only happen if they are created under the same global string cache
From this Stackoverflow link I understand "You need to set a global string cache to compare categoricals created in different columns/lists." but my question is
Why the == one single string filter case works?
What is the proper way to filter a categorical column with a list of string?
Thanks!
Actually, you don't need to set a global string cache to compare strings to Categorical variables. You can use cast to accomplish this.
Let's use this data. I've included the integer values that underlie the Categorical variables to demonstrate something later.
import polars as pl
df_cat = (
pl.DataFrame(
[
pl.Series("a_cat", ["c", "a", "b", "c", "X"], dtype=pl.Categorical),
pl.Series("b_cat", ["F", "G", "E", "S", "X"], dtype=pl.Categorical),
]
)
.with_column(
pl.all().to_physical().suffix('_phys')
)
)
df_cat
shape: (5, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ c ┆ F ┆ 0 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ G ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ E ┆ 2 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ S ┆ 0 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ X ┆ X ┆ 3 ┆ 4 │
└───────┴───────┴────────────┴────────────┘
Comparing a categorical variable to a string
If we cast a Categorical variable back to its string values, we can make any comparison we need. For example:
df_cat.filter(pl.col('a_cat').cast(pl.Utf8).is_in(['a', 'c']))
shape: (3, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ c ┆ F ┆ 0 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ G ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ S ┆ 0 ┆ 3 │
└───────┴───────┴────────────┴────────────┘
Or in a filter step comparing the string values of two Categorical variables that do not share the same string cache.
df_cat.filter(pl.col('a_cat').cast(pl.Utf8) == pl.col('b_cat').cast(pl.Utf8))
shape: (1, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ X ┆ X ┆ 3 ┆ 4 │
└───────┴───────┴────────────┴────────────┘
Notice that it is the string values being compared (not the integers underlying the two Categorical variables).
The equality operator on Categorical variables
The following statements are equivalent:
df_cat.filter((pl.col('a_cat') == 'a'))
df_cat.filter((pl.col('a_cat').cast(pl.Utf8) == 'a'))
The former is syntactic sugar for the latter, as the former is a common use case.
As the error states: ComputeError: joins/or comparisons on categorical dtypes can only happen if they are created under the same global string cache.
Comparisons of categorical values are only allowed under a global string cache. You really want to set this in such a case as it speeds up comparisons and prevents expensive casts to strings.
Setting this on the start of your query will ensure it runs:
import polars as pl
pl.Config.set_global_string_cache()
This is a new answer based on the one from #ritchie46.
Polar 0.15.15 it now is
import polars as pl
pl.toggle_string_cache(True)
Also a StringCache() Context manager can be used, see polars documentation:
with pl.StringCache():
print(df_cat.filter(pl.col('a_cat').is_in(['a', 'c'])))

How to assign Exponential Moving Averages after groupby in python polars

I have just started using polars in python and I'm coming from pandas.
I would like to know how can I replicate the below pandas code in python polars
import pandas as pd
import polars as pl
df['exp_mov_avg_col'] = df.groupby('agg_col')['ewm_col'].transform(lambda x : x.ewm(span=14).mean())
I have tried the following:
df.groupby('agg_col').agg([pl.col('ewm_col').ewm_mean().alias('exp_mov_avg_col')])
but this gives me a list of exponential moving averages per provider, I want that list to be assigned to a column in original dataframe to the correct indexes, just like the pandas code does.
You can use window functions which apply an expression within a group defined by .over("group").
df = pl.DataFrame({
"agg_col": [1, 1, 2, 3, 3, 3],
"ewm_col": [1, 2, 3, 4, 5, 6]
})
(df.select([
pl.all().exclude("ewm_col"),
pl.col("ewm_col").ewm_mean(alpha=0.5).over("agg_col")
]))
Ouputs:
shape: (6, 2)
┌─────────┬──────────┐
│ agg_col ┆ ewm_col │
│ --- ┆ --- │
│ i64 ┆ f64 │
╞═════════╪══════════╡
│ 1 ┆ 1.0 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 1.666667 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 3.0 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 4.0 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 4.666667 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 5.428571 │
└─────────┴──────────┘