Polars: use left shift operator in an apply expression - python-polars

I have a python-polars dataframe:
┌───────┬─────┐
│ val1 ┆ val2│
│ --- ┆ --- │
│ f32 ┆ i32 │
╞═══════╪═════╡
│ 0.0 ┆ 3 │
│ 1.0 ┆ 3 │
│ 1.0 ┆ 3 │
│ 1.0 ┆ 3 │
│ ... ┆ ... │
│ 0.667 ┆ 3 │
│ 1.0 ┆ 3 │
│ 0.333 ┆ 3 │
│ 0.333 ┆ 3 │
└───────┴─────┘
I'd like to do this operation to replace the two columns above with one column obtained by:
((val1 * 100) << 16) + val2
I have a function that can take values from the two columns and do this operation, but it is not expressive and is a python lambda function which makes it quite slow.
df.select(((pl.col("val1") * 100) << 16) + pl.col("val2"))
# Traceback (most recent call last):
# File "<stdin>", line 1, in <module>
# TypeError: unsupported operand type(s) for <<: 'Expr' and 'int'
#
# unsupported operand type(s) for <<: 'Expr' and 'int'
I can run +, -, * or / in place of <<, but it does not work with bitwise shift operators.
Is this possible in polars? I saw that there is a function in polars-rust that does a left shift, but I don't know how or if that can be used with python.

You can use pyarrow since converting from polars to arrow is free and more importantly because pyarrow does have bit shifting although only for integers.
#start with
df=pl.DataFrame({'val1':[0.0,1.0,1.0,0.667,1,0,0.33,0.33], 'val2':[3,3,3,3,3,3,3,3]})
# make a pa array of the val1 * 100 and cast it to Int
padf=df.with_columns([(pl.col('val1')*100).cast(pl.Int64())]).to_arrow()
# import pyarrow and then use `shift_left`
import pyarrow as pa
import pyarrow.compute as pc
df=df.with_columns((pl.from_arrow(pc.shift_left(padf['val1'], pa.scalar(16))).alias('shifted'))).with_columns((pl.col('shifted')+pl.col('val2')).alias('result'))
I just noticed #jquerious's comment. That's actually a good approach because np.left_shift is a ufunc.
It just needs a couple tweaks
df \
.with_columns(
np.left_shift((pl.col('val1')*100).cast(pl.Int64()),16).alias('npres'))

Related

Finding first index of where value in column B is greater than a value in column A

I'd like to know the first occurrence (index) when a value in column A is greater than in column B. Currently I use a for loop (and it's super slow) but I'd imagine it's possible to do that in a rolling window.
df = polars.DataFrame({"idx": [i for i in range(5)], "col_a": [1,2,3,4,4], "col_b": [1,1,5,5,3]})
# apply some window function?
# result of first indices where a value in column B is greater than the value in column A
result = polars.Series([2,2,2,3,None])
I'm still trying to understand polars concept of windows but I imagine the pseudo code would look sth like this:
for window length compare values in both columns, use arg_min() to get the index
if the resulting index is not found (e.g. value None or 0), increase window length and make a second pass
make passes until some max window_len
Current for loop implementation:
df = polars.DataFrame({"col_a": [1,2,3,4,4], "col_b": [1,1,5,5,3]})
for i in range(0, df.shape[0]):
# `arg_max()` returns 0 when there's no such index or if the index is actually 0
series = (df.select("col_a")[i,0] < df.select("col_b")[i:])[:,0]
idx_found = True in series
if idx_found:
print(i + series.arg_max())
else:
print("None")
# output:
2
2
2
3
None
Edit 1:
This almost solves the problem. But we still don't know if arg_max found an actual True value or didn't found an index since it returns 0 for both cases.
One idea is that we're never satisfied with the answer 0 and make a second scan for all values where the result was 0 but now with a longer window.
df.select(polars.col("idx")) + \
df_res = df.groupby_dynamic("idx", every="1i", period="5i").agg(
[
(polars.col("col_a").head(1) < polars.col("col_b")).arg_max().alias("res")
]
)
Edit 2:
This is the final solution: the first pass is made from the code in Edit 1. The following passes (with increasingly wider windows/periods) can be made with:
increase_window_size = "10i"
df_res.groupby_dynamic("idx", every="1i", period=increase_window_size).agg(
[
(polars.col("col_a").head(1) < polars.col("col_b")).filter(polars.col("res").head(1) == 0).arg_max().alias("res")
]
)
Starting from...
df=pl.DataFrame({"idx": [i for i in range(5)], "col_a": [1,2,3,4,4], "col_b": [1,1,5,5,3]})
For each row, you want the min idx where the current row's col_a is less than every subsequent row's col_b.
The first step is to add two columns that will contain all the data as a list and then we want to explode those into a much longer DataFrame.
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx'])
From here, we want to apply a filter so we're only keeping rows where the b_indx is at least as big as the idx AND the col_a is less than col_b
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx']) \
.filter((pl.col('b_indx')>=pl.col('idx')) & (pl.col('col_a')<pl.col('col_b')))
There are a couple ways you could clean that up, one is to groupby+agg+sort
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx']) \
.filter((pl.col('b_indx')>=pl.col('idx')) & (pl.col('col_a')<pl.col('col_b'))) \
.groupby(['idx']).agg([pl.col('b_indx').min()]).sort('idx')
The other way is to just do unique by idx
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx']) \
.filter((pl.col('b_indx')>=pl.col('idx')) & (pl.col('col_a')<pl.col('col_b'))) \
.unique(subset='idx')
Lastly, to get the null values back you have to join it back to the original df. To keep with the theme of adding to the end of the chain we'd want a right join but right joins aren't an option so we have to put the join back at the beginning.
df.join(
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx']) \
.filter((pl.col('b_indx')>=pl.col('idx')) & (pl.col('col_a')<pl.col('col_b'))) \
.unique(subset='idx'),
on='idx',how='left').get_column('b_indx')
shape: (5,)
Series: 'b_indx' [i64]
[
2
2
2
3
null
]
Note:
I was curious on the performance difference between my approach and jqurious's approach so I did
df=pl.DataFrame({'col_a':np.random.randint(1,10,10000), 'col_b':np.random.randint(1,10,10000)}).with_row_count('idx')
then ran each code chunk. Mine took 1.7s while jqurious's took just 0.7s BUT his answer isn't correct...
For instance...
df.join(
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx']) \
.filter((pl.col('b_indx')>=pl.col('idx')) & (pl.col('col_a')<pl.col('col_b'))) \
.unique(subset='idx'),
on='idx',how='left').select(['idx','col_a','col_b',pl.col('b_indx').alias('result')]).head(5)
yields...
shape: (5, 4)
┌─────┬───────┬───────┬────────┐
│ idx ┆ col_a ┆ col_b ┆ result │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ u32 │
╞═════╪═══════╪═══════╪════════╡
│ 0 ┆ 4 ┆ 1 ┆ 2 │ 4<5 at indx2
│ 1 ┆ 1 ┆ 4 ┆ 1 │ 1<4 at indx1
│ 2 ┆ 3 ┆ 5 ┆ 2 │ 3<5 at indx2
│ 3 ┆ 4 ┆ 2 ┆ 5 │ off the page
│ 4 ┆ 5 ┆ 4 ┆ 5 │ off the page
└─────┴───────┴───────┴────────┘
whereas
df.with_columns(
pl.when(pl.col("col_a") < pl.col("col_b"))
.then(1)
.cumsum()
.backward_fill()
.alias("result") + 1
).head(5)
yields
shape: (5, 4)
┌─────┬───────┬───────┬────────┐
│ idx ┆ col_a ┆ col_b ┆ result │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ i32 │
╞═════╪═══════╪═══════╪════════╡
│ 0 ┆ 4 ┆ 1 ┆ 2 │ 4<5 at indx2
│ 1 ┆ 1 ┆ 4 ┆ 2 │ not right
│ 2 ┆ 3 ┆ 5 ┆ 3 │ not right
│ 3 ┆ 4 ┆ 2 ┆ 4 │ off the page
│ 4 ┆ 5 ┆ 4 ┆ 4 │ off the page
└─────┴───────┴───────┴────────┘
Performance
This scales pretty terribly, bumping the df from 10,000 rows to 100,000 made my kernel crash. Going from 10,000 to 20,000 made it take 5.7s which makes sense since we're squaring the size of the df. To mitigate this, you can do overlapping chunks.
First let's make a function
def idx_finder(df):
return(df.join(
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx']) \
.filter((pl.col('b_indx')>=pl.col('idx')) & (pl.col('col_a')<pl.col('col_b'))) \
.unique(subset='idx'),
on='idx',how='left').select(['idx','col_a','col_b',pl.col('b_indx').alias('result')]))
Let's get some summary stats:
print(df.select(pl.all().max()) )
shape: (9, 2)
┌───────┬───────┐
│ col_a ┆ col_b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═══════╪═══════╡
│ 1 ┆ 9 │
│ 2 ┆ 9 │
│ 3 ┆ 9 │
│ 4 ┆ 9 │
│ ... ┆ ... │
│ 6 ┆ 9 │
│ 7 ┆ 9 │
│ 8 ┆ 9 │
│ 9 ┆ 9 │
└───────┴───────┘
This tells us that for any value of col_a what the biggest value of col_b is 9 which means anytime the result is null when col_a is 9 that it's a true null
With that, we do
chunks=[]
chunks.append(idx_finder(df[0:10000])) # arbitrarily picking 10,000 per chunk
Then take a look at
chunks[-1].filter((pl.col('result').is_null()) & (pl.col('col_a')<9))
shape: (2, 4)
┌──────┬───────┬───────┬────────┐
│ idx ┆ col_a ┆ col_b ┆ result │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ u32 │
╞══════╪═══════╪═══════╪════════╡
│ 9993 ┆ 8 ┆ 6 ┆ null │
│ 9999 ┆ 3 ┆ 1 ┆ null │
└──────┴───────┴───────┴────────┘
Let's cutoff this chunk at idx=9992 and then start the next chunk at 9993
curindx=chunks[-1].filter((pl.col('result').is_null()) & (pl.col('col_a')<9))[0,0]
chunks[-1]=chunks[-1].filter(pl.col('idx')<curindx)
With that we can reformulate this logic into a while loop
curindx=0
chunks=[]
while curindx<=df.shape[0]:
print(curindx)
chunks.append(idx_finder(df[curindx:(curindx+10000)]))
curchunkfilt=chunks[-1].filter((pl.col('result').is_null()) & (pl.col('col_a')<9))
if curchunkfilt.shape[0]==0:
curindx+=10000
elif curchunkfilt[0,0]>curindx:
curindx=chunks[-1].filter((pl.col('result').is_null()) & (pl.col('col_a')<9))[0,0]
else:
print("curindx not advancing")
break
chunks[-1]=chunks[-1].filter(pl.col('idx')<curindx)
Finally just
pl.concat(chunks)
As long as we're looping here's another approach
If the gaps between the A/B matches are small then this will end up being fast as it scales according to the gap size rather than by the size of the df. It just uses shift
df=df.with_columns(pl.lit(None).alias('result'))
y=0
while True:
print(y)
maxB=df.filter(pl.col('result').is_null()).select(pl.col('col_b').max())[0,0]
df=df.with_columns((
pl.when(
(pl.col('result').is_null()) & (pl.col('col_a')<pl.col('col_b').shift(-y))
).then(pl.col('idx').shift(-y)).otherwise(pl.col('result'))).alias('result'))
y+=1
if df.filter((pl.col('result').is_null()) & (pl.col('col_a')<maxB) & ~(pl.col('col_b').shift(-y).is_null())).shape[0]==0:
break
With my random data of 1.2m rows it only took 2.6s with a max row offset of 86. If, in your real data, the gaps are on the order of, let's just say, 100,000 then it'd be close to an hour.

python-polars create new column by dividing by two existing columns

in pandas the following creates a new column in dataframe by dividing by two existing columns. How do I do this in polars? Bonus if done in the fastest way using polars.LazyFrame
df = pd.DataFrame({"col1":[10,20,30,40,50], "col2":[5,2,10,10,25]})
df["ans"] = df["col1"]/df["col2"]
print(df)
You want to avoid Pandas-style coding and use Polars Expressions API. Expressions are the heart of Polars and yield the best performance.
Here's how we would code this using Expressions, including using Lazy mode:
(
df
.lazy()
.with_column(
(pl.col('col1') / pl.col('col2')).alias('result')
)
.collect()
)
shape: (5, 3)
┌──────┬──────┬────────┐
│ col1 ┆ col2 ┆ result │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 │
╞══════╪══════╪════════╡
│ 10 ┆ 5 ┆ 2.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 20 ┆ 2 ┆ 10.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 30 ┆ 10 ┆ 3.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 40 ┆ 10 ┆ 4.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 50 ┆ 25 ┆ 2.0 │
└──────┴──────┴────────┘
Here's a section of the User Guide that may help transitioning from Pandas-style coding to using Polars Expressions.

Filtering selected columns based on column aggregate

I wish to select only columns with fewer than 3 unique values. I can generate a boolean mask via pl.all().n_unique() < 3, but I don't know if I can use that mask via the polars API for this.
Currently, I am solving it via python. Is there a more idiomatic way?
import polars as pl, pandas as pd
df = pl.DataFrame({"col1":[1,1,2], "col2":[1,2,3], "col3":[3,3,3]})
# target is:
# df_few_unique = pl.DataFrame({"col1":[1,1,2], "col3":[3,3,3]})
# my attempt:
mask = df.select(pl.all().n_unique() < 3).to_numpy()[0]
cols = [col for col, m in zip(df.columns, mask) if m]
df_few_unique = df.select(cols)
df_few_unique
Equivalent in pandas:
df_pandas = df.to_pandas()
mask = (df_pandas.nunique() < 3)
df_pandas.loc[:, mask]
Edit: after some thinking, I discovered an even easier way to do this, one that doesn't rely on boolean masking at all.
pl.select(
[s for s in df
if s.n_unique() < 3]
)
shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col3 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 1 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 3 │
└──────┴──────┘
Previous answer
One easy way is to use the compress function from Python's itertools.
from itertools import compress
df.select(compress(df.columns, df.select(pl.all().n_unique() < 3).row(0)))
>>> df.select(compress(df.columns, df.select(pl.all().n_unique() < 3).row(0)))
shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col3 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 1 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 3 │
└──────┴──────┘
compress allows us to apply a boolean mask to a list, which in this case is a list of column names.
list(compress(df.columns, df.select(pl.all().n_unique() < 3).row(0)))
>>> list(compress(df.columns, df.select(pl.all().n_unique() < 3).row(0)))
['col1', 'col3']

Is there a good way to do `zfill` in polars?

Is it proper to use pl.Expr.apply to throw the python function zfill at my data? I'm not looking for a performant solution.
pl.col("column").apply(lambda x: str(x).zfill(5))
Is there a better way to do this?
And to follow up I'd love to chat about what a good implementation could look like in the discord if you have some insight (assuming one doesn't currently exist).
Edit: Polars 0.13.43 and later
With version 0.13.43 and later, Polars has a str.zfill expression to accomplish this. str.zfill will be faster than the answer below and thus str.zfill should be preferred.
Prior to Polars 0.13.43
From your question, I'm assuming that you are starting with a column of integers.
lambda x: str(x).zfill(5)
If so, here's one that adheres to pandas rather strictly:
import polars as pl
df = pl.DataFrame({"num": [-10, -1, 0, 1, 10, 100, 1000, 10000, 100000, 1000000, None]})
z = 5
df.with_column(
pl.when(pl.col("num").cast(pl.Utf8).str.lengths() > z)
.then(pl.col("num").cast(pl.Utf8))
.otherwise(pl.concat_str([pl.lit("0" * z), pl.col("num").cast(pl.Utf8)]).str.slice(-z))
.alias("result")
)
shape: (11, 2)
┌─────────┬─────────┐
│ num ┆ result │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════════╪═════════╡
│ -10 ┆ 00-10 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ -1 ┆ 000-1 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ 00000 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 00001 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 10 ┆ 00010 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 100 ┆ 00100 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 1000 ┆ 01000 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 10000 ┆ 10000 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 100000 ┆ 100000 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 1000000 ┆ 1000000 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ null ┆ null │
└─────────┴─────────┘
Comparing the output to pandas:
df.with_column(pl.col('num').cast(pl.Utf8)).get_column('num').to_pandas().str.zfill(z)
0 00-10
1 000-1
2 00000
3 00001
4 00010
5 00100
6 01000
7 10000
8 100000
9 1000000
10 None
dtype: object
If you are starting with strings, then you can simplify the code by getting rid any calls to cast.
Edit: On a dataset with 550 million records, this took about 50 seconds on my machine. (Note: this runs single-threaded)
Edit2: To shave off some time, you can use the following:
result = df.lazy().with_column(
pl.col('num').cast(pl.Utf8).alias('tmp')
).with_column(
pl.when(pl.col("tmp").str.lengths() > z)
.then(pl.col("tmp"))
.otherwise(pl.concat_str([pl.lit("0" * z), pl.col("tmp")]).str.slice(-z))
.alias("result")
).drop('tmp').collect()
but it didn't save that much time.

How can I call a numpy ufunc with two positional arguments in polars?

I would like to call a numpy universal function (ufunc) that has two positional arguments in polars.
df.with_column(
numpy.left_shift(pl.col('col1'), 8)
)
Above attempt results in the following error message
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/local/lib/python3.8/dist-packages/polars/internals/expr.py", line 181, in __array_ufunc__
out_type = ufunc(np.array([1])).dtype
TypeError: left_shift() takes from 2 to 3 positional arguments but 1 were given
There are other ways to perform this computation, e.g.,
df['col1'] = numpy.left_shift(df['col1'], 8)
... but I'm trying to use this with a polars.LazyFrame.
I'm using polars 0.13.13 and Python 3.8.
Edit: as of Polars 0.13.19, the apply method converts Numpy datatypes to Polars datatypes without requiring the Numpy item method.
When you need to pass only one column from polars to the ufunc (as in your example), the easist method is to use the apply function on the particular column.
import numpy as np
import polars as pl
df = pl.DataFrame({"col1": [2, 4, 8, 16]}).lazy()
df.with_column(
pl.col("col1").apply(lambda x: np.left_shift(x, 8).item()).alias("result")
).collect()
shape: (4, 2)
┌──────┬────────┐
│ col1 ┆ result │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪════════╡
│ 2 ┆ 512 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 1024 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 8 ┆ 2048 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 16 ┆ 4096 │
└──────┴────────┘
If you need to pass multiple columns from Polars to the ufunc, then use the struct expression with apply.
df = pl.DataFrame({"col1": [2, 4, 8, 16], "shift": [1, 1, 2, 2]}).lazy()
df.with_column(
pl.struct(["col1", "shift"])
.apply(lambda cols: np.left_shift(cols["col1"], cols["shift"]).item())
.alias("result")
).collect()
shape: (4, 3)
┌──────┬───────┬────────┐
│ col1 ┆ shift ┆ result │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞══════╪═══════╪════════╡
│ 2 ┆ 1 ┆ 4 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 1 ┆ 8 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 8 ┆ 2 ┆ 32 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 16 ┆ 2 ┆ 64 │
└──────┴───────┴────────┘
One Note: the use of the numpy item method may no longer be needed in future releases of Polars. (Presently, the apply method does not always automatically translate between numpy dtypes and Polars dtypes.)
Does this help?