Finding first index of where value in column B is greater than a value in column A - python-polars

I'd like to know the first occurrence (index) when a value in column A is greater than in column B. Currently I use a for loop (and it's super slow) but I'd imagine it's possible to do that in a rolling window.
df = polars.DataFrame({"idx": [i for i in range(5)], "col_a": [1,2,3,4,4], "col_b": [1,1,5,5,3]})
# apply some window function?
# result of first indices where a value in column B is greater than the value in column A
result = polars.Series([2,2,2,3,None])
I'm still trying to understand polars concept of windows but I imagine the pseudo code would look sth like this:
for window length compare values in both columns, use arg_min() to get the index
if the resulting index is not found (e.g. value None or 0), increase window length and make a second pass
make passes until some max window_len
Current for loop implementation:
df = polars.DataFrame({"col_a": [1,2,3,4,4], "col_b": [1,1,5,5,3]})
for i in range(0, df.shape[0]):
# `arg_max()` returns 0 when there's no such index or if the index is actually 0
series = (df.select("col_a")[i,0] < df.select("col_b")[i:])[:,0]
idx_found = True in series
if idx_found:
print(i + series.arg_max())
else:
print("None")
# output:
2
2
2
3
None
Edit 1:
This almost solves the problem. But we still don't know if arg_max found an actual True value or didn't found an index since it returns 0 for both cases.
One idea is that we're never satisfied with the answer 0 and make a second scan for all values where the result was 0 but now with a longer window.
df.select(polars.col("idx")) + \
df_res = df.groupby_dynamic("idx", every="1i", period="5i").agg(
[
(polars.col("col_a").head(1) < polars.col("col_b")).arg_max().alias("res")
]
)
Edit 2:
This is the final solution: the first pass is made from the code in Edit 1. The following passes (with increasingly wider windows/periods) can be made with:
increase_window_size = "10i"
df_res.groupby_dynamic("idx", every="1i", period=increase_window_size).agg(
[
(polars.col("col_a").head(1) < polars.col("col_b")).filter(polars.col("res").head(1) == 0).arg_max().alias("res")
]
)

Starting from...
df=pl.DataFrame({"idx": [i for i in range(5)], "col_a": [1,2,3,4,4], "col_b": [1,1,5,5,3]})
For each row, you want the min idx where the current row's col_a is less than every subsequent row's col_b.
The first step is to add two columns that will contain all the data as a list and then we want to explode those into a much longer DataFrame.
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx'])
From here, we want to apply a filter so we're only keeping rows where the b_indx is at least as big as the idx AND the col_a is less than col_b
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx']) \
.filter((pl.col('b_indx')>=pl.col('idx')) & (pl.col('col_a')<pl.col('col_b')))
There are a couple ways you could clean that up, one is to groupby+agg+sort
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx']) \
.filter((pl.col('b_indx')>=pl.col('idx')) & (pl.col('col_a')<pl.col('col_b'))) \
.groupby(['idx']).agg([pl.col('b_indx').min()]).sort('idx')
The other way is to just do unique by idx
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx']) \
.filter((pl.col('b_indx')>=pl.col('idx')) & (pl.col('col_a')<pl.col('col_b'))) \
.unique(subset='idx')
Lastly, to get the null values back you have to join it back to the original df. To keep with the theme of adding to the end of the chain we'd want a right join but right joins aren't an option so we have to put the join back at the beginning.
df.join(
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx']) \
.filter((pl.col('b_indx')>=pl.col('idx')) & (pl.col('col_a')<pl.col('col_b'))) \
.unique(subset='idx'),
on='idx',how='left').get_column('b_indx')
shape: (5,)
Series: 'b_indx' [i64]
[
2
2
2
3
null
]
Note:
I was curious on the performance difference between my approach and jqurious's approach so I did
df=pl.DataFrame({'col_a':np.random.randint(1,10,10000), 'col_b':np.random.randint(1,10,10000)}).with_row_count('idx')
then ran each code chunk. Mine took 1.7s while jqurious's took just 0.7s BUT his answer isn't correct...
For instance...
df.join(
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx']) \
.filter((pl.col('b_indx')>=pl.col('idx')) & (pl.col('col_a')<pl.col('col_b'))) \
.unique(subset='idx'),
on='idx',how='left').select(['idx','col_a','col_b',pl.col('b_indx').alias('result')]).head(5)
yields...
shape: (5, 4)
┌─────┬───────┬───────┬────────┐
│ idx ┆ col_a ┆ col_b ┆ result │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ u32 │
╞═════╪═══════╪═══════╪════════╡
│ 0 ┆ 4 ┆ 1 ┆ 2 │ 4<5 at indx2
│ 1 ┆ 1 ┆ 4 ┆ 1 │ 1<4 at indx1
│ 2 ┆ 3 ┆ 5 ┆ 2 │ 3<5 at indx2
│ 3 ┆ 4 ┆ 2 ┆ 5 │ off the page
│ 4 ┆ 5 ┆ 4 ┆ 5 │ off the page
└─────┴───────┴───────┴────────┘
whereas
df.with_columns(
pl.when(pl.col("col_a") < pl.col("col_b"))
.then(1)
.cumsum()
.backward_fill()
.alias("result") + 1
).head(5)
yields
shape: (5, 4)
┌─────┬───────┬───────┬────────┐
│ idx ┆ col_a ┆ col_b ┆ result │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ i32 │
╞═════╪═══════╪═══════╪════════╡
│ 0 ┆ 4 ┆ 1 ┆ 2 │ 4<5 at indx2
│ 1 ┆ 1 ┆ 4 ┆ 2 │ not right
│ 2 ┆ 3 ┆ 5 ┆ 3 │ not right
│ 3 ┆ 4 ┆ 2 ┆ 4 │ off the page
│ 4 ┆ 5 ┆ 4 ┆ 4 │ off the page
└─────┴───────┴───────┴────────┘
Performance
This scales pretty terribly, bumping the df from 10,000 rows to 100,000 made my kernel crash. Going from 10,000 to 20,000 made it take 5.7s which makes sense since we're squaring the size of the df. To mitigate this, you can do overlapping chunks.
First let's make a function
def idx_finder(df):
return(df.join(
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx']) \
.filter((pl.col('b_indx')>=pl.col('idx')) & (pl.col('col_a')<pl.col('col_b'))) \
.unique(subset='idx'),
on='idx',how='left').select(['idx','col_a','col_b',pl.col('b_indx').alias('result')]))
Let's get some summary stats:
print(df.select(pl.all().max()) )
shape: (9, 2)
┌───────┬───────┐
│ col_a ┆ col_b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═══════╪═══════╡
│ 1 ┆ 9 │
│ 2 ┆ 9 │
│ 3 ┆ 9 │
│ 4 ┆ 9 │
│ ... ┆ ... │
│ 6 ┆ 9 │
│ 7 ┆ 9 │
│ 8 ┆ 9 │
│ 9 ┆ 9 │
└───────┴───────┘
This tells us that for any value of col_a what the biggest value of col_b is 9 which means anytime the result is null when col_a is 9 that it's a true null
With that, we do
chunks=[]
chunks.append(idx_finder(df[0:10000])) # arbitrarily picking 10,000 per chunk
Then take a look at
chunks[-1].filter((pl.col('result').is_null()) & (pl.col('col_a')<9))
shape: (2, 4)
┌──────┬───────┬───────┬────────┐
│ idx ┆ col_a ┆ col_b ┆ result │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ u32 │
╞══════╪═══════╪═══════╪════════╡
│ 9993 ┆ 8 ┆ 6 ┆ null │
│ 9999 ┆ 3 ┆ 1 ┆ null │
└──────┴───────┴───────┴────────┘
Let's cutoff this chunk at idx=9992 and then start the next chunk at 9993
curindx=chunks[-1].filter((pl.col('result').is_null()) & (pl.col('col_a')<9))[0,0]
chunks[-1]=chunks[-1].filter(pl.col('idx')<curindx)
With that we can reformulate this logic into a while loop
curindx=0
chunks=[]
while curindx<=df.shape[0]:
print(curindx)
chunks.append(idx_finder(df[curindx:(curindx+10000)]))
curchunkfilt=chunks[-1].filter((pl.col('result').is_null()) & (pl.col('col_a')<9))
if curchunkfilt.shape[0]==0:
curindx+=10000
elif curchunkfilt[0,0]>curindx:
curindx=chunks[-1].filter((pl.col('result').is_null()) & (pl.col('col_a')<9))[0,0]
else:
print("curindx not advancing")
break
chunks[-1]=chunks[-1].filter(pl.col('idx')<curindx)
Finally just
pl.concat(chunks)
As long as we're looping here's another approach
If the gaps between the A/B matches are small then this will end up being fast as it scales according to the gap size rather than by the size of the df. It just uses shift
df=df.with_columns(pl.lit(None).alias('result'))
y=0
while True:
print(y)
maxB=df.filter(pl.col('result').is_null()).select(pl.col('col_b').max())[0,0]
df=df.with_columns((
pl.when(
(pl.col('result').is_null()) & (pl.col('col_a')<pl.col('col_b').shift(-y))
).then(pl.col('idx').shift(-y)).otherwise(pl.col('result'))).alias('result'))
y+=1
if df.filter((pl.col('result').is_null()) & (pl.col('col_a')<maxB) & ~(pl.col('col_b').shift(-y).is_null())).shape[0]==0:
break
With my random data of 1.2m rows it only took 2.6s with a max row offset of 86. If, in your real data, the gaps are on the order of, let's just say, 100,000 then it'd be close to an hour.

Related

python-polars create new column by dividing by two existing columns

in pandas the following creates a new column in dataframe by dividing by two existing columns. How do I do this in polars? Bonus if done in the fastest way using polars.LazyFrame
df = pd.DataFrame({"col1":[10,20,30,40,50], "col2":[5,2,10,10,25]})
df["ans"] = df["col1"]/df["col2"]
print(df)
You want to avoid Pandas-style coding and use Polars Expressions API. Expressions are the heart of Polars and yield the best performance.
Here's how we would code this using Expressions, including using Lazy mode:
(
df
.lazy()
.with_column(
(pl.col('col1') / pl.col('col2')).alias('result')
)
.collect()
)
shape: (5, 3)
┌──────┬──────┬────────┐
│ col1 ┆ col2 ┆ result │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 │
╞══════╪══════╪════════╡
│ 10 ┆ 5 ┆ 2.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 20 ┆ 2 ┆ 10.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 30 ┆ 10 ┆ 3.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 40 ┆ 10 ┆ 4.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 50 ┆ 25 ┆ 2.0 │
└──────┴──────┴────────┘
Here's a section of the User Guide that may help transitioning from Pandas-style coding to using Polars Expressions.

Python-Polars: How to filter categorical column with string list

I have a Polars dataframe like below:
df_cat = pl.DataFrame(
[
pl.Series("a_cat", ["c", "a", "b", "c", "b"], dtype=pl.Categorical),
pl.Series("b_cat", ["F", "G", "E", "G", "G"], dtype=pl.Categorical)
])
print(df_cat)
shape: (5, 2)
┌───────┬───────┐
│ a_cat ┆ b_cat │
│ --- ┆ --- │
│ cat ┆ cat │
╞═══════╪═══════╡
│ c ┆ F │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ G │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ E │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ G │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ G │
└───────┴───────┘
The following filter runs perfectly fine:
print(df_cat.filter(pl.col('a_cat') == 'c'))
shape: (2, 2)
┌───────┬───────┐
│ a_cat ┆ b_cat │
│ --- ┆ --- │
│ cat ┆ cat │
╞═══════╪═══════╡
│ c ┆ F │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ G │
└───────┴───────┘
What I want is to use a list of string to run the filter more efficiently. So I tried and ended up with the following error message:
print(df_cat.filter(pl.col('a_cat').is_in(['a', 'c'])))
---------------------------------------------------------------------------
ComputeError Traceback (most recent call last)
d:\GitRepo\Test2\stockEMD3.ipynb Cell 9 in <cell line: 1>()
----> 1 print(df_cat.filter(pl.col('a_cat').is_in(['c'])))
File c:\ProgramData\Anaconda3\envs\charm3.9\lib\site-packages\polars\internals\dataframe\frame.py:2185, in DataFrame.filter(self, predicate)
2181 if _NUMPY_AVAILABLE and isinstance(predicate, np.ndarray):
2182 predicate = pli.Series(predicate)
2184 return (
-> 2185 self.lazy()
2186 .filter(predicate) # type: ignore[arg-type]
2187 .collect(no_optimization=True, string_cache=False)
2188 )
File c:\ProgramData\Anaconda3\envs\charm3.9\lib\site-packages\polars\internals\lazyframe\frame.py:660, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, string_cache, no_optimization, slice_pushdown)
650 projection_pushdown = False
652 ldf = self._ldf.optimization_toggle(
653 type_coercion,
654 predicate_pushdown,
(...)
658 slice_pushdown,
659 )
--> 660 return pli.wrap_df(ldf.collect())
ComputeError: joins/or comparisons on categorical dtypes can only happen if they are created under the same global string cache
From this Stackoverflow link I understand "You need to set a global string cache to compare categoricals created in different columns/lists." but my question is
Why the == one single string filter case works?
What is the proper way to filter a categorical column with a list of string?
Thanks!
Actually, you don't need to set a global string cache to compare strings to Categorical variables. You can use cast to accomplish this.
Let's use this data. I've included the integer values that underlie the Categorical variables to demonstrate something later.
import polars as pl
df_cat = (
pl.DataFrame(
[
pl.Series("a_cat", ["c", "a", "b", "c", "X"], dtype=pl.Categorical),
pl.Series("b_cat", ["F", "G", "E", "S", "X"], dtype=pl.Categorical),
]
)
.with_column(
pl.all().to_physical().suffix('_phys')
)
)
df_cat
shape: (5, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ c ┆ F ┆ 0 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ G ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ E ┆ 2 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ S ┆ 0 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ X ┆ X ┆ 3 ┆ 4 │
└───────┴───────┴────────────┴────────────┘
Comparing a categorical variable to a string
If we cast a Categorical variable back to its string values, we can make any comparison we need. For example:
df_cat.filter(pl.col('a_cat').cast(pl.Utf8).is_in(['a', 'c']))
shape: (3, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ c ┆ F ┆ 0 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ G ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ S ┆ 0 ┆ 3 │
└───────┴───────┴────────────┴────────────┘
Or in a filter step comparing the string values of two Categorical variables that do not share the same string cache.
df_cat.filter(pl.col('a_cat').cast(pl.Utf8) == pl.col('b_cat').cast(pl.Utf8))
shape: (1, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ X ┆ X ┆ 3 ┆ 4 │
└───────┴───────┴────────────┴────────────┘
Notice that it is the string values being compared (not the integers underlying the two Categorical variables).
The equality operator on Categorical variables
The following statements are equivalent:
df_cat.filter((pl.col('a_cat') == 'a'))
df_cat.filter((pl.col('a_cat').cast(pl.Utf8) == 'a'))
The former is syntactic sugar for the latter, as the former is a common use case.
As the error states: ComputeError: joins/or comparisons on categorical dtypes can only happen if they are created under the same global string cache.
Comparisons of categorical values are only allowed under a global string cache. You really want to set this in such a case as it speeds up comparisons and prevents expensive casts to strings.
Setting this on the start of your query will ensure it runs:
import polars as pl
pl.Config.set_global_string_cache()
This is a new answer based on the one from #ritchie46.
Polar 0.15.15 it now is
import polars as pl
pl.toggle_string_cache(True)
Also a StringCache() Context manager can be used, see polars documentation:
with pl.StringCache():
print(df_cat.filter(pl.col('a_cat').is_in(['a', 'c'])))

Filtering selected columns based on column aggregate

I wish to select only columns with fewer than 3 unique values. I can generate a boolean mask via pl.all().n_unique() < 3, but I don't know if I can use that mask via the polars API for this.
Currently, I am solving it via python. Is there a more idiomatic way?
import polars as pl, pandas as pd
df = pl.DataFrame({"col1":[1,1,2], "col2":[1,2,3], "col3":[3,3,3]})
# target is:
# df_few_unique = pl.DataFrame({"col1":[1,1,2], "col3":[3,3,3]})
# my attempt:
mask = df.select(pl.all().n_unique() < 3).to_numpy()[0]
cols = [col for col, m in zip(df.columns, mask) if m]
df_few_unique = df.select(cols)
df_few_unique
Equivalent in pandas:
df_pandas = df.to_pandas()
mask = (df_pandas.nunique() < 3)
df_pandas.loc[:, mask]
Edit: after some thinking, I discovered an even easier way to do this, one that doesn't rely on boolean masking at all.
pl.select(
[s for s in df
if s.n_unique() < 3]
)
shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col3 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 1 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 3 │
└──────┴──────┘
Previous answer
One easy way is to use the compress function from Python's itertools.
from itertools import compress
df.select(compress(df.columns, df.select(pl.all().n_unique() < 3).row(0)))
>>> df.select(compress(df.columns, df.select(pl.all().n_unique() < 3).row(0)))
shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col3 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 1 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 3 │
└──────┴──────┘
compress allows us to apply a boolean mask to a list, which in this case is a list of column names.
list(compress(df.columns, df.select(pl.all().n_unique() < 3).row(0)))
>>> list(compress(df.columns, df.select(pl.all().n_unique() < 3).row(0)))
['col1', 'col3']

Best way to get percentage counts in Polars

I frequently need to calculate the percentage counts of a variable. For example for the dataframe below
df = pl.DataFrame({"person": ["a", "a", "b"],
"value": [1, 2, 3]})
I want to return a dataframe like this:
person
percent
a
0.667
b
0.333
What I have been doing is the following, but I can't help but think there must be a more efficient / polars way to do this
n_rows = len(df)
(
df
.with_column(pl.lit(1)
.alias('percent'))
.groupby('person')
.agg([pl.sum('percent') / n_rows])
)
polars.count will help here. When called without arguments, polars.count returns the number of rows in a particular context.
(
df
.groupby("person")
.agg([pl.count().alias("count")])
.with_column((pl.col("count") / pl.sum("count")).alias("percent_count"))
)
shape: (2, 3)
┌────────┬───────┬───────────────┐
│ person ┆ count ┆ percent_count │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ f64 │
╞════════╪═══════╪═══════════════╡
│ a ┆ 2 ┆ 0.666667 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 1 ┆ 0.333333 │
└────────┴───────┴───────────────┘

Is there a good way to do `zfill` in polars?

Is it proper to use pl.Expr.apply to throw the python function zfill at my data? I'm not looking for a performant solution.
pl.col("column").apply(lambda x: str(x).zfill(5))
Is there a better way to do this?
And to follow up I'd love to chat about what a good implementation could look like in the discord if you have some insight (assuming one doesn't currently exist).
Edit: Polars 0.13.43 and later
With version 0.13.43 and later, Polars has a str.zfill expression to accomplish this. str.zfill will be faster than the answer below and thus str.zfill should be preferred.
Prior to Polars 0.13.43
From your question, I'm assuming that you are starting with a column of integers.
lambda x: str(x).zfill(5)
If so, here's one that adheres to pandas rather strictly:
import polars as pl
df = pl.DataFrame({"num": [-10, -1, 0, 1, 10, 100, 1000, 10000, 100000, 1000000, None]})
z = 5
df.with_column(
pl.when(pl.col("num").cast(pl.Utf8).str.lengths() > z)
.then(pl.col("num").cast(pl.Utf8))
.otherwise(pl.concat_str([pl.lit("0" * z), pl.col("num").cast(pl.Utf8)]).str.slice(-z))
.alias("result")
)
shape: (11, 2)
┌─────────┬─────────┐
│ num ┆ result │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════════╪═════════╡
│ -10 ┆ 00-10 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ -1 ┆ 000-1 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ 00000 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 00001 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 10 ┆ 00010 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 100 ┆ 00100 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 1000 ┆ 01000 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 10000 ┆ 10000 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 100000 ┆ 100000 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 1000000 ┆ 1000000 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ null ┆ null │
└─────────┴─────────┘
Comparing the output to pandas:
df.with_column(pl.col('num').cast(pl.Utf8)).get_column('num').to_pandas().str.zfill(z)
0 00-10
1 000-1
2 00000
3 00001
4 00010
5 00100
6 01000
7 10000
8 100000
9 1000000
10 None
dtype: object
If you are starting with strings, then you can simplify the code by getting rid any calls to cast.
Edit: On a dataset with 550 million records, this took about 50 seconds on my machine. (Note: this runs single-threaded)
Edit2: To shave off some time, you can use the following:
result = df.lazy().with_column(
pl.col('num').cast(pl.Utf8).alias('tmp')
).with_column(
pl.when(pl.col("tmp").str.lengths() > z)
.then(pl.col("tmp"))
.otherwise(pl.concat_str([pl.lit("0" * z), pl.col("tmp")]).str.slice(-z))
.alias("result")
).drop('tmp').collect()
but it didn't save that much time.