How to filter record "sequences" from a Polars dataframe using multiple threads? - python-polars

I have a data set with multiple records on each individual - one record for each time period.
Where an individual is missing a record for a time period, I need to remove any later records for that individual.
So given an example dataset like this:
import polars as pl
df = pl.DataFrame({'Id': [1,1,2,2,2,2,3,3,4,4,4,5,5,5,6,6,6],
'Age': [1,4,1,2,3,4,1,2,1,2,3,1,2,4,2,3,4],
'Value': [1,142,4,73,109,145,6,72,-8,67,102,-1,72,150,72,111,149]})
df
Id Age Value
i64 i64 i64
1 1 1
1 4 142
2 1 4
2 2 73
2 3 109
2 4 145
3 1 6
3 2 72
4 1 -8
4 2 67
4 3 102
5 1 -1
5 2 72
5 4 150
6 2 72
6 3 111
6 4 149
I need to filter it as follows:
Id Age Value Keep
i64 i64 i64 bool
1 1 1 true
2 1 4 true
2 2 73 true
2 3 109 true
2 4 145 true
3 1 6 true
3 2 72 true
4 1 -8 true
4 2 67 true
4 3 102 true
5 1 -1 true
5 2 72 true
So an individual with an age record profile of 1,3,4 would end up with only the 1 record. An individual like Id 6 with an age record profile of 2,3,4 would end up with no records after filtering.
I can achieve this using the approach below, however when the data set contains millions of individuals, the code appears not to run in parallel and performance is very slow (The steps prior to the final filter expression complete in ~22 seconds on a data set with 16.5 million records, the last filter expression takes another 12.5 minutes to complete). Is there an alternative approach that will not be single-threaded, or an adjustment of the code below to achieve that?
df2 = (
df.sort(by=["Id","Age"])
.with_column(
((pl.col("Age").diff(1).fill_null(pl.col("Age") == 1) == 1)
.over("Id")
.alias("Keep")
)
.filter(
(pl.col("Keep").cumprod() == 1).over("Id")
)
)

I propose the following (revised) code:
df2 = df.filter(pl.col('Age').rank().over('Id') == pl.col('Age'))
This code yields the following result on your test dataset:
shape: (12, 3)
┌─────┬─────┬───────┐
│ Id ┆ Age ┆ Value │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═══════╡
│ 1 ┆ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 1 ┆ 4 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 73 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 3 ┆ 109 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 145 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ 1 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ 2 ┆ 72 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 4 ┆ 1 ┆ -8 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 4 ┆ 2 ┆ 67 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 4 ┆ 3 ┆ 102 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 5 ┆ 1 ┆ -1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 5 ┆ 2 ┆ 72 │
└─────┴─────┴───────┘
Basically, when an Age is skipped (for a particular Id), the rank of the Age falls out of step with the Age variable itself, and remains out of step for all higher Age values for that Id.
This code has several advantages over my prior answer. It is more concise, it's far easier to follow, and best of all ... it makes excellent use of the Polars API, particularly the over window function.
Even if this code benchmarks slightly slower in the upcoming Polars release, I recommend it for the reasons above.
Edit - Benchmarks on Polars 0.13.15
ok, wow wow wow ... I just downloaded the newly released Polars (0.13.15), and re-benchmarked the code on my machine with the 17 million records generated as in my prior answer.
The results?
The revised code listed in the question: 13.6 seconds
The (ugly) code in my prior answer: 4.8 seconds
The one-line code in this answer: 3.3 seconds
And from watching the htop command while the code runs, it's clear that the newly released Polars code utilized all 64 logical cores on my machine. Massively parallel.
Impressive!

Note that window functions are very powerful, but also relatively expensive. So you could already start by doing less work.
df.sort(by=["Id", "Age"]).filter(
((pl.col("Age").diff(1).fill_null(1) == 1).over("Id"))
)
And very likely, you can also ditch the expensive sort:
df.filter(
((pl.col("Age").diff(1).fill_null(1) == 1).over("Id"))
)
Multithreading
The filter operation already consists of many forms of parallelism. The materialization of the columns is parallel. And in this case the computation of the mask is parallel as well. A window expression (over() syntax), is multithreaded in computing the groups as well as doing the join operation.
Squeezing out maximum performance of a window function
If your data is already sorted you can make a window expression often faster by explicitly adding a list aggregation and then flattening that result. This is because the list aggregation is free, as we already have a list in aggregation (implementation detail) and the flatten is often also free. A bit of a complicated implementation detail, but it means that polars doesn't have to compute the location of every aggregation relative to the original DataFrame.
This only makes sense if the DataFrame is already sorted by the groups.
# note that that only makes sense if the df is sorted by the groups
sorted_df.filter(
((pl.col("Age").diff(1).fill_null(1) == 1).list().over("Id").flatten())
)

Related

Finding first index of where value in column B is greater than a value in column A

I'd like to know the first occurrence (index) when a value in column A is greater than in column B. Currently I use a for loop (and it's super slow) but I'd imagine it's possible to do that in a rolling window.
df = polars.DataFrame({"idx": [i for i in range(5)], "col_a": [1,2,3,4,4], "col_b": [1,1,5,5,3]})
# apply some window function?
# result of first indices where a value in column B is greater than the value in column A
result = polars.Series([2,2,2,3,None])
I'm still trying to understand polars concept of windows but I imagine the pseudo code would look sth like this:
for window length compare values in both columns, use arg_min() to get the index
if the resulting index is not found (e.g. value None or 0), increase window length and make a second pass
make passes until some max window_len
Current for loop implementation:
df = polars.DataFrame({"col_a": [1,2,3,4,4], "col_b": [1,1,5,5,3]})
for i in range(0, df.shape[0]):
# `arg_max()` returns 0 when there's no such index or if the index is actually 0
series = (df.select("col_a")[i,0] < df.select("col_b")[i:])[:,0]
idx_found = True in series
if idx_found:
print(i + series.arg_max())
else:
print("None")
# output:
2
2
2
3
None
Edit 1:
This almost solves the problem. But we still don't know if arg_max found an actual True value or didn't found an index since it returns 0 for both cases.
One idea is that we're never satisfied with the answer 0 and make a second scan for all values where the result was 0 but now with a longer window.
df.select(polars.col("idx")) + \
df_res = df.groupby_dynamic("idx", every="1i", period="5i").agg(
[
(polars.col("col_a").head(1) < polars.col("col_b")).arg_max().alias("res")
]
)
Edit 2:
This is the final solution: the first pass is made from the code in Edit 1. The following passes (with increasingly wider windows/periods) can be made with:
increase_window_size = "10i"
df_res.groupby_dynamic("idx", every="1i", period=increase_window_size).agg(
[
(polars.col("col_a").head(1) < polars.col("col_b")).filter(polars.col("res").head(1) == 0).arg_max().alias("res")
]
)
Starting from...
df=pl.DataFrame({"idx": [i for i in range(5)], "col_a": [1,2,3,4,4], "col_b": [1,1,5,5,3]})
For each row, you want the min idx where the current row's col_a is less than every subsequent row's col_b.
The first step is to add two columns that will contain all the data as a list and then we want to explode those into a much longer DataFrame.
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx'])
From here, we want to apply a filter so we're only keeping rows where the b_indx is at least as big as the idx AND the col_a is less than col_b
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx']) \
.filter((pl.col('b_indx')>=pl.col('idx')) & (pl.col('col_a')<pl.col('col_b')))
There are a couple ways you could clean that up, one is to groupby+agg+sort
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx']) \
.filter((pl.col('b_indx')>=pl.col('idx')) & (pl.col('col_a')<pl.col('col_b'))) \
.groupby(['idx']).agg([pl.col('b_indx').min()]).sort('idx')
The other way is to just do unique by idx
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx']) \
.filter((pl.col('b_indx')>=pl.col('idx')) & (pl.col('col_a')<pl.col('col_b'))) \
.unique(subset='idx')
Lastly, to get the null values back you have to join it back to the original df. To keep with the theme of adding to the end of the chain we'd want a right join but right joins aren't an option so we have to put the join back at the beginning.
df.join(
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx']) \
.filter((pl.col('b_indx')>=pl.col('idx')) & (pl.col('col_a')<pl.col('col_b'))) \
.unique(subset='idx'),
on='idx',how='left').get_column('b_indx')
shape: (5,)
Series: 'b_indx' [i64]
[
2
2
2
3
null
]
Note:
I was curious on the performance difference between my approach and jqurious's approach so I did
df=pl.DataFrame({'col_a':np.random.randint(1,10,10000), 'col_b':np.random.randint(1,10,10000)}).with_row_count('idx')
then ran each code chunk. Mine took 1.7s while jqurious's took just 0.7s BUT his answer isn't correct...
For instance...
df.join(
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx']) \
.filter((pl.col('b_indx')>=pl.col('idx')) & (pl.col('col_a')<pl.col('col_b'))) \
.unique(subset='idx'),
on='idx',how='left').select(['idx','col_a','col_b',pl.col('b_indx').alias('result')]).head(5)
yields...
shape: (5, 4)
┌─────┬───────┬───────┬────────┐
│ idx ┆ col_a ┆ col_b ┆ result │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ u32 │
╞═════╪═══════╪═══════╪════════╡
│ 0 ┆ 4 ┆ 1 ┆ 2 │ 4<5 at indx2
│ 1 ┆ 1 ┆ 4 ┆ 1 │ 1<4 at indx1
│ 2 ┆ 3 ┆ 5 ┆ 2 │ 3<5 at indx2
│ 3 ┆ 4 ┆ 2 ┆ 5 │ off the page
│ 4 ┆ 5 ┆ 4 ┆ 5 │ off the page
└─────┴───────┴───────┴────────┘
whereas
df.with_columns(
pl.when(pl.col("col_a") < pl.col("col_b"))
.then(1)
.cumsum()
.backward_fill()
.alias("result") + 1
).head(5)
yields
shape: (5, 4)
┌─────┬───────┬───────┬────────┐
│ idx ┆ col_a ┆ col_b ┆ result │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ i32 │
╞═════╪═══════╪═══════╪════════╡
│ 0 ┆ 4 ┆ 1 ┆ 2 │ 4<5 at indx2
│ 1 ┆ 1 ┆ 4 ┆ 2 │ not right
│ 2 ┆ 3 ┆ 5 ┆ 3 │ not right
│ 3 ┆ 4 ┆ 2 ┆ 4 │ off the page
│ 4 ┆ 5 ┆ 4 ┆ 4 │ off the page
└─────┴───────┴───────┴────────┘
Performance
This scales pretty terribly, bumping the df from 10,000 rows to 100,000 made my kernel crash. Going from 10,000 to 20,000 made it take 5.7s which makes sense since we're squaring the size of the df. To mitigate this, you can do overlapping chunks.
First let's make a function
def idx_finder(df):
return(df.join(
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx']) \
.filter((pl.col('b_indx')>=pl.col('idx')) & (pl.col('col_a')<pl.col('col_b'))) \
.unique(subset='idx'),
on='idx',how='left').select(['idx','col_a','col_b',pl.col('b_indx').alias('result')]))
Let's get some summary stats:
print(df.select(pl.all().max()) )
shape: (9, 2)
┌───────┬───────┐
│ col_a ┆ col_b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═══════╪═══════╡
│ 1 ┆ 9 │
│ 2 ┆ 9 │
│ 3 ┆ 9 │
│ 4 ┆ 9 │
│ ... ┆ ... │
│ 6 ┆ 9 │
│ 7 ┆ 9 │
│ 8 ┆ 9 │
│ 9 ┆ 9 │
└───────┴───────┘
This tells us that for any value of col_a what the biggest value of col_b is 9 which means anytime the result is null when col_a is 9 that it's a true null
With that, we do
chunks=[]
chunks.append(idx_finder(df[0:10000])) # arbitrarily picking 10,000 per chunk
Then take a look at
chunks[-1].filter((pl.col('result').is_null()) & (pl.col('col_a')<9))
shape: (2, 4)
┌──────┬───────┬───────┬────────┐
│ idx ┆ col_a ┆ col_b ┆ result │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ u32 │
╞══════╪═══════╪═══════╪════════╡
│ 9993 ┆ 8 ┆ 6 ┆ null │
│ 9999 ┆ 3 ┆ 1 ┆ null │
└──────┴───────┴───────┴────────┘
Let's cutoff this chunk at idx=9992 and then start the next chunk at 9993
curindx=chunks[-1].filter((pl.col('result').is_null()) & (pl.col('col_a')<9))[0,0]
chunks[-1]=chunks[-1].filter(pl.col('idx')<curindx)
With that we can reformulate this logic into a while loop
curindx=0
chunks=[]
while curindx<=df.shape[0]:
print(curindx)
chunks.append(idx_finder(df[curindx:(curindx+10000)]))
curchunkfilt=chunks[-1].filter((pl.col('result').is_null()) & (pl.col('col_a')<9))
if curchunkfilt.shape[0]==0:
curindx+=10000
elif curchunkfilt[0,0]>curindx:
curindx=chunks[-1].filter((pl.col('result').is_null()) & (pl.col('col_a')<9))[0,0]
else:
print("curindx not advancing")
break
chunks[-1]=chunks[-1].filter(pl.col('idx')<curindx)
Finally just
pl.concat(chunks)
As long as we're looping here's another approach
If the gaps between the A/B matches are small then this will end up being fast as it scales according to the gap size rather than by the size of the df. It just uses shift
df=df.with_columns(pl.lit(None).alias('result'))
y=0
while True:
print(y)
maxB=df.filter(pl.col('result').is_null()).select(pl.col('col_b').max())[0,0]
df=df.with_columns((
pl.when(
(pl.col('result').is_null()) & (pl.col('col_a')<pl.col('col_b').shift(-y))
).then(pl.col('idx').shift(-y)).otherwise(pl.col('result'))).alias('result'))
y+=1
if df.filter((pl.col('result').is_null()) & (pl.col('col_a')<maxB) & ~(pl.col('col_b').shift(-y).is_null())).shape[0]==0:
break
With my random data of 1.2m rows it only took 2.6s with a max row offset of 86. If, in your real data, the gaps are on the order of, let's just say, 100,000 then it'd be close to an hour.

Is it possible to reference a different dataframe when using Polars expression without using Lambda?

Is there a way to reference another Polars Dataframe in Polars expressions without using lambdas?
Just to use a simple example - suppose I have two dataframes:
df_1 = pl.DataFrame(
{
"time": pl.date_range(
low=date(2021, 1, 1),
high=date(2022, 1, 1),
interval="1d",
),
"x": pl.arange(0, 366, eager=True),
}
)
df_2 = pl.DataFrame(
{
"time": pl.date_range(
low=date(2021, 1, 1),
high=date(2021, 2, 1),
interval="1mo",
),
"y": [50, 100],
}
)
For each y value in df_2, I would like to find the maximum date in df_1, conditional on the x value being lower than the y.
I am able to perform this using apply/lambda (see below), but just wondering whether there is a more idiomatic way of performing this operation?
df_2.groupby("y").agg(
pl.col("y").apply(lambda s: df_1.filter(pl.col("x") < s).select(pl.col("time")).max()[0,0]).alias('latest')
)
Edit:
Is it possible to pre-filter df_1 prior to using join_asof. So switching the question to look for the min instead of the max, on an individual case this is what I would do:
(
df_2
.filter(pl.col('y') == 50)
.join_asof(
df_1
.sort("x")
.filter(pl.col('time') > date(2021,11,1))
.select([
pl.col("time").cummin().alias("time_min"),
pl.col("x").alias("original_x"),
(pl.col("x") + 1).alias("x"),
]),
left_on="y",
right_on="x",
strategy="forward",
)
)
Is there a way to generalise this merge without using a loop / apply function?
Edit: Generalizing a join
One somewhat-dangerous approach to generalizing a join (so that you can run any sub-queries and filters that you like) is to use a "cross" join.
I say "somewhat-dangerous" because the number of row combinations considered in a cross join is M x N, where M and N are the number of rows in your two DataFrames. So if your two DataFrames are 1 million rows each, you have (1 million x 1 million) row combinations that are being considered. This process can exhaust your RAM or simply take a long time.
If you'd like to try it, here's how it would work (along with some arbitrary filters that I constructed, just to show the ultimate flexibility that a cross-join creates).
(
df_2.lazy()
.join(
df_1.lazy(),
how="cross"
)
.filter(pl.col('time_right') >= pl.col('time'))
.groupby('y')
.agg([
pl.col('time').first(),
pl.col('x')
.filter(pl.col('y') > pl.col('x'))
.max()
.alias('max(x) for(y>x)'),
pl.col('time_right')
.filter(pl.col('y') > pl.col('x'))
.max()
.alias('max(time_right) for(y>x)'),
pl.col('time_right')
.filter(pl.col('y') <= pl.col('x'))
.filter(pl.col('time_right') > pl.col('time'))
.min()
.alias('min(time_right) for(two filters)'),
])
.collect()
)
shape: (2, 5)
┌─────┬────────────┬─────────────────┬──────────────────────────┬──────────────────────────────────┐
│ y ┆ time ┆ max(x) for(y>x) ┆ max(time_right) for(y>x) ┆ min(time_right) for(two filters) │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ date ┆ i64 ┆ date ┆ date │
╞═════╪════════════╪═════════════════╪══════════════════════════╪══════════════════════════════════╡
│ 100 ┆ 2021-02-01 ┆ 99 ┆ 2021-04-10 ┆ 2021-04-11 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 50 ┆ 2021-01-01 ┆ 49 ┆ 2021-02-19 ┆ 2021-02-20 │
└─────┴────────────┴─────────────────┴──────────────────────────┴──────────────────────────────────┘
Couple of suggestions:
I strongly recommend running the cross-join in Lazy mode.
Try to filter directly after the cross-join, to eliminate row combinations that you will never need. This reduces the burden on the later groupby step.
Given the explosive potential of row combinations for cross-joins, I tried to steer you toward a join_asof (which did solve the original sample question). But if you need the flexibility beyond what a join_asof can provide, the cross-join will provide ultimate flexibility -- at a cost.
join_asof
We can use a join_asof to accomplish this, with two wrinkles.
The Algorithm
(
df_2
.sort("y")
.join_asof(
(
df_1
.sort("x")
.select([
pl.col("time").cummax().alias("time_max"),
(pl.col("x") + 1),
])
),
left_on="y",
right_on="x",
strategy="backward",
)
.drop(['x'])
)
shape: (2, 3)
┌────────────┬─────┬────────────┐
│ time ┆ y ┆ time_max │
│ --- ┆ --- ┆ --- │
│ date ┆ i64 ┆ date │
╞════════════╪═════╪════════════╡
│ 2021-01-01 ┆ 50 ┆ 2021-02-19 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-02-01 ┆ 100 ┆ 2021-04-10 │
└────────────┴─────┴────────────┘
This matches the output of your code.
In steps
Let's add some extra information to our query, to elucidate how it works.
(
df_2
.sort("y")
.join_asof(
(
df_1
.sort("x")
.select([
pl.col("time").cummax().alias("time_max"),
pl.col("x").alias("original_x"),
(pl.col("x") + 1).alias("x"),
])
),
left_on="y",
right_on="x",
strategy="backward",
)
)
shape: (2, 5)
┌────────────┬─────┬────────────┬────────────┬─────┐
│ time ┆ y ┆ time_max ┆ original_x ┆ x │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ i64 ┆ date ┆ i64 ┆ i64 │
╞════════════╪═════╪════════════╪════════════╪═════╡
│ 2021-01-01 ┆ 50 ┆ 2021-02-19 ┆ 49 ┆ 50 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2021-02-01 ┆ 100 ┆ 2021-04-10 ┆ 99 ┆ 100 │
└────────────┴─────┴────────────┴────────────┴─────┘
Getting the maximum date
Instead of attempting a "non-equi" join or sub-queries to obtain the maximum date for x or any lesser value of x, we can use a simpler approach: sort df_2 by x and calculate the cumulative maximum date for each "x". That way, when we join, we can join to a single row in df_2 and be certain that for any x, we are getting the maximum date for that x and all lesser values of x. The cumulative maximum is displayed above as time_max.
less-than (and not less-than-or-equal-to)
From the documentation for join_as:
A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.
Since you want "less than" and not "less than or equal to", we can simply increase each value of x by 1. Since x and y are integers, this will work. The result above displays both the original value of x (original_x), and the adjusted value (x) used in the join_asof.
If x and y are floats, you can add an arbitrarily small amount to x (e.g., x + 0.000000001) to force the non-equality.

Failing to understand example in documentation about window functions (operations per group) in polars

In the example under the section Operations per group the author writes:
col("value").sort().over("group")
But he doesn't say what value or group he picked. The assumption is that in this example he selected as value the 'speed' column and groups over 'Type 1'.
The resulting frame is :
│ Name ┆ Type 1 ┆ Speed │
│ str ┆ str ┆ i64 │
│ Slowpoke ┆ Water ┆ 15 │
│ Slowbro ┆ Water ┆ 30 │
│ SlowbroMega Slowbro ┆ Water ┆ 30 │
│ Exeggcute ┆ Grass ┆ 40 │
│ Exeggutor ┆ Grass ┆ 55 │
│ Starmie ┆ Water ┆ 115 │
│ Jynx ┆ Ice ┆ 95 │
He mentions that the colum 'Type1' isn't contiguous, but he doesn't explain why and i am failing to grasp the hints. :(
In the example he sorted 'Speed'(My best guess), therefore speed should be continuous, but isn't due to the last row with the value 95. On the other hand he just sorted the group 'Type 1', so how does the column get added back to the Dataframe?
For aggregations it's clear because:
The results of the aggregation are projected back to the original rows.
but what about operations within a group?
What am i missing? Is it just sorting the rows of each group? For example:
if i have Type 1 == 'Water' in row 1,3 and 7 it will just swap out these positions?
What am i missing? Is it just sorting the rows of each group? For example:
if i have Type 1 == 'Water' in row 1,3 and 7 it will just swap out these positions?
Yes! :)
So the operations work within a group no matter at which row location the group elements are. The window functions will find them.

Is there a good way to do `zfill` in polars?

Is it proper to use pl.Expr.apply to throw the python function zfill at my data? I'm not looking for a performant solution.
pl.col("column").apply(lambda x: str(x).zfill(5))
Is there a better way to do this?
And to follow up I'd love to chat about what a good implementation could look like in the discord if you have some insight (assuming one doesn't currently exist).
Edit: Polars 0.13.43 and later
With version 0.13.43 and later, Polars has a str.zfill expression to accomplish this. str.zfill will be faster than the answer below and thus str.zfill should be preferred.
Prior to Polars 0.13.43
From your question, I'm assuming that you are starting with a column of integers.
lambda x: str(x).zfill(5)
If so, here's one that adheres to pandas rather strictly:
import polars as pl
df = pl.DataFrame({"num": [-10, -1, 0, 1, 10, 100, 1000, 10000, 100000, 1000000, None]})
z = 5
df.with_column(
pl.when(pl.col("num").cast(pl.Utf8).str.lengths() > z)
.then(pl.col("num").cast(pl.Utf8))
.otherwise(pl.concat_str([pl.lit("0" * z), pl.col("num").cast(pl.Utf8)]).str.slice(-z))
.alias("result")
)
shape: (11, 2)
┌─────────┬─────────┐
│ num ┆ result │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════════╪═════════╡
│ -10 ┆ 00-10 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ -1 ┆ 000-1 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ 00000 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 00001 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 10 ┆ 00010 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 100 ┆ 00100 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 1000 ┆ 01000 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 10000 ┆ 10000 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 100000 ┆ 100000 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 1000000 ┆ 1000000 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ null ┆ null │
└─────────┴─────────┘
Comparing the output to pandas:
df.with_column(pl.col('num').cast(pl.Utf8)).get_column('num').to_pandas().str.zfill(z)
0 00-10
1 000-1
2 00000
3 00001
4 00010
5 00100
6 01000
7 10000
8 100000
9 1000000
10 None
dtype: object
If you are starting with strings, then you can simplify the code by getting rid any calls to cast.
Edit: On a dataset with 550 million records, this took about 50 seconds on my machine. (Note: this runs single-threaded)
Edit2: To shave off some time, you can use the following:
result = df.lazy().with_column(
pl.col('num').cast(pl.Utf8).alias('tmp')
).with_column(
pl.when(pl.col("tmp").str.lengths() > z)
.then(pl.col("tmp"))
.otherwise(pl.concat_str([pl.lit("0" * z), pl.col("tmp")]).str.slice(-z))
.alias("result")
).drop('tmp').collect()
but it didn't save that much time.

How to the get the first n% of a group in polars?

Q1: In polars-rust, when you do .groupby().agg() , we can use .head(10) to get the first 10 elements in a column. But if the groups have different lengths and I need to get first 20% elements in each group (like 0-24 elements in a 120 elements group). How to make it work?
Q2: with a dataframe sample like below, my goal is to loop the dataframe. Beacuse polars is column major, so I downcasted df into serval ChunkedArrays and iterated via iter().zip().I found it is faster than the same action after goupby(col("date")) which is loop some list elemnts. How is that?
In my opinion, the length of df is shorter after groupby, which means a shorter loop.
Date
Stock
Price
2010-01-01
IBM
1000
2010-01-02
IBM
1001
2010-01-03
IBM
1002
2010-01-01
AAPL
2900
2010-01-02
AAPL
2901
2010-01-03
AAPL
2902
I don't really understand your 2nd question. Maybe you can create another question with a small example.
I will answer the 1st question:
we can use head(10) to get the first 10 elements in a col. But if the groups have different length and I need to get first 20% elements in each group like 0-24 elements in a 120 elements group. how to make it work?
We can use expressions to take a head(n) where n = 0.2 group_size.
df = pl.DataFrame({
"groups": ["a"] * 10 + ["b"] * 20,
"values": range(30)
})
(df.groupby("groups")
.agg(pl.all().head(pl.count() * 0.2))
.explode(pl.all().exclude("groups"))
)
which outputs:
shape: (6, 2)
┌────────┬────────┐
│ groups ┆ values │
│ --- ┆ --- │
│ str ┆ i64 │
╞════════╪════════╡
│ a ┆ 0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ a ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b ┆ 10 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b ┆ 11 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b ┆ 12 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b ┆ 13 │
└────────┴────────┘