I'm trying to find a polaric way of aggregating data per row. It's not strictly about .sum function, it's about all aggregations where axis makes sense.
Take a look at these pandas examples:
df[df.sum(axis=1) > 5]
df.assign(median=df.median(axis=1))
df[df.rolling(3, axis=1).mean() > 0]
However, with polars, problems start really quick:
df.filter(df.sum(axis=1)>5)
df.with_column(df.mean(axis=1).alias('mean')) - cant do median
df... - cant do rolling, rank and anything more complex.
I saw the page where polars authors suggest doing everything by hand with folds, but there are cases where logic doesn't fit into one input and one accumulator variable (i.e. simple median)
Moreover, this approach seems to not work at all when using Expressions, i.e. pl.all().sum(axis=1) is not valid since for some reason axis argument is absent.
So the question is: how to deal with these situations? I hope to have the full polars api at my fingertips, instead of some suboptimal solutions i can come up with
Row-wise computations:
You can create a list and access the .arr namespace for row wise computations.
#ritchie46's answer regarding rank(axis=1) is also useful reading.
.arr.eval() can be used for more complex computations.
df = pl.DataFrame([[1, 2, 3], [4, 5, 3], [1, 8, 9]])
(df.with_column(pl.concat_list(pl.all()).alias("row"))
.with_columns([
pl.col("row").arr.sum().alias("sum"),
pl.col("row").arr.mean().alias("mean"),
pl.col("row").arr.eval(pl.all().median(), parallel=True).alias("median"),
pl.col("row").arr.eval(pl.all().rank(), parallel=True).alias("rank"),
])
)
shape: (3, 8)
┌──────────┬──────────┬──────────┬───────────┬─────┬──────┬───────────┬─────────────────┐
│ column_0 | column_1 | column_2 | row | sum | mean | median | rank │
│ --- | --- | --- | --- | --- | --- | --- | --- │
│ i64 | i64 | i64 | list[i64] | i64 | f64 | list[f64] | list[f32] │
╞══════════╪══════════╪══════════╪═══════════╪═════╪══════╪═══════════╪═════════════════╡
│ 1 | 4 | 1 | [1, 4, 1] | 6 | 2.0 | [1.0] | [1.5, 3.0, 1.5] │
├──────────┼──────────┼──────────┼───────────┼─────┼──────┼───────────┼─────────────────┤
│ 2 | 5 | 8 | [2, 5, 8] | 15 | 5.0 | [5.0] | [1.0, 2.0, 3.0] │
├──────────┼──────────┼──────────┼───────────┼─────┼──────┼───────────┼─────────────────┤
│ 3 | 3 | 9 | [3, 3, 9] | 15 | 5.0 | [3.0] | [1.5, 1.5, 3.0] │
└──────────┴──────────┴──────────┴───────────┴─────┴──────┴───────────┴─────────────────┘
pl.sum()
Can be given a list of columns.
>>> df.select(pl.sum(pl.all()))
shape: (3, 1)
┌─────┐
│ sum │
│ --- │
│ i64 │
╞═════╡
│ 6 │
├─────┤
│ 15 │
├─────┤
│ 15 │
└─────┘
.rolling_mean()
Can be accessed inside .arr.eval()
pdf = df.to_pandas()
pdf[pdf.rolling(2, axis=1).mean() > 3]
column_0 column_1 column_2
0 NaN NaN NaN
1 NaN 5.0 8.0
2 NaN NaN 9.0
(df.with_column(pl.concat_list(pl.all()).alias("row"))
.with_column(
pl.col("row").arr.eval(
pl.when(pl.all().rolling_mean(2) > 3)
.then(pl.all()),
parallel=True)
.alias("rolling[mean] > 3"))
)
shape: (3, 5)
┌──────────┬──────────┬──────────┬───────────┬────────────────────┐
│ column_0 | column_1 | column_2 | row | rolling[mean] > 3 │
│ --- | --- | --- | --- | --- │
│ i64 | i64 | i64 | list[i64] | list[i64] │
╞══════════╪══════════╪══════════╪═══════════╪════════════════════╡
│ 1 | 4 | 1 | [1, 4, 1] | [null, null, null] │
├──────────┼──────────┼──────────┼───────────┼────────────────────┤
│ 2 | 5 | 8 | [2, 5, 8] | [null, 5, 8] │
├──────────┼──────────┼──────────┼───────────┼────────────────────┤
│ 3 | 3 | 9 | [3, 3, 9] | [null, null, 9] │
└──────────┴──────────┴──────────┴───────────┴────────────────────┘
If you want to "expand" the lists into columns:
Turn the list into a struct with .arr.to_struct()
.unnest() the struct.
Rename the columns (if needed)
(df.with_column(pl.concat_list(pl.all()).alias("row"))
.select(
pl.col("row").arr.eval(
pl.when(pl.all().rolling_mean(2) > 3)
.then(pl.all()),
parallel=True)
.arr.to_struct()
.alias("rolling[mean]"))
.unnest("rolling[mean]")
)
shape: (3, 3)
┌─────────┬─────────┬─────────┐
│ field_0 | field_1 | field_2 │
│ --- | --- | --- │
│ i64 | i64 | i64 │
╞═════════╪═════════╪═════════╡
│ null | null | null │
├─────────┼─────────┼─────────┤
│ null | 5 | 8 │
├─────────┼─────────┼─────────┤
│ null | null | 9 │
└─────────┴─────────┴─────────┘
.transpose()
You could always transpose the dataframe to switch the axis and use the "regular" api.
(df.transpose()
.select(
pl.when(pl.all().rolling_mean(2) > 3)
.then(pl.all())
.keep_name())
.transpose())
shape: (3, 3)
┌──────────┬──────────┬──────────┐
│ column_0 | column_1 | column_2 │
│ --- | --- | --- │
│ i64 | i64 | i64 │
╞══════════╪══════════╪══════════╡
│ null | null | null │
├──────────┼──────────┼──────────┤
│ null | 5 | 8 │
├──────────┼──────────┼──────────┤
│ null | null | 9 │
└──────────┴──────────┴──────────┘
Let's say I have a list of dataframes list this:
Ldfs=[
pl.DataFrame({'a':[1.0,2.0,3.1], 'b':[2,3,4]}),
pl.DataFrame({'b':[1,2,3], 'c':[2,3,4]}),
pl.DataFrame({'a':[1,2,3], 'c':[2,3,4]})
]
I can't do pl.concat(Ldfs) because they don't all have the same columns and even the ones that have a in common don't have the same data type.
What I'd like to do is concat them together but just add a column of Nones whenever a column isn't there and to cast columns to a fixed datatype.
For instance, just taking the first element of the list I'd like to have something like this work:
Ldfs[0].select(pl.when(pl.col('c')).then(pl.col('c').cast(pl.Float64()).otherwise(pl.lit(None).cast(pl.Float64()).alias('c')))
of course, this results in NotFoundError: c
Would an approach like this work for you. (I'll convert your DataFrames to LazyFrames for added fun.)
Ldfs = [
pl.DataFrame({"a": [1.0, 2.0, 3.1], "b": [2, 3, 4]}).lazy(),
pl.DataFrame({"b": [1, 2, 3], "c": [2, 3, 4]}).lazy(),
pl.DataFrame({"a": [1, 2, 3], "c": [2, 3, 4]}).lazy(),
]
my_schema = {
"a": pl.Float64,
"b": pl.Int64,
"c": pl.UInt32,
}
def fix_schema(ldf: pl.LazyFrame) -> pl.LazyFrame:
ldf = (
ldf.with_columns(
[
pl.col(col_nm).cast(col_type)
for col_nm, col_type in my_schema.items()
if col_nm in ldf.columns
]
)
.with_columns(
[
pl.lit(None, dtype=col_type).alias(col_nm)
for col_nm, col_type in my_schema.items()
if col_nm not in ldf.columns
]
)
.select(my_schema.keys())
)
return ldf
pl.concat([fix_schema(next_frame)
for next_frame in Ldfs], how="vertical").collect()
shape: (9, 3)
┌──────┬──────┬──────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ f64 ┆ i64 ┆ u32 │
╞══════╪══════╪══════╡
│ 1.0 ┆ 2 ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2.0 ┆ 3 ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3.1 ┆ 4 ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ 1 ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ 2 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ 3 ┆ 4 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1.0 ┆ null ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2.0 ┆ null ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3.0 ┆ null ┆ 4 │
└──────┴──────┴──────┘
.from_dicts() can infer the types and column names:
>>> df = pl.from_dicts([frame.to_dict() for frame in Ldfs])
>>> df
shape: (3, 3)
┌─────────────────┬───────────┬───────────┐
│ a | b | c │
│ --- | --- | --- │
│ list[f64] | list[i64] | list[i64] │
╞═════════════════╪═══════════╪═══════════╡
│ [1.0, 2.0, 3.1] | [2, 3, 4] | null │
├─────────────────┼───────────┼───────────┤
│ null | [1, 2, 3] | [2, 3, 4] │
├─────────────────┼───────────┼───────────┤
│ [1.0, 2.0, 3.0] | null | [2, 3, 4] │
└─//──────────────┴─//────────┴─//────────┘
With the right sized [null, ...] lists - you could .explode() all columns.
>>> nulls = pl.Series([[None] * len(frame) for frame in Ldfs])
... (
... pl.from_dicts([
... frame.to_dict() for frame in Ldfs
... ])
... .with_columns(
... pl.all().fill_null(nulls))
... .explode(pl.all())
... )
shape: (9, 3)
┌──────┬──────┬──────┐
│ a | b | c │
│ --- | --- | --- │
│ f64 | i64 | i64 │
╞══════╪══════╪══════╡
│ 1.0 | 2 | null │
├──────┼──────┼──────┤
│ 2.0 | 3 | null │
├──────┼──────┼──────┤
│ 3.1 | 4 | null │
├──────┼──────┼──────┤
│ null | 1 | 2 │
├──────┼──────┼──────┤
│ null | 2 | 3 │
├──────┼──────┼──────┤
│ null | 3 | 4 │
├──────┼──────┼──────┤
│ 1.0 | null | 2 │
├──────┼──────┼──────┤
│ 2.0 | null | 3 │
├──────┼──────┼──────┤
│ 3.0 | null | 4 │
└─//───┴─//───┴─//───┘
I will try to explain this as good as possible, because I am unfortunately quiet new to polars. I have a large time series dataset where each separate timeseries is identified with a group_id. Additionally, there is a time_idx column that identifies which of the possible time series step is present and have a corresponding target value if present. As a minimal example, consider the following:
min_df = pl.DataFrame(
{"grop_idx": [0, 1, 2, 3], "time_idx": [[0, 1, 2, 3], [2, 3], [0, 2, 3], [0,3]]}
)
┌──────────┬───────────────┐
│ grop_idx ┆ time_idx │
│ --- ┆ --- │
│ i64 ┆ list[i64] │
╞══════════╪═══════════════╡
│ 0 ┆ [0, 1, 2, 3] │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ [2, 3] │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ [0, 2, 3] │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ [0, 3] │
└──────────┴───────────────┘
Here, the time range in the dataset is 4 steps long, but not for all individual series all time steps are present. So while group_idx=0 has all present steps, group_idx=0 only has step 0 and 3, meaning that for step 1 and 2 no recorded target value is present.
Now, I would like to obtain all possible sub sequences so that we start from each possible time step for a given sequence length and maximally go to the max_time_step (in this case 3). For example, for sequence_length=3, the expected output would be:
result_df = pl.DataFrame(
{
"group_idx": [0, 0, 1, 1, 2, 2, 3, 3],
"time_idx": [[0, 1, 2, 3], [0, 1, 2, 3], [2, 3], [2, 3], [0,2,3], [0,2,3], [0,3], [0,3]],
"sub_sequence": [[0,1,2], [1,2,3], [None, None, 2], [None, 2, 3], [0, None, 2], [None, 2, 3], [0, None, None], [None, None, 3]]
}
)
┌───────────┬───────────────┬─────────────────┐
│ group_idx ┆ time_idx ┆ sub_sequence │
│ --- ┆ --- ┆ --- │
│ i64 ┆ list[i64] ┆ list[i64] │
╞═══════════╪═══════════════╪═════════════════╡
│ 0 ┆ [0, 1, ... 3] ┆ [0, 1, 2] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ [0, 1, ... 3] ┆ [1, 2, 3] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ [2, 3] ┆ [null, null, 2] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ [2, 3] ┆ [null, 2, 3] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ [0, 2, 3] ┆ [0, null, 2] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ [0, 2, 3] ┆ [null, 2, 3] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ [0, 3] ┆ [0, null, null] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ [0, 3] ┆ [null, null, 3] │
└───────────┴───────────────┴─────────────────┘
All of this should be computed within polars, because the real dataset is much larger both in terms of the number of time series and time series length.
Edit:
Based on the suggestion by #ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ I have tried the following on the actual dataset (~200 million rows after .explode()). I forgot to say that we can assume that that group_idxand time_idx are already sorted. However, this gets killed.
(
min_df.lazy()
.with_column(
pl.col("time_idx").alias("time_idx_nulls")
)
.groupby_rolling(
index_column='time_idx',
by='group_idx',
period=str(max_sequence_length) + 'i',
)
.agg(pl.col("time_idx_nulls"))
.filter(pl.col('time_idx_nulls').arr.lengths() == max_sequence_length)
)
Here's an algorithm that needs only the desired sub-sequence length as input. It uses groupby_rolling to create your sub-sequences.
period = 3
min_df = min_df.explode('time_idx')
(
min_df.get_column('group_idx').unique().to_frame()
.join(
min_df.get_column('time_idx').unique().to_frame(),
how='cross'
)
.join(
min_df.with_column(pl.col('time_idx').alias('time_idx_nulls')),
on=['group_idx', 'time_idx'],
how='left',
)
.groupby_rolling(
index_column='time_idx',
by='group_idx',
period=str(period) + 'i',
)
.agg(pl.col("time_idx_nulls"))
.filter(pl.col('time_idx_nulls').arr.lengths() == period)
.sort('group_idx')
)
shape: (8, 3)
┌───────────┬──────────┬─────────────────┐
│ group_idx ┆ time_idx ┆ time_idx_nulls │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ list[i64] │
╞═══════════╪══════════╪═════════════════╡
│ 0 ┆ 2 ┆ [0, 1, 2] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ 3 ┆ [1, 2, 3] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2 ┆ [null, null, 2] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 3 ┆ [null, 2, 3] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ [0, null, 2] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 3 ┆ [null, 2, 3] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 2 ┆ [0, null, null] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 3 ┆ [null, null, 3] │
└───────────┴──────────┴─────────────────┘
And for example, with period = 2:
shape: (12, 3)
┌───────────┬──────────┬────────────────┐
│ group_idx ┆ time_idx ┆ time_idx_nulls │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ list[i64] │
╞═══════════╪══════════╪════════════════╡
│ 0 ┆ 1 ┆ [0, 1] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ 2 ┆ [1, 2] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ 3 ┆ [2, 3] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 1 ┆ [null, null] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2 ┆ [null, 2] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 3 ┆ [2, 3] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 1 ┆ [0, null] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ [null, 2] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 3 ┆ [2, 3] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 1 ┆ [0, null] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 2 ┆ [null, null] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 3 ┆ [null, 3] │
└───────────┴──────────┴────────────────┘
Edit: managing RAM requirements
One way that we can manage RAM requirements (for this, or any other algorithm on large datasets) is to find ways to divide-and-conquer.
If we look at our particular problem, each line in the input dataset leads to results that are independent of any other line. We can use this fact to apply our algorithm in batches.
But first, let's create some data that leads to a large problem:
min_time = 0
max_time = 1_000
nbr_groups = 400_000
min_df = (
pl.DataFrame({"time_idx": [list(range(min_time, max_time, 2))]})
.join(
pl.arange(0, nbr_groups, eager=True).alias("group_idx").to_frame(),
how="cross"
)
)
min_df.explode('time_idx')
shape: (200000000, 2)
┌──────────┬───────────┐
│ time_idx ┆ group_idx │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════════╪═══════════╡
│ 0 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 6 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 992 ┆ 399999 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 994 ┆ 399999 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 996 ┆ 399999 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 998 ┆ 399999 │
└──────────┴───────────┘
The input dataset when exploded is 200 million records .. so roughly the size you describe. (Of course, using divide-and-conquer, we won't explode the full dataset.)
To divide-and-conquer this, we'll slice our input dataset into smaller datasets, run the algorithm on the smaller datasets, and then concat the results into one large dataset. (One nice feature of slice is that it's very cheap - it's simply a window into the original dataset, so it consumes very little additional RAM.)
Notice the slice_size variable. You'll need to experiment with this value on your particular computing platform. You want to set this as large as your RAM requirements allow. If set too low, your program will take too long. If set too high, your program will crash. (I've arbitrarily set this to 10,000 as a starting value.)
time_index_df = (
pl.arange(min_time, max_time, eager=True, dtype=pl.Int64)
.alias("time_idx")
.to_frame()
.lazy()
)
period = 3
slice_size = 10_000
result = pl.concat(
[
(
time_index_df
.join(
min_df
.lazy()
.slice(next_index, slice_size)
.select("group_idx"),
how="cross",
)
.join(
min_df
.lazy()
.slice(next_index, slice_size)
.explode('time_idx')
.with_column(pl.col("time_idx").alias("time_idx_nulls")),
on=["group_idx", "time_idx"],
how="left",
)
.groupby_rolling(
index_column='time_idx',
by='group_idx',
period=str(period) + 'i',
)
.agg(pl.col("time_idx_nulls"))
.filter(pl.col('time_idx_nulls').arr.lengths() == period)
.select(['group_idx', 'time_idx_nulls'])
.collect()
)
for next_index in range(0, min_df.height, slice_size)
]
)
result.sort('group_idx')
shape: (399200000, 2)
┌───────────┬───────────────────┐
│ group_idx ┆ time_idx_nulls │
│ --- ┆ --- │
│ i64 ┆ list[i64] │
╞═══════════╪═══════════════════╡
│ 0 ┆ [0, null, 2] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ [null, 2, null] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ [2, null, 4] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ [null, 4, null] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 399999 ┆ [994, null, 996] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 399999 ┆ [null, 996, null] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 399999 ┆ [996, null, 998] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 399999 ┆ [null, 998, null] │
└───────────┴───────────────────┘
Some other things
You do actually need to use the joins. The joins are used to fill in the "holes" in your sequences with null values.
Also, notice that I've put each slice/batch into lazy mode, but not the entire algorithm. Depending on your computing platform, using lazy mode for the entire algorithm may again overwhelm your system, as Polars attempts to spread the work across multiple processors which could lead to more out-of-memory situations for you.
Also note the humongous size of my output dataset: almost 400 million records. I did this purposely as a reminder that your output dataset may be the ultimate problem. That is, any algorithm would fail if the result dataset is larger than your RAM can hold.
Here's another approach using duckdb
It seems to perform much better in both runtime and memory in my local benchmark.
You can .explode("subsequence") afterwards to get a row per subsequence - this seems to be quite memory intensive though.
Update: polars can perform single column .explode() for "free". https://github.com/pola-rs/polars/pull/5676
You can unnest([ ... ] subsequence) to do the explode in duckdb - it seems to be a bit slower currently.
Update: Using the explode_table() from https://issues.apache.org/jira/plugins/servlet/mobile#issue/ARROW-12099 seems to add very little overhead.
>>> import duckdb
...
... min_df = pl.DataFrame({
... "group_idx": [0, 1, 2, 3],
... "time_idx": [[0, 1, 2, 3], [2, 3], [0, 2, 3], [0,3]]
... })
... max_time_step = 3
... sequence_length = 2
... upper_bound = (
... max_time_step - (
... 1 if max_time_step % sequence_length == 0 else 0
... )
... )
... tbl = min_df.to_arrow()
... pl.from_arrow(
... duckdb.connect().execute(f"""
... select
... group_idx, [
... time_idx_nulls[n: n + {sequence_length - 1}]
... for n in range(1, {upper_bound + 1})
... ] subsequence
... from (
... from tbl select group_idx, list_transform(
... range(0, {max_time_step + 1}),
... n -> case when list_has(time_idx, n) then n end
... ) time_idx_nulls
... )
... """)
... .arrow()
... )
shape: (4, 2)
┌───────────┬─────────────────────────────────────┐
│ group_idx | subsequence │
│ --- | --- │
│ i64 | list[list[i64]] │
╞═══════════╪═════════════════════════════════════╡
│ 0 | [[0, 1], [1, 2], [2, 3]] │
├───────────┼─────────────────────────────────────┤
│ 1 | [[null, null], [null, 2], [2, 3]... │
├───────────┼─────────────────────────────────────┤
│ 2 | [[0, null], [null, 2], [2, 3]] │
├───────────┼─────────────────────────────────────┤
│ 3 | [[0, null], [null, null], [null,... │
└─//────────┴─//──────────────────────────────────┘
I suspect there should be a cleaner way to do this but you could create range/mask list columns:
>>> max_time_step = 3
>>> sequence_length = 3
>>> (
... min_df
... .with_columns([
... pl.arange(0, max_time_step + 1).list().alias("range"),
... pl.col("time_idx").arr.eval(
... pl.arange(0, max_time_step + 1).is_in(pl.element()),
... parallel=True
... ).alias("mask")
... ])
... )
shape: (4, 4)
┌───────────┬───────────────┬───────────────┬──────────────────────────┐
│ group_idx | time_idx | range | mask │
│ --- | --- | --- | --- │
│ i64 | list[i64] | list[i64] | list[bool] │
╞═══════════╪═══════════════╪═══════════════╪══════════════════════════╡
│ 0 | [0, 1, ... 3] | [0, 1, ... 3] | [true, true, ... true] │
├───────────┼───────────────┼───────────────┼──────────────────────────┤
│ 1 | [2, 3] | [0, 1, ... 3] | [false, false, ... true] │
├───────────┼───────────────┼───────────────┼──────────────────────────┤
│ 2 | [0, 2, 3] | [0, 1, ... 3] | [true, false, ... true] │
├───────────┼───────────────┼───────────────┼──────────────────────────┤
│ 3 | [0, 3] | [0, 1, ... 3] | [true, false, ... true] │
└─//────────┴─//────────────┴─//────────────┴─//───────────────────────┘
You can then .explode() those columns, replace true with the number and group them back together.
Update #1: Use #ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ's .groupby_rolling() technique to
generate correct sub-sequences.
Update #2: Use regular .groupby() and .list().slice() to generate sub-sequences.
>>> min_df = pl.DataFrame({
... "group_idx": [0, 1, 2, 3],
... "time_idx": [[0, 1, 2, 3], [2, 3], [0, 2, 3], [0,3]]
... })
... max_time_step = 3
... sequence_length = 2
... (
... min_df
... .with_columns([
... pl.arange(0, max_time_step + 1).list().alias("range"),
... pl.col("time_idx").arr.eval(
... pl.arange(0, max_time_step + 1).is_in(pl.element()),
... parallel=True
... ).alias("mask")
... ])
... .explode(["range", "mask"])
... .with_column(
... pl.when(pl.col("mask"))
... .then(pl.col("range"))
... .alias("value"))
... .groupby("group_idx", maintain_order=True)
... .agg([
... pl.col("value")
... .list()
... .slice(length=sequence_length, offset=n)
... .suffix(f"{n}")
... for n in range(0, max_time_step - (1 if max_time_step % sequence_length == 0 else 0))
... ])
... .melt("group_idx", value_name="subsequence")
... .drop("variable")
... .sort("group_idx")
... )
shape: (12, 2)
┌───────────┬──────────────┐
│ group_idx | subsequence │
│ --- | --- │
│ i64 | list[i64] │
╞═══════════╪══════════════╡
│ 0 | [0, 1] │
├───────────┼──────────────┤
│ 0 | [1, 2] │
├───────────┼──────────────┤
│ 0 | [2, 3] │
├───────────┼──────────────┤
│ 1 | [null, null] │
├───────────┼──────────────┤
│ 1 | [null, 2] │
├───────────┼──────────────┤
│ ... | ... │
├───────────┼──────────────┤
│ 2 | [null, 2] │
├───────────┼──────────────┤
│ 2 | [2, 3] │
├───────────┼──────────────┤
│ 3 | [0, null] │
├───────────┼──────────────┤
│ 3 | [null, null] │
├───────────┼──────────────┤
│ 3 | [null, 3] │
└─//────────┴─//───────────┘
It feels like you should be able to use pl.element() inside .then() here to avoid the explode/groupby but it fails:
>>> (
... min_df
... .with_column(
... pl.col("time_idx").arr.eval(
... pl.when(pl.arange(0, max_time_step + 1).is_in(pl.element()))
... .then(pl.element()),
... parallel=True)
... .alias("subsequence")
... )
... )
---------------------------------------------------------------------------
ShapeError Traceback (most recent call last)
I have a Polars DataFrame with a list column. I want to control how many elements of a pl.List column are printed.
I've tried pl.pl.Config.set_fmt_str_lengths() but this only restricts the number of elements if set to a small value, it doesn't show more elements for a large value.
I'm working in Jupyterlab but I think it's a general issue.
import polars as pl
N = 5
df = (
pl.DataFrame(
{
'id': range(N)
}
)
.with_row_count("value")
.groupby_rolling(
"id",period=f"{N}i"
)
.agg(
pl.col("value")
)
)
df
shape: (5, 2)
┌─────┬───────────────┐
│ id ┆ value │
│ --- ┆ --- │
│ i64 ┆ list[u32] │
╞═════╪═══════════════╡
│ 0 ┆ [0] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ [0, 1] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ [0, 1, 2] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ [0, 1, ... 3] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ [0, 1, ... 4] │
└─────┴───────────────┘
pl.Config.set_tbl_rows(100)
And more generally, I would try looking at dir(pl.Config)
You can use the following config parameter from the Polars Documentation to set the length of the output e.g. 100.
import Polars as pl
pl.Config.set_fmt_str_lengths(100)
Currently I do not think you can, directly; the documentation for Config does not list any such method, and for me (in VSCode at least) set_fmt_str_lengths does not affect lists.
However, if your goal is simply to be able to see what you're working on and you don't mind a slightly hacky workaround, you can simply add a column next to it where you convert your list to a string representation of itself, at which point pl.Config.set_fmt_str_lengths(<some large n>) will then display however much of it you like. For example:
import polars as pl
pl.Config.set_fmt_str_lengths(100)
N = 5
df = (
pl.DataFrame(
{
'id': range(N)
}
)
.with_row_count("value")
.groupby_rolling(
"id",period=f"{N}i"
)
.agg(
pl.col("value")
).with_column(
pl.col("value").apply(lambda x: ["["+", ".join([f'{i}' for i in x])+"]"][0]).alias("string_repr")
)
)
df
shape: (5, 3)
┌─────┬───────────────┬─────────────────┐
│ id ┆ value ┆ string_repr │
│ --- ┆ --- ┆ --- │
│ i64 ┆ list[u32] ┆ str │
╞═════╪═══════════════╪═════════════════╡
│ 0 ┆ [0] ┆ [0] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ [0, 1] ┆ [0, 1] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ [0, 1, 2] ┆ [0, 1, 2] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ [0, 1, ... 3] ┆ [0, 1, 2, 3] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ [0, 1, ... 4] ┆ [0, 1, 2, 3, 4] │
└─────┴───────────────┴─────────────────┘
Assuming I already have a predicate expression, how do I filter with that predicate, but apply it only within groups? For example, the predicate might be to keep all rows equal to the maximum or within a group. (There could be multiple rows kept in a group if there is a tie.)
With my dplyr experience, I thought that I could just .groupby and then .filter, but that does not work.
import polars as pl
df = pl.DataFrame(dict(x=[0, 0, 1, 1], y=[1, 2, 3, 3]))
expression = pl.col("y") == pl.col("y").max()
df.groupby("x").filter(expression)
# AttributeError: 'GroupBy' object has no attribute 'filter'
I then thought I could apply .over to the expression, but that does not work either.
import polars as pl
df = pl.DataFrame(dict(x=[0, 0, 1, 1], y=[1, 2, 3, 3]))
expression = pl.col("y") == pl.col("y").max()
df.filter(expression.over("x"))
# RuntimeError: Any(ComputeError("this binary expression is not an aggregation:
# [(col(\"y\")) == (col(\"y\").max())]
# pherhaps you should add an aggregation like, '.sum()', '.min()', '.mean()', etc.
# if you really want to collect this binary expression, use `.list()`"))
For this particular problem, I can invoke .over on the max, but I don't know how to apply this to an arbitrary predicate I don't have control over.
import polars as pl
df = pl.DataFrame(dict(x=[0, 0, 1, 1], y=[1, 2, 3, 3]))
expression = pl.col("y") == pl.col("y").max().over("x")
df.filter(expression)
# shape: (3, 2)
# ┌─────┬─────┐
# │ x ┆ y │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 0 ┆ 2 │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 1 ┆ 3 │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 1 ┆ 3 │
# └─────┴─────┘
If you had updated to polars>=0.13.0 your second try would have worked. :)
df = pl.DataFrame(dict(
x=[0, 0, 1, 1],
y=[1, 2, 3, 3])
)
df.filter((pl.col("y") == pl.max("y").over("x")))
shape: (3, 2)
┌─────┬─────┐
│ x ┆ y │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 0 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 3 │
└─────┴─────┘