I will try to explain this as good as possible, because I am unfortunately quiet new to polars. I have a large time series dataset where each separate timeseries is identified with a group_id. Additionally, there is a time_idx column that identifies which of the possible time series step is present and have a corresponding target value if present. As a minimal example, consider the following:
min_df = pl.DataFrame(
{"grop_idx": [0, 1, 2, 3], "time_idx": [[0, 1, 2, 3], [2, 3], [0, 2, 3], [0,3]]}
)
┌──────────┬───────────────┐
│ grop_idx ┆ time_idx │
│ --- ┆ --- │
│ i64 ┆ list[i64] │
╞══════════╪═══════════════╡
│ 0 ┆ [0, 1, 2, 3] │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ [2, 3] │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ [0, 2, 3] │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ [0, 3] │
└──────────┴───────────────┘
Here, the time range in the dataset is 4 steps long, but not for all individual series all time steps are present. So while group_idx=0 has all present steps, group_idx=0 only has step 0 and 3, meaning that for step 1 and 2 no recorded target value is present.
Now, I would like to obtain all possible sub sequences so that we start from each possible time step for a given sequence length and maximally go to the max_time_step (in this case 3). For example, for sequence_length=3, the expected output would be:
result_df = pl.DataFrame(
{
"group_idx": [0, 0, 1, 1, 2, 2, 3, 3],
"time_idx": [[0, 1, 2, 3], [0, 1, 2, 3], [2, 3], [2, 3], [0,2,3], [0,2,3], [0,3], [0,3]],
"sub_sequence": [[0,1,2], [1,2,3], [None, None, 2], [None, 2, 3], [0, None, 2], [None, 2, 3], [0, None, None], [None, None, 3]]
}
)
┌───────────┬───────────────┬─────────────────┐
│ group_idx ┆ time_idx ┆ sub_sequence │
│ --- ┆ --- ┆ --- │
│ i64 ┆ list[i64] ┆ list[i64] │
╞═══════════╪═══════════════╪═════════════════╡
│ 0 ┆ [0, 1, ... 3] ┆ [0, 1, 2] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ [0, 1, ... 3] ┆ [1, 2, 3] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ [2, 3] ┆ [null, null, 2] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ [2, 3] ┆ [null, 2, 3] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ [0, 2, 3] ┆ [0, null, 2] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ [0, 2, 3] ┆ [null, 2, 3] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ [0, 3] ┆ [0, null, null] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ [0, 3] ┆ [null, null, 3] │
└───────────┴───────────────┴─────────────────┘
All of this should be computed within polars, because the real dataset is much larger both in terms of the number of time series and time series length.
Edit:
Based on the suggestion by #ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ I have tried the following on the actual dataset (~200 million rows after .explode()). I forgot to say that we can assume that that group_idxand time_idx are already sorted. However, this gets killed.
(
min_df.lazy()
.with_column(
pl.col("time_idx").alias("time_idx_nulls")
)
.groupby_rolling(
index_column='time_idx',
by='group_idx',
period=str(max_sequence_length) + 'i',
)
.agg(pl.col("time_idx_nulls"))
.filter(pl.col('time_idx_nulls').arr.lengths() == max_sequence_length)
)
Here's an algorithm that needs only the desired sub-sequence length as input. It uses groupby_rolling to create your sub-sequences.
period = 3
min_df = min_df.explode('time_idx')
(
min_df.get_column('group_idx').unique().to_frame()
.join(
min_df.get_column('time_idx').unique().to_frame(),
how='cross'
)
.join(
min_df.with_column(pl.col('time_idx').alias('time_idx_nulls')),
on=['group_idx', 'time_idx'],
how='left',
)
.groupby_rolling(
index_column='time_idx',
by='group_idx',
period=str(period) + 'i',
)
.agg(pl.col("time_idx_nulls"))
.filter(pl.col('time_idx_nulls').arr.lengths() == period)
.sort('group_idx')
)
shape: (8, 3)
┌───────────┬──────────┬─────────────────┐
│ group_idx ┆ time_idx ┆ time_idx_nulls │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ list[i64] │
╞═══════════╪══════════╪═════════════════╡
│ 0 ┆ 2 ┆ [0, 1, 2] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ 3 ┆ [1, 2, 3] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2 ┆ [null, null, 2] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 3 ┆ [null, 2, 3] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ [0, null, 2] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 3 ┆ [null, 2, 3] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 2 ┆ [0, null, null] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 3 ┆ [null, null, 3] │
└───────────┴──────────┴─────────────────┘
And for example, with period = 2:
shape: (12, 3)
┌───────────┬──────────┬────────────────┐
│ group_idx ┆ time_idx ┆ time_idx_nulls │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ list[i64] │
╞═══════════╪══════════╪════════════════╡
│ 0 ┆ 1 ┆ [0, 1] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ 2 ┆ [1, 2] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ 3 ┆ [2, 3] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 1 ┆ [null, null] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2 ┆ [null, 2] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 3 ┆ [2, 3] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 1 ┆ [0, null] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ [null, 2] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 3 ┆ [2, 3] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 1 ┆ [0, null] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 2 ┆ [null, null] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 3 ┆ [null, 3] │
└───────────┴──────────┴────────────────┘
Edit: managing RAM requirements
One way that we can manage RAM requirements (for this, or any other algorithm on large datasets) is to find ways to divide-and-conquer.
If we look at our particular problem, each line in the input dataset leads to results that are independent of any other line. We can use this fact to apply our algorithm in batches.
But first, let's create some data that leads to a large problem:
min_time = 0
max_time = 1_000
nbr_groups = 400_000
min_df = (
pl.DataFrame({"time_idx": [list(range(min_time, max_time, 2))]})
.join(
pl.arange(0, nbr_groups, eager=True).alias("group_idx").to_frame(),
how="cross"
)
)
min_df.explode('time_idx')
shape: (200000000, 2)
┌──────────┬───────────┐
│ time_idx ┆ group_idx │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════════╪═══════════╡
│ 0 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 6 ┆ 0 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 992 ┆ 399999 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 994 ┆ 399999 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 996 ┆ 399999 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 998 ┆ 399999 │
└──────────┴───────────┘
The input dataset when exploded is 200 million records .. so roughly the size you describe. (Of course, using divide-and-conquer, we won't explode the full dataset.)
To divide-and-conquer this, we'll slice our input dataset into smaller datasets, run the algorithm on the smaller datasets, and then concat the results into one large dataset. (One nice feature of slice is that it's very cheap - it's simply a window into the original dataset, so it consumes very little additional RAM.)
Notice the slice_size variable. You'll need to experiment with this value on your particular computing platform. You want to set this as large as your RAM requirements allow. If set too low, your program will take too long. If set too high, your program will crash. (I've arbitrarily set this to 10,000 as a starting value.)
time_index_df = (
pl.arange(min_time, max_time, eager=True, dtype=pl.Int64)
.alias("time_idx")
.to_frame()
.lazy()
)
period = 3
slice_size = 10_000
result = pl.concat(
[
(
time_index_df
.join(
min_df
.lazy()
.slice(next_index, slice_size)
.select("group_idx"),
how="cross",
)
.join(
min_df
.lazy()
.slice(next_index, slice_size)
.explode('time_idx')
.with_column(pl.col("time_idx").alias("time_idx_nulls")),
on=["group_idx", "time_idx"],
how="left",
)
.groupby_rolling(
index_column='time_idx',
by='group_idx',
period=str(period) + 'i',
)
.agg(pl.col("time_idx_nulls"))
.filter(pl.col('time_idx_nulls').arr.lengths() == period)
.select(['group_idx', 'time_idx_nulls'])
.collect()
)
for next_index in range(0, min_df.height, slice_size)
]
)
result.sort('group_idx')
shape: (399200000, 2)
┌───────────┬───────────────────┐
│ group_idx ┆ time_idx_nulls │
│ --- ┆ --- │
│ i64 ┆ list[i64] │
╞═══════════╪═══════════════════╡
│ 0 ┆ [0, null, 2] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ [null, 2, null] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ [2, null, 4] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ [null, 4, null] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 399999 ┆ [994, null, 996] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 399999 ┆ [null, 996, null] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 399999 ┆ [996, null, 998] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 399999 ┆ [null, 998, null] │
└───────────┴───────────────────┘
Some other things
You do actually need to use the joins. The joins are used to fill in the "holes" in your sequences with null values.
Also, notice that I've put each slice/batch into lazy mode, but not the entire algorithm. Depending on your computing platform, using lazy mode for the entire algorithm may again overwhelm your system, as Polars attempts to spread the work across multiple processors which could lead to more out-of-memory situations for you.
Also note the humongous size of my output dataset: almost 400 million records. I did this purposely as a reminder that your output dataset may be the ultimate problem. That is, any algorithm would fail if the result dataset is larger than your RAM can hold.
Here's another approach using duckdb
It seems to perform much better in both runtime and memory in my local benchmark.
You can .explode("subsequence") afterwards to get a row per subsequence - this seems to be quite memory intensive though.
Update: polars can perform single column .explode() for "free". https://github.com/pola-rs/polars/pull/5676
You can unnest([ ... ] subsequence) to do the explode in duckdb - it seems to be a bit slower currently.
Update: Using the explode_table() from https://issues.apache.org/jira/plugins/servlet/mobile#issue/ARROW-12099 seems to add very little overhead.
>>> import duckdb
...
... min_df = pl.DataFrame({
... "group_idx": [0, 1, 2, 3],
... "time_idx": [[0, 1, 2, 3], [2, 3], [0, 2, 3], [0,3]]
... })
... max_time_step = 3
... sequence_length = 2
... upper_bound = (
... max_time_step - (
... 1 if max_time_step % sequence_length == 0 else 0
... )
... )
... tbl = min_df.to_arrow()
... pl.from_arrow(
... duckdb.connect().execute(f"""
... select
... group_idx, [
... time_idx_nulls[n: n + {sequence_length - 1}]
... for n in range(1, {upper_bound + 1})
... ] subsequence
... from (
... from tbl select group_idx, list_transform(
... range(0, {max_time_step + 1}),
... n -> case when list_has(time_idx, n) then n end
... ) time_idx_nulls
... )
... """)
... .arrow()
... )
shape: (4, 2)
┌───────────┬─────────────────────────────────────┐
│ group_idx | subsequence │
│ --- | --- │
│ i64 | list[list[i64]] │
╞═══════════╪═════════════════════════════════════╡
│ 0 | [[0, 1], [1, 2], [2, 3]] │
├───────────┼─────────────────────────────────────┤
│ 1 | [[null, null], [null, 2], [2, 3]... │
├───────────┼─────────────────────────────────────┤
│ 2 | [[0, null], [null, 2], [2, 3]] │
├───────────┼─────────────────────────────────────┤
│ 3 | [[0, null], [null, null], [null,... │
└─//────────┴─//──────────────────────────────────┘
I suspect there should be a cleaner way to do this but you could create range/mask list columns:
>>> max_time_step = 3
>>> sequence_length = 3
>>> (
... min_df
... .with_columns([
... pl.arange(0, max_time_step + 1).list().alias("range"),
... pl.col("time_idx").arr.eval(
... pl.arange(0, max_time_step + 1).is_in(pl.element()),
... parallel=True
... ).alias("mask")
... ])
... )
shape: (4, 4)
┌───────────┬───────────────┬───────────────┬──────────────────────────┐
│ group_idx | time_idx | range | mask │
│ --- | --- | --- | --- │
│ i64 | list[i64] | list[i64] | list[bool] │
╞═══════════╪═══════════════╪═══════════════╪══════════════════════════╡
│ 0 | [0, 1, ... 3] | [0, 1, ... 3] | [true, true, ... true] │
├───────────┼───────────────┼───────────────┼──────────────────────────┤
│ 1 | [2, 3] | [0, 1, ... 3] | [false, false, ... true] │
├───────────┼───────────────┼───────────────┼──────────────────────────┤
│ 2 | [0, 2, 3] | [0, 1, ... 3] | [true, false, ... true] │
├───────────┼───────────────┼───────────────┼──────────────────────────┤
│ 3 | [0, 3] | [0, 1, ... 3] | [true, false, ... true] │
└─//────────┴─//────────────┴─//────────────┴─//───────────────────────┘
You can then .explode() those columns, replace true with the number and group them back together.
Update #1: Use #ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ's .groupby_rolling() technique to
generate correct sub-sequences.
Update #2: Use regular .groupby() and .list().slice() to generate sub-sequences.
>>> min_df = pl.DataFrame({
... "group_idx": [0, 1, 2, 3],
... "time_idx": [[0, 1, 2, 3], [2, 3], [0, 2, 3], [0,3]]
... })
... max_time_step = 3
... sequence_length = 2
... (
... min_df
... .with_columns([
... pl.arange(0, max_time_step + 1).list().alias("range"),
... pl.col("time_idx").arr.eval(
... pl.arange(0, max_time_step + 1).is_in(pl.element()),
... parallel=True
... ).alias("mask")
... ])
... .explode(["range", "mask"])
... .with_column(
... pl.when(pl.col("mask"))
... .then(pl.col("range"))
... .alias("value"))
... .groupby("group_idx", maintain_order=True)
... .agg([
... pl.col("value")
... .list()
... .slice(length=sequence_length, offset=n)
... .suffix(f"{n}")
... for n in range(0, max_time_step - (1 if max_time_step % sequence_length == 0 else 0))
... ])
... .melt("group_idx", value_name="subsequence")
... .drop("variable")
... .sort("group_idx")
... )
shape: (12, 2)
┌───────────┬──────────────┐
│ group_idx | subsequence │
│ --- | --- │
│ i64 | list[i64] │
╞═══════════╪══════════════╡
│ 0 | [0, 1] │
├───────────┼──────────────┤
│ 0 | [1, 2] │
├───────────┼──────────────┤
│ 0 | [2, 3] │
├───────────┼──────────────┤
│ 1 | [null, null] │
├───────────┼──────────────┤
│ 1 | [null, 2] │
├───────────┼──────────────┤
│ ... | ... │
├───────────┼──────────────┤
│ 2 | [null, 2] │
├───────────┼──────────────┤
│ 2 | [2, 3] │
├───────────┼──────────────┤
│ 3 | [0, null] │
├───────────┼──────────────┤
│ 3 | [null, null] │
├───────────┼──────────────┤
│ 3 | [null, 3] │
└─//────────┴─//───────────┘
It feels like you should be able to use pl.element() inside .then() here to avoid the explode/groupby but it fails:
>>> (
... min_df
... .with_column(
... pl.col("time_idx").arr.eval(
... pl.when(pl.arange(0, max_time_step + 1).is_in(pl.element()))
... .then(pl.element()),
... parallel=True)
... .alias("subsequence")
... )
... )
---------------------------------------------------------------------------
ShapeError Traceback (most recent call last)
Related
Let's say I have a list of dataframes list this:
Ldfs=[
pl.DataFrame({'a':[1.0,2.0,3.1], 'b':[2,3,4]}),
pl.DataFrame({'b':[1,2,3], 'c':[2,3,4]}),
pl.DataFrame({'a':[1,2,3], 'c':[2,3,4]})
]
I can't do pl.concat(Ldfs) because they don't all have the same columns and even the ones that have a in common don't have the same data type.
What I'd like to do is concat them together but just add a column of Nones whenever a column isn't there and to cast columns to a fixed datatype.
For instance, just taking the first element of the list I'd like to have something like this work:
Ldfs[0].select(pl.when(pl.col('c')).then(pl.col('c').cast(pl.Float64()).otherwise(pl.lit(None).cast(pl.Float64()).alias('c')))
of course, this results in NotFoundError: c
Would an approach like this work for you. (I'll convert your DataFrames to LazyFrames for added fun.)
Ldfs = [
pl.DataFrame({"a": [1.0, 2.0, 3.1], "b": [2, 3, 4]}).lazy(),
pl.DataFrame({"b": [1, 2, 3], "c": [2, 3, 4]}).lazy(),
pl.DataFrame({"a": [1, 2, 3], "c": [2, 3, 4]}).lazy(),
]
my_schema = {
"a": pl.Float64,
"b": pl.Int64,
"c": pl.UInt32,
}
def fix_schema(ldf: pl.LazyFrame) -> pl.LazyFrame:
ldf = (
ldf.with_columns(
[
pl.col(col_nm).cast(col_type)
for col_nm, col_type in my_schema.items()
if col_nm in ldf.columns
]
)
.with_columns(
[
pl.lit(None, dtype=col_type).alias(col_nm)
for col_nm, col_type in my_schema.items()
if col_nm not in ldf.columns
]
)
.select(my_schema.keys())
)
return ldf
pl.concat([fix_schema(next_frame)
for next_frame in Ldfs], how="vertical").collect()
shape: (9, 3)
┌──────┬──────┬──────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ f64 ┆ i64 ┆ u32 │
╞══════╪══════╪══════╡
│ 1.0 ┆ 2 ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2.0 ┆ 3 ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3.1 ┆ 4 ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ 1 ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ 2 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ 3 ┆ 4 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1.0 ┆ null ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2.0 ┆ null ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3.0 ┆ null ┆ 4 │
└──────┴──────┴──────┘
.from_dicts() can infer the types and column names:
>>> df = pl.from_dicts([frame.to_dict() for frame in Ldfs])
>>> df
shape: (3, 3)
┌─────────────────┬───────────┬───────────┐
│ a | b | c │
│ --- | --- | --- │
│ list[f64] | list[i64] | list[i64] │
╞═════════════════╪═══════════╪═══════════╡
│ [1.0, 2.0, 3.1] | [2, 3, 4] | null │
├─────────────────┼───────────┼───────────┤
│ null | [1, 2, 3] | [2, 3, 4] │
├─────────────────┼───────────┼───────────┤
│ [1.0, 2.0, 3.0] | null | [2, 3, 4] │
└─//──────────────┴─//────────┴─//────────┘
With the right sized [null, ...] lists - you could .explode() all columns.
>>> nulls = pl.Series([[None] * len(frame) for frame in Ldfs])
... (
... pl.from_dicts([
... frame.to_dict() for frame in Ldfs
... ])
... .with_columns(
... pl.all().fill_null(nulls))
... .explode(pl.all())
... )
shape: (9, 3)
┌──────┬──────┬──────┐
│ a | b | c │
│ --- | --- | --- │
│ f64 | i64 | i64 │
╞══════╪══════╪══════╡
│ 1.0 | 2 | null │
├──────┼──────┼──────┤
│ 2.0 | 3 | null │
├──────┼──────┼──────┤
│ 3.1 | 4 | null │
├──────┼──────┼──────┤
│ null | 1 | 2 │
├──────┼──────┼──────┤
│ null | 2 | 3 │
├──────┼──────┼──────┤
│ null | 3 | 4 │
├──────┼──────┼──────┤
│ 1.0 | null | 2 │
├──────┼──────┼──────┤
│ 2.0 | null | 3 │
├──────┼──────┼──────┤
│ 3.0 | null | 4 │
└─//───┴─//───┴─//───┘
I am working with multiple parquet datasets that were written with nested structs (sometimes multiple levels deep). I need to output a flattened (no struct) schema. Right now the only way I can think to do that is to use for loops to iterate through the columns. Here is a simplified example where I'm for looping.
while len([x.name for x in df if x.dtype == pl.Struct]) > 0:
for col in df:
if col.dtype == pl.Struct:
df = df.unnest(col.name)
This works, maybe that is the only way to do it, and if so it would be helpful to know that. But Polars is pretty neat and I'm wondering if there is a more functional way to do this without all the looping and reassigning the df to itself.
If you have a df like this:
df=pl.DataFrame({'a':[1,2,3], 'b':[2,3,4], 'c':[3,4,5], 'd':[4,5,6], 'e':[5,6,7]}).select([pl.struct(['a','b']).alias('ab'), pl.struct(['c','d']).alias('cd'),'e'])
You can unnest the ab and cd at the same time by just doing
df.unnest(['ab','cd'])
If you don't know in advance what your column names and types are in advance then you can just use a list comprehension like this:
[col_name for col_name,dtype in zip(df.columns, df.dtypes) if dtype==pl.Struct]
We can now just put that list comprehension in the unnest method.
df=df.unnest([col_name for col_name,dtype in zip(df.columns, df.dtypes) if dtype==pl.Struct])
If you have structs inside structs like:
df=pl.DataFrame({'a':[1,2,3], 'b':[2,3,4], 'c':[3,4,5], 'd':[4,5,6], 'e':[5,6,7]}).select([pl.struct(['a','b']).alias('ab'), pl.struct(['c','d']).alias('cd'),'e']).select([pl.struct(['ab','cd']).alias('abcd'),'e'])
then I don't think you can get away from some kind of while loop but this might be more concise:
while any([x==pl.Struct for x in df.dtypes]):
df=df.unnest([col_name for col_name,dtype in zip(df.columns, df.dtypes) if dtype==pl.Struct])
This is a minor addition. If you're concerned about constantly re-looping through a large number of columns, you can create a recursive formula to address only structs (and nested structs).
def unnest_all(self: pl.DataFrame):
cols = []
for next_col in self:
if next_col.dtype != pl.Struct:
cols.append(next_col)
else:
cols.extend(next_col.struct.to_frame().unnest_all().get_columns())
return pl.DataFrame(cols)
pl.DataFrame.unnest_all = unnest_all
So, using the second example by #Dean MacGregor above:
df = (
pl.DataFrame(
{"a": [1, 2, 3], "b": [2, 3, 4], "c": [
3, 4, 5], "d": [4, 5, 6], "e": [5, 6, 7]}
)
.select([pl.struct(["a", "b"]).alias("ab"), pl.struct(["c", "d"]).alias("cd"), "e"])
.select([pl.struct(["ab", "cd"]).alias("abcd"), "e"])
)
df
df.unnest_all()
>>> df
shape: (3, 2)
┌───────────────┬─────┐
│ abcd ┆ e │
│ --- ┆ --- │
│ struct[2] ┆ i64 │
╞═══════════════╪═════╡
│ {{1,2},{3,4}} ┆ 5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ {{2,3},{4,5}} ┆ 6 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ {{3,4},{5,6}} ┆ 7 │
└───────────────┴─────┘
>>> df.unnest_all()
shape: (3, 5)
┌─────┬─────┬─────┬─────┬─────┐
│ a ┆ b ┆ c ┆ d ┆ e │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╪═════╡
│ 1 ┆ 2 ┆ 3 ┆ 4 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 3 ┆ 4 ┆ 5 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 4 ┆ 5 ┆ 6 ┆ 7 │
└─────┴─────┴─────┴─────┴─────┘
And using the first example:
df = pl.DataFrame(
{"a": [1, 2, 3], "b": [2, 3, 4], "c": [
3, 4, 5], "d": [4, 5, 6], "e": [5, 6, 7]}
).select([pl.struct(["a", "b"]).alias("ab"), pl.struct(["c", "d"]).alias("cd"), "e"])
df
df.unnest_all()
>>> df
shape: (3, 3)
┌───────────┬───────────┬─────┐
│ ab ┆ cd ┆ e │
│ --- ┆ --- ┆ --- │
│ struct[2] ┆ struct[2] ┆ i64 │
╞═══════════╪═══════════╪═════╡
│ {1,2} ┆ {3,4} ┆ 5 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ {2,3} ┆ {4,5} ┆ 6 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ {3,4} ┆ {5,6} ┆ 7 │
└───────────┴───────────┴─────┘
>>> df.unnest_all()
shape: (3, 5)
┌─────┬─────┬─────┬─────┬─────┐
│ a ┆ b ┆ c ┆ d ┆ e │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╪═════╡
│ 1 ┆ 2 ┆ 3 ┆ 4 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 3 ┆ 4 ┆ 5 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 4 ┆ 5 ┆ 6 ┆ 7 │
└─────┴─────┴─────┴─────┴─────┘
In the end, I'm not sure that this saves you much wall-clock time (or RAM).
The other answers taught me a lot. I encountered a new situation where I wanted to easily be able to get each column labeled with all the structs it came from. i.e. for
pl.col("my").struct.field("test").struct.field("thing")
I wanted to recover
my.test.thing
as a string which I could easily use when reading a subset of columns with pyarrow via
pq.ParquetDataset(path).read(columns = ["my.test.thing"])
Since there are many hundreds of columns and the nesting can go quite deep, I wrote functions to do a depth first search on the schema, extract the columns in that pyarrow friendly format, then I can use those to select each column unnested all in one go.
First, I worked with the pyarrow schema because I couldn't figure out how to drill into the structs in the polars schema:
schema = df.to_arrow().schema
navigating structs in that schema is quirky, at the top level the structure behaves differently from deeper in. I ended up writing two functions, the first to navigate the top level structure and the second to continue the search below:
def schema_top_level_DFS(pa_schema):
top_level_stack = list(range(len(pa_schema)))
while top_level_stack:
working_top_level_index = top_level_stack.pop()
working_element_name = pa_schema.names[working_top_level_index]
if type(pa_schema.types[working_top_level_index]) == pa.lib.StructType:
second_level_stack = list(range(len(pa_schema.types[working_top_level_index])))
while second_level_stack:
working_second_level_index = second_level_stack.pop()
schema_DFS(pa_schema.types[working_top_level_index][working_second_level_index],working_element_name)
else:
column_paths.append(working_element_name)
def schema_DFS(incoming_element,upstream_names):
current_name = incoming_element.name
combined_names = ".".join([upstream_names,current_name])
if type(incoming_element.type) == pa.lib.StructType:
stack = list(range(len(incoming_element.type)))
while stack:
working_index = stack.pop()
working_element = incoming_element.type[working_index]
schema_DFS(working_element,combined_names)
else:
column_paths.append(combined_names)
So that running
column_paths = []
schema_top_level_DFS(schema)
gives me column paths like
['struct_name_1.inner_struct_name_2.thing1','struct_name_1.inner_struct_name_2.thing2]
to actually do the unnesting, I wasn't sure how to do better than a function with a case statement:
def return_pl_formatting(col_string):
col_list = col_string.split(".")
match len(col_list):
case 1:
return pl.col(col_list[0]).alias(col_string)
case 2:
return pl.col(col_list[0]).struct.field(col_list[1]).alias(col_string)
case 3:
return pl.col(col_list[0]).struct.field(col_list[1]).struct.field(col_list[2]).alias(col_string)
case 4:
return pl.col(col_list[0]).struct.field(col_list[1]).struct.field(col_list[2]).struct.field(col_list[3]).alias(col_string)
case 5:
return pl.col(col_list[0]).struct.field(col_list[1]).struct.field(col_list[2]).struct.field(col_list[3]).struct.field(col_list[4]).alias(col_string)
case 6:
return pl.col(col_list[0]).struct.field(col_list[1]).struct.field(col_list[2]).struct.field(col_list[3]).struct.field(col_list[4]).struct.field(col_list[5]).alias(col_string)
Then get my unnested and nicely named df with:
df.select([return_pl_formatting(x) for x in column_paths])
To show the output on the example from #Dean MacGregor
test = (
pl.DataFrame(
{"a": [1, 2, 3], "b": [2, 3, 4], "c": [
3, 4, 5], "d": [4, 5, 6], "e": [5, 6, 7]}
)
.select([pl.struct(["a", "b"]).alias("ab"), pl.struct(["c", "d"]).alias("cd"), "e"])
.select([pl.struct(["ab", "cd"]).alias("abcd"), "e"])
)
column_paths = []
schema_top_level_DFS(test.to_arrow().schema)
print(test.select([return_pl_formatting(x) for x in column_paths]))
┌─────┬───────────┬───────────┬───────────┬───────────┐
│ e ┆ abcd.cd.d ┆ abcd.cd.c ┆ abcd.ab.b ┆ abcd.ab.a │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 5 ┆ 4 ┆ 3 ┆ 2 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 6 ┆ 5 ┆ 4 ┆ 3 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 7 ┆ 6 ┆ 5 ┆ 4 ┆ 3 │
└─────┴───────────┴───────────┴───────────┴───────────┘
I have a Polars DataFrame with a list column. I want to control how many elements of a pl.List column are printed.
I've tried pl.pl.Config.set_fmt_str_lengths() but this only restricts the number of elements if set to a small value, it doesn't show more elements for a large value.
I'm working in Jupyterlab but I think it's a general issue.
import polars as pl
N = 5
df = (
pl.DataFrame(
{
'id': range(N)
}
)
.with_row_count("value")
.groupby_rolling(
"id",period=f"{N}i"
)
.agg(
pl.col("value")
)
)
df
shape: (5, 2)
┌─────┬───────────────┐
│ id ┆ value │
│ --- ┆ --- │
│ i64 ┆ list[u32] │
╞═════╪═══════════════╡
│ 0 ┆ [0] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ [0, 1] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ [0, 1, 2] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ [0, 1, ... 3] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ [0, 1, ... 4] │
└─────┴───────────────┘
pl.Config.set_tbl_rows(100)
And more generally, I would try looking at dir(pl.Config)
You can use the following config parameter from the Polars Documentation to set the length of the output e.g. 100.
import Polars as pl
pl.Config.set_fmt_str_lengths(100)
Currently I do not think you can, directly; the documentation for Config does not list any such method, and for me (in VSCode at least) set_fmt_str_lengths does not affect lists.
However, if your goal is simply to be able to see what you're working on and you don't mind a slightly hacky workaround, you can simply add a column next to it where you convert your list to a string representation of itself, at which point pl.Config.set_fmt_str_lengths(<some large n>) will then display however much of it you like. For example:
import polars as pl
pl.Config.set_fmt_str_lengths(100)
N = 5
df = (
pl.DataFrame(
{
'id': range(N)
}
)
.with_row_count("value")
.groupby_rolling(
"id",period=f"{N}i"
)
.agg(
pl.col("value")
).with_column(
pl.col("value").apply(lambda x: ["["+", ".join([f'{i}' for i in x])+"]"][0]).alias("string_repr")
)
)
df
shape: (5, 3)
┌─────┬───────────────┬─────────────────┐
│ id ┆ value ┆ string_repr │
│ --- ┆ --- ┆ --- │
│ i64 ┆ list[u32] ┆ str │
╞═════╪═══════════════╪═════════════════╡
│ 0 ┆ [0] ┆ [0] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ [0, 1] ┆ [0, 1] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ [0, 1, 2] ┆ [0, 1, 2] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ [0, 1, ... 3] ┆ [0, 1, 2, 3] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ [0, 1, ... 4] ┆ [0, 1, 2, 3, 4] │
└─────┴───────────────┴─────────────────┘
Suppose I have a mapping dataframe that I would like to join to an original dataframe:
df = pl.DataFrame({
'A': [1, 2, 3, 2, 1],
})
mapper = pl.DataFrame({
'key': [1, 2, 3, 4, 5],
'value': ['a', 'b', 'c', 'd', 'e']
})
I can map A to value directly via df.join(mapper, ...), but is there a way to do this in an expression context, i.e. while building columns? As in:
df.with_columns([
(pl.col('A')+1).join(mapper, left_on='A', right_on='key')
])
With would furnish:
shape: (5, 2)
┌─────┬───────┐
│ A ┆ value │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═══════╡
│ 1 ┆ b │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1 ┆ b │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ c │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ c │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ d │
└─────┴───────┘
Probably, yes. I just putted df.select(col('A')+1) inside.
df = df.with_columns([
col('A'),
df.select(col('A')+1).join(mapper, left_on='A', right_on='key')['value']
])
print(df)
df
┌─────┬───────┐
│ A ┆ value │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═══════╡
│ 1 ┆ b │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ b │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ c │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ c │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1 ┆ d │
└─────┴───────┘
Assuming I already have a predicate expression, how do I filter with that predicate, but apply it only within groups? For example, the predicate might be to keep all rows equal to the maximum or within a group. (There could be multiple rows kept in a group if there is a tie.)
With my dplyr experience, I thought that I could just .groupby and then .filter, but that does not work.
import polars as pl
df = pl.DataFrame(dict(x=[0, 0, 1, 1], y=[1, 2, 3, 3]))
expression = pl.col("y") == pl.col("y").max()
df.groupby("x").filter(expression)
# AttributeError: 'GroupBy' object has no attribute 'filter'
I then thought I could apply .over to the expression, but that does not work either.
import polars as pl
df = pl.DataFrame(dict(x=[0, 0, 1, 1], y=[1, 2, 3, 3]))
expression = pl.col("y") == pl.col("y").max()
df.filter(expression.over("x"))
# RuntimeError: Any(ComputeError("this binary expression is not an aggregation:
# [(col(\"y\")) == (col(\"y\").max())]
# pherhaps you should add an aggregation like, '.sum()', '.min()', '.mean()', etc.
# if you really want to collect this binary expression, use `.list()`"))
For this particular problem, I can invoke .over on the max, but I don't know how to apply this to an arbitrary predicate I don't have control over.
import polars as pl
df = pl.DataFrame(dict(x=[0, 0, 1, 1], y=[1, 2, 3, 3]))
expression = pl.col("y") == pl.col("y").max().over("x")
df.filter(expression)
# shape: (3, 2)
# ┌─────┬─────┐
# │ x ┆ y │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 0 ┆ 2 │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 1 ┆ 3 │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 1 ┆ 3 │
# └─────┴─────┘
If you had updated to polars>=0.13.0 your second try would have worked. :)
df = pl.DataFrame(dict(
x=[0, 0, 1, 1],
y=[1, 2, 3, 3])
)
df.filter((pl.col("y") == pl.max("y").over("x")))
shape: (3, 2)
┌─────┬─────┐
│ x ┆ y │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 0 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 3 │
└─────┴─────┘