Nested time-based groupby operations/sub-groups without apply()? - python-polars

I'm wanting to understand the polars way to create temporal sub-groups out of the groups from a groupby_rolling() operation.
I'm looking to do this keeping things parallel i.e. without using apply() (see that approach) and without using secondary/merging dataframes.
Example input:
┌─────┬─────────────────────┬───────┐
│ row ┆ date ┆ price │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ i64 │
╞═════╪═════════════════════╪═══════╡
│ 1 ┆ 2022-01-01 10:00:00 ┆ 10 │
│ 2 ┆ 2022-01-01 10:05:00 ┆ 20 │
│ 3 ┆ 2022-01-01 10:10:00 ┆ 30 │
│ 4 ┆ 2022-01-01 10:15:00 ┆ 40 │
│ 5 ┆ 2022-01-01 10:20:00 ┆ 50 │
│ 6 ┆ 2022-01-01 10:25:00 ┆ 60 │
│ 7 ┆ 2022-01-01 10:30:00 ┆ 70 │
│ 8 ┆ 2022-01-01 10:35:00 ┆ 80 │
│ 8 ┆ 2022-01-01 10:40:00 ┆ 90 │
│ 9 ┆ 2022-01-01 10:45:00 ┆ 100 │
│ 10 ┆ 2022-01-01 10:50:00 ┆ 110 │
│ 11 ┆ 2022-01-01 10:55:00 ┆ 120 │
│ 12 ┆ 2022-01-01 11:00:00 ┆ 130 │
└─────┴─────────────────────┴───────┘
Desired output:
┌─────┬─────────────────────┬───────┬──────────────────────────────────┐
│ row ┆ date ┆ price ┆ 10_min_groups_mean_price_history │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[μs] ┆ i64 ┆ list[i64] │
╞═════╪═════════════════════╪═══════╪══════════════════════════════════╡
│ 1 ┆ 2022-01-01 10:00:00 ┆ 10 ┆ [10] │
│ 2 ┆ 2022-01-01 10:05:00 ┆ 20 ┆ [15] │
│ 3 ┆ 2022-01-01 10:10:00 ┆ 30 ┆ [25, 10] │
│ 4 ┆ 2022-01-01 10:15:00 ┆ 40 ┆ [35, 15] │
│ 5 ┆ 2022-01-01 10:20:00 ┆ 50 ┆ [45, 25, 10] │
│ 6 ┆ 2022-01-01 10:25:00 ┆ 60 ┆ [55, 35, 15] │
│ 7 ┆ 2022-01-01 10:30:00 ┆ 70 ┆ [65, 45, 25] │
│ 8 ┆ 2022-01-01 10:35:00 ┆ 80 ┆ [75, 55, 35] │
│ 8 ┆ 2022-01-01 10:40:00 ┆ 90 ┆ [85, 65, 45] │
│ 9 ┆ 2022-01-01 10:45:00 ┆ 100 ┆ [95, 75, 55] │
│ 10 ┆ 2022-01-01 10:50:00 ┆ 110 ┆ [105, 85, 65] │
│ 11 ┆ 2022-01-01 10:55:00 ┆ 120 ┆ [115, 95, 75] │
│ 12 ┆ 2022-01-01 11:00:00 ┆ 130 ┆ [125, 105, 85] │
└─────┴─────────────────────┴───────┴──────────────────────────────────┘
What is happening above?
A rolling window is applied over the dataframe producing a window per row.
Each window includes all rows within the last 30min (including the current row).
Then, each 30min window is devided into 10min sub-groups.
The mean price is calculated for each 10min sub-group
All mean prices from the sub-groups are returned as a list (most recent first) to the "10_min_groups_mean_price_history " column
Worked example (using row 5 as an example):
The rolling window for row 5 captures the previous 30min of data, which is rows 1 to 5
These rows are sub-grouped into 10min windows creating three sub-groups that capture rows [[5,4],[3,2],[1]]
The mean price of the rows in each sub-group is calculated and produced as a list → [45, 25, 10]
Mental model:
I'm conceptualising this as treating each window from a groupby_rolling() operation as a dataframe that can be computed as needed (in this case by performing a groupby_dynamic() operation on it, with the intent of returning aggregations on those sub-groups as a list), but not sure if that is the right way to think about it???
If the sub-group data was categorical it would be a simple case of using over() however I'm not aware of an equivalent when the requirement is to sub-group by time series?
I am also under the impression that this operation should be parallelisable as each window is independent from each other (its just more calc steps), but please point out if there's a reason it can't be.
Thanks in advance!
Full dummy data set:
If you want to run this with a realistic sized dataset you can use
df_dummy = pl.DataFrame({
'date' : pl.date_range(
datetime(2000, 1, 1, 9),
datetime(2000, 1, 1, 16, 59, 59),
timedelta(seconds=1),
)
})
df_dummy = df_dummy.with_column(
pl.Series(np.random.uniform(.5,.95,len(df_dummy)) * 100 ).alias('price')
)
Other ways that people might ask this question (for others searching):
groupby_dynamic() within groupby_rolling()
How to access polars RollingGroupBy[Dataframe] Object
Treat each groupby_rolling() window as a dataframe to aggrigate on
Nested dataframes within groupby context
Nested groupby contexts

Could you .explode() the .groupby_rolling() - then use the resulting column for your .groupby_dynamic()?
(df.groupby_rolling(index_column="date", period="30m", closed="both")
.agg(pl.col("date").alias("window"))
.explode("window"))
shape: (70, 2)
┌─────────────────────┬─────────────────────┐
│ date | window │
│ --- | --- │
│ datetime[μs] | datetime[μs] │
╞═════════════════════╪═════════════════════╡
│ 2022-01-01 10:00:00 | 2022-01-01 10:00:00 │
│ 2022-01-01 10:05:00 | 2022-01-01 10:00:00 │
│ 2022-01-01 10:05:00 | 2022-01-01 10:05:00 │
│ 2022-01-01 10:10:00 | 2022-01-01 10:00:00 │
│ 2022-01-01 10:10:00 | 2022-01-01 10:05:00 │
│ 2022-01-01 10:10:00 | 2022-01-01 10:10:00 │
│ 2022-01-01 10:15:00 | 2022-01-01 10:00:00 │
│ 2022-01-01 10:15:00 | 2022-01-01 10:05:00 │
│ 2022-01-01 10:15:00 | 2022-01-01 10:10:00 │
│ 2022-01-01 10:15:00 | 2022-01-01 10:15:00 │
│ ... | ... │
│ 2022-01-01 10:55:00 | 2022-01-01 10:45:00 │
│ 2022-01-01 10:55:00 | 2022-01-01 10:50:00 │
│ 2022-01-01 10:55:00 | 2022-01-01 10:55:00 │
│ 2022-01-01 11:00:00 | 2022-01-01 10:30:00 │
│ 2022-01-01 11:00:00 | 2022-01-01 10:35:00 │
│ 2022-01-01 11:00:00 | 2022-01-01 10:40:00 │
│ 2022-01-01 11:00:00 | 2022-01-01 10:45:00 │
│ 2022-01-01 11:00:00 | 2022-01-01 10:50:00 │
│ 2022-01-01 11:00:00 | 2022-01-01 10:55:00 │
│ 2022-01-01 11:00:00 | 2022-01-01 11:00:00 │
└─────────────────────┴─────────────────────┘
Something along the lines of:
[Edit: Removed the unneeded .join() per #ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ's help.]
(df.groupby_rolling(index_column="date", period="30m", closed="both")
.agg([pl.col("date").alias("window"), pl.col("price")])
.explode(["window", "price"])
.groupby_dynamic(by="date", index_column="window", every="10m", closed="right")
.agg(pl.col("price")) # pl.col("price").mean()
.groupby("date", maintain_order=True)
.agg(pl.all()))
shape: (13, 3)
┌─────────────────────┬─────────────────────────────────────┬──────────────────────────────────┐
│ date | window | price │
│ --- | --- | --- │
│ datetime[μs] | list[datetime[μs]] | list[list[i64]] │
╞═════════════════════╪═════════════════════════════════════╪══════════════════════════════════╡
│ 2022-01-01 10:00:00 | [2022-01-01 09:50:00] | [[10]] │
│ 2022-01-01 10:05:00 | [2022-01-01 09:50:00, 2022-01-01... | [[10], [20]] │
│ 2022-01-01 10:10:00 | [2022-01-01 09:50:00, 2022-01-01... | [[10], [20, 30]] │
│ 2022-01-01 10:15:00 | [2022-01-01 09:50:00, 2022-01-01... | [[10], [20, 30], [40]] │
│ 2022-01-01 10:20:00 | [2022-01-01 09:50:00, 2022-01-01... | [[10], [20, 30], [40, 50]] │
│ 2022-01-01 10:25:00 | [2022-01-01 09:50:00, 2022-01-01... | [[10], [20, 30], ... [60]] │
│ 2022-01-01 10:30:00 | [2022-01-01 09:50:00, 2022-01-01... | [[10], [20, 30], ... [60, 70]] │
│ 2022-01-01 10:35:00 | [2022-01-01 10:00:00, 2022-01-01... | [[20, 30], [40, 50], ... [80]] │
│ 2022-01-01 10:40:00 | [2022-01-01 10:00:00, 2022-01-01... | [[30], [40, 50], ... [80, 90]] │
│ 2022-01-01 10:45:00 | [2022-01-01 10:10:00, 2022-01-01... | [[40, 50], [60, 70], ... [100]] │
│ 2022-01-01 10:50:00 | [2022-01-01 10:10:00, 2022-01-01... | [[50], [60, 70], ... [100, 110]] │
│ 2022-01-01 10:55:00 | [2022-01-01 10:20:00, 2022-01-01... | [[60, 70], [80, 90], ... [120]] │
│ 2022-01-01 11:00:00 | [2022-01-01 10:20:00, 2022-01-01... | [[70], [80, 90], ... [120, 130]] │
└─────────────────────┴─────────────────────────────────────┴──────────────────────────────────┘

Related

Polars get count of events prior to "this" event, but within given duration

I have been struggling with creating a feature, a counter that counts number of events prior to each event, where each prior event should have occurred within a given duration (dt). I know how to do it for all previous events, it is easy by using cumsum and over of the given column. But, if I want to do this with only events within e.g last 2 days, how do I do that ??
Below is how I do it (the wrong way) with cumsum.
import polars as pl
from datetime import date
df = pl.DataFrame(
data = {
"Event":["Rain","Sun","Rain","Sun","Rain","Sun","Rain","Sun"],
"Date":[
date(2022,1,1),
date(2022,1,2),
date(2022,1,2),
date(2022,1,3),
date(2022,1,3),
date(2022,1,5),
date(2022,1,5),
date(2022,1,8)
]
}
)
df.select(
pl.col("Date").cumcount().over("Event").alias("cum_sum")
)
outputting
shape: (8, 3)
┌───────┬────────────┬─────────┐
│ Event ┆ Date ┆ cum_sum │
│ --- ┆ --- ┆ --- │
│ str ┆ date ┆ u32 │
╞═══════╪════════════╪═════════╡
│ Rain ┆ 2022-01-01 ┆ 0 │
│ Sun ┆ 2022-01-02 ┆ 0 │
│ Rain ┆ 2022-01-02 ┆ 1 │
│ Sun ┆ 2022-01-03 ┆ 1 │
│ Rain ┆ 2022-01-03 ┆ 2 │
│ Sun ┆ 2022-01-05 ┆ 2 │
│ Rain ┆ 2022-01-05 ┆ 3 │
│ Sun ┆ 2022-01-08 ┆ 3 │
└───────┴────────────┴─────────┘
What I would like to output is this:
shape: (8, 3)
┌───────┬────────────┬─────────┐
│ Event ┆ Date ┆ cum_sum │
│ --- ┆ --- ┆ --- │
│ str ┆ date ┆ u32 │
╞═══════╪════════════╪═════════╡
│ Rain ┆ 2022-01-01 ┆ 0 │
│ Sun ┆ 2022-01-02 ┆ 0 │
│ Rain ┆ 2022-01-02 ┆ 1 │
│ Sun ┆ 2022-01-03 ┆ 1 │
│ Rain ┆ 2022-01-03 ┆ 2 │
│ Sun ┆ 2022-01-05 ┆ 1 │
│ Rain ┆ 2022-01-05 ┆ 1 │
│ Sun ┆ 2022-01-08 ┆ 0 │
└───────┴────────────┴─────────┘
(Preferably, a solution that scales somewhat well..)
Thanks
Tried this without success
You can try a groupby_rolling for this.
(
df
.groupby_rolling(
index_column="Date",
period="2d",
by="Event",
closed='both',
)
.agg([
pl.count() - 1
])
.sort(["Date", "Event"], reverse=[False, True])
)
shape: (8, 3)
┌───────┬────────────┬───────┐
│ Event ┆ Date ┆ count │
│ --- ┆ --- ┆ --- │
│ str ┆ date ┆ u32 │
╞═══════╪════════════╪═══════╡
│ Rain ┆ 2022-01-01 ┆ 0 │
│ Sun ┆ 2022-01-02 ┆ 0 │
│ Rain ┆ 2022-01-02 ┆ 1 │
│ Sun ┆ 2022-01-03 ┆ 1 │
│ Rain ┆ 2022-01-03 ┆ 2 │
│ Sun ┆ 2022-01-05 ┆ 1 │
│ Rain ┆ 2022-01-05 ┆ 1 │
│ Sun ┆ 2022-01-08 ┆ 0 │
└───────┴────────────┴───────┘
We subtract one in the agg because we do not want to count the current event, only prior events. (The sort at the end is just to order the rows to match the original data.)

Join between Polars dataframes with inequality conditions

I would like to do a join between two dataframes, using as join condition an inequality condition, i.e. greater than.
Given two dataframes, I would like to get the result equivalent to the SQL written below.
stock_market_value = pl.DataFrame(
{
"date": [date(2022, 1, 1), date(2022, 2, 1), date(2022, 3, 1)],
"price": [10.00, 12.00, 14.00]
}
)
my_stock_orders = pl.DataFrame(
{
"date": [date(2022, 1, 15), date(2022, 2, 15)],
"quantity": [2, 5]
}
)
I have read that Polars supports join of type asof, but I don't think it applies to my case (maybe putting tolerance equal to infinity?).
For sake of clarity, I wrote the join in form of SQL statement.
SELECT m.date, m.price * o.quantity AS portfolio_value
FROM stock_market_value m LEFT JOIN my_stock_orders o
ON m.date >= o.date
Example query/output:
duckdb.sql("""
SELECT
m.date market_date,
o.date order_date,
price,
quantity,
price * quantity AS portfolio_value
FROM stock_market_value m LEFT JOIN my_stock_orders o
ON m.date >= o.date
""").pl()
shape: (4, 5)
┌─────────────┬────────────┬───────┬──────────┬─────────────────┐
│ market_date | order_date | price | quantity | portfolio_value │
│ --- | --- | --- | --- | --- │
│ date | date | f64 | i64 | f64 │
╞═════════════╪════════════╪═══════╪══════════╪═════════════════╡
│ 2022-01-01 | null | 10.0 | null | null │
│ 2022-02-01 | 2022-01-15 | 12.0 | 2 | 24.0 │
│ 2022-03-01 | 2022-01-15 | 14.0 | 2 | 28.0 │
│ 2022-03-01 | 2022-02-15 | 14.0 | 5 | 70.0 │
└─────────────┴────────────┴───────┴──────────┴─────────────────┘
Why asof() is not the solution
Comments were suggesting to use asof, but it actually does not work in the way I expect.
Forward asof
result_fwd = stock_market_value.join_asof(
my_stock_orders, left_on="date", right_on="date", strategy="forward"
)
print(result_fwd)
shape: (3, 3)
┌────────────┬───────┬──────────┐
│ date ┆ price ┆ quantity │
│ --- ┆ --- ┆ --- │
│ date ┆ f64 ┆ i64 │
╞════════════╪═══════╪══════════╡
│ 2022-01-01 ┆ 10.0 ┆ 2 │
│ 2022-02-01 ┆ 12.0 ┆ 5 │
│ 2022-03-01 ┆ 14.0 ┆ null │
└────────────┴───────┴──────────┘
Backward asof
result_bwd = stock_market_value.join_asof(
my_stock_orders, left_on="date", right_on="date", strategy="backward"
)
print(result_bwd)
shape: (3, 3)
┌────────────┬───────┬──────────┐
│ date ┆ price ┆ quantity │
│ --- ┆ --- ┆ --- │
│ date ┆ f64 ┆ i64 │
╞════════════╪═══════╪══════════╡
│ 2022-01-01 ┆ 10.0 ┆ null │
│ 2022-02-01 ┆ 12.0 ┆ 2 │
│ 2022-03-01 ┆ 14.0 ┆ 5 │
└────────────┴───────┴──────────┘
Thanks!
You can do a join_asof. I you want to look forward you should use the forward strategy:
stock_market_value.join_asof(
my_stock_orders,
on='date',
strategy='forward',
).with_columns((pl.col("price") * pl.col("quantity")).alias("value"))
┌────────────┬───────┬──────────┬───────┐
│ date ┆ price ┆ quantity ┆ value │
│ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ f64 ┆ i64 ┆ f64 │
╞════════════╪═══════╪══════════╪═══════╡
│ 2022-01-01 ┆ 10.0 ┆ 2 ┆ 20.0 │
│ 2022-02-01 ┆ 12.0 ┆ 5 ┆ 60.0 │
│ 2022-03-01 ┆ 14.0 ┆ null ┆ null │
└────────────┴───────┴──────────┴───────┘
You can use join_asof to determine which records to exclude from the date logic, then perform a cartesian product + filter yourself on the remainder, then merge everything back together. The following implements what you want, although it's a little bit hacky.
Update: Using polars' native cross-product instead of self-defined cartesian product function.
import polars as pl
from polars import col
from datetime import date
stock_market_value = pl.DataFrame({
"market_date": [date(2022, 1, 1), date(2022, 2, 1), date(2022, 3, 1)],
"price": [10.00, 12.00, 14.00]
})
stock_market_orders = pl.DataFrame({
"order_date": [date(2022, 1, 15), date(2022, 2, 15)],
"quantity": [2, 5]
})
# use a backwards join-asof to find rows in market_value that have no rows in orders with order date < market date
stock_market_value = stock_market_value.with_columns(
stock_market_value.join_asof(
stock_market_orders,
left_on="market_date",
right_on="order_date",
)["order_date"].is_not_null().alias("has_match")
)
nonmatched_rows = stock_market_value.filter(col("has_match")==False).drop("has_match")
# keep all other rows and perform a cartesian product
matched_rows = stock_market_value.filter(col("has_match")==True).drop("has_match")
df = matched_rows.join(stock_market_orders, how="cross")
# filter based on our join condition
df = df.filter(col("market_date") > col("order_date"))
# concatenate the unmatched with the filtered result for our final answer
df = pl.concat((nonmatched_rows, df), how="diagonal")
print(df)
Output:
shape: (4, 4)
┌─────────────┬───────┬────────────┬──────────┐
│ market_date ┆ price ┆ order_date ┆ quantity │
│ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ f64 ┆ date ┆ i64 │
╞═════════════╪═══════╪════════════╪══════════╡
│ 2022-01-01 ┆ 10.0 ┆ null ┆ null │
│ 2022-02-01 ┆ 12.0 ┆ 2022-01-15 ┆ 2 │
│ 2022-03-01 ┆ 14.0 ┆ 2022-01-15 ┆ 2 │
│ 2022-03-01 ┆ 14.0 ┆ 2022-02-15 ┆ 5 │
└─────────────┴───────┴────────────┴──────────┘

iterate through groupby like pandas with a tuple

So when i iterate through a pandas.groupby() what i get back is a tuple. This was important because i could do [x for x in df_pandas.sort('date').groupby('grouping_column')] and then sort this list of tuples based on x[0].
In pandas it's also autosorted after a groupby
I did that to have a constant output in plotly. (Area chart)
Now with polars, i can't do the same. I just get the dataframe back. Is there any way to accomplish the same?
I tried adding a sort([pl.col('date'), pl.col('grouping_column') but it had no effect.
What's in my mind for polars is this:
for value in df.select('grouping_column').uniqeue().to_numpy():
df = df.filter(pl.column('grouping_column') == value)
...
This will in fact give the desired results, because it will always iterate through the same sequence, while the groupby is kinda random and the order doesn't seem to matter at all.
My problem is it that the second solution seems to be not really efficient.
The other thing i could do is
[(sub_df['some_col'].to_numpy()[0], sub_df) for sub_df in df.groupby('some_col')]
Use then pythons sort to sort the list based on key in the tuple x[0] and then reiterate through the list. However this solution seems super ugly as well.
You can use the partition_by function to create a dictionary of key-value pairs, where the keys are your grouping_column and your values are a DataFrame.
For example, let's say we have this data:
import polars as pl
from datetime import datetime
df = pl.DataFrame({"grouping_column": [1, 2, 3], }).join(
pl.DataFrame(
{
"date": pl.date_range(datetime(2020, 1, 1), datetime(2020, 3, 1), "1mo"),
}
),
how="cross",
)
df
shape: (9, 2)
┌─────────────────┬─────────────────────┐
│ grouping_column ┆ date │
│ --- ┆ --- │
│ i64 ┆ datetime[ns] │
╞═════════════════╪═════════════════════╡
│ 1 ┆ 2020-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-02-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-03-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-03-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 2020-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 2020-02-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 2020-03-01 00:00:00 │
└─────────────────┴─────────────────────┘
We can split the DataFrame into a dictionary.
df.partition_by(groups='grouping_column', maintain_order=True, as_dict=True)
{1: shape: (3, 2)
┌─────────────────┬─────────────────────┐
│ grouping_column ┆ date │
│ --- ┆ --- │
│ i64 ┆ datetime[ns] │
╞═════════════════╪═════════════════════╡
│ 1 ┆ 2020-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-02-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-03-01 00:00:00 │
└─────────────────┴─────────────────────┘,
2: shape: (3, 2)
┌─────────────────┬─────────────────────┐
│ grouping_column ┆ date │
│ --- ┆ --- │
│ i64 ┆ datetime[ns] │
╞═════════════════╪═════════════════════╡
│ 2 ┆ 2020-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-02-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-03-01 00:00:00 │
└─────────────────┴─────────────────────┘,
3: shape: (3, 2)
┌─────────────────┬─────────────────────┐
│ grouping_column ┆ date │
│ --- ┆ --- │
│ i64 ┆ datetime[ns] │
╞═════════════════╪═════════════════════╡
│ 3 ┆ 2020-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 2020-02-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 2020-03-01 00:00:00 │
└─────────────────┴─────────────────────┘}
From there, you can create the tuples using the items method of the Python's dictionary.
for x in df.partition_by(groups='grouping_column', maintain_order=True, as_dict=True).items():
print("next item")
print(x)
next item
(1, shape: (3, 2)
┌─────────────────┬─────────────────────┐
│ grouping_column ┆ date │
│ --- ┆ --- │
│ i64 ┆ datetime[ns] │
╞═════════════════╪═════════════════════╡
│ 1 ┆ 2020-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-02-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-03-01 00:00:00 │
└─────────────────┴─────────────────────┘)
next item
(2, shape: (3, 2)
┌─────────────────┬─────────────────────┐
│ grouping_column ┆ date │
│ --- ┆ --- │
│ i64 ┆ datetime[ns] │
╞═════════════════╪═════════════════════╡
│ 2 ┆ 2020-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-02-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-03-01 00:00:00 │
└─────────────────┴─────────────────────┘)
next item
(3, shape: (3, 2)
┌─────────────────┬─────────────────────┐
│ grouping_column ┆ date │
│ --- ┆ --- │
│ i64 ┆ datetime[ns] │
╞═════════════════╪═════════════════════╡
│ 3 ┆ 2020-01-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 2020-02-01 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 2020-03-01 00:00:00 │
└─────────────────┴─────────────────────┘)

convert a pandas loc operation that needed the index to assign values to polars

In this example i have three columns, the 'DayOfWeek' Time' and the 'Risk'.
I want to group by 'DayOfWeek' and take the first element only and assign a high risk on it. This means the first known hour in day of week is the one that has the highest risk. The rest is initialized to 'Low' risk.
In pandas i had an additional column for the index, but in polars i do not. I could artificially create one, but is it even necessary?
Can i do this somehow smarter with polars?
df['risk'] = "Low"
df = df.sort('Time')
df.loc[df.groupby("DayOfWeek").head(1).index, "risk"] = "High"
The index is unique in this case and goes to range(n)
Here is my solution btw. (I don't really like it)
df = df.with_column(pl.arange(0, df.shape[0]).alias('pseudo_index')
# find lowest time for day
indexes_df = df.sort('Time').groupby('DayOfWeek').head(1)
# Set 'High' as col for all rows from groupby
indexes_df = indexes_df.select('pseudo_index').with_column(pl.lit('High').alias('risk'))
# Left join will generate null values for all values that are not in indexes_df 'pseudo_index'
df = df.join(indexes_df, how='left', on='pseudo_index').select([
pl.all().exclude(['pseudo_index', 'risk']), pl.col('risk').fill_null(pl.lit('low'))
])
You can use window functions to find where the first "index" of the "DayOfWeek" group equals the"index" column.
For that we only need to set an "index" column. We can do that easily with:
A method: df.with_row_count(<name>)
An expression: pl.arange(0, pl.count()).alias(<name>)
After that we can use this predicate:
pl.first("index").over("DayOfWeek") == pl.col("index")
Finally we use a when -> then -> otherwise expression to use that condition and create our new "Risk" column.
Example
Let's start with some data. In the snippet below I create an hourly date range and then determine the weekdays from that.
Preparing data
df = pl.DataFrame({
"Time": pl.date_range(datetime(2022, 6, 1), datetime(2022, 6, 30), "1h").sample(frac=1.5, with_replacement=True).sort(),
}).select([
pl.arange(0, pl.count()).alias("index"),
pl.all(),
pl.col("Time").dt.weekday().alias("DayOfWeek"),
])
print(df)
shape: (1045, 3)
┌───────┬─────────────────────┬───────────┐
│ index ┆ Time ┆ DayOfWeek │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[ns] ┆ u32 │
╞═══════╪═════════════════════╪═══════════╡
│ 0 ┆ 2022-06-29 22:00:00 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2022-06-14 11:00:00 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2022-06-11 21:00:00 ┆ 6 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 2022-06-27 20:00:00 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1041 ┆ 2022-06-11 09:00:00 ┆ 6 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1042 ┆ 2022-06-18 22:00:00 ┆ 6 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1043 ┆ 2022-06-18 01:00:00 ┆ 6 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1044 ┆ 2022-06-23 18:00:00 ┆ 4 │
└───────┴─────────────────────┴───────────┘
Computing Risk values
df.with_column(
pl.when(
pl.first("index").over("DayOfWeek") == pl.col("index")
).then(
"High"
).otherwise(
"Low"
).alias("Risk")
).drop("index")
print(df)
shape: (1045, 3)
┌─────────────────────┬───────────┬──────┐
│ Time ┆ DayOfWeek ┆ Risk │
│ --- ┆ --- ┆ --- │
│ datetime[ns] ┆ u32 ┆ str │
╞═════════════════════╪═══════════╪══════╡
│ 2022-06-29 22:00:00 ┆ 3 ┆ High │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-06-14 11:00:00 ┆ 2 ┆ High │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-06-11 21:00:00 ┆ 6 ┆ High │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-06-27 20:00:00 ┆ 1 ┆ High │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-06-11 09:00:00 ┆ 6 ┆ Low │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-06-18 22:00:00 ┆ 6 ┆ Low │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-06-18 01:00:00 ┆ 6 ┆ Low │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-06-23 18:00:00 ┆ 4 ┆ Low │
└─────────────────────┴───────────┴──────┘

In Polars how can I display a single row from a dataframe vertically like a pandas series?

I have a polars dataframe with many columns. I want to look at all the data from a single row aligned vertically so that I can see the values in many different columns without it going off the edge of the screen. How can I do this?
E.g. define a dataframe
df = pl.DataFrame({'a':[0,1],'b':[2,3]})
Print df[0] in ipython/jupyter and I get:
But if I convert df to pandas and print df.iloc[0] I get:
The latter is very handy when you've got many columns.
I've tried things like df[0].to_series(), but it only prints the first element, not the first row.
My suspicion is that there isn't a direct replacement because the pandas method relies on the series having an index. I think the polars solution will be more like making a two column dataframe where one column is the column names and the other is a value. I'm not sure if there's a method to do that though.
Thanks for any help you can offer!
import polars as pl
import numpy as np
# Create dataframe with lots of columns.
df = pl.DataFrame(np.random.randint(0, 1000, (5, 100)))
df
shape: (5, 100)
┌──────────┬──────────┬──────────┬──────────┬─────┬───────────┬───────────┬───────────┬───────────┐
│ column_0 ┆ column_1 ┆ column_2 ┆ column_3 ┆ ... ┆ column_96 ┆ column_97 ┆ column_98 ┆ column_99 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞══════════╪══════════╪══════════╪══════════╪═════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 285 ┆ 366 ┆ 886 ┆ 981 ┆ ... ┆ 63 ┆ 326 ┆ 882 ┆ 564 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 735 ┆ 269 ┆ 381 ┆ 78 ┆ ... ┆ 556 ┆ 737 ┆ 741 ┆ 768 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 543 ┆ 729 ┆ 915 ┆ 901 ┆ ... ┆ 48 ┆ 21 ┆ 277 ┆ 818 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 264 ┆ 424 ┆ 285 ┆ 540 ┆ ... ┆ 602 ┆ 584 ┆ 888 ┆ 836 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 269 ┆ 701 ┆ 483 ┆ 817 ┆ ... ┆ 579 ┆ 873 ┆ 192 ┆ 734 │
└──────────┴──────────┴──────────┴──────────┴─────┴───────────┴───────────┴───────────┴───────────┘
# Display row 3, by creating a tuple of column name and value for row 3.
tuple(zip(df.columns, df.row(2)))
(('column_0', 543),
('column_1', 729),
('column_2', 915),
('column_3', 901),
('column_4', 332),
('column_5', 156),
('column_6', 624),
('column_7', 37),
('column_8', 341),
('column_9', 503),
('column_10', 135),
('column_11', 183),
('column_12', 651),
('column_13', 910),
('column_14', 625),
('column_15', 129),
('column_16', 604),
('column_17', 671),
('column_18', 976),
('column_19', 558),
('column_20', 159),
('column_21', 314),
('column_22', 460),
('column_23', 49),
('column_24', 944),
('column_25', 6),
('column_26', 470),
('column_27', 228),
('column_28', 615),
('column_29', 230),
('column_30', 217),
('column_31', 66),
('column_32', 999),
('column_33', 440),
('column_34', 519),
('column_35', 851),
('column_36', 37),
('column_37', 859),
('column_38', 560),
('column_39', 870),
('column_40', 892),
('column_41', 192),
('column_42', 541),
('column_43', 136),
('column_44', 631),
('column_45', 22),
('column_46', 522),
('column_47', 225),
('column_48', 610),
('column_49', 191),
('column_50', 886),
('column_51', 454),
('column_52', 312),
('column_53', 956),
('column_54', 473),
('column_55', 851),
('column_56', 760),
('column_57', 224),
('column_58', 859),
('column_59', 442),
('column_60', 234),
('column_61', 788),
('column_62', 53),
('column_63', 999),
('column_64', 473),
('column_65', 237),
('column_66', 247),
('column_67', 307),
('column_68', 916),
('column_69', 94),
('column_70', 714),
('column_71', 233),
('column_72', 995),
('column_73', 335),
('column_74', 454),
('column_75', 801),
('column_76', 742),
('column_77', 386),
('column_78', 196),
('column_79', 239),
('column_80', 723),
('column_81', 59),
('column_82', 929),
('column_83', 852),
('column_84', 722),
('column_85', 328),
('column_86', 59),
('column_87', 710),
('column_88', 238),
('column_89', 823),
('column_90', 75),
('column_91', 307),
('column_92', 472),
('column_93', 822),
('column_94', 582),
('column_95', 802),
('column_96', 48),
('column_97', 21),
('column_98', 277),
('column_99', 818))
Pandas does not display all values either if you have many columns.
In [121]: df.to_pandas().iloc[0]
Out[121]:
column_0 285
column_1 366
column_2 886
column_3 981
column_4 464
...
column_95 862
column_96 63
column_97 326
column_98 882
column_99 564
Name: 0, Length: 100, dtype: int64
You can try using melt. For example:
df = pl.DataFrame(
[
pl.Series(name="col_str", values=["string1", "string2"]),
pl.Series(name="col_bool", values=[False, True]),
pl.Series(name="col_int", values=[1, 2]),
pl.Series(name="col_float", values=[10.0, 20.0]),
*[pl.Series(name=f"col_other_{idx}", values=[idx] * 2)
for idx in range(1, 25)],
]
)
print(df)
shape: (2, 28)
┌─────────┬──────────┬─────────┬───────────┬─────┬──────────────┬──────────────┬──────────────┬──────────────┐
│ col_str ┆ col_bool ┆ col_int ┆ col_float ┆ ... ┆ col_other_21 ┆ col_other_22 ┆ col_other_23 ┆ col_other_24 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ bool ┆ i64 ┆ f64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪══════════╪═════════╪═══════════╪═════╪══════════════╪══════════════╪══════════════╪══════════════╡
│ string1 ┆ false ┆ 1 ┆ 10.0 ┆ ... ┆ 21 ┆ 22 ┆ 23 ┆ 24 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ string2 ┆ true ┆ 2 ┆ 20.0 ┆ ... ┆ 21 ┆ 22 ┆ 23 ┆ 24 │
└─────────┴──────────┴─────────┴───────────┴─────┴──────────────┴──────────────┴──────────────┴──────────────┘
To print the first row:
pl.Config.set_tbl_rows(100)
df[0,].melt()
shape: (28, 2)
┌──────────────┬─────────┐
│ variable ┆ value │
│ --- ┆ --- │
│ str ┆ str │
╞══════════════╪═════════╡
│ col_str ┆ string1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_bool ┆ false │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_int ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_float ┆ 10.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_1 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_2 ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_3 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_4 ┆ 4 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_5 ┆ 5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_6 ┆ 6 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_7 ┆ 7 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_8 ┆ 8 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_9 ┆ 9 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_10 ┆ 10 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_11 ┆ 11 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_12 ┆ 12 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_13 ┆ 13 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_14 ┆ 14 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_15 ┆ 15 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_16 ┆ 16 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_17 ┆ 17 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_18 ┆ 18 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_19 ┆ 19 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_20 ┆ 20 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_21 ┆ 21 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_22 ┆ 22 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_23 ┆ 23 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_24 ┆ 24 │
If needed, set the polars.Config.set_tbl_rows option to the number of rows you find acceptable. (This only needs to be done once per session, not every time you print.)
Notice that all values have been cast to super-type str. (One caution: this approach won't work if any of your columns are of dtype list.)
You may try check Polars Cookbook about indexing here
It's stated that
| pandas | polars |
|------------|-----------|
| select row | |
|df.iloc[2] | df[2, :] |
Cheers!