How to select the top_k of one column, sorted by another column, within a third column in Polars? - python-polars

I was writing some code and realized this might be a reasonably common operation. I also realized I don't know a clean way to do it.
The questions is: Get the top 5 entries in column1, sorted by column2, within groups given by column3.
If I had to intuit how this would be writen in Polars it'd be:
df.select(pl.col('column1').top_k(n=5, by='column2').over('column3'))
But note that is made up code; it does not work.
Consider this sample data:
import numpy as np
import pandas as pd
import polars as pl
data_size = 10_000_000
np.random.seed = 1
saleValue = np.random.randint(0, 100, data_size)
storeId = np.random.choice([f'Store: {i}' for i in range(200_000)], replace=True, size=data_size)
customerId = np.random.choice([f'Customer: {i}' for i in range(1_000)], replace=True, size=data_size)
df = pd.DataFrame(
dict(storeId=storeId, customerId=customerId, saleValue=saleValue)
).pipe(pl.from_pandas)
It generates a dataframe of the form:
┌───────────────┬───────────────┬───────────┐
│ storeId ┆ customerId ┆ saleValue │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i32 │
╞═══════════════╪═══════════════╪═══════════╡
│ Store: 161472 ┆ Customer: 960 ┆ 29 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 168620 ┆ Customer: 814 ┆ 21 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 37904 ┆ Customer: 80 ┆ 61 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 166077 ┆ Customer: 516 ┆ 23 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 141748 ┆ Customer: 549 ┆ 58 │
└─///───────────┴─///───────────┴─///───────┘
I'm curious how one would get the top 5 customer per store, sorted by their total spend.
One solution is:
(df
# This part is essential; we need to get the total spend (sales)
.groupby(['storeId','customerId'])
.agg(pl.col('saleValue').sum().alias('totalSales'))
# This is the part I think could be cleaner
.sort('totalSales', reverse=True)
.groupby('storeId')
.agg(pl.col('customerId').head(5).list().alias('customerIds'))
)
┌───────────────┬─────────────────────────────────────┐
│ storeId ┆ customerIds │
│ --- ┆ --- │
│ str ┆ list[str] │
╞═══════════════╪═════════════════════════════════════╡
│ Store: 78152 ┆ ["Customer: 753", "Customer: 170... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 67676 ┆ ["Customer: 957", "Customer: 896... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 45152 ┆ ["Customer: 118", "Customer: 127... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 183339 ┆ ["Customer: 370", "Customer: 227... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 144688 ┆ ["Customer: 328", "Customer: 294... │
└─///───────────┴─///─────────────────────────────────┘
But I wonder if there is something cleaner using .top_k

It can be done a bit cleaner indeed, although you are close.
Define the first step as df_agg (for the sake of this answer, you can chain):
df_agg = df.groupby(['storeId','customerId']).agg(pl.col('saleValue').sum().alias('totalSales'))
Then we can do:
df_agg.groupby('storeId').agg(pl.col('customerId').sort_by('totalSales', reverse=True).slice(0,5))
Which reads as:
group by store
take column customer id
we sort the values of this column by the column totalSales, from high to low (reverse=True)
we take the first 5 values of customer id
So we do the groupby like you proposed, but the sorting is done inside the aggregation using sort_by, rather than on the full dataframe. Also, I use slice rather than head + list.
On your point on top_k: this function returns the largest elements of itself, not by another. Polars has sufficient ways of achieving what you want, notably sort_by, that I dont think there is a need to complicate the implementation of top_k by adding a by argument.

Would using top_k in a filter work for you? For example:
(
df.groupby(["storeId", "customerId"])
.agg(pl.col("saleValue").sum().alias("totalSales"))
.filter(
pl.col("totalSales")
>= pl.col("totalSales").top_k(k=5).list().over("storeId").arr.last()
)
)
shape: (1048652, 3)
┌───────────────┬───────────────┬────────────┐
│ storeId ┆ customerId ┆ totalSales │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═══════════════╪═══════════════╪════════════╡
│ Store: 92626 ┆ Customer: 829 ┆ 98 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 56532 ┆ Customer: 840 ┆ 93 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 159073 ┆ Customer: 684 ┆ 88 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 131292 ┆ Customer: 836 ┆ 98 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 73245 ┆ Customer: 545 ┆ 93 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 31163 ┆ Customer: 554 ┆ 91 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 128047 ┆ Customer: 971 ┆ 89 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 41563 ┆ Customer: 85 ┆ 92 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 157951 ┆ Customer: 45 ┆ 97 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 7677 ┆ Customer: 390 ┆ 88 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... │
Adding a sort to make the result easier to inspect:
(
df.groupby(["storeId", "customerId"])
.agg(pl.col("saleValue").sum().alias("totalSales"))
.filter(
pl.col("totalSales")
>= pl.col("totalSales").top_k(k=5).list().over("storeId").arr.last()
)
.sort(["storeId", "totalSales"], reverse=[False, True])
)
shape: (1048652, 3)
┌──────────────┬───────────────┬────────────┐
│ storeId ┆ customerId ┆ totalSales │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞══════════════╪═══════════════╪════════════╡
│ Store: 0 ┆ Customer: 46 ┆ 151 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 0 ┆ Customer: 267 ┆ 102 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 0 ┆ Customer: 354 ┆ 94 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 0 ┆ Customer: 416 ┆ 93 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 0 ┆ Customer: 729 ┆ 93 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 1 ┆ Customer: 459 ┆ 99 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 1 ┆ Customer: 417 ┆ 90 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 1 ┆ Customer: 982 ┆ 89 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 1 ┆ Customer: 337 ┆ 86 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 1 ┆ Customer: 202 ┆ 84 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 99998 ┆ Customer: 536 ┆ 99 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 99998 ┆ Customer: 295 ┆ 99 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 99998 ┆ Customer: 841 ┆ 98 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 99998 ┆ Customer: 782 ┆ 94 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 99999 ┆ Customer: 29 ┆ 96 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 99999 ┆ Customer: 84 ┆ 96 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 99999 ┆ Customer: 557 ┆ 96 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 99999 ┆ Customer: 885 ┆ 91 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 99999 ┆ Customer: 866 ┆ 89 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Store: 99999 ┆ Customer: 695 ┆ 89 │
└──────────────┴───────────────┴────────────┘
From the above, you can proceed to create a list for each store (or whatever you need).
Note: this approach may produce more than 5 customers per store. (It does not break ties, as shown for Store 99999 at the bottom).

Related

Polars get count of events prior to "this" event, but within given duration

I have been struggling with creating a feature, a counter that counts number of events prior to each event, where each prior event should have occurred within a given duration (dt). I know how to do it for all previous events, it is easy by using cumsum and over of the given column. But, if I want to do this with only events within e.g last 2 days, how do I do that ??
Below is how I do it (the wrong way) with cumsum.
import polars as pl
from datetime import date
df = pl.DataFrame(
data = {
"Event":["Rain","Sun","Rain","Sun","Rain","Sun","Rain","Sun"],
"Date":[
date(2022,1,1),
date(2022,1,2),
date(2022,1,2),
date(2022,1,3),
date(2022,1,3),
date(2022,1,5),
date(2022,1,5),
date(2022,1,8)
]
}
)
df.select(
pl.col("Date").cumcount().over("Event").alias("cum_sum")
)
outputting
shape: (8, 3)
┌───────┬────────────┬─────────┐
│ Event ┆ Date ┆ cum_sum │
│ --- ┆ --- ┆ --- │
│ str ┆ date ┆ u32 │
╞═══════╪════════════╪═════════╡
│ Rain ┆ 2022-01-01 ┆ 0 │
│ Sun ┆ 2022-01-02 ┆ 0 │
│ Rain ┆ 2022-01-02 ┆ 1 │
│ Sun ┆ 2022-01-03 ┆ 1 │
│ Rain ┆ 2022-01-03 ┆ 2 │
│ Sun ┆ 2022-01-05 ┆ 2 │
│ Rain ┆ 2022-01-05 ┆ 3 │
│ Sun ┆ 2022-01-08 ┆ 3 │
└───────┴────────────┴─────────┘
What I would like to output is this:
shape: (8, 3)
┌───────┬────────────┬─────────┐
│ Event ┆ Date ┆ cum_sum │
│ --- ┆ --- ┆ --- │
│ str ┆ date ┆ u32 │
╞═══════╪════════════╪═════════╡
│ Rain ┆ 2022-01-01 ┆ 0 │
│ Sun ┆ 2022-01-02 ┆ 0 │
│ Rain ┆ 2022-01-02 ┆ 1 │
│ Sun ┆ 2022-01-03 ┆ 1 │
│ Rain ┆ 2022-01-03 ┆ 2 │
│ Sun ┆ 2022-01-05 ┆ 1 │
│ Rain ┆ 2022-01-05 ┆ 1 │
│ Sun ┆ 2022-01-08 ┆ 0 │
└───────┴────────────┴─────────┘
(Preferably, a solution that scales somewhat well..)
Thanks
Tried this without success
You can try a groupby_rolling for this.
(
df
.groupby_rolling(
index_column="Date",
period="2d",
by="Event",
closed='both',
)
.agg([
pl.count() - 1
])
.sort(["Date", "Event"], reverse=[False, True])
)
shape: (8, 3)
┌───────┬────────────┬───────┐
│ Event ┆ Date ┆ count │
│ --- ┆ --- ┆ --- │
│ str ┆ date ┆ u32 │
╞═══════╪════════════╪═══════╡
│ Rain ┆ 2022-01-01 ┆ 0 │
│ Sun ┆ 2022-01-02 ┆ 0 │
│ Rain ┆ 2022-01-02 ┆ 1 │
│ Sun ┆ 2022-01-03 ┆ 1 │
│ Rain ┆ 2022-01-03 ┆ 2 │
│ Sun ┆ 2022-01-05 ┆ 1 │
│ Rain ┆ 2022-01-05 ┆ 1 │
│ Sun ┆ 2022-01-08 ┆ 0 │
└───────┴────────────┴───────┘
We subtract one in the agg because we do not want to count the current event, only prior events. (The sort at the end is just to order the rows to match the original data.)

How to group items by difference in percent in Polars?

I'd like to group values so that the difference between each group item remains within a certain percentage. E.g. each time when an item is 5% over the first group element it goes to the new group. As a return I need the first group value. Example with 5% threshold where 'a' is given, 'group' and 'groupFirst' must be calculated:
import polars as pl
df = pl.DataFrame({'a': [100, 103, 105, 106, 105, 104, 103, 106, 100, 102],
'group': [0, 0, 1, 1, 1, 1, 1, 1, 2, 2],
'groupFirst': [100, 100, 105, 105, 105, 105, 105, 105, 100, 100]})
print(df)
shape: (10, 3)
┌─────┬───────┬────────────┐
│ a ┆ group ┆ groupFirst │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═══════╪════════════╡
│ 100 ┆ 0 ┆ 100 │
│ 103 ┆ 0 ┆ 100 │
│ 105 ┆ 1 ┆ 105 │
│ 106 ┆ 1 ┆ 105 │
│ ... ┆ ... ┆ ... │
│ 103 ┆ 1 ┆ 105 │
│ 106 ┆ 1 ┆ 105 │
│ 100 ┆ 2 ┆ 100 │
│ 102 ┆ 2 ┆ 100 │
└─────┴───────┴────────────┘
Say you want to reset cummax when your values exceed the value 6. You could do:
(
df.with_columns(
(pl.col("a") >= 6)
.shift(1)
.fill_null(False)
.cumsum()
.alias("group")
).with_columns(
pl.col("a")
.cummax()
.over(pl.col("group"))
.alias("cummax")
)
)
Example:
In [73]: df = pl.DataFrame({'a': [1, 3, 5, 6, 1, 4, 3, 6, 5, 6]})
In [74]: (
...: df.with_columns(
...: (pl.col("a") >= 6)
...: .shift(1)
...: .fill_null(False)
...: .cumsum()
...: .alias("group")
...: ).with_columns(
...: pl.col("a")
...: .cummax()
...: .over(pl.col("group"))
...: .alias("cummax")
...: )
...: )
Out[74]:
shape: (10, 3)
┌─────┬───────┬────────┐
│ a ┆ group ┆ cummax │
│ --- ┆ --- ┆ --- │
│ i64 ┆ u32 ┆ i64 │
╞═════╪═══════╪════════╡
│ 1 ┆ 0 ┆ 1 │
│ 3 ┆ 0 ┆ 3 │
│ 5 ┆ 0 ┆ 5 │
│ 6 ┆ 0 ┆ 6 │
│ ... ┆ ... ┆ ... │
│ 3 ┆ 1 ┆ 4 │
│ 6 ┆ 1 ┆ 6 │
│ 5 ┆ 2 ┆ 5 │
│ 6 ┆ 2 ┆ 6 │
└─────┴───────┴────────┘
I don't think there is a way to use polars expressions to generate the groups since they always rely on the previous group. That being said the groups can be easily generated in O(n) so the penalty for doing that in python should be minor.
import numpy as np
def make_groups(a, threshold=1.05):
a=np.array(a)
outarray=np.empty(len(a), dtype=a.dtype)
outarray[0]=0
curgroup=a[0]
for indx, cur_a in enumerate(a[1:],1):
if cur_a >= threshold * curgroup or cur_a * threshold <= curgroup:
outarray[indx] = outarray[indx-1] + 1
curgroup=cur_a
else:
outarray[indx] = outarray[indx-1]
return pl.Series(outarray)
So now let's take that to our data.
df = pl.DataFrame({'a': [100, 103, 105, 106, 105, 104, 103, 106, 100, 102]})
We just do a map (incidentally, I tried making make_groups into a np.ufunc but couldn't get it to work).
df \
.with_columns(pl.col('a').map(lambda x: make_groups(x, 1.05)).alias('group')) \
.with_columns((pl.col('a').list().over('group').arr.first()).alias('groupFirst'))
shape: (10, 3)
┌─────┬───────┬────────────┐
│ a ┆ group ┆ groupFirst │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═══════╪════════════╡
│ 100 ┆ 0 ┆ 100 │
│ 103 ┆ 0 ┆ 100 │
│ 105 ┆ 1 ┆ 105 │
│ 106 ┆ 1 ┆ 105 │
│ ... ┆ ... ┆ ... │
│ 103 ┆ 1 ┆ 105 │
│ 106 ┆ 1 ┆ 105 │
│ 100 ┆ 2 ┆ 100 │
│ 102 ┆ 2 ┆ 100 │
└─────┴───────┴────────────┘
By the way, if you just want to use the default threshold then you can just do...
df \
.with_columns(pl.col('a').map(make_groups).alias('group')) \
.with_columns((pl.col('a').list().over('group').arr.first()).alias('groupFirst'))

Computing and retrieving operations at the group level without collapsing data frame in polars?

I am trying to compute a stat (or more) at the group level without having to create a second data frame. The current way I do it is by relying on the generation of a second data frame with the desired aggregation that I then merge back to the original one.
A silly example:
import polars as pl
df = pl. DataFrame( {'name' : ['Steve', 'Larry', 'Tom', 'Steve', 'Tom', 'Steve'],
'points': range(6)})
print(df)
shape: (6, 2)
┌───────┬────────┐
│ name ┆ points │
│ --- ┆ --- │
│ str ┆ i64 │
╞═══════╪════════╡
│ Steve ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Larry ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Tom ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Steve ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Tom ┆ 4 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Steve ┆ 5 │
└───────┴────────┘
We created a simple data frame below in which some groups have more entries than others. In a second step we compute an additional data frame to keep track of the size of each group.
entries= df.groupby('name').agg(pl.count().alias('entries'))
print(entries)
shape: (3, 2)
┌───────┬─────────┐
│ name ┆ entries │
│ --- ┆ --- │
│ str ┆ u32 │
╞═══════╪═════════╡
│ Steve ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Tom ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Larry ┆ 1 │
└───────┴─────────┘
Now we bring back this information to the original data frame in a third step.
print(df.join(entries, left_on='name', right_on='name', how='left'))
shape: (6, 3)
┌───────┬────────┬─────────┐
│ name ┆ points ┆ entries │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u32 │
╞═══════╪════════╪═════════╡
│ Steve ┆ 0 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Larry ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Tom ┆ 2 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Steve ┆ 3 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Tom ┆ 4 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Steve ┆ 5 ┆ 3 │
└───────┴────────┴─────────┘
Is there a way to avoid this triangulation? I have the feeling that using over might be a solution but I can't figure it out yet.
Well ... I managed. Posting the question helped me organize my thoughts and indeed, over was the solution.
df.with_column(pl.col('name').count().over('name').alias('entries'))
shape: (6, 3)
┌───────┬────────┬─────────┐
│ name ┆ points ┆ entries │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u32 │
╞═══════╪════════╪═════════╡
│ Steve ┆ 0 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Larry ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Tom ┆ 2 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Steve ┆ 3 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Tom ┆ 4 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Steve ┆ 5 ┆ 3 │
└───────┴────────┴─────────┘

polars equivalent to pandas groupby shift()

Is there an equivalent way to to df.groupby().shift in polars? Use pandas.shift() within a group
You can use the over expression to accomplish this in Polars. Using the example from the link...
import polars as pl
df = pl.DataFrame({
'object': [1, 1, 1, 2, 2],
'period': [1, 2, 4, 4, 23],
'value': [24, 67, 89, 5, 23],
})
df.with_column(
pl.col('value').shift().over('object').alias('prev_value')
)
shape: (5, 4)
┌────────┬────────┬───────┬────────────┐
│ object ┆ period ┆ value ┆ prev_value │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞════════╪════════╪═══════╪════════════╡
│ 1 ┆ 1 ┆ 24 ┆ null │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2 ┆ 67 ┆ 24 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 4 ┆ 89 ┆ 67 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 5 ┆ null │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 23 ┆ 23 ┆ 5 │
└────────┴────────┴───────┴────────────┘
To perform this on more than one column, you can specify the columns in the pl.col expression, and then use a prefix/suffix to name the new columns. For example:
df.with_columns(
pl.col(['period', 'value']).shift().over('object').prefix("prev_")
)
shape: (5, 5)
┌────────┬────────┬───────┬─────────────┬────────────┐
│ object ┆ period ┆ value ┆ prev_period ┆ prev_value │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞════════╪════════╪═══════╪═════════════╪════════════╡
│ 1 ┆ 1 ┆ 24 ┆ null ┆ null │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2 ┆ 67 ┆ 1 ┆ 24 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 4 ┆ 89 ┆ 2 ┆ 67 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 5 ┆ null ┆ null │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 23 ┆ 23 ┆ 4 ┆ 5 │
└────────┴────────┴───────┴─────────────┴────────────┘
Using multiple values with over
Let's use this data.
df = pl.DataFrame(
{
"id": [1] * 5 + [2] * 5,
"date": ["2020-01-01", "2020-01-01", "2020-02-01", "2020-02-01", "2020-02-01"] * 2,
"value1": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"value2": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
}
).with_column(pl.col('date').str.strptime(pl.Date))
df
shape: (10, 4)
┌─────┬────────────┬────────┬────────┐
│ id ┆ date ┆ value1 ┆ value2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ date ┆ i64 ┆ i64 │
╞═════╪════════════╪════════╪════════╡
│ 1 ┆ 2020-01-01 ┆ 1 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-01-01 ┆ 2 ┆ 20 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-02-01 ┆ 3 ┆ 30 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-02-01 ┆ 4 ┆ 40 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-02-01 ┆ 5 ┆ 50 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-01-01 ┆ 6 ┆ 60 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-01-01 ┆ 7 ┆ 70 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-02-01 ┆ 8 ┆ 80 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-02-01 ┆ 9 ┆ 90 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-02-01 ┆ 10 ┆ 100 │
└─────┴────────────┴────────┴────────┘
We can place a list of our grouping variables in the over expression (as well as a list in our pl.col expression). Polars will run them all in parallel.
df.with_columns([
pl.col(["value1", "value2"]).shift().over(['id','date']).prefix("prev_"),
pl.col(["value1", "value2"]).diff().over(['id','date']).suffix("_diff"),
])
shape: (10, 8)
┌─────┬────────────┬────────┬────────┬─────────────┬─────────────┬─────────────┬─────────────┐
│ id ┆ date ┆ value1 ┆ value2 ┆ prev_value1 ┆ prev_value2 ┆ value1_diff ┆ value2_diff │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ date ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪════════════╪════════╪════════╪═════════════╪═════════════╪═════════════╪═════════════╡
│ 1 ┆ 2020-01-01 ┆ 1 ┆ 10 ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-01-01 ┆ 2 ┆ 20 ┆ 1 ┆ 10 ┆ 1 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-02-01 ┆ 3 ┆ 30 ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-02-01 ┆ 4 ┆ 40 ┆ 3 ┆ 30 ┆ 1 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-02-01 ┆ 5 ┆ 50 ┆ 4 ┆ 40 ┆ 1 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-01-01 ┆ 6 ┆ 60 ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-01-01 ┆ 7 ┆ 70 ┆ 6 ┆ 60 ┆ 1 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-02-01 ┆ 8 ┆ 80 ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-02-01 ┆ 9 ┆ 90 ┆ 8 ┆ 80 ┆ 1 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-02-01 ┆ 10 ┆ 100 ┆ 9 ┆ 90 ┆ 1 ┆ 10 │
└─────┴────────────┴────────┴────────┴─────────────┴─────────────┴─────────────┴─────────────┘

In Polars how can I display a single row from a dataframe vertically like a pandas series?

I have a polars dataframe with many columns. I want to look at all the data from a single row aligned vertically so that I can see the values in many different columns without it going off the edge of the screen. How can I do this?
E.g. define a dataframe
df = pl.DataFrame({'a':[0,1],'b':[2,3]})
Print df[0] in ipython/jupyter and I get:
But if I convert df to pandas and print df.iloc[0] I get:
The latter is very handy when you've got many columns.
I've tried things like df[0].to_series(), but it only prints the first element, not the first row.
My suspicion is that there isn't a direct replacement because the pandas method relies on the series having an index. I think the polars solution will be more like making a two column dataframe where one column is the column names and the other is a value. I'm not sure if there's a method to do that though.
Thanks for any help you can offer!
import polars as pl
import numpy as np
# Create dataframe with lots of columns.
df = pl.DataFrame(np.random.randint(0, 1000, (5, 100)))
df
shape: (5, 100)
┌──────────┬──────────┬──────────┬──────────┬─────┬───────────┬───────────┬───────────┬───────────┐
│ column_0 ┆ column_1 ┆ column_2 ┆ column_3 ┆ ... ┆ column_96 ┆ column_97 ┆ column_98 ┆ column_99 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞══════════╪══════════╪══════════╪══════════╪═════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 285 ┆ 366 ┆ 886 ┆ 981 ┆ ... ┆ 63 ┆ 326 ┆ 882 ┆ 564 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 735 ┆ 269 ┆ 381 ┆ 78 ┆ ... ┆ 556 ┆ 737 ┆ 741 ┆ 768 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 543 ┆ 729 ┆ 915 ┆ 901 ┆ ... ┆ 48 ┆ 21 ┆ 277 ┆ 818 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 264 ┆ 424 ┆ 285 ┆ 540 ┆ ... ┆ 602 ┆ 584 ┆ 888 ┆ 836 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 269 ┆ 701 ┆ 483 ┆ 817 ┆ ... ┆ 579 ┆ 873 ┆ 192 ┆ 734 │
└──────────┴──────────┴──────────┴──────────┴─────┴───────────┴───────────┴───────────┴───────────┘
# Display row 3, by creating a tuple of column name and value for row 3.
tuple(zip(df.columns, df.row(2)))
(('column_0', 543),
('column_1', 729),
('column_2', 915),
('column_3', 901),
('column_4', 332),
('column_5', 156),
('column_6', 624),
('column_7', 37),
('column_8', 341),
('column_9', 503),
('column_10', 135),
('column_11', 183),
('column_12', 651),
('column_13', 910),
('column_14', 625),
('column_15', 129),
('column_16', 604),
('column_17', 671),
('column_18', 976),
('column_19', 558),
('column_20', 159),
('column_21', 314),
('column_22', 460),
('column_23', 49),
('column_24', 944),
('column_25', 6),
('column_26', 470),
('column_27', 228),
('column_28', 615),
('column_29', 230),
('column_30', 217),
('column_31', 66),
('column_32', 999),
('column_33', 440),
('column_34', 519),
('column_35', 851),
('column_36', 37),
('column_37', 859),
('column_38', 560),
('column_39', 870),
('column_40', 892),
('column_41', 192),
('column_42', 541),
('column_43', 136),
('column_44', 631),
('column_45', 22),
('column_46', 522),
('column_47', 225),
('column_48', 610),
('column_49', 191),
('column_50', 886),
('column_51', 454),
('column_52', 312),
('column_53', 956),
('column_54', 473),
('column_55', 851),
('column_56', 760),
('column_57', 224),
('column_58', 859),
('column_59', 442),
('column_60', 234),
('column_61', 788),
('column_62', 53),
('column_63', 999),
('column_64', 473),
('column_65', 237),
('column_66', 247),
('column_67', 307),
('column_68', 916),
('column_69', 94),
('column_70', 714),
('column_71', 233),
('column_72', 995),
('column_73', 335),
('column_74', 454),
('column_75', 801),
('column_76', 742),
('column_77', 386),
('column_78', 196),
('column_79', 239),
('column_80', 723),
('column_81', 59),
('column_82', 929),
('column_83', 852),
('column_84', 722),
('column_85', 328),
('column_86', 59),
('column_87', 710),
('column_88', 238),
('column_89', 823),
('column_90', 75),
('column_91', 307),
('column_92', 472),
('column_93', 822),
('column_94', 582),
('column_95', 802),
('column_96', 48),
('column_97', 21),
('column_98', 277),
('column_99', 818))
Pandas does not display all values either if you have many columns.
In [121]: df.to_pandas().iloc[0]
Out[121]:
column_0 285
column_1 366
column_2 886
column_3 981
column_4 464
...
column_95 862
column_96 63
column_97 326
column_98 882
column_99 564
Name: 0, Length: 100, dtype: int64
You can try using melt. For example:
df = pl.DataFrame(
[
pl.Series(name="col_str", values=["string1", "string2"]),
pl.Series(name="col_bool", values=[False, True]),
pl.Series(name="col_int", values=[1, 2]),
pl.Series(name="col_float", values=[10.0, 20.0]),
*[pl.Series(name=f"col_other_{idx}", values=[idx] * 2)
for idx in range(1, 25)],
]
)
print(df)
shape: (2, 28)
┌─────────┬──────────┬─────────┬───────────┬─────┬──────────────┬──────────────┬──────────────┬──────────────┐
│ col_str ┆ col_bool ┆ col_int ┆ col_float ┆ ... ┆ col_other_21 ┆ col_other_22 ┆ col_other_23 ┆ col_other_24 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ bool ┆ i64 ┆ f64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪══════════╪═════════╪═══════════╪═════╪══════════════╪══════════════╪══════════════╪══════════════╡
│ string1 ┆ false ┆ 1 ┆ 10.0 ┆ ... ┆ 21 ┆ 22 ┆ 23 ┆ 24 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ string2 ┆ true ┆ 2 ┆ 20.0 ┆ ... ┆ 21 ┆ 22 ┆ 23 ┆ 24 │
└─────────┴──────────┴─────────┴───────────┴─────┴──────────────┴──────────────┴──────────────┴──────────────┘
To print the first row:
pl.Config.set_tbl_rows(100)
df[0,].melt()
shape: (28, 2)
┌──────────────┬─────────┐
│ variable ┆ value │
│ --- ┆ --- │
│ str ┆ str │
╞══════════════╪═════════╡
│ col_str ┆ string1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_bool ┆ false │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_int ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_float ┆ 10.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_1 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_2 ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_3 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_4 ┆ 4 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_5 ┆ 5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_6 ┆ 6 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_7 ┆ 7 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_8 ┆ 8 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_9 ┆ 9 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_10 ┆ 10 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_11 ┆ 11 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_12 ┆ 12 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_13 ┆ 13 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_14 ┆ 14 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_15 ┆ 15 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_16 ┆ 16 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_17 ┆ 17 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_18 ┆ 18 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_19 ┆ 19 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_20 ┆ 20 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_21 ┆ 21 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_22 ┆ 22 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_23 ┆ 23 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_24 ┆ 24 │
If needed, set the polars.Config.set_tbl_rows option to the number of rows you find acceptable. (This only needs to be done once per session, not every time you print.)
Notice that all values have been cast to super-type str. (One caution: this approach won't work if any of your columns are of dtype list.)
You may try check Polars Cookbook about indexing here
It's stated that
| pandas | polars |
|------------|-----------|
| select row | |
|df.iloc[2] | df[2, :] |
Cheers!