How to group items by difference in percent in Polars? - python-polars

I'd like to group values so that the difference between each group item remains within a certain percentage. E.g. each time when an item is 5% over the first group element it goes to the new group. As a return I need the first group value. Example with 5% threshold where 'a' is given, 'group' and 'groupFirst' must be calculated:
import polars as pl
df = pl.DataFrame({'a': [100, 103, 105, 106, 105, 104, 103, 106, 100, 102],
'group': [0, 0, 1, 1, 1, 1, 1, 1, 2, 2],
'groupFirst': [100, 100, 105, 105, 105, 105, 105, 105, 100, 100]})
print(df)
shape: (10, 3)
┌─────┬───────┬────────────┐
│ a ┆ group ┆ groupFirst │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═══════╪════════════╡
│ 100 ┆ 0 ┆ 100 │
│ 103 ┆ 0 ┆ 100 │
│ 105 ┆ 1 ┆ 105 │
│ 106 ┆ 1 ┆ 105 │
│ ... ┆ ... ┆ ... │
│ 103 ┆ 1 ┆ 105 │
│ 106 ┆ 1 ┆ 105 │
│ 100 ┆ 2 ┆ 100 │
│ 102 ┆ 2 ┆ 100 │
└─────┴───────┴────────────┘

Say you want to reset cummax when your values exceed the value 6. You could do:
(
df.with_columns(
(pl.col("a") >= 6)
.shift(1)
.fill_null(False)
.cumsum()
.alias("group")
).with_columns(
pl.col("a")
.cummax()
.over(pl.col("group"))
.alias("cummax")
)
)
Example:
In [73]: df = pl.DataFrame({'a': [1, 3, 5, 6, 1, 4, 3, 6, 5, 6]})
In [74]: (
...: df.with_columns(
...: (pl.col("a") >= 6)
...: .shift(1)
...: .fill_null(False)
...: .cumsum()
...: .alias("group")
...: ).with_columns(
...: pl.col("a")
...: .cummax()
...: .over(pl.col("group"))
...: .alias("cummax")
...: )
...: )
Out[74]:
shape: (10, 3)
┌─────┬───────┬────────┐
│ a ┆ group ┆ cummax │
│ --- ┆ --- ┆ --- │
│ i64 ┆ u32 ┆ i64 │
╞═════╪═══════╪════════╡
│ 1 ┆ 0 ┆ 1 │
│ 3 ┆ 0 ┆ 3 │
│ 5 ┆ 0 ┆ 5 │
│ 6 ┆ 0 ┆ 6 │
│ ... ┆ ... ┆ ... │
│ 3 ┆ 1 ┆ 4 │
│ 6 ┆ 1 ┆ 6 │
│ 5 ┆ 2 ┆ 5 │
│ 6 ┆ 2 ┆ 6 │
└─────┴───────┴────────┘

I don't think there is a way to use polars expressions to generate the groups since they always rely on the previous group. That being said the groups can be easily generated in O(n) so the penalty for doing that in python should be minor.
import numpy as np
def make_groups(a, threshold=1.05):
a=np.array(a)
outarray=np.empty(len(a), dtype=a.dtype)
outarray[0]=0
curgroup=a[0]
for indx, cur_a in enumerate(a[1:],1):
if cur_a >= threshold * curgroup or cur_a * threshold <= curgroup:
outarray[indx] = outarray[indx-1] + 1
curgroup=cur_a
else:
outarray[indx] = outarray[indx-1]
return pl.Series(outarray)
So now let's take that to our data.
df = pl.DataFrame({'a': [100, 103, 105, 106, 105, 104, 103, 106, 100, 102]})
We just do a map (incidentally, I tried making make_groups into a np.ufunc but couldn't get it to work).
df \
.with_columns(pl.col('a').map(lambda x: make_groups(x, 1.05)).alias('group')) \
.with_columns((pl.col('a').list().over('group').arr.first()).alias('groupFirst'))
shape: (10, 3)
┌─────┬───────┬────────────┐
│ a ┆ group ┆ groupFirst │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═══════╪════════════╡
│ 100 ┆ 0 ┆ 100 │
│ 103 ┆ 0 ┆ 100 │
│ 105 ┆ 1 ┆ 105 │
│ 106 ┆ 1 ┆ 105 │
│ ... ┆ ... ┆ ... │
│ 103 ┆ 1 ┆ 105 │
│ 106 ┆ 1 ┆ 105 │
│ 100 ┆ 2 ┆ 100 │
│ 102 ┆ 2 ┆ 100 │
└─────┴───────┴────────────┘
By the way, if you just want to use the default threshold then you can just do...
df \
.with_columns(pl.col('a').map(make_groups).alias('group')) \
.with_columns((pl.col('a').list().over('group').arr.first()).alias('groupFirst'))

Related

Select columns from LazyFrame by condition

polars.LazyFrame.var will return variance value for each column in a table as below:
>>> df = pl.DataFrame({"a": [1, 2, 3, 4], "b": [1, 2, 1, 1], "c": [1, 1, 1, 1]}).lazy()
>>> df.collect()
shape: (4, 3)
┌─────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1 ┆ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 1 ┆ 1 │
└─────┴─────┴─────┘
>>> df.var().collect()
shape: (1, 3)
┌──────────┬──────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞══════════╪══════╪═════╡
│ 1.666667 ┆ 0.25 ┆ 0.0 │
└──────────┴──────┴─────┘
I wish to select columns with value > 0 from LazyFrame but couldn't find the solution.
I can iterate over columns in polars dataframe then filter columns by condition as below:
>>> data.var()
shape: (1, 3)
┌──────────┬──────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞══════════╪══════╪═════╡
│ 1.666667 ┆ 0.25 ┆ 0.0 │
└──────────┴──────┴─────┘
>>> cols = pl.select([s for s in data.var() if (s > 0).all()]).columns
>>> cols
['a', 'b']
>>> data.select(cols)
shape: (4, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 1 │
└─────┴─────┘
But it doesn't work in LazyFrame:
>>> data = data.lazy()
>>> data
<polars.internals.lazyframe.frame.LazyFrame object at 0x7f0e3d9966a0>
>>> cols = pl.select([s for s in data.var() if (s > 0).all()]).columns
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <listcomp>
File "/home/jasmine/miniconda3/envs/jupyternb/lib/python3.9/site-packages/polars/internals/lazyframe/frame.py", line 421, in __getitem__
raise TypeError(
TypeError: 'LazyFrame' object is not subscriptable (aside from slicing). Use 'select()' or 'filter()' instead.
The reason for doing this in LazyFrame is that we want to maximize the performance. Any advice would be much appreciated. Thanks!
polars doesn't know what the variance is until after it is calculated but that's the same time that it is displaying the results so there's no way to filter the columns reported and also have it be more performant than just displaying all the columns, at least with respect to the polars calculation. It could be that python/jupyter takes longer to display more results than fewer.
With that said you could do something like this:
df.var().melt().filter(pl.col('value')>0).collect()
which gives you what you want in one line but it's a different shape.
You could also do something like this:
dfvar=df.var()
dfvar.select(dfvar.melt().filter(pl.col('value')>0).select('variable').collect().to_series().to_list()).collect()
Building on the answer from #dean MacGregor, we:
do the var calculation
melt
apply the filter
extract the variable column with column names
pass it as a list to select
df.select(
(
df.var().melt().filter(pl.col('value')>0).collect()
["variable"]
)
.to_list()
).collect()
shape: (4, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 1 │
└─────┴─────┘
```

Polars unable to compute over() when an input was a list

I have a dataframe like this:
import polars as pl
orig_df = pl.DataFrame({
"primary_key": [1, 1, 2, 2, 3, 3],
"simple_foreign_keys": [1, 1, 2, 4, 4, 4],
"fancy_foreign_keys": [[1], [1], [2], [1], [3, 4], [3, 4]],
})
I have some logic that computes when the secondary column changes, then additional logic to rank those changes per primary column. It works fine on the simple data type:
foreign_key_changed = pl.col('simple_foreign_keys') != pl.col('simple_foreign_keys').shift_and_fill(1, 0)
df = orig_df.sort('primary_key').with_column(foreign_key_changed.cumsum().alias('raw_changes'))
print(df)
shape: (6, 4)
┌─────────────┬─────────────────────┬────────────────────┬─────────────┐
│ primary_key ┆ simple_foreign_keys ┆ fancy_foreign_keys ┆ raw_changes │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ list[i64] ┆ u32 │
╞═════════════╪═════════════════════╪════════════════════╪═════════════╡
│ 1 ┆ 1 ┆ [1] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 1 ┆ [1] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ [2] ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 4 ┆ [1] ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 4 ┆ [3, 4] ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 4 ┆ [3, 4] ┆ 3 │
└─────────────┴─────────────────────┴────────────────────┴─────────────┘
df = orig_df.sort('primary_key').with_column(foreign_key_changed.cumsum().rank('dense').over('primary_key').alias('ranked_changes'))
print(df)
shape: (6, 4)
┌─────────────┬─────────────────────┬────────────────────┬────────────────┐
│ primary_key ┆ simple_foreign_keys ┆ fancy_foreign_keys ┆ ranked_changes │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ list[i64] ┆ u32 │
╞═════════════╪═════════════════════╪════════════════════╪════════════════╡
│ 1 ┆ 1 ┆ [1] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 1 ┆ [1] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ [2] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 4 ┆ [1] ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 4 ┆ [3, 4] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 4 ┆ [3, 4] ┆ 1 │
└─────────────┴─────────────────────┴────────────────────┴────────────────┘
But if I try the exact same logic on the List column, it blows up on me. Note that the intermediate columns (the cumsum, the rank) are still plain integers:
foreign_key_changed = pl.col('fancy_foreign_keys') != pl.col('fancy_foreign_keys').shift_and_fill(1, [])
df = orig_df.sort('primary_key').with_column(foreign_key_changed.cumsum().alias('raw_changes'))
print(df)
shape: (6, 4)
┌─────────────┬─────────────────────┬────────────────────┬─────────────┐
│ primary_key ┆ simple_foreign_keys ┆ fancy_foreign_keys ┆ raw_changes │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ list[i64] ┆ u32 │
╞═════════════╪═════════════════════╪════════════════════╪═════════════╡
│ 1 ┆ 1 ┆ [1] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 1 ┆ [1] ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ [2] ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 4 ┆ [1] ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 4 ┆ [3, 4] ┆ 4 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 4 ┆ [3, 4] ┆ 4 │
└─────────────┴─────────────────────┴────────────────────┴─────────────┘
df = orig_df.sort('primary_key').with_column(foreign_key_changed.cumsum().rank('dense').over('primary_key').alias('ranked_changes'))
thread '<unnamed>' panicked at 'implementation error, cannot get ref List(Null) from Int64', /Users/runner/work/polars/polars/polars/polars-core/src/series/mod.rs:945:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
File "polars_example.py", line 18, in <module>
df = orig_df.sort('primary_key').with_column(foreign_key_changed.cumsum().rank('dense').over('primary_key').alias('ranked_changes'))
File "venv/lib/python3.9/site-packages/polars/internals/frame.py", line 4144, in with_column
return self.with_columns([column])
File "venv/lib/python3.9/site-packages/polars/internals/frame.py", line 5308, in with_columns
self.lazy()
File "venv/lib/python3.9/site-packages/polars/internals/lazy_frame.py", line 652, in collect
return self._dataframe_class._from_pydf(ldf.collect())
pyo3_runtime.PanicException: implementation error, cannot get ref List(Null) from Int64
Am I doing this wrong? Is there some massaging that has to happen with the types?
This is with polars 0.13.62.

polars equivalent to pandas groupby shift()

Is there an equivalent way to to df.groupby().shift in polars? Use pandas.shift() within a group
You can use the over expression to accomplish this in Polars. Using the example from the link...
import polars as pl
df = pl.DataFrame({
'object': [1, 1, 1, 2, 2],
'period': [1, 2, 4, 4, 23],
'value': [24, 67, 89, 5, 23],
})
df.with_column(
pl.col('value').shift().over('object').alias('prev_value')
)
shape: (5, 4)
┌────────┬────────┬───────┬────────────┐
│ object ┆ period ┆ value ┆ prev_value │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞════════╪════════╪═══════╪════════════╡
│ 1 ┆ 1 ┆ 24 ┆ null │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2 ┆ 67 ┆ 24 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 4 ┆ 89 ┆ 67 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 5 ┆ null │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 23 ┆ 23 ┆ 5 │
└────────┴────────┴───────┴────────────┘
To perform this on more than one column, you can specify the columns in the pl.col expression, and then use a prefix/suffix to name the new columns. For example:
df.with_columns(
pl.col(['period', 'value']).shift().over('object').prefix("prev_")
)
shape: (5, 5)
┌────────┬────────┬───────┬─────────────┬────────────┐
│ object ┆ period ┆ value ┆ prev_period ┆ prev_value │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞════════╪════════╪═══════╪═════════════╪════════════╡
│ 1 ┆ 1 ┆ 24 ┆ null ┆ null │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2 ┆ 67 ┆ 1 ┆ 24 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 4 ┆ 89 ┆ 2 ┆ 67 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 4 ┆ 5 ┆ null ┆ null │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 23 ┆ 23 ┆ 4 ┆ 5 │
└────────┴────────┴───────┴─────────────┴────────────┘
Using multiple values with over
Let's use this data.
df = pl.DataFrame(
{
"id": [1] * 5 + [2] * 5,
"date": ["2020-01-01", "2020-01-01", "2020-02-01", "2020-02-01", "2020-02-01"] * 2,
"value1": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"value2": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
}
).with_column(pl.col('date').str.strptime(pl.Date))
df
shape: (10, 4)
┌─────┬────────────┬────────┬────────┐
│ id ┆ date ┆ value1 ┆ value2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ date ┆ i64 ┆ i64 │
╞═════╪════════════╪════════╪════════╡
│ 1 ┆ 2020-01-01 ┆ 1 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-01-01 ┆ 2 ┆ 20 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-02-01 ┆ 3 ┆ 30 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-02-01 ┆ 4 ┆ 40 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-02-01 ┆ 5 ┆ 50 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-01-01 ┆ 6 ┆ 60 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-01-01 ┆ 7 ┆ 70 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-02-01 ┆ 8 ┆ 80 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-02-01 ┆ 9 ┆ 90 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-02-01 ┆ 10 ┆ 100 │
└─────┴────────────┴────────┴────────┘
We can place a list of our grouping variables in the over expression (as well as a list in our pl.col expression). Polars will run them all in parallel.
df.with_columns([
pl.col(["value1", "value2"]).shift().over(['id','date']).prefix("prev_"),
pl.col(["value1", "value2"]).diff().over(['id','date']).suffix("_diff"),
])
shape: (10, 8)
┌─────┬────────────┬────────┬────────┬─────────────┬─────────────┬─────────────┬─────────────┐
│ id ┆ date ┆ value1 ┆ value2 ┆ prev_value1 ┆ prev_value2 ┆ value1_diff ┆ value2_diff │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ date ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪════════════╪════════╪════════╪═════════════╪═════════════╪═════════════╪═════════════╡
│ 1 ┆ 2020-01-01 ┆ 1 ┆ 10 ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-01-01 ┆ 2 ┆ 20 ┆ 1 ┆ 10 ┆ 1 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-02-01 ┆ 3 ┆ 30 ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-02-01 ┆ 4 ┆ 40 ┆ 3 ┆ 30 ┆ 1 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2020-02-01 ┆ 5 ┆ 50 ┆ 4 ┆ 40 ┆ 1 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-01-01 ┆ 6 ┆ 60 ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-01-01 ┆ 7 ┆ 70 ┆ 6 ┆ 60 ┆ 1 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-02-01 ┆ 8 ┆ 80 ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-02-01 ┆ 9 ┆ 90 ┆ 8 ┆ 80 ┆ 1 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2020-02-01 ┆ 10 ┆ 100 ┆ 9 ┆ 90 ┆ 1 ┆ 10 │
└─────┴────────────┴────────┴────────┴─────────────┴─────────────┴─────────────┴─────────────┘

Cumulative sum that resets when turning negative/positive

[enter image description here]
I am trying to add a column (column C) to my polars dataframe that counts how many times a value of one of the dataframe's columns (column A) is greater/less than the value of another column (column B). Once the value turns from less/greater to greater/less the cumulative sum should reset and start counting from 1/-1 again.
The data
I'm going to change the data in the example you provided.
df = pl.DataFrame(
{
"a": [11, 10, 10, 10, 9, 8, 8, 8, 8, 8, 15, 15, 15],
"b": [11, 9, 9, 9, 9, 9, 10, 8, 8, 10, 11, 11, 15],
}
)
print(df)
shape: (13, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 11 ┆ 11 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 10 ┆ 9 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 10 ┆ 9 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 10 ┆ 9 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 9 ┆ 9 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 8 ┆ 9 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 8 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 8 ┆ 8 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 8 ┆ 8 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 8 ┆ 10 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 15 ┆ 11 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 15 ┆ 11 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 15 ┆ 15 │
└─────┴─────┘
Notice the cases where the two columns are the same. Your post didn't address what to do in these cases, so I made some assumptions as to what should happen. (You can adapt the code to handle those cases differently.)
The algorithm
df = (
df
.with_column((pl.col("a") - pl.col("b")).sign().alias("sign_a_minus_b"))
.with_column(
pl.when(pl.col("sign_a_minus_b") == 0)
.then(None)
.otherwise(pl.col("sign_a_minus_b"))
.forward_fill()
.alias("run_type")
)
.with_column(
(pl.col("run_type") != pl.col("run_type").shift_and_fill(1, 0))
.cumsum()
.alias("run_id")
)
.with_column(pl.col("sign_a_minus_b").cumsum().over("run_id").alias("result"))
)
print(df)
shape: (13, 6)
┌─────┬─────┬────────────────┬──────────┬────────┬────────┐
│ a ┆ b ┆ sign_a_minus_b ┆ run_type ┆ run_id ┆ result │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ u32 ┆ i64 │
╞═════╪═════╪════════════════╪══════════╪════════╪════════╡
│ 11 ┆ 11 ┆ 0 ┆ null ┆ 1 ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 10 ┆ 9 ┆ 1 ┆ 1 ┆ 2 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 10 ┆ 9 ┆ 1 ┆ 1 ┆ 2 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 10 ┆ 9 ┆ 1 ┆ 1 ┆ 2 ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 9 ┆ 9 ┆ 0 ┆ 1 ┆ 2 ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 8 ┆ 9 ┆ -1 ┆ -1 ┆ 3 ┆ -1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 8 ┆ 10 ┆ -1 ┆ -1 ┆ 3 ┆ -2 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 8 ┆ 8 ┆ 0 ┆ -1 ┆ 3 ┆ -2 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 8 ┆ 8 ┆ 0 ┆ -1 ┆ 3 ┆ -2 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 8 ┆ 10 ┆ -1 ┆ -1 ┆ 3 ┆ -3 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15 ┆ 11 ┆ 1 ┆ 1 ┆ 4 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15 ┆ 11 ┆ 1 ┆ 1 ┆ 4 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 15 ┆ 15 ┆ 0 ┆ 1 ┆ 4 ┆ 2 │
└─────┴─────┴────────────────┴──────────┴────────┴────────┘
I've left the intermediate calculations in the output, merely to show how the algorithm works. (You can drop them.)
The basic idea is to calculate a run_id for each run of positive or negative values. We will then use the cumsum function and the over windowing expression to create a running count of positives/negatives over each run_id.
Key assumption: ties in columns a and b do not interrupt a run, but they do not contribute to the total for that run of positive/negative values.
sign_a_minus_b does two things: it identifies whether a run is positive/negative, and whether there is a tie in columns a and b.
run_type extends any run to include any cases where a tie occurs in columns a and b. The null value at the top of the column was intended - it shows what happens when a tie occurs in the first row.
result is the output column. Note that tied columns do not interrupt a run, but they don't contribute to the totals for that run.
One final note: if ties in columns a and b are not allowed, then this algorithm can be simplified ... and run faster.
Not very elegant or Pythonic, but something like the below should work:
import pandas as pd
df = pd.DataFrame({'a': [10, 10, 10, 8, 8, 8, 15, 15]
,'b': [9, 9, 9, 9, 10, 10, 11, 11]})
df['c'] = df.apply(lambda row: 1 if row['a'] > row['b'] else 0, axis=1)
df['d'] = df.apply(lambda row: 0 if row['a'] > row['b'] else -1, axis=1)
for i in range(1, len(df)):
if df.loc[i, 'a'] > df.loc[i, 'b']:
df.loc[i, 'c'] = df.loc[i-1, 'c'] + 1
df.loc[i, 'd'] = 0
else:
df.loc[i, 'd'] = df.loc[i-1, 'd'] - 1
df.loc[i, 'c'] = 0
df['ans'] = df['c'] + df['d']
print(df)
Also you might need to think about what the value should be for the specific case when column a and b are equal.

In Polars how can I display a single row from a dataframe vertically like a pandas series?

I have a polars dataframe with many columns. I want to look at all the data from a single row aligned vertically so that I can see the values in many different columns without it going off the edge of the screen. How can I do this?
E.g. define a dataframe
df = pl.DataFrame({'a':[0,1],'b':[2,3]})
Print df[0] in ipython/jupyter and I get:
But if I convert df to pandas and print df.iloc[0] I get:
The latter is very handy when you've got many columns.
I've tried things like df[0].to_series(), but it only prints the first element, not the first row.
My suspicion is that there isn't a direct replacement because the pandas method relies on the series having an index. I think the polars solution will be more like making a two column dataframe where one column is the column names and the other is a value. I'm not sure if there's a method to do that though.
Thanks for any help you can offer!
import polars as pl
import numpy as np
# Create dataframe with lots of columns.
df = pl.DataFrame(np.random.randint(0, 1000, (5, 100)))
df
shape: (5, 100)
┌──────────┬──────────┬──────────┬──────────┬─────┬───────────┬───────────┬───────────┬───────────┐
│ column_0 ┆ column_1 ┆ column_2 ┆ column_3 ┆ ... ┆ column_96 ┆ column_97 ┆ column_98 ┆ column_99 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞══════════╪══════════╪══════════╪══════════╪═════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 285 ┆ 366 ┆ 886 ┆ 981 ┆ ... ┆ 63 ┆ 326 ┆ 882 ┆ 564 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 735 ┆ 269 ┆ 381 ┆ 78 ┆ ... ┆ 556 ┆ 737 ┆ 741 ┆ 768 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 543 ┆ 729 ┆ 915 ┆ 901 ┆ ... ┆ 48 ┆ 21 ┆ 277 ┆ 818 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 264 ┆ 424 ┆ 285 ┆ 540 ┆ ... ┆ 602 ┆ 584 ┆ 888 ┆ 836 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 269 ┆ 701 ┆ 483 ┆ 817 ┆ ... ┆ 579 ┆ 873 ┆ 192 ┆ 734 │
└──────────┴──────────┴──────────┴──────────┴─────┴───────────┴───────────┴───────────┴───────────┘
# Display row 3, by creating a tuple of column name and value for row 3.
tuple(zip(df.columns, df.row(2)))
(('column_0', 543),
('column_1', 729),
('column_2', 915),
('column_3', 901),
('column_4', 332),
('column_5', 156),
('column_6', 624),
('column_7', 37),
('column_8', 341),
('column_9', 503),
('column_10', 135),
('column_11', 183),
('column_12', 651),
('column_13', 910),
('column_14', 625),
('column_15', 129),
('column_16', 604),
('column_17', 671),
('column_18', 976),
('column_19', 558),
('column_20', 159),
('column_21', 314),
('column_22', 460),
('column_23', 49),
('column_24', 944),
('column_25', 6),
('column_26', 470),
('column_27', 228),
('column_28', 615),
('column_29', 230),
('column_30', 217),
('column_31', 66),
('column_32', 999),
('column_33', 440),
('column_34', 519),
('column_35', 851),
('column_36', 37),
('column_37', 859),
('column_38', 560),
('column_39', 870),
('column_40', 892),
('column_41', 192),
('column_42', 541),
('column_43', 136),
('column_44', 631),
('column_45', 22),
('column_46', 522),
('column_47', 225),
('column_48', 610),
('column_49', 191),
('column_50', 886),
('column_51', 454),
('column_52', 312),
('column_53', 956),
('column_54', 473),
('column_55', 851),
('column_56', 760),
('column_57', 224),
('column_58', 859),
('column_59', 442),
('column_60', 234),
('column_61', 788),
('column_62', 53),
('column_63', 999),
('column_64', 473),
('column_65', 237),
('column_66', 247),
('column_67', 307),
('column_68', 916),
('column_69', 94),
('column_70', 714),
('column_71', 233),
('column_72', 995),
('column_73', 335),
('column_74', 454),
('column_75', 801),
('column_76', 742),
('column_77', 386),
('column_78', 196),
('column_79', 239),
('column_80', 723),
('column_81', 59),
('column_82', 929),
('column_83', 852),
('column_84', 722),
('column_85', 328),
('column_86', 59),
('column_87', 710),
('column_88', 238),
('column_89', 823),
('column_90', 75),
('column_91', 307),
('column_92', 472),
('column_93', 822),
('column_94', 582),
('column_95', 802),
('column_96', 48),
('column_97', 21),
('column_98', 277),
('column_99', 818))
Pandas does not display all values either if you have many columns.
In [121]: df.to_pandas().iloc[0]
Out[121]:
column_0 285
column_1 366
column_2 886
column_3 981
column_4 464
...
column_95 862
column_96 63
column_97 326
column_98 882
column_99 564
Name: 0, Length: 100, dtype: int64
You can try using melt. For example:
df = pl.DataFrame(
[
pl.Series(name="col_str", values=["string1", "string2"]),
pl.Series(name="col_bool", values=[False, True]),
pl.Series(name="col_int", values=[1, 2]),
pl.Series(name="col_float", values=[10.0, 20.0]),
*[pl.Series(name=f"col_other_{idx}", values=[idx] * 2)
for idx in range(1, 25)],
]
)
print(df)
shape: (2, 28)
┌─────────┬──────────┬─────────┬───────────┬─────┬──────────────┬──────────────┬──────────────┬──────────────┐
│ col_str ┆ col_bool ┆ col_int ┆ col_float ┆ ... ┆ col_other_21 ┆ col_other_22 ┆ col_other_23 ┆ col_other_24 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ bool ┆ i64 ┆ f64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪══════════╪═════════╪═══════════╪═════╪══════════════╪══════════════╪══════════════╪══════════════╡
│ string1 ┆ false ┆ 1 ┆ 10.0 ┆ ... ┆ 21 ┆ 22 ┆ 23 ┆ 24 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ string2 ┆ true ┆ 2 ┆ 20.0 ┆ ... ┆ 21 ┆ 22 ┆ 23 ┆ 24 │
└─────────┴──────────┴─────────┴───────────┴─────┴──────────────┴──────────────┴──────────────┴──────────────┘
To print the first row:
pl.Config.set_tbl_rows(100)
df[0,].melt()
shape: (28, 2)
┌──────────────┬─────────┐
│ variable ┆ value │
│ --- ┆ --- │
│ str ┆ str │
╞══════════════╪═════════╡
│ col_str ┆ string1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_bool ┆ false │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_int ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_float ┆ 10.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_1 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_2 ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_3 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_4 ┆ 4 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_5 ┆ 5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_6 ┆ 6 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_7 ┆ 7 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_8 ┆ 8 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_9 ┆ 9 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_10 ┆ 10 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_11 ┆ 11 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_12 ┆ 12 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_13 ┆ 13 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_14 ┆ 14 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_15 ┆ 15 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_16 ┆ 16 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_17 ┆ 17 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_18 ┆ 18 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_19 ┆ 19 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_20 ┆ 20 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_21 ┆ 21 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_22 ┆ 22 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_23 ┆ 23 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_24 ┆ 24 │
If needed, set the polars.Config.set_tbl_rows option to the number of rows you find acceptable. (This only needs to be done once per session, not every time you print.)
Notice that all values have been cast to super-type str. (One caution: this approach won't work if any of your columns are of dtype list.)
You may try check Polars Cookbook about indexing here
It's stated that
| pandas | polars |
|------------|-----------|
| select row | |
|df.iloc[2] | df[2, :] |
Cheers!