Create duplicates of row based column values - python-polars

I'm trying to build a histogram of some data in polars. As part of my histogram code, I need to duplicate some rows. I've got a column of values, where each row also has a weight that says how many times the row should be added to the histogram.
How can I duplicate my value rows according to the weight column?
Here is some example data, with a target series:
import polars as pl
df = pl.DataFrame({"value":[1,2,3], "weight":[2, 2, 1]})
print(df)
# shape: (3, 2)
# ┌───────┬────────┐
# │ value ┆ weight │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═══════╪════════╡
# │ 1 ┆ 2 │
# │ 2 ┆ 2 │
# │ 3 ┆ 1 │
# └───────┴────────┘
s_target = pl.Series(name="value", values=[1,1,2,2,3])
print(s_target)
# shape: (5,)
# Series: 'value' [i64]
# [
# 1
# 1
# 2
# 2
# 3
# ]

How about
(
df.with_columns(
pl.col("value").repeat_by(pl.col("weight"))
)
.select(pl.col("value").arr.explode())
)
In [11]: df.with_columns(pl.col('value').repeat_by(pl.col('weight'))).select(pl.col('value').arr.explode())
Out[11]:
shape: (5, 1)
┌───────┐
│ value │
│ --- │
│ i64 │
╞═══════╡
│ 1 │
│ 1 │
│ 2 │
│ 2 │
│ 3 │
└───────┘
I didn't know you could do this so easily, I only learned about it while writing the answer. Polars is so nice :)

Turns out repeat_by and a subsequent explode are the perfect building blocks for this transformation:
>>> df.select(pl.col('value').repeat_by('weight').arr.explode())
shape: (5, 1)
┌───────┐
│ value │
│ --- │
│ i64 │
╞═══════╡
│ 1 │
│ 1 │
│ 2 │
│ 2 │
│ 3 │
└───────┘

Related

How to select the last non-null value from one column and also the value from another column on the same row in Polars?

Below is a non working example in which I retrieve the last available 'Open' but how do I get corresponding 'Time'?
sel = self.data.select([pl.col('Time'),
pl.col('Open').drop_nulls().last()])
For instance, you can use .filter() to select rows that do not contain null and then take last row
Here example:
df = pl.DataFrame({
"a": [1,2,3,4,5],
"b": ["cat", None, "owl", None, None]
})
┌─────┬──────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪══════╡
│ 1 ┆ cat │
│ 2 ┆ null │
│ 3 ┆ owl │
│ 4 ┆ null │
│ 5 ┆ null │
└─────┴──────┘
df.filter(
pl.col("b").is_not_null()
).select(pl.all().last())
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 3 ┆ owl │
└─────┴─────┘

Is it possible in Polars to "reset" cumsum() at a certain condition?

I need to cumsum the column b until a becomes True. After that cumsum shall start from this row and so on.
a | b
-------------
False | 1
False | 2
True | 3
False | 4
Can I do it on Polars without looping each row?
You could use the .cumsum() of the a column as the "group number".
>>> df.select(pl.col("a").cumsum())
shape: (4, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 0 │
├╌╌╌╌╌┤
│ 0 │
├╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌┤
│ 1 │
└─────┘
And use that with .over()
>>> df.select(pl.col("b").cumsum().over(pl.col("a").cumsum()))
shape: (4, 1)
┌─────┐
│ b │
│ --- │
│ i64 │
╞═════╡
│ 1 │
├╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌┤
│ 7 │
└─────┘
You can .shift().backward_fill() to include the True
>>> df.select(pl.col("b").cumsum().over(
... pl.col("a").cumsum().shift().backward_fill()))
shape: (4, 1)
┌─────┐
│ b │
│ --- │
│ i64 │
╞═════╡
│ 1 │
├╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌┤
│ 6 │
├╌╌╌╌╌┤
│ 4 │
└─────┘

Ordinal encoding of column in polars

I would like to do an ordinal encoding of a column. Pandas has the nice and convenient method of pd.factorize(), however, I would like to achieve the same in polars.
df = pl.DataFrame({"a": [5, 8, 10], "b": ["hi", "hello", "hi"]})
┌─────┬───────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═══════╡
│ 5 ┆ hi │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 8 ┆ hello │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 10 ┆ hi │
└─────┴───────┘
desired result:
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 0 ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 0 │
└─────┴─────┘
You can join with a dummy DataFrame that contains the unique values and the ordinal encoding you are interested in:
df = pl.DataFrame({"a": [5, 8, 10], "b": ["hi", "hello", "hi"]})
unique = df.select(
pl.col("b").unique(maintain_order=True)
).with_row_count(name="ordinal")
df.join(unique, on="b")
Or you could "misuse" the fact that categorical values are backed by u32 integers.
df.with_column(
pl.col("b").cast(pl.Categorical).to_physical().alias("ordinal")
)
Both methods output:
shape: (3, 3)
┌─────┬───────┬─────────┐
│ a ┆ b ┆ ordinal │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ u32 │
╞═════╪═══════╪═════════╡
│ 5 ┆ hi ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 8 ┆ hello ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 10 ┆ hi ┆ 0 │
└─────┴───────┴─────────┘
Here's another way to do it although I doubt it's better than the dummy Dataframe from #ritchie46
df.with_columns([pl.col('b').unique().list().alias('uniq'),
pl.col('b').unique().list().arr.eval(pl.element().rank()).alias('uniqid')]).explode(['uniq','uniqid']).filter(pl.col('b')==pl.col('uniq')).select(pl.exclude('uniq')).with_column(pl.col('uniqid')-1)
There's almost certainly a way to improve this but basically it creates a new column called uniq which is a list of all the unique values of the column as well as uniqid which (I think, and seems to be) the 1-index based order of the values. It then explodes those creating a row for ever value in uniq and then filters out the ones rows that don't equal the column b. Since rank gives the 1-index (rather than 0-index) you have to subtract 1 and exclude the uniq column that we don't care about since it's the same as b.
If the order is not important you could use .rank(method="dense")
>>> df.select(pl.all().rank(method="dense") - 1)
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞═════╪═════╡
│ 0 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 1 │
└─────┴─────┘
If it is - you could:
>>> (
... df.with_row_count()
... .with_columns([
... pl.col("row_nr").first()
... .over(col)
... .rank(method="dense")
... .alias(col) - 1
... for col in df.columns
... ])
... .drop("row_nr")
... )
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞═════╪═════╡
│ 0 ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 0 │
└─────┴─────┘

Sort within groups on entire table

If I have a single column, I can sort that column within groups using the over method. For example,
import polars as pl
df = pl.DataFrame({'group': [2,2,1,1,2,2], 'value': [3,4,3,1,1,3]})
df.with_column(pl.col('value').sort().over('group'))
# shape: (6, 2)
# ┌───────┬───────┐
# │ group ┆ value │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═══════╪═══════╡
# │ 2 ┆ 1 │
# ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
# │ 2 ┆ 3 │
# ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
# │ 1 ┆ 1 │
# ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
# │ 1 ┆ 3 │
# ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
# │ 2 ┆ 3 │
# ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
# │ 2 ┆ 4 │
# └───────┴───────┘
What is nice about operation is that it maintains the order of the groups (e.g. group=1 is still rows 3 and 4; group=2 is still rows 1, 2, 5, and 6).
But this only works to sort a single column. How do sort an entire table like this? I tried these things below, but none of them worked:
import polars as pl
df = pl.DataFrame({'group': [2,2,1,1,2,2], 'value': [3,4,3,1,1,3], 'value2': [5,4,3,2,1,0]})
df.groupby('group').sort(['value', 'value2'])
# errors
df.sort([pl.col('value').over('group'), pl.col('value2').over('group')])
# does not sort with groups
# Looking for this:
# shape: (6, 3)
# ┌───────┬───────┬────────┐
# │ group ┆ value ┆ value2 │
# │ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 │
# ╞═══════╪═══════╪════════╡
# │ 2 ┆ 1 ┆ 1 │
# ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
# │ 2 ┆ 3 ┆ 0 │
# ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
# │ 1 ┆ 1 ┆ 2 │
# ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
# │ 1 ┆ 3 ┆ 3 │
# ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
# │ 2 ┆ 3 ┆ 5 │
# ├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
# │ 2 ┆ 4 ┆ 4 │
# └───────┴───────┴────────┘
The solution to sorting an entire table in a grouped situation is pl.all().sort_by(sort_columns).over(group_columns).
import polars as pl
df = pl.DataFrame({
'group': [2,2,1,1,2,2],
'value': [3,4,3,1,1,3],
'value2': [5,4,3,2,1,0],
})
df.select(pl.all().sort_by(['value','value2']).over('group'))
df.select(
pl.all().sort_by(['value','value2']).over('group').sort_by(['group'])
)
may be helpful.

Cluster rows with same values without sorting

Sorting by particular columns brings together all rows with the same tuple under those columns. I want to cluster all rows with the same value, but keep the groups in the same order in which their first member appeared.
Something like this:
import polars as pl
df = pl.DataFrame(dict(x=[1,0,1,0], y=[3,1,2,4]))
df.cluster('x')
# shape: (4, 2)
# ┌─────┬─────┐
# │ x ┆ y │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1 ┆ 3 │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 1 ┆ 2 │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 0 ┆ 1 │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 0 ┆ 4 │
# └─────┴─────┘
This can be done by:
Storing the row index temporarily
Setting the row index to the lowest value within a window over the columns of interest
Sorting by that minimum index
Deleting the temporary row index column
import polars as pl
df = pl.DataFrame(dict(x=[1,0,1,0], y=[3,1,2,4]))
(
df
.with_column(pl.arange(0, pl.count()).alias('_index'))
.with_column(pl.min('_index').over('x'))
.sort('_index')
.drop('_index')
)
# shape: (4, 2)
# ┌─────┬─────┐
# │ x ┆ y │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1 ┆ 3 │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 1 ┆ 2 │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 0 ┆ 1 │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 0 ┆ 4 │
# └─────┴─────┘