I need to cumsum the column b until a becomes True. After that cumsum shall start from this row and so on.
a | b
-------------
False | 1
False | 2
True | 3
False | 4
Can I do it on Polars without looping each row?
You could use the .cumsum() of the a column as the "group number".
>>> df.select(pl.col("a").cumsum())
shape: (4, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 0 │
├╌╌╌╌╌┤
│ 0 │
├╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌┤
│ 1 │
└─────┘
And use that with .over()
>>> df.select(pl.col("b").cumsum().over(pl.col("a").cumsum()))
shape: (4, 1)
┌─────┐
│ b │
│ --- │
│ i64 │
╞═════╡
│ 1 │
├╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌┤
│ 7 │
└─────┘
You can .shift().backward_fill() to include the True
>>> df.select(pl.col("b").cumsum().over(
... pl.col("a").cumsum().shift().backward_fill()))
shape: (4, 1)
┌─────┐
│ b │
│ --- │
│ i64 │
╞═════╡
│ 1 │
├╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌┤
│ 6 │
├╌╌╌╌╌┤
│ 4 │
└─────┘
Related
I have a Polars dataframe in the form:
df = pl.DataFrame({'a':[1,2,3], 'b':[['a','b'],['a'],['c','d']]})
┌─────┬────────────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ list[str] │
╞═════╪════════════╡
│ 1 ┆ ["a", "b"] │
│ 2 ┆ ["a"] │
│ 3 ┆ ["c", "d"] │
└─────┴────────────┘
I want to convert it to the following form. I plan to save to a parquet file, and query the file (with sql).
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1 ┆ "a" │
│ 1 ┆ "b" │
│ 2 ┆ "a" │
│ 3 ┆ "c" │
│ 3 ┆ "d" │
└─────┴─────┘
I have seen an answer that works on struct columns, but df.unnest('b') on my data results in the error:
SchemaError: Series of dtype: List(Utf8) != Struct
I also found a github issue that shows list can be converted to a struct, but I can't work out how to do that, or if it applies here.
To decompose column with Lists, you can use .explode() method (doc)
df = pl.DataFrame({'a':[1,2,3], 'b':[['a','b'],['a'],['c','d']]})
df.explode("b")
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1 ┆ a │
│ 1 ┆ b │
│ 2 ┆ a │
│ 3 ┆ c │
│ 3 ┆ d │
└─────┴─────┘
I'm trying to build a histogram of some data in polars. As part of my histogram code, I need to duplicate some rows. I've got a column of values, where each row also has a weight that says how many times the row should be added to the histogram.
How can I duplicate my value rows according to the weight column?
Here is some example data, with a target series:
import polars as pl
df = pl.DataFrame({"value":[1,2,3], "weight":[2, 2, 1]})
print(df)
# shape: (3, 2)
# ┌───────┬────────┐
# │ value ┆ weight │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═══════╪════════╡
# │ 1 ┆ 2 │
# │ 2 ┆ 2 │
# │ 3 ┆ 1 │
# └───────┴────────┘
s_target = pl.Series(name="value", values=[1,1,2,2,3])
print(s_target)
# shape: (5,)
# Series: 'value' [i64]
# [
# 1
# 1
# 2
# 2
# 3
# ]
How about
(
df.with_columns(
pl.col("value").repeat_by(pl.col("weight"))
)
.select(pl.col("value").arr.explode())
)
In [11]: df.with_columns(pl.col('value').repeat_by(pl.col('weight'))).select(pl.col('value').arr.explode())
Out[11]:
shape: (5, 1)
┌───────┐
│ value │
│ --- │
│ i64 │
╞═══════╡
│ 1 │
│ 1 │
│ 2 │
│ 2 │
│ 3 │
└───────┘
I didn't know you could do this so easily, I only learned about it while writing the answer. Polars is so nice :)
Turns out repeat_by and a subsequent explode are the perfect building blocks for this transformation:
>>> df.select(pl.col('value').repeat_by('weight').arr.explode())
shape: (5, 1)
┌───────┐
│ value │
│ --- │
│ i64 │
╞═══════╡
│ 1 │
│ 1 │
│ 2 │
│ 2 │
│ 3 │
└───────┘
Starting with
df = pl.DataFrame({'group': [1, 1, 1, 3, 3, 3, 4, 4]})
how can I get a column which numbers the 'group' column?
Here's what df looks like:
shape: (8, 1)
┌───────┐
│ group │
│ --- │
│ i64 │
╞═══════╡
│ 1 │
├╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌╌╌┤
│ 4 │
├╌╌╌╌╌╌╌┤
│ 4 │
└───────┘
and here's my expected output:
shape: (8, 2)
┌───────┬─────────┐
│ group ┆ group_i │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═══════╪═════════╡
│ 1 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 2 │
└───────┴─────────┘
Here's one way I came up with, it just feels a bit complex for this task...is there a simpler way?
df.with_column(((pl.col('group')!=pl.col('group').shift()).cast(pl.Int64).cumsum()-1).alias('group_i'))
I think the terms come from SQL:
You're looking to .rank() your data - in particular - a "dense" ranking.
>>> df.with_column(pl.col("group").alias("group_i").rank("dense") - 1)
shape: (8, 2)
┌───────┬─────────┐
│ group | group_i │
│ --- | --- │
│ i64 | u32 │
╞═══════╪═════════╡
│ 1 | 0 │
├───────┼─────────┤
│ 1 | 0 │
├───────┼─────────┤
│ 1 | 0 │
├───────┼─────────┤
│ 3 | 1 │
├───────┼─────────┤
│ 3 | 1 │
├───────┼─────────┤
│ 3 | 1 │
├───────┼─────────┤
│ 4 | 2 │
├───────┼─────────┤
│ 4 | 2 │
└───────┴─────────┘
Say I have
df = pl.DataFrame({'group': [1, 1, 1, 3, 3, 3, 4, 4], 'value': [1, 4, 2, 5, 3, 4, 2, 3]})
I'd like to get a rolling sum, with window of 2, for each group
Expected output is:
┌───────┐
│ value │
│ --- │
│ i64 │
╞═══════╡
│ 1 │
├╌╌╌╌╌╌╌┤
│ 5 │
├╌╌╌╌╌╌╌┤
│ 6 │
├╌╌╌╌╌╌╌┤
│ 5 │
├╌╌╌╌╌╌╌┤
│ 8 │
├╌╌╌╌╌╌╌┤
│ 7 │
├╌╌╌╌╌╌╌┤
│ 2 │
├╌╌╌╌╌╌╌┤
│ 5 │
└───────┘
.rolling_sum().over("group")
min_periods=1 will fill in the nulls.
>>> df.select(pl.col("value").rolling_sum(2, min_periods=1).over("group"))
shape: (8, 1)
┌───────┐
│ value │
│ --- │
│ i64 │
╞═══════╡
│ 1 │
├╌╌╌╌╌╌╌┤
│ 5 │
├╌╌╌╌╌╌╌┤
│ 6 │
├╌╌╌╌╌╌╌┤
│ 5 │
├╌╌╌╌╌╌╌┤
│ 8 │
├╌╌╌╌╌╌╌┤
│ 7 │
├╌╌╌╌╌╌╌┤
│ 2 │
├╌╌╌╌╌╌╌┤
│ 5 │
└───────┘
Consider the following example table:
CREATE TABLE rndtbl AS
SELECT
generate_series(1, 10) AS id,
random() AS val;
and I want to find for each id a cluster_id such that the clusters are far away from each other at least 0.1. How would I calculate such a cluster assignment?
A specific example would be:
select * from rndtbl ;
id | val
----+-------------------
1 | 0.485714662820101
2 | 0.185201027430594
3 | 0.368477711919695
4 | 0.687312887981534
5 | 0.978742253035307
6 | 0.961830694694072
7 | 0.10397826647386
8 | 0.644958863966167
9 | 0.912827260326594
10 | 0.196085536852479
(10 rows)
The result would be: ids (2,7,10) in a cluster and (5,6,9) in another cluster and (4,8) in another, and (1) and (3) as singleton clusters.
From
SELECT * FROM rndtbl ;
┌────┬────────────────────┐
│ id │ val │
├────┼────────────────────┤
│ 1 │ 0.153776332736015 │
│ 2 │ 0.572575284633785 │
│ 3 │ 0.998213059268892 │
│ 4 │ 0.654628816060722 │
│ 5 │ 0.692200613208115 │
│ 6 │ 0.572836415842175 │
│ 7 │ 0.0788379465229809 │
│ 8 │ 0.390280921943486 │
│ 9 │ 0.611408909317106 │
│ 10 │ 0.555164183024317 │
└────┴────────────────────┘
(10 rows)
Use the LAG window function to know whether the current row is in a new cluster or not:
SELECT *, val - LAG(val) OVER (ORDER BY val) > 0.1 AS new_cluster
FROM rndtbl ;
┌────┬────────────────────┬─────────────┐
│ id │ val │ new_cluster │
├────┼────────────────────┼─────────────┤
│ 7 │ 0.0788379465229809 │ (null) │
│ 1 │ 0.153776332736015 │ f │
│ 8 │ 0.390280921943486 │ t │
│ 10 │ 0.555164183024317 │ t │
│ 2 │ 0.572575284633785 │ f │
│ 6 │ 0.572836415842175 │ f │
│ 9 │ 0.611408909317106 │ f │
│ 4 │ 0.654628816060722 │ f │
│ 5 │ 0.692200613208115 │ f │
│ 3 │ 0.998213059268892 │ t │
└────┴────────────────────┴─────────────┘
(10 rows)
Finally you can SUM the number of true (still ordering by val) to get the cluster of the row (counting from 0):
SELECT *, SUM(COALESCE(new_cluster::int, 0)) OVER (ORDER BY val) AS nb_cluster
FROM (
SELECT *, val - LAG(val) OVER (ORDER BY val) > 0.1 AS new_cluster
FROM rndtbl
) t
;
┌────┬────────────────────┬─────────────┬────────────┐
│ id │ val │ new_cluster │ nb_cluster │
├────┼────────────────────┼─────────────┼────────────┤
│ 7 │ 0.0788379465229809 │ (null) │ 0 │
│ 1 │ 0.153776332736015 │ f │ 0 │
│ 8 │ 0.390280921943486 │ t │ 1 │
│ 10 │ 0.555164183024317 │ t │ 2 │
│ 2 │ 0.572575284633785 │ f │ 2 │
│ 6 │ 0.572836415842175 │ f │ 2 │
│ 9 │ 0.611408909317106 │ f │ 2 │
│ 4 │ 0.654628816060722 │ f │ 2 │
│ 5 │ 0.692200613208115 │ f │ 2 │
│ 3 │ 0.998213059268892 │ t │ 3 │
└────┴────────────────────┴─────────────┴────────────┘
(10 rows)