Count consecutive True (or 1) values in a Boolean (or numeric) column with Polars? - python-polars

I am hoping to count consecutive values in a column, preferably using Polars expressions.
import polars
df = pl.DataFrame(
{"values": [True,True,True,False,False,True,False,False,True,True]}
)
With the example data frame above, I would like to count the number of consecutive True values.
Below is example output using R's Data.Table package.
library(data.table)
dt <- data.table(value = c(T,T,T,F,F,T,F,F,T,T))
dt[, value2 := fifelse((1:.N) == .N & value == 1, .N, NA_integer_), by = rleid(value)]
dt
value
value2
TRUE
NA
TRUE
NA
TRUE
3
FALSE
NA
FALSE
NA
TRUE
1
FALSE
NA
FALSE
NA
TRUE
NA
TRUE
2
Any ideas who this would be done efficiently using Polars?
[EDIT with a new approach]
I got it working with the code below, but hoping there is a more efficient way. Anyone know the default struct/dictionary field names from value_counts?
(
df.lazy()
.with_row_count()
.with_column(
pl.when(pl.col("value") == False).then(
pl.col("row_nr")
).fill_null(
strategy = "forward"
).alias("id_consecutive_Trues")
)
.with_column(
pl.col("id_consecutive_Trues").value_counts(sort = True)
)
.with_column(
(
pl.col("id_consecutive_Trues").arr.eval(
pl.element().struct().rename_fields(["value", "count"]).struct.field("count")
).arr.max()
- pl.lit(1)
).alias("max_consecutive_true_values")
)
.collect()
)

It can be thought of as a groupby operation.
A common way to generate group IDs for consecutive values is to check != .shift() and use the .cumsum()
I've put each step into its own .with_columns() here but it could be simplified:
(
df
.with_columns(
(pl.col("values") != pl.col("values").shift(+1))
.alias("change"))
.with_columns(
pl.col("change").cumsum()
.alias("group"))
.with_columns(
pl.count().over("group"))
.with_columns(
(pl.col("values") != pl.col("values").shift(-1))
.alias("keep"))
.with_columns(
pl.when(pl.col("values") & pl.col("keep"))
.then(pl.col("count"))
.alias("values2"))
)
shape: (10, 6)
┌────────┬────────┬───────┬───────┬───────┬─────────┐
│ values | change | group | count | keep | values2 │
│ --- | --- | --- | --- | --- | --- │
│ bool | bool | u32 | u32 | bool | u32 │
╞════════╪════════╪═══════╪═══════╪═══════╪═════════╡
│ true | true | 1 | 3 | false | null │
│ true | false | 1 | 3 | false | null │
│ true | false | 1 | 3 | true | 3 │
│ false | true | 2 | 2 | false | null │
│ false | false | 2 | 2 | true | null │
│ true | true | 3 | 1 | true | 1 │
│ false | true | 4 | 2 | false | null │
│ false | false | 4 | 2 | true | null │
│ true | true | 5 | 2 | false | null │
│ true | false | 5 | 2 | true | 2 │
└────────┴────────┴───────┴───────┴───────┴─────────┘
One possible way to write it in a "less verbose" manner:
df.with_columns(
pl.when(
pl.col("values") &
pl.col("values").shift_and_fill(-1, False).is_not())
.then(
pl.count().over(
(pl.col("values") != pl.col("values").shift())
.cumsum()))
)
shape: (10, 2)
┌────────┬───────┐
│ values | count │
│ --- | --- │
│ bool | u32 │
╞════════╪═══════╡
│ true | null │
│ true | null │
│ true | 3 │
│ false | null │
│ false | null │
│ true | 1 │
│ false | null │
│ false | null │
│ true | null │
│ true | 2 │
└────────┴───────┘

Related

Failed to determine supertype of Datetime

Polars: 0.16.2
Python: 3.11.1
Windows 10
Attempting to filter a column using a time range via .is_between()
Couldn't find anything on StackOverflow, but found (maybe?) something similar in the github issues (but it's been solved): https://github.com/pola-rs/polars/issues/5236
To reproduce
import polars as pl
from datetime import time
df = pl.date_range(low=datetime(2023, 2, 7), high=datetime(2023, 2, 8), interval="30m", name="date").to_frame()
# Attempt to filter by time
df.filter(
pl.col('date').is_between(time(9, 30), time(14, 30))
)
Traceback:
PanicException Traceback (most recent call last)
Cell In[11], line 1
----> 1 df.filter(
2 pl.col('date').is_between(time(9, 30, 0, 0), time(14, 30, 0, 0))
3 )
File d:\My_Path\venv\Lib\site-packages\polars\internals\dataframe\frame.py:2747, in DataFrame.filter(self, predicate)
2741 if _check_for_numpy(predicate) and isinstance(predicate, np.ndarray):
2742 predicate = pli.Series(predicate)
2744 return (
2745 self.lazy()
2746 .filter(predicate) # type: ignore[arg-type]
-> 2747 .collect(no_optimization=True)
2748 )
File d:\My_Path\venv\Lib\site-packages\polars\internals\lazyframe\frame.py:1146, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, no_optimization, slice_pushdown, common_subplan_elimination, streaming)
1135 common_subplan_elimination = False
1137 ldf = self._ldf.optimization_toggle(
1138 type_coercion,
1139 predicate_pushdown,
(...)
1144 streaming,
1145 )
-> 1146 return pli.wrap_df(ldf.collect())
PanicException: cannot coerce datatypes: ComputeError(Owned("Failed to determine supertype of Datetime(Microseconds, None) and Time"))
Not sure if I'm doing something wrong, or if this is a bug.
Tried to filter a series using a time range, and expected a filtered series for just those times. Instead, I got a PanicException (list above).
You are trying to filter a DateTime with a Time. You need to cast to pl.Time before doing the is_between
df.filter(
pl.col('date').cast(pl.Time).is_between(time(9, 30), time(14, 30))
)
┌─────────────────────┐
│ date │
│ --- │
│ datetime[μs] │
╞═════════════════════╡
│ 2023-02-07 10:00:00 │
│ 2023-02-07 10:30:00 │
│ 2023-02-07 11:00:00 │
│ 2023-02-07 11:30:00 │
│ 2023-02-07 12:00:00 │
│ 2023-02-07 12:30:00 │
│ 2023-02-07 13:00:00 │
│ 2023-02-07 13:30:00 │
│ 2023-02-07 14:00:00 │
└─────────────────────┘

Finding (unique) set of values across columns

I have a Polars Dataframe that looks like the below
id
attribute
val1
val2
val3
1
True
A
A
A
2
True
A
A
B
2
False
A
B
C
I would like to create a new column that is the set of the values in val1, val2, and val3. For example,
id
attribute
val1
val2
val3
set
1
True
A
A
A
A
2
True
A
A
B
A, B
2
False
A
B
C
A, B, C
I can do something like this,
import polars as pl
df = pl.DataFrame({
'id': [1, 2, 3],
'attribute': [True, True, False],
'val1': ['A', 'A', 'A'],
'val2': ['A', 'A', 'B'],
'val3': ['A', 'B', 'C'],
})
df = df.with_columns([pl.struct('^val.*$').alias('set')])
df = df.with_columns(pl.col('set').apply(lambda x: set(x.values())))
However, with the apply, it is predictable slow. Is there a way to do this using native Polars functionality?
It's essentially a variation of https://stackoverflow.com/a/75387840/ as #ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ pointed out.
df.with_columns(
pl.concat_list(pl.col("^val.*$"))
.arr.eval(
pl.element().unique(maintain_order=True).drop_nulls(),
parallel=True)
.alias("set")
)
shape: (3, 6)
┌─────┬───────────┬──────┬──────┬──────┬─────────────────┐
│ id | attribute | val1 | val2 | val3 | set │
│ --- | --- | --- | --- | --- | --- │
│ i64 | bool | str | str | str | list[str] │
╞═════╪═══════════╪══════╪══════╪══════╪═════════════════╡
│ 1 | true | A | A | A | ["A"] │
│ 2 | true | A | A | B | ["A", "B"] │
│ 3 | false | A | B | C | ["A", "B", "C"] │
└─────┴───────────┴──────┴──────┴──────┴─────────────────┘

Polars Python write_csv error: "pyo3_runtime.PanicException: should not be here"

I am new to using Polars for Python. I am taking a dataframe as an input, converting each column to a numpy array, reassigning values to certain indices in these arrays, deleting specific rows from all of these arrays, and then converting each array to a dataframe and performing a pl.concat (horizontally) on these dataframes. I know that the operation is working because I am able to print the dataframe on Terminal. However, when I try to write the outputDF to a csv file, I get the error below. Any help fixing the error would be greatly appreciated.
P.S.: Here is the link to the sample input data:
https://mega.nz/file/u0Z0GS6b#uSD6PDqyHXIEfWDLNQR2VgaqBcBSgeLdSL8lSjTSq3M
thread '<unnamed>' panicked at 'should not be here', /Users/runner/work/polars/polars/polars/polars-core/src/chunked_array/ops/any_value.rs:103:32
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
File "/Users/username/Desktop/my_code/segmentation/gpt3_embeddings.py", line 89, in <module>
result = process_data(df)
File "/Users/username/Desktop/my_code/segmentation/gpt3_embeddings.py", line 80, in process_data
outputDF.write_csv("PostProcessing_output.csv")
File "/Users/username/opt/anaconda3/envs/segmentation/lib/python3.9/site-packages/polars/internals/dataframe/frame.py", line 2033, in write_csv
self._df.write_csv(
pyo3_runtime.PanicException: should not be here
My code looks as follows:
# PROCESSING THE TRANSCRIBED TEXT:'
def process_data(inputDF):
# Convert relevant columns in the dataframe to numpy arrays
phraseArray = inputDF["phrase"].to_numpy()
actorArray = inputDF["actor"].to_numpy()
startTimeArray = inputDF["start_time"].to_numpy()
endTimeArray = inputDF["end_time"].to_numpy()
# get indicators marking where two consecutive rows have the same actor
speaker_change = inputDF.select(pl.col("actor").diff())
speaker_change = speaker_change.rename({"actor": "change"})
inputDF = inputDF.with_column(speaker_change.to_series(0))
zero_indices = inputDF.filter(pl.col("change") == 0).select(["sentence_index"]).to_series().to_list() # indices where diff() gave 0
if len(zero_indices) > 0:
for index in reversed(zero_indices):
extract_phrase = phraseArray[index]
extract_endTime = endTimeArray[index]
joined_phrases = phraseArray[index - 1] + extract_phrase
phraseArray[index - 1] = joined_phrases
endTimeArray[index - 1] = extract_endTime
phraseArray = np.delete(phraseArray, index)
actorArray = np.delete(actorArray, index)
startTimeArray = np.delete(startTimeArray, index)
endTimeArray = np.delete(endTimeArray, index)
outputDF = pl.concat([pl.DataFrame(actorArray, columns=["actor"], orient="col"), pl.DataFrame(phraseArray, columns=["phrase"], orient="col"), pl.DataFrame(startTimeArray, columns=["start_time"], orient="col"), pl.DataFrame(endTimeArray, columns=["end_time"], orient="col")], rechunk=True, how="horizontal")
outputDF = outputDF.with_row_count(name="sentence_index")
outputDF = outputDF[["sentence_index", "actor", "phrase", "start_time", "end_time"]]
print(outputDF[342:348])
outputDF.write_csv("PostProcessing_output.csv")
return outputDF
else:
return inputDF
I tried using df.hstack instead of concat but that did not work either. I also tried rechunk on the dataframe but that did not help either. I think the issue has to do with me converting the columns into numpy arrays and then converting them back to dataframes, but I am not sure.
It looks like you're trying to group consecutive rows based on the actor column and "combine" them.
A common approach to this is to create "group ids" using the .cumsum() of comparing inequality to the previous row:
>>> df.head(8).with_columns((pl.col("actor") != pl.col("actor").shift()).cumsum().alias("id"))
shape: (8, 6)
┌────────────────┬───────┬─────────────────────────────────────┬────────────────┬────────────────┬─────┐
│ sentence_index | actor | phrase | start_time | end_time | id │
│ --- | --- | --- | --- | --- | --- │
│ str | str | str | str | str | u32 │
╞════════════════╪═══════╪═════════════════════════════════════╪════════════════╪════════════════╪═════╡
│ 0 | 1 | companies. So I don't have any ... | 0:00:00 | 0:00:28.125000 | 1 │
├────────────────┼───────┼─────────────────────────────────────┼────────────────┼────────────────┼─────┤
│ 1 | 0 | Oh yeah, that's fine. | 0:00:28.125000 | 0:00:29.625000 | 2 │
├────────────────┼───────┼─────────────────────────────────────┼────────────────┼────────────────┼─────┤
│ 2 | 1 | Okay, good. And so I have a few... | 0:00:29.625000 | 0:00:38.625000 | 3 │
├────────────────┼───────┼─────────────────────────────────────┼────────────────┼────────────────┼─────┤
│ 3 | 0 | I'm in the parking lot, yeah? Y... | 0:00:38.625000 | 0:00:41.375000 | 4 │
├────────────────┼───────┼─────────────────────────────────────┼────────────────┼────────────────┼─────┤
│ 4 | 1 | Thank you. | 0:00:41.375000 | 0:00:42.125000 | 5 │
├────────────────┼───────┼─────────────────────────────────────┼────────────────┼────────────────┼─────┤
│ 5 | 1 | And so when we get ready for th... | 0:00:42.375000 | 0:01:44.125000 | 5 │
├────────────────┼───────┼─────────────────────────────────────┼────────────────┼────────────────┼─────┤
│ 6 | 0 | Yeah, let's just get started. | 0:01:44.125000 | 0:01:45.375000 | 6 │
├────────────────┼───────┼─────────────────────────────────────┼────────────────┼────────────────┼─────┤
│ 7 | 1 | Okay, let's do it. So first of ... | 0:01:45.375000 | 0:01:52.625000 | 7 │
└────────────────┴───────┴─────────────────────────────────────┴────────────────┴────────────────┴─────┘
Note that rows with sentence_index 4 and 5 end up with the same id - this can be used to .groupby()
It looks like you want the .last() end_time
.str.concat() can be used to "combine" the phrase values
.first() can be used for the remaining columns
output_df = (
df.groupby((pl.col("actor") != pl.col("actor").shift()).cumsum().alias("id")).agg([
pl.col("actor").first(),
pl.col("sentence_index").first(),
pl.col("phrase").str.concat(delimiter=""),
pl.col("start_time").first(),
pl.col("end_time").last()
])
.sort("sentence_index")
.drop(["id", "sentence_index"])
.with_row_count("sentence_index")
)

Using unnest to join in Postgres

Appreciate this is a simple use case but having difficulty doing a join in Postgres using an array.
I have two tables:
table: shares
id | likes_id_array timestamp share_site
-----------------+-----------------+----------+-----------
12345_6789 | [xxx, yyy , zzz]| date1 | fb
abcde_wxyz | [vbd, fka, fhx] | date2 | tw
table: likes
likes_id | name | location
--------+-------+----------+-----
xxx | aaaa | nice
fpg | bbbb | dfpb
yyy | mmmm | place
dhf | cccc | fiwk
zzz | dddd | here
desired - a result set based on shares.id = 12345_6789:
likes_id | name | location | timestamp
--------+-------+----------+------------+-----------
xxx | aaaa | nice | date1
yyy | mmmm | place | date1
zzz | dddd | here | date1
the first step is using unnest() for the likes_id_array:
SELECT unnest(likes_id_array) as i FROM shares
WHERE id = '12345_6789'
but I can't figure out how to join the results set this produces, with the likes table on likes_id. Any help would be much appreciated!
You can create a CTE with your query with the likes identifiers, and then make a regular inner join with the table of likes
with like_ids as (
select
unnest(likes_id_array) as like_id
from shares
where id = '12345_6789'
)
select
likes_id,
name,
location
from likes
inner join like_ids
on likes.likes_id = like_ids.like_id
Demo
You can use ANY:
SELECT a.*, b.timestamp FROM likes a JOIN shares b ON a.likes_id = ANY(b.likes_id_array) WHERE id = '12345_6789';
You could do this with subqueries or a CTE, but the easiest way is to call the unnest function not in the SELECT clause but as a table expression in the FROM clause:
SELECT likes.*, shares.timestamp
FROM shares, unnest(likes_id_array) as arr(likes_id)
JOIN likes USING (likes_id)
WHERE shares.id = '12345_6789'
You can use jsonb_array_elements_text with a (implicit) lateral join:
SELECT
likes.likes_id,
likes.name,
likes.location,
shares.timestamp
FROM
shares,
jsonb_array_elements_text(shares.likes_id_array) AS share_likes(id),
likes
WHERE
likes.likes_id = share_likes.id AND
shares.id = '12345_6789';
Output:
┌──────────┬──────┬──────────┬─────────────────────┐
│ likes_id │ name │ location │ timestamp │
├──────────┼──────┼──────────┼─────────────────────┤
│ xxx │ aaaa │ nice │ 2022-10-12 11:32:39 │
│ yyy │ mmmm │ place │ 2022-10-12 11:32:39 │
│ zzz │ dddd │ here │ 2022-10-12 11:32:39 │
└──────────┴──────┴──────────┴─────────────────────┘
(3 rows)
Or if you want to make the lateral join explicit (notice the addition of the LATERAL keyword):
SELECT
likes.likes_id,
likes.name,
likes.location,
shares.timestamp
FROM
shares,
LATERAL jsonb_array_elements_text(shares.likes_id_array) AS share_likes(id),
likes
WHERE
likes.likes_id = share_likes.id AND
shares.id = '12345_6789';

PostgreSQL: detecting the first/last rows of result set

Is there any way to embed a flag in a select that indicates that it is the first or the last row of a result set? I'm thinking something to the effect of:
> SELECT is_first_row() AS f, is_last_row() AS l FROM blah;
f | l
-----------
t | f
f | f
f | f
f | f
f | t
The answer might be in window functions but I've only just learned about them, and I question their efficiency.
SELECT first_value(unique_column) OVER () = unique_column, last_value(unique_column) OVER () = unique_column, * FROM blah;
seems to do what I want. Unfortunately, I don't even fully understand that syntax, but since unique_column is unique and NOT NULL it should deliver unambiguous results. But if it does sorting, then the cure might be worse than the disease. (Actually, in my tests, unique_column is not sorted, so that's something.)
EXPLAIN ANALYZE doesn't indicate there's an efficiency problem, but when has it ever told me what I needed to know?
And I might need to use this in an aggregate function, but I've just been told window functions aren't allowed there. 😕
Edit:
Actually, I just added ORDER BY unique_column to the above query and the rows identified as first and last were thrown into the middle of the result set. It's as if first_value()/last_value() really means "the first/last value I picked up before I began sorting." I don't think I can safely do this optimally. Not unless a much better understanding of the use of the OVER keyword is to be had.
I'm running PostgreSQL 9.6 in a Debian 9.5 environment.
This isn't a duplicate, because I'm trying to get the first row and last row of the result set to identify themselves, while Postgres: get min, max, aggregate values in one select is just going for the minimum and maximum values for a column in a result set.
You can use the lead() and lag() window functions (over the appropiate window) and compare them to NULL:
-- \i tmp.sql
CREATE TABLE ztable
( id SERIAL PRIMARY KEY
, starttime TIMESTAMP
);
INSERT INTO ztable (starttime) VALUES ( now() - INTERVAL '1 minute');
INSERT INTO ztable (starttime) VALUES ( now() - INTERVAL '2 minute');
INSERT INTO ztable (starttime) VALUES ( now() - INTERVAL '3 minute');
INSERT INTO ztable (starttime) VALUES ( now() - INTERVAL '4 minute');
INSERT INTO ztable (starttime) VALUES ( now() - INTERVAL '5 minute');
INSERT INTO ztable (starttime) VALUES ( now() - INTERVAL '6 minute');
SELECT id, starttime
, ( lead(id) OVER www IS NULL) AS is_first
, ( lag(id) OVER www IS NULL) AS is_last
FROM ztable
WINDOW www AS (ORDER BY id )
ORDER BY id
;
SELECT id, starttime
, ( lead(id) OVER www IS NULL) AS is_first
, ( lag(id) OVER www IS NULL) AS is_last
FROM ztable
WINDOW www AS (ORDER BY starttime )
ORDER BY id
;
SELECT id, starttime
, ( lead(id) OVER www IS NULL) AS is_first
, ( lag(id) OVER www IS NULL) AS is_last
FROM ztable
WINDOW www AS (ORDER BY starttime )
ORDER BY random()
;
Result:
INSERT 0 1
INSERT 0 1
INSERT 0 1
INSERT 0 1
INSERT 0 1
INSERT 0 1
id | starttime | is_first | is_last
----+----------------------------+----------+---------
1 | 2018-08-31 18:38:45.567393 | f | t
2 | 2018-08-31 18:37:45.575586 | f | f
3 | 2018-08-31 18:36:45.587436 | f | f
4 | 2018-08-31 18:35:45.592316 | f | f
5 | 2018-08-31 18:34:45.600619 | f | f
6 | 2018-08-31 18:33:45.60907 | t | f
(6 rows)
id | starttime | is_first | is_last
----+----------------------------+----------+---------
1 | 2018-08-31 18:38:45.567393 | t | f
2 | 2018-08-31 18:37:45.575586 | f | f
3 | 2018-08-31 18:36:45.587436 | f | f
4 | 2018-08-31 18:35:45.592316 | f | f
5 | 2018-08-31 18:34:45.600619 | f | f
6 | 2018-08-31 18:33:45.60907 | f | t
(6 rows)
id | starttime | is_first | is_last
----+----------------------------+----------+---------
2 | 2018-08-31 18:37:45.575586 | f | f
4 | 2018-08-31 18:35:45.592316 | f | f
6 | 2018-08-31 18:33:45.60907 | f | t
5 | 2018-08-31 18:34:45.600619 | f | f
1 | 2018-08-31 18:38:45.567393 | t | f
3 | 2018-08-31 18:36:45.587436 | f | f
(6 rows)
[updated: added a randomly sorted case]
It is simple using window functions with particular frames:
with t(x, y) as (select generate_series(1,5), random())
select *,
count(*) over (rows between unbounded preceding and current row),
count(*) over (rows between current row and unbounded following)
from t;
┌───┬───────────────────┬───────┬───────┐
│ x │ y │ count │ count │
├───┼───────────────────┼───────┼───────┤
│ 1 │ 0.543995119165629 │ 1 │ 5 │
│ 2 │ 0.886343683116138 │ 2 │ 4 │
│ 3 │ 0.124682310037315 │ 3 │ 3 │
│ 4 │ 0.668972567655146 │ 4 │ 2 │
│ 5 │ 0.266671542543918 │ 5 │ 1 │
└───┴───────────────────┴───────┴───────┘
As you can see count(*) over (rows between unbounded preceding and current row) returns rows count from the data set beginning to current row and count(*) over (rows between current row and unbounded following) returns rows count from the current to data set end. 1 indicates the first/last rows.
It works until you ordering your data set by order by. In this case you need to duplicate it in the frames definitions:
with t(x, y) as (select generate_series(1,5), random())
select *,
count(*) over (order by y rows between unbounded preceding and current row),
count(*) over (order by y rows between current row and unbounded following)
from t order by y;
┌───┬───────────────────┬───────┬───────┐
│ x │ y │ count │ count │
├───┼───────────────────┼───────┼───────┤
│ 1 │ 0.125781774986535 │ 1 │ 5 │
│ 4 │ 0.25046408502385 │ 2 │ 4 │
│ 5 │ 0.538880597334355 │ 3 │ 3 │
│ 3 │ 0.802807193249464 │ 4 │ 2 │
│ 2 │ 0.869908029679209 │ 5 │ 1 │
└───┴───────────────────┴───────┴───────┘
PS: As mentioned by a_horse_with_no_name in the comment:
there is no such thing as the "first" or "last" row without sorting.
In fact, Window Functions are a great approach and for that requirement of yours, they are awesome.
Regarding efficiency, window functions work over the data set already at hand. Which means the DBMS will just add extra processing to infer first/last values.
Just one thing I'd like to suggest: I like to put an ORDER BY criteria inside the OVER clause, just to ensure the data set order is the same between multiple executions, thus returning the same values to you.
Try using
SELECT columns
FROM mytable
Join conditions
WHERE conditions ORDER BY date DESC LIMIT 1
UNION ALL
SELECT columns
FROM mytable
Join conditions
WHERE conditions ORDER BY date ASC LIMIT 1
SELECT just cut half of the processing time. You can go for indexing also.