Finding (unique) set of values across columns - python-polars

I have a Polars Dataframe that looks like the below
id
attribute
val1
val2
val3
1
True
A
A
A
2
True
A
A
B
2
False
A
B
C
I would like to create a new column that is the set of the values in val1, val2, and val3. For example,
id
attribute
val1
val2
val3
set
1
True
A
A
A
A
2
True
A
A
B
A, B
2
False
A
B
C
A, B, C
I can do something like this,
import polars as pl
df = pl.DataFrame({
'id': [1, 2, 3],
'attribute': [True, True, False],
'val1': ['A', 'A', 'A'],
'val2': ['A', 'A', 'B'],
'val3': ['A', 'B', 'C'],
})
df = df.with_columns([pl.struct('^val.*$').alias('set')])
df = df.with_columns(pl.col('set').apply(lambda x: set(x.values())))
However, with the apply, it is predictable slow. Is there a way to do this using native Polars functionality?

It's essentially a variation of https://stackoverflow.com/a/75387840/ as #ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ pointed out.
df.with_columns(
pl.concat_list(pl.col("^val.*$"))
.arr.eval(
pl.element().unique(maintain_order=True).drop_nulls(),
parallel=True)
.alias("set")
)
shape: (3, 6)
┌─────┬───────────┬──────┬──────┬──────┬─────────────────┐
│ id | attribute | val1 | val2 | val3 | set │
│ --- | --- | --- | --- | --- | --- │
│ i64 | bool | str | str | str | list[str] │
╞═════╪═══════════╪══════╪══════╪══════╪═════════════════╡
│ 1 | true | A | A | A | ["A"] │
│ 2 | true | A | A | B | ["A", "B"] │
│ 3 | false | A | B | C | ["A", "B", "C"] │
└─────┴───────────┴──────┴──────┴──────┴─────────────────┘

Related

Count consecutive True (or 1) values in a Boolean (or numeric) column with Polars?

I am hoping to count consecutive values in a column, preferably using Polars expressions.
import polars
df = pl.DataFrame(
{"values": [True,True,True,False,False,True,False,False,True,True]}
)
With the example data frame above, I would like to count the number of consecutive True values.
Below is example output using R's Data.Table package.
library(data.table)
dt <- data.table(value = c(T,T,T,F,F,T,F,F,T,T))
dt[, value2 := fifelse((1:.N) == .N & value == 1, .N, NA_integer_), by = rleid(value)]
dt
value
value2
TRUE
NA
TRUE
NA
TRUE
3
FALSE
NA
FALSE
NA
TRUE
1
FALSE
NA
FALSE
NA
TRUE
NA
TRUE
2
Any ideas who this would be done efficiently using Polars?
[EDIT with a new approach]
I got it working with the code below, but hoping there is a more efficient way. Anyone know the default struct/dictionary field names from value_counts?
(
df.lazy()
.with_row_count()
.with_column(
pl.when(pl.col("value") == False).then(
pl.col("row_nr")
).fill_null(
strategy = "forward"
).alias("id_consecutive_Trues")
)
.with_column(
pl.col("id_consecutive_Trues").value_counts(sort = True)
)
.with_column(
(
pl.col("id_consecutive_Trues").arr.eval(
pl.element().struct().rename_fields(["value", "count"]).struct.field("count")
).arr.max()
- pl.lit(1)
).alias("max_consecutive_true_values")
)
.collect()
)
It can be thought of as a groupby operation.
A common way to generate group IDs for consecutive values is to check != .shift() and use the .cumsum()
I've put each step into its own .with_columns() here but it could be simplified:
(
df
.with_columns(
(pl.col("values") != pl.col("values").shift(+1))
.alias("change"))
.with_columns(
pl.col("change").cumsum()
.alias("group"))
.with_columns(
pl.count().over("group"))
.with_columns(
(pl.col("values") != pl.col("values").shift(-1))
.alias("keep"))
.with_columns(
pl.when(pl.col("values") & pl.col("keep"))
.then(pl.col("count"))
.alias("values2"))
)
shape: (10, 6)
┌────────┬────────┬───────┬───────┬───────┬─────────┐
│ values | change | group | count | keep | values2 │
│ --- | --- | --- | --- | --- | --- │
│ bool | bool | u32 | u32 | bool | u32 │
╞════════╪════════╪═══════╪═══════╪═══════╪═════════╡
│ true | true | 1 | 3 | false | null │
│ true | false | 1 | 3 | false | null │
│ true | false | 1 | 3 | true | 3 │
│ false | true | 2 | 2 | false | null │
│ false | false | 2 | 2 | true | null │
│ true | true | 3 | 1 | true | 1 │
│ false | true | 4 | 2 | false | null │
│ false | false | 4 | 2 | true | null │
│ true | true | 5 | 2 | false | null │
│ true | false | 5 | 2 | true | 2 │
└────────┴────────┴───────┴───────┴───────┴─────────┘
One possible way to write it in a "less verbose" manner:
df.with_columns(
pl.when(
pl.col("values") &
pl.col("values").shift_and_fill(-1, False).is_not())
.then(
pl.count().over(
(pl.col("values") != pl.col("values").shift())
.cumsum()))
)
shape: (10, 2)
┌────────┬───────┐
│ values | count │
│ --- | --- │
│ bool | u32 │
╞════════╪═══════╡
│ true | null │
│ true | null │
│ true | 3 │
│ false | null │
│ false | null │
│ true | 1 │
│ false | null │
│ false | null │
│ true | null │
│ true | 2 │
└────────┴───────┘

Polars Python write_csv error: "pyo3_runtime.PanicException: should not be here"

I am new to using Polars for Python. I am taking a dataframe as an input, converting each column to a numpy array, reassigning values to certain indices in these arrays, deleting specific rows from all of these arrays, and then converting each array to a dataframe and performing a pl.concat (horizontally) on these dataframes. I know that the operation is working because I am able to print the dataframe on Terminal. However, when I try to write the outputDF to a csv file, I get the error below. Any help fixing the error would be greatly appreciated.
P.S.: Here is the link to the sample input data:
https://mega.nz/file/u0Z0GS6b#uSD6PDqyHXIEfWDLNQR2VgaqBcBSgeLdSL8lSjTSq3M
thread '<unnamed>' panicked at 'should not be here', /Users/runner/work/polars/polars/polars/polars-core/src/chunked_array/ops/any_value.rs:103:32
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
File "/Users/username/Desktop/my_code/segmentation/gpt3_embeddings.py", line 89, in <module>
result = process_data(df)
File "/Users/username/Desktop/my_code/segmentation/gpt3_embeddings.py", line 80, in process_data
outputDF.write_csv("PostProcessing_output.csv")
File "/Users/username/opt/anaconda3/envs/segmentation/lib/python3.9/site-packages/polars/internals/dataframe/frame.py", line 2033, in write_csv
self._df.write_csv(
pyo3_runtime.PanicException: should not be here
My code looks as follows:
# PROCESSING THE TRANSCRIBED TEXT:'
def process_data(inputDF):
# Convert relevant columns in the dataframe to numpy arrays
phraseArray = inputDF["phrase"].to_numpy()
actorArray = inputDF["actor"].to_numpy()
startTimeArray = inputDF["start_time"].to_numpy()
endTimeArray = inputDF["end_time"].to_numpy()
# get indicators marking where two consecutive rows have the same actor
speaker_change = inputDF.select(pl.col("actor").diff())
speaker_change = speaker_change.rename({"actor": "change"})
inputDF = inputDF.with_column(speaker_change.to_series(0))
zero_indices = inputDF.filter(pl.col("change") == 0).select(["sentence_index"]).to_series().to_list() # indices where diff() gave 0
if len(zero_indices) > 0:
for index in reversed(zero_indices):
extract_phrase = phraseArray[index]
extract_endTime = endTimeArray[index]
joined_phrases = phraseArray[index - 1] + extract_phrase
phraseArray[index - 1] = joined_phrases
endTimeArray[index - 1] = extract_endTime
phraseArray = np.delete(phraseArray, index)
actorArray = np.delete(actorArray, index)
startTimeArray = np.delete(startTimeArray, index)
endTimeArray = np.delete(endTimeArray, index)
outputDF = pl.concat([pl.DataFrame(actorArray, columns=["actor"], orient="col"), pl.DataFrame(phraseArray, columns=["phrase"], orient="col"), pl.DataFrame(startTimeArray, columns=["start_time"], orient="col"), pl.DataFrame(endTimeArray, columns=["end_time"], orient="col")], rechunk=True, how="horizontal")
outputDF = outputDF.with_row_count(name="sentence_index")
outputDF = outputDF[["sentence_index", "actor", "phrase", "start_time", "end_time"]]
print(outputDF[342:348])
outputDF.write_csv("PostProcessing_output.csv")
return outputDF
else:
return inputDF
I tried using df.hstack instead of concat but that did not work either. I also tried rechunk on the dataframe but that did not help either. I think the issue has to do with me converting the columns into numpy arrays and then converting them back to dataframes, but I am not sure.
It looks like you're trying to group consecutive rows based on the actor column and "combine" them.
A common approach to this is to create "group ids" using the .cumsum() of comparing inequality to the previous row:
>>> df.head(8).with_columns((pl.col("actor") != pl.col("actor").shift()).cumsum().alias("id"))
shape: (8, 6)
┌────────────────┬───────┬─────────────────────────────────────┬────────────────┬────────────────┬─────┐
│ sentence_index | actor | phrase | start_time | end_time | id │
│ --- | --- | --- | --- | --- | --- │
│ str | str | str | str | str | u32 │
╞════════════════╪═══════╪═════════════════════════════════════╪════════════════╪════════════════╪═════╡
│ 0 | 1 | companies. So I don't have any ... | 0:00:00 | 0:00:28.125000 | 1 │
├────────────────┼───────┼─────────────────────────────────────┼────────────────┼────────────────┼─────┤
│ 1 | 0 | Oh yeah, that's fine. | 0:00:28.125000 | 0:00:29.625000 | 2 │
├────────────────┼───────┼─────────────────────────────────────┼────────────────┼────────────────┼─────┤
│ 2 | 1 | Okay, good. And so I have a few... | 0:00:29.625000 | 0:00:38.625000 | 3 │
├────────────────┼───────┼─────────────────────────────────────┼────────────────┼────────────────┼─────┤
│ 3 | 0 | I'm in the parking lot, yeah? Y... | 0:00:38.625000 | 0:00:41.375000 | 4 │
├────────────────┼───────┼─────────────────────────────────────┼────────────────┼────────────────┼─────┤
│ 4 | 1 | Thank you. | 0:00:41.375000 | 0:00:42.125000 | 5 │
├────────────────┼───────┼─────────────────────────────────────┼────────────────┼────────────────┼─────┤
│ 5 | 1 | And so when we get ready for th... | 0:00:42.375000 | 0:01:44.125000 | 5 │
├────────────────┼───────┼─────────────────────────────────────┼────────────────┼────────────────┼─────┤
│ 6 | 0 | Yeah, let's just get started. | 0:01:44.125000 | 0:01:45.375000 | 6 │
├────────────────┼───────┼─────────────────────────────────────┼────────────────┼────────────────┼─────┤
│ 7 | 1 | Okay, let's do it. So first of ... | 0:01:45.375000 | 0:01:52.625000 | 7 │
└────────────────┴───────┴─────────────────────────────────────┴────────────────┴────────────────┴─────┘
Note that rows with sentence_index 4 and 5 end up with the same id - this can be used to .groupby()
It looks like you want the .last() end_time
.str.concat() can be used to "combine" the phrase values
.first() can be used for the remaining columns
output_df = (
df.groupby((pl.col("actor") != pl.col("actor").shift()).cumsum().alias("id")).agg([
pl.col("actor").first(),
pl.col("sentence_index").first(),
pl.col("phrase").str.concat(delimiter=""),
pl.col("start_time").first(),
pl.col("end_time").last()
])
.sort("sentence_index")
.drop(["id", "sentence_index"])
.with_row_count("sentence_index")
)

how can I use lateral join to flatten the jsonb in postgres

I have data like this which I need to flatten for each Id with the corresponding key and size with two different columns.
So I was watching the tutorial on snowflake which has this function
select distinct json.key as column_name,
from raw.public.table_name,
lateral flatten(input => table_name) json
I was trying to find something in postgres query
id | json_data
1 | {"KEY": "mekq1232314342134434", "size": 0}
2 | {"KEY": "meksaq12323143421344", "size": 2}
3 | {"KEY": "meksaq12323324421344", "size": 3}
So I need two things here first I need a distinct key from these jsonb columns,
2.
I need to flatten the jsonb columns
id | kEY | size
1 | mekq1232314342134434 | 0
Another option, beside ->>, would be to use jsonb_to_record:
with
sample_data (id, json) as (values
(1, '{"KEY": "mekq1232314342134434", "size": 0}' :: jsonb),
(2, '{"KEY": "meksaq12323143421344", "size": 2}' :: jsonb),
(3, '{"KEY": "meksaq12323324421344", "size": 3}' :: jsonb)
)
select
id, "KEY", size
from
sample_data,
lateral jsonb_to_record(sample_data.json) as ("KEY" text, size int);
-- `lateral` is optional in this case
┌────┬──────────────────────┬──────┐
│ id │ KEY │ size │
├────┼──────────────────────┼──────┤
│ 1 │ mekq1232314342134434 │ 0 │
│ 2 │ meksaq12323143421344 │ 2 │
│ 3 │ meksaq12323324421344 │ 3 │
└────┴──────────────────────┴──────┘
(3 rows)

How do I select a postgres Many-to-One relationship as a single row? [duplicate]

This question already has answers here:
PostgreSQL Crosstab Query
(7 answers)
Closed 3 years ago.
I have a many-to-one relationship between Animals and their attributes. Because different Animals have different attributes, I want to be able to select all animals with their attribute name as a column header and NULL values where that animal does not have that attribute.
Like so...
TABLE_ANIMALS
ID | ANIMAL | DATE | MORE COLS....
1 | CAT | 2012-01-10 | ....
2 | DOG | 2012-01-10 | ....
3 | FROG | 2012-01-10 | ....
...
TABLE_ATTRIBUTES
ID | ANIMAL_ID | ATTRIBUE_NAME | ATTRIBUTE_VALUE
1 | 1 | noise | meow
2 | 1 | legs | 4
3 | 1 | has_fur | TRUE
4 | 2 | noise | woof
5 | 2 | legs | 4
6 | 3 | noise | croak
7 | 3 | legs | 2
8 | 3 | has_fur | FALSE
...
QUERY RESULT
ID | ANIMAL | NOISE | LEGS | HAS_FUR
1 | CAT | meow | 4 | TRUE
2 | DOG | woof | 4 | NULL
3 | FROG | croak | 2 | FALSE
How would I do this? To reiterate, it's important that all the columns are there even if one Animal doesn't have that attribute, such as "DOG" and "HAS_FUR" in this example. If it doesn't have the attribute, it should just be null.
How about a simple join, aggregation and group by?
create table table_animals(id int, animal varchar(10), date date);
create table table_attributes(id varchar(10), animal_id int, attribute_name varchar(10), attribute_value varchar(10));
insert into table_animals values (1, 'CAT', '2012-01-10'),
(2, 'DOG', '2012-01-10'),
(3, 'FROG', '2012-01-10');
insert into table_attributes values (1, 1, 'noise', 'meow'),
(2, 1, 'legs', 4),
(3, 1, 'has_fur', TRUE),
(4, 2, 'noise', 'woof'),
(5, 2, 'legs', 4),
(6, 3, 'noise', 'croak'),
(7, 3, 'legs', 2),
(8, 3, 'has_fur', FALSE);
select ta.animal,
max(attribute_value) filter (where attribute_name = 'noise') as noise,
max(attribute_value) filter (where attribute_name = 'legs') as legs,
max(attribute_value) filter (where attribute_name = 'has_fur') as has_fur
from table_animals ta
left join table_attributes tat on tat.animal_id = ta.id
group by ta.animal
Here's a rextester sample
Additionally you can change the aggregation to MAX CASE WHEN... but MAX FILTER WHERE has better performance.

Spark - Iterating through all rows in dataframe comparing multiple columns for each row against another

| match_id | player_id | team | win |
| 0 | 1 | A | A |
| 0 | 2 | A | A |
| 0 | 3 | B | A |
| 0 | 4 | B | A |
| 1 | 1 | A | B |
| 1 | 4 | A | B |
| 1 | 8 | B | B |
| 1 | 9 | B | B |
| 2 | 8 | A | A |
| 2 | 4 | A | A |
| 2 | 3 | B | A |
| 2 | 2 | B | A |
I have a dataframe that looks like above.
I need to to create a map (key,value) pair such that for every
(k=>(player_id_1, player_id_2), v=> 1 ), if player_id_1 wins against player_id_2 in a match
and
(k=>(player_id_1, player_id_2), v=> 0 ), if player_id_1 loses against player_id_2 in a match
I will have to thus iterate through the entire data frame comparing each player id to another based upon the other 3 columns.
I am planning to achieve this as follows.
Group by match_id
In each group for a player_id check against other player_id's the following
a. If match_id is same and team is different
Then
if team = win
(k=>(player_id_1, player_id_2), v=> 0 )
else team != win
(k=>(player_id_1, player_id_2), v=> 1 )
For example, after partitioning by matches consider match 1.
player_id 1 needs to be compared to player_id 2,3 and 4.
While iterating, record for player_id 2 will be skipped as the team is same
for player_id 3 as team is different the team & win will be compared.
As player_id 1 was in team A and player_id 3 was in team B and team A won the key-value formed would be
((1,3),1)
I have a fair idea of how to achieve this in imperative programming but I am really new to scala and functional programming and can't get a clue as to how while iterating through every row for a field create a (key,value) pair by having checks on other fields.
I tried my best to explain the problem. Please do let me know if any part of my question is unclear. I would be happy to explain the same. Thank you.
P.S: I am using Spark 1.6
This can be achieved using the DataFrame API as shown below..
Dataframe API version:
val df = Seq((0,1,"A","A"),(0,2,"A","A"),(0,3,"B","A"),(0,4,"B","A"),(1,1,"A","B"),(1,4,"A","B"),(1,8,"B","B"),(1,9,"B","B"),(2,8,"A","A"),(2,4,"A","A"),(2,3,"B","A"),(2,2,"B","A")
).toDF("match_id", "player_id", "team", "win")
val result = df.alias("left")
.join(df.alias("right"), $"left.match_id" === $"right.match_id" && not($"right.team" === $"left.team"))
.select($"left.player_id", $"right.player_id", when($"left.team" === $"left.win", 1).otherwise(0).alias("flag"))
scala> result.collect().map(x => (x.getInt(0),x.getInt(1)) -> x.getInt(2)).toMap
res4: scala.collection.immutable.Map[(Int, Int),Int] = Map((1,8) -> 0, (3,4) -> 0, (3,1) -> 0, (9,1) -> 1, (4,1) -> 0, (8,1) -> 1, (2,8) -> 0, (8,3) -> 1, (1,9) -> 0, (1,4) -> 1, (8,2) -> 1, (4,9) -> 0, (3,2) -> 0, (1,3) -> 1, (4,8) -> 0, (4,2) -> 1, (2,4) -> 1, (8,4) -> 1, (2,3) -> 1, (4,3) -> 1, (9,4) -> 1, (3,8) -> 0)
SPARK SQL version:
df.registerTempTable("data_table")
val result = sqlContext.sql("""
SELECT DISTINCT t0.player_id, t1.player_id, CASE WHEN t0.team == t0.win THEN 1 ELSE 0 END AS flag FROM data_table t0
INNER JOIN data_table t1
ON t0.match_id = t1.match_id
AND t0.team != t1.team
""")