window agg over one value, but return another via Polars - python-polars

I am trying to use polars to do a window aggregate over one value, but map it back to another.
For example, if i wanted to get the name of the max value in a group, instead of (or in combination to) just the max value.
assuming an input of something like this.
|label|name|value|
|a. | foo| 1. |
|a. | bar| 2. |
|b. | baz| 1.5. |
|b. | boo| -1 |
# 'max_by' is not a real method, just using it to express what i'm trying to achieve.
df.select(col('label'), col('name').max_by('value').over('label'))
i want an output like this
|label|name|
|a. | bar|
|b. | baz|
ideally with the value too. But i know i can easily add that in via col('value').max().over('label').
|label|name|value|
|a. | bar| 2. |
|b. | baz| 1.5.|

You were close. There is a sort_by expression that can be used.
df.groupby('label').agg(pl.all().sort_by('value').last())
shape: (2, 3)
┌───────┬──────┬───────┐
│ label ┆ name ┆ value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞═══════╪══════╪═══════╡
│ a. ┆ bar ┆ 2.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b. ┆ baz ┆ 1.5 │
└───────┴──────┴───────┘
If you need a windowed version of this:
df.with_columns([
pl.col(['name','value']).sort_by('value').last().over('label').suffix("_max")
])
shape: (4, 5)
┌───────┬──────┬───────┬──────────┬───────────┐
│ label ┆ name ┆ value ┆ name_max ┆ value_max │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ str ┆ f64 │
╞═══════╪══════╪═══════╪══════════╪═══════════╡
│ a. ┆ foo ┆ 1.0 ┆ bar ┆ 2.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ a. ┆ bar ┆ 2.0 ┆ bar ┆ 2.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b. ┆ baz ┆ 1.5 ┆ baz ┆ 1.5 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b. ┆ boo ┆ -1.0 ┆ baz ┆ 1.5 │
└───────┴──────┴───────┴──────────┴───────────┘

Related

Given a data frame with n columns of numbers, how could you calculate the Pearson correlation of all column-pair combinations?

Let's say I have a Polars data frame like this:
=> shape: (19, 5)
┌───────────────┬─────────┬───────────┬───────────┬──────────┐
│ date ┆ open_AA ┆ open_AADI ┆ open_AADR ┆ open_AAL │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════════════╪═════════╪═══════════╪═══════════╪══════════╡
│ 1674777600000 ┆ 51.39 ┆ 12.84 ┆ 50.0799 ┆ 16.535 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1674691200000 ┆ 52.43 ┆ 13.14 ┆ 49.84 ┆ 16.54 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1674604800000 ┆ 51.87 ┆ 12.88 ┆ 49.75 ┆ 15.97 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1674518400000 ┆ 51.22 ┆ 12.81 ┆ 50.1 ┆ 16.01 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1672876800000 ┆ 45.3 ┆ 12.7 ┆ 47.185 ┆ 13.5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1672790400000 ┆ 44.77 ┆ 12.355 ┆ 47.32 ┆ 12.86 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1672704000000 ┆ 45.77 ┆ 12.91 ┆ 47.84 ┆ 12.91 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1672358400000 ┆ 46.01 ┆ 12.57 ┆ 47.29 ┆ 12.55 │
└───────────────┴─────────┴───────────┴───────────┴──────────┘
I'm looking to calculate the Pearson correlation between each pair-combination of all columns (except the date one). The result would look something like this:
=> shape: (5, 5)
┌───────────────┬─────────┬───────────┬───────────┬──────────┐
│ symbol ┆ open_AA ┆ open_AADI ┆ open_AADR ┆ open_AAL │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ utf8 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════════════╪═════════╪═══════════╪═══════════╪══════════╡
│ open_AA ┆ 1 ┆ 1 ┆ .1 ┆ -.5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ open_AADI ┆ .2 ┆ 1 ┆ .2 ┆ .4 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ open_AADR ┆ .4 ┆ .2 ┆ 1 ┆ .3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ open_AAL. ┆ -.45 ┆ -.6 ┆ 50.1 ┆ 1 │
└───────────────┴─────────┴───────────┴───────────┴──────────┘
My hunch is that I need to do the following:
Get the cartesian product of columns [1..] as a new data frame.
Using Polars expressions, calculate the pearson_corr of each of each series pair.
I'm new to Polars and am having trouble with the syntax. Can anyone point me in the right direction?
Say you start with:
df = pl.DataFrame({"date":[5,6,7],"foo": [1, 3, 9], "bar": [4, 1, 3], "ham": [2, 18, 9]})
You want to exclude some cols, so let's put those in a variable
excl_cols=['date']
Then...
(
df.drop(excl_cols) # Use drop to exclude the date column (or whatever columns you don't want)
.pearson_corr() # this is the meat and potatos of the request but it's missing your symbol column on left
.select(
[
pl.Series(df.drop(excl_cols).columns).alias('symbol'), # This just creates a Series out of the column names to become its own column
pl.all() #then just every other column
])
)
shape: (3, 4)
┌────────┬───────────┬───────────┬───────────┐
│ symbol ┆ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 │
╞════════╪═══════════╪═══════════╪═══════════╡
│ foo ┆ 1.0 ┆ -0.052414 ┆ 0.169695 │
│ bar ┆ -0.052414 ┆ 1.0 ┆ -0.993036 │
│ ham ┆ 0.169695 ┆ -0.993036 ┆ 1.0 │
└────────┴───────────┴───────────┴───────────┘
Use DataFrame.pearson_corr
In [9]: df.drop('date').pearson_corr()
Out[9]:
shape: (2, 2)
┌─────────┬───────────┐
│ open_AA ┆ open_AADI │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════════╪═══════════╡
│ 1.0 ┆ 1.0 │
│ 1.0 ┆ 1.0 │
└─────────┴───────────┘

python-polars is there a np.where equivalent?

Polars is there a np.where equivalent? trying to replicate the following code in polars.
If the value is above a certain threshold column called Is_Acceptable is 1 or if it is below it is 0
import pandas as pd
import numpy as np
df = pd.DataFrame({"fruit":["orange","apple","mango","kiwi"], "value":[1,0.8,0.7,1.2]})
df["Is_Acceptable?"] = np.where(df["value"].lt(0.9), 1, 0)
print(df)
Yes, there is pl.when().then().otherwise() expression
import polars as pl
from polars import col
df = pl.DataFrame({
"fruit": ["orange","apple","mango","kiwi"],
"value": [1, 0.8, 0.7, 1.2]
})
df = df.with_column(
pl.when(col('value') < 0.9).then(1).otherwise(0).alias('Is_Acceptable?')
)
print(df)
┌────────┬───────┬────────────────┐
│ fruit ┆ value ┆ Is_Acceptable? │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ i64 │
╞════════╪═══════╪════════════════╡
│ orange ┆ 1.0 ┆ 0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple ┆ 0.8 ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ mango ┆ 0.7 ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ kiwi ┆ 1.2 ┆ 0 │
└────────┴───────┴────────────────┘
The when/then/otherwise expression is a good general-purpose answer. However, in this case, one shortcut is to simply create a boolean expression.
(
df
.with_column(
(pl.col('value') < 0.9).alias('Is_Acceptable')
)
)
shape: (4, 3)
┌────────┬───────┬───────────────┐
│ fruit ┆ value ┆ Is_Acceptable │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ bool │
╞════════╪═══════╪═══════════════╡
│ orange ┆ 1.0 ┆ false │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple ┆ 0.8 ┆ true │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ mango ┆ 0.7 ┆ true │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ kiwi ┆ 1.2 ┆ false │
└────────┴───────┴───────────────┘
In numeric computations, False will be upcast to 0, and True will be upcast to 1. Or, if you prefer, you can upcast them explicitly to a different type.
(
df
.with_column(
(pl.col('value') < 0.9).cast(pl.Int64).alias('Is_Acceptable')
)
)
shape: (4, 3)
┌────────┬───────┬───────────────┐
│ fruit ┆ value ┆ Is_Acceptable │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ i64 │
╞════════╪═══════╪═══════════════╡
│ orange ┆ 1.0 ┆ 0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ apple ┆ 0.8 ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ mango ┆ 0.7 ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ kiwi ┆ 1.2 ┆ 0 │
└────────┴───────┴───────────────┘

polars groupby and pivot converting code from pyspark

Currently converting some code from pyspark to polars which i need some help with
In pyspark i am grouping by col1 and col2 then pivoting on column called VariableName using
the Value column.how would I do this in polars?
pivotDF = df.groupBy("col1","col2").pivot("VariableName").max("Value")
Let's start with this data:
import polars as pl
from pyspark.sql import SparkSession
df = pl.DataFrame(
{
"col1": ["A", "B"] * 12,
"col2": ["x", "y", "z"] * 8,
"VariableName": ["one", "two", "three", "four"] * 6,
"Value": pl.arange(0, 24, eager=True),
}
)
df
shape: (24, 4)
┌──────┬──────┬──────────────┬───────┐
│ col1 ┆ col2 ┆ VariableName ┆ Value │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ i64 │
╞══════╪══════╪══════════════╪═══════╡
│ A ┆ x ┆ one ┆ 0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ B ┆ y ┆ two ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ A ┆ z ┆ three ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ B ┆ x ┆ four ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ A ┆ z ┆ one ┆ 20 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ B ┆ x ┆ two ┆ 21 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ A ┆ y ┆ three ┆ 22 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ B ┆ z ┆ four ┆ 23 │
└──────┴──────┴──────────────┴───────┘
Running your query on pyspark yields:
spark = SparkSession.builder.getOrCreate()
(
spark
.createDataFrame(df.to_pandas())
.groupBy("col1", "col2")
.pivot("VariableName")
.max("Value")
.sort(["col1", "col2"])
.show()
)
+----+----+----+----+-----+----+
|col1|col2|four| one|three| two|
+----+----+----+----+-----+----+
| A| x|null| 12| 18|null|
| A| y|null| 16| 22|null|
| A| z|null| 20| 14|null|
| B| x| 15|null| null| 21|
| B| y| 19|null| null| 13|
| B| z| 23|null| null| 17|
+----+----+----+----+-----+----+
In Polars, we would code this using pivot.
(
df
.pivot(
index=["col1", "col2"],
values="Value",
columns="VariableName",
aggregate_fn="max",
)
.sort(["col1", "col2"])
)
shape: (6, 6)
┌──────┬──────┬──────┬──────┬───────┬──────┐
│ col1 ┆ col2 ┆ one ┆ two ┆ three ┆ four │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞══════╪══════╪══════╪══════╪═══════╪══════╡
│ A ┆ x ┆ 12 ┆ null ┆ 18 ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ A ┆ y ┆ 16 ┆ null ┆ 22 ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ A ┆ z ┆ 20 ┆ null ┆ 14 ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ B ┆ x ┆ null ┆ 21 ┆ null ┆ 15 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ B ┆ y ┆ null ┆ 13 ┆ null ┆ 19 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ B ┆ z ┆ null ┆ 17 ┆ null ┆ 23 │
└──────┴──────┴──────┴──────┴───────┴──────┘

Python-Polars: How to filter categorical column with string list

I have a Polars dataframe like below:
df_cat = pl.DataFrame(
[
pl.Series("a_cat", ["c", "a", "b", "c", "b"], dtype=pl.Categorical),
pl.Series("b_cat", ["F", "G", "E", "G", "G"], dtype=pl.Categorical)
])
print(df_cat)
shape: (5, 2)
┌───────┬───────┐
│ a_cat ┆ b_cat │
│ --- ┆ --- │
│ cat ┆ cat │
╞═══════╪═══════╡
│ c ┆ F │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ G │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ E │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ G │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ G │
└───────┴───────┘
The following filter runs perfectly fine:
print(df_cat.filter(pl.col('a_cat') == 'c'))
shape: (2, 2)
┌───────┬───────┐
│ a_cat ┆ b_cat │
│ --- ┆ --- │
│ cat ┆ cat │
╞═══════╪═══════╡
│ c ┆ F │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ G │
└───────┴───────┘
What I want is to use a list of string to run the filter more efficiently. So I tried and ended up with the following error message:
print(df_cat.filter(pl.col('a_cat').is_in(['a', 'c'])))
---------------------------------------------------------------------------
ComputeError Traceback (most recent call last)
d:\GitRepo\Test2\stockEMD3.ipynb Cell 9 in <cell line: 1>()
----> 1 print(df_cat.filter(pl.col('a_cat').is_in(['c'])))
File c:\ProgramData\Anaconda3\envs\charm3.9\lib\site-packages\polars\internals\dataframe\frame.py:2185, in DataFrame.filter(self, predicate)
2181 if _NUMPY_AVAILABLE and isinstance(predicate, np.ndarray):
2182 predicate = pli.Series(predicate)
2184 return (
-> 2185 self.lazy()
2186 .filter(predicate) # type: ignore[arg-type]
2187 .collect(no_optimization=True, string_cache=False)
2188 )
File c:\ProgramData\Anaconda3\envs\charm3.9\lib\site-packages\polars\internals\lazyframe\frame.py:660, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, string_cache, no_optimization, slice_pushdown)
650 projection_pushdown = False
652 ldf = self._ldf.optimization_toggle(
653 type_coercion,
654 predicate_pushdown,
(...)
658 slice_pushdown,
659 )
--> 660 return pli.wrap_df(ldf.collect())
ComputeError: joins/or comparisons on categorical dtypes can only happen if they are created under the same global string cache
From this Stackoverflow link I understand "You need to set a global string cache to compare categoricals created in different columns/lists." but my question is
Why the == one single string filter case works?
What is the proper way to filter a categorical column with a list of string?
Thanks!
Actually, you don't need to set a global string cache to compare strings to Categorical variables. You can use cast to accomplish this.
Let's use this data. I've included the integer values that underlie the Categorical variables to demonstrate something later.
import polars as pl
df_cat = (
pl.DataFrame(
[
pl.Series("a_cat", ["c", "a", "b", "c", "X"], dtype=pl.Categorical),
pl.Series("b_cat", ["F", "G", "E", "S", "X"], dtype=pl.Categorical),
]
)
.with_column(
pl.all().to_physical().suffix('_phys')
)
)
df_cat
shape: (5, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ c ┆ F ┆ 0 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ G ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ E ┆ 2 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ S ┆ 0 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ X ┆ X ┆ 3 ┆ 4 │
└───────┴───────┴────────────┴────────────┘
Comparing a categorical variable to a string
If we cast a Categorical variable back to its string values, we can make any comparison we need. For example:
df_cat.filter(pl.col('a_cat').cast(pl.Utf8).is_in(['a', 'c']))
shape: (3, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ c ┆ F ┆ 0 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ G ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ S ┆ 0 ┆ 3 │
└───────┴───────┴────────────┴────────────┘
Or in a filter step comparing the string values of two Categorical variables that do not share the same string cache.
df_cat.filter(pl.col('a_cat').cast(pl.Utf8) == pl.col('b_cat').cast(pl.Utf8))
shape: (1, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ X ┆ X ┆ 3 ┆ 4 │
└───────┴───────┴────────────┴────────────┘
Notice that it is the string values being compared (not the integers underlying the two Categorical variables).
The equality operator on Categorical variables
The following statements are equivalent:
df_cat.filter((pl.col('a_cat') == 'a'))
df_cat.filter((pl.col('a_cat').cast(pl.Utf8) == 'a'))
The former is syntactic sugar for the latter, as the former is a common use case.
As the error states: ComputeError: joins/or comparisons on categorical dtypes can only happen if they are created under the same global string cache.
Comparisons of categorical values are only allowed under a global string cache. You really want to set this in such a case as it speeds up comparisons and prevents expensive casts to strings.
Setting this on the start of your query will ensure it runs:
import polars as pl
pl.Config.set_global_string_cache()
This is a new answer based on the one from #ritchie46.
Polar 0.15.15 it now is
import polars as pl
pl.toggle_string_cache(True)
Also a StringCache() Context manager can be used, see polars documentation:
with pl.StringCache():
print(df_cat.filter(pl.col('a_cat').is_in(['a', 'c'])))

How to form dynamic expressions without breaking on types

Any way to make the dynamic polars expressions not break with errors?
Currently I'm just excluding the columns by type, but just wondering if there is a better way.
For example, i have a df coming from parquet, if i just execute an expression on all columns it might break for certain types. Instead I want to contain these errors and possibly return a default value like None or -1 or something else.
import polars as pl
df = pl.scan_parquet("/path/to/data/*.parquet")
print(df.schema)
# Prints: {'date_time': <class 'polars.datatypes.Datetime'>, 'incident': <class 'polars.datatypes.Utf8'>, 'address': <class 'polars.datatypes.Utf8'>, 'city': <class 'polars.datatypes.Utf8'>, 'zipcode': <class 'polars.datatypes.Int32'>}
Now if i form generic expression on top of this, there are chances it may fail. For example,
# Finding positive count across all columns
# Fails due to: exceptions.ComputeError: cannot compare Utf8 with numeric data
print(df.select((pl.all() > 0).count().prefix("__positive_count_")).collect())
# Finding positive count across all columns
# Fails due to: pyo3_runtime.PanicException: 'unique_counts' not implemented for datetime[ns] data types
print(df.select(pl.all().unique_counts().prefix("__unique_count_")).collect())
# Finding positive count across all columns
# Fails due to: exceptions.SchemaError: Series dtype Int32 != utf8
# Note: this could have been avoided by doing an explict cast to string first
print(df.select((pl.all().str.lengths() > 0).count().prefix("__empty_count_")).collect())
I'll keep to things that work in lazy mode, as it appears that you are working in lazy mode with Parquet files.
Let's use this data as an example:
import polars as pl
from datetime import datetime
df = pl.DataFrame(
{
"col_int": [-2, -2, 0, 2, 2],
"col_float": [-20.0, -10, 10, 20, 20],
"col_date": pl.date_range(datetime(2020, 1, 1), datetime(2020, 5, 1), "1mo"),
"col_str": ["str1", "str2", "", None, "str5"],
"col_bool": [True, False, False, True, False],
}
).lazy()
df.collect()
shape: (5, 5)
┌─────────┬───────────┬─────────────────────┬─────────┬──────────┐
│ col_int ┆ col_float ┆ col_date ┆ col_str ┆ col_bool │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[ns] ┆ str ┆ bool │
╞═════════╪═══════════╪═════════════════════╪═════════╪══════════╡
│ -2 ┆ -20.0 ┆ 2020-01-01 00:00:00 ┆ str1 ┆ true │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ -2 ┆ -10.0 ┆ 2020-02-01 00:00:00 ┆ str2 ┆ false │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ 10.0 ┆ 2020-03-01 00:00:00 ┆ ┆ false │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20.0 ┆ 2020-04-01 00:00:00 ┆ null ┆ true │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20.0 ┆ 2020-05-01 00:00:00 ┆ str5 ┆ false │
└─────────┴───────────┴─────────────────────┴─────────┴──────────┘
Using the col Expression
One feature of the col expression is that you can supply a datatype, or even a list of datatypes. For example, if we want to contain our queries to floats, we can do the following:
df.select((pl.col(pl.Float64) > 0).sum().suffix("__positive_count_")).collect()
shape: (1, 1)
┌────────────────────────────┐
│ col_float__positive_count_ │
│ --- │
│ u32 │
╞════════════════════════════╡
│ 3 │
└────────────────────────────┘
(Note: (pl.col(...) > 0) creates a series of boolean values that need to be summed, not counted)
To include more than one datatype, you can supply a list of datatypes to col.
df.select(
(pl.col([pl.Int64, pl.Float64]) > 0).sum().suffix("__positive_count_")
).collect()
shape: (1, 2)
┌──────────────────────────┬────────────────────────────┐
│ col_int__positive_count_ ┆ col_float__positive_count_ │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞══════════════════════════╪════════════════════════════╡
│ 2 ┆ 3 │
└──────────────────────────┴────────────────────────────┘
You can also combine these into the same select statement if you'd like.
df.select(
[
(pl.col(pl.Utf8).str.lengths() == 0).sum().suffix("__empty_count"),
pl.col(pl.Utf8).is_null().sum().suffix("__null_count"),
(pl.col([pl.Float64, pl.Int64]) > 0).sum().suffix("_positive_count"),
]
).collect()
shape: (1, 4)
┌──────────────────────┬─────────────────────┬──────────────────────────┬────────────────────────┐
│ col_str__empty_count ┆ col_str__null_count ┆ col_float_positive_count ┆ col_int_positive_count │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ u32 ┆ u32 │
╞══════════════════════╪═════════════════════╪══════════════════════════╪════════════════════════╡
│ 1 ┆ 1 ┆ 3 ┆ 2 │
└──────────────────────┴─────────────────────┴──────────────────────────┴────────────────────────┘
The Cookbook has a handy list of datatypes.
Using the exclude expression
Another handy trick is to use the exclude expression. With this, we can select all columns except columns of certain datatypes. For example:
df.select(
[
pl.exclude(pl.Utf8).max().suffix("_max"),
pl.exclude([pl.Utf8, pl.Boolean]).min().suffix("_min"),
]
).collect()
shape: (1, 7)
┌─────────────┬───────────────┬─────────────────────┬──────────────┬─────────────┬───────────────┬─────────────────────┐
│ col_int_max ┆ col_float_max ┆ col_date_max ┆ col_bool_max ┆ col_int_min ┆ col_float_min ┆ col_date_min │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[ns] ┆ u32 ┆ i64 ┆ f64 ┆ datetime[ns] │
╞═════════════╪═══════════════╪═════════════════════╪══════════════╪═════════════╪═══════════════╪═════════════════════╡
│ 2 ┆ 20.0 ┆ 2020-05-01 00:00:00 ┆ 1 ┆ -2 ┆ -20.0 ┆ 2020-01-01 00:00:00 │
└─────────────┴───────────────┴─────────────────────┴──────────────┴─────────────┴───────────────┴─────────────────────┘
Unique counts
One caution: unique_counts results in Series of varying lengths.
df.select(pl.col("col_int").unique_counts().prefix(
"__unique_count_")).collect()
shape: (3, 1)
┌────────────────────────┐
│ __unique_count_col_int │
│ --- │
│ u32 │
╞════════════════════════╡
│ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 │
└────────────────────────┘
df.select(pl.col("col_float").unique_counts().prefix(
"__unique_count_")).collect()
shape: (4, 1)
┌──────────────────────────┐
│ __unique_count_col_float │
│ --- │
│ u32 │
╞══════════════════════════╡
│ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 │
└──────────────────────────┘
As such, these should not be combined into the same results. Each column/Series of a DataFrame must have the same length.