How to create fields dynamically - python-polars

Is there any way to create fields dynamically?. I know there are some ways. But it will be better to know best approach in polars. For example I want to add 12 shifted columns to existing dataframe.(lag1, lag2, lag3...lagN) How to achieve this?
Thanks.

You can just use the python language for that. Polars expressions are lazily evaluated, so you can create them anywhere, in a for loop, a function, list comprehension, you name it.
Below I give an example of dynamically created lag columns, one by calling a function, assigning to a variable and then using that variable. And one with a list comprehension.
# some initial dataframe
df = pl.DataFrame({
"a": [1, 2, 3, 4, 5],
"b": [5, 4, 3, 2, 1]
})
# a function that returns a lazy evaluated expression
def lag(name: str, n: int) -> pl.Expr:
return pl.col(name).shift(n).suffix(f"_lag_{n}")
# a lazy evaluated expression assigned to a variable
lag_foo = lag("a", 1)
out = df.select([
lag_foo,
] + [lag("b", i) for i in range(5)] # create exprs with a list comprehension
)
print(out)
This outputs:
shape: (5, 6)
┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ a_lag_1 ┆ b_lag_0 ┆ b_lag_1 ┆ b_lag_2 ┆ b_lag_3 ┆ b_lag_4 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ null ┆ 5 ┆ null ┆ null ┆ null ┆ null │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 4 ┆ 5 ┆ null ┆ null ┆ null │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 3 ┆ 4 ┆ 5 ┆ null ┆ null │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 2 ┆ 3 ┆ 4 ┆ 5 ┆ null │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 1 ┆ 2 ┆ 3 ┆ 4 ┆ 5 │
└─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

Related

Polars solution to normalise groups by per-group reference value

I'm trying to use Polars to normalise the values of groups of entries by a single reference value per group.
In the example data below, I'm trying to generate the column normalised which contains values divided by the per-group ref reference state value, i.e.:
group_id reference_state value normalised
1 ref 5 1.0
1 a 3 0.6
1 b 1 0.2
2 ref 4 1.0
2 a 8 2.0
2 b 2 0.5
This is straightforward in Pandas:
for (i, x) in df.groupby("group_id"):
ref_val = x.loc[x["reference_state"] == "ref"]["value"]
df.loc[df["group_id"] == i, "normalised"] = x["value"] / ref_val.to_list()[0]
Is there a way to do this in Polars?
Thanks in advance!
You can use a window function to make an expression operate on different groups via:
.over("group_id")
and then you can write the logic which divides by the values if equal to "ref" with:
pl.col("value") / pl.col("value").filter(pl.col("reference_state") == "ref").first()
Putting it all together:
df = pl.DataFrame({
"group_id": [1, 1, 1, 2, 2, 2],
"reference_state": ["ref", "a", "b", "ref", "a", "b"],
"value": [5, 3, 1, 4, 8, 2],
})
(df.with_columns([
(
pl.col("value") /
pl.col("value").filter(pl.col("reference_state") == "ref").first()
).over("group_id").alias("normalised")
]))
shape: (6, 4)
┌──────────┬─────────────────┬───────┬────────────┐
│ group_id ┆ reference_state ┆ value ┆ normalised │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ f64 │
╞══════════╪═════════════════╪═══════╪════════════╡
│ 1 ┆ ref ┆ 5 ┆ 1.0 │
│ 1 ┆ a ┆ 3 ┆ 0.6 │
│ 1 ┆ b ┆ 1 ┆ 0.2 │
│ 2 ┆ ref ┆ 4 ┆ 1.0 │
│ 2 ┆ a ┆ 8 ┆ 2.0 │
│ 2 ┆ b ┆ 2 ┆ 0.5 │
└──────────┴─────────────────┴───────┴────────────┘
Here's one way to do it:
create a temporary dataframe which, for each group_id, tells you the value where reference_state is 'ref'
join with that temporary dataframe
(
df.join(
df.filter(pl.col("reference_state") == "ref").select(["group_id", "value"]),
on="group_id",
)
.with_column((pl.col("value") / pl.col("value_right")).alias("normalised"))
.drop("value_right")
)
This gives you:
Out[16]:
shape: (6, 4)
┌──────────┬─────────────────┬───────┬────────────┐
│ group_id ┆ reference_state ┆ value ┆ normalised │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ f64 │
╞══════════╪═════════════════╪═══════╪════════════╡
│ 1 ┆ ref ┆ 5 ┆ 1.0 │
│ 1 ┆ a ┆ 3 ┆ 0.6 │
│ 1 ┆ b ┆ 1 ┆ 0.2 │
│ 2 ┆ ref ┆ 4 ┆ 1.0 │
│ 2 ┆ a ┆ 8 ┆ 2.0 │
│ 2 ┆ b ┆ 2 ┆ 0.5 │
└──────────┴─────────────────┴───────┴────────────┘

Polars: Unnesting columns algorithmically without a for loop

I am working with multiple parquet datasets that were written with nested structs (sometimes multiple levels deep). I need to output a flattened (no struct) schema. Right now the only way I can think to do that is to use for loops to iterate through the columns. Here is a simplified example where I'm for looping.
while len([x.name for x in df if x.dtype == pl.Struct]) > 0:
for col in df:
if col.dtype == pl.Struct:
df = df.unnest(col.name)
This works, maybe that is the only way to do it, and if so it would be helpful to know that. But Polars is pretty neat and I'm wondering if there is a more functional way to do this without all the looping and reassigning the df to itself.
If you have a df like this:
df=pl.DataFrame({'a':[1,2,3], 'b':[2,3,4], 'c':[3,4,5], 'd':[4,5,6], 'e':[5,6,7]}).select([pl.struct(['a','b']).alias('ab'), pl.struct(['c','d']).alias('cd'),'e'])
You can unnest the ab and cd at the same time by just doing
df.unnest(['ab','cd'])
If you don't know in advance what your column names and types are in advance then you can just use a list comprehension like this:
[col_name for col_name,dtype in zip(df.columns, df.dtypes) if dtype==pl.Struct]
We can now just put that list comprehension in the unnest method.
df=df.unnest([col_name for col_name,dtype in zip(df.columns, df.dtypes) if dtype==pl.Struct])
If you have structs inside structs like:
df=pl.DataFrame({'a':[1,2,3], 'b':[2,3,4], 'c':[3,4,5], 'd':[4,5,6], 'e':[5,6,7]}).select([pl.struct(['a','b']).alias('ab'), pl.struct(['c','d']).alias('cd'),'e']).select([pl.struct(['ab','cd']).alias('abcd'),'e'])
then I don't think you can get away from some kind of while loop but this might be more concise:
while any([x==pl.Struct for x in df.dtypes]):
df=df.unnest([col_name for col_name,dtype in zip(df.columns, df.dtypes) if dtype==pl.Struct])
This is a minor addition. If you're concerned about constantly re-looping through a large number of columns, you can create a recursive formula to address only structs (and nested structs).
def unnest_all(self: pl.DataFrame):
cols = []
for next_col in self:
if next_col.dtype != pl.Struct:
cols.append(next_col)
else:
cols.extend(next_col.struct.to_frame().unnest_all().get_columns())
return pl.DataFrame(cols)
pl.DataFrame.unnest_all = unnest_all
So, using the second example by #Dean MacGregor above:
df = (
pl.DataFrame(
{"a": [1, 2, 3], "b": [2, 3, 4], "c": [
3, 4, 5], "d": [4, 5, 6], "e": [5, 6, 7]}
)
.select([pl.struct(["a", "b"]).alias("ab"), pl.struct(["c", "d"]).alias("cd"), "e"])
.select([pl.struct(["ab", "cd"]).alias("abcd"), "e"])
)
df
df.unnest_all()
>>> df
shape: (3, 2)
┌───────────────┬─────┐
│ abcd ┆ e │
│ --- ┆ --- │
│ struct[2] ┆ i64 │
╞═══════════════╪═════╡
│ {{1,2},{3,4}} ┆ 5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ {{2,3},{4,5}} ┆ 6 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ {{3,4},{5,6}} ┆ 7 │
└───────────────┴─────┘
>>> df.unnest_all()
shape: (3, 5)
┌─────┬─────┬─────┬─────┬─────┐
│ a ┆ b ┆ c ┆ d ┆ e │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╪═════╡
│ 1 ┆ 2 ┆ 3 ┆ 4 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 3 ┆ 4 ┆ 5 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 4 ┆ 5 ┆ 6 ┆ 7 │
└─────┴─────┴─────┴─────┴─────┘
And using the first example:
df = pl.DataFrame(
{"a": [1, 2, 3], "b": [2, 3, 4], "c": [
3, 4, 5], "d": [4, 5, 6], "e": [5, 6, 7]}
).select([pl.struct(["a", "b"]).alias("ab"), pl.struct(["c", "d"]).alias("cd"), "e"])
df
df.unnest_all()
>>> df
shape: (3, 3)
┌───────────┬───────────┬─────┐
│ ab ┆ cd ┆ e │
│ --- ┆ --- ┆ --- │
│ struct[2] ┆ struct[2] ┆ i64 │
╞═══════════╪═══════════╪═════╡
│ {1,2} ┆ {3,4} ┆ 5 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ {2,3} ┆ {4,5} ┆ 6 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ {3,4} ┆ {5,6} ┆ 7 │
└───────────┴───────────┴─────┘
>>> df.unnest_all()
shape: (3, 5)
┌─────┬─────┬─────┬─────┬─────┐
│ a ┆ b ┆ c ┆ d ┆ e │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╪═════╡
│ 1 ┆ 2 ┆ 3 ┆ 4 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 3 ┆ 4 ┆ 5 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 4 ┆ 5 ┆ 6 ┆ 7 │
└─────┴─────┴─────┴─────┴─────┘
In the end, I'm not sure that this saves you much wall-clock time (or RAM).
The other answers taught me a lot. I encountered a new situation where I wanted to easily be able to get each column labeled with all the structs it came from. i.e. for
pl.col("my").struct.field("test").struct.field("thing")
I wanted to recover
my.test.thing
as a string which I could easily use when reading a subset of columns with pyarrow via
pq.ParquetDataset(path).read(columns = ["my.test.thing"])
Since there are many hundreds of columns and the nesting can go quite deep, I wrote functions to do a depth first search on the schema, extract the columns in that pyarrow friendly format, then I can use those to select each column unnested all in one go.
First, I worked with the pyarrow schema because I couldn't figure out how to drill into the structs in the polars schema:
schema = df.to_arrow().schema
navigating structs in that schema is quirky, at the top level the structure behaves differently from deeper in. I ended up writing two functions, the first to navigate the top level structure and the second to continue the search below:
def schema_top_level_DFS(pa_schema):
top_level_stack = list(range(len(pa_schema)))
while top_level_stack:
working_top_level_index = top_level_stack.pop()
working_element_name = pa_schema.names[working_top_level_index]
if type(pa_schema.types[working_top_level_index]) == pa.lib.StructType:
second_level_stack = list(range(len(pa_schema.types[working_top_level_index])))
while second_level_stack:
working_second_level_index = second_level_stack.pop()
schema_DFS(pa_schema.types[working_top_level_index][working_second_level_index],working_element_name)
else:
column_paths.append(working_element_name)
def schema_DFS(incoming_element,upstream_names):
current_name = incoming_element.name
combined_names = ".".join([upstream_names,current_name])
if type(incoming_element.type) == pa.lib.StructType:
stack = list(range(len(incoming_element.type)))
while stack:
working_index = stack.pop()
working_element = incoming_element.type[working_index]
schema_DFS(working_element,combined_names)
else:
column_paths.append(combined_names)
So that running
column_paths = []
schema_top_level_DFS(schema)
gives me column paths like
['struct_name_1.inner_struct_name_2.thing1','struct_name_1.inner_struct_name_2.thing2]
to actually do the unnesting, I wasn't sure how to do better than a function with a case statement:
def return_pl_formatting(col_string):
col_list = col_string.split(".")
match len(col_list):
case 1:
return pl.col(col_list[0]).alias(col_string)
case 2:
return pl.col(col_list[0]).struct.field(col_list[1]).alias(col_string)
case 3:
return pl.col(col_list[0]).struct.field(col_list[1]).struct.field(col_list[2]).alias(col_string)
case 4:
return pl.col(col_list[0]).struct.field(col_list[1]).struct.field(col_list[2]).struct.field(col_list[3]).alias(col_string)
case 5:
return pl.col(col_list[0]).struct.field(col_list[1]).struct.field(col_list[2]).struct.field(col_list[3]).struct.field(col_list[4]).alias(col_string)
case 6:
return pl.col(col_list[0]).struct.field(col_list[1]).struct.field(col_list[2]).struct.field(col_list[3]).struct.field(col_list[4]).struct.field(col_list[5]).alias(col_string)
Then get my unnested and nicely named df with:
df.select([return_pl_formatting(x) for x in column_paths])
To show the output on the example from #Dean MacGregor
test = (
pl.DataFrame(
{"a": [1, 2, 3], "b": [2, 3, 4], "c": [
3, 4, 5], "d": [4, 5, 6], "e": [5, 6, 7]}
)
.select([pl.struct(["a", "b"]).alias("ab"), pl.struct(["c", "d"]).alias("cd"), "e"])
.select([pl.struct(["ab", "cd"]).alias("abcd"), "e"])
)
column_paths = []
schema_top_level_DFS(test.to_arrow().schema)
print(test.select([return_pl_formatting(x) for x in column_paths]))
┌─────┬───────────┬───────────┬───────────┬───────────┐
│ e ┆ abcd.cd.d ┆ abcd.cd.c ┆ abcd.ab.b ┆ abcd.ab.a │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 5 ┆ 4 ┆ 3 ┆ 2 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 6 ┆ 5 ┆ 4 ┆ 3 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 7 ┆ 6 ┆ 5 ┆ 4 ┆ 3 │
└─────┴───────────┴───────────┴───────────┴───────────┘

Python-Polars: How to filter categorical column with string list

I have a Polars dataframe like below:
df_cat = pl.DataFrame(
[
pl.Series("a_cat", ["c", "a", "b", "c", "b"], dtype=pl.Categorical),
pl.Series("b_cat", ["F", "G", "E", "G", "G"], dtype=pl.Categorical)
])
print(df_cat)
shape: (5, 2)
┌───────┬───────┐
│ a_cat ┆ b_cat │
│ --- ┆ --- │
│ cat ┆ cat │
╞═══════╪═══════╡
│ c ┆ F │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ G │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ E │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ G │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ G │
└───────┴───────┘
The following filter runs perfectly fine:
print(df_cat.filter(pl.col('a_cat') == 'c'))
shape: (2, 2)
┌───────┬───────┐
│ a_cat ┆ b_cat │
│ --- ┆ --- │
│ cat ┆ cat │
╞═══════╪═══════╡
│ c ┆ F │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ G │
└───────┴───────┘
What I want is to use a list of string to run the filter more efficiently. So I tried and ended up with the following error message:
print(df_cat.filter(pl.col('a_cat').is_in(['a', 'c'])))
---------------------------------------------------------------------------
ComputeError Traceback (most recent call last)
d:\GitRepo\Test2\stockEMD3.ipynb Cell 9 in <cell line: 1>()
----> 1 print(df_cat.filter(pl.col('a_cat').is_in(['c'])))
File c:\ProgramData\Anaconda3\envs\charm3.9\lib\site-packages\polars\internals\dataframe\frame.py:2185, in DataFrame.filter(self, predicate)
2181 if _NUMPY_AVAILABLE and isinstance(predicate, np.ndarray):
2182 predicate = pli.Series(predicate)
2184 return (
-> 2185 self.lazy()
2186 .filter(predicate) # type: ignore[arg-type]
2187 .collect(no_optimization=True, string_cache=False)
2188 )
File c:\ProgramData\Anaconda3\envs\charm3.9\lib\site-packages\polars\internals\lazyframe\frame.py:660, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, string_cache, no_optimization, slice_pushdown)
650 projection_pushdown = False
652 ldf = self._ldf.optimization_toggle(
653 type_coercion,
654 predicate_pushdown,
(...)
658 slice_pushdown,
659 )
--> 660 return pli.wrap_df(ldf.collect())
ComputeError: joins/or comparisons on categorical dtypes can only happen if they are created under the same global string cache
From this Stackoverflow link I understand "You need to set a global string cache to compare categoricals created in different columns/lists." but my question is
Why the == one single string filter case works?
What is the proper way to filter a categorical column with a list of string?
Thanks!
Actually, you don't need to set a global string cache to compare strings to Categorical variables. You can use cast to accomplish this.
Let's use this data. I've included the integer values that underlie the Categorical variables to demonstrate something later.
import polars as pl
df_cat = (
pl.DataFrame(
[
pl.Series("a_cat", ["c", "a", "b", "c", "X"], dtype=pl.Categorical),
pl.Series("b_cat", ["F", "G", "E", "S", "X"], dtype=pl.Categorical),
]
)
.with_column(
pl.all().to_physical().suffix('_phys')
)
)
df_cat
shape: (5, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ c ┆ F ┆ 0 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ G ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ E ┆ 2 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ S ┆ 0 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ X ┆ X ┆ 3 ┆ 4 │
└───────┴───────┴────────────┴────────────┘
Comparing a categorical variable to a string
If we cast a Categorical variable back to its string values, we can make any comparison we need. For example:
df_cat.filter(pl.col('a_cat').cast(pl.Utf8).is_in(['a', 'c']))
shape: (3, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ c ┆ F ┆ 0 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ G ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ S ┆ 0 ┆ 3 │
└───────┴───────┴────────────┴────────────┘
Or in a filter step comparing the string values of two Categorical variables that do not share the same string cache.
df_cat.filter(pl.col('a_cat').cast(pl.Utf8) == pl.col('b_cat').cast(pl.Utf8))
shape: (1, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ X ┆ X ┆ 3 ┆ 4 │
└───────┴───────┴────────────┴────────────┘
Notice that it is the string values being compared (not the integers underlying the two Categorical variables).
The equality operator on Categorical variables
The following statements are equivalent:
df_cat.filter((pl.col('a_cat') == 'a'))
df_cat.filter((pl.col('a_cat').cast(pl.Utf8) == 'a'))
The former is syntactic sugar for the latter, as the former is a common use case.
As the error states: ComputeError: joins/or comparisons on categorical dtypes can only happen if they are created under the same global string cache.
Comparisons of categorical values are only allowed under a global string cache. You really want to set this in such a case as it speeds up comparisons and prevents expensive casts to strings.
Setting this on the start of your query will ensure it runs:
import polars as pl
pl.Config.set_global_string_cache()
This is a new answer based on the one from #ritchie46.
Polar 0.15.15 it now is
import polars as pl
pl.toggle_string_cache(True)
Also a StringCache() Context manager can be used, see polars documentation:
with pl.StringCache():
print(df_cat.filter(pl.col('a_cat').is_in(['a', 'c'])))

polars outer join default null value

https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.join.html
Can I specify the default NULL value for outer joins? Like 0?
The join method does not currently have an option for setting a default value for nulls. However, there is an easy way to accomplish this.
Let's say we have this data:
import polars as pl
df1 = pl.DataFrame({"key": ["a", "b", "d"], "var1": [1, 1, 1]})
df2 = pl.DataFrame({"key": ["a", "b", "c"], "var2": [2, 2, 2]})
df1.join(df2, on="key", how="outer")
shape: (4, 3)
┌─────┬──────┬──────┐
│ key ┆ var1 ┆ var2 │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪══════╪══════╡
│ a ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ c ┆ null ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ d ┆ 1 ┆ null │
└─────┴──────┴──────┘
To create a different value for the null values, simply use this:
df1.join(df2, on="key", how="outer").with_column(pl.all().fill_null(0))
shape: (4, 3)
┌─────┬──────┬──────┐
│ key ┆ var1 ┆ var2 │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪══════╪══════╡
│ a ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ c ┆ 0 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ d ┆ 1 ┆ 0 │
└─────┴──────┴──────┘

Polars: how to add a column in front?

What would be the most idiomatic (and efficient) way to add a column in front of a polars data frame? Same thing like .with_column but add it at index 0?
You can select in the order you want your new DataFrame.
df = pl.DataFrame({
"a": [1, 2, 3],
"b": [True, None, False]
})
df.select([
pl.lit("foo").alias("z"),
pl.all()
])
shape: (3, 3)
┌─────┬─────┬───────┐
│ z ┆ a ┆ b │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ bool │
╞═════╪═════╪═══════╡
│ foo ┆ 1 ┆ true │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ foo ┆ 2 ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ foo ┆ 3 ┆ false │
└─────┴─────┴───────┘