Is it possible to reference a different dataframe when using Polars expression without using Lambda? - python-polars

Is there a way to reference another Polars Dataframe in Polars expressions without using lambdas?
Just to use a simple example - suppose I have two dataframes:
df_1 = pl.DataFrame(
{
"time": pl.date_range(
low=date(2021, 1, 1),
high=date(2022, 1, 1),
interval="1d",
),
"x": pl.arange(0, 366, eager=True),
}
)
df_2 = pl.DataFrame(
{
"time": pl.date_range(
low=date(2021, 1, 1),
high=date(2021, 2, 1),
interval="1mo",
),
"y": [50, 100],
}
)
For each y value in df_2, I would like to find the maximum date in df_1, conditional on the x value being lower than the y.
I am able to perform this using apply/lambda (see below), but just wondering whether there is a more idiomatic way of performing this operation?
df_2.groupby("y").agg(
pl.col("y").apply(lambda s: df_1.filter(pl.col("x") < s).select(pl.col("time")).max()[0,0]).alias('latest')
)
Edit:
Is it possible to pre-filter df_1 prior to using join_asof. So switching the question to look for the min instead of the max, on an individual case this is what I would do:
(
df_2
.filter(pl.col('y') == 50)
.join_asof(
df_1
.sort("x")
.filter(pl.col('time') > date(2021,11,1))
.select([
pl.col("time").cummin().alias("time_min"),
pl.col("x").alias("original_x"),
(pl.col("x") + 1).alias("x"),
]),
left_on="y",
right_on="x",
strategy="forward",
)
)
Is there a way to generalise this merge without using a loop / apply function?

Edit: Generalizing a join
One somewhat-dangerous approach to generalizing a join (so that you can run any sub-queries and filters that you like) is to use a "cross" join.
I say "somewhat-dangerous" because the number of row combinations considered in a cross join is M x N, where M and N are the number of rows in your two DataFrames. So if your two DataFrames are 1 million rows each, you have (1 million x 1 million) row combinations that are being considered. This process can exhaust your RAM or simply take a long time.
If you'd like to try it, here's how it would work (along with some arbitrary filters that I constructed, just to show the ultimate flexibility that a cross-join creates).
(
df_2.lazy()
.join(
df_1.lazy(),
how="cross"
)
.filter(pl.col('time_right') >= pl.col('time'))
.groupby('y')
.agg([
pl.col('time').first(),
pl.col('x')
.filter(pl.col('y') > pl.col('x'))
.max()
.alias('max(x) for(y>x)'),
pl.col('time_right')
.filter(pl.col('y') > pl.col('x'))
.max()
.alias('max(time_right) for(y>x)'),
pl.col('time_right')
.filter(pl.col('y') <= pl.col('x'))
.filter(pl.col('time_right') > pl.col('time'))
.min()
.alias('min(time_right) for(two filters)'),
])
.collect()
)
shape: (2, 5)
┌─────┬────────────┬─────────────────┬──────────────────────────┬──────────────────────────────────┐
│ y ┆ time ┆ max(x) for(y>x) ┆ max(time_right) for(y>x) ┆ min(time_right) for(two filters) │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ date ┆ i64 ┆ date ┆ date │
╞═════╪════════════╪═════════════════╪══════════════════════════╪══════════════════════════════════╡
│ 100 ┆ 2021-02-01 ┆ 99 ┆ 2021-04-10 ┆ 2021-04-11 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 50 ┆ 2021-01-01 ┆ 49 ┆ 2021-02-19 ┆ 2021-02-20 │
└─────┴────────────┴─────────────────┴──────────────────────────┴──────────────────────────────────┘
Couple of suggestions:
I strongly recommend running the cross-join in Lazy mode.
Try to filter directly after the cross-join, to eliminate row combinations that you will never need. This reduces the burden on the later groupby step.
Given the explosive potential of row combinations for cross-joins, I tried to steer you toward a join_asof (which did solve the original sample question). But if you need the flexibility beyond what a join_asof can provide, the cross-join will provide ultimate flexibility -- at a cost.
join_asof
We can use a join_asof to accomplish this, with two wrinkles.
The Algorithm
(
df_2
.sort("y")
.join_asof(
(
df_1
.sort("x")
.select([
pl.col("time").cummax().alias("time_max"),
(pl.col("x") + 1),
])
),
left_on="y",
right_on="x",
strategy="backward",
)
.drop(['x'])
)
shape: (2, 3)
┌────────────┬─────┬────────────┐
│ time ┆ y ┆ time_max │
│ --- ┆ --- ┆ --- │
│ date ┆ i64 ┆ date │
╞════════════╪═════╪════════════╡
│ 2021-01-01 ┆ 50 ┆ 2021-02-19 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-02-01 ┆ 100 ┆ 2021-04-10 │
└────────────┴─────┴────────────┘
This matches the output of your code.
In steps
Let's add some extra information to our query, to elucidate how it works.
(
df_2
.sort("y")
.join_asof(
(
df_1
.sort("x")
.select([
pl.col("time").cummax().alias("time_max"),
pl.col("x").alias("original_x"),
(pl.col("x") + 1).alias("x"),
])
),
left_on="y",
right_on="x",
strategy="backward",
)
)
shape: (2, 5)
┌────────────┬─────┬────────────┬────────────┬─────┐
│ time ┆ y ┆ time_max ┆ original_x ┆ x │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ i64 ┆ date ┆ i64 ┆ i64 │
╞════════════╪═════╪════════════╪════════════╪═════╡
│ 2021-01-01 ┆ 50 ┆ 2021-02-19 ┆ 49 ┆ 50 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2021-02-01 ┆ 100 ┆ 2021-04-10 ┆ 99 ┆ 100 │
└────────────┴─────┴────────────┴────────────┴─────┘
Getting the maximum date
Instead of attempting a "non-equi" join or sub-queries to obtain the maximum date for x or any lesser value of x, we can use a simpler approach: sort df_2 by x and calculate the cumulative maximum date for each "x". That way, when we join, we can join to a single row in df_2 and be certain that for any x, we are getting the maximum date for that x and all lesser values of x. The cumulative maximum is displayed above as time_max.
less-than (and not less-than-or-equal-to)
From the documentation for join_as:
A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.
Since you want "less than" and not "less than or equal to", we can simply increase each value of x by 1. Since x and y are integers, this will work. The result above displays both the original value of x (original_x), and the adjusted value (x) used in the join_asof.
If x and y are floats, you can add an arbitrarily small amount to x (e.g., x + 0.000000001) to force the non-equality.

Related

Best way to get percentage counts in Polars

I frequently need to calculate the percentage counts of a variable. For example for the dataframe below
df = pl.DataFrame({"person": ["a", "a", "b"],
"value": [1, 2, 3]})
I want to return a dataframe like this:
person
percent
a
0.667
b
0.333
What I have been doing is the following, but I can't help but think there must be a more efficient / polars way to do this
n_rows = len(df)
(
df
.with_column(pl.lit(1)
.alias('percent'))
.groupby('person')
.agg([pl.sum('percent') / n_rows])
)
polars.count will help here. When called without arguments, polars.count returns the number of rows in a particular context.
(
df
.groupby("person")
.agg([pl.count().alias("count")])
.with_column((pl.col("count") / pl.sum("count")).alias("percent_count"))
)
shape: (2, 3)
┌────────┬───────┬───────────────┐
│ person ┆ count ┆ percent_count │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ f64 │
╞════════╪═══════╪═══════════════╡
│ a ┆ 2 ┆ 0.666667 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 1 ┆ 0.333333 │
└────────┴───────┴───────────────┘

Fast apply of a function to Polars Dataframe

What are the fastest ways to apply functions to polars DataFrames - pl.DataFrame or pl.internals.lazy_frame.LazyFrame? This question is piggy-backing off Apply Function to all columns of a Polars-DataFrame
I am trying to concat all columns and hash the value using hashlib in python standard library. The function I am using is below:
import hashlib
def hash_row(row):
os.environ['PYTHONHASHSEED'] = "0"
row = str(row).encode('utf-8')
return hashlib.sha256(row).hexdigest()
However given that this function requires a string as input, means this function needs to be applied to every cell within a pl.Series. Working with a small amount of data, should be okay, but when we have closer to 100m rows this becomes very problematic. The question for this thread is how can we apply such a function in the most-performant way across an entire Polars Series?
Pandas
Offers a few options to create new columns, and some are more performant than others.
df['new_col'] = df['some_col'] * 100 # vectorized calls
Another option is to create custom functions for row-wise operations.
def apply_func(row):
return row['some_col'] + row['another_col']
df['new_col'] = df.apply(lambda row: apply_func(row), axis=1) # using apply operations
From my experience, the fastest way is to create numpy vectorized solutions.
import numpy as np
def np_func(some_col, another_col):
return some_col + another_col
vec_func = np.vectorize(np_func)
df['new_col'] = vec_func(df['some_col'].values, df['another_col'].values)
Polars
What is the best solution for Polars?
Let's start with this data of various types:
import polars as pl
df = pl.DataFrame(
{
"col_int": [1, 2, 3, 4],
"col_float": [10.0, 20, 30, 40],
"col_bool": [True, False, True, False],
"col_str": pl.repeat("2020-01-01", 4, eager=True),
}
).with_column(pl.col("col_str").str.strptime(pl.Date).alias("col_date"))
df
shape: (4, 5)
┌─────────┬───────────┬──────────┬────────────┬────────────┐
│ col_int ┆ col_float ┆ col_bool ┆ col_str ┆ col_date │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ bool ┆ str ┆ date │
╞═════════╪═══════════╪══════════╪════════════╪════════════╡
│ 1 ┆ 10.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 30.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 40.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 │
└─────────┴───────────┴──────────┴────────────┴────────────┘
Polars: DataFrame.hash_rows
I should first point out that Polars itself has a hash_rows function that will hash the rows of a DataFrame, without first needing to cast each column to a string.
df.hash_rows()
shape: (4,)
Series: '' [u64]
[
16206777682454905786
7386261536140378310
3777361287274669406
675264002871508281
]
If you find this acceptable, then this would be the most performant solution. You can cast the resulting unsigned int to a string if you need to. Note: hash_rows is available only on a DataFrame, not a LazyFrame.
Using polars.concat_str and apply
If you need to use your own hashing solution, then I recommend using the polars.concat_str function to concatenate the values in each row to a string. From the documentation:
polars.concat_str(exprs: Union[Sequence[Union[polars.internals.expr.Expr, str]], polars.internals.expr.Expr], sep: str = '') → polars.internals.expr.Expr
Horizontally Concat Utf8 Series in linear time. Non utf8 columns are cast to utf8.
So, for example, here is the resulting concatenation on our dataset.
df.with_column(
pl.concat_str(pl.all()).alias('concatenated_cols')
)
shape: (4, 6)
┌─────────┬───────────┬──────────┬────────────┬────────────┬────────────────────────────────┐
│ col_int ┆ col_float ┆ col_bool ┆ col_str ┆ col_date ┆ concatenated_cols │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ bool ┆ str ┆ date ┆ str │
╞═════════╪═══════════╪══════════╪════════════╪════════════╪════════════════════════════════╡
│ 1 ┆ 10.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 ┆ 110.0true2020-01-012020-01-01 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 ┆ 220.0false2020-01-012020-01-01 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 30.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 ┆ 330.0true2020-01-012020-01-01 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 40.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 ┆ 440.0false2020-01-012020-01-01 │
└─────────┴───────────┴──────────┴────────────┴────────────┴────────────────────────────────┘
Taking the next step and using the apply method and your function would yield:
df.with_column(
pl.concat_str(pl.all()).apply(hash_row).alias('hash')
)
shape: (4, 6)
┌─────────┬───────────┬──────────┬────────────┬────────────┬─────────────────────────────────────┐
│ col_int ┆ col_float ┆ col_bool ┆ col_str ┆ col_date ┆ hash │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ bool ┆ str ┆ date ┆ str │
╞═════════╪═══════════╪══════════╪════════════╪════════════╪═════════════════════════════════════╡
│ 1 ┆ 10.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 ┆ 1826eb9c6aeb0abcdd2999a76eee576e... │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 ┆ ea50f5b11957bfc92b5ab7545b3ac12c... │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 30.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 ┆ eef039d8dedadcc282d6fa9473e071e8... │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 40.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 ┆ dcc5c57e0b5fdf15320a84c6839b0e3d... │
└─────────┴───────────┴──────────┴────────────┴────────────┴─────────────────────────────────────┘
Please remember that any time Polars calls external libraries or runs Python bytecode, you are subject to the Python GIL, which means single-threaded performance - no matter how you code it. From the Polars Cookbook section Do Not Kill The Parallelization!:
We have all heard that Python is slow, and does "not scale." Besides the overhead of running "slow" bytecode, Python has to remain within the constraints of the Global Interpreter Lock (GIL). This means that if you were to use a lambda or a custom Python function to apply during a parallelized phase, Polars speed is capped running Python code preventing any multiple threads from executing the function.
This all feels terribly limiting, especially because we often need those lambda functions in a .groupby() step, for example. This approach is still supported by Polars, but keeping in mind bytecode and the GIL costs have to be paid.
To mitigate this, Polars implements a powerful syntax defined not only in its lazy API, but also in its eager API.
Thanks cbilot - was unaware of hash_rows. Your solution is nearly identical to what I have wrote. The one thing that I have to mention is that --
concat_str did not work for me if there are Nulls in your series. Thus I had to cast to Utf8 before fill_null. Then I am able to concat_str and apply hash_row on the result.
def set_datatypes_and_replace_nulls(df, idcol="factset_person_id"):
return (
df
.with_columns([
pl.col("*").cast(pl.Utf8, strict=False),
pl.col("*").fill_null(pl.lit(""))
])
.with_columns([
pl.concat_str(pl.col("*").exclude(exclude_cols)).alias('concatstr'),
])
)
def hash_concat(df):
return (
df
.with_columns([
pl.col("concatstr").apply(hash_row).alias('sha256hash')
])
)
After this we need to aggregate the hashes by ID.
df = (
df
.pipe(set_datatypes_and_replace_nulls)
.pipe(hash_concat)
)
# something like the below...
part1= (
df.lazy()
.groupby("id")
.agg(
[
pl.col("concatstr").unique().list(),
]
)
)
Thanks for improving with pl.hash_rows.

Polars: Pivoting by Int64 column not keeping numeric order

I have a column called VERSION_INDEX which is Int64 and is a proxy for keeping a list of semantic software versions ordered such that 0.2.0 comes after 0.13.0. When I pivot, the column names created from the pivot are sorted alphanumerically.
pivot_df = merged_df.pivot(index=test_events_key_columns, columns='VERSION_INDEX', values='Status')
print(pivot_df)
Is it possible to keep the column order numeric during the pivot such that 9 comes before 87?
thx
In Polars, column names are always stored as strings, and hence you have the alphanumeric sorting rather than numeric. There is no way around the strings, so I think the best you can do is to compute the column order you want, and select the columns:
import polars as pl
df = pl.DataFrame({"version": [9, 85, 87], "testsuite": ["scan1", "scan2", "scan3"], "status": ["ok"] * 3})
wide = df.pivot(index="testsuite", columns='version', values='status')
cols = df["version"].cast(pl.Utf8).to_list()
wide[["testsuite"] + cols]
┌───────────┬──────┬──────┬──────┐
│ testsuite ┆ 9 ┆ 85 ┆ 87 │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str │
╞═══════════╪══════╪══════╪══════╡
│ scan1 ┆ ok ┆ null ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ scan2 ┆ null ┆ ok ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ scan3 ┆ null ┆ null ┆ ok │
└───────────┴──────┴──────┴──────┘

How to the get the first n% of a group in polars?

Q1: In polars-rust, when you do .groupby().agg() , we can use .head(10) to get the first 10 elements in a column. But if the groups have different lengths and I need to get first 20% elements in each group (like 0-24 elements in a 120 elements group). How to make it work?
Q2: with a dataframe sample like below, my goal is to loop the dataframe. Beacuse polars is column major, so I downcasted df into serval ChunkedArrays and iterated via iter().zip().I found it is faster than the same action after goupby(col("date")) which is loop some list elemnts. How is that?
In my opinion, the length of df is shorter after groupby, which means a shorter loop.
Date
Stock
Price
2010-01-01
IBM
1000
2010-01-02
IBM
1001
2010-01-03
IBM
1002
2010-01-01
AAPL
2900
2010-01-02
AAPL
2901
2010-01-03
AAPL
2902
I don't really understand your 2nd question. Maybe you can create another question with a small example.
I will answer the 1st question:
we can use head(10) to get the first 10 elements in a col. But if the groups have different length and I need to get first 20% elements in each group like 0-24 elements in a 120 elements group. how to make it work?
We can use expressions to take a head(n) where n = 0.2 group_size.
df = pl.DataFrame({
"groups": ["a"] * 10 + ["b"] * 20,
"values": range(30)
})
(df.groupby("groups")
.agg(pl.all().head(pl.count() * 0.2))
.explode(pl.all().exclude("groups"))
)
which outputs:
shape: (6, 2)
┌────────┬────────┐
│ groups ┆ values │
│ --- ┆ --- │
│ str ┆ i64 │
╞════════╪════════╡
│ a ┆ 0 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ a ┆ 1 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b ┆ 10 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b ┆ 11 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b ┆ 12 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ b ┆ 13 │
└────────┴────────┘

In polars, can I create a categorical type with levels myself?

In Pandas, I can specify the levels of a Categorical type myself:
MyCat = pd.CategoricalDtype(categories=['A','B','C'], ordered=True)
my_data = pd.Series(['A','A','B'], dtype=MyCat)
This means that
I can make sure that different columns and sets use the same dtype
I can specify an ordering for the levels.
Is there a way to do this with Polars? I know you can use the string cache feature to achieve 1) in a different way, however I'm interested if my dtype/levels can be specified directly. I'm not aware of any way to achieve 2), however I think the categorical dtypes in Arrow do allow an optional ordering, so maybe it's possible?
Not directly, but we can influence how the global string cache is filled. The global string cache simply increments a counter for every new category added.
So if we start with an empty cache and we do a pre-fill in the order that we think is important, the later categories use the cached integer.
Here is an example:
import string
import polars as pl
with pl.StringCache():
# the first run will fill the global string cache counting from 0..25
# for all 26 letters in the alphabet
pl.Series(list(string.ascii_uppercase)).cast(pl.Categorical)
# now the global string cache is populated with all categories
# we cast the string columns
df = (pl.DataFrame({
"letters": ["A", "B", "D"],
"more_letters": ["Z", "B", "J"]
}).with_column(pl.col(pl.Utf8).cast(pl.Categorical))
.with_column(pl.col(pl.Categorical).to_physical().suffix("_real_category"))
)
print(df)
shape: (3, 4)
┌─────────┬──────────────┬───────────────────────┬────────────────────────────┐
│ letters ┆ more_letters ┆ letters_real_category ┆ more_letters_real_category │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═════════╪══════════════╪═══════════════════════╪════════════════════════════╡
│ A ┆ Z ┆ 0 ┆ 25 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ B ┆ B ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ D ┆ J ┆ 3 ┆ 9 │
└─────────┴──────────────┴───────────────────────┴────────────────────────────┘