python-polars create new column by dividing by two existing columns - python-polars

in pandas the following creates a new column in dataframe by dividing by two existing columns. How do I do this in polars? Bonus if done in the fastest way using polars.LazyFrame
df = pd.DataFrame({"col1":[10,20,30,40,50], "col2":[5,2,10,10,25]})
df["ans"] = df["col1"]/df["col2"]
print(df)

You want to avoid Pandas-style coding and use Polars Expressions API. Expressions are the heart of Polars and yield the best performance.
Here's how we would code this using Expressions, including using Lazy mode:
(
df
.lazy()
.with_column(
(pl.col('col1') / pl.col('col2')).alias('result')
)
.collect()
)
shape: (5, 3)
┌──────┬──────┬────────┐
│ col1 ┆ col2 ┆ result │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 │
╞══════╪══════╪════════╡
│ 10 ┆ 5 ┆ 2.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 20 ┆ 2 ┆ 10.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 30 ┆ 10 ┆ 3.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 40 ┆ 10 ┆ 4.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 50 ┆ 25 ┆ 2.0 │
└──────┴──────┴────────┘
Here's a section of the User Guide that may help transitioning from Pandas-style coding to using Polars Expressions.

Related

Python-Polars: How to filter categorical column with string list

I have a Polars dataframe like below:
df_cat = pl.DataFrame(
[
pl.Series("a_cat", ["c", "a", "b", "c", "b"], dtype=pl.Categorical),
pl.Series("b_cat", ["F", "G", "E", "G", "G"], dtype=pl.Categorical)
])
print(df_cat)
shape: (5, 2)
┌───────┬───────┐
│ a_cat ┆ b_cat │
│ --- ┆ --- │
│ cat ┆ cat │
╞═══════╪═══════╡
│ c ┆ F │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ G │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ E │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ G │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ G │
└───────┴───────┘
The following filter runs perfectly fine:
print(df_cat.filter(pl.col('a_cat') == 'c'))
shape: (2, 2)
┌───────┬───────┐
│ a_cat ┆ b_cat │
│ --- ┆ --- │
│ cat ┆ cat │
╞═══════╪═══════╡
│ c ┆ F │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ G │
└───────┴───────┘
What I want is to use a list of string to run the filter more efficiently. So I tried and ended up with the following error message:
print(df_cat.filter(pl.col('a_cat').is_in(['a', 'c'])))
---------------------------------------------------------------------------
ComputeError Traceback (most recent call last)
d:\GitRepo\Test2\stockEMD3.ipynb Cell 9 in <cell line: 1>()
----> 1 print(df_cat.filter(pl.col('a_cat').is_in(['c'])))
File c:\ProgramData\Anaconda3\envs\charm3.9\lib\site-packages\polars\internals\dataframe\frame.py:2185, in DataFrame.filter(self, predicate)
2181 if _NUMPY_AVAILABLE and isinstance(predicate, np.ndarray):
2182 predicate = pli.Series(predicate)
2184 return (
-> 2185 self.lazy()
2186 .filter(predicate) # type: ignore[arg-type]
2187 .collect(no_optimization=True, string_cache=False)
2188 )
File c:\ProgramData\Anaconda3\envs\charm3.9\lib\site-packages\polars\internals\lazyframe\frame.py:660, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, string_cache, no_optimization, slice_pushdown)
650 projection_pushdown = False
652 ldf = self._ldf.optimization_toggle(
653 type_coercion,
654 predicate_pushdown,
(...)
658 slice_pushdown,
659 )
--> 660 return pli.wrap_df(ldf.collect())
ComputeError: joins/or comparisons on categorical dtypes can only happen if they are created under the same global string cache
From this Stackoverflow link I understand "You need to set a global string cache to compare categoricals created in different columns/lists." but my question is
Why the == one single string filter case works?
What is the proper way to filter a categorical column with a list of string?
Thanks!
Actually, you don't need to set a global string cache to compare strings to Categorical variables. You can use cast to accomplish this.
Let's use this data. I've included the integer values that underlie the Categorical variables to demonstrate something later.
import polars as pl
df_cat = (
pl.DataFrame(
[
pl.Series("a_cat", ["c", "a", "b", "c", "X"], dtype=pl.Categorical),
pl.Series("b_cat", ["F", "G", "E", "S", "X"], dtype=pl.Categorical),
]
)
.with_column(
pl.all().to_physical().suffix('_phys')
)
)
df_cat
shape: (5, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ c ┆ F ┆ 0 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ G ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ E ┆ 2 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ S ┆ 0 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ X ┆ X ┆ 3 ┆ 4 │
└───────┴───────┴────────────┴────────────┘
Comparing a categorical variable to a string
If we cast a Categorical variable back to its string values, we can make any comparison we need. For example:
df_cat.filter(pl.col('a_cat').cast(pl.Utf8).is_in(['a', 'c']))
shape: (3, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ c ┆ F ┆ 0 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ G ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ S ┆ 0 ┆ 3 │
└───────┴───────┴────────────┴────────────┘
Or in a filter step comparing the string values of two Categorical variables that do not share the same string cache.
df_cat.filter(pl.col('a_cat').cast(pl.Utf8) == pl.col('b_cat').cast(pl.Utf8))
shape: (1, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ X ┆ X ┆ 3 ┆ 4 │
└───────┴───────┴────────────┴────────────┘
Notice that it is the string values being compared (not the integers underlying the two Categorical variables).
The equality operator on Categorical variables
The following statements are equivalent:
df_cat.filter((pl.col('a_cat') == 'a'))
df_cat.filter((pl.col('a_cat').cast(pl.Utf8) == 'a'))
The former is syntactic sugar for the latter, as the former is a common use case.
As the error states: ComputeError: joins/or comparisons on categorical dtypes can only happen if they are created under the same global string cache.
Comparisons of categorical values are only allowed under a global string cache. You really want to set this in such a case as it speeds up comparisons and prevents expensive casts to strings.
Setting this on the start of your query will ensure it runs:
import polars as pl
pl.Config.set_global_string_cache()
This is a new answer based on the one from #ritchie46.
Polar 0.15.15 it now is
import polars as pl
pl.toggle_string_cache(True)
Also a StringCache() Context manager can be used, see polars documentation:
with pl.StringCache():
print(df_cat.filter(pl.col('a_cat').is_in(['a', 'c'])))

Fast apply of a function to Polars Dataframe

What are the fastest ways to apply functions to polars DataFrames - pl.DataFrame or pl.internals.lazy_frame.LazyFrame? This question is piggy-backing off Apply Function to all columns of a Polars-DataFrame
I am trying to concat all columns and hash the value using hashlib in python standard library. The function I am using is below:
import hashlib
def hash_row(row):
os.environ['PYTHONHASHSEED'] = "0"
row = str(row).encode('utf-8')
return hashlib.sha256(row).hexdigest()
However given that this function requires a string as input, means this function needs to be applied to every cell within a pl.Series. Working with a small amount of data, should be okay, but when we have closer to 100m rows this becomes very problematic. The question for this thread is how can we apply such a function in the most-performant way across an entire Polars Series?
Pandas
Offers a few options to create new columns, and some are more performant than others.
df['new_col'] = df['some_col'] * 100 # vectorized calls
Another option is to create custom functions for row-wise operations.
def apply_func(row):
return row['some_col'] + row['another_col']
df['new_col'] = df.apply(lambda row: apply_func(row), axis=1) # using apply operations
From my experience, the fastest way is to create numpy vectorized solutions.
import numpy as np
def np_func(some_col, another_col):
return some_col + another_col
vec_func = np.vectorize(np_func)
df['new_col'] = vec_func(df['some_col'].values, df['another_col'].values)
Polars
What is the best solution for Polars?
Let's start with this data of various types:
import polars as pl
df = pl.DataFrame(
{
"col_int": [1, 2, 3, 4],
"col_float": [10.0, 20, 30, 40],
"col_bool": [True, False, True, False],
"col_str": pl.repeat("2020-01-01", 4, eager=True),
}
).with_column(pl.col("col_str").str.strptime(pl.Date).alias("col_date"))
df
shape: (4, 5)
┌─────────┬───────────┬──────────┬────────────┬────────────┐
│ col_int ┆ col_float ┆ col_bool ┆ col_str ┆ col_date │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ bool ┆ str ┆ date │
╞═════════╪═══════════╪══════════╪════════════╪════════════╡
│ 1 ┆ 10.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 30.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 40.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 │
└─────────┴───────────┴──────────┴────────────┴────────────┘
Polars: DataFrame.hash_rows
I should first point out that Polars itself has a hash_rows function that will hash the rows of a DataFrame, without first needing to cast each column to a string.
df.hash_rows()
shape: (4,)
Series: '' [u64]
[
16206777682454905786
7386261536140378310
3777361287274669406
675264002871508281
]
If you find this acceptable, then this would be the most performant solution. You can cast the resulting unsigned int to a string if you need to. Note: hash_rows is available only on a DataFrame, not a LazyFrame.
Using polars.concat_str and apply
If you need to use your own hashing solution, then I recommend using the polars.concat_str function to concatenate the values in each row to a string. From the documentation:
polars.concat_str(exprs: Union[Sequence[Union[polars.internals.expr.Expr, str]], polars.internals.expr.Expr], sep: str = '') → polars.internals.expr.Expr
Horizontally Concat Utf8 Series in linear time. Non utf8 columns are cast to utf8.
So, for example, here is the resulting concatenation on our dataset.
df.with_column(
pl.concat_str(pl.all()).alias('concatenated_cols')
)
shape: (4, 6)
┌─────────┬───────────┬──────────┬────────────┬────────────┬────────────────────────────────┐
│ col_int ┆ col_float ┆ col_bool ┆ col_str ┆ col_date ┆ concatenated_cols │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ bool ┆ str ┆ date ┆ str │
╞═════════╪═══════════╪══════════╪════════════╪════════════╪════════════════════════════════╡
│ 1 ┆ 10.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 ┆ 110.0true2020-01-012020-01-01 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 ┆ 220.0false2020-01-012020-01-01 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 30.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 ┆ 330.0true2020-01-012020-01-01 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 40.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 ┆ 440.0false2020-01-012020-01-01 │
└─────────┴───────────┴──────────┴────────────┴────────────┴────────────────────────────────┘
Taking the next step and using the apply method and your function would yield:
df.with_column(
pl.concat_str(pl.all()).apply(hash_row).alias('hash')
)
shape: (4, 6)
┌─────────┬───────────┬──────────┬────────────┬────────────┬─────────────────────────────────────┐
│ col_int ┆ col_float ┆ col_bool ┆ col_str ┆ col_date ┆ hash │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ bool ┆ str ┆ date ┆ str │
╞═════════╪═══════════╪══════════╪════════════╪════════════╪═════════════════════════════════════╡
│ 1 ┆ 10.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 ┆ 1826eb9c6aeb0abcdd2999a76eee576e... │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 ┆ ea50f5b11957bfc92b5ab7545b3ac12c... │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 30.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 ┆ eef039d8dedadcc282d6fa9473e071e8... │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 40.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 ┆ dcc5c57e0b5fdf15320a84c6839b0e3d... │
└─────────┴───────────┴──────────┴────────────┴────────────┴─────────────────────────────────────┘
Please remember that any time Polars calls external libraries or runs Python bytecode, you are subject to the Python GIL, which means single-threaded performance - no matter how you code it. From the Polars Cookbook section Do Not Kill The Parallelization!:
We have all heard that Python is slow, and does "not scale." Besides the overhead of running "slow" bytecode, Python has to remain within the constraints of the Global Interpreter Lock (GIL). This means that if you were to use a lambda or a custom Python function to apply during a parallelized phase, Polars speed is capped running Python code preventing any multiple threads from executing the function.
This all feels terribly limiting, especially because we often need those lambda functions in a .groupby() step, for example. This approach is still supported by Polars, but keeping in mind bytecode and the GIL costs have to be paid.
To mitigate this, Polars implements a powerful syntax defined not only in its lazy API, but also in its eager API.
Thanks cbilot - was unaware of hash_rows. Your solution is nearly identical to what I have wrote. The one thing that I have to mention is that --
concat_str did not work for me if there are Nulls in your series. Thus I had to cast to Utf8 before fill_null. Then I am able to concat_str and apply hash_row on the result.
def set_datatypes_and_replace_nulls(df, idcol="factset_person_id"):
return (
df
.with_columns([
pl.col("*").cast(pl.Utf8, strict=False),
pl.col("*").fill_null(pl.lit(""))
])
.with_columns([
pl.concat_str(pl.col("*").exclude(exclude_cols)).alias('concatstr'),
])
)
def hash_concat(df):
return (
df
.with_columns([
pl.col("concatstr").apply(hash_row).alias('sha256hash')
])
)
After this we need to aggregate the hashes by ID.
df = (
df
.pipe(set_datatypes_and_replace_nulls)
.pipe(hash_concat)
)
# something like the below...
part1= (
df.lazy()
.groupby("id")
.agg(
[
pl.col("concatstr").unique().list(),
]
)
)
Thanks for improving with pl.hash_rows.

How to form dynamic expressions without breaking on types

Any way to make the dynamic polars expressions not break with errors?
Currently I'm just excluding the columns by type, but just wondering if there is a better way.
For example, i have a df coming from parquet, if i just execute an expression on all columns it might break for certain types. Instead I want to contain these errors and possibly return a default value like None or -1 or something else.
import polars as pl
df = pl.scan_parquet("/path/to/data/*.parquet")
print(df.schema)
# Prints: {'date_time': <class 'polars.datatypes.Datetime'>, 'incident': <class 'polars.datatypes.Utf8'>, 'address': <class 'polars.datatypes.Utf8'>, 'city': <class 'polars.datatypes.Utf8'>, 'zipcode': <class 'polars.datatypes.Int32'>}
Now if i form generic expression on top of this, there are chances it may fail. For example,
# Finding positive count across all columns
# Fails due to: exceptions.ComputeError: cannot compare Utf8 with numeric data
print(df.select((pl.all() > 0).count().prefix("__positive_count_")).collect())
# Finding positive count across all columns
# Fails due to: pyo3_runtime.PanicException: 'unique_counts' not implemented for datetime[ns] data types
print(df.select(pl.all().unique_counts().prefix("__unique_count_")).collect())
# Finding positive count across all columns
# Fails due to: exceptions.SchemaError: Series dtype Int32 != utf8
# Note: this could have been avoided by doing an explict cast to string first
print(df.select((pl.all().str.lengths() > 0).count().prefix("__empty_count_")).collect())
I'll keep to things that work in lazy mode, as it appears that you are working in lazy mode with Parquet files.
Let's use this data as an example:
import polars as pl
from datetime import datetime
df = pl.DataFrame(
{
"col_int": [-2, -2, 0, 2, 2],
"col_float": [-20.0, -10, 10, 20, 20],
"col_date": pl.date_range(datetime(2020, 1, 1), datetime(2020, 5, 1), "1mo"),
"col_str": ["str1", "str2", "", None, "str5"],
"col_bool": [True, False, False, True, False],
}
).lazy()
df.collect()
shape: (5, 5)
┌─────────┬───────────┬─────────────────────┬─────────┬──────────┐
│ col_int ┆ col_float ┆ col_date ┆ col_str ┆ col_bool │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[ns] ┆ str ┆ bool │
╞═════════╪═══════════╪═════════════════════╪═════════╪══════════╡
│ -2 ┆ -20.0 ┆ 2020-01-01 00:00:00 ┆ str1 ┆ true │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ -2 ┆ -10.0 ┆ 2020-02-01 00:00:00 ┆ str2 ┆ false │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ 10.0 ┆ 2020-03-01 00:00:00 ┆ ┆ false │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20.0 ┆ 2020-04-01 00:00:00 ┆ null ┆ true │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20.0 ┆ 2020-05-01 00:00:00 ┆ str5 ┆ false │
└─────────┴───────────┴─────────────────────┴─────────┴──────────┘
Using the col Expression
One feature of the col expression is that you can supply a datatype, or even a list of datatypes. For example, if we want to contain our queries to floats, we can do the following:
df.select((pl.col(pl.Float64) > 0).sum().suffix("__positive_count_")).collect()
shape: (1, 1)
┌────────────────────────────┐
│ col_float__positive_count_ │
│ --- │
│ u32 │
╞════════════════════════════╡
│ 3 │
└────────────────────────────┘
(Note: (pl.col(...) > 0) creates a series of boolean values that need to be summed, not counted)
To include more than one datatype, you can supply a list of datatypes to col.
df.select(
(pl.col([pl.Int64, pl.Float64]) > 0).sum().suffix("__positive_count_")
).collect()
shape: (1, 2)
┌──────────────────────────┬────────────────────────────┐
│ col_int__positive_count_ ┆ col_float__positive_count_ │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞══════════════════════════╪════════════════════════════╡
│ 2 ┆ 3 │
└──────────────────────────┴────────────────────────────┘
You can also combine these into the same select statement if you'd like.
df.select(
[
(pl.col(pl.Utf8).str.lengths() == 0).sum().suffix("__empty_count"),
pl.col(pl.Utf8).is_null().sum().suffix("__null_count"),
(pl.col([pl.Float64, pl.Int64]) > 0).sum().suffix("_positive_count"),
]
).collect()
shape: (1, 4)
┌──────────────────────┬─────────────────────┬──────────────────────────┬────────────────────────┐
│ col_str__empty_count ┆ col_str__null_count ┆ col_float_positive_count ┆ col_int_positive_count │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ u32 ┆ u32 │
╞══════════════════════╪═════════════════════╪══════════════════════════╪════════════════════════╡
│ 1 ┆ 1 ┆ 3 ┆ 2 │
└──────────────────────┴─────────────────────┴──────────────────────────┴────────────────────────┘
The Cookbook has a handy list of datatypes.
Using the exclude expression
Another handy trick is to use the exclude expression. With this, we can select all columns except columns of certain datatypes. For example:
df.select(
[
pl.exclude(pl.Utf8).max().suffix("_max"),
pl.exclude([pl.Utf8, pl.Boolean]).min().suffix("_min"),
]
).collect()
shape: (1, 7)
┌─────────────┬───────────────┬─────────────────────┬──────────────┬─────────────┬───────────────┬─────────────────────┐
│ col_int_max ┆ col_float_max ┆ col_date_max ┆ col_bool_max ┆ col_int_min ┆ col_float_min ┆ col_date_min │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[ns] ┆ u32 ┆ i64 ┆ f64 ┆ datetime[ns] │
╞═════════════╪═══════════════╪═════════════════════╪══════════════╪═════════════╪═══════════════╪═════════════════════╡
│ 2 ┆ 20.0 ┆ 2020-05-01 00:00:00 ┆ 1 ┆ -2 ┆ -20.0 ┆ 2020-01-01 00:00:00 │
└─────────────┴───────────────┴─────────────────────┴──────────────┴─────────────┴───────────────┴─────────────────────┘
Unique counts
One caution: unique_counts results in Series of varying lengths.
df.select(pl.col("col_int").unique_counts().prefix(
"__unique_count_")).collect()
shape: (3, 1)
┌────────────────────────┐
│ __unique_count_col_int │
│ --- │
│ u32 │
╞════════════════════════╡
│ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 │
└────────────────────────┘
df.select(pl.col("col_float").unique_counts().prefix(
"__unique_count_")).collect()
shape: (4, 1)
┌──────────────────────────┐
│ __unique_count_col_float │
│ --- │
│ u32 │
╞══════════════════════════╡
│ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 │
└──────────────────────────┘
As such, these should not be combined into the same results. Each column/Series of a DataFrame must have the same length.

Polars: How to reorder columns in a specific order?

I cannot find how to reorder columns in a polars dataframe in the polars DataFrame docs.
thx
Using the select method is the recommended way to sort columns in polars.
Example:
Input:
df
┌─────┬───────┬─────┐
│Col1 ┆ Col2 ┆Col3 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════╪═══════╪═════╡
│ a ┆ x ┆ p │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ y ┆ q │
└─────┴───────┴─────┘
Output:
df.select(['Col3', 'Col2', 'Col1'])
or
df.select([pl.col('Col3'), pl.col('Col2'), pl.col('Col1)])
┌─────┬───────┬─────┐
│Col3 ┆ Col2 ┆Col1 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════╪═══════╪═════╡
│ p ┆ x ┆ a │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ q ┆ y ┆ b │
└─────┴───────┴─────┘
Note:
While df[['Col3', 'Col2', 'Col1']] gives the same result (version 0.14), it is recommended (link) that you use the select method instead.
We strongly recommend selecting data with expressions for almost all
use cases. Square bracket indexing is perhaps useful when doing
exploratory data analysis in a terminal or notebook when you just want
a quick look at a subset of data.
For all other use cases we recommend using expressions because:
expressions can be parallelized
the expression approach can be used in lazy and eager mode while the indexing approach can only be used in eager mode
in lazy mode the query optimizer can optimize expressions
Turns out it is the same as pandas:
df = df[['PRODUCT', 'PROGRAM', 'MFG_AREA', 'VERSION', 'RELEASE_DATE', 'FLOW_SUMMARY', 'TESTSUITE', 'MODULE', 'BASECLASS', 'SUBCLASS', 'Empty', 'Color', 'BINNING', 'BYPASS', 'Status', 'Legend']]
That seems like a special case of projection to me.
df = pl.DataFrame({
"c": [1, 2],
"a": ["a", "b"],
"b": [True, False]
})
df.select(sorted(df.columns))
shape: (2, 3)
┌─────┬───────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ str ┆ bool ┆ i64 │
╞═════╪═══════╪═════╡
│ a ┆ true ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ false ┆ 2 │
└─────┴───────┴─────┘

in polars, how could i use rank() to get most popular category per user

Let's say I have a csv
transaction_id,user,book
1,bob,bookA
2,bob,bookA
3,bob,bookB
4,tim,bookA
5,lucy,bookA
6,lucy,bookC
7,lucy,bookC
8,lucy,bookC
per user, i want to find the book they have shown the most preference towards. For example, the output should be;
shape: (3, 2)
┌──────┬──────────┐
│ user ┆ fav_book │
│ --- ┆ --- │
│ str ┆ str │
╞══════╪══════════╡
│ bob ┆ bookA │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ tim ┆ bookA │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ lucy ┆ bookC │
└──────┴──────────┘
now i've worked out how to do it like so
import polars as pl
df = pl.read_csv("book_aggs.csv")
print(df)
df2 = df.groupby(["user", "book"]).agg([
pl.col("book").count(),
pl.col("transaction_id") # just so we can double check where it all came from - TODO: how to output this to csv?
])
print(df2)
df3 = df2.sort(["user", "book_count"], reverse=True).groupby("user").agg([
pl.col("book").first().alias("fav_book")
])
print(df3)
but really the normal sql way of doing it is a dense_rank sorted by book count descending where rank = 1. I have tried for hours to get this to work but i can't find a relevant example in the docs.
the issue is that in the docs, none of the agg examples reference the output of another agg - in this case it needs to reference the count of each book per user, and then sort those counts descending and then rank based on that sort order.
Please provide an example that explains how to use rank to perform this task, and also how to nest aggregations efficiently.
Approach 1
We could first groupby user and 'book' to get all user -> book combinations and count the most occurring.
This would give this intermediate DataFrame:
shape: (5, 3)
┌──────┬───────┬────────────┐
│ user ┆ book ┆ book_count │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ u32 │
╞══════╪═══════╪════════════╡
│ lucy ┆ bookC ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ lucy ┆ bookA ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ bob ┆ bookB ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ tim ┆ bookA ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ bob ┆ bookA ┆ 2 │
└──────┴───────┴────────────┘
Then we can do another groupby user where we compute the index of the maximum book_count and use that index to take the correct book.
The whole query looks like this:
df = pl.DataFrame({'book': ['bookA',
'bookA',
'bookB',
'bookA',
'bookA',
'bookC',
'bookC',
'bookC'],
'transaction_id': [1, 2, 3, 4, 5, 6, 7, 8],
'user': ['bob', 'bob', 'bob', 'tim', 'lucy', 'lucy', 'lucy', 'lucy']
})
(df.groupby(["user", "book"])
.agg([
pl.col("book").count()
])
.groupby("user")
.agg([
pl.col("book").take(pl.col("book_count").arg_max()).alias("fav_book")
])
)
And creates this output:
shape: (3, 2)
┌──────┬──────────┐
│ user ┆ fav_book │
│ --- ┆ --- │
│ str ┆ str │
╞══════╪══════════╡
│ tim ┆ bookA │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ bob ┆ bookA │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ lucy ┆ bookC │
└──────┴──────────┘
Approach 2
Another approach would be creating a book_count column with a window_expression and then use the index of the maximum to take the correct book in aggregation:
(df
.with_column(pl.count("book").over(["user", "book"]).alias("book_count"))
.groupby("user")
.agg([
pl.col("book").take(pl.col("book_count").arg_max())
])
)