Best way to get percentage counts in Polars - python-polars

I frequently need to calculate the percentage counts of a variable. For example for the dataframe below
df = pl.DataFrame({"person": ["a", "a", "b"],
"value": [1, 2, 3]})
I want to return a dataframe like this:
person
percent
a
0.667
b
0.333
What I have been doing is the following, but I can't help but think there must be a more efficient / polars way to do this
n_rows = len(df)
(
df
.with_column(pl.lit(1)
.alias('percent'))
.groupby('person')
.agg([pl.sum('percent') / n_rows])
)

polars.count will help here. When called without arguments, polars.count returns the number of rows in a particular context.
(
df
.groupby("person")
.agg([pl.count().alias("count")])
.with_column((pl.col("count") / pl.sum("count")).alias("percent_count"))
)
shape: (2, 3)
┌────────┬───────┬───────────────┐
│ person ┆ count ┆ percent_count │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ f64 │
╞════════╪═══════╪═══════════════╡
│ a ┆ 2 ┆ 0.666667 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 1 ┆ 0.333333 │
└────────┴───────┴───────────────┘

Related

Polars: Unnesting columns algorithmically without a for loop

I am working with multiple parquet datasets that were written with nested structs (sometimes multiple levels deep). I need to output a flattened (no struct) schema. Right now the only way I can think to do that is to use for loops to iterate through the columns. Here is a simplified example where I'm for looping.
while len([x.name for x in df if x.dtype == pl.Struct]) > 0:
for col in df:
if col.dtype == pl.Struct:
df = df.unnest(col.name)
This works, maybe that is the only way to do it, and if so it would be helpful to know that. But Polars is pretty neat and I'm wondering if there is a more functional way to do this without all the looping and reassigning the df to itself.
If you have a df like this:
df=pl.DataFrame({'a':[1,2,3], 'b':[2,3,4], 'c':[3,4,5], 'd':[4,5,6], 'e':[5,6,7]}).select([pl.struct(['a','b']).alias('ab'), pl.struct(['c','d']).alias('cd'),'e'])
You can unnest the ab and cd at the same time by just doing
df.unnest(['ab','cd'])
If you don't know in advance what your column names and types are in advance then you can just use a list comprehension like this:
[col_name for col_name,dtype in zip(df.columns, df.dtypes) if dtype==pl.Struct]
We can now just put that list comprehension in the unnest method.
df=df.unnest([col_name for col_name,dtype in zip(df.columns, df.dtypes) if dtype==pl.Struct])
If you have structs inside structs like:
df=pl.DataFrame({'a':[1,2,3], 'b':[2,3,4], 'c':[3,4,5], 'd':[4,5,6], 'e':[5,6,7]}).select([pl.struct(['a','b']).alias('ab'), pl.struct(['c','d']).alias('cd'),'e']).select([pl.struct(['ab','cd']).alias('abcd'),'e'])
then I don't think you can get away from some kind of while loop but this might be more concise:
while any([x==pl.Struct for x in df.dtypes]):
df=df.unnest([col_name for col_name,dtype in zip(df.columns, df.dtypes) if dtype==pl.Struct])
This is a minor addition. If you're concerned about constantly re-looping through a large number of columns, you can create a recursive formula to address only structs (and nested structs).
def unnest_all(self: pl.DataFrame):
cols = []
for next_col in self:
if next_col.dtype != pl.Struct:
cols.append(next_col)
else:
cols.extend(next_col.struct.to_frame().unnest_all().get_columns())
return pl.DataFrame(cols)
pl.DataFrame.unnest_all = unnest_all
So, using the second example by #Dean MacGregor above:
df = (
pl.DataFrame(
{"a": [1, 2, 3], "b": [2, 3, 4], "c": [
3, 4, 5], "d": [4, 5, 6], "e": [5, 6, 7]}
)
.select([pl.struct(["a", "b"]).alias("ab"), pl.struct(["c", "d"]).alias("cd"), "e"])
.select([pl.struct(["ab", "cd"]).alias("abcd"), "e"])
)
df
df.unnest_all()
>>> df
shape: (3, 2)
┌───────────────┬─────┐
│ abcd ┆ e │
│ --- ┆ --- │
│ struct[2] ┆ i64 │
╞═══════════════╪═════╡
│ {{1,2},{3,4}} ┆ 5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ {{2,3},{4,5}} ┆ 6 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ {{3,4},{5,6}} ┆ 7 │
└───────────────┴─────┘
>>> df.unnest_all()
shape: (3, 5)
┌─────┬─────┬─────┬─────┬─────┐
│ a ┆ b ┆ c ┆ d ┆ e │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╪═════╡
│ 1 ┆ 2 ┆ 3 ┆ 4 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 3 ┆ 4 ┆ 5 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 4 ┆ 5 ┆ 6 ┆ 7 │
└─────┴─────┴─────┴─────┴─────┘
And using the first example:
df = pl.DataFrame(
{"a": [1, 2, 3], "b": [2, 3, 4], "c": [
3, 4, 5], "d": [4, 5, 6], "e": [5, 6, 7]}
).select([pl.struct(["a", "b"]).alias("ab"), pl.struct(["c", "d"]).alias("cd"), "e"])
df
df.unnest_all()
>>> df
shape: (3, 3)
┌───────────┬───────────┬─────┐
│ ab ┆ cd ┆ e │
│ --- ┆ --- ┆ --- │
│ struct[2] ┆ struct[2] ┆ i64 │
╞═══════════╪═══════════╪═════╡
│ {1,2} ┆ {3,4} ┆ 5 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ {2,3} ┆ {4,5} ┆ 6 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ {3,4} ┆ {5,6} ┆ 7 │
└───────────┴───────────┴─────┘
>>> df.unnest_all()
shape: (3, 5)
┌─────┬─────┬─────┬─────┬─────┐
│ a ┆ b ┆ c ┆ d ┆ e │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╪═════╡
│ 1 ┆ 2 ┆ 3 ┆ 4 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 3 ┆ 4 ┆ 5 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 4 ┆ 5 ┆ 6 ┆ 7 │
└─────┴─────┴─────┴─────┴─────┘
In the end, I'm not sure that this saves you much wall-clock time (or RAM).
The other answers taught me a lot. I encountered a new situation where I wanted to easily be able to get each column labeled with all the structs it came from. i.e. for
pl.col("my").struct.field("test").struct.field("thing")
I wanted to recover
my.test.thing
as a string which I could easily use when reading a subset of columns with pyarrow via
pq.ParquetDataset(path).read(columns = ["my.test.thing"])
Since there are many hundreds of columns and the nesting can go quite deep, I wrote functions to do a depth first search on the schema, extract the columns in that pyarrow friendly format, then I can use those to select each column unnested all in one go.
First, I worked with the pyarrow schema because I couldn't figure out how to drill into the structs in the polars schema:
schema = df.to_arrow().schema
navigating structs in that schema is quirky, at the top level the structure behaves differently from deeper in. I ended up writing two functions, the first to navigate the top level structure and the second to continue the search below:
def schema_top_level_DFS(pa_schema):
top_level_stack = list(range(len(pa_schema)))
while top_level_stack:
working_top_level_index = top_level_stack.pop()
working_element_name = pa_schema.names[working_top_level_index]
if type(pa_schema.types[working_top_level_index]) == pa.lib.StructType:
second_level_stack = list(range(len(pa_schema.types[working_top_level_index])))
while second_level_stack:
working_second_level_index = second_level_stack.pop()
schema_DFS(pa_schema.types[working_top_level_index][working_second_level_index],working_element_name)
else:
column_paths.append(working_element_name)
def schema_DFS(incoming_element,upstream_names):
current_name = incoming_element.name
combined_names = ".".join([upstream_names,current_name])
if type(incoming_element.type) == pa.lib.StructType:
stack = list(range(len(incoming_element.type)))
while stack:
working_index = stack.pop()
working_element = incoming_element.type[working_index]
schema_DFS(working_element,combined_names)
else:
column_paths.append(combined_names)
So that running
column_paths = []
schema_top_level_DFS(schema)
gives me column paths like
['struct_name_1.inner_struct_name_2.thing1','struct_name_1.inner_struct_name_2.thing2]
to actually do the unnesting, I wasn't sure how to do better than a function with a case statement:
def return_pl_formatting(col_string):
col_list = col_string.split(".")
match len(col_list):
case 1:
return pl.col(col_list[0]).alias(col_string)
case 2:
return pl.col(col_list[0]).struct.field(col_list[1]).alias(col_string)
case 3:
return pl.col(col_list[0]).struct.field(col_list[1]).struct.field(col_list[2]).alias(col_string)
case 4:
return pl.col(col_list[0]).struct.field(col_list[1]).struct.field(col_list[2]).struct.field(col_list[3]).alias(col_string)
case 5:
return pl.col(col_list[0]).struct.field(col_list[1]).struct.field(col_list[2]).struct.field(col_list[3]).struct.field(col_list[4]).alias(col_string)
case 6:
return pl.col(col_list[0]).struct.field(col_list[1]).struct.field(col_list[2]).struct.field(col_list[3]).struct.field(col_list[4]).struct.field(col_list[5]).alias(col_string)
Then get my unnested and nicely named df with:
df.select([return_pl_formatting(x) for x in column_paths])
To show the output on the example from #Dean MacGregor
test = (
pl.DataFrame(
{"a": [1, 2, 3], "b": [2, 3, 4], "c": [
3, 4, 5], "d": [4, 5, 6], "e": [5, 6, 7]}
)
.select([pl.struct(["a", "b"]).alias("ab"), pl.struct(["c", "d"]).alias("cd"), "e"])
.select([pl.struct(["ab", "cd"]).alias("abcd"), "e"])
)
column_paths = []
schema_top_level_DFS(test.to_arrow().schema)
print(test.select([return_pl_formatting(x) for x in column_paths]))
┌─────┬───────────┬───────────┬───────────┬───────────┐
│ e ┆ abcd.cd.d ┆ abcd.cd.c ┆ abcd.ab.b ┆ abcd.ab.a │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 5 ┆ 4 ┆ 3 ┆ 2 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 6 ┆ 5 ┆ 4 ┆ 3 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 7 ┆ 6 ┆ 5 ┆ 4 ┆ 3 │
└─────┴───────────┴───────────┴───────────┴───────────┘

Python-Polars: How to filter categorical column with string list

I have a Polars dataframe like below:
df_cat = pl.DataFrame(
[
pl.Series("a_cat", ["c", "a", "b", "c", "b"], dtype=pl.Categorical),
pl.Series("b_cat", ["F", "G", "E", "G", "G"], dtype=pl.Categorical)
])
print(df_cat)
shape: (5, 2)
┌───────┬───────┐
│ a_cat ┆ b_cat │
│ --- ┆ --- │
│ cat ┆ cat │
╞═══════╪═══════╡
│ c ┆ F │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ G │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ E │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ G │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ G │
└───────┴───────┘
The following filter runs perfectly fine:
print(df_cat.filter(pl.col('a_cat') == 'c'))
shape: (2, 2)
┌───────┬───────┐
│ a_cat ┆ b_cat │
│ --- ┆ --- │
│ cat ┆ cat │
╞═══════╪═══════╡
│ c ┆ F │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ G │
└───────┴───────┘
What I want is to use a list of string to run the filter more efficiently. So I tried and ended up with the following error message:
print(df_cat.filter(pl.col('a_cat').is_in(['a', 'c'])))
---------------------------------------------------------------------------
ComputeError Traceback (most recent call last)
d:\GitRepo\Test2\stockEMD3.ipynb Cell 9 in <cell line: 1>()
----> 1 print(df_cat.filter(pl.col('a_cat').is_in(['c'])))
File c:\ProgramData\Anaconda3\envs\charm3.9\lib\site-packages\polars\internals\dataframe\frame.py:2185, in DataFrame.filter(self, predicate)
2181 if _NUMPY_AVAILABLE and isinstance(predicate, np.ndarray):
2182 predicate = pli.Series(predicate)
2184 return (
-> 2185 self.lazy()
2186 .filter(predicate) # type: ignore[arg-type]
2187 .collect(no_optimization=True, string_cache=False)
2188 )
File c:\ProgramData\Anaconda3\envs\charm3.9\lib\site-packages\polars\internals\lazyframe\frame.py:660, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, string_cache, no_optimization, slice_pushdown)
650 projection_pushdown = False
652 ldf = self._ldf.optimization_toggle(
653 type_coercion,
654 predicate_pushdown,
(...)
658 slice_pushdown,
659 )
--> 660 return pli.wrap_df(ldf.collect())
ComputeError: joins/or comparisons on categorical dtypes can only happen if they are created under the same global string cache
From this Stackoverflow link I understand "You need to set a global string cache to compare categoricals created in different columns/lists." but my question is
Why the == one single string filter case works?
What is the proper way to filter a categorical column with a list of string?
Thanks!
Actually, you don't need to set a global string cache to compare strings to Categorical variables. You can use cast to accomplish this.
Let's use this data. I've included the integer values that underlie the Categorical variables to demonstrate something later.
import polars as pl
df_cat = (
pl.DataFrame(
[
pl.Series("a_cat", ["c", "a", "b", "c", "X"], dtype=pl.Categorical),
pl.Series("b_cat", ["F", "G", "E", "S", "X"], dtype=pl.Categorical),
]
)
.with_column(
pl.all().to_physical().suffix('_phys')
)
)
df_cat
shape: (5, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ c ┆ F ┆ 0 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ G ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ E ┆ 2 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ S ┆ 0 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ X ┆ X ┆ 3 ┆ 4 │
└───────┴───────┴────────────┴────────────┘
Comparing a categorical variable to a string
If we cast a Categorical variable back to its string values, we can make any comparison we need. For example:
df_cat.filter(pl.col('a_cat').cast(pl.Utf8).is_in(['a', 'c']))
shape: (3, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ c ┆ F ┆ 0 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ G ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ S ┆ 0 ┆ 3 │
└───────┴───────┴────────────┴────────────┘
Or in a filter step comparing the string values of two Categorical variables that do not share the same string cache.
df_cat.filter(pl.col('a_cat').cast(pl.Utf8) == pl.col('b_cat').cast(pl.Utf8))
shape: (1, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ X ┆ X ┆ 3 ┆ 4 │
└───────┴───────┴────────────┴────────────┘
Notice that it is the string values being compared (not the integers underlying the two Categorical variables).
The equality operator on Categorical variables
The following statements are equivalent:
df_cat.filter((pl.col('a_cat') == 'a'))
df_cat.filter((pl.col('a_cat').cast(pl.Utf8) == 'a'))
The former is syntactic sugar for the latter, as the former is a common use case.
As the error states: ComputeError: joins/or comparisons on categorical dtypes can only happen if they are created under the same global string cache.
Comparisons of categorical values are only allowed under a global string cache. You really want to set this in such a case as it speeds up comparisons and prevents expensive casts to strings.
Setting this on the start of your query will ensure it runs:
import polars as pl
pl.Config.set_global_string_cache()
This is a new answer based on the one from #ritchie46.
Polar 0.15.15 it now is
import polars as pl
pl.toggle_string_cache(True)
Also a StringCache() Context manager can be used, see polars documentation:
with pl.StringCache():
print(df_cat.filter(pl.col('a_cat').is_in(['a', 'c'])))

Polars: How to reorder columns in a specific order?

I cannot find how to reorder columns in a polars dataframe in the polars DataFrame docs.
thx
Using the select method is the recommended way to sort columns in polars.
Example:
Input:
df
┌─────┬───────┬─────┐
│Col1 ┆ Col2 ┆Col3 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════╪═══════╪═════╡
│ a ┆ x ┆ p │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ y ┆ q │
└─────┴───────┴─────┘
Output:
df.select(['Col3', 'Col2', 'Col1'])
or
df.select([pl.col('Col3'), pl.col('Col2'), pl.col('Col1)])
┌─────┬───────┬─────┐
│Col3 ┆ Col2 ┆Col1 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════╪═══════╪═════╡
│ p ┆ x ┆ a │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ q ┆ y ┆ b │
└─────┴───────┴─────┘
Note:
While df[['Col3', 'Col2', 'Col1']] gives the same result (version 0.14), it is recommended (link) that you use the select method instead.
We strongly recommend selecting data with expressions for almost all
use cases. Square bracket indexing is perhaps useful when doing
exploratory data analysis in a terminal or notebook when you just want
a quick look at a subset of data.
For all other use cases we recommend using expressions because:
expressions can be parallelized
the expression approach can be used in lazy and eager mode while the indexing approach can only be used in eager mode
in lazy mode the query optimizer can optimize expressions
Turns out it is the same as pandas:
df = df[['PRODUCT', 'PROGRAM', 'MFG_AREA', 'VERSION', 'RELEASE_DATE', 'FLOW_SUMMARY', 'TESTSUITE', 'MODULE', 'BASECLASS', 'SUBCLASS', 'Empty', 'Color', 'BINNING', 'BYPASS', 'Status', 'Legend']]
That seems like a special case of projection to me.
df = pl.DataFrame({
"c": [1, 2],
"a": ["a", "b"],
"b": [True, False]
})
df.select(sorted(df.columns))
shape: (2, 3)
┌─────┬───────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ str ┆ bool ┆ i64 │
╞═════╪═══════╪═════╡
│ a ┆ true ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ false ┆ 2 │
└─────┴───────┴─────┘

in polars, how could i use rank() to get most popular category per user

Let's say I have a csv
transaction_id,user,book
1,bob,bookA
2,bob,bookA
3,bob,bookB
4,tim,bookA
5,lucy,bookA
6,lucy,bookC
7,lucy,bookC
8,lucy,bookC
per user, i want to find the book they have shown the most preference towards. For example, the output should be;
shape: (3, 2)
┌──────┬──────────┐
│ user ┆ fav_book │
│ --- ┆ --- │
│ str ┆ str │
╞══════╪══════════╡
│ bob ┆ bookA │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ tim ┆ bookA │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ lucy ┆ bookC │
└──────┴──────────┘
now i've worked out how to do it like so
import polars as pl
df = pl.read_csv("book_aggs.csv")
print(df)
df2 = df.groupby(["user", "book"]).agg([
pl.col("book").count(),
pl.col("transaction_id") # just so we can double check where it all came from - TODO: how to output this to csv?
])
print(df2)
df3 = df2.sort(["user", "book_count"], reverse=True).groupby("user").agg([
pl.col("book").first().alias("fav_book")
])
print(df3)
but really the normal sql way of doing it is a dense_rank sorted by book count descending where rank = 1. I have tried for hours to get this to work but i can't find a relevant example in the docs.
the issue is that in the docs, none of the agg examples reference the output of another agg - in this case it needs to reference the count of each book per user, and then sort those counts descending and then rank based on that sort order.
Please provide an example that explains how to use rank to perform this task, and also how to nest aggregations efficiently.
Approach 1
We could first groupby user and 'book' to get all user -> book combinations and count the most occurring.
This would give this intermediate DataFrame:
shape: (5, 3)
┌──────┬───────┬────────────┐
│ user ┆ book ┆ book_count │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ u32 │
╞══════╪═══════╪════════════╡
│ lucy ┆ bookC ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ lucy ┆ bookA ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ bob ┆ bookB ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ tim ┆ bookA ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ bob ┆ bookA ┆ 2 │
└──────┴───────┴────────────┘
Then we can do another groupby user where we compute the index of the maximum book_count and use that index to take the correct book.
The whole query looks like this:
df = pl.DataFrame({'book': ['bookA',
'bookA',
'bookB',
'bookA',
'bookA',
'bookC',
'bookC',
'bookC'],
'transaction_id': [1, 2, 3, 4, 5, 6, 7, 8],
'user': ['bob', 'bob', 'bob', 'tim', 'lucy', 'lucy', 'lucy', 'lucy']
})
(df.groupby(["user", "book"])
.agg([
pl.col("book").count()
])
.groupby("user")
.agg([
pl.col("book").take(pl.col("book_count").arg_max()).alias("fav_book")
])
)
And creates this output:
shape: (3, 2)
┌──────┬──────────┐
│ user ┆ fav_book │
│ --- ┆ --- │
│ str ┆ str │
╞══════╪══════════╡
│ tim ┆ bookA │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ bob ┆ bookA │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ lucy ┆ bookC │
└──────┴──────────┘
Approach 2
Another approach would be creating a book_count column with a window_expression and then use the index of the maximum to take the correct book in aggregation:
(df
.with_column(pl.count("book").over(["user", "book"]).alias("book_count"))
.groupby("user")
.agg([
pl.col("book").take(pl.col("book_count").arg_max())
])
)

How to get row_count for a group in polars?

The usage might seems like the code below
out_df = df.select([
pl.col("*"),
pl.col("md5").row_count().over("md5").alias("row_count"),
])
print(out_df)
The data should be like this:
before:
md5
a
a
b
after:
md5 row_count
a 1
a 2
b 1
Maybe Im misunderstanding, as your output has both values 1 and 2 for a. Assuming you meant 2 for both:
You are very close, Polars has .count():
import polars as pl
df = pl.DataFrame({"md5": ["a", "a", "b"]})
out_df = df.select([
pl.col("*"),
pl.col("md5").count().over("md5").alias("row_count"),
])
print(out_df)
Which prints out this:
shape: (3, 2)
┌─────┬───────────┐
│ md5 ┆ row_count │
│ --- ┆ --- │
│ str ┆ u32 │
╞═════╪═══════════╡
│ a ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 1 │
└─────┴───────────┘
If I think I understand correctly, you want to have a count per seen value in the group.
You can do this:
df = pl.DataFrame({"md5": ["a", "a", "b"]})
(df
.with_column(pl.lit(1).alias("ones"))
.select([
pl.all().exclude("ones"),
pl.col("ones").cumsum().over("md5").flatten().alias("row_count")
]))
shape: (3, 2)
┌─────┬───────────┐
│ md5 ┆ row_count │
│ --- ┆ --- │
│ str ┆ i32 │
╞═════╪═══════════╡
│ a ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 1 │
└─────┴───────────┘
We still have to add a dummy column "ones", because (as of polars==0.10.23` we cannot apply a window function over literals. We will add this functionality.