divide a dataframe into another dataframe (element by element) - python-polars

In pandas, I could set several named columns as an index and find the quotient of the division of two
DataFrame,like this
import pandas as pd
df_1=pd.DataFrame( {
'Name':['a','a','a', 'b','b', 'c'],
'Name_2':['first','second','third', 'first','second', 'first'],
'Value':[20,40,50,100,150,400]
})
df_2=pd.DataFrame( {
'Name':['a','a','a', 'b','b', 'c'],
'Name_2':['first','second','third', 'first','second', 'first'],
'Value':[10,20,25,50,75,200]
})
df_1=df_1.set_index(['Name','Name_2'])
df_2=df_2.set_index(['Name','Name_2'])
df_1/df_2
How can something like this be implemented in python-polars?
I can't find an answer to this question in the documentation.

You just use a join, then do the math on the appropriate column(s).
df_1=pl.DataFrame( {
'Name':['a','a','a', 'b','b', 'c'],
'Name_2':['first','second','third', 'first','second', 'first'],
'Value':[20,40,50,100,150,400]
})
df_2=pl.DataFrame( {
'Name':['a','a','a', 'b','b', 'c'],
'Name_2':['first','second','third', 'first','second', 'first'],
'Value':[10,20,25,50,75,200]
})
That's the setup then the solution is:
df_1.join(df_2, on=['Name','Name_2']) \
.select(['Name','Name_2', pl.col('Value')/pl.col('Value_right')])
If you have a bunch of "Value" columns and different indx columns you can do something like:
myindxcols=['Name', 'Name_2]
myvalcols=[x for x in df_1.columns if x in df_2.columns and not x in myindxcols]
df_1.join(df_2, on=myindxcols) \
.select(myindxcols + [pl.col(x)/pl.col(f"{x}_right") for x in myvalcols])

#Dean MacGregor beat me to it. Please accept his answer.
df_1 = pl.DataFrame( {
"Name":["a","a","a", "b","b", "c"],
"Name_2":["first","second","third", "first","second", "first"],
"Value":[20,40,50,100,150,400]
})
df_2 = pl.DataFrame( {
"Name":["a","a","a", "b","b", "c"],
"Name_2":["first","second","third", "first","second", "first"],
"Value":[10,20,25,50,75,200]
})
keys = ["Name", "Name_2"]
(df_1
.join(df_2, on=keys, suffix="_right")
.select([
*keys,
pl.col("Value") / pl.col("Value_right")
])
)
shape: (6, 3)
┌──────┬────────┬───────┐
│ Name ┆ Name_2 ┆ Value │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 │
╞══════╪════════╪═══════╡
│ a ┆ first ┆ 2.0 │
│ a ┆ second ┆ 2.0 │
│ a ┆ third ┆ 2.0 │
│ b ┆ first ┆ 2.0 │
│ b ┆ second ┆ 2.0 │
│ c ┆ first ┆ 2.0 │
└──────┴────────┴───────┘

Related

Python-Polars: How to filter categorical column with string list

I have a Polars dataframe like below:
df_cat = pl.DataFrame(
[
pl.Series("a_cat", ["c", "a", "b", "c", "b"], dtype=pl.Categorical),
pl.Series("b_cat", ["F", "G", "E", "G", "G"], dtype=pl.Categorical)
])
print(df_cat)
shape: (5, 2)
┌───────┬───────┐
│ a_cat ┆ b_cat │
│ --- ┆ --- │
│ cat ┆ cat │
╞═══════╪═══════╡
│ c ┆ F │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ G │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ E │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ G │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ G │
└───────┴───────┘
The following filter runs perfectly fine:
print(df_cat.filter(pl.col('a_cat') == 'c'))
shape: (2, 2)
┌───────┬───────┐
│ a_cat ┆ b_cat │
│ --- ┆ --- │
│ cat ┆ cat │
╞═══════╪═══════╡
│ c ┆ F │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ G │
└───────┴───────┘
What I want is to use a list of string to run the filter more efficiently. So I tried and ended up with the following error message:
print(df_cat.filter(pl.col('a_cat').is_in(['a', 'c'])))
---------------------------------------------------------------------------
ComputeError Traceback (most recent call last)
d:\GitRepo\Test2\stockEMD3.ipynb Cell 9 in <cell line: 1>()
----> 1 print(df_cat.filter(pl.col('a_cat').is_in(['c'])))
File c:\ProgramData\Anaconda3\envs\charm3.9\lib\site-packages\polars\internals\dataframe\frame.py:2185, in DataFrame.filter(self, predicate)
2181 if _NUMPY_AVAILABLE and isinstance(predicate, np.ndarray):
2182 predicate = pli.Series(predicate)
2184 return (
-> 2185 self.lazy()
2186 .filter(predicate) # type: ignore[arg-type]
2187 .collect(no_optimization=True, string_cache=False)
2188 )
File c:\ProgramData\Anaconda3\envs\charm3.9\lib\site-packages\polars\internals\lazyframe\frame.py:660, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, string_cache, no_optimization, slice_pushdown)
650 projection_pushdown = False
652 ldf = self._ldf.optimization_toggle(
653 type_coercion,
654 predicate_pushdown,
(...)
658 slice_pushdown,
659 )
--> 660 return pli.wrap_df(ldf.collect())
ComputeError: joins/or comparisons on categorical dtypes can only happen if they are created under the same global string cache
From this Stackoverflow link I understand "You need to set a global string cache to compare categoricals created in different columns/lists." but my question is
Why the == one single string filter case works?
What is the proper way to filter a categorical column with a list of string?
Thanks!
Actually, you don't need to set a global string cache to compare strings to Categorical variables. You can use cast to accomplish this.
Let's use this data. I've included the integer values that underlie the Categorical variables to demonstrate something later.
import polars as pl
df_cat = (
pl.DataFrame(
[
pl.Series("a_cat", ["c", "a", "b", "c", "X"], dtype=pl.Categorical),
pl.Series("b_cat", ["F", "G", "E", "S", "X"], dtype=pl.Categorical),
]
)
.with_column(
pl.all().to_physical().suffix('_phys')
)
)
df_cat
shape: (5, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ c ┆ F ┆ 0 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ G ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ E ┆ 2 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ S ┆ 0 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ X ┆ X ┆ 3 ┆ 4 │
└───────┴───────┴────────────┴────────────┘
Comparing a categorical variable to a string
If we cast a Categorical variable back to its string values, we can make any comparison we need. For example:
df_cat.filter(pl.col('a_cat').cast(pl.Utf8).is_in(['a', 'c']))
shape: (3, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ c ┆ F ┆ 0 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ G ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ S ┆ 0 ┆ 3 │
└───────┴───────┴────────────┴────────────┘
Or in a filter step comparing the string values of two Categorical variables that do not share the same string cache.
df_cat.filter(pl.col('a_cat').cast(pl.Utf8) == pl.col('b_cat').cast(pl.Utf8))
shape: (1, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ X ┆ X ┆ 3 ┆ 4 │
└───────┴───────┴────────────┴────────────┘
Notice that it is the string values being compared (not the integers underlying the two Categorical variables).
The equality operator on Categorical variables
The following statements are equivalent:
df_cat.filter((pl.col('a_cat') == 'a'))
df_cat.filter((pl.col('a_cat').cast(pl.Utf8) == 'a'))
The former is syntactic sugar for the latter, as the former is a common use case.
As the error states: ComputeError: joins/or comparisons on categorical dtypes can only happen if they are created under the same global string cache.
Comparisons of categorical values are only allowed under a global string cache. You really want to set this in such a case as it speeds up comparisons and prevents expensive casts to strings.
Setting this on the start of your query will ensure it runs:
import polars as pl
pl.Config.set_global_string_cache()
This is a new answer based on the one from #ritchie46.
Polar 0.15.15 it now is
import polars as pl
pl.toggle_string_cache(True)
Also a StringCache() Context manager can be used, see polars documentation:
with pl.StringCache():
print(df_cat.filter(pl.col('a_cat').is_in(['a', 'c'])))

Polars: how to add a column in front?

What would be the most idiomatic (and efficient) way to add a column in front of a polars data frame? Same thing like .with_column but add it at index 0?
You can select in the order you want your new DataFrame.
df = pl.DataFrame({
"a": [1, 2, 3],
"b": [True, None, False]
})
df.select([
pl.lit("foo").alias("z"),
pl.all()
])
shape: (3, 3)
┌─────┬─────┬───────┐
│ z ┆ a ┆ b │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ bool │
╞═════╪═════╪═══════╡
│ foo ┆ 1 ┆ true │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ foo ┆ 2 ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ foo ┆ 3 ┆ false │
└─────┴─────┴───────┘

Polars: How to reorder columns in a specific order?

I cannot find how to reorder columns in a polars dataframe in the polars DataFrame docs.
thx
Using the select method is the recommended way to sort columns in polars.
Example:
Input:
df
┌─────┬───────┬─────┐
│Col1 ┆ Col2 ┆Col3 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════╪═══════╪═════╡
│ a ┆ x ┆ p │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ y ┆ q │
└─────┴───────┴─────┘
Output:
df.select(['Col3', 'Col2', 'Col1'])
or
df.select([pl.col('Col3'), pl.col('Col2'), pl.col('Col1)])
┌─────┬───────┬─────┐
│Col3 ┆ Col2 ┆Col1 │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞═════╪═══════╪═════╡
│ p ┆ x ┆ a │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ q ┆ y ┆ b │
└─────┴───────┴─────┘
Note:
While df[['Col3', 'Col2', 'Col1']] gives the same result (version 0.14), it is recommended (link) that you use the select method instead.
We strongly recommend selecting data with expressions for almost all
use cases. Square bracket indexing is perhaps useful when doing
exploratory data analysis in a terminal or notebook when you just want
a quick look at a subset of data.
For all other use cases we recommend using expressions because:
expressions can be parallelized
the expression approach can be used in lazy and eager mode while the indexing approach can only be used in eager mode
in lazy mode the query optimizer can optimize expressions
Turns out it is the same as pandas:
df = df[['PRODUCT', 'PROGRAM', 'MFG_AREA', 'VERSION', 'RELEASE_DATE', 'FLOW_SUMMARY', 'TESTSUITE', 'MODULE', 'BASECLASS', 'SUBCLASS', 'Empty', 'Color', 'BINNING', 'BYPASS', 'Status', 'Legend']]
That seems like a special case of projection to me.
df = pl.DataFrame({
"c": [1, 2],
"a": ["a", "b"],
"b": [True, False]
})
df.select(sorted(df.columns))
shape: (2, 3)
┌─────┬───────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ str ┆ bool ┆ i64 │
╞═════╪═══════╪═════╡
│ a ┆ true ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ b ┆ false ┆ 2 │
└─────┴───────┴─────┘

in polars, how could i use rank() to get most popular category per user

Let's say I have a csv
transaction_id,user,book
1,bob,bookA
2,bob,bookA
3,bob,bookB
4,tim,bookA
5,lucy,bookA
6,lucy,bookC
7,lucy,bookC
8,lucy,bookC
per user, i want to find the book they have shown the most preference towards. For example, the output should be;
shape: (3, 2)
┌──────┬──────────┐
│ user ┆ fav_book │
│ --- ┆ --- │
│ str ┆ str │
╞══════╪══════════╡
│ bob ┆ bookA │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ tim ┆ bookA │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ lucy ┆ bookC │
└──────┴──────────┘
now i've worked out how to do it like so
import polars as pl
df = pl.read_csv("book_aggs.csv")
print(df)
df2 = df.groupby(["user", "book"]).agg([
pl.col("book").count(),
pl.col("transaction_id") # just so we can double check where it all came from - TODO: how to output this to csv?
])
print(df2)
df3 = df2.sort(["user", "book_count"], reverse=True).groupby("user").agg([
pl.col("book").first().alias("fav_book")
])
print(df3)
but really the normal sql way of doing it is a dense_rank sorted by book count descending where rank = 1. I have tried for hours to get this to work but i can't find a relevant example in the docs.
the issue is that in the docs, none of the agg examples reference the output of another agg - in this case it needs to reference the count of each book per user, and then sort those counts descending and then rank based on that sort order.
Please provide an example that explains how to use rank to perform this task, and also how to nest aggregations efficiently.
Approach 1
We could first groupby user and 'book' to get all user -> book combinations and count the most occurring.
This would give this intermediate DataFrame:
shape: (5, 3)
┌──────┬───────┬────────────┐
│ user ┆ book ┆ book_count │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ u32 │
╞══════╪═══════╪════════════╡
│ lucy ┆ bookC ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ lucy ┆ bookA ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ bob ┆ bookB ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ tim ┆ bookA ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ bob ┆ bookA ┆ 2 │
└──────┴───────┴────────────┘
Then we can do another groupby user where we compute the index of the maximum book_count and use that index to take the correct book.
The whole query looks like this:
df = pl.DataFrame({'book': ['bookA',
'bookA',
'bookB',
'bookA',
'bookA',
'bookC',
'bookC',
'bookC'],
'transaction_id': [1, 2, 3, 4, 5, 6, 7, 8],
'user': ['bob', 'bob', 'bob', 'tim', 'lucy', 'lucy', 'lucy', 'lucy']
})
(df.groupby(["user", "book"])
.agg([
pl.col("book").count()
])
.groupby("user")
.agg([
pl.col("book").take(pl.col("book_count").arg_max()).alias("fav_book")
])
)
And creates this output:
shape: (3, 2)
┌──────┬──────────┐
│ user ┆ fav_book │
│ --- ┆ --- │
│ str ┆ str │
╞══════╪══════════╡
│ tim ┆ bookA │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ bob ┆ bookA │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ lucy ┆ bookC │
└──────┴──────────┘
Approach 2
Another approach would be creating a book_count column with a window_expression and then use the index of the maximum to take the correct book in aggregation:
(df
.with_column(pl.count("book").over(["user", "book"]).alias("book_count"))
.groupby("user")
.agg([
pl.col("book").take(pl.col("book_count").arg_max())
])
)

How to get row_count for a group in polars?

The usage might seems like the code below
out_df = df.select([
pl.col("*"),
pl.col("md5").row_count().over("md5").alias("row_count"),
])
print(out_df)
The data should be like this:
before:
md5
a
a
b
after:
md5 row_count
a 1
a 2
b 1
Maybe Im misunderstanding, as your output has both values 1 and 2 for a. Assuming you meant 2 for both:
You are very close, Polars has .count():
import polars as pl
df = pl.DataFrame({"md5": ["a", "a", "b"]})
out_df = df.select([
pl.col("*"),
pl.col("md5").count().over("md5").alias("row_count"),
])
print(out_df)
Which prints out this:
shape: (3, 2)
┌─────┬───────────┐
│ md5 ┆ row_count │
│ --- ┆ --- │
│ str ┆ u32 │
╞═════╪═══════════╡
│ a ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 1 │
└─────┴───────────┘
If I think I understand correctly, you want to have a count per seen value in the group.
You can do this:
df = pl.DataFrame({"md5": ["a", "a", "b"]})
(df
.with_column(pl.lit(1).alias("ones"))
.select([
pl.all().exclude("ones"),
pl.col("ones").cumsum().over("md5").flatten().alias("row_count")
]))
shape: (3, 2)
┌─────┬───────────┐
│ md5 ┆ row_count │
│ --- ┆ --- │
│ str ┆ i32 │
╞═════╪═══════════╡
│ a ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 1 │
└─────┴───────────┘
We still have to add a dummy column "ones", because (as of polars==0.10.23` we cannot apply a window function over literals. We will add this functionality.