Is there the same functions as Power Bi desktop measure defination using python-polars? - python-polars

I want to use python-polars replace Power BI desktop tools. Data model can be a mulicolumns dataset include alll tabular model columns, but How Can I use dynamic measure definition in python-polars. For example:
sum('amount') filter ('Country' = 'USA')
I want to define this measure in configuration files.

(disclaimer: I am not familiar with Power Bi, so I am answering this based on the remainder of your question).
Let's say the data is
import polars as pl
df = pl.DataFrame({"Country": ["USA", "USA", "Japan"],
"amount": [100, 200, 400]})
shape: (3, 2)
┌─────────┬────────┐
│ Country ┆ amount │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════════╪════════╡
│ USA ┆ 100 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ USA ┆ 200 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Japan ┆ 400 │
└─────────┴────────┘
You could define your expression in the configuration file as follows:
expr = pl.col('amount').filter(pl.col('Country')=='USA').sum()
and in your code base you could do
from <expression file above> import expr
df.select(expr)
shape: (1, 1)
┌────────┐
│ amount │
│ --- │
│ i64 │
╞════════╡
│ 300 │
└────────┘
If the configuration file should be plain text (would strongly advise against this, because unsafe, not testable, no tooling support, etc), you could store in the config file
pl.col('amount').filter(pl.col('Country')=='USA').sum()
and read this as expr in your Python code, and use eval:
expr = <read config file as string>
df.select(eval(expr))

Related

Sort polars DataFrame using column with text and numericals

If I have a DataFrame like
┌────────┬──────────────────────┐
│ Name ┆ Value │
│ --- ┆ --- │
│ str ┆ list[str] │
╞════════╪══════════════════════╡
│ No. 1 ┆ ["None", "!!!"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ No. 10 ┆ ["0.3", "OK"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ No. 2 ┆ ["1.1", "OK"] │
How can I sort it by numerical value.
Ie I want to pull the string from the Name column and extract only the numerical elements when sorting.
Ie
┌────────┬──────────────────────┐
│ Name ┆ Value │
│ --- ┆ --- │
│ str ┆ list[str] │
╞════════╪══════════════════════╡
│ No. 1 ┆ ["None", "!!!"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ No. 2 ┆ ["1.1", "OK"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ No. 10 ┆ ["0.3", "OK"] │
Can't see the polars expression needed and not sure you can pass a custom python function.
Thanks
You can to use str.extract to get the number from the string, using a regular expression.
Then cast it to int and sort:
pl.DataFrame({"Name": ["No. 1", "No. 12", "No. 2"]}).sort(
pl.col("Name").str.extract(r"No\. ([0-9]*)", 1).cast(int)
)
Also, if you want to sort by numbers in List:
df.sort(
pl.col("Value").arr.get(0).cast(pl.Float32, strict=False),
nulls_last=False
)

Can I use indexing [start:end] in expressions instead of offset and length in polars

In Exp.slice, the only supported syntax is exp.slice(offset,length), but in some cases, something like exp[start:end] would be more convenient and consistent. So how to write exp[1:6] for exp.slice(1,5) just like it in pandas?
You can add polars.internals.expr.expr.Expr.__getitem__():
import polars as pl
from polars.internals.expr.expr import Expr
def expr_slice(self, subscript):
if isinstance(subscript, slice):
if slice.start is not None:
offset = subscript.start
length = subscript.stop - offset + 1 if subscript.stop is not None else None
else:
offset = None
length = subscript.stop
return Expr.slice(self, offset, length)
else:
return Expr.slice(self, subscript, 1)
Expr.__getitem__ = expr_slice
df = pl.DataFrame({'a': range(10)})
print(df.select(pl.col('a')[2:5]))
print(df.select(pl.col('a')[3]))
shape: (4, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 2 │
├╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌┤
│ 4 │
├╌╌╌╌╌┤
│ 5 │
└─────┘
shape: (1, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 3 │
└─────┘
Note that my example doesn't take into account the logic for negative indexing; you can implement that yourself.
Also note that python uses 0-based indexing, so in your example exp[1:6] would raise an error.

python-polars create new column by dividing by two existing columns

in pandas the following creates a new column in dataframe by dividing by two existing columns. How do I do this in polars? Bonus if done in the fastest way using polars.LazyFrame
df = pd.DataFrame({"col1":[10,20,30,40,50], "col2":[5,2,10,10,25]})
df["ans"] = df["col1"]/df["col2"]
print(df)
You want to avoid Pandas-style coding and use Polars Expressions API. Expressions are the heart of Polars and yield the best performance.
Here's how we would code this using Expressions, including using Lazy mode:
(
df
.lazy()
.with_column(
(pl.col('col1') / pl.col('col2')).alias('result')
)
.collect()
)
shape: (5, 3)
┌──────┬──────┬────────┐
│ col1 ┆ col2 ┆ result │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 │
╞══════╪══════╪════════╡
│ 10 ┆ 5 ┆ 2.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 20 ┆ 2 ┆ 10.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 30 ┆ 10 ┆ 3.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 40 ┆ 10 ┆ 4.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 50 ┆ 25 ┆ 2.0 │
└──────┴──────┴────────┘
Here's a section of the User Guide that may help transitioning from Pandas-style coding to using Polars Expressions.

Sum columns based on column names in a list for polars

So in python Polars
I can add one or more columns to make a new column by using an expression something like
frame.with_column((pl.col('colname1') + pl.col('colname2').alias('new_colname')))
However, if I have all the column names in a list is there a way to sum all the columns in that list and create a new column based on the result ?
Thanks
sum expr supports horizontal summing. From the docs,
List[Expr] -> aggregate the sum value horizontally.
Sample code for ref,
import polars as pl
df = pl.DataFrame({"a": [1, 2, 3], "b": [1, 2, None]})
print(df)
This results in something like,
shape: (3, 2)
┌─────┬──────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪══════╡
│ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3 ┆ null │
└─────┴──────┘
On this you can do something like,
cols = ["a", "b"]
df2 = df.select(pl.sum([pl.col(i) for i in cols]).alias('new_colname'))
print(df2)
Which will result in,
shape: (3, 1)
┌──────┐
│ sum │
│ --- │
│ i64 │
╞══════╡
│ 2 │
├╌╌╌╌╌╌┤
│ 4 │
├╌╌╌╌╌╌┤
│ null │
└──────┘

in polars, how could i use rank() to get most popular category per user

Let's say I have a csv
transaction_id,user,book
1,bob,bookA
2,bob,bookA
3,bob,bookB
4,tim,bookA
5,lucy,bookA
6,lucy,bookC
7,lucy,bookC
8,lucy,bookC
per user, i want to find the book they have shown the most preference towards. For example, the output should be;
shape: (3, 2)
┌──────┬──────────┐
│ user ┆ fav_book │
│ --- ┆ --- │
│ str ┆ str │
╞══════╪══════════╡
│ bob ┆ bookA │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ tim ┆ bookA │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ lucy ┆ bookC │
└──────┴──────────┘
now i've worked out how to do it like so
import polars as pl
df = pl.read_csv("book_aggs.csv")
print(df)
df2 = df.groupby(["user", "book"]).agg([
pl.col("book").count(),
pl.col("transaction_id") # just so we can double check where it all came from - TODO: how to output this to csv?
])
print(df2)
df3 = df2.sort(["user", "book_count"], reverse=True).groupby("user").agg([
pl.col("book").first().alias("fav_book")
])
print(df3)
but really the normal sql way of doing it is a dense_rank sorted by book count descending where rank = 1. I have tried for hours to get this to work but i can't find a relevant example in the docs.
the issue is that in the docs, none of the agg examples reference the output of another agg - in this case it needs to reference the count of each book per user, and then sort those counts descending and then rank based on that sort order.
Please provide an example that explains how to use rank to perform this task, and also how to nest aggregations efficiently.
Approach 1
We could first groupby user and 'book' to get all user -> book combinations and count the most occurring.
This would give this intermediate DataFrame:
shape: (5, 3)
┌──────┬───────┬────────────┐
│ user ┆ book ┆ book_count │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ u32 │
╞══════╪═══════╪════════════╡
│ lucy ┆ bookC ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ lucy ┆ bookA ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ bob ┆ bookB ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ tim ┆ bookA ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ bob ┆ bookA ┆ 2 │
└──────┴───────┴────────────┘
Then we can do another groupby user where we compute the index of the maximum book_count and use that index to take the correct book.
The whole query looks like this:
df = pl.DataFrame({'book': ['bookA',
'bookA',
'bookB',
'bookA',
'bookA',
'bookC',
'bookC',
'bookC'],
'transaction_id': [1, 2, 3, 4, 5, 6, 7, 8],
'user': ['bob', 'bob', 'bob', 'tim', 'lucy', 'lucy', 'lucy', 'lucy']
})
(df.groupby(["user", "book"])
.agg([
pl.col("book").count()
])
.groupby("user")
.agg([
pl.col("book").take(pl.col("book_count").arg_max()).alias("fav_book")
])
)
And creates this output:
shape: (3, 2)
┌──────┬──────────┐
│ user ┆ fav_book │
│ --- ┆ --- │
│ str ┆ str │
╞══════╪══════════╡
│ tim ┆ bookA │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ bob ┆ bookA │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ lucy ┆ bookC │
└──────┴──────────┘
Approach 2
Another approach would be creating a book_count column with a window_expression and then use the index of the maximum to take the correct book in aggregation:
(df
.with_column(pl.count("book").over(["user", "book"]).alias("book_count"))
.groupby("user")
.agg([
pl.col("book").take(pl.col("book_count").arg_max())
])
)