in polars, how could i use rank() to get most popular category per user - python-polars

Let's say I have a csv
transaction_id,user,book
1,bob,bookA
2,bob,bookA
3,bob,bookB
4,tim,bookA
5,lucy,bookA
6,lucy,bookC
7,lucy,bookC
8,lucy,bookC
per user, i want to find the book they have shown the most preference towards. For example, the output should be;
shape: (3, 2)
┌──────┬──────────┐
│ user ┆ fav_book │
│ --- ┆ --- │
│ str ┆ str │
╞══════╪══════════╡
│ bob ┆ bookA │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ tim ┆ bookA │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ lucy ┆ bookC │
└──────┴──────────┘
now i've worked out how to do it like so
import polars as pl
df = pl.read_csv("book_aggs.csv")
print(df)
df2 = df.groupby(["user", "book"]).agg([
pl.col("book").count(),
pl.col("transaction_id") # just so we can double check where it all came from - TODO: how to output this to csv?
])
print(df2)
df3 = df2.sort(["user", "book_count"], reverse=True).groupby("user").agg([
pl.col("book").first().alias("fav_book")
])
print(df3)
but really the normal sql way of doing it is a dense_rank sorted by book count descending where rank = 1. I have tried for hours to get this to work but i can't find a relevant example in the docs.
the issue is that in the docs, none of the agg examples reference the output of another agg - in this case it needs to reference the count of each book per user, and then sort those counts descending and then rank based on that sort order.
Please provide an example that explains how to use rank to perform this task, and also how to nest aggregations efficiently.

Approach 1
We could first groupby user and 'book' to get all user -> book combinations and count the most occurring.
This would give this intermediate DataFrame:
shape: (5, 3)
┌──────┬───────┬────────────┐
│ user ┆ book ┆ book_count │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ u32 │
╞══════╪═══════╪════════════╡
│ lucy ┆ bookC ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ lucy ┆ bookA ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ bob ┆ bookB ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ tim ┆ bookA ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ bob ┆ bookA ┆ 2 │
└──────┴───────┴────────────┘
Then we can do another groupby user where we compute the index of the maximum book_count and use that index to take the correct book.
The whole query looks like this:
df = pl.DataFrame({'book': ['bookA',
'bookA',
'bookB',
'bookA',
'bookA',
'bookC',
'bookC',
'bookC'],
'transaction_id': [1, 2, 3, 4, 5, 6, 7, 8],
'user': ['bob', 'bob', 'bob', 'tim', 'lucy', 'lucy', 'lucy', 'lucy']
})
(df.groupby(["user", "book"])
.agg([
pl.col("book").count()
])
.groupby("user")
.agg([
pl.col("book").take(pl.col("book_count").arg_max()).alias("fav_book")
])
)
And creates this output:
shape: (3, 2)
┌──────┬──────────┐
│ user ┆ fav_book │
│ --- ┆ --- │
│ str ┆ str │
╞══════╪══════════╡
│ tim ┆ bookA │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ bob ┆ bookA │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ lucy ┆ bookC │
└──────┴──────────┘
Approach 2
Another approach would be creating a book_count column with a window_expression and then use the index of the maximum to take the correct book in aggregation:
(df
.with_column(pl.count("book").over(["user", "book"]).alias("book_count"))
.groupby("user")
.agg([
pl.col("book").take(pl.col("book_count").arg_max())
])
)

Related

Sort polars DataFrame using column with text and numericals

If I have a DataFrame like
┌────────┬──────────────────────┐
│ Name ┆ Value │
│ --- ┆ --- │
│ str ┆ list[str] │
╞════════╪══════════════════════╡
│ No. 1 ┆ ["None", "!!!"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ No. 10 ┆ ["0.3", "OK"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ No. 2 ┆ ["1.1", "OK"] │
How can I sort it by numerical value.
Ie I want to pull the string from the Name column and extract only the numerical elements when sorting.
Ie
┌────────┬──────────────────────┐
│ Name ┆ Value │
│ --- ┆ --- │
│ str ┆ list[str] │
╞════════╪══════════════════════╡
│ No. 1 ┆ ["None", "!!!"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ No. 2 ┆ ["1.1", "OK"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ No. 10 ┆ ["0.3", "OK"] │
Can't see the polars expression needed and not sure you can pass a custom python function.
Thanks
You can to use str.extract to get the number from the string, using a regular expression.
Then cast it to int and sort:
pl.DataFrame({"Name": ["No. 1", "No. 12", "No. 2"]}).sort(
pl.col("Name").str.extract(r"No\. ([0-9]*)", 1).cast(int)
)
Also, if you want to sort by numbers in List:
df.sort(
pl.col("Value").arr.get(0).cast(pl.Float32, strict=False),
nulls_last=False
)

python-polars create new column by dividing by two existing columns

in pandas the following creates a new column in dataframe by dividing by two existing columns. How do I do this in polars? Bonus if done in the fastest way using polars.LazyFrame
df = pd.DataFrame({"col1":[10,20,30,40,50], "col2":[5,2,10,10,25]})
df["ans"] = df["col1"]/df["col2"]
print(df)
You want to avoid Pandas-style coding and use Polars Expressions API. Expressions are the heart of Polars and yield the best performance.
Here's how we would code this using Expressions, including using Lazy mode:
(
df
.lazy()
.with_column(
(pl.col('col1') / pl.col('col2')).alias('result')
)
.collect()
)
shape: (5, 3)
┌──────┬──────┬────────┐
│ col1 ┆ col2 ┆ result │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 │
╞══════╪══════╪════════╡
│ 10 ┆ 5 ┆ 2.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 20 ┆ 2 ┆ 10.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 30 ┆ 10 ┆ 3.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 40 ┆ 10 ┆ 4.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 50 ┆ 25 ┆ 2.0 │
└──────┴──────┴────────┘
Here's a section of the User Guide that may help transitioning from Pandas-style coding to using Polars Expressions.

In Python polars convert a json string column to dict for filtering

Hi have a dataframe where I have a column called tags which is a json string.
I want to filter this dataframe on the tags column so it only contains rows where a certain tag key is present or where a tag has a particular value.
I guess I could do a string contains match but think it may be more robust to have the json convert into a dict first and using has_key etc ?
What would be the recommended way to do this in python polars ?
Thanks
Polars does not have a generic dictionary type. Instead, dictionaries are imported/mapped as structs. Each dictionary key is mapped to a struct 'field name', and the corresponding dictionary value becomes the value of this field.
However, there are some constraints for creating a Series of type struct. Two of them are:
all structs must have the same field names.
the field names must be listed in the same order.
In your description, you mention has_key, which indicates that the dictionaries will not have the same keys. As such, creating a column of struct from your dictionaries will not work. (For more information, you can see this Stack Overflow response.)
json_path_match
I suggest using json_path_match, which extracts values based on some simple JSONPath syntax. Using JSONPath syntax, you should be able to query whether a key exists, and retrieve it's value. (For simple unnested dictionaries, these are the same query.)
For example, let's start with this data:
import polars as pl
json_list = [
"""{"name": "Maria",
"position": "developer",
"office": "Seattle"}""",
"""{"name": "Josh",
"position": "analyst",
"termination_date": "2020-01-01"}""",
"""{"name": "Jorge",
"position": "architect",
"office": "",
"manager_st_dt": "2020-01-01"}""",
]
df = pl.DataFrame(
{
"tags": json_list,
}
).with_row_count("id", 1)
df
shape: (3, 2)
┌─────┬────────────────────┐
│ id ┆ tags │
│ --- ┆ --- │
│ u32 ┆ str │
╞═════╪════════════════════╡
│ 1 ┆ {"name": "Maria", │
│ ┆ "posit... │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ {"name": "Josh", │
│ ┆ "positi... │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ {"name": "Jorge", │
│ ┆ "posit... │
└─────┴────────────────────┘
To query for values:
df.with_columns([
pl.col('tags').str.json_path_match(r"$.name").alias('name'),
pl.col('tags').str.json_path_match(r"$.office").alias('location'),
pl.col('tags').str.json_path_match(r"$.manager_st_dt").alias('manager start date'),
])
shape: (3, 5)
┌─────┬────────────────────┬───────┬──────────┬────────────────────┐
│ id ┆ tags ┆ name ┆ location ┆ manager start date │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str ┆ str ┆ str │
╞═════╪════════════════════╪═══════╪══════════╪════════════════════╡
│ 1 ┆ {"name": "Maria", ┆ Maria ┆ Seattle ┆ null │
│ ┆ "posit... ┆ ┆ ┆ │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ {"name": "Josh", ┆ Josh ┆ null ┆ null │
│ ┆ "positi... ┆ ┆ ┆ │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ {"name": "Jorge", ┆ Jorge ┆ ┆ 2020-01-01 │
│ ┆ "posit... ┆ ┆ ┆ │
└─────┴────────────────────┴───────┴──────────┴────────────────────┘
Notice the null values. This is the return value when a key is not found. We'll use this fact for the has_key functionality you mentioned.
Also, if we look at the "location" column, you'll see that json_path_match does distinguish between an empty string "office":"" and a key not found..
To filter for the presence of a key, we simply filter for null values.
df.filter(
pl.col('tags').str.json_path_match(r"$.manager_st_dt").is_not_null()
)
shape: (1, 2)
┌─────┬───────────────────┐
│ id ┆ tags │
│ --- ┆ --- │
│ u32 ┆ str │
╞═════╪═══════════════════╡
│ 3 ┆ {"name": "Jorge", │
│ ┆ "posit... │
└─────┴───────────────────┘
The json_path_match will also work with nested structures. (See the Syntax page for details.)
One limitation, however: json_path_match will only return the first match for a query, rather than a list of matches. If your JSON strings are not lists or nested dictionaries, this won't be a problem.

Sum columns based on column names in a list for polars

So in python Polars
I can add one or more columns to make a new column by using an expression something like
frame.with_column((pl.col('colname1') + pl.col('colname2').alias('new_colname')))
However, if I have all the column names in a list is there a way to sum all the columns in that list and create a new column based on the result ?
Thanks
sum expr supports horizontal summing. From the docs,
List[Expr] -> aggregate the sum value horizontally.
Sample code for ref,
import polars as pl
df = pl.DataFrame({"a": [1, 2, 3], "b": [1, 2, None]})
print(df)
This results in something like,
shape: (3, 2)
┌─────┬──────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪══════╡
│ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3 ┆ null │
└─────┴──────┘
On this you can do something like,
cols = ["a", "b"]
df2 = df.select(pl.sum([pl.col(i) for i in cols]).alias('new_colname'))
print(df2)
Which will result in,
shape: (3, 1)
┌──────┐
│ sum │
│ --- │
│ i64 │
╞══════╡
│ 2 │
├╌╌╌╌╌╌┤
│ 4 │
├╌╌╌╌╌╌┤
│ null │
└──────┘

How to get row_count for a group in polars?

The usage might seems like the code below
out_df = df.select([
pl.col("*"),
pl.col("md5").row_count().over("md5").alias("row_count"),
])
print(out_df)
The data should be like this:
before:
md5
a
a
b
after:
md5 row_count
a 1
a 2
b 1
Maybe Im misunderstanding, as your output has both values 1 and 2 for a. Assuming you meant 2 for both:
You are very close, Polars has .count():
import polars as pl
df = pl.DataFrame({"md5": ["a", "a", "b"]})
out_df = df.select([
pl.col("*"),
pl.col("md5").count().over("md5").alias("row_count"),
])
print(out_df)
Which prints out this:
shape: (3, 2)
┌─────┬───────────┐
│ md5 ┆ row_count │
│ --- ┆ --- │
│ str ┆ u32 │
╞═════╪═══════════╡
│ a ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 1 │
└─────┴───────────┘
If I think I understand correctly, you want to have a count per seen value in the group.
You can do this:
df = pl.DataFrame({"md5": ["a", "a", "b"]})
(df
.with_column(pl.lit(1).alias("ones"))
.select([
pl.all().exclude("ones"),
pl.col("ones").cumsum().over("md5").flatten().alias("row_count")
]))
shape: (3, 2)
┌─────┬───────────┐
│ md5 ┆ row_count │
│ --- ┆ --- │
│ str ┆ i32 │
╞═════╪═══════════╡
│ a ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 1 │
└─────┴───────────┘
We still have to add a dummy column "ones", because (as of polars==0.10.23` we cannot apply a window function over literals. We will add this functionality.