general question about polars memory management

general question about polars memory management - python-polars

I have some general questions about memory management in Polars. It would be great if you can spend a few sentences on how it works, like when is memory allocated and when it's reclaimed.
In particular I would like to know how to delete some memory from a dataframe. I want to do it in a way that is immediate and doesn't go through the Python garbage collection mechanism if possible. It's not too bad if I have to call gc.collect() immediately after but that's not preferable.

I don't really understand your question, but I'll have a go at it.
In python-polars, a Series or a DataFrame's deletion is determined by pythons reference counting garbage collection just like any other python object.
Next there is the fact that polars memory is also reference counted. So if we create a new DataFrame that copies data from an already existing DataFrame/Series that data is not copied, but a reference count is incremented.
So for instance in the example below we have 2 DataFrames totalling 4 columns, but we only have 3 columns in memory because the column "a" is shared between both DataFrames. And will only get deleted if the reference count is 0.
The same principle also counts for slicing Series. A slice never copies data, but merely increments a reference count and updates an offset and length field.
df_a = pl.DataFrame({
"a": [1, 2, 3],
"b": ["a", "b", "c"]
})
df_b = df_a.select(["a", pl.col("b") + "py"])
print(df_a)
print(df_b)
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1 ┆ a │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ b │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ c │
└─────┴─────┘
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1 ┆ apy │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ bpy │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ cpy │
└─────┴─────┘

Related

python-polars create new column by dividing by two existing columns

in pandas the following creates a new column in dataframe by dividing by two existing columns. How do I do this in polars? Bonus if done in the fastest way using polars.LazyFrame
df = pd.DataFrame({"col1":[10,20,30,40,50], "col2":[5,2,10,10,25]})
df["ans"] = df["col1"]/df["col2"]
print(df)

You want to avoid Pandas-style coding and use Polars Expressions API. Expressions are the heart of Polars and yield the best performance.
Here's how we would code this using Expressions, including using Lazy mode:
(
df
.lazy()
.with_column(
(pl.col('col1') / pl.col('col2')).alias('result')
)
.collect()
)
shape: (5, 3)
┌──────┬──────┬────────┐
│ col1 ┆ col2 ┆ result │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 │
╞══════╪══════╪════════╡
│ 10 ┆ 5 ┆ 2.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 20 ┆ 2 ┆ 10.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 30 ┆ 10 ┆ 3.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 40 ┆ 10 ┆ 4.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 50 ┆ 25 ┆ 2.0 │
└──────┴──────┴────────┘
Here's a section of the User Guide that may help transitioning from Pandas-style coding to using Polars Expressions.

Sum columns based on column names in a list for polars

So in python Polars
I can add one or more columns to make a new column by using an expression something like
frame.with_column((pl.col('colname1') + pl.col('colname2').alias('new_colname')))
However, if I have all the column names in a list is there a way to sum all the columns in that list and create a new column based on the result ?
Thanks

sum expr supports horizontal summing. From the docs,
List[Expr] -> aggregate the sum value horizontally.
Sample code for ref,
import polars as pl
df = pl.DataFrame({"a": [1, 2, 3], "b": [1, 2, None]})
print(df)
This results in something like,
shape: (3, 2)
┌─────┬──────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪══════╡
│ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3 ┆ null │
└─────┴──────┘
On this you can do something like,
cols = ["a", "b"]
df2 = df.select(pl.sum([pl.col(i) for i in cols]).alias('new_colname'))
print(df2)
Which will result in,
shape: (3, 1)
┌──────┐
│ sum │
│ --- │
│ i64 │
╞══════╡
│ 2 │
├╌╌╌╌╌╌┤
│ 4 │
├╌╌╌╌╌╌┤
│ null │
└──────┘

In polars, can I create a categorical type with levels myself?

In Pandas, I can specify the levels of a Categorical type myself:
MyCat = pd.CategoricalDtype(categories=['A','B','C'], ordered=True)
my_data = pd.Series(['A','A','B'], dtype=MyCat)
This means that
I can make sure that different columns and sets use the same dtype
I can specify an ordering for the levels.
Is there a way to do this with Polars? I know you can use the string cache feature to achieve 1) in a different way, however I'm interested if my dtype/levels can be specified directly. I'm not aware of any way to achieve 2), however I think the categorical dtypes in Arrow do allow an optional ordering, so maybe it's possible?

Not directly, but we can influence how the global string cache is filled. The global string cache simply increments a counter for every new category added.
So if we start with an empty cache and we do a pre-fill in the order that we think is important, the later categories use the cached integer.
Here is an example:
import string
import polars as pl
with pl.StringCache():
# the first run will fill the global string cache counting from 0..25
# for all 26 letters in the alphabet
pl.Series(list(string.ascii_uppercase)).cast(pl.Categorical)
# now the global string cache is populated with all categories
# we cast the string columns
df = (pl.DataFrame({
"letters": ["A", "B", "D"],
"more_letters": ["Z", "B", "J"]
}).with_column(pl.col(pl.Utf8).cast(pl.Categorical))
.with_column(pl.col(pl.Categorical).to_physical().suffix("_real_category"))
)
print(df)
shape: (3, 4)
┌─────────┬──────────────┬───────────────────────┬────────────────────────────┐
│ letters ┆ more_letters ┆ letters_real_category ┆ more_letters_real_category │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═════════╪══════════════╪═══════════════════════╪════════════════════════════╡
│ A ┆ Z ┆ 0 ┆ 25 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ B ┆ B ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ D ┆ J ┆ 3 ┆ 9 │
└─────────┴──────────────┴───────────────────────┴────────────────────────────┘

in polars, how could i use rank() to get most popular category per user

Let's say I have a csv
transaction_id,user,book
1,bob,bookA
2,bob,bookA
3,bob,bookB
4,tim,bookA
5,lucy,bookA
6,lucy,bookC
7,lucy,bookC
8,lucy,bookC
per user, i want to find the book they have shown the most preference towards. For example, the output should be;
shape: (3, 2)
┌──────┬──────────┐
│ user ┆ fav_book │
│ --- ┆ --- │
│ str ┆ str │
╞══════╪══════════╡
│ bob ┆ bookA │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ tim ┆ bookA │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ lucy ┆ bookC │
└──────┴──────────┘
now i've worked out how to do it like so
import polars as pl
df = pl.read_csv("book_aggs.csv")
print(df)
df2 = df.groupby(["user", "book"]).agg([
pl.col("book").count(),
pl.col("transaction_id") # just so we can double check where it all came from - TODO: how to output this to csv?
])
print(df2)
df3 = df2.sort(["user", "book_count"], reverse=True).groupby("user").agg([
pl.col("book").first().alias("fav_book")
])
print(df3)
but really the normal sql way of doing it is a dense_rank sorted by book count descending where rank = 1. I have tried for hours to get this to work but i can't find a relevant example in the docs.
the issue is that in the docs, none of the agg examples reference the output of another agg - in this case it needs to reference the count of each book per user, and then sort those counts descending and then rank based on that sort order.
Please provide an example that explains how to use rank to perform this task, and also how to nest aggregations efficiently.

Approach 1
We could first groupby user and 'book' to get all user -> book combinations and count the most occurring.
This would give this intermediate DataFrame:
shape: (5, 3)
┌──────┬───────┬────────────┐
│ user ┆ book ┆ book_count │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ u32 │
╞══════╪═══════╪════════════╡
│ lucy ┆ bookC ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ lucy ┆ bookA ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ bob ┆ bookB ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ tim ┆ bookA ┆ 1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ bob ┆ bookA ┆ 2 │
└──────┴───────┴────────────┘
Then we can do another groupby user where we compute the index of the maximum book_count and use that index to take the correct book.
The whole query looks like this:
df = pl.DataFrame({'book': ['bookA',
'bookA',
'bookB',
'bookA',
'bookA',
'bookC',
'bookC',
'bookC'],
'transaction_id': [1, 2, 3, 4, 5, 6, 7, 8],
'user': ['bob', 'bob', 'bob', 'tim', 'lucy', 'lucy', 'lucy', 'lucy']
})
(df.groupby(["user", "book"])
.agg([
pl.col("book").count()
])
.groupby("user")
.agg([
pl.col("book").take(pl.col("book_count").arg_max()).alias("fav_book")
])
)
And creates this output:
shape: (3, 2)
┌──────┬──────────┐
│ user ┆ fav_book │
│ --- ┆ --- │
│ str ┆ str │
╞══════╪══════════╡
│ tim ┆ bookA │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ bob ┆ bookA │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ lucy ┆ bookC │
└──────┴──────────┘
Approach 2
Another approach would be creating a book_count column with a window_expression and then use the index of the maximum to take the correct book in aggregation:
(df
.with_column(pl.count("book").over(["user", "book"]).alias("book_count"))
.groupby("user")
.agg([
pl.col("book").take(pl.col("book_count").arg_max())
])
)

How to get row_count for a group in polars?

The usage might seems like the code below
out_df = df.select([
pl.col("*"),
pl.col("md5").row_count().over("md5").alias("row_count"),
])
print(out_df)
The data should be like this:
before:
md5
a
a
b
after:
md5 row_count
a 1
a 2
b 1

Maybe Im misunderstanding, as your output has both values 1 and 2 for a. Assuming you meant 2 for both:
You are very close, Polars has .count():
import polars as pl
df = pl.DataFrame({"md5": ["a", "a", "b"]})
out_df = df.select([
pl.col("*"),
pl.col("md5").count().over("md5").alias("row_count"),
])
print(out_df)
Which prints out this:
shape: (3, 2)
┌─────┬───────────┐
│ md5 ┆ row_count │
│ --- ┆ --- │
│ str ┆ u32 │
╞═════╪═══════════╡
│ a ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 1 │
└─────┴───────────┘

If I think I understand correctly, you want to have a count per seen value in the group.
You can do this:
df = pl.DataFrame({"md5": ["a", "a", "b"]})
(df
.with_column(pl.lit(1).alias("ones"))
.select([
pl.all().exclude("ones"),
pl.col("ones").cumsum().over("md5").flatten().alias("row_count")
]))
shape: (3, 2)
┌─────┬───────────┐
│ md5 ┆ row_count │
│ --- ┆ --- │
│ str ┆ i32 │
╞═════╪═══════════╡
│ a ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 1 │
└─────┴───────────┘
We still have to add a dummy column "ones", because (as of polars==0.10.23` we cannot apply a window function over literals. We will add this functionality.