Computing and retrieving operations at the group level without collapsing data frame in polars? - python-polars

I am trying to compute a stat (or more) at the group level without having to create a second data frame. The current way I do it is by relying on the generation of a second data frame with the desired aggregation that I then merge back to the original one.
A silly example:
import polars as pl
df = pl. DataFrame( {'name' : ['Steve', 'Larry', 'Tom', 'Steve', 'Tom', 'Steve'],
'points': range(6)})
print(df)
shape: (6, 2)
┌───────┬────────┐
│ name ┆ points │
│ --- ┆ --- │
│ str ┆ i64 │
╞═══════╪════════╡
│ Steve ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Larry ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Tom ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Steve ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Tom ┆ 4 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Steve ┆ 5 │
└───────┴────────┘
We created a simple data frame below in which some groups have more entries than others. In a second step we compute an additional data frame to keep track of the size of each group.
entries= df.groupby('name').agg(pl.count().alias('entries'))
print(entries)
shape: (3, 2)
┌───────┬─────────┐
│ name ┆ entries │
│ --- ┆ --- │
│ str ┆ u32 │
╞═══════╪═════════╡
│ Steve ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Tom ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Larry ┆ 1 │
└───────┴─────────┘
Now we bring back this information to the original data frame in a third step.
print(df.join(entries, left_on='name', right_on='name', how='left'))
shape: (6, 3)
┌───────┬────────┬─────────┐
│ name ┆ points ┆ entries │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u32 │
╞═══════╪════════╪═════════╡
│ Steve ┆ 0 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Larry ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Tom ┆ 2 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Steve ┆ 3 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Tom ┆ 4 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Steve ┆ 5 ┆ 3 │
└───────┴────────┴─────────┘
Is there a way to avoid this triangulation? I have the feeling that using over might be a solution but I can't figure it out yet.

Well ... I managed. Posting the question helped me organize my thoughts and indeed, over was the solution.
df.with_column(pl.col('name').count().over('name').alias('entries'))
shape: (6, 3)
┌───────┬────────┬─────────┐
│ name ┆ points ┆ entries │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u32 │
╞═══════╪════════╪═════════╡
│ Steve ┆ 0 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Larry ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Tom ┆ 2 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Steve ┆ 3 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Tom ┆ 4 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Steve ┆ 5 ┆ 3 │
└───────┴────────┴─────────┘

Related

Polars table convert a list column to separate rows i.e. unnest a list column to multiple rows

I have a Polars dataframe in the form:
df = pl.DataFrame({'a':[1,2,3], 'b':[['a','b'],['a'],['c','d']]})
┌─────┬────────────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ list[str] │
╞═════╪════════════╡
│ 1 ┆ ["a", "b"] │
│ 2 ┆ ["a"] │
│ 3 ┆ ["c", "d"] │
└─────┴────────────┘
I want to convert it to the following form. I plan to save to a parquet file, and query the file (with sql).
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1 ┆ "a" │
│ 1 ┆ "b" │
│ 2 ┆ "a" │
│ 3 ┆ "c" │
│ 3 ┆ "d" │
└─────┴─────┘
I have seen an answer that works on struct columns, but df.unnest('b') on my data results in the error:
SchemaError: Series of dtype: List(Utf8) != Struct
I also found a github issue that shows list can be converted to a struct, but I can't work out how to do that, or if it applies here.
To decompose column with Lists, you can use .explode() method (doc)
df = pl.DataFrame({'a':[1,2,3], 'b':[['a','b'],['a'],['c','d']]})
df.explode("b")
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1 ┆ a │
│ 1 ┆ b │
│ 2 ┆ a │
│ 3 ┆ c │
│ 3 ┆ d │
└─────┴─────┘

Polars get count of events prior to "this" event, but within given duration

I have been struggling with creating a feature, a counter that counts number of events prior to each event, where each prior event should have occurred within a given duration (dt). I know how to do it for all previous events, it is easy by using cumsum and over of the given column. But, if I want to do this with only events within e.g last 2 days, how do I do that ??
Below is how I do it (the wrong way) with cumsum.
import polars as pl
from datetime import date
df = pl.DataFrame(
data = {
"Event":["Rain","Sun","Rain","Sun","Rain","Sun","Rain","Sun"],
"Date":[
date(2022,1,1),
date(2022,1,2),
date(2022,1,2),
date(2022,1,3),
date(2022,1,3),
date(2022,1,5),
date(2022,1,5),
date(2022,1,8)
]
}
)
df.select(
pl.col("Date").cumcount().over("Event").alias("cum_sum")
)
outputting
shape: (8, 3)
┌───────┬────────────┬─────────┐
│ Event ┆ Date ┆ cum_sum │
│ --- ┆ --- ┆ --- │
│ str ┆ date ┆ u32 │
╞═══════╪════════════╪═════════╡
│ Rain ┆ 2022-01-01 ┆ 0 │
│ Sun ┆ 2022-01-02 ┆ 0 │
│ Rain ┆ 2022-01-02 ┆ 1 │
│ Sun ┆ 2022-01-03 ┆ 1 │
│ Rain ┆ 2022-01-03 ┆ 2 │
│ Sun ┆ 2022-01-05 ┆ 2 │
│ Rain ┆ 2022-01-05 ┆ 3 │
│ Sun ┆ 2022-01-08 ┆ 3 │
└───────┴────────────┴─────────┘
What I would like to output is this:
shape: (8, 3)
┌───────┬────────────┬─────────┐
│ Event ┆ Date ┆ cum_sum │
│ --- ┆ --- ┆ --- │
│ str ┆ date ┆ u32 │
╞═══════╪════════════╪═════════╡
│ Rain ┆ 2022-01-01 ┆ 0 │
│ Sun ┆ 2022-01-02 ┆ 0 │
│ Rain ┆ 2022-01-02 ┆ 1 │
│ Sun ┆ 2022-01-03 ┆ 1 │
│ Rain ┆ 2022-01-03 ┆ 2 │
│ Sun ┆ 2022-01-05 ┆ 1 │
│ Rain ┆ 2022-01-05 ┆ 1 │
│ Sun ┆ 2022-01-08 ┆ 0 │
└───────┴────────────┴─────────┘
(Preferably, a solution that scales somewhat well..)
Thanks
Tried this without success
You can try a groupby_rolling for this.
(
df
.groupby_rolling(
index_column="Date",
period="2d",
by="Event",
closed='both',
)
.agg([
pl.count() - 1
])
.sort(["Date", "Event"], reverse=[False, True])
)
shape: (8, 3)
┌───────┬────────────┬───────┐
│ Event ┆ Date ┆ count │
│ --- ┆ --- ┆ --- │
│ str ┆ date ┆ u32 │
╞═══════╪════════════╪═══════╡
│ Rain ┆ 2022-01-01 ┆ 0 │
│ Sun ┆ 2022-01-02 ┆ 0 │
│ Rain ┆ 2022-01-02 ┆ 1 │
│ Sun ┆ 2022-01-03 ┆ 1 │
│ Rain ┆ 2022-01-03 ┆ 2 │
│ Sun ┆ 2022-01-05 ┆ 1 │
│ Rain ┆ 2022-01-05 ┆ 1 │
│ Sun ┆ 2022-01-08 ┆ 0 │
└───────┴────────────┴───────┘
We subtract one in the agg because we do not want to count the current event, only prior events. (The sort at the end is just to order the rows to match the original data.)

Given a data frame with n columns of numbers, how could you calculate the Pearson correlation of all column-pair combinations?

Let's say I have a Polars data frame like this:
=> shape: (19, 5)
┌───────────────┬─────────┬───────────┬───────────┬──────────┐
│ date ┆ open_AA ┆ open_AADI ┆ open_AADR ┆ open_AAL │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════════════╪═════════╪═══════════╪═══════════╪══════════╡
│ 1674777600000 ┆ 51.39 ┆ 12.84 ┆ 50.0799 ┆ 16.535 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1674691200000 ┆ 52.43 ┆ 13.14 ┆ 49.84 ┆ 16.54 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1674604800000 ┆ 51.87 ┆ 12.88 ┆ 49.75 ┆ 15.97 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1674518400000 ┆ 51.22 ┆ 12.81 ┆ 50.1 ┆ 16.01 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1672876800000 ┆ 45.3 ┆ 12.7 ┆ 47.185 ┆ 13.5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1672790400000 ┆ 44.77 ┆ 12.355 ┆ 47.32 ┆ 12.86 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1672704000000 ┆ 45.77 ┆ 12.91 ┆ 47.84 ┆ 12.91 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1672358400000 ┆ 46.01 ┆ 12.57 ┆ 47.29 ┆ 12.55 │
└───────────────┴─────────┴───────────┴───────────┴──────────┘
I'm looking to calculate the Pearson correlation between each pair-combination of all columns (except the date one). The result would look something like this:
=> shape: (5, 5)
┌───────────────┬─────────┬───────────┬───────────┬──────────┐
│ symbol ┆ open_AA ┆ open_AADI ┆ open_AADR ┆ open_AAL │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ utf8 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════════════╪═════════╪═══════════╪═══════════╪══════════╡
│ open_AA ┆ 1 ┆ 1 ┆ .1 ┆ -.5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ open_AADI ┆ .2 ┆ 1 ┆ .2 ┆ .4 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ open_AADR ┆ .4 ┆ .2 ┆ 1 ┆ .3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ open_AAL. ┆ -.45 ┆ -.6 ┆ 50.1 ┆ 1 │
└───────────────┴─────────┴───────────┴───────────┴──────────┘
My hunch is that I need to do the following:
Get the cartesian product of columns [1..] as a new data frame.
Using Polars expressions, calculate the pearson_corr of each of each series pair.
I'm new to Polars and am having trouble with the syntax. Can anyone point me in the right direction?
Say you start with:
df = pl.DataFrame({"date":[5,6,7],"foo": [1, 3, 9], "bar": [4, 1, 3], "ham": [2, 18, 9]})
You want to exclude some cols, so let's put those in a variable
excl_cols=['date']
Then...
(
df.drop(excl_cols) # Use drop to exclude the date column (or whatever columns you don't want)
.pearson_corr() # this is the meat and potatos of the request but it's missing your symbol column on left
.select(
[
pl.Series(df.drop(excl_cols).columns).alias('symbol'), # This just creates a Series out of the column names to become its own column
pl.all() #then just every other column
])
)
shape: (3, 4)
┌────────┬───────────┬───────────┬───────────┐
│ symbol ┆ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 │
╞════════╪═══════════╪═══════════╪═══════════╡
│ foo ┆ 1.0 ┆ -0.052414 ┆ 0.169695 │
│ bar ┆ -0.052414 ┆ 1.0 ┆ -0.993036 │
│ ham ┆ 0.169695 ┆ -0.993036 ┆ 1.0 │
└────────┴───────────┴───────────┴───────────┘
Use DataFrame.pearson_corr
In [9]: df.drop('date').pearson_corr()
Out[9]:
shape: (2, 2)
┌─────────┬───────────┐
│ open_AA ┆ open_AADI │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════════╪═══════════╡
│ 1.0 ┆ 1.0 │
│ 1.0 ┆ 1.0 │
└─────────┴───────────┘

Ordinal encoding of column in polars

I would like to do an ordinal encoding of a column. Pandas has the nice and convenient method of pd.factorize(), however, I would like to achieve the same in polars.
df = pl.DataFrame({"a": [5, 8, 10], "b": ["hi", "hello", "hi"]})
┌─────┬───────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═══════╡
│ 5 ┆ hi │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 8 ┆ hello │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 10 ┆ hi │
└─────┴───────┘
desired result:
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 0 ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 0 │
└─────┴─────┘
You can join with a dummy DataFrame that contains the unique values and the ordinal encoding you are interested in:
df = pl.DataFrame({"a": [5, 8, 10], "b": ["hi", "hello", "hi"]})
unique = df.select(
pl.col("b").unique(maintain_order=True)
).with_row_count(name="ordinal")
df.join(unique, on="b")
Or you could "misuse" the fact that categorical values are backed by u32 integers.
df.with_column(
pl.col("b").cast(pl.Categorical).to_physical().alias("ordinal")
)
Both methods output:
shape: (3, 3)
┌─────┬───────┬─────────┐
│ a ┆ b ┆ ordinal │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ u32 │
╞═════╪═══════╪═════════╡
│ 5 ┆ hi ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 8 ┆ hello ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 10 ┆ hi ┆ 0 │
└─────┴───────┴─────────┘
Here's another way to do it although I doubt it's better than the dummy Dataframe from #ritchie46
df.with_columns([pl.col('b').unique().list().alias('uniq'),
pl.col('b').unique().list().arr.eval(pl.element().rank()).alias('uniqid')]).explode(['uniq','uniqid']).filter(pl.col('b')==pl.col('uniq')).select(pl.exclude('uniq')).with_column(pl.col('uniqid')-1)
There's almost certainly a way to improve this but basically it creates a new column called uniq which is a list of all the unique values of the column as well as uniqid which (I think, and seems to be) the 1-index based order of the values. It then explodes those creating a row for ever value in uniq and then filters out the ones rows that don't equal the column b. Since rank gives the 1-index (rather than 0-index) you have to subtract 1 and exclude the uniq column that we don't care about since it's the same as b.
If the order is not important you could use .rank(method="dense")
>>> df.select(pl.all().rank(method="dense") - 1)
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞═════╪═════╡
│ 0 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 1 │
└─────┴─────┘
If it is - you could:
>>> (
... df.with_row_count()
... .with_columns([
... pl.col("row_nr").first()
... .over(col)
... .rank(method="dense")
... .alias(col) - 1
... for col in df.columns
... ])
... .drop("row_nr")
... )
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞═════╪═════╡
│ 0 ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 0 │
└─────┴─────┘

How to form dynamic expressions without breaking on types

Any way to make the dynamic polars expressions not break with errors?
Currently I'm just excluding the columns by type, but just wondering if there is a better way.
For example, i have a df coming from parquet, if i just execute an expression on all columns it might break for certain types. Instead I want to contain these errors and possibly return a default value like None or -1 or something else.
import polars as pl
df = pl.scan_parquet("/path/to/data/*.parquet")
print(df.schema)
# Prints: {'date_time': <class 'polars.datatypes.Datetime'>, 'incident': <class 'polars.datatypes.Utf8'>, 'address': <class 'polars.datatypes.Utf8'>, 'city': <class 'polars.datatypes.Utf8'>, 'zipcode': <class 'polars.datatypes.Int32'>}
Now if i form generic expression on top of this, there are chances it may fail. For example,
# Finding positive count across all columns
# Fails due to: exceptions.ComputeError: cannot compare Utf8 with numeric data
print(df.select((pl.all() > 0).count().prefix("__positive_count_")).collect())
# Finding positive count across all columns
# Fails due to: pyo3_runtime.PanicException: 'unique_counts' not implemented for datetime[ns] data types
print(df.select(pl.all().unique_counts().prefix("__unique_count_")).collect())
# Finding positive count across all columns
# Fails due to: exceptions.SchemaError: Series dtype Int32 != utf8
# Note: this could have been avoided by doing an explict cast to string first
print(df.select((pl.all().str.lengths() > 0).count().prefix("__empty_count_")).collect())
I'll keep to things that work in lazy mode, as it appears that you are working in lazy mode with Parquet files.
Let's use this data as an example:
import polars as pl
from datetime import datetime
df = pl.DataFrame(
{
"col_int": [-2, -2, 0, 2, 2],
"col_float": [-20.0, -10, 10, 20, 20],
"col_date": pl.date_range(datetime(2020, 1, 1), datetime(2020, 5, 1), "1mo"),
"col_str": ["str1", "str2", "", None, "str5"],
"col_bool": [True, False, False, True, False],
}
).lazy()
df.collect()
shape: (5, 5)
┌─────────┬───────────┬─────────────────────┬─────────┬──────────┐
│ col_int ┆ col_float ┆ col_date ┆ col_str ┆ col_bool │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[ns] ┆ str ┆ bool │
╞═════════╪═══════════╪═════════════════════╪═════════╪══════════╡
│ -2 ┆ -20.0 ┆ 2020-01-01 00:00:00 ┆ str1 ┆ true │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ -2 ┆ -10.0 ┆ 2020-02-01 00:00:00 ┆ str2 ┆ false │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ 10.0 ┆ 2020-03-01 00:00:00 ┆ ┆ false │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20.0 ┆ 2020-04-01 00:00:00 ┆ null ┆ true │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20.0 ┆ 2020-05-01 00:00:00 ┆ str5 ┆ false │
└─────────┴───────────┴─────────────────────┴─────────┴──────────┘
Using the col Expression
One feature of the col expression is that you can supply a datatype, or even a list of datatypes. For example, if we want to contain our queries to floats, we can do the following:
df.select((pl.col(pl.Float64) > 0).sum().suffix("__positive_count_")).collect()
shape: (1, 1)
┌────────────────────────────┐
│ col_float__positive_count_ │
│ --- │
│ u32 │
╞════════════════════════════╡
│ 3 │
└────────────────────────────┘
(Note: (pl.col(...) > 0) creates a series of boolean values that need to be summed, not counted)
To include more than one datatype, you can supply a list of datatypes to col.
df.select(
(pl.col([pl.Int64, pl.Float64]) > 0).sum().suffix("__positive_count_")
).collect()
shape: (1, 2)
┌──────────────────────────┬────────────────────────────┐
│ col_int__positive_count_ ┆ col_float__positive_count_ │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞══════════════════════════╪════════════════════════════╡
│ 2 ┆ 3 │
└──────────────────────────┴────────────────────────────┘
You can also combine these into the same select statement if you'd like.
df.select(
[
(pl.col(pl.Utf8).str.lengths() == 0).sum().suffix("__empty_count"),
pl.col(pl.Utf8).is_null().sum().suffix("__null_count"),
(pl.col([pl.Float64, pl.Int64]) > 0).sum().suffix("_positive_count"),
]
).collect()
shape: (1, 4)
┌──────────────────────┬─────────────────────┬──────────────────────────┬────────────────────────┐
│ col_str__empty_count ┆ col_str__null_count ┆ col_float_positive_count ┆ col_int_positive_count │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ u32 ┆ u32 │
╞══════════════════════╪═════════════════════╪══════════════════════════╪════════════════════════╡
│ 1 ┆ 1 ┆ 3 ┆ 2 │
└──────────────────────┴─────────────────────┴──────────────────────────┴────────────────────────┘
The Cookbook has a handy list of datatypes.
Using the exclude expression
Another handy trick is to use the exclude expression. With this, we can select all columns except columns of certain datatypes. For example:
df.select(
[
pl.exclude(pl.Utf8).max().suffix("_max"),
pl.exclude([pl.Utf8, pl.Boolean]).min().suffix("_min"),
]
).collect()
shape: (1, 7)
┌─────────────┬───────────────┬─────────────────────┬──────────────┬─────────────┬───────────────┬─────────────────────┐
│ col_int_max ┆ col_float_max ┆ col_date_max ┆ col_bool_max ┆ col_int_min ┆ col_float_min ┆ col_date_min │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[ns] ┆ u32 ┆ i64 ┆ f64 ┆ datetime[ns] │
╞═════════════╪═══════════════╪═════════════════════╪══════════════╪═════════════╪═══════════════╪═════════════════════╡
│ 2 ┆ 20.0 ┆ 2020-05-01 00:00:00 ┆ 1 ┆ -2 ┆ -20.0 ┆ 2020-01-01 00:00:00 │
└─────────────┴───────────────┴─────────────────────┴──────────────┴─────────────┴───────────────┴─────────────────────┘
Unique counts
One caution: unique_counts results in Series of varying lengths.
df.select(pl.col("col_int").unique_counts().prefix(
"__unique_count_")).collect()
shape: (3, 1)
┌────────────────────────┐
│ __unique_count_col_int │
│ --- │
│ u32 │
╞════════════════════════╡
│ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 │
└────────────────────────┘
df.select(pl.col("col_float").unique_counts().prefix(
"__unique_count_")).collect()
shape: (4, 1)
┌──────────────────────────┐
│ __unique_count_col_float │
│ --- │
│ u32 │
╞══════════════════════════╡
│ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 │
└──────────────────────────┘
As such, these should not be combined into the same results. Each column/Series of a DataFrame must have the same length.