Let's say I have a Polars data frame like this:
=> shape: (19, 5)
┌───────────────┬─────────┬───────────┬───────────┬──────────┐
│ date ┆ open_AA ┆ open_AADI ┆ open_AADR ┆ open_AAL │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════════════╪═════════╪═══════════╪═══════════╪══════════╡
│ 1674777600000 ┆ 51.39 ┆ 12.84 ┆ 50.0799 ┆ 16.535 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1674691200000 ┆ 52.43 ┆ 13.14 ┆ 49.84 ┆ 16.54 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1674604800000 ┆ 51.87 ┆ 12.88 ┆ 49.75 ┆ 15.97 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1674518400000 ┆ 51.22 ┆ 12.81 ┆ 50.1 ┆ 16.01 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1672876800000 ┆ 45.3 ┆ 12.7 ┆ 47.185 ┆ 13.5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1672790400000 ┆ 44.77 ┆ 12.355 ┆ 47.32 ┆ 12.86 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1672704000000 ┆ 45.77 ┆ 12.91 ┆ 47.84 ┆ 12.91 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1672358400000 ┆ 46.01 ┆ 12.57 ┆ 47.29 ┆ 12.55 │
└───────────────┴─────────┴───────────┴───────────┴──────────┘
I'm looking to calculate the Pearson correlation between each pair-combination of all columns (except the date one). The result would look something like this:
=> shape: (5, 5)
┌───────────────┬─────────┬───────────┬───────────┬──────────┐
│ symbol ┆ open_AA ┆ open_AADI ┆ open_AADR ┆ open_AAL │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ utf8 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════════════╪═════════╪═══════════╪═══════════╪══════════╡
│ open_AA ┆ 1 ┆ 1 ┆ .1 ┆ -.5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ open_AADI ┆ .2 ┆ 1 ┆ .2 ┆ .4 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ open_AADR ┆ .4 ┆ .2 ┆ 1 ┆ .3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ open_AAL. ┆ -.45 ┆ -.6 ┆ 50.1 ┆ 1 │
└───────────────┴─────────┴───────────┴───────────┴──────────┘
My hunch is that I need to do the following:
Get the cartesian product of columns [1..] as a new data frame.
Using Polars expressions, calculate the pearson_corr of each of each series pair.
I'm new to Polars and am having trouble with the syntax. Can anyone point me in the right direction?
Say you start with:
df = pl.DataFrame({"date":[5,6,7],"foo": [1, 3, 9], "bar": [4, 1, 3], "ham": [2, 18, 9]})
You want to exclude some cols, so let's put those in a variable
excl_cols=['date']
Then...
(
df.drop(excl_cols) # Use drop to exclude the date column (or whatever columns you don't want)
.pearson_corr() # this is the meat and potatos of the request but it's missing your symbol column on left
.select(
[
pl.Series(df.drop(excl_cols).columns).alias('symbol'), # This just creates a Series out of the column names to become its own column
pl.all() #then just every other column
])
)
shape: (3, 4)
┌────────┬───────────┬───────────┬───────────┐
│ symbol ┆ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 │
╞════════╪═══════════╪═══════════╪═══════════╡
│ foo ┆ 1.0 ┆ -0.052414 ┆ 0.169695 │
│ bar ┆ -0.052414 ┆ 1.0 ┆ -0.993036 │
│ ham ┆ 0.169695 ┆ -0.993036 ┆ 1.0 │
└────────┴───────────┴───────────┴───────────┘
Use DataFrame.pearson_corr
In [9]: df.drop('date').pearson_corr()
Out[9]:
shape: (2, 2)
┌─────────┬───────────┐
│ open_AA ┆ open_AADI │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════════╪═══════════╡
│ 1.0 ┆ 1.0 │
│ 1.0 ┆ 1.0 │
└─────────┴───────────┘
I have multi columns which name startswith "ts" like "ts_1, ts_2, ts_3,etc" , I want to sum these f64 value row by row, but I don't know exactly the column names. If I use regex like pl.col('^ts.*$'). How to sum these value?
Update: Polars >= 0.13.60
As of Polars 0.13.60, you can now use polars.sum with a regex argument to polars.col.
df.with_column(
pl.sum(pl.col("^ts_.*$")).alias('ts_sum')
)
shape: (4, 6)
┌──────┬──────┬──────┬───────┬──────┬────────┐
│ ts_1 ┆ a ┆ ts_2 ┆ b ┆ ts_3 ┆ ts_sum │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞══════╪══════╪══════╪═══════╪══════╪════════╡
│ 1 ┆ -100 ┆ 10 ┆ -1000 ┆ 100 ┆ 111 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ -100 ┆ 10 ┆ -1000 ┆ 100 ┆ 112 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3 ┆ -100 ┆ 10 ┆ -1000 ┆ 100 ┆ 113 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ -100 ┆ 10 ┆ -1000 ┆ 100 ┆ 114 │
└──────┴──────┴──────┴───────┴──────┴────────┘
Other methods
There are two ways to accomplish this. Starting with this data:
import polars as pl
df = pl.DataFrame({
'ts_1': [1, 2, 3, 4],
'a': [-100] * 4,
'ts_2': [10] * 4,
'b': [-1000] * 4,
'ts_3': [100] * 4,
})
df
shape: (4, 5)
┌──────┬──────┬──────┬───────┬──────┐
│ ts_1 ┆ a ┆ ts_2 ┆ b ┆ ts_3 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞══════╪══════╪══════╪═══════╪══════╡
│ 1 ┆ -100 ┆ 10 ┆ -1000 ┆ 100 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ -100 ┆ 10 ┆ -1000 ┆ 100 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3 ┆ -100 ┆ 10 ┆ -1000 ┆ 100 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 4 ┆ -100 ┆ 10 ┆ -1000 ┆ 100 │
└──────┴──────┴──────┴───────┴──────┘
As of Polars 0.13.59, you can use a regex expression in fold.
df.with_column(
pl.fold(acc=pl.lit(0),
f=lambda acc, x: acc + x,
exprs=pl.col("^ts_.*$"))
.alias("ts_sum")
)
shape: (4, 6)
┌──────┬──────┬──────┬───────┬──────┬────────┐
│ ts_1 ┆ a ┆ ts_2 ┆ b ┆ ts_3 ┆ ts_sum │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞══════╪══════╪══════╪═══════╪══════╪════════╡
│ 1 ┆ -100 ┆ 10 ┆ -1000 ┆ 100 ┆ 111 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ -100 ┆ 10 ┆ -1000 ┆ 100 ┆ 112 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3 ┆ -100 ┆ 10 ┆ -1000 ┆ 100 ┆ 113 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ -100 ┆ 10 ┆ -1000 ┆ 100 ┆ 114 │
└──────┴──────┴──────┴───────┴──────┴────────┘
The other option is to use the fold-like properties of polars.sum. polars.sum will sum horizontally when passed a list of Expressions. For example:
df.with_column(
pl.sum(
[col_nm for col_nm in df.columns
if col_nm.startswith(r"ts_")]
).alias("ts_sum")
)
shape: (4, 6)
┌──────┬──────┬──────┬───────┬──────┬────────┐
│ ts_1 ┆ a ┆ ts_2 ┆ b ┆ ts_3 ┆ ts_sum │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞══════╪══════╪══════╪═══════╪══════╪════════╡
│ 1 ┆ -100 ┆ 10 ┆ -1000 ┆ 100 ┆ 111 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ -100 ┆ 10 ┆ -1000 ┆ 100 ┆ 112 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3 ┆ -100 ┆ 10 ┆ -1000 ┆ 100 ┆ 113 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ -100 ┆ 10 ┆ -1000 ┆ 100 ┆ 114 │
└──────┴──────┴──────┴───────┴──────┴────────┘
If needed, you can use regex-type filtering, using a regex module such as re.
import re
df.with_column(
pl.sum(
[col_nm for col_nm in df.columns
if re.search(r"^ts_.*$", col_nm)]
).alias("ts_sum")
)
shape: (4, 6)
┌──────┬──────┬──────┬───────┬──────┬────────┐
│ ts_1 ┆ a ┆ ts_2 ┆ b ┆ ts_3 ┆ ts_sum │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞══════╪══════╪══════╪═══════╪══════╪════════╡
│ 1 ┆ -100 ┆ 10 ┆ -1000 ┆ 100 ┆ 111 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 2 ┆ -100 ┆ 10 ┆ -1000 ┆ 100 ┆ 112 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 3 ┆ -100 ┆ 10 ┆ -1000 ┆ 100 ┆ 113 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ -100 ┆ 10 ┆ -1000 ┆ 100 ┆ 114 │
└──────┴──────┴──────┴───────┴──────┴────────┘
I have a polars dataframe with many columns. I want to look at all the data from a single row aligned vertically so that I can see the values in many different columns without it going off the edge of the screen. How can I do this?
E.g. define a dataframe
df = pl.DataFrame({'a':[0,1],'b':[2,3]})
Print df[0] in ipython/jupyter and I get:
But if I convert df to pandas and print df.iloc[0] I get:
The latter is very handy when you've got many columns.
I've tried things like df[0].to_series(), but it only prints the first element, not the first row.
My suspicion is that there isn't a direct replacement because the pandas method relies on the series having an index. I think the polars solution will be more like making a two column dataframe where one column is the column names and the other is a value. I'm not sure if there's a method to do that though.
Thanks for any help you can offer!
import polars as pl
import numpy as np
# Create dataframe with lots of columns.
df = pl.DataFrame(np.random.randint(0, 1000, (5, 100)))
df
shape: (5, 100)
┌──────────┬──────────┬──────────┬──────────┬─────┬───────────┬───────────┬───────────┬───────────┐
│ column_0 ┆ column_1 ┆ column_2 ┆ column_3 ┆ ... ┆ column_96 ┆ column_97 ┆ column_98 ┆ column_99 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞══════════╪══════════╪══════════╪══════════╪═════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 285 ┆ 366 ┆ 886 ┆ 981 ┆ ... ┆ 63 ┆ 326 ┆ 882 ┆ 564 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 735 ┆ 269 ┆ 381 ┆ 78 ┆ ... ┆ 556 ┆ 737 ┆ 741 ┆ 768 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 543 ┆ 729 ┆ 915 ┆ 901 ┆ ... ┆ 48 ┆ 21 ┆ 277 ┆ 818 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 264 ┆ 424 ┆ 285 ┆ 540 ┆ ... ┆ 602 ┆ 584 ┆ 888 ┆ 836 │
├╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 269 ┆ 701 ┆ 483 ┆ 817 ┆ ... ┆ 579 ┆ 873 ┆ 192 ┆ 734 │
└──────────┴──────────┴──────────┴──────────┴─────┴───────────┴───────────┴───────────┴───────────┘
# Display row 3, by creating a tuple of column name and value for row 3.
tuple(zip(df.columns, df.row(2)))
(('column_0', 543),
('column_1', 729),
('column_2', 915),
('column_3', 901),
('column_4', 332),
('column_5', 156),
('column_6', 624),
('column_7', 37),
('column_8', 341),
('column_9', 503),
('column_10', 135),
('column_11', 183),
('column_12', 651),
('column_13', 910),
('column_14', 625),
('column_15', 129),
('column_16', 604),
('column_17', 671),
('column_18', 976),
('column_19', 558),
('column_20', 159),
('column_21', 314),
('column_22', 460),
('column_23', 49),
('column_24', 944),
('column_25', 6),
('column_26', 470),
('column_27', 228),
('column_28', 615),
('column_29', 230),
('column_30', 217),
('column_31', 66),
('column_32', 999),
('column_33', 440),
('column_34', 519),
('column_35', 851),
('column_36', 37),
('column_37', 859),
('column_38', 560),
('column_39', 870),
('column_40', 892),
('column_41', 192),
('column_42', 541),
('column_43', 136),
('column_44', 631),
('column_45', 22),
('column_46', 522),
('column_47', 225),
('column_48', 610),
('column_49', 191),
('column_50', 886),
('column_51', 454),
('column_52', 312),
('column_53', 956),
('column_54', 473),
('column_55', 851),
('column_56', 760),
('column_57', 224),
('column_58', 859),
('column_59', 442),
('column_60', 234),
('column_61', 788),
('column_62', 53),
('column_63', 999),
('column_64', 473),
('column_65', 237),
('column_66', 247),
('column_67', 307),
('column_68', 916),
('column_69', 94),
('column_70', 714),
('column_71', 233),
('column_72', 995),
('column_73', 335),
('column_74', 454),
('column_75', 801),
('column_76', 742),
('column_77', 386),
('column_78', 196),
('column_79', 239),
('column_80', 723),
('column_81', 59),
('column_82', 929),
('column_83', 852),
('column_84', 722),
('column_85', 328),
('column_86', 59),
('column_87', 710),
('column_88', 238),
('column_89', 823),
('column_90', 75),
('column_91', 307),
('column_92', 472),
('column_93', 822),
('column_94', 582),
('column_95', 802),
('column_96', 48),
('column_97', 21),
('column_98', 277),
('column_99', 818))
Pandas does not display all values either if you have many columns.
In [121]: df.to_pandas().iloc[0]
Out[121]:
column_0 285
column_1 366
column_2 886
column_3 981
column_4 464
...
column_95 862
column_96 63
column_97 326
column_98 882
column_99 564
Name: 0, Length: 100, dtype: int64
You can try using melt. For example:
df = pl.DataFrame(
[
pl.Series(name="col_str", values=["string1", "string2"]),
pl.Series(name="col_bool", values=[False, True]),
pl.Series(name="col_int", values=[1, 2]),
pl.Series(name="col_float", values=[10.0, 20.0]),
*[pl.Series(name=f"col_other_{idx}", values=[idx] * 2)
for idx in range(1, 25)],
]
)
print(df)
shape: (2, 28)
┌─────────┬──────────┬─────────┬───────────┬─────┬──────────────┬──────────────┬──────────────┬──────────────┐
│ col_str ┆ col_bool ┆ col_int ┆ col_float ┆ ... ┆ col_other_21 ┆ col_other_22 ┆ col_other_23 ┆ col_other_24 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ bool ┆ i64 ┆ f64 ┆ ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════════╪══════════╪═════════╪═══════════╪═════╪══════════════╪══════════════╪══════════════╪══════════════╡
│ string1 ┆ false ┆ 1 ┆ 10.0 ┆ ... ┆ 21 ┆ 22 ┆ 23 ┆ 24 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ string2 ┆ true ┆ 2 ┆ 20.0 ┆ ... ┆ 21 ┆ 22 ┆ 23 ┆ 24 │
└─────────┴──────────┴─────────┴───────────┴─────┴──────────────┴──────────────┴──────────────┴──────────────┘
To print the first row:
pl.Config.set_tbl_rows(100)
df[0,].melt()
shape: (28, 2)
┌──────────────┬─────────┐
│ variable ┆ value │
│ --- ┆ --- │
│ str ┆ str │
╞══════════════╪═════════╡
│ col_str ┆ string1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_bool ┆ false │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_int ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_float ┆ 10.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_1 ┆ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_2 ┆ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_3 ┆ 3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_4 ┆ 4 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_5 ┆ 5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_6 ┆ 6 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_7 ┆ 7 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_8 ┆ 8 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_9 ┆ 9 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_10 ┆ 10 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_11 ┆ 11 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_12 ┆ 12 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_13 ┆ 13 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_14 ┆ 14 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_15 ┆ 15 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_16 ┆ 16 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_17 ┆ 17 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_18 ┆ 18 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_19 ┆ 19 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_20 ┆ 20 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_21 ┆ 21 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_22 ┆ 22 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_23 ┆ 23 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ col_other_24 ┆ 24 │
If needed, set the polars.Config.set_tbl_rows option to the number of rows you find acceptable. (This only needs to be done once per session, not every time you print.)
Notice that all values have been cast to super-type str. (One caution: this approach won't work if any of your columns are of dtype list.)
You may try check Polars Cookbook about indexing here
It's stated that
| pandas | polars |
|------------|-----------|
| select row | |
|df.iloc[2] | df[2, :] |
Cheers!
I'm learning polars (as substitute of pandas) and I would reply some excel functions.
In particular average if over a rolling windows.
Let us suppose we have a column with positive and negative value, how can I create a new column with rolling average only if all the value in the column are positive?
import polars as pl
df = pl.DataFrame(
{
"Date": ["12/04/98", "19/04/98", "26/04/98", "03/05/98", "10/05/98", "17/05/98", "24/05/98", "31/05/98", "07/06/98"],
"Close": [15.46 ,15.09 ,16.13 ,15.13 ,14.47 ,14.78 ,15.20 ,15.07 ,12.59]
}
)
df = df.with_columns([(
pl.col("Close").pct_change().alias("Close Returns")
)])
This creates a data frame with the column "Close Returns" and the new column will be it's average on a fixed windows only if the are all positive.
And if I want to create a new column as result of quotient positive average over negative?
As example for a window of two elements, in the image below there is the first which is null and do nothing. First widows contains a positive and a negative so returns zero (I need 2 positive value) while last window contains two negative and the mean can be computed.
Here my solution but I'm not satisfied:
import polars as pl
dataset = pl.DataFrame(
{
"Date": ["12/04/98", "19/04/98", "26/04/98", "03/05/98", "10/05/98", "17/05/98", "24/05/98", "31/05/98", "07/06/98"],
"Close": [15.46 ,15.09 ,16.13 ,15.13 ,14.47 ,14.78 ,15.20 ,15.07 ,12.59]
}
)
q = dataset.lazy().with_column(pl.col("Date").str.strptime(pl.Date, fmt="%d/%m/%y"))
df = q.collect()
df = df.with_columns([(
pl.col("Close").pct_change().alias("Close Returns")
)])
lag_vector = [2, 6, 7, 10, 12, 13]
for lag in lag_vector:
out = df.groupby_rolling(
index_column="Date", period=f"{lag}w"
).agg([
pl.col("Close Returns").filter(pl.col("Close Returns") >= 0).mean().alias("positive mean"),
pl.col("Close Returns").filter(pl.col("Close Returns") < 0).mean().alias("negative mean"),
])
out["negative mean"] = out["negative mean"].fill_null("zero")
out["positive mean"] = out["positive mean"].fill_null("zero")
out = out.with_columns([
(pl.col("positive mean") / (pl.col("positive mean") - pl.col("negative mean"))).alias(f"{lag} lag mean"),
])
df = df.join(out.select(["Date", f"{lag} lag mean"]), left_on="Date", right_on="Date")
Edit: I've tweaked my answer to use the any expression so that the non-negative windowed mean is calculated if any (rather than all) of the values in the window is non-negative. Likewise, for the negative windowed mean.
lag_vector = [1, 2, 3]
for lag in lag_vector:
out = (
df
.groupby_rolling(index_column="Date", period=f"{lag}w").agg(
[
pl.col('Close Returns').alias('Close Returns list'),
pl.when((pl.col("Close Returns") >= 0).any())
.then(pl.col('Close Returns').filter(pl.col("Close Returns") >= 0).mean())
.otherwise(0)
.alias("positive mean"),
pl.when((pl.col("Close Returns") < 0).any())
.then(pl.col('Close Returns').filter(pl.col("Close Returns") < 0).mean())
.otherwise(0)
.alias("negative mean"),
]
)
)
print(out)
Window size 1 week:
shape: (9, 4)
┌────────────┬────────────────────┬───────────────┬───────────────┐
│ Date ┆ Close Returns list ┆ positive mean ┆ negative mean │
│ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ list [f64] ┆ f64 ┆ f64 │
╞════════════╪════════════════════╪═══════════════╪═══════════════╡
│ 1998-04-12 ┆ [null] ┆ 0.0 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-04-19 ┆ [-0.023933] ┆ 0.0 ┆ -0.023933 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-04-26 ┆ [0.0689] ┆ 0.0689 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-03 ┆ [-0.061996] ┆ 0.0 ┆ -0.061996 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-10 ┆ [-0.043622] ┆ 0.0 ┆ -0.043622 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-17 ┆ [0.021424] ┆ 0.021424 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-24 ┆ [0.028417] ┆ 0.028417 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-31 ┆ [-0.008553] ┆ 0.0 ┆ -0.008553 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-06-07 ┆ [-0.164565] ┆ 0.0 ┆ -0.164565 │
└────────────┴────────────────────┴───────────────┴───────────────┘
Window size 2 weeks:
shape: (9, 4)
┌────────────┬────────────────────────┬───────────────┬───────────────┐
│ Date ┆ Close Returns list ┆ positive mean ┆ negative mean │
│ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ list [f64] ┆ f64 ┆ f64 │
╞════════════╪════════════════════════╪═══════════════╪═══════════════╡
│ 1998-04-12 ┆ [null] ┆ 0.0 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-04-19 ┆ [null, -0.023933] ┆ 0.0 ┆ -0.023933 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-04-26 ┆ [-0.023933, 0.0689] ┆ 0.0689 ┆ -0.023933 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-03 ┆ [0.0689, -0.061996] ┆ 0.0689 ┆ -0.061996 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-10 ┆ [-0.061996, -0.043622] ┆ 0.0 ┆ -0.052809 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-17 ┆ [-0.043622, 0.021424] ┆ 0.021424 ┆ -0.043622 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-24 ┆ [0.021424, 0.028417] ┆ 0.0249 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-31 ┆ [0.028417, -0.008553] ┆ 0.028417 ┆ -0.008553 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-06-07 ┆ [-0.008553, -0.164565] ┆ 0.0 ┆ -0.086559 │
└────────────┴────────────────────────┴───────────────┴───────────────┘
Window size 3 weeks:
shape: (9, 4)
┌────────────┬──────────────────────────────────┬───────────────┬───────────────┐
│ Date ┆ Close Returns list ┆ positive mean ┆ negative mean │
│ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ list [f64] ┆ f64 ┆ f64 │
╞════════════╪══════════════════════════════════╪═══════════════╪═══════════════╡
│ 1998-04-12 ┆ [null] ┆ 0.0 ┆ 0.0 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-04-19 ┆ [null, -0.023933] ┆ 0.0 ┆ -0.023933 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-04-26 ┆ [null, -0.023933, 0.0689] ┆ 0.0689 ┆ -0.023933 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-03 ┆ [-0.023933, 0.0689, -0.061996] ┆ 0.0689 ┆ -0.042965 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-10 ┆ [0.0689, -0.061996, -0.043622] ┆ 0.0689 ┆ -0.052809 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-17 ┆ [-0.061996, -0.043622, 0.021424] ┆ 0.021424 ┆ -0.052809 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-24 ┆ [-0.043622, 0.021424, 0.028417] ┆ 0.0249 ┆ -0.043622 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-05-31 ┆ [0.021424, 0.028417, -0.008553] ┆ 0.0249 ┆ -0.008553 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1998-06-07 ┆ [0.028417, -0.008553, -0.164565] ┆ 0.028417 ┆ -0.086559 │
└────────────┴──────────────────────────────────┴───────────────┴───────────────┘
Is this closer to what you are looking for?
You can use groupby_rolling and then in the aggregation filter out values that are negative.
In the example below, we parse the dates and then groupby a window of 10 days ("10d"), finally we aggregate by our conditions.
df = pl.DataFrame(
{
"Date": ["12/04/98", "19/04/98", "26/04/98", "03/05/98", "10/05/98", "17/05/98", "24/05/98",],
"Close": [15.46 ,15.09 ,16.13 ,15.13 ,14.47 ,14.78 ,15.20]
}
)
(df.with_column(pl.col("Date").str.strptime(pl.Date, fmt="%d/%m/%y"))
.groupby_rolling(index_column="Date", period="10d")
.agg([
pl.col("Close").filter(pl.col("Close") > 0).mean().alias("mean")
])
)
shape: (7, 2)
┌────────────┬────────┐
│ Date ┆ mean │
│ --- ┆ --- │
│ date ┆ f64 │
╞════════════╪════════╡
│ 1998-04-12 ┆ 15.46 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1998-04-19 ┆ 15.275 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1998-04-26 ┆ 15.61 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1998-05-03 ┆ 15.63 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1998-05-10 ┆ 14.8 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1998-05-17 ┆ 14.625 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 1998-05-24 ┆ 14.99 │
└────────────┴────────┘