Float Decimal Point Display Setting in Polars - python-polars

Is there a way to adjust or change a setting that Polars would show a same number of decimal points for all values?
And if it is, am I able to save it as default for all new notebooks in Jupyter for instance?
For example,
pl.DataFrame({"a":[0.1213, 0.4244, 0.1000, 0.4242]})
Output:
shape: (4, 1)
┌────────┐
│ a │
│ --- │
│ f64 │
╞════════╡
│ 0.1213 │
│ 0.4244 │
│ 0.1 │
│ 0.4242 │
└────────┘
I'd like to see the 0.1 as 0.1000

I think right now we do not have this kind of options on the Polars API
If you take a look at https://pola-rs.github.io/polars/py-polars/html/reference/config.html
We only have limited options to format tables and strings.

This doesn't exactly answer your questions, but if the intention is just to print values to the same number of decimal points:
df = pl.DataFrame({"a":[0.1213, 0.4244, 0.1000, 0.4242]})
(
df.lazy()
.with_column(
pl.col("a").round(4).cast(pl.Utf8).str.split(".")
)
.with_column(
pl.col("a").arr.first().arr
.concat(
pl.col("a").arr.last().str.ljust(width=4, fillchar="0")
).arr.join(".")
)
.collect()
)

Related

Can I use indexing [start:end] in expressions instead of offset and length in polars

In Exp.slice, the only supported syntax is exp.slice(offset,length), but in some cases, something like exp[start:end] would be more convenient and consistent. So how to write exp[1:6] for exp.slice(1,5) just like it in pandas?
You can add polars.internals.expr.expr.Expr.__getitem__():
import polars as pl
from polars.internals.expr.expr import Expr
def expr_slice(self, subscript):
if isinstance(subscript, slice):
if slice.start is not None:
offset = subscript.start
length = subscript.stop - offset + 1 if subscript.stop is not None else None
else:
offset = None
length = subscript.stop
return Expr.slice(self, offset, length)
else:
return Expr.slice(self, subscript, 1)
Expr.__getitem__ = expr_slice
df = pl.DataFrame({'a': range(10)})
print(df.select(pl.col('a')[2:5]))
print(df.select(pl.col('a')[3]))
shape: (4, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 2 │
├╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌┤
│ 4 │
├╌╌╌╌╌┤
│ 5 │
└─────┘
shape: (1, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 3 │
└─────┘
Note that my example doesn't take into account the logic for negative indexing; you can implement that yourself.
Also note that python uses 0-based indexing, so in your example exp[1:6] would raise an error.

Filtering selected columns based on column aggregate

I wish to select only columns with fewer than 3 unique values. I can generate a boolean mask via pl.all().n_unique() < 3, but I don't know if I can use that mask via the polars API for this.
Currently, I am solving it via python. Is there a more idiomatic way?
import polars as pl, pandas as pd
df = pl.DataFrame({"col1":[1,1,2], "col2":[1,2,3], "col3":[3,3,3]})
# target is:
# df_few_unique = pl.DataFrame({"col1":[1,1,2], "col3":[3,3,3]})
# my attempt:
mask = df.select(pl.all().n_unique() < 3).to_numpy()[0]
cols = [col for col, m in zip(df.columns, mask) if m]
df_few_unique = df.select(cols)
df_few_unique
Equivalent in pandas:
df_pandas = df.to_pandas()
mask = (df_pandas.nunique() < 3)
df_pandas.loc[:, mask]
Edit: after some thinking, I discovered an even easier way to do this, one that doesn't rely on boolean masking at all.
pl.select(
[s for s in df
if s.n_unique() < 3]
)
shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col3 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 1 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 3 │
└──────┴──────┘
Previous answer
One easy way is to use the compress function from Python's itertools.
from itertools import compress
df.select(compress(df.columns, df.select(pl.all().n_unique() < 3).row(0)))
>>> df.select(compress(df.columns, df.select(pl.all().n_unique() < 3).row(0)))
shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col3 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 1 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 3 │
└──────┴──────┘
compress allows us to apply a boolean mask to a list, which in this case is a list of column names.
list(compress(df.columns, df.select(pl.all().n_unique() < 3).row(0)))
>>> list(compress(df.columns, df.select(pl.all().n_unique() < 3).row(0)))
['col1', 'col3']

Is there the same functions as Power Bi desktop measure defination using python-polars?

I want to use python-polars replace Power BI desktop tools. Data model can be a mulicolumns dataset include alll tabular model columns, but How Can I use dynamic measure definition in python-polars. For example:
sum('amount') filter ('Country' = 'USA')
I want to define this measure in configuration files.
(disclaimer: I am not familiar with Power Bi, so I am answering this based on the remainder of your question).
Let's say the data is
import polars as pl
df = pl.DataFrame({"Country": ["USA", "USA", "Japan"],
"amount": [100, 200, 400]})
shape: (3, 2)
┌─────────┬────────┐
│ Country ┆ amount │
│ --- ┆ --- │
│ str ┆ i64 │
╞═════════╪════════╡
│ USA ┆ 100 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ USA ┆ 200 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Japan ┆ 400 │
└─────────┴────────┘
You could define your expression in the configuration file as follows:
expr = pl.col('amount').filter(pl.col('Country')=='USA').sum()
and in your code base you could do
from <expression file above> import expr
df.select(expr)
shape: (1, 1)
┌────────┐
│ amount │
│ --- │
│ i64 │
╞════════╡
│ 300 │
└────────┘
If the configuration file should be plain text (would strongly advise against this, because unsafe, not testable, no tooling support, etc), you could store in the config file
pl.col('amount').filter(pl.col('Country')=='USA').sum()
and read this as expr in your Python code, and use eval:
expr = <read config file as string>
df.select(eval(expr))

Sum columns based on column names in a list for polars

So in python Polars
I can add one or more columns to make a new column by using an expression something like
frame.with_column((pl.col('colname1') + pl.col('colname2').alias('new_colname')))
However, if I have all the column names in a list is there a way to sum all the columns in that list and create a new column based on the result ?
Thanks
sum expr supports horizontal summing. From the docs,
List[Expr] -> aggregate the sum value horizontally.
Sample code for ref,
import polars as pl
df = pl.DataFrame({"a": [1, 2, 3], "b": [1, 2, None]})
print(df)
This results in something like,
shape: (3, 2)
┌─────┬──────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪══════╡
│ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3 ┆ null │
└─────┴──────┘
On this you can do something like,
cols = ["a", "b"]
df2 = df.select(pl.sum([pl.col(i) for i in cols]).alias('new_colname'))
print(df2)
Which will result in,
shape: (3, 1)
┌──────┐
│ sum │
│ --- │
│ i64 │
╞══════╡
│ 2 │
├╌╌╌╌╌╌┤
│ 4 │
├╌╌╌╌╌╌┤
│ null │
└──────┘

calculating rowwise minimum of two series?

Hi;
Is there any function that can generate a serie by calculating rowwise minimum of two series.? Functionality will be similar to np.minimum
a = [1,4,2,5,2]
b= [5,1,4,2,5]
np.minimum(a,b) -> [1,1,2,2,2]
Thanks.
q =df.lazy().with_column(pl.when(pl.col("a")>pl.col("b")).then(pl.col("b")).otherwise(pl.col("a")).alias("minimum"))
df = q.collect()
i didn't tested it but this should work i think
As the accepted answer states, you can use pl.when -> then -> otherwise expression.
If you have a wider DataFrame, you can use the DataFrame.min() method, pl.min expression, or a pl.fold for more control.
import polars as pl
df = pl.DataFrame({
"a": [1,4,2,5,2],
"b": [5,1,4,2,5],
"c": [3,2,5,7,2]
})
df.min(axis=1)
This outputs:
shape: (5,)
Series: 'a' [i64]
[
1
1
2
2
2
]
Min expression
When given multiple expression inputs to a pl.min the minimum is determined row-wise instead of column-wise.
df.select(pl.min(["a", "b", "c"]))
Outputs:
shape: (5, 1)
┌─────┐
│ min │
│ --- │
│ i64 │
╞═════╡
│ 1 │
├╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌┤
│ 2 │
├╌╌╌╌╌┤
│ 2 │
├╌╌╌╌╌┤
│ 2 │
└─────┘
Fold expression
Or with a fold expression:
df.select(
pl.fold(int(1e9), lambda acc, a: pl.when(acc > a).then(a).otherwise(acc), ["a", "b", "c"])
)
shape: (5, 1)
┌─────────┐
│ literal │
│ --- │
│ i64 │
╞═════════╡
│ 1 │
├╌╌╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌╌╌┤
│ 2 │
├╌╌╌╌╌╌╌╌╌┤
│ 2 │
├╌╌╌╌╌╌╌╌╌┤
│ 2 │
└─────────┘
The fold allows for more cool things, because you operate over expressions.
So we could for instance compute the min of the squared columns:
pl.fold(int(1e9), lambda acc, a: pl.when(acc > a).then(a).otherwise(acc), [pl.all()**2])
Or we could compute the min of square root of column "a" and the rest of the columns is squared.
pl.fold(int(1e9), lambda acc, a: pl.when(acc > a).then(a).otherwise(acc), [pl.col("a").sqrt(), pl.all().exclude("a")**2])
You get the idea.