polars df.sum(axis=1) for Expressions (and other functions, maybe median?)

polars df.sum(axis=1) for Expressions (and other functions, maybe median?) - python-polars

I'm trying to find a polaric way of aggregating data per row. It's not strictly about .sum function, it's about all aggregations where axis makes sense.
Take a look at these pandas examples:
df[df.sum(axis=1) > 5]
df.assign(median=df.median(axis=1))
df[df.rolling(3, axis=1).mean() > 0]
However, with polars, problems start really quick:
df.filter(df.sum(axis=1)>5)
df.with_column(df.mean(axis=1).alias('mean')) - cant do median
df... - cant do rolling, rank and anything more complex.
I saw the page where polars authors suggest doing everything by hand with folds, but there are cases where logic doesn't fit into one input and one accumulator variable (i.e. simple median)
Moreover, this approach seems to not work at all when using Expressions, i.e. pl.all().sum(axis=1) is not valid since for some reason axis argument is absent.
So the question is: how to deal with these situations? I hope to have the full polars api at my fingertips, instead of some suboptimal solutions i can come up with

Row-wise computations:
You can create a list and access the .arr namespace for row wise computations.
#ritchie46's answer regarding rank(axis=1) is also useful reading.
.arr.eval() can be used for more complex computations.
df = pl.DataFrame([[1, 2, 3], [4, 5, 3], [1, 8, 9]])
(df.with_column(pl.concat_list(pl.all()).alias("row"))
.with_columns([
pl.col("row").arr.sum().alias("sum"),
pl.col("row").arr.mean().alias("mean"),
pl.col("row").arr.eval(pl.all().median(), parallel=True).alias("median"),
pl.col("row").arr.eval(pl.all().rank(), parallel=True).alias("rank"),
])
)
shape: (3, 8)
┌──────────┬──────────┬──────────┬───────────┬─────┬──────┬───────────┬─────────────────┐
│ column_0 | column_1 | column_2 | row | sum | mean | median | rank │
│ --- | --- | --- | --- | --- | --- | --- | --- │
│ i64 | i64 | i64 | list[i64] | i64 | f64 | list[f64] | list[f32] │
╞══════════╪══════════╪══════════╪═══════════╪═════╪══════╪═══════════╪═════════════════╡
│ 1 | 4 | 1 | [1, 4, 1] | 6 | 2.0 | [1.0] | [1.5, 3.0, 1.5] │
├──────────┼──────────┼──────────┼───────────┼─────┼──────┼───────────┼─────────────────┤
│ 2 | 5 | 8 | [2, 5, 8] | 15 | 5.0 | [5.0] | [1.0, 2.0, 3.0] │
├──────────┼──────────┼──────────┼───────────┼─────┼──────┼───────────┼─────────────────┤
│ 3 | 3 | 9 | [3, 3, 9] | 15 | 5.0 | [3.0] | [1.5, 1.5, 3.0] │
└──────────┴──────────┴──────────┴───────────┴─────┴──────┴───────────┴─────────────────┘
pl.sum()
Can be given a list of columns.
>>> df.select(pl.sum(pl.all()))
shape: (3, 1)
┌─────┐
│ sum │
│ --- │
│ i64 │
╞═════╡
│ 6 │
├─────┤
│ 15 │
├─────┤
│ 15 │
└─────┘
.rolling_mean()
Can be accessed inside .arr.eval()
pdf = df.to_pandas()
pdf[pdf.rolling(2, axis=1).mean() > 3]
column_0 column_1 column_2
0 NaN NaN NaN
1 NaN 5.0 8.0
2 NaN NaN 9.0
(df.with_column(pl.concat_list(pl.all()).alias("row"))
.with_column(
pl.col("row").arr.eval(
pl.when(pl.all().rolling_mean(2) > 3)
.then(pl.all()),
parallel=True)
.alias("rolling[mean] > 3"))
)
shape: (3, 5)
┌──────────┬──────────┬──────────┬───────────┬────────────────────┐
│ column_0 | column_1 | column_2 | row | rolling[mean] > 3 │
│ --- | --- | --- | --- | --- │
│ i64 | i64 | i64 | list[i64] | list[i64] │
╞══════════╪══════════╪══════════╪═══════════╪════════════════════╡
│ 1 | 4 | 1 | [1, 4, 1] | [null, null, null] │
├──────────┼──────────┼──────────┼───────────┼────────────────────┤
│ 2 | 5 | 8 | [2, 5, 8] | [null, 5, 8] │
├──────────┼──────────┼──────────┼───────────┼────────────────────┤
│ 3 | 3 | 9 | [3, 3, 9] | [null, null, 9] │
└──────────┴──────────┴──────────┴───────────┴────────────────────┘
If you want to "expand" the lists into columns:
Turn the list into a struct with .arr.to_struct()
.unnest() the struct.
Rename the columns (if needed)
(df.with_column(pl.concat_list(pl.all()).alias("row"))
.select(
pl.col("row").arr.eval(
pl.when(pl.all().rolling_mean(2) > 3)
.then(pl.all()),
parallel=True)
.arr.to_struct()
.alias("rolling[mean]"))
.unnest("rolling[mean]")
)
shape: (3, 3)
┌─────────┬─────────┬─────────┐
│ field_0 | field_1 | field_2 │
│ --- | --- | --- │
│ i64 | i64 | i64 │
╞═════════╪═════════╪═════════╡
│ null | null | null │
├─────────┼─────────┼─────────┤
│ null | 5 | 8 │
├─────────┼─────────┼─────────┤
│ null | null | 9 │
└─────────┴─────────┴─────────┘
.transpose()
You could always transpose the dataframe to switch the axis and use the "regular" api.
(df.transpose()
.select(
pl.when(pl.all().rolling_mean(2) > 3)
.then(pl.all())
.keep_name())
.transpose())
shape: (3, 3)
┌──────────┬──────────┬──────────┐
│ column_0 | column_1 | column_2 │
│ --- | --- | --- │
│ i64 | i64 | i64 │
╞══════════╪══════════╪══════════╡
│ null | null | null │
├──────────┼──────────┼──────────┤
│ null | 5 | 8 │
├──────────┼──────────┼──────────┤
│ null | null | 9 │
└──────────┴──────────┴──────────┘

Related

Polars, sum column based on other column values in `groupby`

I want to calculate the sum of a column in a groupby based on values of another column. Pretty much what pl.Expr.value_counts do (see example), but I want to apply a function (e.g sum) to a specific column, in this case the Price column.
I know that I could do the groupby on Weather + Windy and then aggregate, but, I can't do that since I have plenty of other aggregations I need to compute on only the Weather groupby.
import polars as pl
df = pl.DataFrame(
data = {
"Weather":["Rain","Sun","Rain","Sun","Rain","Sun","Rain","Sun"],
"Price":[1,2,3,4,5,6,7,8],
"Windy":["Y","Y","Y","Y","N","N","N","N"]
}
)
I can get number of counts per windy day by value_counts
df_agg = (df
.groupby("Weather")
.agg([
pl.col("Windy")
.value_counts()
.alias("Price")
])
)
shape: (2, 2)
┌─────────┬────────────────────┐
│ Weather ┆ Price │
│ --- ┆ --- │
│ str ┆ list[struct[2]] │
╞═════════╪════════════════════╡
│ Sun ┆ [{"Y",2}, {"N",2}] │
│ Rain ┆ [{"Y",2}, {"N",2}] │
└─────────┴────────────────────┘
I would like to do something like this:
df_agg =(df
.groupby("Weather")
.agg([
pl.col("Windy")
.custom_fun_on_other_col("Price",sum)
.alias("Price")
])
)
and, this is the result I want,
shape: (2, 2)
┌─────────┬────────────────────┐
│ Weather ┆ Price │
│ --- ┆ --- │
│ str ┆ list[struct[2]] │
╞═════════╪════════════════════╡
│ Sun ┆ [{"Y",6},{"N",14}] │
│ Rain ┆ [{"Y",4},{"N",12}] │
└─────────┴────────────────────┘
(Using polars version 15.15)

Inside a groupby context - you could combine .repeat_by().flatten() with .value_counts()
df.groupby("Weather").agg(
pl.col("Windy").repeat_by("Price").flatten().value_counts()
.alias("Price")
)
shape: (2, 2)
┌─────────┬─────────────────────┐
│ Weather | Price │
│ --- | --- │
│ str | list[struct[2]] │
╞═════════╪═════════════════════╡
│ Sun | [{"N",14}, {"Y",6}] │
├─────────┼─────────────────────┤
│ Rain | [{"Y",4}, {"N",12}] │
└─────────┴─────────────────────┘
Do you know about Window functions?
df.with_columns(
pl.sum("Price").over(["Weather", "Windy"]).alias("sum")
)
shape: (8, 4)
┌─────────┬───────┬───────┬─────┐
│ Weather | Price | Windy | sum │
│ --- | --- | --- | --- │
│ str | i64 | str | i64 │
╞═════════╪═══════╪═══════╪═════╡
│ Rain | 1 | Y | 4 │
├─────────┼───────┼───────┼─────┤
│ Sun | 2 | Y | 6 │
├─────────┼───────┼───────┼─────┤
│ Rain | 3 | Y | 4 │
├─────────┼───────┼───────┼─────┤
│ Sun | 4 | Y | 6 │
├─────────┼───────┼───────┼─────┤
│ Rain | 5 | N | 12 │
├─────────┼───────┼───────┼─────┤
│ Sun | 6 | N | 14 │
├─────────┼───────┼───────┼─────┤
│ Rain | 7 | N | 12 │
├─────────┼───────┼───────┼─────┤
│ Sun | 8 | N | 14 │
└─────────┴───────┴───────┴─────┘
You could also create the struct if desired:
pl.struct(["Windy", pl.sum("Price").over(["Weather", "Windy"])])

For instance, you can create temporary dataframe and then join it with main dataframe.
tmp = df.groupby(["Weather", "Windy"]).agg(col("Price").sum())\
.select([pl.col("Weather"), pl.struct(["Windy", "Price"])])\
.groupby("Weather").agg(pl.list("Windy"))
df.groupby("Weather").agg([
# your another aggregations ...
]).join(tmp, on="Weather")
┌─────────┬─────────────────────┐
│ Weather ┆ Windy │
│ --- ┆ --- │
│ str ┆ list[struct[2]] │
╞═════════╪═════════════════════╡
│ Rain ┆ [{"Y",4}, {"N",12}] │
│ Sun ┆ [{"N",14}, {"Y",6}] │
└─────────┴─────────────────────┘

Nested Tables in Python Polars

One of the key data analytics outputs is tables. And to analyse large research databases, we frequently create nested tables with two or more levels of nesting in rows and / or columns. I could create the nested tables in Pandas, but don't know if they can be created in Polars.
I am using the database from Kaggle called 'Home Loan Approval'. It's URL is https://www.kaggle.com/datasets/rishikeshkonapure/home-loan-approval.
I use the following code in Pandas to create a nested table with 'Gender' and 'Married' in rows and 'Property_Area' and 'Self_Employed' in columns
(
pd.read_csv('loan_sanction_test.csv')
.groupby(['Gender', 'Married', 'Property_Area', 'Self_Employed'])['LoanAmount'].sum()
.unstack([2,3])
)
In Polars, we don't have multi-indexes like Pandas. Polars pivot function only pivots to a single level. For e.g. if I use ['Property_Area' and 'Self_Employed'] in columns, pivot will reproduce these variables one after another and not nest them. Here's the code which illustrates this (using Polars version 0.16):
(
pl.read_csv('loan_sanction_test.csv')
.groupby(['Gender', 'Married', 'Property_Area', 'Self_Employed']).agg(pl.col('LoanAmount').sum())
.pivot(index=['Gender', 'Married'], columns=['Property_Area', 'Self_Employed'], values='LoanAmount')
)
We frequently use three level deep nesting in rows as well as columns. Is there a way to generate nested tables in Polars like Pandas example above?

One thing you can try when nesting columns is concat_str. For example: (My firewall blocks Kaggle downloads so I've simulated data.)
(
df
.sort(['Property_Area', 'Self_Employed'])
.with_columns(
pl.concat_str(['Property_Area', 'Self_Employed'], "|").alias('PA_SE')
)
.pivot(
index=["Gender", "Married"],
columns=["PA_SE"],
values="LoanAmount",
aggregate_fn="count",
)
)
shape: (4, 8)
+--------+---------+----------+-----------+--------------+---------------+----------+-----------+
| Gender | Married | Rural|No | Rural|Yes | Semiurban|No | Semiurban|Yes | Urban|No | Urban|Yes |
| --- | --- | --- | --- | --- | --- | --- | --- |
| str | str | u32 | u32 | u32 | u32 | u32 | u32 |
+===============================================================================================+
| Female | No | 38 | 35 | 54 | 37 | 40 | 41 |
| Female | Yes | 46 | 45 | 38 | 45 | 45 | 38 |
| Male | Yes | 40 | 39 | 47 | 44 | 39 | 40 |
| Male | No | 34 | 48 | 44 | 48 | 45 | 30 |
+--------+---------+----------+-----------+--------------+---------------+----------+-----------+
It's not as slick as Pandas, but it does get you to an answer.

I don't think you can make nested tables, but you could do something kinda similar with
(
df.groupby(
[
"Gender",
"Married",
"Property_Area",
"Self_Employed",
]
)
.agg(pl.col("LoanAmount").sum())
.pivot(
index=["Gender", "Married", "Self_Employed"],
columns=["Property_Area"],
values="LoanAmount",
aggregate_fn="sum",
)
)
which would give you
shape: (15, 6)
┌────────┬─────────┬───────────────┬───────┬───────┬───────────┐
│ Gender ┆ Married ┆ Self_Employed ┆ Rural ┆ Urban ┆ Semiurban │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ i64 ┆ i64 ┆ i64 │
╞════════╪═════════╪═══════════════╪═══════╪═══════╪═══════════╡
│ Female ┆ No ┆ Yes ┆ 389 ┆ null ┆ 295 │
│ Female ┆ Yes ┆ No ┆ 596 ┆ 1610 ┆ 1195 │
│ Male ┆ Yes ┆ Yes ┆ 1692 ┆ 1011 ┆ 1073 │
│ null ┆ Yes ┆ No ┆ null ┆ null ┆ 270 │
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
│ Male ┆ Yes ┆ null ┆ 465 ┆ 1437 ┆ 626 │
│ null ┆ No ┆ Yes ┆ null ┆ 138 ┆ null │
│ null ┆ Yes ┆ Yes ┆ 68 ┆ null ┆ null │
│ Female ┆ Yes ┆ null ┆ null ┆ null ┆ 94 │
└────────┴─────────┴───────────────┴───────┴───────┴───────────┘

Copying row values based on Config file from polars dataframe

I have a dataframe consisting of an ID, Local, Entity, Field and Global column.
# Creating a dictionary with the data
data = {'ID': [4, 4, 4, 4, 4],
'Local': ['A', 'B', 'C', 'D', 'E'],
'Field': ['P', 'Q', 'R', 'S', 'T'],
'Entity': ['K', 'L', 'M', 'N', 'O'],
'Global': ['F', 'G', 'H', 'I', 'J'],}
# Creating the dataframe
table = pl.DataFrame(data)
print(table)
shape: (5, 5)
┌─────┬───────┬───────┬──────────┬────────┐
│ ID ┆ Local ┆ Field ┆ Entity ┆ Global │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ str │
╞═════╪═══════╪═══════╪══════════╪════════╡
│ 4 ┆ A ┆ P ┆ K ┆ F │
│ 4 ┆ B ┆ Q ┆ L ┆ G │
│ 4 ┆ C ┆ R ┆ M ┆ H │
│ 4 ┆ D ┆ S ┆ N ┆ I │
│ 4 ┆ E ┆ T ┆ O ┆ J │
└─────┴───────┴───────┴──────────┴────────┘
Within the dataset certain rows need to be copied. For this purpose the following config file is provided with the following information:
copying:
- column_name: P
source_table: K
destination_table: X
- column_name: S
source_table: N
destination_table: W
In the config file there is a column_name value which refers to the Field column, a source_table which refers to the given Entity and a destination_table which should be the future entry in the Entity column. The goal is to enrich data based on existing rows (just with other tables).
The solution should look like this:
shape: (7, 5)
┌─────┬───────┬───────┬──────────┬────────┐
│ ID ┆ Local ┆ Field ┆ Entity ┆ Global │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ str │
╞═════╪═══════╪═══════╪══════════╪════════╡
│ 4 ┆ A ┆ P ┆ K ┆ F │
│ 4 ┆ B ┆ Q ┆ L ┆ G │
│ 4 ┆ C ┆ R ┆ M ┆ H │
│ 4 ┆ D ┆ S ┆ N ┆ I │
│ 4 ┆ E ┆ T ┆ O ┆ J │
│ 4 ┆ A ┆ P ┆ X ┆ F │
│ 4 ┆ D ┆ S ┆ W ┆ I │
└─────┴───────┴───────┴──────────┴────────┘
The dataset is a polars DataFrame and the config file is loaded as omegaconf. I tried it with this code:
conf.copying = [
{"column_name": "P", "source_table": "K", "destination_table": "X"},
{"column_name": "S", "source_table": "N", "destination_table": "W"},
]
# Iterate through the config file
for i in range(len(conf.copying)):
# Select the rows from the table dataframe that match the column_name and source_table fields in the config
match_rows = table.filter(
(pl.col("Field") == conf.copying[i]["column_name"])
& (pl.col("Entity") == conf.copying[i]["source_table"])
)
# Add the column Entities with the destination_table
match_rows = match_rows.select(
[
"ID",
"Local",
"Field",
"Global",
]
)
# Add the column Entities with the destination_table
match_rows = match_rows.with_columns(
pl.lit(conf.copying[i]["destination_table"]).alias("Entity")
)
match_rows = match_rows[
[
"ID",
"Local",
"Field",
"Entity",
"Global",
]
]
# Append the new rows to the original table dataframe
df_copy = match_rows.vstack(table)
However, the data is not copied as expected and added to the existing dataset. What am I doing wrong?

It's probably better if you extract all of the rows you want in one go.
One possible approach is to build a list of when/then expressions:
rules = [
pl.when(
(pl.col("Field") == rule["column_name"]) &
(pl.col("Entity") == rule["source_table"]))
.then(rule["destination_table"])
.alias("Entity")
for rule in conf.copying
]
You can use pl.coalesce and .drop_nulls() to remove non-matches.
>>> table.with_columns(pl.coalesce(rules)).drop_nulls()
shape: (2, 5)
┌─────┬───────┬───────┬────────┬────────┐
│ ID | Local | Field | Entity | Global │
│ --- | --- | --- | --- | --- │
│ i64 | str | str | str | str │
╞═════╪═══════╪═══════╪════════╪════════╡
│ 4 | A | P | X | F │
├─────┼───────┼───────┼────────┼────────┤
│ 4 | D | S | W | I │
└─────┴───────┴───────┴────────┴────────┘
This can then be added to the original dataframe.
Another style I've seen is to start with an "empty" when/then to build a .when().then() chain e.g.
rules = pl.when(False).then(None)
for rule in conf.copying:
rules = (
rules.when(
(pl.col("Field") == rule["column_name"]) &
(pl.col("Entity") == rule["source_table"]))
.then(rule["destination_table"])
)
>>> table.with_columns(rules.alias("Entity")).drop_nulls()
shape: (2, 5)
┌─────┬───────┬───────┬────────┬────────┐
│ ID | Local | Field | Entity | Global │
│ --- | --- | --- | --- | --- │
│ i64 | str | str | str | str │
╞═════╪═══════╪═══════╪════════╪════════╡
│ 4 | A | P | X | F │
├─────┼───────┼───────┼────────┼────────┤
│ 4 | D | S | W | I │
└─────┴───────┴───────┴────────┴────────┘
I'm not sure if it would make much difference in filtering the rows first and then doing the Entity modification - pl.any() could be used if you were to do that:
table.filter(
pl.any([
(pl.col("Field") == rule["column_name"]) &
(pl.col("Entity") == rule["source_table"])
for rule in conf.copying
])
)
shape: (2, 5)
┌─────┬───────┬───────┬────────┬────────┐
│ ID | Local | Field | Entity | Global │
│ --- | --- | --- | --- | --- │
│ i64 | str | str | str | str │
╞═════╪═══════╪═══════╪════════╪════════╡
│ 4 | A | P | K | F │
├─────┼───────┼───────┼────────┼────────┤
│ 4 | D | S | N | I │
└─────┴───────┴───────┴────────┴────────┘

How to rename column names with first row in polars?

I have a polars dataframe and I want to set the first row as header. I thought about renaming the column name one by one with the value of the first row of the correspondent column. How can I do this in polars?

[UPDATE]: #kirara0048's suggestion of .to_dicts() is a much simpler approach.
>>> df.head(1).to_dicts().pop()
{'column_0': 'one', 'column_1': 'two', 'column_2': 'three'}
Which can be passed directly to .rename()
df.rename(df.head(1).to_dicts().pop())
Perhaps there is a more direct method but you could take the first row and .transpose().to_series()
>>> df = pl.DataFrame([["one", "four"], ["two", "five"], ["three", "six"]])
>>> df.head(1).transpose().to_series()
shape: (3,)
Series: 'column_0' [str]
[
"one"
"two"
"three"
]
This can be used to create a dict of old: new and passed to .rename()
>>> df.rename(dict(zip(df.columns, df.head(1).transpose().to_series())))
shape: (2, 3)
┌──────┬──────┬───────┐
│ one | two | three │
│ --- | --- | --- │
│ str | str | str │
╞══════╪══════╪═══════╡
│ one | two | three │
├──────┼──────┼───────┤
│ four | five | six │
└──────┴──────┴───────┘
.slice(1) can be used to "remove" the first row if desired:
>>> df.rename(dict(zip(df.columns, df.head(1).transpose().to_series()))).slice(1)
shape: (1, 3)
┌──────┬──────┬───────┐
│ one | two | three │
│ --- | --- | --- │
│ str | str | str │
╞══════╪══════╪═══════╡
│ four | five | six │
└──────┴──────┴───────┘
You can also assign to .columns - I'm unsure if this is considered "bad style" or not.
>>> df.columns = df.head(1).transpose().to_series()
>>> df
shape: (2, 3)
┌──────┬──────┬───────┐
│ one | two | three │
│ --- | --- | --- │
│ str | str | str │
╞══════╪══════╪═══════╡
│ one | two | three │
├──────┼──────┼───────┤
│ four | five | six │
└──────┴──────┴───────┘

Can we/how to conditionally select columns?

Let's say I have a list of dataframes list this:
Ldfs=[
pl.DataFrame({'a':[1.0,2.0,3.1], 'b':[2,3,4]}),
pl.DataFrame({'b':[1,2,3], 'c':[2,3,4]}),
pl.DataFrame({'a':[1,2,3], 'c':[2,3,4]})
]
I can't do pl.concat(Ldfs) because they don't all have the same columns and even the ones that have a in common don't have the same data type.
What I'd like to do is concat them together but just add a column of Nones whenever a column isn't there and to cast columns to a fixed datatype.
For instance, just taking the first element of the list I'd like to have something like this work:
Ldfs[0].select(pl.when(pl.col('c')).then(pl.col('c').cast(pl.Float64()).otherwise(pl.lit(None).cast(pl.Float64()).alias('c')))
of course, this results in NotFoundError: c

Would an approach like this work for you. (I'll convert your DataFrames to LazyFrames for added fun.)
Ldfs = [
pl.DataFrame({"a": [1.0, 2.0, 3.1], "b": [2, 3, 4]}).lazy(),
pl.DataFrame({"b": [1, 2, 3], "c": [2, 3, 4]}).lazy(),
pl.DataFrame({"a": [1, 2, 3], "c": [2, 3, 4]}).lazy(),
]
my_schema = {
"a": pl.Float64,
"b": pl.Int64,
"c": pl.UInt32,
}
def fix_schema(ldf: pl.LazyFrame) -> pl.LazyFrame:
ldf = (
ldf.with_columns(
[
pl.col(col_nm).cast(col_type)
for col_nm, col_type in my_schema.items()
if col_nm in ldf.columns
]
)
.with_columns(
[
pl.lit(None, dtype=col_type).alias(col_nm)
for col_nm, col_type in my_schema.items()
if col_nm not in ldf.columns
]
)
.select(my_schema.keys())
)
return ldf
pl.concat([fix_schema(next_frame)
for next_frame in Ldfs], how="vertical").collect()
shape: (9, 3)
┌──────┬──────┬──────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ f64 ┆ i64 ┆ u32 │
╞══════╪══════╪══════╡
│ 1.0 ┆ 2 ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2.0 ┆ 3 ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3.1 ┆ 4 ┆ null │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ 1 ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ 2 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ 3 ┆ 4 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1.0 ┆ null ┆ 2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2.0 ┆ null ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3.0 ┆ null ┆ 4 │
└──────┴──────┴──────┘

.from_dicts() can infer the types and column names:
>>> df = pl.from_dicts([frame.to_dict() for frame in Ldfs])
>>> df
shape: (3, 3)
┌─────────────────┬───────────┬───────────┐
│ a | b | c │
│ --- | --- | --- │
│ list[f64] | list[i64] | list[i64] │
╞═════════════════╪═══════════╪═══════════╡
│ [1.0, 2.0, 3.1] | [2, 3, 4] | null │
├─────────────────┼───────────┼───────────┤
│ null | [1, 2, 3] | [2, 3, 4] │
├─────────────────┼───────────┼───────────┤
│ [1.0, 2.0, 3.0] | null | [2, 3, 4] │
└─//──────────────┴─//────────┴─//────────┘
With the right sized [null, ...] lists - you could .explode() all columns.
>>> nulls = pl.Series([[None] * len(frame) for frame in Ldfs])
... (
... pl.from_dicts([
... frame.to_dict() for frame in Ldfs
... ])
... .with_columns(
... pl.all().fill_null(nulls))
... .explode(pl.all())
... )
shape: (9, 3)
┌──────┬──────┬──────┐
│ a | b | c │
│ --- | --- | --- │
│ f64 | i64 | i64 │
╞══════╪══════╪══════╡
│ 1.0 | 2 | null │
├──────┼──────┼──────┤
│ 2.0 | 3 | null │
├──────┼──────┼──────┤
│ 3.1 | 4 | null │
├──────┼──────┼──────┤
│ null | 1 | 2 │
├──────┼──────┼──────┤
│ null | 2 | 3 │
├──────┼──────┼──────┤
│ null | 3 | 4 │
├──────┼──────┼──────┤
│ 1.0 | null | 2 │
├──────┼──────┼──────┤
│ 2.0 | null | 3 │
├──────┼──────┼──────┤
│ 3.0 | null | 4 │
└─//───┴─//───┴─//───┘

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

polars df.sum(axis=1) for Expressions (and other functions, maybe median?) - python-polars

Related

Polars, sum column based on other column values in `groupby`

Nested Tables in Python Polars

Copying row values based on Config file from polars dataframe

How to rename column names with first row in polars?

Can we/how to conditionally select columns?

Categories

Resources