I have a polars dataframe and I want to set the first row as header. I thought about renaming the column name one by one with the value of the first row of the correspondent column. How can I do this in polars?
[UPDATE]: #kirara0048's suggestion of .to_dicts() is a much simpler approach.
>>> df.head(1).to_dicts().pop()
{'column_0': 'one', 'column_1': 'two', 'column_2': 'three'}
Which can be passed directly to .rename()
df.rename(df.head(1).to_dicts().pop())
Perhaps there is a more direct method but you could take the first row and .transpose().to_series()
>>> df = pl.DataFrame([["one", "four"], ["two", "five"], ["three", "six"]])
>>> df.head(1).transpose().to_series()
shape: (3,)
Series: 'column_0' [str]
[
"one"
"two"
"three"
]
This can be used to create a dict of old: new and passed to .rename()
>>> df.rename(dict(zip(df.columns, df.head(1).transpose().to_series())))
shape: (2, 3)
┌──────┬──────┬───────┐
│ one | two | three │
│ --- | --- | --- │
│ str | str | str │
╞══════╪══════╪═══════╡
│ one | two | three │
├──────┼──────┼───────┤
│ four | five | six │
└──────┴──────┴───────┘
.slice(1) can be used to "remove" the first row if desired:
>>> df.rename(dict(zip(df.columns, df.head(1).transpose().to_series()))).slice(1)
shape: (1, 3)
┌──────┬──────┬───────┐
│ one | two | three │
│ --- | --- | --- │
│ str | str | str │
╞══════╪══════╪═══════╡
│ four | five | six │
└──────┴──────┴───────┘
You can also assign to .columns - I'm unsure if this is considered "bad style" or not.
>>> df.columns = df.head(1).transpose().to_series()
>>> df
shape: (2, 3)
┌──────┬──────┬───────┐
│ one | two | three │
│ --- | --- | --- │
│ str | str | str │
╞══════╪══════╪═══════╡
│ one | two | three │
├──────┼──────┼───────┤
│ four | five | six │
└──────┴──────┴───────┘
Related
I want to calculate the sum of a column in a groupby based on values of another column. Pretty much what pl.Expr.value_counts do (see example), but I want to apply a function (e.g sum) to a specific column, in this case the Price column.
I know that I could do the groupby on Weather + Windy and then aggregate, but, I can't do that since I have plenty of other aggregations I need to compute on only the Weather groupby.
import polars as pl
df = pl.DataFrame(
data = {
"Weather":["Rain","Sun","Rain","Sun","Rain","Sun","Rain","Sun"],
"Price":[1,2,3,4,5,6,7,8],
"Windy":["Y","Y","Y","Y","N","N","N","N"]
}
)
I can get number of counts per windy day by value_counts
df_agg = (df
.groupby("Weather")
.agg([
pl.col("Windy")
.value_counts()
.alias("Price")
])
)
shape: (2, 2)
┌─────────┬────────────────────┐
│ Weather ┆ Price │
│ --- ┆ --- │
│ str ┆ list[struct[2]] │
╞═════════╪════════════════════╡
│ Sun ┆ [{"Y",2}, {"N",2}] │
│ Rain ┆ [{"Y",2}, {"N",2}] │
└─────────┴────────────────────┘
I would like to do something like this:
df_agg =(df
.groupby("Weather")
.agg([
pl.col("Windy")
.custom_fun_on_other_col("Price",sum)
.alias("Price")
])
)
and, this is the result I want,
shape: (2, 2)
┌─────────┬────────────────────┐
│ Weather ┆ Price │
│ --- ┆ --- │
│ str ┆ list[struct[2]] │
╞═════════╪════════════════════╡
│ Sun ┆ [{"Y",6},{"N",14}] │
│ Rain ┆ [{"Y",4},{"N",12}] │
└─────────┴────────────────────┘
(Using polars version 15.15)
Inside a groupby context - you could combine .repeat_by().flatten() with .value_counts()
df.groupby("Weather").agg(
pl.col("Windy").repeat_by("Price").flatten().value_counts()
.alias("Price")
)
shape: (2, 2)
┌─────────┬─────────────────────┐
│ Weather | Price │
│ --- | --- │
│ str | list[struct[2]] │
╞═════════╪═════════════════════╡
│ Sun | [{"N",14}, {"Y",6}] │
├─────────┼─────────────────────┤
│ Rain | [{"Y",4}, {"N",12}] │
└─────────┴─────────────────────┘
Do you know about Window functions?
df.with_columns(
pl.sum("Price").over(["Weather", "Windy"]).alias("sum")
)
shape: (8, 4)
┌─────────┬───────┬───────┬─────┐
│ Weather | Price | Windy | sum │
│ --- | --- | --- | --- │
│ str | i64 | str | i64 │
╞═════════╪═══════╪═══════╪═════╡
│ Rain | 1 | Y | 4 │
├─────────┼───────┼───────┼─────┤
│ Sun | 2 | Y | 6 │
├─────────┼───────┼───────┼─────┤
│ Rain | 3 | Y | 4 │
├─────────┼───────┼───────┼─────┤
│ Sun | 4 | Y | 6 │
├─────────┼───────┼───────┼─────┤
│ Rain | 5 | N | 12 │
├─────────┼───────┼───────┼─────┤
│ Sun | 6 | N | 14 │
├─────────┼───────┼───────┼─────┤
│ Rain | 7 | N | 12 │
├─────────┼───────┼───────┼─────┤
│ Sun | 8 | N | 14 │
└─────────┴───────┴───────┴─────┘
You could also create the struct if desired:
pl.struct(["Windy", pl.sum("Price").over(["Weather", "Windy"])])
For instance, you can create temporary dataframe and then join it with main dataframe.
tmp = df.groupby(["Weather", "Windy"]).agg(col("Price").sum())\
.select([pl.col("Weather"), pl.struct(["Windy", "Price"])])\
.groupby("Weather").agg(pl.list("Windy"))
df.groupby("Weather").agg([
# your another aggregations ...
]).join(tmp, on="Weather")
┌─────────┬─────────────────────┐
│ Weather ┆ Windy │
│ --- ┆ --- │
│ str ┆ list[struct[2]] │
╞═════════╪═════════════════════╡
│ Rain ┆ [{"Y",4}, {"N",12}] │
│ Sun ┆ [{"N",14}, {"Y",6}] │
└─────────┴─────────────────────┘
I have a dataframe consisting of an ID, Local, Entity, Field and Global column.
# Creating a dictionary with the data
data = {'ID': [4, 4, 4, 4, 4],
'Local': ['A', 'B', 'C', 'D', 'E'],
'Field': ['P', 'Q', 'R', 'S', 'T'],
'Entity': ['K', 'L', 'M', 'N', 'O'],
'Global': ['F', 'G', 'H', 'I', 'J'],}
# Creating the dataframe
table = pl.DataFrame(data)
print(table)
shape: (5, 5)
┌─────┬───────┬───────┬──────────┬────────┐
│ ID ┆ Local ┆ Field ┆ Entity ┆ Global │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ str │
╞═════╪═══════╪═══════╪══════════╪════════╡
│ 4 ┆ A ┆ P ┆ K ┆ F │
│ 4 ┆ B ┆ Q ┆ L ┆ G │
│ 4 ┆ C ┆ R ┆ M ┆ H │
│ 4 ┆ D ┆ S ┆ N ┆ I │
│ 4 ┆ E ┆ T ┆ O ┆ J │
└─────┴───────┴───────┴──────────┴────────┘
Within the dataset certain rows need to be copied. For this purpose the following config file is provided with the following information:
copying:
- column_name: P
source_table: K
destination_table: X
- column_name: S
source_table: N
destination_table: W
In the config file there is a column_name value which refers to the Field column, a source_table which refers to the given Entity and a destination_table which should be the future entry in the Entity column. The goal is to enrich data based on existing rows (just with other tables).
The solution should look like this:
shape: (7, 5)
┌─────┬───────┬───────┬──────────┬────────┐
│ ID ┆ Local ┆ Field ┆ Entity ┆ Global │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ str │
╞═════╪═══════╪═══════╪══════════╪════════╡
│ 4 ┆ A ┆ P ┆ K ┆ F │
│ 4 ┆ B ┆ Q ┆ L ┆ G │
│ 4 ┆ C ┆ R ┆ M ┆ H │
│ 4 ┆ D ┆ S ┆ N ┆ I │
│ 4 ┆ E ┆ T ┆ O ┆ J │
│ 4 ┆ A ┆ P ┆ X ┆ F │
│ 4 ┆ D ┆ S ┆ W ┆ I │
└─────┴───────┴───────┴──────────┴────────┘
The dataset is a polars DataFrame and the config file is loaded as omegaconf. I tried it with this code:
conf.copying = [
{"column_name": "P", "source_table": "K", "destination_table": "X"},
{"column_name": "S", "source_table": "N", "destination_table": "W"},
]
# Iterate through the config file
for i in range(len(conf.copying)):
# Select the rows from the table dataframe that match the column_name and source_table fields in the config
match_rows = table.filter(
(pl.col("Field") == conf.copying[i]["column_name"])
& (pl.col("Entity") == conf.copying[i]["source_table"])
)
# Add the column Entities with the destination_table
match_rows = match_rows.select(
[
"ID",
"Local",
"Field",
"Global",
]
)
# Add the column Entities with the destination_table
match_rows = match_rows.with_columns(
pl.lit(conf.copying[i]["destination_table"]).alias("Entity")
)
match_rows = match_rows[
[
"ID",
"Local",
"Field",
"Entity",
"Global",
]
]
# Append the new rows to the original table dataframe
df_copy = match_rows.vstack(table)
However, the data is not copied as expected and added to the existing dataset. What am I doing wrong?
It's probably better if you extract all of the rows you want in one go.
One possible approach is to build a list of when/then expressions:
rules = [
pl.when(
(pl.col("Field") == rule["column_name"]) &
(pl.col("Entity") == rule["source_table"]))
.then(rule["destination_table"])
.alias("Entity")
for rule in conf.copying
]
You can use pl.coalesce and .drop_nulls() to remove non-matches.
>>> table.with_columns(pl.coalesce(rules)).drop_nulls()
shape: (2, 5)
┌─────┬───────┬───────┬────────┬────────┐
│ ID | Local | Field | Entity | Global │
│ --- | --- | --- | --- | --- │
│ i64 | str | str | str | str │
╞═════╪═══════╪═══════╪════════╪════════╡
│ 4 | A | P | X | F │
├─────┼───────┼───────┼────────┼────────┤
│ 4 | D | S | W | I │
└─────┴───────┴───────┴────────┴────────┘
This can then be added to the original dataframe.
Another style I've seen is to start with an "empty" when/then to build a .when().then() chain e.g.
rules = pl.when(False).then(None)
for rule in conf.copying:
rules = (
rules.when(
(pl.col("Field") == rule["column_name"]) &
(pl.col("Entity") == rule["source_table"]))
.then(rule["destination_table"])
)
>>> table.with_columns(rules.alias("Entity")).drop_nulls()
shape: (2, 5)
┌─────┬───────┬───────┬────────┬────────┐
│ ID | Local | Field | Entity | Global │
│ --- | --- | --- | --- | --- │
│ i64 | str | str | str | str │
╞═════╪═══════╪═══════╪════════╪════════╡
│ 4 | A | P | X | F │
├─────┼───────┼───────┼────────┼────────┤
│ 4 | D | S | W | I │
└─────┴───────┴───────┴────────┴────────┘
I'm not sure if it would make much difference in filtering the rows first and then doing the Entity modification - pl.any() could be used if you were to do that:
table.filter(
pl.any([
(pl.col("Field") == rule["column_name"]) &
(pl.col("Entity") == rule["source_table"])
for rule in conf.copying
])
)
shape: (2, 5)
┌─────┬───────┬───────┬────────┬────────┐
│ ID | Local | Field | Entity | Global │
│ --- | --- | --- | --- | --- │
│ i64 | str | str | str | str │
╞═════╪═══════╪═══════╪════════╪════════╡
│ 4 | A | P | K | F │
├─────┼───────┼───────┼────────┼────────┤
│ 4 | D | S | N | I │
└─────┴───────┴───────┴────────┴────────┘
I'm trying to improve performance of my polars code by converting a list of string to a list of categorical type for my tags column:
shape: (3, 2)
┌─────┬────────────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ list[str] │
╞═════╪════════════╡
│ 1 ┆ ["a", "b"] │
│ 2 ┆ ["a"] │
│ 3 ┆ ["c", "d"] │
└─────┴────────────┘
df = pl.DataFrame({'a':[1,2,3], 'b':[['a','b'],['a'],['c','d']]})
df.with_column(pl.col('tags').cast(pl.list(pl.Categorical)))
However I get the following error:
ValueError: could not convert value 'Unknown' as a Literal
Does polars support lists of categoricals?
Polars does supports lists of Categoricals.
The issue is you're using pl.list() instead of pl.List() - datatypes start with uppercased letters.
>>> df.with_columns(pl.col('b').cast(pl.List(pl.Categorical)))
shape: (3, 2)
┌─────┬────────────┐
│ a | b │
│ --- | --- │
│ i64 | list[cat] │
╞═════╪════════════╡
│ 1 | ["a", "b"] │
│ 2 | ["a"] │
│ 3 | ["c", "d"] │
└─────┴────────────┘
pl.list() is something different - it appears to be shorthand syntax for pl.col().list()
I'm trying to find a polaric way of aggregating data per row. It's not strictly about .sum function, it's about all aggregations where axis makes sense.
Take a look at these pandas examples:
df[df.sum(axis=1) > 5]
df.assign(median=df.median(axis=1))
df[df.rolling(3, axis=1).mean() > 0]
However, with polars, problems start really quick:
df.filter(df.sum(axis=1)>5)
df.with_column(df.mean(axis=1).alias('mean')) - cant do median
df... - cant do rolling, rank and anything more complex.
I saw the page where polars authors suggest doing everything by hand with folds, but there are cases where logic doesn't fit into one input and one accumulator variable (i.e. simple median)
Moreover, this approach seems to not work at all when using Expressions, i.e. pl.all().sum(axis=1) is not valid since for some reason axis argument is absent.
So the question is: how to deal with these situations? I hope to have the full polars api at my fingertips, instead of some suboptimal solutions i can come up with
Row-wise computations:
You can create a list and access the .arr namespace for row wise computations.
#ritchie46's answer regarding rank(axis=1) is also useful reading.
.arr.eval() can be used for more complex computations.
df = pl.DataFrame([[1, 2, 3], [4, 5, 3], [1, 8, 9]])
(df.with_column(pl.concat_list(pl.all()).alias("row"))
.with_columns([
pl.col("row").arr.sum().alias("sum"),
pl.col("row").arr.mean().alias("mean"),
pl.col("row").arr.eval(pl.all().median(), parallel=True).alias("median"),
pl.col("row").arr.eval(pl.all().rank(), parallel=True).alias("rank"),
])
)
shape: (3, 8)
┌──────────┬──────────┬──────────┬───────────┬─────┬──────┬───────────┬─────────────────┐
│ column_0 | column_1 | column_2 | row | sum | mean | median | rank │
│ --- | --- | --- | --- | --- | --- | --- | --- │
│ i64 | i64 | i64 | list[i64] | i64 | f64 | list[f64] | list[f32] │
╞══════════╪══════════╪══════════╪═══════════╪═════╪══════╪═══════════╪═════════════════╡
│ 1 | 4 | 1 | [1, 4, 1] | 6 | 2.0 | [1.0] | [1.5, 3.0, 1.5] │
├──────────┼──────────┼──────────┼───────────┼─────┼──────┼───────────┼─────────────────┤
│ 2 | 5 | 8 | [2, 5, 8] | 15 | 5.0 | [5.0] | [1.0, 2.0, 3.0] │
├──────────┼──────────┼──────────┼───────────┼─────┼──────┼───────────┼─────────────────┤
│ 3 | 3 | 9 | [3, 3, 9] | 15 | 5.0 | [3.0] | [1.5, 1.5, 3.0] │
└──────────┴──────────┴──────────┴───────────┴─────┴──────┴───────────┴─────────────────┘
pl.sum()
Can be given a list of columns.
>>> df.select(pl.sum(pl.all()))
shape: (3, 1)
┌─────┐
│ sum │
│ --- │
│ i64 │
╞═════╡
│ 6 │
├─────┤
│ 15 │
├─────┤
│ 15 │
└─────┘
.rolling_mean()
Can be accessed inside .arr.eval()
pdf = df.to_pandas()
pdf[pdf.rolling(2, axis=1).mean() > 3]
column_0 column_1 column_2
0 NaN NaN NaN
1 NaN 5.0 8.0
2 NaN NaN 9.0
(df.with_column(pl.concat_list(pl.all()).alias("row"))
.with_column(
pl.col("row").arr.eval(
pl.when(pl.all().rolling_mean(2) > 3)
.then(pl.all()),
parallel=True)
.alias("rolling[mean] > 3"))
)
shape: (3, 5)
┌──────────┬──────────┬──────────┬───────────┬────────────────────┐
│ column_0 | column_1 | column_2 | row | rolling[mean] > 3 │
│ --- | --- | --- | --- | --- │
│ i64 | i64 | i64 | list[i64] | list[i64] │
╞══════════╪══════════╪══════════╪═══════════╪════════════════════╡
│ 1 | 4 | 1 | [1, 4, 1] | [null, null, null] │
├──────────┼──────────┼──────────┼───────────┼────────────────────┤
│ 2 | 5 | 8 | [2, 5, 8] | [null, 5, 8] │
├──────────┼──────────┼──────────┼───────────┼────────────────────┤
│ 3 | 3 | 9 | [3, 3, 9] | [null, null, 9] │
└──────────┴──────────┴──────────┴───────────┴────────────────────┘
If you want to "expand" the lists into columns:
Turn the list into a struct with .arr.to_struct()
.unnest() the struct.
Rename the columns (if needed)
(df.with_column(pl.concat_list(pl.all()).alias("row"))
.select(
pl.col("row").arr.eval(
pl.when(pl.all().rolling_mean(2) > 3)
.then(pl.all()),
parallel=True)
.arr.to_struct()
.alias("rolling[mean]"))
.unnest("rolling[mean]")
)
shape: (3, 3)
┌─────────┬─────────┬─────────┐
│ field_0 | field_1 | field_2 │
│ --- | --- | --- │
│ i64 | i64 | i64 │
╞═════════╪═════════╪═════════╡
│ null | null | null │
├─────────┼─────────┼─────────┤
│ null | 5 | 8 │
├─────────┼─────────┼─────────┤
│ null | null | 9 │
└─────────┴─────────┴─────────┘
.transpose()
You could always transpose the dataframe to switch the axis and use the "regular" api.
(df.transpose()
.select(
pl.when(pl.all().rolling_mean(2) > 3)
.then(pl.all())
.keep_name())
.transpose())
shape: (3, 3)
┌──────────┬──────────┬──────────┐
│ column_0 | column_1 | column_2 │
│ --- | --- | --- │
│ i64 | i64 | i64 │
╞══════════╪══════════╪══════════╡
│ null | null | null │
├──────────┼──────────┼──────────┤
│ null | 5 | 8 │
├──────────┼──────────┼──────────┤
│ null | null | 9 │
└──────────┴──────────┴──────────┘
Issue: I have limitations in the code that I am writing that won't read in columns > 4K bytes
Want: Turn a single row into multiple rows with a new max length and have the ordinal to keep them in order.
I have done this activity in DB2 previously using the with clause and a recursive query, I am trying to convert the code to work with Postgres. I don't have as much Postgres experience to know if there is a better way to do this.
To start I have a table
Create table test_long_text (
test_id numeric(12,0) NOT NULL,
long_text varchar(64000)
)
Here is the query I have been trying to write and failed.
WITH RECURSIVE rec(test_id, len, ord, pos) as (
select test_id, octet_length(long_text), 1, 1 from test_long_text
union all
select test_id, len, ord+1, pos+4096 from rec where len >=4096)
select A.test_id, ord, substr(long_text, pos, 4096) from test_long_text A
inner join rec using (test_id)
order by A.test_id, ord
Currently, I get an error negative substring length not allowed or it hangs indefinitely.
Expected Results: Where the text is chunks of text at a max of 4096 bytes long. pretend ABC is a longer string.
+--------------+
| ID |ORD|TEXT |
| 1 |1 |ABC |
+--------------+
| 2 |1 |ABC |
+--------------+
| 2 |2 |DEF |
+--------------+
| 3 |1 |ABC |
+--------------+
| 3 |2 |DEF |
+--------------+
| 3 |3 |GHI |
+--------------+
This example shows how to split values of text column into 3-characters parts:
with t(x) as (values('1234567890abcdef'::text), ('qwertyuiop'))
select *, substring(x, f.p, 3)
from t, generate_series(1, length(x), 3) with ordinality as f(p, i);
┌──────────────────┬────┬───┬───────────┐
│ x │ p │ i │ substring │
├──────────────────┼────┼───┼───────────┤
│ 1234567890abcdef │ 1 │ 1 │ 123 │
│ 1234567890abcdef │ 4 │ 2 │ 456 │
│ 1234567890abcdef │ 7 │ 3 │ 789 │
│ 1234567890abcdef │ 10 │ 4 │ 0ab │
│ 1234567890abcdef │ 13 │ 5 │ cde │
│ 1234567890abcdef │ 16 │ 6 │ f │
│ qwertyuiop │ 1 │ 1 │ qwe │
│ qwertyuiop │ 4 │ 2 │ rty │
│ qwertyuiop │ 7 │ 3 │ uio │
│ qwertyuiop │ 10 │ 4 │ p │
└──────────────────┴────┴───┴───────────┘
You can simply adapt it to your data.