Polars table convert a list column to separate rows i.e. unnest a list column to multiple rows - python-polars

I have a Polars dataframe in the form:
df = pl.DataFrame({'a':[1,2,3], 'b':[['a','b'],['a'],['c','d']]})
┌─────┬────────────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ list[str] │
╞═════╪════════════╡
│ 1 ┆ ["a", "b"] │
│ 2 ┆ ["a"] │
│ 3 ┆ ["c", "d"] │
└─────┴────────────┘
I want to convert it to the following form. I plan to save to a parquet file, and query the file (with sql).
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1 ┆ "a" │
│ 1 ┆ "b" │
│ 2 ┆ "a" │
│ 3 ┆ "c" │
│ 3 ┆ "d" │
└─────┴─────┘
I have seen an answer that works on struct columns, but df.unnest('b') on my data results in the error:
SchemaError: Series of dtype: List(Utf8) != Struct
I also found a github issue that shows list can be converted to a struct, but I can't work out how to do that, or if it applies here.

To decompose column with Lists, you can use .explode() method (doc)
df = pl.DataFrame({'a':[1,2,3], 'b':[['a','b'],['a'],['c','d']]})
df.explode("b")
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1 ┆ a │
│ 1 ┆ b │
│ 2 ┆ a │
│ 3 ┆ c │
│ 3 ┆ d │
└─────┴─────┘

Related

Sort polars DataFrame using column with text and numericals

If I have a DataFrame like
┌────────┬──────────────────────┐
│ Name ┆ Value │
│ --- ┆ --- │
│ str ┆ list[str] │
╞════════╪══════════════════════╡
│ No. 1 ┆ ["None", "!!!"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ No. 10 ┆ ["0.3", "OK"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ No. 2 ┆ ["1.1", "OK"] │
How can I sort it by numerical value.
Ie I want to pull the string from the Name column and extract only the numerical elements when sorting.
Ie
┌────────┬──────────────────────┐
│ Name ┆ Value │
│ --- ┆ --- │
│ str ┆ list[str] │
╞════════╪══════════════════════╡
│ No. 1 ┆ ["None", "!!!"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ No. 2 ┆ ["1.1", "OK"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ No. 10 ┆ ["0.3", "OK"] │
Can't see the polars expression needed and not sure you can pass a custom python function.
Thanks
You can to use str.extract to get the number from the string, using a regular expression.
Then cast it to int and sort:
pl.DataFrame({"Name": ["No. 1", "No. 12", "No. 2"]}).sort(
pl.col("Name").str.extract(r"No\. ([0-9]*)", 1).cast(int)
)
Also, if you want to sort by numbers in List:
df.sort(
pl.col("Value").arr.get(0).cast(pl.Float32, strict=False),
nulls_last=False
)

How to select the last non-null value from one column and also the value from another column on the same row in Polars?

Below is a non working example in which I retrieve the last available 'Open' but how do I get corresponding 'Time'?
sel = self.data.select([pl.col('Time'),
pl.col('Open').drop_nulls().last()])
For instance, you can use .filter() to select rows that do not contain null and then take last row
Here example:
df = pl.DataFrame({
"a": [1,2,3,4,5],
"b": ["cat", None, "owl", None, None]
})
┌─────┬──────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪══════╡
│ 1 ┆ cat │
│ 2 ┆ null │
│ 3 ┆ owl │
│ 4 ┆ null │
│ 5 ┆ null │
└─────┴──────┘
df.filter(
pl.col("b").is_not_null()
).select(pl.all().last())
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 3 ┆ owl │
└─────┴─────┘

Ordinal encoding of column in polars

I would like to do an ordinal encoding of a column. Pandas has the nice and convenient method of pd.factorize(), however, I would like to achieve the same in polars.
df = pl.DataFrame({"a": [5, 8, 10], "b": ["hi", "hello", "hi"]})
┌─────┬───────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═══════╡
│ 5 ┆ hi │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 8 ┆ hello │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 10 ┆ hi │
└─────┴───────┘
desired result:
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 0 ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 0 │
└─────┴─────┘
You can join with a dummy DataFrame that contains the unique values and the ordinal encoding you are interested in:
df = pl.DataFrame({"a": [5, 8, 10], "b": ["hi", "hello", "hi"]})
unique = df.select(
pl.col("b").unique(maintain_order=True)
).with_row_count(name="ordinal")
df.join(unique, on="b")
Or you could "misuse" the fact that categorical values are backed by u32 integers.
df.with_column(
pl.col("b").cast(pl.Categorical).to_physical().alias("ordinal")
)
Both methods output:
shape: (3, 3)
┌─────┬───────┬─────────┐
│ a ┆ b ┆ ordinal │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ u32 │
╞═════╪═══════╪═════════╡
│ 5 ┆ hi ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 8 ┆ hello ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 10 ┆ hi ┆ 0 │
└─────┴───────┴─────────┘
Here's another way to do it although I doubt it's better than the dummy Dataframe from #ritchie46
df.with_columns([pl.col('b').unique().list().alias('uniq'),
pl.col('b').unique().list().arr.eval(pl.element().rank()).alias('uniqid')]).explode(['uniq','uniqid']).filter(pl.col('b')==pl.col('uniq')).select(pl.exclude('uniq')).with_column(pl.col('uniqid')-1)
There's almost certainly a way to improve this but basically it creates a new column called uniq which is a list of all the unique values of the column as well as uniqid which (I think, and seems to be) the 1-index based order of the values. It then explodes those creating a row for ever value in uniq and then filters out the ones rows that don't equal the column b. Since rank gives the 1-index (rather than 0-index) you have to subtract 1 and exclude the uniq column that we don't care about since it's the same as b.
If the order is not important you could use .rank(method="dense")
>>> df.select(pl.all().rank(method="dense") - 1)
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞═════╪═════╡
│ 0 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 1 │
└─────┴─────┘
If it is - you could:
>>> (
... df.with_row_count()
... .with_columns([
... pl.col("row_nr").first()
... .over(col)
... .rank(method="dense")
... .alias(col) - 1
... for col in df.columns
... ])
... .drop("row_nr")
... )
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞═════╪═════╡
│ 0 ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 0 │
└─────┴─────┘

Computing and retrieving operations at the group level without collapsing data frame in polars?

I am trying to compute a stat (or more) at the group level without having to create a second data frame. The current way I do it is by relying on the generation of a second data frame with the desired aggregation that I then merge back to the original one.
A silly example:
import polars as pl
df = pl. DataFrame( {'name' : ['Steve', 'Larry', 'Tom', 'Steve', 'Tom', 'Steve'],
'points': range(6)})
print(df)
shape: (6, 2)
┌───────┬────────┐
│ name ┆ points │
│ --- ┆ --- │
│ str ┆ i64 │
╞═══════╪════════╡
│ Steve ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Larry ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Tom ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Steve ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Tom ┆ 4 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Steve ┆ 5 │
└───────┴────────┘
We created a simple data frame below in which some groups have more entries than others. In a second step we compute an additional data frame to keep track of the size of each group.
entries= df.groupby('name').agg(pl.count().alias('entries'))
print(entries)
shape: (3, 2)
┌───────┬─────────┐
│ name ┆ entries │
│ --- ┆ --- │
│ str ┆ u32 │
╞═══════╪═════════╡
│ Steve ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Tom ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Larry ┆ 1 │
└───────┴─────────┘
Now we bring back this information to the original data frame in a third step.
print(df.join(entries, left_on='name', right_on='name', how='left'))
shape: (6, 3)
┌───────┬────────┬─────────┐
│ name ┆ points ┆ entries │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u32 │
╞═══════╪════════╪═════════╡
│ Steve ┆ 0 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Larry ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Tom ┆ 2 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Steve ┆ 3 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Tom ┆ 4 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Steve ┆ 5 ┆ 3 │
└───────┴────────┴─────────┘
Is there a way to avoid this triangulation? I have the feeling that using over might be a solution but I can't figure it out yet.
Well ... I managed. Posting the question helped me organize my thoughts and indeed, over was the solution.
df.with_column(pl.col('name').count().over('name').alias('entries'))
shape: (6, 3)
┌───────┬────────┬─────────┐
│ name ┆ points ┆ entries │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u32 │
╞═══════╪════════╪═════════╡
│ Steve ┆ 0 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Larry ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Tom ┆ 2 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Steve ┆ 3 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Tom ┆ 4 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Steve ┆ 5 ┆ 3 │
└───────┴────────┴─────────┘

polars outer join default null value

https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.join.html
Can I specify the default NULL value for outer joins? Like 0?
The join method does not currently have an option for setting a default value for nulls. However, there is an easy way to accomplish this.
Let's say we have this data:
import polars as pl
df1 = pl.DataFrame({"key": ["a", "b", "d"], "var1": [1, 1, 1]})
df2 = pl.DataFrame({"key": ["a", "b", "c"], "var2": [2, 2, 2]})
df1.join(df2, on="key", how="outer")
shape: (4, 3)
┌─────┬──────┬──────┐
│ key ┆ var1 ┆ var2 │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪══════╪══════╡
│ a ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ c ┆ null ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ d ┆ 1 ┆ null │
└─────┴──────┴──────┘
To create a different value for the null values, simply use this:
df1.join(df2, on="key", how="outer").with_column(pl.all().fill_null(0))
shape: (4, 3)
┌─────┬──────┬──────┐
│ key ┆ var1 ┆ var2 │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪══════╪══════╡
│ a ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ c ┆ 0 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ d ┆ 1 ┆ 0 │
└─────┴──────┴──────┘