Polars table convert a list column to separate rows i.e. unnest a list column to multiple rows

Polars table convert a list column to separate rows i.e. unnest a list column to multiple rows - python-polars

I have a Polars dataframe in the form:
df = pl.DataFrame({'a':[1,2,3], 'b':[['a','b'],['a'],['c','d']]})
┌─────┬────────────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ list[str] │
╞═════╪════════════╡
│ 1 ┆ ["a", "b"] │
│ 2 ┆ ["a"] │
│ 3 ┆ ["c", "d"] │
└─────┴────────────┘
I want to convert it to the following form. I plan to save to a parquet file, and query the file (with sql).
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1 ┆ "a" │
│ 1 ┆ "b" │
│ 2 ┆ "a" │
│ 3 ┆ "c" │
│ 3 ┆ "d" │
└─────┴─────┘
I have seen an answer that works on struct columns, but df.unnest('b') on my data results in the error:
SchemaError: Series of dtype: List(Utf8) != Struct
I also found a github issue that shows list can be converted to a struct, but I can't work out how to do that, or if it applies here.

To decompose column with Lists, you can use .explode() method (doc)
df = pl.DataFrame({'a':[1,2,3], 'b':[['a','b'],['a'],['c','d']]})
df.explode("b")
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1 ┆ a │
│ 1 ┆ b │
│ 2 ┆ a │
│ 3 ┆ c │
│ 3 ┆ d │
└─────┴─────┘

Related

Sort polars DataFrame using column with text and numericals

If I have a DataFrame like
┌────────┬──────────────────────┐
│ Name ┆ Value │
│ --- ┆ --- │
│ str ┆ list[str] │
╞════════╪══════════════════════╡
│ No. 1 ┆ ["None", "!!!"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ No. 10 ┆ ["0.3", "OK"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ No. 2 ┆ ["1.1", "OK"] │
How can I sort it by numerical value.
Ie I want to pull the string from the Name column and extract only the numerical elements when sorting.
Ie
┌────────┬──────────────────────┐
│ Name ┆ Value │
│ --- ┆ --- │
│ str ┆ list[str] │
╞════════╪══════════════════════╡
│ No. 1 ┆ ["None", "!!!"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ No. 2 ┆ ["1.1", "OK"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ No. 10 ┆ ["0.3", "OK"] │
Can't see the polars expression needed and not sure you can pass a custom python function.
Thanks

You can to use str.extract to get the number from the string, using a regular expression.
Then cast it to int and sort:
pl.DataFrame({"Name": ["No. 1", "No. 12", "No. 2"]}).sort(
pl.col("Name").str.extract(r"No\. ([0-9]*)", 1).cast(int)
)

Also, if you want to sort by numbers in List:
df.sort(
pl.col("Value").arr.get(0).cast(pl.Float32, strict=False),
nulls_last=False
)

How to select the last non-null value from one column and also the value from another column on the same row in Polars?

Below is a non working example in which I retrieve the last available 'Open' but how do I get corresponding 'Time'?
sel = self.data.select([pl.col('Time'),
pl.col('Open').drop_nulls().last()])

For instance, you can use .filter() to select rows that do not contain null and then take last row
Here example:
df = pl.DataFrame({
"a": [1,2,3,4,5],
"b": ["cat", None, "owl", None, None]
})
┌─────┬──────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪══════╡
│ 1 ┆ cat │
│ 2 ┆ null │
│ 3 ┆ owl │
│ 4 ┆ null │
│ 5 ┆ null │
└─────┴──────┘
df.filter(
pl.col("b").is_not_null()
).select(pl.all().last())
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 3 ┆ owl │
└─────┴─────┘

Ordinal encoding of column in polars

I would like to do an ordinal encoding of a column. Pandas has the nice and convenient method of pd.factorize(), however, I would like to achieve the same in polars.
df = pl.DataFrame({"a": [5, 8, 10], "b": ["hi", "hello", "hi"]})
┌─────┬───────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═══════╡
│ 5 ┆ hi │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 8 ┆ hello │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 10 ┆ hi │
└─────┴───────┘
desired result:
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 0 ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 0 │
└─────┴─────┘

You can join with a dummy DataFrame that contains the unique values and the ordinal encoding you are interested in:
df = pl.DataFrame({"a": [5, 8, 10], "b": ["hi", "hello", "hi"]})
unique = df.select(
pl.col("b").unique(maintain_order=True)
).with_row_count(name="ordinal")
df.join(unique, on="b")
Or you could "misuse" the fact that categorical values are backed by u32 integers.
df.with_column(
pl.col("b").cast(pl.Categorical).to_physical().alias("ordinal")
)
Both methods output:
shape: (3, 3)
┌─────┬───────┬─────────┐
│ a ┆ b ┆ ordinal │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ u32 │
╞═════╪═══════╪═════════╡
│ 5 ┆ hi ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 8 ┆ hello ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 10 ┆ hi ┆ 0 │
└─────┴───────┴─────────┘

Here's another way to do it although I doubt it's better than the dummy Dataframe from #ritchie46
df.with_columns([pl.col('b').unique().list().alias('uniq'),
pl.col('b').unique().list().arr.eval(pl.element().rank()).alias('uniqid')]).explode(['uniq','uniqid']).filter(pl.col('b')==pl.col('uniq')).select(pl.exclude('uniq')).with_column(pl.col('uniqid')-1)
There's almost certainly a way to improve this but basically it creates a new column called uniq which is a list of all the unique values of the column as well as uniqid which (I think, and seems to be) the 1-index based order of the values. It then explodes those creating a row for ever value in uniq and then filters out the ones rows that don't equal the column b. Since rank gives the 1-index (rather than 0-index) you have to subtract 1 and exclude the uniq column that we don't care about since it's the same as b.

If the order is not important you could use .rank(method="dense")
>>> df.select(pl.all().rank(method="dense") - 1)
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞═════╪═════╡
│ 0 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 1 │
└─────┴─────┘
If it is - you could:
>>> (
... df.with_row_count()
... .with_columns([
... pl.col("row_nr").first()
... .over(col)
... .rank(method="dense")
... .alias(col) - 1
... for col in df.columns
... ])
... .drop("row_nr")
... )
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞═════╪═════╡
│ 0 ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 0 │
└─────┴─────┘

Computing and retrieving operations at the group level without collapsing data frame in polars?

I am trying to compute a stat (or more) at the group level without having to create a second data frame. The current way I do it is by relying on the generation of a second data frame with the desired aggregation that I then merge back to the original one.
A silly example:
import polars as pl
df = pl. DataFrame( {'name' : ['Steve', 'Larry', 'Tom', 'Steve', 'Tom', 'Steve'],
'points': range(6)})
print(df)
shape: (6, 2)
┌───────┬────────┐
│ name ┆ points │
│ --- ┆ --- │
│ str ┆ i64 │
╞═══════╪════════╡
│ Steve ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Larry ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Tom ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Steve ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Tom ┆ 4 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ Steve ┆ 5 │
└───────┴────────┘
We created a simple data frame below in which some groups have more entries than others. In a second step we compute an additional data frame to keep track of the size of each group.
entries= df.groupby('name').agg(pl.count().alias('entries'))
print(entries)
shape: (3, 2)
┌───────┬─────────┐
│ name ┆ entries │
│ --- ┆ --- │
│ str ┆ u32 │
╞═══════╪═════════╡
│ Steve ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Tom ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Larry ┆ 1 │
└───────┴─────────┘
Now we bring back this information to the original data frame in a third step.
print(df.join(entries, left_on='name', right_on='name', how='left'))
shape: (6, 3)
┌───────┬────────┬─────────┐
│ name ┆ points ┆ entries │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u32 │
╞═══════╪════════╪═════════╡
│ Steve ┆ 0 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Larry ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Tom ┆ 2 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Steve ┆ 3 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Tom ┆ 4 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Steve ┆ 5 ┆ 3 │
└───────┴────────┴─────────┘
Is there a way to avoid this triangulation? I have the feeling that using over might be a solution but I can't figure it out yet.

Well ... I managed. Posting the question helped me organize my thoughts and indeed, over was the solution.
df.with_column(pl.col('name').count().over('name').alias('entries'))
shape: (6, 3)
┌───────┬────────┬─────────┐
│ name ┆ points ┆ entries │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ u32 │
╞═══════╪════════╪═════════╡
│ Steve ┆ 0 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Larry ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Tom ┆ 2 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Steve ┆ 3 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Tom ┆ 4 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ Steve ┆ 5 ┆ 3 │
└───────┴────────┴─────────┘

polars outer join default null value

https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.join.html
Can I specify the default NULL value for outer joins? Like 0?

The join method does not currently have an option for setting a default value for nulls. However, there is an easy way to accomplish this.
Let's say we have this data:
import polars as pl
df1 = pl.DataFrame({"key": ["a", "b", "d"], "var1": [1, 1, 1]})
df2 = pl.DataFrame({"key": ["a", "b", "c"], "var2": [2, 2, 2]})
df1.join(df2, on="key", how="outer")
shape: (4, 3)
┌─────┬──────┬──────┐
│ key ┆ var1 ┆ var2 │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪══════╪══════╡
│ a ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ c ┆ null ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ d ┆ 1 ┆ null │
└─────┴──────┴──────┘
To create a different value for the null values, simply use this:
df1.join(df2, on="key", how="outer").with_column(pl.all().fill_null(0))
shape: (4, 3)
┌─────┬──────┬──────┐
│ key ┆ var1 ┆ var2 │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪══════╪══════╡
│ a ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ c ┆ 0 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ d ┆ 1 ┆ 0 │
└─────┴──────┴──────┘

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Polars table convert a list column to separate rows i.e. unnest a list column to multiple rows - python-polars

Related

Sort polars DataFrame using column with text and numericals

How to select the last non-null value from one column and also the value from another column on the same row in Polars?

Ordinal encoding of column in polars

Computing and retrieving operations at the group level without collapsing data frame in polars?

polars outer join default null value

Categories

Resources