Merge list column with constant column in polars - python-polars

I have a dataframe like:
pl.DataFrame({'a': [['a', 'b'], None, ['c', 'd', 'e'], None], 't':['x', 'y', None, None]})
shape: (4, 2)
┌─────────────────┬──────┐
│ a ┆ t │
│ --- ┆ --- │
│ list[str] ┆ str │
╞═════════════════╪══════╡
│ ["a", "b"] ┆ x │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ y │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ["c", "d", "e"] ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ null │
└─────────────────┴──────┘
I'd like to have a transformation that results in:
┌─────────────────┐
│ a │
│ --- │
│ list[str] │
╞═════════════════╡
│ ["a", "b", "x"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["y"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["c", "d", "e"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null │
└─────────────────┘
However, the obvious solutions which come to mind don't seem to work.
df.with_column(
col('a').arr.concat(col('t'))
)
results in
┌──────────────────────┬──────┐
│ a ┆ t │
│ --- ┆ --- │
│ list[str] ┆ str │
╞══════════════════════╪══════╡
│ ["a", "b", "x"] ┆ x │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ y │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ["c", "d", ... null] ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ null │
└──────────────────────┴──────┘
Strangely, somehow
df.with_column(
col('t').apply(lambda s: [s]).arr.concat(col('a'))
)
results in an error saying that the dataframe length has changed.:
ShapeError: Could not add column. The Series length 5 differs from the DataFrame height: 4
I don't understand why concatenating the two Series together should produce a new series of a different length. Is this a bug?
I have tried a number of ways to produce a solution but continue to run into errors. For example, using a list comprehension works to add the arrays together, but .append does not.
def combine(d):
x, y = d['a'], d['t']
if x and y:
# return x.append(y) # produces error
return [a for a in x] + [b for b in y]
if x and not y:
return [a for a in x]
if y and not x:
return [b for b in y]
else:
# return None # (produces error)
return ['None']
df.with_column(
pl.struct([col('a'), col('t')]).apply(combine).alias('combined')
)
gives
┌─────────────────┬──────┬─────────────────┐
│ a ┆ t ┆ combined │
│ --- ┆ --- ┆ --- │
│ list[str] ┆ str ┆ list[str] │
╞═════════════════╪══════╪═════════════════╡
│ ["a", "b"] ┆ x ┆ ["a", "b", "x"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ y ┆ ["y"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["c", "d", "e"] ┆ null ┆ ["c", "d", "e"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ null ┆ ["None"] │
└─────────────────┴──────┴─────────────────┘
This gets part of the way there but now we have to deal with ["None"] at some point.

Related

Polars convert list of strings to list of categoricals

I'm trying to improve performance of my polars code by converting a list of string to a list of categorical type for my tags column:
shape: (3, 2)
┌─────┬────────────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ list[str] │
╞═════╪════════════╡
│ 1 ┆ ["a", "b"] │
│ 2 ┆ ["a"] │
│ 3 ┆ ["c", "d"] │
└─────┴────────────┘
df = pl.DataFrame({'a':[1,2,3], 'b':[['a','b'],['a'],['c','d']]})
df.with_column(pl.col('tags').cast(pl.list(pl.Categorical)))
However I get the following error:
ValueError: could not convert value 'Unknown' as a Literal
Does polars support lists of categoricals?
Polars does supports lists of Categoricals.
The issue is you're using pl.list() instead of pl.List() - datatypes start with uppercased letters.
>>> df.with_columns(pl.col('b').cast(pl.List(pl.Categorical)))
shape: (3, 2)
┌─────┬────────────┐
│ a | b │
│ --- | --- │
│ i64 | list[cat] │
╞═════╪════════════╡
│ 1 | ["a", "b"] │
│ 2 | ["a"] │
│ 3 | ["c", "d"] │
└─────┴────────────┘
pl.list() is something different - it appears to be shorthand syntax for pl.col().list()

How to select the last non-null value from one column and also the value from another column on the same row in Polars?

Below is a non working example in which I retrieve the last available 'Open' but how do I get corresponding 'Time'?
sel = self.data.select([pl.col('Time'),
pl.col('Open').drop_nulls().last()])
For instance, you can use .filter() to select rows that do not contain null and then take last row
Here example:
df = pl.DataFrame({
"a": [1,2,3,4,5],
"b": ["cat", None, "owl", None, None]
})
┌─────┬──────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪══════╡
│ 1 ┆ cat │
│ 2 ┆ null │
│ 3 ┆ owl │
│ 4 ┆ null │
│ 5 ┆ null │
└─────┴──────┘
df.filter(
pl.col("b").is_not_null()
).select(pl.all().last())
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 3 ┆ owl │
└─────┴─────┘

How to use join in expression context?

Suppose I have a mapping dataframe that I would like to join to an original dataframe:
df = pl.DataFrame({
'A': [1, 2, 3, 2, 1],
})
mapper = pl.DataFrame({
'key': [1, 2, 3, 4, 5],
'value': ['a', 'b', 'c', 'd', 'e']
})
I can map A to value directly via df.join(mapper, ...), but is there a way to do this in an expression context, i.e. while building columns? As in:
df.with_columns([
(pl.col('A')+1).join(mapper, left_on='A', right_on='key')
])
With would furnish:
shape: (5, 2)
┌─────┬───────┐
│ A ┆ value │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═══════╡
│ 1 ┆ b │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1 ┆ b │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ c │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ c │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ d │
└─────┴───────┘
Probably, yes. I just putted df.select(col('A')+1) inside.
df = df.with_columns([
col('A'),
df.select(col('A')+1).join(mapper, left_on='A', right_on='key')['value']
])
print(df)
df
┌─────┬───────┐
│ A ┆ value │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═══════╡
│ 1 ┆ b │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ b │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ c │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ c │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1 ┆ d │
└─────┴───────┘

polars outer join default null value

https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.join.html
Can I specify the default NULL value for outer joins? Like 0?
The join method does not currently have an option for setting a default value for nulls. However, there is an easy way to accomplish this.
Let's say we have this data:
import polars as pl
df1 = pl.DataFrame({"key": ["a", "b", "d"], "var1": [1, 1, 1]})
df2 = pl.DataFrame({"key": ["a", "b", "c"], "var2": [2, 2, 2]})
df1.join(df2, on="key", how="outer")
shape: (4, 3)
┌─────┬──────┬──────┐
│ key ┆ var1 ┆ var2 │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪══════╪══════╡
│ a ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ c ┆ null ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ d ┆ 1 ┆ null │
└─────┴──────┴──────┘
To create a different value for the null values, simply use this:
df1.join(df2, on="key", how="outer").with_column(pl.all().fill_null(0))
shape: (4, 3)
┌─────┬──────┬──────┐
│ key ┆ var1 ┆ var2 │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪══════╪══════╡
│ a ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ c ┆ 0 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ d ┆ 1 ┆ 0 │
└─────┴──────┴──────┘

Polars: how to add a column in front?

What would be the most idiomatic (and efficient) way to add a column in front of a polars data frame? Same thing like .with_column but add it at index 0?
You can select in the order you want your new DataFrame.
df = pl.DataFrame({
"a": [1, 2, 3],
"b": [True, None, False]
})
df.select([
pl.lit("foo").alias("z"),
pl.all()
])
shape: (3, 3)
┌─────┬─────┬───────┐
│ z ┆ a ┆ b │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ bool │
╞═════╪═════╪═══════╡
│ foo ┆ 1 ┆ true │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ foo ┆ 2 ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ foo ┆ 3 ┆ false │
└─────┴─────┴───────┘