Merge list column with constant column in polars

Merge list column with constant column in polars - python-polars

I have a dataframe like:
pl.DataFrame({'a': [['a', 'b'], None, ['c', 'd', 'e'], None], 't':['x', 'y', None, None]})
shape: (4, 2)
┌─────────────────┬──────┐
│ a ┆ t │
│ --- ┆ --- │
│ list[str] ┆ str │
╞═════════════════╪══════╡
│ ["a", "b"] ┆ x │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ y │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ["c", "d", "e"] ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ null │
└─────────────────┴──────┘
I'd like to have a transformation that results in:
┌─────────────────┐
│ a │
│ --- │
│ list[str] │
╞═════════════════╡
│ ["a", "b", "x"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["y"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["c", "d", "e"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null │
└─────────────────┘
However, the obvious solutions which come to mind don't seem to work.
df.with_column(
col('a').arr.concat(col('t'))
)
results in
┌──────────────────────┬──────┐
│ a ┆ t │
│ --- ┆ --- │
│ list[str] ┆ str │
╞══════════════════════╪══════╡
│ ["a", "b", "x"] ┆ x │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ y │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ["c", "d", ... null] ┆ null │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ null ┆ null │
└──────────────────────┴──────┘
Strangely, somehow
df.with_column(
col('t').apply(lambda s: [s]).arr.concat(col('a'))
)
results in an error saying that the dataframe length has changed.:
ShapeError: Could not add column. The Series length 5 differs from the DataFrame height: 4
I don't understand why concatenating the two Series together should produce a new series of a different length. Is this a bug?
I have tried a number of ways to produce a solution but continue to run into errors. For example, using a list comprehension works to add the arrays together, but .append does not.
def combine(d):
x, y = d['a'], d['t']
if x and y:
# return x.append(y) # produces error
return [a for a in x] + [b for b in y]
if x and not y:
return [a for a in x]
if y and not x:
return [b for b in y]
else:
# return None # (produces error)
return ['None']
df.with_column(
pl.struct([col('a'), col('t')]).apply(combine).alias('combined')
)
gives
┌─────────────────┬──────┬─────────────────┐
│ a ┆ t ┆ combined │
│ --- ┆ --- ┆ --- │
│ list[str] ┆ str ┆ list[str] │
╞═════════════════╪══════╪═════════════════╡
│ ["a", "b"] ┆ x ┆ ["a", "b", "x"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ y ┆ ["y"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ["c", "d", "e"] ┆ null ┆ ["c", "d", "e"] │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ null ┆ null ┆ ["None"] │
└─────────────────┴──────┴─────────────────┘
This gets part of the way there but now we have to deal with ["None"] at some point.

Related

Polars convert list of strings to list of categoricals

I'm trying to improve performance of my polars code by converting a list of string to a list of categorical type for my tags column:
shape: (3, 2)
┌─────┬────────────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ list[str] │
╞═════╪════════════╡
│ 1 ┆ ["a", "b"] │
│ 2 ┆ ["a"] │
│ 3 ┆ ["c", "d"] │
└─────┴────────────┘
df = pl.DataFrame({'a':[1,2,3], 'b':[['a','b'],['a'],['c','d']]})
df.with_column(pl.col('tags').cast(pl.list(pl.Categorical)))
However I get the following error:
ValueError: could not convert value 'Unknown' as a Literal
Does polars support lists of categoricals?

Polars does supports lists of Categoricals.
The issue is you're using pl.list() instead of pl.List() - datatypes start with uppercased letters.
>>> df.with_columns(pl.col('b').cast(pl.List(pl.Categorical)))
shape: (3, 2)
┌─────┬────────────┐
│ a | b │
│ --- | --- │
│ i64 | list[cat] │
╞═════╪════════════╡
│ 1 | ["a", "b"] │
│ 2 | ["a"] │
│ 3 | ["c", "d"] │
└─────┴────────────┘
pl.list() is something different - it appears to be shorthand syntax for pl.col().list()

How to select the last non-null value from one column and also the value from another column on the same row in Polars?

Below is a non working example in which I retrieve the last available 'Open' but how do I get corresponding 'Time'?
sel = self.data.select([pl.col('Time'),
pl.col('Open').drop_nulls().last()])

For instance, you can use .filter() to select rows that do not contain null and then take last row
Here example:
df = pl.DataFrame({
"a": [1,2,3,4,5],
"b": ["cat", None, "owl", None, None]
})
┌─────┬──────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪══════╡
│ 1 ┆ cat │
│ 2 ┆ null │
│ 3 ┆ owl │
│ 4 ┆ null │
│ 5 ┆ null │
└─────┴──────┘
df.filter(
pl.col("b").is_not_null()
).select(pl.all().last())
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 3 ┆ owl │
└─────┴─────┘

How to use join in expression context?

Suppose I have a mapping dataframe that I would like to join to an original dataframe:
df = pl.DataFrame({
'A': [1, 2, 3, 2, 1],
})
mapper = pl.DataFrame({
'key': [1, 2, 3, 4, 5],
'value': ['a', 'b', 'c', 'd', 'e']
})
I can map A to value directly via df.join(mapper, ...), but is there a way to do this in an expression context, i.e. while building columns? As in:
df.with_columns([
(pl.col('A')+1).join(mapper, left_on='A', right_on='key')
])
With would furnish:
shape: (5, 2)
┌─────┬───────┐
│ A ┆ value │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═══════╡
│ 1 ┆ b │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1 ┆ b │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ c │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ c │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ d │
└─────┴───────┘

Probably, yes. I just putted df.select(col('A')+1) inside.
df = df.with_columns([
col('A'),
df.select(col('A')+1).join(mapper, left_on='A', right_on='key')['value']
])
print(df)
df
┌─────┬───────┐
│ A ┆ value │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═══════╡
│ 1 ┆ b │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ b │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 3 ┆ c │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 2 ┆ c │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 1 ┆ d │
└─────┴───────┘

polars outer join default null value

https://pola-rs.github.io/polars/py-polars/html/reference/api/polars.DataFrame.join.html
Can I specify the default NULL value for outer joins? Like 0?

The join method does not currently have an option for setting a default value for nulls. However, there is an easy way to accomplish this.
Let's say we have this data:
import polars as pl
df1 = pl.DataFrame({"key": ["a", "b", "d"], "var1": [1, 1, 1]})
df2 = pl.DataFrame({"key": ["a", "b", "c"], "var2": [2, 2, 2]})
df1.join(df2, on="key", how="outer")
shape: (4, 3)
┌─────┬──────┬──────┐
│ key ┆ var1 ┆ var2 │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪══════╪══════╡
│ a ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ c ┆ null ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ d ┆ 1 ┆ null │
└─────┴──────┴──────┘
To create a different value for the null values, simply use this:
df1.join(df2, on="key", how="outer").with_column(pl.all().fill_null(0))
shape: (4, 3)
┌─────┬──────┬──────┐
│ key ┆ var1 ┆ var2 │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 │
╞═════╪══════╪══════╡
│ a ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ b ┆ 1 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ c ┆ 0 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ d ┆ 1 ┆ 0 │
└─────┴──────┴──────┘

Polars: how to add a column in front?

What would be the most idiomatic (and efficient) way to add a column in front of a polars data frame? Same thing like .with_column but add it at index 0?

You can select in the order you want your new DataFrame.
df = pl.DataFrame({
"a": [1, 2, 3],
"b": [True, None, False]
})
df.select([
pl.lit("foo").alias("z"),
pl.all()
])
shape: (3, 3)
┌─────┬─────┬───────┐
│ z ┆ a ┆ b │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ bool │
╞═════╪═════╪═══════╡
│ foo ┆ 1 ┆ true │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ foo ┆ 2 ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ foo ┆ 3 ┆ false │
└─────┴─────┴───────┘

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Merge list column with constant column in polars - python-polars

Related

Polars convert list of strings to list of categoricals

How to select the last non-null value from one column and also the value from another column on the same row in Polars?

How to use join in expression context?

polars outer join default null value

Polars: how to add a column in front?

Categories

Resources