python-polars string concatenation of two existing columns - python-polars

I want to concatenate the last letter from two existing columns and create a new column from this using polars.LazyFrame
for example in pandas can achieve this with the following code
import pandas as pd
df = pd.DataFrame({"col1":["abc","def"], "col2":["ghi","jkl"]})
df["last_letters_concat"]=df["col1"].str.strip().str[-1]+df["col2"].str.strip().str[-1]
print(df)
My attempt in polars
import polars as pl
from polars import col
#using same df
df.lazy().with_column(
(pl.col("col1")[-1] + pl.col('col2'))[-1].alias("last_letters_concat")).collect()
How can i do this?

You can use the str.slice expression for that. Below I show to examples that produce the same result.
pl.DataFrame({
"col1": ["abc","def"],
"col2":["ghi","jkl"]
})
# concat all last letters
out1 = df.select(
pl.concat_str([pl.col("col1").str.slice(-1), pl.col("col2").str.slice(-1)])
)
# concat only two specific columns
out2 = df.select(
pl.col("col1").str.slice(-1) + pl.col("col2").str.slice(-1)
)
assert out1.frame_equal(out2)
print(out1)
shape: (2, 1)
┌──────┐
│ col1 │
│ --- │
│ str │
╞══════╡
│ ci │
├╌╌╌╌╌╌┤
│ fl │
└──────┘
I recommend using the concat_str expression as this has O(n) complexity where n is the number of columns you add, whereas the addition operator has O(n^2) complexity.
EDIT: as of polars >= 0.14.12 the optimizer will ensure it always is linear complexity

Related

replace part of string with values from dictionary?

Loving the Polars library for its fantastic speed and easy syntax!
Struggling with this question - is there an analogue in Polars for the Pandas code below? Would like to replace strings using a dictionary.
Tried using this expression, but it returns 'TypeError: 'dict' object is not callable'
pl.col("List").str.replace_all(lambda key: key,dict())
Trying to replace the Working Pandas code below with a Polars expression
df = pd.DataFrame({'List':[
'Systems',
'Software',
'Cleared'
]})
dic = {
'Systems':'Sys'
,'Software':'Soft'
,'Cleared':'Clr'
}
df["List"] = df["List"].replace(dic, regex=True)
Output:
List
0 Sys
1 Soft
2 Clr
You could build an expression by chaining multiple .replace_all() calls.
>>> replacements = pl.col("List")
>>> for old, new in dic.items():
... replacements = replacements.str.replace_all(old, new)
>>> df.select(replacements)
shape: (3, 1)
┌──────┐
│ List │
│ --- │
│ str │
╞══════╡
│ Sys │
├╌╌╌╌╌╌┤
│ Soft │
├╌╌╌╌╌╌┤
│ Clr │
└──────┘
You can pass literal=True to .replace_all() if you don't need/want regex matching.
I think your best bet would be to turn your dic into a dataframe and join the two.
You need to convert your dic to the format which will make a nice DataFrame. You can do that as a list of dicts so that you have
dicdf=pl.DataFrame([{'List':x, 'newList':y} for x,y in dic.items()])
where List is what your column name is and we're arbitrary making newList our new column name that we'll get rid of later
You'll want to join that with your original df and then select all columns except the old List plus newList but renamed to List
df=df.join(
dicdf,
on='List') \
.select([
pl.exclude(['List','newList']),
pl.col('newList').alias('List')
])

How do I pipe an expression in Polars?

How do you "pipe" an expression in Polars?
Consider this code:
def transformation(col:pl.Series)->pl.Series:
return col.tanh().suffix('_tanh')
It'd be nice to be able to do this:
df.with_columns([
pl.col('colA').pipe(transformation),
pl.col('colB').pipe(transformation),
pl.col('colC').pipe(transformation),
pl.col('colD').pipe(transformation),
])
But I don't think Polars supports .pipe for Series / expressions.
The alternative is
df.with_columns([
transformation(pl.col('colA')),
transformation(pl.col('colB')),
transformation(pl.col('colC')),
transformation(pl.col('colD')),
])
But this gets messy (IMO) when you have arguments to the transformation function
Edit:
I implemented this and it "works" for me
def _pipe(self, func, *args, **kwargs):
return func(self, *args, **kwargs)
pl.Expr.pipe = _pipe
Typically (like pandas) you'd apply pipe at the DataFrame level.
Especially in conjunction with lazy-eval, this would be equivalent to chaining expressions; your function will receive the underlying eager/lazy frame, along with any optional *args and **kwargs, and by making it lazy() you ensure that your chain of operations can still take advantage of the query optimiser and parallelisation.
For example:
import polars as pl
# define some UDFs
def extend_with_tan( df ):
return df.with_columns( pl.all().tanh().suffix("_tanh") )
def mul_in_place( df, n ):
return df.select( (pl.all() * n).suffix(f"_x{n}") )
# init lazyframe
df = pl.DataFrame({
"colA": [-4],
"colB": [-2],
"colC": [10],
}).lazy()
# pipe/result
dfx = df.pipe( extend_with_tan ).pipe( mul_in_place,n=3 )
dfx.collect()
# ┌─────────┬─────────┬─────────┬──────────────┬──────────────┬──────────────┐
# │ colA_x3 ┆ colB_x3 ┆ colC_x3 ┆ colA_tanh_x3 ┆ colB_tanh_x3 ┆ colC_tanh_x3 │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 ┆ f64 ┆ f64 ┆ f64 │
# ╞═════════╪═════════╪═════════╪══════════════╪══════════════╪══════════════╡
# │ -12 ┆ -6 ┆ 30 ┆ -2.997988 ┆ -2.892083 ┆ 3.0 │
# └─────────┴─────────┴─────────┴──────────────┴──────────────┴──────────────┘
API Docs: polars "pipe" method
As you've probably realized, adding custom methods in order to be able to do method chaining is unfortunately not a first-class citizen in python.
In polars, a canonical way that hopefully satisfies you is to instead write a function that returns an expression. You do this already (although the type hint is incorrectly set to pl.Series), but can save some space by giving a string argument to our transformation function:
import polars as pl
df = pl.DataFrame({"colA": [-4], "colB": [-2], "colC": [0], "colD": [2]})
def transformation(name: str | list[str]) -> pl.Expr:
return pl.col(name).tanh().suffix("_tanh")
df1 = df.with_columns(
[
transformation("colA"),
transformation("colB"),
transformation("colC"),
transformation("colD"),
]
)
I realise this doesn't quite do what you wanted, but perhaps the following will cheer you up a bit. Since pl.col() can take a list of column names, we can do the following:
df2 = df.with_column(transformation(["colA", "colB", "colC", "colD"]))
assert df1.frame_equal(df2) # True
And we can even target all of them using a regular expression:
# ^col\w+$ is a regular expression matching `col<anything>`
df3 = df.with_column(transformation("^col\w+$"))
assert df1.frame_equal(df3) # True

Polars - how to parallelize lambda that uses only Polars expressions?

This runs on a single core, despite not using (seemingly) any non-Polars stuff. What am I doing wrong?
(the goal is to convert a list in doc_ids field in every row into its string representation, s.t. [1, 2, 3] (list[int]) -> '[1, 2, 3]' (string))
import polars as pl
df = pl.DataFrame(dict(ent = ['a', 'b'], doc_ids = [[2,3], [3]]))
df = (df.lazy()
.with_column(
pl.concat_str([
pl.lit('['),
pl.col('doc_ids').apply(lambda x: x.cast(pl.Utf8)).arr.join(', '),
pl.lit(']')
])
.alias('docs_str')
)
.drop('doc_ids')
).collect()
In general, we want to avoid apply at all costs. It acts like a black-box function that Polars cannot optimize, leading to single-threaded performance.
Here's one way that we can eliminate apply: replace it with arr.eval. arr.eval allows us to treat a list as if it were an Expression/Series, which allows us to use standard expressions on it.
(
df.lazy()
.with_column(
pl.concat_str(
[
pl.lit("["),
pl.col("doc_ids")
.arr.eval(pl.element().cast(pl.Utf8))
.arr.join(", "),
pl.lit("]"),
]
).alias("docs_str")
)
.drop("doc_ids")
.collect()
)
shape: (2, 2)
┌─────┬──────────┐
│ ent ┆ docs_str │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪══════════╡
│ a ┆ [2, 3] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ [3] │
└─────┴──────────┘

Lightweight syntax for filtering a polars DataFrame on a multi-column key?

I'm wondering if there's a lightweight syntax for filtering a polars DataFrame against a multi-column key, other than inner/anti joins. (There's nothing wrong with the joins, but it would be nice if there's something more compact).
Using the following frame as an example:
import polars as pl
df = pl.DataFrame(
data = [
["x",123, 4.5, "misc"],
["y",456,10.0,"other"],
["z",789,99.5,"value"],
],
columns = ["a","b","c","d"],
)
A PostgreSQL statement could use a VALUES expression, like so...
(("a","b") IN (VALUES ('x',123),('y',456)))
...and a pandas equivalent might set a multi-column index.
pf.set_index( ["a","b"], inplace=True )
pf[ pf.index.isin([('x',123),('y',456)]) ]
The polars syntax would look like this:
df.join(
pl.DataFrame(
data = [('x',123),('y',456)],
columns = {col:tp for col,tp in df.schema.items() if col in ("a","b")},
orient = 'row',
),
on = ["a","b"],
how = "inner", # or 'anti' for "not in"
)
Is a multi-column is_in construct, or equivalent expression, currently available with polars? Something like the following would be great if it exists (or could be added):
df.filter( pl.cols("a","b").is_in([('x',123),('y',456)]) )
In the next polars release >0.13.44 this will work on the struct datatype.
We convert the 2 (or more) columns we want to check to a struct with pl.struct and call the is_in expression. (A conversion to struct is a free operation)
df = pl.DataFrame(
data=[
["x", 123, 4.5, "misc"],
["y", 456, 10.0, "other"],
["z", 789, 99.5, "value"],
],
columns=["a", "b", "c", "d"],
)
df.filter(
pl.struct(["a", "b"]).is_in([{"a": "x", "b": 123}, {"a": "y", "b": 456}])
)
shape: (2, 4)
┌─────┬─────┬──────┬───────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 ┆ str │
╞═════╪═════╪══════╪═══════╡
│ x ┆ 123 ┆ 4.5 ┆ misc │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ y ┆ 456 ┆ 10.0 ┆ other │
└─────┴─────┴──────┴───────┘
Filtering by data in another DataFrame.
The idiomatic way to filter data by presence in another DataFrame are semi and anti joins. Inner joins also filter by presence, but they include the columns of the right hand DataFrame, where a semi join does not and only filters the left hand side.
semi: keep rows/keys that are in both DataFrames
anti: remove rows/keys that are in both DataFrames
The reason why these joins are preferred over is_in is that they are much faster and currently allow for more optimization.

Expression in Polars select context that refers to earlier alias

Is there a way to allow an expression in Polars to refer to a previous aliased expression? For example, this code that defines two new columns errors because the second new column refers to the first:
import polars as pl
df = pl.DataFrame(dict(x=[0, 0, 1]))
df.select([
(pl.col('x') + 1).alias('y'),
(pl.col('y') * 2).alias('z')],
)
# pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value:
# NotFound("Unable to get field named \"y\". Valid fields: [\"x\"]")
The error makes it obvious that the failure is caused by the first alias not being visible to the second expression. Is there a straightforward way to make this work?
All polars expressions within a context are executed in parallel. So they cannot refer to a column that does not yet exist.
A context is:
df.with_columns
df.select
df.groupby(..).agg
This means you need to enforce sequential execution for expressions that reference to other expression outputs.
In your case I would do:
(df.with_column(
(pl.col('x') + 1).alias('y')
).select([
pl.col('y'),
(pl.col('y') * 2).alias('z')
]))
One workaround is to pull out each new column into its own with_column call and then do a final select to keep the columns you were supposed to keep. You will probably want to make sure this is done lazily.
import polars as pl
df = pl.DataFrame(dict(x=[0, 0, 1]))
(df
.lazy()
.with_column((pl.col("x") + 1).alias("y"))
.with_column((pl.col("y") * 2).alias("z"))
.select(["y", "z"])
.collect()
)
# shape: (3, 2)
# ┌─────┬─────┐
# │ y ┆ z │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1 ┆ 2 │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 1 ┆ 2 │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 2 ┆ 4 │
# └─────┴─────┘