replace part of string with values from dictionary? - python-polars

Loving the Polars library for its fantastic speed and easy syntax!
Struggling with this question - is there an analogue in Polars for the Pandas code below? Would like to replace strings using a dictionary.
Tried using this expression, but it returns 'TypeError: 'dict' object is not callable'
pl.col("List").str.replace_all(lambda key: key,dict())
Trying to replace the Working Pandas code below with a Polars expression
df = pd.DataFrame({'List':[
'Systems',
'Software',
'Cleared'
]})
dic = {
'Systems':'Sys'
,'Software':'Soft'
,'Cleared':'Clr'
}
df["List"] = df["List"].replace(dic, regex=True)
Output:
List
0 Sys
1 Soft
2 Clr

You could build an expression by chaining multiple .replace_all() calls.
>>> replacements = pl.col("List")
>>> for old, new in dic.items():
... replacements = replacements.str.replace_all(old, new)
>>> df.select(replacements)
shape: (3, 1)
┌──────┐
│ List │
│ --- │
│ str │
╞══════╡
│ Sys │
├╌╌╌╌╌╌┤
│ Soft │
├╌╌╌╌╌╌┤
│ Clr │
└──────┘
You can pass literal=True to .replace_all() if you don't need/want regex matching.

I think your best bet would be to turn your dic into a dataframe and join the two.
You need to convert your dic to the format which will make a nice DataFrame. You can do that as a list of dicts so that you have
dicdf=pl.DataFrame([{'List':x, 'newList':y} for x,y in dic.items()])
where List is what your column name is and we're arbitrary making newList our new column name that we'll get rid of later
You'll want to join that with your original df and then select all columns except the old List plus newList but renamed to List
df=df.join(
dicdf,
on='List') \
.select([
pl.exclude(['List','newList']),
pl.col('newList').alias('List')
])

Related

Polars - how to parallelize lambda that uses only Polars expressions?

This runs on a single core, despite not using (seemingly) any non-Polars stuff. What am I doing wrong?
(the goal is to convert a list in doc_ids field in every row into its string representation, s.t. [1, 2, 3] (list[int]) -> '[1, 2, 3]' (string))
import polars as pl
df = pl.DataFrame(dict(ent = ['a', 'b'], doc_ids = [[2,3], [3]]))
df = (df.lazy()
.with_column(
pl.concat_str([
pl.lit('['),
pl.col('doc_ids').apply(lambda x: x.cast(pl.Utf8)).arr.join(', '),
pl.lit(']')
])
.alias('docs_str')
)
.drop('doc_ids')
).collect()
In general, we want to avoid apply at all costs. It acts like a black-box function that Polars cannot optimize, leading to single-threaded performance.
Here's one way that we can eliminate apply: replace it with arr.eval. arr.eval allows us to treat a list as if it were an Expression/Series, which allows us to use standard expressions on it.
(
df.lazy()
.with_column(
pl.concat_str(
[
pl.lit("["),
pl.col("doc_ids")
.arr.eval(pl.element().cast(pl.Utf8))
.arr.join(", "),
pl.lit("]"),
]
).alias("docs_str")
)
.drop("doc_ids")
.collect()
)
shape: (2, 2)
┌─────┬──────────┐
│ ent ┆ docs_str │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪══════════╡
│ a ┆ [2, 3] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ [3] │
└─────┴──────────┘

python-polars string concatenation of two existing columns

I want to concatenate the last letter from two existing columns and create a new column from this using polars.LazyFrame
for example in pandas can achieve this with the following code
import pandas as pd
df = pd.DataFrame({"col1":["abc","def"], "col2":["ghi","jkl"]})
df["last_letters_concat"]=df["col1"].str.strip().str[-1]+df["col2"].str.strip().str[-1]
print(df)
My attempt in polars
import polars as pl
from polars import col
#using same df
df.lazy().with_column(
(pl.col("col1")[-1] + pl.col('col2'))[-1].alias("last_letters_concat")).collect()
How can i do this?
You can use the str.slice expression for that. Below I show to examples that produce the same result.
pl.DataFrame({
"col1": ["abc","def"],
"col2":["ghi","jkl"]
})
# concat all last letters
out1 = df.select(
pl.concat_str([pl.col("col1").str.slice(-1), pl.col("col2").str.slice(-1)])
)
# concat only two specific columns
out2 = df.select(
pl.col("col1").str.slice(-1) + pl.col("col2").str.slice(-1)
)
assert out1.frame_equal(out2)
print(out1)
shape: (2, 1)
┌──────┐
│ col1 │
│ --- │
│ str │
╞══════╡
│ ci │
├╌╌╌╌╌╌┤
│ fl │
└──────┘
I recommend using the concat_str expression as this has O(n) complexity where n is the number of columns you add, whereas the addition operator has O(n^2) complexity.
EDIT: as of polars >= 0.14.12 the optimizer will ensure it always is linear complexity

Lightweight syntax for filtering a polars DataFrame on a multi-column key?

I'm wondering if there's a lightweight syntax for filtering a polars DataFrame against a multi-column key, other than inner/anti joins. (There's nothing wrong with the joins, but it would be nice if there's something more compact).
Using the following frame as an example:
import polars as pl
df = pl.DataFrame(
data = [
["x",123, 4.5, "misc"],
["y",456,10.0,"other"],
["z",789,99.5,"value"],
],
columns = ["a","b","c","d"],
)
A PostgreSQL statement could use a VALUES expression, like so...
(("a","b") IN (VALUES ('x',123),('y',456)))
...and a pandas equivalent might set a multi-column index.
pf.set_index( ["a","b"], inplace=True )
pf[ pf.index.isin([('x',123),('y',456)]) ]
The polars syntax would look like this:
df.join(
pl.DataFrame(
data = [('x',123),('y',456)],
columns = {col:tp for col,tp in df.schema.items() if col in ("a","b")},
orient = 'row',
),
on = ["a","b"],
how = "inner", # or 'anti' for "not in"
)
Is a multi-column is_in construct, or equivalent expression, currently available with polars? Something like the following would be great if it exists (or could be added):
df.filter( pl.cols("a","b").is_in([('x',123),('y',456)]) )
In the next polars release >0.13.44 this will work on the struct datatype.
We convert the 2 (or more) columns we want to check to a struct with pl.struct and call the is_in expression. (A conversion to struct is a free operation)
df = pl.DataFrame(
data=[
["x", 123, 4.5, "misc"],
["y", 456, 10.0, "other"],
["z", 789, 99.5, "value"],
],
columns=["a", "b", "c", "d"],
)
df.filter(
pl.struct(["a", "b"]).is_in([{"a": "x", "b": 123}, {"a": "y", "b": 456}])
)
shape: (2, 4)
┌─────┬─────┬──────┬───────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 ┆ str │
╞═════╪═════╪══════╪═══════╡
│ x ┆ 123 ┆ 4.5 ┆ misc │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ y ┆ 456 ┆ 10.0 ┆ other │
└─────┴─────┴──────┴───────┘
Filtering by data in another DataFrame.
The idiomatic way to filter data by presence in another DataFrame are semi and anti joins. Inner joins also filter by presence, but they include the columns of the right hand DataFrame, where a semi join does not and only filters the left hand side.
semi: keep rows/keys that are in both DataFrames
anti: remove rows/keys that are in both DataFrames
The reason why these joins are preferred over is_in is that they are much faster and currently allow for more optimization.

Expression in Polars select context that refers to earlier alias

Is there a way to allow an expression in Polars to refer to a previous aliased expression? For example, this code that defines two new columns errors because the second new column refers to the first:
import polars as pl
df = pl.DataFrame(dict(x=[0, 0, 1]))
df.select([
(pl.col('x') + 1).alias('y'),
(pl.col('y') * 2).alias('z')],
)
# pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value:
# NotFound("Unable to get field named \"y\". Valid fields: [\"x\"]")
The error makes it obvious that the failure is caused by the first alias not being visible to the second expression. Is there a straightforward way to make this work?
All polars expressions within a context are executed in parallel. So they cannot refer to a column that does not yet exist.
A context is:
df.with_columns
df.select
df.groupby(..).agg
This means you need to enforce sequential execution for expressions that reference to other expression outputs.
In your case I would do:
(df.with_column(
(pl.col('x') + 1).alias('y')
).select([
pl.col('y'),
(pl.col('y') * 2).alias('z')
]))
One workaround is to pull out each new column into its own with_column call and then do a final select to keep the columns you were supposed to keep. You will probably want to make sure this is done lazily.
import polars as pl
df = pl.DataFrame(dict(x=[0, 0, 1]))
(df
.lazy()
.with_column((pl.col("x") + 1).alias("y"))
.with_column((pl.col("y") * 2).alias("z"))
.select(["y", "z"])
.collect()
)
# shape: (3, 2)
# ┌─────┬─────┐
# │ y ┆ z │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1 ┆ 2 │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 1 ┆ 2 │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 2 ┆ 4 │
# └─────┴─────┘

Is it possible to create a 2d json array with PostgreSQL

tab:
num │ value_two │ value_three │ value_four
─────┼───────────┼─────────────┼────────────
1 │ a │ A │ 4.0
2 │ a │ A2 │ 75.0
3 │ b │ A3 │ 7.0
I want to create a 2D json array like this
[[1,"a","A",4.0],[2,"a","A2",75.0],[3,"b","A3",7.0]]
I have tried two things:
First SELECT json_agg(tab) FROM tab but it returns an array of objects.
The second thing that I tried kinda works, the only detail is that it returns a 2d string array.
SELECT json_agg(ARRAY[num::TEXT,value_two,value_three,value_four::TEXT]) FROM tab
[["1","a","A","4.0"],["2","a","A2",75.0],["3","b","A3","7.0"]]
Short answer:
=# select json_agg(json_build_array(num, value_two, value_three, value_four)) as answer
from tab;
answer
-----------------------------------------------------------------
[[1, "a", "A", 4.0], [2, "a", "A2", 75.0], [3, "b", "A3", 7.0]]
(1 row)
Native PostgreSQL arrays like the one you created with
ARRAY[num::TEXT,value_two,value_three,value_four::TEXT]
are strictly typed, which is why you had to cast num and value_four to text.
To get the type mixing allowed in JSON, use json_build_array(), instead.