polars DataFrame - search strings from list - python-polars

I need to search in string that contains a substring.
I am looking for the efficient way to do so.
Slow version:
import polars as pl
def search_text(queries, text):
return [query for query in queries if query in text]
pl_df = pl.DataFrame( {
"Title": ["I am aa", "I am bbob"]
})
queries = ['aa', 'bb']
pl_df = pl_df.with_column(pl.col('Title').apply(lambda text: search_text(queries, text)).alias('Title_match'))
print(pl_df)
shape: (2, 2)
┌───────────┬─────────────┐
│ Title ┆ Title_match │
│ --- ┆ --- │
│ str ┆ list[str] │
╞═══════════╪═════════════╡
│ I am aa ┆ ["aa"] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ I am bbob ┆ ["bb"] │
└───────────┴─────────────┘

You could try .extract_all()
You can combine your query strings into a single regex:
>>> import re
...
... queries = "aa", "bb", "am"
... query = "|".join(map(re.escape, sorted(queries, key=len, reverse=True)))
...
... pl_df.with_column(
... pl.col("Title").str.extract_all(query)
... .alias("Title_match")
... )
shape: (2, 2)
┌───────────┬──────────────┐
│ Title | Title_match │
│ --- | --- │
│ str | list[str] │
╞═══════════╪══════════════╡
│ I am aa | ["am", "aa"] │
├───────────┼──────────────┤
│ I am bbob | ["am", "bb"] │
└─//────────┴─//───────────┘

Related

How can I use when, then and otherwise with multiple conditions in polars?

I have a data set with three columns. Column A is to be checked for strings. If the string matches foo or spam, the values in the same row for the other two columns L and G should be changed to XX. For this I have tried the following.
df = pl.DataFrame(
{
"A": ["foo", "ham", "spam", "egg",],
"L": ["A54", "A12", "B84", "C12"],
"G": ["X34", "C84", "G96", "L6",],
}
)
print(df)
shape: (4, 3)
┌──────┬─────┬─────┐
│ A ┆ L ┆ G │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════╪═════╪═════╡
│ foo ┆ A54 ┆ X34 │
│ ham ┆ A12 ┆ C84 │
│ spam ┆ B84 ┆ G96 │
│ egg ┆ C12 ┆ L6 │
└──────┴─────┴─────┘
expected outcome
shape: (4, 3)
┌──────┬─────┬─────┐
│ A ┆ L ┆ G │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str │
╞══════╪═════╪═════╡
│ foo ┆ XX ┆ XX │
│ ham ┆ A12 ┆ C84 │
│ spam ┆ XX ┆ XX │
│ egg ┆ C12 ┆ L6 │
└──────┴─────┴─────┘
I tried this
df = df.with_column(
pl.when((pl.col("A") == "foo") | (pl.col("A") == "spam"))
.then((pl.col("L")= "XX") & (pl.col( "G")= "XX"))
.otherwise((pl.col("L"))&(pl.col( "G")))
)
However, this does not work. Can someone help me with this?
For setting multiple columns to the same value you could use:
df.with_columns(
pl.when(pl.col("A").is_in(["foo", "spam"]))
.then("XX")
.otherwise(pl.col(["L", "G"]))
.keep_name()
)
shape: (4, 3)
┌──────┬─────┬─────┐
│ A | L | G │
│ --- | --- | --- │
│ str | str | str │
╞══════╪═════╪═════╡
│ foo | XX | XX │
├──────┼─────┼─────┤
│ ham | A12 | C84 │
├──────┼─────┼─────┤
│ spam | XX | XX │
├──────┼─────┼─────┤
│ egg | C12 | L6 │
└──────┴─────┴─────┘
.is_in() can be used instead of multiple == x | == y chains.
To update multiple columns at once with different values you could use .map() and a dictionary:
df.with_columns(
pl.when(pl.col("A").is_in(["foo", "spam"]))
.then(pl.col(["L", "G"]).map(
lambda column: {
"L": "XX",
"G": "YY",
}.get(column.name)))
.otherwise(pl.col(["L", "G"]))
)
shape: (4, 3)
┌──────┬─────┬─────┐
│ A | L | G │
│ --- | --- | --- │
│ str | str | str │
╞══════╪═════╪═════╡
│ foo | XX | YY │
├──────┼─────┼─────┤
│ ham | A12 | C84 │
├──────┼─────┼─────┤
│ spam | XX | YY │
├──────┼─────┼─────┤
│ egg | C12 | L6 │
└──────┴─────┴─────┘

Polars convert list of strings to list of categoricals

I'm trying to improve performance of my polars code by converting a list of string to a list of categorical type for my tags column:
shape: (3, 2)
┌─────┬────────────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ list[str] │
╞═════╪════════════╡
│ 1 ┆ ["a", "b"] │
│ 2 ┆ ["a"] │
│ 3 ┆ ["c", "d"] │
└─────┴────────────┘
df = pl.DataFrame({'a':[1,2,3], 'b':[['a','b'],['a'],['c','d']]})
df.with_column(pl.col('tags').cast(pl.list(pl.Categorical)))
However I get the following error:
ValueError: could not convert value 'Unknown' as a Literal
Does polars support lists of categoricals?
Polars does supports lists of Categoricals.
The issue is you're using pl.list() instead of pl.List() - datatypes start with uppercased letters.
>>> df.with_columns(pl.col('b').cast(pl.List(pl.Categorical)))
shape: (3, 2)
┌─────┬────────────┐
│ a | b │
│ --- | --- │
│ i64 | list[cat] │
╞═════╪════════════╡
│ 1 | ["a", "b"] │
│ 2 | ["a"] │
│ 3 | ["c", "d"] │
└─────┴────────────┘
pl.list() is something different - it appears to be shorthand syntax for pl.col().list()

How to select the last non-null value from one column and also the value from another column on the same row in Polars?

Below is a non working example in which I retrieve the last available 'Open' but how do I get corresponding 'Time'?
sel = self.data.select([pl.col('Time'),
pl.col('Open').drop_nulls().last()])
For instance, you can use .filter() to select rows that do not contain null and then take last row
Here example:
df = pl.DataFrame({
"a": [1,2,3,4,5],
"b": ["cat", None, "owl", None, None]
})
┌─────┬──────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪══════╡
│ 1 ┆ cat │
│ 2 ┆ null │
│ 3 ┆ owl │
│ 4 ┆ null │
│ 5 ┆ null │
└─────┴──────┘
df.filter(
pl.col("b").is_not_null()
).select(pl.all().last())
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 3 ┆ owl │
└─────┴─────┘

Sum columns based on column names in a list for polars

So in python Polars
I can add one or more columns to make a new column by using an expression something like
frame.with_column((pl.col('colname1') + pl.col('colname2').alias('new_colname')))
However, if I have all the column names in a list is there a way to sum all the columns in that list and create a new column based on the result ?
Thanks
sum expr supports horizontal summing. From the docs,
List[Expr] -> aggregate the sum value horizontally.
Sample code for ref,
import polars as pl
df = pl.DataFrame({"a": [1, 2, 3], "b": [1, 2, None]})
print(df)
This results in something like,
shape: (3, 2)
┌─────┬──────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪══════╡
│ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3 ┆ null │
└─────┴──────┘
On this you can do something like,
cols = ["a", "b"]
df2 = df.select(pl.sum([pl.col(i) for i in cols]).alias('new_colname'))
print(df2)
Which will result in,
shape: (3, 1)
┌──────┐
│ sum │
│ --- │
│ i64 │
╞══════╡
│ 2 │
├╌╌╌╌╌╌┤
│ 4 │
├╌╌╌╌╌╌┤
│ null │
└──────┘

How to get row_count for a group in polars?

The usage might seems like the code below
out_df = df.select([
pl.col("*"),
pl.col("md5").row_count().over("md5").alias("row_count"),
])
print(out_df)
The data should be like this:
before:
md5
a
a
b
after:
md5 row_count
a 1
a 2
b 1
Maybe Im misunderstanding, as your output has both values 1 and 2 for a. Assuming you meant 2 for both:
You are very close, Polars has .count():
import polars as pl
df = pl.DataFrame({"md5": ["a", "a", "b"]})
out_df = df.select([
pl.col("*"),
pl.col("md5").count().over("md5").alias("row_count"),
])
print(out_df)
Which prints out this:
shape: (3, 2)
┌─────┬───────────┐
│ md5 ┆ row_count │
│ --- ┆ --- │
│ str ┆ u32 │
╞═════╪═══════════╡
│ a ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 1 │
└─────┴───────────┘
If I think I understand correctly, you want to have a count per seen value in the group.
You can do this:
df = pl.DataFrame({"md5": ["a", "a", "b"]})
(df
.with_column(pl.lit(1).alias("ones"))
.select([
pl.all().exclude("ones"),
pl.col("ones").cumsum().over("md5").flatten().alias("row_count")
]))
shape: (3, 2)
┌─────┬───────────┐
│ md5 ┆ row_count │
│ --- ┆ --- │
│ str ┆ i32 │
╞═════╪═══════════╡
│ a ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 1 │
└─────┴───────────┘
We still have to add a dummy column "ones", because (as of polars==0.10.23` we cannot apply a window function over literals. We will add this functionality.