Given the following data, i'm looking to groupby and combine two columns into one, holding a dictionary. One column supplies the keys, while the values stem from another column which is aggregated into a list first.
import polars as pl
data = pl.DataFrame(
{
"names": ["foo", "ham", "spam", "cheese", "egg", "foo"],
"dates": ["1", "1", "2", "3", "3", "4"],
"groups": ["A", "A", "B", "B", "B", "C"],
}
)
>>> print(data)
names dates groups
0 foo 1 A
1 ham 1 A
2 spam 2 B
3 cheese 3 B
4 egg 3 B
5 foo 4 C
# This is what i'm trying to do:
groups combined
0 A {'1': ['foo', 'ham']}
1 B {'2': ['spam'], '3': ['cheese', 'egg']}
2 C {'4': ['foo']}
In pandas i can do this using two groupby statements, in pyspark using a set of operations around "map_from_entries" but despite various attempts i haven't figured out a way in polars. So far i use agg_list(), convert to pandas and use a lambda. While this works, it certainly doesn't feel right.
data = data.groupby(["groups", "dates"])["names"].agg_list()
data = (
data.to_pandas()
.groupby(["groups"])
.apply(lambda x: dict(zip(x["dates"], x["names_agg_list"])))
.reset_index(name="combined")
)
Alternativly, inspired by this post i've tried a number of variations similar to the following, including converting the dict to json strings among other things.
data = data.groupby(["groups"]).agg(
pl.apply(exprs=["dates", "names_agg_list"], f=build_dict).alias("combined")
)
With the release of polars>=0.12.10 you can do this:
print(data
.groupby(["groups", "dates"]).agg(pl.col("names").list().keep_name())
.groupby("groups")
.agg([
pl.apply([pl.col("dates"), pl.col("names")], lambda s: dict(zip(s[0], s[1].to_list())))
])
)
shape: (3, 2)
┌────────┬─────────────────────────────────────┐
│ groups ┆ dates │
│ --- ┆ --- │
│ str ┆ object │
╞════════╪═════════════════════════════════════╡
│ A ┆ {'1': ['foo', 'ham']} │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ C ┆ {'4': ['foo']} │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ B ┆ {'3': ['cheese', 'egg'], '2': ['... │
└────────┴─────────────────────────────────────┘
This not really how you should be using DataFrames though. There is likely a solution that lets you deal with more flattened dataframes and doesn't require you to put slow python objects in dataframes.
Related
Consider the following dataframe:
df = pl.DataFrame({
"letters": ["A", "B", "C", "D", "E", "F", "G", "H"],
"values": ["aa", "bb", "cc", "dd", "ee", "ff", "gg", "hh"]
})
print(df)
shape: (8, 2)
┌─────────┬────────┐
│ letters ┆ values │
│ --- ┆ --- │
│ str ┆ str │
╞═════════╪════════╡
│ A ┆ aa │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ B ┆ bb │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ C ┆ cc │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ D ┆ dd │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ E ┆ ee │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ F ┆ ff │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ G ┆ gg │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ H ┆ hh │
└─────────┴────────┘
How do I take a window of size +/- N around any row that satisfies a given condition? For example, the condition is pl.col("letters").contains("D|F") and N = 2. Then, the output should be:
┌─────────┬────────────────────────────────┐
│ letters ┆ output │
│ --- ┆ --- │
│ str ┆ list[str] │
╞═════════╪════════════════════════════════╡
│ D ┆ ["bb", "cc", "dd", "ee", "ff"] │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ F ┆ ["dd", "ee", "ff", "gg", "hh"] │
└─────────┴────────────────────────────────┘
Note that the windows are overlapping in this case (the F window also contains dd and the D windows also contains ff). Also, note that N = 2 for the sake of simplicity here but, in reality, it'll be larger (~10 - 20). And the dataset is relatively large so I'd like to do this as efficiently as possible without exploding memory usage.
EDIT: To make the ask more explicit, here's the query in DuckDB's SQL syntax that gives the right answer (and I'd like to know how to translate it to Polars):
df_table = df.to_arrow()
con = duckdb.connect()
query = """
SELECT
letters,
list(values) OVER (
ROWS BETWEEN 2 PRECEDING
AND 2 FOLLOWING
) as combined
FROM df_table
QUALIFY letters in ('D', 'F')
"""
print(pl.from_arrow(con.execute(query).arrow()))
shape: (2, 2)
┌─────────┬────────────────────────┐
│ letters ┆ combined │
│ --- ┆ --- │
│ str ┆ list[str] │
╞═════════╪════════════════════════╡
│ D ┆ ["bb", "cc", ... "ff"] │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ F ┆ ["dd", "ee", ... "hh"] │
└─────────┴────────────────────────┘
Benchmarks of suggested solutions
I ran the suggested solutions in a Jupyter notebook on one of Amazon's ml.c5.xlarge machines. While the notebook was running, I also kept htop open in a terminal to observe CPU and memory use. The dataset had 12M+ rows.
I ran both solutions via both the eager and lazy APIs. For good measure, I also tried using a simple Python for loop to extract the slices after identifying the rows of interest and also DuckDB.
Summary Table
Polars had really robust performance and judicious memory use (with the #jqurious' method) because of the clever, no-copy implementation of .shift() . Surprisingly, a well-thought out Python for loop did just as well. DuckDB had performed rather poorly in both speed and memory use.
Neither Polars nor DuckDB uses more than one core for the operation. Not sure if that's due to a lack of optimization or if this problem is just amenable to parallelization. I suppose we're only filtering over one column and then taking slices of that same column so there's not much multiple threads can do.
method
cpu use
memory use
time
ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ
single core
explosion
jqurious
single core
2.53G to 2.53G
4.63 s
(smart) for loop
single core
2.53G to 2.58G
4.91 s
DuckDB
single core
1.62G to 6.13G
38.6 s
cpu use shows if multiple cores were taxes during the operation
memory use shows how much memory was being used before the operation and the maximum memory use during the operation.
#ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ's solution:
preceding = 2
following = 2
look_around = [pl.col("body").shift(-i)
for i in range(-preceding, following + 1)]
(
df
.with_column(
pl.when(pl.col('body').str.contains(regex))
.then(pl.concat_list(look_around))
.alias('combined')
)
.filter(pl.col('combined').is_not_null())
)
Unfortunately, on my rather large dataset, this solution caused the memory use to explode and the kernel to crash with both the eager and lazy APIs.
#jqurious' solution
preceding = 2
following = 2
look_around = [
pl.col("body").shift(-i).alias(f"lag_{i}") for i in range(-preceding, following + 1)
]
(
df
.with_columns(
look_around
)
.filter(pl.col("body").str.contains(regex))
.select([
pl.col("body"),
pl.concat_list([f"lag_{i}" for i in range(-2, 3)]).alias("output")
])
)
eager:
cpu use: single-core
memory use: 2.53G -> 2.53G
time: 4.63 s ± 6.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
lazy:
cpu use: single-core
memory use: 2.53G -> 2.53G
time: 4.63 s ± 3.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
(Smart) Python for loop
preceding = 2
following = 2
output = []
indices = df.with_row_count().select(
pl.col("row_nr").filter(pl.col("body").str.contains(regex))
)["row_nr"]
for idx, x in enumerate(indices):
offset = max(0, x - preceding)
length = preceding + following + 1
output.append(df["body"].slice(offset, length))
cpu use: single-core
memory use: 2.53G -> 2.58G
time: 4.91 s ± 24.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
DuckDB
Note that I first converted the df to an Arrow.Table before running the query so DuckDB could directly act on it. Also, I'm not sure if the conversion of the result back to Arrow takes up a huge amount of computation and is unfair to it.
preceding = 2
following = 2
query = f"""
SELECT
body,
list(body) OVER (
ROWS BETWEEN {preceding} PRECEDING
AND {following} FOLLOWING
) as combined
FROM df_table
QUALIFY regexp_matches(body, '{regex}')
"""
result = con.execute(query).arrow()
With DuckDB, my first attempt to run the computation crashed. I had to retry by reading to an Arrow Table directly without using Polars (this saved about 1GB of memory) to give DuckDB more memory to use.
first try:
cpu: single-core
memory: 2.53G -> 6.93G -> crash!
time: NA
second try:
cpu: single-core
memory: 1.62G -> 6.13G
time: 38.6 s ± 311 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
A modification of Use the rolling function of polars to get a list of all values in the rolling windows
>>> (
... df
... .with_columns(
... [pl.col("values").shift(i).alias(f"lag_{i}") for i in range(-2, 3)])
... .filter(pl.col("letters").str.contains("D|F"))
... .select([
... pl.col("letters"),
... pl.concat_list(reversed([f"lag_{i}" for i in range(-2, 3)])).alias("output")
... ])
... )
shape: (2, 2)
┌─────────┬────────────────────────────────┐
│ letters | output │
│ --- | --- │
│ str | list[str] │
╞═════════╪════════════════════════════════╡
│ D | ["bb", "cc", "dd", "ee", "ff"] │
├─────────┼────────────────────────────────┤
│ F | ["dd", "ee", "ff", "gg", "hh"] │
└─//──────┴─//─────────────────────────────┘
You can try this:
preceding = 2
following = 2
look_around = [pl.col("values").shift(-i)
for i in range(-preceding, following + 1)]
(
df
.with_column(
pl.when(pl.col('letters').str.contains('D|F'))
.then(pl.concat_list(look_around))
.alias('combined')
)
.filter(pl.col('combined').is_not_null())
)
shape: (2, 3)
┌─────────┬────────┬────────────────────────┐
│ letters ┆ values ┆ combined │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ list[str] │
╞═════════╪════════╪════════════════════════╡
│ D ┆ dd ┆ ["bb", "cc", ... "ff"] │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ F ┆ ff ┆ ["dd", "ee", ... "hh"] │
└─────────┴────────┴────────────────────────┘
Loving the Polars library for its fantastic speed and easy syntax!
Struggling with this question - is there an analogue in Polars for the Pandas code below? Would like to replace strings using a dictionary.
Tried using this expression, but it returns 'TypeError: 'dict' object is not callable'
pl.col("List").str.replace_all(lambda key: key,dict())
Trying to replace the Working Pandas code below with a Polars expression
df = pd.DataFrame({'List':[
'Systems',
'Software',
'Cleared'
]})
dic = {
'Systems':'Sys'
,'Software':'Soft'
,'Cleared':'Clr'
}
df["List"] = df["List"].replace(dic, regex=True)
Output:
List
0 Sys
1 Soft
2 Clr
You could build an expression by chaining multiple .replace_all() calls.
>>> replacements = pl.col("List")
>>> for old, new in dic.items():
... replacements = replacements.str.replace_all(old, new)
>>> df.select(replacements)
shape: (3, 1)
┌──────┐
│ List │
│ --- │
│ str │
╞══════╡
│ Sys │
├╌╌╌╌╌╌┤
│ Soft │
├╌╌╌╌╌╌┤
│ Clr │
└──────┘
You can pass literal=True to .replace_all() if you don't need/want regex matching.
I think your best bet would be to turn your dic into a dataframe and join the two.
You need to convert your dic to the format which will make a nice DataFrame. You can do that as a list of dicts so that you have
dicdf=pl.DataFrame([{'List':x, 'newList':y} for x,y in dic.items()])
where List is what your column name is and we're arbitrary making newList our new column name that we'll get rid of later
You'll want to join that with your original df and then select all columns except the old List plus newList but renamed to List
df=df.join(
dicdf,
on='List') \
.select([
pl.exclude(['List','newList']),
pl.col('newList').alias('List')
])
This runs on a single core, despite not using (seemingly) any non-Polars stuff. What am I doing wrong?
(the goal is to convert a list in doc_ids field in every row into its string representation, s.t. [1, 2, 3] (list[int]) -> '[1, 2, 3]' (string))
import polars as pl
df = pl.DataFrame(dict(ent = ['a', 'b'], doc_ids = [[2,3], [3]]))
df = (df.lazy()
.with_column(
pl.concat_str([
pl.lit('['),
pl.col('doc_ids').apply(lambda x: x.cast(pl.Utf8)).arr.join(', '),
pl.lit(']')
])
.alias('docs_str')
)
.drop('doc_ids')
).collect()
In general, we want to avoid apply at all costs. It acts like a black-box function that Polars cannot optimize, leading to single-threaded performance.
Here's one way that we can eliminate apply: replace it with arr.eval. arr.eval allows us to treat a list as if it were an Expression/Series, which allows us to use standard expressions on it.
(
df.lazy()
.with_column(
pl.concat_str(
[
pl.lit("["),
pl.col("doc_ids")
.arr.eval(pl.element().cast(pl.Utf8))
.arr.join(", "),
pl.lit("]"),
]
).alias("docs_str")
)
.drop("doc_ids")
.collect()
)
shape: (2, 2)
┌─────┬──────────┐
│ ent ┆ docs_str │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪══════════╡
│ a ┆ [2, 3] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ [3] │
└─────┴──────────┘
I'm wondering if there's a lightweight syntax for filtering a polars DataFrame against a multi-column key, other than inner/anti joins. (There's nothing wrong with the joins, but it would be nice if there's something more compact).
Using the following frame as an example:
import polars as pl
df = pl.DataFrame(
data = [
["x",123, 4.5, "misc"],
["y",456,10.0,"other"],
["z",789,99.5,"value"],
],
columns = ["a","b","c","d"],
)
A PostgreSQL statement could use a VALUES expression, like so...
(("a","b") IN (VALUES ('x',123),('y',456)))
...and a pandas equivalent might set a multi-column index.
pf.set_index( ["a","b"], inplace=True )
pf[ pf.index.isin([('x',123),('y',456)]) ]
The polars syntax would look like this:
df.join(
pl.DataFrame(
data = [('x',123),('y',456)],
columns = {col:tp for col,tp in df.schema.items() if col in ("a","b")},
orient = 'row',
),
on = ["a","b"],
how = "inner", # or 'anti' for "not in"
)
Is a multi-column is_in construct, or equivalent expression, currently available with polars? Something like the following would be great if it exists (or could be added):
df.filter( pl.cols("a","b").is_in([('x',123),('y',456)]) )
In the next polars release >0.13.44 this will work on the struct datatype.
We convert the 2 (or more) columns we want to check to a struct with pl.struct and call the is_in expression. (A conversion to struct is a free operation)
df = pl.DataFrame(
data=[
["x", 123, 4.5, "misc"],
["y", 456, 10.0, "other"],
["z", 789, 99.5, "value"],
],
columns=["a", "b", "c", "d"],
)
df.filter(
pl.struct(["a", "b"]).is_in([{"a": "x", "b": 123}, {"a": "y", "b": 456}])
)
shape: (2, 4)
┌─────┬─────┬──────┬───────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 ┆ str │
╞═════╪═════╪══════╪═══════╡
│ x ┆ 123 ┆ 4.5 ┆ misc │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ y ┆ 456 ┆ 10.0 ┆ other │
└─────┴─────┴──────┴───────┘
Filtering by data in another DataFrame.
The idiomatic way to filter data by presence in another DataFrame are semi and anti joins. Inner joins also filter by presence, but they include the columns of the right hand DataFrame, where a semi join does not and only filters the left hand side.
semi: keep rows/keys that are in both DataFrames
anti: remove rows/keys that are in both DataFrames
The reason why these joins are preferred over is_in is that they are much faster and currently allow for more optimization.
tab:
num │ value_two │ value_three │ value_four
─────┼───────────┼─────────────┼────────────
1 │ a │ A │ 4.0
2 │ a │ A2 │ 75.0
3 │ b │ A3 │ 7.0
I want to create a 2D json array like this
[[1,"a","A",4.0],[2,"a","A2",75.0],[3,"b","A3",7.0]]
I have tried two things:
First SELECT json_agg(tab) FROM tab but it returns an array of objects.
The second thing that I tried kinda works, the only detail is that it returns a 2d string array.
SELECT json_agg(ARRAY[num::TEXT,value_two,value_three,value_four::TEXT]) FROM tab
[["1","a","A","4.0"],["2","a","A2",75.0],["3","b","A3","7.0"]]
Short answer:
=# select json_agg(json_build_array(num, value_two, value_three, value_four)) as answer
from tab;
answer
-----------------------------------------------------------------
[[1, "a", "A", 4.0], [2, "a", "A2", 75.0], [3, "b", "A3", 7.0]]
(1 row)
Native PostgreSQL arrays like the one you created with
ARRAY[num::TEXT,value_two,value_three,value_four::TEXT]
are strictly typed, which is why you had to cast num and value_four to text.
To get the type mixing allowed in JSON, use json_build_array(), instead.