Polars: 0.16.2
Python: 3.11.1
Windows 10
Attempting to filter a column using a time range via .is_between()
Couldn't find anything on StackOverflow, but found (maybe?) something similar in the github issues (but it's been solved): https://github.com/pola-rs/polars/issues/5236
To reproduce
import polars as pl
from datetime import time
df = pl.date_range(low=datetime(2023, 2, 7), high=datetime(2023, 2, 8), interval="30m", name="date").to_frame()
# Attempt to filter by time
df.filter(
pl.col('date').is_between(time(9, 30), time(14, 30))
)
Traceback:
PanicException Traceback (most recent call last)
Cell In[11], line 1
----> 1 df.filter(
2 pl.col('date').is_between(time(9, 30, 0, 0), time(14, 30, 0, 0))
3 )
File d:\My_Path\venv\Lib\site-packages\polars\internals\dataframe\frame.py:2747, in DataFrame.filter(self, predicate)
2741 if _check_for_numpy(predicate) and isinstance(predicate, np.ndarray):
2742 predicate = pli.Series(predicate)
2744 return (
2745 self.lazy()
2746 .filter(predicate) # type: ignore[arg-type]
-> 2747 .collect(no_optimization=True)
2748 )
File d:\My_Path\venv\Lib\site-packages\polars\internals\lazyframe\frame.py:1146, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, no_optimization, slice_pushdown, common_subplan_elimination, streaming)
1135 common_subplan_elimination = False
1137 ldf = self._ldf.optimization_toggle(
1138 type_coercion,
1139 predicate_pushdown,
(...)
1144 streaming,
1145 )
-> 1146 return pli.wrap_df(ldf.collect())
PanicException: cannot coerce datatypes: ComputeError(Owned("Failed to determine supertype of Datetime(Microseconds, None) and Time"))
Not sure if I'm doing something wrong, or if this is a bug.
Tried to filter a series using a time range, and expected a filtered series for just those times. Instead, I got a PanicException (list above).
You are trying to filter a DateTime with a Time. You need to cast to pl.Time before doing the is_between
df.filter(
pl.col('date').cast(pl.Time).is_between(time(9, 30), time(14, 30))
)
┌─────────────────────┐
│ date │
│ --- │
│ datetime[μs] │
╞═════════════════════╡
│ 2023-02-07 10:00:00 │
│ 2023-02-07 10:30:00 │
│ 2023-02-07 11:00:00 │
│ 2023-02-07 11:30:00 │
│ 2023-02-07 12:00:00 │
│ 2023-02-07 12:30:00 │
│ 2023-02-07 13:00:00 │
│ 2023-02-07 13:30:00 │
│ 2023-02-07 14:00:00 │
└─────────────────────┘
Related
Say I have
df = pl.DataFrame({'group': [1, 1, 1, 3, 3, 3, 4, 4]})
I have a numpy array of values, which I'd like to replace 'group' 3 with
values = np.array([9, 8, 7])
Here's what I've tried:
(
df
.with_column(
pl.when(pl.col('group')==3)
.then(values)
.otherwise(pl.col('group')
).alias('group')
)
In [4]: df.with_column(pl.when(pl.col('group')==3).then(values).otherwise(pl.col('group')).alias('group'))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In [4], line 1
----> 1 df.with_column(pl.when(pl.col('group')==3).then(values).otherwise(pl.col('group')).alias('group'))
File ~/tmp/.venv/lib/python3.8/site-packages/polars/internals/whenthen.py:132, in When.then(self, expr)
111 def then(
112 self,
113 expr: (
(...)
121 ),
122 ) -> WhenThen:
123 """
124 Values to return in case of the predicate being `True`.
125
(...)
130
131 """
--> 132 expr = pli.expr_to_lit_or_expr(expr)
133 pywhenthen = self._pywhen.then(expr._pyexpr)
134 return WhenThen(pywhenthen)
File ~/tmp/.venv/lib/python3.8/site-packages/polars/internals/expr/expr.py:118, in expr_to_lit_or_expr(expr, str_to_lit)
116 return expr.otherwise(None)
117 else:
--> 118 raise ValueError(
119 f"did not expect value {expr} of type {type(expr)}, maybe disambiguate with"
120 " pl.lit or pl.col"
121 )
ValueError: did not expect value [9 8 7] of type <class 'numpy.ndarray'>, maybe disambiguate with pl.lit or pl.col
How can I do this correctly?
A few things to consider.
One is that you always should convert your numpy arrays to polars Series as we will use the arrow memory specification underneath and not numpys.
Second is that when -> then -> otherwise operates on columns that are of equal length. We nudge the API in such a direction that you define a logical statement based of columns in your DataFrame and therefore you should not know the indices (nor the lenght of a group) that you want to replace. This allows for much optimizations because if you do not define indices to replace we can push down a filter before that expression.
Anyway, your specific situation does know the length of the group, so we must use something different. We can first compute the indices where the conditional holds and then modify based on those indices.
df = pl.DataFrame({
"group": [1, 1, 1, 3, 3, 3, 4, 4]
})
values = np.array([9, 8, 7])
# compute indices of the predicate
idx = df.select(
pl.arg_where(pl.col("group") == 3)
).to_series()
# mutate on those locations
df.with_column(
df["group"].set_at_idx(idx, pl.Series(values))
)
Here's all I could come up with
df.with_column(
pl.col("group")
.cumcount()
.over(pl.col("group"))
.alias("idx")
).apply(
lambda x: values[x[1]] if x[0] == 3 else x[0]
).select(
pl.col("apply").alias("group")
)
Surely there's a simpler way?
In [28]: df.with_column(pl.col('group').cumcount().over(pl.col('group')).alias('idx')).apply(lambda x:
...: values[x[1]] if x[0] == 3 else x[0]).select(pl.col('apply').alias('group'))
Out[28]:
shape: (8, 1)
┌───────┐
│ group │
│ --- │
│ i64 │
╞═══════╡
│ 1 │
├╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌┤
│ 9 │
├╌╌╌╌╌╌╌┤
│ 8 │
├╌╌╌╌╌╌╌┤
│ 7 │
├╌╌╌╌╌╌╌┤
│ 4 │
├╌╌╌╌╌╌╌┤
│ 4 │
└───────┘
Consider the following dataframe:
df = pl.DataFrame({
"letters": ["A", "B", "C", "D", "E", "F", "G", "H"],
"values": ["aa", "bb", "cc", "dd", "ee", "ff", "gg", "hh"]
})
print(df)
shape: (8, 2)
┌─────────┬────────┐
│ letters ┆ values │
│ --- ┆ --- │
│ str ┆ str │
╞═════════╪════════╡
│ A ┆ aa │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ B ┆ bb │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ C ┆ cc │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ D ┆ dd │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ E ┆ ee │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ F ┆ ff │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ G ┆ gg │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ H ┆ hh │
└─────────┴────────┘
How do I take a window of size +/- N around any row that satisfies a given condition? For example, the condition is pl.col("letters").contains("D|F") and N = 2. Then, the output should be:
┌─────────┬────────────────────────────────┐
│ letters ┆ output │
│ --- ┆ --- │
│ str ┆ list[str] │
╞═════════╪════════════════════════════════╡
│ D ┆ ["bb", "cc", "dd", "ee", "ff"] │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ F ┆ ["dd", "ee", "ff", "gg", "hh"] │
└─────────┴────────────────────────────────┘
Note that the windows are overlapping in this case (the F window also contains dd and the D windows also contains ff). Also, note that N = 2 for the sake of simplicity here but, in reality, it'll be larger (~10 - 20). And the dataset is relatively large so I'd like to do this as efficiently as possible without exploding memory usage.
EDIT: To make the ask more explicit, here's the query in DuckDB's SQL syntax that gives the right answer (and I'd like to know how to translate it to Polars):
df_table = df.to_arrow()
con = duckdb.connect()
query = """
SELECT
letters,
list(values) OVER (
ROWS BETWEEN 2 PRECEDING
AND 2 FOLLOWING
) as combined
FROM df_table
QUALIFY letters in ('D', 'F')
"""
print(pl.from_arrow(con.execute(query).arrow()))
shape: (2, 2)
┌─────────┬────────────────────────┐
│ letters ┆ combined │
│ --- ┆ --- │
│ str ┆ list[str] │
╞═════════╪════════════════════════╡
│ D ┆ ["bb", "cc", ... "ff"] │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ F ┆ ["dd", "ee", ... "hh"] │
└─────────┴────────────────────────┘
Benchmarks of suggested solutions
I ran the suggested solutions in a Jupyter notebook on one of Amazon's ml.c5.xlarge machines. While the notebook was running, I also kept htop open in a terminal to observe CPU and memory use. The dataset had 12M+ rows.
I ran both solutions via both the eager and lazy APIs. For good measure, I also tried using a simple Python for loop to extract the slices after identifying the rows of interest and also DuckDB.
Summary Table
Polars had really robust performance and judicious memory use (with the #jqurious' method) because of the clever, no-copy implementation of .shift() . Surprisingly, a well-thought out Python for loop did just as well. DuckDB had performed rather poorly in both speed and memory use.
Neither Polars nor DuckDB uses more than one core for the operation. Not sure if that's due to a lack of optimization or if this problem is just amenable to parallelization. I suppose we're only filtering over one column and then taking slices of that same column so there's not much multiple threads can do.
method
cpu use
memory use
time
ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ
single core
explosion
jqurious
single core
2.53G to 2.53G
4.63 s
(smart) for loop
single core
2.53G to 2.58G
4.91 s
DuckDB
single core
1.62G to 6.13G
38.6 s
cpu use shows if multiple cores were taxes during the operation
memory use shows how much memory was being used before the operation and the maximum memory use during the operation.
#ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ's solution:
preceding = 2
following = 2
look_around = [pl.col("body").shift(-i)
for i in range(-preceding, following + 1)]
(
df
.with_column(
pl.when(pl.col('body').str.contains(regex))
.then(pl.concat_list(look_around))
.alias('combined')
)
.filter(pl.col('combined').is_not_null())
)
Unfortunately, on my rather large dataset, this solution caused the memory use to explode and the kernel to crash with both the eager and lazy APIs.
#jqurious' solution
preceding = 2
following = 2
look_around = [
pl.col("body").shift(-i).alias(f"lag_{i}") for i in range(-preceding, following + 1)
]
(
df
.with_columns(
look_around
)
.filter(pl.col("body").str.contains(regex))
.select([
pl.col("body"),
pl.concat_list([f"lag_{i}" for i in range(-2, 3)]).alias("output")
])
)
eager:
cpu use: single-core
memory use: 2.53G -> 2.53G
time: 4.63 s ± 6.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
lazy:
cpu use: single-core
memory use: 2.53G -> 2.53G
time: 4.63 s ± 3.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
(Smart) Python for loop
preceding = 2
following = 2
output = []
indices = df.with_row_count().select(
pl.col("row_nr").filter(pl.col("body").str.contains(regex))
)["row_nr"]
for idx, x in enumerate(indices):
offset = max(0, x - preceding)
length = preceding + following + 1
output.append(df["body"].slice(offset, length))
cpu use: single-core
memory use: 2.53G -> 2.58G
time: 4.91 s ± 24.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
DuckDB
Note that I first converted the df to an Arrow.Table before running the query so DuckDB could directly act on it. Also, I'm not sure if the conversion of the result back to Arrow takes up a huge amount of computation and is unfair to it.
preceding = 2
following = 2
query = f"""
SELECT
body,
list(body) OVER (
ROWS BETWEEN {preceding} PRECEDING
AND {following} FOLLOWING
) as combined
FROM df_table
QUALIFY regexp_matches(body, '{regex}')
"""
result = con.execute(query).arrow()
With DuckDB, my first attempt to run the computation crashed. I had to retry by reading to an Arrow Table directly without using Polars (this saved about 1GB of memory) to give DuckDB more memory to use.
first try:
cpu: single-core
memory: 2.53G -> 6.93G -> crash!
time: NA
second try:
cpu: single-core
memory: 1.62G -> 6.13G
time: 38.6 s ± 311 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
A modification of Use the rolling function of polars to get a list of all values in the rolling windows
>>> (
... df
... .with_columns(
... [pl.col("values").shift(i).alias(f"lag_{i}") for i in range(-2, 3)])
... .filter(pl.col("letters").str.contains("D|F"))
... .select([
... pl.col("letters"),
... pl.concat_list(reversed([f"lag_{i}" for i in range(-2, 3)])).alias("output")
... ])
... )
shape: (2, 2)
┌─────────┬────────────────────────────────┐
│ letters | output │
│ --- | --- │
│ str | list[str] │
╞═════════╪════════════════════════════════╡
│ D | ["bb", "cc", "dd", "ee", "ff"] │
├─────────┼────────────────────────────────┤
│ F | ["dd", "ee", "ff", "gg", "hh"] │
└─//──────┴─//─────────────────────────────┘
You can try this:
preceding = 2
following = 2
look_around = [pl.col("values").shift(-i)
for i in range(-preceding, following + 1)]
(
df
.with_column(
pl.when(pl.col('letters').str.contains('D|F'))
.then(pl.concat_list(look_around))
.alias('combined')
)
.filter(pl.col('combined').is_not_null())
)
shape: (2, 3)
┌─────────┬────────┬────────────────────────┐
│ letters ┆ values ┆ combined │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ list[str] │
╞═════════╪════════╪════════════════════════╡
│ D ┆ dd ┆ ["bb", "cc", ... "ff"] │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ F ┆ ff ┆ ["dd", "ee", ... "hh"] │
└─────────┴────────┴────────────────────────┘
How do you "pipe" an expression in Polars?
Consider this code:
def transformation(col:pl.Series)->pl.Series:
return col.tanh().suffix('_tanh')
It'd be nice to be able to do this:
df.with_columns([
pl.col('colA').pipe(transformation),
pl.col('colB').pipe(transformation),
pl.col('colC').pipe(transformation),
pl.col('colD').pipe(transformation),
])
But I don't think Polars supports .pipe for Series / expressions.
The alternative is
df.with_columns([
transformation(pl.col('colA')),
transformation(pl.col('colB')),
transformation(pl.col('colC')),
transformation(pl.col('colD')),
])
But this gets messy (IMO) when you have arguments to the transformation function
Edit:
I implemented this and it "works" for me
def _pipe(self, func, *args, **kwargs):
return func(self, *args, **kwargs)
pl.Expr.pipe = _pipe
Typically (like pandas) you'd apply pipe at the DataFrame level.
Especially in conjunction with lazy-eval, this would be equivalent to chaining expressions; your function will receive the underlying eager/lazy frame, along with any optional *args and **kwargs, and by making it lazy() you ensure that your chain of operations can still take advantage of the query optimiser and parallelisation.
For example:
import polars as pl
# define some UDFs
def extend_with_tan( df ):
return df.with_columns( pl.all().tanh().suffix("_tanh") )
def mul_in_place( df, n ):
return df.select( (pl.all() * n).suffix(f"_x{n}") )
# init lazyframe
df = pl.DataFrame({
"colA": [-4],
"colB": [-2],
"colC": [10],
}).lazy()
# pipe/result
dfx = df.pipe( extend_with_tan ).pipe( mul_in_place,n=3 )
dfx.collect()
# ┌─────────┬─────────┬─────────┬──────────────┬──────────────┬──────────────┐
# │ colA_x3 ┆ colB_x3 ┆ colC_x3 ┆ colA_tanh_x3 ┆ colB_tanh_x3 ┆ colC_tanh_x3 │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 ┆ f64 ┆ f64 ┆ f64 │
# ╞═════════╪═════════╪═════════╪══════════════╪══════════════╪══════════════╡
# │ -12 ┆ -6 ┆ 30 ┆ -2.997988 ┆ -2.892083 ┆ 3.0 │
# └─────────┴─────────┴─────────┴──────────────┴──────────────┴──────────────┘
API Docs: polars "pipe" method
As you've probably realized, adding custom methods in order to be able to do method chaining is unfortunately not a first-class citizen in python.
In polars, a canonical way that hopefully satisfies you is to instead write a function that returns an expression. You do this already (although the type hint is incorrectly set to pl.Series), but can save some space by giving a string argument to our transformation function:
import polars as pl
df = pl.DataFrame({"colA": [-4], "colB": [-2], "colC": [0], "colD": [2]})
def transformation(name: str | list[str]) -> pl.Expr:
return pl.col(name).tanh().suffix("_tanh")
df1 = df.with_columns(
[
transformation("colA"),
transformation("colB"),
transformation("colC"),
transformation("colD"),
]
)
I realise this doesn't quite do what you wanted, but perhaps the following will cheer you up a bit. Since pl.col() can take a list of column names, we can do the following:
df2 = df.with_column(transformation(["colA", "colB", "colC", "colD"]))
assert df1.frame_equal(df2) # True
And we can even target all of them using a regular expression:
# ^col\w+$ is a regular expression matching `col<anything>`
df3 = df.with_column(transformation("^col\w+$"))
assert df1.frame_equal(df3) # True
Is there a way to allow an expression in Polars to refer to a previous aliased expression? For example, this code that defines two new columns errors because the second new column refers to the first:
import polars as pl
df = pl.DataFrame(dict(x=[0, 0, 1]))
df.select([
(pl.col('x') + 1).alias('y'),
(pl.col('y') * 2).alias('z')],
)
# pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value:
# NotFound("Unable to get field named \"y\". Valid fields: [\"x\"]")
The error makes it obvious that the failure is caused by the first alias not being visible to the second expression. Is there a straightforward way to make this work?
All polars expressions within a context are executed in parallel. So they cannot refer to a column that does not yet exist.
A context is:
df.with_columns
df.select
df.groupby(..).agg
This means you need to enforce sequential execution for expressions that reference to other expression outputs.
In your case I would do:
(df.with_column(
(pl.col('x') + 1).alias('y')
).select([
pl.col('y'),
(pl.col('y') * 2).alias('z')
]))
One workaround is to pull out each new column into its own with_column call and then do a final select to keep the columns you were supposed to keep. You will probably want to make sure this is done lazily.
import polars as pl
df = pl.DataFrame(dict(x=[0, 0, 1]))
(df
.lazy()
.with_column((pl.col("x") + 1).alias("y"))
.with_column((pl.col("y") * 2).alias("z"))
.select(["y", "z"])
.collect()
)
# shape: (3, 2)
# ┌─────┬─────┐
# │ y ┆ z │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1 ┆ 2 │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 1 ┆ 2 │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 2 ┆ 4 │
# └─────┴─────┘
This should be an easy one but I can't find any documentation or prior Q&A on this. Using Julia to subset is easy especially with the #Chain command. But I haven't for the life of me figured out a way to subset on a date:
maindf = #chain rawdf begin
#subset(Dates.year(:travel_date) .== 2019)
end
In all of the documentation Dates.year(today()) should produce (2021) but this ends up tossing me an error:
ERROR: MethodError: no method matching +(::Vector{Date}, ::Int64)
Closest candidates are:
+(::Any, ::Any, ::Any, ::Any...) at operators.jl:560
+(::T, ::T) where T<:Union{Int128, Int16, Int32, Int64, Int8, UInt128, UInt16, UInt32, UInt64, UInt8} at int.jl:87
+(::T, ::Integer) where T<:AbstractChar at char.jl:223
Not sure exactly why I am getting a method error..
In R using DPLYR this would simply be:
maindf = rawdf %>%
filter(., year(travel_date) == 2019)
Any ideas?
Use:
julia> using DataFramesMeta, Dates
julia> df = DataFrame(travel_date=repeat([Date(2019,1,1), Date(2020,1,1)],3), id=1:6)
6×2 DataFrame
Row │ travel_date id
│ Date Int64
─────┼────────────────────
1 │ 2019-01-01 1
2 │ 2020-01-01 2
3 │ 2019-01-01 3
4 │ 2020-01-01 4
5 │ 2019-01-01 5
6 │ 2020-01-01 6
julia> #rsubset(df, year(:travel_date) == 2019)
3×2 DataFrame
Row │ travel_date id
│ Date Int64
─────┼────────────────────
1 │ 2019-01-01 1
2 │ 2019-01-01 3
3 │ 2019-01-01 5
julia> #subset(df, year.(:travel_date) .== 2019)
3×2 DataFrame
Row │ travel_date id
│ Date Int64
─────┼────────────────────
1 │ 2019-01-01 1
2 │ 2019-01-01 3
3 │ 2019-01-01 5
The difference is that #rsubset works by row and #subset works on whole columns.
Your problem was that in Dates.year(:travel_date) .== 2019) you mix non-broadcasted call of the year function and broadcasted comparison .== 2019. You always need to make sure that you either work row-wise (using #rsubset in this case) or on whole columns (using #subset).
Different scenarios might require a different approach. Here is an example when whole-column approach is useful:
julia> using Statistics
julia> #subset(df, :id .> mean(:id))
3×2 DataFrame
Row │ travel_date id
│ Date Int64
─────┼────────────────────
1 │ 2020-01-01 4
2 │ 2019-01-01 5
3 │ 2020-01-01 6
where you want mean to operate on a whole column.
EDIT
Here is the same with #chain:
julia> #chain df begin
#subset year.(:travel_date) .== 2019
end
3×2 DataFrame
Row │ travel_date id
│ Date Int64
─────┼────────────────────
1 │ 2019-01-01 1
2 │ 2019-01-01 3
3 │ 2019-01-01 5