Can I use indexing [start:end] in expressions instead of offset and length in polars - python-polars

In Exp.slice, the only supported syntax is exp.slice(offset,length), but in some cases, something like exp[start:end] would be more convenient and consistent. So how to write exp[1:6] for exp.slice(1,5) just like it in pandas?

You can add polars.internals.expr.expr.Expr.__getitem__():
import polars as pl
from polars.internals.expr.expr import Expr
def expr_slice(self, subscript):
if isinstance(subscript, slice):
if slice.start is not None:
offset = subscript.start
length = subscript.stop - offset + 1 if subscript.stop is not None else None
else:
offset = None
length = subscript.stop
return Expr.slice(self, offset, length)
else:
return Expr.slice(self, subscript, 1)
Expr.__getitem__ = expr_slice
df = pl.DataFrame({'a': range(10)})
print(df.select(pl.col('a')[2:5]))
print(df.select(pl.col('a')[3]))
shape: (4, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 2 │
├╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌┤
│ 4 │
├╌╌╌╌╌┤
│ 5 │
└─────┘
shape: (1, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 3 │
└─────┘
Note that my example doesn't take into account the logic for negative indexing; you can implement that yourself.
Also note that python uses 0-based indexing, so in your example exp[1:6] would raise an error.

Related

Create duplicates of row based column values

I'm trying to build a histogram of some data in polars. As part of my histogram code, I need to duplicate some rows. I've got a column of values, where each row also has a weight that says how many times the row should be added to the histogram.
How can I duplicate my value rows according to the weight column?
Here is some example data, with a target series:
import polars as pl
df = pl.DataFrame({"value":[1,2,3], "weight":[2, 2, 1]})
print(df)
# shape: (3, 2)
# ┌───────┬────────┐
# │ value ┆ weight │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═══════╪════════╡
# │ 1 ┆ 2 │
# │ 2 ┆ 2 │
# │ 3 ┆ 1 │
# └───────┴────────┘
s_target = pl.Series(name="value", values=[1,1,2,2,3])
print(s_target)
# shape: (5,)
# Series: 'value' [i64]
# [
# 1
# 1
# 2
# 2
# 3
# ]
How about
(
df.with_columns(
pl.col("value").repeat_by(pl.col("weight"))
)
.select(pl.col("value").arr.explode())
)
In [11]: df.with_columns(pl.col('value').repeat_by(pl.col('weight'))).select(pl.col('value').arr.explode())
Out[11]:
shape: (5, 1)
┌───────┐
│ value │
│ --- │
│ i64 │
╞═══════╡
│ 1 │
│ 1 │
│ 2 │
│ 2 │
│ 3 │
└───────┘
I didn't know you could do this so easily, I only learned about it while writing the answer. Polars is so nice :)
Turns out repeat_by and a subsequent explode are the perfect building blocks for this transformation:
>>> df.select(pl.col('value').repeat_by('weight').arr.explode())
shape: (5, 1)
┌───────┐
│ value │
│ --- │
│ i64 │
╞═══════╡
│ 1 │
│ 1 │
│ 2 │
│ 2 │
│ 3 │
└───────┘

Is it possible in Polars to "reset" cumsum() at a certain condition?

I need to cumsum the column b until a becomes True. After that cumsum shall start from this row and so on.
a | b
-------------
False | 1
False | 2
True | 3
False | 4
Can I do it on Polars without looping each row?
You could use the .cumsum() of the a column as the "group number".
>>> df.select(pl.col("a").cumsum())
shape: (4, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 0 │
├╌╌╌╌╌┤
│ 0 │
├╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌┤
│ 1 │
└─────┘
And use that with .over()
>>> df.select(pl.col("b").cumsum().over(pl.col("a").cumsum()))
shape: (4, 1)
┌─────┐
│ b │
│ --- │
│ i64 │
╞═════╡
│ 1 │
├╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌┤
│ 7 │
└─────┘
You can .shift().backward_fill() to include the True
>>> df.select(pl.col("b").cumsum().over(
... pl.col("a").cumsum().shift().backward_fill()))
shape: (4, 1)
┌─────┐
│ b │
│ --- │
│ i64 │
╞═════╡
│ 1 │
├╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌┤
│ 6 │
├╌╌╌╌╌┤
│ 4 │
└─────┘

Sum columns based on column names in a list for polars

So in python Polars
I can add one or more columns to make a new column by using an expression something like
frame.with_column((pl.col('colname1') + pl.col('colname2').alias('new_colname')))
However, if I have all the column names in a list is there a way to sum all the columns in that list and create a new column based on the result ?
Thanks
sum expr supports horizontal summing. From the docs,
List[Expr] -> aggregate the sum value horizontally.
Sample code for ref,
import polars as pl
df = pl.DataFrame({"a": [1, 2, 3], "b": [1, 2, None]})
print(df)
This results in something like,
shape: (3, 2)
┌─────┬──────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪══════╡
│ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3 ┆ null │
└─────┴──────┘
On this you can do something like,
cols = ["a", "b"]
df2 = df.select(pl.sum([pl.col(i) for i in cols]).alias('new_colname'))
print(df2)
Which will result in,
shape: (3, 1)
┌──────┐
│ sum │
│ --- │
│ i64 │
╞══════╡
│ 2 │
├╌╌╌╌╌╌┤
│ 4 │
├╌╌╌╌╌╌┤
│ null │
└──────┘

calculating rowwise minimum of two series?

Hi;
Is there any function that can generate a serie by calculating rowwise minimum of two series.? Functionality will be similar to np.minimum
a = [1,4,2,5,2]
b= [5,1,4,2,5]
np.minimum(a,b) -> [1,1,2,2,2]
Thanks.
q =df.lazy().with_column(pl.when(pl.col("a")>pl.col("b")).then(pl.col("b")).otherwise(pl.col("a")).alias("minimum"))
df = q.collect()
i didn't tested it but this should work i think
As the accepted answer states, you can use pl.when -> then -> otherwise expression.
If you have a wider DataFrame, you can use the DataFrame.min() method, pl.min expression, or a pl.fold for more control.
import polars as pl
df = pl.DataFrame({
"a": [1,4,2,5,2],
"b": [5,1,4,2,5],
"c": [3,2,5,7,2]
})
df.min(axis=1)
This outputs:
shape: (5,)
Series: 'a' [i64]
[
1
1
2
2
2
]
Min expression
When given multiple expression inputs to a pl.min the minimum is determined row-wise instead of column-wise.
df.select(pl.min(["a", "b", "c"]))
Outputs:
shape: (5, 1)
┌─────┐
│ min │
│ --- │
│ i64 │
╞═════╡
│ 1 │
├╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌┤
│ 2 │
├╌╌╌╌╌┤
│ 2 │
├╌╌╌╌╌┤
│ 2 │
└─────┘
Fold expression
Or with a fold expression:
df.select(
pl.fold(int(1e9), lambda acc, a: pl.when(acc > a).then(a).otherwise(acc), ["a", "b", "c"])
)
shape: (5, 1)
┌─────────┐
│ literal │
│ --- │
│ i64 │
╞═════════╡
│ 1 │
├╌╌╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌╌╌┤
│ 2 │
├╌╌╌╌╌╌╌╌╌┤
│ 2 │
├╌╌╌╌╌╌╌╌╌┤
│ 2 │
└─────────┘
The fold allows for more cool things, because you operate over expressions.
So we could for instance compute the min of the squared columns:
pl.fold(int(1e9), lambda acc, a: pl.when(acc > a).then(a).otherwise(acc), [pl.all()**2])
Or we could compute the min of square root of column "a" and the rest of the columns is squared.
pl.fold(int(1e9), lambda acc, a: pl.when(acc > a).then(a).otherwise(acc), [pl.col("a").sqrt(), pl.all().exclude("a")**2])
You get the idea.

Create lag / lead time series with by groups in Julia?

I am wondering if there is an easy way to create a lag (or lead) of a time series variable in Julia according to a by group or condition? For example: I have a dataset of the following form
julia> df1 = DataFrame(var1=["a","a","a","a","b","b","b","b"],
var2=[0,1,2,3,0,1,2,3])
8×2 DataFrame
│ Row │ var1 │ var2 │
│ │ String │ Int64 │
├─────┼────────┼───────┤
│ 1 │ a │ 0 │
│ 2 │ a │ 1 │
│ 3 │ a │ 2 │
│ 4 │ a │ 3 │
│ 5 │ b │ 0 │
│ 6 │ b │ 1 │
│ 7 │ b │ 2 │
│ 8 │ b │ 3 │
And I want to create a variable lag2 that contains the values in var2 lagged by 2. However, this should be done grouped by var1 so that the first two observations in the 'b' group do not get the last two values of the 'a' group. Rather they should be set to missing or zero or some default value.
I have tried the following code which produces the following error.
julia> df2 = df1 |> #groupby(_.var1) |> #mutate(lag2 = lag(_.var2,2)) |> DataFrame
ERROR: MethodError: no method matching merge(::Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}}, ::NamedTuple{(:lag2,),Tuple{ShiftedArray{Int64,Missing,1,QueryOperators.GroupColumnArrayView{Int64,Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},:var2}}}})
Closest candidates are:
merge(::NamedTuple{,T} where T<:Tuple, ::NamedTuple) at namedtuple.jl:245
merge(::NamedTuple{an,T} where T<:Tuple, ::NamedTuple{bn,T} where T<:Tuple) where {an, bn} at namedtuple.jl:233
merge(::NamedTuple, ::NamedTuple, ::NamedTuple...) at namedtuple.jl:249
...
Stacktrace:
[1] (::var"#437#442")(::Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}}) at /Users/kayvon/.julia/packages/Query/AwBtd/src/query_translation.jl:58
[2] iterate at /Users/kayvon/.julia/packages/QueryOperators/g4G21/src/enumerable/enumerable_map.jl:25 [inlined]
[3] iterate at /Users/kayvon/.julia/packages/Tables/TjjiP/src/tofromdatavalues.jl:45 [inlined]
[4] buildcolumns at /Users/kayvon/.julia/packages/Tables/TjjiP/src/fallbacks.jl:185 [inlined]
[5] columns at /Users/kayvon/.julia/packages/Tables/TjjiP/src/fallbacks.jl:237 [inlined]
[6] #DataFrame#453(::Bool, ::Type{DataFrame}, ::QueryOperators.EnumerableMap{Union{},QueryOperators.EnumerableIterable{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},QueryOperators.EnumerableGroupBy{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},String,NamedTuple{(:var1, :var2),Tuple{String,Int64}},QueryOperators.EnumerableIterable{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.DataValueRowIterator{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.Schema{(:var1, :var2),Tuple{String,Int64}},Tables.RowIterator{NamedTuple{(:var1, :var2),Tuple{Array{String,1},Array{Int64,1}}}}}},var"#434#439",var"#435#440"}},var"#437#442"}) at /Users/kayvon/.julia/packages/DataFrames/S3ZFo/src/other/tables.jl:40
[7] DataFrame(::QueryOperators.EnumerableMap{Union{},QueryOperators.EnumerableIterable{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},QueryOperators.EnumerableGroupBy{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},String,NamedTuple{(:var1, :var2),Tuple{String,Int64}},QueryOperators.EnumerableIterable{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.DataValueRowIterator{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.Schema{(:var1, :var2),Tuple{String,Int64}},Tables.RowIterator{NamedTuple{(:var1, :var2),Tuple{Array{String,1},Array{Int64,1}}}}}},var"#434#439",var"#435#440"}},var"#437#442"}) at /Users/kayvon/.julia/packages/DataFrames/S3ZFo/src/other/tables.jl:31
[8] |>(::QueryOperators.EnumerableMap{Union{},QueryOperators.EnumerableIterable{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},QueryOperators.EnumerableGroupBy{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},String,NamedTuple{(:var1, :var2),Tuple{String,Int64}},QueryOperators.EnumerableIterable{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.DataValueRowIterator{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.Schema{(:var1, :var2),Tuple{String,Int64}},Tables.RowIterator{NamedTuple{(:var1, :var2),Tuple{Array{String,1},Array{Int64,1}}}}}},var"#434#439",var"#435#440"}},var"#437#442"}, ::Type) at ./operators.jl:854
[9] top-level scope at none:0
Appreciate any help with this approach or alternate approaches. Thanks.
EDIT
Putting this edit to the top as it works in DataFrames 1.0 so reflects the stable API:
Under DataFrames.jl 0.22.2 the correct syntax is:
julia> combine(groupby(df1, :var1), :var2 => Base.Fix2(lag, 2) => :var2_l2)
8×2 DataFrame
Row │ var1 var2_l2
│ String Int64?
─────┼─────────────────
1 │ a missing
2 │ a missing
3 │ a 0
4 │ a 1
5 │ b missing
6 │ b missing
7 │ b 0
8 │ b 1
Another alternative to the maybe slightly arcane Base.Fix2 syntax you could use an anonymous function (x -> lag(x, 2)) (note the enclosing parens are required due to operator precedence).
Original answer:
You definitely had the right idea - I don't work with Query.jl but this can easily be done with basic DataFrames syntax:
julia> using DataFrames
julia> import ShiftedArrays: lag
julia> df1 = DataFrame(var1=["a","a","a","a","b","b","b","b"],
var2=[0,1,2,3,0,1,2,3]);
julia> by(df1, :var1, var2_l2 = :var2 => Base.Fix2(lag, 2)))
8×2 DataFrame
│ Row │ var1 │ var2_l2 │
│ │ String │ Int64⍰ │
├─────┼────────┼─────────┤
│ 1 │ a │ missing │
│ 2 │ a │ missing │
│ 3 │ a │ 0 │
│ 4 │ a │ 1 │
│ 5 │ b │ missing │
│ 6 │ b │ missing │
│ 7 │ b │ 0 │
│ 8 │ b │ 1 │
Note that I used Base.Fix2 here to get a single argument version of lag. This is essentially the same as defining your own l2(x) = lag(x, 2) and then using l2 in the by call. If you do define your own l2 function you can also set the default value like l2(x) = lag(x, 2, default = -1000) if you want to avoid missing values:
julia> l2(x) = lag(x, 2, default = -1000)
l2 (generic function with 1 method)
julia> by(df1, :var1, var2_l2 = :var2 => l2)
8×2 DataFrame
│ Row │ var1 │ var2_l2 │
│ │ String │ Int64 │
├─────┼────────┼─────────┤
│ 1 │ a │ -1000 │
│ 2 │ a │ -1000 │
│ 3 │ a │ 0 │
│ 4 │ a │ 1 │
│ 5 │ b │ -1000 │
│ 6 │ b │ -1000 │
│ 7 │ b │ 0 │
│ 8 │ b │ 1 │