Create lag / lead time series with by groups in Julia? - group-by

I am wondering if there is an easy way to create a lag (or lead) of a time series variable in Julia according to a by group or condition? For example: I have a dataset of the following form
julia> df1 = DataFrame(var1=["a","a","a","a","b","b","b","b"],
var2=[0,1,2,3,0,1,2,3])
8×2 DataFrame
│ Row │ var1 │ var2 │
│ │ String │ Int64 │
├─────┼────────┼───────┤
│ 1 │ a │ 0 │
│ 2 │ a │ 1 │
│ 3 │ a │ 2 │
│ 4 │ a │ 3 │
│ 5 │ b │ 0 │
│ 6 │ b │ 1 │
│ 7 │ b │ 2 │
│ 8 │ b │ 3 │
And I want to create a variable lag2 that contains the values in var2 lagged by 2. However, this should be done grouped by var1 so that the first two observations in the 'b' group do not get the last two values of the 'a' group. Rather they should be set to missing or zero or some default value.
I have tried the following code which produces the following error.
julia> df2 = df1 |> #groupby(_.var1) |> #mutate(lag2 = lag(_.var2,2)) |> DataFrame
ERROR: MethodError: no method matching merge(::Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}}, ::NamedTuple{(:lag2,),Tuple{ShiftedArray{Int64,Missing,1,QueryOperators.GroupColumnArrayView{Int64,Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},:var2}}}})
Closest candidates are:
merge(::NamedTuple{,T} where T<:Tuple, ::NamedTuple) at namedtuple.jl:245
merge(::NamedTuple{an,T} where T<:Tuple, ::NamedTuple{bn,T} where T<:Tuple) where {an, bn} at namedtuple.jl:233
merge(::NamedTuple, ::NamedTuple, ::NamedTuple...) at namedtuple.jl:249
...
Stacktrace:
[1] (::var"#437#442")(::Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}}) at /Users/kayvon/.julia/packages/Query/AwBtd/src/query_translation.jl:58
[2] iterate at /Users/kayvon/.julia/packages/QueryOperators/g4G21/src/enumerable/enumerable_map.jl:25 [inlined]
[3] iterate at /Users/kayvon/.julia/packages/Tables/TjjiP/src/tofromdatavalues.jl:45 [inlined]
[4] buildcolumns at /Users/kayvon/.julia/packages/Tables/TjjiP/src/fallbacks.jl:185 [inlined]
[5] columns at /Users/kayvon/.julia/packages/Tables/TjjiP/src/fallbacks.jl:237 [inlined]
[6] #DataFrame#453(::Bool, ::Type{DataFrame}, ::QueryOperators.EnumerableMap{Union{},QueryOperators.EnumerableIterable{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},QueryOperators.EnumerableGroupBy{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},String,NamedTuple{(:var1, :var2),Tuple{String,Int64}},QueryOperators.EnumerableIterable{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.DataValueRowIterator{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.Schema{(:var1, :var2),Tuple{String,Int64}},Tables.RowIterator{NamedTuple{(:var1, :var2),Tuple{Array{String,1},Array{Int64,1}}}}}},var"#434#439",var"#435#440"}},var"#437#442"}) at /Users/kayvon/.julia/packages/DataFrames/S3ZFo/src/other/tables.jl:40
[7] DataFrame(::QueryOperators.EnumerableMap{Union{},QueryOperators.EnumerableIterable{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},QueryOperators.EnumerableGroupBy{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},String,NamedTuple{(:var1, :var2),Tuple{String,Int64}},QueryOperators.EnumerableIterable{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.DataValueRowIterator{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.Schema{(:var1, :var2),Tuple{String,Int64}},Tables.RowIterator{NamedTuple{(:var1, :var2),Tuple{Array{String,1},Array{Int64,1}}}}}},var"#434#439",var"#435#440"}},var"#437#442"}) at /Users/kayvon/.julia/packages/DataFrames/S3ZFo/src/other/tables.jl:31
[8] |>(::QueryOperators.EnumerableMap{Union{},QueryOperators.EnumerableIterable{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},QueryOperators.EnumerableGroupBy{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},String,NamedTuple{(:var1, :var2),Tuple{String,Int64}},QueryOperators.EnumerableIterable{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.DataValueRowIterator{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.Schema{(:var1, :var2),Tuple{String,Int64}},Tables.RowIterator{NamedTuple{(:var1, :var2),Tuple{Array{String,1},Array{Int64,1}}}}}},var"#434#439",var"#435#440"}},var"#437#442"}, ::Type) at ./operators.jl:854
[9] top-level scope at none:0
Appreciate any help with this approach or alternate approaches. Thanks.

EDIT
Putting this edit to the top as it works in DataFrames 1.0 so reflects the stable API:
Under DataFrames.jl 0.22.2 the correct syntax is:
julia> combine(groupby(df1, :var1), :var2 => Base.Fix2(lag, 2) => :var2_l2)
8×2 DataFrame
Row │ var1 var2_l2
│ String Int64?
─────┼─────────────────
1 │ a missing
2 │ a missing
3 │ a 0
4 │ a 1
5 │ b missing
6 │ b missing
7 │ b 0
8 │ b 1
Another alternative to the maybe slightly arcane Base.Fix2 syntax you could use an anonymous function (x -> lag(x, 2)) (note the enclosing parens are required due to operator precedence).
Original answer:
You definitely had the right idea - I don't work with Query.jl but this can easily be done with basic DataFrames syntax:
julia> using DataFrames
julia> import ShiftedArrays: lag
julia> df1 = DataFrame(var1=["a","a","a","a","b","b","b","b"],
var2=[0,1,2,3,0,1,2,3]);
julia> by(df1, :var1, var2_l2 = :var2 => Base.Fix2(lag, 2)))
8×2 DataFrame
│ Row │ var1 │ var2_l2 │
│ │ String │ Int64⍰ │
├─────┼────────┼─────────┤
│ 1 │ a │ missing │
│ 2 │ a │ missing │
│ 3 │ a │ 0 │
│ 4 │ a │ 1 │
│ 5 │ b │ missing │
│ 6 │ b │ missing │
│ 7 │ b │ 0 │
│ 8 │ b │ 1 │
Note that I used Base.Fix2 here to get a single argument version of lag. This is essentially the same as defining your own l2(x) = lag(x, 2) and then using l2 in the by call. If you do define your own l2 function you can also set the default value like l2(x) = lag(x, 2, default = -1000) if you want to avoid missing values:
julia> l2(x) = lag(x, 2, default = -1000)
l2 (generic function with 1 method)
julia> by(df1, :var1, var2_l2 = :var2 => l2)
8×2 DataFrame
│ Row │ var1 │ var2_l2 │
│ │ String │ Int64 │
├─────┼────────┼─────────┤
│ 1 │ a │ -1000 │
│ 2 │ a │ -1000 │
│ 3 │ a │ 0 │
│ 4 │ a │ 1 │
│ 5 │ b │ -1000 │
│ 6 │ b │ -1000 │
│ 7 │ b │ 0 │
│ 8 │ b │ 1 │

Related

Create duplicates of row based column values

I'm trying to build a histogram of some data in polars. As part of my histogram code, I need to duplicate some rows. I've got a column of values, where each row also has a weight that says how many times the row should be added to the histogram.
How can I duplicate my value rows according to the weight column?
Here is some example data, with a target series:
import polars as pl
df = pl.DataFrame({"value":[1,2,3], "weight":[2, 2, 1]})
print(df)
# shape: (3, 2)
# ┌───────┬────────┐
# │ value ┆ weight │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═══════╪════════╡
# │ 1 ┆ 2 │
# │ 2 ┆ 2 │
# │ 3 ┆ 1 │
# └───────┴────────┘
s_target = pl.Series(name="value", values=[1,1,2,2,3])
print(s_target)
# shape: (5,)
# Series: 'value' [i64]
# [
# 1
# 1
# 2
# 2
# 3
# ]
How about
(
df.with_columns(
pl.col("value").repeat_by(pl.col("weight"))
)
.select(pl.col("value").arr.explode())
)
In [11]: df.with_columns(pl.col('value').repeat_by(pl.col('weight'))).select(pl.col('value').arr.explode())
Out[11]:
shape: (5, 1)
┌───────┐
│ value │
│ --- │
│ i64 │
╞═══════╡
│ 1 │
│ 1 │
│ 2 │
│ 2 │
│ 3 │
└───────┘
I didn't know you could do this so easily, I only learned about it while writing the answer. Polars is so nice :)
Turns out repeat_by and a subsequent explode are the perfect building blocks for this transformation:
>>> df.select(pl.col('value').repeat_by('weight').arr.explode())
shape: (5, 1)
┌───────┐
│ value │
│ --- │
│ i64 │
╞═══════╡
│ 1 │
│ 1 │
│ 2 │
│ 2 │
│ 3 │
└───────┘

Can I use indexing [start:end] in expressions instead of offset and length in polars

In Exp.slice, the only supported syntax is exp.slice(offset,length), but in some cases, something like exp[start:end] would be more convenient and consistent. So how to write exp[1:6] for exp.slice(1,5) just like it in pandas?
You can add polars.internals.expr.expr.Expr.__getitem__():
import polars as pl
from polars.internals.expr.expr import Expr
def expr_slice(self, subscript):
if isinstance(subscript, slice):
if slice.start is not None:
offset = subscript.start
length = subscript.stop - offset + 1 if subscript.stop is not None else None
else:
offset = None
length = subscript.stop
return Expr.slice(self, offset, length)
else:
return Expr.slice(self, subscript, 1)
Expr.__getitem__ = expr_slice
df = pl.DataFrame({'a': range(10)})
print(df.select(pl.col('a')[2:5]))
print(df.select(pl.col('a')[3]))
shape: (4, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 2 │
├╌╌╌╌╌┤
│ 3 │
├╌╌╌╌╌┤
│ 4 │
├╌╌╌╌╌┤
│ 5 │
└─────┘
shape: (1, 1)
┌─────┐
│ a │
│ --- │
│ i64 │
╞═════╡
│ 3 │
└─────┘
Note that my example doesn't take into account the logic for negative indexing; you can implement that yourself.
Also note that python uses 0-based indexing, so in your example exp[1:6] would raise an error.

clickhouse : left join using IN

I wish to perform a left join based on two conditions :
SELECT
...
FROM
sometable AS a
LEFT JOIN someothertable AS b ON a.some_id = b.some_id
AND b.other_id IN (1, 2, 3, 4)
I got the error:
Supported syntax: JOIN ON Expr([table.]column, ...) = Expr([table.]column, ...) [AND Expr([table.]column, ...) = Expr([table.]column, ...) ...]```
It seems that the condition for a join must be = and can't be in
Any idea ?
Consider moving IN-operator into subquery:
SELECT
a.number,
b.number
FROM numbers(8) AS a
LEFT JOIN
(
SELECT *
FROM numbers(234)
WHERE number IN (1, 2, 3, 4)
) AS b USING (number)
/*
┌─number─┬─b.number─┐
│ 0 │ 0 │
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
│ 4 │ 4 │
│ 5 │ 0 │
│ 6 │ 0 │
│ 7 │ 0 │
└────────┴──────────┘
*/
or
SELECT
a.number,
b.number
FROM numbers(8) AS a
LEFT JOIN
(
SELECT *
FROM numbers(234)
WHERE number IN (1, 2, 3, 4)
) AS b USING (number)
SETTINGS join_use_nulls = 1 /* 1 is 'JOIN behaves the same way as in standard SQL. The type of the corresponding field is converted to Nullable, and empty cells are filled with NULL.' */
/*
┌─number─┬─b.number─┐
│ 0 │ ᴺᵁᴸᴸ │
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
│ 4 │ 4 │
│ 5 │ ᴺᵁᴸᴸ │
│ 6 │ ᴺᵁᴸᴸ │
│ 7 │ ᴺᵁᴸᴸ │
└────────┴──────────┘
*/

Counting the same positional bits in postgresql bitmasks

I am trying to count each same position bit of multiple bitmasks in postgresql, here is an example of the problem:
Suppose i have three bitmasks (in binary) like:
011011011100110
100011010100101
110110101010101
Now what I want to do is to get the total count of bits in each separate column, considering the above masks as three rows and multiple columns.
e.g The first column have count 2, the second one have count 2, the third one have count of 1 and so on...
In actual i have total of 30 bits in each bitmasks in my database. I want to do it in PostgreSQL. I am open for further explanation of the problem if needed.
You could do it by using the get_bit functoin and a couple of joins:
SELECT sum(bit) FILTER (WHERE i = 0) AS count_0,
sum(bit) FILTER (WHERE i = 1) AS count_1,
...
sum(bit) FILTER (WHERE i = 29) AS count_29
FROM bits
CROSS JOIN generate_series(0, 29) AS i
CROSS JOIN LATERAL get_bit(b, i) AS bit;
The column with the bit string is b in my example.
You could use the bitwise and & operator and bigint arithmetic so long as your bitstrings contain 63 bits or fewer:
# create table bmasks (mask bit(15));
CREATE TABLE
# insert into bmasks values ('011011011100110'), ('100011010100101'), ('110110101010101');
INSERT 0 3
# with masks as (
select (2 ^ x)::bigint::bit(15) as mask, x as posn
from generate_series(0, 14) as gs(x)
)
select m.posn, m.mask, sum((b.mask & m.mask > 0::bit(15))::int) as set_bits
from masks m
cross join bmasks b
group by m.posn, m.mask;
┌──────┬─────────────────┬──────────┐
│ posn │ mask │ set_bits │
├──────┼─────────────────┼──────────┤
│ 0 │ 000000000000001 │ 2 │
│ 1 │ 000000000000010 │ 1 │
│ 2 │ 000000000000100 │ 3 │
│ 3 │ 000000000001000 │ 0 │
│ 4 │ 000000000010000 │ 1 │
│ 5 │ 000000000100000 │ 2 │
│ 6 │ 000000001000000 │ 2 │
│ 7 │ 000000010000000 │ 2 │
│ 8 │ 000000100000000 │ 1 │
│ 9 │ 000001000000000 │ 2 │
│ 10 │ 000010000000000 │ 3 │
│ 11 │ 000100000000000 │ 1 │
│ 12 │ 001000000000000 │ 1 │
│ 13 │ 010000000000000 │ 2 │
│ 14 │ 100000000000000 │ 2 │
└──────┴─────────────────┴──────────┘
(15 rows)

Find clusters of values using Postgresql

Consider the following example table:
CREATE TABLE rndtbl AS
SELECT
generate_series(1, 10) AS id,
random() AS val;
and I want to find for each id a cluster_id such that the clusters are far away from each other at least 0.1. How would I calculate such a cluster assignment?
A specific example would be:
select * from rndtbl ;
id | val
----+-------------------
1 | 0.485714662820101
2 | 0.185201027430594
3 | 0.368477711919695
4 | 0.687312887981534
5 | 0.978742253035307
6 | 0.961830694694072
7 | 0.10397826647386
8 | 0.644958863966167
9 | 0.912827260326594
10 | 0.196085536852479
(10 rows)
The result would be: ids (2,7,10) in a cluster and (5,6,9) in another cluster and (4,8) in another, and (1) and (3) as singleton clusters.
From
SELECT * FROM rndtbl ;
┌────┬────────────────────┐
│ id │ val │
├────┼────────────────────┤
│ 1 │ 0.153776332736015 │
│ 2 │ 0.572575284633785 │
│ 3 │ 0.998213059268892 │
│ 4 │ 0.654628816060722 │
│ 5 │ 0.692200613208115 │
│ 6 │ 0.572836415842175 │
│ 7 │ 0.0788379465229809 │
│ 8 │ 0.390280921943486 │
│ 9 │ 0.611408909317106 │
│ 10 │ 0.555164183024317 │
└────┴────────────────────┘
(10 rows)
Use the LAG window function to know whether the current row is in a new cluster or not:
SELECT *, val - LAG(val) OVER (ORDER BY val) > 0.1 AS new_cluster
FROM rndtbl ;
┌────┬────────────────────┬─────────────┐
│ id │ val │ new_cluster │
├────┼────────────────────┼─────────────┤
│ 7 │ 0.0788379465229809 │ (null) │
│ 1 │ 0.153776332736015 │ f │
│ 8 │ 0.390280921943486 │ t │
│ 10 │ 0.555164183024317 │ t │
│ 2 │ 0.572575284633785 │ f │
│ 6 │ 0.572836415842175 │ f │
│ 9 │ 0.611408909317106 │ f │
│ 4 │ 0.654628816060722 │ f │
│ 5 │ 0.692200613208115 │ f │
│ 3 │ 0.998213059268892 │ t │
└────┴────────────────────┴─────────────┘
(10 rows)
Finally you can SUM the number of true (still ordering by val) to get the cluster of the row (counting from 0):
SELECT *, SUM(COALESCE(new_cluster::int, 0)) OVER (ORDER BY val) AS nb_cluster
FROM (
SELECT *, val - LAG(val) OVER (ORDER BY val) > 0.1 AS new_cluster
FROM rndtbl
) t
;
┌────┬────────────────────┬─────────────┬────────────┐
│ id │ val │ new_cluster │ nb_cluster │
├────┼────────────────────┼─────────────┼────────────┤
│ 7 │ 0.0788379465229809 │ (null) │ 0 │
│ 1 │ 0.153776332736015 │ f │ 0 │
│ 8 │ 0.390280921943486 │ t │ 1 │
│ 10 │ 0.555164183024317 │ t │ 2 │
│ 2 │ 0.572575284633785 │ f │ 2 │
│ 6 │ 0.572836415842175 │ f │ 2 │
│ 9 │ 0.611408909317106 │ f │ 2 │
│ 4 │ 0.654628816060722 │ f │ 2 │
│ 5 │ 0.692200613208115 │ f │ 2 │
│ 3 │ 0.998213059268892 │ t │ 3 │
└────┴────────────────────┴─────────────┴────────────┘
(10 rows)