We have an app which displays a table like this :
this is what it looks like in database :
┌──────────┬──────────────┬─────────────┬────────────┬──────────┬──────────────────┐
│ BatchId │ ProductCode │ StageValue │ StageUnit │ StageId │ StageLineNumber │
├──────────┼──────────────┼─────────────┼────────────┼──────────┼──────────────────┤
│ 0B001 │ 150701 │ LEDI2B4015 │ │ 37222 │ 1 │
│ 0B001 │ 150701 │ 16.21 │ KG │ 37222 │ 1 │
│ 0B001 │ 150701 │ 73.5 │ │ 37222 │ 2 │
│ 0B001 │ 150701 │ LEDI2B6002 │ KG │ 37222 │ 2 │
└──────────┴──────────────┴─────────────┴────────────┴──────────┴──────────────────┘
I would like to query the database so that the output looks like this :
┌──────────┬──────────────┬────────────────────┬─────────────┬────────────┬──────────┬──────────────────┐
│ BatchId │ ProductCode │ LoadedProductCode │ StageValue │ StageUnit │ StageId │ StageLineNumber │
├──────────┼──────────────┼────────────────────┼─────────────┼────────────┼──────────┼──────────────────┤
│ 0B001 │ 150701 │ LEDI2B4015 │ 16.21 │ KG │ 37222 │ 1 │
│ 0B001 │ 150701 │ LEDI2B6002 │ 73.5 │ KG │ 37222 │ 2 │
└──────────┴──────────────┴────────────────────┴─────────────┴────────────┴──────────┴──────────────────┘
Is that even possible ?
My PostgreSQL Server version is 14.X
I have looked for many threads with "merge two columns and add new one" but none of them seem to be what I want.
DB Fiddle link
SQL Fiddle (in case) link
It's possible to get your output, but it's going to be prone to errors. You should seriously rethink your data model, if at all possible. Storing floats as text and trying to parse them is going to lead to many problems.
That said, here's a query that will work, at least for your sample data:
SELECT batchid,
productcode,
max(stagevalue) FILTER (WHERE stagevalue ~ '^[a-zA-Z].*') as loadedproductcode,
max(stagevalue::float) FILTER (WHERE stagevalue !~ '^[a-zA-Z].*') as stagevalue,
max(stageunit),
stageid,
stagelinenumber
FROM datas
GROUP BY batchid, productcode, stageid, stagelinenumber;
Note that max is just used because you need an aggregate function to combine with the filter. You could replace it with min and get the same result, at least for these data.
I wish to perform a left join based on two conditions :
SELECT
...
FROM
sometable AS a
LEFT JOIN someothertable AS b ON a.some_id = b.some_id
AND b.other_id IN (1, 2, 3, 4)
I got the error:
Supported syntax: JOIN ON Expr([table.]column, ...) = Expr([table.]column, ...) [AND Expr([table.]column, ...) = Expr([table.]column, ...) ...]```
It seems that the condition for a join must be = and can't be in
Any idea ?
Consider moving IN-operator into subquery:
SELECT
a.number,
b.number
FROM numbers(8) AS a
LEFT JOIN
(
SELECT *
FROM numbers(234)
WHERE number IN (1, 2, 3, 4)
) AS b USING (number)
/*
┌─number─┬─b.number─┐
│ 0 │ 0 │
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
│ 4 │ 4 │
│ 5 │ 0 │
│ 6 │ 0 │
│ 7 │ 0 │
└────────┴──────────┘
*/
or
SELECT
a.number,
b.number
FROM numbers(8) AS a
LEFT JOIN
(
SELECT *
FROM numbers(234)
WHERE number IN (1, 2, 3, 4)
) AS b USING (number)
SETTINGS join_use_nulls = 1 /* 1 is 'JOIN behaves the same way as in standard SQL. The type of the corresponding field is converted to Nullable, and empty cells are filled with NULL.' */
/*
┌─number─┬─b.number─┐
│ 0 │ ᴺᵁᴸᴸ │
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
│ 4 │ 4 │
│ 5 │ ᴺᵁᴸᴸ │
│ 6 │ ᴺᵁᴸᴸ │
│ 7 │ ᴺᵁᴸᴸ │
└────────┴──────────┘
*/
I am wondering if there is an easy way to create a lag (or lead) of a time series variable in Julia according to a by group or condition? For example: I have a dataset of the following form
julia> df1 = DataFrame(var1=["a","a","a","a","b","b","b","b"],
var2=[0,1,2,3,0,1,2,3])
8×2 DataFrame
│ Row │ var1 │ var2 │
│ │ String │ Int64 │
├─────┼────────┼───────┤
│ 1 │ a │ 0 │
│ 2 │ a │ 1 │
│ 3 │ a │ 2 │
│ 4 │ a │ 3 │
│ 5 │ b │ 0 │
│ 6 │ b │ 1 │
│ 7 │ b │ 2 │
│ 8 │ b │ 3 │
And I want to create a variable lag2 that contains the values in var2 lagged by 2. However, this should be done grouped by var1 so that the first two observations in the 'b' group do not get the last two values of the 'a' group. Rather they should be set to missing or zero or some default value.
I have tried the following code which produces the following error.
julia> df2 = df1 |> #groupby(_.var1) |> #mutate(lag2 = lag(_.var2,2)) |> DataFrame
ERROR: MethodError: no method matching merge(::Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}}, ::NamedTuple{(:lag2,),Tuple{ShiftedArray{Int64,Missing,1,QueryOperators.GroupColumnArrayView{Int64,Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},:var2}}}})
Closest candidates are:
merge(::NamedTuple{,T} where T<:Tuple, ::NamedTuple) at namedtuple.jl:245
merge(::NamedTuple{an,T} where T<:Tuple, ::NamedTuple{bn,T} where T<:Tuple) where {an, bn} at namedtuple.jl:233
merge(::NamedTuple, ::NamedTuple, ::NamedTuple...) at namedtuple.jl:249
...
Stacktrace:
[1] (::var"#437#442")(::Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}}) at /Users/kayvon/.julia/packages/Query/AwBtd/src/query_translation.jl:58
[2] iterate at /Users/kayvon/.julia/packages/QueryOperators/g4G21/src/enumerable/enumerable_map.jl:25 [inlined]
[3] iterate at /Users/kayvon/.julia/packages/Tables/TjjiP/src/tofromdatavalues.jl:45 [inlined]
[4] buildcolumns at /Users/kayvon/.julia/packages/Tables/TjjiP/src/fallbacks.jl:185 [inlined]
[5] columns at /Users/kayvon/.julia/packages/Tables/TjjiP/src/fallbacks.jl:237 [inlined]
[6] #DataFrame#453(::Bool, ::Type{DataFrame}, ::QueryOperators.EnumerableMap{Union{},QueryOperators.EnumerableIterable{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},QueryOperators.EnumerableGroupBy{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},String,NamedTuple{(:var1, :var2),Tuple{String,Int64}},QueryOperators.EnumerableIterable{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.DataValueRowIterator{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.Schema{(:var1, :var2),Tuple{String,Int64}},Tables.RowIterator{NamedTuple{(:var1, :var2),Tuple{Array{String,1},Array{Int64,1}}}}}},var"#434#439",var"#435#440"}},var"#437#442"}) at /Users/kayvon/.julia/packages/DataFrames/S3ZFo/src/other/tables.jl:40
[7] DataFrame(::QueryOperators.EnumerableMap{Union{},QueryOperators.EnumerableIterable{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},QueryOperators.EnumerableGroupBy{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},String,NamedTuple{(:var1, :var2),Tuple{String,Int64}},QueryOperators.EnumerableIterable{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.DataValueRowIterator{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.Schema{(:var1, :var2),Tuple{String,Int64}},Tables.RowIterator{NamedTuple{(:var1, :var2),Tuple{Array{String,1},Array{Int64,1}}}}}},var"#434#439",var"#435#440"}},var"#437#442"}) at /Users/kayvon/.julia/packages/DataFrames/S3ZFo/src/other/tables.jl:31
[8] |>(::QueryOperators.EnumerableMap{Union{},QueryOperators.EnumerableIterable{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},QueryOperators.EnumerableGroupBy{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},String,NamedTuple{(:var1, :var2),Tuple{String,Int64}},QueryOperators.EnumerableIterable{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.DataValueRowIterator{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.Schema{(:var1, :var2),Tuple{String,Int64}},Tables.RowIterator{NamedTuple{(:var1, :var2),Tuple{Array{String,1},Array{Int64,1}}}}}},var"#434#439",var"#435#440"}},var"#437#442"}, ::Type) at ./operators.jl:854
[9] top-level scope at none:0
Appreciate any help with this approach or alternate approaches. Thanks.
EDIT
Putting this edit to the top as it works in DataFrames 1.0 so reflects the stable API:
Under DataFrames.jl 0.22.2 the correct syntax is:
julia> combine(groupby(df1, :var1), :var2 => Base.Fix2(lag, 2) => :var2_l2)
8×2 DataFrame
Row │ var1 var2_l2
│ String Int64?
─────┼─────────────────
1 │ a missing
2 │ a missing
3 │ a 0
4 │ a 1
5 │ b missing
6 │ b missing
7 │ b 0
8 │ b 1
Another alternative to the maybe slightly arcane Base.Fix2 syntax you could use an anonymous function (x -> lag(x, 2)) (note the enclosing parens are required due to operator precedence).
Original answer:
You definitely had the right idea - I don't work with Query.jl but this can easily be done with basic DataFrames syntax:
julia> using DataFrames
julia> import ShiftedArrays: lag
julia> df1 = DataFrame(var1=["a","a","a","a","b","b","b","b"],
var2=[0,1,2,3,0,1,2,3]);
julia> by(df1, :var1, var2_l2 = :var2 => Base.Fix2(lag, 2)))
8×2 DataFrame
│ Row │ var1 │ var2_l2 │
│ │ String │ Int64⍰ │
├─────┼────────┼─────────┤
│ 1 │ a │ missing │
│ 2 │ a │ missing │
│ 3 │ a │ 0 │
│ 4 │ a │ 1 │
│ 5 │ b │ missing │
│ 6 │ b │ missing │
│ 7 │ b │ 0 │
│ 8 │ b │ 1 │
Note that I used Base.Fix2 here to get a single argument version of lag. This is essentially the same as defining your own l2(x) = lag(x, 2) and then using l2 in the by call. If you do define your own l2 function you can also set the default value like l2(x) = lag(x, 2, default = -1000) if you want to avoid missing values:
julia> l2(x) = lag(x, 2, default = -1000)
l2 (generic function with 1 method)
julia> by(df1, :var1, var2_l2 = :var2 => l2)
8×2 DataFrame
│ Row │ var1 │ var2_l2 │
│ │ String │ Int64 │
├─────┼────────┼─────────┤
│ 1 │ a │ -1000 │
│ 2 │ a │ -1000 │
│ 3 │ a │ 0 │
│ 4 │ a │ 1 │
│ 5 │ b │ -1000 │
│ 6 │ b │ -1000 │
│ 7 │ b │ 0 │
│ 8 │ b │ 1 │
I need to aggregate time-series data (with average functions) on different timeslots, like:
Today
last X days
Last weekend
This week
Last X weeks
This month
List item
etc...
Q1: Can it be done within GROUP BY statement or at least with a single query?
Q2: Do I need any Materialized View for that?
The table is partitioned by Month and sharded by UserID
All queries are within UserID (single shard)
group by with ROLLUP
create table xrollup(metric Int64, b date, v Int64 ) engine=MergeTree partition by tuple() order by tuple();
insert into xrollup values (1,'2018-01-01', 1), (1,'2018-01-02', 1), (1,'2018-02-01', 1), (1,'2017-03-01', 1);
insert into xrollup values (2,'2018-01-01', 1), (2,'2018-02-02', 1);
SELECT metric, toYear(b) y, toYYYYMM(b) m, SUM(v) AS val
FROM xrollup
GROUP BY metric, y, m with ROLLUP
ORDER BY metric, y, m
┌─metric─┬────y─┬──────m─┬─val─┐
│ 0 │ 0 │ 0 │ 6 │ overall
│ 1 │ 0 │ 0 │ 4 │ overall by metric1
│ 1 │ 2017 │ 0 │ 1 │ overall by metric1 for 2017
│ 1 │ 2017 │ 201703 │ 1 │ overall by metric1 for march 2017
│ 1 │ 2018 │ 0 │ 3 │
│ 1 │ 2018 │ 201801 │ 2 │
│ 1 │ 2018 │ 201802 │ 1 │
│ 2 │ 0 │ 0 │ 2 │
│ 2 │ 2018 │ 0 │ 2 │
│ 2 │ 2018 │ 201801 │ 1 │
│ 2 │ 2018 │ 201802 │ 1 │
└────────┴──────┴────────┴─────┘
Although there's an accepted answer I had to do something similar but found an alternative route using aggregate function combinators, specially -If to select specific date ranges.
I needed to group by a content ID but retrieve unique views for the whole time range and also for specific buckets to generate a histogram (ClickHouse's histogram() function wasn't suitable because there's no option for sub aggregation).
You could do something along these lines:
SELECT
group_field,
avgIf(metric, date BETWEEN toDate('2022-09-03') AND toDate('2022-09-10')) AS week_avg
avgIf(metric, date BETWEEN toDate('2022-08-10') AND toDate('2022-09-10')) AS month_avg
FROM data
GROUP BY group_field
Consider the following example table:
CREATE TABLE rndtbl AS
SELECT
generate_series(1, 10) AS id,
random() AS val;
and I want to find for each id a cluster_id such that the clusters are far away from each other at least 0.1. How would I calculate such a cluster assignment?
A specific example would be:
select * from rndtbl ;
id | val
----+-------------------
1 | 0.485714662820101
2 | 0.185201027430594
3 | 0.368477711919695
4 | 0.687312887981534
5 | 0.978742253035307
6 | 0.961830694694072
7 | 0.10397826647386
8 | 0.644958863966167
9 | 0.912827260326594
10 | 0.196085536852479
(10 rows)
The result would be: ids (2,7,10) in a cluster and (5,6,9) in another cluster and (4,8) in another, and (1) and (3) as singleton clusters.
From
SELECT * FROM rndtbl ;
┌────┬────────────────────┐
│ id │ val │
├────┼────────────────────┤
│ 1 │ 0.153776332736015 │
│ 2 │ 0.572575284633785 │
│ 3 │ 0.998213059268892 │
│ 4 │ 0.654628816060722 │
│ 5 │ 0.692200613208115 │
│ 6 │ 0.572836415842175 │
│ 7 │ 0.0788379465229809 │
│ 8 │ 0.390280921943486 │
│ 9 │ 0.611408909317106 │
│ 10 │ 0.555164183024317 │
└────┴────────────────────┘
(10 rows)
Use the LAG window function to know whether the current row is in a new cluster or not:
SELECT *, val - LAG(val) OVER (ORDER BY val) > 0.1 AS new_cluster
FROM rndtbl ;
┌────┬────────────────────┬─────────────┐
│ id │ val │ new_cluster │
├────┼────────────────────┼─────────────┤
│ 7 │ 0.0788379465229809 │ (null) │
│ 1 │ 0.153776332736015 │ f │
│ 8 │ 0.390280921943486 │ t │
│ 10 │ 0.555164183024317 │ t │
│ 2 │ 0.572575284633785 │ f │
│ 6 │ 0.572836415842175 │ f │
│ 9 │ 0.611408909317106 │ f │
│ 4 │ 0.654628816060722 │ f │
│ 5 │ 0.692200613208115 │ f │
│ 3 │ 0.998213059268892 │ t │
└────┴────────────────────┴─────────────┘
(10 rows)
Finally you can SUM the number of true (still ordering by val) to get the cluster of the row (counting from 0):
SELECT *, SUM(COALESCE(new_cluster::int, 0)) OVER (ORDER BY val) AS nb_cluster
FROM (
SELECT *, val - LAG(val) OVER (ORDER BY val) > 0.1 AS new_cluster
FROM rndtbl
) t
;
┌────┬────────────────────┬─────────────┬────────────┐
│ id │ val │ new_cluster │ nb_cluster │
├────┼────────────────────┼─────────────┼────────────┤
│ 7 │ 0.0788379465229809 │ (null) │ 0 │
│ 1 │ 0.153776332736015 │ f │ 0 │
│ 8 │ 0.390280921943486 │ t │ 1 │
│ 10 │ 0.555164183024317 │ t │ 2 │
│ 2 │ 0.572575284633785 │ f │ 2 │
│ 6 │ 0.572836415842175 │ f │ 2 │
│ 9 │ 0.611408909317106 │ f │ 2 │
│ 4 │ 0.654628816060722 │ f │ 2 │
│ 5 │ 0.692200613208115 │ f │ 2 │
│ 3 │ 0.998213059268892 │ t │ 3 │
└────┴────────────────────┴─────────────┴────────────┘
(10 rows)