ClickHouse: Efficient way to aggregate data by different time ranges at - group-by

I need to aggregate time-series data (with average functions) on different timeslots, like:
Today
last X days
Last weekend
This week
Last X weeks
This month
List item
etc...
Q1: Can it be done within GROUP BY statement or at least with a single query?
Q2: Do I need any Materialized View for that?
The table is partitioned by Month and sharded by UserID
All queries are within UserID (single shard)

group by with ROLLUP
create table xrollup(metric Int64, b date, v Int64 ) engine=MergeTree partition by tuple() order by tuple();
insert into xrollup values (1,'2018-01-01', 1), (1,'2018-01-02', 1), (1,'2018-02-01', 1), (1,'2017-03-01', 1);
insert into xrollup values (2,'2018-01-01', 1), (2,'2018-02-02', 1);
SELECT metric, toYear(b) y, toYYYYMM(b) m, SUM(v) AS val
FROM xrollup
GROUP BY metric, y, m with ROLLUP
ORDER BY metric, y, m
┌─metric─┬────y─┬──────m─┬─val─┐
│ 0 │ 0 │ 0 │ 6 │ overall
│ 1 │ 0 │ 0 │ 4 │ overall by metric1
│ 1 │ 2017 │ 0 │ 1 │ overall by metric1 for 2017
│ 1 │ 2017 │ 201703 │ 1 │ overall by metric1 for march 2017
│ 1 │ 2018 │ 0 │ 3 │
│ 1 │ 2018 │ 201801 │ 2 │
│ 1 │ 2018 │ 201802 │ 1 │
│ 2 │ 0 │ 0 │ 2 │
│ 2 │ 2018 │ 0 │ 2 │
│ 2 │ 2018 │ 201801 │ 1 │
│ 2 │ 2018 │ 201802 │ 1 │
└────────┴──────┴────────┴─────┘

Although there's an accepted answer I had to do something similar but found an alternative route using aggregate function combinators, specially -If to select specific date ranges.
I needed to group by a content ID but retrieve unique views for the whole time range and also for specific buckets to generate a histogram (ClickHouse's histogram() function wasn't suitable because there's no option for sub aggregation).
You could do something along these lines:
SELECT
group_field,
avgIf(metric, date BETWEEN toDate('2022-09-03') AND toDate('2022-09-10')) AS week_avg
avgIf(metric, date BETWEEN toDate('2022-08-10') AND toDate('2022-09-10')) AS month_avg
FROM data
GROUP BY group_field

Related

PostgreSql : Merge two rows and add the difference to new column

We have an app which displays a table like this :
this is what it looks like in database :
┌──────────┬──────────────┬─────────────┬────────────┬──────────┬──────────────────┐
│ BatchId │ ProductCode │ StageValue │ StageUnit │ StageId │ StageLineNumber │
├──────────┼──────────────┼─────────────┼────────────┼──────────┼──────────────────┤
│ 0B001 │ 150701 │ LEDI2B4015 │ │ 37222 │ 1 │
│ 0B001 │ 150701 │ 16.21 │ KG │ 37222 │ 1 │
│ 0B001 │ 150701 │ 73.5 │ │ 37222 │ 2 │
│ 0B001 │ 150701 │ LEDI2B6002 │ KG │ 37222 │ 2 │
└──────────┴──────────────┴─────────────┴────────────┴──────────┴──────────────────┘
I would like to query the database so that the output looks like this :
┌──────────┬──────────────┬────────────────────┬─────────────┬────────────┬──────────┬──────────────────┐
│ BatchId │ ProductCode │ LoadedProductCode │ StageValue │ StageUnit │ StageId │ StageLineNumber │
├──────────┼──────────────┼────────────────────┼─────────────┼────────────┼──────────┼──────────────────┤
│ 0B001 │ 150701 │ LEDI2B4015 │ 16.21 │ KG │ 37222 │ 1 │
│ 0B001 │ 150701 │ LEDI2B6002 │ 73.5 │ KG │ 37222 │ 2 │
└──────────┴──────────────┴────────────────────┴─────────────┴────────────┴──────────┴──────────────────┘
Is that even possible ?
My PostgreSQL Server version is 14.X
I have looked for many threads with "merge two columns and add new one" but none of them seem to be what I want.
DB Fiddle link
SQL Fiddle (in case) link
It's possible to get your output, but it's going to be prone to errors. You should seriously rethink your data model, if at all possible. Storing floats as text and trying to parse them is going to lead to many problems.
That said, here's a query that will work, at least for your sample data:
SELECT batchid,
productcode,
max(stagevalue) FILTER (WHERE stagevalue ~ '^[a-zA-Z].*') as loadedproductcode,
max(stagevalue::float) FILTER (WHERE stagevalue !~ '^[a-zA-Z].*') as stagevalue,
max(stageunit),
stageid,
stagelinenumber
FROM datas
GROUP BY batchid, productcode, stageid, stagelinenumber;
Note that max is just used because you need an aggregate function to combine with the filter. You could replace it with min and get the same result, at least for these data.

clickhouse : left join using IN

I wish to perform a left join based on two conditions :
SELECT
...
FROM
sometable AS a
LEFT JOIN someothertable AS b ON a.some_id = b.some_id
AND b.other_id IN (1, 2, 3, 4)
I got the error:
Supported syntax: JOIN ON Expr([table.]column, ...) = Expr([table.]column, ...) [AND Expr([table.]column, ...) = Expr([table.]column, ...) ...]```
It seems that the condition for a join must be = and can't be in
Any idea ?
Consider moving IN-operator into subquery:
SELECT
a.number,
b.number
FROM numbers(8) AS a
LEFT JOIN
(
SELECT *
FROM numbers(234)
WHERE number IN (1, 2, 3, 4)
) AS b USING (number)
/*
┌─number─┬─b.number─┐
│ 0 │ 0 │
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
│ 4 │ 4 │
│ 5 │ 0 │
│ 6 │ 0 │
│ 7 │ 0 │
└────────┴──────────┘
*/
or
SELECT
a.number,
b.number
FROM numbers(8) AS a
LEFT JOIN
(
SELECT *
FROM numbers(234)
WHERE number IN (1, 2, 3, 4)
) AS b USING (number)
SETTINGS join_use_nulls = 1 /* 1 is 'JOIN behaves the same way as in standard SQL. The type of the corresponding field is converted to Nullable, and empty cells are filled with NULL.' */
/*
┌─number─┬─b.number─┐
│ 0 │ ᴺᵁᴸᴸ │
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
│ 4 │ 4 │
│ 5 │ ᴺᵁᴸᴸ │
│ 6 │ ᴺᵁᴸᴸ │
│ 7 │ ᴺᵁᴸᴸ │
└────────┴──────────┘
*/

Counting the same positional bits in postgresql bitmasks

I am trying to count each same position bit of multiple bitmasks in postgresql, here is an example of the problem:
Suppose i have three bitmasks (in binary) like:
011011011100110
100011010100101
110110101010101
Now what I want to do is to get the total count of bits in each separate column, considering the above masks as three rows and multiple columns.
e.g The first column have count 2, the second one have count 2, the third one have count of 1 and so on...
In actual i have total of 30 bits in each bitmasks in my database. I want to do it in PostgreSQL. I am open for further explanation of the problem if needed.
You could do it by using the get_bit functoin and a couple of joins:
SELECT sum(bit) FILTER (WHERE i = 0) AS count_0,
sum(bit) FILTER (WHERE i = 1) AS count_1,
...
sum(bit) FILTER (WHERE i = 29) AS count_29
FROM bits
CROSS JOIN generate_series(0, 29) AS i
CROSS JOIN LATERAL get_bit(b, i) AS bit;
The column with the bit string is b in my example.
You could use the bitwise and & operator and bigint arithmetic so long as your bitstrings contain 63 bits or fewer:
# create table bmasks (mask bit(15));
CREATE TABLE
# insert into bmasks values ('011011011100110'), ('100011010100101'), ('110110101010101');
INSERT 0 3
# with masks as (
select (2 ^ x)::bigint::bit(15) as mask, x as posn
from generate_series(0, 14) as gs(x)
)
select m.posn, m.mask, sum((b.mask & m.mask > 0::bit(15))::int) as set_bits
from masks m
cross join bmasks b
group by m.posn, m.mask;
┌──────┬─────────────────┬──────────┐
│ posn │ mask │ set_bits │
├──────┼─────────────────┼──────────┤
│ 0 │ 000000000000001 │ 2 │
│ 1 │ 000000000000010 │ 1 │
│ 2 │ 000000000000100 │ 3 │
│ 3 │ 000000000001000 │ 0 │
│ 4 │ 000000000010000 │ 1 │
│ 5 │ 000000000100000 │ 2 │
│ 6 │ 000000001000000 │ 2 │
│ 7 │ 000000010000000 │ 2 │
│ 8 │ 000000100000000 │ 1 │
│ 9 │ 000001000000000 │ 2 │
│ 10 │ 000010000000000 │ 3 │
│ 11 │ 000100000000000 │ 1 │
│ 12 │ 001000000000000 │ 1 │
│ 13 │ 010000000000000 │ 2 │
│ 14 │ 100000000000000 │ 2 │
└──────┴─────────────────┴──────────┘
(15 rows)

Create lag / lead time series with by groups in Julia?

I am wondering if there is an easy way to create a lag (or lead) of a time series variable in Julia according to a by group or condition? For example: I have a dataset of the following form
julia> df1 = DataFrame(var1=["a","a","a","a","b","b","b","b"],
var2=[0,1,2,3,0,1,2,3])
8×2 DataFrame
│ Row │ var1 │ var2 │
│ │ String │ Int64 │
├─────┼────────┼───────┤
│ 1 │ a │ 0 │
│ 2 │ a │ 1 │
│ 3 │ a │ 2 │
│ 4 │ a │ 3 │
│ 5 │ b │ 0 │
│ 6 │ b │ 1 │
│ 7 │ b │ 2 │
│ 8 │ b │ 3 │
And I want to create a variable lag2 that contains the values in var2 lagged by 2. However, this should be done grouped by var1 so that the first two observations in the 'b' group do not get the last two values of the 'a' group. Rather they should be set to missing or zero or some default value.
I have tried the following code which produces the following error.
julia> df2 = df1 |> #groupby(_.var1) |> #mutate(lag2 = lag(_.var2,2)) |> DataFrame
ERROR: MethodError: no method matching merge(::Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}}, ::NamedTuple{(:lag2,),Tuple{ShiftedArray{Int64,Missing,1,QueryOperators.GroupColumnArrayView{Int64,Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},:var2}}}})
Closest candidates are:
merge(::NamedTuple{,T} where T<:Tuple, ::NamedTuple) at namedtuple.jl:245
merge(::NamedTuple{an,T} where T<:Tuple, ::NamedTuple{bn,T} where T<:Tuple) where {an, bn} at namedtuple.jl:233
merge(::NamedTuple, ::NamedTuple, ::NamedTuple...) at namedtuple.jl:249
...
Stacktrace:
[1] (::var"#437#442")(::Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}}) at /Users/kayvon/.julia/packages/Query/AwBtd/src/query_translation.jl:58
[2] iterate at /Users/kayvon/.julia/packages/QueryOperators/g4G21/src/enumerable/enumerable_map.jl:25 [inlined]
[3] iterate at /Users/kayvon/.julia/packages/Tables/TjjiP/src/tofromdatavalues.jl:45 [inlined]
[4] buildcolumns at /Users/kayvon/.julia/packages/Tables/TjjiP/src/fallbacks.jl:185 [inlined]
[5] columns at /Users/kayvon/.julia/packages/Tables/TjjiP/src/fallbacks.jl:237 [inlined]
[6] #DataFrame#453(::Bool, ::Type{DataFrame}, ::QueryOperators.EnumerableMap{Union{},QueryOperators.EnumerableIterable{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},QueryOperators.EnumerableGroupBy{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},String,NamedTuple{(:var1, :var2),Tuple{String,Int64}},QueryOperators.EnumerableIterable{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.DataValueRowIterator{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.Schema{(:var1, :var2),Tuple{String,Int64}},Tables.RowIterator{NamedTuple{(:var1, :var2),Tuple{Array{String,1},Array{Int64,1}}}}}},var"#434#439",var"#435#440"}},var"#437#442"}) at /Users/kayvon/.julia/packages/DataFrames/S3ZFo/src/other/tables.jl:40
[7] DataFrame(::QueryOperators.EnumerableMap{Union{},QueryOperators.EnumerableIterable{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},QueryOperators.EnumerableGroupBy{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},String,NamedTuple{(:var1, :var2),Tuple{String,Int64}},QueryOperators.EnumerableIterable{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.DataValueRowIterator{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.Schema{(:var1, :var2),Tuple{String,Int64}},Tables.RowIterator{NamedTuple{(:var1, :var2),Tuple{Array{String,1},Array{Int64,1}}}}}},var"#434#439",var"#435#440"}},var"#437#442"}) at /Users/kayvon/.julia/packages/DataFrames/S3ZFo/src/other/tables.jl:31
[8] |>(::QueryOperators.EnumerableMap{Union{},QueryOperators.EnumerableIterable{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},QueryOperators.EnumerableGroupBy{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},String,NamedTuple{(:var1, :var2),Tuple{String,Int64}},QueryOperators.EnumerableIterable{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.DataValueRowIterator{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.Schema{(:var1, :var2),Tuple{String,Int64}},Tables.RowIterator{NamedTuple{(:var1, :var2),Tuple{Array{String,1},Array{Int64,1}}}}}},var"#434#439",var"#435#440"}},var"#437#442"}, ::Type) at ./operators.jl:854
[9] top-level scope at none:0
Appreciate any help with this approach or alternate approaches. Thanks.
EDIT
Putting this edit to the top as it works in DataFrames 1.0 so reflects the stable API:
Under DataFrames.jl 0.22.2 the correct syntax is:
julia> combine(groupby(df1, :var1), :var2 => Base.Fix2(lag, 2) => :var2_l2)
8×2 DataFrame
Row │ var1 var2_l2
│ String Int64?
─────┼─────────────────
1 │ a missing
2 │ a missing
3 │ a 0
4 │ a 1
5 │ b missing
6 │ b missing
7 │ b 0
8 │ b 1
Another alternative to the maybe slightly arcane Base.Fix2 syntax you could use an anonymous function (x -> lag(x, 2)) (note the enclosing parens are required due to operator precedence).
Original answer:
You definitely had the right idea - I don't work with Query.jl but this can easily be done with basic DataFrames syntax:
julia> using DataFrames
julia> import ShiftedArrays: lag
julia> df1 = DataFrame(var1=["a","a","a","a","b","b","b","b"],
var2=[0,1,2,3,0,1,2,3]);
julia> by(df1, :var1, var2_l2 = :var2 => Base.Fix2(lag, 2)))
8×2 DataFrame
│ Row │ var1 │ var2_l2 │
│ │ String │ Int64⍰ │
├─────┼────────┼─────────┤
│ 1 │ a │ missing │
│ 2 │ a │ missing │
│ 3 │ a │ 0 │
│ 4 │ a │ 1 │
│ 5 │ b │ missing │
│ 6 │ b │ missing │
│ 7 │ b │ 0 │
│ 8 │ b │ 1 │
Note that I used Base.Fix2 here to get a single argument version of lag. This is essentially the same as defining your own l2(x) = lag(x, 2) and then using l2 in the by call. If you do define your own l2 function you can also set the default value like l2(x) = lag(x, 2, default = -1000) if you want to avoid missing values:
julia> l2(x) = lag(x, 2, default = -1000)
l2 (generic function with 1 method)
julia> by(df1, :var1, var2_l2 = :var2 => l2)
8×2 DataFrame
│ Row │ var1 │ var2_l2 │
│ │ String │ Int64 │
├─────┼────────┼─────────┤
│ 1 │ a │ -1000 │
│ 2 │ a │ -1000 │
│ 3 │ a │ 0 │
│ 4 │ a │ 1 │
│ 5 │ b │ -1000 │
│ 6 │ b │ -1000 │
│ 7 │ b │ 0 │
│ 8 │ b │ 1 │

Subtract value from previous row value if it is greater than the max value

Im using Postgresql & Sequelize. I have to find the consumption from the reading table. Currently, I have the query to subtract the value from the previous row. But the problem was If the value is less than the previous value means I have to ignore the row and need to wait for the greater value to make the calculation.
Current Query
select "readingValue",
"readingValue" - coalesce(lag("readingValue") over (order by "id")) as consumption
from public."EnergyReadingTbl";
Example Record & Current Output
id readingValue consumption
65479 "35.8706703186035" "3.1444168090820"
65480 "39.0491638183594" "3.1784934997559"
65481 "42.1287002563477" "3.0795364379883"
65482 "2.38636064529419" "-39.74233961105351"
65483 "5.91744041442871" "3.53107976913452"
65484 "9.59204387664795" "3.67460346221924"
65485 "14.3925561904907" "4.80051231384275"
65486 "19.4217891693115" "5.0292329788208"
65487 "24.2393398284912" "4.8175506591797"
65488 "29.2515335083008" "5.0121936798096"
65489 "34.2519302368164" "5.0003967285156"
65490 "38.6513633728027" "4.3994331359863"
65491 "43.7513643778087" "5.1000010050060"
In this picture, the last max value was 42.1287002563477. I have to wait until to get the greater value than 42.1287002563477 to make the calculation like the next greater value - 42.1287002563477. In this, 43.7513643778087 - 42.1287002563477.
Expected Output
id readingValue consumption
65479 "35.8706703186035" "3.1444168090820"
65480 "39.0491638183594" "3.1784934997559"
65481 "42.1287002563477" "3.0795364379883"
65482 "2.38636064529419" "0"
65483 "5.91744041442871" "0"
65484 "9.59204387664795" "0"
65485 "14.3925561904907" "0"
65486 "19.4217891693115" "0"
65487 "24.2393398284912" "0"
65488 "29.2515335083008" "0"
65489 "34.2519302368164" "0"
65490 "38.6513633728027" "0"
65491 "43.7513643778087" "1.1226641214710"
Is there any chance to resolve this issue in the query?
You can use the ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING to limit the frame of the window function, so you can substract the MAX up to the current row with the MAX value of the rows up to but excluding the current row:
SELECT readingValue,
MAX(readingValue) OVER (ORDER BY id) - MAX(readingValue) OVER (ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
FROM e;
┌──────────────────┬─────────────────┐
│ readingvalue │ ?column? │
├──────────────────┼─────────────────┤
│ 35.8706703186035 │ (null) │
│ 39.0491638183594 │ 3.1784934997559 │
│ 42.1287002563477 │ 3.0795364379883 │
│ 2.38636064529419 │ 0 │
│ 5.91744041442871 │ 0 │
│ 9.59204387664795 │ 0 │
│ 14.3925561904907 │ 0 │
│ 19.4217891693115 │ 0 │
│ 24.2393398284912 │ 0 │
│ 29.2515335083008 │ 0 │
│ 34.2519302368164 │ 0 │
│ 38.6513633728027 │ 0 │
│ 43.7513643778087 │ 1.622664121461 │
└──────────────────┴─────────────────┘
(13 rows)
Time: 0,430 ms