Druid TopNMetricSpec - Performance Clarification - druid

Problem statement:
I need to find all the distinct userIDs from huge dataset (~100Millions) and so experimenting with TopNMetricSpec so that I get userIDs based on the set threshold.
Can you anyone help me to understand how TopNMetricSpec runs?
If I run following TopN query with TopNMetricSpec repeatedly for 'n' times using the same http client,
then I want to know will this scan all the records every time when we set previousStop.
Consider the following Data:
┌──────────────────────────┬─────────┬────────┬────────┐
│ __time │ movieId │ rating │ userId │
├──────────────────────────┼─────────┼────────┼────────┤
│ 2015-02-05T00:10:09.000Z │ 2011 │ 3.5 │ 215 │
│ 2015-02-05T00:10:26.000Z │ 38061 │ 3.5 │ 215 │
│ 2015-02-05T00:10:32.000Z │ 8981 │ 2.0 │ 215 │
│ 2015-02-05T00:11:00.000Z │ 89864 │ 4.0 │ 215 │
│ 2015-02-23T23:55:08.000Z │ 56587 │ 1.5 │ 31 │
│ 2015-02-23T23:55:33.000Z │ 51077 │ 4.0 │ 31 │
│ 2015-02-23T23:55:35.000Z │ 49274 │ 4.0 │ 31 │
│ 2015-02-23T23:55:37.000Z │ 30816 │ 2.0 │ 31
│ 2015-03-19T14:24:01.000Z │ 5066 │ 5.0 │ 176 │
│ 2015-03-19T14:26:23.000Z │ 6776 │ 5.0 │ 176 │
│ 2015-03-29T16:19:58.000Z │ 2337 │ 2.0 │ 96 │
For example, in the following query:
Initially, I have set previous stop as null and threshold has two so it will fetch first two records (because threshold = 2) viz. 215, 176
Now, I will pass previous stop = 176 now the question is will the broker scan all the records again or will it just scan from where it stopped after step 1 i.e. 176?
{
"queryType": "topN",
"dataSource": "ratings30K",
"intervals": "2015-02-05T00:00:00.000Z/2015-03-30T00:00:00.000Z",
"granularity": "all",
"dimension":"userId",
"threshold": 2,
"metric": {
"type": "inverted",
"metric": {
"type": "dimension",
"ordering": "Numeric",
"previousStop": null
}
}
}

It doesn't quite work the way you describe it. You have two options as to what to give as metric in this query. You can either find the users that have given the highest ratings:
"dimension":"userId",
"metric":"rating"
Or you can sort by user id in ascending order, in which case you can provide a previousStop:
"dimension":"userId",
"metric": {
"type": "dimension",
"ordering": "Numeric",
"previousStop": null
}
(For either of those two options, you can invert the sort order by wrapping the metric in an inverted metric spec.)
But when you sort by rating, you can't give a previousStop value. So if you want pagination that is guaranteed to return all rows, you can't sort by rating.

Related

PostgreSql : Merge two rows and add the difference to new column

We have an app which displays a table like this :
this is what it looks like in database :
┌──────────┬──────────────┬─────────────┬────────────┬──────────┬──────────────────┐
│ BatchId │ ProductCode │ StageValue │ StageUnit │ StageId │ StageLineNumber │
├──────────┼──────────────┼─────────────┼────────────┼──────────┼──────────────────┤
│ 0B001 │ 150701 │ LEDI2B4015 │ │ 37222 │ 1 │
│ 0B001 │ 150701 │ 16.21 │ KG │ 37222 │ 1 │
│ 0B001 │ 150701 │ 73.5 │ │ 37222 │ 2 │
│ 0B001 │ 150701 │ LEDI2B6002 │ KG │ 37222 │ 2 │
└──────────┴──────────────┴─────────────┴────────────┴──────────┴──────────────────┘
I would like to query the database so that the output looks like this :
┌──────────┬──────────────┬────────────────────┬─────────────┬────────────┬──────────┬──────────────────┐
│ BatchId │ ProductCode │ LoadedProductCode │ StageValue │ StageUnit │ StageId │ StageLineNumber │
├──────────┼──────────────┼────────────────────┼─────────────┼────────────┼──────────┼──────────────────┤
│ 0B001 │ 150701 │ LEDI2B4015 │ 16.21 │ KG │ 37222 │ 1 │
│ 0B001 │ 150701 │ LEDI2B6002 │ 73.5 │ KG │ 37222 │ 2 │
└──────────┴──────────────┴────────────────────┴─────────────┴────────────┴──────────┴──────────────────┘
Is that even possible ?
My PostgreSQL Server version is 14.X
I have looked for many threads with "merge two columns and add new one" but none of them seem to be what I want.
DB Fiddle link
SQL Fiddle (in case) link
It's possible to get your output, but it's going to be prone to errors. You should seriously rethink your data model, if at all possible. Storing floats as text and trying to parse them is going to lead to many problems.
That said, here's a query that will work, at least for your sample data:
SELECT batchid,
productcode,
max(stagevalue) FILTER (WHERE stagevalue ~ '^[a-zA-Z].*') as loadedproductcode,
max(stagevalue::float) FILTER (WHERE stagevalue !~ '^[a-zA-Z].*') as stagevalue,
max(stageunit),
stageid,
stagelinenumber
FROM datas
GROUP BY batchid, productcode, stageid, stagelinenumber;
Note that max is just used because you need an aggregate function to combine with the filter. You could replace it with min and get the same result, at least for these data.

Counting the same positional bits in postgresql bitmasks

I am trying to count each same position bit of multiple bitmasks in postgresql, here is an example of the problem:
Suppose i have three bitmasks (in binary) like:
011011011100110
100011010100101
110110101010101
Now what I want to do is to get the total count of bits in each separate column, considering the above masks as three rows and multiple columns.
e.g The first column have count 2, the second one have count 2, the third one have count of 1 and so on...
In actual i have total of 30 bits in each bitmasks in my database. I want to do it in PostgreSQL. I am open for further explanation of the problem if needed.
You could do it by using the get_bit functoin and a couple of joins:
SELECT sum(bit) FILTER (WHERE i = 0) AS count_0,
sum(bit) FILTER (WHERE i = 1) AS count_1,
...
sum(bit) FILTER (WHERE i = 29) AS count_29
FROM bits
CROSS JOIN generate_series(0, 29) AS i
CROSS JOIN LATERAL get_bit(b, i) AS bit;
The column with the bit string is b in my example.
You could use the bitwise and & operator and bigint arithmetic so long as your bitstrings contain 63 bits or fewer:
# create table bmasks (mask bit(15));
CREATE TABLE
# insert into bmasks values ('011011011100110'), ('100011010100101'), ('110110101010101');
INSERT 0 3
# with masks as (
select (2 ^ x)::bigint::bit(15) as mask, x as posn
from generate_series(0, 14) as gs(x)
)
select m.posn, m.mask, sum((b.mask & m.mask > 0::bit(15))::int) as set_bits
from masks m
cross join bmasks b
group by m.posn, m.mask;
┌──────┬─────────────────┬──────────┐
│ posn │ mask │ set_bits │
├──────┼─────────────────┼──────────┤
│ 0 │ 000000000000001 │ 2 │
│ 1 │ 000000000000010 │ 1 │
│ 2 │ 000000000000100 │ 3 │
│ 3 │ 000000000001000 │ 0 │
│ 4 │ 000000000010000 │ 1 │
│ 5 │ 000000000100000 │ 2 │
│ 6 │ 000000001000000 │ 2 │
│ 7 │ 000000010000000 │ 2 │
│ 8 │ 000000100000000 │ 1 │
│ 9 │ 000001000000000 │ 2 │
│ 10 │ 000010000000000 │ 3 │
│ 11 │ 000100000000000 │ 1 │
│ 12 │ 001000000000000 │ 1 │
│ 13 │ 010000000000000 │ 2 │
│ 14 │ 100000000000000 │ 2 │
└──────┴─────────────────┴──────────┘
(15 rows)

Create lag / lead time series with by groups in Julia?

I am wondering if there is an easy way to create a lag (or lead) of a time series variable in Julia according to a by group or condition? For example: I have a dataset of the following form
julia> df1 = DataFrame(var1=["a","a","a","a","b","b","b","b"],
var2=[0,1,2,3,0,1,2,3])
8×2 DataFrame
│ Row │ var1 │ var2 │
│ │ String │ Int64 │
├─────┼────────┼───────┤
│ 1 │ a │ 0 │
│ 2 │ a │ 1 │
│ 3 │ a │ 2 │
│ 4 │ a │ 3 │
│ 5 │ b │ 0 │
│ 6 │ b │ 1 │
│ 7 │ b │ 2 │
│ 8 │ b │ 3 │
And I want to create a variable lag2 that contains the values in var2 lagged by 2. However, this should be done grouped by var1 so that the first two observations in the 'b' group do not get the last two values of the 'a' group. Rather they should be set to missing or zero or some default value.
I have tried the following code which produces the following error.
julia> df2 = df1 |> #groupby(_.var1) |> #mutate(lag2 = lag(_.var2,2)) |> DataFrame
ERROR: MethodError: no method matching merge(::Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}}, ::NamedTuple{(:lag2,),Tuple{ShiftedArray{Int64,Missing,1,QueryOperators.GroupColumnArrayView{Int64,Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},:var2}}}})
Closest candidates are:
merge(::NamedTuple{,T} where T<:Tuple, ::NamedTuple) at namedtuple.jl:245
merge(::NamedTuple{an,T} where T<:Tuple, ::NamedTuple{bn,T} where T<:Tuple) where {an, bn} at namedtuple.jl:233
merge(::NamedTuple, ::NamedTuple, ::NamedTuple...) at namedtuple.jl:249
...
Stacktrace:
[1] (::var"#437#442")(::Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}}) at /Users/kayvon/.julia/packages/Query/AwBtd/src/query_translation.jl:58
[2] iterate at /Users/kayvon/.julia/packages/QueryOperators/g4G21/src/enumerable/enumerable_map.jl:25 [inlined]
[3] iterate at /Users/kayvon/.julia/packages/Tables/TjjiP/src/tofromdatavalues.jl:45 [inlined]
[4] buildcolumns at /Users/kayvon/.julia/packages/Tables/TjjiP/src/fallbacks.jl:185 [inlined]
[5] columns at /Users/kayvon/.julia/packages/Tables/TjjiP/src/fallbacks.jl:237 [inlined]
[6] #DataFrame#453(::Bool, ::Type{DataFrame}, ::QueryOperators.EnumerableMap{Union{},QueryOperators.EnumerableIterable{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},QueryOperators.EnumerableGroupBy{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},String,NamedTuple{(:var1, :var2),Tuple{String,Int64}},QueryOperators.EnumerableIterable{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.DataValueRowIterator{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.Schema{(:var1, :var2),Tuple{String,Int64}},Tables.RowIterator{NamedTuple{(:var1, :var2),Tuple{Array{String,1},Array{Int64,1}}}}}},var"#434#439",var"#435#440"}},var"#437#442"}) at /Users/kayvon/.julia/packages/DataFrames/S3ZFo/src/other/tables.jl:40
[7] DataFrame(::QueryOperators.EnumerableMap{Union{},QueryOperators.EnumerableIterable{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},QueryOperators.EnumerableGroupBy{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},String,NamedTuple{(:var1, :var2),Tuple{String,Int64}},QueryOperators.EnumerableIterable{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.DataValueRowIterator{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.Schema{(:var1, :var2),Tuple{String,Int64}},Tables.RowIterator{NamedTuple{(:var1, :var2),Tuple{Array{String,1},Array{Int64,1}}}}}},var"#434#439",var"#435#440"}},var"#437#442"}) at /Users/kayvon/.julia/packages/DataFrames/S3ZFo/src/other/tables.jl:31
[8] |>(::QueryOperators.EnumerableMap{Union{},QueryOperators.EnumerableIterable{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},QueryOperators.EnumerableGroupBy{Grouping{String,NamedTuple{(:var1, :var2),Tuple{String,Int64}}},String,NamedTuple{(:var1, :var2),Tuple{String,Int64}},QueryOperators.EnumerableIterable{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.DataValueRowIterator{NamedTuple{(:var1, :var2),Tuple{String,Int64}},Tables.Schema{(:var1, :var2),Tuple{String,Int64}},Tables.RowIterator{NamedTuple{(:var1, :var2),Tuple{Array{String,1},Array{Int64,1}}}}}},var"#434#439",var"#435#440"}},var"#437#442"}, ::Type) at ./operators.jl:854
[9] top-level scope at none:0
Appreciate any help with this approach or alternate approaches. Thanks.
EDIT
Putting this edit to the top as it works in DataFrames 1.0 so reflects the stable API:
Under DataFrames.jl 0.22.2 the correct syntax is:
julia> combine(groupby(df1, :var1), :var2 => Base.Fix2(lag, 2) => :var2_l2)
8×2 DataFrame
Row │ var1 var2_l2
│ String Int64?
─────┼─────────────────
1 │ a missing
2 │ a missing
3 │ a 0
4 │ a 1
5 │ b missing
6 │ b missing
7 │ b 0
8 │ b 1
Another alternative to the maybe slightly arcane Base.Fix2 syntax you could use an anonymous function (x -> lag(x, 2)) (note the enclosing parens are required due to operator precedence).
Original answer:
You definitely had the right idea - I don't work with Query.jl but this can easily be done with basic DataFrames syntax:
julia> using DataFrames
julia> import ShiftedArrays: lag
julia> df1 = DataFrame(var1=["a","a","a","a","b","b","b","b"],
var2=[0,1,2,3,0,1,2,3]);
julia> by(df1, :var1, var2_l2 = :var2 => Base.Fix2(lag, 2)))
8×2 DataFrame
│ Row │ var1 │ var2_l2 │
│ │ String │ Int64⍰ │
├─────┼────────┼─────────┤
│ 1 │ a │ missing │
│ 2 │ a │ missing │
│ 3 │ a │ 0 │
│ 4 │ a │ 1 │
│ 5 │ b │ missing │
│ 6 │ b │ missing │
│ 7 │ b │ 0 │
│ 8 │ b │ 1 │
Note that I used Base.Fix2 here to get a single argument version of lag. This is essentially the same as defining your own l2(x) = lag(x, 2) and then using l2 in the by call. If you do define your own l2 function you can also set the default value like l2(x) = lag(x, 2, default = -1000) if you want to avoid missing values:
julia> l2(x) = lag(x, 2, default = -1000)
l2 (generic function with 1 method)
julia> by(df1, :var1, var2_l2 = :var2 => l2)
8×2 DataFrame
│ Row │ var1 │ var2_l2 │
│ │ String │ Int64 │
├─────┼────────┼─────────┤
│ 1 │ a │ -1000 │
│ 2 │ a │ -1000 │
│ 3 │ a │ 0 │
│ 4 │ a │ 1 │
│ 5 │ b │ -1000 │
│ 6 │ b │ -1000 │
│ 7 │ b │ 0 │
│ 8 │ b │ 1 │

Apache Druid : Issue while updating the data in Datasource

I am currently using the druid-Incubating-0.16.0 version. As mentioned in https://druid.apache.org/docs/latest/tutorials/tutorial-update-data.html tutorial link, we can use combining firehose to update and merge the data for a data source.
Step: 1
I am using the same sample data with the initial structure as
┌──────────────────────────┬──────────┬───────┬────────┐
│ __time │ animal │ count │ number │
├──────────────────────────┼──────────┼───────┼────────┤
│ 2018-01-01T01:01:00.000Z │ tiger │ 1 │ 100 │
│ 2018-01-01T03:01:00.000Z │ aardvark │ 1 │ 42 │
│ 2018-01-01T03:01:00.000Z │ giraffe │ 1 │ 14124 │
└──────────────────────────┴──────────┴───────┴────────┘
Step 2:
I updated the data for tiger with {"timestamp":"2018-01-01T01:01:35Z","animal":"tiger", "number":30} with appendToExisting = false and rollUp = true and found the result
┌──────────────────────────┬──────────┬───────┬────────┐
│ __time │ animal │ count │ number │
├──────────────────────────┼──────────┼───────┼────────┤
│ 2018-01-01T01:01:00.000Z │ tiger │ 2 │ 130 │
│ 2018-01-01T03:01:00.000Z │ aardvark │ 1 │ 42 │
│ 2018-01-01T03:01:00.000Z │ giraffe │ 1 │ 14124 │
└──────────────────────────┴──────────┴───────┴────────┘
Step 3:
Now i am updating giraffe with {"timestamp":"2018-01-01T03:01:35Z","animal":"giraffe", "number":30} with appendToExisting = false and rollUp = true and getting the following result
┌──────────────────────────┬──────────┬───────┬────────┐
│ __time │ animal │ count │ number │
├──────────────────────────┼──────────┼───────┼────────┤
│ 2018-01-01T01:01:00.000Z │ tiger │ 1 │ 130 │
│ 2018-01-01T03:01:00.000Z │ aardvark │ 1 │ 42 │
│ 2018-01-01T03:01:00.000Z │ giraffe │ 2 │ 14154 │
└──────────────────────────┴──────────┴───────┴────────┘
My doubt is, In step 3 the count of the tiger is getting decreased by 1 but I think it should not be changed since there are no changes in step 3 for tiger and there is no number change also
FYI, count and number are metricSpec and they are count and longSum respectively.
Please clarify.
when using ingestSegment firehose with initial data like
┌──────────────────────────┬──────────┬───────┬────────┐
│ __time │ animal │ count │ number │
├──────────────────────────┼──────────┼───────┼────────┤
│ 2018-01-01T00:00:00.000Z │ aardvark │ 1 │ 9999 │
│ 2018-01-01T00:00:00.000Z │ bear │ 1 │ 111 │
│ 2018-01-01T00:00:00.000Z │ lion │ 2 │ 200 │
└──────────────────────────┴──────────┴───────┴────────┘
while adding a new data {"timestamp":"2018-01-01T03:01:35Z","animal":"giraffe", "number":30} with appendToExisting = true, i am getting
┌──────────────────────────┬──────────┬───────┬────────┐
│ __time │ animal │ count │ number │
├──────────────────────────┼──────────┼───────┼────────┤
│ 2018-01-01T00:00:00.000Z │ aardvark │ 1 │ 9999 │
│ 2018-01-01T00:00:00.000Z │ bear │ 1 │ 111 │
│ 2018-01-01T00:00:00.000Z │ lion │ 2 │ 200 │
│ 2018-01-01T00:00:00.000Z │ aardvark │ 1 │ 9999 │
│ 2018-01-01T00:00:00.000Z │ bear │ 1 │ 111 │
│ 2018-01-01T00:00:00.000Z │ giraffe │ 1 │ 30 │
│ 2018-01-01T00:00:00.000Z │ lion │ 1 │ 200 │
└──────────────────────────┴──────────┴───────┴────────┘
is it correct and expected output? why the rollup didn't happen?
Druid has actually only 2 modes. Overwrite or append.
With the appendToExisting=true, your data will be appended to the existing data, which will cause that the "number" field will increase (and the count also).
With appendToExisting=false all your data in the segment is overwritten. I think this is what happening.
This is different then with "normal" databases, where you can update specific rows.
In druid you can update only certain rows, but this is done by re-indexing your data. It is not a very easy process.
This re-indexing is done by an ingestSegment Firehose, which reads your data from a segment, and then writes it also to a segment (can be the same). During this process, you can add a transform filter, which does a specific action, like update certain field values.
We have build a PHP library to make these processes more easy to work with. See this example how to re-index a segment and apply a transformation during the re-indexing.
https://github.com/level23/druid-client#reindex

ClickHouse: Efficient way to aggregate data by different time ranges at

I need to aggregate time-series data (with average functions) on different timeslots, like:
Today
last X days
Last weekend
This week
Last X weeks
This month
List item
etc...
Q1: Can it be done within GROUP BY statement or at least with a single query?
Q2: Do I need any Materialized View for that?
The table is partitioned by Month and sharded by UserID
All queries are within UserID (single shard)
group by with ROLLUP
create table xrollup(metric Int64, b date, v Int64 ) engine=MergeTree partition by tuple() order by tuple();
insert into xrollup values (1,'2018-01-01', 1), (1,'2018-01-02', 1), (1,'2018-02-01', 1), (1,'2017-03-01', 1);
insert into xrollup values (2,'2018-01-01', 1), (2,'2018-02-02', 1);
SELECT metric, toYear(b) y, toYYYYMM(b) m, SUM(v) AS val
FROM xrollup
GROUP BY metric, y, m with ROLLUP
ORDER BY metric, y, m
┌─metric─┬────y─┬──────m─┬─val─┐
│ 0 │ 0 │ 0 │ 6 │ overall
│ 1 │ 0 │ 0 │ 4 │ overall by metric1
│ 1 │ 2017 │ 0 │ 1 │ overall by metric1 for 2017
│ 1 │ 2017 │ 201703 │ 1 │ overall by metric1 for march 2017
│ 1 │ 2018 │ 0 │ 3 │
│ 1 │ 2018 │ 201801 │ 2 │
│ 1 │ 2018 │ 201802 │ 1 │
│ 2 │ 0 │ 0 │ 2 │
│ 2 │ 2018 │ 0 │ 2 │
│ 2 │ 2018 │ 201801 │ 1 │
│ 2 │ 2018 │ 201802 │ 1 │
└────────┴──────┴────────┴─────┘
Although there's an accepted answer I had to do something similar but found an alternative route using aggregate function combinators, specially -If to select specific date ranges.
I needed to group by a content ID but retrieve unique views for the whole time range and also for specific buckets to generate a histogram (ClickHouse's histogram() function wasn't suitable because there's no option for sub aggregation).
You could do something along these lines:
SELECT
group_field,
avgIf(metric, date BETWEEN toDate('2022-09-03') AND toDate('2022-09-10')) AS week_avg
avgIf(metric, date BETWEEN toDate('2022-08-10') AND toDate('2022-09-10')) AS month_avg
FROM data
GROUP BY group_field