Druid GroupBy query gives different response when changing the order by fields - druid

I have a question regarding an Apache Druid incubating query.
I have a simple group by to select the number of calls per operator. See here my query:
{
"queryType": "groupBy",
"dataSource": "ivr-calls",
"intervals": [
"2019-12-06T00:00:00.000Z/2019-12-07T00:00:00.000Z"
],
"dimensions": [
{
"type": "lookup",
"dimension": "operator_id",
"outputName": "value",
"name": "ivr_operator",
"replaceMissingValueWith": "Unknown"
},
{
"type": "default",
"dimension": "operator_id",
"outputType": "long",
"outputName": "id"
}
],
"granularity": "all",
"aggregations": [
{
"type": "longSum",
"name": "calls",
"fieldName": "calls"
}
],
"limitSpec": {
"type": "default",
"limit": 999999,
"columns": [
{
"dimension": "value",
"direction": "ascending",
"dimensionOrder": "numeric"
}
]
}
}
In this query I order the result by the "value" dimension, I receive 218 results.
I noticed that some of the records are duplicate. (I see some operators two times in my resultset). This is strange because in my experience all dimensions which you select are also used for grouping by. So, they should be unique.
If I add an order by to the "id" dimension, I receive 183 results (which is expected):
"columns": [
{
"dimension": "value",
"direction": "ascending",
"dimensionOrder": "numeric"
},
{
"dimension": "id",
"direction": "ascending",
"dimensionOrder": "numeric"
}
]
The documentation tells me nothing about this strange behavior (https://druid.apache.org/docs/latest/querying/limitspec.html).
My previous experience with druid is that the order by is just "ordering".
I am running druid version 0.15.0-incubating-iap9.
Can anybody tell me why there is a difference in the result set based on the column sorting?

I resolved this problem for now by specifying all columns in my order by.
Issue seems to be related to a bug in druid. See: https://github.com/apache/incubator-druid/issues/9000

Related

Sum(distinct metric) in apache druid

how do we write sum(distinct col) in druid ? if i try to write in druid, it says plans can't be build, but same is possible in Druid. I tried to convert to subquery approach, but my inner query returns lot of item level data, hence timing out.
The distinct count or sum is not something which is by default supported by druid.
There are actually several methods which give you a similar result.
Option 1. Theta Sketch extension (recommended)
If you enable the Theta Sketch extension (See https://druid.apache.org/docs/latest/development/extensions-core/datasketches-theta.html) you can use this to get the same result.
Example:
{
"queryType": "groupBy",
"dataSource": "hits",
"intervals": [
"2020-08-14T11:00:00.000Z/2020-08-14T12:00:00.000Z"
],
"dimensions": [],
"granularity": "all",
"aggregations": [
{
"type": "cardinality",
"name": "col",
"fields": [
{
"type": "default",
"dimension": "domain",
"outputType": "string",
"outputName": "domain"
}
],
"byRow": false,
"round": false
}
]
}
Result:
+--------+
| domain |
+--------+
| 22 |
+--------+
Option 2: cardinality
The cardinality() aggregation computes the cardinality of a set of Apache Druid (incubating) dimensions, using HyperLogLog to estimate the cardinality.
Example:
{
"queryType": "groupBy",
"dataSource": "hits",
"intervals": [
"2020-08-14T11:00:00.000Z/2020-08-14T12:00:00.000Z"
],
"dimensions": [],
"granularity": "all",
"aggregations": [
{
"type": "cardinality",
"name": "domain",
"fields": [
{
"type": "default",
"dimension": "domain",
"outputType": "string",
"outputName": "domain"
}
],
"byRow": false,
"round": false
}
]
}
Response:
+-----------------+
| domain |
+-----------------+
| 22.119017166376 |
+-----------------+
Option 3. use hyperUnique
This option requires that you keep track of the counts at indexation time. If you have applied this, you can use this in your query:
{
"queryType": "groupBy",
"dataSource": "hits",
"intervals": [
"2020-08-14T11:00:00.000Z/2020-08-14T12:00:00.000Z"
],
"dimensions": [],
"granularity": "all",
"aggregations": [
{
"type": "hyperUnique",
"name": "domain",
"fieldName": "domain",
"isInputHyperUnique": false,
"round": false
}
],
"context": {
"groupByStrategy": "v2"
}
}
As I have no hyperUnique metric in my data set, I have no exact example response.
This page explains this method very well: https://blog.mshimul.com/getting-unique-counts-from-druid-using-hyperloglog/
Conclusion
In my opinion the Theta Sketch extension is the best and most easy way to get the result. Please read the documentation carefully.
If you are an PHP user you could take a look at this, maybe it helps:
https://github.com/level23/druid-client#hyperunique
https://github.com/level23/druid-client#cardinality
https://github.com/level23/druid-client#distinctcount

Comparing dimensions in druid

I recently started experimenting with druid. I have a use case which I'm not able to solve. I have 3 date columns primary_date, date_1 and date_2, amount and client.
I wanted to calulate sum(amount) when date_1 > date_2 when granularity is month. I wanted to calculate this for each month in 6 month interval for each client.
I also wanted to calcutate sum(amount) when date_1 > max(bucket date) for each bucket for 6 months for each client.
{
"queryType" : "groupBy",
"dataSource" : "data_source_xxx",
"granularity" : "month",
"dimensions" : ["client"],
"intervals": ["2019-01-01/2019-07-01"],
"aggregations":[{"type": "doubleSum", "name": "total_amount", "fieldName": "amount"}],
"filter" : {
"type": "select",
"dimension": "client",
"value": "client"
}
}
I wanted to modify the above query to have additional filters I have mentioned.
Any help is highly appreciated.
Thanks
I think you can realize this by using a virtual column, which does the date comparison. Then you should be able to use the virtual column in a filtered aggregation, which only applies the aggregation if the filter matches.
This is not tested, but I think something like this should work:
{
"queryType": "groupBy",
"dataSource": "data_source_xxx",
"intervals": [
"2019-01-01T00:00:00.000Z/2019-07-01T00:00:00.000Z"
],
"dimensions": [
{
"type": "default",
"dimension": "client",
"outputType": "string",
"outputName": "client"
}
],
"granularity": "month",
"aggregations": [
{
"type": "filtered",
"filter": {
"type": "selector",
"dimension": "isOlder",
"value": "1"
},
"aggregator": {
"type": "doubleSum",
"name": "sumAmount",
"fieldName": "amount"
}
}
],
"virtualColumns": [
{
"type": "expression",
"name": "isOlder",
"expression": "if( date_1 > date_2, '1', '0')",
"outputType": "string"
}
],
"context": {
"groupByStrategy": "v2"
}
}
I have created this using this PHP code using this package: https://github.com/level23/druid-client
$client = new DruidClient(['router_url' => 'http://127.0.0.1:8888']);
// Build a select query
$builder = $client->query('data_source_xxx', Granularity::MONTH)
->interval("2019-01-01/2019-07-01")
->select(['client'])
->virtualColumn("if( date_1 > date_2, '1', '0')", 'isOlder')
->sum('amount', 'sumAmount', DataType::DOUBLE, function(FilterBuilder $filterBuilder){
$filterBuilder->where('isOlder', '=', '1');
});
echo $builder->toJson();

Ingesting multi-valued dimension from comma sep string

I have event data from Kafka with the following structure that I want to ingest in Druid
{
"event": "some_event",
"id": "1",
"parameters": {
"campaigns": "campaign1, campaign2",
"other_stuff": "important_info"
}
}
Specifically, I want to transform the dimension "campaigns" from a comma-separated string into an array / multi-valued dimension so that it can be nicely filtered and grouped by.
My ingestion so far looks as follows
{
"type": "kafka",
"dataSchema": {
"dataSource": "event-data",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "timestamp",
"format": "posix"
},
"flattenSpec": {
"fields": [
{
"type": "root",
"name": "parameters"
},
{
"type": "jq",
"name": "campaigns",
"expr": ".parameters.campaigns"
}
]
}
},
"dimensionSpec": {
"dimensions": [
"event",
"id",
"campaigns"
]
}
},
"metricsSpec": [
{
"type": "count",
"name": "count"
}
],
"granularitySpec": {
"type": "uniform",
...
}
},
"tuningConfig": {
"type": "kafka",
...
},
"ioConfig": {
"topic": "production-tracking",
...
}
}
Which however leads to campaigns being ingested as a string.
I could neither find a way to generate an array out of it with a jq expression in flattenSpec nor did I find something like a string split expression that may be used as a transformSpec.
Any suggestions?
Try setting useFieldDiscover: false in your ingestion spec. when this flag is set to true (which is default case) then it interprets all fields with singular values (not a map or list) and flat lists (lists of singular values) at the root level as columns.
Here is a good example and reference link to use flatten spec:
https://druid.apache.org/docs/latest/ingestion/flatten-json.html
Looks like since Druid 0.17.0, Druid expressions support typed constructors for creating arrays, so using expression string_to_array should do the trick!

Apache Druid sql query conversion to json based query

I am trying to convert the following druid sql query to a druid json query, as one of the columns i have is a multi-value dimension for which druid does not support a sql style query.
My sql query:
SELECT date_dt, source, type_labels, COUNT(DISTINCT unique_p_hll)
FROM "test"
WHERE
type_labels = 'z' AND
(a_id IN ('a', 'b', 'c') OR b_id IN ('m', 'n', 'p'))
GROUP BY date_dt, source, type_labels;
unique_p_hll is an hll column with uniques.
The druid json query i came up with is following:
{
"queryType": "groupBy",
"dataSource": "test",
"granularity": "day",
"dimensions": ["source", "type_labels"],
"limitSpec": {},
"filter": {
"type": "and",
"fields": [
{ "type": "selector", "dimension": "type_labels", "value": "z" },
{ "type": "or", "fields": [
{ "type": "in", "dimension": "a_id", "values": ["a", "b", "c"] },
{ "type": "in", "dimension": "b_id", "values": ["m", "n", "p"] }
]}
]
},
"aggregations": [
{ "type": "longSum", "name": "unique_p_hll", "fieldName": "p_id" }
],
"intervals": [ "2018-08-01/2018-08-02" ]
}
But the json query seems to be returning empty resultset.
I can see the output correctly in Pivot UI. Though the array column type_labels values show up as {"array_element": "z"} instead of simply "z".
Does the query return empty string, or does it return a formatted JSON with zero records?
If the former, I can suggest a couple of leads for debugging this issue:
Make sure that the query is properly sent to the Broker, as shown in Druid's query tutorial:
curl -X 'POST' -H 'Content-Type:application/json' -d #query-file.json http://<BROKER-IP>:<BROKER-PORT>/druid/v2?pretty
Also, check the Broker's log for errors.

Is there a possibility to have another timestamp as dimension in Druid?

Is it possible to have Druid datasource with 2 (or multiple) timestmaps in it?
I know that Druid is time-based DB and I have no problem with the concept but I'd like to add another dimension with which I can work as with timestamp
e.g. User retention: Metric surely is specified to a certain date, but I also need to create cohorts based on users date of registration and rollup those dates maybe to a weeks, months or filter to only a certain time periods....
If the functionality is not supported, are there any plug-ins? Any dirty solutions?
Although I'd rather wait for official implementation for timestamp dimensions full support in druid to be made, I've found a 'dirty' hack which I've been looking for.
DataSource Schema
First things first, I wanted to know, how much users logged in for each day, with being able to aggregate by date/month/year cohorts
here's the data schema I used:
"dataSchema": {
"dataSource": "ds1",
"parser": {
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "timestamp",
"format": "iso"
},
"dimensionsSpec": {
"dimensions": [
"user_id",
"platform",
"register_time"
],
"dimensionExclusions": [],
"spatialDimensions": []
}
}
},
"metricsSpec": [
{ "type" : "hyperUnique", "name" : "users", "fieldName" : "user_id" }
],
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "HOUR",
"queryGranularity": "DAY",
"intervals": ["2015-01-01/2017-01-01"]
}
},
so the sample data should look something like (each record is login event):
{"user_id": 4151948, "platform": "portal", "register_time": "2016-05-29T00:45:36.000Z", "timestamp": "2016-06-29T22:18:11.000Z"}
{"user_id": 2871923, "platform": "portal", "register_time": "2014-05-24T10:28:57.000Z", "timestamp": "2016-06-29T22:18:25.000Z"}
as you can see, my "main" timestamp to which I calculate these metrics is timestamp field, where register_time is only the dimension in stringy - ISO 8601 UTC format .
Aggregating
And now, for the fun part: I've been able to aggregate by timestamp (date) and register_time (date again) thanks to Time Format Extraction Function
Query looking like that:
{
"intervals": "2016-01-20/2016-07-01",
"dimensions": [
{
"type": "extraction",
"dimension": "register_time",
"outputName": "reg_date",
"extractionFn": {
"type": "timeFormat",
"format": "YYYY-MM-dd",
"timeZone": "Europe/Bratislava" ,
"locale": "sk-SK"
}
}
],
"granularity": {"timeZone": "Europe/Bratislava", "period": "P1D", "type": "period"},
"aggregations": [{"fieldName": "users", "name": "users", "type": "hyperUnique"}],
"dataSource": "ds1",
"queryType": "groupBy"
}
Filtering
Solution for filtering is based on JavaScript Extraction Function with which I can transform date to UNIX time and use it inside (for example) bound filter:
{
"intervals": "2016-01-20/2016-07-01",
"dimensions": [
"platform",
{
"type": "extraction",
"dimension": "register_time",
"outputName": "reg_date",
"extractionFn": {
"type": "javascript",
"function": "function(x) {return Date.parse(x)/1000}"
}
}
],
"granularity": {"timeZone": "Europe/Bratislava", "period": "P1D", "type": "period"},
"aggregations": [{"fieldName": "users", "name": "users", "type": "hyperUnique"}],
"dataSource": "ds1",
"queryType": "groupBy"
"filter": {
"type": "bound",
"dimension": "register_time",
"outputName": "reg_date",
"alphaNumeric": "true"
"extractionFn": {
"type": "javascript",
"function": "function(x) {return Date.parse(x)/1000}"
}
}
}
I've tried to filter it 'directly' with javascript filter but I haven't been able to convince druid to return the correct records although I've doublecheck it with various JavaScript REPLs, but hey, I'm no JavaScript expert.
Unfortunately Druid has only one time-stamp column that can be used to do rollup plus currently druid treat all the other columns as a strings (except metrics of course) so you can add another string columns with time-stamp values, but the only thing you can do with it is filtering.
I guess you might be able to hack it that way.
Hopefully in the future druid will allow different type of columns and maybe time-stamp will be one of those.
Another solution is add a longMin sort of metric for the timestamp and store the epoch time in that field or you convert the date time to a number and store it (eg 31st March 2021 08:00 to 310320210800)
As for Druid 0.22 it is stated in the documentation that secondary timestamps should be handled/parsed as dimensions of type long. Secondary timestamps can be parsed to longs at ingestion time with a tranformSpec and be transformed back if needed in querying time link.