Sum(distinct metric) in apache druid - druid

how do we write sum(distinct col) in druid ? if i try to write in druid, it says plans can't be build, but same is possible in Druid. I tried to convert to subquery approach, but my inner query returns lot of item level data, hence timing out.

The distinct count or sum is not something which is by default supported by druid.
There are actually several methods which give you a similar result.
Option 1. Theta Sketch extension (recommended)
If you enable the Theta Sketch extension (See https://druid.apache.org/docs/latest/development/extensions-core/datasketches-theta.html) you can use this to get the same result.
Example:
{
"queryType": "groupBy",
"dataSource": "hits",
"intervals": [
"2020-08-14T11:00:00.000Z/2020-08-14T12:00:00.000Z"
],
"dimensions": [],
"granularity": "all",
"aggregations": [
{
"type": "cardinality",
"name": "col",
"fields": [
{
"type": "default",
"dimension": "domain",
"outputType": "string",
"outputName": "domain"
}
],
"byRow": false,
"round": false
}
]
}
Result:
+--------+
| domain |
+--------+
| 22 |
+--------+
Option 2: cardinality
The cardinality() aggregation computes the cardinality of a set of Apache Druid (incubating) dimensions, using HyperLogLog to estimate the cardinality.
Example:
{
"queryType": "groupBy",
"dataSource": "hits",
"intervals": [
"2020-08-14T11:00:00.000Z/2020-08-14T12:00:00.000Z"
],
"dimensions": [],
"granularity": "all",
"aggregations": [
{
"type": "cardinality",
"name": "domain",
"fields": [
{
"type": "default",
"dimension": "domain",
"outputType": "string",
"outputName": "domain"
}
],
"byRow": false,
"round": false
}
]
}
Response:
+-----------------+
| domain |
+-----------------+
| 22.119017166376 |
+-----------------+
Option 3. use hyperUnique
This option requires that you keep track of the counts at indexation time. If you have applied this, you can use this in your query:
{
"queryType": "groupBy",
"dataSource": "hits",
"intervals": [
"2020-08-14T11:00:00.000Z/2020-08-14T12:00:00.000Z"
],
"dimensions": [],
"granularity": "all",
"aggregations": [
{
"type": "hyperUnique",
"name": "domain",
"fieldName": "domain",
"isInputHyperUnique": false,
"round": false
}
],
"context": {
"groupByStrategy": "v2"
}
}
As I have no hyperUnique metric in my data set, I have no exact example response.
This page explains this method very well: https://blog.mshimul.com/getting-unique-counts-from-druid-using-hyperloglog/
Conclusion
In my opinion the Theta Sketch extension is the best and most easy way to get the result. Please read the documentation carefully.
If you are an PHP user you could take a look at this, maybe it helps:
https://github.com/level23/druid-client#hyperunique
https://github.com/level23/druid-client#cardinality
https://github.com/level23/druid-client#distinctcount

Related

pydruid: use a query as datasource

I am trying to write a query in pydruid, which uses another query as datasource. Druid itself supports is by setting datasource like this:
"dataSource": {
"type": "query",
"query": {
"queryType": "groupBy",
"dataSource": "site_traffic",
"intervals": ["0000/3000"],
"granularity": "all",
"dimensions": ["page"],
"aggregations": [
{ "type": "count", "name": "hits" }
]
}...
}
but I dont know how to implement it in pydruid. I tried doing this:
source = self.client.groupby( ... )
top_n = self.client.topn(
datasource=source,...
)
but it give me an error, saying that datasource can only be a string or list of strings. How can I do this? Can it be done at all? if not, what are my options?

Comparing dimensions in druid

I recently started experimenting with druid. I have a use case which I'm not able to solve. I have 3 date columns primary_date, date_1 and date_2, amount and client.
I wanted to calulate sum(amount) when date_1 > date_2 when granularity is month. I wanted to calculate this for each month in 6 month interval for each client.
I also wanted to calcutate sum(amount) when date_1 > max(bucket date) for each bucket for 6 months for each client.
{
"queryType" : "groupBy",
"dataSource" : "data_source_xxx",
"granularity" : "month",
"dimensions" : ["client"],
"intervals": ["2019-01-01/2019-07-01"],
"aggregations":[{"type": "doubleSum", "name": "total_amount", "fieldName": "amount"}],
"filter" : {
"type": "select",
"dimension": "client",
"value": "client"
}
}
I wanted to modify the above query to have additional filters I have mentioned.
Any help is highly appreciated.
Thanks
I think you can realize this by using a virtual column, which does the date comparison. Then you should be able to use the virtual column in a filtered aggregation, which only applies the aggregation if the filter matches.
This is not tested, but I think something like this should work:
{
"queryType": "groupBy",
"dataSource": "data_source_xxx",
"intervals": [
"2019-01-01T00:00:00.000Z/2019-07-01T00:00:00.000Z"
],
"dimensions": [
{
"type": "default",
"dimension": "client",
"outputType": "string",
"outputName": "client"
}
],
"granularity": "month",
"aggregations": [
{
"type": "filtered",
"filter": {
"type": "selector",
"dimension": "isOlder",
"value": "1"
},
"aggregator": {
"type": "doubleSum",
"name": "sumAmount",
"fieldName": "amount"
}
}
],
"virtualColumns": [
{
"type": "expression",
"name": "isOlder",
"expression": "if( date_1 > date_2, '1', '0')",
"outputType": "string"
}
],
"context": {
"groupByStrategy": "v2"
}
}
I have created this using this PHP code using this package: https://github.com/level23/druid-client
$client = new DruidClient(['router_url' => 'http://127.0.0.1:8888']);
// Build a select query
$builder = $client->query('data_source_xxx', Granularity::MONTH)
->interval("2019-01-01/2019-07-01")
->select(['client'])
->virtualColumn("if( date_1 > date_2, '1', '0')", 'isOlder')
->sum('amount', 'sumAmount', DataType::DOUBLE, function(FilterBuilder $filterBuilder){
$filterBuilder->where('isOlder', '=', '1');
});
echo $builder->toJson();

Druid GroupBy query gives different response when changing the order by fields

I have a question regarding an Apache Druid incubating query.
I have a simple group by to select the number of calls per operator. See here my query:
{
"queryType": "groupBy",
"dataSource": "ivr-calls",
"intervals": [
"2019-12-06T00:00:00.000Z/2019-12-07T00:00:00.000Z"
],
"dimensions": [
{
"type": "lookup",
"dimension": "operator_id",
"outputName": "value",
"name": "ivr_operator",
"replaceMissingValueWith": "Unknown"
},
{
"type": "default",
"dimension": "operator_id",
"outputType": "long",
"outputName": "id"
}
],
"granularity": "all",
"aggregations": [
{
"type": "longSum",
"name": "calls",
"fieldName": "calls"
}
],
"limitSpec": {
"type": "default",
"limit": 999999,
"columns": [
{
"dimension": "value",
"direction": "ascending",
"dimensionOrder": "numeric"
}
]
}
}
In this query I order the result by the "value" dimension, I receive 218 results.
I noticed that some of the records are duplicate. (I see some operators two times in my resultset). This is strange because in my experience all dimensions which you select are also used for grouping by. So, they should be unique.
If I add an order by to the "id" dimension, I receive 183 results (which is expected):
"columns": [
{
"dimension": "value",
"direction": "ascending",
"dimensionOrder": "numeric"
},
{
"dimension": "id",
"direction": "ascending",
"dimensionOrder": "numeric"
}
]
The documentation tells me nothing about this strange behavior (https://druid.apache.org/docs/latest/querying/limitspec.html).
My previous experience with druid is that the order by is just "ordering".
I am running druid version 0.15.0-incubating-iap9.
Can anybody tell me why there is a difference in the result set based on the column sorting?
I resolved this problem for now by specifying all columns in my order by.
Issue seems to be related to a bug in druid. See: https://github.com/apache/incubator-druid/issues/9000

Ingesting multi-valued dimension from comma sep string

I have event data from Kafka with the following structure that I want to ingest in Druid
{
"event": "some_event",
"id": "1",
"parameters": {
"campaigns": "campaign1, campaign2",
"other_stuff": "important_info"
}
}
Specifically, I want to transform the dimension "campaigns" from a comma-separated string into an array / multi-valued dimension so that it can be nicely filtered and grouped by.
My ingestion so far looks as follows
{
"type": "kafka",
"dataSchema": {
"dataSource": "event-data",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "timestamp",
"format": "posix"
},
"flattenSpec": {
"fields": [
{
"type": "root",
"name": "parameters"
},
{
"type": "jq",
"name": "campaigns",
"expr": ".parameters.campaigns"
}
]
}
},
"dimensionSpec": {
"dimensions": [
"event",
"id",
"campaigns"
]
}
},
"metricsSpec": [
{
"type": "count",
"name": "count"
}
],
"granularitySpec": {
"type": "uniform",
...
}
},
"tuningConfig": {
"type": "kafka",
...
},
"ioConfig": {
"topic": "production-tracking",
...
}
}
Which however leads to campaigns being ingested as a string.
I could neither find a way to generate an array out of it with a jq expression in flattenSpec nor did I find something like a string split expression that may be used as a transformSpec.
Any suggestions?
Try setting useFieldDiscover: false in your ingestion spec. when this flag is set to true (which is default case) then it interprets all fields with singular values (not a map or list) and flat lists (lists of singular values) at the root level as columns.
Here is a good example and reference link to use flatten spec:
https://druid.apache.org/docs/latest/ingestion/flatten-json.html
Looks like since Druid 0.17.0, Druid expressions support typed constructors for creating arrays, so using expression string_to_array should do the trick!

Apache Druid sql query conversion to json based query

I am trying to convert the following druid sql query to a druid json query, as one of the columns i have is a multi-value dimension for which druid does not support a sql style query.
My sql query:
SELECT date_dt, source, type_labels, COUNT(DISTINCT unique_p_hll)
FROM "test"
WHERE
type_labels = 'z' AND
(a_id IN ('a', 'b', 'c') OR b_id IN ('m', 'n', 'p'))
GROUP BY date_dt, source, type_labels;
unique_p_hll is an hll column with uniques.
The druid json query i came up with is following:
{
"queryType": "groupBy",
"dataSource": "test",
"granularity": "day",
"dimensions": ["source", "type_labels"],
"limitSpec": {},
"filter": {
"type": "and",
"fields": [
{ "type": "selector", "dimension": "type_labels", "value": "z" },
{ "type": "or", "fields": [
{ "type": "in", "dimension": "a_id", "values": ["a", "b", "c"] },
{ "type": "in", "dimension": "b_id", "values": ["m", "n", "p"] }
]}
]
},
"aggregations": [
{ "type": "longSum", "name": "unique_p_hll", "fieldName": "p_id" }
],
"intervals": [ "2018-08-01/2018-08-02" ]
}
But the json query seems to be returning empty resultset.
I can see the output correctly in Pivot UI. Though the array column type_labels values show up as {"array_element": "z"} instead of simply "z".
Does the query return empty string, or does it return a formatted JSON with zero records?
If the former, I can suggest a couple of leads for debugging this issue:
Make sure that the query is properly sent to the Broker, as shown in Druid's query tutorial:
curl -X 'POST' -H 'Content-Type:application/json' -d #query-file.json http://<BROKER-IP>:<BROKER-PORT>/druid/v2?pretty
Also, check the Broker's log for errors.