Comparing dimensions in druid - druid

I recently started experimenting with druid. I have a use case which I'm not able to solve. I have 3 date columns primary_date, date_1 and date_2, amount and client.
I wanted to calulate sum(amount) when date_1 > date_2 when granularity is month. I wanted to calculate this for each month in 6 month interval for each client.
I also wanted to calcutate sum(amount) when date_1 > max(bucket date) for each bucket for 6 months for each client.
{
"queryType" : "groupBy",
"dataSource" : "data_source_xxx",
"granularity" : "month",
"dimensions" : ["client"],
"intervals": ["2019-01-01/2019-07-01"],
"aggregations":[{"type": "doubleSum", "name": "total_amount", "fieldName": "amount"}],
"filter" : {
"type": "select",
"dimension": "client",
"value": "client"
}
}
I wanted to modify the above query to have additional filters I have mentioned.
Any help is highly appreciated.
Thanks

I think you can realize this by using a virtual column, which does the date comparison. Then you should be able to use the virtual column in a filtered aggregation, which only applies the aggregation if the filter matches.
This is not tested, but I think something like this should work:
{
"queryType": "groupBy",
"dataSource": "data_source_xxx",
"intervals": [
"2019-01-01T00:00:00.000Z/2019-07-01T00:00:00.000Z"
],
"dimensions": [
{
"type": "default",
"dimension": "client",
"outputType": "string",
"outputName": "client"
}
],
"granularity": "month",
"aggregations": [
{
"type": "filtered",
"filter": {
"type": "selector",
"dimension": "isOlder",
"value": "1"
},
"aggregator": {
"type": "doubleSum",
"name": "sumAmount",
"fieldName": "amount"
}
}
],
"virtualColumns": [
{
"type": "expression",
"name": "isOlder",
"expression": "if( date_1 > date_2, '1', '0')",
"outputType": "string"
}
],
"context": {
"groupByStrategy": "v2"
}
}
I have created this using this PHP code using this package: https://github.com/level23/druid-client
$client = new DruidClient(['router_url' => 'http://127.0.0.1:8888']);
// Build a select query
$builder = $client->query('data_source_xxx', Granularity::MONTH)
->interval("2019-01-01/2019-07-01")
->select(['client'])
->virtualColumn("if( date_1 > date_2, '1', '0')", 'isOlder')
->sum('amount', 'sumAmount', DataType::DOUBLE, function(FilterBuilder $filterBuilder){
$filterBuilder->where('isOlder', '=', '1');
});
echo $builder->toJson();

Related

How to check 'IN' clause in Druid Transformation Spec Expression

I'm looking for checking 'IN' clause in Druid Transformation Expression.
I want to check and derive a field with the below condition :
severity_analysis_period IN ('Critical')
THEN '>4hr Deviation' ,
when severity_analysis_period IN ('High', 'Medium', 'Low')
THEN '<4hr Deviation' else 'No Deviation'.
I wrote the transformation spec like this , but it doesn't work :
"transformSpec": {
"transforms": [
{
"type": "expression",
"name": "deviation",
"expression": "if(\"severity_analysis_period\"='Critical'),'>4hr Deviation',if(\"severity_analysis_period\" in ('High','Medium','Low'),'<4hr Deviation','No Deviation'))"
},
{
"type": "expression",
"name": "deviation_day",
"expression": "if((\"severity_day\"='Critical'),'>4hr Deviation',if(\"severity_day in\" ('High','Medium','Low'),'<4hr Deviation','No Deviation'))"
},
{
"type": "expression",
"name": "deviation_hour",
"expression": "if((\"severity_day_hour\"='Critical'),'>4hr Deviation',if(\"severity_day_hour\" in ('High','Medium','Low'),'<4hr Deviation','No Deviation'))"
}
],
}
You could try to use a case statement:
{
"type": "expression",
"name": "deviation_hour",
"expression": 'case_simple("severity_analysis_period", "Critical", ">4hr Deviation", "High", "<4hr Deviation", "Medium", "<4hr Deviation", "Low", "<4hr Deviation", "No Deviation")'
}
The case_simple function works like:
case_simple(expr, value1, result1, [[value2, result2, ...], else-result])
See also: https://druid.apache.org/docs/latest/misc/math-expr.html

Sum(distinct metric) in apache druid

how do we write sum(distinct col) in druid ? if i try to write in druid, it says plans can't be build, but same is possible in Druid. I tried to convert to subquery approach, but my inner query returns lot of item level data, hence timing out.
The distinct count or sum is not something which is by default supported by druid.
There are actually several methods which give you a similar result.
Option 1. Theta Sketch extension (recommended)
If you enable the Theta Sketch extension (See https://druid.apache.org/docs/latest/development/extensions-core/datasketches-theta.html) you can use this to get the same result.
Example:
{
"queryType": "groupBy",
"dataSource": "hits",
"intervals": [
"2020-08-14T11:00:00.000Z/2020-08-14T12:00:00.000Z"
],
"dimensions": [],
"granularity": "all",
"aggregations": [
{
"type": "cardinality",
"name": "col",
"fields": [
{
"type": "default",
"dimension": "domain",
"outputType": "string",
"outputName": "domain"
}
],
"byRow": false,
"round": false
}
]
}
Result:
+--------+
| domain |
+--------+
| 22 |
+--------+
Option 2: cardinality
The cardinality() aggregation computes the cardinality of a set of Apache Druid (incubating) dimensions, using HyperLogLog to estimate the cardinality.
Example:
{
"queryType": "groupBy",
"dataSource": "hits",
"intervals": [
"2020-08-14T11:00:00.000Z/2020-08-14T12:00:00.000Z"
],
"dimensions": [],
"granularity": "all",
"aggregations": [
{
"type": "cardinality",
"name": "domain",
"fields": [
{
"type": "default",
"dimension": "domain",
"outputType": "string",
"outputName": "domain"
}
],
"byRow": false,
"round": false
}
]
}
Response:
+-----------------+
| domain |
+-----------------+
| 22.119017166376 |
+-----------------+
Option 3. use hyperUnique
This option requires that you keep track of the counts at indexation time. If you have applied this, you can use this in your query:
{
"queryType": "groupBy",
"dataSource": "hits",
"intervals": [
"2020-08-14T11:00:00.000Z/2020-08-14T12:00:00.000Z"
],
"dimensions": [],
"granularity": "all",
"aggregations": [
{
"type": "hyperUnique",
"name": "domain",
"fieldName": "domain",
"isInputHyperUnique": false,
"round": false
}
],
"context": {
"groupByStrategy": "v2"
}
}
As I have no hyperUnique metric in my data set, I have no exact example response.
This page explains this method very well: https://blog.mshimul.com/getting-unique-counts-from-druid-using-hyperloglog/
Conclusion
In my opinion the Theta Sketch extension is the best and most easy way to get the result. Please read the documentation carefully.
If you are an PHP user you could take a look at this, maybe it helps:
https://github.com/level23/druid-client#hyperunique
https://github.com/level23/druid-client#cardinality
https://github.com/level23/druid-client#distinctcount

Druid GroupBy query gives different response when changing the order by fields

I have a question regarding an Apache Druid incubating query.
I have a simple group by to select the number of calls per operator. See here my query:
{
"queryType": "groupBy",
"dataSource": "ivr-calls",
"intervals": [
"2019-12-06T00:00:00.000Z/2019-12-07T00:00:00.000Z"
],
"dimensions": [
{
"type": "lookup",
"dimension": "operator_id",
"outputName": "value",
"name": "ivr_operator",
"replaceMissingValueWith": "Unknown"
},
{
"type": "default",
"dimension": "operator_id",
"outputType": "long",
"outputName": "id"
}
],
"granularity": "all",
"aggregations": [
{
"type": "longSum",
"name": "calls",
"fieldName": "calls"
}
],
"limitSpec": {
"type": "default",
"limit": 999999,
"columns": [
{
"dimension": "value",
"direction": "ascending",
"dimensionOrder": "numeric"
}
]
}
}
In this query I order the result by the "value" dimension, I receive 218 results.
I noticed that some of the records are duplicate. (I see some operators two times in my resultset). This is strange because in my experience all dimensions which you select are also used for grouping by. So, they should be unique.
If I add an order by to the "id" dimension, I receive 183 results (which is expected):
"columns": [
{
"dimension": "value",
"direction": "ascending",
"dimensionOrder": "numeric"
},
{
"dimension": "id",
"direction": "ascending",
"dimensionOrder": "numeric"
}
]
The documentation tells me nothing about this strange behavior (https://druid.apache.org/docs/latest/querying/limitspec.html).
My previous experience with druid is that the order by is just "ordering".
I am running druid version 0.15.0-incubating-iap9.
Can anybody tell me why there is a difference in the result set based on the column sorting?
I resolved this problem for now by specifying all columns in my order by.
Issue seems to be related to a bug in druid. See: https://github.com/apache/incubator-druid/issues/9000

Ingesting multi-valued dimension from comma sep string

I have event data from Kafka with the following structure that I want to ingest in Druid
{
"event": "some_event",
"id": "1",
"parameters": {
"campaigns": "campaign1, campaign2",
"other_stuff": "important_info"
}
}
Specifically, I want to transform the dimension "campaigns" from a comma-separated string into an array / multi-valued dimension so that it can be nicely filtered and grouped by.
My ingestion so far looks as follows
{
"type": "kafka",
"dataSchema": {
"dataSource": "event-data",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "timestamp",
"format": "posix"
},
"flattenSpec": {
"fields": [
{
"type": "root",
"name": "parameters"
},
{
"type": "jq",
"name": "campaigns",
"expr": ".parameters.campaigns"
}
]
}
},
"dimensionSpec": {
"dimensions": [
"event",
"id",
"campaigns"
]
}
},
"metricsSpec": [
{
"type": "count",
"name": "count"
}
],
"granularitySpec": {
"type": "uniform",
...
}
},
"tuningConfig": {
"type": "kafka",
...
},
"ioConfig": {
"topic": "production-tracking",
...
}
}
Which however leads to campaigns being ingested as a string.
I could neither find a way to generate an array out of it with a jq expression in flattenSpec nor did I find something like a string split expression that may be used as a transformSpec.
Any suggestions?
Try setting useFieldDiscover: false in your ingestion spec. when this flag is set to true (which is default case) then it interprets all fields with singular values (not a map or list) and flat lists (lists of singular values) at the root level as columns.
Here is a good example and reference link to use flatten spec:
https://druid.apache.org/docs/latest/ingestion/flatten-json.html
Looks like since Druid 0.17.0, Druid expressions support typed constructors for creating arrays, so using expression string_to_array should do the trick!

How to perform date arithmetic between nested and unnested dates in Elasticsearch?

Consider the following Elasticsearch (v5.4) object (an "award" doc type):
{
"name": "Gold 1000",
"date": "2017-06-01T16:43:00.000+00:00",
"recipient": {
"name": "James Conroy",
"date_of_birth": "1991-05-30"
}
}
The mapping type for both award.date and award.recipient.date_of_birth is "date".
I want to perform a range aggregation to get a list of the age ranges of the recipients of this award ("Under 18", "18-24", "24-30", "30+"), at the time of their award. I tried the following aggregation query:
{
"size": 0,
"query": {"match_all": {}},
"aggs": {
"recipients": {
"nested": {
"path": "recipient"
},
"aggs": {
"age_ranges": {
"range": {
"script": {
"inline": "doc['date'].date - doc['recipient.date_of_birth'].date"
},
"keyed": true,
"ranges": [{
"key": "Under 18",
"from": 0,
"to": 18
}, {
"key": "18-24",
"from": 18,
"to": 24
}, {
"key": "24-30",
"from": 24,
"to": 30
}, {
"key": "30+",
"from": 30,
"to": 100
}]
}
}
}
}
}
}
Problem 1
But I get the following error due to the comparison of dates in the script portion:
Cannot apply [-] operation to types [org.joda.time.DateTime] and [org.joda.time.MutableDateTime].
The DateTime object is the award.date field, and the MutableDateTime object is the award.recipient.date_of_birth field. I've tried doing something like doc['recipient.date_of_birth'].date.toDateTime() (which doesn't work despite the Joda docs claiming that MutableDateTime has this method inherited from a parent class). I've also tried doing something further like this:
"script": "ChronoUnit.YEARS.between(doc['date'].date, doc['recipient.date_of_birth'].date)"
Which sadly also doesn't work :(
Problem 2
I notice if I do this:
"aggs": {
"recipients": {
"nested": {
"path": "recipient"
},
"aggs": {
"award_years": {
"terms": {
"script": {
"inline": "doc['date'].date.year"
}
}
}
}
}
}
I get 1970 with a doc_count that happens to equal the total number of docs in ES. This leads me to believe that accessing a property outside of the nested object simply does not work and gives me back some default like the epoch datetime. And if I do the opposite (aggregating dates of birth without nesting), I get the exact same thing for all the dates of birth instead (1970, epoch datetime). So how can I compare those two dates?
I am racking my brain here, and I feel like there's some clever solution that is just beyond my current expertise with Elasticsearch. Help!
If you want to set up a quick environment for this to help me out, here is some curl goodness:
curl -XDELETE http://localhost:9200/joelinux
curl -XPUT http://localhost:9200/joelinux -d "{\"mappings\": {\"award\": {\"properties\": {\"name\": {\"type\": \"string\"}, \"date\": {\"type\": \"date\", \"format\": \"yyyy-MM-dd'T'HH:mm:ss.SSSSSSZ\"}, \"recipient\": {\"type\": \"nested\", \"properties\": {\"name\": {\"type\": \"string\"}, \"date_of_birth\": {\"type\": \"date\", \"format\": \"yyyy-MM-dd\"}}}}}}}"
curl -XPUT http://localhost:9200/joelinux/award/1 -d '{"name": "Gold 1000", "date": "2016-06-01T16:43:00.000000+00:00", "recipient": {"name": "James Conroy", "date_of_birth": "1991-05-30"}}'
curl -XPUT http://localhost:9200/joelinux/award/2 -d '{"name": "Gold 1000", "date": "2017-02-28T13:36:00.000000+00:00", "recipient": {"name": "Martin McNealy", "date_of_birth": "1983-01-20"}}'
That should give you a "joelinux" index with two "award" docs to test this out ("James Conroy" and "Martin McNealy"). Thanks in advance!
Unfortunately, you can't access nested and non-nested fields within the same context. As a workaround, you can change your mapping to automatically copy date from nested document to root context using copy_to option:
{
"mappings": {
"award": {
"properties": {
"name": {
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
},
"type": "text"
},
"date": {
"type": "date"
},
"date_of_birth": {
"type": "date" // will be automatically filled when indexing documents
},
"recipient": {
"properties": {
"name": {
"fields": {
"keyword": {
"ignore_above": 256,
"type": "keyword"
}
},
"type": "text"
},
"date_of_birth": {
"type": "date",
"copy_to": "date_of_birth" // copy value to root document
}
},
"type": "nested"
}
}
}
}
}
After that you can access date of birth using path date, though the calculations to get number of years between dates are slightly tricky:
Period.between(LocalDate.ofEpochDay(doc['date_of_birth'].date.getMillis() / 86400000L), LocalDate.ofEpochDay(doc['date'].date.getMillis() / 86400000L)).getYears()
Here I convert original JodaTime date objects to system.time.LocalDate objects:
Get number of milliseconds from 1970-01-01
Convert to number of days from 1970-01-01 by dividing it to 86400000L (number of ms in one day)
Convert to LocalDate object
Create date-based Period object from two dates
Get number of years between two dates.
So, the final aggregation query looks like this:
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"age_ranges": {
"range": {
"script": {
"inline": "Period.between(LocalDate.ofEpochDay(doc['date_of_birth'].date.getMillis() / 86400000L), LocalDate.ofEpochDay(doc['date'].date.getMillis() / 86400000L)).getYears()"
},
"keyed": true,
"ranges": [
{
"key": "Under 18",
"from": 0,
"to": 18
},
{
"key": "18-24",
"from": 18,
"to": 24
},
{
"key": "24-30",
"from": 24,
"to": 30
},
{
"key": "30+",
"from": 30,
"to": 100
}
]
}
}
}
}