Post aggregation example query for druid in json - druid

I am trying to use post aggregation. I have used aggregation to count the number of rows which match the given filter. Following is the post aggregation query:
{
"queryType": "groupBy",
"dataSource": "datasrc1",
"intervals": ["2020-09-16T21:15/2020-09-16T22:30"],
"pagingSpec":{ "threshold":100},
"dimensions": ["city", "zip_code", "country"],
"filter": {
"fields": [
{
"type": "selector",
"dimension": "bankId",
"value": "<bank id>"
}
]
},
"granularity": "all",
"aggregations": [
{ "type": "count", "name": "row"}
],
"postAggregations": [
{ "type": "arithmetic",
"name": "sum_rows",
"fn": "+",
"fields": [
{ "type": "fieldAccess", "fieldName": "row" }
]
}
]
}
If I remove the post aggregation part, it returns me result like:
[ {
"version" : "v1",
"timestamp" : "2020-09-16T21:15:00.000Z",
"event" : {
"city": "Sunnyvale",
"zip_code": "94085",
"country": "US",
"row" : 1
}
}, {
"version" : "v1",
"timestamp" : "2020-09-16T21:15:00.000Z",
"event" : {
"city": "Sunnyvale",
"zip_code": "94080",
"country": "US",
"row" : 1
}
}
If I add the post aggregations part, I get parser exception:
{
"error" : "Unknown exception",
"errorMessage" : "Instantiation of [simple type, class io.druid.query.aggregation.post.ArithmeticPostAggregator] value failed: Illegal number of fields[
%s], must be > 1 (through reference chain: java.util.ArrayList[0])",
"errorClass" : "com.fasterxml.jackson.databind.JsonMappingException",
"host" : null
}
I want to add all the rows (column 'row') in the response we are getting for aggregation query; and put the output in "sum_rows".
I don't understand what I am missing in post_aggregations. Any help is appreciated.

Confess that I spend most of my time in the SQL API not in the native API (!!) but I think your issue is that you're only supplying one field to your post aggregator. See these examples:
https://druid.apache.org/docs/latest/querying/post-aggregations.html#example-usage
If you need sum of rows, perhaps you need a normal aggregator to sum the row count?

The error message says the ArithmeticPostAggregator requires 2 arguments; the example code has only one. There's an example of this post aggregator at the bottom of this answer.
However...the example query doesn't have multiple numeric aggregations to perform arithmetic post-aggregation against. Maybe the goal is to "combine" the two output rows into one?
...To change the two-row result into only one with the total row count (for all database records matching the query filter and interval), removing zip_code from the dimension list would be one way.
Removing zip_code from dimensions would produce one result like this:
[
{
"version" : "v1",
"timestamp" : "2020-09-16T21:15:00.000Z",
"event" : {
"city": "Sunnyvale",
"country": "US",
"row" : 2
}
]
As you can see, by submitting a groupBy query with aggregations, Druid will do this aggregation for you dynamically (based on the dimension values in the database at the time the query is run) without needing post aggregations.
Example arithmetic post aggregator:
{
"type": "arithmetic",
"name": "my_output_sum",
"fn": "+",
"fields": [
{"fieldName": "input_addend_1", "type":"fieldAccess"},
{"fieldName": "input_addend_2", "type":"fieldAccess"}
]
}

Related

Comparing dimensions in druid

I recently started experimenting with druid. I have a use case which I'm not able to solve. I have 3 date columns primary_date, date_1 and date_2, amount and client.
I wanted to calulate sum(amount) when date_1 > date_2 when granularity is month. I wanted to calculate this for each month in 6 month interval for each client.
I also wanted to calcutate sum(amount) when date_1 > max(bucket date) for each bucket for 6 months for each client.
{
"queryType" : "groupBy",
"dataSource" : "data_source_xxx",
"granularity" : "month",
"dimensions" : ["client"],
"intervals": ["2019-01-01/2019-07-01"],
"aggregations":[{"type": "doubleSum", "name": "total_amount", "fieldName": "amount"}],
"filter" : {
"type": "select",
"dimension": "client",
"value": "client"
}
}
I wanted to modify the above query to have additional filters I have mentioned.
Any help is highly appreciated.
Thanks
I think you can realize this by using a virtual column, which does the date comparison. Then you should be able to use the virtual column in a filtered aggregation, which only applies the aggregation if the filter matches.
This is not tested, but I think something like this should work:
{
"queryType": "groupBy",
"dataSource": "data_source_xxx",
"intervals": [
"2019-01-01T00:00:00.000Z/2019-07-01T00:00:00.000Z"
],
"dimensions": [
{
"type": "default",
"dimension": "client",
"outputType": "string",
"outputName": "client"
}
],
"granularity": "month",
"aggregations": [
{
"type": "filtered",
"filter": {
"type": "selector",
"dimension": "isOlder",
"value": "1"
},
"aggregator": {
"type": "doubleSum",
"name": "sumAmount",
"fieldName": "amount"
}
}
],
"virtualColumns": [
{
"type": "expression",
"name": "isOlder",
"expression": "if( date_1 > date_2, '1', '0')",
"outputType": "string"
}
],
"context": {
"groupByStrategy": "v2"
}
}
I have created this using this PHP code using this package: https://github.com/level23/druid-client
$client = new DruidClient(['router_url' => 'http://127.0.0.1:8888']);
// Build a select query
$builder = $client->query('data_source_xxx', Granularity::MONTH)
->interval("2019-01-01/2019-07-01")
->select(['client'])
->virtualColumn("if( date_1 > date_2, '1', '0')", 'isOlder')
->sum('amount', 'sumAmount', DataType::DOUBLE, function(FilterBuilder $filterBuilder){
$filterBuilder->where('isOlder', '=', '1');
});
echo $builder->toJson();

Druid GroupBy query gives different response when changing the order by fields

I have a question regarding an Apache Druid incubating query.
I have a simple group by to select the number of calls per operator. See here my query:
{
"queryType": "groupBy",
"dataSource": "ivr-calls",
"intervals": [
"2019-12-06T00:00:00.000Z/2019-12-07T00:00:00.000Z"
],
"dimensions": [
{
"type": "lookup",
"dimension": "operator_id",
"outputName": "value",
"name": "ivr_operator",
"replaceMissingValueWith": "Unknown"
},
{
"type": "default",
"dimension": "operator_id",
"outputType": "long",
"outputName": "id"
}
],
"granularity": "all",
"aggregations": [
{
"type": "longSum",
"name": "calls",
"fieldName": "calls"
}
],
"limitSpec": {
"type": "default",
"limit": 999999,
"columns": [
{
"dimension": "value",
"direction": "ascending",
"dimensionOrder": "numeric"
}
]
}
}
In this query I order the result by the "value" dimension, I receive 218 results.
I noticed that some of the records are duplicate. (I see some operators two times in my resultset). This is strange because in my experience all dimensions which you select are also used for grouping by. So, they should be unique.
If I add an order by to the "id" dimension, I receive 183 results (which is expected):
"columns": [
{
"dimension": "value",
"direction": "ascending",
"dimensionOrder": "numeric"
},
{
"dimension": "id",
"direction": "ascending",
"dimensionOrder": "numeric"
}
]
The documentation tells me nothing about this strange behavior (https://druid.apache.org/docs/latest/querying/limitspec.html).
My previous experience with druid is that the order by is just "ordering".
I am running druid version 0.15.0-incubating-iap9.
Can anybody tell me why there is a difference in the result set based on the column sorting?
I resolved this problem for now by specifying all columns in my order by.
Issue seems to be related to a bug in druid. See: https://github.com/apache/incubator-druid/issues/9000

can we used world update for France only server ? seems to make wrong query result

On my local overpass api server with only french data on which is applied hourly planet diff, some of the query responses are wrong.
It's not doing it for each query : but something like once every 200 requests , sometime more ...
for example :
[timeout:360][out:json];way(48.359900103518235,5.708088852670471,48.360439696481784,5.708900947329539)[highway];out ;
return 3 ways :
{
"version": 0.6,
"generator": "Overpass API 0.7.54.13 ff15392f",
"osm3s": {
"timestamp_osm_base": "2019-09-23T15:00:00Z",
},
"elements": [
{
"type": "way",
"id": 53290349,
"nodes": [...],
"tags": {
"highway": "secondary",
"maxspeed": "100",
"ref": "L 385"
}
},
{
"type": "way",
"id": 238493649,
"nodes": [...],
"tags": {
"highway": "residential",
"name": "Rue du Stand",
"ref": "C 3",
"source": "..."
}
},
{
"type": "way",
"id": 597978369,
"nodes": [...],
"tags": {
"highway": "service"
}
}
]
}
First one is in Germany, far East ...
My question :
On an overpass api server, is there a way to apply diff only for defined area ? it is not documented ( neither here : https://wiki.openstreetmap.org/wiki/Overpass_API/Installation
or here : https://wiki.openstreetmap.org/wiki/User:Breki/Overpass_API_Installation#Configuring_Diffs )
if not, how to get rid of those wrong results ?
Thanks,
Two questions, so two answer :
i found that there is French diff file existing : http://download.openstreetmap.fr/replication/europe/france/minute/ so i will restart my server with those diffs.
The best way to get rid of those wrong result is to have a consistant server : no world diff for just France Data.

Ingesting multi-valued dimension from comma sep string

I have event data from Kafka with the following structure that I want to ingest in Druid
{
"event": "some_event",
"id": "1",
"parameters": {
"campaigns": "campaign1, campaign2",
"other_stuff": "important_info"
}
}
Specifically, I want to transform the dimension "campaigns" from a comma-separated string into an array / multi-valued dimension so that it can be nicely filtered and grouped by.
My ingestion so far looks as follows
{
"type": "kafka",
"dataSchema": {
"dataSource": "event-data",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "timestamp",
"format": "posix"
},
"flattenSpec": {
"fields": [
{
"type": "root",
"name": "parameters"
},
{
"type": "jq",
"name": "campaigns",
"expr": ".parameters.campaigns"
}
]
}
},
"dimensionSpec": {
"dimensions": [
"event",
"id",
"campaigns"
]
}
},
"metricsSpec": [
{
"type": "count",
"name": "count"
}
],
"granularitySpec": {
"type": "uniform",
...
}
},
"tuningConfig": {
"type": "kafka",
...
},
"ioConfig": {
"topic": "production-tracking",
...
}
}
Which however leads to campaigns being ingested as a string.
I could neither find a way to generate an array out of it with a jq expression in flattenSpec nor did I find something like a string split expression that may be used as a transformSpec.
Any suggestions?
Try setting useFieldDiscover: false in your ingestion spec. when this flag is set to true (which is default case) then it interprets all fields with singular values (not a map or list) and flat lists (lists of singular values) at the root level as columns.
Here is a good example and reference link to use flatten spec:
https://druid.apache.org/docs/latest/ingestion/flatten-json.html
Looks like since Druid 0.17.0, Druid expressions support typed constructors for creating arrays, so using expression string_to_array should do the trick!

MongoDB - Project specific element from array (big data)

I got a big array with data in the following format:
{
"application": "myapp",
"buildSystem": {
"counter": 2361.1,
"hostname": "host.com",
"jobName": "job_name",
"label": "2361",
"systemType": "sys"
},
"creationTime": 1517420374748,
"id": "123",
"stack": "OTHER",
"testStatus": "PASSED",
"testSuites": [
{
"errors": 0,
"failures": 0,
"hostname": "some_host",
"properties": [
{
"name": "some_name",
"value": "UnicodeLittle"
},
<MANY MORE PROPERTIES>,
{
"name": "sun",
"value": ""
}
],
"skipped": 0,
"systemError": "",
"systemOut": "",
"testCases": [
{
"classname": "IdTest",
"name": "has correct representation",
"status": "PASSED",
"time": "0.001"
},
<MANY MORE TEST CASES>,
{
"classname": "IdTest",
"name": "normalized values",
"status": "PASSED",
"time": "0.001"
}
],
"tests": 8,
"time": 0.005,
"timestamp": "2018-01-31T17:35:15",
"title": "IdTest"
}
<MANY MORE TEST SUITES >,
]}
Where I can distinct three main structures with big data: TestSuites, Properties, and TestCases. My task is to sum all times from each TestSuite so that I can get the total duration of the test. Since the properties and TestCases are huge, the query cannot complete. I would like to select only the "time" value from TestSuites, but it kind of conflicts with the "time" of TestCases in my query:
db.my_tests.find(
{
application: application,
creationTime:{
$gte: start_date.valueOf(),
$lte: end_date.valueOf()
}
},
{
application: 1,
creationTime: 1,
buildSystem: 1,
"testSuites.time": 1,
_id:1
}
)
Is it possible to project only the "time" properties from TestSuites without loading the whole schema? I already tried testSuites: 1, testSuites.$.time: 1 without success. Please notice that TestSuites is an array of one element with a dictionary.
I already checked this similar post without success:
Mongodb update the specific element from subarray
Following code prints duration of each TestSuite:
query = db.my_collection.aggregate(
[
{$match: {
application: application,
creationTime:{
$gte: start_date.valueOf(),
$lte: end_date.valueOf()
}
}
},
{ $project :
{ duration: { $sum: "$testSuites.time"}}
}
]
).forEach(function(doc)
{
print(doc._id)
print(doc.duration)
}
)
Is it possible to project only the "time" properties from TestSuites
without loading the whole schema? I already tried testSuites: 1,
testSuites.$.time
Answering to your problem of prejecting only the time property of the testSuites document you can simply try projecting it with "testSuites.time" : 1 (you need to add the quotes for the dot notation property references).
My task is to sum all times from each TestSuite so that I can get the
total duration of the test. Since the properties and TestCases are
huge, the query cannot complete
As for your task, i suggest you try out the mongodb's aggregation framework for your calculations documents tranformations. The aggregations framework option {allowDiskUse : true} will also help you if you are proccessing "large" documents.