How to check 'IN' clause in Druid Transformation Spec Expression - druid

I'm looking for checking 'IN' clause in Druid Transformation Expression.
I want to check and derive a field with the below condition :
severity_analysis_period IN ('Critical')
THEN '>4hr Deviation' ,
when severity_analysis_period IN ('High', 'Medium', 'Low')
THEN '<4hr Deviation' else 'No Deviation'.
I wrote the transformation spec like this , but it doesn't work :
"transformSpec": {
"transforms": [
{
"type": "expression",
"name": "deviation",
"expression": "if(\"severity_analysis_period\"='Critical'),'>4hr Deviation',if(\"severity_analysis_period\" in ('High','Medium','Low'),'<4hr Deviation','No Deviation'))"
},
{
"type": "expression",
"name": "deviation_day",
"expression": "if((\"severity_day\"='Critical'),'>4hr Deviation',if(\"severity_day in\" ('High','Medium','Low'),'<4hr Deviation','No Deviation'))"
},
{
"type": "expression",
"name": "deviation_hour",
"expression": "if((\"severity_day_hour\"='Critical'),'>4hr Deviation',if(\"severity_day_hour\" in ('High','Medium','Low'),'<4hr Deviation','No Deviation'))"
}
],
}

You could try to use a case statement:
{
"type": "expression",
"name": "deviation_hour",
"expression": 'case_simple("severity_analysis_period", "Critical", ">4hr Deviation", "High", "<4hr Deviation", "Medium", "<4hr Deviation", "Low", "<4hr Deviation", "No Deviation")'
}
The case_simple function works like:
case_simple(expr, value1, result1, [[value2, result2, ...], else-result])
See also: https://druid.apache.org/docs/latest/misc/math-expr.html

Related

Comparing dimensions in druid

I recently started experimenting with druid. I have a use case which I'm not able to solve. I have 3 date columns primary_date, date_1 and date_2, amount and client.
I wanted to calulate sum(amount) when date_1 > date_2 when granularity is month. I wanted to calculate this for each month in 6 month interval for each client.
I also wanted to calcutate sum(amount) when date_1 > max(bucket date) for each bucket for 6 months for each client.
{
"queryType" : "groupBy",
"dataSource" : "data_source_xxx",
"granularity" : "month",
"dimensions" : ["client"],
"intervals": ["2019-01-01/2019-07-01"],
"aggregations":[{"type": "doubleSum", "name": "total_amount", "fieldName": "amount"}],
"filter" : {
"type": "select",
"dimension": "client",
"value": "client"
}
}
I wanted to modify the above query to have additional filters I have mentioned.
Any help is highly appreciated.
Thanks
I think you can realize this by using a virtual column, which does the date comparison. Then you should be able to use the virtual column in a filtered aggregation, which only applies the aggregation if the filter matches.
This is not tested, but I think something like this should work:
{
"queryType": "groupBy",
"dataSource": "data_source_xxx",
"intervals": [
"2019-01-01T00:00:00.000Z/2019-07-01T00:00:00.000Z"
],
"dimensions": [
{
"type": "default",
"dimension": "client",
"outputType": "string",
"outputName": "client"
}
],
"granularity": "month",
"aggregations": [
{
"type": "filtered",
"filter": {
"type": "selector",
"dimension": "isOlder",
"value": "1"
},
"aggregator": {
"type": "doubleSum",
"name": "sumAmount",
"fieldName": "amount"
}
}
],
"virtualColumns": [
{
"type": "expression",
"name": "isOlder",
"expression": "if( date_1 > date_2, '1', '0')",
"outputType": "string"
}
],
"context": {
"groupByStrategy": "v2"
}
}
I have created this using this PHP code using this package: https://github.com/level23/druid-client
$client = new DruidClient(['router_url' => 'http://127.0.0.1:8888']);
// Build a select query
$builder = $client->query('data_source_xxx', Granularity::MONTH)
->interval("2019-01-01/2019-07-01")
->select(['client'])
->virtualColumn("if( date_1 > date_2, '1', '0')", 'isOlder')
->sum('amount', 'sumAmount', DataType::DOUBLE, function(FilterBuilder $filterBuilder){
$filterBuilder->where('isOlder', '=', '1');
});
echo $builder->toJson();

Druid GroupBy query gives different response when changing the order by fields

I have a question regarding an Apache Druid incubating query.
I have a simple group by to select the number of calls per operator. See here my query:
{
"queryType": "groupBy",
"dataSource": "ivr-calls",
"intervals": [
"2019-12-06T00:00:00.000Z/2019-12-07T00:00:00.000Z"
],
"dimensions": [
{
"type": "lookup",
"dimension": "operator_id",
"outputName": "value",
"name": "ivr_operator",
"replaceMissingValueWith": "Unknown"
},
{
"type": "default",
"dimension": "operator_id",
"outputType": "long",
"outputName": "id"
}
],
"granularity": "all",
"aggregations": [
{
"type": "longSum",
"name": "calls",
"fieldName": "calls"
}
],
"limitSpec": {
"type": "default",
"limit": 999999,
"columns": [
{
"dimension": "value",
"direction": "ascending",
"dimensionOrder": "numeric"
}
]
}
}
In this query I order the result by the "value" dimension, I receive 218 results.
I noticed that some of the records are duplicate. (I see some operators two times in my resultset). This is strange because in my experience all dimensions which you select are also used for grouping by. So, they should be unique.
If I add an order by to the "id" dimension, I receive 183 results (which is expected):
"columns": [
{
"dimension": "value",
"direction": "ascending",
"dimensionOrder": "numeric"
},
{
"dimension": "id",
"direction": "ascending",
"dimensionOrder": "numeric"
}
]
The documentation tells me nothing about this strange behavior (https://druid.apache.org/docs/latest/querying/limitspec.html).
My previous experience with druid is that the order by is just "ordering".
I am running druid version 0.15.0-incubating-iap9.
Can anybody tell me why there is a difference in the result set based on the column sorting?
I resolved this problem for now by specifying all columns in my order by.
Issue seems to be related to a bug in druid. See: https://github.com/apache/incubator-druid/issues/9000

Ingesting multi-valued dimension from comma sep string

I have event data from Kafka with the following structure that I want to ingest in Druid
{
"event": "some_event",
"id": "1",
"parameters": {
"campaigns": "campaign1, campaign2",
"other_stuff": "important_info"
}
}
Specifically, I want to transform the dimension "campaigns" from a comma-separated string into an array / multi-valued dimension so that it can be nicely filtered and grouped by.
My ingestion so far looks as follows
{
"type": "kafka",
"dataSchema": {
"dataSource": "event-data",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"timestampSpec": {
"column": "timestamp",
"format": "posix"
},
"flattenSpec": {
"fields": [
{
"type": "root",
"name": "parameters"
},
{
"type": "jq",
"name": "campaigns",
"expr": ".parameters.campaigns"
}
]
}
},
"dimensionSpec": {
"dimensions": [
"event",
"id",
"campaigns"
]
}
},
"metricsSpec": [
{
"type": "count",
"name": "count"
}
],
"granularitySpec": {
"type": "uniform",
...
}
},
"tuningConfig": {
"type": "kafka",
...
},
"ioConfig": {
"topic": "production-tracking",
...
}
}
Which however leads to campaigns being ingested as a string.
I could neither find a way to generate an array out of it with a jq expression in flattenSpec nor did I find something like a string split expression that may be used as a transformSpec.
Any suggestions?
Try setting useFieldDiscover: false in your ingestion spec. when this flag is set to true (which is default case) then it interprets all fields with singular values (not a map or list) and flat lists (lists of singular values) at the root level as columns.
Here is a good example and reference link to use flatten spec:
https://druid.apache.org/docs/latest/ingestion/flatten-json.html
Looks like since Druid 0.17.0, Druid expressions support typed constructors for creating arrays, so using expression string_to_array should do the trick!

How to use script arguments in AWS DataPipeline SQLActivity?

I am trying to execute an unload command on Redshift via Data Pipeline. The script looks something like:
unload ($$ SELECT *, count(*) FROM (SELECT APP_ID, CAST(record_date AS DATE) WHERE len(APP_ID)>0 AND CAST(record_date as DATE)=$1) GROUP BY APP_ID $$) to 's3://test/unload/' iam_role 'arn:aws:iam::xxxxxxxxxxx:role/Test' delimiter ',' addquotes;
The pipeline looks something like this:
{
"objects": [
{
"role": "DataPipelineDefaultRole",
"subject": "SuccessNotification",
"name": "SNS",
"id": "ActionId_xxxxx”,
"message": "SUCCESS: #{format(minusDays(node.#scheduledStartTime,1),'MM-dd-YYYY')}",
"type": "SnsAlarm",
"topicArn": "arn:aws:sns:us-west-2:xxxxxxxxxx:notification"
},
{
"connectionString": “connection-url”,
"password": “password”,
"name": “Test”,
"id": "DatabaseId_xxxxx”,
"type": "RedshiftDatabase",
"username": “username”
},
{
"subnetId": "subnet-xxxxxx”,
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"name": "EC2",
"id": "ResourceId_xxxxx”,
"type": "Ec2Resource"
},
{
"failureAndRerunMode": "CASCADE",
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"pipelineLogUri": "s3://test/logs/",
"scheduleType": "ONDEMAND",
"name": "Default",
"id": "Default"
},
{
"database": {
"ref": "DatabaseId_xxxxxx”
},
"scriptUri": "s3://test/script.sql",
"name": "SqlActivity",
"scriptArgument": "#{format(minusDays(node.#scheduledStartTime,1),"MM-dd-YYYY”)}”,
"id": "SqlActivityId_xxxxx”,
"runsOn": {
"ref": "ResourceId_xxxx”
},
"type": "SqlActivity",
"onSuccess": {
"ref": "ActionId_xxxxx”
}
}
],
"parameters": []
}
However, I keep getting the error: The column index is out of range: 1, number of columns: 0.
I just can't get it to work. I have tried using ?, $1 and I even tried putting the expression #{format(minusDays(node.#scheduledStartTime,1),"MM-dd-YYYY”)} directly in the script. None of them works.
I have looked at the answers to Amazon Data Pipline: How to use a script argument in a SqlActivity? but none of them are helpful.
Does anyone has idea how to use script argument in SQL Script in AWS Data Pipeline?

json schema issue on required property

I need to write the JSON Schema based on the specification defined by http://json-schema.org/. But I'm struggling for the required/mandatory property validation. Below is the JSON schema that I have written where all the 3 properties are mandatory but In my case either one should be mandatory. How to do this?.
{
"id": "http://example.com/searchShops-schema#",
"$schema": "http://json-schema.org/draft-04/schema#",
"title": "searchShops Service",
"description": "",
"type": "object",
"properties": {
"city":{
"type": "string"
},
"address":{
"type": "string"
},
"zipCode":{
"type": "integer"
}
},
"required": ["city", "address", "zipCode"]
}
If your goal is to tell that "I want at least one member to exist" then use minProperties:
{
"type": "object",
"etc": "etc",
"minProperties": 1
}
Note also that you can use "dependencies" to great effect if you also want additional constraints to exist when this or that member is present.
{
...
"anyOf": [
{ "required": ["city"] },
{ "required": ["address"] },
{ "required": ["zipcode"] },
]
}
Or use "oneOf" if exactly one property should be present