Unable to load data in druid - druid

I am a newbie in druid. Trying to load a very simple data in JSON format to druid. The data contains just one dimension, one metric and timestamp. I have been successfully able to load data to druid for a different dataset but somehow I am getting errors for this dataset.
This is my index file :
{
"type" : "index",
"spec" : {
"dataSchema" : {
"dataSource" : "datatemplate",
"parser" : {
"type" : "string",
"parseSpec" : {
"format" : "json",
"dimensionsSpec" : {
"dimensions" : [
"Loc"
]
},
"timestampSpec" : {
"format" : "auto",
"column" : "Timestamp"
}
}
},
"metricsSpec" : [{"name" : "Qty","type" : "doubleSum","fieldName" : "Qty"}],
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "day",
"queryGranularity" : "none",
"intervals" : ["2016-01-01T00:00:00Z/2030-06-30T00:00:00Z"],
"rollup" : true
}
},
"ioConfig" : {
"type" : "index",
"firehose" : {
"type" : "local",
"baseDir" : "datatemplate/",
"filter" : "datatemplate.json"
},
"appendToExisting" : false
},
"tuningConfig" : {
"type" : "index",
"targetPartitionSize" : 10000000,
"maxRowsInMemory" : 40000,
"forceExtendableShardSpecs" : true
}
}
}
Also here is my dataset in JSON format:
{"Loc": "A", "Qty": "1", "Timestamp": "2017-12-01T00:00:00Z"}
{"Loc": "A", "Qty": "1", "Timestamp": "2017-12-01T00:00:00Z"}
{"Loc": "B", "Qty": "2", "Timestamp": "2017-12-01T00:00:00Z"}
{"Loc": "B", "Qty": "1", "Timestamp": "2017-12-01T00:00:00Z"}

Related

how to send data to druid using Tranquility core API?

I have setup druid and was able to run the tutorial at:Tutorial: Loading a file. I was also able to execute native json queries and get the results as described at : http://druid.io/docs/latest/tutorials/tutorial-query.html The druid setup is working fine.
I now want to ingest additional data from a Java program into this datasource. Is it possible to send data into druid using tranquility from a java program for a datasource created using batch load?
I tried the example program at : https://github.com/druid-io/tranquility/blob/master/core/src/test/java/com/metamx/tranquility/example/JavaExample.java
But this program just keeps running and doesn't show any output. how can druid be setup to accept data using tranquility core APIs?
Following are the ingestion specs and config file for tranquility:
wikipedia-index.json
{
"type" : "index",
"spec" : {
"dataSchema" : {
"dataSource" : "wikipedia",
"parser" : {
"type" : "string",
"parseSpec" : {
"format" : "json",
"dimensionsSpec" : {
"dimensions" : [
"channel",
"cityName",
"comment",
"countryIsoCode",
"countryName",
"isAnonymous",
"isMinor",
"isNew",
"isRobot",
"isUnpatrolled",
"metroCode",
"namespace",
"page",
"regionIsoCode",
"regionName",
"user",
{ "name": "added", "type": "long" },
{ "name": "deleted", "type": "long" },
{ "name": "delta", "type": "long" }
]
},
"timestampSpec": {
"column": "time",
"format": "iso"
}
}
},
"metricsSpec" : [],
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "day",
"queryGranularity" : "none",
"intervals" : ["2015-09-12/2015-09-13"],
"rollup" : false
}
},
"ioConfig" : {
"type" : "index",
"firehose" : {
"type" : "local",
"baseDir" : "quickstart/",
"filter" : "wikiticker-2015-09-12-sampled.json.gz"
},
"appendToExisting" : false
},
"tuningConfig" : {
"type" : "index",
"targetPartitionSize" : 5000000,
"maxRowsInMemory" : 25000,
"forceExtendableShardSpecs" : true
}
}
}
example.json (tranquility config):
{
"dataSources" : [
{
"spec" : {
"dataSchema" : {
"dataSource" : "wikipedia",
"metricsSpec" : [
{ "type" : "count", "name" : "count" }
],
"granularitySpec" : {
"segmentGranularity" : "hour",
"queryGranularity" : "none",
"type" : "uniform"
},
"parser" : {
"type" : "string",
"parseSpec" : {
"format" : "json",
"timestampSpec" : { "column": "time", "format": "iso" },
"dimensionsSpec" : {
"dimensions" : ["channel",
"cityName",
"comment",
"countryIsoCode",
"countryName",
"isAnonymous",
"isMinor",
"isNew",
"isRobot",
"isUnpatrolled",
"metroCode",
"namespace",
"page",
"regionIsoCode",
"regionName",
"user",
{ "name": "added", "type": "long" },
{ "name": "deleted", "type": "long" },
{ "name": "delta", "type": "long" }]
}
}
}
},
"tuningConfig" : {
"type" : "realtime",
"windowPeriod" : "PT10M",
"intermediatePersistPeriod" : "PT10M",
"maxRowsInMemory" : "100000"
}
},
"properties" : {
"task.partitions" : "1",
"task.replicants" : "1"
}
}
],
"properties" : {
"zookeeper.connect" : "localhost"
}
}
I did not find any example on setting up a datasource on druid which accepts continuously accepts data from a java program. I don't want to use Kafka. Any pointers on this would be greatly appreciated.
You need to create the data files with the additional data first and than run the ingestion task with new fields, You can't edit the same record in druid, It overwrites to new record.

swagger file for all possible 'properties'

I need to call an API to load data on an ongoing basis. API returns different properties for each event. When I create the swagger file it has properties from sample return, but in the long run there will be more properties which can be added by the source system and they will not be in swagger file.
Is there any way to recreate swagger file before data load dynamically with additional properties?
Swagger file is generated by Informatica Cloud based on a sample return while testing the connection.
Properties list has a different number of entries based on event type.
swagger file:
{"swagger" : "2.0",
"info" : {
"description" : null,
"version" : "1.0.0",
"title" : null,
"termsOfService" : null,
"contact" : null,
"license" : null
},
"host" : "<host>.com",
"basePath" : "/api",
"schemes" : [ "https" ],
"paths" : {
"/2.0" : {
"post" : {
"tags" : [ "events" ],
"summary" : null,
"description" : null,
"operationId" : "events",
"produces" : [ "application/json" ],
"consumes" : [ "application/json" ],
"parameters" : [ {
"name" : "script",
"in" : "query",
"description" : null,
"required" : false,
"type" : "string"
}, {
"name" : "Authorization",
"in" : "header",
"description" : null,
"required" : false,
"type" : "string"
} ],
"responses" : {
"200" : {
"description" : "successful operation",
"schema" : {
"$ref" : "#/definitions/events"
}
}
}
}
}
},
"definitions" : {
"events##properties" : {
"properties" : {
"$app_build_number" : {
"type" : "string"
},
"$app_version_string" : {
"type" : "string"
},
"$carrier" : {
"type" : "string"
},
"$lib_version" : {
"type" : "string"
},
"$manufacturer" : {
"type" : "string"
},
"$model" : {
"type" : "string"
},
"$os" : {
"type" : "string"
},
"$os_version" : {
"type" : "string"
},
"$radio" : {
"type" : "string"
},
"$region" : {
"type" : "string"
},
"$screen_height" : {
"type" : "number",
"format" : "int32"
},
"$screen_width" : {
"type" : "number",
"format" : "int32"
},
"Home Step Enabled" : {
"type" : "string"
},
"Number Of Lifetime Logins" : {
"type" : "number",
"format" : "int32"
},
"Sessions" : {
"type" : "number",
"format" : "int32"
},
"mp_country_code" : {
"type" : "string"
},
"mp_lib" : {
"type" : "string"
}
}
},
"events" : {
"properties" : {
"name" : {
"type" : "string"
},
"distinct_id" : {
"type" : "string"
},
"labels" : {
"type" : "string"
},
"time" : {
"type" : "number",
"format" : "int64"
},
"sampling_factor" : {
"type" : "number",
"format" : "int32"
},
"dataset" : {
"type" : "string"
},
"properties" : {
"$ref" : "#/definitions/events##properties"
}
}
}
}
}
sample return:
"name": "Session",
"distinct_id": "1234567890",
"labels": [],
"time": 1520072505000,
"sampling_factor": 1,
"dataset": "$event_data_set",
"properties": {
"$app_build_number": "900",
"$app_version_string": "1.9",
"$carrier": "AT&T",
"$lib_version": "2.0.1",
"$manufacturer": "Apple",
"$model": "iPhone10,6",
"$os": "iOS",
"$os_version": "11.2.6",
"$radio": "LTE",
"$region": "Florida",
"$screen_height": 667,
"$screen_width": 375,
"Number Of Lifetime Logins": 2,
"Session Length": "00h:00m:08s",
"Sessions": 43,
"mp_country_code": "US",
"mp_lib": "swift"
}
}

Druid:how to add a numeric data to metric without aggregation function

The scenario is i want to setup a stock quote server and save the quote data into druid.
my requirement is to get the latest price of all the stock by a query.
But i notice that the query interface of druid such as time series only work on metrics filed ,not the dimension fields.
so i consider to make the price filed one of the metrics,but no need to aggregated.
how can i do it?
Any suggestions?
here is my tranquility config file.
{
"dataSources" : {
"stock-index-topic" : {
"spec" : {
"dataSchema" : {
"dataSource" : "stock-index-topic",
"parser" : {
"type" : "string",
"parseSpec" : {
"timestampSpec" : {
"column" : "timestamp",
"format" : "auto"
},
"dimensionsSpec" : {
"dimensions" : ["code","name","acronym","market","tradeVolume","totalValueTraded","preClosePx","openPrice","highPrice","lowPrice","latestPrice","closePx"],
"dimensionExclusions" : [
"timestamp",
"value"
]
},
"format" : "json"
}
},
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "HOUR",
"queryGranularity" : "SECOND",
},
"metricsSpec" : [
{
"name" : "firstPrice",
"type" : "doubleFirst",
"fieldName" : "tradePrice"
},{
"name" : "lastPrice",
"type" : "doubleLast",
"fieldName" : "tradePrice"
}, {
"name" : "minPrice",
"type" : "doubleMin",
"fieldName" : "tradePrice"
}, {
"name" : "maxPrice",
"type" : "doubleMax",
"fieldName" : "tradePrice"
}
]
},
"ioConfig" : {
"type" : "realtime"
},
"tuningConfig" : {
"type" : "realtime",
"maxRowsInMemory" : "100000",
"intermediatePersistPeriod" : "PT10M",
"windowPeriod" : "PT10M"
}
},
"properties" : {
"task.partitions" : "1",
"task.replicants" : "1",
"topicPattern" : "stock-index-topic"
}
}
},
"properties" : {
"zookeeper.connect" : "localhost:2181",
"druid.discovery.curator.path" : "/druid/discovery",
"druid.selectors.indexing.serviceName" : "druid/overlord",
"commit.periodMillis" : "15000",
"consumer.numThreads" : "2",
"kafka.zookeeper.connect" : "localhost:2181",
"kafka.group.id" : "tranquility-kafka"
}
}
I think you should make [latest_price] as new numeric dimension, it would be much better from performance and querying standpoint considering how druid works.
Metrics and meant to perform aggregation functions as core so won't be helpful in your use case.

Exception throws when insert data into druid via tranquility

i'm pushing kafka stream into druid via tranquility.
kafka version is 0.9.1 , tranquility is 0.8 , druid is 0.10.
tranquility is started fine when no message produced,but when producer sending message i will get JsonMappingException like this:
ava.lang.IllegalArgumentException: Can not deserialize instance of java.util.ArrayList out of VALUE_STRING token
at [Source: N/A; line: -1, column: -1]
at com.fasterxml.jackson.databind.ObjectMapper._convert(ObjectMapper.java:2774) ~[com.fasterxml.jackson.core.jackson-databind-2.4.6.jar:2.4.6]
at com.fasterxml.jackson.databind.ObjectMapper.convertValue(ObjectMapper.java:2700) ~[com.fasterxml.jackson.core.jackson-databind-2.4.6.jar:2.4.6]
at com.metamx.tranquility.druid.DruidBeams$.makeFireDepartment(DruidBeams.scala:406) ~[io.druid.tranquility-core-0.8.0.jar:0.8.0]
at com.metamx.tranquility.druid.DruidBeams$.fromConfigInternal(DruidBeams.scala:291) ~[io.druid.tranquility-core-0.8.0.jar:0.8.0]
at com.metamx.tranquility.druid.DruidBeams$.fromConfig(DruidBeams.scala:199) ~[io.druid.tranquility-core-0.8.0.jar:0.8.0]
at com.metamx.tranquility.kafka.KafkaBeamUtils$.createTranquilizer(KafkaBeamUtils.scala:40) ~[io.druid.tranquility-kafka-0.8.0.jar:0.8.0]
at com.metamx.tranquility.kafka.KafkaBeamUtils.createTranquilizer(KafkaBeamUtils.scala) ~[io.druid.tranquility-kafka-0.8.0.jar:0.8.0]
at com.metamx.tranquility.kafka.writer.TranquilityEventWriter.<init>(TranquilityEventWriter.java:64) ~[io.druid.tranquility-kafka-0.8.0.jar:0.8.0]
at com.metamx.tranquility.kafka.writer.WriterController.createWriter(WriterController.java:171) ~[io.druid.tranquility-kafka-0.8.0.jar:0.8.0]
at com.metamx.tranquility.kafka.writer.WriterController.getWriter(WriterController.java:98) ~[io.druid.tranquility-kafka-0.8.0.jar:0.8.0]
at com.metamx.tranquility.kafka.KafkaConsumer$2.run(KafkaConsumer.java:231) ~[io.druid.tranquility-kafka-0.8.0.jar:0.8.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_67]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_67]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_67]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_67]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_67]
and my kafka.json is :
{
"dataSources" : {
"stock-index-topic" : {
"spec" : {
"dataSchema" : {
"dataSource" : "stock-index-topic",
"parser" : {
"type" : "string",
"parseSpec" : {
"timestampSpec" : {
"column" : "timestamp",
"format" : "auto"
},
"dimensionsSpec" : {
"dimensions" : ["code","name","acronym","market","tradeVolume","totalValueTraded","preClosePx","openPrice","highPrice","lowPrice","tradePrice","closePx","timestamp"],
"dimensionExclusions" : [
"timestamp",
"value"
]
},
"format" : "json"
}
},
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "DAY",
"queryGranularity" : "none",
"intervals":"no"
},
"metricsSpec" : [
{
"name" : "firstPrice",
"type" : "doubleFirst",
"fieldName" : "tradePrice"
},{
"name" : "lastPrice",
"type" : "doubleLast",
"fieldName" : "tradePrice"
}, {
"name" : "minPrice",
"type" : "doubleMin",
"fieldName" : "tradePrice"
}, {
"name" : "maxPrice",
"type" : "doubleMax",
"fieldName" : "tradePrice"
}
]
},
"ioConfig" : {
"type" : "realtime"
},
"tuningConfig" : {
"type" : "realtime",
"maxRowsInMemory" : "100000",
"intermediatePersistPeriod" : "PT10M",
"windowPeriod" : "PT10M"
}
},
"properties" : {
"task.partitions" : "1",
"task.replicants" : "1",
"topicPattern" : "stock-index-topic"
}
}
},
"properties" : {
"zookeeper.connect" : "localhost:2181",
"druid.discovery.curator.path" : "/druid/discovery",
"druid.selectors.indexing.serviceName" : "druid/overlord",
"commit.periodMillis" : "15000",
"consumer.numThreads" : "2",
"kafka.zookeeper.connect" : "localhost:2181",
"kafka.group.id" : "tranquility-kafka"
}
}
i use the kafka-console-consumer to get the data ,it looks like
{"code": "399982", "name": "500等权", "acronym": "500DQ", "market": "102", "tradeVolume": 0, "totalValueTraded": 0.0, "preClosePx": 0.0, "openPrice": 0.0, "highPrice": 0.0, "lowPrice": 0.0, "tradePrice": 7184.7142, "closePx": 0.0, "timestamp": "2017-05-16T09:06:39.000+08:00"}
Any idea why? Thanks.
"metricsSpec" : [
{
"name" : "firstPrice",
"type" : "doubleFirst",
"fieldName" : "tradePrice"
},{
"name" : "lastPrice",
"type" : "doubleLast",
"fieldName" : "tradePrice"
}, {
"name" : "minPrice",
"type" : "doubleMin",
"fieldName" : "tradePrice"
}, {
"name" : "maxPrice",
"type" : "doubleMax",
"fieldName" : "tradePrice"
}
]
},
it's wrong.The document said :
First and Last aggregator cannot be used in ingestion spec, and should only be specified as part of queries.
So,the issue is solved.

Elasticsearch index operation fails on complex object

I am indexing a data stream to Elasticsearch and I cannot figure out how to normalize incoming data to make it index without error. I have a mapping type "getdatavalues" which is a meta-data query. This meta-data query can return very different looking responses but I'm not seeing the difference. The error I get:
{"index":{"_index":"ens_event-2016.03.11","_type":"getdatavalues","_id":"865800029798177_2016_03_11_03_18_12_100037","status":400,"error":"MapperParsingException[object mapping for [getdatavalues] tried to parse field [output] as object, but got EOF, has a concrete value been provided to it?]"}}
when performing:
curl -XPUT 'http://192.168.99.100:80/es/ens_event-2016.03.11/getdatavalues/865800029798177_2016_03_11_03_18_12_100037' -d '{
"type": "getDataValues",
"input": {
"deviceID": {
"IMEI": "865800029798177",
"serial-number": "64180258"
},
"handle": 644,
"exprCode": "200000010300140000080001005f00a700000000000000",
"noRollHandle": "478669308-578452",
"transactionID": 290
},
"timestamp": "2016-03-11T03:18:12.000Z",
"handle": 644,
"output": {
"noRollPubSessHandle": "478669308-578740",
"publishSessHandle": 1195,
"status": true,
"matchFilter": {
"prefix": "publicExpr.operatorDefined.commercialIdentifier.FoodSvcs.Restaurant.\"A&C Kabul Curry\".\"Rooster Street\"",
"argValues": {
"event": "InternationalEvent",
"hasEvent": "anyEvent"
}
},
"transactionID": 290,
"validFor": 50
}
}'
Here's what Elasticsearch has for the mapping:
"getdatavalues" : {
"dynamic_templates" : [ {
"strings" : {
"mapping" : {
"index" : "not_analyzed",
"type" : "string"
},
"match_mapping_type" : "string"
}
} ],
"properties" : {
"handle" : {
"type" : "long"
},
"input" : {
"properties" : {
"deviceID" : {
"properties" : {
"IMEI" : {
"type" : "string",
"index" : "not_analyzed"
},
"serial-number" : {
"type" : "string",
"index" : "not_analyzed"
}
}
},
"exprCode" : {
"type" : "string",
"index" : "not_analyzed"
},
"handle" : {
"type" : "long"
},
"noRollHandle" : {
"type" : "string",
"index" : "not_analyzed"
},
"serviceVersion" : {
"type" : "string",
"index" : "not_analyzed"
},
"transactionID" : {
"type" : "long"
}
}
},
"output" : {
"properties" : {
"matchFilter" : {
"properties" : {
"argValues" : {
"properties" : {
"Interests" : {
"type" : "object"
},
"MerchantId" : {
"type" : "string",
"index" : "not_analyzed"
},
"Queue" : {
"type" : "string",
"index" : "not_analyzed"
},
"Vibe" : {
"type" : "string",
"index" : "not_analyzed"
},
"event" : {
"properties" : {
"event" : {
"type" : "string",
"index" : "not_analyzed"
},
"hasEvent" : {
"type" : "string",
"index" : "not_analyzed"
}
}
},
"hasEvent" : {
"type" : "string",
"index" : "not_analyzed"
},
"interests" : {
"type" : "string",
"index" : "not_analyzed"
}
}
},
"prefix" : {
"type" : "string",
"index" : "not_analyzed"
},
"transactionID" : {
"type" : "long"
},
"validFor" : {
"type" : "long"
}
}
},
"noRollPubSessHandle" : {
"type" : "string",
"index" : "not_analyzed"
},
"publishSessHandle" : {
"type" : "long"
},
"status" : {
"type" : "boolean"
},
"transactionID" : {
"type" : "long"
},
"validFor" : {
"type" : "long"
}
}
},
"timestamp" : {
"type" : "date",
"format" : "dateOptionalTime"
},
"type" : {
"type" : "string",
"index" : "not_analyzed"
}
}
},
Looks like the argValues object doesn't quite agree with your mapping:
"argValues": {
"event": "InternationalEvent",
"hasEvent": "anyEvent"
}
Either this:
"argValues": {
"event": {
"event": "InternationalEvent"
},
"hasEvent": "anyEvent"
}
Or this:
"argValues": {
"event": {
"event": "InternationalEvent"
"hasEvent": "anyEvent"
},
}
Would both seem to be valid.