Kafka connect Stream data from mongodb to elasticsearch - mongodb

I am trying to stream data from mongodb to elasticsearch using kafka connect.
The data that is stream into kafka by mongodb connector is as given below
{
"updatedAt" : {
"$date" : 1591596275939
},
"createdAt" : {
"$date" : 1362162600000
},
"name" : "my name",
"_id" : {
"$oid" : "5ee0cc7e0c3273f3d4a3c20f"
},
"documentId" : "mydoc1",
"age" : 20,
"language" : "English",
"validFrom" : {
"$date" : 978307200000
},
"remarks" : [
"remarks"
],
"married" : false
}
I have below two problem while saving data to elasticsearch
_id is an object and I want to use "documentId" key as _id instead in elasticsearch
Dates are an object with $date key which I cant't figure out how we can convert to normal date.
Can anyone please point me to the right direction regarding above two issues.
Mongodb source config
{
"tasks.max" : "5",
"change.stream.full.document" : "updateLookup",
"name" : "mongodb-source",
"value.converter" : "org.apache.kafka.connect.storage.StringConverter",
"collection" : "collection",
"poll.max.batch.size" : "1000",
"connector.class" : "com.mongodb.kafka.connect.MongoSourceConnector",
"batch.size" : "1000",
"key.converter" : "org.apache.kafka.connect.storage.StringConverter",
"key.converter.schemas.enable":"false",
"value.converter.schemas.enable":"false",
"connection.uri" : "mongodb://connection",
"publish.full.document.only" : "true",
"database" : "databasename",
"poll.await.time.ms" : "5000",
"topic.prefix" : "mongodb"
}
Elastic sink config
{
"write.method" : "upsert",
"errors.deadletterqueue.context.headers.enable" : "true",
"name" : "elasticsearch-sink",
"connection.password" : "password",
"topic.index.map" : "mongodb.databasename.collection:elasticindexname",
"connection.url" : "http://localhost:9200",
"errors.log.enable" : "true",
"flush.timeout.ms" : "20000",
"errors.log.include.messages" : "true",
"key.ignore" : "false",
"type.name" : "_doc",
"key.converter" : "org.apache.kafka.connect.json.JsonConverter",
"value.converter" : "org.apache.kafka.connect.json.JsonConverter",
"key.converter.schemas.enable":"false",
"value.converter.schemas.enable":"false",
"tasks.max" : "1",
"batch.size" : "100",
"schema.ignore" : "true",
"schema.enable" : "false",
"connector.class" : "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"read.timeout.ms" : "6000",
"connection.username" : "elastic",
"topics" : "mongodb.databasename.collection",
"proxy.host": "localhost",
"proxy.port": "8080"
}
Exception
Caused by: org.apache.kafka.connect.errors.DataException: MAP is not supported as the document id.
at io.confluent.connect.elasticsearch.DataConverter.convertKey(DataConverter.java:107)
at io.confluent.connect.elasticsearch.DataConverter.convertRecord(DataConverter.java:182)
at io.confluent.connect.elasticsearch.ElasticsearchWriter.tryWriteRecord(ElasticsearchWriter.java:291)
at io.confluent.connect.elasticsearch.ElasticsearchWriter.write(ElasticsearchWriter.java:276)
at io.confluent.connect.elasticsearch.ElasticsearchSinkTask.put(ElasticsearchSinkTask.java:174)
at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:538)
... 10 more
Connector Links :
https://docs.mongodb.com/kafka-connector/master/kafka-source/
https://docs.confluent.io/current/connect/kafka-connect-elasticsearch

Related

Debezium mongodb kafka connector not producing some of records in topic as it is in mongodb

In my mongodb there i have this data
mongo01:PRIMARY> db.col.find({"_id" : ObjectId("5d8777f188fef5555b")})
{ "_id" : ObjectId("5d8777f188fef5555b"), "attachments" : [ { "name" : "Je", "src" : "https://google.co", "type" : "image/png" } ], "tags" : [ 51, 52 ], "last_comment" : [ ], "hashtags" : [ "Je" ], "badges" : [ ], "feed_id" : "1", "company_id" : 1, "message" : "aJsm9LtK", "group_id" : "106", "feed_type" : "post", "thumbnail" : "", "group_tag" : false, "like_count" : 0, "clap_count" : 0, "comment_count" : 0, "created_by" : 520, "created_at" : "1469577278628", "updated_at" : "1469577278628", "status" : 1, "__v" : 0 }
mongo01:PRIMARY> db.col.find({"_id" : ObjectId("5d285b4554e3b584bf97759")})
{ "_id" : ObjectId("5d285b4554e3b584bf97759"), "attachments" : [ ], "tags" : [ ], "last_comment" : [ ], "company_id" : 1, "group_id" : "00e35289", "feed_type" : "post", "group_tag" : false, "status" : 1, "feed_id" : "3dc44", "thumbnail" : "{}", "message" : "s2np1HYrPuFF", "created_by" : 1, "html_content" : "", "created_at" : "144687057949", "updated_at" : "144687057949", "like_count" : 0, "clap_count" : 0, "comment_count" : 0, "__v" : 0, "badges" : [ ], "hashtags" : [ ] }
I am using this debezium mongodb connector in order to get the mongodb data in kafka topic.
curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json"
http://localhost:8083/connectors/ -d '{
"name": "mongo_connector-4",
"config": {
"connector.class": "io.debezium.connector.mongodb.MongoDbConnector",
"mongodb.hosts": "mongo01/localhost:27017",
"mongodb.name": "mongo_1",
"collection.whitelist": "data.col",
"key.converter.schemas.enable": false,
"value.converter.schemas.enable": false,
"key.converter": "org.apache.kafka.connect.json.JsonConverter",
"value.converter": "org.apache.kafka.connect.json.JsonConverter",
"transforms" : "unwrap",
"transforms.unwrap.type" : "io.debezium.connector.mongodb.transforms.UnwrapFromMongoDbEnvelope",
"transforms.unwrap.drop.tombstones" : "false",
"transforms.unwrap.delete.handling.mode" : "drop",
"transforms.unwrap.operation.header" : "true",
"errors.tolerance" : "all",
"snapshot.delay.ms":"120000",
"poll.interval.ms":"3000",
"heartbeat.interval.ms":"90000"
}
}'
now while printing the topic in ksql i am getting that for some records data came with all columns(as it was in mongodb) while for some records some columns
are missing.
ksql> print 'mongo_1.data.col' from beginning;
Format:JSON
{"ROWTIME":1571148520736,"ROWKEY":"{\"id\":\"5d8777f188fef5555b\"}","attachments":[{"name":"Je","src":"https://google.co","type":"image/png"}],"tags":[51,52],"last_comment":[],"hashtags":[],"badges":[],"feed_id":"1","company_id":1,"message":"aJsm9LtK","group_id":"106","feed_type":"post","thumbnail":"","group_tag":false,"like_count":0,"clap_count":0,"comment_count":0,"created_by":520,"created_at":"1469577278628","updated_at":"1469577278628","status":1,"__v":0,"id":"5d8777f188fef5555b"}
{"ROWTIME":1571148520736,"ROWKEY":"{\"id\":\"5d285b4554e3b584bf97759\"}","badges":[],"hashtags":[],"id":"5d285b4554e3b584bf97759"}
Why this is happening and how to resolve this issue?
PS: the only difference i found that both records have different order of columns.
While searching about this issue only close thing i found here https://github.com/hpgrahsl/kafka-connect-mongodb
something they are saying about post-processing and redacting fields which have sensitive data. But as you can see both mine records are similar and have no sensitive data(by sensitive data i mean encrypted data, maybe they meant something else).
Are not the missing values after updates? Don't forget that MongoDB connector provides patch for updates not after - https://debezium.io/documentation/reference/0.10/connectors/mongodb.html#change-events-value
If you need to construct full format after in case of MongoDB you need to introduce a Kafka Streams pipeline that would store the event after insert into a persistent store and then merge the patch with the original insert to create the final event.

Unable to load data in druid

I am a newbie in druid. Trying to load a very simple data in JSON format to druid. The data contains just one dimension, one metric and timestamp. I have been successfully able to load data to druid for a different dataset but somehow I am getting errors for this dataset.
This is my index file :
{
"type" : "index",
"spec" : {
"dataSchema" : {
"dataSource" : "datatemplate",
"parser" : {
"type" : "string",
"parseSpec" : {
"format" : "json",
"dimensionsSpec" : {
"dimensions" : [
"Loc"
]
},
"timestampSpec" : {
"format" : "auto",
"column" : "Timestamp"
}
}
},
"metricsSpec" : [{"name" : "Qty","type" : "doubleSum","fieldName" : "Qty"}],
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "day",
"queryGranularity" : "none",
"intervals" : ["2016-01-01T00:00:00Z/2030-06-30T00:00:00Z"],
"rollup" : true
}
},
"ioConfig" : {
"type" : "index",
"firehose" : {
"type" : "local",
"baseDir" : "datatemplate/",
"filter" : "datatemplate.json"
},
"appendToExisting" : false
},
"tuningConfig" : {
"type" : "index",
"targetPartitionSize" : 10000000,
"maxRowsInMemory" : 40000,
"forceExtendableShardSpecs" : true
}
}
}
Also here is my dataset in JSON format:
{"Loc": "A", "Qty": "1", "Timestamp": "2017-12-01T00:00:00Z"}
{"Loc": "A", "Qty": "1", "Timestamp": "2017-12-01T00:00:00Z"}
{"Loc": "B", "Qty": "2", "Timestamp": "2017-12-01T00:00:00Z"}
{"Loc": "B", "Qty": "1", "Timestamp": "2017-12-01T00:00:00Z"}

Druid:how to add a numeric data to metric without aggregation function

The scenario is i want to setup a stock quote server and save the quote data into druid.
my requirement is to get the latest price of all the stock by a query.
But i notice that the query interface of druid such as time series only work on metrics filed ,not the dimension fields.
so i consider to make the price filed one of the metrics,but no need to aggregated.
how can i do it?
Any suggestions?
here is my tranquility config file.
{
"dataSources" : {
"stock-index-topic" : {
"spec" : {
"dataSchema" : {
"dataSource" : "stock-index-topic",
"parser" : {
"type" : "string",
"parseSpec" : {
"timestampSpec" : {
"column" : "timestamp",
"format" : "auto"
},
"dimensionsSpec" : {
"dimensions" : ["code","name","acronym","market","tradeVolume","totalValueTraded","preClosePx","openPrice","highPrice","lowPrice","latestPrice","closePx"],
"dimensionExclusions" : [
"timestamp",
"value"
]
},
"format" : "json"
}
},
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "HOUR",
"queryGranularity" : "SECOND",
},
"metricsSpec" : [
{
"name" : "firstPrice",
"type" : "doubleFirst",
"fieldName" : "tradePrice"
},{
"name" : "lastPrice",
"type" : "doubleLast",
"fieldName" : "tradePrice"
}, {
"name" : "minPrice",
"type" : "doubleMin",
"fieldName" : "tradePrice"
}, {
"name" : "maxPrice",
"type" : "doubleMax",
"fieldName" : "tradePrice"
}
]
},
"ioConfig" : {
"type" : "realtime"
},
"tuningConfig" : {
"type" : "realtime",
"maxRowsInMemory" : "100000",
"intermediatePersistPeriod" : "PT10M",
"windowPeriod" : "PT10M"
}
},
"properties" : {
"task.partitions" : "1",
"task.replicants" : "1",
"topicPattern" : "stock-index-topic"
}
}
},
"properties" : {
"zookeeper.connect" : "localhost:2181",
"druid.discovery.curator.path" : "/druid/discovery",
"druid.selectors.indexing.serviceName" : "druid/overlord",
"commit.periodMillis" : "15000",
"consumer.numThreads" : "2",
"kafka.zookeeper.connect" : "localhost:2181",
"kafka.group.id" : "tranquility-kafka"
}
}
I think you should make [latest_price] as new numeric dimension, it would be much better from performance and querying standpoint considering how druid works.
Metrics and meant to perform aggregation functions as core so won't be helpful in your use case.

Exception throws when insert data into druid via tranquility

i'm pushing kafka stream into druid via tranquility.
kafka version is 0.9.1 , tranquility is 0.8 , druid is 0.10.
tranquility is started fine when no message produced,but when producer sending message i will get JsonMappingException like this:
ava.lang.IllegalArgumentException: Can not deserialize instance of java.util.ArrayList out of VALUE_STRING token
at [Source: N/A; line: -1, column: -1]
at com.fasterxml.jackson.databind.ObjectMapper._convert(ObjectMapper.java:2774) ~[com.fasterxml.jackson.core.jackson-databind-2.4.6.jar:2.4.6]
at com.fasterxml.jackson.databind.ObjectMapper.convertValue(ObjectMapper.java:2700) ~[com.fasterxml.jackson.core.jackson-databind-2.4.6.jar:2.4.6]
at com.metamx.tranquility.druid.DruidBeams$.makeFireDepartment(DruidBeams.scala:406) ~[io.druid.tranquility-core-0.8.0.jar:0.8.0]
at com.metamx.tranquility.druid.DruidBeams$.fromConfigInternal(DruidBeams.scala:291) ~[io.druid.tranquility-core-0.8.0.jar:0.8.0]
at com.metamx.tranquility.druid.DruidBeams$.fromConfig(DruidBeams.scala:199) ~[io.druid.tranquility-core-0.8.0.jar:0.8.0]
at com.metamx.tranquility.kafka.KafkaBeamUtils$.createTranquilizer(KafkaBeamUtils.scala:40) ~[io.druid.tranquility-kafka-0.8.0.jar:0.8.0]
at com.metamx.tranquility.kafka.KafkaBeamUtils.createTranquilizer(KafkaBeamUtils.scala) ~[io.druid.tranquility-kafka-0.8.0.jar:0.8.0]
at com.metamx.tranquility.kafka.writer.TranquilityEventWriter.<init>(TranquilityEventWriter.java:64) ~[io.druid.tranquility-kafka-0.8.0.jar:0.8.0]
at com.metamx.tranquility.kafka.writer.WriterController.createWriter(WriterController.java:171) ~[io.druid.tranquility-kafka-0.8.0.jar:0.8.0]
at com.metamx.tranquility.kafka.writer.WriterController.getWriter(WriterController.java:98) ~[io.druid.tranquility-kafka-0.8.0.jar:0.8.0]
at com.metamx.tranquility.kafka.KafkaConsumer$2.run(KafkaConsumer.java:231) ~[io.druid.tranquility-kafka-0.8.0.jar:0.8.0]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) [na:1.7.0_67]
at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_67]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_67]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_67]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_67]
and my kafka.json is :
{
"dataSources" : {
"stock-index-topic" : {
"spec" : {
"dataSchema" : {
"dataSource" : "stock-index-topic",
"parser" : {
"type" : "string",
"parseSpec" : {
"timestampSpec" : {
"column" : "timestamp",
"format" : "auto"
},
"dimensionsSpec" : {
"dimensions" : ["code","name","acronym","market","tradeVolume","totalValueTraded","preClosePx","openPrice","highPrice","lowPrice","tradePrice","closePx","timestamp"],
"dimensionExclusions" : [
"timestamp",
"value"
]
},
"format" : "json"
}
},
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "DAY",
"queryGranularity" : "none",
"intervals":"no"
},
"metricsSpec" : [
{
"name" : "firstPrice",
"type" : "doubleFirst",
"fieldName" : "tradePrice"
},{
"name" : "lastPrice",
"type" : "doubleLast",
"fieldName" : "tradePrice"
}, {
"name" : "minPrice",
"type" : "doubleMin",
"fieldName" : "tradePrice"
}, {
"name" : "maxPrice",
"type" : "doubleMax",
"fieldName" : "tradePrice"
}
]
},
"ioConfig" : {
"type" : "realtime"
},
"tuningConfig" : {
"type" : "realtime",
"maxRowsInMemory" : "100000",
"intermediatePersistPeriod" : "PT10M",
"windowPeriod" : "PT10M"
}
},
"properties" : {
"task.partitions" : "1",
"task.replicants" : "1",
"topicPattern" : "stock-index-topic"
}
}
},
"properties" : {
"zookeeper.connect" : "localhost:2181",
"druid.discovery.curator.path" : "/druid/discovery",
"druid.selectors.indexing.serviceName" : "druid/overlord",
"commit.periodMillis" : "15000",
"consumer.numThreads" : "2",
"kafka.zookeeper.connect" : "localhost:2181",
"kafka.group.id" : "tranquility-kafka"
}
}
i use the kafka-console-consumer to get the data ,it looks like
{"code": "399982", "name": "500等权", "acronym": "500DQ", "market": "102", "tradeVolume": 0, "totalValueTraded": 0.0, "preClosePx": 0.0, "openPrice": 0.0, "highPrice": 0.0, "lowPrice": 0.0, "tradePrice": 7184.7142, "closePx": 0.0, "timestamp": "2017-05-16T09:06:39.000+08:00"}
Any idea why? Thanks.
"metricsSpec" : [
{
"name" : "firstPrice",
"type" : "doubleFirst",
"fieldName" : "tradePrice"
},{
"name" : "lastPrice",
"type" : "doubleLast",
"fieldName" : "tradePrice"
}, {
"name" : "minPrice",
"type" : "doubleMin",
"fieldName" : "tradePrice"
}, {
"name" : "maxPrice",
"type" : "doubleMax",
"fieldName" : "tradePrice"
}
]
},
it's wrong.The document said :
First and Last aggregator cannot be used in ingestion spec, and should only be specified as part of queries.
So,the issue is solved.

How to insert data into druid via tranquility

By following tutorial at http://druid.io/docs/latest/tutorials/tutorial-loading-streaming-data.html , I was able to insert data into druid via Kafka console
Kafka console
The spec file looks as following
examples/indexing/wikipedia.spec
[
{
"dataSchema" : {
"dataSource" : "wikipedia",
"parser" : {
"type" : "string",
"parseSpec" : {
"format" : "json",
"timestampSpec" : {
"column" : "timestamp",
"format" : "auto"
},
"dimensionsSpec" : {
"dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"],
"dimensionExclusions" : [],
"spatialDimensions" : []
}
}
},
"metricsSpec" : [{
"type" : "count",
"name" : "count"
}, {
"type" : "doubleSum",
"name" : "added",
"fieldName" : "added"
}, {
"type" : "doubleSum",
"name" : "deleted",
"fieldName" : "deleted"
}, {
"type" : "doubleSum",
"name" : "delta",
"fieldName" : "delta"
}],
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "DAY",
"queryGranularity" : "NONE"
}
},
"ioConfig" : {
"type" : "realtime",
"firehose": {
"type": "kafka-0.8",
"consumerProps": {
"zookeeper.connect": "localhost:2181",
"zookeeper.connection.timeout.ms" : "15000",
"zookeeper.session.timeout.ms" : "15000",
"zookeeper.sync.time.ms" : "5000",
"group.id": "druid-example",
"fetch.message.max.bytes" : "1048586",
"auto.offset.reset": "largest",
"auto.commit.enable": "false"
},
"feed": "wikipedia"
},
"plumber": {
"type": "realtime"
}
},
"tuningConfig": {
"type" : "realtime",
"maxRowsInMemory": 500000,
"intermediatePersistPeriod": "PT10m",
"windowPeriod": "PT10m",
"basePersistDirectory": "\/tmp\/realtime\/basePersist",
"rejectionPolicy": {
"type": "messageTime"
}
}
}
]
I start realtime via
java -Xmx512m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -Ddruid.realtime.specFile=examples/indexing/wikipedia.spec -classpath config/_common:config/realtime:lib/* io.druid.cli.Main server realtime
In Kafka console, I paste and enter the following
{"timestamp": "2013-08-10T01:02:33Z", "page": "Good Bye", "language" : "en", "user" : "catty", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143}
Then I tend to perform query by creating select.json and run curl -X POST 'http://localhost:8084/druid/v2/?pretty' -H 'content-type: application/json' -d #select.json
select.json
{
"queryType": "select",
"dataSource": "wikipedia",
"dimensions":[],
"metrics":[],
"granularity": "all",
"intervals": [
"2000-01-01/2020-01-02"
],
"filter" : {"type":"and",
"fields" : [
{ "type": "selector", "dimension": "user", "value": "catty" }
]
},
"pagingSpec":{"pagingIdentifiers": {}, "threshold":500}
}
I was able to get the following result.
[ {
"timestamp" : "2013-08-10T01:02:33.000Z",
"result" : {
"pagingIdentifiers" : {
"wikipedia_2013-08-10T00:00:00.000Z_2013-08-11T00:00:00.000Z_2013-08-10T00:00:00.000Z" : 0
},
"events" : [ {
"segmentId" : "wikipedia_2013-08-10T00:00:00.000Z_2013-08-11T00:00:00.000Z_2013-08-10T00:00:00.000Z",
"offset" : 0,
"event" : {
"timestamp" : "2013-08-10T01:02:33.000Z",
"continent" : "North America",
"robot" : "false",
"country" : "United States",
"city" : "San Francisco",
"newPage" : "true",
"unpatrolled" : "true",
"namespace" : "article",
"anonymous" : "false",
"language" : "en",
"page" : "Good Bye",
"region" : "Bay Area",
"user" : "catty",
"deleted" : 200.0,
"added" : 57.0,
"count" : 1,
"delta" : -143.0
}
} ]
}
} ]
It seem that I had setup Druid correctly.
Now, I would like to insert data via HTTP endpoint. According to How realtime data input to Druid?, it seems like recommended way is to use tranquility
tranquility
I have indexing service started via
java -Xmx2g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -classpath config/_common:config/overlord:lib/*: io.druid.cli.Main server overlord
conf/server.json looks like
{
"dataSources" : [
{
"spec" : {
"dataSchema" : {
"dataSource" : "wikipedia",
"parser" : {
"type" : "string",
"parseSpec" : {
"format" : "json",
"timestampSpec" : {
"column" : "timestamp",
"format" : "auto"
},
"dimensionsSpec" : {
"dimensions": ["page","language","user","unpatrolled","newPage","robot","anonymous","namespace","continent","country","region","city"],
"dimensionExclusions" : [],
"spatialDimensions" : []
}
}
},
"metricsSpec" : [{
"type" : "count",
"name" : "count"
}, {
"type" : "doubleSum",
"name" : "added",
"fieldName" : "added"
}, {
"type" : "doubleSum",
"name" : "deleted",
"fieldName" : "deleted"
}, {
"type" : "doubleSum",
"name" : "delta",
"fieldName" : "delta"
}],
"granularitySpec" : {
"type" : "uniform",
"segmentGranularity" : "DAY",
"queryGranularity" : "NONE"
}
},
"tuningConfig" : {
"windowPeriod" : "PT10M",
"type" : "realtime",
"intermediatePersistPeriod" : "PT10M",
"maxRowsInMemory" : "100000"
}
},
"properties" : {
"task.partitions" : "1",
"task.replicants" : "1"
}
}
],
"properties" : {
"zookeeper.connect" : "localhost",
"http.port" : "8200",
"http.threads" : "8"
}
}
Then, I start the server using
bin/tranquility server -configFile conf/server.json
I perform post to http://xx.xxx.xxx.xxx:8200/v1/post/wikipedia, with content-type equals application/json
{"timestamp": "2013-08-10T01:02:33Z", "page": "Selamat Pagi", "language" : "en", "user" : "catty", "unpatrolled" : "true", "newPage" : "true", "robot": "false", "anonymous": "false", "namespace":"article", "continent":"North America", "country":"United States", "region":"Bay Area", "city":"San Francisco", "added": 57, "deleted": 200, "delta": -143}
I get the the following respond
{"result":{"received":1,"sent":0}}
It seems that tranquility has received our data, but failed to send it to druid!
I try to run curl -X POST 'http://localhost:8084/druid/v2/?pretty' -H 'content-type: application/json' -d #select.json, but doesn't get the output I inserted via tranquility.
Any idea why? Thanks.
This generally happens when the data you send is out of the window period. If you are inserting data manually, give the exact current timestamp (UTC) in milliseconds. Else it can be easily done if you are using any script to generate data. Make sure it is UTC current time.
It is extremely difficult to setup druid to work properly with real-time data insertion.
The best bet I found is, use https://github.com/implydata . Imply is a set of wrappers around druid, to make it easy to use.
However, the real-time insertion in imply is not perfect either. I had experiment OutOfMemoryException, after inserting 30 millions items via real-time. This will caused data loss on previous inserted 30 millions rows.
The detailed regarding data loss can be found here : https://groups.google.com/forum/#!topic/imply-user-group/95xpYojxiOg
An issue ticket has been filed : https://github.com/implydata/distribution/issues/8
Druid streaming windowPeriod is very short (10 minutes). Outside this period, your event will be ignored.
As you got {"result":{"received":1,"sent":0}}, your worker threads are working fine. Tranquility decides what data is sent to the druid based on the timestamp associated with the data.
This period is decided by the "windowPeriod" configuration. So if your type is realtime ("type":"realtime") and window period is PT10M ("windowPeriod" : "PT10M"), tranquility will send any data between t-10, t+10 and not send anything outside this period.
I disagree with the insertion efficiency problems, we have been sending 3million rows every 15 minutes since June 2016 and has been running beautifully. Of course, we have a stronger infrastructure deemed for the scale.
Another reason for not inserting, is out memory on the coordinador/overloard are running