Consume from Kafka and send to Druid issue - druid

I have a Java application that supposed to read from Kafka, do some magic and send the data to Druid.
I have Kafka workers threads (about 15 per topic) that consume the data to from Kafka and eventually send it to Druid using Tranquillity.
This is the problem:
If I work with one thread - all is fine. If I work with more than one I get exceptions.
I tried working the following way:
Spring Druid service with several Tranquillity objects.
No Spring, Just create several Tranquillity objects for each thread.
I thought it might be concurrency issue.
When I say "several Tranquillity" I mean that I am sending the data to different tables.
I get :
java.lang.AssertionError: assertion failed: List(package nio, package nio, package nio, package nio, package nio, package nio, package nio)
at scala.reflect.internal.Symbols$Symbol.suchThat(Symbols.scala:1678) ~[scala-reflect-2.10.5.jar!/:na]
at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:44) ~[scala-reflect-2.10.5.jar!/:na]
at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40) ~[scala-reflect-2.10.5.jar!/:na]
at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61) ~[scala-reflect-2.10.5.jar!/:na]
at scala.reflect.internal.Mirrors$RootsBase.staticModuleOrClass(Mirrors.scala:72) ~[scala-reflect-2.10.5.jar!/:na]
at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:119) ~[scala-reflect-2.10.5.jar!/:na]
at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:21) ~[scala-reflect-2.10.5.jar!/:na]
at com.metamx.tranquility.druid.DruidBeams$$anonfun$1$$typecreator3$1.apply(DruidBeams.scala:166) ~[tranquility-core_2.10-0.8.2.jar!/:0.8.2]
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) ~[scala-reflect-2.10.5.jar!/:na]
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) ~[scala-reflect-2.10.5.jar!/:na]
at scala.reflect.api.TypeTags$TypeTag$class.equals(TypeTags.scala:256) ~[scala-reflect-2.10.5.jar!/:na]
at scala.reflect.api.TypeTags$TypeTagImpl.equals(TypeTags.scala:291) ~[scala-reflect-2.10.5.jar!/:na]
at com.metamx.tranquility.druid.DruidBeams$$anonfun$1.apply(DruidBeams.scala:166) ~[tranquility-core_2.10-0.8.2.jar!/:0.8.2]
at com.metamx.tranquility.druid.DruidBeams$$anonfun$1.apply(DruidBeams.scala:152) ~[tranquility-core_2.10-0.8.2.jar!/:0.8.2]
at com.metamx.tranquility.druid.DruidBeams$.fromConfigInternal(DruidBeams.scala:341) ~[tranquility-core_2.10-0.8.2.jar!/:0.8.2]
at com.metamx.tranquility.druid.DruidBeams$.fromConfig(DruidBeams.scala:204) ~[tranquility-core_2.10-0.8.2.jar!/:0.8.2]
at com.metamx.tranquility.druid.DruidBeams$.fromConfig(DruidBeams.scala:123) ~[tranquility-core_2.10-0.8.2.jar!/:0.8.2]
at com.metamx.tranquility.druid.DruidBeams.fromConfig(DruidBeams.scala) ~[tranquility-core_2.10-0.8.2.jar!/:0.8.2]
at com.cooladata.etl.rt.connector.DruidConnector.registerSender(DruidConnector.java:57)
at com.cooladata.etl.rt.util.DruidUtil.emmitEvent(DruidUtil.java:26)

what I will suggest is taking other pay. First of all, you should know that tranquility drops events which are older than some specified period of time, as I remember one hour by default. In case you care only for recent events it is ok.
Since you already have Kafka I would suggest you push processed data back to Kafka and use either Tranquility or Kafka indexer service to send data to Druid. Benefits of this approach are that not only Druid but other services can use this data. Also, in contrast to Tranquility Kafka indexer service does not drop any event.

Related

How to reprocess messages in Kafka Connect?

I am working with a Kafka Sink Connector which reads from a Kafka topic and puts the data into a target database (in my case it is a Neo4j instance) .The messages need to be processed strictly sequentially since they are not idempotent. My question is if for some reason an exception occurs, for e.g. 1. Datbase goes down , 2. Connectivity to DB lost , 3. Schema parsing failure , how can we reprocess the message ?
I understand we can run with error.tolerance=none configuration and redirect failure message to a dead letter queue. But my question is there any way we can process a selected message again ? Also , is there any audit mechanism to track how many messages are processed, to seek from a given offset (without manual offset reset).
Below is my connector configuration . Also suggest if there are better data integration technologies apart from the kafka connectors to sink the data into a target database.
{
"topics": "mytopic",
"connector.class": "streams.kafka.connect.sink.Neo4jSinkConnector",
"tasks.max":"1",
"key.converter.schemas.enable":"true",
"values.converter.schemas.enable":"true",
"errors.retry.timeout": "-1",
"errors.retry.delay.max.ms": "1000",
"errors.tolerance": "none",
"errors.deadletterqueue.topic.name": "deadletter-topic",
"errors.deadletterqueue.topic.replication.factor":1,
"errors.deadletterqueue.context.headers.enable":true,
"key.converter":"org.apache.kafka.connect.storage.StringConverter",
"key.converter.enhanced.avro.schema.support":true,
"value.converter.enhanced.avro.schema.support":true,
"value.converter":"io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url":"https://schema-url/",
"value.converter.basic.auth.credentials.source":"USER_INFO",
"value.converter.basic.auth.user.info":"user:pass",
"errors.log.enable": true,
"schema.ignore":"false",
"errors.log.include.messages": true,
"neo4j.server.uri": "neo4j://my-ip:7687/neo4j",
"neo4j.authentication.basic.username": "neo4j",
"neo4j.authentication.basic.password": "neo4j",
"neo4j.encryption.enabled": false,
"neo4j.topic.cypher.mytopic": "MERGE (p:Loc_Con{name: event.geography.name})"
}
For non fatal exceptions, the connector will write to a dead letter topic.
You'd need another connector or some other consumer to read that other topic to process that data. Since it's a topic, there's no straightforward way to "process a selected message"
JMX metrics or Neo4j database metrics should both be able to tell you approximately how many messages have been processed over time

Kafka Streams shutdown after IllegalStateException: No current assignment for partition

I have a Kafka Streams application that launches and runs successfully. We have 4 instances of the application running. Occasionally one of our instance of the application is legitimately killed which causes several rounds of rebalancing until the old node is replaced.
Sometimes during the rebalance, one ore more previously healthy nodes fail. The logs are indicating that the Streams application transitions into a PENDING_SHUTDOWN state directly after receiving the following exception:
java.lang.IllegalStateException: No current assignment for partition public.chat.message-28
at org.apache.kafka.clients.consumer.internals.SubscriptionState.assignedState(SubscriptionState.java:256)
at org.apache.kafka.clients.consumer.internals.SubscriptionState.resetFailed(SubscriptionState.java:418)
at org.apache.kafka.clients.consumer.internals.Fetcher$2.onFailure(Fetcher.java:621)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177)
at org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147)
at org.apache.kafka.clients.consumer.internals.RequestFutureAdapter.onFailure(RequestFutureAdapter.java:30)
at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onFailure(RequestFuture.java:209)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireFailure(RequestFuture.java:177)
at org.apache.kafka.clients.consumer.internals.RequestFuture.raise(RequestFuture.java:147)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:571)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:389)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:297)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:215)
at org.apache.kafka.clients.consumer.internals.Fetcher.getTopicMetadata(Fetcher.java:292)
at org.apache.kafka.clients.consumer.internals.Fetcher.getAllTopicMetadata(Fetcher.java:275)
at org.apache.kafka.clients.consumer.KafkaConsumer.listTopics(KafkaConsumer.java:1849)
at org.apache.kafka.clients.consumer.KafkaConsumer.listTopics(KafkaConsumer.java:1827)
at org.apache.kafka.streams.processor.internals.StoreChangelogReader.refreshChangelogInfo(StoreChangelogReader.java:259)
at org.apache.kafka.streams.processor.internals.StoreChangelogReader.initialize(StoreChangelogReader.java:133)
at org.apache.kafka.streams.processor.internals.StoreChangelogReader.restore(StoreChangelogReader.java:79)
at org.apache.kafka.streams.processor.internals.TaskManager.updateNewAndRestoringTasks(TaskManager.java:328)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:866)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:804)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:773)
Prior to this error we often seem to also recieve some informational logs reporting a disconnect exception:
Error sending fetch request (sessionId=568252460, epoch=7) to node 4: org.apache.kafka.common.errors.DisconnectException
I have a feeling the two are related but I'm unable to reason why at present.
Is anyone able to give me some hints as to what may be causing this issue and any possible solutions?
Additional Info:
Kafka 2.2.1
32 partitions spread evenly across the 4 worker nodes
StreamsConfig settings:
kafkaStreamProps.put(StreamsConfig.REPLICATION_FACTOR_CONFIG, 2);
kafkaStreamProps.put(StreamsConfig.NUM_STANDBY_REPLICAS_CONFIG, 1);
kafkaStreamProps.put(StreamsConfig.NUM_STREAM_THREADS_CONFIG, 4);
kafkaStreamProps.put(StreamsConfig.COMMIT_INTERVAL_MS_CONFIG, 120000);
kafkaStreamProps.put(StreamsConfig.TOPOLOGY_OPTIMIZATION, StreamsConfig.OPTIMIZE);
This looks like it could be related to https://issues.apache.org/jira/browse/KAFKA-9073, which has been fixed in Kafka Streams 2.3.2.
If you can't wait for that release, you could try creating a private build using the changeset from this pull request: https://github.com/apache/kafka/pull/7630/files

UnknownProducerIdException in Kafka streams when enabling exactly once

After enabling exactly once processing on a Kafka streams application, the following error appears in the logs:
ERROR o.a.k.s.p.internals.StreamTask - task [0_0] Failed to close producer
due to the following error:
org.apache.kafka.streams.errors.StreamsException: task [0_0] Abort
sending since an error caught with a previous record (key 222222 value
some-value timestamp 1519200902670) to topic exactly-once-test-topic-
v2 due to This exception is raised by the broker if it could not
locate the producer metadata associated with the producerId in
question. This could happen if, for instance, the producer's records
were deleted because their retention time had elapsed. Once the last
records of the producerId are removed, the producer's metadata is
removed from the broker, and future appends by the producer will
return this exception.
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.recordSendError(RecordCollectorImpl.java:125)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.access$500(RecordCollectorImpl.java:48)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl$1.onCompletion(RecordCollectorImpl.java:180)
at org.apache.kafka.clients.producer.KafkaProducer$InterceptorCallback.onCompletion(KafkaProducer.java:1199)
at org.apache.kafka.clients.producer.internals.ProducerBatch.completeFutureAndFireCallbacks(ProducerBatch.java:204)
at org.apache.kafka.clients.producer.internals.ProducerBatch.done(ProducerBatch.java:187)
at org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:627)
at org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:596)
at org.apache.kafka.clients.producer.internals.Sender.completeBatch(Sender.java:557)
at org.apache.kafka.clients.producer.internals.Sender.handleProduceResponse(Sender.java:481)
at org.apache.kafka.clients.producer.internals.Sender.access$100(Sender.java:74)
at org.apache.kafka.clients.producer.internals.Sender$1.onComplete(Sender.java:692)
at org.apache.kafka.clients.ClientResponse.onComplete(ClientResponse.java:101)
at org.apache.kafka.clients.NetworkClient.completeResponses(NetworkClient.java:482)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:474)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:239)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:163)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.common.errors.UnknownProducerIdException
We've reproduced the issue with a minimal test case where we move messages from a source stream to another stream without any transformation. The source stream contains millions of messages produced over several months. The KafkaStreams object is created with the following StreamsConfig:
StreamsConfig.PROCESSING_GUARANTEE_CONFIG = "exactly_once"
StreamsConfig.APPLICATION_ID_CONFIG = "Some app id"
StreamsConfig.NUM_STREAM_THREADS_CONFIG = 1
ProducerConfig.BATCH_SIZE_CONFIG = 102400
The app is able to process some messages before the exception occurs.
Context information:
we're running a 5 node Kafka 1.1.0 cluster with 5 zookeeper nodes.
there are multiple instances of the app running
Has anyone seen this problem before or can give us any hints about what might be causing this behaviour?
Update
We created a new 1.1.0 cluster from scratch and started to process new messages without problems. However, when we imported old messages from the old cluster, we hit the same UnknownProducerIdException after a while.
Next we tried to set the cleanup.policy on the sink topic to compact while keeping the retention.ms at 3 years. Now the error did not occur. However, messages seem to have been lost. The source offset is 106 million and the sink offset is 100 million.
As explained in the comments, there currently seems to be a bug that may cause problems when replaying messages older than the (maximum configurable?) retention time.
At time of writing this is unresolved, the latest status can always be seen here:
https://issues.apache.org/jira/browse/KAFKA-6817

Kafka org.apache.kafka.connect.converters.ByteArrayConverter doesn't work as values for key.converter and value.converter

I'm trying to build a pipeline where I have to move binary data from kafka topic to kinesis stream with out transforming. So I'm planning to use ByteArrayConverter for worker properties setup. But I'm getting the following error! Although I could see the ByteArrayConverter class in here
on 0.11.0 version. I cannot find the same class under 3.2.x :(
Any help would be much appreciated.
key.converter=io.confluent.connect.replicator.util.ByteArrayConverter
value.converter=io.confluent.connect.replicator.util.ByteArrayConverter
Exception in thread "main" org.apache.kafka.common.config.ConfigException: Invalid value io.confluent.connect.replicator.util.ByteArrayConverter for configuration key.converter: Class io.confluent.connect.replicator.util.ByteArrayConverter could not be found.
at org.apache.kafka.common.config.ConfigDef.parseType(ConfigDef.java:672)
at org.apache.kafka.common.config.ConfigDef.parse(ConfigDef.java:418)
at org.apache.kafka.common.config.AbstractConfig.<init>(AbstractConfig.java:55)
at org.apache.kafka.common.config.AbstractConfig.<init>(AbstractConfig.java:62)
at org.apache.kafka.connect.runtime.WorkerConfig.<init>(WorkerConfig.java:156)
at org.apache.kafka.connect.runtime.distributed.DistributedConfig.<init>(DistributedConfig.java:198)
at org.apache.kafka.connect.cli.ConnectDistributed.main(ConnectDistributed.java:65)
org.apache.kafka.connect.converters.ByteArrayConverter was only added to Apache Kafka 0.11 (which is Confluent 3.3). If you are running a Confluent distro earlier than 3.3 then you will need the Confluent Enterprise distro (not Confluent Open Source) and use the io.confluent.connect.replicator.util.ByteArrayConverter converter

storm KafkaSpout stopped to consume message from kafka (NOT_LEADER_FOR_PARTITION)

I have three zookeeper and three kafka and storm in cluster enviroinment
storm nimbus is working on machine1 and storm supervisor working on machine 2 and 3
at some point KafkaSpout stopped consuming data from kafka and i found error
2016-11-16 04:02:07.470 c.e.m.s.k.KafkaSpout [WARN] Fetch failed
com.monitor.storm.kafka.FailedFetchException: Error fetching data from [Partition{host=<machine1_ip>:9092, partition=1}] for topic [test_topic]: [NOT_LEADER_FOR_PARTITION]
at com.monitor.storm.kafka.KafkaUtils.fetchMessages(KafkaUtils.java:193) ~[storm-kafka-0.8-plus.jar:newtrunk.10.25.2016]
at com.monitor.storm.kafka.PartitionManager.fill(PartitionManager.java:175) ~[storm-kafka-0.8-plus.jar:newtrunk.10.25.2016]
at com.monitor.storm.kafka.PartitionManager.next(PartitionManager.java:132) ~[storm-kafka-0.8-plus.jar:newtrunk.10.25.2016]
at com.monitor.storm.kafka.KafkaSpout.nextTuple(KafkaSpout.java:153) [storm-kafka-0.8-plus.jar:newtrunk.10.25.2016]
at backtype.storm.daemon.executor$fn__5624$fn__5639$fn__5670.invoke(executor.clj:607) [storm-core-0.10.0.jar:0.10.0]
at backtype.storm.util$async_loop$fn__545.invoke(util.clj:479) [storm-core-0.10.0.jar:0.10.0]
at clojure.lang.AFn.run(AFn.java:22) [clojure-1.6.0.jar:?]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_73]
2016-11-16 04:02:07.471 c.e.m.s.k.ZkCoordinator [INFO] Task [2/4] Refreshing partition manager connections
I have checked all the services were up and running fine.
I have tried telnet all the ports are reachable.
If i delete all the topic and restart the services(zookeeper, kafka , storm (nimbus, supervisor, workers) ) it will work fine.
But, i can't deleted topics. Because, it has some data in it.And after few hours it seem started to working fine without restart.But, it started to happen frequently. So, their is some delay in actions.
Can any one help me to find out what is the problem and is the their
any way i can prevent it.