Clickhouse: kafka broken messages error handling - apache-kafka

Clickhouse version 21.12.3.32. I`m following this PR(https://github.com/ClickHouse/ClickHouse/pull/21850) to handle incorrect messages from kafka topic, but after some investigation I`ve found that if a single message contains broken data, a whole batch of received messages cannot be parsed and it can lead to data loss.
Kafka engine table:
CREATE TABLE default.kafka_engine (message String)
ENGINE = Kafka
SETTINGS
kafka_broker_list = 'kafka:9092',
kafka_topic_list = 'topic',
kafka_group_name = 'group',
kafka_format = 'JSONAsString',
kafka_row_delimiter = '\n',
kafka_num_consumers = 1,
kafka_handle_error_mode ='stream';
Example of broken message: [object Object]
First message error: JSON object must begin with '{'.: (at row 1).
Other messages error: Cannot parse input: expected ']' at end of stream..
Is it possible to skip just that broken message and correctly parse other messages in a batch received from kafka topic?

Changing the kafka_format to RawBLOB fixed my issue.

Related

Getting "java.lang.IllegalStateException: Tried to lookup lag for unknown task 3_0" after upgrading Kafka Stream from 2.5.1 to 2.6.2

I just upgraded our Kafka Stream application from 2.5.1 to 2.6.2. It used to work, now it doesn't.
Here is the troublesome topology (I have omitted the irrelevant Serdes):
val builder = new StreamsBuilder()
val contractEventStream: KStream[TariffId, ContractEvent] =
builder.stream[String, ContractUpsertAvro](settings.contractsTopicName)
.flatMap { (_, contractAvro) =>
ContractEvent.from(contractAvro)
.map(contractEvent => (contractEvent.tariffId, contractEvent))
}
val tariffsTable: KTable[TariffId, Tariff] =
builder.stream[String, TariffUpdateEventAvro](settings.tariffTopicName)
.flatMapValues(Tariff.fromAvro(_))
.selectKey((_, tariff) => tariff.tariffId)
.toTable(Materialized.`with`(tariffIdSerde, tariffSerde)) // Materialized.as also throws the same IllegalStateExceptions
contractEventStream
.join(tariffsTable)(JourneyStep.from(_, _).asInstanceOf[ContractCreated])(Joined.`with`(tariffIdSerde, contractEventSerde, tariffSerde))
.selectKey((_, contractUpdated) => contractUpdated.accountId)
.foreach((_, journeyStep) => println(journeyStep))
The join gives the following exception:
java.lang.IllegalStateException: Tried to lookup lag for unknown task 3_0
at org.apache.kafka.streams.processor.internals.assignment.ClientState.lagFor(ClientState.java:306)
at java.util.Comparator.lambda$comparingLong$6043328a$1(Comparator.java:511)
at java.util.Comparator.lambda$thenComparing$36697e65$1(Comparator.java:216)
at java.util.TreeMap.compare(TreeMap.java:1295)
at java.util.TreeMap.put(TreeMap.java:538)
at java.util.TreeSet.add(TreeSet.java:255)
at java.util.AbstractCollection.addAll(AbstractCollection.java:344)
at java.util.TreeSet.addAll(TreeSet.java:312)
at org.apache.kafka.streams.processor.internals.StreamsPartitionAssignor.getPreviousTasksByLag(StreamsPartitionAssignor.java:1275)
at org.apache.kafka.streams.processor.internals.StreamsPartitionAssignor.assignTasksToThreads(StreamsPartitionAssignor.java:1189)
at org.apache.kafka.streams.processor.internals.StreamsPartitionAssignor.computeNewAssignment(StreamsPartitionAssignor.java:940)
at org.apache.kafka.streams.processor.internals.StreamsPartitionAssignor.assign(StreamsPartitionAssignor.java:399)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.performAssignment(ConsumerCoordinator.java:589)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.onJoinLeader(AbstractCoordinator.java:684)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.access$1000(AbstractCoordinator.java:111)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$JoinGroupResponseHandler.handle(AbstractCoordinator.java:597)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$JoinGroupResponseHandler.handle(AbstractCoordinator.java:560)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:1160)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:1135)
at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:206)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:169)
at org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:129)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:602)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:412)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:297)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollForFetches(KafkaConsumer.java:1296)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1237)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1210)
at org.apache.kafka.streams.processor.internals.StreamThread.pollRequests(StreamThread.java:767)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:624)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:551)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:510)
I can't see what I am doing wrong. The code above works with Kafka 2.5.1. Anyone has any idea what is going on?
The problem is caused by the Kafka Streams cache, which it keeps on disk. This cache is specific to Kafka-version and to the Kafka Streams topology you use (ie. a change in your topology could also lead to this error).
The cache is usually found in /tmp or elsewhere if you passed in the "state.dir" property to Kafka Streams. Clear the directory with the cache and you should be able to cleanly start again.

node-rdkafka - debug set to all but I only see broker transport failure

I am trying to connect to kafka server. Authentication is based on GSSAPI.
/opt/app-root/src/server/node_modules/node-rdkafka/lib/error.js:411
return new LibrdKafkaError(e);
^
Error: broker transport failure
at Function.createLibrdkafkaError (/opt/app-root/src/server/node_modules/node-rdkafka/lib/error.js:411:10)
at /opt/app-root/src/server/node_modules/node-rdkafka/lib/client.js:350:28
This my test_kafka.js:
const Kafka = require('node-rdkafka');
const kafkaConf = {
'group.id': 'espdev2',
'enable.auto.commit': true,
'metadata.broker.list': 'br01',
'security.protocol': 'SASL_SSL',
'sasl.kerberos.service.name': 'kafka',
'sasl.kerberos.keytab': 'svc_esp_kafka_nonprod.keytab',
'sasl.kerberos.principal': 'svc_esp_kafka_nonprod#INT.LOCAL',
'debug': 'all',
'enable.ssl.certificate.verification': true,
//'ssl.certificate.location': 'some-root-ca.cer',
'ssl.ca.location': 'some-root-ca.cer',
//'ssl.key.location': 'svc_esp_kafka_nonprod.keytab',
};
const topics = 'hello1';
console.log(Kafka.features);
let readStream = new Kafka.KafkaConsumer.createReadStream(kafkaConf, { "auto.offset.reset": "earliest" }, { topics })
readStream.on('data', function (message) {
const messageString = message.value.toString();
console.log(`Consumed message on Stream: ${messageString}`);
});
You can look at this issue for the explanation of this error:
https://github.com/edenhill/librdkafka/issues/1987
Taken from #edenhill:
As a general rule for librdkafka-based clients: given that the cluster and client are correctly configured, all errors can be ignored as they are most likely temporary and librdkafka will attempt to recover automatically. In this specific case; if a group coordinator request fails it will be retried (using any broker in state Up) within 500ms. The current assignment and group membership will not be affected, if a new coordinator is found before the missing heartbeats times out the membership (session.timeout.ms).
Auto offset commits will be stalled until a new coordinator is found. In a future version we'll extend the error type to include a severity, allowing applications to happily ignore non-terminal errors. At this time an application should consider all errors informational, and not terminal.

WSO2 SP - Kafka source with JSON attributes

I'm trying to read JSON data from Kafka, using following code:
#source(type = 'kafka', bootstrap.servers = 'localhost:9092', topic.list = 'TestTopic',
group.id = 'test', threading.option = 'single.thread', #map(type = 'json'))
define stream myDataStream (json object);
But failed with following error:
[2019-03-27_11-39-32_103] ERROR
{org.wso2.extension.siddhi.map.json.sourcemapper.JsonSourceMapper} -
Stream "myDataStream" does not have an attribute named "ABC",
but the received event {"event":{"ABC":"1"}} does. Hence dropping the message.
Check whether the json string is in a correct format for default mapping.
I've tried adding the attributes
#source(type = 'kafka', bootstrap.servers = 'localhost:9092',
topic.list = 'TestTopic', group.id = 'test',
threading.option = 'single.thread',
#map(type = 'json', #attributes(ABC = '$.ABC')))
Syntax error:
Error at 'json' defined at stream 'myDataStream', attribute 'json' is
not mapped
Any help would be greatly appreciated.
There is an error in the syntax of the stream,
define stream myDataStream (ABC string);
Here the attribute name is the key of the JSON messages, in this case, ABC

Producer lost some message on kafka restart

Kafka Client : 0.11.0.0-cp1
Kafka Broker :
On Kafka broker rolling restart, our application lost some messages while sending to broker. I believe with rolling restart there should not be any loss of message. These are the producer (Using Producer with asynchronous send() and not using callback/future etc) settings we are using :
val acksConfig: String = "all",
val retriesConfig: Int = Int.MAX_VALUE,
val retriesBackOffConfig: Int = 1000,
val batchSize: Int = 32768,
val lingerTime: Int = 1,
val maxBlockTime: Int = Int.MAX_VALUE,
val requestTimeOut: Int = 420000,
val bufferMemory: Int = 33_554_432,
val compressionType: String = "gzip",
val keySerializer: Class<StringSerializer> = StringSerializer::class.java,
val valueSerializer: Class<ByteArraySerializer> = ByteArraySerializer::class.java
I am seeing these exceptions in the logs
2019-03-19 17:30:59,224 [org.apache.kafka.clients.producer.internals.Sender] [kafka-producer-network-thread | producer-1] (Sender.java:511) WARN org.apache.kafka.clients.producer.internals.Sender - Got error produce response with correlation id 1105790 on topic-partition catapult_on_entitlement_updates_prod-67, retrying (2147483643 attempts left). Error: NOT_LEADER_FOR_PARTITION
But log says retry attempt left, i am curious why didnt it retry then? Let me know if anyone has any idea?
Two things to note:
What is the replication factor of the topic you are producing and what is the required number of min.insync.replicas?
What do you mean by "producer lost some messages". The producer if it cannot successfully produce to #min.insync.replicas brokers it will throw an exception and fail (for synchronous production). It is up to the producer/ client to retry in case of failure (synchronous or asynchronous production).

Akka streams with gilt aws kinesis exception: Stream is terminated. SourceQueue is detached

I'm using gilt aws kinesis stream consumer library there to connect to a single shard kinesis stream.
Specifically:
...
val streamConfig = KinesisStreamConsumerConfig[String](
streamName = queueName
, applicationName = kinesisConsumerApp
, regionName = Some(awsRegion)
, checkPointInterval = 5.minutes
, retryConfig = RetryConfig(initialDelay = 1.second, retryDelay = 1.second, maxRetries = 3)
, initialPositionInStream = InitialPositionInStream.LATEST
)
implicit val mat = ActorMaterializer()
val flow = Source.queue[String](0, OverflowStrategy.backpressure)
.to(Sink.foreach {
msgBody => {
log.info(s"Flow got message: $msgBody")
try {
val workAsJson = parse(msgBody)
frontEnd ! workAsJson
} catch {
case th: Throwable => log.error(s"Exception thrown trying to parse message from Kinesis stream, e.cause: ${th.getCause}, e.message: ${th.getMessage}")
}
}
})
.run()
val consumer = new KinesisStreamConsumer[String](
streamConfig,
KinesisStreamHandler(
KinesisStreamSource.pumpKinesisStreamTo(flow, 10.second)
)
)
val ec = Executors.newSingleThreadExecutor()
ec.submit(new Runnable {
override def run(): Unit = consumer.run()
})
The application runs fine for about 24 hours, I verify occasionally by pushing records using aws kinesis put-record command line and watch them getting consumed by my application, but then suddenly the application start receiving exceptions each time a new record is pushed to the stream.
Here is the console logging when that happens:
INFO: Sleeping ... [863/1962]
DEBUG[RecordProcessor-0000] KCLRecordProcessorFactory$IRecordProcessorFactoryImpl - Processing 1 records from shard shardId-000000000000
WARN [RecordProcessor-0000] KCLRecordProcessorFactory$IRecordProcessorFactoryImpl - Kinesis shard: shardId-000000000000 :: Stream is terminated. SourceQueue is detached
WARN [RecordProcessor-0000] KCLRecordProcessorFactory$IRecordProcessorFactoryImpl - Kinesis shard: shardId-000000000000 :: Stream is terminated. SourceQueue is detached
WARN [RecordProcessor-0000] KCLRecordProcessorFactory$IRecordProcessorFactoryImpl - Kinesis shard: shardId-000000000000 :: Stream is terminated. SourceQueue is detached
ERROR[RecordProcessor-0000] KCLRecordProcessorFactory$IRecordProcessorFactoryImpl - SKIPPING 1 records from shard shardId-000000000000 :: Kinesis shard: shardId-000000000000 :: Stream is termi
nated. SourceQueue is detached
com.gilt.gfc.aws.kinesis.client.KCLRecordProcessorFactory$KCLProcessorException: Kinesis shard: shardId-000000000000 :: Stream is terminated. SourceQueue is detached
at com.gilt.gfc.aws.kinesis.client.KCLRecordProcessorFactory$IRecordProcessorFactoryImpl$$anon$1.$anonfun$doRetry$2(KCLRecordProcessorFactory.scala:156)
at com.gilt.gfc.util.Retry$.retryWithExponentialDelay(Retry.scala:67)
at com.gilt.gfc.aws.kinesis.client.KCLRecordProcessorFactory$IRecordProcessorFactoryImpl$$anon$1.doRetry(KCLRecordProcessorFactory.scala:151)
at com.gilt.gfc.aws.kinesis.client.KCLRecordProcessorFactory$IRecordProcessorFactoryImpl$$anon$1.processRecords(KCLRecordProcessorFactory.scala:120)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.V1ToV2RecordProcessorAdapter.processRecords(V1ToV2RecordProcessorAdapter.java:42)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ProcessTask.call(ProcessTask.java:176)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:49)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:24)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalStateException: Stream is terminated. SourceQueue is detached
at akka.stream.impl.QueueSource$$anon$1.$anonfun$postStop$1(Sources.scala:57)
at akka.stream.impl.QueueSource$$anon$1.$anonfun$postStop$1$adapted(Sources.scala:56)
at akka.stream.stage.CallbackWrapper.$anonfun$invoke$1(GraphStage.scala:1373)
at akka.stream.stage.CallbackWrapper.locked(GraphStage.scala:1379)
at akka.stream.stage.CallbackWrapper.invoke(GraphStage.scala:1370)
at akka.stream.stage.CallbackWrapper.invoke$(GraphStage.scala:1369)
at akka.stream.impl.QueueSource$$anon$1.invoke(Sources.scala:47)
at akka.stream.impl.QueueSource$$anon$2.offer(Sources.scala:180)
at com.gilt.gfc.aws.kinesis.akka.KinesisStreamSource$.$anonfun$pumpKinesisStreamTo$1(KinesisStreamSource.scala:20)
at com.gilt.gfc.aws.kinesis.akka.KinesisStreamSource$.$anonfun$pumpKinesisStreamTo$1$adapted(KinesisStreamSource.scala:20)
at com.gilt.gfc.aws.kinesis.akka.KinesisStreamHandler$$anon$1.onRecord(KinesisStreamHandler.scala:29)
at com.gilt.gfc.aws.kinesis.akka.KinesisStreamConsumer.$anonfun$run$1(KinesisStreamConsumer.scala:40)
at com.gilt.gfc.aws.kinesis.akka.KinesisStreamConsumer.$anonfun$run$1$adapted(KinesisStreamConsumer.scala:40)
at com.gilt.gfc.aws.kinesis.client.KCLWorkerRunner.$anonfun$runSingleRecordProcessor$2(KCLWorkerRunner.scala:159)
at com.gilt.gfc.aws.kinesis.client.KCLWorkerRunner.$anonfun$runSingleRecordProcessor$2$adapted(KCLWorkerRunner.scala:159)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:52)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at com.gilt.gfc.aws.kinesis.client.KCLWorkerRunner.$anonfun$runSingleRecordProcessor$1(KCLWorkerRunner.scala:159)
at com.gilt.gfc.aws.kinesis.client.KCLWorkerRunner.$anonfun$runSingleRecordProcessor$1$adapted(KCLWorkerRunner.scala:159)
at com.gilt.gfc.aws.kinesis.client.KCLWorkerRunner.$anonfun$runBatchProcessor$1(KCLWorkerRunner.scala:121)
at com.gilt.gfc.aws.kinesis.client.KCLWorkerRunner.$anonfun$runBatchProcessor$1$adapted(KCLWorkerRunner.scala:116)
at com.gilt.gfc.aws.kinesis.client.KCLRecordProcessorFactory$IRecordProcessorFactoryImpl$$anon$1.$anonfun$processRecords$2(KCLRecordProcessorFactory.scala:120)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
at com.gilt.gfc.aws.kinesis.client.KCLRecordProcessorFactory$IRecordProcessorFactoryImpl$$anon$1.$anonfun$doRetry$2(KCLRecordProcessorFactory.scala:153)
... 11 common frames omitted
Wondering if that answer might be related. If so, I'd appreciate simpler explanation/how-to-fix that suits a newbie like myself
Notes:
This is still in testing/staging phase so there is no real load to
the stream except for the occasional manual pushes I'm making
The 24h duration in which the application runs fine is not
accurately tested but was an observation.
I'm running the test for a third time (started at 8:42 UTC) but with
the difference of increasing Source.queue buffer size to 100.
If 24h turns out to be accurate, could that be related to Kinesis
default 24h retention period of stream records?
Update:
Application still working fine after 24+ hours of operation.
Update2:
So the application has been running fine for the past 48+ hours, again the only difference is increasing the stream's Source.queue size to 100.
Could that be the proper fix to the issue?
Will I face similar issue with increased load once we go to production?
Is 100 enough/too much/too few?
Can someone please explain how this change fixed/suppressed/metigated the error?