Connector fails when schema registry's master changes - apache-kafka

My source connector throws
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Error while forwarding register schema request to the master; error code: 50003
or
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Master not known
I found this happened
when schema registry's master changes and I have two replicas of schema-registry, under the same service on k8s.
The top exception is org.apache.kafka.connect.errors.ConnectException: Tolerance exceeded in error handler
How to increase the tolerance so the connector can retry more times until the new master is elected?

Just because you have two replicas doesn't mean they know about each other.
See how to fix this - https://github.com/confluentinc/cp-helm-charts/issues/375
Regarding the error handler, you give timeouts. Example from the docs.
# retry for at most 10 minutes times waiting up to 30 seconds between consecutive failures
errors.retry.timeout=600000
errors.retry.delay.max.ms=30000

Related

Running multiple Debezium connectors on the same source MariaDB

We have multiple MariaDB schemas and for each of those running two debezium connectors. Everything runs fine for a while but then every 1-2 weeks or so debezium error on random connector occurs:
2022-10-31 06:18:55,106 ERROR MySQL|scheme_1|binlog Error during binlog processing. Last offset stored = {transaction_id=null, ts_sec=1667155787, file=mysql-bin.075628, pos=104509320, server_id=1, event=32}, binlog reader near position = mysql-bin.075628/300573885 [io.debezium.connector.mysql.MySqlStreamingChangeEventSource]
2022-10-31 06:18:55,107 ERROR MySQL|scheme_1|binlog Producer failure [io.debezium.pipeline.ErrorHandler]
io.debezium.DebeziumException: Connection reset
at io.debezium.connector.mysql.MySqlStreamingChangeEventSource.wrap(MySqlStreamingChangeEventSource.java:1189)
at io.debezium.connector.mysql.MySqlStreamingChangeEventSource$ReaderThreadLifecycleListener.onCommunicationFailure(MySqlStreamingChangeEventSource.java:1234)
at com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:980)
at com.github.shyiko.mysql.binlog.BinaryLogClient.connect(BinaryLogClient.java:599)
at com.github.shyiko.mysql.binlog.BinaryLogClient$7.run(BinaryLogClient.java:857)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.net.SocketException: Connection reset
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
at com.github.shyiko.mysql.binlog.io.BufferedSocketInputStream.read(BufferedSocketInputStream.java:59)
at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.readWithinBlockBoundaries(ByteArrayInputStream.java:261)
at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.read(ByteArrayInputStream.java:245)
at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.fill(ByteArrayInputStream.java:112)
at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.read(ByteArrayInputStream.java:105)
at com.github.shyiko.mysql.binlog.BinaryLogClient.readPacketSplitInChunks(BinaryLogClient.java:995)
at com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:953)
... 3 more
2022-10-31 06:18:55,113 INFO MySQL|scheme_1|binlog Stopped reading binlog after 0 events, last recorded offset: {transaction_id=null, ts_sec=1667155787, file=mysql-bin.075628, pos=104509320, server_id=1, event=32} [io.debezium.connector.mysql.MySqlStreamingChangeEventSource]
2022-10-31 06:18:55,123 ERROR || WorkerSourceTask{id=scheme_1-connector-1666100046785939106-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted [org.apache.kafka.connect.runtime.WorkerTask]
org.apache.kafka.connect.errors.ConnectException: An exception occurred in the change event producer. This connector will be stopped.
at io.debezium.pipeline.ErrorHandler.setProducerThrowable(ErrorHandler.java:50)
at io.debezium.connector.mysql.MySqlStreamingChangeEventSource$ReaderThreadLifecycleListener.onCommunicationFailure(MySqlStreamingChangeEventSource.java:1234)
at com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:980)
at com.github.shyiko.mysql.binlog.BinaryLogClient.connect(BinaryLogClient.java:599)
at com.github.shyiko.mysql.binlog.BinaryLogClient$7.run(BinaryLogClient.java:857)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: io.debezium.DebeziumException: Connection reset
at io.debezium.connector.mysql.MySqlStreamingChangeEventSource.wrap(MySqlStreamingChangeEventSource.java:1189)
... 5 more
Caused by: java.net.SocketException: Connection reset
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
at com.github.shyiko.mysql.binlog.io.BufferedSocketInputStream.read(BufferedSocketInputStream.java:59)
at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.readWithinBlockBoundaries(ByteArrayInputStream.java:261)
at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.read(ByteArrayInputStream.java:245)
at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.fill(ByteArrayInputStream.java:112)
at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.read(ByteArrayInputStream.java:105)
at com.github.shyiko.mysql.binlog.BinaryLogClient.readPacketSplitInChunks(BinaryLogClient.java:995)
at com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:953)
... 3 more
2022-10-31 06:18:55,132 INFO || Stopping down connector [io.debezium.connector.common.BaseSourceTask]
This must be related to fact that we have two connectors attached, because there are no problems if there's one connector per schema.
MariaDB server didn't go down because we have another connector on the same server and it wasn't affected.
It seems unlikely that two independent connectors would crash at exactly the same binlog position because of each others presence.
ts_sec=1667155787, file=mysql-bin.075628, pos=104509320, server_id=1, event=32
Take the mariadb-binlog --start-position=104509320 mysql-bin.075628 from that position (just one full entry is probably sufficient) and raise a bug report (if one doesn't already exist).

IOError(Stalefile) exception being thrown by Kafka Streams RocksDB

When running my stateful Kafka streaming applications I'm coming across various different RocksDB Disk I/O Stalefile exceptions. The exception only occurs when I have at least one KTable implementation and it happens at various different times. I've tried countless times to reproduce it but haven't been able to.
App/Environment details:
Runtime: Java
Kafka library: org.apache.kafka:kafka-streams:2.5.1
Deployment: OpenShift
Volume type: NFS
RAM: 2000 - 8000 MiB
CPU: 200 Millicores to 2 Cores
Threads: 1
Partitions: 1 - many
Exceptions encountered:
Caused by: org.apache.kafka.streams.errors.ProcessorStateException: Error while getting value for key from at org.apache.kafka.streams.state.internals.RocksDBStore.get(RocksDbStore.java:301)
Caused by: org.apache.kafka.streams.errors.ProcessorStateException: Error restoring batch to store at org.apache.kafka.streams.state.internals.RocksDBStore$RocksDBBatchingRestoreCallback.restoreAll(RocksDbStore.java:636)
Caused by: org.apache.kafka.streams.errors.ProcessorStateException: Error while range compacting during restoring at org.apache.kafka.streams.state.internals.RocksDBStore$SingleColumnFamilyAccessor.toggleDbForBulkLoading(RocksDbStore.java:616)
Caused by: org.apache.kafka.streams.errors.ProcessorStateException: Error while executing flush from store at org.apache.kafka.streams.state.internals.RocksDBStore.flush(RocksDbStore.java:616)
Apologies for not being able to post the entire stack trace, but all of the above exceptions seem to reference the org.rocksdb.RocksDBException: IOError(Stalefile) exception.
Additional info:
Using a persisted state directory
Kafka topic settings are created with defaults
Running a single instance on a single thread
Exception is raised during gets and writes
Exception is raised when consuming valid data
Exception also occurs on internal repartition topics
I'd really appreciate any help and please let me know if I can provide any further information.
If you are using Posix file system, this error means that the file system returns ESTALE. See description to the code in https://man7.org/linux/man-pages/man3/errno.3.html

Does kafka connect restart failed task

We have a source connector that reads from rdbms and put to kafka. It uses schema registry with avro schema.
I am finding following exceptions in kafka connect log and schema registry log respectively.
1.
Committing offsets (org.apache.kafka.connect.runtime.WorkerSourceTask:426)
WorkerSourceTask{id=A-0} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask:443)
Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:186)
org.apache.kafka.connect.errors.ConnectException: Tolerance exceeded in error handler
.
.
Caused by: org.apache.kafka.connect.errors.DataException: Failed to serialize Avro data from topic A :
at io.confluent.connect.avro.AvroConverter.fromConnectData(AvroConverter.java:91)
at org.apache.kafka.connect.storage.Converter.fromConnectData(Converter.java:63)
.
.
Caused by: org.apache.kafka.common.errors.SerializationException: Error registering Avro schema:
.
.
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Register operation timed out; error code: 50002
.
.
Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:187)
Stopping JDBC source task (io.confluent.connect.jdbc.source.JdbcSourceTask:314)
Closing the Kafka producer with timeoutMillis = 30000 ms.
(org.apache.kafka.clients.producer.KafkaProducer:1182)
2.
Wait to catch up until the offset at 1 (io.confluent.kafka.schemaregistry.storage.KafkaStore:304)
Request Failed with exception (io.confluent.rest.exceptions.DebuggableExceptionMapper:62)
io.confluent.kafka.schemaregistry.rest.exceptions.RestSchemaRegistryTimeoutException: Register operation timed out
at io.confluent.kafka.schemaregistry.rest.exceptions.Errors.operationTimeoutException(Errors.java:132)
.
.
Caused by: io.confluent.kafka.schemaregistry.exceptions.SchemaRegistryTimeoutException: Write to the Kafka store timed out while
at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.register(KafkaSchemaRegistry.java:508)
at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.registerOrForward(KafkaSchemaRegistry.java:553)
.
.
Caused by: io.confluent.kafka.schemaregistry.storage.exceptions.StoreTimeoutException: KafkaStoreReaderThread failed to reach target offset within the timeout interval. targetOffset: 3, offsetReached: 1, timeout(ms): 50
0
So basically schema registry before registering schema moves offset to latest and there it time out 500ms.
My question was this.
How can I find why it is not able to read from kafka?
Does the source connector task restart or poll data for the failed task of one connector? Because in later section of the log I see this.
Committing offsets (org.apache.kafka.connect.runtime.WorkerSourceTask:426)
WorkerSourceTask{id=A-0} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask:443)
So eariler it failed after this, but now it is not printing it, which means it passed.
The key thing to note is that when it failed eariler reading, it failed task for only one connector A and others passed. Later I didn't find the exception for the connector A.
If the task is not starting or connector is not polling again, I need to restart task using rest API.
Any help will be greatly appriciated.
Thanks in advance.
Regarding your question title, read the error.
task will not recover until manually restarted
If you have more than one task, you would still expect to see logs from other tasks.
As far as offset commits, source task offsets would not be committed until the task succeeds, and no logs given show something "moving to latest"
The error has nothing to do with reading from Kafka. The error is a timeout in your schema registry client in the AvroConverter, which isn't required for Kafka Connect.

Kafka fails to publish message when any node in a 3 node cluster is down

I have kafka 0.10.2.1 deployed in a 3 node cluster with default configuration for the most part. The Producer config is as follows,
"bootstrap.servers", 3;
"retry.backoff.ms", "1000"
"reconnect.backoff.ms", "1000"
"max.request.size", "5242880"
"key.serializer", "org.apache.kafka.common.serialization.ByteArraySerializer"
"value.serializer", "org.apache.kafka.common.serialization.ByteArraySerializer"
What I am seeing is that when one node in the cluster is down I cannot publish messages to Kafka anymore. I get the following exception when I do so,
05-Apr-2018 22:29:33,362 PDT ERROR [vm18] [KafkaMessageBroker] (default task-43) |default| Failed to publish message for topic deviceConfigRequest: java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for deviceConfigRequest-1: 30967 ms has passed since batch creation plus linger time
at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.valueOrError(FutureRecordMetadata.java:70) [kafka-clients-0.10.2.1.jar:]
at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:65) [kafka-clients-0.10.2.1.jar:]
at org.apache.kafka.clients.producer.internals.FutureRecordMetadata.get(FutureRecordMetadata.java:25) [kafka-clients-0.10.2.1.jar:]
at x.y.x.KafkaMessageBroker.publishMessage(KafkaMessageBroker.java:151) [classes:]
What am I missing?

org.apache.kafka.common.errors.TimeoutException

I have two broker 1.0.0 kafka cluster and I am running 1.0.0 kafka stream API application against this kafka.I increased the producer request.timeout.ms to 5 minutes to fix producer Timeoutexception.
Currently I am getting below two types of exceptions after running some time. I am trying to fix these exceptions as suggested in Apache Kafka: TimeoutException and then nothing works‏ But incomplete solution was here. Is this solution recommendable (decreasing producer batch.size). Please help.
Exception 1
2017-12-08 13:11:55,129 ERROR o.a.k.s.p.i.RecordCollectorImpl [sample-app-0.0.1-156ec0d4-6d7c-40b0-a493-370f8d9a092c-StreamThread-1] task [2_0] Error sending record (key 5a12c529e532af0b84f5d937 value com.kafka.streams.SampleEvent#54a6900d timestamp 1512536799387) to topic abc due to org.apache.kafka.common.errors.TimeoutException: Failed to allocate memory within the configured max blocking time 60000 ms.; No more records will be sent and no more offsets will be recorded for this task.
2017-12-08 13:11:55,131 ERROR o.a.k.s.p.i.AssignedTasks [sample-app-0.0.1-156ec0d4-6d7c-40b0-a493-370f8d9a092c-StreamThread-1] stream-thread [sample-app-0.0.1-156ec0d4-6d7c-40b0-a493-370f8d9a092c-StreamThread-1] Failed to process stream task 2_0 due to the following error: org.apache.kafka.streams.errors.StreamsException: Exception caught in process. taskId=2_0, processor=KSTREAM-SOURCE-0000000004, topic=Sample-Event, partition=0, offset=508417
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:232)
at org.apache.kafka.streams.processor.internals.AssignedTasks.process(AssignedTasks.java:403)
at org.apache.kafka.streams.processor.internals.TaskManager.process(TaskManager.java:317)
at org.apache.kafka.streams.processor.internals.StreamThread.processAndMaybeCommit(StreamThread.java:942)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:822)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:774)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:744)
Caused by: org.apache.kafka.streams.errors.StreamsException: task [2_0] Abort sending since an error caught with a previous record (key 5a12c529e532af0b84f5d937 value com.kafka.streams.SampleEvent#54a6900d timestamp 1512536799387) to topic abc due to org.apache.kafka.common.errors.TimeoutException: Failed to allocate memory within the configured max blocking time 60000 ms..
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl$1.onCompletion(RecordCollectorImpl.java:118)
at org.apache.kafka.clients.producer.KafkaProducer.doSend(KafkaProducer.java:819)
at org.apache.kafka.clients.producer.KafkaProducer.send(KafkaProducer.java:760)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.send(RecordCollectorImpl.java:100)
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl.send(RecordCollectorImpl.java:78)
at org.apache.kafka.streams.processor.internals.SinkNode.process(SinkNode.java:87)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:85)
at org.apache.kafka.streams.kstream.internals.KStreamTransform$KStreamTransformProcessor.process(KStreamTransform.java:56)
at org.apache.kafka.streams.processor.internals.ProcessorNode$1.run(ProcessorNode.java:46)
at org.apache.kafka.streams.processor.internals.StreamsMetricsImpl.measureLatencyNs(StreamsMetricsImpl.java:208)
at org.apache.kafka.streams.processor.internals.ProcessorNode.process(ProcessorNode.java:124)
at org.apache.kafka.streams.processor.internals.ProcessorContextImpl.forward(ProcessorContextImpl.java:85)
at org.apache.kafka.streams.processor.internals.SourceNode.process(SourceNode.java:80)
at org.apache.kafka.streams.processor.internals.StreamTask.process(StreamTask.java:216)
... 6 more
Caused by: org.apache.kafka.common.errors.TimeoutException: Failed to allocate memory within the configured max blocking time 60000 ms.
Exception 2
2017-12-11 11:08:35,257 ERROR o.a.k.s.p.i.RecordCollectorImpl [kafka-producer-network-thread | sample-app-0.0.1-030b5133-df00-4abd-a3de-8bfab114f626-StreamThread-1-producer] task [2_0] Error sending record (key 5a12c529e532af0b84f5d937 value com.kafka.streams.SampleEvent#1758de61 timestamp 1512795449471) to topic abc due to org.apache.kafka.common.errors.TimeoutException: Expiring 14 record(s) for abc-0: 122597 ms has passed since last append; No more records will be sent and no more offsets will be recorded for this task.
2017-12-11 11:08:56,001 ERROR o.a.k.s.p.i.AssignedTasks [sample-app-0.0.1-030b5133-df00-4abd-a3de-8bfab114f626-StreamThread-1] stream-thread [sample-app-0.0.1-030b5133-df00-4abd-a3de-8bfab114f626-StreamThread-1] Failed to commit stream task 2_0 due to the following error: org.apache.kafka.streams.errors.StreamsException: task [2_0] Abort sending since an error caught with a previous record (key 5a12c529e532af0b84f5d937 value com.kafka.streams.SampleEvent#1758de61 timestamp 1512795449471) to topic abc due to org.apache.kafka.common.errors.TimeoutException: Expiring 14 record(s) for abc-0: 122597 ms has passed since last append.
at org.apache.kafka.streams.processor.internals.RecordCollectorImpl$1.onCompletion(RecordCollectorImpl.java:118)
at org.apache.kafka.clients.producer.internals.ProducerBatch.completeFutureAndFireCallbacks(ProducerBatch.java:204)
at org.apache.kafka.clients.producer.internals.ProducerBatch.done(ProducerBatch.java:187)
at org.apache.kafka.clients.producer.internals.Sender.failBatch(Sender.java:627)
at org.apache.kafka.clients.producer.internals.Sender.sendProducerData(Sender.java:287)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:238)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:163)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 14 record(s) for abc-0: 122597 ms has passed since last append
We faced similar issue which we resolved
First issue by setting:
max.block.ms to something higher than currently configured value.
Second issue by: increasing the batch.size and decreasing the linger.ms (might increase latency) on Kafka Producer side. Increasing batch.size would send more batches with fewer messages in each batch.
This looks something that often happens when an expected topic hasn't been created. Try looking further back in the log files.
You can also explicity use the admin client to check which topics exist.
first issue is due to this reason: ( Producer sends continuous heartbeat and it will wait for 60,000 ms (default value) for the metadata. If metadata isn't present in the specified time, it throws streams timeout exception. To fix this, add kafka producer config (ProducerConfig.MAX_BLOCK_MS_CONFIG) to some value grater that 60000ms. This will resolve the issue.
if you are using nifi and kafka with SASL_SSL without kerberos and you are providing kafka client jaas then increase the Metadata wait time to 100sec and Acknowledgment wait time to 100 sec this will work for you.