Running multiple Debezium connectors on the same source MariaDB - apache-kafka

We have multiple MariaDB schemas and for each of those running two debezium connectors. Everything runs fine for a while but then every 1-2 weeks or so debezium error on random connector occurs:
2022-10-31 06:18:55,106 ERROR MySQL|scheme_1|binlog Error during binlog processing. Last offset stored = {transaction_id=null, ts_sec=1667155787, file=mysql-bin.075628, pos=104509320, server_id=1, event=32}, binlog reader near position = mysql-bin.075628/300573885 [io.debezium.connector.mysql.MySqlStreamingChangeEventSource]
2022-10-31 06:18:55,107 ERROR MySQL|scheme_1|binlog Producer failure [io.debezium.pipeline.ErrorHandler]
io.debezium.DebeziumException: Connection reset
at io.debezium.connector.mysql.MySqlStreamingChangeEventSource.wrap(MySqlStreamingChangeEventSource.java:1189)
at io.debezium.connector.mysql.MySqlStreamingChangeEventSource$ReaderThreadLifecycleListener.onCommunicationFailure(MySqlStreamingChangeEventSource.java:1234)
at com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:980)
at com.github.shyiko.mysql.binlog.BinaryLogClient.connect(BinaryLogClient.java:599)
at com.github.shyiko.mysql.binlog.BinaryLogClient$7.run(BinaryLogClient.java:857)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.net.SocketException: Connection reset
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
at com.github.shyiko.mysql.binlog.io.BufferedSocketInputStream.read(BufferedSocketInputStream.java:59)
at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.readWithinBlockBoundaries(ByteArrayInputStream.java:261)
at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.read(ByteArrayInputStream.java:245)
at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.fill(ByteArrayInputStream.java:112)
at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.read(ByteArrayInputStream.java:105)
at com.github.shyiko.mysql.binlog.BinaryLogClient.readPacketSplitInChunks(BinaryLogClient.java:995)
at com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:953)
... 3 more
2022-10-31 06:18:55,113 INFO MySQL|scheme_1|binlog Stopped reading binlog after 0 events, last recorded offset: {transaction_id=null, ts_sec=1667155787, file=mysql-bin.075628, pos=104509320, server_id=1, event=32} [io.debezium.connector.mysql.MySqlStreamingChangeEventSource]
2022-10-31 06:18:55,123 ERROR || WorkerSourceTask{id=scheme_1-connector-1666100046785939106-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted [org.apache.kafka.connect.runtime.WorkerTask]
org.apache.kafka.connect.errors.ConnectException: An exception occurred in the change event producer. This connector will be stopped.
at io.debezium.pipeline.ErrorHandler.setProducerThrowable(ErrorHandler.java:50)
at io.debezium.connector.mysql.MySqlStreamingChangeEventSource$ReaderThreadLifecycleListener.onCommunicationFailure(MySqlStreamingChangeEventSource.java:1234)
at com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:980)
at com.github.shyiko.mysql.binlog.BinaryLogClient.connect(BinaryLogClient.java:599)
at com.github.shyiko.mysql.binlog.BinaryLogClient$7.run(BinaryLogClient.java:857)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: io.debezium.DebeziumException: Connection reset
at io.debezium.connector.mysql.MySqlStreamingChangeEventSource.wrap(MySqlStreamingChangeEventSource.java:1189)
... 5 more
Caused by: java.net.SocketException: Connection reset
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:186)
at java.base/java.net.SocketInputStream.read(SocketInputStream.java:140)
at com.github.shyiko.mysql.binlog.io.BufferedSocketInputStream.read(BufferedSocketInputStream.java:59)
at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.readWithinBlockBoundaries(ByteArrayInputStream.java:261)
at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.read(ByteArrayInputStream.java:245)
at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.fill(ByteArrayInputStream.java:112)
at com.github.shyiko.mysql.binlog.io.ByteArrayInputStream.read(ByteArrayInputStream.java:105)
at com.github.shyiko.mysql.binlog.BinaryLogClient.readPacketSplitInChunks(BinaryLogClient.java:995)
at com.github.shyiko.mysql.binlog.BinaryLogClient.listenForEventPackets(BinaryLogClient.java:953)
... 3 more
2022-10-31 06:18:55,132 INFO || Stopping down connector [io.debezium.connector.common.BaseSourceTask]
This must be related to fact that we have two connectors attached, because there are no problems if there's one connector per schema.
MariaDB server didn't go down because we have another connector on the same server and it wasn't affected.

It seems unlikely that two independent connectors would crash at exactly the same binlog position because of each others presence.
ts_sec=1667155787, file=mysql-bin.075628, pos=104509320, server_id=1, event=32
Take the mariadb-binlog --start-position=104509320 mysql-bin.075628 from that position (just one full entry is probably sufficient) and raise a bug report (if one doesn't already exist).

Related

IOError(Stalefile) exception being thrown by Kafka Streams RocksDB

When running my stateful Kafka streaming applications I'm coming across various different RocksDB Disk I/O Stalefile exceptions. The exception only occurs when I have at least one KTable implementation and it happens at various different times. I've tried countless times to reproduce it but haven't been able to.
App/Environment details:
Runtime: Java
Kafka library: org.apache.kafka:kafka-streams:2.5.1
Deployment: OpenShift
Volume type: NFS
RAM: 2000 - 8000 MiB
CPU: 200 Millicores to 2 Cores
Threads: 1
Partitions: 1 - many
Exceptions encountered:
Caused by: org.apache.kafka.streams.errors.ProcessorStateException: Error while getting value for key from at org.apache.kafka.streams.state.internals.RocksDBStore.get(RocksDbStore.java:301)
Caused by: org.apache.kafka.streams.errors.ProcessorStateException: Error restoring batch to store at org.apache.kafka.streams.state.internals.RocksDBStore$RocksDBBatchingRestoreCallback.restoreAll(RocksDbStore.java:636)
Caused by: org.apache.kafka.streams.errors.ProcessorStateException: Error while range compacting during restoring at org.apache.kafka.streams.state.internals.RocksDBStore$SingleColumnFamilyAccessor.toggleDbForBulkLoading(RocksDbStore.java:616)
Caused by: org.apache.kafka.streams.errors.ProcessorStateException: Error while executing flush from store at org.apache.kafka.streams.state.internals.RocksDBStore.flush(RocksDbStore.java:616)
Apologies for not being able to post the entire stack trace, but all of the above exceptions seem to reference the org.rocksdb.RocksDBException: IOError(Stalefile) exception.
Additional info:
Using a persisted state directory
Kafka topic settings are created with defaults
Running a single instance on a single thread
Exception is raised during gets and writes
Exception is raised when consuming valid data
Exception also occurs on internal repartition topics
I'd really appreciate any help and please let me know if I can provide any further information.
If you are using Posix file system, this error means that the file system returns ESTALE. See description to the code in https://man7.org/linux/man-pages/man3/errno.3.html

Does kafka connect restart failed task

We have a source connector that reads from rdbms and put to kafka. It uses schema registry with avro schema.
I am finding following exceptions in kafka connect log and schema registry log respectively.
1.
Committing offsets (org.apache.kafka.connect.runtime.WorkerSourceTask:426)
WorkerSourceTask{id=A-0} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask:443)
Task threw an uncaught and unrecoverable exception (org.apache.kafka.connect.runtime.WorkerTask:186)
org.apache.kafka.connect.errors.ConnectException: Tolerance exceeded in error handler
.
.
Caused by: org.apache.kafka.connect.errors.DataException: Failed to serialize Avro data from topic A :
at io.confluent.connect.avro.AvroConverter.fromConnectData(AvroConverter.java:91)
at org.apache.kafka.connect.storage.Converter.fromConnectData(Converter.java:63)
.
.
Caused by: org.apache.kafka.common.errors.SerializationException: Error registering Avro schema:
.
.
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Register operation timed out; error code: 50002
.
.
Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:187)
Stopping JDBC source task (io.confluent.connect.jdbc.source.JdbcSourceTask:314)
Closing the Kafka producer with timeoutMillis = 30000 ms.
(org.apache.kafka.clients.producer.KafkaProducer:1182)
2.
Wait to catch up until the offset at 1 (io.confluent.kafka.schemaregistry.storage.KafkaStore:304)
Request Failed with exception (io.confluent.rest.exceptions.DebuggableExceptionMapper:62)
io.confluent.kafka.schemaregistry.rest.exceptions.RestSchemaRegistryTimeoutException: Register operation timed out
at io.confluent.kafka.schemaregistry.rest.exceptions.Errors.operationTimeoutException(Errors.java:132)
.
.
Caused by: io.confluent.kafka.schemaregistry.exceptions.SchemaRegistryTimeoutException: Write to the Kafka store timed out while
at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.register(KafkaSchemaRegistry.java:508)
at io.confluent.kafka.schemaregistry.storage.KafkaSchemaRegistry.registerOrForward(KafkaSchemaRegistry.java:553)
.
.
Caused by: io.confluent.kafka.schemaregistry.storage.exceptions.StoreTimeoutException: KafkaStoreReaderThread failed to reach target offset within the timeout interval. targetOffset: 3, offsetReached: 1, timeout(ms): 50
0
So basically schema registry before registering schema moves offset to latest and there it time out 500ms.
My question was this.
How can I find why it is not able to read from kafka?
Does the source connector task restart or poll data for the failed task of one connector? Because in later section of the log I see this.
Committing offsets (org.apache.kafka.connect.runtime.WorkerSourceTask:426)
WorkerSourceTask{id=A-0} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask:443)
So eariler it failed after this, but now it is not printing it, which means it passed.
The key thing to note is that when it failed eariler reading, it failed task for only one connector A and others passed. Later I didn't find the exception for the connector A.
If the task is not starting or connector is not polling again, I need to restart task using rest API.
Any help will be greatly appriciated.
Thanks in advance.
Regarding your question title, read the error.
task will not recover until manually restarted
If you have more than one task, you would still expect to see logs from other tasks.
As far as offset commits, source task offsets would not be committed until the task succeeds, and no logs given show something "moving to latest"
The error has nothing to do with reading from Kafka. The error is a timeout in your schema registry client in the AvroConverter, which isn't required for Kafka Connect.

Connector fails when schema registry's master changes

My source connector throws
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Error while forwarding register schema request to the master; error code: 50003
or
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Master not known
I found this happened
when schema registry's master changes and I have two replicas of schema-registry, under the same service on k8s.
The top exception is org.apache.kafka.connect.errors.ConnectException: Tolerance exceeded in error handler
How to increase the tolerance so the connector can retry more times until the new master is elected?
Just because you have two replicas doesn't mean they know about each other.
See how to fix this - https://github.com/confluentinc/cp-helm-charts/issues/375
Regarding the error handler, you give timeouts. Example from the docs.
# retry for at most 10 minutes times waiting up to 30 seconds between consecutive failures
errors.retry.timeout=600000
errors.retry.delay.max.ms=30000

Debezium postgres ERROR: parameter "include-unchanged-toast" was deprecated

I am doing a load test on Debezium postgres connector at the moment to know if it can support very massive amounts (in terms of billions) of changes logs in Aurora Postgres.
When I insert 1 million records to the postgres table, Debezium Postgres connector failed with following error messages:
org.apache.kafka.connect.errors.ConnectException: An exception ocurred in the change event producer. This connector will be stopped.
at io.debezium.connector.base.ChangeEventQueue.throwProducerFailureIfPresent(ChangeEventQueue.java:170)
at io.debezium.connector.base.ChangeEventQueue.poll(ChangeEventQueue.java:151)
at io.debezium.connector.postgresql.PostgresConnectorTask.poll(PostgresConnectorTask.java:188)
at org.apache.kafka.connect.runtime.WorkerSourceTask.poll(WorkerSourceTask.java:259)
at org.apache.kafka.connect.runtime.WorkerSourceTask.execute(WorkerSourceTask.java:226)
at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:177)
at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:227)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: org.postgresql.util.PSQLException: ERROR: parameter "include-unchanged-toast" was deprecated
Where: slot "wal2json_dbz5", output plugin "wal2json", in the startup callback
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2440)
at org.postgresql.core.v3.QueryExecutorImpl.processCopyResults(QueryExecutorImpl.java:1116)
at org.postgresql.core.v3.QueryExecutorImpl.readFromCopy(QueryExecutorImpl.java:1035)
at org.postgresql.core.v3.CopyDualImpl.readFromCopy(CopyDualImpl.java:41)
at org.postgresql.core.v3.replication.V3PGReplicationStream.receiveNextData(V3PGReplicationStream.java:155)
at org.postgresql.core.v3.replication.V3PGReplicationStream.readInternal(V3PGReplicationStream.java:124)
at org.postgresql.core.v3.replication.V3PGReplicationStream.readPending(V3PGReplicationStream.java:78)
at io.debezium.connector.postgresql.connection.PostgresReplicationConnection$1.readPending(PostgresReplicationConnection.java:401)
at io.debezium.connector.postgresql.PostgresStreamingChangeEventSource.execute(PostgresStreamingChangeEventSource.java:94)
at io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:91)
It seems connector does not support include-unchanged-toast anymore. Is there any workaround to fix this issue?
You can either get Debrezium fixed or you can use an old version of wal2json from before the option was removed.
The GIT snapshot of wal2json from just before the option was removed is here.
Be warned, though, that the option was removed for a good reason.

KSQL query execution fails

I have confluent platform setup on AWS having specifications as two workers, three zookeepers and three brokers.
Can somebody explain why I getting an error while executing the following query in KQL with all the servers up?
create STREAM SINK_STREAM WITH (VALUE_FORMAT='AVRO', KAFKA_TOPIC='sink-topic') AS select * from YP_USER_STREAM;
Here YP_USER_STREAM is created as follows :
CREATE STREAM YP_USER_STREAM (ID INT(11), EMAIL VARCHAR(64)) WITH (KAFKA_TOPIC='kafkaTopic', VALUE_FORMAT='JSON');
This error is generated at the time of execution of query and at the start of KSQL server :
[2019-07-18 09:01:45,766] INFO Retrying admin request due to retriable exception. Retry no: 1 (io.confluent.ksql.util.KafkaTopicClient:351)
java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.UnknownTopicOrPartitionException: This server does not host this topic-partition.
at org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45)
at org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32)
at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89)
Can somebody tell me what configuration mistake I would have done?