How can I know if I'm suffering data loss during Kafka Connect intermittent read-from-source issues? - apache-kafka

We are running Kafka Connect using the Confluent JDBC Source Connector to read from a DB2 database. Periodically, we see issues like this in our Kafka Connect logs:
kafkaconnect-deploy-prod-967ddfffb-5l4cm 2021-04-23 10:39:43.770 ERROR Failed to run query for table TimestampIncrementingTableQuerier{table="PRODSCHEMA"."VW_PRODVIEW", query='null', topicPrefix='some-topic-prefix-', incrementingColumn='', timestampColumns=[UPDATEDATETIME]}: {} (io.confluent.connect.jdbc.source.JdbcSourceTask:404)
com.ibm.db2.jcc.am.SqlException: DB2 SQL Error: SQLCODE=-668, SQLSTATE=57007, SQLERRMC=1;PRODSCHEMA.SOURCE_TABLE, DRIVER=4.28.11
at com.ibm.db2.jcc.am.b7.a(b7.java:815)
...
at com.ibm.db2.jcc.am.k7.bd(k7.java:785)
at com.ibm.db2.jcc.am.k7.executeQuery(k7.java:750)
at io.confluent.connect.jdbc.source.TimestampIncrementingTableQuerier.executeQuery(TimestampIncrementingTableQuerier.java:200)
at io.confluent.connect.jdbc.source.TimestampIncrementingTableQuerier.maybeStartQuery(TimestampIncrementingTableQuerier.java:159)
at io.confluent.connect.jdbc.source.JdbcSourceTask.poll(JdbcSourceTask.java:371)
This appears to be an intermittent issue connecting to DB2, and is semi-expected; for reasons outside the scope of this question, we know that the network between the two is unreliable.
However, what we are trying to establish is whether in this circumstance data loss is likely to have occurred. I've found this article which talks about error handling in Kafka Connect, but it only refers to errors due to broken messages, not the actual connectivity between Kafka Connect and the data source.
In this case, how would we know if the failure to connect had caused data loss? (i.e. records in our data source that were not processed for target topic). Would there be errors in the Kafka Connect log? Will Kafka Connect always retry indefinitely when it has a connectivity issue? Are there any controls over its retry?
(If it matters, Kafka Connect is version 2.5; it is deployed in a Kubernetes cluster, in distributed mode, but with only one actual running worker/container.)

Related

Unable to start debezium connector in distributed mode

Trying to deploy debezium using kafka connect distributed mode is causing issues .
The connect worker shut down with not a clear exception.
The group-id decleration along with the topic.regex is not forcing the connect to read only the regex topics it is trying to consume from all the topics in the cluster
Has anyone able to run debezium on connect-distributed?
Followed the instructions to run debezium using connect-distributed.

Kafka connect jdbc sink SQL error handling

I am currently configuring a Kafka JDBC sink connector to write my kafka messages in a Postgres table. All is working fine except the error handling part. Sometimes, messages in my topic have wrong data and so the database constraints fail with an expected SQL EXCEPTION duplicate key...
I would like to put these wrong messages in a DLQ and to commit the offset to process the next messages, so I configured the connector with
"errors.tolerance": "all"
"errors.deadletterqueue.topic.name": "myDLQTopicName"
but it does not change a thing, the connector retries until it crashes.
Is there another configuration I'm missing? I saw only these two in the confluent documentation
(I see in the jdbc connector changelog that the error handling in the put stage is implemented in the version 10.1.0 (CCDB-192) and I'm using the last version of the connector 10.5.1)
"The Kafka Connect framework provides generic error handling and dead-letter queue capabilities which are available for problems with [de]serialisation and Single Message Transforms. When it comes to errors that a connector may encounter doing the actual pull or put of data from the source/target system, it’s down to the connector itself to implement logic around that."
If the duplicate key are the only type of bad records you need to deal with, you might consider use upsert in insert.mode

Kafka connect-distributed mode fault tolerance not working

I have created kafka connect cluster with 3 EC2 machines and started 3 connectors ( debezium-postgres source) on each machine reading a different set of tables from postgres source. In one of the machines, I started the s3 sink connector as well. So the changed data from postgres is being moved to kafka broker via source connectors (3) and S3 sink connector consumes these messages and pushes them to S3 bucket.
The cluster is working fine and so are the connectors too. When I pause any of the connectors running on one of EC2 machine, I was expecting that its task should be taken by another connector (postgres-debezium) running on another machine. But that's not happening.
I installed kafdrop as well to monitor the brokers. I see 3 internal topics connect-offsets, connect-status and connect-configs are getting populated with necessary offsets, configs, and status too ( when I pause, status paus message appears).
But somehow connectors are not taking the task when I paused.
Let me know in what scenario connector takes the task of other failed one? Is pause is the right way? or we should produce some error on one of the connectors then it takes.
Please guide.
Sounds like it's working as expected.
Pausing has nothing to do with the fault tolerance settings and it'll completely stop the tasks. There's nothing to rebalance until unpaused.
The fault tolerance settings for dead letter queue, skip+log, or halt are for when there are actual runtime exception in the connector that you cannot control through the API. For example, a database or S3 network / authentication exception, or serialization error in the Kafka client

Is it possible to re-ingest (sink connector) data into the db if some messages got missed entering

Currently I setup 2 separate connectors running the JDBC Sink Connector to ingest topics produced from the producer to be read into the database. Sometimes, I see errors in the logs, which cause messages produced fails to get stored into the database.
The errors I constantly see is
Caused by: org.apache.kafka.common.errors.SerializationException: Error retrieving Avro schema for id:11
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Subject 'topic-io..models.avro.Topic' not found; error code404
Which is true because TopicRecordName is not supposed to be directed toward this topic but another topic that I directed to, it is just supposed to be directed toward models.avro.Topic
I was wondering if this happens constantly, is there a way to re-ingest those produced records/messages into the database after the messages got produced. For example, if messages got produced during 12am-1am, and some kind of errors showed up in the logs and failed to consume those messages during that timeframe, the configurations or offset can restore it by re-ingesting it to the database. The error is due to the schema registry link failed to read/ link to the correct schema link. It failed because it read the incorrect worker file, since one of my worker file have a value.converter.value.subject.name.strategy=io.confluent.kafka.serializers.subject.TopicRecordNameStrategy while the other connector does not read that subjectName.
Currently, I set the consumer.auto.offset.reset=earliest to start reading message.
Is there a way to get back those data like into a file and I can restore those data because I am deploying to production and there must be data consumed into the database at all times without any errors.
Rather than mess with the consumer group offsets, which would eventually cause correctly processed data to get consumed again and duplicated, you could use the dead letter queue configurations to send error records to a new topic, which you'd need to monitor and consume before the topic retention completely drops the events
https://www.confluent.io/blog/kafka-connect-deep-dive-error-handling-dead-letter-queues/
one of my worker file have a [different config]
This is why configuration management software is important. Don't modify one server in a distributed system without a process that updates them all. Ansible/Terraform are most common if you're not running the connectors in Kubernetes

Kafka scheduler in Vertica 7.2 is running and working, but produce errors

At the time when I run /opt/vertica/packages/kafka/bin/vkconfig launch I get such warning:
Unable to determine hostname, defaulting to 'unknown' in scheduler history
But the scheduler continues working fine and consuming messages from Kafka. What does it means?
The next strange thing is thet I find next records in /home/dbadmin/events/dbLog (I think it is Kafka consumer log file):
%3|14470569%3|1446726706.945|FAIL|vertica#consumer-1|
localhost:4083/bootstrap: Failed to connect to broker at
[localhost]:4083: Connection refused
%3|1446726706.945|ERROR|vertica#consumer-1| localhost:4083/bootstrap:
Failed to connect to broker at [localhost]:4083: Connection refused
%3|1446726610.267|ERROR|vertica#consumer-1| 1/1 brokers are down
As I mention, the scheduler is finally starting, but this records periodicaly appear in logs. What is this localhost:4083? Normally my broker runs on 9092 port on separate server which is described in kafka_config.kafka_scheduler table.
In the scheduler history table it attempts to get the hostname using Java:
InetAddress.getLocalHost().getHostAddress();
This will sometimes result in an UnknownHostException for various reasons (you can check documentation here: https://docs.oracle.com/javase/7/docs/api/java/net/UnknownHostException.html)
If this occurs, the hostname will default to "unknown" in that table. Luckily, the schedulers work by locking with your Vertica database, so knowing exactly which Scheduler host is unnecessary for functionality (just monitoring).
The Kafka-related logging in dbLog probably is the standard out from rdkafka (https://github.com/edenhill/librdkafka). I'm not sure what is going on with that log message, unfortunately. Vertica should only be using the configured broker list.