kafka connect - jdbc sink sql exception - apache-kafka

I am using the confluent community edition for a simple setup consisting a rest client calling the Kafka rest proxy and then pushing that data into an oracle database using the provided jdbc sink connector.
I noticed that if there is an sql exception for instance if the actual data's length is greater than the actual one (column's length defined), the task stopped and if I do restart it, same thing it tries to insert the erroneous entry and it stopped. It does not insert the other entries.
Is not a way I can log the erroneous entry and let the tasks continue inserting the other data?

Kafka Connect framework for Sink Connectors can only skip problematic records when exception is thrown during:
- Convertion key or values (Converter:toConnectData(...))
- Transformation (Transformation::apply)
For that you can use errors.tolerance property:
"errors.tolerance": "all"
There are some additional properties, for printing details regarding errors: errors.log.enable, errors.log.include.messages.
Original answer: Apache Kafka JDBC Connector - SerializationException: Unknown magic byte
If an exception is thrown during delivering messages Sink Task is killed.
If you need to handle communication error (or others) with an external system, you have to add support to your connector
Jdbc Connector, when SQLException is thrown makes retries but doesn't skip any records
Number of retries and interval between them is managed by the following properties
max.retries default value 10
retry.backoff.ms default 3000

The sink cannot currently ignore bad records, but you can manually skip them, using the kafka-consumer-groups tool:
kafka-consumer-groups \
--bootstrap-server kafka:29092 \
--group connect-sink_postgres_foo_00 \
--reset-offsets \
--topic foo \
--to-offset 2 \
--execute
For more info see here.

Currently, there is no way to stop this from failing the sink connector, specifically.
However, there is another approach that might be worth looking into. You can apply a Single Message Transform (SMT) on the Connector, check the length of the incoming columns, then decide to either throw an exception, which would bubble up to the errors.tolerance configuration, or return null which will filter the record out entirely.
Since this is a Sink connector, the SMT would be applied before passing the record on to the connector, and therefore records that are skipped via the transform would never make it to the tasks to be sync'd into the database.

Related

Kafka connect jdbc sink SQL error handling

I am currently configuring a Kafka JDBC sink connector to write my kafka messages in a Postgres table. All is working fine except the error handling part. Sometimes, messages in my topic have wrong data and so the database constraints fail with an expected SQL EXCEPTION duplicate key...
I would like to put these wrong messages in a DLQ and to commit the offset to process the next messages, so I configured the connector with
"errors.tolerance": "all"
"errors.deadletterqueue.topic.name": "myDLQTopicName"
but it does not change a thing, the connector retries until it crashes.
Is there another configuration I'm missing? I saw only these two in the confluent documentation
(I see in the jdbc connector changelog that the error handling in the put stage is implemented in the version 10.1.0 (CCDB-192) and I'm using the last version of the connector 10.5.1)
"The Kafka Connect framework provides generic error handling and dead-letter queue capabilities which are available for problems with [de]serialisation and Single Message Transforms. When it comes to errors that a connector may encounter doing the actual pull or put of data from the source/target system, it’s down to the connector itself to implement logic around that."
If the duplicate key are the only type of bad records you need to deal with, you might consider use upsert in insert.mode

Kafka: "Broker failed to validate record" after increasing partition

I had increased the partition of an existing Kafka topic via terraform. The partition size had increased successfully however when I test the connection to the topic, I'm getting a "Broker failed to validate record"
Testing method:
echo "test" | kcat -b ...
**sensitive content has been removed**
...
% Auto-selecting Producer mode (use -P or -C to override)
% Delivery failed for message: Broker: Broker failed to validate record
I had tried to search up online and came across something called schema validation configuration: https://docs.confluent.io/cloud/current/sr/broker-side-schema-validation.html
Is there something I need to do after increasing the partition? ie flush some cache?
You need to ask your Kafka cluster administrator if they have schema validation enabled, but increasing partitions shouldn't cause that. (This is a feature of Confluent Server, not Apache Kafka).
If someone changed the schema in the schema registry for your topic, or validation has suddenly been enabled, and you are sending a record from an "old" schema (or not correct schema), then the broker would "fail to validate" the record.

Flink doesn't consume Data from Kafka publisher

What I have: http://prntscr.com/szmkn4
That's the most barebone version of it. Some stuff's gonna come later, but for now the issue is that data is properly arriving in my consumer in form of a JSON string.
I want to throw it into a flink table, which I create with this statement: http://prntscr.com/szmll3
I then check if it got created, just to be sure and get this: http://prntscr.com/szmn79
Next I wanna turn on the machine and check my data with "SELECT * FROM RawData" and get the following error:
[ERROR] Could not execute SQL statement. Reason:
org.apache.flink.kafka.shaded.org.apache.kafka.common.config.ConfigException: No resolvable bootstrap urls given in bootstrap.servers
I assume it's an issue with how I created my table, but am honestly not sure where/what/how.
My publisher's properties in NiFi are:
https://prnt.sc/szoe6z
and
http://prntscr.com/szoeka
If you need any additional information from me, feel free to ask.
Thanks in advance,
Psy
[ERROR] Could not execute SQL statement. Reason: org.apache.flink.kafka.shaded.org.apache.kafka.common.config.ConfigException: No resolvable bootstrap urls given in bootstrap.servers
That likely means that the Kafka bootstrap servers you've specified to Flink cannot have their hostnames resolved on the Flink servers. You'd know if it's a NiFi issue because you'd see errors in the NiFi flow saying it couldn't produce to Kafka. It might be producing to the wrong topic or even the wrong set of brokers if you have multiple Kafka clusters, but the error you posted isn't a NiFi issue.

Kafka Connect - Delete Connector with configs?

I know how to delete Kafka connector as mentioned here Kafka Connect - How to delete a connector
But I am not sure if it also delete/erase specific connector related configs, offsets and status from *.sorage.topic for that worker?
For e.g:
Lets say I delete a connector having connector-name as"connector-abc-1.0.0" and Kafka connect worker was started with following config.
offset.storage.topic=<topic.name>.internal.offsets
config.storage.topic=<topic.name>.internal.configs
status.storage.topic=<topic.name>.internal.status
Now after DELETE call for that connector, will it erased all records from above internal topics for that specific connector?
So that I can create new connector with "same name" on same worker but different config(different offset.start or connector.class)?
When you delete a connector, the offsets are retained in the offsets topic.
If you recreate the connector with the same name, it will re-use the offsets from the previous execution (even if the connector was deleted in between).
Since Kafka is append only, then only way the messages in those Connect topics would be removed is if it were published with the connector name as the message key, and null as the value.
You could inspect those topics using console consumer to see what data is in them including --property print.key=true, and keep the consumer running when you delete a connector.
You can PUT a new config at /connectors/{name}/config, but any specific offsets that are used are dependent upon the actual connector type (sink / source); for example, there is the internal Kafka __consumer_offsets topic, used by Sink connectors, as well as the offset.storage.topic, optionally used by source connectors.
"same name" on same worker but different config(different offset.start or connector.class)?
I'm not sure changing connector.class would be a good idea with the above in mind since it'd change the connector behavior completely. offset.start isn't a property I'm aware of, so you'll need to see the documentation of that specific connector class to know what it does.

Problems with Avro deserialization in Kafka sink connectors

I'm trying to read data from DB2 using Kafka and then to write it to HDFS. I use distributed confluent platform with standard JDBC and HDFS connectors.
As the HDFS connector needs to know the schema, it requires avro data as an input. Thus, I have to specify the following avro converters for the data fed to Kafka (in etc/kafka/connect-distributed.properties):
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
I then run my JDBC connector and check with the console-avro-consumer that I can successfully read the data fetched from the DB2.
However, when I launch the HDFS Connector, it does not work anymore. Instead, it outputs SerializationException:
Error deserializing Avro message for id -1
... Unknown magic byte!
To check if this is a problem with the HDFS connector, I tried to use a simple FileSink connector instead. However, I saw exactly the same exception when using the FileSink (and the file itself was created but stayed empty).
I then carried out the following experiment: Instead of using avro converter for the key and value I used json converters:
key.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schema.enable=false
value.converter=org.apache.kafka.connect.json.JsonConverter
value.converter.schema.enable=false
This fixed the problem with the FileSink connector, i.e., the whole pipeline from DB2 to the file worked fine. However, for the HDFS connector this solution is infeasible as the connector needs the schema and consequently avro format as an input.
It feels to me that the deserialization of avro format in the sink connectors is not implemented properly as the console-avro-consumer can still successfully read the data.
Does anyone have any idea of what could be the reason of this behavior? I'd also appreciate an idea of a simple fix for this!
check with the console-avro-consumer that I can successfully read the data fetched
I'm guessing you didn't add --property print.key=true --from-beginning when you did that.
Its possible that the latest values are Avro, but connect is clearly failing somewhere on the topic, so you need to scan it to find out where that happens
If using JsonConverter works, and the data is actually readable JSON on disk, then it sounds like the JDBC Connector actually wrote JSON, not Avro
If you are able to pinpoint the offset for the bad message, you can use the regular console consumer with the connector group id set, then add --max-messages along with a partition and offset specified to skip those events