How can I manage Kafka connect schema errors? - apache-kafka

I'm using kafka connect (confluent distribution) to connect an mqtt broker to a kafka topic (https://docs.lenses.io/connectors/source/mqtt.html), but when a message arrives and it isn't conform to the expected schema, the connector stops!
How can I prevent this from happening?
I'd like also to manage the error and for example keep track of it!

If you are using a ready made connector, you need to satisfy the proper schema. If any error occurs it will stop the connector. So, best way is to identify the schema error based on error message.
If its impossible to use the existing connector, create one for your own which could satisfy your need.

Related

Kafka connect jdbc sink SQL error handling

I am currently configuring a Kafka JDBC sink connector to write my kafka messages in a Postgres table. All is working fine except the error handling part. Sometimes, messages in my topic have wrong data and so the database constraints fail with an expected SQL EXCEPTION duplicate key...
I would like to put these wrong messages in a DLQ and to commit the offset to process the next messages, so I configured the connector with
"errors.tolerance": "all"
"errors.deadletterqueue.topic.name": "myDLQTopicName"
but it does not change a thing, the connector retries until it crashes.
Is there another configuration I'm missing? I saw only these two in the confluent documentation
(I see in the jdbc connector changelog that the error handling in the put stage is implemented in the version 10.1.0 (CCDB-192) and I'm using the last version of the connector 10.5.1)
"The Kafka Connect framework provides generic error handling and dead-letter queue capabilities which are available for problems with [de]serialisation and Single Message Transforms. When it comes to errors that a connector may encounter doing the actual pull or put of data from the source/target system, it’s down to the connector itself to implement logic around that."
If the duplicate key are the only type of bad records you need to deal with, you might consider use upsert in insert.mode

Sending Avro messages to Kafka

I have an app that produces an array of messages in raw JSON periodically. I was able to convert that to Avro using the avro-tools. I did that because I needed the messages to include schema due to the limitations of Kafka-Connect JDBC sink. I can open this file on notepad++ and see that it includes the schema and a few lines of data.
Now I would like to send this to my central Kafka Broker and then use Kafka Connect JDBC sink to put the data in a database. I am having a hard time understanding how I should be sending these Avro files I have to my Kafka Broker. Do I need a schema registry for my purposes? I believe Kafkacat does not support Avro so I suppose I will have to stick with the kafka-producer.sh that comes with the Kafka installation (please correct me if I am wrong).
Question is: Can someone please share the steps to produce my Avro file to a Kafka broker without getting Confluent getting involved.
Thanks,
To use the Kafka Connect JDBC Sink, your data needs an explicit schema. The converter that you specify in your connector configuration determines where the schema is held. This can either be embedded within the JSON message (org.apache.kafka.connect.json.JsonConverter with schemas.enabled=true) or held in the Schema Registry (one of io.confluent.connect.avro.AvroConverter, io.confluent.connect.protobuf.ProtobufConverter, or io.confluent.connect.json.JsonSchemaConverter).
To learn more about this see https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained
To write an Avro message to Kafka you should serialise it as Avro and store the schema in the Schema Registry. There is a Go client library to use with examples
without getting Confluent getting involved.
It's not entirely clear what you mean by this. The Kafka Connect JDBC Sink is written by Confluent. The best way to manage schemas is with the Schema Registry. If you don't want to use the Schema Registry then you can embed the schema in your JSON message but it's a suboptimal way of doing things.

PDI transformation does not send messages to Kafka server

I have a transformation in Pentaho Data Integration (PDI) that makes a query to NetSuite, builds JSON strings for each row and finally these strings are sent to Kafka. This is the transformation:
When I test the transform against my local Kafka it works like a charm, as you can see below:
The problem is when I substitute the connection parameters for those of an AWS EC2 instance where I have Kafka as well. The problem is that the transformation does not give errors, but the messages do not reach Kafka, as can be seen here:
This is the configuration of the Kafka Producer step of the transformation:
The strange thing is that although it does not send the messages to Kafka, it seems that it does connect to the server because the combobox is displayed with the names of the topics that I have:
In addition, this error is observed in the PDI terminal:
ERROR [NamedClusterServiceLocatorImpl] Could not find service for interface org.pentaho.hadoop.shim.api.jaas.JaasConfigService associated with named cluster null
Which doesn't make sense to me because I'm using a direct connection and not a connection to a Hadoop Cluster.
So I wanted to ask the members of this community if anyone has used POIs to send messages to Kafka and if they had to make configurations in POI or Slack to achieve it, since I cannot think what could be happening.
Thanks in advance for any ideas or comments to help me solve this!

Flink doesn't consume Data from Kafka publisher

What I have: http://prntscr.com/szmkn4
That's the most barebone version of it. Some stuff's gonna come later, but for now the issue is that data is properly arriving in my consumer in form of a JSON string.
I want to throw it into a flink table, which I create with this statement: http://prntscr.com/szmll3
I then check if it got created, just to be sure and get this: http://prntscr.com/szmn79
Next I wanna turn on the machine and check my data with "SELECT * FROM RawData" and get the following error:
[ERROR] Could not execute SQL statement. Reason:
org.apache.flink.kafka.shaded.org.apache.kafka.common.config.ConfigException: No resolvable bootstrap urls given in bootstrap.servers
I assume it's an issue with how I created my table, but am honestly not sure where/what/how.
My publisher's properties in NiFi are:
https://prnt.sc/szoe6z
and
http://prntscr.com/szoeka
If you need any additional information from me, feel free to ask.
Thanks in advance,
Psy
[ERROR] Could not execute SQL statement. Reason: org.apache.flink.kafka.shaded.org.apache.kafka.common.config.ConfigException: No resolvable bootstrap urls given in bootstrap.servers
That likely means that the Kafka bootstrap servers you've specified to Flink cannot have their hostnames resolved on the Flink servers. You'd know if it's a NiFi issue because you'd see errors in the NiFi flow saying it couldn't produce to Kafka. It might be producing to the wrong topic or even the wrong set of brokers if you have multiple Kafka clusters, but the error you posted isn't a NiFi issue.

In Kafka Connector, how do I get the bootstrap-server address My Kafka Connect is currently using?

I'm developing a Kafka Sink connector on my own. My deserializer is JSONConverter. However, when someone send a wrong JSON data into my connector's topic, I want to omit this record and send this record to a specific topic of my company.
My confuse is: I can't find any API for me to get my Connect's bootstrap.servers.(I know it's in the confluent's etc directory but it's not a good idea to write hard code of the directory of "connect-distributed.properties" to get the bootstrap.servers)
So question, is there another way for me to get the value of bootstrap.servers conveniently in my connector program?
Instead of trying to send the "bad" records from a SinkTask to Kafka, you should instead try to use the dead letter queue feature that was added in Kafka Connect 2.0.
You can configure the Connect runtime to automatically dump records that failed to be processed to a configured topic acting as a DLQ.
For more details, see the KIP that added this feature.