I'm trying to use Kafka connect sink to write files from Kafka to HDFS.
My properties looks like:
connector.class=io.confluent.connect.hdfs.HdfsSinkConnector
flush.size=3
format.class=io.confluent.connect.hdfs.parquet.ParquetFormat
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
schema.compatability=BACKWARD
key.converter.schemas.enabled=false
value.converter.schemas.enabled=false
schemas.enable=false
And When I'm trying to run the connector I got the following exception:
org.apache.kafka.connect.errors.DataException: JsonConverter with schemas.enable requires "schema" and "payload" fields and may not contain additional fields. If you are trying to deserialize plain JSON data, set schemas.enable=false in your converter configuration.
I'm using Confluent version 4.0.0.
Any suggestions please?
My understanding of this issue is that if you set schemas.enable=true, you tell kafka that you would like to include the schema into messages that kafka must transfer. In this case, a kafka message does not have a plain json format. Instead, it first describes the schema and then attaches the payload (i.e., the actual data) that corresponds to the schema (read about AVRO formatting). And this leads to the conflict: On the one hand you've specified JsonConverter for your data, on the other hand you ask kafka to include the schema into messages. To fix this, you can either use AvroConverter with schemas.enable = true or JsonCOnverter with schemas.enable=false.
Related
I have an app that produces an array of messages in raw JSON periodically. I was able to convert that to Avro using the avro-tools. I did that because I needed the messages to include schema due to the limitations of Kafka-Connect JDBC sink. I can open this file on notepad++ and see that it includes the schema and a few lines of data.
Now I would like to send this to my central Kafka Broker and then use Kafka Connect JDBC sink to put the data in a database. I am having a hard time understanding how I should be sending these Avro files I have to my Kafka Broker. Do I need a schema registry for my purposes? I believe Kafkacat does not support Avro so I suppose I will have to stick with the kafka-producer.sh that comes with the Kafka installation (please correct me if I am wrong).
Question is: Can someone please share the steps to produce my Avro file to a Kafka broker without getting Confluent getting involved.
Thanks,
To use the Kafka Connect JDBC Sink, your data needs an explicit schema. The converter that you specify in your connector configuration determines where the schema is held. This can either be embedded within the JSON message (org.apache.kafka.connect.json.JsonConverter with schemas.enabled=true) or held in the Schema Registry (one of io.confluent.connect.avro.AvroConverter, io.confluent.connect.protobuf.ProtobufConverter, or io.confluent.connect.json.JsonSchemaConverter).
To learn more about this see https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained
To write an Avro message to Kafka you should serialise it as Avro and store the schema in the Schema Registry. There is a Go client library to use with examples
without getting Confluent getting involved.
It's not entirely clear what you mean by this. The Kafka Connect JDBC Sink is written by Confluent. The best way to manage schemas is with the Schema Registry. If you don't want to use the Schema Registry then you can embed the schema in your JSON message but it's a suboptimal way of doing things.
I have a Kafka topic where the values are MessagePack-encoded.
Is there any way to sink the records from this topic into MongoDB using the MongoDB Kafka connector, or must the record values simply be stored as JSON?
You will need to find or create your own Kafka Connect Converter, then add that package to each Connect worker's classpath, followed by setting it as your key/value converter setting, from which the existing Mongo Sink Connector can deserialize the messages into a Struct and Schema form, and handle correctly.
JSON was never a requirement. Avro and Protobuf should work as well
I have created a NiFi flow that eventually publishes json records as records with Avro encoded values and string keys, using a schema in Confluent Registry for the value schema. Here is the configuration for the AvroRecordSetWriter in NiFi.
I am now trying to use Kafka Connect (connect-standalone) to move the messages to a PostgreSQL database using JdbcSinkConnector, but am getting the following error: Error retrieving Avro schema for id 1
I have confirmed that I have a schema in my Confluent Registry with and ID of 1. Following are my configs for the Connect task
Worker Config:
bootstrap.servers=localhost:29092
key.converter.schemas.enable=false
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
offset.storage.file.filename=/tmp/connect.offsets
rest.host.name=localhost
rest.port=8083
plugin.path=share/java
Connector Config:
name=pg-sink
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
tasks.max=1
topics=rds
connection.url=jdbc:postgresql://localhost:5432/test
connection.user=postgres
connection.password=xxxxxxxx
insert.mode=upsert
table.name.format=test_data
auto.create=true
I created a flow in NiFi that consumes the messages properly and I have also successfully consumed messages (which are formated as JSON in the output) with kafka-avro-console-consumer by specifying --property schema.registry.url=http://schema-registry:8081. Note that I'm running the consumer within a Docker container, and that is why the url is not localhost.
I'm not sure what I am missing. My only thought is that I am using the wrong class for the key converter, but that would not make sense with the given error. Can anyone see what I am doing wrong?
I don't know much about Nifi but I see that the name of the schema is "rds" and in the error logs it's say that it didn't found the subject in the schema registry.
Kafka use KafkaAvroSerializer to serialize avro records and in the same time registering the associated avro schema in the schema registry.
It use KafkaAvroDeserializer to deserialize avro records and retrieving the associated schema from the schema registry.
Schema registry store schema into categories called "subjects" and the default behavior to naming the subject for a record is : topic_name-value for the value record and topic_name-key for the key.
In your case, you didn't register the schema with Kafka but with Nifi, so my guess will be that the name "rds" appears in or is the subject name on the schema registry.
How did you verify that you schema was corectly stored ?
Normally in your case the correct subject will be rds-value because you're using schema registry only on value records.
I'm trying to read data from DB2 using Kafka and then to write it to HDFS. I use distributed confluent platform with standard JDBC and HDFS connectors.
As the HDFS connector needs to know the schema, it requires avro data as an input. Thus, I have to specify the following avro converters for the data fed to Kafka (in etc/kafka/connect-distributed.properties):
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
I then run my JDBC connector and check with the console-avro-consumer that I can successfully read the data fetched from the DB2.
However, when I launch the HDFS Connector, it does not work anymore. Instead, it outputs SerializationException:
Error deserializing Avro message for id -1
... Unknown magic byte!
To check if this is a problem with the HDFS connector, I tried to use a simple FileSink connector instead. However, I saw exactly the same exception when using the FileSink (and the file itself was created but stayed empty).
I then carried out the following experiment: Instead of using avro converter for the key and value I used json converters:
key.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schema.enable=false
value.converter=org.apache.kafka.connect.json.JsonConverter
value.converter.schema.enable=false
This fixed the problem with the FileSink connector, i.e., the whole pipeline from DB2 to the file worked fine. However, for the HDFS connector this solution is infeasible as the connector needs the schema and consequently avro format as an input.
It feels to me that the deserialization of avro format in the sink connectors is not implemented properly as the console-avro-consumer can still successfully read the data.
Does anyone have any idea of what could be the reason of this behavior? I'd also appreciate an idea of a simple fix for this!
check with the console-avro-consumer that I can successfully read the data fetched
I'm guessing you didn't add --property print.key=true --from-beginning when you did that.
Its possible that the latest values are Avro, but connect is clearly failing somewhere on the topic, so you need to scan it to find out where that happens
If using JsonConverter works, and the data is actually readable JSON on disk, then it sounds like the JDBC Connector actually wrote JSON, not Avro
If you are able to pinpoint the offset for the bad message, you can use the regular console consumer with the connector group id set, then add --max-messages along with a partition and offset specified to skip those events
I have Filebeat which outputs to Kafka topic and would like to make sure that messages are in correct format by using Avro schema.
In Filebeat documentation, there are mentioned two possible output codecs, JSON or format.
In Kafka, Schema registry can be used to store Avro schemas.
My questions:
Is is possible for Filebeat to send Avro message (validated against Avro schema); and if so, can schema be linked from Schema registry so it won't need to be physically copied over on each new version of schema ?
Can Filebeat's JSON or format message be considered as Avro by Kafka ?
If Filebeat can't produce validated Avro message, can JSON send be validated at Kafka's side, when trying to write to topic ? If so, can non-valid messages be dropped or logged somewhere ?