Kafka Connect Schema evolution when columns are removed - apache-kafka

Lets say we have a setup as follows.
Schema evolution compatibility is set to BACKWARD.
JDBC Source Connector polls data from DB writing to Kafka topic.HDFS Sink Connector read message from Kafka topic and write to HDFS in Avro format.
Following the the flow as I understood.
JDBC Source Connector query DB and generate the Schema V1 from JDBC Metadata from ResultSet.V1 has col1,col2,col3.Schema V1 is registered in Schema Registry.
Source connector polls data from DB and write messages to the Kafka topic in V1 schema.
(Question 1) When HDFS Sink connector read messages from the topic ,does it validate the messages against the V1 schema from the Schema Registry?
Next DB schema is changed. Column "col3" is removed from the table.
Next time JDBC Source polls DB it sees that the schema has changed, generate new Schema V2 (col1,col2) and register V2 is Schema Registry.
Source Connect continue polling data and write to Kafka topic in V2 schema.
Now the Kafka Topic can have messages in both V1 and V2 schema.
(Question 2) When HDFS Sink connector read message does it now validate messages against Schema V2 ?
This this the case addressed in the Confluent documentation under the Backward Compatibility ? :
[https://docs.confluent.io/current/schema-registry/avro.html#schema-evolution-and-compatibility]
An example of a backward compatible change is a removal of a field. A
consumer that was developed to process events without this field will
be able to process events written with the old schema and contain the
field – the consumer will just ignore that field.

The registry only validates when a new schema is registered.
Therefore, it's if/when the source connector detects a change, then validation occurs at the registry side
As for HDFS connector, there is a separate schema.compatibility property that applies a projection over records held in memory and any new records. When you get a record with a new schema, and have a backwards compatible update, then all messages not yet flushed will be updated to hold the new schema when an Avro container file is written.
Aside: just because the registry thinks it's backwards, doesn't guarantee the sink connector does... The validation within the source code is different, and we've had multiple issues with it :/

Related

Where do Kafka consumers get the first version of Avro schema from?

Most Quarkus examples that illustrate the use of Avro and Service registry combine consumer and producer in the same project. In doing so, the schema and the generated classes are available to the producer and the consumer.
I understand that the role of the schema registry is to maintain various versions of a schema and make them available to consumers.
What I do not fully understand is how and when the consumer pulls the schema from the registry. For instance, how does the consumer's team get the initial version of the schema? Do they just go to the registry and download it manually? Do they use the maven plugin to download it and generate the sources?
In the case of Quarkus, the avro extension automatically generates the Java source from avro schema. I wonder if it also downloads the initial schema for consumers.
The consumers can optionally download the schema from the registry. Then the consumer should be configured to use a SpecificRecord deserialization.
For the Confluent serialization format, the Schema ID is embedded in each record. When the consumer deserializer processes the bytes, it'll always perform a lookup from the registry for that ID.
Avro requires both a writer schema (the one that's downloaded from that ID), and a reader schema (either the one that the consumer app developer has downloaded, or defaults to the writer schema), and then schema evolution is evaluated.
Quarkus/Kafka doesn't really matter, here. It's entirely done within the Avro library after HTTP calls to the Registry.

Sending Avro messages to Kafka

I have an app that produces an array of messages in raw JSON periodically. I was able to convert that to Avro using the avro-tools. I did that because I needed the messages to include schema due to the limitations of Kafka-Connect JDBC sink. I can open this file on notepad++ and see that it includes the schema and a few lines of data.
Now I would like to send this to my central Kafka Broker and then use Kafka Connect JDBC sink to put the data in a database. I am having a hard time understanding how I should be sending these Avro files I have to my Kafka Broker. Do I need a schema registry for my purposes? I believe Kafkacat does not support Avro so I suppose I will have to stick with the kafka-producer.sh that comes with the Kafka installation (please correct me if I am wrong).
Question is: Can someone please share the steps to produce my Avro file to a Kafka broker without getting Confluent getting involved.
Thanks,
To use the Kafka Connect JDBC Sink, your data needs an explicit schema. The converter that you specify in your connector configuration determines where the schema is held. This can either be embedded within the JSON message (org.apache.kafka.connect.json.JsonConverter with schemas.enabled=true) or held in the Schema Registry (one of io.confluent.connect.avro.AvroConverter, io.confluent.connect.protobuf.ProtobufConverter, or io.confluent.connect.json.JsonSchemaConverter).
To learn more about this see https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained
To write an Avro message to Kafka you should serialise it as Avro and store the schema in the Schema Registry. There is a Go client library to use with examples
without getting Confluent getting involved.
It's not entirely clear what you mean by this. The Kafka Connect JDBC Sink is written by Confluent. The best way to manage schemas is with the Schema Registry. If you don't want to use the Schema Registry then you can embed the schema in your JSON message but it's a suboptimal way of doing things.

How to auto-save Avro schema in Confluent Schema Registry from Apache NiFi flow?

How to auto-save Avro schema in Confluent Schema Registry from Apache NiFi flow?
That's basically the question.
I am not finding the way of automatically storing the Avro schema of the record in the Confluent Schema Registry from a NiFi flow. It is possible to flexibly read and populate message with the reference to the schema in the Confluent Schema-Registry, but there should be a way of auto-creating one in the registry instead of demanding Confluent Schema-Registry to be initialized upfront before NiFi flow starts.
Update
Here is my current Flow:
I'm reading from a Postgres table using QueryDatabaseTableRecord processor (version 1.10) and publishing [new] records to a Kafka topic using PublishKafkaRecord_2_0 (version 1.10.0).
I want to publish to Kafka in Avro format storing (and passing around) the Avro schema in the Confluent Schema Registry (that works well in other places of my NiFi setup).
For that, I am using AvroRecordSetWriter in the "Record Writer" property on the QueryDatabaseTableRecord processor with the following properties:
PublishKafkaRecord processor is configured to read Avro schema from the input message (using the Confluent schema registry, the schema is not embedded into each FlowFile) and uses same AvroRecordSetWriter as QueryDatabaseTableRecord processor to write to Kafka.
That's basically it.
Trying to replace the first AvroRecordSetWriter with one that embeds the schema with the hope that the second AvroRecordSetWriter could auto generate schema in the Confluent Schema Registry on publish, since I don't want to bloat each message with my embedded Avro schema.
Update
I've tried to follow the advice from the comment as follows
With that I was trying to make first access to the Confluent Schema Registry the last step in the chain. Unfortunately, my attempts were unsuccessful. The only option that worked was my initial described in this question that required a schema in the registry upfront/in advance to work.
Both other cases that I tried ended up with the exception:
org.apache.nifi.schema.access.SchemaNotFoundException: Cannot write Confluent Schema Registry Reference because the Schema Identifier is not known
Please note, that I cannot use "Inherit Schema from record" in the last writer's schema access, since I'm getting an invalid combination and the NiFi config validation doesn't pass such combination through.

How to fetch Kafka source connector schema based on connector name

I am using Confluent JDBC Kafka connector to publish messages into topic. The source connector will send data to topic along with schema on each poll. I want to retrieve this schema.
Is it possible? How? Can anyone suggest me
My intention is to create a KSQL stream or table based on schema build by Kafka connector on poll.
The best way to do this is to use Avro, in which the schema is stored separately and automatically used by Kafka Connect and KSQL.
You can use Avro by configuring Kafka Connect to use the AvroConverter. In your Kafka Connect worker configuration set:
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://schema-registry:8081
(Update schema-registry to the hostname of where your Schema Registry is running)
From there, in KSQL you just use
CREATE STREAM my_stream WITH (KAFKA_TOPIC='source_topic', VALUE_FORMAT='AVRO');
You don't need to specify the schema itself here, because KSQL fetches it from the Schema Registry.
You can read more about Converters and serialisers here.
Disclaimer: I work for Confluent, and wrote the referenced blog post.

How is schema from Schema-Registry is propagated over Replicator

How do schemas from Confluent Schema-Registry get propagated by Confluent-Replicator to destination Kafka-Cluster and Schema-Registry?
Is each replicated message schema contained in it or are schemas replicated somehow separately through a separate topic?
I didn't see any configuration possibilities in Confluent-Replicator regarding this.
It sounds like you are asking how the schema registry can be used in a multi data center environment. There's a pretty good doc on this https://docs.confluent.io/current/schema-registry/docs/multidc.html
Replicator can be used to keep the schema registry data in sync on the backend as shown in the doc.
Schemas are not stored with the topic, only their ID's. And the _schemas topic is not replicated, only the ID's stored within the replicated topics.
On a high-level, if you use the AvroConverter with Replicator, it'll deserialize the message from the source cluster, optionally rename the topic as per the replicator configuration, then serialize the message and send the new subject name to the destination cluster + registry.
Otherwise, if you use the ByteArrayConverter, it will not inspect the message, and it just copies it along to the destination cluster with no registration.
A small optimization on the Avro way would be to only inspect that the message is Avro encoded on the first 5 bytes, as per the Schema Registry specification, then perform HTTP lookups to the source subject using Schema Registry REST API GET /schemas/ids/:id, again rename topic if needed to the destination cluster, and POST the schema there. A similar approach can work in any Consumer/Producer pair such as a MirrorMaker MessageHandler implementation.