Where do Kafka consumers get the first version of Avro schema from? - apache-kafka

Most Quarkus examples that illustrate the use of Avro and Service registry combine consumer and producer in the same project. In doing so, the schema and the generated classes are available to the producer and the consumer.
I understand that the role of the schema registry is to maintain various versions of a schema and make them available to consumers.
What I do not fully understand is how and when the consumer pulls the schema from the registry. For instance, how does the consumer's team get the initial version of the schema? Do they just go to the registry and download it manually? Do they use the maven plugin to download it and generate the sources?
In the case of Quarkus, the avro extension automatically generates the Java source from avro schema. I wonder if it also downloads the initial schema for consumers.

The consumers can optionally download the schema from the registry. Then the consumer should be configured to use a SpecificRecord deserialization.
For the Confluent serialization format, the Schema ID is embedded in each record. When the consumer deserializer processes the bytes, it'll always perform a lookup from the registry for that ID.
Avro requires both a writer schema (the one that's downloaded from that ID), and a reader schema (either the one that the consumer app developer has downloaded, or defaults to the writer schema), and then schema evolution is evaluated.
Quarkus/Kafka doesn't really matter, here. It's entirely done within the Avro library after HTTP calls to the Registry.

Related

Sending Avro messages to Kafka

I have an app that produces an array of messages in raw JSON periodically. I was able to convert that to Avro using the avro-tools. I did that because I needed the messages to include schema due to the limitations of Kafka-Connect JDBC sink. I can open this file on notepad++ and see that it includes the schema and a few lines of data.
Now I would like to send this to my central Kafka Broker and then use Kafka Connect JDBC sink to put the data in a database. I am having a hard time understanding how I should be sending these Avro files I have to my Kafka Broker. Do I need a schema registry for my purposes? I believe Kafkacat does not support Avro so I suppose I will have to stick with the kafka-producer.sh that comes with the Kafka installation (please correct me if I am wrong).
Question is: Can someone please share the steps to produce my Avro file to a Kafka broker without getting Confluent getting involved.
Thanks,
To use the Kafka Connect JDBC Sink, your data needs an explicit schema. The converter that you specify in your connector configuration determines where the schema is held. This can either be embedded within the JSON message (org.apache.kafka.connect.json.JsonConverter with schemas.enabled=true) or held in the Schema Registry (one of io.confluent.connect.avro.AvroConverter, io.confluent.connect.protobuf.ProtobufConverter, or io.confluent.connect.json.JsonSchemaConverter).
To learn more about this see https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained
To write an Avro message to Kafka you should serialise it as Avro and store the schema in the Schema Registry. There is a Go client library to use with examples
without getting Confluent getting involved.
It's not entirely clear what you mean by this. The Kafka Connect JDBC Sink is written by Confluent. The best way to manage schemas is with the Schema Registry. If you don't want to use the Schema Registry then you can embed the schema in your JSON message but it's a suboptimal way of doing things.

How to auto-save Avro schema in Confluent Schema Registry from Apache NiFi flow?

How to auto-save Avro schema in Confluent Schema Registry from Apache NiFi flow?
That's basically the question.
I am not finding the way of automatically storing the Avro schema of the record in the Confluent Schema Registry from a NiFi flow. It is possible to flexibly read and populate message with the reference to the schema in the Confluent Schema-Registry, but there should be a way of auto-creating one in the registry instead of demanding Confluent Schema-Registry to be initialized upfront before NiFi flow starts.
Update
Here is my current Flow:
I'm reading from a Postgres table using QueryDatabaseTableRecord processor (version 1.10) and publishing [new] records to a Kafka topic using PublishKafkaRecord_2_0 (version 1.10.0).
I want to publish to Kafka in Avro format storing (and passing around) the Avro schema in the Confluent Schema Registry (that works well in other places of my NiFi setup).
For that, I am using AvroRecordSetWriter in the "Record Writer" property on the QueryDatabaseTableRecord processor with the following properties:
PublishKafkaRecord processor is configured to read Avro schema from the input message (using the Confluent schema registry, the schema is not embedded into each FlowFile) and uses same AvroRecordSetWriter as QueryDatabaseTableRecord processor to write to Kafka.
That's basically it.
Trying to replace the first AvroRecordSetWriter with one that embeds the schema with the hope that the second AvroRecordSetWriter could auto generate schema in the Confluent Schema Registry on publish, since I don't want to bloat each message with my embedded Avro schema.
Update
I've tried to follow the advice from the comment as follows
With that I was trying to make first access to the Confluent Schema Registry the last step in the chain. Unfortunately, my attempts were unsuccessful. The only option that worked was my initial described in this question that required a schema in the registry upfront/in advance to work.
Both other cases that I tried ended up with the exception:
org.apache.nifi.schema.access.SchemaNotFoundException: Cannot write Confluent Schema Registry Reference because the Schema Identifier is not known
Please note, that I cannot use "Inherit Schema from record" in the last writer's schema access, since I'm getting an invalid combination and the NiFi config validation doesn't pass such combination through.

How to handle Kafka schema evolution

I'm new to kafka .Here is the question i have on ever changing kafka schema.
How can we handle schema changes at kafka consumer end?
If we change the payload structure at kafka publisher end, how can I make sure nothing breaks on the kafka consumer end?
I would be interested to know industry wide practises to handle this scenario.
I won't be using Confluent's schema registry for avro. Are there any other tried and tested options?
Schema Registry is the solution for a centralized schema management and compatibility checks as schemas evolve
Configure the schema registry into kafka producers and consumers
kafkaProducerProps.put(KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG,"http://localhost:8081");
kafkaConsumerProps.put(KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG, "true");
//Schema registry location.
kafkaConsumerProps.put(KafkaAvroDeserializerConfig.SCHEMA_REGISTRY_URL_CONFIG, "http://localhost:8081");
Run Schema Registry on 8081
Refer below URL for sample code
https://dzone.com/articles/kafka-avro-serialization-and-the-schema-registry

How to dump avro data from Kafka topic and read it back in Java/Scala

We need to export production data from a Kafka topic to use it for testing purposes: the data is written in Avro and the schema is placed on the Schema registry.
We tried the following strategies:
Using kafka-console-consumer with StringDeserializer or BinaryDeserializer. We were unable to obtain a file which we could parse in Java: we always got exceptions when parsing it, suggesting the file was in the wrong format.
Using kafka-avro-console-consumer: it generates a json which includes also some bytes, for example when deserializing BigDecimal. We didn't even know which parsing option to choose (it is not avro, it is not json)
Other unsuitable strategies:
deploying a special kafka consumer would require us to package and place that code in some production server, since we are talking about our production cluster. It is just too long. After all, isn't kafka console consumer already a consumer with configurable options?
Potentially suitable strategies
Using a kafka connect Sink. We didn't find a simple way to reset the consumer offset since apparently the connector created consumer is still active even when we delete the sink
Isn't there a simply, easy way to dump the content of the value (not the schema) of a Kafka topic containing avro data to a file so that it can be parsed? I expect this to be achievable using kafka-console-consumer with the right options, plus using the correct Java Api of Avro.
for example, using kafka-console-consumer... We were unable to obtain a file which we could parse in Java: we always got exceptions when parsing it, suggesting the file was in the wrong format.
You wouldn't use regular console consumer. You would use kafka-avro-console-consumer which deserializes the binary avro data into json for you to read on the console. You can redirect > topic.txt to the console to read it.
If you did use the console consumer, you can't parse the Avro immediately because you still need to extract the schema ID from the data (4 bytes after the first "magic byte"), then use the schema registry client to retrieve the schema, and only then will you be able to deserialize the messages. Any Avro library you use to read this file as the console consumer writes it expects one entire schema to be placed at the header of the file, not only an ID pointing to anything in the registry at every line. (The basic Avro library doesn't know anything about the registry either)
The only thing configurable about the console consumer is the formatter and the registry. You can add decoders by additionally exporting them into the CLASSPATH
in such a format that you can re-read it from Java?
Why not just write a Kafka consumer in Java? See Schema Registry documentation
package and place that code in some production server
Not entirely sure why this is a problem. If you could SSH proxy or VPN into the production network, then you don't need to deploy anything there.
How do you export this data
Since you're using the Schema Registry, I would suggest using one of the Kafka Connect libraries
Included ones are for Hadoop, S3, Elasticsearch, and JDBC. I think there's a FileSink Connector as well
We didn't find a simple way to reset the consumer offset
The connector name controls if a new consumer group is formed in distributed mode. You only need a single consumer, so I would suggest standalone connector, where you can set offset.storage.file.filename property to control how the offsets are stored.
KIP-199 discusses reseting consumer offsets for Connect, but feature isn't implemented.
However, did you see Kafka 0.11 how to reset offsets?
Alternative options include Apache Nifi or Streamsets, both integrate into the Schema Registry and can parse Avro data to transport it to numerous systems
One option to consider, along with cricket_007's, is to simply replicate data from one cluster to another. You can use Apache Kafka Mirror Maker to do this, or Replicator from Confluent. Both give the option of selecting certain topics to be replicated from one cluster to another- such as a test environment.

Avro messages with schema

So we are planning to use Avro for communication over a confluent kafka-based ecosystem. My current understanding of Avro is that each message carries its schema. If that is the case, we need schema registry just for resolving version updates?
I ask since carrying the schema with each message prevents the need for something like a schema registry to map a message id to a schema. Or am I missing something here?
When you run the Confluent Schema Registry, the Kafka messages published with the Confluent Avro Serdes library do not contain the avro schema. They only contain a numeric Schema id that is used by the consumers deserializer to fetch the Schema from the Confluent Schema Registry. These schemas are cached by the serializer and deserializer as a further performance optimization.