Avro DataFileWriter API with confluent schema registry - apache-kafka

Can I use avro DataFileWriter with schema registry?

You cannot write files into the registry, so, not really.
You can use DataFileReader to get a schema and POST to the Registry.
You can use Registry to get a different schema than the one included in the file (for example, you're doing schema evolution)

Related

What is the use of confluent schema registry if Kafka can use Avro without it

The difference between vanilla apache Avro and Avro with confluent schema registry is that when using apache avro , we send schema+message in kafka topic whereas in confluent schema registry , we send schemaID+message in kafka topic ? So here , schema registry helps in performance improvement via schema look up in registry. Is there any other benefit of using confluent schema registry ? Also , does apache avro supports compatabilty rules of schema evolution like schema registry ?
Note: There are other implementations of a "Schema Registry" that can use used with Kafka.
Here are a list of reasons
Clients can discover schemas without interacting with Kafka. For example, Apache Hive / Presto / Spark can download schemas from the Registry to perform analytics.
The registry is centrally responsible for compatibility checks rather than pushing each client to operate that on their own (to answer your second question)
The same applies to any serialization format, as well, not only Avro

Does Confluent schema registry only support AVRO?

Can we use schema registry for json messages and json schemas? Or is it like we have to use avro serialization for value serialization of messages.
Confluent schema registry also support JSON schema and Protobuf with the release of Confluent Platform 5.5. It has been announced in this blog.

How to auto-save Avro schema in Confluent Schema Registry from Apache NiFi flow?

How to auto-save Avro schema in Confluent Schema Registry from Apache NiFi flow?
That's basically the question.
I am not finding the way of automatically storing the Avro schema of the record in the Confluent Schema Registry from a NiFi flow. It is possible to flexibly read and populate message with the reference to the schema in the Confluent Schema-Registry, but there should be a way of auto-creating one in the registry instead of demanding Confluent Schema-Registry to be initialized upfront before NiFi flow starts.
Update
Here is my current Flow:
I'm reading from a Postgres table using QueryDatabaseTableRecord processor (version 1.10) and publishing [new] records to a Kafka topic using PublishKafkaRecord_2_0 (version 1.10.0).
I want to publish to Kafka in Avro format storing (and passing around) the Avro schema in the Confluent Schema Registry (that works well in other places of my NiFi setup).
For that, I am using AvroRecordSetWriter in the "Record Writer" property on the QueryDatabaseTableRecord processor with the following properties:
PublishKafkaRecord processor is configured to read Avro schema from the input message (using the Confluent schema registry, the schema is not embedded into each FlowFile) and uses same AvroRecordSetWriter as QueryDatabaseTableRecord processor to write to Kafka.
That's basically it.
Trying to replace the first AvroRecordSetWriter with one that embeds the schema with the hope that the second AvroRecordSetWriter could auto generate schema in the Confluent Schema Registry on publish, since I don't want to bloat each message with my embedded Avro schema.
Update
I've tried to follow the advice from the comment as follows
With that I was trying to make first access to the Confluent Schema Registry the last step in the chain. Unfortunately, my attempts were unsuccessful. The only option that worked was my initial described in this question that required a schema in the registry upfront/in advance to work.
Both other cases that I tried ended up with the exception:
org.apache.nifi.schema.access.SchemaNotFoundException: Cannot write Confluent Schema Registry Reference because the Schema Identifier is not known
Please note, that I cannot use "Inherit Schema from record" in the last writer's schema access, since I'm getting an invalid combination and the NiFi config validation doesn't pass such combination through.

Kafka Connect HDFS (Azure) Persist Avro Values AND String Keys

I have configured Kafka Connect HDFS to work on Azure Datalake, however I just noticed that the keys (Strings) are not being persisted in anyway, only the Avro values.
When I think about this I suppose this makes sense as the partitioning I want to apply in the data lake is not related to the key and I have not specified some new Avro Schema which incorporates the key String into the existing Avro value Schema.
Now within the configurations I supply when running the connect-distributed.sh script, I have (among other configurations)
...
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://<ip>:<port>
...
But within the actual sink connector that I set up using curl I simply specify the output format as
...
"format.class": "io.confluent.connect.hdfs.avro.AvroFormat"
...
so the connector just assumes that the Avro value is to be written.
So I have two questions. How do I tell the connector that it should save the key along with the value as part of a new Avro schema, and where do I define this schema?
Note that this is an Azure HDInsight cluster and so is not a Confluent Kafka solution (though I would have access to open source Confluent code such as Kafka Connect HDFS)

Avro messages with schema

So we are planning to use Avro for communication over a confluent kafka-based ecosystem. My current understanding of Avro is that each message carries its schema. If that is the case, we need schema registry just for resolving version updates?
I ask since carrying the schema with each message prevents the need for something like a schema registry to map a message id to a schema. Or am I missing something here?
When you run the Confluent Schema Registry, the Kafka messages published with the Confluent Avro Serdes library do not contain the avro schema. They only contain a numeric Schema id that is used by the consumers deserializer to fetch the Schema from the Confluent Schema Registry. These schemas are cached by the serializer and deserializer as a further performance optimization.