Kafka Connect HDFS (Azure) Persist Avro Values AND String Keys - apache-kafka

I have configured Kafka Connect HDFS to work on Azure Datalake, however I just noticed that the keys (Strings) are not being persisted in anyway, only the Avro values.
When I think about this I suppose this makes sense as the partitioning I want to apply in the data lake is not related to the key and I have not specified some new Avro Schema which incorporates the key String into the existing Avro value Schema.
Now within the configurations I supply when running the connect-distributed.sh script, I have (among other configurations)
...
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://<ip>:<port>
...
But within the actual sink connector that I set up using curl I simply specify the output format as
...
"format.class": "io.confluent.connect.hdfs.avro.AvroFormat"
...
so the connector just assumes that the Avro value is to be written.
So I have two questions. How do I tell the connector that it should save the key along with the value as part of a new Avro schema, and where do I define this schema?
Note that this is an Azure HDInsight cluster and so is not a Confluent Kafka solution (though I would have access to open source Confluent code such as Kafka Connect HDFS)

Related

Using Glue schema registry with MSK Connector

I've been trying to create an MSK Connector and use Glue schema registry with it.
The configuration is as follows.
connector.class=io.confluent.connect.s3.S3SinkConnector
s3.region=eu-west-1
topics.dir=topics/dir
flush.size=200
tasks.max=2
s3.part.size=5242880
timezone=GMT
# value.converter.schema.registry.url=http://someIP:8081
key.converter.schemaName=my-topic-schema
locale=US
format.class=io.confluent.connect.s3.format.parquet.ParquetFormat
value.converter.schemaName=my-topic-schema
value.converter=io.confluent.connect.avro.StringConverter
s3.bucket.name=my-bucket
key.converter=io.confluent.connect.avro.StringConverter
# key.converter.schema.registry.url==http://someIP:8081
partition.duration.ms=3600000
schema.compatibility=BACKWARD
topics=osb
value.converter.registry.name=Glue-Schema-Registry
key.converter.registry.name=Glue-Schema-Registry
key.converter.schemas.enable=true
partitioner.class=io.confluent.connect.storage.partitioner.TimeBasedPartitioner
value.converter.schemas.enable=true
storage.class=io.confluent.connect.s3.storage.S3Storage
rotate.schedule.interval.ms=0
path.format='year'=YYYY/'month'=MM/'day'=dd/'hour'=HH
timestamp.extractor=RecordField
timestamp.field=timestamp
First I was using the confluent schema registry running on an EC2 whose ip I added in "key/value.converter.schema.registry.url" field and it was working fine. Now I'm trying to use Glue Schema Registry. But I don't know how to connect the connector with Glue Schema Registry.
These classes don't exist
key.converter=io.confluent.connect.avro.StringConverter
value.converter=io.confluent.connect.avro.StringConverter
The StringConverter classname starts with org.apache.kafka
Similarly, looks like you've added a bunch of random converter properties, which aren't valid for the String or Confluent Avro converter, apart from the url
To use Glue, you'll need to use AWSKafkaAvroConverter, which is part of this repo,
https://github.com/awslabs/aws-glue-schema-registry/tree/master/avro-kafkaconnect-converter
And is documented here
https://docs.aws.amazon.com/glue/latest/dg/schema-registry-integrations.html#schema-registry-integrations-apache-kafka-connect

Is there a way of telling a sink connector in Kafka Connect how to look for schema entries

I have successfully set up Kafka Connect in distributed mode locally with the Confluent BigQuery connector. The topics are being made available to me by another party; I am simply moving these topics into my Kafka Connect on my local machine, and then to the sink connector (and thus into BigQuery).
Because of the topics being created by someone else, the schema registry is also being managed by them. So in my config, I set "schema.registry.url":https://url-to-schema-registry, but we have multiple topics which all use the same schema entry, which is located at, let's say, https://url-to-schema-registry/subjects/generic-entry-value/versions/1.
What is happening, however, is that Connect is looking for the schema entry based on the topic name. So let's say my topic is my-topic. Connect is looking for the entry at this URL: https://url-to-schema-registry/subjects/my-topic-value/versions/1. But instead, I want to use the entry located at https://url-to-schema-registry/subjects/generic-entry-value/versions/1, and I want to do so for any and all topics.
How can I make this change? I have tried looking at this doc: https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#configuration-details as well as this class: https://github.com/confluentinc/schema-registry/blob/master/schema-serializer/src/main/java/io/confluent/kafka/serializers/subject/TopicRecordNameStrategy.java
but this looks to be a config parameter for the schema registry itself (which I have no control over), not the sink connector. Unless I'm not configuring something correctly.
Is there a way for me to configure my sink connector to look for a specified schema entry like generic-entry-value/versions/..., instead of the default format topic-name-value/versions/...?
The strategy is configurable at the connector level.
e.g.
value.converter.value.subject.name.strategy=...
There are only strategies built-in, however for Topic and/or RecordName lookups. You'll need to write your own class for static lookups from "generic-entry" if you otherwise cannot copy this "generic-entry-value" schema into new subjects
e.g
# get output of this to a file
curl ... https://url-to-schema-registry/subjects/generic-entry-value/versions/1/schema
# upload it again where "new-entry" is the name of the other topic
curl -XPOST -d #schema.json https://url-to-schema-registry/subjects/new-entry-value/versions

What is the use of confluent schema registry if Kafka can use Avro without it

The difference between vanilla apache Avro and Avro with confluent schema registry is that when using apache avro , we send schema+message in kafka topic whereas in confluent schema registry , we send schemaID+message in kafka topic ? So here , schema registry helps in performance improvement via schema look up in registry. Is there any other benefit of using confluent schema registry ? Also , does apache avro supports compatabilty rules of schema evolution like schema registry ?
Note: There are other implementations of a "Schema Registry" that can use used with Kafka.
Here are a list of reasons
Clients can discover schemas without interacting with Kafka. For example, Apache Hive / Presto / Spark can download schemas from the Registry to perform analytics.
The registry is centrally responsible for compatibility checks rather than pushing each client to operate that on their own (to answer your second question)
The same applies to any serialization format, as well, not only Avro

Sending Avro messages to Kafka

I have an app that produces an array of messages in raw JSON periodically. I was able to convert that to Avro using the avro-tools. I did that because I needed the messages to include schema due to the limitations of Kafka-Connect JDBC sink. I can open this file on notepad++ and see that it includes the schema and a few lines of data.
Now I would like to send this to my central Kafka Broker and then use Kafka Connect JDBC sink to put the data in a database. I am having a hard time understanding how I should be sending these Avro files I have to my Kafka Broker. Do I need a schema registry for my purposes? I believe Kafkacat does not support Avro so I suppose I will have to stick with the kafka-producer.sh that comes with the Kafka installation (please correct me if I am wrong).
Question is: Can someone please share the steps to produce my Avro file to a Kafka broker without getting Confluent getting involved.
Thanks,
To use the Kafka Connect JDBC Sink, your data needs an explicit schema. The converter that you specify in your connector configuration determines where the schema is held. This can either be embedded within the JSON message (org.apache.kafka.connect.json.JsonConverter with schemas.enabled=true) or held in the Schema Registry (one of io.confluent.connect.avro.AvroConverter, io.confluent.connect.protobuf.ProtobufConverter, or io.confluent.connect.json.JsonSchemaConverter).
To learn more about this see https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained
To write an Avro message to Kafka you should serialise it as Avro and store the schema in the Schema Registry. There is a Go client library to use with examples
without getting Confluent getting involved.
It's not entirely clear what you mean by this. The Kafka Connect JDBC Sink is written by Confluent. The best way to manage schemas is with the Schema Registry. If you don't want to use the Schema Registry then you can embed the schema in your JSON message but it's a suboptimal way of doing things.

How to auto-save Avro schema in Confluent Schema Registry from Apache NiFi flow?

How to auto-save Avro schema in Confluent Schema Registry from Apache NiFi flow?
That's basically the question.
I am not finding the way of automatically storing the Avro schema of the record in the Confluent Schema Registry from a NiFi flow. It is possible to flexibly read and populate message with the reference to the schema in the Confluent Schema-Registry, but there should be a way of auto-creating one in the registry instead of demanding Confluent Schema-Registry to be initialized upfront before NiFi flow starts.
Update
Here is my current Flow:
I'm reading from a Postgres table using QueryDatabaseTableRecord processor (version 1.10) and publishing [new] records to a Kafka topic using PublishKafkaRecord_2_0 (version 1.10.0).
I want to publish to Kafka in Avro format storing (and passing around) the Avro schema in the Confluent Schema Registry (that works well in other places of my NiFi setup).
For that, I am using AvroRecordSetWriter in the "Record Writer" property on the QueryDatabaseTableRecord processor with the following properties:
PublishKafkaRecord processor is configured to read Avro schema from the input message (using the Confluent schema registry, the schema is not embedded into each FlowFile) and uses same AvroRecordSetWriter as QueryDatabaseTableRecord processor to write to Kafka.
That's basically it.
Trying to replace the first AvroRecordSetWriter with one that embeds the schema with the hope that the second AvroRecordSetWriter could auto generate schema in the Confluent Schema Registry on publish, since I don't want to bloat each message with my embedded Avro schema.
Update
I've tried to follow the advice from the comment as follows
With that I was trying to make first access to the Confluent Schema Registry the last step in the chain. Unfortunately, my attempts were unsuccessful. The only option that worked was my initial described in this question that required a schema in the registry upfront/in advance to work.
Both other cases that I tried ended up with the exception:
org.apache.nifi.schema.access.SchemaNotFoundException: Cannot write Confluent Schema Registry Reference because the Schema Identifier is not known
Please note, that I cannot use "Inherit Schema from record" in the last writer's schema access, since I'm getting an invalid combination and the NiFi config validation doesn't pass such combination through.