Kafka Connect: How can I send protobuf data from Kafka topics to HDFS using hdfs sink connector? - apache-kafka

I have a producer that's producing protobuf messages to a topic. I have a consumer application which deserializes the protobuf messages. But hdfs sink connector picks up messages from the Kafka topics directly. What would the key and value converter in etc/schema-registry/connect-avro-standalone.properties be set to? What's the best way to do this? Thanks in advance!

Kafka Connect is designed to separate the concern of serialization format in Kafka from individual connectors with the concept of converters. As you seem to have found, you'll need to adjust the key.converter and value.converter classes to implementations that support protobufs. These classes are commonly implemented as a normal Kafka Deserializer followed by a step which performs a conversion from serialization-specific runtime formats (e.g. Message in protobufs) to Kafka Connect's runtime API (which doesn't have any associated serialization format -- it's just a set of Java types and a class to define Schemas).
I'm not aware of an existing implementation. The main challenge in implementing this is that protobufs is self-describing (i.e. you can deserialize it without access to the original schema), but since its fields are simply integer IDs, you probably wouldn't get useful schema information without either a) requiring that the specific schema is available to the converter, e.g. via config (which makes migrating schemas more complicated) or b) a schema registry service + wrapper format for your data that allows you to look up the schema dynamically.

Related

How does a kafka connect connector know which schema to use?

Let's say I have a bunch of different topics, each with their own json schema. In schema registry, I indicated which schemas exist within the different topics, not directly refering to which topic a schema applies. Then, in my sink connector, I only refer to the endpoint (URL) of the schema registry. So to my knowledge, I never indicated which registered schema a kafka connector (e.g., JDBC sink) should be used in order to deserialize a message from a certain topic?
Asking here as I can't seem to find anything online.
I am trying to decrease my kafka message size by removing overhead of having to specify the schema in each message, and using schema registry instead. However, I cannot seem to understand how this could work.
Your producer serializes the schema id directly in the bytes of the record. Connect (or consumers with the json deserializer) use the schema that's part of each record.
https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#wire-format
If you're trying to decrease message size, don't use JSON, but rather a binary format and enable topic compression such as ZSTD

How to deserialize avro message using mirrormaker?

I want to replicate a kafka topic to an azure event hub.
The messages are in avro format and uses a schema that is behind a schema registry with USER_INFO authentication.
Using a java client to connect to kafka, I can use a KafkaAvroDeserializer to deserialize the message correctly.
But this configuration doesn't seems to work with mirrormaker.
Is is possible to deserialize the avro message using mirrormaker before sending it ?
Cheers
For MirrorMaker1, the consumer deserializer properties are hard-coded
Unless you plan on re-serializing the data into a different format when the producer sends data to EventHub, you should stick to using the default ByteArrayDeserializer.
If you did want to manipulate the messages in any way, that would need to be done with a MirrorMakerMessageHandler subclass
For MirrorMaker2, you can use AvroConverter followed by some transforms properties, but still ByteArrayConverter would be preferred for a one-to-one byte copy.

Why Kafka Connect Works?

I'm trying to wrap my head around how Kafka Connect works and I can't understand one particular thing.
From what I have read and watched, I understand that Kafka Connect allows you to send data into Kafka using Source Connectors and read data from Kafka using Sink Connectors. And the great thing about this is that Kafka Connect somehow abstracts away all the platform-specific things and all you have to care about is having proper connectors. E.g. you can use a PostgreSQL Source Connector to write to Kafka and then use Elasticsearch and Neo4J Sink Connectors in parallel to read the data from Kafka.
My question is: how does this abstraction work? Why are Source and Sink connectors written by different people able to work together? In order to read data from Kafka and write them anywhere, you have to expect some fixed message structure/schema, right? E.g. how does an Elasticsearch Sink know in advance what kind of messages would a PostgreSQL Source produce? What if I replaced PostgreSQL Source with MySQL source? Would the produced messages have the same structure?
It would be logical to assume that Kafka requires some kind of a fixed message structure, but according to the documentation the SourceRecord which is sent to Kafka does not necessarily have a fixed structure:
...can have arbitrary structure and should be represented using
org.apache.kafka.connect.data objects (or primitive values). For
example, a database connector might specify the sourcePartition as
a record containing { "db": "database_name", "table": "table_name"}
and the sourceOffset as a Long containing the timestamp of the row".
In order to read data from Kafka and write them anywhere, you have to expect some fixed message structure/schema, right?
Exactly. Refer the Javadoc on the Struct and Schema classes of the Connect API as well as the Converter interface
Of course, those are not strict requirements, but without them, then the framework doesn't work across different sources and sinks, but this is no different than the contract between producers and consumers regarding serialization

Kafka: Replicate topic A to topic B while applying a transformation to the records

I need to mirror records from a topic on a cluster A to a topic on cluster B while adding a field onto the record as they are proxied (eg. InsertField).
I am not controlling cluster A (but could require changes) and have full control of cluster B.
I know that cluster A is sending serialised JSON.
I am using the MirrorMaker API with Kafka connect to do the mirroring and I am trying to use InsertField transformation to add data on the record as they are proxied.
My configuration looks like that:
connector.class=org.apache.kafka.connect.mirror.MirrorSourceConnector
topics=.*
source.cluster.alias=upstream
source.cluster.bootstrap.servers=source:9092
target.cluster.bootstrap.servers=target:9092
# ByteArrayConverter to avoid MirrorMaker to re-encode messages
key.converter=org.apache.kafka.connect.converters.ByteArrayConverter
value.converter=org.apache.kafka.connect.converters.ByteArrayConverter
transforms=InsertSource1
transforms.InsertSource1.type=org.apache.kafka.connect.transforms.InsertField$Value
transforms.InsertSource1.static.field=test_inser
transforms.InsertSource1.static.value=test_value
name=somerandomname
This code will fail with an error stating:
org.apache.kafka.connect.errors.DataException: Only Struct objects
supported for [field insertion]
Is there a way to achieve this without writing a custom transform (I am using Python and I am not familiar with Java)
Thanks a lot
In the current version of Apache Kafka (2.6.0), you cannot apply InsertField single message transformation (SMT) to MirrorMaker 2.0 records.
Explanation
The MirrorMaker 2.0 is based on Kafka Connect framework and, internally, the MirrorMaker 2.0 driver sets up MirrorSourceConnector.
Source connectors apply SMT immediately after polling records (there are no converters (e.g. ByteArrayConverter or JsonConverter) at this steps: they are used after SMT has been applied).
The SourceRecord value are represented as a byte array with BYTES_SCHEMA schema. At the same time InsertField transformation requires Type.STRUCT for records with schema.
So, since record can not be determine as Struct, transformation is not applied.
References
KIP-382: MirrorMaker 2.0
How to Use Single Message Transforms in Kafka Connect
Additional resources
Docker-compose playground for MirrorMaker 2.0
As commented, the Byte Array converter has no Struct/Schema information, so therefore the transform you're using (adding a field) cannot be used.
This does not mean that no transforms can be used, however
If you're sending JSON messages, you must send schema and payload information.

Why do we need to specify Serializer in Apache Kafka?

I got this doubt from this question.
When I am not using Kafka Streams, Why do I need to use Serializer while creating ZkClient?
Kafka havily uses zookeeper for storing metadata (topics). For that library com.101tec::zkClient is used. According to source code ZkClient it requires ZkSerializer for serializing/deserializing data send/retreived from zookeeper. Kafka inside itself has implementation of ZkSerializer: ZKStringSerializer (defied in zkUtils).
However, for usual interaction with kafka (producing / consuming) you do not need to create ZkClient. It is required only for 'administrative' work.