How to deserialize avro message using mirrormaker?

How to deserialize avro message using mirrormaker? - apache-kafka

I want to replicate a kafka topic to an azure event hub.
The messages are in avro format and uses a schema that is behind a schema registry with USER_INFO authentication.
Using a java client to connect to kafka, I can use a KafkaAvroDeserializer to deserialize the message correctly.
But this configuration doesn't seems to work with mirrormaker.
Is is possible to deserialize the avro message using mirrormaker before sending it ?
Cheers

For MirrorMaker1, the consumer deserializer properties are hard-coded
Unless you plan on re-serializing the data into a different format when the producer sends data to EventHub, you should stick to using the default ByteArrayDeserializer.
If you did want to manipulate the messages in any way, that would need to be done with a MirrorMakerMessageHandler subclass
For MirrorMaker2, you can use AvroConverter followed by some transforms properties, but still ByteArrayConverter would be preferred for a one-to-one byte copy.

Related

How does a kafka connect connector know which schema to use?

Let's say I have a bunch of different topics, each with their own json schema. In schema registry, I indicated which schemas exist within the different topics, not directly refering to which topic a schema applies. Then, in my sink connector, I only refer to the endpoint (URL) of the schema registry. So to my knowledge, I never indicated which registered schema a kafka connector (e.g., JDBC sink) should be used in order to deserialize a message from a certain topic?
Asking here as I can't seem to find anything online.
I am trying to decrease my kafka message size by removing overhead of having to specify the schema in each message, and using schema registry instead. However, I cannot seem to understand how this could work.

Your producer serializes the schema id directly in the bytes of the record. Connect (or consumers with the json deserializer) use the schema that's part of each record.
https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#wire-format
If you're trying to decrease message size, don't use JSON, but rather a binary format and enable topic compression such as ZSTD

How to monitor 'bad' messages written to kafka topic with no schema

I use Kafka Connect to take data from RabbitMQ into kafka topic. The data comes without schema so in order to associate schema I use ksql stream. On top of the stream I create a new topic that now has a defined schema. At the end I take the data to BQ database. My question is how do I monitor messages that have not passed the stream stage? in this way, do i support schema evolution? and if not, how can use the schema registry functionality?
Thanks

use Kafka Connect to take data ... data comes without schema
I'm not familiar specifically with Rabbitmq connector, but if you use the Confluent converter classes that do use schemas, then it would have one, although maybe only a string or bytes schema
If ksql is consuming the non-schema topic, then there's a consumer group associated with that process. You can monitor its lag to know how many messages have not yet been processed by ksql. If ksql is unable to parse a message because it's "bad", then I assume it's either skipped or the stream stops consuming completely; this is likely configurable
If you've set the output topic format to Avro, for example, then the schema will automatically be registered to the Registry. There will be no evolution until you modify the fields of the stream

Kafka: Replicate topic A to topic B while applying a transformation to the records

I need to mirror records from a topic on a cluster A to a topic on cluster B while adding a field onto the record as they are proxied (eg. InsertField).
I am not controlling cluster A (but could require changes) and have full control of cluster B.
I know that cluster A is sending serialised JSON.
I am using the MirrorMaker API with Kafka connect to do the mirroring and I am trying to use InsertField transformation to add data on the record as they are proxied.
My configuration looks like that:
connector.class=org.apache.kafka.connect.mirror.MirrorSourceConnector
topics=.*
source.cluster.alias=upstream
source.cluster.bootstrap.servers=source:9092
target.cluster.bootstrap.servers=target:9092
# ByteArrayConverter to avoid MirrorMaker to re-encode messages
key.converter=org.apache.kafka.connect.converters.ByteArrayConverter
value.converter=org.apache.kafka.connect.converters.ByteArrayConverter
transforms=InsertSource1
transforms.InsertSource1.type=org.apache.kafka.connect.transforms.InsertField$Value
transforms.InsertSource1.static.field=test_inser
transforms.InsertSource1.static.value=test_value
name=somerandomname
This code will fail with an error stating:
org.apache.kafka.connect.errors.DataException: Only Struct objects
supported for [field insertion]
Is there a way to achieve this without writing a custom transform (I am using Python and I am not familiar with Java)
Thanks a lot

In the current version of Apache Kafka (2.6.0), you cannot apply InsertField single message transformation (SMT) to MirrorMaker 2.0 records.
Explanation
The MirrorMaker 2.0 is based on Kafka Connect framework and, internally, the MirrorMaker 2.0 driver sets up MirrorSourceConnector.
Source connectors apply SMT immediately after polling records (there are no converters (e.g. ByteArrayConverter or JsonConverter) at this steps: they are used after SMT has been applied).
The SourceRecord value are represented as a byte array with BYTES_SCHEMA schema. At the same time InsertField transformation requires Type.STRUCT for records with schema.
So, since record can not be determine as Struct, transformation is not applied.
References
KIP-382: MirrorMaker 2.0
How to Use Single Message Transforms in Kafka Connect
Additional resources
Docker-compose playground for MirrorMaker 2.0

As commented, the Byte Array converter has no Struct/Schema information, so therefore the transform you're using (adding a field) cannot be used.
This does not mean that no transforms can be used, however
If you're sending JSON messages, you must send schema and payload information.

how can I pass KafkaAvroSerializer into a Kafka ProducerRecord?

I have messages which are being streamed to Kafka. I would like to convert the messages in avro binary format (means to encode them).
I'm using the confluent platform. I have a Kafka ProducerRecord[String,String] which sends the messages to the Kafka topic.
Can someone provide with a (short) example? Or recommend a website with examples?
Does anyone know how I can pass a instance of a KafkaAvroSerializer into the KafkaProducer?
Can I use inside the ProducerRecord a Avro GenericRecord instance?
Kind regards
Nika

You need to use the KafkaAvroSerializer in your producer config for the either serializer config, as well as set the schema registry url in the producer config as well (AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG)
That serializer will Avro-encode primitives and strings, but if you need complex objects, you could try adding Avro4s, for example. Otherwise, GenericRecord will work as well.
Java example is here - https://docs.confluent.io/current/schema-registry/serializer-formatter.html

Kafka Connect: How can I send protobuf data from Kafka topics to HDFS using hdfs sink connector?

I have a producer that's producing protobuf messages to a topic. I have a consumer application which deserializes the protobuf messages. But hdfs sink connector picks up messages from the Kafka topics directly. What would the key and value converter in etc/schema-registry/connect-avro-standalone.properties be set to? What's the best way to do this? Thanks in advance!

Kafka Connect is designed to separate the concern of serialization format in Kafka from individual connectors with the concept of converters. As you seem to have found, you'll need to adjust the key.converter and value.converter classes to implementations that support protobufs. These classes are commonly implemented as a normal Kafka Deserializer followed by a step which performs a conversion from serialization-specific runtime formats (e.g. Message in protobufs) to Kafka Connect's runtime API (which doesn't have any associated serialization format -- it's just a set of Java types and a class to define Schemas).
I'm not aware of an existing implementation. The main challenge in implementing this is that protobufs is self-describing (i.e. you can deserialize it without access to the original schema), but since its fields are simply integer IDs, you probably wouldn't get useful schema information without either a) requiring that the specific schema is available to the converter, e.g. via config (which makes migrating schemas more complicated) or b) a schema registry service + wrapper format for your data that allows you to look up the schema dynamically.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to deserialize avro message using mirrormaker? - apache-kafka

Related

How does a kafka connect connector know which schema to use?

How to monitor 'bad' messages written to kafka topic with no schema

Kafka: Replicate topic A to topic B while applying a transformation to the records

how can I pass KafkaAvroSerializer into a Kafka ProducerRecord?

Kafka Connect: How can I send protobuf data from Kafka topics to HDFS using hdfs sink connector?

Categories

Resources