Flink KafkaSource with multiple topics, each topic with different avro schema of data - apache-kafka

If I do multiple of the below - one for each topic:
KafkaSource<T> kafkaDataSource = KafkaSource<T>builder().setBootstrapServers(consumerProps.getProperty("bootstrap.servers")).setTopics(topic).setDeserializer(deserializer).setGroupId(identifier)
.setProperties(consumerProps).build();
The deserializer seems to get into some issue and ends up reading data from different topic of different schema than it meant for and fails!
If I provide all topics in the same KafkaSource then watermarks seems to be progress across the topics together.
DataStream<T> dataSource = environment.fromSource(kafkaDataSource,
WatermarkStrategy.<T>forBoundedOutOfOrderness(Duration.ofMillis(2000))
.withTimestampAssigner((event, timestamp) -> {...}, ""));
Also the avro data in the kafka itself holds the first magic byte for schema (schema info is embedded), not using any external avro registry (it's all in the libraries).
It works fine with FlinkKafkaConsumer (created multiple instances of it).
FlinkKafkaConsumer<T> kafkaConsumer = new FlinkKafkaConsumer<>(topic, deserializer, consumerProps);
kafkaConsumer.assignTimestampsAndWatermarks(WatermarkStrategy.<T>forBoundedOutOfOrderness(Duration.ofMillis(2000))
.withTimestampAssigner((event, timestamp) -> {
Not sure if it's a problem the way that I am using? Any pointers on how to solve would be appreciated. Also FlinkKafkaConsumer is deprecated.

Figured it based on the code in here Custom avro message deserialization with Flink. Implemented open method and the instance fields of the deserialisier are made transient.

Related

Kafka Avro deserializer with local schema file

Background :
I used SpringKafka to implement Avro based Consumer and Producer. Other important components : Kafka-Broker, Zookeeper and Schema-registry runs in a docker container. And this works perfectly fine for me.
What I Want :
I want to have a Kafka Avro Deserializer(in Consumer) which should be independent of schema-registry. In my case, I have a avroSchema file which would not change. So I want to get rid of this additional step of using schema-registry on Consumer side and Rather go for a local schema file
If the Producer serializer uses the Schema Registry, then the Consumer should as well. Avro requires you to have a reader and writer schema.
If the consumer, for whatever reason cannot access the Registry over the network, you would need to use ByteArrayDeserializer, then you would take the byte-slice after position 5 (0x0 + 4 byte schema integer ID) of the byte[] from the part of the record you want to parse.
Then, from Avro API, you can use GenericDatumReader along with your local Schema reference to get a GenericRecord instance, but this would assume your reader and writer schema are the exact same (but this shouldn't be true, as the producer could change the schema at any time).
Or you can create a SpecificRecord from the schema you have, and configure the KafkaAvroDeserializer to use that.

How to consume and parse different Avro messages in kafka consumer

In My application Kafka topics are dedicated to a domain(can't change that) and multiple different types of events (1 Event = 1 Avro schema message) related to that domain being produced by different micro-services in that one topic.
Now I have only one consumer app in which I should be able to apply different schema dynamically (by inspecting event name in message) and transform in appropriate pojo object(generated by specific Avro schema) for further event specific actions.
Whatever example I find on net is all about single schema type message consumer so need some help.
Related blog post: https://www.confluent.io/blog/multiple-event-types-in-the-same-kafka-topic/
How to configure the consumer:
https://docs.confluent.io/platform/current/schema-registry/serdes-develop/serdes-avro.html#avro-deserializer
https://github.com/openweb-nl/kafka-graphql-examples/blob/307bbad6f10e4aaa6b797a3bbe3b6620d3635263/graphql-endpoint/src/main/java/nl/openweb/graphql_endpoint/service/AccountCreationService.java#L47
https://github.com/openweb-nl/kafka-graphql-examples/blob/307bbad6f10e4aaa6b797a3bbe3b6620d3635263/graphql-endpoint/src/main/resources/application.yml#L20
You need the generated Avro classes on the classpath. Most likely by adding a dependency.

how can I pass KafkaAvroSerializer into a Kafka ProducerRecord?

I have messages which are being streamed to Kafka. I would like to convert the messages in avro binary format (means to encode them).
I'm using the confluent platform. I have a Kafka ProducerRecord[String,String] which sends the messages to the Kafka topic.
Can someone provide with a (short) example? Or recommend a website with examples?
Does anyone know how I can pass a instance of a KafkaAvroSerializer into the KafkaProducer?
Can I use inside the ProducerRecord a Avro GenericRecord instance?
Kind regards
Nika
You need to use the KafkaAvroSerializer in your producer config for the either serializer config, as well as set the schema registry url in the producer config as well (AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG)
That serializer will Avro-encode primitives and strings, but if you need complex objects, you could try adding Avro4s, for example. Otherwise, GenericRecord will work as well.
Java example is here - https://docs.confluent.io/current/schema-registry/serializer-formatter.html

Kafka multiple Topics into the same avro file

i am new of the KAFKA protocol's world and i would like to ask you some inportant information related to my project.
I am using AVRO file for producing and consuming messages, i want to know if i can use the same avro file for multiple Topics maybe for example by using a different "name" attribute into the producer and by using a specific "name" attribute in the consumer.
Thanks a lot.
Stefano
You can use one file to send data to multiple topics, yes, although I'm not sure why one would do that
I would be cautious about merging multiple topics into one Avro file because the schema must match in every topic for that file
It would be suggested that you use the Confluent Schema Registry, for example, rather than sending individual Avro events because if you are not using some registry, then you're likely sending the Avro schema as part of every message, which will slow down the possible throughput of your topic. And then, the name of the Avro schema record in the register will correspond to the topic name

Kafka Connect: How can I send protobuf data from Kafka topics to HDFS using hdfs sink connector?

I have a producer that's producing protobuf messages to a topic. I have a consumer application which deserializes the protobuf messages. But hdfs sink connector picks up messages from the Kafka topics directly. What would the key and value converter in etc/schema-registry/connect-avro-standalone.properties be set to? What's the best way to do this? Thanks in advance!
Kafka Connect is designed to separate the concern of serialization format in Kafka from individual connectors with the concept of converters. As you seem to have found, you'll need to adjust the key.converter and value.converter classes to implementations that support protobufs. These classes are commonly implemented as a normal Kafka Deserializer followed by a step which performs a conversion from serialization-specific runtime formats (e.g. Message in protobufs) to Kafka Connect's runtime API (which doesn't have any associated serialization format -- it's just a set of Java types and a class to define Schemas).
I'm not aware of an existing implementation. The main challenge in implementing this is that protobufs is self-describing (i.e. you can deserialize it without access to the original schema), but since its fields are simply integer IDs, you probably wouldn't get useful schema information without either a) requiring that the specific schema is available to the converter, e.g. via config (which makes migrating schemas more complicated) or b) a schema registry service + wrapper format for your data that allows you to look up the schema dynamically.