Why do we need to specify Serializer in Apache Kafka?

Why do we need to specify Serializer in Apache Kafka? - apache-kafka

I got this doubt from this question.
When I am not using Kafka Streams, Why do I need to use Serializer while creating ZkClient?

Kafka havily uses zookeeper for storing metadata (topics). For that library com.101tec::zkClient is used. According to source code ZkClient it requires ZkSerializer for serializing/deserializing data send/retreived from zookeeper. Kafka inside itself has implementation of ZkSerializer: ZKStringSerializer (defied in zkUtils).
However, for usual interaction with kafka (producing / consuming) you do not need to create ZkClient. It is required only for 'administrative' work.

Related

Do I really need avro4s when using kafka schema registry?

I noticed confluent has a kafka serializer that will let me serialize and de-serialize my case classes from my kafka topic, and it will pull the schema from the registry.
If this is the case, what benefit would I get by using avro4s?

You have no obligation to use avro4s. In fact you do not have to use Avro at all. Kafka does not care about the format you use for serialization. Although, Avro is the defacto (de)serializer for Kafka, and the one you noticed within Confluent suit (Schema Registry?), is also Avro. The only thing you need is to add dependency to Avro: https://mvnrepository.com/artifact/org.apache.avro/avro/1.10.1
Also, use sbt-avro plugin. This one is not necessary, but your life will be very hard without it: https://github.com/sbt/sbt-avro

Kafka Connect - Connector with same Kafka cluster as a source?

I only found references to MirrorMaker v2.
Can I reuse org.apache.kafka.connect.mirror.MirrorSourceConnector as if it were a "plain" Connector with Kafka as a source, or is there something else, hopefully simpler, available?
I'm trying to use KafkaConnect and (a combination of) its SMTs to simulate message routing behaviour found in other message brokers.
For example, I would like to consume from a topic, extract values from the message (either headers or payload), and route the message to another topic within the same cluster depending on the data found in the message.
Thank you

within the same cluster
Then that's what Kafka Streams or ksqlDB are for. You can import and use SMT methods directly via code, although you also need to use the Connect Converter classes to get Schema/Struct types that most of the SMT's require
While you could use MirrorMaker, intercluster relocation is not its purpose

how can I pass KafkaAvroSerializer into a Kafka ProducerRecord?

I have messages which are being streamed to Kafka. I would like to convert the messages in avro binary format (means to encode them).
I'm using the confluent platform. I have a Kafka ProducerRecord[String,String] which sends the messages to the Kafka topic.
Can someone provide with a (short) example? Or recommend a website with examples?
Does anyone know how I can pass a instance of a KafkaAvroSerializer into the KafkaProducer?
Can I use inside the ProducerRecord a Avro GenericRecord instance?
Kind regards
Nika

You need to use the KafkaAvroSerializer in your producer config for the either serializer config, as well as set the schema registry url in the producer config as well (AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG)
That serializer will Avro-encode primitives and strings, but if you need complex objects, you could try adding Avro4s, for example. Otherwise, GenericRecord will work as well.
Java example is here - https://docs.confluent.io/current/schema-registry/serializer-formatter.html

Create new Producer from Kafka consumer?

How to create new Kafka Producer from existing Consumer with java ?

You can't create a KafkaProducer from a KafkaConsumer instance.
You have to explicitly create a KafkaProducer using the same connection settings as your producer.
Considering the use case you mentioned (copying data from a topic to another), I'd recommend using Kafka Streams. There's actually an example in Kafka that does exactly that: https://github.com/apache/kafka/blob/trunk/streams/examples/src/main/java/org/apache/kafka/streams/examples/pipe/PipeDemo.java

I will recommend to use the Kafka Streams library. It reads data from kafka topics and do some processing and write back to another topics.
That could be simpler approach for you.
https://kafka.apache.org/documentation/streams/
Current limitation is, Source and destination cluster should be same with Kafka Streams.
Otherwise you need to use Processor API to define another destination cluster.
Another approach, is simply define a producer in the consumer program. Wherever your rule matches(based on offset or any conditions), call producer.send() method

Kafka Connect: How can I send protobuf data from Kafka topics to HDFS using hdfs sink connector?

I have a producer that's producing protobuf messages to a topic. I have a consumer application which deserializes the protobuf messages. But hdfs sink connector picks up messages from the Kafka topics directly. What would the key and value converter in etc/schema-registry/connect-avro-standalone.properties be set to? What's the best way to do this? Thanks in advance!

Kafka Connect is designed to separate the concern of serialization format in Kafka from individual connectors with the concept of converters. As you seem to have found, you'll need to adjust the key.converter and value.converter classes to implementations that support protobufs. These classes are commonly implemented as a normal Kafka Deserializer followed by a step which performs a conversion from serialization-specific runtime formats (e.g. Message in protobufs) to Kafka Connect's runtime API (which doesn't have any associated serialization format -- it's just a set of Java types and a class to define Schemas).
I'm not aware of an existing implementation. The main challenge in implementing this is that protobufs is self-describing (i.e. you can deserialize it without access to the original schema), but since its fields are simply integer IDs, you probably wouldn't get useful schema information without either a) requiring that the specific schema is available to the converter, e.g. via config (which makes migrating schemas more complicated) or b) a schema registry service + wrapper format for your data that allows you to look up the schema dynamically.