Who is responsible for Avro Schema serialization in Kafka queue? - apache-kafka

I am learning the basics of Avro schema serialization. I understand that both, keys and values, can have their specific Avro schemas. However, what I am confused is how the serialization process actually works.
Do you specify the Avro schemas to use at the time of creating a topic? This way, the producer can post a message using plain json text and the kafka server knows how to serialize/deserialize it. Likewise, the consumer can obtain a record in plain json text.
Or, do you specify the schemas to use at the time of posting a message to a topic?
Finally, let's say I define my schemas in mykeyschema.avsc and myvalueschema.avsc. Would appreciate an example of how to use the schemas either from the command line kafka tools or as curl scripts (for rest proxy). Thanks.

Topics are independent of schemas; they do not (and often are not) defined together.
Most importantly, Kafka only knows about byte arrays; the clients decide serialization format. If you choose to pay for Confluent Server, for example, only then can you force Kafka to accept only Avro bytes (obviously, this adds latency to your requests because the records are being deserialized by the server to do the validation, but this is the trade-off for "protecting the topic from bad actors").
That being said, producers are the ones sending data. They often are responsible for registering the schema, based on what is sent. Consumers can then decide to use that schema or define their own projection of those fields (Avro requires both reader and writer schema for deserialization).
example of how to use the schemas either from the command line kafka tools
kafka-avro-console-producer --topic foobar \
--property value.schema="$(jq -rc < example-schema.avsc)" \
--bootstrap-server localhost:9092 --sync
And rather than type out long JSON payloads, you can redirect JSON file with line-separated records into that
kafka-avro-console-producer ... < records.json
REST Proxy
When you call POST requests to send data you can provide the key/value schema as JSON encoded values (not ideal since it makes every request much larger than necessary), or you can pre-register the schema, which returns an ID that you can use.
https://docs.confluent.io/platform/current/kafka-rest/api.html

Related

How does a kafka connect connector know which schema to use?

Let's say I have a bunch of different topics, each with their own json schema. In schema registry, I indicated which schemas exist within the different topics, not directly refering to which topic a schema applies. Then, in my sink connector, I only refer to the endpoint (URL) of the schema registry. So to my knowledge, I never indicated which registered schema a kafka connector (e.g., JDBC sink) should be used in order to deserialize a message from a certain topic?
Asking here as I can't seem to find anything online.
I am trying to decrease my kafka message size by removing overhead of having to specify the schema in each message, and using schema registry instead. However, I cannot seem to understand how this could work.
Your producer serializes the schema id directly in the bytes of the record. Connect (or consumers with the json deserializer) use the schema that's part of each record.
https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#wire-format
If you're trying to decrease message size, don't use JSON, but rather a binary format and enable topic compression such as ZSTD

How to monitor 'bad' messages written to kafka topic with no schema

I use Kafka Connect to take data from RabbitMQ into kafka topic. The data comes without schema so in order to associate schema I use ksql stream. On top of the stream I create a new topic that now has a defined schema. At the end I take the data to BQ database. My question is how do I monitor messages that have not passed the stream stage? in this way, do i support schema evolution? and if not, how can use the schema registry functionality?
Thanks
use Kafka Connect to take data ... data comes without schema
I'm not familiar specifically with Rabbitmq connector, but if you use the Confluent converter classes that do use schemas, then it would have one, although maybe only a string or bytes schema
If ksql is consuming the non-schema topic, then there's a consumer group associated with that process. You can monitor its lag to know how many messages have not yet been processed by ksql. If ksql is unable to parse a message because it's "bad", then I assume it's either skipped or the stream stops consuming completely; this is likely configurable
If you've set the output topic format to Avro, for example, then the schema will automatically be registered to the Registry. There will be no evolution until you modify the fields of the stream

Processing data streams in kafka

I am trying to set up a kafka consumer to process data from Kafka streams. I was able to set up the connection to the stream and the data is visible but it's a mixture of special characters and ASCII.
I am using the inbuilt kafka console, but have also tried the python version of confluent-kafka. The only parameters that need to be followed is to use SASL_PLAINTEXT security protocol with SCRAM-SHA-256. I am open to using other methods to also parse the output (not Java if possible).
Kafka Console
bin/kafka-console-consumer.sh --bootstrap-server server:9092 \
--topic TOPIC --from-beginning --consumer.config=consumer.properties
Confluent Kafka Python
topics = "TOPIC"
conf = {
"bootstrap.servers": "server:9092",
"group.id": "group",
"security.protocol": "SASL_PLAINTEXT",
"sasl.mechanisms" : "SCRAM-SHA-256",
}
c = Consumer(conf)
c.subscribe([topics])
running = True
while running:
message = c.poll()
print(message.value())
c.close()
Output
PLE9K1PKH3S0MAY38ChangeRequest : llZYMEgVmq2CHG:Infra RequestKSUSMAINCHANGEKC-10200-FL01DATA_MISSINGCHGUSD
DATA_MISSINGDATA_MISSINGUSD
CANCEL
▒▒12SLM:Measurement"Schedule(1 = 0)USDUSD▒▒▒
l▒l▒V?▒▒▒
llZYMEgVmq
company_team team_nameTEAM###SGP000000140381PPL000002020234
Latha M▒>▒>▒ChangeRequest
hello:1234543534 cloud abcdef▒▒▒
▒Ի▒
▒▒▒
John Smithjs12345SGP000000140381▒NPPL000002020234
▒Ի▒
I am trying to parse the data on the standard output initially, but the expectation at the end is to get the parsed data in a database. Any advice would be appreciated.
It seems like you have your messages encoded in binary format. To print those you will need to set up a binary decoder and pass them through that. In case you have produced them using a specific schema you might also need to deserialize the objects using the Schema Registry which contains the schema for the given topic. You are looking at something in the lines of:
message_bytes = io.BytesIO(message.value())
decoder = BinaryDecoder(message_bytes)
As jaivalis has mentioned there appears to be a mismatch between your producers and the consumer you are using to ingest the data. Kafka Streams exposes two properties for controlling the serialization and deserialization of data that passes through the topology; default.value.serde, default.key.serde. I recommend reviewing your streams application's configuration to find a suitable deserializer for the consumer to use.
https://kafka.apache.org/documentation/#streamsconfigs
Do note however that these serdes may be overwritten by the your streams application implementation. Be sure to review your implementation as well to ensure you have found the correct serialization format.
https://kafka.apache.org/21/documentation/streams/developer-guide/datatypes.html#overriding-default-serdes

Kafka Connect: How can I send protobuf data from Kafka topics to HDFS using hdfs sink connector?

I have a producer that's producing protobuf messages to a topic. I have a consumer application which deserializes the protobuf messages. But hdfs sink connector picks up messages from the Kafka topics directly. What would the key and value converter in etc/schema-registry/connect-avro-standalone.properties be set to? What's the best way to do this? Thanks in advance!
Kafka Connect is designed to separate the concern of serialization format in Kafka from individual connectors with the concept of converters. As you seem to have found, you'll need to adjust the key.converter and value.converter classes to implementations that support protobufs. These classes are commonly implemented as a normal Kafka Deserializer followed by a step which performs a conversion from serialization-specific runtime formats (e.g. Message in protobufs) to Kafka Connect's runtime API (which doesn't have any associated serialization format -- it's just a set of Java types and a class to define Schemas).
I'm not aware of an existing implementation. The main challenge in implementing this is that protobufs is self-describing (i.e. you can deserialize it without access to the original schema), but since its fields are simply integer IDs, you probably wouldn't get useful schema information without either a) requiring that the specific schema is available to the converter, e.g. via config (which makes migrating schemas more complicated) or b) a schema registry service + wrapper format for your data that allows you to look up the schema dynamically.

How can i get the Avro schema object from the received message in kafka?

I try to publish/consume my java objects to kafka. I use Avro schema.
My basic program works fine. In my program i use my schema in the producer (for encoding) and consumer (decoding).
If i publish different objects to different topics( eg: 100 topics)at the receiver, i do not know, what type of message i received ?..I would like to get the avro schema from the received byte and would like to use that for decoding..
Is my understand correct? If so, how can i retrieve from the received object?
You won't receive the Avro schema in the received bytes -- and you don't really want to. The whole idea with Avro is to separate the schema from the record, so that it is a much more compact format. The way I do it, I have a topic called Schema. The first thing a Kafka consumer process does is to listen to this topic from the beginning and to parse all of the schemas.
Avro schemas are just JSON string objects -- you can just store one schema per record in the Schema topic.
As to figuring out which schema goes with which topic, as I said in a previous answer, you want one schema per topic, no more. So when you parse a message from a specific topic you know exactly what schema applies, because there can be only one.
If you never re-use the schema, you can just name the schema the same as the topic. However, in practice you probably will use the same schema on multiple topics. In which case, you want to have a separate topic that maps Schemas to Topics. You could create an Avro schema like this:
{"name":"SchemaMapping", "type":"record", "fields":[
{"name":"schemaName", "type":"string"},
{"name":"topicName", "type":"string"}
]}
You would publish a single record per topic with your Avro-encoded mapping into a special topic -- for example called SchemaMapping -- and after consuming the Schema topic from the beginning, a consumer would listen to SchemaMapping and after that it would know exactly which schema to apply for each topic.