Specify KSQL Stream Subject names explicitly - apache-kafka

I have two KSQL topics my-topic-1 and my-topic-2, with messages serialised via AVRO. For historical reasons, the my-topic-1 schema is not in the recommended topic-value format, but is instead my-custom-subject-name.
I want to move records from one topic to the other via KSQL.
First up, let's create a stream:
CREATE STREAM my-stream-1
WITH (KAFKA_TOPIC='my-topic-1', VALUE_FORMAT='AVRO');
oops:
Avro schema for message values on topic my-topic-1 does not exist in the Schema Registry.
Subject: my-topic-1-value
Possible causes include:
- The topic itself does not exist
-> Use SHOW TOPICS; to check
- Messages on the topic are not Avro serialized
-> Use PRINT 'my-topic-1' FROM BEGINNING; to verify
- Messages on the topic have not been serialized using the Confluent Schema Registry Avro serializer
-> See https://docs.confluent.io/current/schema-registry/docs/serializer-formatter.html
- The schema is registered on a different instance of the Schema Registry
-> Use the REST API to list available subjects
https://docs.confluent.io/current/schema-registry/docs/api.html#get--subjects
It's looking for the subject my-topic-1-value
Does anyone have any idea if this is possible? VALUE_AVRO_SCHEMA_FULL_NAME mentioned here doesn't do what I want it to.

This appears to be a bug. I've updated https://github.com/confluentinc/ksql/issues/3188 with an example to reproduce. I suggest we track it there.

Related

How does a kafka connect connector know which schema to use?

Let's say I have a bunch of different topics, each with their own json schema. In schema registry, I indicated which schemas exist within the different topics, not directly refering to which topic a schema applies. Then, in my sink connector, I only refer to the endpoint (URL) of the schema registry. So to my knowledge, I never indicated which registered schema a kafka connector (e.g., JDBC sink) should be used in order to deserialize a message from a certain topic?
Asking here as I can't seem to find anything online.
I am trying to decrease my kafka message size by removing overhead of having to specify the schema in each message, and using schema registry instead. However, I cannot seem to understand how this could work.
Your producer serializes the schema id directly in the bytes of the record. Connect (or consumers with the json deserializer) use the schema that's part of each record.
https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#wire-format
If you're trying to decrease message size, don't use JSON, but rather a binary format and enable topic compression such as ZSTD

ksqlDB can't get data from Schema Registry

Case: I have topic in Kafka with name some_table_prefix.table_name. Data is serialized with AVRO, but for historical reasons I have record in Schema Registry named table_name-value.
When I'm trying to setup ksqlDB stream
CREATE STREAM some_stream_name
WITH (KAFKA_TOPIC='some_table_prefix.table_name', VALUE_FORMAT='AVRO');
I'm getting error Schema for message values on topic some_table_prefix.table_name does not exist in the Schema Registry.Subject: some_table_prefix.table_name-value.
I have Schema registry integrated correctly, for others topics everything works ok.
So, is it possible to specify Schema Registry record name in ksqlDB stream creation or resolve this issue some other way?
If you have a topic named table_name, that has Avro being produced to it (which would automatically create table_name-value in the Registry), then that's what ksqlDB should consume from. If you'd manually created that subject by posting the schema on your own, without matching the topic name, then that's part of the problem.
As the error says, it's looking for a specific subject in the Registry based on the topic you've provided. To my knowledge, its not possible to use another subject name, so the workaround is to POST the old subject's schemas into the new one

How to monitor 'bad' messages written to kafka topic with no schema

I use Kafka Connect to take data from RabbitMQ into kafka topic. The data comes without schema so in order to associate schema I use ksql stream. On top of the stream I create a new topic that now has a defined schema. At the end I take the data to BQ database. My question is how do I monitor messages that have not passed the stream stage? in this way, do i support schema evolution? and if not, how can use the schema registry functionality?
Thanks
use Kafka Connect to take data ... data comes without schema
I'm not familiar specifically with Rabbitmq connector, but if you use the Confluent converter classes that do use schemas, then it would have one, although maybe only a string or bytes schema
If ksql is consuming the non-schema topic, then there's a consumer group associated with that process. You can monitor its lag to know how many messages have not yet been processed by ksql. If ksql is unable to parse a message because it's "bad", then I assume it's either skipped or the stream stops consuming completely; this is likely configurable
If you've set the output topic format to Avro, for example, then the schema will automatically be registered to the Registry. There will be no evolution until you modify the fields of the stream

Why Apache Kafka consumer would use a different version of schema to deserialize record other than the one sent along with the data?

Let us assume I am using Avro serialization while sending data to kafka.
While consuming record from Apache Kafka, I get both the schema and the record. I can use the schema to parse the record. I am not getting the scenario why consumer would use a different version of schema to deserialize the record. Can someone help?
The message is serialized with the the unique id for a specific version of the schema when produced onto Kafka. The consumer would use that unique schema id to deserialize.
Taken from https://docs.confluent.io/current/schema-registry/avro.html#schema-evolution

How can i get the Avro schema object from the received message in kafka?

I try to publish/consume my java objects to kafka. I use Avro schema.
My basic program works fine. In my program i use my schema in the producer (for encoding) and consumer (decoding).
If i publish different objects to different topics( eg: 100 topics)at the receiver, i do not know, what type of message i received ?..I would like to get the avro schema from the received byte and would like to use that for decoding..
Is my understand correct? If so, how can i retrieve from the received object?
You won't receive the Avro schema in the received bytes -- and you don't really want to. The whole idea with Avro is to separate the schema from the record, so that it is a much more compact format. The way I do it, I have a topic called Schema. The first thing a Kafka consumer process does is to listen to this topic from the beginning and to parse all of the schemas.
Avro schemas are just JSON string objects -- you can just store one schema per record in the Schema topic.
As to figuring out which schema goes with which topic, as I said in a previous answer, you want one schema per topic, no more. So when you parse a message from a specific topic you know exactly what schema applies, because there can be only one.
If you never re-use the schema, you can just name the schema the same as the topic. However, in practice you probably will use the same schema on multiple topics. In which case, you want to have a separate topic that maps Schemas to Topics. You could create an Avro schema like this:
{"name":"SchemaMapping", "type":"record", "fields":[
{"name":"schemaName", "type":"string"},
{"name":"topicName", "type":"string"}
]}
You would publish a single record per topic with your Avro-encoded mapping into a special topic -- for example called SchemaMapping -- and after consuming the Schema topic from the beginning, a consumer would listen to SchemaMapping and after that it would know exactly which schema to apply for each topic.