How does a kafka connect connector know which schema to use? - apache-kafka

Let's say I have a bunch of different topics, each with their own json schema. In schema registry, I indicated which schemas exist within the different topics, not directly refering to which topic a schema applies. Then, in my sink connector, I only refer to the endpoint (URL) of the schema registry. So to my knowledge, I never indicated which registered schema a kafka connector (e.g., JDBC sink) should be used in order to deserialize a message from a certain topic?
Asking here as I can't seem to find anything online.
I am trying to decrease my kafka message size by removing overhead of having to specify the schema in each message, and using schema registry instead. However, I cannot seem to understand how this could work.

Your producer serializes the schema id directly in the bytes of the record. Connect (or consumers with the json deserializer) use the schema that's part of each record.
https://docs.confluent.io/platform/current/schema-registry/serdes-develop/index.html#wire-format
If you're trying to decrease message size, don't use JSON, but rather a binary format and enable topic compression such as ZSTD

Related

Is it possible to Consume data using JDBC connector using Schema Registry, If a JAVA producer is producing data without Schema?

My requirement is when a producer is producing data without schema , I need to register a new schema in Schema- Register to consume data into JDBC converter.
Have found this, but is it possible to get any other solution.
Schema Registry is not a requirement to use JDBC Connector, but JDBC Sink connector does require a schema in the record payload, as the linked answer says.
The source connector can read data and generate records without a schema, but this has no interaction with any external producer client.
If you have producers that generate records without any schema, then it's unclear what schema you would be registering anywhere. But you can try to use a ProducerInterceptor to intercept and inspect those records to do whatever you need to.

ksqlDB can't get data from Schema Registry

Case: I have topic in Kafka with name some_table_prefix.table_name. Data is serialized with AVRO, but for historical reasons I have record in Schema Registry named table_name-value.
When I'm trying to setup ksqlDB stream
CREATE STREAM some_stream_name
WITH (KAFKA_TOPIC='some_table_prefix.table_name', VALUE_FORMAT='AVRO');
I'm getting error Schema for message values on topic some_table_prefix.table_name does not exist in the Schema Registry.Subject: some_table_prefix.table_name-value.
I have Schema registry integrated correctly, for others topics everything works ok.
So, is it possible to specify Schema Registry record name in ksqlDB stream creation or resolve this issue some other way?
If you have a topic named table_name, that has Avro being produced to it (which would automatically create table_name-value in the Registry), then that's what ksqlDB should consume from. If you'd manually created that subject by posting the schema on your own, without matching the topic name, then that's part of the problem.
As the error says, it's looking for a specific subject in the Registry based on the topic you've provided. To my knowledge, its not possible to use another subject name, so the workaround is to POST the old subject's schemas into the new one

How to monitor 'bad' messages written to kafka topic with no schema

I use Kafka Connect to take data from RabbitMQ into kafka topic. The data comes without schema so in order to associate schema I use ksql stream. On top of the stream I create a new topic that now has a defined schema. At the end I take the data to BQ database. My question is how do I monitor messages that have not passed the stream stage? in this way, do i support schema evolution? and if not, how can use the schema registry functionality?
Thanks
use Kafka Connect to take data ... data comes without schema
I'm not familiar specifically with Rabbitmq connector, but if you use the Confluent converter classes that do use schemas, then it would have one, although maybe only a string or bytes schema
If ksql is consuming the non-schema topic, then there's a consumer group associated with that process. You can monitor its lag to know how many messages have not yet been processed by ksql. If ksql is unable to parse a message because it's "bad", then I assume it's either skipped or the stream stops consuming completely; this is likely configurable
If you've set the output topic format to Avro, for example, then the schema will automatically be registered to the Registry. There will be no evolution until you modify the fields of the stream

Kafka multiple Topics into the same avro file

i am new of the KAFKA protocol's world and i would like to ask you some inportant information related to my project.
I am using AVRO file for producing and consuming messages, i want to know if i can use the same avro file for multiple Topics maybe for example by using a different "name" attribute into the producer and by using a specific "name" attribute in the consumer.
Thanks a lot.
Stefano
You can use one file to send data to multiple topics, yes, although I'm not sure why one would do that
I would be cautious about merging multiple topics into one Avro file because the schema must match in every topic for that file
It would be suggested that you use the Confluent Schema Registry, for example, rather than sending individual Avro events because if you are not using some registry, then you're likely sending the Avro schema as part of every message, which will slow down the possible throughput of your topic. And then, the name of the Avro schema record in the register will correspond to the topic name

How does Avro for Kafka work with Schema registry?

I am working on Kafka and as a beginner the following question popped out of my mind.
Every time we design the schema for Avro, we create the Java object out of it through its jars.
Now we use that object to populate data and send it from Producer.
For consuming the message we generate the Object again in Consumer. Now the objects generated in both places Producer & Consumer contains a field
"public static final org.apache.avro.Schema SCHEMA$" which actually stores the schema as a String.
If that is the case then why should kafka use schema registry at all ? The schema is already available as part of the Avro objects.
Hope my question is clear. If someone can answer me, It would be of great help.
Schema Registry is the repository which store the schema of all the records sent to Kafka. So when a Kafka producer send some records using KafkaAvroSerializer. The schema of the record is extracted and stored in schema registry and the actual record in Kafka only contains the schema-id.
The consumer when de-serializing the record fetches the schema-id and use it to fetch the actual schema from schema- registry. The record is then de-serialized using the fetched schema.
So in short Kafka does not keep a copy of schema in every record instead it is stored in schema registry and referenced via schema-id.
This helps in saving space while storing records also to detect any schema compatibility issue between various clients.
https://docs.confluent.io/current/schema-registry/docs/serializer-formatter.html
Schema registry is a central repo for all the schemas and helps in enforcing schema compatibility rules while registering new schemas , without which schema evolution would be difficult.
Based on the configured compatibility ( backward, forward , full etc) , the schema registry will restrict adding new schema which doesn't confirm to the configured compatibility.