For AvroProducer to Kafka, where are avro schema for "key" and "value"? - apache-kafka

From the AvroProducer example in the confluent-kafka-python repo, it appears that the key/value schema are loaded from files. That is, from this code:
from confluent_kafka import avro
from confluent_kafka.avro import AvroProducer
value_schema = avro.load('ValueSchema.avsc')
key_schema = avro.load('KeySchema.avsc')
value = {"name": "Value"}
key = {"name": "Key"}
avroProducer = AvroProducer({'bootstrap.servers': 'mybroker,mybroker2', 'schema.registry.url': 'http://schem_registry_host:port'}, default_key_schema=key_schema, default_value_schema=value_schema)
avroProducer.produce(topic='my_topic', value=value, key=key)
it appears that the files ValueSchema.avsc and KeySchema.avsc are loaded independently of the Avro Schema Registry.
Is this right? What's the point of referencing the URL for the Avro Schema Registry, but then loading schema from disk for key/value's?
Please clarify.

I ran into the same issue where it was initially unclear what the point of the local files are. As mentioned by the other answers, for the first write to an Avro topic, or an update to the topic's schema, you need the schema string - you can see this from the Kafka REST documentation here.
Once you have the schema in the registry, you can read it with REST (I used the requests Python module in this case) and use the avro.loads() method to get it. I found this useful because the produce() function requires that you have a value schema for your AvroProducer, and this code will work without that local file being present:
get_schema_req_data = requests.get("http://1.2.3.4:8081/subjects/sample_value_schema/versions/latest")
get_schema_req_data.raise_for_status()
schema_string = get_schema_req_data.json()['schema']
value_schema = avro.loads(schema_string)
avroProducer = AvroProducer({'bootstrap.servers': '1.2.3.4:9092', 'schema.registry.url': 'http://1.2.3.4:8081'}, default_value_schema=value_schema)
avroProducer.produce(topic='my_topic', value={"data" : "that matches your schema" })
Hope this helps.

That is just one way to create a key and value schema in the Schema Registry in the first place. You can create it in the SR first using the SR REST API or you can create new schemas or new versions of existing schemas in the SR by publishing them with new messages. It's entirely your choice which method is preferred.

Take a look at the code and consider that schema from the registry is needed by a consumer rather than a producer. MessageSerializer registers schema in the schema registry for you :)

Related

How to fix the deserialization error when merging 2 kstreams topics using leftJoin?

I am new to Kafka. I am working on a personal project where I want to write to 2 different Avro topics and merge them using leftJoin. Once I merge them, I want to produce the same messages to a KSQL DB as well. (I haven't implemented that part yet).
I am using Kafka Template to produce to the 2 Avro topics and convert them into kstreams to merge them. I am also using KafkaListener to print any messages in them and that work is working. Here's where I am having issues at: 2 of them actually. In either cases, it doesn't produce any messages in the merged topic.
If I removed the consumed.with()from the kstream, then it throws a default key Serde error.
But if I keep it, then it throws a deserialization error.
I have even provided the default serialization and deserialization in both my application.properties and in the streamConfig inside main() but it's still not working.
Can somebody please help me with how to merge the 2 Avro topics? Is it error occurring because I am using the Avro schema? Should I use JSON instead? I wanna use a schema because my value part of the message will have multiple values in it.
For eg: {Key : Value} = {company : {inventory_id, company, color, inventory}} = {Toyota : {0, RAV4, 50,000}}
Here's a link to all the file: application.properties, DefaultKeySerdeError.txt, DeserializationError.txt, FilterStreams.java, Inventory.avsc, Pricing.avsc, and MergedAvro.avsc . Let me know if yall want me to put them below. Thank you very much for your help in advance!
https://gist.github.com/Arjun13/b76f53c9c2b4e88225ef71a18eb08e2f
Looking at the DeserializationError.txt file, it looks like the problem is you haven't provided the credentials for schema registry. Even though you have provided them in the application.properties file, they're not getting into the serdes configuration, so if you add the basic.auth.user.info configs to the serdeConfig map you should be all set.

Glue avro schema registry with flink and kafka for any object

I am trying to registry and serialize an abject with flink, kafka, glue and avro. I've seen this method which I'm trying.
Schema schema = parser.parse(new File("path/to/avro/file"));
GlueSchemaRegistryAvroSerializationSchema<GenericRecord> test= GlueSchemaRegistryAvroSerializationSchema.forGeneric(schema, topic, configs);
FlinkKafkaProducer<GenericRecord> producer = new FlinkKafkaProducer<GenericRecord>(
kafkaTopic,
test,
properties);
My problem is that this system doesn't allow to include an object different than GenericRecord, the object that I want to send is another and is very big. So big that is too complex to transform to GenericRecord.
I don't find too much documentation. How can I send an object different than GenericRecord, or any way to include my object inside GenericRecord?
I'm not sure if I understand correctly, but basically the GlueSchemaRegistryAvroSerializationSchema has another method called forSpecific that accepts SpecificRecord. So, You can use avro generation plugin for Your build tool depending on what You use (e.g. for sbt here) that will generate classes from Your avro schema that can then be passed to forSpecific method.

How can Confluent SchemaRegistry help ensuring the read (projection) Avro schema evolution?

SchemaRegistry helps with sharing the write Avro schema, which is used to encode a message, with the consumers that need the write schema to decode the received message.
Another important feature is assisting the schema evolution.
Let's say a producer P defines a write Avro schema v1 that is stored under the logical schema S, a consumer C1 that defines a read (projection) schema v1
and another consumer C2 that defines its own read (projection) schema. The read schemas are not shared as they are used locally by Avro to translate messages from the writer schema into the reader schema.
Imagine the schema evolution without any breaking changes:
The consumer C1 requests a new property by the new optional field added to its schema. This is a backward-compatible change.
Messages encoded without this field will be still translated into the read schema.
Now we've got v2 of the C1's read schema.
The producer P satisfies the consumer C1's need by the new field added to its schema. The field doesn't have to be required as this is a forwards-compatible change.
The consumer C1 will access the data encoded in the newly added field. The consumer C2 will simply ignore it, as it is a tolerant reader.
Now we've got v2 of the P's write schema.
Consumers need to know the exact schema with which the messages were written, so the new version is stored under the logical schema S.
Now imagine some schema breaking changes:
The producer P decides to delete a non-optional field. One of the consumers might use this field. This is not a forwards-compatible change.
Assuming the subject S is configured with FORWARD_TRANSITIVE compatibility type, the attempt to store the new write schema will fail. We are safe.
The consumer C2 requests a new property by the new field added to its schema. Since it's not written by the producer, this is not a backward-compatible change.
The question is how can the SchemaRegistry come in handy to prevent any breaking changes on the consumer side?
Note that the compatibility check of the read schema has to be done against all versions of the write schema.
There is an endpoint that allows checking the compatibility against the versions in the subject.
The issue is that it uses the compatibility type that is set on the subject.
The subject which contains versions of the write schema can not be used, because it is configured with FORWARD_TRANSITIVE compatibility type, but the read schema has to be backward compatible.
Creating another subject with the compatibility type BACKWARD_TRANSITIVE will not work, because a new version of the write schema with a forwards-compatible change (e.g. add a non-optional field) will fail to be stored in this subject.
One option that came to mind is to have some unit tests written using the CompatibilityChecker. It's an ugly solution because each consumer must hold locally all versions of the write schema.
It's going to be a pain to sync all the consumers when the producer's schema changes.
Schema Registry lets us keep track of schemas that are currently in use, both by producers and consumers.
Creating another subject with the compatibility type BACKWARD_TRANSITIVE will not work, because a new version of the write schema with a forwards-compatible change (e.g. add a non-optional field) will fail to be stored in this subject.
You were very close. Indeed, adding a non-optional field to the write schema is forward-compatible, but not backward-compatible because you may have data already produced that don't have values for this field. But we don't apply the same changes both to the write and read schemas. This only works when the change is both forward and backward compatible (aka full compatibility), e.g., adding/removing optional fields. In our case, we'd have to add the new field as optional to the read schema.
You can push the write schema to this new subject initially, but from that point on it is a separate read schema, and it would have to evolve separately from the write schema.
You can apply whatever approach you're currently using for checking the write schema changes. For example, make each consumer push the schema it's about to use to a subject with a BACKWARD_TRANSITIVE compatibility type before being allowed to use it.
There's also Schema Registry Maven Plugin for use in a CI/CD environment.
An alternative would be to use a single subject with FULL_TRANSITIVE compatibility.

Why use Subject in Confluent's Schema Registry?

I'm starting to work with the Confluent Schema Registry. I realized that there is only one Schema per Subject.
What is exactly the use of the entity Subject in the registry, isn't it just a schema, that would be used for a topic in Kafka for example.
You can't really place more than one Schema in a Subject right?
A subject is a name under which the schema is registered, In Schema Registry, a subject can be registered depends on the strategy type.
When schemas evolve, they are still associated with the same subject but get a new schema ID and version
The default strategy is TopicNameStrategy.
TopicNameStrategy - This is a good strategy where grouping by a message by topic is required.
RecordNameStrategy and TopicRecordNameStrategy are used where grouping by topic name is not optional.
Yes, We can't put more than one schema in one Subject.
We can find more detailed information on the following URL.
https://docs.confluent.io/current/schema-registry/serdes-develop/index.html#sr-schemas-subject-name-strategy

Confluent Platform: Schema Registry Subjects

Working with Confluent Platform, the platform offered by the creators of Apache Kafka, and I have a question:
In the documentation of the Schema Registry API Reference, they mention the abstraction of a "Subject". You register a schema under a "subject" which is of the form topicName-key, or topicName-value, yet there is no explanation as to why you need (as it implies) a separate schema for the key and value of messages on a given topic. Nor is there any direct statement to the effect that registration with a "subject" necessarily associates the schema with that topic, other than mnemonically.
Further confusing matters, the subsequent examples ("get schema version for subject" and "register new schema under subject") on that page do not use that format for the subject name, and instead use just a topic name for the "subject" value. If anyone has any insight into a) why there are these two "subjects" per topic, and b) what the proper usage is, it would be greatly appreciated.
Confluent Schema Registry is actually a bit inconsistent with subject names :)
Indeed, the KafkaAvroSerializer (used for new Kafka 0.8.2 producer) uses topic-key|value pattern for subjects (link) whereas KafkaAvroEncoder (for old producer) uses schema.getName()-value pattern (link).
The reason why one would have 2 different subjects per topic (one for key, one for value) is pretty simple:
say I have an Avro schema representing a log entry, and each log entry has a source information attached to it:
{
"type":"record",
"name":"LogEntry",
"fields":[
{
"name":"line",
"type":"string"
},
{
"name":"source",
"type":{
"type":"record",
"name":"SourceInfo",
"fields":[
{
"name":"host",
"type":"string"
},
{
"name":"...",
"type":"string"
}
]
}
}
]
}
A common use case would be that I want to partition entries by source, thus would like to have two subjects associated for topic (and subjects are basically revisions of Avro schemas) - one for key (which is SourceInfo) and one for value (LogEntry).
Having these two subjects would allow partitioning and storing the data as long as I have a schema registry running and my producers/consumers can talk to it. Any modifications to these schemas would be reflected in the schema registry and as long as they satisfy compatibility settings everything should just serialize/deserialize without you having to care about this.
Note: any further information is just my personal thoughts and maybe I just don't yet fully understand how this is supposed to work so I might be wrong.
I actually like more how the KafkaAvroEncoder is implemented rather than the KafkaAvroSerializer. KafkaAvroEncoder does not in any way enforce you to use ONE schema per topic key\value whereas KafkaAvroSerializer does. This might be an issue when you plan to produce data for multiple Avro schemas into one topic. In this case KafkaAvroSerializer would try to update the topic-key and topic-value subjects and 99% would break if compatibility is violated (and if you have multiple Avro schemas they are almost always different and incompatible with each other).
On the other side, KafkaAvroEncoder cares just about schema names and you may safely produce data for multiple Avro schemas into one topic and everything should work just fine (you will have as many subjects as schemas).
This inconsistency is still unclear to me and I hope Confluent guys can explain this if they see this question/answer.
Hope that helps you