How to fix the deserialization error when merging 2 kstreams topics using leftJoin? - apache-kafka

I am new to Kafka. I am working on a personal project where I want to write to 2 different Avro topics and merge them using leftJoin. Once I merge them, I want to produce the same messages to a KSQL DB as well. (I haven't implemented that part yet).
I am using Kafka Template to produce to the 2 Avro topics and convert them into kstreams to merge them. I am also using KafkaListener to print any messages in them and that work is working. Here's where I am having issues at: 2 of them actually. In either cases, it doesn't produce any messages in the merged topic.
If I removed the consumed.with()from the kstream, then it throws a default key Serde error.
But if I keep it, then it throws a deserialization error.
I have even provided the default serialization and deserialization in both my application.properties and in the streamConfig inside main() but it's still not working.
Can somebody please help me with how to merge the 2 Avro topics? Is it error occurring because I am using the Avro schema? Should I use JSON instead? I wanna use a schema because my value part of the message will have multiple values in it.
For eg: {Key : Value} = {company : {inventory_id, company, color, inventory}} = {Toyota : {0, RAV4, 50,000}}
Here's a link to all the file: application.properties, DefaultKeySerdeError.txt, DeserializationError.txt, FilterStreams.java, Inventory.avsc, Pricing.avsc, and MergedAvro.avsc . Let me know if yall want me to put them below. Thank you very much for your help in advance!
https://gist.github.com/Arjun13/b76f53c9c2b4e88225ef71a18eb08e2f

Looking at the DeserializationError.txt file, it looks like the problem is you haven't provided the credentials for schema registry. Even though you have provided them in the application.properties file, they're not getting into the serdes configuration, so if you add the basic.auth.user.info configs to the serdeConfig map you should be all set.

Related

How to rename the id header of a debezium mongodb connector outbox message

I am trying to use the outbox event router of debezium for mongodb. The consumer is a spring cloud stream application. I cannot deserialize the message because spring cloud expects the message id header to be UUID, but it receives byte[]. I have tried different deserializers to no avail. I am thinking of renaming the id header in order to skip this spring cloud check, or remove it altogether. I have tried the ReplaceField SMT but it does not seem to modify the header fields.
Also is there a way to overcome this in spring?
The solution to the initial question is to use the DropHeaders SMT(https://docs.confluent.io/platform/current/connect/transforms/dropheaders.html).
This will remove the id header that is populated by debezium.
But as Oleg Zhurakousky mentioned, moving to a newer version of spring-cloud-stream without #StreamListener solves the underlying problem.
Apparently #StreamListener checks if a message has an id header and it demands to be of type Uuid. By using the new functional way of working with spring-cloud-stream, the id header is actually overwritten with a new generated value. This means that the value populated by debezium (the id column form the outbox table) is ignored. I guess if you need to check for duplicate delivery, maybe it is better to create your own header instead of using the id. I do not know if spring-cloud-stream generates the same id for the same message if it is redelivered.
Also keep in mind that even in the newer versions of spring-cloud-stream, if you use the deprecated #StreamListener, you will have the same problem.

Glue avro schema registry with flink and kafka for any object

I am trying to registry and serialize an abject with flink, kafka, glue and avro. I've seen this method which I'm trying.
Schema schema = parser.parse(new File("path/to/avro/file"));
GlueSchemaRegistryAvroSerializationSchema<GenericRecord> test= GlueSchemaRegistryAvroSerializationSchema.forGeneric(schema, topic, configs);
FlinkKafkaProducer<GenericRecord> producer = new FlinkKafkaProducer<GenericRecord>(
kafkaTopic,
test,
properties);
My problem is that this system doesn't allow to include an object different than GenericRecord, the object that I want to send is another and is very big. So big that is too complex to transform to GenericRecord.
I don't find too much documentation. How can I send an object different than GenericRecord, or any way to include my object inside GenericRecord?
I'm not sure if I understand correctly, but basically the GlueSchemaRegistryAvroSerializationSchema has another method called forSpecific that accepts SpecificRecord. So, You can use avro generation plugin for Your build tool depending on what You use (e.g. for sbt here) that will generate classes from Your avro schema that can then be passed to forSpecific method.

How to deserialize Avro schema and then abandon schema before write to ES Sink Connector using SMT

Use Case and Description
My use case is described more here, but the gist of the issue is:
I am making a custom SMT and want to make sure the Elasticsearch sink connector deserializes incoming records properly, but then after that I don't need any sort of schema at all. Each record has a dynamic amount of fields set, so I don't want to have any makeUpdatedSchema step (e.g., this code) at all. This both keeps code more simple and I would assume improves performance since I don't have to recreate schemas for each record.
What I tried
I tried doing something like the applySchemaless code as shown here even when the record has a schema by returning something like this, with null for schema:
return newRecord(record, null, updatedValue);
However, in runtime it errors out, saying I have an incompatible schema.
Key Question
I might be misunderstanding the role of the schema at this point in the process (is it needed at all once we're in the Elasticsearch sink connector?) or how it works, and if so that would be helpful to know as well. But is there some way to write a custom SMT like this?

For AvroProducer to Kafka, where are avro schema for "key" and "value"?

From the AvroProducer example in the confluent-kafka-python repo, it appears that the key/value schema are loaded from files. That is, from this code:
from confluent_kafka import avro
from confluent_kafka.avro import AvroProducer
value_schema = avro.load('ValueSchema.avsc')
key_schema = avro.load('KeySchema.avsc')
value = {"name": "Value"}
key = {"name": "Key"}
avroProducer = AvroProducer({'bootstrap.servers': 'mybroker,mybroker2', 'schema.registry.url': 'http://schem_registry_host:port'}, default_key_schema=key_schema, default_value_schema=value_schema)
avroProducer.produce(topic='my_topic', value=value, key=key)
it appears that the files ValueSchema.avsc and KeySchema.avsc are loaded independently of the Avro Schema Registry.
Is this right? What's the point of referencing the URL for the Avro Schema Registry, but then loading schema from disk for key/value's?
Please clarify.
I ran into the same issue where it was initially unclear what the point of the local files are. As mentioned by the other answers, for the first write to an Avro topic, or an update to the topic's schema, you need the schema string - you can see this from the Kafka REST documentation here.
Once you have the schema in the registry, you can read it with REST (I used the requests Python module in this case) and use the avro.loads() method to get it. I found this useful because the produce() function requires that you have a value schema for your AvroProducer, and this code will work without that local file being present:
get_schema_req_data = requests.get("http://1.2.3.4:8081/subjects/sample_value_schema/versions/latest")
get_schema_req_data.raise_for_status()
schema_string = get_schema_req_data.json()['schema']
value_schema = avro.loads(schema_string)
avroProducer = AvroProducer({'bootstrap.servers': '1.2.3.4:9092', 'schema.registry.url': 'http://1.2.3.4:8081'}, default_value_schema=value_schema)
avroProducer.produce(topic='my_topic', value={"data" : "that matches your schema" })
Hope this helps.
That is just one way to create a key and value schema in the Schema Registry in the first place. You can create it in the SR first using the SR REST API or you can create new schemas or new versions of existing schemas in the SR by publishing them with new messages. It's entirely your choice which method is preferred.
Take a look at the code and consider that schema from the registry is needed by a consumer rather than a producer. MessageSerializer registers schema in the schema registry for you :)

How to find out Avro schema from binary data that comes in via Spark Streaming?

I set up a Spark-Streaming pipeline that gets measuring data via Kafka. This data was serialized using Avro. The data can be of two types - EquidistantData and DiscreteData. I created these using an avdl file and the sbt-avrohugger plugin. I use the variant that generates Scala case classes that inherit from SpecificRecord.
In my receiving application, I can get the two schemas by querying EquidistantData.SCHEMA$ and DiscreteData.SCHEMA$.
Now, my Kafka stream gives me RDDs whose value class is Array[Byte]. So far so good.
How can I find out from the byte array which schema was used when serializing it, i.e., whether to use EquidistantData.SCHEMA$ or DiscreteData.SCHEMA$?
I thought of sending an appropriate info in the message key. Currently, I don't use the message key. Would this be a feasible way or can I get the schema somehow from the serialized byte array I received?
Followup:
Another possibility would be to use separate topics for discrete and equidistant data. Would this be feasible?