Apache Kafka with centralized Avro schema - apache-kafka

We use Apache Kafka( not confluent kafka ) 0.10. We would like to setup AVRO schema with kafka.
I have avro schema as follows.
{
"namespace": "Rule",
"type": "record",
"name": "RuleMessage",
"fields": [
{
"name": "station",
"type": "string"
},
{
"name": "model",
"type": "string"
}
}
Serializing message like,
public byte[] serializeMessage(EventMessage eventMessage) throws IOException {
ByteArrayOutputStream out = new ByteArrayOutputStream();
BinaryEncoder encoder = EncoderFactory.get().binaryEncoder(out, null);
DatumWriter<EventMessage> writer = new SpecificDatumWriter<EventMessage>(EventMessage.getClassSchema());
writer.write(eventMessage, encoder);
encoder.flush();
out.close();
return out.toByteArray();
}
This is working as expected.
But, would like to setup an Avro Schema at the topic level, so that the topic will reject messages if the message is not meeting the avro schema.
Is there anyway, I could do this with Apache Kafka 0.10.
Thanks

You can use Confluent's Schema Registry (its open source and Apache licensed) with Apache Kafka 0.10.0 to associate a schema with a topic. It arrives with Avro Serializers/DeSerializers that automatically validate the Avro schemas in exactly the way you requested.
Please note that the is no such thing as "Confluent Kafka" - it would be a trademark violation to have it. Confluent simply packages Apache Kafka in its distribution for convenience, but since the Schema Registry is on github, you can use that without using Confluent packaging if you prefer.

Related

Kafka Connect - From JSON records to Avro files in HDFS

My current setup contains Kafka, HDFS, Kafka Connect, and a Schema Registry all in networked docker containers.
The Kafka topic contains simple JSON data without a Schema:
{
"repo_name": "ironbee/ironbee"
}
The Schema Registry contains a JSON Schema describing the data in the Kafka Topic:
{"$schema": "https://json-schema.org/draft/2019-09/schema",
"$id": "http://example.com/example.json",
"type": "object",
"title": "Root Schema",
"required": [
"repo_name"
],
"properties": {
"repo_name": {
"type": "string",
"default": "",
"title": "The repo_name Schema",
"examples": [
"ironbee/ironbee"
]
}
}}
What I am trying to achieve is a Connection that reads JSON data from a Topic and dumps it into files in HDFS (Avro or Parquet).
{
"name": "kafka to hdfs",
"connector.class": "io.confluent.connect.hdfs3.Hdfs3SinkConnector",
"topics": "repo",
"hdfs.url": "hdfs://namenode:9000",
"flush.size": 3,
"confluent.topic.bootstrap.servers": "kafka-1:19092,kafka-2:29092,kafka-3:39092",
"key.converter": "org.apache.kafka.connect.storage.StringConverter",
"key.converter.schemas.enable": "false",
"value.converter": "io.confluent.connect.json.JsonSchemaConverter",
"value.converter.schemas.enable": "false",
"value.converter.schema.registry.url": "http://schema-registry:8081"
}
If I try to read the raw JSON value via the StringConverter (no schema used) and dump it into Avro files it works, resulting in
Key=null Value={my json} touples
so no usable structure at all.
When I try to use my schema via the JsonSchemaConverter I get the errors
“Converting byte[] to Kafka Connect data failed due to serialization error of topic”
“Unknown magic byte”
I think that there is something wrong with the configuration of my connection, but after a week of trying everything my google-skills have reached their limits.
All the code is available here: https://github.com/SDU-minions/7-Scalable-Systems-Project/tree/dev/Kafka
raw JSON value via the StringConverter (no schema used)
schemas.enable property only exists on JSONConverter. Strings don't have schemas. JSONSchema always has a schema, so property also doesn't exist there.
When I try to use my schema via the JsonSchemaConverter I get the errors
Your producer needs to use Confluent JSONSchema Serializer. Otherwise, it doesn't get sent to Kafka with the "magic byte" referred to in your error.
I personally haven't tried converting JSON schema records to Avro directly in Connect. Usually the pattern is to either produce Avro directly, or convert within ksqlDB, for example to a new Avro topic, which is then consumed by Connect.

Exception in Flink Streaming to Kafka Avro Sink java.lang.IllegalAccessException: Class org.apache.avro.specific.SpecificData

I'm Using flink streaming to read events from Kafka source topic and after de-duplication, writing to separate kafka topic in avro topic.
Flow
Kafka Topic(json format) -> flink streaming(de-duplication) -> scala
case class objects -> Kafka Topic(Avro Format)
val sink = sinkProvider.getKafkaSink(brokerURL, targetTopic,kafkaTransactionMaxTimeoutMs, kafkaTransactionTimeoutMs)
messageStream
.map {
record =>
convertJsonToExample(record)
}
.sinkTo(sink)
.name("Example Kafka Avro Sink")
.uid("Example-Kafka-Avro-Sink")
Here are the steps I followed:
I created avro schema for my output schema
{
"type":"record",
"name":"Example",
"namespace":"ca.ix.dcn.test",
"fields":[
{
"name":"x",
"type":"string"
},
{
"name":"y",
"type":"long"
}
]
}
From avro schema I generated case class using avro-hugger tools(version 1.2.1) for
SpecificRecord
I used flink AvroSerializationSchema forSpecificRecord cause flink
kafka avro sink let's you use either specific record or generic
record constructor for serialization to avro.
def getKafkaSink(brokers: String, targetTopic: String,transactionMaxTimeoutMs:String,transactionTimeoutMs:String) = {
val schema = ReflectData.get.getSchema(classOf[Example])
val sink = KafkaSink.builder()
.setBootstrapServers(brokers)
.setProperty("transaction.max.timeout.ms",transactionMaxTimeoutMs)
.setProperty("transaction.timeout.ms",transactionTimeoutMs)
.setRecordSerializer(KafkaRecordSerializationSchema.builder()
.setTopic(targetTopic)
.setValueSerializationSchema(AvroSerializationSchema.forSpecific[Example](classOf[Example]))
.setPartitioner(new FlinkFixedPartitioner())
.build()
)
.setDeliveryGuarantee(DeliveryGuarantee.EXACTLY_ONCE)
.build()
sink
}
Now when I run it I get the exeption:
Caused by: org.apache.avro.AvroRuntimeException: java.lang.IllegalAccessException: Class org.apache.avro.specific.SpecificData can not access a member of class ca.ix.dcn.test with modifiers "private final"
at org.apache.avro.specific.SpecificData.createSchema(SpecificData.java:405)
at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:734)
I saw there is a bug opened on flink for this:
https://issues.apache.org/jira/browse/FLINK-18478
But I didn't find any work around for this.
Wondering if there is any workaround for this. Also if there are detailed examples that explain how to use flink streaming sink(for avro) using AvroSerializationSchema(Specific/Generic)
Appreciate the help on this.
In the Flink ticket that you're linking to, there's a comment made that avro-hugger is not really compatible with the Apache Avro Java library, see https://issues.apache.org/jira/browse/FLINK-18478?focusedCommentId=17164456&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17164456
The solution would be to generate Avro Java POJOs and use them in your Scala application.

FlinkKafkaConsumer/Producer & Confluent Avro schema registry: Validation failed & Compatibility mode writes invalid schema

Hello together im struggling with (de-)serializing a simple avro schema together with schema registry.
The setup:
2 Flink jobs written in java (one consumer, one producer)
1 confluent schema registry for schema validation
1 kafka cluster for messaging
The target:
The producer should send a message serialized with ConfluentRegistryAvroSerializationSchema which includes updating and validating the schema.
The consumer should then deserialize the message into an object with the received schema. Using ConfluentRegistryAvroDeserializationSchema.
So far so good:
If i configre my subject on the schema registry to be FORWARD-compatible the producer writes the correct avro schema to the registry, but it ends with the error (even if i completely and permanetly delete the subject first):
Failed to send data to Kafka: Schema being registered is incompatible with an earlier schema for subject "my.awesome.MyExampleEntity-value"
The schema was successfully written:
{
"subject": "my.awesome.MyExampleEntity-value",
"version": 1,
"id": 100028,
"schema": "{\"type\":\"record\",\"name\":\"input\",\"namespace\":\"my.awesome.MyExampleEntity\",\"fields\":[{\"name\":\"serialNumber\",\"type\":\"string\"},{\"name\":\"editingDate\",\"type\":\"int\",\"logicalType\":\"date\"}]}"
}
following this i could try to set the compability to NONE
If i do so i can produce my data on the kafka but:
The schema registry has a new version of my schema looking like this:
{
"subject": "my.awesome.MyExampleEntity-value",
"version": 2,
"id": 100031,
"schema": "\"bytes\""
}
Now i can produce data but the consumer is not able to deserialize this schema emiting the following error:
Caused by: org.apache.avro.AvroTypeException: Found bytes, expecting my.awesome.MyExampleEntity
...
Im currently not sure where the problem exactly is.
Even if i completely and permanetly delete the subject (including schemas) my producer should work fine from scratch registering a whole "new" subject with schema.
On the other hand if i set the compatibility to "NONE" the schema exchange should work anyway by should registering a schema which can be read by the consumer.
Can anybody help me out here?
According to a latest confluent doc NONE: schema compatibility checks are disabled docs:
The whole problem with serialisation was about the usage of the following flag in the kafka config:
"schema.registry.url"
"key.serializer"
"key.deserializer"
"value.serializer"
"value.deserializer"
Setting this flags in flink, even if they are logically correct leads to a undebuggable schema validation and serialisation chaos.
Omitted all of these flags and it works fine.
The registry url needs to be set in ConfluentRegistryAvro(De)serializationSchema only.

Debezium, Kafka connect: is there a way to send only payload and not schema?

I have an outbox postgresql table and debezium connector in kafka connect that creates kafka messages based on the added rows to the table.
The problem I am facing is with the message format. This is the created message value:
{
"schema": {
"type": "string",
"optional": true,
"name": "io.debezium.data.Json",
"version": 1
},
"payload": "{\"foo\": \"bar\"}"
}
But (because of consumer) I need the message to contain only the payload, like this:
{
"\"foo\": \"bar\""
}
This is part of my kafka connector configuration:
"transforms": "outbox",
"transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
"transforms.outbox.route.topic.replacement": "${routedByValue}",
"transforms.outbox.route.by.field": "aggregate_type",
"transforms.outbox.table.field.event.payload.id": "aggregate_id",
"transforms.outbox.table.fields.additional.placement": "payload_type:header:__TypeId__"
Is there any way to achieve this without creating custom transformer?
It looks like you're using org.apache.kafka.connect.json.JsonConverter with schemas.enable=true for your value converter. When you do this it embeds the schema alongside the payload in the message.
If you set value.converter.schemas.enable=false you should get just the payload in your message.
Ref: Kafka Connect: Converters and Serialization Explained — JSON and Schemas

Exception while Deserialize avro data using ConfluentSchemaRegistry?

I am new to flink and Kafka. I am trying to deserialize avro data using Confluent Schema registry. I have already installed flink and Kafka on ec2 machine. Also, the "test" topic has been created before running code.
Code Path: https://gist.github.com/mandar2174/5dc13350b296abf127b92d0697c320f2
The code does the following operation as part of implementation:
1) Create a flink DataStream object using a list of user element. (User class is avro generated class)
2) Write the Datastream source to Kafka using AvroSerializationSchema.
3) Read the data from Kafka using ConfluentRegistryAvroDeserializationSchema by reading the schema from Confluent Schema registry.
Command to run flink executable jar:
./bin/flink run -c com.streaming.example.ConfluentSchemaRegistryExample /opt/flink-1.7.2/kafka-flink-stream-processing-assembly-0.1.jar
Exception while running code:
java.io.IOException: Unknown data format. Magic number does not match
at org.apache.flink.formats.avro.registry.confluent.ConfluentSchemaRegistryCoder.readSchema(ConfluentSchemaRegistryCoder.java:55)
at org.apache.flink.formats.avro.RegistryAvroDeserializationSchema.deserialize(RegistryAvroDeserializationSchema.java:66)
at org.apache.flink.streaming.util.serialization.KeyedDeserializationSchemaWrapper.deserialize(KeyedDeserializationSchemaWrapper.java:44)
at org.apache.flink.streaming.connectors.kafka.internal.KafkaFetcher.runFetchLoop(KafkaFetcher.java:140)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.run(FlinkKafkaConsumerBase.java:665)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:94)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:58)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:99)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:704)
at java.lang.Thread.run(Thread.java:748)
Avro schema which I am using for User class is as below:
{
"type": "record",
"name": "User",
"namespace": "com.streaming.example",
"fields": [
{
"name": "name",
"type": "string"
},
{
"name": "favorite_number",
"type": [
"int",
"null"
]
},
{
"name": "favorite_color",
"type": [
"string",
"null"
]
}
]
}
Can someone point out what steps I am missing as part of deserializing avro data using confluent Kafka schema registry?
How you wrote the Avro data needs to use the Registry as well in order for the deserializer that depends on it to work.
But this is an open PR in Flink, still for adding a ConfluentRegistryAvroSerializationSchema class
The workaround, I believe would be to use AvroDeserializationSchema, which does not depend on the Registry.
If you did want to use the Registry in the producer code, then you'd have to do so outside of Flink until that PR is merged.