How I deserialize Avro from Kafka with embedded schema - apache-kafka

I recieve binary Avro files from a Kafka topic and I must deserialize them. In the message received by Kafka, I can see a schema at the start of every message. I know it's a better practice to not embed the schema and separate it from the actual Avro file, but I don't have control over the producer and I can't change that.
My code runs on top of Apache Storm. First I create a reader:
mDatumReader = new GenericDatumReader<GenericRecord>();
And later I try to deserialize the message without declaring schema:
Decoder decoder = DecoderFactory.get().binaryDecoder(messageBytes, null);
GenericRecord payload = mDatumReader.read(null, decoder);
But then I get an error when a message arrives:
Caused by: java.lang.NullPointerException: writer cannot be null!
at org.apache.avro.io.ResolvingDecoder.resolve(ResolvingDecoder.java:77) ~[stormjar.jar:?]
at org.apache.avro.io.ResolvingDecoder.<init>(ResolvingDecoder.java:46) ~[stormjar.jar:?]
at org.apache.avro.io.DecoderFactory.resolvingDecoder(DecoderFactory.java:307) ~[stormjar.jar:?]
at org.apache.avro.generic.GenericDatumReader.getResolver(GenericDatumReader.java:122) ~[stormjar.jar:?]
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:137) ~[stormjar.jar:?]
All the answers I've seen are about using other formats, changing the messages delivered to Kafka or something else. I don't have control over those things.
My question is, given a message in bytes[] with embedded schema inside binary message, how to deserialize that Avro file without declaring schema so I can read it.

With the DatumReader/Writer, there is no such thing like an embedded schema. Had been my misunderstanding when looking at Avro & Kafka the first time as well. But the source code of the Avro Serializer clearly shows there is no schema embedded when using the GenericDatumWriter.
It is the DataFileWriter who does write a schema at the beginning of the file and then adds GenericRecords using the GenericDatumWriter.
Since you said there is a schema at the beginning, I assume you can read it, turn it into a Schema object and then pass that into the GenericDatumReader(schema) constructor.
Would be interesting to know how the message is serialized. Maybe the DataFileWriter is used to write into a byte[] instead of an actual file, then you could use the DataFileReader to deserialize the data?

Add Maven Dependancy
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro-maven-plugin</artifactId>
<version>1.9.1</version>
<type>maven-plugin</type>
</dependency>
Create a file like below
{"namespace": "tachyonis.space",
"type": "record",
"name": "Avro",
"fields": [
{"name": "Id", "type": "string"},
]
}
Save above as Avro.avsc in src/main/resources.
In Eclipse or any IDE Run > Maven generate sources which create Avro.java to package folder [namespace] tachyonis.space
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
props.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, SCHEMA_REGISTRY_URL_CONFIG);
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class);
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, KafkaAvroDeserializer.class);
props.put(KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG, true);
KafkaConsumer<String, Avro> consumer = new KafkaConsumer<>(props);
The consumer/producer has to run in the same machine. Otherwise you need to configure hosts file in Windows/Linux and change all components configurations properties from localhost to map to the actual IP address for broadcast to the producers/consumers. Otherwise you get errors like network connection issues
Connection to node -3 (/127.0.0.1:9092) could not be established. Broker may not be available

Related

FlinkKafkaConsumer/Producer & Confluent Avro schema registry: Validation failed & Compatibility mode writes invalid schema

Hello together im struggling with (de-)serializing a simple avro schema together with schema registry.
The setup:
2 Flink jobs written in java (one consumer, one producer)
1 confluent schema registry for schema validation
1 kafka cluster for messaging
The target:
The producer should send a message serialized with ConfluentRegistryAvroSerializationSchema which includes updating and validating the schema.
The consumer should then deserialize the message into an object with the received schema. Using ConfluentRegistryAvroDeserializationSchema.
So far so good:
If i configre my subject on the schema registry to be FORWARD-compatible the producer writes the correct avro schema to the registry, but it ends with the error (even if i completely and permanetly delete the subject first):
Failed to send data to Kafka: Schema being registered is incompatible with an earlier schema for subject "my.awesome.MyExampleEntity-value"
The schema was successfully written:
{
"subject": "my.awesome.MyExampleEntity-value",
"version": 1,
"id": 100028,
"schema": "{\"type\":\"record\",\"name\":\"input\",\"namespace\":\"my.awesome.MyExampleEntity\",\"fields\":[{\"name\":\"serialNumber\",\"type\":\"string\"},{\"name\":\"editingDate\",\"type\":\"int\",\"logicalType\":\"date\"}]}"
}
following this i could try to set the compability to NONE
If i do so i can produce my data on the kafka but:
The schema registry has a new version of my schema looking like this:
{
"subject": "my.awesome.MyExampleEntity-value",
"version": 2,
"id": 100031,
"schema": "\"bytes\""
}
Now i can produce data but the consumer is not able to deserialize this schema emiting the following error:
Caused by: org.apache.avro.AvroTypeException: Found bytes, expecting my.awesome.MyExampleEntity
...
Im currently not sure where the problem exactly is.
Even if i completely and permanetly delete the subject (including schemas) my producer should work fine from scratch registering a whole "new" subject with schema.
On the other hand if i set the compatibility to "NONE" the schema exchange should work anyway by should registering a schema which can be read by the consumer.
Can anybody help me out here?
According to a latest confluent doc NONE: schema compatibility checks are disabled docs:
The whole problem with serialisation was about the usage of the following flag in the kafka config:
"schema.registry.url"
"key.serializer"
"key.deserializer"
"value.serializer"
"value.deserializer"
Setting this flags in flink, even if they are logically correct leads to a undebuggable schema validation and serialisation chaos.
Omitted all of these flags and it works fine.
The registry url needs to be set in ConfluentRegistryAvro(De)serializationSchema only.

Getting error while publishing message to kafka topic

I am new to Kafka. I have written a simple JAVA program to generate a message using avro schema. I have generated a specific record. The record is generated successfully. My schema is not yet registered with my local environment. It is currently registered with some other environment.
I am using the apache kafka producer library to publish the message to my local environment kafka topic. Can I publish the message to the local topic or the schema needs to be registered with the local schema registry as well.
Below are the producer properties -
properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class);
properties.put(KafkaAvroSerializerConfig.SCHEMA_REGISTRY_URL_CONFIG, "https://schema-registry.xxxx.service.dev:443");```
Error I am getting while publishing the message -
``` org.apache.kafka.common.errors.SerializationException: Error registering Avro schema:
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: User is denied operation Write on Subject: xxx.avro-value; error code: **40301**
The issue was kafka producer by default tries to register the schema on topic. So, I added the below - properties.put(KafkaAvroSerializerConfig.AUTO_REGISTER_SCHEMAS, false);
and it resolved the issue.

how to resolve java.lang.IllegalArgumentException Unsupported Avro type

private KafkaTemplate<String, KafkaMessage> kafkaTemplate;
Message<KafkaMessage> message = MessageBuilder
        .withPayload(kafkaMessage)
        .setHeader(KafkaHeaders.TOPIC, targetTopic)
        .setHeader(KafkaHeaders.MESSAGE_KEY, "someStringValue" )
        .setHeader("X-Custom-Header", headerCreator.generateHeader(source, type)).build();
ListenableFuture<SendResult<String, KafkaMessage>> listenableFuture = kafkaTemplate.send(message);
This is my code. and the exception occurs at send method.
The exception is java.lang.IllegalArgumentException: Unsupported Avro type. Supported types are null, Boolean, Integer, Long, Float, Double, String, byte[] and IndexedRecord ?
Assuming that the Kafka topic is expecting an AVRO serialized object, you can add the plugin "avro-maven-plugin" to the project POM, and let Maven to generate the AVRO classes for you.
This plugin reads the AVRO schema' files, and automatically (once the project is build) generates the POJO classes. If the schema contains an error or is not valid, you will be warned before executing any code.
The KafkaTeamplate should use this POJO instead of KafkaMessage.
I recommend reading How to Use Schema Registry and Avro in Spring Boot Applications for a complete consumer and producer example, using Confluent components, for the overall project configuration (SERDEs, schema registry, etc.).

Configure Apache Kafka sink jdbc connector

I want to send the data sent to the topic to a postgresql-database. So I follow this guide and have configured the properties-file like this:
name=transaction-sink
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
tasks.max=1
topics=transactions
connection.url=jdbc:postgresql://localhost:5432/db
connection.user=db-user
connection.password=
auto.create=true
insert.mode=insert
table.name.format=transaction
pk.mode=none
I start the connector with
./bin/connect-standalone etc/schema-registry/connect-avro-standalone.properties etc/kafka-connect-jdbc/sink-quickstart-postgresql.properties
The sink-connector is created but does not start due to this error:
Caused by: org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id -1
Caused by: org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
The schema is in avro-format and registered and I can send (produce) messages to the topic and read (consume) from it. But I can't seem to sent it to the database.
This is my ./etc/schema-registry/connect-avro-standalone.properties
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
This is a producer feeding the topic using the java-api:
properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class);
properties.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, "http://localhost:8081");
try (KafkaProducer<String, Transaction> producer = new KafkaProducer<>(properties)) {
Transaction transaction = new Transaction();
transaction.setFoo("foo");
transaction.setBar("bar");
UUID uuid = UUID.randomUUID();
final ProducerRecord<String, Transaction> record = new ProducerRecord<>(TOPIC, uuid.toString(), transaction);
producer.send(record);
}
I'm verifying data is properly serialized and deserialized using
./bin/kafka-avro-console-consumer --bootstrap-server localhost:9092 \
--property schema.registry.url=http://localhost:8081 \
--topic transactions \
--from-beginning --max-messages 1
The database is up and running.
This is not correct:
The unknown magic byte can be due to a id-field not part of the schema
What that error means that the message on the topic was not serialised using the Schema Registry Avro serialiser.
How are you putting data on the topic?
Maybe all the messages have the problem, maybe only some—but by default this will halt the Kafka Connect task.
You can set
"errors.tolerance":"all",
to get it to ignore messages that it can't deserialise. But if all of them are not correctly Avro serialised this won't help and you need to serialise them correctly, or choose a different Converter (e.g. if they're actually JSON, use the JSONConverter).
These references should help you more:
https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained
https://www.confluent.io/blog/kafka-connect-deep-dive-error-handling-dead-letter-queues
http://rmoff.dev/ksldn19-kafka-connect
Edit :
If you are serialising the key with StringSerializer then you need to use this in your Connect config:
key.converter=org.apache.kafka.connect.storage.StringConverter
You can set it at the worker (global property, applies to all connectors that you run on it), or just for this connector (i.e. put it in the connector properties itself, it will override the worker settings)

There's no avro data in hdfs using kafka connect

I am using kafka connect distribution.
The command is : bin/connect-distributed etc/schema-registry/connect-avro-distributed.properties
The worker configuration is:
bootstrap.servers=kafka1:9092,kafka2:9092,kafka3:9092
group.id=connect-cluster
key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false
The kafka connect start over with no errors!
The topic connect-configs,connect-offsets,connect-statuses has been created.
The topic mysiteview has been created.
Then i create kafka connectors using RESTful API like this:
curl -X POST -H "Content-Type: application/json" --data '{"name":"hdfs-sink-mysiteview","config":{"connector.class":"io.confluent.connect.hdfs.HdfsSinkConnector","tasks.max":"3","topics":"mysiteview","hdfs.url":"hdfs://master1:8020","topics.dir":"/kafka/topics","logs.dir":"/kafka/logs","format.class":"io.confluent.connect.hdfs.avro.AvroFormat","flush.size":"1000","rotate.interval.ms":"1000","partitioner.class":"io.confluent.connect.hdfs.partitioner.DailyPartitioner","path.format":"YYYY-MM-dd","schema.compatibility":"BACKWARD","locale":"zh_CN","timezone":"Asia/Shanghai"}}' http://kafka1:8083/connectors
And when i producer data to topic "mysiteview" something like this:
{"f1":"192.168.1.1","f2":"aa.example.com"}
The java code is following:
Properties props = new Properties();
props.put("bootstrap.servers","kafka1:9092");
props.put("acks","all");
props.put("retries",3);
props.put("batch.size", 16384);
props.put("linger.ms",30);
props.put("buffer.memory",33554432);
props.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
Producer<String, String> producer = new KafkaProducer<String,String>(props);
Random rnd = new Random();
for(long nEvents = 0; nEvents < events; nEvents++) {
long runtime = new Date().getTime();
String site = "www.example.com";
String ipString = "192.168.2." + rnd.nextInt(255);
String key = "" + rnd.nextInt(255);
User u = new User();
u.setF1(ipString);
u.setF2(site+" "+rnd.nextInt(255));
System.out.println(JSON.toJSONString(u));
producer.send(new ProducerRecord<String,String>("mysiteview",JSON.toJSONString(u)));
Thread.sleep(50);
}
producer.flush();
producer.close();
The weird things occured.
I get data from kafka-logs but no data in hdfs(no topic directory).
I try the connector command:
curl -X GET http://kafka1:8083/connectors/hdfs-sink-mysiteview/status
output is:
{"name":"hdfs-sink-mysiteview","connector":{"state":"RUNNING","worker_id":"10.255.223.178:8083"},"tasks":[{"state":"RUNNING","id":0,"worker_id":"10.255.223.178:8083"},{"state":"RUNNING","id":1,"worker_id":"10.255.223.178:8083"},{"state":"RUNNING","id":2,"worker_id":"10.255.223.178:8083"}]}
But when i inspect the task status using following command:
curl -X GET http://kafka1:8083/connectors/hdfs-sink-mysiteview/hdfs-sink-siteview-1
I get the result: "Error 404" . Three tasks is as the same error!
What' going wrong?
Without seeing the worker's log, I'm not sure with which exception exactly your HDFS Connector instances are failing when you use the settings you describe above. However I can spot a few issues with the configuration:
You mention that you start your Connect worker with: bin/connect-distributed etc/schema-registry/connect-avro-distributed.properties. These properties default to having key and value converters set to AvroConverter and require you to run the schema-registry service. If indeed you've edited the configuration in connect-avro-distributed.properties to use the JsonConverter instead, your HDFS connector will probably fail during the conversion of Kafka records to Connect's SinkRecord data type, just before it tries to export your data to HDFS.
Until recently, the HDFS connector was able to export only Avro records, to files of Avro or Parquet format. And that requires using the AvroConverter as mentioned above. The capability to export records to text files as JSON was added recently, and will appear in version 4.0.0 of the connector (you may try this capability by checking-out and building the connector from source).
At this point, my first suggestion would be to try and import your data with bin/kafka-avro-console-producer. Define their schema, confirm that the data are imported successfully with bin/kafka-avro-console-consumer and then set your HDFS Connector to use AvroFormat as above. The quickstart at the connector's page describes a very similar process, and maybe it would be a great starting point for your use case.
maybe you are just using the REST-Api wrong.
According to the documentation the call should be
/connectors/:connector_name/tasks/:task_id
https://docs.confluent.io/3.3.1/connect/restapi.html#get--connectors-(string-name)-tasks-(int-taskid)-status