Cannot print Kafka Avro decoded message - apache-kafka

I have a legacy C++ based system which spits out binary encoded Avro data that supports confluent Avro schema registry format. In my Java application, I successfully deserialized the message using KafkaAvroDeserializer class but could not print out the message.
private void consumeAvroData(){
String group = "group1";
Properties props = new Properties();
props.put("bootstrap.servers", "http://1.2.3.4:9092");
props.put("group.id", group);
props.put("enable.auto.commit", "true");
props.put("auto.commit.interval.ms", "1000");
props.put("session.timeout.ms", "30000");
props.put("key.deserializer", LongDeserializer.class.getName());
props.put("value.deserializer", KafkaAvroDeserializer.class.getName());
// props.put(KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG,"false");
props.put("schema.registry.url","http://1.2.3.4:8081");
KafkaConsumer<String, GenericRecord> consumer = new KafkaConsumer<String, GenericRecord>(props);
consumer.subscribe(Arrays.asList(TOPIC_NAME));
System.out.println("Subscribed to topic " + TOPIC_NAME);
while (true) {
ConsumerRecords<String, GenericRecord> records = consumer.poll(100);
for (ConsumerRecord<String, GenericRecord> record : records)
{
System.out.printf("value = %s\n",record.value());
}
}
}
The output I get is
{"value":"�"}
Why is that I cannot print the deserialized data ? Any help appreciated !

The wire format for the Confluent Avro Serializer is documented here in the section entitled "Wire Format"
http://docs.confluent.io/current/schema-registry/docs/serializer-formatter.html
It's a single magic byte (currently always 0) followed by a 4 byte Schema ID as returned by the Schema Registry, followed by a set of bytes which are the Avro serialized data in Avro’s binary encoding.
If you read the message as a ByteArray and print out the first 5 bytes you will know if this is really a Confluent Avro Serialized message or not. Should be 0 followed by 0001 or some other Schema ID which you can check if it is in the Schema Registry for this topic.
If it's not in this format then the message is likely serialized another way (without Confluent Schema Registry) and you need to use a different deserializer or perhaps extract the full Schema from the message value or even need to get the original Schema file from some other source to be able to decode.

Related

how does the producer communicate with the registry and what does it send to the registry

I'm trying to understand this by reading the documentation but maybe because I'm not an advanced programmer I do not really understand it.
I'm in the documentation and for example in this example:
https://docs.confluent.io/current/schema-registry/serdes-develop/serdes-protobuf.html#protobuf-serializer
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer");
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"io.confluent.kafka.serializers.protobuf.KafkaProtobufSerializer");
props.put("schema.registry.url", "http://127.0.0.1:8081");
Producer<String, MyRecord> producer = new KafkaProducer<String, MyRecord>(props);
String topic = "testproto";
String key = "testkey";
OtherRecord otherRecord = OtherRecord.newBuilder()
.setOtherId(123).build();
MyRecord myrecord = MyRecord.newBuilder()
.setF1("value1").setF2(otherRecord).build();
ProducerRecord<String, MyRecord> record
= new ProducerRecord<String, MyRecord>(topic, key, myrecord);
producer.send(record).get();
producer.close();
I see here that you define the schema registry url and then somehow the producer will know that it will send contact to the registry to provide some metadata on the messages to the registry.
Now I would like to understand better how does this actually work and what is exchanged between the producer and the registry (or is kafka that contact with the registry)?
Anyway my question is imagine I have a record that is in a protobuf format.
I'm putting that protobuf into kafka in a certain topic.
now I want to activate the schema registry so would the producer just send the proto definition into the schema registry?
does the producer just get the metadata definition directly from the record?
would it try to update in any new message into the queue? would this not increase a bit the latency when pushing the data to kafka?
Sorry if this is all very basic questions but I'm just trying to get the bigger picture and I'm missing this peace.
Thanks for any information, sorry if this is already clear from the documentation.
(I need to have this so that I can use ksql to deserialize my messages in kafka)
best regards,
would the producer just send the proto definition into the schema registry
The Serializer does, not the Producer directly.
MyRecord is serialized to binary, the schema is sent over HTTP to the Registry, which returns an ID, then the sent message contains 0x0 + ID + binary-protobuf-value
Source code here
would it try to update in any new message into the queue?
Schema gets sent before any messages are sent. Existing messages are untouched
would this not increase a bit the latency when pushing the data to kafka?
Only for the first message since the schema gets cached

Deserialization in Flink datastream

Here I wrote a string to Kafka topic and flink consumes this topic.The Deserialization is done by using SimpleStringSchema.When I need to consume integer value what deserialization method should be used instead of SimpleStringschema ???
DataStream<String> messageStream = env.addSource(new FlinkKafkaConsumer09<String>("test2", new SimpleStringSchema(), properties));
You need to define your own SerializationSchema
https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/serialization/SerializationSchema.java#L32
Or leave it as a string then map the stream out into your required types

How to send json message from Kinesis to MSK and then to elastic search using kafka connect

I have set up ready and the flow also is working.
I am sending my data from Kinesis stream to MSK using lambda function and the format of the message is below
{
"data": {
"RequestID": 517082653,
"ContentTypeID": 9,
"OrgID": 16145,
"UserID": 4,
"PromotionStartDateTime": "2019-12-14T16:06:21Z",
"PromotionEndDateTime": "2019-12-14T16:16:04Z",
"SystemStartDatetime": "2019-12-14T16:17:45.507000000Z"
},
"metadata": {
"timestamp": "2019-12-29T10:37:31.502042Z",
"record-type": "data",
"operation": "insert",
"partition-key-type": "schema-table",
"schema-name": "dbo",
"table-name": "TRFSDIQueue"
}
}
This json message i am sending into kafka topic like below
props.put("acks", "all");
props.put("retries", 0);
props.put("batch.size", 16384);
props.put("linger.ms", 1);
props.put("buffer.memory", 33554432);
props.put("producer.type", "async");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
Producer<String, String> producer = new KafkaProducer<String, String>(props);
System.out.println("Inside loop successfully");
try {
producer.send(
new ProducerRecord<String, String>(topicName, new String(rec.getKinesis().getData().array())));
Thread.sleep(1000);
System.out.println("Message sent successfully");
} catch (Exception e) {
System.out.println("------------Exception message=-------------" + e.toString());
}
finally {
producer.flush();
producer.close();
}
When i start kafka connect for elastic search i get error like
DataException: Converting byte[] to Kafka Connect data failed due to serialization error
Also i have modified quickstart-elasticsearch.properties and changed key value serializer as string .
When it was json it was throwing error .
I can see indices is getting created with kafka topic name in elastic search but no record .
So please help me with my few confusion .
1. Am i sending message correctly from producer kinesis stream ?
i am using
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
or should i use json here .But there is no json as such .
Or do i have to use json serializer in quickstart-elasticsearch.properties?
If event is insert then it will insert record in elastci search what about delete and update ,Kafka-connect handle delete and update for us in elastic search ?
Thanks in Advance
For a 30 day free trial, you can use the Kinesis Source Connector, or you can learn how to write your own Source Connector and deploy it alongside the Elasticsearch sink rather than use a lambda at all...
Secondly, work backwards. Can you create a fake topic and send records of the same format outside of lambda? Do those end up in Kafka? How about Elasticsearch? And remove Kibana from the equation, too, if you're using it and it's not working
Then focus on the lambda integration
To answer your questions
1) You're sending JSON as a string. You don't need a separate serializer for JSON unless you are sending POJO classes that get mapped into JSON strings within the serializer interface.
You're sending JSON records, so you should be using JSONConverter, in Connect, yes. However, I don't think Elasticsearch mappings will be automatically created unless you have a schema and payload, so simple workaround to that is to create the ES index mapping ahead of time (however if you already know that, you therefore have designed a schema, so it ultimately is the responsibility of the producer code to send the right record).
If you define the mapping ahead of time, you should be able to simply use StringConverter in Connect
The only things I would change about your producer code is having retries higher than 0. And use try with resources rather than explicitly closing the producer
//... parse input
try (Producer<String, String> producer = new KafkaProducer<>(props)) {
//... send record
}
2) You can search the Github issues for the connector, but last I checked, it does full document updates and inserts, no partial updates or any deletes

Reading latest data from Kafka broker in Apache Flink

I want to receive the latest data from Kafka to Flink Program, but Flink is reading the historical data.
I have set auto.offset.reset to latest as shown below, but it did not work
properties.setProperty("auto.offset.reset", "latest");
Flink Programm is receiving the data from Kafka using below code
//getting stream from Kafka and giving it assignTimestampsAndWatermarks
DataStream<JoinedStreamEvent> raw_stream = envrionment.addSource(new FlinkKafkaConsumer09<JoinedStreamEvent>("test",
new JoinSchema(), properties)).assignTimestampsAndWatermarks(new IngestionTimeExtractor<>());
I was following the discussion on
https://issues.apache.org/jira/browse/FLINK-4280 , which suggests adding the source in below mentioned way
Properties props = new Properties();
...
FlinkKafkaConsumer kafka = new FlinkKafkaConsumer("topic", schema, props);
kafka.setStartFromEarliest();
kafka.setStartFromLatest();
kafka.setEnableCommitOffsets(boolean); // if true, commits on checkpoint if checkpointing is enabled, otherwise, periodically.
kafka.setForwardMetrics(boolean);
...
env.addSource(kafka)
I did the same, however, I was not able to access the setStartFromLatest()
FlinkKafkaConsumer09 kafka = new FlinkKafkaConsumer09<JoinedStreamEvent>( "test", new JoinSchema(),properties);
What should I do to receive the latest values that are being sent to
Kafka rather than receiving values from history?
The problem was solved by creating the new group id named test1 both for the sender and consumer and keeping the topic name same as test.
Now I am wondering, is this the best way to solve this issue? because
every time I need to give a new group id
Is there some way I can just read data that is being sent to Kafka?
I believe this could work for you. It did for me. Modify the properties and your kafka topic.
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "ip:port");
properties.setProperty("zookeeper.connect", "ip:port");
properties.setProperty("group.id", "your-group-id");
DataStream<String> stream = env
.addSource(new FlinkKafkaConsumer09<>("your-topic", new SimpleStringSchema(), properties));
stream.writeAsText("your-path", FileSystem.WriteMode.OVERWRITE)
.setParallelism(1);
env.execute();
}

Can Apache Kafka send non-string messages through a topic?

A mllib model is trained somewhere and I want it to be sent to somewhere else. When I try to send it through a kafka topic like this
val model = LogisticRegressionModel.load(sc, "/PATH/To/Model")
val producer=new Producer[String, LogisticRegressionModel](config)
val data=new KeyedMessage[String, LogisticRegressionModel(topic2,key,model)
producer.send(data)
producer.close()
I would encounter an error like this:
org.apache.spark.mllib.classification.LogisticRegressionModel cannot be cast to java.lang.String
So, is it possible for kafka to send non-string messages through a topic?
You can send non-string messages to Kafka topic using Kafka Producer. From 0.9.0 version its better to use Java Client instead of Scala Client.
All you need to do is specifying the correct Key, Value serializer in Properties like below.
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");