Kafka Consumer Error - apache-kafka

I am using a kafka produer and a Spring kafka consumer. I am using a Json serializer and deserializer. Whenever I try to read messages in the consumer from the topic i get the following error:
org.apache.kafka.common.errors.SerializationException: Error deserializing key/value for partition fan_topic-0 at offset 154. If needed, please seek past the record to continue consumption.
Caused by: java.lang.IllegalStateException: No type information in headers and no default type provided
I have not configured anything about headers neither in the producer nor in the consumer. What am i missing here?

I believe that you are missing the fact that JsonDeserializer has to be configured on the ConsumerFactory with an appropriate default type to deserialize, but not in the Kafka properties.
All the info is presented in the Docs: https://docs.spring.io/spring-kafka/docs/2.1.7.RELEASE/reference/html/_reference.html#serdes

just adding to above answer,
The below changes solved for me.
config.put(JsonSerializer.ADD_TYPE_INFO_HEADERS, false);
adding
return new DefaultKafkaConsumerFactory<>(config, new StringDeserializer(), new JsonDeserializer<>(String.class));
instead of
return new DefaultKafkaConsumerFactory<String, String>(config);
For reference,
the below method in deserialize expecting the headers and "Assert.state.." throws IllegalStateException
#Override
public T deserialize(String topic, Headers headers, byte[] data) {
JavaType javaType = this.typeMapper.toJavaType(headers);
if (javaType == null) {
Assert.state(this.targetType != null, "No type information in headers and no default type provided");
return deserialize(topic, data);
}
else {
try {
return this.objectMapper.readerFor(javaType).readValue(data);
}
catch (IOException e) {
throw new SerializationException("Can't deserialize data [" + Arrays.toString(data) +
"] from topic [" + topic + "]", e);
}
}
}

Related

remote-partitioning in spring-batch with Kafka as middleware

I'm trying to use spring-batch remote-partitioning for scaling the Job and Apache Kafka as the middleware.
here is a brief configuration of the masterStep:
#Bean
public Step managerStep() {
return managerStepBuilderFactory.get("managerStep")
.partitioner("workerStep", filePartitioner)
.outputChannel(requestForWorkers())
.inputChannel(repliesFromWorkers())
.build();
}
So I'm using channels for both sending requests to the workers as well as receiving responses from them. I know the other option is to poll the JobRepository (which works fine in my case), but I would rather not use it.
here also is some of the configs for the Kafka:
spring.kafka.producer.key-serializer= org.apache.kafka.common.serialization.StringSerializer
spring.kafka.producer.value-serializer=org.springframework.kafka.support.serializer.JsonSerializer
spring.kafka.consumer.key-deserializer= org.apache.kafka.common.serialization.StringDeserializer
spring.kafka.consumer.value-deserializer= org.springframework.kafka.support.serializer.JsonDeserializer
spring.kafka.producer.properties.spring.json.add.type.headers=true
spring.kafka.consumer.properties.spring.json.trusted.packages = org.springframework.batch.integration.partition,org.springframework.batch.core
The master and the workers are configured and the master can send the request through Kafka to the workers. The workers start processing and everything is fine until the workers try to send the response through the Kafka
as you see I'm using the JsonSerializer and JsonDeserializer for sending/receiving the messages. The problem is that when Jackson tries to serialize the StepExecution, it falls into an infinite loop since the StepExetion has a JobExecution in it and the JobExecution also has a List of StepExetions:
Caused by: org.apache.kafka.common.errors.SerializationException: Can't serialize data [StepExecution: id=3001, version=6, name=workerStep:61127a319d6caf656442ff53, status=COMPLETED, exitStatus=COMPLETED, readCount=10, filterCount=0, writeCount=10 readSkipCount=0, writeSkipCount=0, processSkipCount=0, commitCount=4, rollbackCount=0, exitDescription=] for topic [repliesFromWorkers]
Caused by: com.fasterxml.jackson.databind.JsonMappingException: Infinite recursion (StackOverflowError) (through reference chain: org.springframework.batch.core.JobExecution["stepExecutions"]->java.util.Collections$UnmodifiableRandomAccessList[0]->org.springframework.batch.core.StepExecution["jobExecution"]->org.springframework.batch.core.JobExecution["stepExecutions"]->java.util.Collections$UnmodifiableRandomAccessList[0]->org.springframework.batch.core.StepExecution["jobExecution"]->org.springframework.batch.core.JobExecution["stepExecutions"]-....
So I thought maybe I can customize the serializing of the StepExecution so it ignores the List of StepExecutions in the JobExecution of the first StepExecution! but even in this case, it will fails at the master side while deserializing of this StepExecution:
Caused by: com.fasterxml.jackson.databind.exc.MismatchedInputException: Cannot construct instance of `org.springframework.batch.core.StepExecution` (although at least one Creator exists): cannot deserialize from Object value (no delegate- or property-based Creator)
Is there anyway to make this work?
Im using Spring Boot 2.4.2 and its corresponding versions of the spring-boot-starter-batch, spring-batch-integration, spring-integration-kafka and spring-kafka
you can create a custom (de)serializer and handle it manually. something like this will help:
public class KafkaStringOrByteSerializer<T> extends JsonSerializer<T> {
private final Serializer<Object> byteSerializer = new DefaultSerializer();
private final org.apache.kafka.common.serialization.Serializer<String> stringSerializer = new StringSerializer();
#Override
public byte[] serialize(String topic, T data) {
if (needsBinarySerializer(data)) {
return this.serializeBinary(data);
} else {
return stringSerializer.serialize(topic, (String) data);
}
}
private boolean needsBinarySerializer(Object data) {
if (data instanceof byte[] || data instanceof Byte[] || data instanceof Byte)
return true;
if (data != null && data.getClass() != null) {
return (data.getClass().getName()).startsWith("org.springframework.batch");
}
return false;
}
private byte[] serializeBinary(Object data) {
try (ByteArrayOutputStream output = new ByteArrayOutputStream()) {
byteSerializer.serialize(data, output);
return output.toByteArray();
} catch (IOException e) {
throw new MessageConversionException("Cannot convert object to bytes", e);
}
}
}
a similar approach can be taken for the deserializer

Pause Kafka Consumer with spring-cloud-stream and Functional Style

I'm trying to implement a retry mechanism for my kafka stream application. The idea is that I would get the consumer and partition ID as well as the topic name from the input topic and then pause the consumer for the duration stored in the payload.
I've searched for documentations and examples but all I found are examples based on the classic bindings provided by spring-cloud-stream. I'm trying to see if there's a way to get access to these info with functional style.
For example the following code can give me access to the consumer with classic binding style.
#StreamListener(Sink.INPUT)
public void in(String in, #Header(KafkaHeaders.CONSUMER) Consumer<?, ?> consumer) {
System.out.println(in);
consumer.pause(Collections.singleton(new TopicPartition("myTopic", 0)));
}
How do I get the equivalence with the Functional Style?
I tried with the following code but I'm getting exception saying no such binding is found.
#Bean
public Function<Message<?>, KStream<String, String>> process() {
message -> {
Consumer<?, ?> consumer = message.getHeaders().get(KafkaHeaders.Consumer, Consumer.class);
String topic = message.getHeaders().get(KafkaHeaders.Topic, String.class);
Integer partitionId = message.getHeaders().get(KafkaHeaders.RECEIVED_PARTITION_ID, Integer.class);
CustomPayload payload = (CustomPayload) message.getPayload();
if (payload.getRetryTime() < System.currentTimeMillis()) {
consumer.pause(Collections.singleton(new TopicPartition(topic, partitionId)));
}
}
}
Exception I got
Caused by: java.lang.IllegalStateException: No factory found for binding target type: org.springframework.messaging.Message among registered factories: channelFactory,messageSourceFactory,kStreamBoundElementFactory,kTableBoundElementFactory,globalKTableBoundElementFactory
at org.springframework.cloud.stream.binding.AbstractBindableProxyFactory.getBindingTargetFactory(AbstractBindableProxyFactory.java:82)
at org.springframework.cloud.stream.binder.kafka.streams.function.KafkaStreamsBindableProxyFactory.bindInput(KafkaStreamsBindableProxyFactory.java:191)
at org.springframework.cloud.stream.binder.kafka.streams.function.KafkaStreamsBindableProxyFactory.afterPropertiesSet(KafkaStreamsBindableProxyFactory.java:111)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.invokeInitMethods(AbstractAutowireCapableBeanFactory.java:1853)
at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.initializeBean(AbstractAutowireCapableBeanFactory.java:1790)
... 96 more
In your functional bean example, you are mixing both Message and KStream. That is the reason for that specific exception. The functional bean could be rewritten as below.
#Bean
public java.util.function.Consumer<Message<?>> process() {
return message -> {
Consumer<?, ?> consumer = message.getHeaders().get(KafkaHeaders.Consumer, Consumer.class);
String topic = message.getHeaders().get(KafkaHeaders.Topic, String.class);
Integer partitionId = message.getHeaders().get(KafkaHeaders.RECEIVED_PARTITION_ID, Integer.class);
CustomPayload payload = (CustomPayload) message.getPayload();
if (payload.getRetryTime() < System.currentTimeMillis()) {
consumer.pause(Collections.singleton(new TopicPartition(topic, partitionId)));
}
}
}

Kafka Consumer Poll runs indefinitely and doesn't return anything

I am facing difficulty with KafkaConsumer.poll(duration timeout), wherein it runs indefinitely and never come out of the method. Understand that this could be related to connection and I have seen it a bit inconsistent sometimes. How do I handle this should poll stops responding? Given below is the snippet from KafkaConsumer.poll()
public ConsumerRecords<K, V> poll(final Duration timeout) {
return poll(time.timer(timeout), true);
}
and I am calling the above from here :
Duration timeout = Duration.ofSeconds(30);
while (true) {
final ConsumerRecords<recordID, topicName> records = consumer.poll(timeout);
System.out.println("record count is" + records.count());
}
I am getting the below error:
org.apache.kafka.common.errors.SerializationException: Error
deserializing key/value for partition at offset 2. If
needed, please seek past the record to continue consumption.
I stumbled upon some useful information while trying to fix the problem I was facing above. I will provide the piece of code which should be able to handle this, but before that it is important to know what causes this.
While producing or consuming message or data to Apache Kafka, we need schema structure to that message or data, in my case Avro schema. If there is a conflict of message being produced to Kafka that conflict with that message schema, it will have an effect on consumption.
Add below code in your consumer topic in the method where it consume records --
do remember to import below packages:
import org.apache.kafka.common.TopicPartition;
import org.jsoup.SerializationException;
try {
while (true) {
ConsumerRecords<String, GenericRecord> records = null;
try {
records = consumer.poll(10000);
} catch (SerializationException e) {
String s = e.getMessage().split("Error deserializing key/value
for partition ")[1].split(". If needed, please seek past the record to
continue consumption.")[0];
String topics = s.split("-")[0];
int offset = Integer.valueOf(s.split("offset ")[1]);
int partition = Integer.valueOf(s.split("-")[1].split(" at") .
[0]);
TopicPartition topicPartition = new TopicPartition(topics,
partition);
//log.info("Skipping " + topic + "-" + partition + " offset "
+ offset);
consumer.seek(topicPartition, offset + 1);
}
for (ConsumerRecord<String, GenericRecord> record : records) {
System.out.printf("value = %s \n", record.value());
}
}
} finally {
consumer.close();
}
I ran into this while setting up a test environment.
Running the following command on the broker printed out the stored records as one would expect:
bin/kafka-console-consumer.sh --bootstrap-server="localhost:9092" --topic="foo" --from-beginning
It turned out that the Kafka server was misconfigured. To connect from an external
IP address listeners must have a valid value in kafka/config/server.properties, e.g.
# The address the socket server listens on. It will get the value returned from
# java.net.InetAddress.getCanonicalHostName() if not configured.
# FORMAT:
# listeners = listener_name://host_name:port
# EXAMPLE:
# listeners = PLAINTEXT://your.host.name:9092
listeners=PLAINTEXT://:9092

Kafka KStream to GlobalKTable join does not work with same key used

I have a very frustrating problem trying to join a KStream, that is populated by a java driver program using KafkaProducer, to a GlobalKTable that is populated from a Topic that, in turn, is populated using the JDBCConnector pulling data from a MySQL Table. No matter what I try to do the join between the KStream and the GlobalKTable, which both are keyed on the same value, will not work. What I mean is that the ValueJoiner is never called. I'll try and explain by showing the relevant config and code below. I appreciate any help.
I am using the latest version of the confluent platform.
The topic that the GlobalKTable is populated from is pulled from a single MySQL table:
Column Name/Type:
pk/bigint(20)
org_name/varchar(255)
orgId/varchar(10)
The JDBCConnector configuration for this is:
name=my-demo
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
connection.url=jdbc:mysql://localhost:3306/reporting?user=root&password=XXX
table.whitelist=organisation
mode=incrementing
incrementing.column.name=pk
topic.prefix=my-
transforms=keyaddition
transforms.keyaddition.type=org.apache.kafka.connect.transforms.ValueToKey
transforms.keyaddition.fields=orgId
I am running the JDBC connector using the command line:
connect-standalone /home/jim/platform/confluent/etc/schema-registry/connect-avro-standalone.properties /home/jim/prg/kafka/config/my.mysql.properties
This gives me a topic called my-organisation, that is keyed on orgId ..... so far so good!
(note, the namespace does not seem to be set by JDBCConnector but I don't think this is an issue but I don't know for sure)
Now, the code. Here is how I initialise and create the GlobalKTable (relevant code shown):
final Map<String, String> serdeConfig =
Collections.singletonMap(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG,
schemaRegistryUrl);
final StreamsBuilder builder = new StreamsBuilder();
final SpecificAvroSerde<Organisation> orgSerde = new SpecificAvroSerde<>();
orgSerde.configure(serdeConfig, false);
// Create the GlobalKTable from the topic that was populated using the connect-standalone command line
final GlobalKTable<String, Organisation>
orgs =
builder.globalTable(ORG_TOPIC, Materialized.<String, Organisation, KeyValueStore<Bytes, byte[]>>as(ORG_STORE)
.withKeySerde(Serdes.String())
.withValueSerde(orgSerde));
The avro schema, from where the Organisaton class is generated is defined as:
{"namespace": "io.confluent.examples.streams.avro",
"type":"record",
"name":"Organisation",
"fields":[
{"name": "pk", "type":"long"},
{"name": "org_name", "type":"string"},
{"name": "orgId", "type":"string"}
]
}
Note: as described above the orgId is set as the key on the topic using the single message transform (SMT) operation.
So, that is the GlobalKTable setup.
Now for the KStream setup (the right hand side of the join). This has the same key (orgId) as the globalKTable. I use a simple driver program for this:
(The use case is that this topic will contain events associated with each organisation)
public class UploadGenerator {
public static void main(String[] args){
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
io.confluent.kafka.serializers.KafkaAvroSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
io.confluent.kafka.serializers.KafkaAvroSerializer.class);
props.put("schema.registry.url", "http://localhost:8081");
KafkaProducer producer = new KafkaProducer(props);
// This schema is also used in the consumer application or more specifically a class generated from it.
String mySchema = "{\"namespace\": \"io.confluent.examples.streams.avro\"," +
"\"type\":\"record\"," +
"\"name\":\"DocumentUpload\"," +
"\"fields\":[{\"name\":\"orgId\",\"type\":\"string\"}," +
"{\"name\":\"date\",\"type\":\"long\",\"logicalType\":\"timestamp-millis\"}]}";
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(mySchema);
// Just using three fictional organisations with the following orgIds/keys
String[] ORG_ARRAY = {"002", "003", "004"};
long count = 0;
String key = ""; // key is the realm
while(true) {
count++;
try {
TimeUnit.SECONDS.sleep(5);
} catch (InterruptedException e) {
}
GenericRecord avroRecord = new GenericData.Record(schema);
int orgId = ThreadLocalRandom.current().nextInt(0, 2 + 1);
avroRecord.put("orgId",ORG_ARRAY[orgId]);
avroRecord.put("date",new Date().getTime());
key = ORG_ARRAY[orgId];
ProducerRecord<Object, Object> record = new ProducerRecord<>("topic_uploads", key, avroRecord);
try {
producer.send(record);
producer.flush();
} catch(SerializationException e) {
System.out.println("Exccccception was generated! + " + e.getMessage());
} catch(Exception el) {
System.out.println("Exception: " + el.getMessage());
}
}
}
}
So, this generates a new event representing an upload for an organisation represented by the orgId but also specifically set in the key variable used in the ProducerRecord.
Here is the code that sets up the KStream for these events:
final SpecificAvroSerde<DocumentUpload> uploadSerde = new SpecificAvroSerde<>();
uploadSerde.configure(serdeConfig, false);
// Get the stream of uploads
final KStream<String, DocumentUpload> uploadStream = builder.stream(UPLOADS_TOPIC, Consumed.with(Serdes.String(), uploadSerde));
// Debug output to see the contents of the stream
uploadStream.foreach((k, v) -> System.out.println("uploadStream: Key: " + k + ", Value: " + v));
// Note, I tried to re-key the stream with the orgId field (even though it was set as the key in the driver but same problem)
final KStream<String, DocumentUpload> keyedUploadStream = uploadStream.selectKey((key, value) -> value.getOrgId());
keyedUploadStream.foreach((k, v) -> System.out.println("keyedUploadStream: Key: " + k + ", Value: " + v));
// Java 7 form used as it was easier to put in debug statements
// OrgPK is just a helper class defined in the same class
KStream<String, OrgPk> joined = keyedUploadStream.leftJoin(orgs,
new KeyValueMapper<String, DocumentUpload, String>() { /* derive a (potentially) new key by which to lookup against the table */
#Override
public String apply(String key, DocumentUpload value) {
System.out.println("1. The key passed in is: " + key);
System.out.println("2. The upload realm passed in is: " + value.getOrgId());
return value.getOrgId();
}
},
// THIS IS NEVER CALLED WITH A join() AND WHEN CALLED WITH A leftJoin() HAS A NULL ORGANISATION
new ValueJoiner<DocumentUpload, Organisation, OrgPk>() {
#Override
public OrgPk apply(DocumentUpload leftValue, Organisation rightValue) {
System.out.println("3. Value joiner has been called...");
if( null == rightValue ) {
// THIS IS ALWAYS CALLED, SO THERE IS NEVER A "MATCH"
System.out.println(" 3.1. Orgnisation is NULL");
return new OrgPk(leftValue.getRealm(), 1L);
}
System.out.println(" 3.1. Org is OK");
// Never reaches here - this is the issue i.e. there is never a match
return new OrgPk(leftValue.getOrgId(), rightValue.getPk());
}
});
So, the above join (or leftJoin) never matches, even though the two keys are the same! This is the main issue.
Finally, the avro schema for the DocumentUpload is:
{"namespace": "io.confluent.examples.streams.avro",
"type":"record",
"name":"DocumentUpload",
"fields":[
{"name": "orgId", "type":"string"},
{"name":"date", "type":"long", "logicalType":"timestamp-millis"}
]
}
So, in summary:
I have a KStream on a topic with a String key of OrgId
I have a GlobalKTable on a topic with a String key of OrgId also.
The join never works, even though the keys are in the GlobalKTable (at least they are in the topic underlying the GlobalKTable)
Can someone help me? I am pulling my hair out trying to figure this out.
I was able to solve this issue on Windows/Intellij by providing a state dir config
StreamsConfig.STATE_DIR_CONFIG

how to send avro schema ONLY once in kafka

I'm using the following code (not really, but let's assume it) to create a schema and send it to kafka by a producer.
public static final String USER_SCHEMA = "{"
+ "\"type\":\"record\","
+ "\"name\":\"myrecord\","
+ "\"fields\":["
+ " { \"name\":\"str1\", \"type\":\"string\" },"
+ " { \"name\":\"str2\", \"type\":\"string\" },"
+ " { \"name\":\"int1\", \"type\":\"int\" }"
+ "]}";
public static void main(String[] args) throws InterruptedException {
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.ByteArraySerializer");
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(USER_SCHEMA);
Injection<GenericRecord, byte[]> recordInjection = GenericAvroCodecs.toBinary(schema);
KafkaProducer<String, byte[]> producer = new KafkaProducer<>(props);
for (int i = 0; i < 1000; i++) {
GenericData.Record avroRecord = new GenericData.Record(schema);
avroRecord.put("str1", "Str 1-" + i);
avroRecord.put("str2", "Str 2-" + i);
avroRecord.put("int1", i);
byte[] bytes = recordInjection.apply(avroRecord);
ProducerRecord<String, byte[]> record = new ProducerRecord<>("mytopic", bytes);
producer.send(record);
Thread.sleep(250);
}
producer.close();
}
The thing is the code allows me to send only 1 message with this schema. Then I need to change the schema name in order to send the next message... so the name string is a randomly generated one right now so I can send more message. This is a hack so I'd like to know the proper way to do this.
I've also looked at how to send messages without a schema (ie already sent 1 message with schema to kafka now all other messages don't need schema anymore) - but new GenericData.Record(..) expects a schema parameter. If it's null it will throw an error.
So what's the correct way to send avro schema messages to kafka?
Here is another code sample - pretty identical to mine:
https://github.com/confluentinc/examples/blob/kafka-0.10.0.1-cp-3.0.1/kafka-clients/producer/src/main/java/io/confluent/examples/producer/ProducerExample.java
It also doesn't show how to send without setting a schema.
I didn't understand the line:
The thing is the code allows me to send only 1 message with this
schema. Then I need to change the schema name in order to send the
next message.
In both of the examples, your and the confluent example you supplied, the schema is not sent to Kafka.
In the example you supplied, the schema used to create a GenericRecord object. You supply the schema, because you want to validate the record against some schema (for example validate that you would only be able to put an integer int1 field inside GenericRecord object).
In your code the only difference is that you decided to serialize the data to byte[], which is probably not needed since you can delegate this responsibility to KafkaAvroSerializer, as you can see in the confluent example.
GenericRecord is an Avro object, it is not an enforcement by Kafka. If you want to send any kind of object to Kafka (with schema or without), you just need to create (or use exiting) serializer that will convert your object to byte[] and set this serializer in the properties you are creating for the producer.
Usually it is a good practice to send a pointer to the schema with the Avro message itself. You can find the reasoning for it at the following links:
http://www.confluent.io/blog/schema-registry-kafka-stream-processing-yes-virginia-you-really-need-one/