Processor topology with error handling and state store rollback - apache-kafka

I have given topology with source from topic, processor and sink to other topic
StoreBuilder storeBuilder = Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("store"),
Serdes.String(),
Serdes.String());
Topology topology = new Topology();
topology.addSource("incoming", Serdes.String().deserializer(), Serdes.String().deserializer(), "topic");
topology.addProcessor("incoming_first", () -> new MyProcessor(), "incoming");
topology.addStateStore(storeBuilder, "incoming_first");
topology.addSink("sink", "sink", "incoming_first"),
public class MyProcessor implements Processor<String, String> {
private ProcessorContext context;
private KeyValueStore<String, String> stateStore;
#Override
public void init(ProcessorContext context) {
this.context = context;
this.stateStore = (KeyValueStore<String, String>) context.getStateStore("store");
}
#Override
public void process(String key, String value) {
stateStore.put(key, value);
....
throw new RuntimeException();
....
context.forward(); //forward to sink
}
#Override
public void close() {
}
}
My question is how to handle situations when some exception occurs in the processor after write to the state store. Does Kafka has some error handling mechanism with state store rollback to reprocess the message again or forward it to the error topic?
Currently, without any handling, my application entirely dies and I need to restart it.
Also, if I add some try-catch the message identified as ok and my state store is updated and the message is sent to the changelog topic.
Do I need some rollback mechanisms for the state store?
https://issues.apache.org/jira/browse/KAFKA-7192 KIP says that if exceptions occurred the state store should not be processed with EOS, but this is valid only for the case when my entire application dies.
Thanks in advance!

For any exception that is thrown from a Processor the corresponding thread will always die. The only way to prevent this, is by catching all exceptions and handle them accordingly (whatever the right way to handle is for your application).
If a thread dies and you restart your application to recover the thread, it depends on your configuration if the store will be rolled back or not. By default, the store would not be rolled back. Only if you enable exactly-once semantics by setting configuration parameter processing.guarantees="exactly_once" the store would be rolled back on restart.
If you catch any exception in your Processor code and your business logic requires to roll back the store, you need to implement this yourself, by first getting the old values from the store, updating the store, and in cause of an exception putting the old values back into the store to overwrite/undo all your writes.

Related

Error handling in Spring Cloud Kafka Streams

I'm using Spring Cloud Stream with Kafka Streams. Let's say I have a processor which is a Function which converts a KStream of Strings to a KStream of CityProgrammes. It invokes an API to find the City by name and an other transformation which finds any events near that city.
Now the problem is that any error happens during the transformation, the whole application stops. I want to send that one particular message to a DLQ and move along. I've been reading for days and everyone suggests to handle errors within the called services but that is a nonesense in my opinion, plus I still need to return a KStream: how do I do that within a catch?
I also looked at UncaughtExeptionHandler but it is not aware of the message and only able to restart the processing which won't skip this invalid message.
This might sound like an A-B problem so the question rephrased: how do I maintain the flow in a KStream when an exception occurs and send the invalid item to the DLQ?
When it comes to the application-level errors you have, it is up to the application itself how the error is handled. Kafka Streams and the Spring Cloud Stream binder mainly support deserialization and serialization errors at the framework level. Although that is the case, I think your scenario can be handled. If you are using Kafka Client prior to 2.8, here is an SO answer I gave before on something similar: https://stackoverflow.com/a/66749750/2070861
If you are using Kafka/Streams 2.8, here is an idea that you can use. However, the code below should only be used as a starting point. Adjust it according to your use case. Read more on how branching works in Kafka Streams 2.8. The branching API is significantly refactored in 2.8 from the prior versions.
public Function<KStream<?, String>, KStream<?, Foo>> convert() {
Foo[] foo = new Foo[0];
return input -> {
final Map<String, ? extends KStream<?, String>> branches =
input.split(Named.as("foo-")).branch((key, value) -> {
try {
foo[0] = new Foo(); // your API call for CitiProgramme converion here, possibly.
return true;
}
catch (Exception e) {
Message<?> message = MessageBuilder.withPayload(value).build();
streamBridge.send("to-my-dlt", message);
return false;
}
}, Branched.as("bar"))
.defaultBranch();
final KStream<?, String> kStream = branches.get("foo-bar");
return kStream.map((key, value) -> new KeyValue<>("", foo[0]));
};
}
}
The default branch is ignored in this code because that only contains the records that threw exceptions. Those were handled by the catch statement above in which we send the records to a DLT programmatically. Finally, we get the good records and map them to a new KStream and send it through the outbound.

Artemis message routing

I'm using ActiveMQ Artemis 2.17.0 and I'm facing routing issues.
I've implementing a plugin that logs the before message route and I see that some message are routed from topic.private.abc.task.V1 to topic.abc.rawmessage.V1.
There is no divert setup and topic and queue are created dynamically by the producers and consumers. There is a setup to map destination clustered.*.> to virtual topics
private TransportConfiguration getServerTransportConfiguration() {
Map<String, Object> extraProps = new HashMap<>();
extraProps.put("virtualTopicConsumerWildcards", "clustered.*.>;2");
Map<String, Object> params = new HashMap<>();
params.put("scheme", "tcp");
params.put("port", port);
params.put("host", hostname);
return new TransportConfiguration("org.apache.activemq.artemis.core.remoting.impl.netty.NettyAcceptorFactory", params, "netty-acceptor", extraProps);
}
Both topic.private.abc.task.V1 and topic.abc.rawmessage.V1 are valid topics but they are not supposed to be linked.
What could explain that behavior?
Here is the plugin code:
#Override
public void beforeMessageRoute(Message message, RoutingContext context, boolean direct, boolean rejectDuplicates) throws ActiveMQException {
Map<String, Object> map = new HashMap<>();
map.put("RoutingContext", new RoutingContextLogView(context));
logger.info(mapper.writeValueAsString(map));
ActiveMQServerPlugin.super.beforeMessageRoute(message, context, direct, rejectDuplicates);
}
public class RoutingContextLogView {
private RoutingContext routingContext;
public RoutingContextLogView(RoutingContext routingContext) {
this.routingContext = routingContext;
}
public String getAddress() {
return routingContext.getAddress() != null ? routingContext.getAddress().toString() : null;
}
public String getPreviousAddress() {
return routingContext.getPreviousAddress() != null ? routingContext.getPreviousAddress().toString() : null;
}
public String getRoutingType() {
return routingContext.getRoutingType() != null ? routingContext.getRoutingType().name() : null;
}
public String getPreviousRoutingType() {
return routingContext.getPreviousRoutingType() != null ? routingContext.getPreviousRoutingType().name() : null;
}
}
Despite the odd logging the flow followed by the message seems to be OK (i.e. the message is produced to topic.abc.rawmessage.V1 and consumed from topic.abc.rawmessage.V1). I'm just wandering why there is message routing and why the previousAddress in the RoutingContext is wrong.
The RoutingContext object, which is used internally by the broker, is reusable. This is done for performance reasons to prevent having to re-create the RoutingContext for every routing operation no matter what. As one might guess, routing messages is a very common operation in the broker so it pays to optimize it as much as possible. Reusing the RoutingContext means fewer objects are created and thrown away which means less garbage needs to be cleaned up which means fewer pauses and better overall performance by the broker.
The fact that the previousAddress is different here from the address where the current message is going to be routed is not a problem. It just means that the context won't be re-used for this routing operation and therefore will be cleared. As the name suggests, the beforeMessageRoute method is invoked before any routing logic is performed (e.g. clearing the RoutingContext). If you inspect the RoutingContext using afterMessageRoute then you should see that it was cleared and populated with the proper details.
Message "sending" and message "routing" (both of which have plugin hooks) are related but distinct operations. A message is "sent" in response to a client operation. Sends always result in a route. However, not all routes are the results of sends. A message can be routed due to internal broker operations which do not involve a send (e.g. moving messages around a cluster, expiring a message, cancelling an undeliverable message to a dead-letter address, using a divert, etc.).
I would caution you against inspecting internal broker state (which can be subtle and nuanced) and assuming a problem exists when everything else indicates that the broker is functioning normally. In this case you said that you were "facing routing issues" and that "some message are routed from topic.private.abc.task.V1 to topic.abc.rawmessage.V1" when, in fact, there was no routing issue and messages were not actually being routed from topic.private.abc.task.V1 to topic.abc.rawmessage.V1. From what I can see everything is in fact functioning normally.

Cannot get custom store connected to a Transformer with Spring Cloud Stream Binder Kafka 3.x

Cannot get custom store connected to my Transformer in Spring Cloud Stream Binder Kafka 3.x (functional style) following examples from here.
I am defining a KeyValueStore as a #Bean with type StoreBuilder<KeyValueStore<String,Long>>:
#Bean
public StoreBuilder<KeyValueStore<String,Long>> myStore() {
return Stores.keyValueStoreBuilder(
Stores.persistentKeyValueStore("my-store"), Serdes.String(),
Serdes.Long());
}
#Bean
#DependsOn({"myStore"})
public MyTransformer myTransformer() {
return new MyTransformer("my-store");
}
In debugger I can see that the beans get initialised.
In my stream processor function then:
return myStream -> {
return myStream
.peek(..)
.transform(() -> myTransformer())
...
MyTransformer is declared as
public class MyTransformer implements Transformer<String, MyEvent, KeyValue<KeyValue<String,Long>, MyEvent>> {
...
#Override
public void init(final ProcessorContext context) {
this.context = context;
this.myStore = context.getStateStore(storeName);
}
Getting the following error when application context starts up from my unit test:
Caused by: org.apache.kafka.streams.errors.StreamsException: Processor KSTREAM-TRANSFORM-0000000002 has no access to StateStore my-store as the store is not connected to the processor. If you add stores manually via '.addStateStore()' make sure to connect the added store to the processor by providing the processor name to '.addStateStore()' or connect them via '.connectProcessorAndStateStores()'. DSL users need to provide the store name to '.process()', '.transform()', or '.transformValues()' to connect the store to the corresponding operator, or they can provide a StoreBuilder by implementing the stores() method on the Supplier itself. If you do not add stores manually, please file a bug report at https://issues.apache.org/jira/projects/KAFKA.
In the application startup logs when running my unit test, I can see that the store seems to get created:
2021-04-06 00:44:43.806 INFO [ main] .k.s.AbstractKafkaStreamsBinderProcessor : state store my-store added to topology
I'm already using pretty much every feature of the Spring Cloud Stream Binder Kafka in my app and from my unit test, everything works very well. Unexpectedly, I got stuck at adding the custom KeyValueStore to my Transformer. It would be great, if you could spot an error in my setup.
The versions I'm using right now:
org.springframework.boot:spring-boot:jar:2.4.4
org.springframework.kafka:spring-kafka:jar:2.6.7
org.springframework.kafka:spring-kafka-test:jar:2.6.7
org.springframework.cloud:spring-cloud-stream-binder-kafka-streams:jar:3.0.4.RELEASE
org.apache.kafka:kafka-streams:jar:2.7.0
I've just tried with
org.springframework.cloud:spring-cloud-stream-binder-kafka-streams:jar:3.1.3-SNAPSHOT
and the issue seems to persist.
In your processor function, when you call .transform(() -> myTransformer()), you also need to provide the state store names in order for this to be connected to that transformer. There are some overloaded transform methods in the KStream API that takes state store names as a vararg. I wonder if this is the issue that you are running into. You may want to change that call to .transform(() -> myTransformer(), "myStore").

How shutdown KafkaListener when error occurs

I wrote a Listener in this way
#Autowired
private KafkaListenerEndpointRegistry kafkaListenerEndpointRegistry;
#KafkaListener(containerFactory = "cdcKafkaListenerContainerFactory", errorHandler = "errorHandler")
public void consume(#Payload String message) throws Exception {
...
}
#Bean
public KafkaListenerErrorHandler errorHandler() {
return ((message, e) -> {
kafkaListenerEndpointRegistry.stop();
return null;
});
}
In #KafkaListener annotation I specified my error handler that simply stop the consumer.
It seems to work but I've some question to ask.
Is there a built-in errorHandler for this scope? I've read that ContainerStoppingErrorHandler can be use, but I cannot set it because #KafkaListener's errorHandler accept beans of KafkaListenerErrorHandler type.
I see that with kafkaListenerEndpointRegistry.stop(); do a graceful stop. So before stopping the partition offset of the consumed message is committed.
What I would know is what happen when kafkaListenerEndpointRegistry.stop(); is called and before listener is definitely turned off another message arrive into the topic?
Is this message consumed?
I image this scenario
time0: kafkaListenerEndpointRegistry.stop() is called
time1: a message is pushed into the listened topic
time2: kafkaListenerEndpointRegistry.stop() complete graceful stop
I'm worried about a possible message arrive at time1. What would happen in this scenario?
Do not stop the container within the listener.
ContainerStoppingErrorHandler is set on the container factory, not the annotation.
If you are using Spring Boot, just declare the error handler as a bean and boot will wire it in.
Otherwise add the error handler to the connection factory bean.
With this error handler, throwing an exception will immediately stop the container.

Kafka Producer : Handle Exception in Async Send with Callback

I need to catch the exceptions in case of Async send to Kafka. The Kafka producer Api comes with a fuction send(ProducerRecord record, Callback callback). But when I tested this against following two scenarios :
Kafka Broker Down
Topic not pre created
The callbacks are not getting called. Rather I am getting warning in the code for unsuccessful send (as shown below).
Questions :
So are the callbacks called only for specific exceptions ?
When does Kafka Client try to connect to Kafka broker while async send : on every batch send or periodically ?
Kafka Warning Image
Note : I am also using linger.ms setting of 25 sec to batch send my records.
public class ProducerDemo {
static KafkaProducer<String, String> producer;
public static void main(String[] args) throws IOException {
final Logger logger = LoggerFactory.getLogger(ProducerDemo.class);
Properties properties = new Properties();
properties.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "127.0.0.1:9092");
properties.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
properties.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class.getName());
properties.setProperty(ProducerConfig.ACKS_CONFIG, "1");
properties.setProperty(ProducerConfig.LINGER_MS_CONFIG, "30000");
producer = new KafkaProducer<String, String>(properties);
String topic = "first_topic";
for (int i = 0; i < 5; i++) {
String value = "hello world " + Integer.toString(i);
String key = "id_" + Integer.toString(i);
ProducerRecord<String, String> record = new ProducerRecord<String, String>(topic, key, value);
producer.send(record, new Callback() {
public void onCompletion(RecordMetadata recordMetadata, Exception e) {
//execute everytime a record is successfully sent or exception is thrown
if(e == null){
// No Exception
}else{
//Exception Handling
}
}
});
}
producer.close();
}
You will get those warning for non-existing topic as a resilience mechanism provided with KafkaProducer. If you wait a bit longer(should be 60 seconds by default), the callback will be called eventually:
Here's my snippet:
So, when something goes wrong and async send is not successful, it will eventually fail with a failed future or/and a callback with exception.
If you are not running it transactionally, it can still mean that some messages from the batch have found their way to the broker, while others haven't.
It will most certainly be a problem if you need a blocking-style acknowledgement to the upstream system(like http ingestion interface, etc.) per every message that is sent to Kafka. The only way to do that is by blocking every message with the future's get, as described in the documentation:
In general, I've noticed a lot of question related to KafkaProducer delivery semantics and guarantees. It can definitely be documented better.
One more thing, since you mentioned linger.ms:
Note that records that arrive close together in time will generally
batch together even with linger.ms=0 so under heavy load batching will
occur regardless of the linger configuration
For the first question, here is the answer.
As per the apache kafka documentation, you can capture below exceptions using onCompletion method when you are implementing Callback interface
https://kafka.apache.org/25/javadoc/org/apache/kafka/clients/producer/Callback.html
For the second question, the combination of below properties control when to send the records and as far as i understand, it's same for synchronous or asynchronous call.
linger.ms
max.block.ms
https://kafka.apache.org/documentation/#linger.ms
So are the callbacks called only for specific exceptions ?
Yes, that's how it works. From documentation (2.5.0):
* Fully non-blocking usage can make use of the {#link Callback} parameter to provide a callback that
* will be invoked when the request is complete.
Notice the important part: when the request is complete, what means that the producer must have accepted the record and sent the ProduceRequest to Kafka Broker. Without digging too deep into internals, this means that broker metadata must be present and the partition must exist.
When it comes to formal specification, you'd need to take a good look at send()'s Javadoc and possibly at KafkaProducer's implementation of doSend method. Out there you're going to see that multiple exceptions can be thrown at the in submitting call (instead of returning a future and invoking callback), e.g. :
if broker metadata is not available in timeout given,
if data could not be serialized,
if serialized form was too large, etc.