Spring cloud stream, kafka binder - seek on demand - apache-kafka

I use spring cloud stream with kafka. I have a topic X, with partition Y and consumer group Z. Spring boot starter parent 2.7.2, spring kafka version 2.8.8:
#StreamListener("input-channel-name")
public void processMessage(final DomainObject domainObject) {
// some processing
}
It works fine.
I would like to have an endpoint in the app, that allows me to re-read/re-process (seek right?) all the messages in X.Y (again). But not after rebalancing (ConsumerSeekAware#onPartitionsAssigned) or after app restart (KafkaConsumerProperties#resetOffsets) but on demand like this:
#RestController
#Slf4j
#RequiredArgsConstructor
public class SeekController {
#GetMapping
public void seekToBeginningForDomainObject() {
/**
* seekToBeginning for X, Y, input-channel-name
*/
}
}
I just can't achieve that. Is it even possible ?. I understand that I have to do that on the consumer level, probably the one that is created after #StreamListener("input-channel-name") subscription, right ? but I've no clue how to obtain that consumer. How can I execute seek on demand to make kafka send the messages to the consumer again ? I just want to reset the offset for X.Y.Z to 0 to just make the app, load and process all the messages again.

https://docs.spring.io/spring-cloud-stream/docs/current/reference/html/spring-cloud-stream-binder-kafka.html#rebalance-listener
KafkaBindingRebalanceListener.onPartitionsAssigned() provides a boolean to indicate whether this is an initial assignment Vs. a rebalance assignment.
Spring cloud stream does not currently support arbitrary seeks at runtime, even though the underlying KafkaMessageDrivenChannelAdapter does support getting access to a ConsumerSeekCallback (which allows arbitrary seeks between polls). It would need an enhancement to the binder to allow access to this code.
It is possible, though, to consume idle container events in an event listener; the event contains the consumer, so you could do arbitrary seeks under those conditions.

Related

How to disable a Spring Cloud Stream Kafka consumer

Here is my situation:
We have a Spring cloud Stream 3 Kafka service connected to multiple topics in the same broker but I want to control connecting to a specific topic based on properties.
Every topic has its own binder and binding but the broker is the same for all.
I tried disabling the binding (that was the only solution I found so far) by using the property below and that works for the StreamListener to not receive messages but the connection to the topic and rebalancing is still happening.
spring:
cloud:
stream:
bindings:
...
anotherBinding:
consumer:
...
autostartup: false
I wonder if there is any setting on binder level that prevents it from starting. One of the topics consumer should only be available in one of the environments.
Thanks
Disabling the bindings by setting autoStartup to false should work, I am not sure what the issue is.
It doesn't look like you are using the new functional model, but the StreamListener. If you are using the functional model, here is another thing that you can try. You can disable the bindings by not including the corresponding functions at runtime. For example, assume you have the following two consumers.
#Bean
public Consumer<String> one() {}
#Bean
public Consumer<String> two() {}
When running this app, you can provide the property spring.cloud.function.definition to include/exclude functions. For instance, when you run it with spring.cloud.function.definition=one, then the consumer two will not be activated at all. When running with spring.cloud.function.definition=two, then the consumer one will not be activated.
The downside to the above approach is that if you decide to start the other function once the app started (given autoStartup is false on the other function), it will not work as it was not part of the original bindings through spring.cloud.function.definition. However, based on your requirements, this is probably not an issue as you know which environments are targeted for the corresponding topics. In other words, if you know that consumer one needs to always consume from the topic one, then you don't include consumer two as part of the definition.

Apache Kafka Streams : Out-of-Order messages

I have an Apache Kafka 2.6 Producer which writes to topic-A (TA).
I also have a Kafka streams application which consumes from TA and writes to topic-B (TB).
In the streams application, I have a custom timestamp extractor which extracts the timestamp from the message payload.
For one of my failure handling test cases, I shutdown the Kafka cluster while my applications are running.
When the producer application tries to write messages to TA, it cannot because the cluster is down and hence (I assume) buffers the messages.
Let's say it receives 4 messages m1,m2,m3,m4 in increasing time order. (i.e. m1 is first and m4 is last).
When I bring the Kafka cluster back online, the producer sends the buffered messages to the topic, but they are not in order. I receive for example, m2 then m3 then m1 and then m4.
Why is that ? Is it because the buffering in the producer is multi-threaded with each producing to the topic at the same time ?
I assumed that the custom timestamp extractor would help in ordering messages when consuming them. But they do not. Or maybe my understanding of the timestamp extractor is wrong.
I got one solution from SO here, to just stream all events from tA to another intermediate topic (say tA') which will use the TimeStamp extractor to another topic. But I am not sure if this will cause the events to get reordered based on the extracted timestamp.
My code for the Producer is as shown below (I am using Spring Cloud for creating the Producer):
Producer.java
#Service
public class Producer {
private String topicName = "input-topic";
private ApplicationProperties appProps;
#Autowired
private KafkaTemplate<String, MyEvent> kafkaTemplate;
public Producer() {
super();
}
#Autowired
public void setAppProps(ApplicationProperties appProps) {
this.appProps = appProps;
this.topicName = appProps.getInput().getTopicName();
}
public void sendMessage(String key, MyEvent ce) {
ListenableFuture<SendResult<String,MyEvent>> future = this.kafkaTemplate.send(this.topicName, key, ce);
}
}
Why is that ? Is it because the buffering in the producer is multi-threaded with each producing to the topic at the same time ?
By default, the producer allow for up to 5 parallel in-flight requests to a broker, and thus if some requests fail and are retried the request order might change.
To avoid this re-ordering issue, you can either set max.in.flight.requests.per.connection = 1 (what may have a performance hit) or set enable.idempotence = true.
Btw: you did not say if your topic has a single partition or multiple partitions, and if your messages have a key? If your topic has more then one partition and you messages are sent to different partitions, there is no ordering guarantee on read anyway, because offset ordering is only guaranteed within a partition.
I assumed that the custom timestamp extractor would help in ordering messages when consuming them. But they do not. Or maybe my understanding of the timestamp extractor is wrong.
The timestamp extractor only extracts a timestamp. Kafka Streams does not re-order any messages, but processes messages always in offset-order.
If not, then what are the specific uses of the timestamp extractor ? Just to associate a timestamp with an event ?
Correct.
I got one solution from SO here, to just stream all events from tA to another intermediate topic (say tA') which will use the TimeStamp extractor to another topic. But I am not sure if this will cause the events to get reordered based on the extracted timestamp.
No, it won't do any reordering. The other SO question is just about to change the timestamp, but if you read messages in order a,b,c the result would be written in order a,b,c (just with different timestamps, but offset order should be preserved).
This talk explains some more details: https://www.confluent.io/kafka-summit-san-francisco-2019/whats-the-time-and-why/

Roll back mechanism in kafka processor api?

I am using kafka processor api (not DSL)
public class StreamProcessor implements Processor<String, String>
{
public ProcessorContext context;
public void init(ProcessorContext context)
{
this.context = context;
context.commit()
//statestore initialized with key,value
}
public void process(String key, String val)
{
try
{
String[] topicList = stateStore.get(key).split("|");
for(String topic: topicList)
{
context.forward(key,val,To.child(consumerTopic));
} // forward same message to list of topics ( 1..n topics) , rollback if write to some topics failed ?
}
}
}
Scenario : we are reading data from a source topic and stream
processor writes data to multiple sink topics (topicList above) .
Question: How to implement rollback mechanism using kafka streams
processor api when one or more of the topics in the topicList above
fails to receive the message ? .
What I understand is processor api has rollback mechanism for each
record it failed to send, or can roll back for an an entire batch of
messages which failed be achieved as well? as process method in
processor interface is called per record rather than per batch hence I
would surmise it can only be done per record.Is this correct assumption ?, if not please suggest
how to achieve per record and per batch rollbacks for failed topics using processor api.
You would need to implement it yourself. For example, you could use two stores: main-store, and "buffer" store and first only update the buffer store, call context.forward() second to make sure all write are in the output topic, and afterward merge the "buffer" store into the main store.
If you need to roll back, you drop the content from the buffer store.

Seek to end of partition while running kafka spring using Group management?

I'm using a kafka spring consumer that is under group management.
I have the following code in my consumer class
public class ConsumerHandler implements Receiver<String, Message>, ConsumerSeekAware {
#Value("${topic}")
protected String topic;
public ConsumerHandler(){}
#KafkaListener(topics = "${topic}")
public Message receive(List<ConsumerRecord<String, Message>> messages, Acknowledgment acknowledgment) {
for (ConsumerRecord<String, Message> message : messages) {
Message msg = message.value();
this.handleMessage(any, message);
}
acknowledgment.acknowledge();
return null;
}
#Override
public void registerSeekCallback(ConsumerSeekCallback callback) {
}
#Override
public void onPartitionsAssigned(Map<TopicPartition, Long> assignments, ConsumerSeekCallback callback) {
for (Entry<TopicPartition, Long> pair : assignments.entrySet()) {
TopicPartition tp = pair.getKey();
callback.seekToEnd(tp.topic(),tp.partition());
}
}
#Override
public void onIdleContainer(Map<TopicPartition, Long> assignments, ConsumerSeekCallback callback) {}
}
This code works great while my consumer is running. However, sometimes the amount of messages being processed is too much and the messages stack up. I've implemented concurrency on my consumers and still sometimes there's delay in the messages over time.
So as a workaround, before I figure out why the delay is happening, I'm trying to keep my consumer up to the latest messages.
I'm having to restart my app to get partition assigned invoked so that my consumer seeks to end and starts processing the latest messages.
Is there a way to seek to end without having to bounce my application?
Thanks.
As explained in the JavaDocs and the reference manual, you can save off the ConsumerSeekCallback passed into registerSeekCallback in a ThreadLocal<ConsumerSeekCallback>.
Then, you can perform arbitrary seek operations whenever you want; however, since the consumer is not thread-safe, you must perform the seeks within your #KafkaListener so they run on the consumer thread - hence the need to store the callback in a ThreadLocal.
In version 2.0 and later, you can add the consumer as a parameter to the #KafkaListener method and perform the seeks directly thereon.
public Message receive(List<ConsumerRecord<String, Message>> messages, Acknowledgment acknowledgment,
Consumer<?, ?> consumer) {
The current version is 2.1.6.
I have never found or seen a fixed solution for this kind of problem. The way I do is to boost performance as high as possible base on the amount of messages to be processed and Kafka parameters.
Let's say if you have a shopping online app then you can control the upper bound of the number of transactions per day, said N. So you should make the app work well in the scenario where 1.5*N or 2*N transactions will need to sync to Kafka cluster. You keep this state until a day your shopping app reaches a new level and you will need to upgrade your Kafka system again. For shopping online app there are a special high number of transactions in promotion or mega sales days so what you prepare for your system is for these days.

MDB consume several messages on same transaction

I have an application with a high level of load and performance's critical .
Now, I'm migrating the application to use EJB. I'm very worried about using EJB to consume messages on queues because transactionality can decrease the performance.
Now, I'm consuming X messages in the same transaction, but I don't know how do the same using MDBs.
Is it possible to consume a block of messages in an MDB using only one transaction?
It is not guaranteed that the same MDB will process the stream of messages.
I think you can achieve what you want by using a stateless bean with an #Asynchronous invocation, and passing your set of messages.
Something like that:
#Stateless
public class AsynchProcessor {
#Asynchronous
public void processMessages(Set<MyMessage> messages) {....}
}
Decorate your method with Future if necessary, then in your client.
Set<MyMessage> messages = ...
asynchProcessor.processMessages(messages)