Seek to end of partition while running kafka spring using Group management? - apache-kafka

I'm using a kafka spring consumer that is under group management.
I have the following code in my consumer class
public class ConsumerHandler implements Receiver<String, Message>, ConsumerSeekAware {
#Value("${topic}")
protected String topic;
public ConsumerHandler(){}
#KafkaListener(topics = "${topic}")
public Message receive(List<ConsumerRecord<String, Message>> messages, Acknowledgment acknowledgment) {
for (ConsumerRecord<String, Message> message : messages) {
Message msg = message.value();
this.handleMessage(any, message);
}
acknowledgment.acknowledge();
return null;
}
#Override
public void registerSeekCallback(ConsumerSeekCallback callback) {
}
#Override
public void onPartitionsAssigned(Map<TopicPartition, Long> assignments, ConsumerSeekCallback callback) {
for (Entry<TopicPartition, Long> pair : assignments.entrySet()) {
TopicPartition tp = pair.getKey();
callback.seekToEnd(tp.topic(),tp.partition());
}
}
#Override
public void onIdleContainer(Map<TopicPartition, Long> assignments, ConsumerSeekCallback callback) {}
}
This code works great while my consumer is running. However, sometimes the amount of messages being processed is too much and the messages stack up. I've implemented concurrency on my consumers and still sometimes there's delay in the messages over time.
So as a workaround, before I figure out why the delay is happening, I'm trying to keep my consumer up to the latest messages.
I'm having to restart my app to get partition assigned invoked so that my consumer seeks to end and starts processing the latest messages.
Is there a way to seek to end without having to bounce my application?
Thanks.

As explained in the JavaDocs and the reference manual, you can save off the ConsumerSeekCallback passed into registerSeekCallback in a ThreadLocal<ConsumerSeekCallback>.
Then, you can perform arbitrary seek operations whenever you want; however, since the consumer is not thread-safe, you must perform the seeks within your #KafkaListener so they run on the consumer thread - hence the need to store the callback in a ThreadLocal.
In version 2.0 and later, you can add the consumer as a parameter to the #KafkaListener method and perform the seeks directly thereon.
public Message receive(List<ConsumerRecord<String, Message>> messages, Acknowledgment acknowledgment,
Consumer<?, ?> consumer) {
The current version is 2.1.6.

I have never found or seen a fixed solution for this kind of problem. The way I do is to boost performance as high as possible base on the amount of messages to be processed and Kafka parameters.
Let's say if you have a shopping online app then you can control the upper bound of the number of transactions per day, said N. So you should make the app work well in the scenario where 1.5*N or 2*N transactions will need to sync to Kafka cluster. You keep this state until a day your shopping app reaches a new level and you will need to upgrade your Kafka system again. For shopping online app there are a special high number of transactions in promotion or mega sales days so what you prepare for your system is for these days.

Related

Spring cloud stream, kafka binder - seek on demand

I use spring cloud stream with kafka. I have a topic X, with partition Y and consumer group Z. Spring boot starter parent 2.7.2, spring kafka version 2.8.8:
#StreamListener("input-channel-name")
public void processMessage(final DomainObject domainObject) {
// some processing
}
It works fine.
I would like to have an endpoint in the app, that allows me to re-read/re-process (seek right?) all the messages in X.Y (again). But not after rebalancing (ConsumerSeekAware#onPartitionsAssigned) or after app restart (KafkaConsumerProperties#resetOffsets) but on demand like this:
#RestController
#Slf4j
#RequiredArgsConstructor
public class SeekController {
#GetMapping
public void seekToBeginningForDomainObject() {
/**
* seekToBeginning for X, Y, input-channel-name
*/
}
}
I just can't achieve that. Is it even possible ?. I understand that I have to do that on the consumer level, probably the one that is created after #StreamListener("input-channel-name") subscription, right ? but I've no clue how to obtain that consumer. How can I execute seek on demand to make kafka send the messages to the consumer again ? I just want to reset the offset for X.Y.Z to 0 to just make the app, load and process all the messages again.
https://docs.spring.io/spring-cloud-stream/docs/current/reference/html/spring-cloud-stream-binder-kafka.html#rebalance-listener
KafkaBindingRebalanceListener.onPartitionsAssigned() provides a boolean to indicate whether this is an initial assignment Vs. a rebalance assignment.
Spring cloud stream does not currently support arbitrary seeks at runtime, even though the underlying KafkaMessageDrivenChannelAdapter does support getting access to a ConsumerSeekCallback (which allows arbitrary seeks between polls). It would need an enhancement to the binder to allow access to this code.
It is possible, though, to consume idle container events in an event listener; the event contains the consumer, so you could do arbitrary seeks under those conditions.

Roll back mechanism in kafka processor api?

I am using kafka processor api (not DSL)
public class StreamProcessor implements Processor<String, String>
{
public ProcessorContext context;
public void init(ProcessorContext context)
{
this.context = context;
context.commit()
//statestore initialized with key,value
}
public void process(String key, String val)
{
try
{
String[] topicList = stateStore.get(key).split("|");
for(String topic: topicList)
{
context.forward(key,val,To.child(consumerTopic));
} // forward same message to list of topics ( 1..n topics) , rollback if write to some topics failed ?
}
}
}
Scenario : we are reading data from a source topic and stream
processor writes data to multiple sink topics (topicList above) .
Question: How to implement rollback mechanism using kafka streams
processor api when one or more of the topics in the topicList above
fails to receive the message ? .
What I understand is processor api has rollback mechanism for each
record it failed to send, or can roll back for an an entire batch of
messages which failed be achieved as well? as process method in
processor interface is called per record rather than per batch hence I
would surmise it can only be done per record.Is this correct assumption ?, if not please suggest
how to achieve per record and per batch rollbacks for failed topics using processor api.
You would need to implement it yourself. For example, you could use two stores: main-store, and "buffer" store and first only update the buffer store, call context.forward() second to make sure all write are in the output topic, and afterward merge the "buffer" store into the main store.
If you need to roll back, you drop the content from the buffer store.

Is there any example of Spring Schedule that reads Kafka topic?

We're trying to read the data from Kafka at specified window time (so we have Kafka consumer), that means avoiding the data read at other times. However we're not sure how to shut down the consumer after the time period is expired. I wonder if there are any example of how to do that? Many thanks in advance for helping us.
You can disable the autoStartup and then manually start the kafka containers using KafkaListenerEndpointRegistry start and stop methods
#KafkaListener Lifecycle Management
public class KafkaConsumer {
#Autowired
private KafkaListenerEndpointRegistry registry;
#KafkaListener(id = "myContainer", topics = "myTopic", autoStartup = "false")
public void listen(...) { ... }
#Schedule(cron = "")
public void scheduledMethod() {
registry.start();
registry.stop()
}
But in the above approach there is no guarantee that all messages from kafka will be consumed in that time frame (It depends on load and processing speed)
I had the same use case but I wrote the scheduler specifying the max poll record for one batch and kept a counter if the counter matched the max polled record then I consider that the processing for the batch is finished as it has processed the record which it got during the one poll.
And then I am unsubscribing to the topic and closing the consumer. The next time when the scheduler will run it will again process the max poll record specified limit.
fixedDelayString fulfil the purpose that the scheduler starts after the specified time limit once the previous is finished.
#EnableScheduling
public class MessageScheduler{
#Scheduled(initialDelayString = "${fixedInitialDelay.in.milliseconds}", fixedDelayString = "${fixedDelay.in.milliseconds}")
public void run(){
/*write your kafka consumer here with manual commit*/
/*once your batch is finished processing unsubcribe and close the consumer*/
kafkaConsumer.unsubscribe();
kafkaConsumer.close();
}
}

Kafka exactly_once processing - do you need your streams app to produce to kafka topic as well?

I have a kafka streams app consuming from kafka topic. It only consumes and processes the data but doesn't produce anything.
For Kafka's exactly_once processing to work, do you also need your streams app to write to a kafka topic?
How can you achieve exactly_once if your streams app wants to process the message only once but not produce anything?
Providing “exactly-once” processing semantics really means that distinct updates to the state of an operator that is managed by the stream processing engine are only reflected once. “Exactly-once” by no means guarantees that processing of an event, i.e. execution of arbitrary user-defined logic, will happen only once.
Above is the "Exactly once" semantics explanation.
It is not necessary to publish the output to a topic always in KStream application.
When you are using KStream applications, you have to define an applicationID with each which uses a consumer in the backend. In the application, you have to configure few
parameters like processing.guarantee to exactly_once and enable.idempotence
Here are the details :
https://kafka.apache.org/22/documentation/streams/developer-guide/config-streams#processing-guarantee
I am not conflicting on exactly-once stream pattern because that's the beauty of Kafka Stream however its possible to use Kafka Stream without producing to other topics.
Exactly-once stream pattern is simply the ability to execute a read-process-write operation exactly one time. This means you consume one message at a time get the process and published to another topic and commit. So commit will be handle by Stream automatically one message a time.
Kafka Stream achieve these be setting below parameters which can not be overwritten
isolation.level: (read_committed) - Consumers will always read committed data only
enable.idempotence: (true) - Producer will always have idempotency enabled
max.in.flight.requests.per.connection" (5) - Producer will always have one in-flight request per connection
In case any error in the consumer or producer Kafka stream always retries a specific configured number of attempts.
KafkaStream doesn't guarantee inside processing logic we still need to handle e.g. there is a requirement for DB operation and if DB connection got failed in that case Kafka doesn't aware so you need to handle by your own.
As per pattern definition yes we need consumer, process, and producer topic but in general, it's not stopping you if you don't output to another topic. Still, you can consume exactly one item at a time with default time interval commit(DEFAULT_COMMIT_INTERVAL_MS) and again you need to handle your logic transaction failure by yourself
I am putting some sample examples.
StreamsBuilder builder = new StreamsBuilder();
Properties props = getStreamProperties();
KStream<String, String> textLines = builder.stream(Pattern.compile("topic"));
textLines.process(() -> new ProcessInternal());
KafkaStreams streams = new KafkaStreams(builder.build(), props);
final CountDownLatch latch = new CountDownLatch(1);
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
logger.info("Completed VQM stream");
streams.close();
}));
logger.info("Streaming start...");
try {
streams.start();
latch.await();
} catch (Throwable e) {
System.exit(1);
}
class ProcessInternal implements Processor<String, String> {
private ProcessorContext context;
#Override
public void init(ProcessorContext context) {
this.context = context;
}
#Override
public void close() {
// Any code for clean up would go here.
}
#Override
public void process(String key, String value) {
///Your transactional process business logic
}
}

Apache Kafka: Exactly Once in Version 0.10

To achieve exactly-once processing of messages by Kafka consumer I am committing one message at a time, like below
public void commitOneRecordConsumer(long seconds) {
KafkaConsumer<String, String> consumer = consumerConfigFactory.getConsumerConfig();
try {
while (running) {
ConsumerRecords<String, String> records = consumer.poll(1000);
try {
for (ConsumerRecord<String, String> record : records) {
processingService.process(record);
consumer.commitSync(Collections.singletonMap(new TopicPartition(record.topic(),record.partition()), new OffsetAndMetadata(record.offset() + 1)));
System.out.println("Committed Offset" + ": " + record.offset());
}
} catch (CommitFailedException e) {
// application specific failure handling
}
}
} finally {
consumer.close();
}
}
The above code delegates the processing of message asynchronously to another class below.
#Service
public class ProcessingService {
#Async
public void process(ConsumerRecord<String, String> record) throws InterruptedException {
Thread.sleep(5000L);
Map<String, Object> map = new HashMap<>();
map.put("partition", record.partition());
map.put("offset", record.offset());
map.put("value", record.value());
System.out.println("Processed" + ": " + map);
}
}
However, this still does not guarantee exactly-once delivery, because if the processing fails, it might still commit other messages and the previous messages will never be processed and committed, what are my options here?
Original answer for 0.10.2 and older releases (for 0.11 and later releases see answer blow)
Currently, Kafka cannot provide exactly-once processing out-of-the box. You can either have at-least-once processing if you commit messages after you successfully processed them, or you can have at-most-once processing if you commit messages directly after poll() before you start processing.
(see also paragraph "Delivery Guarantees" in http://docs.confluent.io/3.0.0/clients/consumer.html#synchronous-commits)
However, at-least-once guarantee is "good enough" if your processing is idempotent, i.e., the final result will be the same even if you process a record twice. Examples for idempotent processing would be adding a message to a key-value store. Even if you add the same record twice, the second insert will just replace the first current key-value-pair and the KV-store will still have the correct data in it.
In your example code above, you update a HashMap and this would be an idempotent operation. Even if your might have an inconsistent state in case of failure if for example only two put calls are executed before the crash. However, this inconsistent state would be fixed on reprocessing the same record again.
The call to println() is not idempotent though because this is an operation with "side effect". But I guess the print is for debugging purpose only.
As an alternative, you would need to implement transaction semantics in your user code which requires to "undo" (partly executed) operation in case of failure. In general, this is a hard problem.
Update for Apache Kafka 0.11+ (for pre 0.11 releases see answer above)
Since 0.11, Apache Kafka supports idempotent producers, transactional producer, and exactly-once-processing using Kafka Streams. It also adds a "read_committed" mode to the consumer to only read committed messages (and to drop/filter aborted messages).
https://kafka.apache.org/documentation/#semantics
https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/
https://www.confluent.io/blog/transactions-apache-kafka/
https://www.confluent.io/blog/enabling-exactly-kafka-streams/
Apache Kafka 0.11.0.0 has been just released, it supports exactly once delivery now.
http://kafka.apache.org/documentation/#upgrade_11_exactly_once_semantics
https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging
I think exactly once processing can be achieved with kafka 0.10.x itself. But there's some catch. I'm sharing the high level idea from this book. Relevant contents can be found in section: Seek and Exactly Once Processing in chapter 4: Kafka Consumers - Reading Data from Kafka. You can view the contents of that book with a (free) safaribooksonline account, or buy it once it's out, or maybe get it from other sources, which we shall not speak about.
Idea:
Think about this common scenario: Your application reads events from Kafka, processes the data, and then stores the results in a database. Suppose that we really don’t want to lose any data, nor do we want to store the same results in the database twice.
It's doable if there is a way to store both the record and the offset in one atomic action. Either both the record and the offset are committed, or neither of them are committed.
To achieve that, we need to write both the record and the offset to the database, in one transaction. Then we’ll know that either we are done with the record and the offset is committed or we are not, and the record will be reprocessed.
Now the only problem is: if the record is stored in a database and not in Kafka, how will our consumer know where to start reading when it is assigned a partition? This is exactly what seek() can be used for. When the consumer starts or when new partitions are assigned, it can look up the offset in the database and seek() to that location.
Sample code from the book:
public class SaveOffsetsOnRebalance implements ConsumerRebalanceListener {
public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
commitDBTransaction();
}
public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
for(TopicPartition partition: partitions)
consumer.seek(partition, getOffsetFromDB(partition));
}
}
consumer.subscribe(topics, new SaveOffsetOnRebalance(consumer));
consumer.poll(0);
for (TopicPartition partition: consumer.assignment())
consumer.seek(partition, getOffsetFromDB(partition));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records)
{
processRecord(record);
storeRecordInDB(record);
storeOffsetInDB(record.topic(), record.partition(), record.offset());
}
commitDBTransaction();
}