Is it possible to reset offsets to a topic for a kafka consumer group in a kafka connector? - apache-kafka

My kafka sink connector reads from multiple topics (configured with 10 tasks) and processes upwards of 300 records from all topics. Based on the information held in each record, the connector may perform certain operations.
Here is an example of the key:value pair in a trigger record:
"REPROCESS":"my-topic-1"
Upon reading this record, I would then need to reset the offsets of the topic 'my-topic-1' to 0 in each of its partitions.
I have read in many places that creating a new KafkaConsumer, subscribing to the topic's partitions, then calling the subscribe(...) method is the recommended way. For example,
public class MyTask extends SinkTask {
#Override
public void put(Collection<SinkRecord> records) {
records.forEach(record -> {
if (record.key().toString().equals("REPROCESS")) {
reprocessTopicRecords(record);
} else {
// do something else
}
});
}
private void reprocessTopicRecords(SinkRecord record) {
KafkaConsumer<JsonNode, JsonNode> reprocessorConsumer =
new KafkaConsumer<>(reprocessorProps, deserializer, deserializer);
reprocessorConsumer.subscribe(Arrays.asList(record.value().toString()),
new ConsumerRebalanceListener() {
public void onPartitionsRevoked(Collection<TopicPartition> partitions) {}
public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
// do offset reset here
}
}
);
}
}
However, the above strategy does not work for my case because:
1. It depends on a group rebalance taking place (does not always happen)
2. 'partitions' passed to the onPartitionsAssigned method are dynamically assigned partitions, meaning these are only a subset to the full set of partitions that will need to have their offset reset. For example, this SinkTask will be assigned only 2 of the 8 partitions that hold the records for 'my-topic-1'.
I've also looked into using assign() but this is not compatible with the distributed consumer model (consumer groups) in the SinkConnector/SinkTask implementation.
I am aware that the kafka command line tool kafka-consumer-groups can do exactly what I want (I think):
https://gist.github.com/marwei/cd40657c481f94ebe273ecc16601674b
To summarize, I want to reset the offsets of all partitions for a given topic using Java APIs and let the Sink Connector pick up the offset changes and continue to do what it has been doing (processing records).
Thanks in advance.

I was able to achieve resetting offsets for a kafka connect consumer group by using a series of Confluent's kafka-rest-proxy APIs: https://docs.confluent.io/current/kafka-rest/api.html
This implementation no longer requires the 'trigger record' approach firs described in the original post and is purely Rest API based.
Temporarily delete the kafka connector (this deletes the connector's consumers and )
Create a consumer instance for the same consumer group ("connect-")
Have the instance subscribe to the requested topic you want to reset
Do a dummy poll ('subscribe' is evaluated lazily')
Reset consumer group topic offsets for specified topic
Do a dummy poll ('seek' is evaluated lazily') Commit the current offset state (in the proxy) for the consumer
Re-create kafka connector (with same connector name) - after re-balancing, consumers will join the group and read the last committed offset (starting from 0)
Delete the temporary consumer instance
If you are able to use the CLI, Steps 2-6 can be replaced with:
kafka-consumer-groups --bootstrap-server <kafkahost:port> --group <group_id> --topic <topic_name> --reset-offsets --to-earliest --execute
As for those of you trying to do this in the kafka connector code through native Java APIs, you're out of luck :-(

You're looking for the seek method. Either to an offset
consumer.seek(new TopicPartition("topic-name", partition), offset);
Or seekToBeginning
However, I feel like you'd be competing with the Connect Sink API's consumer group. In other words, assuming you setup the consumer with a separate group id, then you're essentially consuming records twice here from the source topic, once by Connect, and then your own consumer instance.
Unless you explicitly seek Connect's own consumer instance as well (which is not exposed), you'd be getting into a weird state. For example, your task only executes on new records to the topic, despite the fact your own consumer would be looking at an old offset, or you'd still be getting even newer events while still processing old ones
Also, eventually you might get a reprocess event at the very beginning of the topic due to retention policies, expiring old records, for example, causing your consumer to not progress at all and constantly rebalancing its group by seeking to the beginning

We had to do a very similar offset resetting exercise.
KafkaConsumer.seek() combined with KafkaConsumer.commitSync() worked well.
There is another option that is worth mentioning, if you are dealing with lots of topics and partitions (javadoc):
AdminClient.alterConsumerGroupOffsets(
String groupId,
Map<TopicPartition,OffsetAndMetadata> offsets
)
We were lucky because we had the luxury to stop the Kafka Connect instance for a while, so there's no consumer group competing.

Related

Re-read a Kafka topic

I'd like to re-read all Kafka events programmatically. I know there is an Application Reset Tool, but from what I understand, that requires me to shut down my application. I can't shut the application down in my use-case.
Is there a way to make my application re-read all events on a Kafka topic? Examples or code snippets would be much appreciated. Preferably but not necessarily using Kafka Streams.
You cannot re-read topic with Kafka Streams, but with "plain" Kafka you can position consumer to any valid offset.
Something like
final Map<Integer, TopicPartition> partitions = new HashMap<>();
// get all partitions for the topic
for (final PartitionInfo partition : consumer.partitionsFor("your_topic")) {
final TopicPartition tp = new TopicPartition("your_topic", partition.partition());
partitions.put(partition.partition(), tp);
}
consumer.assign(partitions.values());
consumer.seekToBeginning(partitions.values());
Consumers are required to stop in order to avoid running into race conditions between consumers committing offsets and AdminClient altering offsets.
If you wish to keep the consumer group id, you can use Kafka Consumer seek APIs to look for the earliest offsets. Then AdminClient can be used to alter consumer group offsets.
kafka-consumer-groups --reset-offsets implementation should be a good example on how to accomplish this: https://github.com/apache/kafka/blob/85b6545b8159885c57ab67e08b7185be8a607988/core/src/main/scala/kafka/admin/ConsumerGroupCommand.scala#L446-L469
Otherwise, using another consumer group id should be enough to consume from the beginning, if your auto.offset.reset is set to earliest.

How to handle exceptions and message reprocessing in apache kafka

I have a kafka cluster. There is only one topic and to this topic 3 different consumer groups are taking the same messages from the topic and processing differently according to their own logic.
is there any problem with creating same topic for multiple consumer groups?
I am getting this doubt, as i am trying to implement exception topic and try to reprocess this messages.
suppose, i have message "secret" in topic A.
my all 3 consumer groups took the message "secret".
2 of my consumer groups successfully completed the processing of message.
But for one of my consumer group failed to process the message.
so i kept the message in topic "failed_topic".
I want to try to process this message for my failed consumer. But if i keep this message in my actual topic A, the other 2 consumer groups process this message second time.
Can some one please let me know how i can implement perfect reprocessing for this scenario ?
First of all in Kafka each consumer group has its own offset for each topic-partition subscribed and these offsets are managed seperately by consumer groups. So failing in one consumer group doesn't affect other consumer groups.
You can check current offsets for a consumer group with this cli command:
bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group my-group
is there any problem with creating same topic for multiple consumer
groups
No. There is no problem. Actually this is a normal behaivour of topic based publisher/subscriber pattern.
To implement re-processing logic there are some important points to consider:
You should keep calling poll() even you are re-processing same
message. Otherwise after max.poll.interval.ms your consumer
will be considered dead and be revoked.
By calling poll() you will get messages that your consumer group have not
read yet. So when you poll() you will get messages up to
max.poll.records when you poll() again, for this time you will get
next group of messages. So for reprocessing failed messages you need
to call seek method.
public void seek(TopicPartition partition, long offset) : Overrides
the fetch offsets that the consumer will use on the next poll(timeout)
Ideally your number of consumers in consumer group should be
equal to number of partitions of the topic subscribed. Kafka will
take care of assigning partitions to consumers evenly. (one partition
per consumer) But even this condition is satisfied at the very
beginning, after some time a consumer may die and Kafka may assign
more than one partitions to one consumer. This can lead some problems. Suppose that your consumer is responsible for two partitions, when you poll() you will get messages from both of these partitions and when a message cannot be consumed you should seek all of the partitions which is assigned (not just the one failed message comes from). Otherwise you may skip some messages.
Let's try to write some pseudocode to implement re-process logic in case of exception by using these informations:
public void consumeLoop() {
while (true) {
currentRecord = consumer.poll(); //max.poll.records = 1
if (currentRecord != null) {
try {
processMessage(currentRecord);
} catch (Exception e) {
consumer.seek(new TopicPartition(currentRecord.topic(), currentRecord.partition()), currentRecord.offset());
continue;
}
consumer.commitSync(Collections.singletonMap(topicPartition, new OffsetAndMetadata(currentRecord.offset() + 1)));
}
}
}
Notes about the code:
max.poll.records is set to one to make seek process simple.
In every exception we seek and poll to get same message again. (we
have to poll to be considered alive by Kafka)
auto.commit is disabled
is there any problem with creating same topic for multiple consumer groups?
Not at all
if i keep this message in my actual topic A, the other 2 consumer groups process this message second time.
Exactly, and you would create a loop (third group would fail, put it back, 2 accept it, third fails again, etc, etc)
Basically, you are asking about a "dead-letter queue" which would be a specific topic for each consumer group. Kafka can hold tens of thousands of topics, so this shouldn't be an issue in your use-case.

How to implement Exactly-Once Kafka Consumer without manually assigning partitions

I was going through this article which explains how to ensure message is processed exactly once by doing following:
Read (topic, partition, offset) from database on start/restart
Read message from specific (topic, partition, offset)
Atomically do following things:
Processing message
Commit offset to database as (topic, partition, offset)
As you can see, it explicitly specified from which partition to read messages. I feel its not good idea as it does not let allow Kafka to assign fair share of partition to active consumers. I am not able to come with logic to implement similar functionality without explicitly specifying partitions while polling kafka topic inside consumer. Is it possible to do?
Good analysis. You have a very good point and if possible you should certainly let kafka handle the partition assignment to consumers.
There is an alternative to consumer.Assign(Partition[]). The kafka brokers will notify your consumers when a partition is revoked or assigned to the consumer. For example, the dotnet client library has a 'SetPartitionsRevoked' and 'SetPartitionsAssigned' handler, that consumers can use to manage their offsets.
When a partition is revoked, persist your last processed offset for each partition being revoked to the database. When a new partition is assigned, get the last processed offset for that partition from the database and use that.
C# Example:
public class Program
{
public void Main(string[] args)
{
using (
var consumer = new ConsumerBuilder<string, string>(config)
.SetErrorHandler(ErrorHandler)
.SetPartitionsRevokedHandler(HandlePartitionsRevoked)
.SetPartitionsAssigned(HandlePartitionsAssigned)
.Build()
)
{
while (true)
{
consumer.Consume()//.Poll()
}
}
}
public IEnumerable<TopicPartitionOffset>
HandlePartitionsRevoked
(
IConsumer<string, string> consumer,
List<TopicPartitionOffset> currentTopicPartitionOffsets
)
{
Persist(<last processed offset for each partition in
'currentTopicPartitionOffsets'>);
return tpos;
}
public IEnumerable<TopicPartitionOffset> HandlePartitionsAssigned
(
IConsumer<string, string> consumer,
List<TopicPartition> tps
)
{
List<TopicPartitionOffset> tpos = FetchOffsetsFromDbForTopicPartitions(tps);
return tpos
}
}
Java Example from the ConsumerRebalanceListener Docs:
If writing in Java, there is a 'ConsumerRebalanceListener' interface that you can implement. You then pass your implementation of the interface into the consumer.Subscribe(topic, listener) method. The example below is taken verbatim from the kafka docs linked above:
public class SaveOffsetsOnRebalance implements ConsumerRebalanceListener {
private Consumer<?,?> consumer;
public SaveOffsetsOnRebalance(Consumer<?,?> consumer) {
this.consumer = consumer;
}
public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
// save the offsets in an external store using some custom code not described here
for(TopicPartition partition: partitions)
saveOffsetInExternalStore(consumer.position(partition));
}
public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
// read the offsets from an external store using some custom code not described here
for(TopicPartition partition: partitions)
consumer.seek(partition, readOffsetFromExternalStore(partition));
}
}
If my understanding is correct, you would call the java version like this: consumer.Subscribe("My topic", new SaveOffsetsOnRebalance(consumer)).
For more information, see the 'Storing Offsets Outside Kafka' section of the kafka docs.
Here's an excerpt from those docs that summarizes how to store the partitions and offsets for exactly-once processing:
Each record comes with its own offset, so to manage your own offset
you just need to do the following:
Configure enable.auto.commit=false
Use the offset provided with each ConsumerRecord to save your position.
On restart restore the position of the consumer using seek(TopicPartition, long).
This type of usage is simplest when the partition assignment is also
done manually (this would be likely in the search index use case
described above). If the partition assignment is done automatically
special care is needed to handle the case where partition assignments
change. This can be done by providing a ConsumerRebalanceListener
instance in the call to subscribe(Collection,
ConsumerRebalanceListener) and subscribe(Pattern,
ConsumerRebalanceListener). For example, when partitions are taken
from a consumer the consumer will want to commit its offset for those
partitions by implementing
ConsumerRebalanceListener.onPartitionsRevoked(Collection). When
partitions are assigned to a consumer, the consumer will want to look
up the offset for those new partitions and correctly initialize the
consumer to that position by implementing
ConsumerRebalanceListener.onPartitionsAssigned(Collection).
Another common use for ConsumerRebalanceListener is to flush any
caches the application maintains for partitions that are moved
elsewhere.

Kafka Consumer API jumping offsets

I am using Kafka Version 2.0 and java consumer API to consume messages from a topic. We are using a single node Kafka server with one consumer per partition. I have observed that the consumer is loosing some of the messages.
The scenario is:
Consumer polls the topic.
I have created One Consumer Per Thread.
Fetches the messages and gives it to a handler to handle the message.
Then it commits the offsets using "At-least-once" Kafka Consumer semantics to commit Kafka offset.
In parallel, I have another consumer running with a different group-id. In this consumer, I'm simply increasing the message counter and committing the offset. There's no message loss in this consumer.
try {
//kafkaConsumer.registerTopic();
consumerThread = new Thread(() -> {
final String topicName1 = "topic-0";
final String topicName2 = "topic-1";
final String topicName3 = "topic-2";
final String topicName4 = "topic-3";
String groupId = "group-0";
final Properties consumerProperties = new Properties();
consumerProperties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.13.49:9092");
consumerProperties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer");
consumerProperties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer");
consumerProperties.put(ConsumerConfig.GROUP_ID_CONFIG, groupId);
consumerProperties.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, "100");
consumerProperties.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");
consumerProperties.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, 1000);
try {
consumer = new KafkaConsumer<>(consumerProperties);
consumer.subscribe(Arrays.asList(topicName1, topicName2, topicName3, topicName4));
} catch (KafkaException ke) {
logTrace(MODULE, ke);
}
while (service.isServiceStateRunning()) {
ConsumerRecords<String, byte[]> records = consumer.poll(Duration.ofMillis(100));
for (TopicPartition partition : records.partitions()) {
List<ConsumerRecord<String, byte[]>> partitionRecords = records.records(partition);
for (ConsumerRecord<String, byte[]> record : partitionRecords) {
processMessage(simpleMessage);
}
}
consumer.commitSync();
}
kafkaConsumer.closeResource();
}, "KAKFA_CONSUMER");
} catch (Exception e) {
}
There seems to be a problem with usage of subscribe() here.
Subscribe is used to subscribe to topics and not to partitions. To use specific partitions you need to use assign(). Read up the extract from the documentation:
public void subscribe(java.util.Collection topics)
Subscribe to the given list of topics to get dynamically assigned
partitions. Topic subscriptions are not incremental. This list will
replace the current assignment (if there is one). It is not possible
to combine topic subscription with group management with manual
partition assignment through assign(Collection). If the given list of
topics is empty, it is treated the same as unsubscribe(). This is a
short-hand for subscribe(Collection, ConsumerRebalanceListener), which
uses a noop listener. If you need the ability to seek to particular
offsets, you should prefer subscribe(Collection,
ConsumerRebalanceListener), since group rebalances will cause
partition offsets to be reset. You should also provide your own
listener if you are doing your own offset management since the
listener gives you an opportunity to commit offsets before a rebalance
finishes.
public void assign(java.util.Collection partitions)
Manually assign a list of partitions to this consumer. This interface
does not allow for incremental assignment and will replace the
previous assignment (if there is one). If the given list of topic
partitions is empty, it is treated the same as unsubscribe(). Manual
topic assignment through this method does not use the consumer's group
management functionality. As such, there will be no rebalance
operation triggered when group membership or cluster and topic
metadata change. Note that it is not possible to use both manual
partition assignment with assign(Collection) and group assignment with
subscribe(Collection, ConsumerRebalanceListener).
You probably shouldn't do what you're doing. You should use subscribe, and use multiple partitions per topic, and multiple consumers in the group for high availability, and allow the consumer to handle the offsets for you.
You don't describe why you're trying to process your topics in this custom way? It's advanced and leads to issues.
The timestamps on your instances should not have to be synchronised to do normal topic processing.
If you're looking for more performance or to isolate records more carefully to avoid "head of line blocking" consider something like Parallel Consumer (PC).
It also tracks per record acknowledgement, among other things. Check out Parallel Consumer on GitHub (it's open source BTW, and I'm the author).

Questions about Kafka Consumer of Transient messages via Akka-Streams on Multiple Nodes

We are using Kafka to store messages that are produced by a node in our cluster and to be distributed to all nodes in the cluster and I have it mostly working with akka-streams but there is a couple of questions I have to tie this up. There are some constraints to this.
First of all the message has to be consumed by every node in the cluster but produced by only one node. I understand I can assign each node a group id that is probably its node ID which means each node will get the message. That sorted. But here are the questions.
The data is extremely transient and fairly large (just under a meg) and cannot be compressed further or broken up. If there is a new message on the topic the old one is pretty much trash. How can I limit the topic to basically just one message currently maximum?
Given that the data is necessary for the node to start, I need to consume the latest message on the topic no matter whether I have consumed it before and, hopefully without creating a unique group id every time I start the server. Is this possible and if so, how can it be done.
Finally, the data is usually on the topic but on occasion it is not there and I, ideally, need to be able to check if there is a message there and if not ask the producer to create the message. Is this possible?
This is the code I am currently using to start the consumer:
private Control startMatrixConsumer() {
final ConsumerSettings<Long, byte[]> matrixConsumerSettings = ConsumerSettings
.create(services.actorSystem(), new LongDeserializer(), new ByteArrayDeserializer())
.withBootstrapServers(services.config().getString("kafka.bootstrapServers"))
.withGroupId("group1") // todo put in the conf ??
.withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest");
final String topicName = Matrix.class.getSimpleName() + '-' + eventId;
final AutoSubscription subscription = Subscriptions.topics(topicName);
return Consumer.plainSource(MatrixConsumerSettings, subscription)
.named(Matrix.class.getSimpleName() + "-Kafka-Consumer-" + eventId)
.map(data -> {
final Matrix matrix = services.kryoDeserialize(data.value(), Matrix.class);
log.debug(format("Received %s for event %d from Kafka", Matrix.class.getSimpleName(), matrix.getEventId()));
return matrix;
})
.filter(Objects::nonNull)
.to(Sink.actorRef(getSelf(), NotUsed.getInstance()))
.run(ActorMaterializer.create(getContext()));
}
Thanks a bunch.
All the message has to be consumed by every node in the cluster but
produced by only one.
You are correct, you can achieve this by having an unique group id per node.
How can I limit the topic to basically just one message currently
maximum?
Kafka provides compacted topics.
Compacted topic maintains only the most recent message of a given key. For instance, Kafka consumers store their offsets in compacted topic.
In your case, produce every message with the same key, and Kafka Log Cleaner will delete old messages. Please be aware that compaction is performed periodically, so you can end up with two (or more) messages with the same key for a short period of time (depends on your Log Cleaner configuration.
I need to consume the latest message on the topic no matter whether I
have consumed it before.
You can achieve this by not committing the consumer offset (enable.auto.commit set to false) and setting auto.offset.reset to earliest. By having one message in your compacted topic and consumer that starts from the beginning of the topic, that message is always consumed after node starts.
I need to be able to check if there is a message there and if not ask
the producer to create the message.
Unfortunately, I am not aware of any Kafka functionality that could help you with that. Most of the time Kafka is used to decouple producers and consumers.