I'd like to re-read all Kafka events programmatically. I know there is an Application Reset Tool, but from what I understand, that requires me to shut down my application. I can't shut the application down in my use-case.
Is there a way to make my application re-read all events on a Kafka topic? Examples or code snippets would be much appreciated. Preferably but not necessarily using Kafka Streams.
You cannot re-read topic with Kafka Streams, but with "plain" Kafka you can position consumer to any valid offset.
Something like
final Map<Integer, TopicPartition> partitions = new HashMap<>();
// get all partitions for the topic
for (final PartitionInfo partition : consumer.partitionsFor("your_topic")) {
final TopicPartition tp = new TopicPartition("your_topic", partition.partition());
partitions.put(partition.partition(), tp);
}
consumer.assign(partitions.values());
consumer.seekToBeginning(partitions.values());
Consumers are required to stop in order to avoid running into race conditions between consumers committing offsets and AdminClient altering offsets.
If you wish to keep the consumer group id, you can use Kafka Consumer seek APIs to look for the earliest offsets. Then AdminClient can be used to alter consumer group offsets.
kafka-consumer-groups --reset-offsets implementation should be a good example on how to accomplish this: https://github.com/apache/kafka/blob/85b6545b8159885c57ab67e08b7185be8a607988/core/src/main/scala/kafka/admin/ConsumerGroupCommand.scala#L446-L469
Otherwise, using another consumer group id should be enough to consume from the beginning, if your auto.offset.reset is set to earliest.
Related
This is a followup question to "Where do zookeeper store Kafka cluster and related information?" based on the answer provided by Armando Ballaci.
Now it's clear that consumer offsets are stored in the Kafka cluster in a special topic called __consumer_offsets. That's fine, I am just wondering how does the retrieval of these offsets work.
Topics are not like RDBS over which we can query for arbitrary data based on a certain predicate. Ex - if the data is stored in an RDBMS, probably a query like below will get the consumer offset for a particular partition of a topic for a particular consumer of some consumer group.
select consumer_offset__read, consumer_offset__commited from consumer_offset_table where consumer-grp-id="x" and partitionid="y"
But clearly this kind of retrieval is not possible o.n Kafka Topics. So how does the retrieval mechanism from topic work? Could someone elaborate?
(Data from Kafka partitions is read in FIFO, and if Kafka consumer model is followed to retrieve a particular offset, a lot of additional data has to be processed and it's going to be slow. So am wondering if it's done in some other way...)
Some description I could find on web regarding the same when I stumbled upon this for my day job is as follows:
In Kafka releases through 0.8.1.1, consumers commit their offsets to ZooKeeper. ZooKeeper does not scale extremely well (especially for writes) when there are a large number of offsets (i.e., consumer-count * partition-count). Fortunately, Kafka now provides an ideal mechanism for storing consumer offsets. Consumers can commit their offsets in Kafka by writing them to a durable (replicated) and highly available topic. Consumers can fetch offsets by reading from this topic (although we provide an in-memory offsets cache for faster access). i.e., offset commits are regular producer requests (which are inexpensive) and offset fetches are fast memory look ups.
The official Kafka documentation describes how the feature works and how to migrate offsets from ZooKeeper to Kafka. This wiki provides sample code that shows how to use the new Kafka-based offset storage mechanism.
try {
BlockingChannel channel = new BlockingChannel("localhost", 9092,
BlockingChannel.UseDefaultBufferSize(),
BlockingChannel.UseDefaultBufferSize(),
5000 /* read timeout in millis */);
channel.connect();
final String MY_GROUP = "demoGroup";
final String MY_CLIENTID = "demoClientId";
int correlationId = 0;
final TopicAndPartition testPartition0 = new TopicAndPartition("demoTopic", 0);
final TopicAndPartition testPartition1 = new TopicAndPartition("demoTopic", 1);
channel.send(new ConsumerMetadataRequest(MY_GROUP, ConsumerMetadataRequest.CurrentVersion(), correlationId++, MY_CLIENTID));
ConsumerMetadataResponse metadataResponse = ConsumerMetadataResponse.readFrom(channel.receive().buffer());
if (metadataResponse.errorCode() == ErrorMapping.NoError()) {
Broker offsetManager = metadataResponse.coordinator();
// if the coordinator is different, from the above channel's host then reconnect
channel.disconnect();
channel = new BlockingChannel(offsetManager.host(), offsetManager.port(),
BlockingChannel.UseDefaultBufferSize(),
BlockingChannel.UseDefaultBufferSize(),
5000 /* read timeout in millis */);
channel.connect();
} else {
// retry (after backoff)
}
}
catch (IOException e) {
// retry the query (after backoff)
}
In Kafka releases through 0.8.1.1, consumers commit their offsets to ZooKeeper. ZooKeeper does not scale extremely well (especially for writes) when there are a large number of offsets (i.e., consumer-count * partition-count). Fortunately, Kafka now provides an ideal mechanism for storing consumer offsets. Consumers can commit their offsets in Kafka by writing them to a durable (replicated) and highly available topic. Consumers can fetch offsets by reading from this topic (although we provide an in-memory offsets cache for faster access). i.e., offset commits are regular producer requests (which are inexpensive) and offset fetches are fast memory look ups.
The official Kafka documentation describes how the feature works and how to migrate offsets from ZooKeeper to Kafka.
The idea is that if you need such a functionality as you describe you need to store the data in a RDBS or a NoSQL database or an ELK Stack. A good pattern would be through Kafka Connect using a Sink connector. The normal message processing in Kafka is done through Consummers or Stream Definitions that react on the Events as they come. You can certainly seek to offset or timestamp in some cases and that is completely possible...
In the latest versions of Kafka the offsets are not kept in Zookeeper anymore. So Zookeeper is not involved in Consumer ofset handling.
My kafka sink connector reads from multiple topics (configured with 10 tasks) and processes upwards of 300 records from all topics. Based on the information held in each record, the connector may perform certain operations.
Here is an example of the key:value pair in a trigger record:
"REPROCESS":"my-topic-1"
Upon reading this record, I would then need to reset the offsets of the topic 'my-topic-1' to 0 in each of its partitions.
I have read in many places that creating a new KafkaConsumer, subscribing to the topic's partitions, then calling the subscribe(...) method is the recommended way. For example,
public class MyTask extends SinkTask {
#Override
public void put(Collection<SinkRecord> records) {
records.forEach(record -> {
if (record.key().toString().equals("REPROCESS")) {
reprocessTopicRecords(record);
} else {
// do something else
}
});
}
private void reprocessTopicRecords(SinkRecord record) {
KafkaConsumer<JsonNode, JsonNode> reprocessorConsumer =
new KafkaConsumer<>(reprocessorProps, deserializer, deserializer);
reprocessorConsumer.subscribe(Arrays.asList(record.value().toString()),
new ConsumerRebalanceListener() {
public void onPartitionsRevoked(Collection<TopicPartition> partitions) {}
public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
// do offset reset here
}
}
);
}
}
However, the above strategy does not work for my case because:
1. It depends on a group rebalance taking place (does not always happen)
2. 'partitions' passed to the onPartitionsAssigned method are dynamically assigned partitions, meaning these are only a subset to the full set of partitions that will need to have their offset reset. For example, this SinkTask will be assigned only 2 of the 8 partitions that hold the records for 'my-topic-1'.
I've also looked into using assign() but this is not compatible with the distributed consumer model (consumer groups) in the SinkConnector/SinkTask implementation.
I am aware that the kafka command line tool kafka-consumer-groups can do exactly what I want (I think):
https://gist.github.com/marwei/cd40657c481f94ebe273ecc16601674b
To summarize, I want to reset the offsets of all partitions for a given topic using Java APIs and let the Sink Connector pick up the offset changes and continue to do what it has been doing (processing records).
Thanks in advance.
I was able to achieve resetting offsets for a kafka connect consumer group by using a series of Confluent's kafka-rest-proxy APIs: https://docs.confluent.io/current/kafka-rest/api.html
This implementation no longer requires the 'trigger record' approach firs described in the original post and is purely Rest API based.
Temporarily delete the kafka connector (this deletes the connector's consumers and )
Create a consumer instance for the same consumer group ("connect-")
Have the instance subscribe to the requested topic you want to reset
Do a dummy poll ('subscribe' is evaluated lazily')
Reset consumer group topic offsets for specified topic
Do a dummy poll ('seek' is evaluated lazily') Commit the current offset state (in the proxy) for the consumer
Re-create kafka connector (with same connector name) - after re-balancing, consumers will join the group and read the last committed offset (starting from 0)
Delete the temporary consumer instance
If you are able to use the CLI, Steps 2-6 can be replaced with:
kafka-consumer-groups --bootstrap-server <kafkahost:port> --group <group_id> --topic <topic_name> --reset-offsets --to-earliest --execute
As for those of you trying to do this in the kafka connector code through native Java APIs, you're out of luck :-(
You're looking for the seek method. Either to an offset
consumer.seek(new TopicPartition("topic-name", partition), offset);
Or seekToBeginning
However, I feel like you'd be competing with the Connect Sink API's consumer group. In other words, assuming you setup the consumer with a separate group id, then you're essentially consuming records twice here from the source topic, once by Connect, and then your own consumer instance.
Unless you explicitly seek Connect's own consumer instance as well (which is not exposed), you'd be getting into a weird state. For example, your task only executes on new records to the topic, despite the fact your own consumer would be looking at an old offset, or you'd still be getting even newer events while still processing old ones
Also, eventually you might get a reprocess event at the very beginning of the topic due to retention policies, expiring old records, for example, causing your consumer to not progress at all and constantly rebalancing its group by seeking to the beginning
We had to do a very similar offset resetting exercise.
KafkaConsumer.seek() combined with KafkaConsumer.commitSync() worked well.
There is another option that is worth mentioning, if you are dealing with lots of topics and partitions (javadoc):
AdminClient.alterConsumerGroupOffsets(
String groupId,
Map<TopicPartition,OffsetAndMetadata> offsets
)
We were lucky because we had the luxury to stop the Kafka Connect instance for a while, so there's no consumer group competing.
We are using processors with exactly-once delivery (committing consumer offset through producer) and need to understand whether it is possible for this to happen when consuming a message from a topic in kafka-cluster-1 and producing to a topic on kafka-cluster-2 (and vice-versa).
This is a snippet from the transactional processor:
messageProducer.beginTransaction(partitionId)
resultPublisher.publish(partitionId, resultTopic, messageRecord.key(), result)
val offsetAndMetadata = messageConsumer.getUncommittedOffsets(listenTopic, messageRecord)
messageProducer.sendOffsetsToTransaction(partitionId, offsetAndMetadata, consumerGroupId)
messageProducer.commitTransaction(partitionId)
My understanding is that a the producer will try to commit the offset on a consumer topic in the same cluster.
I did some research but can't really find anything related to multiple clusters.
Is it possible at all?
It is possible, you can "manually" send offsets to your own topic on the same cluster, where the produced message is sent. This way you can use guarantees provided by transactions.
You would need to create your own topic for offsets, similar to Kafka's internal __consumer_offsets, where as key you should use groupId, topic, partition, and as value the most recent offset (already read or to be read). Remember to use log compaction.
AFAIK there is no possibility to have transaction across two different clusters.
The use case is: I have a kafka streams app that consumer from an input topic, and output to a intermediate topic, then in the same streams another topology consume from this intermediate topic.
Whenever the application id is updated, both topic start to consumer from earliest. I want to change the auto.offset.reset for the intermediate topic to latest while keep that to earliest for the input topic.
Yes. You can set the reset strategy for each topic via:
// Processor API
topology.addSource(AutoOffsetReset offsetReset, String name, String... topics);
// DSL
builder.stream(String topic, Consumed.with(AutoOffsetReset offsetReset));
builder.table(String topic, Consumed.with(AutoOffsetReset offsetReset));
All those methods have some overloads that allow to set it.
I have developed a Kafka consumer and there will be multiple instances of this consumer running in production. I know how we can use group.id as to not duplicate the processing of data. Is there a way to have all the consumers receive the message but send one consumer a leader bit?
Is there a way to have a group.id per topic or even per key in a topic?
Looks like this has nothing to with Kafka. You already know that by providing a unique group.id for each consumer, all consumer instances will get all messages from the topic. Now as far as the push to DB is concerned - you can factor out that logic and try using a distributed lock so that the push to DB part of your application can only be executed by one of the consumers. Is this a Java based setup ?