How to consume all messages from begining from apache kafka using java - apache-kafka

I am trying to consume all messages from the beginning of a topic in Apache Kafka. I can consume messages which are producing at that moment. Here is my code for fetch messages.
public void consume() {
KafkaConsumer<String, String> consumer = new KafkaConsumer<>(props);
consumer.subscribe(Arrays.asList(topic));
while (true) {
ConsumerRecords<String, String> records = consumer.poll(1000);
for (ConsumerRecord<String, String> record : records) {
System.out.printf("\"%s\"\n", record.value());
}
}
}

After subscribing to the topic, you can use the seekToBeginning method in order to set the offset at the beginning of the topic partition. Of course, it's valid per partition because a topic with different partitions has different beginning offsets (if messages deletion happened).

Besides setting, auto.offset.reset=earliest, try setting a new/random value for property group.id and give it a try. Also, if you are not interested in keeping track of consumer position but always want to start from the beginning, you can also set enable.auto.commit=false to avoid polluting the offsets topic.
Hope it helps.
Thanks.

The console consumer generates a random consumer group id in order to accomplish this and sets auto.offset.reset=earliest
Make sure that you close the consumer object to prevent adding lots of temporary consumer groups ids in Zookeeper

Related

Re-read a Kafka topic

I'd like to re-read all Kafka events programmatically. I know there is an Application Reset Tool, but from what I understand, that requires me to shut down my application. I can't shut the application down in my use-case.
Is there a way to make my application re-read all events on a Kafka topic? Examples or code snippets would be much appreciated. Preferably but not necessarily using Kafka Streams.
You cannot re-read topic with Kafka Streams, but with "plain" Kafka you can position consumer to any valid offset.
Something like
final Map<Integer, TopicPartition> partitions = new HashMap<>();
// get all partitions for the topic
for (final PartitionInfo partition : consumer.partitionsFor("your_topic")) {
final TopicPartition tp = new TopicPartition("your_topic", partition.partition());
partitions.put(partition.partition(), tp);
}
consumer.assign(partitions.values());
consumer.seekToBeginning(partitions.values());
Consumers are required to stop in order to avoid running into race conditions between consumers committing offsets and AdminClient altering offsets.
If you wish to keep the consumer group id, you can use Kafka Consumer seek APIs to look for the earliest offsets. Then AdminClient can be used to alter consumer group offsets.
kafka-consumer-groups --reset-offsets implementation should be a good example on how to accomplish this: https://github.com/apache/kafka/blob/85b6545b8159885c57ab67e08b7185be8a607988/core/src/main/scala/kafka/admin/ConsumerGroupCommand.scala#L446-L469
Otherwise, using another consumer group id should be enough to consume from the beginning, if your auto.offset.reset is set to earliest.

Kafka Consumer API jumping offsets

I am using Kafka Version 2.0 and java consumer API to consume messages from a topic. We are using a single node Kafka server with one consumer per partition. I have observed that the consumer is loosing some of the messages.
The scenario is:
Consumer polls the topic.
I have created One Consumer Per Thread.
Fetches the messages and gives it to a handler to handle the message.
Then it commits the offsets using "At-least-once" Kafka Consumer semantics to commit Kafka offset.
In parallel, I have another consumer running with a different group-id. In this consumer, I'm simply increasing the message counter and committing the offset. There's no message loss in this consumer.
try {
//kafkaConsumer.registerTopic();
consumerThread = new Thread(() -> {
final String topicName1 = "topic-0";
final String topicName2 = "topic-1";
final String topicName3 = "topic-2";
final String topicName4 = "topic-3";
String groupId = "group-0";
final Properties consumerProperties = new Properties();
consumerProperties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.13.49:9092");
consumerProperties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer");
consumerProperties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer");
consumerProperties.put(ConsumerConfig.GROUP_ID_CONFIG, groupId);
consumerProperties.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, "100");
consumerProperties.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");
consumerProperties.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, 1000);
try {
consumer = new KafkaConsumer<>(consumerProperties);
consumer.subscribe(Arrays.asList(topicName1, topicName2, topicName3, topicName4));
} catch (KafkaException ke) {
logTrace(MODULE, ke);
}
while (service.isServiceStateRunning()) {
ConsumerRecords<String, byte[]> records = consumer.poll(Duration.ofMillis(100));
for (TopicPartition partition : records.partitions()) {
List<ConsumerRecord<String, byte[]>> partitionRecords = records.records(partition);
for (ConsumerRecord<String, byte[]> record : partitionRecords) {
processMessage(simpleMessage);
}
}
consumer.commitSync();
}
kafkaConsumer.closeResource();
}, "KAKFA_CONSUMER");
} catch (Exception e) {
}
There seems to be a problem with usage of subscribe() here.
Subscribe is used to subscribe to topics and not to partitions. To use specific partitions you need to use assign(). Read up the extract from the documentation:
public void subscribe(java.util.Collection topics)
Subscribe to the given list of topics to get dynamically assigned
partitions. Topic subscriptions are not incremental. This list will
replace the current assignment (if there is one). It is not possible
to combine topic subscription with group management with manual
partition assignment through assign(Collection). If the given list of
topics is empty, it is treated the same as unsubscribe(). This is a
short-hand for subscribe(Collection, ConsumerRebalanceListener), which
uses a noop listener. If you need the ability to seek to particular
offsets, you should prefer subscribe(Collection,
ConsumerRebalanceListener), since group rebalances will cause
partition offsets to be reset. You should also provide your own
listener if you are doing your own offset management since the
listener gives you an opportunity to commit offsets before a rebalance
finishes.
public void assign(java.util.Collection partitions)
Manually assign a list of partitions to this consumer. This interface
does not allow for incremental assignment and will replace the
previous assignment (if there is one). If the given list of topic
partitions is empty, it is treated the same as unsubscribe(). Manual
topic assignment through this method does not use the consumer's group
management functionality. As such, there will be no rebalance
operation triggered when group membership or cluster and topic
metadata change. Note that it is not possible to use both manual
partition assignment with assign(Collection) and group assignment with
subscribe(Collection, ConsumerRebalanceListener).
You probably shouldn't do what you're doing. You should use subscribe, and use multiple partitions per topic, and multiple consumers in the group for high availability, and allow the consumer to handle the offsets for you.
You don't describe why you're trying to process your topics in this custom way? It's advanced and leads to issues.
The timestamps on your instances should not have to be synchronised to do normal topic processing.
If you're looking for more performance or to isolate records more carefully to avoid "head of line blocking" consider something like Parallel Consumer (PC).
It also tracks per record acknowledgement, among other things. Check out Parallel Consumer on GitHub (it's open source BTW, and I'm the author).

Is it possible to reset offsets to a topic for a kafka consumer group in a kafka connector?

My kafka sink connector reads from multiple topics (configured with 10 tasks) and processes upwards of 300 records from all topics. Based on the information held in each record, the connector may perform certain operations.
Here is an example of the key:value pair in a trigger record:
"REPROCESS":"my-topic-1"
Upon reading this record, I would then need to reset the offsets of the topic 'my-topic-1' to 0 in each of its partitions.
I have read in many places that creating a new KafkaConsumer, subscribing to the topic's partitions, then calling the subscribe(...) method is the recommended way. For example,
public class MyTask extends SinkTask {
#Override
public void put(Collection<SinkRecord> records) {
records.forEach(record -> {
if (record.key().toString().equals("REPROCESS")) {
reprocessTopicRecords(record);
} else {
// do something else
}
});
}
private void reprocessTopicRecords(SinkRecord record) {
KafkaConsumer<JsonNode, JsonNode> reprocessorConsumer =
new KafkaConsumer<>(reprocessorProps, deserializer, deserializer);
reprocessorConsumer.subscribe(Arrays.asList(record.value().toString()),
new ConsumerRebalanceListener() {
public void onPartitionsRevoked(Collection<TopicPartition> partitions) {}
public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
// do offset reset here
}
}
);
}
}
However, the above strategy does not work for my case because:
1. It depends on a group rebalance taking place (does not always happen)
2. 'partitions' passed to the onPartitionsAssigned method are dynamically assigned partitions, meaning these are only a subset to the full set of partitions that will need to have their offset reset. For example, this SinkTask will be assigned only 2 of the 8 partitions that hold the records for 'my-topic-1'.
I've also looked into using assign() but this is not compatible with the distributed consumer model (consumer groups) in the SinkConnector/SinkTask implementation.
I am aware that the kafka command line tool kafka-consumer-groups can do exactly what I want (I think):
https://gist.github.com/marwei/cd40657c481f94ebe273ecc16601674b
To summarize, I want to reset the offsets of all partitions for a given topic using Java APIs and let the Sink Connector pick up the offset changes and continue to do what it has been doing (processing records).
Thanks in advance.
I was able to achieve resetting offsets for a kafka connect consumer group by using a series of Confluent's kafka-rest-proxy APIs: https://docs.confluent.io/current/kafka-rest/api.html
This implementation no longer requires the 'trigger record' approach firs described in the original post and is purely Rest API based.
Temporarily delete the kafka connector (this deletes the connector's consumers and )
Create a consumer instance for the same consumer group ("connect-")
Have the instance subscribe to the requested topic you want to reset
Do a dummy poll ('subscribe' is evaluated lazily')
Reset consumer group topic offsets for specified topic
Do a dummy poll ('seek' is evaluated lazily') Commit the current offset state (in the proxy) for the consumer
Re-create kafka connector (with same connector name) - after re-balancing, consumers will join the group and read the last committed offset (starting from 0)
Delete the temporary consumer instance
If you are able to use the CLI, Steps 2-6 can be replaced with:
kafka-consumer-groups --bootstrap-server <kafkahost:port> --group <group_id> --topic <topic_name> --reset-offsets --to-earliest --execute
As for those of you trying to do this in the kafka connector code through native Java APIs, you're out of luck :-(
You're looking for the seek method. Either to an offset
consumer.seek(new TopicPartition("topic-name", partition), offset);
Or seekToBeginning
However, I feel like you'd be competing with the Connect Sink API's consumer group. In other words, assuming you setup the consumer with a separate group id, then you're essentially consuming records twice here from the source topic, once by Connect, and then your own consumer instance.
Unless you explicitly seek Connect's own consumer instance as well (which is not exposed), you'd be getting into a weird state. For example, your task only executes on new records to the topic, despite the fact your own consumer would be looking at an old offset, or you'd still be getting even newer events while still processing old ones
Also, eventually you might get a reprocess event at the very beginning of the topic due to retention policies, expiring old records, for example, causing your consumer to not progress at all and constantly rebalancing its group by seeking to the beginning
We had to do a very similar offset resetting exercise.
KafkaConsumer.seek() combined with KafkaConsumer.commitSync() worked well.
There is another option that is worth mentioning, if you are dealing with lots of topics and partitions (javadoc):
AdminClient.alterConsumerGroupOffsets(
String groupId,
Map<TopicPartition,OffsetAndMetadata> offsets
)
We were lucky because we had the luxury to stop the Kafka Connect instance for a while, so there's no consumer group competing.

How to re-send (read) an old kafka message from given topic and partition at specific offset using spring-kafka?

Given topic name, partition number and offset, how can I read just one record from the topic?
In my Sprng Boot based application I use Kafka for import of business data.
Import records are send to import_queue and consumed by one or more business modules. Records are always acknowledged even if consumer fails to import data from the record in order to continue data import from the following records.
Later a user (after he/she has fixed some dependent business data) can decide to re-send one or more failed (but acknowledged) import records.
The offset, partition number and topic name of every record are stored in my application internally in an SQL database.
From the reference documentation and some StackOverflow questions I found out that I have to:
set-up a container (consumer/listener)
rewind (seek) to desired offset
read one record
skip reading remaining records
Is this the only way to read just one old record from a kafka topic?
Or is there an easier solution?
Solution
As suggested by #Gary:
ConsumerRecord<byte[], byte[]> read(String topic, int partition, long offset) {
Map<String, Object> configs = Map.of(
"bootstrap.servers", "localhost:9092",
"group.id", "incubator_retry",
"max.poll.records", 1);
DefaultKafkaConsumerFactory<byte[], byte[]> consumerFactory = new DefaultKafkaConsumerFactory<>(
configs, new ByteArrayDeserializer(), new ByteArrayDeserializer());
try (Consumer<byte[], byte[]> consumer = consumerFactory.createConsumer()) {
TopicPartition topicPartition = new TopicPartition(topic, partition);
consumer.assign(List.of(topicPartition));
consumer.seek(topicPartition, offset);
ConsumerRecords<byte[], byte[]> consumerRecords = consumer.poll(Duration.ofMillis(5000));
if (consumerRecords.isEmpty()) {
throw new RuntimeException(String.format("Timeout polling from topic %s partition %d at offset %d",
topicPartition.topic(), topicPartition.partition(), offset));
}
return consumerRecords.iterator().next();
}
}
There is an easier solution.
Use the DefaultConsumerFactory to create a KafkaConsumer (or simply create one)
Use a different group.id
Set the max.poll.records property to 1
consumer.assign(...) the desired topic/partition
seek(...) to the required offset
poll(...) until you get the record
close() the consumer
If you are using any message conversion (aside from Kafka deserializers) you will have to invoke the converter manually.

Is there a way to stop Kafka consumer at a specific offset?

I can seek to a specific offset. Is there a way to stop the consumer at a specific offset? In other words, consume till my given offset. As far as I know, Kafka does not offer such a function. Please correct me if I am wrong.
Eg. partition has offsets 1-10. I only want to consume from 3-8. After consuming the 8th message, program should exit.
Yes, kafka does not offer this function, but you could achieve this in your consumer code. You could try use commitSync() to control this.
public void commitSync(Map offsets)
Commit the specified offsets for the specified list of topics and partitions.
This commits offsets to Kafka. The offsets committed using this API will be used on the first fetch after every rebalance and also on startup. As such, if you need to store offsets in anything other than Kafka, this API should not be used. The committed offset should be the next message your application will consume, i.e. lastProcessedMessageOffset + 1.
This is a synchronous commits and will block until either the commit succeeds or an unrecoverable error is encountered (in which case it is thrown to the caller).
Something like this:
while (goAhead) {
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
if (record.offset() > OFFSET_BOUND) {
consumer.commitSync(Collections.singletonMap(new TopicPartition(record.topic(), record.partition()), new OffsetAndMetadata(record.offset())));
goAhead = false;
break;
}
process(record);
}
}
You should set the "enable.auto.commit" to false in code above. In your case the OFFSET_BOUND could be set to 8. Because the commited offset is just 9 in your example, So next time the consumer will fetch from this position.
Assuming that partition offsets are continuous (i.e. not log compacted) you could configure your consumer (using max.poll.records config) so it reads certain number of records in each poll. This would let you stop at the offset you want.
As I know max.poll.records is a client feature. Kafka fetch protocol has only bytes limitations https://kafka.apache.org/protocol#The_Messages_Fetch
So you will read more messages under hood in general