I am new to writing the Kafka consumer, I have scenario in case I have two consumer running under a same group id and I have two partitions.
Suppose that;
Consumer 1===>Linked to ====>Partition 1
Consumer 2===>Linked to ====>Partition 2
I case my consumer-2 is down how can I ensure that my Consumer-1 re-read all the event which came to partition 2 again, I just came across some thing regarding setConsumerRebalanceListener so I have set my container property for this, and for the onPartitionsAssigned method I am setting consumer.seekToBeginning(consumer.assignment())
Is this correct, does this line means my consumer-1 will read all the event from partition-2 as well when the consumer-2 is down and the partition-2 is reassigned to consumer?
I also will request if someone could share some good links where i can read the basics about ConsumerRebalanceListener.
public ConcurrentKafkaListenerContainerFactory<String, MultiTenancyOrgDataMessage> kafkaListenerContainerFactory() {
LOG.debug("ConcurrentKafkaListenerContainerFactory executing");
ConcurrentKafkaListenerContainerFactory<String, MultiTenancyOrgDataMessage> factory = new ConcurrentKafkaListenerContainerFactory<>();
factory.setConsumerFactory(consumerFactory());
factory.getContainerProperties().setAckMode(AckMode.MANUAL_IMMEDIATE);
factory.getContainerProperties().setConsumerRebalanceListener(new ConsumerAwareRebalanceListener() {
#Override
public void onPartitionsAssigned(Consumer<?, ?> consumer, Collection<TopicPartition> partitions) {
consumer.seekToBeginning(consumer.assignment()); // read topic from beginning on service restart
}
});
This is what committing is for - if consumer 2 goes down then any records it has consumed but not committed will be picked up by consumer 1 after a rebalance.
This is one reason that Kafka supports at least once semantics - after rebalance consumer 1 will pick up from the last committed offset and hence may process records that have been successfully processed by consumer 2 if it died before committing.
An example of why you might use a ConsumerRebalanceListener is to deal with pausing across a rebalance - I have written about this at https://chrisg23.blogspot.com/2020/02/why-is-pausing-kafka-consumer-so.html?m=1
Related
I'm using #KafkaListener and ConcurrentKafkaListenerContainerFactory to listen to 3 kafka topics and each topic has 10 partitions. I have few questions on how this works.
ConcurrentKafkaListenerContainerFactory<String, String> kafkaListenerContainerFactory(
ConsumerFactory<String, String> consumerFactory) {
ConcurrentKafkaListenerContainerFactory<String, String> factory =
new ConcurrentKafkaListenerContainerFactory<>();
factory.setConsumerFactory(consumerFactory);
factory.setConcurrency(30);
factory.getContainerProperties().setSyncCommits(true);
return factory;
}
#KafkaListener(topics = "topic1", containerFactory="kafkaListenerContainerFactory")
public void handleMessage(final ConsumerRecord<Object, String> arg0) throws Exception {
}
#KafkaListener(topics = "topic2", containerFactory="kafkaListenerContainerFactory")
public void handleMessage(final ConsumerRecord<Object, String> arg0) throws Exception {
}
#KafkaListener(topics = "topic3", containerFactory="kafkaListenerContainerFactory")
public void handleMessage(final ConsumerRecord<Object, String> arg0) throws Exception {
}
my listener.ackmode is return and enable.auto.commit is set to false and partition.assignment.strategy: org.apache.kafka.clients.consumer.RoundRobinAssignor
1) my understanding about concurrency is, since i set my concurrency (at factory level) to 30, and i have a total of 30 partitions (for all three topic together) to read from, each thread will be assigned one partition. Is my understanding correct? how does it impact if i override concurrency again inside #KafkaListener annotation?
2) When spring call the poll() method, does it poll from all three topics?
3) since i set listener.ackmode is set to return, will it wait until all of the records that were returned in a single poll() to completed before issuing a next poll()? Also what happens if my records are taking longer than max.poll.interval.ms to process? Lets say 1-100 offsets are returned in a single poll() call and my code is only able to process 50 before max.poll.interval.ms is hit, will spring issue another poll at this time because it already hit max.poll.interval.ms? if so will the next poll() return records from offset 51?
really appreciate for your time and help
my listener.ackmode is return
There is no such ackmode; since you don't set it on the factory, your actual ack mode is BATCH (the default). To use ack mode record (if that's what you mean), you must so configure the factory container properties.
my understanding about concurrency is ...
Your understanding is incorrect; concurrency can not be greater than the number of partitions in the topic with the most partitions (if a listener listens to multiple topics). Since you only have 10 partitions in each topic, your actual concurrency is 10.
Overriding the concurrency on a listener simply overrides the factory setting; you always need at least as many partitions as concurrency.
When spring call the poll() method, does it poll from all three topics?
Not with that configuration; you have 3 concurrent containers, each with 30 consumers listening to one topic. You have 90 consumers.
If you have a single listener for all 3 topics, the poll will return records from all 3; but you still may have 20 idle consumers, depending on how the partition assignor allocates the partitions - see the logs "partitions assigned" for exactly how the partitions are allocated. The round robin assignor should distribute them ok.
will spring issue another poll at this time
Spring has no control - if you are taking too long, the Consumer thread is in the listener - the Consumer is not thread-safe so we can't issue an asynchronous poll.
You must process max.poll.records within max.poll.interval.ms to avoid Kafka from rebalancing the partitions.
The ack mode makes no difference; it's all about processing the results of the poll in a timely fashion.
I was going through this article which explains how to ensure message is processed exactly once by doing following:
Read (topic, partition, offset) from database on start/restart
Read message from specific (topic, partition, offset)
Atomically do following things:
Processing message
Commit offset to database as (topic, partition, offset)
As you can see, it explicitly specified from which partition to read messages. I feel its not good idea as it does not let allow Kafka to assign fair share of partition to active consumers. I am not able to come with logic to implement similar functionality without explicitly specifying partitions while polling kafka topic inside consumer. Is it possible to do?
Good analysis. You have a very good point and if possible you should certainly let kafka handle the partition assignment to consumers.
There is an alternative to consumer.Assign(Partition[]). The kafka brokers will notify your consumers when a partition is revoked or assigned to the consumer. For example, the dotnet client library has a 'SetPartitionsRevoked' and 'SetPartitionsAssigned' handler, that consumers can use to manage their offsets.
When a partition is revoked, persist your last processed offset for each partition being revoked to the database. When a new partition is assigned, get the last processed offset for that partition from the database and use that.
C# Example:
public class Program
{
public void Main(string[] args)
{
using (
var consumer = new ConsumerBuilder<string, string>(config)
.SetErrorHandler(ErrorHandler)
.SetPartitionsRevokedHandler(HandlePartitionsRevoked)
.SetPartitionsAssigned(HandlePartitionsAssigned)
.Build()
)
{
while (true)
{
consumer.Consume()//.Poll()
}
}
}
public IEnumerable<TopicPartitionOffset>
HandlePartitionsRevoked
(
IConsumer<string, string> consumer,
List<TopicPartitionOffset> currentTopicPartitionOffsets
)
{
Persist(<last processed offset for each partition in
'currentTopicPartitionOffsets'>);
return tpos;
}
public IEnumerable<TopicPartitionOffset> HandlePartitionsAssigned
(
IConsumer<string, string> consumer,
List<TopicPartition> tps
)
{
List<TopicPartitionOffset> tpos = FetchOffsetsFromDbForTopicPartitions(tps);
return tpos
}
}
Java Example from the ConsumerRebalanceListener Docs:
If writing in Java, there is a 'ConsumerRebalanceListener' interface that you can implement. You then pass your implementation of the interface into the consumer.Subscribe(topic, listener) method. The example below is taken verbatim from the kafka docs linked above:
public class SaveOffsetsOnRebalance implements ConsumerRebalanceListener {
private Consumer<?,?> consumer;
public SaveOffsetsOnRebalance(Consumer<?,?> consumer) {
this.consumer = consumer;
}
public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
// save the offsets in an external store using some custom code not described here
for(TopicPartition partition: partitions)
saveOffsetInExternalStore(consumer.position(partition));
}
public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
// read the offsets from an external store using some custom code not described here
for(TopicPartition partition: partitions)
consumer.seek(partition, readOffsetFromExternalStore(partition));
}
}
If my understanding is correct, you would call the java version like this: consumer.Subscribe("My topic", new SaveOffsetsOnRebalance(consumer)).
For more information, see the 'Storing Offsets Outside Kafka' section of the kafka docs.
Here's an excerpt from those docs that summarizes how to store the partitions and offsets for exactly-once processing:
Each record comes with its own offset, so to manage your own offset
you just need to do the following:
Configure enable.auto.commit=false
Use the offset provided with each ConsumerRecord to save your position.
On restart restore the position of the consumer using seek(TopicPartition, long).
This type of usage is simplest when the partition assignment is also
done manually (this would be likely in the search index use case
described above). If the partition assignment is done automatically
special care is needed to handle the case where partition assignments
change. This can be done by providing a ConsumerRebalanceListener
instance in the call to subscribe(Collection,
ConsumerRebalanceListener) and subscribe(Pattern,
ConsumerRebalanceListener). For example, when partitions are taken
from a consumer the consumer will want to commit its offset for those
partitions by implementing
ConsumerRebalanceListener.onPartitionsRevoked(Collection). When
partitions are assigned to a consumer, the consumer will want to look
up the offset for those new partitions and correctly initialize the
consumer to that position by implementing
ConsumerRebalanceListener.onPartitionsAssigned(Collection).
Another common use for ConsumerRebalanceListener is to flush any
caches the application maintains for partitions that are moved
elsewhere.
I am using Kafka Version 2.0 and java consumer API to consume messages from a topic. We are using a single node Kafka server with one consumer per partition. I have observed that the consumer is loosing some of the messages.
The scenario is:
Consumer polls the topic.
I have created One Consumer Per Thread.
Fetches the messages and gives it to a handler to handle the message.
Then it commits the offsets using "At-least-once" Kafka Consumer semantics to commit Kafka offset.
In parallel, I have another consumer running with a different group-id. In this consumer, I'm simply increasing the message counter and committing the offset. There's no message loss in this consumer.
try {
//kafkaConsumer.registerTopic();
consumerThread = new Thread(() -> {
final String topicName1 = "topic-0";
final String topicName2 = "topic-1";
final String topicName3 = "topic-2";
final String topicName4 = "topic-3";
String groupId = "group-0";
final Properties consumerProperties = new Properties();
consumerProperties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.13.49:9092");
consumerProperties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer");
consumerProperties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer");
consumerProperties.put(ConsumerConfig.GROUP_ID_CONFIG, groupId);
consumerProperties.put(ConsumerConfig.MAX_POLL_RECORDS_CONFIG, "100");
consumerProperties.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");
consumerProperties.put(ConsumerConfig.AUTO_COMMIT_INTERVAL_MS_CONFIG, 1000);
try {
consumer = new KafkaConsumer<>(consumerProperties);
consumer.subscribe(Arrays.asList(topicName1, topicName2, topicName3, topicName4));
} catch (KafkaException ke) {
logTrace(MODULE, ke);
}
while (service.isServiceStateRunning()) {
ConsumerRecords<String, byte[]> records = consumer.poll(Duration.ofMillis(100));
for (TopicPartition partition : records.partitions()) {
List<ConsumerRecord<String, byte[]>> partitionRecords = records.records(partition);
for (ConsumerRecord<String, byte[]> record : partitionRecords) {
processMessage(simpleMessage);
}
}
consumer.commitSync();
}
kafkaConsumer.closeResource();
}, "KAKFA_CONSUMER");
} catch (Exception e) {
}
There seems to be a problem with usage of subscribe() here.
Subscribe is used to subscribe to topics and not to partitions. To use specific partitions you need to use assign(). Read up the extract from the documentation:
public void subscribe(java.util.Collection topics)
Subscribe to the given list of topics to get dynamically assigned
partitions. Topic subscriptions are not incremental. This list will
replace the current assignment (if there is one). It is not possible
to combine topic subscription with group management with manual
partition assignment through assign(Collection). If the given list of
topics is empty, it is treated the same as unsubscribe(). This is a
short-hand for subscribe(Collection, ConsumerRebalanceListener), which
uses a noop listener. If you need the ability to seek to particular
offsets, you should prefer subscribe(Collection,
ConsumerRebalanceListener), since group rebalances will cause
partition offsets to be reset. You should also provide your own
listener if you are doing your own offset management since the
listener gives you an opportunity to commit offsets before a rebalance
finishes.
public void assign(java.util.Collection partitions)
Manually assign a list of partitions to this consumer. This interface
does not allow for incremental assignment and will replace the
previous assignment (if there is one). If the given list of topic
partitions is empty, it is treated the same as unsubscribe(). Manual
topic assignment through this method does not use the consumer's group
management functionality. As such, there will be no rebalance
operation triggered when group membership or cluster and topic
metadata change. Note that it is not possible to use both manual
partition assignment with assign(Collection) and group assignment with
subscribe(Collection, ConsumerRebalanceListener).
You probably shouldn't do what you're doing. You should use subscribe, and use multiple partitions per topic, and multiple consumers in the group for high availability, and allow the consumer to handle the offsets for you.
You don't describe why you're trying to process your topics in this custom way? It's advanced and leads to issues.
The timestamps on your instances should not have to be synchronised to do normal topic processing.
If you're looking for more performance or to isolate records more carefully to avoid "head of line blocking" consider something like Parallel Consumer (PC).
It also tracks per record acknowledgement, among other things. Check out Parallel Consumer on GitHub (it's open source BTW, and I'm the author).
My kafka sink connector reads from multiple topics (configured with 10 tasks) and processes upwards of 300 records from all topics. Based on the information held in each record, the connector may perform certain operations.
Here is an example of the key:value pair in a trigger record:
"REPROCESS":"my-topic-1"
Upon reading this record, I would then need to reset the offsets of the topic 'my-topic-1' to 0 in each of its partitions.
I have read in many places that creating a new KafkaConsumer, subscribing to the topic's partitions, then calling the subscribe(...) method is the recommended way. For example,
public class MyTask extends SinkTask {
#Override
public void put(Collection<SinkRecord> records) {
records.forEach(record -> {
if (record.key().toString().equals("REPROCESS")) {
reprocessTopicRecords(record);
} else {
// do something else
}
});
}
private void reprocessTopicRecords(SinkRecord record) {
KafkaConsumer<JsonNode, JsonNode> reprocessorConsumer =
new KafkaConsumer<>(reprocessorProps, deserializer, deserializer);
reprocessorConsumer.subscribe(Arrays.asList(record.value().toString()),
new ConsumerRebalanceListener() {
public void onPartitionsRevoked(Collection<TopicPartition> partitions) {}
public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
// do offset reset here
}
}
);
}
}
However, the above strategy does not work for my case because:
1. It depends on a group rebalance taking place (does not always happen)
2. 'partitions' passed to the onPartitionsAssigned method are dynamically assigned partitions, meaning these are only a subset to the full set of partitions that will need to have their offset reset. For example, this SinkTask will be assigned only 2 of the 8 partitions that hold the records for 'my-topic-1'.
I've also looked into using assign() but this is not compatible with the distributed consumer model (consumer groups) in the SinkConnector/SinkTask implementation.
I am aware that the kafka command line tool kafka-consumer-groups can do exactly what I want (I think):
https://gist.github.com/marwei/cd40657c481f94ebe273ecc16601674b
To summarize, I want to reset the offsets of all partitions for a given topic using Java APIs and let the Sink Connector pick up the offset changes and continue to do what it has been doing (processing records).
Thanks in advance.
I was able to achieve resetting offsets for a kafka connect consumer group by using a series of Confluent's kafka-rest-proxy APIs: https://docs.confluent.io/current/kafka-rest/api.html
This implementation no longer requires the 'trigger record' approach firs described in the original post and is purely Rest API based.
Temporarily delete the kafka connector (this deletes the connector's consumers and )
Create a consumer instance for the same consumer group ("connect-")
Have the instance subscribe to the requested topic you want to reset
Do a dummy poll ('subscribe' is evaluated lazily')
Reset consumer group topic offsets for specified topic
Do a dummy poll ('seek' is evaluated lazily') Commit the current offset state (in the proxy) for the consumer
Re-create kafka connector (with same connector name) - after re-balancing, consumers will join the group and read the last committed offset (starting from 0)
Delete the temporary consumer instance
If you are able to use the CLI, Steps 2-6 can be replaced with:
kafka-consumer-groups --bootstrap-server <kafkahost:port> --group <group_id> --topic <topic_name> --reset-offsets --to-earliest --execute
As for those of you trying to do this in the kafka connector code through native Java APIs, you're out of luck :-(
You're looking for the seek method. Either to an offset
consumer.seek(new TopicPartition("topic-name", partition), offset);
Or seekToBeginning
However, I feel like you'd be competing with the Connect Sink API's consumer group. In other words, assuming you setup the consumer with a separate group id, then you're essentially consuming records twice here from the source topic, once by Connect, and then your own consumer instance.
Unless you explicitly seek Connect's own consumer instance as well (which is not exposed), you'd be getting into a weird state. For example, your task only executes on new records to the topic, despite the fact your own consumer would be looking at an old offset, or you'd still be getting even newer events while still processing old ones
Also, eventually you might get a reprocess event at the very beginning of the topic due to retention policies, expiring old records, for example, causing your consumer to not progress at all and constantly rebalancing its group by seeking to the beginning
We had to do a very similar offset resetting exercise.
KafkaConsumer.seek() combined with KafkaConsumer.commitSync() worked well.
There is another option that is worth mentioning, if you are dealing with lots of topics and partitions (javadoc):
AdminClient.alterConsumerGroupOffsets(
String groupId,
Map<TopicPartition,OffsetAndMetadata> offsets
)
We were lucky because we had the luxury to stop the Kafka Connect instance for a while, so there's no consumer group competing.
I'm using Kafka 0.8.1.1
Is there any API (callback etc.) using which the lost partitions or newly added partitions to a consumer could be found?
I'm using Kafka 0.9.1 API and there is an interface ConsumerRebalanceListener with two methods
public void onPartitionsRevoked(Collection partitions)
where
partitions The list of partitions that were assigned to the consumer on the last rebalance
public void onPartitionsAssigned(Collection partitions)
partitions The list of partitions that are now assigned to the consumer (may include partitions previously assigned to the consumer)