Making Kafka consumers consume existing messages before subscription - apache-kafka

Having Publisher and N Consumers, if consumers use auto.offset.reset=latest then they miss all the messages that were published to a topic before they subscribed to it ... It is a known fact that Consumer with auto.offset.reset=latestdoesn't replay messages that existed in the topic before it subscribed...
So I would need either :
Make publisher wait until all subscribers start consuming messages and then start publishing. Dunno how to do that without leveraging Zookeeper for instance. Does Kafka provide means to do that ?
Another way would be having auto.offset.reset=latest Consumers and make them explicitly consume all existing messages before in case they are about to subscribe to a topic with existing messages...
What is the best practice for this case?
I guess that consumer must check topic for existing messages, consume them if there are any and then initiate auto.offset.reset=latest consumption. That sounds like the best way to me ...

If a high level consumer gets started, it does the following:
look for committed offsets for its consumer group
a. if valid offsets are found, resume from there
b. if no valid offsets are found, set offsets according to auto.offset.reset
Thus, auto.offset.reset only triggers, if no valid offset was committed. This behavior is intended and necessary to provide at-least-once processing guarantees in case of failure.
Thus, is you want to read a topic from its beginning, you can either use a new consumer group.id and set auto.offset.reset = earliest or you explicitly modify the offsets on startup using seekToBeginning() before you start your poll() loop.

We do the option (1) using a service discovery feature provided by Eureka (any other service discovery app would do the job) + aliasing. Basically a publisher does not register itself (and start processing requests nor publish notifications) until at least one subscriber is available.

Related

Kafka excatly-once producer consumer

I am implementing Exactly-once semantics for a simple data pipeline, with Kafka as message broker. I can force Kafka producer to write each produced record exactly once by setting set enable.idempotence=true.
However, on the consumption front I need to guarantee that the consumer reads each record exactly once (I am not interested in storing the consumed records to external system or to another Kafka topic just processing). To achieve this, I have to ensure that polled records are processed and their offsets are committed to __consumer_offsets topic atomically/transactionally (both succeed/fail together).
In such case do I need to resort to Kafka transaction APIs to create a transactional producer in the consumer polling loop, where inside the transaction I perform: (1) processing of the consumed records and (2) committing their offsets, before closing the transaction. Would the normal commitSync/commitAsync serve in such case?
"on the consumption front I need to guarantee that the consumer reads each record exactly once"
The answer from Gopinath explains well how you can achieve exactly-once between a KafkaProducer and KafkaConsumer. These configurations (together with the application of Transaction API in the KafkaProducer) guarantees that all data send by the producer will be stored in Kafka exactly once. However, it does not guarantee that the Consumer is reading the data exactly once. This, of course, depends on your offset management.
Anyway, I understand your question that you want to know how the Consumer itself is processing a consumed message exactly once.
For this you need to manage your offsets on your own in a atomic way. That means, you need build your own "transaction" around
fetching data from Kafka,
processing data, and
storing processed offsets externally.
The methods commitSync and commitAsync will not get you far here as they can only ensure at-most-once or at-least-once processing within the Consumer. In addition, it is beneficial that your processing is idempotent.
There is a nice blog that explains such an implementation making use of the ConsumerRebalanceListener and storing the offsets in your local file system. A full code example is also provided.
"do I need to resort to Kafka transaction APIs to create a transactional producer in the consumer polling loop"
The Transaction API is only available for KafkaProducers and as far as I am aware cannot be used for your offset management.
'Exactly-once' functionality in Kafka can be achieved by a combination of these 3 settings:
isolation.level = read_committed
transactional.id = <unique_id>
processing.guarantee = exactly_once
More information on enabling the exactly-once functionality:
https://www.confluent.io/blog/simplified-robust-exactly-one-semantics-in-kafka-2-5/
https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/

Difference between kafka idempotent and transactional producer setup?

When setting up a kafka producer to use idempotent behaviour, and transactional behaviour:
I understand that for idempotency we set:
enable.idempotence=true
and that by changing this one flag on our producer, we are guaranteed exactly-once event delivery?
and for transactions, we must go further and set the transaction.id=<some value>
but by setting this value, it also sets idempotence to true?
Also, by setting one or both of the above to true, the producer will also set acks=all.
With the above should I be able to add 'exactly once delivery' by simply changing the enable idempotency setting? If i wanted to go further and enable transactional support, On the Consumer side, I would only need to change their setting, isolation.level=read_committed? Does this image reflect how to setup the producer in terms of EOS?
Yes you understood the main concepts.
By enabling idempotence, the producer automatically sets acks to all and guarantees message delivery for the lifetime of the Producer instance.
By enabling transactions, the producer automatically enables idempotence (and acks=all). Transactions allow to group produce requests and offset commits and ensure all or nothing gets committed to Kafka.
When using transactions, you can configure if consumers should only see records from committed transactions by setting isolation.level to read_committed, otherwise by default they see all records including from discarded transactions.
Actually idemnpotency by itself does not always guarantee exactly once event delivery. Let's say you have a consumer that consumes an event, processes it and produces an event. Somewhere in this process the offset that the consumer uses must be incremented and persisted. Without a transactional producer, if it happens before the producer sends a message, the message might not be sent and its at most once delivery. If you do it after the message is sent you might fail in persisting the offset and then the consumer would read the same message again and the producer would send a duplicate, you get an at least once delivery. The all or nothing mechanism of a transactional producer prevents this scenario given that you store your offset on kafka, the new message and the incrementation of the offset of the consumer becomes an atomic action.

How to make a consumer leave and enter a consumer group in kafka

So, I have a consumer group and I have a few distinct nodes, each acting as a consumer. Each node is supposed to perform some computation intensive task. I want to make a consumer join to this consumer group only when it has available CPU resources. Once it has joined, it will consume a message from the topic regarding what computation it needs to perform and then start the computation. Now as this consumer is engaged in a computational task, I want to make it exit from the consumer group as it doesn't have any further capability to perform new computations. Is this possible to do in kafka? Or maybe there is another better way to do the above thing? I am using the kafka-python library.
In general, regardless of the Kafka client, this is possible to do using any Kafka consumer. The way to do it is simply subscribe to the topic, consume the message you want to process, acknowledge only that specific message, and close the consumer.
Specifically in the Kafka python client, the method you want is KafkaConsumer.close. Make sure to set auto-commit to false, because your poll might have consumed more than the messages you want to compute, and you only want to acknowledge the one you're actually going to work on.
Alternatively, you can set your consumer properties (specifically max.poll.records) to fetch only 1 message per poll, and then you can use the .close method with auto-commit set to true.
More info on all the KafkaConsumer configuration options here:
https://kafka.apache.org/documentation/#consumerconfigs
And here's a link to the official kafka-python client KafkaConsumer docs:
https://kafka-python.readthedocs.io/en/master/apidoc/KafkaConsumer.html#kafka.KafkaConsumer.close

How to read messages from kafka consumer group without consuming?

I'm managing a kafka queue using a common consumer group across multiple machines. Now I also need to show the current content of the queue. How do I read only those messages within the group which haven't been read, yet making those messages again readable by other consumers in the group which actually processes those messages. Any help would be appreciated.
In Kafka, the notion of "reading" messages from a topic and that of "consuming" them are the same thing. At a high level, the only thing that makes a "consumed" message unavailable to a consumer is that consumer setting its read offset to a value beyond that of the message in question. Thus, you can turn off the autocommit feature of your consumers and avoid committing offsets in cases where you'd like only to "read" but not to "consume".
A good proxy for getting "all messages which haven't been read" is to compare the latest committed offset to the highwater mark offset per partition. This provides a notion of "lag" that indicates how far behind a given consumer is in its consumption of a partition. The fetch_consumer_lag CLI function in pykafka is a good example of how to do this.
In Kafka, a partition can be consumed by only one consumer in a group i.e. if your topic has 10 partitions and you spawned 20 consumers with same groupId, then only 10 will be connected to Kafka and remaining 10 will be sitting idle. A new consumer will be identified by Kafka only in case one of the existing consumer dies or does not poll from the topic.
AFAIK, I don't think you can do what I understand you want to do within a consumer group. You can obviously create another groupId and process message based on the information gathered by first consumer group.
Kafka now has a KStream.peek() method
See proposal "Add KStream peek method".
It's not 100% clear to me from the docs that this prevents consuming of message that's peeked from the topic, but I can't see how you could use it in any crash-safe, robust way unless it does.
See also:
Handling consumer rebalance when implementing synchronous auto-offset commit
High-Level Consumer and peeking messages
I think that you can use publish-subscribe model. Then each consumer has own offset and could consume all messages for itself.

Apache Kafka Consumer group and Simple Consumer

I am new to Kafka, what I've understood sofar regarding the consumer is there are basically two types of implementation.
1) The High level consumer/consumer group
2) Simple Consumer
The most important part about the high level abstraction is it used when Kafka doesn't care about handling the offset while the Simple consumer provides much better control over the offset management. What confuse me is what if I want to run consumer in a multithreaded environment and also want to have control over the offset.If I use consumer group does that mean I must read from the last offset stored in zookeeper? is that the only option I have.
For the most part, the high-level consumer API does not let you control the offset directly.
When the consumer group is first created, you can tell it whether to start with the oldest or newest message that kafka has stored using the auto.offset.reset property.
You can also control when the high-level consumer commits new offsets to zookeeper by setting auto.commit.enable to false.
Since the high-level consumer stores the offsets in zookeeper, your app could access zookeeper directly and manipulate the offsets - but it would be outside of the high-level consumer API.
Your question was a little confusing but you can use the simple consumer in a multi-threaded environment. That's what the high-level consumer does.
In Apache Kafka 0.9 and 0.10 the consumer group management is handled entirely within the Kafka application by a Broker (for coordination) and a topic (for state storage).
When a consumer group first subscribes to a topic the setting of auto.offset.reset determines where consumers begin to consume messages (http://kafka.apache.org/documentation.html#newconsumerconfigs)
You can register a ConsumerRebalanceListener to receive a notification when a particular consumer is assigned topics/partitions.
Once the consumer is running, you can use seek, seekToBeginning and seekToEnd to get messages from a specific offset. seek affects the next poll for that consumer, and is stored on the next commit (e.g. commitSync, commitAsync or when the auto.commit.interval elapses, if enabled.)
The consumer javadocs mention more specific situations: http://kafka.apache.org/0100/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
You can combine the group management provided by Kafka with manual management of offsets via seek(..) once partitions are assigned.