How does Zookeeper retrive the consumer offsets from __consumer_offsets topic? - apache-kafka

This is a followup question to "Where do zookeeper store Kafka cluster and related information?" based on the answer provided by Armando Ballaci.
Now it's clear that consumer offsets are stored in the Kafka cluster in a special topic called __consumer_offsets. That's fine, I am just wondering how does the retrieval of these offsets work.
Topics are not like RDBS over which we can query for arbitrary data based on a certain predicate. Ex - if the data is stored in an RDBMS, probably a query like below will get the consumer offset for a particular partition of a topic for a particular consumer of some consumer group.
select consumer_offset__read, consumer_offset__commited from consumer_offset_table where consumer-grp-id="x" and partitionid="y"
But clearly this kind of retrieval is not possible o.n Kafka Topics. So how does the retrieval mechanism from topic work? Could someone elaborate?
(Data from Kafka partitions is read in FIFO, and if Kafka consumer model is followed to retrieve a particular offset, a lot of additional data has to be processed and it's going to be slow. So am wondering if it's done in some other way...)

Some description I could find on web regarding the same when I stumbled upon this for my day job is as follows:
In Kafka releases through 0.8.1.1, consumers commit their offsets to ZooKeeper. ZooKeeper does not scale extremely well (especially for writes) when there are a large number of offsets (i.e., consumer-count * partition-count). Fortunately, Kafka now provides an ideal mechanism for storing consumer offsets. Consumers can commit their offsets in Kafka by writing them to a durable (replicated) and highly available topic. Consumers can fetch offsets by reading from this topic (although we provide an in-memory offsets cache for faster access). i.e., offset commits are regular producer requests (which are inexpensive) and offset fetches are fast memory look ups.
The official Kafka documentation describes how the feature works and how to migrate offsets from ZooKeeper to Kafka. This wiki provides sample code that shows how to use the new Kafka-based offset storage mechanism.
try {
BlockingChannel channel = new BlockingChannel("localhost", 9092,
BlockingChannel.UseDefaultBufferSize(),
BlockingChannel.UseDefaultBufferSize(),
5000 /* read timeout in millis */);
channel.connect();
final String MY_GROUP = "demoGroup";
final String MY_CLIENTID = "demoClientId";
int correlationId = 0;
final TopicAndPartition testPartition0 = new TopicAndPartition("demoTopic", 0);
final TopicAndPartition testPartition1 = new TopicAndPartition("demoTopic", 1);
channel.send(new ConsumerMetadataRequest(MY_GROUP, ConsumerMetadataRequest.CurrentVersion(), correlationId++, MY_CLIENTID));
ConsumerMetadataResponse metadataResponse = ConsumerMetadataResponse.readFrom(channel.receive().buffer());
if (metadataResponse.errorCode() == ErrorMapping.NoError()) {
Broker offsetManager = metadataResponse.coordinator();
// if the coordinator is different, from the above channel's host then reconnect
channel.disconnect();
channel = new BlockingChannel(offsetManager.host(), offsetManager.port(),
BlockingChannel.UseDefaultBufferSize(),
BlockingChannel.UseDefaultBufferSize(),
5000 /* read timeout in millis */);
channel.connect();
} else {
// retry (after backoff)
}
}
catch (IOException e) {
// retry the query (after backoff)
}

In Kafka releases through 0.8.1.1, consumers commit their offsets to ZooKeeper. ZooKeeper does not scale extremely well (especially for writes) when there are a large number of offsets (i.e., consumer-count * partition-count). Fortunately, Kafka now provides an ideal mechanism for storing consumer offsets. Consumers can commit their offsets in Kafka by writing them to a durable (replicated) and highly available topic. Consumers can fetch offsets by reading from this topic (although we provide an in-memory offsets cache for faster access). i.e., offset commits are regular producer requests (which are inexpensive) and offset fetches are fast memory look ups.
The official Kafka documentation describes how the feature works and how to migrate offsets from ZooKeeper to Kafka.
The idea is that if you need such a functionality as you describe you need to store the data in a RDBS or a NoSQL database or an ELK Stack. A good pattern would be through Kafka Connect using a Sink connector. The normal message processing in Kafka is done through Consummers or Stream Definitions that react on the Events as they come. You can certainly seek to offset or timestamp in some cases and that is completely possible...
In the latest versions of Kafka the offsets are not kept in Zookeeper anymore. So Zookeeper is not involved in Consumer ofset handling.

Related

How to request data from producer at beginning position that does not exist in Kafka?

I have a database with time series data and this data is sent to Kafka.
Many consumers build aggregations and reporting based on this data.
My Kafka cluster stores data with TTL for 1 day.
But how I can build a new report and run a new consumer from 0th position that does not exist in Kafka but exists in source storage.
For example - some callback for the producer if I request an offset that does not exist in Kafka?
If it is not possible please advise other architectural solutions. I want to use the same codebase to aggregate this data.
For example - some callback for the producer if I request an offset
that does not exist in Kafka?
If the data does not exist in Kafka, you cannot consume it much less do any aggregation on top of it.
Moreover, there is no concept of a consumer requesting a producer. Producer sends data to Kafka broker(s) and consumers consume from those broker(s). There is no direct interaction between a producer and a consumer as such.
Since you say that the data still exists in the source DB, you can fetch your data from there and reproduce it to Kafka.
When you produce that data again, they will be new messages which will be eventually consumed by the consumers as usual.
In case you would like to differentiate between initial consumption and re-consumption, you can produce these messages to a new topic and have your consumers consume from them.
Other way is to increase your TTL (I suppose you mean retention in Kafka when you say TTL) and then you can seek back to a timestamp in the consumers using the offsetsForTimes(Map<TopicPartition,Long> timestampToSearch) and seek(TopicPartition topicPartition, long offset) methods.

How to determine topic has been read completely by Kafka Stream application from very first offset to last offset from Java application

I need some help in Kafka Streams. I have started a Kafka stream application, which is streaming one topic from the very first offset. Topic is very huge in data, so I want to implement a mechanism in my application, using Kafka streams, so that I can get notified when topic has been read completely to the very last offset.
I have read Kafka Streams 2.8.0 api, I have found an api method i-e allLocalStorePartitionLags, which is returning map of store names to another map of partition containing all the lag information against each partition. This method returns lag information for all store partitions (active or standby) local to this Streams. This method is quite useful for me, in above case, when I have one node running that stream application.
But in my case, system is distributed and application nodes are 3 and topic partitions are 10, which meaning each node have at least 3 partitions for the topic to read from.
I need help here. How I can implement this functionality where I can get notified when topic has been read completely from partition 0 to partition 9. Please note that I don't have option to use database here as of now.
Other approaches to achieve goal are also welcomed. Thank you.
I was able to achieve lag information from adminClient api. Below code results end offsets and current offsets for each partitions against topics read by given stream application i-e applicationId.
AdminClient adminClient = AdminClient.create(kafkaProperties);
ListConsumerGroupOffsetsResult listConsumerGroupOffsetsResult = adminClient.listConsumerGroupOffsets(applicationId);
// Current offsets.
Map<TopicPartition, OffsetAndMetadata> topicPartitionOffsetAndMetadataMap = listConsumerGroupOffsetsResult.partitionsToOffsetAndMetadata().get();
// all topic partitions.
Set<TopicPartition> topicPartitions = topicPartitionOffsetAndMetadataMap.keySet();
// list of end offsets for each partitions.
ListOffsetsResult listOffsetsResult = adminClient.listOffsets(topicPartitions.stream()
.collect(Collectors.toMap(Function.identity(), tp -> OffsetSpec.latest())));

Kafka excatly-once producer consumer

I am implementing Exactly-once semantics for a simple data pipeline, with Kafka as message broker. I can force Kafka producer to write each produced record exactly once by setting set enable.idempotence=true.
However, on the consumption front I need to guarantee that the consumer reads each record exactly once (I am not interested in storing the consumed records to external system or to another Kafka topic just processing). To achieve this, I have to ensure that polled records are processed and their offsets are committed to __consumer_offsets topic atomically/transactionally (both succeed/fail together).
In such case do I need to resort to Kafka transaction APIs to create a transactional producer in the consumer polling loop, where inside the transaction I perform: (1) processing of the consumed records and (2) committing their offsets, before closing the transaction. Would the normal commitSync/commitAsync serve in such case?
"on the consumption front I need to guarantee that the consumer reads each record exactly once"
The answer from Gopinath explains well how you can achieve exactly-once between a KafkaProducer and KafkaConsumer. These configurations (together with the application of Transaction API in the KafkaProducer) guarantees that all data send by the producer will be stored in Kafka exactly once. However, it does not guarantee that the Consumer is reading the data exactly once. This, of course, depends on your offset management.
Anyway, I understand your question that you want to know how the Consumer itself is processing a consumed message exactly once.
For this you need to manage your offsets on your own in a atomic way. That means, you need build your own "transaction" around
fetching data from Kafka,
processing data, and
storing processed offsets externally.
The methods commitSync and commitAsync will not get you far here as they can only ensure at-most-once or at-least-once processing within the Consumer. In addition, it is beneficial that your processing is idempotent.
There is a nice blog that explains such an implementation making use of the ConsumerRebalanceListener and storing the offsets in your local file system. A full code example is also provided.
"do I need to resort to Kafka transaction APIs to create a transactional producer in the consumer polling loop"
The Transaction API is only available for KafkaProducers and as far as I am aware cannot be used for your offset management.
'Exactly-once' functionality in Kafka can be achieved by a combination of these 3 settings:
isolation.level = read_committed
transactional.id = <unique_id>
processing.guarantee = exactly_once
More information on enabling the exactly-once functionality:
https://www.confluent.io/blog/simplified-robust-exactly-one-semantics-in-kafka-2-5/
https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/

Commit transactional offset across 2 different clusters

We are using processors with exactly-once delivery (committing consumer offset through producer) and need to understand whether it is possible for this to happen when consuming a message from a topic in kafka-cluster-1 and producing to a topic on kafka-cluster-2 (and vice-versa).
This is a snippet from the transactional processor:
messageProducer.beginTransaction(partitionId)
resultPublisher.publish(partitionId, resultTopic, messageRecord.key(), result)
val offsetAndMetadata = messageConsumer.getUncommittedOffsets(listenTopic, messageRecord)
messageProducer.sendOffsetsToTransaction(partitionId, offsetAndMetadata, consumerGroupId)
messageProducer.commitTransaction(partitionId)
My understanding is that a the producer will try to commit the offset on a consumer topic in the same cluster.
I did some research but can't really find anything related to multiple clusters.
Is it possible at all?
It is possible, you can "manually" send offsets to your own topic on the same cluster, where the produced message is sent. This way you can use guarantees provided by transactions.
You would need to create your own topic for offsets, similar to Kafka's internal __consumer_offsets, where as key you should use groupId, topic, partition, and as value the most recent offset (already read or to be read). Remember to use log compaction.
AFAIK there is no possibility to have transaction across two different clusters.

Kafka: Why broker isn't pull based like consumers

I was reading Kafka docs where it was mentioned that:-
Consumers pulls data from broker by requesting from offset.
Producer pushes messages to broker.
Making Kafka consumers pull based make sense that the consumers can drive the pace and broker can store the data for a really long time.
However with producers being push based, How does Kafka make sure that speed mismatch between producer and kafka won't happen? Also producers don't have persistance by design.This seems to be a bigger problem, when producers and brokers are separated over high latency network(internet).
As a distributed commit log, Kafka solves exactly this (impedance mismatch). You produce your events at the rate at which they occur into Kafka, and then you consume them at the rate at which your application can. The data is persisted in Kafka regardless. If your application needs to consume at a greater rate, you scale it out and partition your topic and consume in parallel. Because the data is persisted the only factor is how fast you want to consume the data.