Clarifications on Apache Kafka - apache-kafka

I have few questions on Apache Kafka.
Can a single partition be assigned to more than one consumer from the same group?
Where is the offset stored? Is it in the partition or at the consumer.
Just like the producer always post the record to the lead partition and the records gets replicated to other partitions, Does Kafka consumer reads the data from the lead partition?
Lets say, that a consumer is reading from a partition and the consumer is running a long process. In this case, the rate at which the producer is updating the partition will be faster than the rate at which the consumer is consuming from the same partition. Is there a way we can speed up the consumption from that partition?
Can we create a checkpoint in the commit log on the partition so that the consumer can start processing from that specific checkpoint? This would be useful, if I want to perform the audit from a specific checkpoint onward?

Can a single partition be assigned to more than one consumer from the same group?
No, one partition can be consumed at most from one consumer within the same consumer group as described here: "This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group."
Where is the offset stored? Is it in the partition or at the consumer.
The offsets for each consumer group is stored in an internal kafka topic called __consumer_offsets as described here: "The coordinator of each group is chosen from the leaders of the internal offsets topic __consumer_offsets, which is used to store committed offsets."
Just like the producer always post the record to the lead partition and the records gets replicated to other partitions, Does Kafka consumer reads the data from the lead partition?
Yes it does. The leader partition is the only "client-facing" partition as described here: "'leader' is the node responsible for all reads and writes for the given partition.".
EDIT:
Is there a way we can speed up the consumption from that partition?
The measure to speed up consumption is to increase the partitions of the topic so you can have more consumer threads reading from that topic and process the data in parallel. At the same time you need to make sure that your data is evenly distributed accross partitions.

Related

kafka should the consumer and producer have knowledge of the partitions

I'm trying to wrap my head around kafka and the thing that confuses me are the partitions. From all/most of the examples I have seen the consumers/products seem to have implicit knowledge of the partitions, (which partition to write messages to, which partition to read messages from). Is this correct, I initially thought that partitions are internal to the system and the consumers/producers dont need to know partition information. If they need to know partition information then aren't we exposing the inner structure of the topic to a certain extent to the outside world?
In kafka every partition in a topic has a set of brokers, and at most one broker leader per partition. You cannot have more consumers of a topic than the number of partitions because otherwise some consumer would be inactive.You can have multiple partitions for a single consumer, but cannot have multiple consumers for a single partition. So the number of partitions must be chosen according to the throughput you expect. The number of partitions can be increased on a topic, but never decreased. When consumers connect to a partition they actually connect to the broker leader to consume messages.
Anyway the partition leader could change, so the consumer would get an error and should send the request for meta-data to the cluster controller in order to get the info on the new partition leader. At consumer startup partitions are assigned according to the kafka parameter partition.assignment.strategy. Of course if consumers start at different times on the same consumer group there will be partition rebalance.
Finally you need a lot of info on the kafka cluser structure as a client.

How multiple consumers from different consumer groups read from same partition?

I have a use case where i have 2 consumers in different consumer groups(cg1 and cg2) subscribing to same topic(Topic A) with 4 partitions.
What happens if both consumers are reading from same partition and one of them failed and other one commited the offset?
In Kafka the offset management is done by Consumer Group per Partition.
If you have two consumer groups reading the same topic and even partition a commit from one consumer group will not have any impact to the other consumer group. The consumer groups are completely discoupled.
One consumer of a consumer group can read data from a single topic partition. A single consumer can't read data from multiple partitions of a topic.
Example Consumer 1 of Consumer Group 1 can read data of only single topic partition.
Offset management is done by the zookeeper.
__consumer_offsets: Every consumer group maintains its offset per topic partitions. Since v0.9 the information of committed offsets for every consumer group is stored in this internal topic (prior to v0.9 this information was stored on Zookeeper).
When the offset manager receives an OffsetCommitRequest, it appends the request to a special compacted Kafka topic named __consumer_offsets. Finally, the offset manager will send a successful offset commit response to the consumer, only when all the replicas of the offsets topic receive the offsets.
simultaneously two consumers from two different consumer groups(cg1 and cg2) can read the data from same topic.
In kafka 1: Offset management is taken care by zookeeper.
In kafka 2: offsets of each consumer is stored at __Consumer_offsets topic
Offset used for keeping the track of consumers (how much records consumed by consumers), let say consumer-1 consume 10 records and consumer-2 consume-20 records and suddenly consumer-1 got died now whenever the consumer-1 will up then it will start reading from 11th record onward.

How multiple consumer group consumers work across partition on the same topic in Kafka?

I was reading this SO answer and many such blogs.
What I know:
Multiple consumers can run on a single partition when running multiple consumers with multiple consumer group id and only one consumer from a consumer group can consume at a given time from a partition.
My question is related to multiple consumers from multiple consumer groups consuming from the same topic:
What happens in the case of multiple consumers(different groups) consuming a single topic(eventually the same partition)?
Do they get the same data?
How offset is managed? Is it separate for each consumer?
(Might be opinion based) How do you or generally recommended way is to handle overlapping data across two consumers of a separate group operating on a single partition?
Edit:
"overlapping data": means two consumers of separate consumer groups operating on the same partition getting the same data.
Yes they get the same data. Kafka only stores one copy of the data in the topic partitions' commit log. If consumers are not in the same group then they can each get the same data using fetch requests from the clients' consumer library. The assignment of which partitions each group member will get is managed by the lead consumer of each group. The entire process in detailed steps is documented here https://community.hortonworks.com/articles/72378/understanding-kafka-consumer-partition-assignment.html
Offsets are "managed" by the consumers, but "stored" in a special __consumer_offsets topic on the Kafka brokers.
Offsets are stored for each (consumer group, topic, partition) tuple. This combination is also used as the key when publishing offsets to the __consumer_offsets topic so that log compaction can delete old unneeded offset commit messages and so that all offsets for the same (consumer group, topic, partition) tuple are stored in the same partition of the __consumer_offsets topic (which defaults to 50 partitions)
Each consumer group gets every message from a subscribed topic.
Yes
Offset are stored by partition. For example let's say you have a topic with 2 partitions and a consumer group named cg made up of 2 consumers. In that case Kafka assigns each of the consumers one of the partitions. Then the consumers fetch the offset for the partition they were assigned to from Kafka (e.g. consumer 'asks' Kafka: "What is the offset for this topic for consumer group cg partition 1", or partition 2 for the other consumer). After getting the correct offset the consumer polls some Kafka broker for the next message in that partition.
I'm not entirely sure what you mean by overlapping data, can you clarify a bit or give an example?

Kafka - Topic & Partitions & Consumer

Just wanna understand the basics properly.
Let's say I've a topic called "myTopic" that has 3 partitions P0, P1 & P2.
Each of these partitions will have a leader and the data (messages) for this topic is distributed across these partitions.
1. Producer will always writes to the leader of the partition in a round robin fashion based on the load on the broker. Is that right?
2. How do the producer know the leader of the partition?
3. Consumer reading a particular topic should read all partitions of that topic? Is that correct?
Appreciate your help.
Producer will always writes to the leader of the partition in a round robin fashion based on the load on the broker. Is that right?
By default, yes.
That said, a producer can also decide to use a custom partitioning scheme, i.e. a different strategy to which partitions data is being written to.
How do the producer know the leader of the partition?
Through the Kafka protocol.
Consumer reading a particular topic should read all partitions of that topic? Is that correct?
By default, yes.
That said, you can also implement e.g. consumer applications that implement custom logic, e.g. a "sampling" consumer that only reads from 1 out of N partitions.
Producer will always writes to the leader of the partition
Yes, always.
in a round robin fashion based on the load on the broker
No. If a partition is explicitly set on a ProducerRecord then that partition is used. Otherwise, if a custom partitioner implementation is provided, that determines the partition. Otherwise, if the msg key is not null, the hash of the key will be used to consistently send msgs with the same key to the same partition. If the msg key is null, only then the msg will indeed be sent to any partition in a round-robin fashion. However, this is irrespective of the load on the broker.
How do the producer know the leader of the partition?
By periodically asking the broker for metadata.
Consumer reading a particular topic should read all partitions of that topic? Is that correct?
Consumers form consumer groups. If there are multiple consumer instances in a consumer group, each consumes a subset of the partitions. But the consumer group as a whole consumes from all partitions. That is, unless you decide to go "low-level" and manage that yourself, which you can do.

Kafka partitions and consumer groups for at-least-once message delivery

I am trying to come up with a design using Kafka for a number of processing agents to process messages from a Kafka topic in parallel.
I would like to ensure close to exactly-once per message processing across the whole consumer group, although can tolerate at-least-once.
I find the documentation unclear in many regards, and there are a few specific questions I have to know if this is a viable approach:
if a message is published to a topic, does it exist once only across all partitions in the topic or is it replicated on possibly more than one partition? I have read statements that could support both possibilities.
is the "offset" per partition or per consumer/consumergroup/partition?
when I start a new consumer, does it look at the offset for the consumer group as a whole or for the partition it is assigned?
if I want to scale up new consumers and there are no free partitions (I believe there can be not more than one consumer per partition), will kafka rebalance existing messages from the existing partitions, and how does that affect the offsets and consumers of existing partitions?
Or are there any other points I am missing that may help my understanding of this?
if a message is published to a topic, does it exist once only across all partitions in the topic or is it replicated on possibly more than one partition? I have read statements that could support both possibilities.
[A]: the partition is replicated across nodes depending on replication factor. if you have partition P1 in a broker with 2 nodes and replication factor of 2, then, node1 will be primary leader for P1 and node2 will also have the P1 contents/messaged but it will be the replica (and replication happens in async manner)
is the "offset" per partition or per consumer/consumergroup/partition?
[A]: per partition from a broker standpoint. its also per consumer since 'offset' is explicitly tracked/managed on the consumer end. The consumer code can delegate this work to Kafka or manage the offsets manually
when I start a new consumer, does it look at the offset for the consumer group as a whole or for the partition it is assigned?
[A]: kafka would trigger a rebalance when a new consumer enters the group and assign certain partitions to it. from there on, the consumer will only care about the offsets of the partitions which it is responsible for
if I want to scale up new consumers and there are no free partitions (I believe there can be not more than one consumer per partition), will kafka rebalance existing messages from the existing partitions, and how does that affect the offsets and consumers of existing partitions?
[A] for parallelism, the ideal scenario is to have 1-1 mapping b/w consumer and partition e.g. if you have 10 partitions, you can have at max 10 consumers. If you bring in the 11th one, kafka wont assign partitions to it unless an existing consumer leaves the group