Is an offset singular for a kafka's topic replicas? - apache-kafka

Kafka's consumer reads from some partition, which has replicas. After consumer's acknowledge the offset is updated.
Is the offset updated for the all partition's replicas?

Offset is updated only for a partition, not it's replicas.
Let's say we have a topic we 5 partitions and RF of 3 which is being consumed by a consumer.
A consumer will connect to only one of the replica of each partition to read data from (by default it will connect to leader of the each 5 partitions) and once it reads the message from the partitions - the offset will be updated for each of the 5 partitions (not each and every replicas of each partition).

No, an offset is not singular for all Kafka replicas. Each Kafka replica will have its own offsets, which can differ from one another.

Related

Why does Kafka Topic __consumer_offsets has unbalanced size in partitions?

I am using Kafka 2.0.0.
In the __consumer_offset topic, the most of partitions are 30MB, but some partitions are very big. For example: 1 partition is 15GB, another partition is 250GB, etc.
What could be the problem?
The topic __consumer_offsets stores the latest committed offset for each subscribed TopicPartition of a Kafka Consumer Group. In this topic the ConsumerGroup servers as the key.
Apparently, your ConsumerGroups which fall into the same partition (applying the hash(key) % #partitions logic) are much more active (consuming more messages more frequently) compared to other Consumer Groups.

Does Kafka producer write keyless message to another partition when a partition is down?

For example, I have a topic that has 2 partitions and a producer using defaultpartitioner (round-robin I assumed) writes to the topic. At some point, partition 1 becomes unavailable because all of the replica brokers go offline. Assuming the messages have no specified keys, will the producer resend the messages to partition 2? or simply gets stuck?
That is an interesting question and we should look at it from a broader (cluster) perspective.
At some point, partition 1 becomes unavailable because all of the replica brokers go offline.
I see the following scenarios:
All replica brokers of partition one are different to the replica brokers of partition two.
All replica brokers of partition one are the same as for partition two.
Some replica brokers are the same as for partition two.
In scenario "1" it means you still have enough brokers alive as the replication factor is a topic-wide not a partition-based configuration. In that case as soon as the first broker goes down its data will be moved to another broker to ensure that your partition always has enough in-sync replicas.
In scenarios "2" both partitions become unavailable and your KafkaProducer will eventually time out. Now, it depends if you have other brokers that are alive and can take on the data of the partitions.
In scenario "3" the dead replicas would be shifted to running brokers. During that time the KafkaProducer will only write to partition 2 as this is the only available partition in the topic. As soon as partition 1 has enough in-sync replicas the producer will start producing again to both partitions.
Actually, I could think of many more scenarios. If you need a more concrete answer you need to specify
how many brokers you have,
what your replication factor actually is and
in what timely order which broker goes down.
Assuming the messages have no specified keys, will the producer resend the messages to partition 2?
The KafkaProducer will not re-send the data that was previously send to partition 1 to partition 2. Whatever was written to partition 1 will stay in partition 1.

How multiple consumers from different consumer groups read from same partition?

I have a use case where i have 2 consumers in different consumer groups(cg1 and cg2) subscribing to same topic(Topic A) with 4 partitions.
What happens if both consumers are reading from same partition and one of them failed and other one commited the offset?
In Kafka the offset management is done by Consumer Group per Partition.
If you have two consumer groups reading the same topic and even partition a commit from one consumer group will not have any impact to the other consumer group. The consumer groups are completely discoupled.
One consumer of a consumer group can read data from a single topic partition. A single consumer can't read data from multiple partitions of a topic.
Example Consumer 1 of Consumer Group 1 can read data of only single topic partition.
Offset management is done by the zookeeper.
__consumer_offsets: Every consumer group maintains its offset per topic partitions. Since v0.9 the information of committed offsets for every consumer group is stored in this internal topic (prior to v0.9 this information was stored on Zookeeper).
When the offset manager receives an OffsetCommitRequest, it appends the request to a special compacted Kafka topic named __consumer_offsets. Finally, the offset manager will send a successful offset commit response to the consumer, only when all the replicas of the offsets topic receive the offsets.
simultaneously two consumers from two different consumer groups(cg1 and cg2) can read the data from same topic.
In kafka 1: Offset management is taken care by zookeeper.
In kafka 2: offsets of each consumer is stored at __Consumer_offsets topic
Offset used for keeping the track of consumers (how much records consumed by consumers), let say consumer-1 consume 10 records and consumer-2 consume-20 records and suddenly consumer-1 got died now whenever the consumer-1 will up then it will start reading from 11th record onward.

Kafka multiple producer writing to same topic?

Say I have a topic T1 with 3 partitions i.e. P1,P2 and P3. Where p1 is leader and rest are followers.
Now there are 2 producers want to push to same topic T1. I believe P1 will be leader for both of them ? Also single offset will be maintained
for both of them or offset is maintainer per partition per producer ?
Now I have single consumer which is polling from T1. Will it get messages from both producers by default or it has to explicitly mention producer name if it
wants message from specfic producer ?
Leader is not dependent on the producers or consumers, so p1 will be always returned as a leader. Offsets are not important for producers, they are defined per consumer group. Offset determines, which messages were read and committed by a consumer group.
Consumer will always read all the messages, it does not matter, which producer published them.
You're maybe mixing up replicas and partitions. When you say you have a topic with 3 partitions, it means your records will be dispatched amongs them according to the record key ( or dispatcher algo) .
There is no ' leader partition' . However you have a leader broker that handle a partition. In your case you will have 3 leaders, each of them managing one of your 3 partitions.
An interstingng post here, regarding Kafka partitions:
Understanding Kafka Topics and Partitions
Yannick

Kafka partitions and consumer groups for at-least-once message delivery

I am trying to come up with a design using Kafka for a number of processing agents to process messages from a Kafka topic in parallel.
I would like to ensure close to exactly-once per message processing across the whole consumer group, although can tolerate at-least-once.
I find the documentation unclear in many regards, and there are a few specific questions I have to know if this is a viable approach:
if a message is published to a topic, does it exist once only across all partitions in the topic or is it replicated on possibly more than one partition? I have read statements that could support both possibilities.
is the "offset" per partition or per consumer/consumergroup/partition?
when I start a new consumer, does it look at the offset for the consumer group as a whole or for the partition it is assigned?
if I want to scale up new consumers and there are no free partitions (I believe there can be not more than one consumer per partition), will kafka rebalance existing messages from the existing partitions, and how does that affect the offsets and consumers of existing partitions?
Or are there any other points I am missing that may help my understanding of this?
if a message is published to a topic, does it exist once only across all partitions in the topic or is it replicated on possibly more than one partition? I have read statements that could support both possibilities.
[A]: the partition is replicated across nodes depending on replication factor. if you have partition P1 in a broker with 2 nodes and replication factor of 2, then, node1 will be primary leader for P1 and node2 will also have the P1 contents/messaged but it will be the replica (and replication happens in async manner)
is the "offset" per partition or per consumer/consumergroup/partition?
[A]: per partition from a broker standpoint. its also per consumer since 'offset' is explicitly tracked/managed on the consumer end. The consumer code can delegate this work to Kafka or manage the offsets manually
when I start a new consumer, does it look at the offset for the consumer group as a whole or for the partition it is assigned?
[A]: kafka would trigger a rebalance when a new consumer enters the group and assign certain partitions to it. from there on, the consumer will only care about the offsets of the partitions which it is responsible for
if I want to scale up new consumers and there are no free partitions (I believe there can be not more than one consumer per partition), will kafka rebalance existing messages from the existing partitions, and how does that affect the offsets and consumers of existing partitions?
[A] for parallelism, the ideal scenario is to have 1-1 mapping b/w consumer and partition e.g. if you have 10 partitions, you can have at max 10 consumers. If you bring in the 11th one, kafka wont assign partitions to it unless an existing consumer leaves the group