Kafka partitions and consumer groups for at-least-once message delivery - apache-kafka

I am trying to come up with a design using Kafka for a number of processing agents to process messages from a Kafka topic in parallel.
I would like to ensure close to exactly-once per message processing across the whole consumer group, although can tolerate at-least-once.
I find the documentation unclear in many regards, and there are a few specific questions I have to know if this is a viable approach:
if a message is published to a topic, does it exist once only across all partitions in the topic or is it replicated on possibly more than one partition? I have read statements that could support both possibilities.
is the "offset" per partition or per consumer/consumergroup/partition?
when I start a new consumer, does it look at the offset for the consumer group as a whole or for the partition it is assigned?
if I want to scale up new consumers and there are no free partitions (I believe there can be not more than one consumer per partition), will kafka rebalance existing messages from the existing partitions, and how does that affect the offsets and consumers of existing partitions?
Or are there any other points I am missing that may help my understanding of this?

if a message is published to a topic, does it exist once only across all partitions in the topic or is it replicated on possibly more than one partition? I have read statements that could support both possibilities.
[A]: the partition is replicated across nodes depending on replication factor. if you have partition P1 in a broker with 2 nodes and replication factor of 2, then, node1 will be primary leader for P1 and node2 will also have the P1 contents/messaged but it will be the replica (and replication happens in async manner)
is the "offset" per partition or per consumer/consumergroup/partition?
[A]: per partition from a broker standpoint. its also per consumer since 'offset' is explicitly tracked/managed on the consumer end. The consumer code can delegate this work to Kafka or manage the offsets manually
when I start a new consumer, does it look at the offset for the consumer group as a whole or for the partition it is assigned?
[A]: kafka would trigger a rebalance when a new consumer enters the group and assign certain partitions to it. from there on, the consumer will only care about the offsets of the partitions which it is responsible for
if I want to scale up new consumers and there are no free partitions (I believe there can be not more than one consumer per partition), will kafka rebalance existing messages from the existing partitions, and how does that affect the offsets and consumers of existing partitions?
[A] for parallelism, the ideal scenario is to have 1-1 mapping b/w consumer and partition e.g. if you have 10 partitions, you can have at max 10 consumers. If you bring in the 11th one, kafka wont assign partitions to it unless an existing consumer leaves the group

Related

kafka should the consumer and producer have knowledge of the partitions

I'm trying to wrap my head around kafka and the thing that confuses me are the partitions. From all/most of the examples I have seen the consumers/products seem to have implicit knowledge of the partitions, (which partition to write messages to, which partition to read messages from). Is this correct, I initially thought that partitions are internal to the system and the consumers/producers dont need to know partition information. If they need to know partition information then aren't we exposing the inner structure of the topic to a certain extent to the outside world?
In kafka every partition in a topic has a set of brokers, and at most one broker leader per partition. You cannot have more consumers of a topic than the number of partitions because otherwise some consumer would be inactive.You can have multiple partitions for a single consumer, but cannot have multiple consumers for a single partition. So the number of partitions must be chosen according to the throughput you expect. The number of partitions can be increased on a topic, but never decreased. When consumers connect to a partition they actually connect to the broker leader to consume messages.
Anyway the partition leader could change, so the consumer would get an error and should send the request for meta-data to the cluster controller in order to get the info on the new partition leader. At consumer startup partitions are assigned according to the kafka parameter partition.assignment.strategy. Of course if consumers start at different times on the same consumer group there will be partition rebalance.
Finally you need a lot of info on the kafka cluser structure as a client.

Clarifications on Apache Kafka

I have few questions on Apache Kafka.
Can a single partition be assigned to more than one consumer from the same group?
Where is the offset stored? Is it in the partition or at the consumer.
Just like the producer always post the record to the lead partition and the records gets replicated to other partitions, Does Kafka consumer reads the data from the lead partition?
Lets say, that a consumer is reading from a partition and the consumer is running a long process. In this case, the rate at which the producer is updating the partition will be faster than the rate at which the consumer is consuming from the same partition. Is there a way we can speed up the consumption from that partition?
Can we create a checkpoint in the commit log on the partition so that the consumer can start processing from that specific checkpoint? This would be useful, if I want to perform the audit from a specific checkpoint onward?
Can a single partition be assigned to more than one consumer from the same group?
No, one partition can be consumed at most from one consumer within the same consumer group as described here: "This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group."
Where is the offset stored? Is it in the partition or at the consumer.
The offsets for each consumer group is stored in an internal kafka topic called __consumer_offsets as described here: "The coordinator of each group is chosen from the leaders of the internal offsets topic __consumer_offsets, which is used to store committed offsets."
Just like the producer always post the record to the lead partition and the records gets replicated to other partitions, Does Kafka consumer reads the data from the lead partition?
Yes it does. The leader partition is the only "client-facing" partition as described here: "'leader' is the node responsible for all reads and writes for the given partition.".
EDIT:
Is there a way we can speed up the consumption from that partition?
The measure to speed up consumption is to increase the partitions of the topic so you can have more consumer threads reading from that topic and process the data in parallel. At the same time you need to make sure that your data is evenly distributed accross partitions.

Kafka, will different partitions have the same offset number

I have one Kafka topic and five partitions for that one topic. There will be 5 consumer groups. Each consumer group has one service instances consuming from that topic.
Will the offset be the same in each consumer for the same record in Kafka?
By offset, if you mean the ordering of messages, then yes. It'd be the same for all consumers, because the ordering is determined by producers and brokers. So, if you have msg-1, msg-2, ..., msg-1000 in the topic, all the 5 consumers will consume those in that specific order. But the rate of consumption might vary. It has lots of variables (e.g. Network latency, network topology, consumer logic etc.) that determines the rate of consumption.
The offset is assigned by the broker when the message comes into the partition so it's unique and it's not related to the consumers (and consumer groups). It identifies the unique position that the record has inside the partition.
On the other side, each consumer (in a consumer group) reading from a specific partition will track its own offset which will be different from consumers (in other consumer groups); the offset concept in this case is used for tracking the position inside the partition from which reading messages. Of course it's always a message offset.

Are Kafka partitions consumed evenly?

I have a consumer group with several consumers. Each consumer is assigned to a set of partitions. When the consumer polls for messages where the consumed partition is selected? Is it done on the consumer side or does Kafka server decide which partitions turn it is to get consumed?
Some of my partitions have a lot of messages, but some have none or very little. But I still need my consumers to consume each of it's assigned partitions equally. So I need my consumer to loop through the partitions fast, preferably poll x messages from each assigned partition.
I'm using https://github.com/appsignal/rdkafka-ruby in case it matters.
Kafka assigns the partitions to be consumed as a Round-Robin strategy giving to each partition a fair chance for consumption. In that way starving for the partitions is avoid.
On the other hand, Kafka does not guarantee that the data will be consumed proportionally across the partitions,
Please see the details about this here.

Kafka Issues on consumer group

I'm a newbie in Kafka. I had a glance at the Kafka Documentation. It seems that the the message dispatched to a subscribing consumer group is implemented by binding the partition with the consumer instance.
One important thing we should remember when we work with Apache Kafka is the number of consumers in the same consumer group should be less than or equal the number of partitions in the consumed topic. Otherwise, the exceedable consumers will not be received any messages from the topic.
In a non-prod environment, I didn't config the topic partition. In such case, is there only a single partition in Kafka. And If I start multiple consumers sharing the same group and subscribe them to the topic, would the message always dispatched to the same instance in the group? In other words, I have to partition the topic to get the load-balance feature in consumer group?
Thanks!
You are absolutely right. One partitions cannot be processed in paralell (by one consumer group). You can treat partition as atomic and it cannot be split.
If you configure non-prod and prod env with the same amount of partitions per topic, that should help you to find correct number of conumsers and catch problems before moving to prod.