How to find the offset of a message in Kafka Topic - apache-kafka

How to find the offset of a message in Kafka Topic? Does the offset contains multiple messages or a single message?

Offsets only link to a single message within a single partition of a topic.
You could have the same offset be available in many partitions and many topics, but there is almost no correlation between those values unless you explictly made the producers do that.
There is no easy way to find the offset for a single message. You need to scan the whole topic (or at least a single partition)

Assuming you've read the message using a consume() or poll(), and you're interested in the details of this message, you can find the corresponding partition and offset using the following code:
# assuming you have a Kafka consumer in place
msg = consumer.consume()
partition = msg.partition()
offset = msg.offset()
A Kafka topic is divided into partitions. Kafka distributes the incoming messages in a round robin fashion across partitions (unless you've specified some key on which to partition). So, for a random message that you haven't read, you may have to scan all partitions to find the message, and then the corresponding offset.

Related

Kafka - Message versus Record versus offset

I am new to Streaming Broker [like Kafka], and coming from Queueing Messaging System [like JMS, Rabbit MQ].
I read from Kafka docs that, messages are stored in Kafka partitions in offset as record. And consumer reads from offset.
What is the difference between message and record [does multiple/partial messages constitute a record?]
When comsumer reads from offset, is there a possibility that consumer reads partial message? IS there a need for consumer to string these parital messages based on some logic?
OR
1 message = 1 record = 1 offset
EDIT1:
The question was popped because, the "batch size" decides how many bytes of message should be published on to the borker. Lets say there are 2 messages with message1 = 100bytes and message2= 200 bytes and batchsize is set to 150bytes. Does this mean 100 bytes from message1 and 50 bytes from message2 are sent to broker at once? If yes, how are these 2 messages stored in offset?
In Kafka, a Producer sends messages or records (both terms can be used interchangeably) to Topics. A topic is divided into one or more Partitions that are distributed among the Kafka Cluster, which is generally composed of at least three Brokers.
A message/record is sent to a leader partition (which is owned by a single broker) and associated to an Offset. An Offset is a monotonically increasing numerical identifier used to uniquely identify a record inside a topic/partition, e.g. the first message stored in a record partition will have the offset 0 and so on.
Offsets are used both to identify the position of a message in a topic/partition as well as for the position of a Consumer Group.
For optimisation purpose , a producer will batch messages per partition. A batch is considered to be ready when either the configured batch.sized or linger.ms are reached. For example, if you have a batch.size set to 200KB and you send two messages (150KB and 100KB), they will be part potentially of the same batch. But the producer will never fragment a single message into chuncks.
No, a consumer cannot read partial messages.

Kafka: Who maintains that upto which offset number message is read by a consumer group?

I know that all the messages (or offset) in a Kafka Queue Partition has its offset number and it takes care of the sequence of offsets.
But if I have a Kafka Consumer Group (or single Kafka Consumer) which is reading particularly the Kafka Topic Partition then how it maintains up to which offset messages are read and who maintains this offset counter?
If the consumer goes down then how a new consumer will start reading the offset from the next unread (or not acknowledged) offset.
The information about Consumer Groups is all stored in the internal Kafka topic __consumer_offsets. Whenever a new group tries to read data from a topic it checks its offset position in that internal topic which has a deletion policy set to compact. The compaction keeps this topic small.
Kafka comes with a command line tool kafka-consumer-groups.sh that helps you understand which information is stored for each consumer group.
More information is given in the Kafka Documentation on offset tracking.

Kafka message partitioning by key

We have a business process/workflow that is being started when initial event message is received and closed when the last message is processed. We have up to 100,000 processes executed each day. My problem is that the order of the messages that come to specific process has to be processed by the same order messages were received. If one of the messages fails, the process has to freeze until the problem is fixed, despite that all other processes has to continue. For this kind of situation i am thinking of using Kafka. first solution that came to my mind was to use Topic partitioning by message key. The key of the message would be the ProcessId. This way i could be sure that all process messages would be partitioned and kafka would guarantee the order. As i am new to Kafka what i managed to figure out that partitions has to be created in advance and that makes everything to difficult. so my questions are:
1) when i produce message to kafka's topic that does not exist, the topic is created on runtime. Is it possible to have same behavior for topic partitions?
2) there can be more than 100,000 active partitions on the topic, is that a problem?
3) can partition be deleted after all messages from that topic were read?
4) maybe you can suggest other approaches to my problem?
When i produce message to kafka's topic that does not exist, the topic is created on runtime. Is it possible to have same behavior for topic partitions?
You need to specify number of partitions while creating topic. New Partitions won't be create automatically(as is the case with topic creation), you have to change number of partitions using topic tool.
More Info: https://kafka.apache.org/documentation/#basic_ops_modify_topi
As soon as you increase number of partitions, producer and consumer will be notified of new paritions, thereby leading them to rebalance. Once rebalanced, producer and consumer will start producing and consuming from new partition.
there can be more than 100,000 active partitions on the topic, is that a problem?
Yes, having this much partitions will increase overall latency.
Go through how-choose-number-topics-partitions-kafka-cluster on how to decide number of partitions.
can partition be deleted after all messages from that topic were read?
Deleting a partition would lead to data loss and also the remaining data's keys would not be distributed correctly so new messages would not get directed to the same partitions as old existing messages with the same key. That's why Kafka does not support decreasing partition count on topic.
Also, Kafka doc states that
Kafka does not currently support reducing the number of partitions for a topic.
I suppose you choose wrong feature to solve you task.
In general, partitioning is used for load balancing.
Incoming messages will be distributed on given number of partition according to the partitioning strategy which defined at broker start. In short, default strategy just calculate i=key_hash mod number_of_partitions and put message to ith partition. More about strategies you could read here
Message ordering is guaranteed only within partition. With two messages from different partitions you have no guarantees which come first to the consumer.
Probably you would use group instead. It's option for consumer
Each group consumes all messages from topic independently.
Group could consist of one consumer or more if you need it.
You could assign many groups and add new group (in fact, add new consumer with new groupId) dynamically.
As you could stop/pause any consumer, you could manually stop all consumers related to specified group. I suppose there is no single command to do that but I'm not sure. Anyway, if you have single consumer in each group you could stop it easily.
If you want to remove the group you just shutdown and drop out related consumers. No actions on broker side is needed.
As a drawback you'll get 100,000 consumers which read (single) topic. It's heavy network load at least.

Kafka, will different partitions have the same offset number

I have one Kafka topic and five partitions for that one topic. There will be 5 consumer groups. Each consumer group has one service instances consuming from that topic.
Will the offset be the same in each consumer for the same record in Kafka?
By offset, if you mean the ordering of messages, then yes. It'd be the same for all consumers, because the ordering is determined by producers and brokers. So, if you have msg-1, msg-2, ..., msg-1000 in the topic, all the 5 consumers will consume those in that specific order. But the rate of consumption might vary. It has lots of variables (e.g. Network latency, network topology, consumer logic etc.) that determines the rate of consumption.
The offset is assigned by the broker when the message comes into the partition so it's unique and it's not related to the consumers (and consumer groups). It identifies the unique position that the record has inside the partition.
On the other side, each consumer (in a consumer group) reading from a specific partition will track its own offset which will be different from consumers (in other consumer groups); the offset concept in this case is used for tracking the position inside the partition from which reading messages. Of course it's always a message offset.

How is message sequence preserved for topic with many partitions?

I want any information/explanation on how Kafka maintains a message sequence when messages are written to topic with multiple partition.
For e.g. I have multiple message producer each producing messages sequentially and writing on the Kafka topic with more than 1 partition. In this case, how consumer group will work to consume messages.
Kafka only provides a total order over records within a partition, not between different partitions in a topic. Even within one partition, you still could encounter the out-of-order events if retries is enabled and max.in.flight.requests.per.connection is larger than 1.
Work-around is create a topic with only one partition although it means only one consumer process per consumer group.
Kafka will store messages in the partitions according to the message key given to the producer. If none is given, then the messages will be written in a round-robin style into the partitions. To keep ordering for a topic, you need to make sure the ordered sequence has the same key, or that the topic has only one partition.