Number of commits and offset in each partition of a kafka topic - apache-kafka

How to find the number of commits and current offset in each partition of a known kafka topic. I am using kafka v0.8.1.1

It is not clear from your question, what kind of offset you're interested in.
There are actually three types of offsets:
The offset of the first available message in topic's partition. Use -2
(earliest) as --time parameter for GetOffsetShell tool
The offset of the last available message in topic's partition. Use -1(latest) as --time
parameter.
The last read/processed message offset maintained by
kafka consumer. High level consumer stores this information, for every consumer group, in
an internal Kafka topic (used to be Zookeeper) and takes care about
keeping it up to date when you call commit() or when auto-commit
setting is set to true. For simple consumer, your code have to take
care about managing offsets.
In addition to command line utility, the offset information for #1 and #2 is also available via SimpleConsumer.earliestOrLatestOffset().
If the number of messages is not too large, you can specify a large --offsets parameter to GetOffsetShell and then count number of lines returned by the tool. Otherwise, you can write a simple loop in scala/java that would iterate all available offsets starting from the earliest.
From Kafka documentation:
Get Offset Shell
get offsets for a topic
bin/kafka-run-class.sh kafka.tools.GetOffsetShell
required argument [broker-list], [topic]
Option Description
------ -----------
--broker-list <hostname:port,..., REQUIRED: The list of hostname and hostname:port> port of the server to connect to.
--max-wait-ms <Integer: ms> The max amount of time each fetch request waits. (default: 1000)
--offsets <Integer: count> number of offsets returned (default: 1)
--partitions <partition ids> comma separated list of partition ids. If not specified, will find offsets for all partitions (default)
--time <Long: timestamp in milliseconds / -1(latest) / -2 (earliest) timestamp; offsets will come before this timestamp, as in getOffsetsBefore >
--topic <topic> REQUIRED: The topic to get offsets from.

Regarding the offset of the topic and partition you can use kafka.tools.GetOffsetShell. For example using these command (I have topic games):
bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic games --time -1
I will get games:0:47841 which means that for topic games and 0 partition I have latest not used offset 47841 (latest available message).
You can use -2 to see the first available message.

Starting version 0.9.0.x you should start to use the kafka.admin.ConsumerGroupCommand tool. Below are the arguments that the tool take
List all consumer groups, describe a consumer group, or delete consumer group info.
Option Description
------ -----------
--bootstrap-server <server to connect REQUIRED (only when using new-
to> consumer): The server to connect to.
--command-config <command config Property file containing configs to be
property file> passed to Admin Client and Consumer.
--delete Pass in groups to delete topic
partition offsets and ownership
information over the entire consumer
group. For instance --group g1 --
group g2
Pass in groups with a single topic to
just delete the given topic's
partition offsets and ownership
information for the given consumer
groups. For instance --group g1 --
group g2 --topic t1
Pass in just a topic to delete the
given topic's partition offsets and
ownership information for every
consumer group. For instance --topic
t1
WARNING: Group deletion only works for
old ZK-based consumer groups, and
one has to use it carefully to only
delete groups that are not active.
--describe Describe consumer group and list
offset lag related to given group.
--group <consumer group> The consumer group we wish to act on.
--list List all consumer groups.
--new-consumer Use new consumer.
--topic <topic> The topic whose consumer group
information should be deleted.
--zookeeper <urls> REQUIRED (unless new-consumer is
used): The connection string for the
zookeeper connection in the form
host:port. Multiple URLS can be
given to allow fail-over.
To get offsets for a Topic_X for a consumerGroup_Y use the command as below
bin/kafka-run-class.sh kafka.admin.ConsumerGroupCommand --zookeeper <zookeeper urls> --describe --group consumerGroup_Y
Response would look like
GROUP, TOPIC, PARTITION, CURRENT OFFSET, LOG END OFFSET, LAG, OWNER
consumerGroup, Topic_X, 0, 3030460, 3168412, 137952, none
consumerGroup, Topic_X, 1, 3030903, 3168884, 137981, none
consumerGroup, Topic_X, 2, 801564, 939540, 137976, none
consumerGroup, Topic_X, 3, 737290, 875262, 137972, none
consumerGroup, Topic_X, 4, 737288, 875254, 137966, none
consumerGroup, Topic_X, 5, 737276, 875241, 137965, none
consumerGroup, Topic_X, 6, 737290, 875251, 137961, none
consumerGroup, Topic_X, 7, 737290, 875248, 137958, none
consumerGroup, Topic_X, 8, 737288, 875246, 137958, none
consumerGroup, Topic_X, 9, 737293, 875251, 137958, none
consumerGroup, Topic_X, 10, 737289, 875244, 137955, none
consumerGroup, Topic_X, 11, 737273, 875226, 137953, none

This information was also helpful in creating a script to view the number of messages on a partition for a topic (from the command line). While tools like Kafka-Web-Console are nice, some of us live in a non-GUI world.
Here is the script ... use and modify it at your own risk :-)
#!/bin/bash
topic=$1
if [[ -z "${topic}" ]] ; then
echo "Usage: ${0} <topic>"
exit 1
fi
if [[ -z "${KAFKA_HOME}" ]] ; then
# $KAFKA_HOME not set, using default /kafka
KAFKA_HOME="/kafka"
fi
if [ ! -d ${KAFKA_HOME} ] ; then
echo "\$KAFKA_HOME does not point to a valid directory [$KAFKA_HOME]"
exit 1
fi
cd $KAFKA_HOME
echo
echo "Topic: ${topic}: "
#
printf "Partition Count\n"
printf "~~~~~~~~~~ ~~~~~~~~~~~~\n"
idx=0
for msg in `bin/kafka-run-class.sh kafka.tools.GetOffsetShell --topic ${topic} --broker-list localhost:9092 --time -1` ; do
name=`echo ${msg} | awk -F ":" '{print $1}'`
partition=`echo ${msg} | awk -F ":" '{print $2}'`
total=`echo ${msg} | awk -F ":" '{print $3}'`
printf "%10d %12d\n" ${partition} ${total}
idx=$((idx + 1))
done
if [ ${idx} -eq 0 ] ; then
echo "Topic name not found!"
exit 1
fi
echo
exit ${rc}

Say suppouse we have topic by name tomorrowis27
And our requirment is
Req 1: Wanted to know the partition and offset details of the topic.
Ans : We can use GetOffsetShell command as shown in the below screenshot.
Req 2: Wanted to know the no of offset consumed by a consumer group.
Ans: We can use ConsumerGroupCommand as shown in the below screenshot.

Related

How do we know we reached the last message in the kafka topic using KafkaMessageListenerContainer

I am running consumer using KafkaMessageListenerContainer.I need to stop the consumer on the topic's last message.How can i identify particular message is the last message in the topic.
You can get it with the following shell command (remove --partition parameter to get offsets for all topic's partitions):
./bin/kafka-run-class kafka.tools.GetOffsetShell --broker-list <host>:<port> --topic <topic-name> --partition <partition-number> --time -1
As you can see, this is using the GetOffsetShell [0] object that you may use to get the last message offset and compare it with record's offset.
[0] https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/tools/GetOffsetShell.scala

delete topic-messages in Apache kafka

I'm testing the working of kafka-topics but I don´t undestand how the deletion works.
I have created a simple topic with
retention.ms = 60000
and
segment.ms = 60000
and
cleanup.policy=delete.
After this I created a producer and I sent some messages.
A consumer receive the messages without problems.
But I expect that, after one minute, if a repeat the consumer, it doesn't show the messages because they must have been deleted. But this behaviour doesn't occur.
If I create a query in ksql it's the same. The messages always appear.
I think that I don't understand how the deletion works.
Example:
1) Topic
./kafka-topics --create --zookeeper localhost:2181 --topic test --
replication-factor 2 --partitions 1 --config "cleanup.policy=delete" --
config "delete.retention.ms=60000" --config "segment.ms=60000"
2) producer
./kafka-avro-console-producer --broker-list broker:29092 --topic test--
property parse.key=true --property key.schema='{"type":"long"}' --property
"key.separator=:" --property value.schema='{"type": "record","name":
"ppp","namespace": "test.topic","fields": [{"name": "id","type": "long"}]}'
3) messages from producer
1:{"id": 1}
2:{"id": 2}
4:{"id": 4}
5:{"id": 5}
4) Consumer
./kafka-avro-console-consumer \
--bootstrap-server broker:29092 \
--property schema.registry.url=http://localhost:8081 \
--topic test--from-beginning --property print.key=true
The consumer shows the four messages.
But I expect that If I run the consumer again after one minute (I have waited more time too, even hours) the messages don´t show because the retention.ms and segment.ms are one minute.
When messages are actually deleted?
Another important think to know in deletion process in Kafka is log segment file:
Topics are divided into partitions right? This is what allows parallelism, scale etc..
Each partition is divided into log segments files. Why? Because Kafka writes data to Disk right...? we don't want to it keep the entire topic / partition in 1 huge file, but split it into smaller files (segments)..
Breaking data into smaller files has many advantages, don't really related to the question. Can read more here
The key thing to notice here is:
Retention policy is looking on the log semgnet's file time stamp.
"Retention by time is performed by examining the last modified
time (mtime) on each log segment file on disk. Under normal clus‐
ter operations, this is the time that the log segment was closed, and
represents the timestamp of the last message in the file"
(From Kafka-definitive Guide, page 26)
Version 0.10.1.0
The log retention time is no longer based on last modified time of the log segments. Instead it will be based on the largest timestamp of the messages in a log segment.
Which means it looks only on closed log segment files.
Make sure your 'segment' config params are right..
Change the retention.ms as mentioned by Ajay Srivastava above using kafka-topics --zookeeper localhost:2181 --alter --topic test --config retention.ms=60000 and test again.

Current offset behavior when set by kafka-consumer-groups to earliest?

I have a kafka topic with 25 partitions and the cluster has been running for 5 months.
As per my understanding for each partition for a given topic, the offset starts from 0,1,2... (un-bounded)
I see log-end-offset at a very high value (right now -> 1230628032)
I created a new consumer group with offset being set to earliest; so i expected the offset from which a client for that consumer group will start from offset 0.
The command which I used to create a new consumer group with offset to earliest:
kafka-consumer-groups --bootstrap-server <IP_address>:9092 --reset-offsets --to-earliest --topic some-topic --group to-earliest-cons --execute
I see the consumer group being created. I expected the current-offset being to 0; however when I described the consumer group the current offset was very high , at the moment --> 1143755193.
The record retention period set is for 7 days (standard value).
My question is why didn't we see the first offset from which a consumer from this consumer group will read 0? Has it to do something with data-retention?
Can anyone help understand this?
It is exactly data retention. It is highly probable that Kafka already removed old messages with offset 0 from your partitions, so it doesn't make sense to start from 0. Instead, Kafka will set offset to the earliest available message on your partition. You can check those offsets using:
./kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list <IP_address>:9092 --topic some-topic --time -2
You will probably see values really close to what you're seeing as new consumer offset.
You can also try and set offset explicitly to 0:
./kafka-consumer-groups.sh --bootstrap-server <IP_address>:9092 --reset-offsets --to-offset 0 --topic some-topic --group to-earliest-cons --execute
However, you will see warning that offset 0 does not exist and it will use higher value (aforementioned earliest message available)
New offset (0) is lower than earliest offset for topic partition some-topic. Value will be set to 1143755193

How to get log end offset of all partitions for a given kafka topic using kafka command line?

When I describe a kafka topic it doesn't show the log end offset of any partition but show all the other metadata such as ISR,Replicas,Leader.
How do I see a log end offset of the partition for a given topic?
Ran this: ./kafka-topics.sh --zookeeper zk-service:2181 --describe --topic "__consumer_offsets"
Output Doesn't have a offset column.
Note: Need Only the log end offset.
Since you're only looking for the log end offset for a topic, you can use kafka-run-class with the kafka.tools.GetOffsetShell class.
Assuming your topic is __consumer_offsets, you would get the end offset by running:
./kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --time -1 --topic __consumer_offsets
Change the --broker-list localhost:9092 to your desired Kafka address. This will list all of the log end offsets for each partition in the topic.
install kafkacat, its an easy to use kafka tool:
sudo apt-get update
sudo apt-get install kafkacat
kafkacat -C -b <kafka-broker-ip-and-port> -t <topic> -o -1
This will not consume anything because the offset is incremented after a message is added. But it will give you the offsets for all the partitions. Note however that this isn't the current offset that you are consuming at... The above answers will help you more in terms of looking into partition lag.
Following is the command you would need to get the offset of all partitions for a given kafka topic for a given consumer group:
kafka-consumer-groups --bootstrap-server <kafka-broker-list-with-ports> --describe --group <consumer-group-name>
Please note that the <consumer-group-name> at the end is important as the offsets are committed by consumers that are typically a part of a consumer group.
The output of this command may look something like:
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
<topic-name> 0 62 62 0 <consumer-id> <host> <client>
In your post however, you're trying to get this information for the internal topic __consumer_offsets so you would need a consumer group which would have consumers consuming from this internal topic. You could perhaps do the following:
kafka-console-consumer --bootstrap-server <kafka-broker-list-with-ports> --topic __consumer_offsets --formatter "kafka.coordinator.group.GroupMetadataManager\$OffsetsMessageFormatter" --max-messages 5
Output of the above command:
[<consumer-group-name>,<topic-name>,0]::[OffsetMetadata[481690879,NO_METADATA],CommitTime 1479708539051,ExpirationTime 1480313339051]
Just use the <consumer-group-name> from the output and put it in the kafka-consumer-groups command mentioned in the beginning and you'll get the offset details for all the 50 partitions for the given consumer group only.
I hope this helps.

How to fetch recent messages from Kafka topic

Do we have any option like fetching recent 10/20/ etc., messages from Kafka topic. I can see --from-beginning option to fetch all messages from the topic but if I want to fetch only few messages first, last, middle or latest 10. do we have some options?
First N messages
You can use --max-messages N in order to fetch the first N messages of a topic.
For example, to get the first 10 messages, run
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning --max-messages 10
Next N messages
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --max-messages 10
Last N messages
To get the last N messages, you need to define a specific partition and the offset:
bin/kafka-simple-consumer-shell.sh --bootstrap-server localhost:9092 --topic test--partition testPartition --offset yourOffset
M to N messages
Again, for this case you'd have to define both the partition and the offset.
For example, you can run the following in order to get N messages starting from an offset of your choice:
bin/kafka-simple-consumer-shell.sh --bootstrap-server localhost:9092 --topic test--partition testPartition --offset yourOffset --max-messages 10
If you don't want to stick to the binaries, I would suggest you to use kt which is a Kafka command line tool with more options and functionality.
For more details refer to the article How to fetch specific messages in Apache Kafka
Without specifying an offset and partition, you'll only be able to consume next N or first N. To consume in the "middle" of the unbounded stream, you need to give the offset
Other than console consumer, there's kafkacat
First twenty
kafkacat -C -b -t topic -o earliest -c 20
And from previous twenty (from partition zero)
kafkacat -C -b -t topic -P 0 -o -20