Kafka: What is Current Offset or Record Count of Topic? - apache-kafka

How do I get the current offset, or offset by partition, or record count for a given topic? It doesn't need to be perfect, but I want a ballpark idea of how much data is in a Kafka topic.

In order to get the offset for the partitions of a topic you can use kafka.tools.GetOffsetShell
./bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic your_topic_name --time -1
If you want to get the latest offset for a particular group, you can also use:
./bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --topic your_topic_name--zookeeper localhost:2181 --group your_group_id
In order to count the entries within a topic, you can either consume the whole topic (when you stop the consumer the total number of consumed messages will be reported). Alternatively, you can use
./bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list <broker>: <port> --topic <topic-name> --time -1 --offsets 1 | awk -F ":" '{sum += $3} END {print sum}'

Related

How to get log end offset of all partitions for a given kafka topic using kafka command line?

When I describe a kafka topic it doesn't show the log end offset of any partition but show all the other metadata such as ISR,Replicas,Leader.
How do I see a log end offset of the partition for a given topic?
Ran this: ./kafka-topics.sh --zookeeper zk-service:2181 --describe --topic "__consumer_offsets"
Output Doesn't have a offset column.
Note: Need Only the log end offset.
Since you're only looking for the log end offset for a topic, you can use kafka-run-class with the kafka.tools.GetOffsetShell class.
Assuming your topic is __consumer_offsets, you would get the end offset by running:
./kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --time -1 --topic __consumer_offsets
Change the --broker-list localhost:9092 to your desired Kafka address. This will list all of the log end offsets for each partition in the topic.
install kafkacat, its an easy to use kafka tool:
sudo apt-get update
sudo apt-get install kafkacat
kafkacat -C -b <kafka-broker-ip-and-port> -t <topic> -o -1
This will not consume anything because the offset is incremented after a message is added. But it will give you the offsets for all the partitions. Note however that this isn't the current offset that you are consuming at... The above answers will help you more in terms of looking into partition lag.
Following is the command you would need to get the offset of all partitions for a given kafka topic for a given consumer group:
kafka-consumer-groups --bootstrap-server <kafka-broker-list-with-ports> --describe --group <consumer-group-name>
Please note that the <consumer-group-name> at the end is important as the offsets are committed by consumers that are typically a part of a consumer group.
The output of this command may look something like:
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
<topic-name> 0 62 62 0 <consumer-id> <host> <client>
In your post however, you're trying to get this information for the internal topic __consumer_offsets so you would need a consumer group which would have consumers consuming from this internal topic. You could perhaps do the following:
kafka-console-consumer --bootstrap-server <kafka-broker-list-with-ports> --topic __consumer_offsets --formatter "kafka.coordinator.group.GroupMetadataManager\$OffsetsMessageFormatter" --max-messages 5
Output of the above command:
[<consumer-group-name>,<topic-name>,0]::[OffsetMetadata[481690879,NO_METADATA],CommitTime 1479708539051,ExpirationTime 1480313339051]
Just use the <consumer-group-name> from the output and put it in the kafka-consumer-groups command mentioned in the beginning and you'll get the offset details for all the 50 partitions for the given consumer group only.
I hope this helps.

How to fetch recent messages from Kafka topic

Do we have any option like fetching recent 10/20/ etc., messages from Kafka topic. I can see --from-beginning option to fetch all messages from the topic but if I want to fetch only few messages first, last, middle or latest 10. do we have some options?
First N messages
You can use --max-messages N in order to fetch the first N messages of a topic.
For example, to get the first 10 messages, run
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning --max-messages 10
Next N messages
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --max-messages 10
Last N messages
To get the last N messages, you need to define a specific partition and the offset:
bin/kafka-simple-consumer-shell.sh --bootstrap-server localhost:9092 --topic test--partition testPartition --offset yourOffset
M to N messages
Again, for this case you'd have to define both the partition and the offset.
For example, you can run the following in order to get N messages starting from an offset of your choice:
bin/kafka-simple-consumer-shell.sh --bootstrap-server localhost:9092 --topic test--partition testPartition --offset yourOffset --max-messages 10
If you don't want to stick to the binaries, I would suggest you to use kt which is a Kafka command line tool with more options and functionality.
For more details refer to the article How to fetch specific messages in Apache Kafka
Without specifying an offset and partition, you'll only be able to consume next N or first N. To consume in the "middle" of the unbounded stream, you need to give the offset
Other than console consumer, there's kafkacat
First twenty
kafkacat -C -b -t topic -o earliest -c 20
And from previous twenty (from partition zero)
kafkacat -C -b -t topic -P 0 -o -20

Before consumers for new topic are attached, I create new topic and produce message in apache kafka

Before consumers for new topic are attached, I create new topic and produce a first message in apache kafka.
Then consumers for new topic are attached, but the first message could not be consumed.
Why..?
In this case, already log-end offset=1, commited offset=1, lag=0.
Doesn't "commited offset=1" mean it's already been consumed?
My question is why it has already been consumed.
Let me know if there's anything I'm wrong with.
This is my test case.
# create new topic
$ kafka/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic NEW_TOPIC_NAME
# produce a first message
$ kafka/bin/kafka-console-producer.sh --broker-list localhost:9092 --topic NEW_TOPIC_NAME
> send a first message
# then execute consumer
$ kafka/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic NEW_TOPIC_NAME
> # no consume a first message
But after consumers for new topic are attached, I produce a second message then normally consume.
By default, the kafka-console-consumer starts from the end of the topic.
If you want to consume messages produced before, you can set --from-beginning for example:
kafka/bin/kafka-console-consumer.sh --bootstrap-server localhost:9092
--topic NEW_TOPIC_NAME --from-beginning

Read from kafka in a Spark batch job (fromOffset untilOffset)

I saw on this question that we can read messages from Kafka in spark batch jobs using org.apache.spark.streaming.kafka.KafkaUtils#createRDD but this method requires a offset range that needs a 'from offset' and 'until offset'. I'm getting the 'from offset' from org.apache.spark.streaming.kafka.KafkaCluster#getLatestLeaderOffsets method but how can I get the until offset? I'm using kafka-2.1.1-0.9.0.1
You can use GetOffsetShell to fetch latest offset from any topic
bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --topic myTopic --time -1
this will return :
myTopic:12341:47841
which mean 47841 is the latest offset for topic myTopic

Kafka 0.10.1.0 change offset time

Elasticsearch pipeline set up with Kafka cluster between 2 logstash instances.
I need to reset the offset back 12 hours for a topic and start the consumer again.
bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list kfkserver:9092 --topic topicname --time 1488153601000
which returns topicname:0:3730858
1488153601000 <- 2017-02-27 00:00:01 in milliseconds
How can I set the offset time?
If you're on 0.10.x and don't have the awesome offset management tool that was added in 0.11, there's a hack to use kafka-console-consumer.sh to change a consumer group's offset. This only works with the numeric offset though, not the timestamp.
First, stop whatever process is running that's using that consumer. Clean shutdown is best. Then, run a command that looks like this:
bin/kafka-console-consumer.sh --bootstrap-server :9092 \
--topic my-topic \
--partition 1 \
--consumer-property group.id=my-consumer-group \
--max-messages 0 \
--offset 12345
--max-messages 0 is important here; setting it to any other value, including 1, will consume that many messages and then commit the current latest offset in that topic/partition. This must be a bug in the console consumer.
Next, check your work with kafka-consumer-groups.sh:
./kafka-consumer-groups.sh --bootstrap-server :9092 \
--group my-consumer-group \
--describe