Kafka: unable to consume events given an offset - apache-kafka

I was just following the quick start guide for Kafka and I decided to test offsets a little bit.
The only modification I did to the default configuration was adding:
log.retention.minutes=5
My test topic was created as basic as possible, as suggested in the quick start guide (1 partition, replication factor 1):
$ bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
I've produced some messages, m1 and m2 (adding date before and after):
$ date
viernes, 21 de julio de 2017, 12:16:06 CEST
$ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
>m1
>m2
>^C
$ date
viernes, 21 de julio de 2017, 12:16:25 CEST
The thing is I'm able to consume them from the beginning, but I'm not able to consume them given an offset (for instance, offset 0, which I understand points to the first message):
$ date
viernes, 21 de julio de 2017, 12:16:29 CEST
$ bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --offset 0 --partition 0
^CProcessed a total of 0 messages
$ date
viernes, 21 de julio de 2017, 12:17:25 CEST
$ bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
m1
m2
^CProcessed a total of 2 messages
$ date
viernes, 21 de julio de 2017, 12:17:50 CEST
Most probably I've not understood well this statement from the documentation:
In fact, the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the consumer it can consume records in any order it likes. For example a consumer can reset to an older offset to reprocess data from the past or skip ahead to the most recent record and start consuming from "now".
Moreover, I've seen that if a produce a third message (m3) after running the consumer as described above (i.e. pointing to offset 0), this third message is read:
$ bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --offset 0 --partition 0
m3
Could anybody explain this behavior, please? Thanks!

Alright, after a lot of comments and a bit of code searching I think that this is what is happening:
When you configured your retention period with 5 minutes, you caused Kafka to delete a few of your old messages - most notably the one with the offset 0. So at some point in time the smallest offset in partition 0 became lets say 4.
When you start a console consumer with --from-beginning, it internally calls a method that initializes the beginning offset to the smallest offset that can be found in the partition - 4 in this case. With that offset the consumer starts polling and receives that message and all subsequent ones, which is all messages for the partition.
If you start a consumer with --offset 0 that piece of code is bypassed and the consumer polls with an offset of 0 - the broker responds to that with an OFFSET_OUT_OF_RANGE error. The Consumer upon receiving that error resets the offset for the partition in question, and for this it uses the parameter auto.offset.reset which in theory can be earliest or latest.
However, due to the way that the ConsoleConsumer is written, the only way to have this parameter set to earliest is, if you pass the command line parameter --from-beginning - which cannot be combined with --offset - so effectively the only possible value that auto.offset.reset can have here is: latest.
So what happens when you poll with an offset of 0 that does not exist is an unsuccessful poll for data and after that the same behavior as if you hadn't passed any parameter at all.
Hope that helps and makes sense.
Update:
As of Kafka version 1.0 this behavior has been changed by KAFKA-5629 and should now behave a bit more in line with expectations.

You can try: --offset earliest

Related

How do we know we reached the last message in the kafka topic using KafkaMessageListenerContainer

I am running consumer using KafkaMessageListenerContainer.I need to stop the consumer on the topic's last message.How can i identify particular message is the last message in the topic.
You can get it with the following shell command (remove --partition parameter to get offsets for all topic's partitions):
./bin/kafka-run-class kafka.tools.GetOffsetShell --broker-list <host>:<port> --topic <topic-name> --partition <partition-number> --time -1
As you can see, this is using the GetOffsetShell [0] object that you may use to get the last message offset and compare it with record's offset.
[0] https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/tools/GetOffsetShell.scala

delete topic-messages in Apache kafka

I'm testing the working of kafka-topics but I don´t undestand how the deletion works.
I have created a simple topic with
retention.ms = 60000
and
segment.ms = 60000
and
cleanup.policy=delete.
After this I created a producer and I sent some messages.
A consumer receive the messages without problems.
But I expect that, after one minute, if a repeat the consumer, it doesn't show the messages because they must have been deleted. But this behaviour doesn't occur.
If I create a query in ksql it's the same. The messages always appear.
I think that I don't understand how the deletion works.
Example:
1) Topic
./kafka-topics --create --zookeeper localhost:2181 --topic test --
replication-factor 2 --partitions 1 --config "cleanup.policy=delete" --
config "delete.retention.ms=60000" --config "segment.ms=60000"
2) producer
./kafka-avro-console-producer --broker-list broker:29092 --topic test--
property parse.key=true --property key.schema='{"type":"long"}' --property
"key.separator=:" --property value.schema='{"type": "record","name":
"ppp","namespace": "test.topic","fields": [{"name": "id","type": "long"}]}'
3) messages from producer
1:{"id": 1}
2:{"id": 2}
4:{"id": 4}
5:{"id": 5}
4) Consumer
./kafka-avro-console-consumer \
--bootstrap-server broker:29092 \
--property schema.registry.url=http://localhost:8081 \
--topic test--from-beginning --property print.key=true
The consumer shows the four messages.
But I expect that If I run the consumer again after one minute (I have waited more time too, even hours) the messages don´t show because the retention.ms and segment.ms are one minute.
When messages are actually deleted?
Another important think to know in deletion process in Kafka is log segment file:
Topics are divided into partitions right? This is what allows parallelism, scale etc..
Each partition is divided into log segments files. Why? Because Kafka writes data to Disk right...? we don't want to it keep the entire topic / partition in 1 huge file, but split it into smaller files (segments)..
Breaking data into smaller files has many advantages, don't really related to the question. Can read more here
The key thing to notice here is:
Retention policy is looking on the log semgnet's file time stamp.
"Retention by time is performed by examining the last modified
time (mtime) on each log segment file on disk. Under normal clus‐
ter operations, this is the time that the log segment was closed, and
represents the timestamp of the last message in the file"
(From Kafka-definitive Guide, page 26)
Version 0.10.1.0
The log retention time is no longer based on last modified time of the log segments. Instead it will be based on the largest timestamp of the messages in a log segment.
Which means it looks only on closed log segment files.
Make sure your 'segment' config params are right..
Change the retention.ms as mentioned by Ajay Srivastava above using kafka-topics --zookeeper localhost:2181 --alter --topic test --config retention.ms=60000 and test again.

Current offset behavior when set by kafka-consumer-groups to earliest?

I have a kafka topic with 25 partitions and the cluster has been running for 5 months.
As per my understanding for each partition for a given topic, the offset starts from 0,1,2... (un-bounded)
I see log-end-offset at a very high value (right now -> 1230628032)
I created a new consumer group with offset being set to earliest; so i expected the offset from which a client for that consumer group will start from offset 0.
The command which I used to create a new consumer group with offset to earliest:
kafka-consumer-groups --bootstrap-server <IP_address>:9092 --reset-offsets --to-earliest --topic some-topic --group to-earliest-cons --execute
I see the consumer group being created. I expected the current-offset being to 0; however when I described the consumer group the current offset was very high , at the moment --> 1143755193.
The record retention period set is for 7 days (standard value).
My question is why didn't we see the first offset from which a consumer from this consumer group will read 0? Has it to do something with data-retention?
Can anyone help understand this?
It is exactly data retention. It is highly probable that Kafka already removed old messages with offset 0 from your partitions, so it doesn't make sense to start from 0. Instead, Kafka will set offset to the earliest available message on your partition. You can check those offsets using:
./kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list <IP_address>:9092 --topic some-topic --time -2
You will probably see values really close to what you're seeing as new consumer offset.
You can also try and set offset explicitly to 0:
./kafka-consumer-groups.sh --bootstrap-server <IP_address>:9092 --reset-offsets --to-offset 0 --topic some-topic --group to-earliest-cons --execute
However, you will see warning that offset 0 does not exist and it will use higher value (aforementioned earliest message available)
New offset (0) is lower than earliest offset for topic partition some-topic. Value will be set to 1143755193

How to get log end offset of all partitions for a given kafka topic using kafka command line?

When I describe a kafka topic it doesn't show the log end offset of any partition but show all the other metadata such as ISR,Replicas,Leader.
How do I see a log end offset of the partition for a given topic?
Ran this: ./kafka-topics.sh --zookeeper zk-service:2181 --describe --topic "__consumer_offsets"
Output Doesn't have a offset column.
Note: Need Only the log end offset.
Since you're only looking for the log end offset for a topic, you can use kafka-run-class with the kafka.tools.GetOffsetShell class.
Assuming your topic is __consumer_offsets, you would get the end offset by running:
./kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list localhost:9092 --time -1 --topic __consumer_offsets
Change the --broker-list localhost:9092 to your desired Kafka address. This will list all of the log end offsets for each partition in the topic.
install kafkacat, its an easy to use kafka tool:
sudo apt-get update
sudo apt-get install kafkacat
kafkacat -C -b <kafka-broker-ip-and-port> -t <topic> -o -1
This will not consume anything because the offset is incremented after a message is added. But it will give you the offsets for all the partitions. Note however that this isn't the current offset that you are consuming at... The above answers will help you more in terms of looking into partition lag.
Following is the command you would need to get the offset of all partitions for a given kafka topic for a given consumer group:
kafka-consumer-groups --bootstrap-server <kafka-broker-list-with-ports> --describe --group <consumer-group-name>
Please note that the <consumer-group-name> at the end is important as the offsets are committed by consumers that are typically a part of a consumer group.
The output of this command may look something like:
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
<topic-name> 0 62 62 0 <consumer-id> <host> <client>
In your post however, you're trying to get this information for the internal topic __consumer_offsets so you would need a consumer group which would have consumers consuming from this internal topic. You could perhaps do the following:
kafka-console-consumer --bootstrap-server <kafka-broker-list-with-ports> --topic __consumer_offsets --formatter "kafka.coordinator.group.GroupMetadataManager\$OffsetsMessageFormatter" --max-messages 5
Output of the above command:
[<consumer-group-name>,<topic-name>,0]::[OffsetMetadata[481690879,NO_METADATA],CommitTime 1479708539051,ExpirationTime 1480313339051]
Just use the <consumer-group-name> from the output and put it in the kafka-consumer-groups command mentioned in the beginning and you'll get the offset details for all the 50 partitions for the given consumer group only.
I hope this helps.

Kafka GetOffsetShell timestamp doesn't seem to work

I would like to look at a Kafka topic starting from a particular time, using kafka-console-consumer and passing it an --offset corresponding to that time. In order to figure out what offset to specify, I tried using the command:
kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list ... --topic data.live --time 1533827402000
but all I get back is:
data.live:8:
data.live:2:
data.live:5:
data.live:4:
...
(i.e., no offsets). The command works fine if I specify a --time of -2 ("earliest") or -1 ("latest"), giving back results like:
data.live:8:765349205
data.live:2:766537956
data.live:5:759575128
data.live:4:761703674
...
(I assume that the numbers after the second colons are offsets).
My question is: How do I get offsets for an intermediate time? I tried using a millisecond timestamp that I think should have some data.
And a second question: Using the offsets for --time of -1 and -2, I guess at an intermediate offset, and then I pass that to the
kafka-console-consumer.sh --bootstrap-server ... --topic data.live --offset 77000000 --partition 1
but the --offset seems to make no difference in what I get back (my topic values contain a human-readable timestamp that indicates this).
When I run date on my Mac, here's what I get
$ date -r 1533827402000
Fri Jan 20 17:13:20 CST 50575
I think it doesn't print the offsets because they don't exist (yet)
Try --time 1533827402 (which has a year in 2018, for the same command)
You can also do kafka-console-consumer ... --property print.timestamp=true, then grep out the first few digits you would be interested in, then use those values as examples for the GetOffsetShell
Maybe you are running with an older log format version?
We are seeing the same thing but we have log.message.format.version=0.9.0.1
I think that is why, log message timestamps weren't introduced until 0.10