Identify oldest Kafka offset that has data for consumption - apache-kafka

I ran into this data issue today and to solve it I have to recalculate everything from the last 3 months. But, in Kafka when I run this command :
./kafka-console-consumer.sh --bootstrap-server 10.8.95.21:9092 --topic backoffice --from-beginning
it encounters an error : The requested offset is not within the range of offsets maintained by the server
The --from-beginning is trying to get data from Offsets whose data has been purged by kafka.
Can I list offsets alongwith the time it was created? So, that I can estimate from where I can start consuming data. Otherwise, if I can identify the oldest Kafka Offset that has data, I can start reading from that offset.

Have you tried out kt (fgeller/kt). This is an amazing tool as an alternative to Kafka console tools. This is written in go, so amazingly fast also. And one other advantage is you can get offset of each message by default there.
So you can simply write something like :
kt consume -brokers <broker-name> -topic <topic-name> oldest
and the output will be something like this :
{
"partition": 0,
"offset": <oldest-offset>,
"key": "<your-key>",
"value": "<value of the message>"
}
Edit: If you want some UI for this, Kafdrop is just what you are looking for. Setting it up is pretty easy and you can get all offset related information quite easily. You can even watch a message corresponding to an offset which is pretty amazing.

The following cmd worked for me:
./bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list <broker-name> --topic <topic-name> --time -2

Related

How to move a topic from one broker to another broker in kafka?

I first tried to see if I can create a topic in a particular broker. But looks like this is not possible. Even if I mention the broker host in the bootstrap
admin_client = AdminClient({
"bootstrap.servers": "xxx1.com:9092,xxx2.com:9092"
})
futmap=admin_client.create_topics(topic_list)
The program is arbitrarily picking up one of the 5 brokers that I have as the leader broker for the topic. I am trying to understand why it happens like this.
I am also trying to see if I can reassign the topic leader to another broker. I know it may be possible through the kafka-reassign-partitions command line script, but I wanted to do it programmatically using python and confluent-Kafka package. Is it possible to do this programmatically. I did not find the reassign partition function in the ADMIN class of confluent-Kafka package
Thanks
I have finally got the solution for this, the documentation of the confluent Kafka python package is not adequate for this. But one good thing about open source is that you can read the code and figure out. So, to create the topic in a particular broker, I had to code the create topic code as below. Please note that I have used replica_assignment instead of replication_factor. These two are mutually exclusive. If you use the replication_factor, the partitions will be assigned by Kafka, you can control the assignment through replica_assignment. However, I am sure that this will get re-assigned in case of a rebalancing/re-assigning of partitions. But that can also be handled through the on_revoke event. But for now, this works for me.
def createTopic(admin_client,topics):
#topic_name=topics
topic_name = ['rajib1_test_xxx_topic']
replica_assignment = [[262, 261]]
topic_list = [NewTopic(topic, num_partitions=1, replica_assignment=replica_assignment) for topic in topic_name]
futmap=admin_client.create_topics(topic_list)
# Wait for each operation to finish.
for topic, f in futmap.items():
try:
f.result() # The result itself is None
print("Topic {} created".format(topic))
except Exception as e:
print("Failed to create topic {}: {}".format(topic, e))
#return futmap
You could also use the kafka-reassign-partitions.sh tool that comes with Kafka to change the replicas of one topic to another broker.
For example, if you want to have your (in this example single-replicated, and single-partitioned) topic "test" be located on broker "1", you can first define a plan (named replicachange.json):
{
"partitions":
[{"topic": "test", "partition": 0,
"replicas": [
1
]
}],
"version":1
}
and then execute it using:
kafka-reassign-partitions.sh --zookeeper localhost:2181 --execute \
--reassignment-json-file replicachange.json

Kafka-Verifiable-Producer and Consumer Problem

I am doing experimenting with kafka. I already start the
kafka-console-producer and
kafka-console-consumer.
I send messages with kafka-producer and successfully receive at the kafka-console-consumer.
Now I want to produce and consume around 5000 messages at once. I look into documentation and get to know that there are two commands.
kafka-verifiable-producer.sh
kafka-verifiable-consumer.sh
I tried to use these commands .
kafka-verifiable-producer.sh --broker-list localhost:9092 --max-messages 5000 --topic data-sending
kafka-verifiable-consumer.sh --group-instance-id 1 --group-id data-world --topic data-sending --broker-list localhost:9092
The result is as follow
"timestamp":1581268289761,"name":"producer_send_success","key":null,"value":"4996","offset":44630,"topic":"try_1","partition":0}
{"timestamp":1581268289761,"name":"producer_send_success","key":null,"value":"4997","offset":44631,"topic":"try_1","partition":0}
{"timestamp":1581268289761,"name":"producer_send_success","key":null,"value":"4998","offset":44632,"topic":"try_1","partition":0}
{"timestamp":1581268289761,"name":"producer_send_success","key":null,"value":"4999","offset":44633,"topic":"try_1","partition":0}
{"timestamp":1581268289769,"name":"shutdown_complete"}
{"timestamp":1581268289771,"name":"tool_data","sent":5000,"acked":5000,"target_throughput":-1,"avg_throughput":5285.412262156448}
On the consumer console the result is as follow
{"timestamp":1581268089357,"name":"records_consumed","count":352,"partitions":[{"topic":"try_1","partition":0,"count":352,"minOffset":32777,"maxOffset":33128}]}
{"timestamp":1581268089359,"name":"offsets_committed","offsets":[{"topic":"try_1","partition":0,"offset":33129}],"success":true}
{"timestamp":1581268089384,"name":"records_consumed","count":500,"partitions":[{"topic":"try_1","partition":0,"count":500,"minOffset":33129,"maxOffset":33628}]}
{"timestamp":1581268089391,"name":"offsets_committed","offsets":[{"topic":"try_1","partition":0,"offset":33629}],"success":true}
{"timestamp":1581268089392,"name":"records_consumed","count":270,"partitions":[{"topic":"try_1","partition":0,"count":270,"minOffset":33629,"maxOffset":33898}]}
{"timestamp":1581268089394,"name":"offsets_committed","offsets":[{"topic":"try_1","partition":0,"offset":33899}],"success":true}
{"timestamp":1581268089415,"name":"records_consumed","count":500,"partitions":[{"topic":"try_1","partition":0,"count":500,"minOffset":33899,"maxOffset":34398}]}
{"timestamp":1581268089416,"name":"offsets_committed","offsets":[{"topic":"try_1","partition":0,"offset":34399}],"success":true}
{"timestamp":1581268089417,"name":"records_consumed","count":235,"partitions":[{"topic":"try_1","partition":0,"count":235,"minOffset":34399,"maxOffset":34633}]}
{"timestamp":1581268089419,"name":"offsets_committed","offsets":[{"topic":"try_1","partition":0,"offset":34634}],"success":true}
In above results, the key is null.
How i can send a bulk of messages with this command ?
I tried to look into one example how to use them but didn't found any. It produces integer number like values but where i can insert the messages?.
Is there any way i can use this command to produce messages in bulk? Also is it possible to implement such commands in windows or it is just for linux?
Any link to the examples would be greatly appreciated.
The script kafka-verifiable-producer.sh executes the classorg.apache.kafka.tools.VerifiableProducer.
(https://github.com/apache/kafka/blob/trunk/tools/src/main/java/org/apache/kafka/tools/VerifiableProducer.java)
Its program arguments --throughput, --repeating-keys and --value-prefix may fulfil your needs.
For example, the following produces messages with prefix value, 111 and with an incremental key which rotates for every 5 messages. You can also configure the throughput of the messages with the --throughput option. Int this example, it produces an average of 5 messages per second.
./kafka-verifiable-producer.sh --broker-list localhost:9092 --max-messages 10 --repeating-keys 5 --value-prefix 100 --throughput 5 --topic test
{"timestamp":1581271492652,"name":"startup_complete"}
{"timestamp":1581271492860,"name":"producer_send_success","key":"0","value":"100.0","offset":45,"topic":"test","partition":0}
{"timestamp":1581271492862,"name":"producer_send_success","key":"1","value":"100.1","offset":46,"topic":"test","partition":0}
{"timestamp":1581271493048,"name":"producer_send_success","key":"2","value":"100.2","offset":47,"topic":"test","partition":0}
{"timestamp":1581271493254,"name":"producer_send_success","key":"3","value":"100.3","offset":48,"topic":"test","partition":0}
{"timestamp":1581271493256,"name":"producer_send_success","key":"4","value":"100.4","offset":49,"topic":"test","partition":0}
{"timestamp":1581271493457,"name":"producer_send_success","key":"0","value":"100.5","offset":50,"topic":"test","partition":0}
{"timestamp":1581271493659,"name":"producer_send_success","key":"1","value":"100.6","offset":51,"topic":"test","partition":0}
{"timestamp":1581271493860,"name":"producer_send_success","key":"2","value":"100.7","offset":52,"topic":"test","partition":0}
{"timestamp":1581271494063,"name":"producer_send_success","key":"3","value":"100.8","offset":53,"topic":"test","partition":0}
{"timestamp":1581271494268,"name":"producer_send_success","key":"4","value":"100.9","offset":54,"topic":"test","partition":0}
{"timestamp":1581271494483,"name":"shutdown_complete"}
{"timestamp":1581271494484,"name":"tool_data","sent":10,"acked":10,"target_throughput":5,"avg_throughput":5.452562704471101}
The easiest is to modify/extend the above class If you are looking for more customized message keys and values.

delete topic-messages in Apache kafka

I'm testing the working of kafka-topics but I don´t undestand how the deletion works.
I have created a simple topic with
retention.ms = 60000
and
segment.ms = 60000
and
cleanup.policy=delete.
After this I created a producer and I sent some messages.
A consumer receive the messages without problems.
But I expect that, after one minute, if a repeat the consumer, it doesn't show the messages because they must have been deleted. But this behaviour doesn't occur.
If I create a query in ksql it's the same. The messages always appear.
I think that I don't understand how the deletion works.
Example:
1) Topic
./kafka-topics --create --zookeeper localhost:2181 --topic test --
replication-factor 2 --partitions 1 --config "cleanup.policy=delete" --
config "delete.retention.ms=60000" --config "segment.ms=60000"
2) producer
./kafka-avro-console-producer --broker-list broker:29092 --topic test--
property parse.key=true --property key.schema='{"type":"long"}' --property
"key.separator=:" --property value.schema='{"type": "record","name":
"ppp","namespace": "test.topic","fields": [{"name": "id","type": "long"}]}'
3) messages from producer
1:{"id": 1}
2:{"id": 2}
4:{"id": 4}
5:{"id": 5}
4) Consumer
./kafka-avro-console-consumer \
--bootstrap-server broker:29092 \
--property schema.registry.url=http://localhost:8081 \
--topic test--from-beginning --property print.key=true
The consumer shows the four messages.
But I expect that If I run the consumer again after one minute (I have waited more time too, even hours) the messages don´t show because the retention.ms and segment.ms are one minute.
When messages are actually deleted?
Another important think to know in deletion process in Kafka is log segment file:
Topics are divided into partitions right? This is what allows parallelism, scale etc..
Each partition is divided into log segments files. Why? Because Kafka writes data to Disk right...? we don't want to it keep the entire topic / partition in 1 huge file, but split it into smaller files (segments)..
Breaking data into smaller files has many advantages, don't really related to the question. Can read more here
The key thing to notice here is:
Retention policy is looking on the log semgnet's file time stamp.
"Retention by time is performed by examining the last modified
time (mtime) on each log segment file on disk. Under normal clus‐
ter operations, this is the time that the log segment was closed, and
represents the timestamp of the last message in the file"
(From Kafka-definitive Guide, page 26)
Version 0.10.1.0
The log retention time is no longer based on last modified time of the log segments. Instead it will be based on the largest timestamp of the messages in a log segment.
Which means it looks only on closed log segment files.
Make sure your 'segment' config params are right..
Change the retention.ms as mentioned by Ajay Srivastava above using kafka-topics --zookeeper localhost:2181 --alter --topic test --config retention.ms=60000 and test again.

Kafka GetOffsetShell timestamp doesn't seem to work

I would like to look at a Kafka topic starting from a particular time, using kafka-console-consumer and passing it an --offset corresponding to that time. In order to figure out what offset to specify, I tried using the command:
kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list ... --topic data.live --time 1533827402000
but all I get back is:
data.live:8:
data.live:2:
data.live:5:
data.live:4:
...
(i.e., no offsets). The command works fine if I specify a --time of -2 ("earliest") or -1 ("latest"), giving back results like:
data.live:8:765349205
data.live:2:766537956
data.live:5:759575128
data.live:4:761703674
...
(I assume that the numbers after the second colons are offsets).
My question is: How do I get offsets for an intermediate time? I tried using a millisecond timestamp that I think should have some data.
And a second question: Using the offsets for --time of -1 and -2, I guess at an intermediate offset, and then I pass that to the
kafka-console-consumer.sh --bootstrap-server ... --topic data.live --offset 77000000 --partition 1
but the --offset seems to make no difference in what I get back (my topic values contain a human-readable timestamp that indicates this).
When I run date on my Mac, here's what I get
$ date -r 1533827402000
Fri Jan 20 17:13:20 CST 50575
I think it doesn't print the offsets because they don't exist (yet)
Try --time 1533827402 (which has a year in 2018, for the same command)
You can also do kafka-console-consumer ... --property print.timestamp=true, then grep out the first few digits you would be interested in, then use those values as examples for the GetOffsetShell
Maybe you are running with an older log format version?
We are seeing the same thing but we have log.message.format.version=0.9.0.1
I think that is why, log message timestamps weren't introduced until 0.10

Kafka: unable to consume events given an offset

I was just following the quick start guide for Kafka and I decided to test offsets a little bit.
The only modification I did to the default configuration was adding:
log.retention.minutes=5
My test topic was created as basic as possible, as suggested in the quick start guide (1 partition, replication factor 1):
$ bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
I've produced some messages, m1 and m2 (adding date before and after):
$ date
viernes, 21 de julio de 2017, 12:16:06 CEST
$ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
>m1
>m2
>^C
$ date
viernes, 21 de julio de 2017, 12:16:25 CEST
The thing is I'm able to consume them from the beginning, but I'm not able to consume them given an offset (for instance, offset 0, which I understand points to the first message):
$ date
viernes, 21 de julio de 2017, 12:16:29 CEST
$ bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --offset 0 --partition 0
^CProcessed a total of 0 messages
$ date
viernes, 21 de julio de 2017, 12:17:25 CEST
$ bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
m1
m2
^CProcessed a total of 2 messages
$ date
viernes, 21 de julio de 2017, 12:17:50 CEST
Most probably I've not understood well this statement from the documentation:
In fact, the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the consumer it can consume records in any order it likes. For example a consumer can reset to an older offset to reprocess data from the past or skip ahead to the most recent record and start consuming from "now".
Moreover, I've seen that if a produce a third message (m3) after running the consumer as described above (i.e. pointing to offset 0), this third message is read:
$ bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --offset 0 --partition 0
m3
Could anybody explain this behavior, please? Thanks!
Alright, after a lot of comments and a bit of code searching I think that this is what is happening:
When you configured your retention period with 5 minutes, you caused Kafka to delete a few of your old messages - most notably the one with the offset 0. So at some point in time the smallest offset in partition 0 became lets say 4.
When you start a console consumer with --from-beginning, it internally calls a method that initializes the beginning offset to the smallest offset that can be found in the partition - 4 in this case. With that offset the consumer starts polling and receives that message and all subsequent ones, which is all messages for the partition.
If you start a consumer with --offset 0 that piece of code is bypassed and the consumer polls with an offset of 0 - the broker responds to that with an OFFSET_OUT_OF_RANGE error. The Consumer upon receiving that error resets the offset for the partition in question, and for this it uses the parameter auto.offset.reset which in theory can be earliest or latest.
However, due to the way that the ConsoleConsumer is written, the only way to have this parameter set to earliest is, if you pass the command line parameter --from-beginning - which cannot be combined with --offset - so effectively the only possible value that auto.offset.reset can have here is: latest.
So what happens when you poll with an offset of 0 that does not exist is an unsuccessful poll for data and after that the same behavior as if you hadn't passed any parameter at all.
Hope that helps and makes sense.
Update:
As of Kafka version 1.0 this behavior has been changed by KAFKA-5629 and should now behave a bit more in line with expectations.
You can try: --offset earliest