I am doing experimenting with kafka. I already start the
kafka-console-producer and
kafka-console-consumer.
I send messages with kafka-producer and successfully receive at the kafka-console-consumer.
Now I want to produce and consume around 5000 messages at once. I look into documentation and get to know that there are two commands.
kafka-verifiable-producer.sh
kafka-verifiable-consumer.sh
I tried to use these commands .
kafka-verifiable-producer.sh --broker-list localhost:9092 --max-messages 5000 --topic data-sending
kafka-verifiable-consumer.sh --group-instance-id 1 --group-id data-world --topic data-sending --broker-list localhost:9092
The result is as follow
"timestamp":1581268289761,"name":"producer_send_success","key":null,"value":"4996","offset":44630,"topic":"try_1","partition":0}
{"timestamp":1581268289761,"name":"producer_send_success","key":null,"value":"4997","offset":44631,"topic":"try_1","partition":0}
{"timestamp":1581268289761,"name":"producer_send_success","key":null,"value":"4998","offset":44632,"topic":"try_1","partition":0}
{"timestamp":1581268289761,"name":"producer_send_success","key":null,"value":"4999","offset":44633,"topic":"try_1","partition":0}
{"timestamp":1581268289769,"name":"shutdown_complete"}
{"timestamp":1581268289771,"name":"tool_data","sent":5000,"acked":5000,"target_throughput":-1,"avg_throughput":5285.412262156448}
On the consumer console the result is as follow
{"timestamp":1581268089357,"name":"records_consumed","count":352,"partitions":[{"topic":"try_1","partition":0,"count":352,"minOffset":32777,"maxOffset":33128}]}
{"timestamp":1581268089359,"name":"offsets_committed","offsets":[{"topic":"try_1","partition":0,"offset":33129}],"success":true}
{"timestamp":1581268089384,"name":"records_consumed","count":500,"partitions":[{"topic":"try_1","partition":0,"count":500,"minOffset":33129,"maxOffset":33628}]}
{"timestamp":1581268089391,"name":"offsets_committed","offsets":[{"topic":"try_1","partition":0,"offset":33629}],"success":true}
{"timestamp":1581268089392,"name":"records_consumed","count":270,"partitions":[{"topic":"try_1","partition":0,"count":270,"minOffset":33629,"maxOffset":33898}]}
{"timestamp":1581268089394,"name":"offsets_committed","offsets":[{"topic":"try_1","partition":0,"offset":33899}],"success":true}
{"timestamp":1581268089415,"name":"records_consumed","count":500,"partitions":[{"topic":"try_1","partition":0,"count":500,"minOffset":33899,"maxOffset":34398}]}
{"timestamp":1581268089416,"name":"offsets_committed","offsets":[{"topic":"try_1","partition":0,"offset":34399}],"success":true}
{"timestamp":1581268089417,"name":"records_consumed","count":235,"partitions":[{"topic":"try_1","partition":0,"count":235,"minOffset":34399,"maxOffset":34633}]}
{"timestamp":1581268089419,"name":"offsets_committed","offsets":[{"topic":"try_1","partition":0,"offset":34634}],"success":true}
In above results, the key is null.
How i can send a bulk of messages with this command ?
I tried to look into one example how to use them but didn't found any. It produces integer number like values but where i can insert the messages?.
Is there any way i can use this command to produce messages in bulk? Also is it possible to implement such commands in windows or it is just for linux?
Any link to the examples would be greatly appreciated.
The script kafka-verifiable-producer.sh executes the classorg.apache.kafka.tools.VerifiableProducer.
(https://github.com/apache/kafka/blob/trunk/tools/src/main/java/org/apache/kafka/tools/VerifiableProducer.java)
Its program arguments --throughput, --repeating-keys and --value-prefix may fulfil your needs.
For example, the following produces messages with prefix value, 111 and with an incremental key which rotates for every 5 messages. You can also configure the throughput of the messages with the --throughput option. Int this example, it produces an average of 5 messages per second.
./kafka-verifiable-producer.sh --broker-list localhost:9092 --max-messages 10 --repeating-keys 5 --value-prefix 100 --throughput 5 --topic test
{"timestamp":1581271492652,"name":"startup_complete"}
{"timestamp":1581271492860,"name":"producer_send_success","key":"0","value":"100.0","offset":45,"topic":"test","partition":0}
{"timestamp":1581271492862,"name":"producer_send_success","key":"1","value":"100.1","offset":46,"topic":"test","partition":0}
{"timestamp":1581271493048,"name":"producer_send_success","key":"2","value":"100.2","offset":47,"topic":"test","partition":0}
{"timestamp":1581271493254,"name":"producer_send_success","key":"3","value":"100.3","offset":48,"topic":"test","partition":0}
{"timestamp":1581271493256,"name":"producer_send_success","key":"4","value":"100.4","offset":49,"topic":"test","partition":0}
{"timestamp":1581271493457,"name":"producer_send_success","key":"0","value":"100.5","offset":50,"topic":"test","partition":0}
{"timestamp":1581271493659,"name":"producer_send_success","key":"1","value":"100.6","offset":51,"topic":"test","partition":0}
{"timestamp":1581271493860,"name":"producer_send_success","key":"2","value":"100.7","offset":52,"topic":"test","partition":0}
{"timestamp":1581271494063,"name":"producer_send_success","key":"3","value":"100.8","offset":53,"topic":"test","partition":0}
{"timestamp":1581271494268,"name":"producer_send_success","key":"4","value":"100.9","offset":54,"topic":"test","partition":0}
{"timestamp":1581271494483,"name":"shutdown_complete"}
{"timestamp":1581271494484,"name":"tool_data","sent":10,"acked":10,"target_throughput":5,"avg_throughput":5.452562704471101}
The easiest is to modify/extend the above class If you are looking for more customized message keys and values.
I'm testing the working of kafka-topics but I don´t undestand how the deletion works.
I have created a simple topic with
retention.ms = 60000
and
segment.ms = 60000
and
cleanup.policy=delete.
After this I created a producer and I sent some messages.
A consumer receive the messages without problems.
But I expect that, after one minute, if a repeat the consumer, it doesn't show the messages because they must have been deleted. But this behaviour doesn't occur.
If I create a query in ksql it's the same. The messages always appear.
I think that I don't understand how the deletion works.
Example:
1) Topic
./kafka-topics --create --zookeeper localhost:2181 --topic test --
replication-factor 2 --partitions 1 --config "cleanup.policy=delete" --
config "delete.retention.ms=60000" --config "segment.ms=60000"
2) producer
./kafka-avro-console-producer --broker-list broker:29092 --topic test--
property parse.key=true --property key.schema='{"type":"long"}' --property
"key.separator=:" --property value.schema='{"type": "record","name":
"ppp","namespace": "test.topic","fields": [{"name": "id","type": "long"}]}'
3) messages from producer
1:{"id": 1}
2:{"id": 2}
4:{"id": 4}
5:{"id": 5}
4) Consumer
./kafka-avro-console-consumer \
--bootstrap-server broker:29092 \
--property schema.registry.url=http://localhost:8081 \
--topic test--from-beginning --property print.key=true
The consumer shows the four messages.
But I expect that If I run the consumer again after one minute (I have waited more time too, even hours) the messages don´t show because the retention.ms and segment.ms are one minute.
When messages are actually deleted?
Another important think to know in deletion process in Kafka is log segment file:
Topics are divided into partitions right? This is what allows parallelism, scale etc..
Each partition is divided into log segments files. Why? Because Kafka writes data to Disk right...? we don't want to it keep the entire topic / partition in 1 huge file, but split it into smaller files (segments)..
Breaking data into smaller files has many advantages, don't really related to the question. Can read more here
The key thing to notice here is:
Retention policy is looking on the log semgnet's file time stamp.
"Retention by time is performed by examining the last modified
time (mtime) on each log segment file on disk. Under normal clus‐
ter operations, this is the time that the log segment was closed, and
represents the timestamp of the last message in the file"
(From Kafka-definitive Guide, page 26)
Version 0.10.1.0
The log retention time is no longer based on last modified time of the log segments. Instead it will be based on the largest timestamp of the messages in a log segment.
Which means it looks only on closed log segment files.
Make sure your 'segment' config params are right..
Change the retention.ms as mentioned by Ajay Srivastava above using kafka-topics --zookeeper localhost:2181 --alter --topic test --config retention.ms=60000 and test again.
I ran into this data issue today and to solve it I have to recalculate everything from the last 3 months. But, in Kafka when I run this command :
./kafka-console-consumer.sh --bootstrap-server 10.8.95.21:9092 --topic backoffice --from-beginning
it encounters an error : The requested offset is not within the range of offsets maintained by the server
The --from-beginning is trying to get data from Offsets whose data has been purged by kafka.
Can I list offsets alongwith the time it was created? So, that I can estimate from where I can start consuming data. Otherwise, if I can identify the oldest Kafka Offset that has data, I can start reading from that offset.
Have you tried out kt (fgeller/kt). This is an amazing tool as an alternative to Kafka console tools. This is written in go, so amazingly fast also. And one other advantage is you can get offset of each message by default there.
So you can simply write something like :
kt consume -brokers <broker-name> -topic <topic-name> oldest
and the output will be something like this :
{
"partition": 0,
"offset": <oldest-offset>,
"key": "<your-key>",
"value": "<value of the message>"
}
Edit: If you want some UI for this, Kafdrop is just what you are looking for. Setting it up is pretty easy and you can get all offset related information quite easily. You can even watch a message corresponding to an offset which is pretty amazing.
The following cmd worked for me:
./bin/kafka-run-class.sh kafka.tools.GetOffsetShell --broker-list <broker-name> --topic <topic-name> --time -2
I was just following the quick start guide for Kafka and I decided to test offsets a little bit.
The only modification I did to the default configuration was adding:
log.retention.minutes=5
My test topic was created as basic as possible, as suggested in the quick start guide (1 partition, replication factor 1):
$ bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
I've produced some messages, m1 and m2 (adding date before and after):
$ date
viernes, 21 de julio de 2017, 12:16:06 CEST
$ bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
>m1
>m2
>^C
$ date
viernes, 21 de julio de 2017, 12:16:25 CEST
The thing is I'm able to consume them from the beginning, but I'm not able to consume them given an offset (for instance, offset 0, which I understand points to the first message):
$ date
viernes, 21 de julio de 2017, 12:16:29 CEST
$ bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --offset 0 --partition 0
^CProcessed a total of 0 messages
$ date
viernes, 21 de julio de 2017, 12:17:25 CEST
$ bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
m1
m2
^CProcessed a total of 2 messages
$ date
viernes, 21 de julio de 2017, 12:17:50 CEST
Most probably I've not understood well this statement from the documentation:
In fact, the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the consumer it can consume records in any order it likes. For example a consumer can reset to an older offset to reprocess data from the past or skip ahead to the most recent record and start consuming from "now".
Moreover, I've seen that if a produce a third message (m3) after running the consumer as described above (i.e. pointing to offset 0), this third message is read:
$ bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --offset 0 --partition 0
m3
Could anybody explain this behavior, please? Thanks!
Alright, after a lot of comments and a bit of code searching I think that this is what is happening:
When you configured your retention period with 5 minutes, you caused Kafka to delete a few of your old messages - most notably the one with the offset 0. So at some point in time the smallest offset in partition 0 became lets say 4.
When you start a console consumer with --from-beginning, it internally calls a method that initializes the beginning offset to the smallest offset that can be found in the partition - 4 in this case. With that offset the consumer starts polling and receives that message and all subsequent ones, which is all messages for the partition.
If you start a consumer with --offset 0 that piece of code is bypassed and the consumer polls with an offset of 0 - the broker responds to that with an OFFSET_OUT_OF_RANGE error. The Consumer upon receiving that error resets the offset for the partition in question, and for this it uses the parameter auto.offset.reset which in theory can be earliest or latest.
However, due to the way that the ConsoleConsumer is written, the only way to have this parameter set to earliest is, if you pass the command line parameter --from-beginning - which cannot be combined with --offset - so effectively the only possible value that auto.offset.reset can have here is: latest.
So what happens when you poll with an offset of 0 that does not exist is an unsuccessful poll for data and after that the same behavior as if you hadn't passed any parameter at all.
Hope that helps and makes sense.
Update:
As of Kafka version 1.0 this behavior has been changed by KAFKA-5629 and should now behave a bit more in line with expectations.
You can try: --offset earliest