I’m new to Kafka and trying out few small usecase for my new application. The use case is basically,
Kafka-producer —> Kafka-Consumer—> flume-Kafka source—>flume-hdfs-sink.
When Consuming(step2), below is the sequence of steps..
1. consumer.Poll(1.0)
1.a. Produce to multiple topics (multiple flume agents are listening)
1.b. Produce. Poll()
2. Flush() every 25 msgs
3. Commit() every msgs (asynchCommit=false)
Question 1: Is this sequence of action right!?!
Question2: Will this cause any data loss as the flush is every 25 msgs and commit is for every msg?!?
Question3 :Difference between poll() for producer and poll ()consumer?
Question4 :What happens when messages are committed but not flushed!?!
I will really appreciate if someone can help me understand with offset examples between producer/consumer for poll,flush and commit.
Thanks in advance!!
Let us first understand Kafka in short:
what is kafka producer:
t.turner#devs:~/developers/softwares/kafka_2.12-2.2.0$ bin/kafka-console-producer.sh --broker-list 100.102.1.40:9092,100.102.1.41:9092 --topic company_wallet_db_v3-V3_0_0-transactions
>{"created_at":1563415200000,"payload":{"action":"insert","entity":{"amount":40.0,"channel":"INTERNAL","cost_rate":1.0,"created_at":"2019-07-18T02:00:00Z","currency_id":1,"direction":"debit","effective_rate":1.0,"explanation":"Voucher,"exchange_rate":null,expired","id":1563415200,"instrument":null,"instrument_id":null,"latitude":null,"longitude":null,"other_party":null,"primary_account_id":2,"receiver_phone":null,"secondary_account_id":362,"sequence":1,"settlement_id":null,"status":"success","type":"voucher_expiration","updated_at":"2019-07-18T02:00:00Z","primary_account_previous_balance":0.0,"secondary_account_previous_balance":0.0}},"track_id":"a011ad33-2cdd-48a5-9597-5c27c8193033"}
[2019-07-21 11:53:37,907] WARN [Producer clientId=console-producer] Error while fetching metadata with correlation id 7 : {company_wallet_db_v3-V3_0_0-transactions=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
You can ignore the warning. It appears as Kafka could not find the topic and auto-creates the topic.
Let us see how kafka has stored this message:
The producer creates a directory in the broker server at /kafka-logs (for apache kafka) or /kafka-cf-data (for the confluent version)
drwxr-xr-x 2 root root 4096 Jul 21 08:53 company_wallet_db_v3-V3_0_0-transactions-0
cd into this directory and then list the files. You will see the .log file that stores the actual data:
-rw-r--r-- 1 root root 10485756 Jul 21 08:53 00000000000000000000.timeindex
-rw-r--r-- 1 root root 10485760 Jul 21 08:53 00000000000000000000.index
-rw-r--r-- 1 root root 8 Jul 21 08:53 leader-epoch-checkpoint
drwxr-xr-x 2 root root 4096 Jul 21 08:53 .
-rw-r--r-- 1 root root 762 Jul 21 08:53 00000000000000000000.log
If you open the log file, you will see:
^#^#^#^#^#^#^#^#^#^#^Bî^#^#^#^#^B<96>T<88>ò^#^#^#^#^#^#^#^#^Al^S<85><98>k^#^#^Al^S<85><98>kÿÿÿÿÿÿÿÿÿÿÿÿÿÿ^#^#^#^Aö
^#^#^#^Aè
{"created_at":1563415200000,"payload":{"action":"insert","entity":{"amount":40.0,"channel":"INTERNAL","cost_rate":1.0,"created_at":"2019-07-18T02:00:00Z","currency_id":1,"direction":"debit","effective_rate":1.0,"explanation":"Voucher,"exchange_rate":null,expired","id":1563415200,"instrument":null,"instrument_id":null,"latitude":null,"longitude":null,"other_party":null,"primary_account_id":2,"receiver_phone":null,"secondary_account_id":362,"sequence":1,"settlement_id":null,"status":"success","type":"voucher_expiration","updated_at":"2019-07-18T02:00:00Z","primary_account_previous_balance":0.0,"secondary_account_previous_balance":0.0}},"track_id":"a011ad33-2cdd-48a5-9597-5c27c8193033"}^#
Let us understand how the consumer would poll and read records :
What is Kafka Poll :
Kafka maintains a numerical offset for each record in a partition.
This offset acts as a unique identifier of a record within that
partition, and also denotes the position of the consumer in the
partition. For example, a consumer which is at position 5 has consumed
records with offsets 0 through 4 and will next receive the record with
offset 5. There are actually two notions of position relevant to the
user of the consumer: The position of the consumer gives the offset of
the next record that will be given out. It will be one larger than the
highest offset the consumer has seen in that partition. It
automatically advances every time the consumer receives messages in a
call to poll(long).
So, poll takes a duration as input, reads the 00000000000000000000.log file for that duration, and returns them to the consumer.
When are messages removed :
Kafka takes care of the flushing of messages.
There are 2 ways:
Time-based : Default is 7 days. Can be altered using
log.retention.ms=1680000
Size-based : Can be set like
log.retention.bytes=10487500
Now let us look at the consumer:
t.turner#devs:~/developers/softwares/kafka_2.12-2.2.0$ bin/kafka-console-consumer.sh --bootstrap-server 100.102.1.40:9092 --topic company_wallet_db_v3-V3_0_0-transactions --from-beginning
{"created_at":1563415200000,"payload":{"action":"insert","entity":{"amount":40.0,"channel":"INTERNAL","cost_rate":1.0,"created_at":"2019-07-18T02:00:00Z","currency_id":1,"direction":"debit","effective_rate":1.0,"explanation":"Voucher,"exchange_rate":null,expired","id":1563415200,"instrument":null,"instrument_id":null,"latitude":null,"longitude":null,"other_party":null,"primary_account_id":2,"receiver_phone":null,"secondary_account_id":362,"sequence":1,"settlement_id":null,"status":"success","type":"voucher_expiration","updated_at":"2019-07-18T02:00:00Z","primary_account_previous_balance":0.0,"secondary_account_previous_balance":0.0}},"track_id":"a011ad33-2cdd-48a5-9597-5c27c8193033"}
^CProcessed a total of 1 messages
The above command instructs the consumer to read from offset = 0. Kafka assigns this console consumer a group_id and maintains the last offset that this group_id has read. So, it can push newer messages to this consumer-group
What is Kafka Commit:
Commit is a way to tell kafka the messages the consumer has successfully processed. This can be thought as updating the lookup between group-id : current_offset + 1.
You can manage this using the commitAsync() or commitSync() methods of the consumer object.
Reference: https://kafka.apache.org/10/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
When I describe one of my topics I get this status:
➜ local-kafka_2.12-2.0.0 bin/kafka-consumer-groups.sh --bootstrap-server myip:1025 --group mygroup --describe
Consumer group 'mygroup' has no active members.
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
mytopic 0 858 858 0 - - -
when I try to reset it to the earliest, I get this status:
➜ local-kafka_2.12-2.0.0 bin/kafka-consumer-groups.sh --bootstrap-server myip:1025 --group mygroup --topic mytopic --reset-offsets --to-earliest --execute
TOPIC PARTITION NEW-OFFSET
mytopic 0 494
I would have expected the new offset to be at 0 rather than 494.
Question
1 - In the describe output the current offset is shown as 858, however resetting to earliest shows as 494. So there would be a lag of 364. My question is, what happened to the remaining 494 (858-364) offsets? Are they gone because of some configuration setup for this topic? My retention.ms is set to 1 week
2 - If the 494 records are gone, is there a way to recover them somehow?
In case you have access to the data directory of your kafka clusters, you can see the data that is present in there using the command kafka-run-class.bat kafka.tools.DumpLogSegments.
For more information see e.g. here: https://medium.com/#durgaswaroop/a-practical-introduction-to-kafka-storage-internals-d5b544f6925f
Your data might have been deleted either due to log retention time or due to size limitation of the logs (the configuration property log.retention.bytes).
I am using kafka_2.12-1.1.0. Below are my retention conf. I am using Win OS.
log.retention.ms=120000
log.segment.bytes=2800
log.retention.check.interval.ms=30000
I see the logs get rolled over and generates *0000.log to *0069.log and then *00103.log etc and then i expected it to delete the old files but when the retention logic kicks in I get below error process cannot access the file because it is being used by another process and the kafka server crashes and doesnt recover from there. This happens when it is trying to delete 00000000000000000000.log ? Can you let me know the commands to check if they are active segments or why is it trying to delete this and crashing.
PFB log file if you want to see it.
[ProducerStateManager partition=myLocalTopic-0] Writing producer snapshot at offset 39
[Log partition=myLocalTopic-0, dir=D:\tmp\kafka-logs] Rolled new log segment at offset 39 in 39 ms.
[ProducerStateManager partition=myLocalTopic-0] Writing producer snapshot at offset 69
[Log partition=myLocalTopic-0, dir=D:\tmp\kafka-logs] Rolled new log segment at offset 69 in 52 ms.
[ProducerStateManager partition=myLocalTopic-0] Writing producer snapshot at offset 91
[Log partition=myLocalTopic-0, dir=D:\tmp\kafka-logs] Rolled new log segment at offset 91 in 18 ms.
[Log partition=myLocalTopic-0, dir=D:\tmp\kafka-logs] Found deletable segments with base offsets [0,69] due to retention time 120000ms breach (kafka.log.Log)
[ProducerStateManager partition=myLocalTopic-0] Writing producer snapshot at offset 103 (kafka.log.ProducerStateManager)
[Log partition=myLocalTopic-0, dir=D:\tmp\kafka-logs] Rolled new log segment at offset 103 in 25 ms. (kafka.log.Log)
[Log partition=myLocalTopic-0, dir=D:\tmp\kafka-logs] Scheduling log segment [baseOffset 0, size 2746] for deletion. (kafka.log.Log)
ERROR Error while deleting segments for myLocalTopic-0 in dir D:\tmp\kafka-logs (kafka.server.LogDirFailureChannel)
java.nio.file.FileSystemException: D:\tmp\kafka-logs\myLocalTopic-0\00000000000000000000.log -> D:\tmp\kafka-logs\myLocalTopic-0\00000000000000000000.log.deleted: The process cannot access the file because it is being used by another process.
............. at sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:98)
.......................................................
ERROR Uncaught exception in scheduled task 'kafka-log-retention' (kafka.utils.KafkaScheduler)
org.apache.kafka.common.errors.KafkaStorageException: Error while deleting segments for myLocalTopic-0 in dir D:\tmp\kafka-logs
Caused by: java.nio.file.FileSystemException: D:\tmp\kafka-logs\myLocalTopic-0\00000000000000000000.log -> D:\tmp\kafka-logs\myLocalTopic-0\00000000000000000000.log.deleted: The process cannot access the file because it is being used by another process.
Click here for file list image
Apache Kafka provides us with the following retention policies
Time based Retention
Size based Retention
For time based retention try the following commands,
Command to set retention time
./bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic my-topic --config retention.ms=1680000
Command to remove Topic level retention time
./bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic my-topic --delete-config retention.ms
For size based retention try the following commands,
To set retention size:
./bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic my-topic --config retention.bytes=104857600
To remove,
./bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic my-topic --delete-config retention.bytes
I am trying to test a configuration of broker offset.retention.minutes=30. I have changed to this config to 10 mins, instead of 24 hours as default.
however after more than 10 mins the consumer group offset still showing offset in information
ldnpsr000001131$ bin/kafka-consumer-groups.sh --zookeeper localhost:2181 --describe -group rent_test
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG OWNER
rent_test rent_test 0 44 44 0 none
Any idea why it is not getting deleted?
offsets.retention.minutes controls the log retention window in minutes for offsets topic, namely __consumer_offsets into which new consumers store offsets. In your case where you are using the old consumer, the offsets are stored in the zookeeper, so setting offsets.retention.minutes has no effect on the ZK-based consumer group.
How do you know when was a topic created in Kafka?
It seems that a few of the topics were created with a wrong number of partitions. Is there a way to know the date the topic was created? Supposedly, a topic with the name "test" was created with n number of partitions. How can I find the date and time when this "test" topic was created on Kafka?
You can see the Kafka topic creation time(ctime) and last modified time(mtime) in zookeeper stat.
First login to zookeeper shell and add command "stat "
kafka % bin/zookeeper-shell.sh localhost:2181 stat /brokers/topics/test-events
It will return below details:
Connecting to localhost:2181
WATCHER::
WatchedEvent state:SyncConnected type:None path:null
cZxid = 0x1007ac74c
ctime = Thu Nov 01 10:38:39 UTC 2018
mZxid = 0x4000f6e26
mtime = Mon Jan 07 05:22:25 UTC 2019
pZxid = 0x1007ac74d
cversion = 1
dataVersion = 8
aclVersion = 0
ephemeralOwner = 0x0
dataLength = 112
numChildren = 1
You can refer this to understand the attributes : https://zookeeper.apache.org/doc/current/zookeeperProgrammers.html#sc_zkStatStructure
You can tell the topic creation time by checking the zookeeper node creation time for the topic. Given that "zookeeper001:2181/foo" is the Kafka zookeeper connection string, and "test_topic" is the topic name, you can check the stat of znode to get the topic creation time:
/foo/brokers/topics/test_tpopic
I don't think that there is a way to check number of partitions at the topic creation time. You can always increase the topic partition number by using :
kafka-topics.sh --alter ...