usually after kafka cluster scratch installation I saw this files under /data/kafka-logs ( kafka broker logs. where all topics should be located )
ls -ltr
-rw-r--r-- 1 kafka hadoop 0 Jan 9 10:07 cleaner-offset-checkpoint
-rw-r--r-- 1 kafka hadoop 57 Jan 9 10:07 meta.properties
drwxr-xr-x 2 kafka hadoop 4096 Jan 9 10:51 _schemas-0
-rw-r--r-- 1 kafka hadoop 17 Jan 10 07:39 recovery-point-offset-checkpoint
-rw-r--r-- 1 kafka hadoop 17 Jan 10 07:39 replication-offset-checkpoint
but on some other Kafka scratch installation we saw the folder - /data/kafka-logs is empty
is this indicate on problem ?
note - we still not create the topics
I'm not sure when each checkpoint file is created (though, they track log cleaner and replication offsets), but I assume that the meta properties is created at broker startup.
Otherwise, you would see one folder per Topic-partition, for example, looks like you had one topic created, _schemas.
If you only see one partition folder out of multiple brokers, then your replication factor for that topic is set to 1
I am getting the below on trying to produce to(/consume from) a kafka topic :
[2019-12-12 17:33:53,049] WARN [Producer clientId=console-producer] 1 partitions have leader brokers without a matching listener, including [wallet-V4_1_2-transactions-0] (org.apache.kafka.clients.NetworkClient)
On further investigation, I could see that the In-sync replicas is now 0.
Unfortunately, the partition and replication-factor was kept to 1 due to resource-constraints.
bin/kafka-topics --bootstrap-server 192.158.30.74:9092 --topic wallet-V4_1_2-transactions --describe
Topic:wallet-V4_1_2-transactions PartitionCount:1 ReplicationFactor:1 Configs:
Topic: wallet-V4_1_2-transactions Partition: 0 Leader: none Replicas: 2 Isr:
In the segment directory, I can see there is an additional file for leader-epoch-checkpoint
root#8436f52ec04c:/var/lib/kafka/data/wallet-V4_1_2-transactions-0# ls -lart
total 656
-rwxrwxrwx 1 nobody nogroup 10485756 Dec 4 06:50 00000000000000003235.timeindex
-rwxrwxrwx 1 nobody nogroup 10 Dec 4 06:50 00000000000000003235.snapshot
-rwxrwxrwx 1 nobody nogroup 18 Dec 10 04:49 leader-epoch-checkpoint
drwxrwxrwx 2 nobody nogroup 4096 Dec 10 04:49 .
-rwxrwxrwx 1 nobody nogroup 636034 Dec 10 11:55 00000000000000003235.log
-rwxrwxrwx 1 nobody nogroup 10485760 Dec 10 11:55 00000000000000003235.index
drwxrwxrwx 106 nobody nogroup 12288 Dec 15 06:22 ..
As mentioned here, the consumer/producer should not impacted by this.
Other topics have recovered from the Leader failure and I am able to query them.
However, this topic still gives the error:
[2019-12-15 09:54:12,303] WARN [Consumer clientId=consumer-1, groupId=console-consumer-83858] 1 partitions have leader brokers without a matching listener, including [wallet-V4_1_2-transactions-0] (org.apache.kafka.clients.NetworkClient)
How can I recover the already published messages from kafka and preserve this topic
I’m new to Kafka and trying out few small usecase for my new application. The use case is basically,
Kafka-producer —> Kafka-Consumer—> flume-Kafka source—>flume-hdfs-sink.
When Consuming(step2), below is the sequence of steps..
1. consumer.Poll(1.0)
1.a. Produce to multiple topics (multiple flume agents are listening)
1.b. Produce. Poll()
2. Flush() every 25 msgs
3. Commit() every msgs (asynchCommit=false)
Question 1: Is this sequence of action right!?!
Question2: Will this cause any data loss as the flush is every 25 msgs and commit is for every msg?!?
Question3 :Difference between poll() for producer and poll ()consumer?
Question4 :What happens when messages are committed but not flushed!?!
I will really appreciate if someone can help me understand with offset examples between producer/consumer for poll,flush and commit.
Thanks in advance!!
Let us first understand Kafka in short:
what is kafka producer:
t.turner#devs:~/developers/softwares/kafka_2.12-2.2.0$ bin/kafka-console-producer.sh --broker-list 100.102.1.40:9092,100.102.1.41:9092 --topic company_wallet_db_v3-V3_0_0-transactions
>{"created_at":1563415200000,"payload":{"action":"insert","entity":{"amount":40.0,"channel":"INTERNAL","cost_rate":1.0,"created_at":"2019-07-18T02:00:00Z","currency_id":1,"direction":"debit","effective_rate":1.0,"explanation":"Voucher,"exchange_rate":null,expired","id":1563415200,"instrument":null,"instrument_id":null,"latitude":null,"longitude":null,"other_party":null,"primary_account_id":2,"receiver_phone":null,"secondary_account_id":362,"sequence":1,"settlement_id":null,"status":"success","type":"voucher_expiration","updated_at":"2019-07-18T02:00:00Z","primary_account_previous_balance":0.0,"secondary_account_previous_balance":0.0}},"track_id":"a011ad33-2cdd-48a5-9597-5c27c8193033"}
[2019-07-21 11:53:37,907] WARN [Producer clientId=console-producer] Error while fetching metadata with correlation id 7 : {company_wallet_db_v3-V3_0_0-transactions=LEADER_NOT_AVAILABLE} (org.apache.kafka.clients.NetworkClient)
You can ignore the warning. It appears as Kafka could not find the topic and auto-creates the topic.
Let us see how kafka has stored this message:
The producer creates a directory in the broker server at /kafka-logs (for apache kafka) or /kafka-cf-data (for the confluent version)
drwxr-xr-x 2 root root 4096 Jul 21 08:53 company_wallet_db_v3-V3_0_0-transactions-0
cd into this directory and then list the files. You will see the .log file that stores the actual data:
-rw-r--r-- 1 root root 10485756 Jul 21 08:53 00000000000000000000.timeindex
-rw-r--r-- 1 root root 10485760 Jul 21 08:53 00000000000000000000.index
-rw-r--r-- 1 root root 8 Jul 21 08:53 leader-epoch-checkpoint
drwxr-xr-x 2 root root 4096 Jul 21 08:53 .
-rw-r--r-- 1 root root 762 Jul 21 08:53 00000000000000000000.log
If you open the log file, you will see:
^#^#^#^#^#^#^#^#^#^#^Bî^#^#^#^#^B<96>T<88>ò^#^#^#^#^#^#^#^#^Al^S<85><98>k^#^#^Al^S<85><98>kÿÿÿÿÿÿÿÿÿÿÿÿÿÿ^#^#^#^Aö
^#^#^#^Aè
{"created_at":1563415200000,"payload":{"action":"insert","entity":{"amount":40.0,"channel":"INTERNAL","cost_rate":1.0,"created_at":"2019-07-18T02:00:00Z","currency_id":1,"direction":"debit","effective_rate":1.0,"explanation":"Voucher,"exchange_rate":null,expired","id":1563415200,"instrument":null,"instrument_id":null,"latitude":null,"longitude":null,"other_party":null,"primary_account_id":2,"receiver_phone":null,"secondary_account_id":362,"sequence":1,"settlement_id":null,"status":"success","type":"voucher_expiration","updated_at":"2019-07-18T02:00:00Z","primary_account_previous_balance":0.0,"secondary_account_previous_balance":0.0}},"track_id":"a011ad33-2cdd-48a5-9597-5c27c8193033"}^#
Let us understand how the consumer would poll and read records :
What is Kafka Poll :
Kafka maintains a numerical offset for each record in a partition.
This offset acts as a unique identifier of a record within that
partition, and also denotes the position of the consumer in the
partition. For example, a consumer which is at position 5 has consumed
records with offsets 0 through 4 and will next receive the record with
offset 5. There are actually two notions of position relevant to the
user of the consumer: The position of the consumer gives the offset of
the next record that will be given out. It will be one larger than the
highest offset the consumer has seen in that partition. It
automatically advances every time the consumer receives messages in a
call to poll(long).
So, poll takes a duration as input, reads the 00000000000000000000.log file for that duration, and returns them to the consumer.
When are messages removed :
Kafka takes care of the flushing of messages.
There are 2 ways:
Time-based : Default is 7 days. Can be altered using
log.retention.ms=1680000
Size-based : Can be set like
log.retention.bytes=10487500
Now let us look at the consumer:
t.turner#devs:~/developers/softwares/kafka_2.12-2.2.0$ bin/kafka-console-consumer.sh --bootstrap-server 100.102.1.40:9092 --topic company_wallet_db_v3-V3_0_0-transactions --from-beginning
{"created_at":1563415200000,"payload":{"action":"insert","entity":{"amount":40.0,"channel":"INTERNAL","cost_rate":1.0,"created_at":"2019-07-18T02:00:00Z","currency_id":1,"direction":"debit","effective_rate":1.0,"explanation":"Voucher,"exchange_rate":null,expired","id":1563415200,"instrument":null,"instrument_id":null,"latitude":null,"longitude":null,"other_party":null,"primary_account_id":2,"receiver_phone":null,"secondary_account_id":362,"sequence":1,"settlement_id":null,"status":"success","type":"voucher_expiration","updated_at":"2019-07-18T02:00:00Z","primary_account_previous_balance":0.0,"secondary_account_previous_balance":0.0}},"track_id":"a011ad33-2cdd-48a5-9597-5c27c8193033"}
^CProcessed a total of 1 messages
The above command instructs the consumer to read from offset = 0. Kafka assigns this console consumer a group_id and maintains the last offset that this group_id has read. So, it can push newer messages to this consumer-group
What is Kafka Commit:
Commit is a way to tell kafka the messages the consumer has successfully processed. This can be thought as updating the lookup between group-id : current_offset + 1.
You can manage this using the commitAsync() or commitSync() methods of the consumer object.
Reference: https://kafka.apache.org/10/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
I have a Kafka topic called retention and Below are the server configuration related to retention:
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=3600000 (~ 1 hour)
log.cleaner.enable=true
And below is the topic specific config:
retention.ms=2592000000,retention.bytes=3298534883328
where retention.ms ~ 30d and retention.bytes = ~ 3.29 TB
I configured the retention.ms and retention.bytes using the below command recently (on 14th Jan 2019) using below commands:
./bin/kafka-topics.sh --zookeeper localhost:2181 --alter --topic retentions --config retention.bytes=219902325555
Here the configuration for the retntion.bytes seems to be working while retention.ms does not seem to be working. Here is the evidence that I could collect:
cd log_dir/retentions-0/
ls -lrt 00000000000000000000.*
-rw-r--r-- 1 root root 294387381 Nov 26 22:37 00000000000000000000.log
-rw-r--r-- 1 root root 3912 Jan 14 18:06 00000000000000000000.index
-rw-r--r-- 1 root root 5868 Jan 14 18:06 00000000000000000000.timeindex
If we look into the logs of older segments these are nearly 2 months old.
Can anybody tell which of these two configurations will take effect on priority Or, both can work whichever crosses the configured threshold.
In my assumptions, both configurations should work in conjunction. Plz, let me know if this is not the case.
Both work in conjunction.
From Kafka: The Definitive Guide book
If you have specified a value for both log.retention.bytes and log.retention.ms ... messages may be removed when either criteria is met.
I am trying to understand the kafka data logs. I can see the logs under the dir set in logs.dir as "Topicname_partitionnumber". However I would like to know what are the different logs captured under it. Below is the screenshot for a sample log.
In Kafka logs, each partition has a log.dir directory. Each partition is split into segments.
A segment is just a collection of messages. Instead of writing all messages into a single file, Kafka splits them into chunks of segments.
Whenever Kafka writes to a partition, it writes to an active segment. Each segment has defined size limit. When the segment size limit is reached, it closes the segment and opens a new one that becomes active. One partition can have one or more segment based on the configuration.
Each segment contains three files - segment.log,segment.index and segment.timeindex
There are three types of file for each Kafka topic partition:
-rw-r--r-- 1 kafka hadoop 10485760 Dec 3 23:57 00000000000000000000.index
-rw-r--r-- 1 kafka hadoop 148814230 Oct 11 06:50 00000000000000000000.log
-rw-r--r-- 1 kafka hadoop 10485756 Dec 3 23:57 00000000000000000000.timeindex
The 00000000000000000000 in front of log and index files is the name of the segments. It represents the offset of the first record written in that segment. If there are 2 segments i.e. Segment 1 containing message offset 0,1 and Segment 2 containing message offset 2 and 3.
-rw-r--r-- 1 kafka hadoop 10485760 Dec 3 23:57 00000000000000000000.index
-rw-r--r-- 1 kafka hadoop 148814230 Oct 11 06:50 00000000000000000000.log
-rw-r--r-- 1 kafka hadoop 10485756 Dec 3 23:57 00000000000000000000.timeindex
-rw-r--r-- 1 kafka hadoop 10485760 Dec 3 23:57 00000000000000000002.index
-rw-r--r-- 1 kafka hadoop 148814230 Oct 11 06:50 00000000000000000002.log
-rw-r--r-- 1 kafka hadoop 10485756 Dec 3 23:57 00000000000000000002.timeindex
.log file stores the offset, the physical position of the message, timestamp along with the message content. While reading the messages from Kafka at a particular offset, it becomes an expensive task to find the offset in a huge log file.
That's where .index the file becomes useful. It stores the offsets and physical position of the messages in the log file.
.timeindex the file is based on the timestamp of messages.
The files without a suffix are the segment files, i.e. the files the data is actually written to, named by earliest contained message offset. The latest of those is the active segment, meaning the one that messages are currently appended to.
.index are corresponding mappings from offset to positions in the segment file. .timeindex are mappings from timestamp to offset.
Below is the screenshot for a sample log
You should add your screenshot and sample log, then we could give your expected and specific answer.
before that, only can give you some common knowledge:
eg: in my CentOS, for folder:
/root/logs/kafka/kafka.log/storybook_add-0
storybook_add: is the topic name
in code, the real topic name is storybook-add
its contains:
[root#xxx storybook_add-0]# ll
total 8
-rw-r--r-- 1 root root 10485760 Aug 28 16:44 00000000000000000023.index
-rw-r--r-- 1 root root 700 Aug 28 16:45 00000000000000000023.log
-rw-r--r-- 1 root root 10485756 Aug 28 16:44 00000000000000000023.timeindex
-rw-r--r-- 1 root root 9 Aug 28 16:44 leader-epoch-checkpoint
00000000000000000023.log: log file
store the real data = kafka message
00000000000000000023.index: index file
00000000000000000023.timeindex: timeindex file
->
00000000000000000023 called segment name
why is 23?
in 00000000000000000023.log, which stored first message's postion is 23
kafka previously has totally received 23 messages
what the message data look like?
we can see through its content:
For further basic concept and logic of kafka, recommend to read this article:
A Practical Introduction to Kafka Storage Internals