kafka + which files should created under kafka-logs - apache-kafka

usually after kafka cluster scratch installation I saw this files under /data/kafka-logs ( kafka broker logs. where all topics should be located )
ls -ltr
-rw-r--r-- 1 kafka hadoop 0 Jan 9 10:07 cleaner-offset-checkpoint
-rw-r--r-- 1 kafka hadoop 57 Jan 9 10:07 meta.properties
drwxr-xr-x 2 kafka hadoop 4096 Jan 9 10:51 _schemas-0
-rw-r--r-- 1 kafka hadoop 17 Jan 10 07:39 recovery-point-offset-checkpoint
-rw-r--r-- 1 kafka hadoop 17 Jan 10 07:39 replication-offset-checkpoint
but on some other Kafka scratch installation we saw the folder - /data/kafka-logs is empty
is this indicate on problem ?
note - we still not create the topics

I'm not sure when each checkpoint file is created (though, they track log cleaner and replication offsets), but I assume that the meta properties is created at broker startup.
Otherwise, you would see one folder per Topic-partition, for example, looks like you had one topic created, _schemas.
If you only see one partition folder out of multiple brokers, then your replication factor for that topic is set to 1

Related

Kafka log4j log rotator is not rotating logs

I have this snippet in my log4j configuration file:
log4j.appender.kafkaAppender=org.apache.log4j.RollingFileAppender
log4j.appender.kafkaAppender.MaxFileSize=50MB
log4j.appender.kafkaAppender.MaxBackupIndex=4
log4j.appender.kafkaAppender.File=/var/log/kafka/server.log
log4j.appender.kafkaAppender.layout=org.apache.log4j.PatternLayout
log4j.appender.kafkaAppender.layout.ConversionPattern=[%d] %p %m (%c)%n
And in /var/log/kafka, indeed I see the server.log file. However, under my path /opt/kafka/logs, I see this below:
(server.log.2021-02-04-00 and continuing on...)
-rw-r--r-- 1 kafka kafka 942 Mar 5 08:57 server.log.2021-03-05-08
-rw-r--r-- 1 kafka kafka 942 Mar 5 09:57 server.log.2021-03-05-09
-rw-r--r-- 1 kafka kafka 2361 Mar 5 10:57 server.log.2021-03-05-10
-rw-r--r-- 1 kafka kafka 942 Mar 5 11:57 server.log.2021-03-05-11
-rw-r--r-- 1 kafka kafka 942 Mar 5 12:57 server.log.2021-03-05-12
-rw-r--r-- 1 kafka kafka 942 Mar 5 13:57 server.log.2021-03-05-13
-rw-r--r-- 1 kafka kafka 942 Mar 5 14:57 server.log.2021-03-05-14
-rw-r--r-- 1 kafka kafka 942 Mar 5 15:57 server.log.2021-03-05-15
-rw-r--r-- 1 kafka kafka 942 Mar 5 16:57 server.log.2021-03-05-16
-rw-r--r-- 1 kafka kafka 2361 Mar 5 17:57 server.log.2021-03-05-17
(all the way to server.log.2021-03-15-20)
How can I get these logs to delete properly? Why does it seem to make a log file per hour? Why would kafka be logging to another path?

kafka + how to add additional partitions to the existing XX partitions?

this is example how to create new 10 topic partitions with name - test_test
kafka-topics.sh --create --zookeeper zookeeper01:2181 --replication-factor 3 --partitions 10 --topic test_test
Created topic "test_test".
[root#kafka01 kafka-data]# \ls -ltr | grep test_test
drwxr-xr-x 2 kafka hadoop 4096 Mar 22 16:53 test_test-8
drwxr-xr-x 2 kafka hadoop 4096 Mar 22 16:53 test_test-5
drwxr-xr-x 2 kafka hadoop 4096 Mar 22 16:53 test_test-2
drwxr-xr-x 2 kafka hadoop 4096 Mar 22 16:53 test_test-0
drwxr-xr-x 2 kafka hadoop 4096 Mar 22 16:53 test_test-7
drwxr-xr-x 2 kafka hadoop 4096 Mar 22 16:53 test_test-4
drwxr-xr-x 2 kafka hadoop 4096 Mar 22 16:53 test_test-1
drwxr-xr-x 2 kafka hadoop 4096 Mar 22 16:53 test_test-9
drwxr-xr-x 2 kafka hadoop 4096 Mar 22 16:53 test_test-6
drwxr-xr-x 2 kafka hadoop 4096 Mar 22 16:53 test_test-3
now we want to add additional 10 partitions to the topic name - test_test
how to add additional partitions to the existing 10 partitions ?
You can run this command:
./bin/kafka-topics.sh --alter --bootstrap-server localhost:9092 --topic test_test --partitions 20
By the way there are two things to consider about changing partitions:
Decreasing the number of partitions is not allowed
If you add more partitions to a topic, key based ordering of the messages cannot be guaranteed
Note: If your Kafka version is older than 2.2 you must use --zookeeper parameter instead of --bootstrap-server
Moreover, you should take into consideration that adding partitions triggers a rebalance which makes all of your this topic's consumers unavailable for a period of time.
rebalance is the process of re-assigning partitions to consumers, it happens when new partitions are added, new consumer is added or a consumer is leaving (may happen due to exception, network problems or initiated exit).
In order to preserve reading consistency, during a rebalance the consumer group entirely stops receiving messages until the new partition assignment is taking place.
This relatively short answer explains rebalance very well.

What are the different logs under kafka data log dir

I am trying to understand the kafka data logs. I can see the logs under the dir set in logs.dir as "Topicname_partitionnumber". However I would like to know what are the different logs captured under it. Below is the screenshot for a sample log.
In Kafka logs, each partition has a log.dir directory. Each partition is split into segments.
A segment is just a collection of messages. Instead of writing all messages into a single file, Kafka splits them into chunks of segments.
Whenever Kafka writes to a partition, it writes to an active segment. Each segment has defined size limit. When the segment size limit is reached, it closes the segment and opens a new one that becomes active. One partition can have one or more segment based on the configuration.
Each segment contains three files - segment.log,segment.index and segment.timeindex
There are three types of file for each Kafka topic partition:
-rw-r--r-- 1 kafka hadoop 10485760 Dec 3 23:57 00000000000000000000.index
-rw-r--r-- 1 kafka hadoop 148814230 Oct 11 06:50 00000000000000000000.log
-rw-r--r-- 1 kafka hadoop 10485756 Dec 3 23:57 00000000000000000000.timeindex
The 00000000000000000000 in front of log and index files is the name of the segments. It represents the offset of the first record written in that segment. If there are 2 segments i.e. Segment 1 containing message offset 0,1 and Segment 2 containing message offset 2 and 3.
-rw-r--r-- 1 kafka hadoop 10485760 Dec 3 23:57 00000000000000000000.index
-rw-r--r-- 1 kafka hadoop 148814230 Oct 11 06:50 00000000000000000000.log
-rw-r--r-- 1 kafka hadoop 10485756 Dec 3 23:57 00000000000000000000.timeindex
-rw-r--r-- 1 kafka hadoop 10485760 Dec 3 23:57 00000000000000000002.index
-rw-r--r-- 1 kafka hadoop 148814230 Oct 11 06:50 00000000000000000002.log
-rw-r--r-- 1 kafka hadoop 10485756 Dec 3 23:57 00000000000000000002.timeindex
.log file stores the offset, the physical position of the message, timestamp along with the message content. While reading the messages from Kafka at a particular offset, it becomes an expensive task to find the offset in a huge log file.
That's where .index the file becomes useful. It stores the offsets and physical position of the messages in the log file.
.timeindex the file is based on the timestamp of messages.
The files without a suffix are the segment files, i.e. the files the data is actually written to, named by earliest contained message offset. The latest of those is the active segment, meaning the one that messages are currently appended to.
.index are corresponding mappings from offset to positions in the segment file. .timeindex are mappings from timestamp to offset.
Below is the screenshot for a sample log
You should add your screenshot and sample log, then we could give your expected and specific answer.
before that, only can give you some common knowledge:
eg: in my CentOS, for folder:
/root/logs/kafka/kafka.log/storybook_add-0
storybook_add: is the topic name
in code, the real topic name is storybook-add
its contains:
[root#xxx storybook_add-0]# ll
total 8
-rw-r--r-- 1 root root 10485760 Aug 28 16:44 00000000000000000023.index
-rw-r--r-- 1 root root 700 Aug 28 16:45 00000000000000000023.log
-rw-r--r-- 1 root root 10485756 Aug 28 16:44 00000000000000000023.timeindex
-rw-r--r-- 1 root root 9 Aug 28 16:44 leader-epoch-checkpoint
00000000000000000023.log: log file
store the real data = kafka message
00000000000000000023.index: index file
00000000000000000023.timeindex: timeindex file
->
00000000000000000023 called segment name
why is 23?
in 00000000000000000023.log, which stored first message's postion is 23
kafka previously has totally received 23 messages
what the message data look like?
we can see through its content:
For further basic concept and logic of kafka, recommend to read this article:
A Practical Introduction to Kafka Storage Internals

Why is Kafka not deleting data?

I have a two node Kafka cluster with 48 gb disk allotted to each.
The server.properties is set to retain logs upto 48 hours or log segments up to 1 GB. Here it is :
log.retention.hours=48
log.retention.bytes=1073741824
log.segment.bytes=1073741824
I have 30 partitons for a topic. Here are the disk usage stats for one of these partitions:
-rw-r--r-- 1 root root 1.9M Apr 14 00:06 00000000000000000000.index
-rw-r--r-- 1 root root 1.0G Apr 14 00:06 00000000000000000000.log
-rw-r--r-- 1 root root 0 Apr 14 00:06 00000000000000000000.timeindex
-rw-r--r-- 1 root root 10M Apr 14 12:43 00000000000001486744.index
-rw-r--r-- 1 root root 73M Apr 14 12:43 00000000000001486744.log
-rw-r--r-- 1 root root 10M Apr 14 00:06 00000000000001486744.timeindex
As you can clearly see, we have a log segment of 1 gb. But as per my understanding, it should have already been deleted. Also, its been more than 48 hours since these logs were rolled out by Kafka. Thoughts?
In your case, you set log.retention.bytes and log.segment.bytes to the same value, which means there is always no candidate of deletable segment, so no delete happens.
The algorithm is:
firstly calculate the difference. In your case, the difference is 73MB (73MB + 1GB - 1GB)
Iterator all the non-active log segments, compare its size with the diff
If diff > log segment size, mark this segment deletable, and decrement the diff by the size
Otherwise, mark this segment undeletable and try with the next log segment.
Answering my own question:
Let's say that log.retention.hours has value 24 hours and log.retention.bytes and log.segment.bytes are both set to 1 GB. When the value of the log reaches 1 GB (call this Old Log), a new log segment is created (call this New Log). The Old Log is then deleted 24 hours after the New Log is created.
In my case, the New Log was created about 25 hours before I posted this question. I dynamically changed the retention.ms value for a topic (which is maintained by Zookeeper, and not the Kafka cluster, and therefore does not require Kafka restart) to 24 hours, my old logs were immediately deleted.

How to find the kafka version in linux

How to find the kafka version in linux?
whether there is a way to find the installed kafka version other than mentioning the version while downloading it?
Not sure if there's a convenient way, but you can just inspect your kafka/libs folder. You should see files like kafka_2.10-0.8.2-beta.jar, where 2.10 is Scala version and 0.8.2-beta is Kafka version.
Kafka 2.0 have the fix(KIP-278) for it:
kafka-topics.sh --version
Or
kafka-topics --version
Using confluent utility:
Kafka version check can be done with confluent utility which comes by default with Confluent platform(confluent utility can be added to cluster separately as well - credits cricket_007).
${confluent.home}/bin/confluent version kafka
Checking the version of other Confluent platform components like ksql schema-registry and connect
[confluent-4.1.0]$ ./bin/confluent version kafka
1.1.0-cp1
[confluent-4.1.0]$ ./bin/confluent version connect
4.1.0
[confluent-4.1.0]$ ./bin/confluent version schema-registry
4.1.0
[confluent-4.1.0]$ ./bin/confluent version ksql-server
4.1.0
There is nothing like kafka --version at this point. So you should either check the version from your kafka/libs/ folder or you can run
find ./libs/ -name \*kafka_\* | head -1 | grep -o '\kafka[^\n]*'
from your kafka folder (and it will do the same for you). It will return you something like kafka_2.9.2-0.8.1.1.jar.asc where 0.8.1.1 is your kafka version.
There are several methods to find kafka version
Method 1 simple:-
ps -ef|grep kafka
it will displays all running kafka clients in the console...
Ex:- /usr/hdp/current/kafka-broker/bin/../libs/kafka-clients-0.10.0.2.5.3.0-37.jar
we are using 0.10.0.2.5.3.0-37 version of kafka
Method 2:-
go to
cd /usr/hdp/current/kafka-broker/libs
ll |grep kafka
Ex:- kafka_2.10-0.10.0.2.5.3.0-37.jar
kafka-clients-0.10.0.2.5.3.0-37.jar
same result as method 1 we can find the version of kafka using in kafka libs.
You can grep the logs to see the version. Let's say kafka is installed under /usr/local/kafka, then:
$ grep "Kafka version" /usr/local/kafka/logs/*
/usr/local/kafka/logs/kafkaServer.out: INFO Kafka version : 0.9.0.1 (org.apache.kafka.common.utils.AppInfoParser)
will reveal the version
If you want to check the version of a specific Kafka broker, run this CLI on the broker*
kafka-broker-api-versions.sh --bootstrap-server localhost:9092 --version
where localhost:9092 is the accessible <hostname|IP Address>:<port> this API will check (localhost can be used if it's the same host you're running this command on). Example of output:
2.4.0 (Commit:77a89fcf8d7fa018)
* Apache Kafka comes with a variety of console tools in the ./bin sub-directory of your Kafka download; e.g. ~/kafka/bin/
Simple way on macOS e.g. installed via homebrew
$ ls -l $(which kafka-topics)
/usr/local/bin/kafka-topics -> ../Cellar/kafka/0.11.0.1/bin/kafka-topics
You can use for Debian/Ubuntu:
dpkg -l|grep kafka
Expected result should to be like:
ii confluent-kafka-2.11 0.11.0.1-1 all publish-subscribe messaging rethought as a distributed commit log
ii confluent-kafka-connect-elasticsearch 3.3.1-1 all Kafka Connect connector for copying data between Kafka and Elasticsearch
ii confluent-kafka-connect-hdfs 3.3.1-1 all Kafka Connect connector for copying data between Kafka and Hadoop HDFS
ii confluent-kafka-connect-jdbc 3.3.1-1 all Kafka Connect connector for JDBC-compatible databases
ii confluent-kafka-connect-replicator 3.3.1-1 all Kafka Connect connector for replicating topics between Kafka clusters
ii confluent-kafka-connect-s3 3.3.1-1 all Kafka Connect S3 connector for copying data between Kafka and
ii confluent-kafka-connect-storage-common 3.3.1-1 all Kafka Connect Storage Common contains packages used by storage
ii confluent-kafka-rest 3.3.1-1 all A REST proxy for Kafka
go to kafka/libs folder
we can see multiple jars search for something similar kafka_2.11-0.10.1.1.jar.asc in this case the kafka version is 0.10.1.1
I found an easy way to do this without searching directories or log files:
kafka-dump-log --version
Output looks like this:
5.3.0-ccs (Commit:6481debc2be778ee)
cd kafka
./bin/kafka-topics.sh --version
When you install Kafka in Centos7 with confluent :
yum install confluent-platform-oss-2.11
You can see the version of Kafka with :
yum deplist confluent-platform-oss-2.11
You can read : confluent-kafka-2.11 >= 0.10.2.1
To find the Kafka Version, We can use the jps command which show all the java processes running on the machine.
Step 1: Let's say, you are running Kafka as the root user, so login to your machine with root and use jps -m. It will show the result like
4979 Jps -m
9434 Kafka config/server.properties
Step 2: From the above result, you can take the PID for Kafka application and use pwdx 9434 which reports the current directory of the process. the result will be like
9434: /apps/kafka_2.12-2.4.0
here you can see the Kafka version which is 2.12-2.4.0
cd confluent-7.2.0/share/java/kafka
then
$ ls -lha | grep kafka
-rw-r--r-- 1 root root 5.3M Jul 5 09:45 kafka_2.13-7.2.0-ccs.jar
-rw-r--r-- 1 root root 4.8M Jul 5 09:45 kafka-clients-7.2.0-ccs.jar
lrwxrwxrwx 1 root root 26 Jul 23 10:10 kafka.jar -> ./kafka_2.13-7.2.0-ccs.jar
-rw-r--r-- 1 root root 9.4K Jul 5 09:45 kafka-log4j-appender-7.2.0-ccs.jar
-rw-r--r-- 1 root root 458K Jul 5 09:45 kafka-metadata-7.2.0-ccs.jar
-rw-r--r-- 1 root root 182K Jul 5 09:45 kafka-raft-7.2.0-ccs.jar
-rw-r--r-- 1 root root 36K Jul 5 09:45 kafka-server-common-7.2.0-ccs.jar
-rw-r--r-- 1 root root 84K Jul 5 09:45 kafka-shell-7.2.0-ccs.jar
-rw-r--r-- 1 root root 151K Jul 5 09:45 kafka-storage-7.2.0-ccs.jar
-rw-r--r-- 1 root root 23K Jul 5 09:45 kafka-storage-api-7.2.0-ccs.jar
-rw-r--r-- 1 root root 1.6M Jul 5 09:45 kafka-streams-7.2.0-ccs.jar
-rw-r--r-- 1 root root 41K Jul 5 09:45 kafka-streams-examples-7.2.0-ccs.jar
-rw-r--r-- 1 root root 161K Jul 5 09:45 kafka-streams-scala_2.13-7.2.0-ccs.jar
-rw-r--r-- 1 root root 52K Jul 5 09:45 kafka-streams-test-utils-7.2.0-ccs.jar
-rw-r--r-- 1 root root 127K Jul 5 09:45 kafka-tools-7.2.0-ccs.jar
You can also type
cat /build.info
This will give you an output like this
BUILD_BRANCH=master
BUILD_COMMIT=434160726dacc4a1a592fe6036891d6e646a3a4a
BUILD_TIME=2017-05-12T16:02:04Z
DOCKER_REPO=index.docker.io/landoop/fast-data-dev
KAFKA_VERSION=0.10.2.1
CP_VERSION=3.2.1
To check kafka version :
cd /usr/hdp/current/kafka-broker/libs
ls kafka_*.jar