Expiring the messages in Kafka Topic - apache-kafka

We are using Apache Kafka perform load test in our dev environment.
In our Linux box where we have installed confluent kafka ,have limited space hence to perform load test we have added retention.ms property to the topic.
Idea is to remove the message from the topic after it is consumed by the consumer.
I have tried
kafka-topics --zookeeper localhost:2181 --alter --topic myTopic --config retention.ms=10000
it didn't work hence we re-created the topic and tried below option.
kafka-configs --alter --zookeeper localhost:2181 --entity-type topics --entity-name myTopic -add-config retention.ms=10000
After running the process for few hours the broker is shutting down because of space constraint.
What other options i can try from Topic as well as from broker standpoint to keep expiring the messages reliably and claiming back the disk space for long running load test.

You can define the deletion policy based on the byte size in addition to the time.
The topic configuration is called retention.bytes and in the documentation it is describes as:
This configuration controls the maximum size a partition (which consists of log segments) can grow to before we will discard old log segments to free up space if we are using the "delete" retention policy. By default there is no size limit only a time limit. Since this limit is enforced at the partition level, multiply it by the number of partitions to compute the topic retention in bytes.
You can set it together with retention.ms and whatever limit (bytes or time) will be reached first, the cleaning is triggered.

This might be because your Log cleaner threads might not had triggered.
You did not provide much info of how much data is accumulated on the topics. But it might not be in GB's.
Log cleaner threads will be cleaning on the completed log segments. Default size of segment is 1 GB.
Modify your topic configuration segment.bytes to less value if you are expecting huge load.
or
modify the configuration segment.ms to 1 min or 10 mins as per your requirement.
This should create the segments and based on your log retention time, cleaner threads will clean the older completed segments.

Related

Kafka offset is stuck in consumer group

I'm running this kafka command:
/opt/kafka_2.11/bin/kafka-consumer-groups.sh --bootstrap-server xxxxx:9092 \
--describe --group flink-cg
Result is like this:
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG
my_topic 0 481239571 484028280 2788709
The offset keep stuck although my flink is running and have no error in the log file.
How to check if the number of my offset is correct? I'm afraid my current-offset is having wrong number so the value is stuck.
The sole fact that Flink Job is running doesn't necesarily mean that the offset should change. This depends on the configuration of Your job, but by default the offset is only commited on checkpoint, so the first thing to check is if Your job is properly checkpointing (maybe You have configured long time between checkpoints).
If it is or if You have enabled enable.auto.commit then You should check if there is possibly a backpressure for some operators that may be causing problems with reading of records.
It would be easier to tell if You could provide more info about configuration and the job itself.

Data still remains in Kafka topic even after retention time/size

We set the log retention hours to 1 hour as the following (previously setting was 72H)
Using the following Kafka command line tool, we set the kafka retention.ms to 1H. Our aim is to purge the data that is older then 1H in topic - test_topic, so we used the following command:
kafka-configs.sh --alter \
--zookeeper localhost:2181 \
--entity-type topics \
--entity-name topic_test \
--add-config retention.ms=3600000
and also
kafka-topics.sh --zookeeper localhost:2181 --alter \
--topic topic_test \
--config retention.ms=3600000
Both commands ran without errors.
But the problem is about Kafka data that is older then 1H and still remains!
Actually no data was removed from the topic topic_test partitions. We have HDP Kafka cluster version 1.0x and ambari
We do not understand why data on topic - topic_test still remained? and not decreased even after we run both cli as already described
what is wrong on the following kafka cli?
kafka-configs.sh --alter --zookeeper localhost:2181 --entity-type topics --entity-name topic_test --add-config retention.ms=3600000
kafka-topics.sh --zookeeper localhost:2181 --alter --topic topic_test --config retention.ms=3600000
from the Kafka server.log we ca see the following
2020-07-28 14:47:27,394] INFO Processing override for entityPath: topics/topic_test with config: Map(retention.bytes -> 2165441552, retention.ms -> 3600000) (kafka.server.DynamicConfigManager)
[2020-07-28 14:47:27,397] WARN retention.ms for topic topic_test is set to 3600000. It is smaller than message.timestamp.difference.max.ms's value 9223372036854775807. This may result in frequent log rolling. (kafka.server.TopicConfigHandler)
reference - https://ronnieroller.com/kafka/cheat-sheet
The log cleaner will only work on inactive (sometimes also referred to as "old" or "clean") segments. As long as all data fits into the active ("dirty", "unclean") segment where its size is defined by segment.bytes size limit there will be no cleaning happening.
The configuration cleanup.policy is described as:
A string that is either "delete" or "compact" or both. This string designates the retention policy to use on old log segments. The default policy ("delete") will discard old segments when their retention time or size limit has been reached. The "compact" setting will enable log compaction on the topic.
In addition, the segment.bytes is:
This configuration controls the segment file size for the log. Retention and cleaning is always done a file at a time so a larger segment size means fewer files but less granular control over retention.
The configuration segment.ms can also be used to steer the deletion:
This configuration controls the period of time after which Kafka will force the log to roll even if the segment file isn't full to ensure that retention can delete or compact old data.
As it defaults to one week, you might want to reduce it to fit your needs.
Therefore, if you want to set the retention of a topic to e.g. one hour you could set:
cleanup.policy=delete
retention.ms=3600000
segment.ms=3600000
file.delete.delay.ms=1 (The time to wait before deleting a file from the filesystem)
segment.bytes=1024
Note: I am not referring to retention.bytes. The segment.bytes is a very different thing as described above. Also, be aware that log.retention.hours is a cluster-wide configuration. So, if you plan to have different retention times for different topics this will solve it.

Why can it take so long to delete Kafka topics?

On a 3 node cluster, I created few topics with thousands of messages. I have noticed that it takes long time to delete a topic. It took me more than 14 mins to delete 500 topics.
Are there any best practices for topic deletion?
Is there any document that explains why it takes so much time to delete a topic ?
When I create a topic, Kafka will create a folder under log.dirs. I had 10000 topics; I ran a command to delete all of them. Kafka has deleted all 10000 files from log.dirs but Kafka-topics.sh shows topics that does not exist on the file system with "- marked for deletion".
I don't think there are any best practices for deleting a topic in Kafka. As far delete.topic.enable=true is defined in server.properties you can simply delete the topic using
bin/kafka-topics.sh --delete --zookeeper localhost:2181 --topic myTopic
If your topics were large enough (and you might possibly had a high replication factor as well), then that's something normal. Essentially, the messages of topics are stored in log files. If your topics are extremely large, then it could take some time in order to get rid of all of these files. I think the proper metric here is the size of the topics you attempted to delete and not the number of topics (You can have 500 topics each of which has 1 message as opposed to 500 topics with e.g. 1TB of messages each).
Kafka topic deletion is not guaranteed to be instant. When you 'delete' a topic, you are actually marking it for deletion. When the TopicDeletionManager next runs, it will then start removing any topics that are marked for deletion. This may also take longer if the topic logs are larger.

kafka different topics set different partitions

As I know, 'num.partitions' in kafka server.properties will be worked for all topics.
Now I want to set partitionNumber=1 for topicA and partitionNumber=2 for topicB.
Is that possible to implementation with high level api?
num.partitions is a value used when a topic is generated automatically. If you generate a topic yourself, you can set any number of partitions as you want.
You can generate a topic yourself with the following command. (replication factor 3 and the number of partitions 2. Capital words are what you have to replace.)
bin/kafka-topics.sh --create --zookeeper ZOOKEEPER_HOSTNAME:ZOOKEEPER_PORT \
--replication-factor 3 --partitions 2 --topic TOPIC_NAME
There a configuration value that can be set on a Kafka Broker.
auto.create.topics.enable=true
True is actually the default setting,
Enable auto creation of topic on the server. If this is set to true then attempts to produce data or fetch metadata for a non-existent topic will automatically create it with the default replication factor and number of partitions.
So if you read or write from a non-existent partition as if it existed, if will automatically create one for you. I've never heard of using the high level api to automatically create one.
Looking over the Kafka Protocol Documentation, there doesn't seem to be a provided way to create topics.

Are broker nodes in kafka cluster configured to handle number of partition?

Kafka places the partitions and replicas in a way that the brokers with least number of existing partitions are used first. Does it mean that brokers are pre-configured to handle the partitions.
When you create a topic, you set the number of partitions.
bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic test
Also, there is a num.partions parameter you can use. (This is used when a topic is automatically created.)
A broker can have as many partitions as it wants as long as it has enough diskspace, memory, and network bandwidth.
In the screenshot above, you can see the partition of test. If you make a topic with three partitions, you will have two more folders with test-1 and test-2.
Each partition has an index file, a timeindex file, and a log file. The log file keeps Kafka data for that partition.