Kafka: Adjust retention automatically based on lag - apache-kafka

Given that I write an application that collect data from IoT devices and let my customer subscribe to this data by providing me with their http endpoint credentials.
I have to deal with their endpoint not responding or being slow, therfore I will bufferize the messages until sent (consumed), which will require storage.
To limit this storage I am wondering if I can watch the lag of my consumers and, when it reaches a threshold automatically increase the topic rentention (that will decrease automatically back later)
This would help me to set a short retention by default and yet be able to handle unavailable external endpoints without loosing messages. (Of course, if lag keeps growing, I'll have to take other actions).
My question is then, is this possible with Kafka ? and are there things I should be carafeful whith when doing this way ?
Many thanks

You can adjust retention time of topic vis kafka command line tools:
bin/kafka-configs.sh --bootstrap-server localhost:9092 --entity-type topics --entity-name test --alter --add-config retention.ms=55000
Or if you want to do it within your code take a look at TopicCommand class.

Related

Control throughout in Apache Kafka

I want to run tests to measure the latency and throughout of different frameworks. I send messages to a kafka topic using an script that reads from a txt file. This frameworks consumes events from this input topic and produces into an output topic. I have two questions about this:
A) Lets say I want to send 400 events per second. How do I control that? Its something that is controlled by the script that sends the data or it can be configured in Kafka?
B) If I can control throughput tuning Kafka parameters, how do I gradually increment the amount of events sent (dinamically)?
thank you very much!
Throughput can be controlled in Kafka by enforcing Quotas on clients. But the catch is it's enforced in terms of Bytes/sec and not number of messages per second.
Since it's for testing purposes you can define the quota like this in the Kafka config file (server.properties):
quota.producer.default=100
quota.consumer.default=100
Do note, that this will mean it will apply this throttling on all the topics.
Also, 100 here means 100 bytes.
If you want to enforce quota on specific producer or consumer clients you can do that using:
quota.producer.override="clientA:4M,clientB:6M"
Which means regardless of whatever the default Quota is, producer with client.id "ClientA" can produce at a max of 4Mbps.
As, for the dynamic part, instead of having to set these on the property file manually, you can use the kafka-config.sh file to set these configs:
bin/kafka-configs.sh --bootstrap-server localhost:9092 --alter --add-config 'producer_byte_rate=1024' --entity-name clientA --entity-type clients
Which means, I want to enforce a throttle on a client with clientID "clientA" and the throttling is on produce-limit that needs to be set at max of 1024 bytes.
You can programmatically use this script to increase of decrease quotas dynamically.

How to avoid duplication in Kafka when consumer got disconnected ..I am able to see previous messages every time

I am trying to implement a simple Producer-->Kafka-->Consumer application in Kafka Shell . I am able to produce as well as consume the messages successfully, but the problem occurs when I restart the consumer,every time when I restart consumer old messages get picked up.Is there any way to avoid so that failed messages while I restarting consumer get only picked up
From the shell... Meaning kafka-console-consumer? If so:
Don't constantly use --from-beginning
Add --group argument to keep track of what has been consumed
Its worth mentioning that the default behavior is at-least once delivery, so duplicates should be expected and you'll need to write a different consumer to accommodate idempotent / transactional producers

kafka __consumer_offsets topic logs rapidly growing in size reducing disk space

I find that the __consumer_offsets topic log size is growing rapidly and after studying it further found the topics with the highest volume. I changed the retention policy on these topics to stop the rate of growth but would like to increase disk space and delete all the old logs for __consumer_offsets topic.
But this will cause all the other topics and consumers/producers to get corrupted or lose valuable metadata. Is there a way I can accomplish this? I'm looking at the parameters for the config which includes cleanup policy and compression but not sure how to specify this specifically for the topics that caused this rapid growth.
https://docs.confluent.io/current/installation/configuration/topic-configs.html
Appreciate any assistance here.
The topic "__consumer_offsets" is an internal topic which is used to manage the offsets of each Consumer Group. Producers will not be directly impacted by any change/modification in this topic.
Saying that, and also emphasizing your expecrience, you should be very careful about changing the configuration of this topic.
I suggest to tweak the topic configurations for compacted topics. The cleanup policy should be kept at "compacted".
Reduce max.compaction.lag.ms (cluster-wide setting: log.cleaner.max.compaction.lag.ms) which defaults to MAX_LONG to something like 60000.
Reduce the ratio when a compaction is triggered through min.cleanable.dirty.ratio (cluster-wide setting: log.cleaner.min.cleanable.ratio) which defaults to 0.5 to something like 0.1.
That way, the compactions will be conducted more often without loosing any essential information.
Deleting old records in __consumer_offsets
The topic will pile up if you use many unique Consumer Groups (e.g. by using console-consumer which creates by default a random Consumer Group each time it is being executing).
To clean "old and un-needed" entries in the topic you need to be aware how to delete a message out of a compacted topic. This is done by producing a message to the topic with a null value. That way you will eventually delete the messages for the same key. You just have to figure out the keys of the messages you want to get rid of.
In Kafka, there are two types of log retention; size and time retention. The former is triggered by log.retention.bytes while the latter by log.retention.hours.
In your case, you should pay attention to size retention that sometimes can be quite tricky to configure. Assuming that you want a delete cleanup policy, you'd need to configure the following parameters to
log.cleaner.enable=true
log.cleanup.policy=delete
Then you need to think about the configuration of log.retention.bytes, log.segment.bytes and log.retention.check.interval.ms. To do so, you have to take into consideration the following factors:
log.retention.bytes is a minimum guarantee for a single partition of a topic, meaning that if you set log.retention.bytes to 512MB, it means you will always have 512MB of data (per partition) in your disk.
Again, if you set log.retention.bytes to 512MB and log.retention.check.interval.ms to 5 minutes (which is the default value) at any given time, you will have at least 512MB of data + the size of data produced within the 5 minute window, before the retention policy is triggered.
A topic log on disk, is made up of segments. The segment size is dependent to log.segment.bytes parameter. For log.retention.bytes=1GB and log.segment.bytes=512MB, you will always have up to 3 segments on the disk (2 segments which reach the retention and the 3rd one will be the active segment where data is currently written to).
Finally, you should do the math and compute the maximum size that might be reserved by Kafka logs at any given time on your disk and tune the aforementioned parameters accordingly. Of course, I would also advice to set a time retention policy as well and configure log.retention.hours accordingly. If after 2 days you don't need your data anymore, then set log.retention.hours=48.
Now in order to change the retention policy just for the __consumer_offsets topic, you can simply run:
bin/kafka-configs.sh \
--zookeeper localhost:2181 \
--alter \
--entity-type topics \
--entity-name __consumer_offsets \
--add-config retention.bytes=...
As a side note, you must be very careful with the retention policy for the __consumer_offsets as this might mess up all your consumers.

Is there any api to find partition is balanced or not in kafka

Is there any API or client library exist which can tell me how much percent topic is filled with data so that I can figure out is there any way to check whether partitions are balanced or not
This is a good strategy to discuss before designing and development on Kafka.
The first point you need to consider how you are defining your key and whats exactly partitioner you are planning to use while producing a message to the topics.
Thumb-rule:
If you not bothering collecting messages in different groups based on key just pass the key as null to redistribute your messages in a round-robin manner.
You can also use a custom partitioner to define partitioning in case you need to do some more refinement.
To check the partition distribution, the best approach is to check the lagging on each partition and rate byte/sec
There many ways to monitor
1.You can use simple API to get various matrices like lagging, rate, etc
You can refer here Kafka Metrices
kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica
2.I generally prefer Grafana with JMX exported it will visualize matrices
Grafana
3.We can also use CLI to identify each partition offset and lagging, and really give you the overall figure instantly
bin/kafka-consumer-groups.sh --bootstrap-server broker1:9092 --describe --group consumer-group
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
test 1 10 30 20 consumer-group
You can also do with programmatically
How to identify partition lagging
Confluent Control-Center is paid one but a very interesting tool to monitor overall Kafka including consumer and its partitions/
Confluent control center
Assume that you created a topic X. Your producers started to push tons of data into your topic. Your topic is growing exponentially. Depending on the configuration log.segment.bytes, Kafka will create a new segment and start writing data into it. Old segment will be kept for log.retention.ms milliseconds. Because of this, 100% of a topic itself is tricky calculate.
However, if you are looking for a tool that can allocate partitions depending on the load on each broker then I would recommend looking into Kafka-kit (https://www.datadoghq.com/blog/engineering/introducing-kafka-kit-tools-for-scaling-kafka/).

Kafka System Tools for Replay

I have case where we need to move data from one topic to other topic. I saw a utility in Kafka documentation "ReplayLogProducer". Its supposed to be run as indicated below.
bin/kafka-run-class.sh kafka.tools.ReplayLogProducer
Does this tool require the partitions on source topic same as that of destination partitions? How does the retention of data work on the new topic?
It would be great if anyone can provide any insight on any best practices to be followed or caveats to keep in mind while running this tool.
The command-line kafka.tools.ReplayLogProducer tool does not require the partitions to be the same. By default it uses the default partitioning strategy: hash of message's key if present, or round-robin if your messages don't have keys. One of the main use cases is to copy data from an old to a new topic after changing the number of partitions or the partitioning strategy.
It's still not documented, but the ability to specify a custom partitioner was apparently added by KAFKA-1281: you can now specify custom producer options with --property. So to use a different partitioning strategy, try:
bin/kafka-run-class.sh kafka.tools.ReplayLogProducer --property partitioner.class=my.custom.Partitioner
Retention of data in the new topic will be however the new topic is configured with cleanup.policy, retention.ms or retention.bytes. Note that if using retention.ms (the default), retention is relative to the time the messages were replayed, not the original creation time. This is an issue with regular replication or mirrormaker and ReplayLogProducer is no different. Proposals for KIP-32 and KIP-33 should make it possible to instead configure retention by "creation time" of your messages, but since Kafka 0.10 is not yet released, it's not yet clear if ReplayLogProducer would preserve message creation time.