Control throughout in Apache Kafka

Control throughout in Apache Kafka - apache-kafka

I want to run tests to measure the latency and throughout of different frameworks. I send messages to a kafka topic using an script that reads from a txt file. This frameworks consumes events from this input topic and produces into an output topic. I have two questions about this:
A) Lets say I want to send 400 events per second. How do I control that? Its something that is controlled by the script that sends the data or it can be configured in Kafka?
B) If I can control throughput tuning Kafka parameters, how do I gradually increment the amount of events sent (dinamically)?
thank you very much!

Throughput can be controlled in Kafka by enforcing Quotas on clients. But the catch is it's enforced in terms of Bytes/sec and not number of messages per second.
Since it's for testing purposes you can define the quota like this in the Kafka config file (server.properties):
quota.producer.default=100
quota.consumer.default=100
Do note, that this will mean it will apply this throttling on all the topics.
Also, 100 here means 100 bytes.
If you want to enforce quota on specific producer or consumer clients you can do that using:
quota.producer.override="clientA:4M,clientB:6M"
Which means regardless of whatever the default Quota is, producer with client.id "ClientA" can produce at a max of 4Mbps.
As, for the dynamic part, instead of having to set these on the property file manually, you can use the kafka-config.sh file to set these configs:
bin/kafka-configs.sh --bootstrap-server localhost:9092 --alter --add-config 'producer_byte_rate=1024' --entity-name clientA --entity-type clients
Which means, I want to enforce a throttle on a client with clientID "clientA" and the throttling is on produce-limit that needs to be set at max of 1024 bytes.
You can programmatically use this script to increase of decrease quotas dynamically.

Related

Kafka: Adjust retention automatically based on lag

Given that I write an application that collect data from IoT devices and let my customer subscribe to this data by providing me with their http endpoint credentials.
I have to deal with their endpoint not responding or being slow, therfore I will bufferize the messages until sent (consumed), which will require storage.
To limit this storage I am wondering if I can watch the lag of my consumers and, when it reaches a threshold automatically increase the topic rentention (that will decrease automatically back later)
This would help me to set a short retention by default and yet be able to handle unavailable external endpoints without loosing messages. (Of course, if lag keeps growing, I'll have to take other actions).
My question is then, is this possible with Kafka ? and are there things I should be carafeful whith when doing this way ?
Many thanks

You can adjust retention time of topic vis kafka command line tools:
bin/kafka-configs.sh --bootstrap-server localhost:9092 --entity-type topics --entity-name test --alter --add-config retention.ms=55000
Or if you want to do it within your code take a look at TopicCommand class.

kafka __consumer_offsets topic logs rapidly growing in size reducing disk space

I find that the __consumer_offsets topic log size is growing rapidly and after studying it further found the topics with the highest volume. I changed the retention policy on these topics to stop the rate of growth but would like to increase disk space and delete all the old logs for __consumer_offsets topic.
But this will cause all the other topics and consumers/producers to get corrupted or lose valuable metadata. Is there a way I can accomplish this? I'm looking at the parameters for the config which includes cleanup policy and compression but not sure how to specify this specifically for the topics that caused this rapid growth.
https://docs.confluent.io/current/installation/configuration/topic-configs.html
Appreciate any assistance here.

The topic "__consumer_offsets" is an internal topic which is used to manage the offsets of each Consumer Group. Producers will not be directly impacted by any change/modification in this topic.
Saying that, and also emphasizing your expecrience, you should be very careful about changing the configuration of this topic.
I suggest to tweak the topic configurations for compacted topics. The cleanup policy should be kept at "compacted".
Reduce max.compaction.lag.ms (cluster-wide setting: log.cleaner.max.compaction.lag.ms) which defaults to MAX_LONG to something like 60000.
Reduce the ratio when a compaction is triggered through min.cleanable.dirty.ratio (cluster-wide setting: log.cleaner.min.cleanable.ratio) which defaults to 0.5 to something like 0.1.
That way, the compactions will be conducted more often without loosing any essential information.
Deleting old records in __consumer_offsets
The topic will pile up if you use many unique Consumer Groups (e.g. by using console-consumer which creates by default a random Consumer Group each time it is being executing).
To clean "old and un-needed" entries in the topic you need to be aware how to delete a message out of a compacted topic. This is done by producing a message to the topic with a null value. That way you will eventually delete the messages for the same key. You just have to figure out the keys of the messages you want to get rid of.

In Kafka, there are two types of log retention; size and time retention. The former is triggered by log.retention.bytes while the latter by log.retention.hours.
In your case, you should pay attention to size retention that sometimes can be quite tricky to configure. Assuming that you want a delete cleanup policy, you'd need to configure the following parameters to
log.cleaner.enable=true
log.cleanup.policy=delete
Then you need to think about the configuration of log.retention.bytes, log.segment.bytes and log.retention.check.interval.ms. To do so, you have to take into consideration the following factors:
log.retention.bytes is a minimum guarantee for a single partition of a topic, meaning that if you set log.retention.bytes to 512MB, it means you will always have 512MB of data (per partition) in your disk.
Again, if you set log.retention.bytes to 512MB and log.retention.check.interval.ms to 5 minutes (which is the default value) at any given time, you will have at least 512MB of data + the size of data produced within the 5 minute window, before the retention policy is triggered.
A topic log on disk, is made up of segments. The segment size is dependent to log.segment.bytes parameter. For log.retention.bytes=1GB and log.segment.bytes=512MB, you will always have up to 3 segments on the disk (2 segments which reach the retention and the 3rd one will be the active segment where data is currently written to).
Finally, you should do the math and compute the maximum size that might be reserved by Kafka logs at any given time on your disk and tune the aforementioned parameters accordingly. Of course, I would also advice to set a time retention policy as well and configure log.retention.hours accordingly. If after 2 days you don't need your data anymore, then set log.retention.hours=48.
Now in order to change the retention policy just for the __consumer_offsets topic, you can simply run:
bin/kafka-configs.sh \
--zookeeper localhost:2181 \
--alter \
--entity-type topics \
--entity-name __consumer_offsets \
--add-config retention.bytes=...
As a side note, you must be very careful with the retention policy for the __consumer_offsets as this might mess up all your consumers.

Is there any api to find partition is balanced or not in kafka

Is there any API or client library exist which can tell me how much percent topic is filled with data so that I can figure out is there any way to check whether partitions are balanced or not

This is a good strategy to discuss before designing and development on Kafka.
The first point you need to consider how you are defining your key and whats exactly partitioner you are planning to use while producing a message to the topics.
Thumb-rule:
If you not bothering collecting messages in different groups based on key just pass the key as null to redistribute your messages in a round-robin manner.
You can also use a custom partitioner to define partitioning in case you need to do some more refinement.
To check the partition distribution, the best approach is to check the lagging on each partition and rate byte/sec
There many ways to monitor
1.You can use simple API to get various matrices like lagging, rate, etc
You can refer here Kafka Metrices
kafka.server:type=ReplicaFetcherManager,name=MaxLag,clientId=Replica
2.I generally prefer Grafana with JMX exported it will visualize matrices
Grafana
3.We can also use CLI to identify each partition offset and lagging, and really give you the overall figure instantly
bin/kafka-consumer-groups.sh --bootstrap-server broker1:9092 --describe --group consumer-group
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
test 1 10 30 20 consumer-group
You can also do with programmatically
How to identify partition lagging
Confluent Control-Center is paid one but a very interesting tool to monitor overall Kafka including consumer and its partitions/
Confluent control center

Assume that you created a topic X. Your producers started to push tons of data into your topic. Your topic is growing exponentially. Depending on the configuration log.segment.bytes, Kafka will create a new segment and start writing data into it. Old segment will be kept for log.retention.ms milliseconds. Because of this, 100% of a topic itself is tricky calculate.
However, if you are looking for a tool that can allocate partitions depending on the load on each broker then I would recommend looking into Kafka-kit (https://www.datadoghq.com/blog/engineering/introducing-kafka-kit-tools-for-scaling-kafka/).

How can I know that a kafka topic is full?

Lets say I have one kafka broker configured with one partition
log.retention.bytes=80000
log.retention.hours=6
What will happen if I try to send a record with the producer api to a broker and the log of the topic got full before the retention period?
Will my message get dropped?
Or will kafka free some space from the old messages and add mine?
How can I know if a topic is getting full and logs are being deleted before being consumed?
Is there a way to monitor or expose a metric when a topic is getting full?

What will happen if I try to send a record with the producer api to a
broker and the log of the topic got full before the retention period?
Will my message get dropped? Or will kafka free some space from the
old messages and add mine?
cleanup.policy property from topic config which by default is delete, says that "The delete policy will discard old segments when their retention time or size limit has been reached."
So, if you send record with producer api and topic got full, it will discard old segments.
How can I know if a topic is getting full and logs are being deleted
before being consumed?
Is there a way to monitor or expose a metric when a topic is getting full?
You can get Partition size using below script:
/bin/kafka-log-dirs.sh --describe --bootstrap-server : --topic-list
You will need to develop a script that will run above script for fetching current size of topic and send it periodically to Datadog.
In Datadog, you can create widget that will trigger appropriate action(e.g. sending email alerts) once size reaches a particular threshold.

It's not exactly true, a topic is never full, at least by default.
I said by default because like #Mukesh said the cleanup.policy will discard old segments when their retention time or size limit is reached, but by default there is no size limit only a time limit and the property that handle that is retention.bytes (set by default to -1).
It will let only a time limit on message, note that the retention.bytes value is set by partition so to specify a limit on a topic, you have to multiply by the numbers of partitions on that topic.
EDIT :
There is a tons of metrics that kafka export (in JMX) and in thoses you can found global metrics about segments (total numbers, per topic numbers, size, rate of rolling segments etc...).

Kafka Producer Quotas

Here is the inbound messaging flow in our IoT platform:
Device ---(MQTT)---> RabbitMQ Broker ---(AMQP)---> Apache Storm ---> Kafka
I'm looking to implement a solution which effectively limits/throttles the amount of data published to Kafka per second on a per-client basis.
The current strategy in place utilizes Guava's RateLimiter where each device gets its own locally cached instance. When a device message is received, the RateLimiter mapped to that deviceId is fetched from cache and the tryAquire() method is invoked. If a permit was successfully acquired then the tuple is forwarded to Kafka as usual, else, quota exceeded and message is discarded silently. This method is rather cumbersome and at some point doomed to fail or become a bottleneck.
I've been reading up on Kafka's byte-rate quotas and believe this would work perfectly in our case especially since Kafka clients can be configured dynamically. When a virtual device is created in our platform then a new client.id should be added where client.id == deviceId.
Let's assume the following use case as an example:
Admin creates 2 virtual devices: humidity & temp sensor
A rule is fired to create new user/clientId entries in Kafka for above devices
Set their producer quota values via Kafka CLI
Both devices emit an inbound event message
...?
Here's my question. If using a single Producer instance, is it possible to specify a client.id in the ProducerRecord or somewhere in the Producer prior to calling send()? If a Producer is allowed only a single client.id, does this mean each device must have its own Producer? If only a one-to-one mapping is allowed then would it be wise to cache potentially hundreds, if not thousands, of Producer instances, one for each device? Is there a better approach I'm not aware of yet?
Note: Our platform is an "open door system" meaning clients never get sent back an error response such as "Rate Exceeded" or any error for that matter. It's all transparent to the end user. For this reason, I can't interfere with data in RabbitMQ or re-route messages to different queues.. my only option to integrate this stuff lies in between Storm or Kafka.

You can configure the client.id by application: properties.put ("client.id", "humidity") or properties.put ("client.id", "temp")
According to each client.id you can set the values
producer_byte_rate = 1024, consumer_byte_rate = 2048,
request_percentage = 200
A doubt that I am in relation to this configuration (producer_byte_rate = 1024, consumer_byte_rate = 2048, request_percentage = 200), the Producer does not assume the inserted configuration, since the Consumer is working properly

While you can specify client.id on Producer object, remember that they are heavyweight, and you might not be willing to create multiple instances of them (especially on one-per-device basis).
Regarding reducing the number of Producer, have you considered creating one on a per-user, and not per-device basis, or even having a finite shared pool of them? Kafka message headers could then be used to discern which device actually produced the data. The drawback is that you would need to throttle message production on your side, so that one device does not grab all resources from the other ones.
However, you can limit the users on Kafka broker side, with configuration applying to default user/client:
> bin/kafka-configs.sh --zookeeper localhost:2181 --alter --add-config 'producer_byte_rate=1024,consumer_byte_rate=2048,request_percentage=200' --entity-type clients --entity-default
Updated config for entity: default client-id.
See https://kafka.apache.org/documentation/#design_quotas for more examples and explaination in depth.
How the messages are discerned depends on your architecture, the possible solutions include:
a topic / partition per user (e.g. data-USERABCDEF)
if you decide to use common topics, then you can put producer data into message headers - https://kafka.apache.org/0110/javadoc/index.html?org/apache/kafka/common/header/Headers.html , or you can put them into payload itself

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse