KAFKA message restriction for a Publisher? - apache-kafka

We are using kafka 0.10.x, I am looking, if there is a way to stop a publisher kafka to stop sending messages after certain messages/limit is reached in an hour. The goal here is to restrict user to only send certain number messages in and hour/day ?
If anyone has come across similar use case, please share your findings.
Thanks in Advance......

Kafka has a few throttling and quota mechanisms but none of them exactly match your requirement to strictly limit a producer based on message count on a daily basis.
From the Apache Kafka 0.11.0.0 documentation at https://kafka.apache.org/documentation/#design_quotas
Kafka cluster has the ability to enforce quotas on requests to control
the broker resources used by clients. Two types of client quotas can
be enforced by Kafka brokers for each group of clients sharing a
quota:
Network bandwidth quotas define byte-rate thresholds (since 0.9)
Request rate quotas define CPU utilization thresholds as a percentage
of network and I/O threads (since 0.11)
Client quotas were first introduced in Kafka 0.9.0.0. Rate limits on producers and consumers are enforced to prevent clients saturating the network or monopolizing broker resources.
See KIP-13 for details: https://cwiki.apache.org/confluence/display/KAFKA/KIP-13+-+Quotas
The quota mechanism introduced on 0.9 was based on the client.id set in the client configuration, which can be changed easily. Ideally, quota should be set on the authenticated user name so it is not easy to circumvent so in 0.10.1.0 an addition Authenticated Quota feature was added.
See KIP-55 for details: https://cwiki.apache.org/confluence/display/KAFKA/KIP-55%3A+Secure+Quotas+for+Authenticated+Users
Both the quota mechanisms described above work on data volume (i.e. bandwidth throttling) and not on number of messages nor number of requests. If a client sends lots of small messages or makes lots of requests that return no messages (e.g., a consumer with min.byte configured to 0), it can still overwhelm the broker. To address this issue 0.11.0.0 added in additionally support for throttling by request rate.
See KIP-124 for details: https://cwiki.apache.org/confluence/display/KAFKA/KIP-124+-+Request+rate+quotas
With all that as background then, if you know that your producer always publishes messages of a certain size, then you can compute a daily limit expressed in MB and also a rate limit expressed in MB/sec which you can configure as a quota. That's not a perfect fit for your need because a producer might send nothing for 12 hours and then try and send at a faster rate for a short time and the quota would still limit them to a lower publish rate because the limit is enforced per second and not per day.
If you don't know the message size or it varies a lot then since messages are published using a produce request, you could use request rate throttling to somewhat control the rate that an authenticated user is allow to publish messages but again it would not be a message/day limit nor even a bandwidth limit but rather as a "CPU utilization threshold as a percentage of network and I/O threads". This helps more for avoiding DoS problems and not really for limiting message counts.
If you would like to see message count quotas or message storage quotas added to Kafka then clearly the Kafka Improvement Proposal (KIP) process works and you are encouraged to submit improvement proposals in this or any other area.
See KIP process for details: https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals

you can make use of broker configs:
message.max.bytes (default:1000000) – Maximum size of a message the broker will accept. This has to be smaller than the consumer fetch.message.max.bytes, or the broker will have messages that can’t be consumed, causing consumers to hang.
log.segment.bytes (default: 1GB) – size of a Kafka data file. Make sure its larger than 1 message. Default should be fine (i.e. large messages probably shouldn’t exceed 1GB in any case. Its a messaging system, not a file system)
replica.fetch.max.bytes (default: 1MB) – Maximum size of data that a broker can replicate. This has to be larger than message.max.bytes, or
a broker will accept messages and fail to replicate them. Leading to potential data loss.
I think you can tweak the config to do what you want

Related

How to apply back pressure to kafka producer?

Back pressure can help by limiting the queue size, thereby maintaining a high throughput rate and good response times for jobs already in the queue.
In RabbitMQ it can be applied setting Queue length limit.
How can it be done with Kafka ?
Can this be done by keeping Rate Limiter(TokenBucket) between kafka producer and broker, where the current bucket size and refill rate be set dynamically with values of rate of consumption ? Like a rest api in producer receiving from consumer the rate at which consumer is processing messages.
Load is typically distributed amongst brokers, so back-pressure from the producer client-side may not be necessary.
But you can add quotas per client to throttle their requests.

Are 3k kafka topics decrease performance?

I have a Kafka Cluster (Using Aivan on AWS):
Kafka Hardware
Startup-2 (2 CPU, 2 GB RAM, 90 GB storage, no backups) 3-node high availability set
Ping between my consumers and the Kafka Broker is 0.7ms.
Backgroup
I have a topic such that:
It contains data about 3000 entities.
Entity lifetime is a week.
Each week there will be different 3000 entities (on avg).
Each entity may have between 15k to 50k messages in total.
There can be at most 500 messages per second.
Architecture
My team built an architecture such that there will be a group of consumers. They will parse this data, perform some transformations (without any filtering!!) and then sends the final messages back to the kafka to topic=<entity-id>.
It means I upload the data back to the kafka to a topic that contains only a data of a specific entity.
Questions
At any given time, there can be up to 3-4k topics in kafka (1 topic for each unique entity).
Can my kafka handle it well? If not, what do I need to change?
Do I need to delete a topic or it's fine to have (alot of!!) unused topics over time?
Each consumer which consumes the final messages, will consume 100 topics at the same time. I know kafka clients can consume multiple topics concurrenctly but I'm not sure what is the best practices for that.
Please share your concerns.
Requirements
Please focus on the potential problems of this architecture and try not to talk about alternative architectures (less topics, more consumers, etc).
The number of topics is not so important in itself, but each Kafka topic is partitioned and the total number of partitions could impact performance.
The general recommendation from the Apache Kafka community is to have no more than 4,000 partitions per broker (this includes replicas). The linked KIP article explains some of the possible issues you may face if the limit is breached, and with 3,000 topics it would be easy to do so unless you choose a low partition count and/or replication factor for each topic.
Choosing a low partition count for a topic is sometimes not a good idea, because it limits the parallelism of reads and writes, leading to performance bottlenecks for your clients.
Choosing a low replication factor for a topic is also sometimes not a good idea, because it increases the chance of data loss upon failure.
Generally it's fine to have unused topics on the cluster but be aware that there is still a performance impact for the cluster to manage the metadata for all these partitions and some operations will still take longer than if the topics were not there at all.
There is also a per-cluster limit but that is much higher (200,000 partitions). So your architecture might be better served simply by increasing the node count of your cluster.

Kafka Byte Rate Quota for a client having multiple producers or consumers

I had one question on the byte-rate quota management in Confluent Kafka. When we use use config like:
/kafka-configs-zookeeper host1:2181,host2:2181,host3:2181 --alter --add-config
'producer_byte_rate=1024, consumer_byte_rate=2048, request percentage=50' -entity-type clients-entity-name client1
I have understood say request percentage is 50 then each will get 50% of the quota window for request handler & network threads
In scenario, If there are 5 applications using the same clientID client1 for producing & consuming from the cluster, then how would the producer_byte_rate, consumer_byte_rate & request_percentage parameter come into play?
Would the quota window get uniformly divided into 5 slices of 10%
each?
Would the producer byte rate & consumer byte rate also get
divided equally among the 5 producers & consumers?
When you define a client ID as a quota group, no matter how many applications are configured with that value, as the official documentation says, "all connections of a quota group share the quota configured for the group". So, there are no slices of quotas among the applications using the same client ID, and as soon as the quota value has been reached, all applications will get throttled in that quota window.

Reducing network requests within Kafka cluster

Our team uses Kafka as my message queuing tech and due to the high-volume of fast & small messages, throttling is kicking in and causing a variety of issues within our producers.
I'm looking to reduce the number of requests being made to the broker in an attempt to remove this throttling. The main thing I've looked into is the batch.size and linger.ms applied to each producer within the pipeline and have set them to about 1mb and 500ms respectively. I'm just curious what things I should be looking into to fix this.

Does Kafka scale well for big number of clients?

I know Kafka can handle tons of traffic. However, how well does it scale for big number of concurrent clients?
Each client would have their own unique group_id (and as consequence Kafka would be keeping track of each one's offsets).
Would that be an issue for Kafka 0.9+ with offsets stored internally?
Would that be an issue for Kafka 0.8 with offsets stored in Zookeeper?
Some Kafka users such as LinkedIn have reported in the past that a single Kafka broker can support ~10K client connections. This number may vary depending on hardware, configuration, etc.
As long as the request rate is not too high, the limiting factor is probably just the open-file-descriptors limit as configured in the operating system, see e.g. http://docs.confluent.io/current/kafka/deployment.html#file-descriptors-and-mmap for more information.