Reducing network requests within Kafka cluster - apache-kafka

Our team uses Kafka as my message queuing tech and due to the high-volume of fast & small messages, throttling is kicking in and causing a variety of issues within our producers.
I'm looking to reduce the number of requests being made to the broker in an attempt to remove this throttling. The main thing I've looked into is the batch.size and linger.ms applied to each producer within the pipeline and have set them to about 1mb and 500ms respectively. I'm just curious what things I should be looking into to fix this.

Related

Consuming messages in a Kafka topic ASAP

Imagine a scenario in which a producer is producing 100 messages per second, and we're working on a system that consuming messages ASAP matters a lot, even 5 seconds delay might result in a decision not to take care of that message anymore. also, the order of messages does not matter.
So I don't want to use a basic queue and a single pod listening on a single partition to consume messages, since in order to consume a message, the consumer needs to make multiple remote API calls and this might take time.
In such a scenario, I'm thinking of a single Kafka topic, with 100 partitions. and for each partition, I'm gonna have a separate machine (pod) listening for partitions 0 to 99.
Am I thinking right? this is my first project with Kafka. this seems a little weird to me.
For your use case, think of partitions = max number of instances of the service consuming data. Don't create extra partitions if you'll have 8 instances. This will have a negative impact if consumers need to be rebalanced and probably won't give you any performace improvement. Also 100 messages/s is very, very little, you can make this work with almost any technology.
To get the maximum performance I would suggest:
Use a round robin partitioner
Find a Parallel consumer implementation for your platform (for jvm)
And there a few producer and consumer properties that you'll need to change, but they depend your environment. For example batch.size, linger.ms, etc. I would also check about the need to set acks=all as it might be ok for you to lose data if a broker dies given that old data is of no use.
One warning: In Java, the standard kafka consumer is single threaded. This surprises many people and I'm not sure if the same is true for other platforms. So having 100s of partitions won't give any performance benefit with these consumers, and that's why it's important to use a Parallel Consumer.
One more warning: Kafka is a complex broker. It's trivial to start using it, but it's a very bumpy journey to use it correctly.
And a note: One of the benefits of Kafka is that it keeps the messages rather than delete them once they are consumed. If messages older than 5 seconds are useless for you, Kafka might be the wrong technology and using a more traditional broker might be easier (activeMQ, rabbitMQ or go to blazing fast ones like zeromq)
Your bottleneck is your application processing the event, not Kafka.
when you have ten consumers, there is overhead for connecting each consumer to Kafka so it will lower the performance.
I advise focusing on your application performance rather than message broker.
Kafka p99 Latency is 5 ms with 200 MB/s load.
https://developer.confluent.io/learn/kafka-performance/

dealing with Kafka's exactly once processing edge-cases

Folks,
Trying to do a POC for processing messages using Kafka for an implementation which absolutely requires only once processing. Example: as a payment system, process a credit card transaction only once
What edge cases should we protect against?
One failure scenario covered here is:
1.) If a consumer fails, and does not commit that it has read through a particular offset, the message will be read again.
Lets say consumers live in Kubernetes pods, and one of the hosts goes offline. We will potentially have messages that have been processed, but not marked as processed in Kafka before the pods went away due to underlying hardware issue. Do i understand this error scenario correctly?
Are there other failure scenarios which we need to fully understand on the producer/consumer side when thinking of Kafka doing only-once processing?
Thanks!
im going to basically repeat and exand on an answer i gave here:
a few scenarios can result in duplication:
consumers only periodically checkpoint their positions. a consumer crash can result in duplicate processing of some range or records
producers have client-side timeouts. this means the producer may think a request timed out and re-transmit while broker-side it actually succeeded.
if you mirror data between kafka clusters thats usually done with a producer + consumer pair of some sort that can lead to more duplication.
there are also scenarios that end in data loss - look up "unclean leader election" (disabling that trades with availability).
also - kafka "exactly once" configurations only work if all you inputs, outputs, and side effects happen on the same kafka cluster. which often makes it of limited use in real life.
there are a few kafka features you could try using to reduce the likelihood of this happening to you:
set enable.idempotence to true in your producer configs (see https://kafka.apache.org/documentation/#producerconfigs) - incurs some overhead
use transactions when producing - incurs overhead and adds latency
set transactional.id on the producer in case your fail over across machines - gets complicated to manage at scale
set isolation.level to read_committed on the consumer - adds latency (needs to be done in combination with 2 above)
shorten auto.commit.interval.ms on the consumer - just reduces the window of duplication, doesnt really solve anything. incurs overhead at really low values.
I have to say that as someone who's been maintaining a VERY large kafka installation for the past few years I'd never use a bank that relied on kafka for its core transaction processing though ...

KAFKA message restriction for a Publisher?

We are using kafka 0.10.x, I am looking, if there is a way to stop a publisher kafka to stop sending messages after certain messages/limit is reached in an hour. The goal here is to restrict user to only send certain number messages in and hour/day ?
If anyone has come across similar use case, please share your findings.
Thanks in Advance......
Kafka has a few throttling and quota mechanisms but none of them exactly match your requirement to strictly limit a producer based on message count on a daily basis.
From the Apache Kafka 0.11.0.0 documentation at https://kafka.apache.org/documentation/#design_quotas
Kafka cluster has the ability to enforce quotas on requests to control
the broker resources used by clients. Two types of client quotas can
be enforced by Kafka brokers for each group of clients sharing a
quota:
Network bandwidth quotas define byte-rate thresholds (since 0.9)
Request rate quotas define CPU utilization thresholds as a percentage
of network and I/O threads (since 0.11)
Client quotas were first introduced in Kafka 0.9.0.0. Rate limits on producers and consumers are enforced to prevent clients saturating the network or monopolizing broker resources.
See KIP-13 for details: https://cwiki.apache.org/confluence/display/KAFKA/KIP-13+-+Quotas
The quota mechanism introduced on 0.9 was based on the client.id set in the client configuration, which can be changed easily. Ideally, quota should be set on the authenticated user name so it is not easy to circumvent so in 0.10.1.0 an addition Authenticated Quota feature was added.
See KIP-55 for details: https://cwiki.apache.org/confluence/display/KAFKA/KIP-55%3A+Secure+Quotas+for+Authenticated+Users
Both the quota mechanisms described above work on data volume (i.e. bandwidth throttling) and not on number of messages nor number of requests. If a client sends lots of small messages or makes lots of requests that return no messages (e.g., a consumer with min.byte configured to 0), it can still overwhelm the broker. To address this issue 0.11.0.0 added in additionally support for throttling by request rate.
See KIP-124 for details: https://cwiki.apache.org/confluence/display/KAFKA/KIP-124+-+Request+rate+quotas
With all that as background then, if you know that your producer always publishes messages of a certain size, then you can compute a daily limit expressed in MB and also a rate limit expressed in MB/sec which you can configure as a quota. That's not a perfect fit for your need because a producer might send nothing for 12 hours and then try and send at a faster rate for a short time and the quota would still limit them to a lower publish rate because the limit is enforced per second and not per day.
If you don't know the message size or it varies a lot then since messages are published using a produce request, you could use request rate throttling to somewhat control the rate that an authenticated user is allow to publish messages but again it would not be a message/day limit nor even a bandwidth limit but rather as a "CPU utilization threshold as a percentage of network and I/O threads". This helps more for avoiding DoS problems and not really for limiting message counts.
If you would like to see message count quotas or message storage quotas added to Kafka then clearly the Kafka Improvement Proposal (KIP) process works and you are encouraged to submit improvement proposals in this or any other area.
See KIP process for details: https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
you can make use of broker configs:
message.max.bytes (default:1000000) – Maximum size of a message the broker will accept. This has to be smaller than the consumer fetch.message.max.bytes, or the broker will have messages that can’t be consumed, causing consumers to hang.
log.segment.bytes (default: 1GB) – size of a Kafka data file. Make sure its larger than 1 message. Default should be fine (i.e. large messages probably shouldn’t exceed 1GB in any case. Its a messaging system, not a file system)
replica.fetch.max.bytes (default: 1MB) – Maximum size of data that a broker can replicate. This has to be larger than message.max.bytes, or
a broker will accept messages and fail to replicate them. Leading to potential data loss.
I think you can tweak the config to do what you want

What makes Kafka high in throughput?

Most articles depicts Kafka better in read/write throughput than other message broker(MB) like ActiveMQ. Per mine understanding reading/writing
with the help of offset makes it faster. But I am not clear how offset makes it faster ?
After reading Kafka architecture, I have got some understanding but not clear what makes Kafka scalable and high in throughput based on below points :-
Probably with the offset, client knows which exact message it needs to read which may be one of the factor to make it high in performance.
And in case of other MB's , broker need to coordinate among consumers so
that message is delivered to only consumer. But this is the case for queues only not for topics. Then What makes Kafka topic faster than other MB's topic.
Kafka provides partitioning for scalability but other message broker(MB) like ActiveMQ also provides the clustering. so how Kafka is better for big data/high loads ?
In other MB's we can have listeners . So as soon as message comes, broker will deliver the message but in case of Kafka we need to poll which means more
load on both broker/client side ?
Lots of details on what makes Kafka different and faster than other messaging systems are in Jay Kreps blog post here
https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
There are actually a lot of differences that make Kafka perform well including but not limited to:
Maximized use of sequential disk reads and writes
Zero-copy processing of messages
Use of Linux OS page cache rather than Java heap for caching
Partitioning of topics across multiple brokers in a cluster
Smart client libraries that offload certain functions from the
brokers
Batching of multiple published messages to yield less frequent network round trips to the broker
Support for multiple in-flight messages
Prefetching data into client buffers for faster subsequent requests.
It's largely marketing that Kafka is fast for a message broker. For example IBM MessageSight appliances did 13M msgs/sec with microsecond latency in 2013. On one machine. A year before Kreps even started the Github.:
https://www.zdnet.com/article/ibm-launches-messagesight-appliance-aimed-at-m2m/
Kafka is good for a lot of things. True low latency messaging is not one of them. You flatly can't use batch delivery (e.g. a range of offsets) in any pure latency-centric environment. When an event arrives, delivery must be attempted immediately if you want the lowest latency. That doesn't mean waiting around for a couple seconds to batch read a block of events or enduring the overhead of requesting every message. Try using Kafka with an offset range of 1 (so: 1 message) if you want to compare it to a normal push-based broker and you'll see what I mean.
Instead, I recommend focusing on the thing pull-based stream buffering does give you:
Replayability!!!
Personally, I think this makes downstream data engineering systems a bit easier to build in the face of failure, particularly since you don't have to rely on their built-in replication models (if they even have one). For example, it's very easy for me to consume messages, lose the disks, restore the machine, and replay the lost data. The data streams become the single source of truth against which other systems can synchronize and this is exceptionally useful!!!
There's no free lunch in messaging, pull and push each have their advantages and disadvantages vs. each other. It might not surprise you that people have also tried push-pull messaging and it's no free lunch either :).

How many Producers can I use to write to a single topic

I have a web application which put messages into a Kafka topic. There are a lot of instances of this application (200) and each of them contains it's own Kafka Producer.
Questions:
Does there exist any upper bound of Producers amount per topic?
Does the number of Producers impact on Kafka performance? If yes, how?
What is the best practice for Producers? One synchronous producer per application, an asynchronous producer, or a custom pool of sync producers?
Is exists any upper bound of Producers amount per topic?
The only limitation I am aware of is the number of available IP addresses. It is unlikely you'd bump into any practical limit in your described application.
Does Producer amount impact on Kafka performance? If yes, how?
No, all other things being equal (traffic volume, asynchronous vs synchronous (including batch size / time constraints), etc).
Presumably there's some overhead somewhere for the connection, but its small enough that I've never managed to notice it.
What is Producer best practice (One sync producer per application, async producer or custom pool of sync producers)
Depends a whole bunch on your use case, which I am not clear on. For the most part, asynchronous > synchronous. If you choose to use asynchronous, then you have to deal with the risks of batching on the producers (ie data loss), and the delays associated with building up enough messages for a batch / waiting for the batch timeout to trigger. Those delays could be significant if your use case is sufficiently demanding.