Handling a Large Kafka topic

Handling a Large Kafka topic - apache-kafka

I have a very very large(count of messages) Kafka topic, it might have more than 20M message per second, but, message size is small, it's just some plain text, each less than 1KB, I can use several partitions per topic, and also I can use several servers to work on one topic and they will consume one of the partitions in the topic...
what if I need +100 servers for a huge topic?
Is it logical to create +100 partitions or more on a single topic?

You should define "large" when mentioning Kafka topics:
Large means huge data in terms of volume size.
Message size is large that it takes time sending a message from queue to client for processing?
Intensive write to that topic? In that case, do you need to process read as fast as possible? (i.e: can we delay process data for about 1 hour)
...
In either case, you should better think on the consumer side for a better design topic and partition. For instances:
Processing time for each message is slow, and it better process fast between messages: In that case, you should create many partitions. It is like a load balancer and server relationship, you create many workers for doing your job.
If only some message types, the time processing is slow, you should consider moving to a new topic. There is a nice article: Should you put several event types in the same Kafka topic explains this decision.
Is the order of messages important? for example, message A happens before message B, message A should be processed first. In this case, you should make all messages of the same type going to the same partition (only the same partition can maintain message order), or move to a separate topic (with a single partition).
...
After you have a proper design for topic and partition, it is come to question: how many partitions should you have for each topic. Increasing total partitions will increase your throughput, but at the same time, it will affect availability or latency. There are some good topics here and here that explain carefully how will total partitions per topic affect the performance. In my opinion, you should benchmark directly on your system to choose the correct value. It depends on many factors of your system: processing power of server machine, network capacity, memory ...
And the last part, you don't need 100 servers for 100 partitions. Kafka will try to balance all partitions between servers, but it is just optional. For example, if you have 1 topic with 7 partitions running on 3 servers, there will be 2 servers store 2 partitions each and 1 server stores 3 partitions. (so 2*2 + 3*1 = 7). In the newer version of Kafka, the mapping between partition and server information will be stored on the zookeeper.

you will get better help, if you are more specific and provide some numbers like what is your expected load per second and what is each message size etc,
in general Kafka is pretty powerful and behind the seances it writes the data to buffer and periodically flush the data to disk. and as per the benchmark done by confluent a while back, Kafka cluster with 6 node supports around 0.8 million messages per second below is bench marking pic

Our friends were right, I refer you to this book
Kafka, The Definitive Guide
by Neha Narkhede, Gwen Shapira & Todd Palino
You can find the answer on page 47
How to Choose the Number of Partitions
There are several factors to consider when choosing the number of
partitions:
What is the throughput you expect to achieve for the topic?
For example, do you expect to write 100 KB per second or 1 GB per
second?
What is the maximum throughput you expect to achieve when consuming from a single partition? You will always have, at most, one consumer
reading from a partition, so if you know that your slower consumer
writes the data to a database and this database never handles more
than 50 MB per second from each thread writing to it, then you know
you are limited to 60MB throughput when consuming from a partition.
You can go through the same exercise to estimate the maxi mum throughput per producer for a single partition, but since producers
are typically much faster than consumers, it is usu‐ ally safe to skip
this.
If you are sending messages to partitions based on keys, adding partitions later can be very challenging, so calculate throughput
based on your expected future usage, not the cur‐ rent usage.
Consider the number of partitions you will place on each broker and available diskspace and network bandwidth per broker.
Avoid overestimating, as each partition uses memory and other resources on the broker and will increase the time for leader
elections. With all this in mind, it’s clear that you want many
partitions but not too many. If you have some estimate regarding the
target throughput of the topic and the expected throughput of the con‐
sumers, you can divide the target throughput by the expected con‐
sumer throughput and derive the number of partitions this way. So if I
want to be able to write and read 1 GB/sec from a topic, and I know
each consumer can only process 50 MB/s, then I know I need at least 20
partitions. This way, I can have 20 consumers reading from the topic
and achieve 1 GB/sec. If you don’t have this detailed information, our
experience suggests that limiting the size of the partition on the
disk to less than 6 GB per day of retention often gives satisfactory
results.

Related

Are 3k kafka topics decrease performance?

I have a Kafka Cluster (Using Aivan on AWS):
Kafka Hardware
Startup-2 (2 CPU, 2 GB RAM, 90 GB storage, no backups) 3-node high availability set
Ping between my consumers and the Kafka Broker is 0.7ms.
Backgroup
I have a topic such that:
It contains data about 3000 entities.
Entity lifetime is a week.
Each week there will be different 3000 entities (on avg).
Each entity may have between 15k to 50k messages in total.
There can be at most 500 messages per second.
Architecture
My team built an architecture such that there will be a group of consumers. They will parse this data, perform some transformations (without any filtering!!) and then sends the final messages back to the kafka to topic=<entity-id>.
It means I upload the data back to the kafka to a topic that contains only a data of a specific entity.
Questions
At any given time, there can be up to 3-4k topics in kafka (1 topic for each unique entity).
Can my kafka handle it well? If not, what do I need to change?
Do I need to delete a topic or it's fine to have (alot of!!) unused topics over time?
Each consumer which consumes the final messages, will consume 100 topics at the same time. I know kafka clients can consume multiple topics concurrenctly but I'm not sure what is the best practices for that.
Please share your concerns.
Requirements
Please focus on the potential problems of this architecture and try not to talk about alternative architectures (less topics, more consumers, etc).

The number of topics is not so important in itself, but each Kafka topic is partitioned and the total number of partitions could impact performance.
The general recommendation from the Apache Kafka community is to have no more than 4,000 partitions per broker (this includes replicas). The linked KIP article explains some of the possible issues you may face if the limit is breached, and with 3,000 topics it would be easy to do so unless you choose a low partition count and/or replication factor for each topic.
Choosing a low partition count for a topic is sometimes not a good idea, because it limits the parallelism of reads and writes, leading to performance bottlenecks for your clients.
Choosing a low replication factor for a topic is also sometimes not a good idea, because it increases the chance of data loss upon failure.
Generally it's fine to have unused topics on the cluster but be aware that there is still a performance impact for the cluster to manage the metadata for all these partitions and some operations will still take longer than if the topics were not there at all.
There is also a per-cluster limit but that is much higher (200,000 partitions). So your architecture might be better served simply by increasing the node count of your cluster.

Ideal number of partitions for Kafka topic

I am currently working on a setup which has 6 kafka-brokers, Data is being pushed into my topic from two producers at a rate of about 4000 messages per second, I have 5 Consumers for this topic working as a group. What should be the ideal number of partitions of my kafka topic?
Please feel free to tell me if any change is required in brokers/consumers/producers as well.

In general more the partitions - more the throughput. However there are other considerations too like the limits of hardware you are running on, whether you are using compression etc. There is a good enough information from Confluent here which provides you insight into rough calculation you can use to arrive at number of partitions.
A rough formula for picking the number of partitions is based on
throughput. You measure the throughout that you can achieve on a
single partition for production (call it p) and consumption (call it
c). Let’s say your target throughput is t. Then you need to have at
least max(t/p, t/c) partitions. The per-partition throughput that one
can achieve on the producer depends on configurations such as the
batching size, compression codec, type of acknowledgement, replication
factor, etc.
Moreover for consumer
The consumer throughput is often application dependent since it
corresponds to how fast the consumer logic can process each message
So the best way is to measure and benchmark for your own use case

Kafka: Is our number of partitions insane?

We have a 3 host Kafka cluster. We have 136 topics, each of which has 100 partitions, with a replication factor of 3. This makes for 13,600 partitions across our cluster.
Is this a sane configuration of our topics?

It's too many. You should ask yourself if you have (or plan to have soon) enough consumer instances to need that many partitions. Then, if you do plan to have 13k consumer instances, what sort of hardware are you running these brokers on such that they would be able to serve that many consumers? That's even before your consider the additional impact of many partitions pre-1.1 https://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/
This to me looks like 100 was a round number and seemed future proof. I'd suggest starting at a much lower number per topic (like say 2 or 10) and see if you actually hit scale issues that demand more partitions before trying to jump to expert mode. You can always add more partitions later.

The short answer to your question is 'It depends'.
More partitions in a Kafka cluster leads to higher throughput however, you need to be aware that the number of partitions has an impact on availability and latency.
In general more partitions,
Lead to Higher Throughput
Require More Open File Handles
May Increase Unavailability
May Increase End-to-end Latency
May Require More Memory In the Client
You need to study the trade-offs and make sure that you've picked the number of partitions that satisfies your requirements regarding throughput, latency and required resources.
For further details refer to this blog post from Confluent.

Partitions = max(NP, NC)
where:
NP is the number of required producers determined by calculating: TT/TP.
NC is the number of required consumers determined by calculating: TT/TC.
TT is the total expected throughput for our system.
TP is the max throughput of a single producer to a single partition.
TC is the max throughput of a single consumer from a single partition.

How to choose the no of partitions for a kafka topic?

We have 3 zk nodes cluster and 7 brokers. Now we have to create a topic and have to create partitions for this topic.
But I did not find any formula to decide that how much partitions should I create for this topic.
Rate of producer is 5k messages/sec and size of each message is 130 Bytes.
Thanks In Advance

I can't give you a definitive answer, there are many patterns and constraints that can affect the answer, but here are some of the things you might want to take into account:
The unit of parallelism is the partition, so if you know the average processing time per message, then you should be able to calculate the number of partitions required to keep up. For example if each message takes 100ms to process and you receive 5k a second then you'll need at least 50 partitions. Add a percentage more that that to cope with peaks and variable infrastructure performance. Queuing Theory can give you the math to calculate your parallelism needs.
How bursty is your traffic and what latency constraints do you have? Considering the last point, if you also have latency requirements then you may need to scale out your partitions to cope with your peak rate of traffic.
If you use any data locality patterns or require ordering of messages then you need to consider future traffic growth. For example, you deal with customer data and use your customer id as a partition key, and depend on each customer always being routed to the same partition. Perhaps for event sourcing or simply to ensure each change is applied in the right order. Well, if you add new partitions later on to cope with a higher rate of messages, then each customer will likely be routed to a different partition now. This can introduce a few headaches regarding guaranteed message ordering as a customer exists on two partitions. So you want to create enough partitions for future growth.
Just remember that is easy to scale out and in consumers, but partitions need some planning, so go on the safe side and be future proof.
Having thousands of partitions can increase overall latency.

This old benchmark by Kafka co-founder is pretty nice to understand the magnitudes of scale - https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
The immediate conclusion from this, like Vanlightly said here, is that the consumer handling time is the most important factor in deciding on number of partition (since you are not close to challenge the producer throughput).
maximal concurrency for consuming is the number of partitions, so you want to make sure that:
((processing time for one message in seconds x number of msgs per second) / num of partitions) << 1
if it equals to 1, you cannot read faster than writing, and this is without mentioning bursts of messages and failures\downtime of consumers. so you will need to it to be significantly lower than 1, how significant depends on the latency that your system can endure.

It depends on your required throughput, cluster size, hardware specifications:
There is a clear blog about this written by Jun Rao from Confluent:
How to choose the number of topics/partitions in a Kafka cluster?
Also this might be helpful to have an insight:
Apache Kafka Supports 200K Partitions Per Cluster

Partitions = max(NP, NC)
where:
NP is the number of required producers determined by calculating: TT/TP
NC is the number of required consumers determined by calculating: TT/TC
TT is the total expected throughput for our system
TP is the max throughput of a single producer to a single partition
TC is the max throughput of a single consumer from a single partition

For example, if you want to be able to read 1000MB/sec, but your consumer is only able process 50 MB/sec, then you need at least 20 partitions and 20 consumers in the consumer group. Similarly, if you want to achieve the same for producers, and 1 producer can only write at 100 MB/sec, you need 10 partitions. In this case, if you have 20 partitions, you can maintain 1 GB/sec for producing and consuming messages. You should adjust the exact number of partitions to number of consumers or producers, so that each consumer and producer achieve their target throughput.
So a simple formula could be:
#Partitions = max(NP, NC)
where:
NP is the number of required producers determined by calculating: TT/TP
NC is the number of required consumers determined by calculating: TT/TC
TT is the total expected throughput for our system
TP is the max throughput of a single producer to a single partition
TC is the max throughput of a single consumer from a single partition
source : https://docs.cloudera.com/runtime/7.2.10/kafka-performance-tuning/topics/kafka-tune-sizing-partition-number.html

You could choose the no of partitions equal to maximum of {throughput/#producer ; throughput/#consumer}. The throughput is calculated by message volume per second. Here you have:
Throughput = 5k * 130bytes = 650MB/s

Is there a way to further parallelize kstreams aside from partitions?

I understand that the fundamental approach to parallelization with kafka is to utilize partitioning. However, I have a special situation in that I have to leverage an existing infrastructure that only has 6 partitions, and I need to process millions and millions of records per second.
Is there a way to further optimize in a way that I could have each kstream consumer read and equally distribute load at the same time from a single partition?

The simplest way is to create a "helper" topic with the desired number of partitions. This topic can be configured with a very short retention time, because the original data is safely stored in the actual input topic. You use this helper topic to route all data through it and thus allow for more parallelism downstream:
builder.stream("input-topic")
.through("helper-topic-with-many-partitions")
... // actual processing

Partitions are the level of parallelization. With 6 partitions - you could maximum have 6 instances (of kstream) consuming data. If each instance is in a separate machine i.e. with 1 GBps network each, you could be reading in total with 600 Mbytes / sec
If that's not enough, you'd need to repartition data
Now for distributing your processing, you would need to run each kstream (with the same consumer group) on a different machine
Here's a short video that demonstrates how Kafka Streams (via Kafka SQL) are parallelized to 5 processes https://www.youtube.com/watch?v=denwxORF3pU
It all depends on partitions & executors. With 6 partitions, I usually can achieve 500K+ messages / second, depending on the complexity of the processing of course

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse