Are 3k kafka topics decrease performance? - apache-kafka

I have a Kafka Cluster (Using Aivan on AWS):
Kafka Hardware
Startup-2 (2 CPU, 2 GB RAM, 90 GB storage, no backups) 3-node high availability set
Ping between my consumers and the Kafka Broker is 0.7ms.
Backgroup
I have a topic such that:
It contains data about 3000 entities.
Entity lifetime is a week.
Each week there will be different 3000 entities (on avg).
Each entity may have between 15k to 50k messages in total.
There can be at most 500 messages per second.
Architecture
My team built an architecture such that there will be a group of consumers. They will parse this data, perform some transformations (without any filtering!!) and then sends the final messages back to the kafka to topic=<entity-id>.
It means I upload the data back to the kafka to a topic that contains only a data of a specific entity.
Questions
At any given time, there can be up to 3-4k topics in kafka (1 topic for each unique entity).
Can my kafka handle it well? If not, what do I need to change?
Do I need to delete a topic or it's fine to have (alot of!!) unused topics over time?
Each consumer which consumes the final messages, will consume 100 topics at the same time. I know kafka clients can consume multiple topics concurrenctly but I'm not sure what is the best practices for that.
Please share your concerns.
Requirements
Please focus on the potential problems of this architecture and try not to talk about alternative architectures (less topics, more consumers, etc).

The number of topics is not so important in itself, but each Kafka topic is partitioned and the total number of partitions could impact performance.
The general recommendation from the Apache Kafka community is to have no more than 4,000 partitions per broker (this includes replicas). The linked KIP article explains some of the possible issues you may face if the limit is breached, and with 3,000 topics it would be easy to do so unless you choose a low partition count and/or replication factor for each topic.
Choosing a low partition count for a topic is sometimes not a good idea, because it limits the parallelism of reads and writes, leading to performance bottlenecks for your clients.
Choosing a low replication factor for a topic is also sometimes not a good idea, because it increases the chance of data loss upon failure.
Generally it's fine to have unused topics on the cluster but be aware that there is still a performance impact for the cluster to manage the metadata for all these partitions and some operations will still take longer than if the topics were not there at all.
There is also a per-cluster limit but that is much higher (200,000 partitions). So your architecture might be better served simply by increasing the node count of your cluster.

Related

Advice on how I can decrease Kafka Lag

I'm relatively new to working with Kafka, below is a sample of what my current set up is.
Kafka Setup
Multiple topics that all have one partition each. 2 Consumer Groups with each group containing one consumer.
The issue I am seeing is that the Lag is enormous, sometimes upwards of 8-10 hours waiting for consuming, the load is about 100-200 million messages a day
What steps should I look at in order to address this? Is it as simple as reassigning partitions or creating new partitions for the 3 topics that are being consumed by the two consumers? - I've also looked at compressing the contents of the producer with gzip but it doesn't really help in terms of the lag. I've looked at network connections and don't feel that it is anything got to do with this. If anyone could point me in the direction of Kafka and Low Latency documents that would be good also.
Generally the flow is to parallelize your consumption through the increase on the number of partitions and consumers in consumer groups that subscribe to those topics with increased partitions (Nconsumers <= Npartitions).
And distribute your topics with increase on the number of brokers in your cluster.
So from topic considerations:
Less partition per topic result:
in producer and/or consumer lag
starved or overloaded brokers and consumers.
(But take into account) More partition per topic result in:
More broker resources – file handlers and memory.
There is an overhead with each additional partition and a number of partitions a broker can handle is limited.
Overhead of replication load
Then increase the number of consumers in that consumer groups.
Try increasing partition per topic, but by itself it should not help! You also will need to increase the number of consumers in your consumer group. Is that single consumers or consumer groups on your diagram? How many consumers in your consumer group vs partitions on the topic that they are subscibed to.
From this in your message:
I've also looked at compressing the contents of the producer with gzip but it doesn't really help in terms of the lag.
I get an idead that your messages may be huge! Is it so? In case yes, try to keep messages small (for example by excluding BLOBs and keep external links to them)
Still the issue may be somewhere else like bad configs, consumer commit messages (acknowledgment handling), etc.
So, I highly advice you to read article Fine-tune Kafka performance with the Kafka optimization theorem
I also advise you to go through Apache Kafka courses on Confluent web-page
This should be added as a comment, but I haven't had permissions to do so. The provided info is very limited with incorrect diagram, which limits the ability to provide an adequate helpfull answer. If possible please correct your diagram and add more details about your set-up like:
broker configuration, file attached;
consumer set-up (Consumer commit messages);
producer set-up;
topic set-up;
kafka version (the defaults differ with major/minor versions)
The provided diagram is not correct in the notion of topic - partition relationship, so I assume it is a mistype and Partition 0 must be substituded with Broker 0, right?
Kafka's topics are divided into several partitions. While the topic is a logical concept in Kafka, a partition is the smallest storage unit that holds a subset of records owned by a topic...
Then there is an open question on the number of partiotions in each topic and the number of topics in each broker, as well as the number of brokers in your cluster!

Ideal number of partitions for Kafka topic

I am currently working on a setup which has 6 kafka-brokers, Data is being pushed into my topic from two producers at a rate of about 4000 messages per second, I have 5 Consumers for this topic working as a group. What should be the ideal number of partitions of my kafka topic?
Please feel free to tell me if any change is required in brokers/consumers/producers as well.
In general more the partitions - more the throughput. However there are other considerations too like the limits of hardware you are running on, whether you are using compression etc. There is a good enough information from Confluent here which provides you insight into rough calculation you can use to arrive at number of partitions.
A rough formula for picking the number of partitions is based on
throughput. You measure the throughout that you can achieve on a
single partition for production (call it p) and consumption (call it
c). Let’s say your target throughput is t. Then you need to have at
least max(t/p, t/c) partitions. The per-partition throughput that one
can achieve on the producer depends on configurations such as the
batching size, compression codec, type of acknowledgement, replication
factor, etc.
Moreover for consumer
The consumer throughput is often application dependent since it
corresponds to how fast the consumer logic can process each message
So the best way is to measure and benchmark for your own use case

Does kafka support millions of partitions?

Will we have any problem if we have millions of partitions for one topic?
Due to our business requirement, we are thinking if we can make a partition for every user in kafka.
We have millions of users.
Any insight would be appreciated!
Yes, I think you will end up having problems if you have millions of partitions for several reasons:
(Most importantly!!) Customers come and go, so you will have the requirement to constantly change the number of partitions or have plenty of unused partitions (because you can not reduce the number of partitions within a topic).
More Partitions Requires More Open File Handles: More Partitions means more directories and segment files on disk.
More Partitions May Increase Unavailability: Planned failures move Leaders off of a Broker one at a time, with minimal downtime per partition. In a hard failure all the leaders are immediately unavailable.
More Partitions May Increase End-to-end Latency: For the message to be seen by a Consumer it must be committed. The Broker replicates data from the leader with a single thread, resulting in overhead per Partition.
More Partitions May Require More Memory In the Client
More details are provided in the blog from Confluent on How to choose the number of topics/partitions in a Kafka cluster?.
In addition, according to Confluent's training material for Kafka developers it is recommended:
"The current limits (2-4K Partitions/Broker, 100s K Partitions per cluster) are maximums. Most environments are well below these values (typically in the 1000-1500 range or less per Broker)."
This blog explains that "Apache Kafka Supports 200K Partitions Per Cluster".
This might change with the replacement of Zookeeper KIP-500 but, again, looking at the first bullet point above this will still be a unhealthy software design.

Handling a Large Kafka topic

I have a very very large(count of messages) Kafka topic, it might have more than 20M message per second, but, message size is small, it's just some plain text, each less than 1KB, I can use several partitions per topic, and also I can use several servers to work on one topic and they will consume one of the partitions in the topic...
what if I need +100 servers for a huge topic?
Is it logical to create +100 partitions or more on a single topic?
You should define "large" when mentioning Kafka topics:
Large means huge data in terms of volume size.
Message size is large that it takes time sending a message from queue to client for processing?
Intensive write to that topic? In that case, do you need to process read as fast as possible? (i.e: can we delay process data for about 1 hour)
...
In either case, you should better think on the consumer side for a better design topic and partition. For instances:
Processing time for each message is slow, and it better process fast between messages: In that case, you should create many partitions. It is like a load balancer and server relationship, you create many workers for doing your job.
If only some message types, the time processing is slow, you should consider moving to a new topic. There is a nice article: Should you put several event types in the same Kafka topic explains this decision.
Is the order of messages important? for example, message A happens before message B, message A should be processed first. In this case, you should make all messages of the same type going to the same partition (only the same partition can maintain message order), or move to a separate topic (with a single partition).
...
After you have a proper design for topic and partition, it is come to question: how many partitions should you have for each topic. Increasing total partitions will increase your throughput, but at the same time, it will affect availability or latency. There are some good topics here and here that explain carefully how will total partitions per topic affect the performance. In my opinion, you should benchmark directly on your system to choose the correct value. It depends on many factors of your system: processing power of server machine, network capacity, memory ...
And the last part, you don't need 100 servers for 100 partitions. Kafka will try to balance all partitions between servers, but it is just optional. For example, if you have 1 topic with 7 partitions running on 3 servers, there will be 2 servers store 2 partitions each and 1 server stores 3 partitions. (so 2*2 + 3*1 = 7). In the newer version of Kafka, the mapping between partition and server information will be stored on the zookeeper.
you will get better help, if you are more specific and provide some numbers like what is your expected load per second and what is each message size etc,
in general Kafka is pretty powerful and behind the seances it writes the data to buffer and periodically flush the data to disk. and as per the benchmark done by confluent a while back, Kafka cluster with 6 node supports around 0.8 million messages per second below is bench marking pic
Our friends were right, I refer you to this book
Kafka, The Definitive Guide
by Neha Narkhede, Gwen Shapira & Todd Palino
You can find the answer on page 47
How to Choose the Number of Partitions
There are several factors to consider when choosing the number of
partitions:
What is the throughput you expect to achieve for the topic?
For example, do you expect to write 100 KB per second or 1 GB per
second?
What is the maximum throughput you expect to achieve when consuming from a single partition? You will always have, at most, one consumer
reading from a partition, so if you know that your slower consumer
writes the data to a database and this database never handles more
than 50 MB per second from each thread writing to it, then you know
you are limited to 60MB throughput when consuming from a partition.
You can go through the same exercise to estimate the maxi mum throughput per producer for a single partition, but since producers
are typically much faster than consumers, it is usu‐ ally safe to skip
this.
If you are sending messages to partitions based on keys, adding partitions later can be very challenging, so calculate throughput
based on your expected future usage, not the cur‐ rent usage.
Consider the number of partitions you will place on each broker and available diskspace and network bandwidth per broker.
Avoid overestimating, as each partition uses memory and other resources on the broker and will increase the time for leader
elections. With all this in mind, it’s clear that you want many
partitions but not too many. If you have some estimate regarding the
target throughput of the topic and the expected throughput of the con‐
sumers, you can divide the target throughput by the expected con‐
sumer throughput and derive the number of partitions this way. So if I
want to be able to write and read 1 GB/sec from a topic, and I know
each consumer can only process 50 MB/s, then I know I need at least 20
partitions. This way, I can have 20 consumers reading from the topic
and achieve 1 GB/sec. If you don’t have this detailed information, our
experience suggests that limiting the size of the partition on the
disk to less than 6 GB per day of retention often gives satisfactory
results.

large number of Channels Kafka

I was wondering if Kafka has any limitation or starts slowing down (due to GC or other reasons) if we have large number of channels. We have a heavy volume of data that we will be sending through Kafka (Over 2B data points). We were thinking of having about 1600 channels to start with.
Has anyone come across issues when we have such large number of channels in Kafka? Similarly, do you see issues with local DC replication with these large number of channels and lastly any foreseeable issues if we are using MirrorMaker for cross DC replication with such large number of channels
Any pointers are highly appreciated
Thanks
I believe there is no hard limit on number of topics in Kafka itself. However, since Kafka stores topic info in Zookeeper (//brokers/topics/), and Zookeeper has a 1MB limitation on max node size, there can be only a finite number of topics. Also, Kafka brokers store data for different topics in /var/kafka/data/. Performance may suffer if there are too many subdirs in /var/kafka/data/.
I haven't tried thousands of topics but Kafka with a few hundred topics works ok for my purposes. The only area where I had problems was dynamic topic creation while using high level consumer. It required client re-connection to pick up the new topics on all consumer boxes. This caused time consuming consumer re-balancing (which sometimes failed, preventing reading from some topics). As a result I had to switch to simple consumer and take care about read coordination in my code.
I'd recommend to create a simple test app that generates some random data for the number of topics you expect going forward and verify that performance is acceptable.