I have 4 machines where a Kafka Cluster is configured with topology that
each machine has one zookeeper and two broker.
With this configuration what do you advice for maximum topic&partition for best performance?
Replication Factor 3:
using kafka 0.10.XX
Thanks?
Each topic is restricted to 100,000 partitions no matter how many nodes (as of July 2017)
As to the number of topics that depends on how large the smallest RAM is across the machines. This is due to Zookeeper keeping everything in memory for quick access (also it doesnt shard the znodes, just replicates across ZK nodes upon write). This effectively means once you exhaust one machines memory that ZK will fail to add more topics. You will most likely run out of file handles before reaching this limit on the Kafka broker nodes.
To quote the KAFKA docs on their site (6.1 Basic Kafka Operations https://kafka.apache.org/documentation/#basic_ops_add_topic):
Each sharded partition log is placed into its own folder under the Kafka log directory. The name of such folders consists of the topic name, appended by a dash (-) and the partition id. Since a typical folder name can not be over 255 characters long, there will be a limitation on the length of topic names. We assume the number of partitions will not ever be above 100,000. Therefore, topic names cannot be longer than 249 characters. This leaves just enough room in the folder name for a dash and a potentially 5 digit long partition id.
To quote the Zookeeper docs (https://zookeeper.apache.org/doc/trunk/zookeeperOver.html):
The replicated database is an in-memory database containing the entire data tree. Updates are logged to disk for recoverability, and writes are serialized to disk before they are applied to the in-memory database.
Performance:
Depending on your publishing and consumption semantics the topic-partition finity will change. The following are a set of questions you should ask yourself to gain insight into a potential solution (your question is very open ended):
Is the data I am publishing mission critical (i.e. cannot lose it, must be sure I published it, must have exactly once consumption)?
Should I make the producer.send() call as synchronous as possible or continue to use the asynchronous method with batching (do I trade-off publishing guarantees for speed)?
Are the messages I am publishing dependent on one another? Does message A have to be consumed before message B (implies A published before B)?
How do I choose which partition to send my message to?
Should I: assign the message to a partition (extra producer logic), let the cluster decide in a round robin fashion, or assign a key which will hash to one of the partitions for the topic (need to come up with an evenly distributed hash to get good load balancing across partitions)
How many topics should you have? How is this connected to the semantics of your data? Will auto-creating topics for many distinct logical data domains be efficient (think of the effect on Zookeeper and administrative pain to delete stale topics)?
Partitions provide parallelism (more consumers possible) and possibly increased positive load balancing effects (if producer publishes correctly). Would you want to assign parts of your problem domain elements to specific partitions (when publishing send data for client A to partition 1)? What side-effects does this have (think refactorability and maintainability)?
Will you want to make more partitions than you need so you can scale up if needed with more brokers/consumers? How realistic is automatic scaling of a KAFKA cluster given your expertise? Will this be done manually? Is manual scaling viable for your problem domain (are you building KAFKA around a fixed system with well known characteristics or are you required to be able to handle severe spikes in messages)?
How will my consumers subscribe to topics? Will they use pre-configured configurations or use a regex to consume many topics? Are the messages between topics dependent or prioritized (need extra logic on consumer to implement priority)?
Should you use different network interfaces for replication between brokers (i.e. port 9092 for producers/consumers and 9093 for replication traffic)?
Good Links:
http://cloudurable.com/ppt/4-kafka-detailed-architecture.pdf
https://www.slideshare.net/ToddPalino/putting-kafka-into-overdrive
https://www.slideshare.net/JiangjieQin/no-data-loss-pipeline-with-apache-kafka-49753844
https://kafka.apache.org/documentation/
Related
I have built a micro service platform based on kubernetes, but Kafka is used as MQ in the service. Now a very confusing question has arisen. Kubernetes is designed to facilitate the expansion of micro services. However, when the expansion exceeds the number of Kafka partitions, some micro services cannot be consumed. What should I do?
This is a Kafka limitation and has nothing to do with your service scheduler.
Kafka consumer groups simply cannot scale beyond the partition count. So, if you have a single partitioned topic because you care about strict event ordering, then only one replica of your service can be active and consuming from the topic, and you'd need to handle failover in specific ways that is outside the scope of Kafka itself.
If your concern is the k8s autoscaler, then you can look into the KEDA autoscaler for Kafka services
Kafka, as OneCricketeer notes, bounds the parallelism of consumption by the number of partitions.
If you couple processing with consumption, this limits the number of instances which will be performing work at any given time to the number of partitions to be consumed. Because the Kafka consumer group protocol includes support for reassigning partitions consumed by a crashed (or non-responsive...) consumer to a different consumer in the group, running more instances of the service than there are partitions at least allows for the other instances to be hot spares for fast failover.
It's possible to decouple processing from consumption. The broad outline of could be to have every instance of your service join the consumer group. Up to the number of instances consuming will actually consume from the topic. They can then make a load-balanced network request to another (or the same) instance based on the message they consume to do the processing. If you allow the consumer to have multiple requests in flight, this expands your scaling horizon to max-in-flight-requests * number-of-partitions.
If it happens that the messages in a partition don't need to be processed in order, simple round-robin load-balancing of the requests is sufficient.
Conversely, if it's the case that there are effectively multiple logical streams of messages multiplexed into a given partition (e.g. if messages are keyed by equipment ID; the second message for ID A needs to be processed after the first message, but could be processed in any order relative to messages from ID B), you can still do this, but it needs some care around ensuring ordering. Additionally, given the amount of throughput you should be able to get from a consumer of a single partition, needing to scale out to the point where you have more processing instances than partitions suggests that you'll want to investigate load-balancing approaches where if request B needs to be processed after request A (presumably because request A could affect the result of request B), A and B get routed to the same instance so they can leverage local in-memory state rather than do a read-from-db then write-to-db pas de deux.
This sort of architecture can be implemented in any language, though maintaining a reasonable level of availability and consistency is going to be difficult. There are frameworks and toolkits which can deliver a lot of this functionality: Akka (JVM), Akka.Net, and Protoactor all implement useful primitives in this area (disclaimer: I'm employed by Lightbend, which maintains and provides commercial support for one of those, though I'd have (and actually have) made the same recommendations prior to my employment there).
When consuming messages from Kafka in this style of architecture, you will definitely have to make the choice between at-most-once and at-least-once delivery guarantees and that will drive decisions around when you commit offsets. Note particularly that you need to be careful, if doing at-least-once, to not commit until every message up to that offset has been processed (or discarded), lest you end up with "at-least-zero-times", which isn't a useful guarantee. If doing at-least-once, you may also want to try for effectively-once: at-least-once with idempotent processing.
Will we have any problem if we have millions of partitions for one topic?
Due to our business requirement, we are thinking if we can make a partition for every user in kafka.
We have millions of users.
Any insight would be appreciated!
Yes, I think you will end up having problems if you have millions of partitions for several reasons:
(Most importantly!!) Customers come and go, so you will have the requirement to constantly change the number of partitions or have plenty of unused partitions (because you can not reduce the number of partitions within a topic).
More Partitions Requires More Open File Handles: More Partitions means more directories and segment files on disk.
More Partitions May Increase Unavailability: Planned failures move Leaders off of a Broker one at a time, with minimal downtime per partition. In a hard failure all the leaders are immediately unavailable.
More Partitions May Increase End-to-end Latency: For the message to be seen by a Consumer it must be committed. The Broker replicates data from the leader with a single thread, resulting in overhead per Partition.
More Partitions May Require More Memory In the Client
More details are provided in the blog from Confluent on How to choose the number of topics/partitions in a Kafka cluster?.
In addition, according to Confluent's training material for Kafka developers it is recommended:
"The current limits (2-4K Partitions/Broker, 100s K Partitions per cluster) are maximums. Most environments are well below these values (typically in the 1000-1500 range or less per Broker)."
This blog explains that "Apache Kafka Supports 200K Partitions Per Cluster".
This might change with the replacement of Zookeeper KIP-500 but, again, looking at the first bullet point above this will still be a unhealthy software design.
As usual, it's bit confusing to see benefits of splitting methods over others.
I can't see the difference/Pros-Cons between having
Topic1 -> P0 and Topic 2 -> P0
over Topic 1 -> P0, P1
and a consumer pull from 2 topics or single topic/2 partitions, while P0 and P1 will hold different event types or entities.
Thee only benefit I can see if another consumer needs Topic 2 data then it's easy to consume
Regarding topic auto generation, any benefits behind that way or it will be out of hand after some time?
Thanks
I would say this decision depends on multiple factors;
Logic/Separation of Concerns: You can decide whether to use multiple topics over multiple partitions based on the logic you are trying to implement. Normally, you need distinct topics for distinct entities. For example, say you want to stream users and companies. It doesn't make much sense to create a single topic with two partitions where the first partition holds users and the second one holds the companies. Also, having a single topic for multiple partitions won't allow you to implement e.g. message ordering for users that can only be achieved using keyed messages (message with the same key are placed in the same partition).
Host storage capabilities: A partition must fit in the storage of the host machine while a topic can be distributed across the whole Kafka Cluster by partitioning it across multiple partitions. Kafka Docs can shed some more light on this:
The partitions in the log serve several purposes. First, they allow
the log to scale beyond a size that will fit on a single server. Each
individual partition must fit on the servers that host it, but a topic
may have many partitions so it can handle an arbitrary amount of data.
Second they act as the unit of parallelism—more on that in a bit.
Throughput: If you have high throughput, it makes more sense to create different topics per entity and split them into multiple partitions so that multiple consumers can join the consumer group. Don't forget that the level of parallelism in Kafka is defined by the number of partitions (and obviously active consumers).
Retention Policy: Message retention in Kafka works on partition/segment level and you need to make sure that the partitioning you've made in conjunction with the desired retention policy you've picked will support your use case.
Coming to your second question now, I am not sure what is your requirement and how this question relates to the first one. When a producer attempts to write a message to a Kafka topic that does not exist, it will automatically create that topic when auto.create.topics.enable is set to true. Otherwise, the topic won't get created and your producer will fail.
auto.create.topics.enable: Enable auto creation of topic on the server
Again, this decision should be dependent on your requirements and the desired behaviour. Normally, auto.create.topics.enable should be set to false in production environments in order to mitigate any risks.
Just adding some things on top of Giorgos answer:
By choosing the second approach over the first one, you would lose a lot of features that Kafka offers. Some of the features may be: data balancing per brokers, removing topics, consumer groups, ACLs, joins with Kafka Streams, etc.
I think that this flag can be easily compared with automatically creating tables in your database. It's handy to do it in your dev environments but you never want it to happen in production.
The essence of distributed computation is to co-locate execution with data, or in other words, to send your code o your data, not your data to your code. That is the core design of Hadoop, Spark etc.
Does Kafka / Kafka Streams allow such setup? If yes, how? If no is there something planned, maybe as a subproject e.g. using Kubernetes or similar?
I know that we can define consumer groups for a topic but I don't understand how partitions are allocated to consuming application instances and if this allocation can be made to favour co-located instances.
Please let me know if there is a better term to search for as "kafka consumer co-location" didn't please the google gods :/
The Kafka model is different. The Kafka cluster itself only stores data streams. Computation happens outside the Kafka cluster. Thus, there is only limited notion of co-location, ie, data will always be sent over the network to consumer/streams applications that do the processing.
For Kafka Streams, if you do joins for example, data sub-streams (based in Kafka partitions) of both input streams for the join will be co-located within a single Kafka Streams instance though to compute the correct result.
Note, that data stream processing is a different model and thus "shipping code to data" is not important as for batch processing.
Why would we like to have that?
To minimize network traffic? Reduce delays?
The wish is to try to give each partition to a local consumer, if possible. Any of the following condition makes that impossible or undesirable:
Broker's host does not run any Consumers
Local consumers don't subscribe to broker's topics
Local consumer is overloaded, compared to some external consumers
Even in case of a relatively simple StickyAssignor, this problem turns out to be a multi-objective optimization:
Optimize for evenly-distributed consumer load
Optimize for preserving previously-assigned partitions
All, in the situation when topic distribution and consumer membership changes dynamically!
The next step would be to introduce some numerical measure of locality. Ideal assignment would try to connect broker and consumer on the same host, rack, data center, or continent. For example, you might wish to use ping time as a measure of distance between processes; or a number of hops.
Another dimension is variation of host capabilities and load. How many more partitions consumer's host can handle?
There must be a way to aggregate all the requirements into a single number: how good is the assignment of Topic X to a consumer Y.
In the end, you might get a n * m matrix of assignment scores: for each consumer-broker pair you might compute an assignment penalty. By solving that assignment problem in O(n^3) time you'll get the best possible assignment, which favors all aspects, important for your application:
closeness to brocker
closeness to end-user
consumer's cache state
CPU load and free disk space of consumer nodes
maybe some other criteria like: regulation requirements, scheduled maintenance, cost of running a node
Kafka has a PartitionAssignor class, which controls relation between Topics and Consumers. Default is very simple algorithm, but there are more sophisticated implementations like StickyAssignor, which tries to preserve Consumer's caches. It is a pluggable interface, open for experimentation.
Kafka's philosophy favors robustness and universality. Maybe that's why such fragile and multi-faceted optimizations are not a part of a standard distribution.
While trying to get a deep understanding of the Kafka distribution model, one sentence here from StackOverflow got me buzzing, and I can't get a confirmation nor deny.
So, the more subscriber groups you have, the lower the performance is, as kafka needs to replicate the messages to all those groups and guarantee the total order.
As far as I understood from the Kafka docs, multiple consumer groups act similarly to singular consumers. There is no replicating done within the brokers, since each consumer has it's own offset for a certain partition. The number of groups should, then, not put any significant overhead, all of the data is on one place, only the offset is different. Is that correct?
If this is correct, then there is no way of actually introducing multiple disjoint consumers without impacting throughput, since all consumers always query all of the partitions, and some kind of copying is introduced. Note that this is not related to the number of consumer threads, threads only improve consumer performance, they don't interfere with broker operations as far as I conclude.
I've found an answer myself, it's located within the new consumer API docs for Kafka 0.9 and after:
Conceptually you can think of a consumer group as being a single logical subscriber that happens to be made up of multiple processes. As a multi-subscriber system, Kafka naturally supports having any number of consumer groups for a given topic without duplicating data (additional consumers are actually quite cheap).
Bottom line: no, multiple consumer groups do not decrease performance, at least not significantly.
It does not effect kafka process's performance, but since 2 or more consumer groups means, 2 or more times more read from kafka servers, it effects network utilization in outgoing traffic if you have lots of consumer groups. Besides that data is read from mostly memory and does not effect performance, because ram is way faster then network communication.