How does Kafka message processing scale in publish-subscribe mode? - apache-kafka

All, Forgive me I am a newbie just beginner of Kafka. Currently I was reading the document of Kafka about the difference between traditional message system like Active MQ and Kafka.
As the document put.
For the traditional message system. they can not scale the message processing.
Since
Publish-subscribe allows you broadcast data to multiple processes, but
has no way of scaling processing since every message goes to every
subscriber.
I think this make sense to me.
But for the Kafka. Document says the Kafka can scale the message processing even in the publish-subscribe mode. (Please correct me if I was wrong. Thanks.)
The consumer group concept in Kafka generalizes these two concepts. As
with a queue the consumer group allows you to divide up processing
over a collection of processes (the members of the consumer group). As
with publish-subscribe, Kafka allows you to broadcast messages to
multiple consumer groups.
The advantage of Kafka's model is that every topic has both these
properties—it can scale processing and is also multi-subscriber—there
is no need to choose one or the other.
So my question is How Kafka make it ? I mean scaling the processing in the publish-subscribe mode. Thanks.

The main unique features in Kafka that enables scalable pub/sub are:
Partitioning individual topics and spreading the active partitions across multiple brokers in the cluster to take advantage of more machines, disks, and cache memory. Producers and consumers often connect to many or all nodes in the cluster, not just a single master node for a given topic/queue.
Storing all messages in a sequential commit log and not deleting them when consumed. This leads to more sequential reads and writes, offloads the broker from having to deal with keeping track of different copies of messages, deleting individual messages, handling fragmentation, tracking which consumer has acknowledged consuming which messages.
Enabling smart parallel processing of individual consumers and consumer groups in a way that each parallel message stream can come from the distributed partitions mentioned in #1 while offloading the offset management and partition assignment logic onto the clients themselves. Kafka scales with more consumers because the consumers do some of the work (unlike most other pub/sub brokers where the bulk of the work is done in the broker)

Related

kafka consume an undefined number of topics

The topics are dynamically created, and there could be thousands of them. I need a way to detect when messages are produced so I can consume them. Moreover, I need to consume each topic independently so that I can then bulk a large number of messages into a database, each topic corresponding to a different table. So let's say I start consuming a topic, I would consume 1000 messages, bulk them in a database in one operation, then commit the reading in kafka. If I have 10 topics, I could use 10 consumers in parallel. The problem is if I end up with a large number of topics, and that most of them are idle (empty), I need a way to be notified that some topics become suddenly active, so that I don't have to launch thousands of idle consumers that do nothing most of the time.
The only solution I thought so far is using a single signal topic in addition to the real topics, in which the producers would produce in addition to the real topic. But I was wondering if there was another solution. Like polling the meta-data in kafka, maybe. But for what I've seen, I would have to iterate through all the topics matching a regex, then check the offsets of the partitions for each. I don't think it's possible to do that efficiently, but maybe I'm wrong.
You could track JMX metrics from the broker for incoming bytes per topic using Prometheus JMX Exporter, for example, then combine that with AlertManager to send some event/webhook upon some threshold of data to a consuming REST service, which would then start some consumers (maybe Kafka Connect tasks for a database?).
Or, like you said, use a signal topic since producer requests can be made to multiple topics at once.
If I have 10 topics, I could use 10 consumers in parallel
You can have more parallel consumers if any of those topics have multiple partitions
could be thousands of them
There's are reasonable limits on the number of topics a Kafka cluster can support, by the way, but it's upwards of hundreds of thousands, as of latest releases. Something to keep in mind, though.
launch thousands of idle consumers that do nothing most of the time.
You could also use solutions like AWS Lambda or Kubernetes KEDA to auto scale up/down based on topic data (lag)

Kafka/Streams Consumer Scaling Architecture

Posting this question to discuss best approaches/practices for scaling out kafka-consumers. Here's my use-case -
We have multiple (more than a million) applications pushing data to kafka - each application has one assigned kafka topic with six partitions. Our need is to process the data pushed by applications in near real-time - we need to push a callback once processing is done and this has to be per kafka record. So, the naive approach is to spawn one kafka-consumer per partition. But this can get expensive, especially when some applications donot produce data in a contiguous sequence. So to address this, we decided to spawn consumers subscribing to multiple topics (spring-kafka regex matching topic pattern) - we can have some arbitrary number of consumers subscribe to topics and let kafka framework handle partition distribution amongst available consumers; all the consumers will be belonging to one unique consumer-group.
The approach seem to work under normal workload/traffic-patterns but there are some obvious shortcomings that we would like to address -
One slow consumer can potentially affect overall consumer lag, especially when record processing varies from application to application and consumers are subscribing to multiple topics.
Scale-in/scale-out strategies.
Individual consumer health-checking - identifying zombies ?

Implementation of queues using kafka server

I want to implement a queue mechanism using kafka. But could not find anywhere that if it's possible to just peek data from the queue created for any topic without moving forward into it.
I want to read data from the queue and on the basis of different conditions want to remove the existing message or add another message into this queue. Also, is it possible to use a single kafka server from different machines.
I referred to tutorialspoint for learning more about it.
Thanks in advance. Any leads would be appreciated.
Keep in mind that Kakfa scales with multiple partitions per topic, and it doesn't give any ordering guarantee between partitions. So don't use kafka if you want strict ordering. Within a consumer group, if you want n consumers per topic, you need to have atleast n partitions.
Consumers don't remove messages, they commit the offset of a message. Default configuration in most clients is to auto commit offset on read. You can re-insert messages into the topic anytime. But you cannot skip a message and expect to process it later.
You can connect as many machines as you want to a kafka server. Typically, you have multiple servers as a kafka cluster, with replication for fault tolerance.

Maximum subscription limit of Kafka Topics Per Consumer

What is maximum limit of topics can a consumer subscribe to in Kafka. Am not able to find this value documented anywhere.
If consumer subscribes 500000 or more topics, will there be downgrade in performance.
500,000 or more topics in a single Kafka cluster would be a bad design from the broker point of view. You typically want to keep the number of topic partitions down to the low tens of thousands.
If you find yourself thinking you need that many topics in Kafka you might instead want to consider creating a smaller number of topics and having 500,000 or more keys instead. The number of keys in Kafka is unlimited.
To be technical the "maximum" number of topics you could be subscribed to would be constrained by the available memory space for your consumer process (if your topics are listed explicitly then a very large portion of the Java String pool will be your topics). This seems the less likely limiting factor (listing that many topics explicitly is prohibitive).
Another consideration is how the Topic assignment data structures are setup at Group Coordinator Brokers. They could run out of space to record the topic assignment depending on how they do it.
Lastly, which is the most plausible, is the available memory on your Apache Zookeeper node. ZK keeps ALL data in memory for fast retrieval. ZK is also not sharded, meaning all data MUST fit onto one node. This means there is a limit to the number of topics you can create, which is constrained by the available memory on a ZK node.
Consumption is initiated by the consumers. The act of subscribing to a topic does not mean the consumer will start receiving messages for that topic. So as long as the consumer can poll and process data for that many topics, Kafka should be fine as well.
Consumer is fairly independent entity than Kafka cluster, unless you are talking about build in command line consumer that is shipped with Kafka
That said logic of subscribing to a kafka topic, how many to subscribe to and how to handle that data is upto the consumer. So scalability issue here lies with consumer logic
Last but not the least, I am not sure it is a good idea to consumer too many topics within a single consumer. The vary purpose of pub sub mechanism that Kafka provides through the segregation of messages into various topics is to facilitate the handling of specific category of messages using separate consumers. So I think if you want to consume many topics like few 1000s of them using a single consumer, why divide the data into separate topics first using Kafka.

kafka log deletion and load balancing across consumers

Say a consumer does a time intensive processing. In order to scale consumer side processing, i would like to spawn multiple consumers and consumer messages from kafka topic in a round robin fashion. Based on the documentation, it seems like if i create multiple consumers and add them in one consumer group, only one consumer will get the messages. If i add consumers to different consumer groups, each consumer will get the same message. So, in order to achieve the above objective, is the only solution to partition the topic ? This seems like an odd design choice, because the consumer scalability is now bleeding into topic and even producer design. Ideally, if a topic does not partitioning, there should be no need to partition it. This puts un-necessary logic on producer and also causes other consumer types to consume from these partitions that may only make sense to one type of consumer. Plus it limits the usecase, where a certain consumer type may want ordering over the messages, so splitting a topic into partitions may not be possible.
Second if i choose "cleanup.policy" to compact, does it mean that kafka log will keep increasing as it will maintain the latest value for each key? If not, how can i get log deletion and compaction?
UPDATE:
It seems like i have two options to achieve scalability on consumer side, which are independent of topic scaling.
Create consumer groups and have them consume odd and even offsets. This logic would have to be built into the consumers to discard un-needed messages. Also doubles the network requirements
Create a hierarchy of topics, where the root topic gets all the messages. Then some job classifies the logs and publish them again to more fine grained topics. In this case, the strong ordering can be achieved at root and more fine grained topics for consumer scaling can be constructed.
In 0.8, kafka maintains the consumer offset, so publishing messages in a round robin across various consumers is not a too far fetched requirement from their design.
Partitions are the unit of parallelism in Kafka by design. Not just for consumtion but kafka distributes the partiotions accross cluster which has different other benifits like sharing load among different servers, replication management for ensuring no Data loss, managing log to scale beyond a size that will fit on a single server etc.
Ordering of messages is a key factor as if you do not need a storng ordering then diving topics with multiple partitions will allow you to evenly distribute the load while producing (this will be handled by the producer itself). And while using consumer group you just need to add more consumer instances in the same group in order to consume them parallely.
Plus it limits the usecase, where a certain consumer type may want ordering over the messages, so splitting a topic into partitions may not be possible.
True,from the doc
However, if you require a total order over messages this can be achieved with a topic that has only one partition, though this will mean only one consumer process.
Maintaining ordering whiile consuming in distributed manner requires the messaging system to maintain per-message state to keep track of message acknowledgement. But this will involve a lot of expensive random I/O in the system. So clearly there is a trade-off.
Ideally, if a topic does not partitioning, there should be no need to partition it. This puts un-necessary logic on producer and also causes other consumer types to consume from these partitions that may only make sense to one type of consumer
Distributing messages across partitions is typically handled by the producer it self without any intervention from the programmers end (assuming you don't want to categories messages using key). And for the consumers as you just mentioned here the better choice would be to use Simple/Low level consumers which will allow you to consume only a subset of the partitions in a topic.
This seems like an odd design choice, because the consumer scalability is now bleeding into topic and even producer design
I believe for a system like Kafka which focuses on high throughput ( handle hundreds of megabytes of reads and writes per second from thousands of clients ), ensuring scalability and strong durability and fault-tolerance guarantees might not be a good fit for someone having totally a different business requirements.
Topic partitioning is primarily a way to scale out consumers and brokers so if you need many consumers to keep up then you need to partition the topic and add multiple consumer instances in the same consumer group. The producer API will manage partitions transparently. If you need to have certain consumers subscribing only to some partitions, then you need to use the simple consumer API instead of the high level API and in this case you don't have the consumer group concept and have to coordinate consumption yourself.
Message ordering is guaranteed within partitions but not between partitions so if this is a requirement it needs to be dealt with on consumer side.
Setting cleanup.policy=compact means that the Kafka brokers will keep the latest version of a message key indefinitely and use cases like that should be more for recording of data updates for things you intend to keep around rather than the log stream buffering use case.
You need to factor out the reading of Kafka messages from the subsequent processing of those messages. You can use partitions and consumer groups to make reading messages as fast as possible, but if you process the messages as part of your consumer logic then you'll just slow down your consumers. By streaming the messages from consumers to other classes that will perform your processing you can adjust the parallelism of the consumers and of the processors independently. You'll see this approach in technologies like Spark and Storm.
This approach does add one complication and that is that the consumer has to commit the message offset before the message has been processed. You may have to track the messages in flight to insure execute-exactly-once.