Producer-consumer with side constraints in Kafka (or others) - apache-kafka

We have a bunch of producers that send messages/events to a bunch of consumers. Each message must be consumed by exactly one consumer. We know that this common scenario can easily be achieved by using consumer groups in Kafka. However, we also have a couple of additional constraints: Not every consumer can consume every message. Messages have (arbitrary) requirements attached to them and only consumers that fulfil these requirements must process them. This would still be possible with a consumer group where a consumer first looks at the message and eventually re-submits it if it does not meet the requirements. However, there is no guarantee that messages will be seen by every consumers at least once so they may bounce around indefinitely although there may be a matching consumer. We also cannot set up multiple topics because the requirements for consumers are arbitrary complex boolean formulas defined by the user and not the application. This can result in a combinatorial explosion of topics.
Additionally we want to be able to dynamically add and remove consumers from the group in case more processing resources are needed. As far as I understood Kafka, this can lead to consumers not getting any messages if there are not enough partitions and dynamically re-partitioning is also not really possible (without admin interaction).
Is there any way to make this work in Kafka? Maybe Kafka is also not the right technology, are there others that are more suitable? We also looked at RabbitMQ but also there we did not find a way that guarantees that every consumer is seeing a message so that it can evaluate the requirements.

you could commit offsets manually when you after identifying the desired events by setting ENABLE_AUTO_COMMIT_CONFIG to false in your consumer configs but your use-case would trigger excessive rebalances which stops any consumption. i don't think Kafka is the appropriate infrastructure for this.
however if you could mark your events with finite number of keys, you can dictate which partition they are produced to. using the same key in your consumer guarantees to poll events from the same partition. note that you need to have the same number of partitions in your topic as the number of unique keys.

Related

How to scale to thousands of producer-consumer pairs in Kafka?

I have a usecase where I want to have thousands of producers writing messages which will be consumed by thousands of corresponding consumers. Each producer's message is meant for exactly one consumer.
Going through the core concepts here and here: it seems like each consumer-producer pair should have its own topic. Is this correct understanding? I also looked into consumer groups but it seems they are more for parallellizing consumption.
Right now I have multiple producer-consumer pairs sharing very few topics, but because of that (i think) I am having to read a lot of messages in the consumer and filter them out for the specific producer's messages by the key. As my system scales this might take a lot of time. Also in the event I have to delete the checkpoint this will be even more problematic as it starts reading from the very beginning.
Is creating thousands of topics the solution for this? Or is there any other way to use concepts like partitions, consumer groups etc? Both producers and consumers are spark streaming/batch applications. Thanks.
Each producer's message is meant for exactly one consumer
Assuming you commit the offsets, and don't allow retries, this is the expected behavior of all Kafka consumers (or rather, consumer groups)
seems like each consumer-producer pair should have its own topic
Not really. As you said, you have many-to-many relationship of clients. You do not need to have a known pair ahead of time; a producer could send data with no expected consumer, then any consumer application(s) in the future should be able to subscribe to that topic for the data they are interested in.
sharing very few topics, but because of that (i think) I am having to read a lot of messages in the consumer and filter them out for the specific producer's messages by the key. As my system scales this might take a lot of time
The consumption would take linearly more time on a higher production rate, yes, and partitions are the way to solve for that. Beyond that, you need faster network and processing. You still need to consume and deserialize in order to filter, so the filter is not the bottleneck here.
Is creating thousands of topics the solution for this?
Ultimately depends on your data, but I'm guessing not.
Is creating thousands of topics the solution for this? Or is there any
other way to use concepts like partitions, consumer groups etc? Both
producers and consumers are spark streaming/batch applications.
What's the reason you want to have thousands of consumers? or want to have a 1 to 1 explicit relationship? As mentioned earlier, only one consumer within a consumer group will process a message. This is normal.
If however you are trying to make your record processing extremely concurrent, instead of using very high partition counts or very large consumer groups, should use something like Parallel Consumer (PC).
By using PC, you can processing all your keys in parallel, regardless of how long it takes to process, and you can be as concurrent as you wish .
PC directly solves for this, by sub partitioning the input partitions by key and processing each key in parallel.
It also tracks per record acknowledgement. Check out Parallel Consumer on GitHub (it's open source BTW, and I'm the author).

If I use Kafka as simple message. Does it really worth

=== Assume everything from consumer point of view ===
I was reading couple of Kafka articles and I saw that the number of partitions is coupled to number of micro-service instances.... Ex: If I say 1topic 1partition for my serviceA.. Producer pushes message to topicT1, partitionP1, and from consumerSide(ServiceA1) I can read from t1,p1. If I spin new pod(ServiceA2) to have highThroughput then second instance will never receive any message because Kafka/ZooKeeper assigns id to each Consumer and partition1 is already taken by serviceA1. So serviceA2++ stays idle... To avoid such a hassle Kafka recommends to add more partition, so that number of consumers can be increased/decreased based on need.
I was also able to test through commandLine and service2 never consumed any message. If I shut service1 then service2 was able to pick new message... So if I spin more pod then FailSafe/Availability increases but throughput is same always...
Is my assumption is correct. Am I missing anything. Now I feel like any standard messaging will have the same problem...How to extend message-oriented systems itself.
Every topic has a partition, by default it comes with only one partition if you don't define the partition count value. In your case, you have a consumer group that consists of two consumers. Every consumer read the log from the partition. In your case, first consumer read the log from the first partition(we have the only partition), and for second consumer there will be no partition to the consumer the data so it become idle. Once first consumer gets down then only the second consumer starts reading the data from the first partition from the last committed offset.
Please check below blogs and videos. It explains the topic, consumer, and consumer group in kafka.
https://www.javatpoint.com/apache-kafka-consumer-and-consumer-groups
http://cloudurable.com/blog/kafka-architecture-consumers/index.html
https://docs.confluent.io/platform/current/clients/consumer.html
https://www.youtube.com/watch?v=lAdG16KaHLs
I hope this will give you idea about the consumer and consumer group.
A broad solution to this is to decouple consumption of a message (i.e. receiving a message from Kafka and perhaps deserializing it and validating that it conforms to the schema) and processing it (interpreting the message). If the consumption is simple enough, being limited to no more instances consuming than there are partitions need not constrain.
One way to accomplish this is to have a Kafka consumption service which sends an HTTP request (perhaps through a load balancer or whatever) to a processing service which has arbitrarily many members.
Note that depending on what you're using Kafka for, there may be a requirement that certain messages always be in the same partition as one another in order to ensure that they get handled in a deterministic order (since ordering across partitions is not guaranteed). A typical example of this would be if the messages are change events for a particular record. If you're accomplishing this via some hash of the message key (or a portion of the key if using a custom partitioner), then simply changing the number of partitions might not be viable (you would need to introduce some sort of migration or have the producers know which records have to be routed to the old partitions and only route to the new partitions if the record has never been seen before).
We just started replacing messaging with Kafka.
In a traditional MQ there will be a cluster and 1orMQ will be there inside.
So the MQ cluster/co-ordinator service will deliver the message to clients.
Now there can be 10 services/clients which can consume message from single MQ.
So if there are 10 messages in MQ then each service/consumer/client can read/process 1 message
Now this case is not possible in Kafka which I understood now as per design
To achieve similar functionality in Kafka I have add equal or more number of partition as client/consumer/pods.

Kafka - Synchronized Consumer Groups

i am trying to make my head regarding Kafka consumers and I'd like to know if the following use case can be solved using Kafka.
My use case is basically this one:
I have a stream that I'd like to be consumed in sync by several consumers. In other words, I have a first consumer that starts to consume the stream, then another consumer arrives later. I'd like this second consumer to start to consume the stream at the offset where is currently the first consumer.
I know that I need to have the consumers in two different groups. But it is not clear for me :
on how or if it is possible to coordinate the groups offset
if I would expect a latency for such coordination task
You do not need two different groups, all consumers can check one topic. Or as many as they like, for that matter.
offset
Messages typically are identified by their arrival date, so all the clients need to tell the producer "my last visit was at 10:00, give me all new messages". So all each client needs to keep track of is when which individual topic was checked last.
latency
this is kind of "of scope" at this point. Of course there will be latency, but it depends on the environment, like "how many consumers", "how many topics", "message format" etc.
so can your usecase be solved using kafka
In short: yes. "Can one consumer continue where another has left", the consumers could exchange the latest index between each other, of course that would require some internal synchronization. Kafka itself does not care about consumers, so it will not keep track itself about the latest index. You need to do the work. Another possibility would be to actually consume the messages (like, delete them from queue once consumed), so each time another consumer hits the queue it is guaranteed to receive the messages another consumer left off. Of course that would depend on your usecase, can you actually delete your messages from the queue.
This is not a problematic treated by kafka directly (consumer group is to distribute partitions among members, not to attribute the same offset), but you can do somehting for this. You could simply create an other topic, where consumer1 would post either offset or copy of the message read (so you would need bth consumer and producer for this), and your other synchronized consumer would react against this - of course there ould be some latency for this.
What is your use case behind this? Why can't you consume at different offset? Couldn't you rather having one consumer, which would then dispatch the message read to to different processes, so that they are indeed synchronized? (with no latency)
What do you mean by synchronized: should consumer2 (and 3 and more) only consume the same message than consumer1 (ie can't consume faster, what I assume in both previous solution) While this is possible, it would really be better to know the reason behind this, maybe there is a better way for you to process data

kafka topics and partitions decisions

I need to understand something about kafka:
When I have a single kafka broker on a single host - is there any sense to have it have more than one partition for the topics? I means even if my data can be distinguished with some key (say tenant id) - what is the benefit of doing it on a single kafka broker? does this give any parallelism , if so how?
When a key is used, is this means that each key is mapped to a given partition? Does the number of partitions for a topic must be equal to the number of possible values for the key I specified? OR is this just a hash and so the number of partitions doesnt have to be equal?
From what I read, topics are created due to types of messages to be places in kafka. But in my case, i have 2 topics I have created since I have 2 types of consumption: one for reading one by one message. the second in case of a bulk of messages comes into the queue (application reasons) and then it is being entered into the second topic. Is that a good design although the messages type is the same? any other practice for such a scansion?
Yes, it definitely makes sense to have more than one partition for a topic even when you have a single Kafka broker. A scenario when you can benefit from this is pretty simple:
you need to guarantee in-order processing by tenant id
processing logic for each message is rather complex and takes some time. Especially the case when the Kafka message itself is simple, but the logic behind processing this message takes time (simple example - message is an URL, and the processing logic is downloading the file from there and doing some processing)
Given these 2 conditions you may get into a situation where one consumer is not able to keep up processing all the messages if all the data goes to a single partition. Remember, you can process one partition with exactly one consumer (well, you can use 2 consumers if using different consumer groups, but that's not your case), so you'll start getting behind over time. But if you have more than one partition you'll either be able to use one consumer and process data in parallel (this could help to speed things up in some cases) or just add more consumers.
By default, Kafka uses hash-based partitioning. This is configurable by providing a custom Partitioner, for example you can use random partitioning if you don't care what partition your message ends up in.
It's totally up to you what purposes you have topics for
UPD, answers to questions in the comment:
Adding more consumers is usually done for adding more computing power, not for achieving desired parallelism. To add parallelism add partitions. Most consumer implementations process different partitions on different threads, so if you have enough computing power, you might just have a single consumer processing multiple partitions in parallel. Then, if you start bumping into situations where one consumer is not enough, you just add more consumers.
When you create a topic you just specify the number of partitions (and replication factor for this topic, but that's a different thing). The key and partition to send is completely up to producer. In fact, you could configure your producer to use random partitioner and it won't even care about keys, just pick the partition randomly. There's no direct relation between key -> partition, it's just convenient to benefit from having things setup like this.
Can you elaborate on this one? Not sure I understand this, but I guess your question is whether you can send just a value and Kafka will infer a key somehow itself. If so, then the answer is no - Kafka does not apply any transformation to messages and stores them as is, so if you want your message to contain a key, the producer must explicitly send the key.

kafka log deletion and load balancing across consumers

Say a consumer does a time intensive processing. In order to scale consumer side processing, i would like to spawn multiple consumers and consumer messages from kafka topic in a round robin fashion. Based on the documentation, it seems like if i create multiple consumers and add them in one consumer group, only one consumer will get the messages. If i add consumers to different consumer groups, each consumer will get the same message. So, in order to achieve the above objective, is the only solution to partition the topic ? This seems like an odd design choice, because the consumer scalability is now bleeding into topic and even producer design. Ideally, if a topic does not partitioning, there should be no need to partition it. This puts un-necessary logic on producer and also causes other consumer types to consume from these partitions that may only make sense to one type of consumer. Plus it limits the usecase, where a certain consumer type may want ordering over the messages, so splitting a topic into partitions may not be possible.
Second if i choose "cleanup.policy" to compact, does it mean that kafka log will keep increasing as it will maintain the latest value for each key? If not, how can i get log deletion and compaction?
UPDATE:
It seems like i have two options to achieve scalability on consumer side, which are independent of topic scaling.
Create consumer groups and have them consume odd and even offsets. This logic would have to be built into the consumers to discard un-needed messages. Also doubles the network requirements
Create a hierarchy of topics, where the root topic gets all the messages. Then some job classifies the logs and publish them again to more fine grained topics. In this case, the strong ordering can be achieved at root and more fine grained topics for consumer scaling can be constructed.
In 0.8, kafka maintains the consumer offset, so publishing messages in a round robin across various consumers is not a too far fetched requirement from their design.
Partitions are the unit of parallelism in Kafka by design. Not just for consumtion but kafka distributes the partiotions accross cluster which has different other benifits like sharing load among different servers, replication management for ensuring no Data loss, managing log to scale beyond a size that will fit on a single server etc.
Ordering of messages is a key factor as if you do not need a storng ordering then diving topics with multiple partitions will allow you to evenly distribute the load while producing (this will be handled by the producer itself). And while using consumer group you just need to add more consumer instances in the same group in order to consume them parallely.
Plus it limits the usecase, where a certain consumer type may want ordering over the messages, so splitting a topic into partitions may not be possible.
True,from the doc
However, if you require a total order over messages this can be achieved with a topic that has only one partition, though this will mean only one consumer process.
Maintaining ordering whiile consuming in distributed manner requires the messaging system to maintain per-message state to keep track of message acknowledgement. But this will involve a lot of expensive random I/O in the system. So clearly there is a trade-off.
Ideally, if a topic does not partitioning, there should be no need to partition it. This puts un-necessary logic on producer and also causes other consumer types to consume from these partitions that may only make sense to one type of consumer
Distributing messages across partitions is typically handled by the producer it self without any intervention from the programmers end (assuming you don't want to categories messages using key). And for the consumers as you just mentioned here the better choice would be to use Simple/Low level consumers which will allow you to consume only a subset of the partitions in a topic.
This seems like an odd design choice, because the consumer scalability is now bleeding into topic and even producer design
I believe for a system like Kafka which focuses on high throughput ( handle hundreds of megabytes of reads and writes per second from thousands of clients ), ensuring scalability and strong durability and fault-tolerance guarantees might not be a good fit for someone having totally a different business requirements.
Topic partitioning is primarily a way to scale out consumers and brokers so if you need many consumers to keep up then you need to partition the topic and add multiple consumer instances in the same consumer group. The producer API will manage partitions transparently. If you need to have certain consumers subscribing only to some partitions, then you need to use the simple consumer API instead of the high level API and in this case you don't have the consumer group concept and have to coordinate consumption yourself.
Message ordering is guaranteed within partitions but not between partitions so if this is a requirement it needs to be dealt with on consumer side.
Setting cleanup.policy=compact means that the Kafka brokers will keep the latest version of a message key indefinitely and use cases like that should be more for recording of data updates for things you intend to keep around rather than the log stream buffering use case.
You need to factor out the reading of Kafka messages from the subsequent processing of those messages. You can use partitions and consumer groups to make reading messages as fast as possible, but if you process the messages as part of your consumer logic then you'll just slow down your consumers. By streaming the messages from consumers to other classes that will perform your processing you can adjust the parallelism of the consumers and of the processors independently. You'll see this approach in technologies like Spark and Storm.
This approach does add one complication and that is that the consumer has to commit the message offset before the message has been processed. You may have to track the messages in flight to insure execute-exactly-once.