I have facing and issue related whit a KafkaConsumer, our scenario is the following we have 5 environments which have a KafkaConsumer implemented, all of them pointing to the same Kafka server and topic also all the consumers have the same config and group.id.
I notice that some of the environments are losing messages but these lose messages are reach other environments. I think that somehow is related with I am using the same group.id.
For example if the message 'A' is present in env1, is not present in env2,3,4,5.
Could someone give me an idea of what could be the cause or if is related with the group.id.
Please, learn in details what is really a Consumer Group. In few words: it means that one instance in the group is going to consume from a subset of topic-partitions at a time. All other instances may consume from the same topic, but only will read from different partitions - partitions do not overlap.
So, since you say that some of your messages don't reach some specific consumer but do reach others, that means these messages are routed to different partitions.
Not sure what is your business goal, but here is a detailed Apache Kafka docs about Consumer Groups: https://dev.to/de_maric/what-is-a-consumer-group-in-kafka-49il
Related
=== Assume everything from consumer point of view ===
I was reading couple of Kafka articles and I saw that the number of partitions is coupled to number of micro-service instances.... Ex: If I say 1topic 1partition for my serviceA.. Producer pushes message to topicT1, partitionP1, and from consumerSide(ServiceA1) I can read from t1,p1. If I spin new pod(ServiceA2) to have highThroughput then second instance will never receive any message because Kafka/ZooKeeper assigns id to each Consumer and partition1 is already taken by serviceA1. So serviceA2++ stays idle... To avoid such a hassle Kafka recommends to add more partition, so that number of consumers can be increased/decreased based on need.
I was also able to test through commandLine and service2 never consumed any message. If I shut service1 then service2 was able to pick new message... So if I spin more pod then FailSafe/Availability increases but throughput is same always...
Is my assumption is correct. Am I missing anything. Now I feel like any standard messaging will have the same problem...How to extend message-oriented systems itself.
Every topic has a partition, by default it comes with only one partition if you don't define the partition count value. In your case, you have a consumer group that consists of two consumers. Every consumer read the log from the partition. In your case, first consumer read the log from the first partition(we have the only partition), and for second consumer there will be no partition to the consumer the data so it become idle. Once first consumer gets down then only the second consumer starts reading the data from the first partition from the last committed offset.
Please check below blogs and videos. It explains the topic, consumer, and consumer group in kafka.
https://www.javatpoint.com/apache-kafka-consumer-and-consumer-groups
http://cloudurable.com/blog/kafka-architecture-consumers/index.html
https://docs.confluent.io/platform/current/clients/consumer.html
https://www.youtube.com/watch?v=lAdG16KaHLs
I hope this will give you idea about the consumer and consumer group.
A broad solution to this is to decouple consumption of a message (i.e. receiving a message from Kafka and perhaps deserializing it and validating that it conforms to the schema) and processing it (interpreting the message). If the consumption is simple enough, being limited to no more instances consuming than there are partitions need not constrain.
One way to accomplish this is to have a Kafka consumption service which sends an HTTP request (perhaps through a load balancer or whatever) to a processing service which has arbitrarily many members.
Note that depending on what you're using Kafka for, there may be a requirement that certain messages always be in the same partition as one another in order to ensure that they get handled in a deterministic order (since ordering across partitions is not guaranteed). A typical example of this would be if the messages are change events for a particular record. If you're accomplishing this via some hash of the message key (or a portion of the key if using a custom partitioner), then simply changing the number of partitions might not be viable (you would need to introduce some sort of migration or have the producers know which records have to be routed to the old partitions and only route to the new partitions if the record has never been seen before).
We just started replacing messaging with Kafka.
In a traditional MQ there will be a cluster and 1orMQ will be there inside.
So the MQ cluster/co-ordinator service will deliver the message to clients.
Now there can be 10 services/clients which can consume message from single MQ.
So if there are 10 messages in MQ then each service/consumer/client can read/process 1 message
Now this case is not possible in Kafka which I understood now as per design
To achieve similar functionality in Kafka I have add equal or more number of partition as client/consumer/pods.
We have a bunch of producers that send messages/events to a bunch of consumers. Each message must be consumed by exactly one consumer. We know that this common scenario can easily be achieved by using consumer groups in Kafka. However, we also have a couple of additional constraints: Not every consumer can consume every message. Messages have (arbitrary) requirements attached to them and only consumers that fulfil these requirements must process them. This would still be possible with a consumer group where a consumer first looks at the message and eventually re-submits it if it does not meet the requirements. However, there is no guarantee that messages will be seen by every consumers at least once so they may bounce around indefinitely although there may be a matching consumer. We also cannot set up multiple topics because the requirements for consumers are arbitrary complex boolean formulas defined by the user and not the application. This can result in a combinatorial explosion of topics.
Additionally we want to be able to dynamically add and remove consumers from the group in case more processing resources are needed. As far as I understood Kafka, this can lead to consumers not getting any messages if there are not enough partitions and dynamically re-partitioning is also not really possible (without admin interaction).
Is there any way to make this work in Kafka? Maybe Kafka is also not the right technology, are there others that are more suitable? We also looked at RabbitMQ but also there we did not find a way that guarantees that every consumer is seeing a message so that it can evaluate the requirements.
you could commit offsets manually when you after identifying the desired events by setting ENABLE_AUTO_COMMIT_CONFIG to false in your consumer configs but your use-case would trigger excessive rebalances which stops any consumption. i don't think Kafka is the appropriate infrastructure for this.
however if you could mark your events with finite number of keys, you can dictate which partition they are produced to. using the same key in your consumer guarantees to poll events from the same partition. note that you need to have the same number of partitions in your topic as the number of unique keys.
Let's say I have a Kafka cluster with several topics spread over several partitions. Also, I have a cluster of applications act as clients for Kafka. Each application in that cluster has a client that is subscribed to a same set of topics, which is identical over the whole cluster. Also, each of these clients share same Kafka group ID.
Now, speaking of commit mode. I really do not want to specify offset manually, but I do not want to use autocommit either, because I need to do some handing after I receive my data from Kafka.
With this solution, I expect to occur "same data received by different consumers" problem, because I do not specify offset before I do reading (consuming), and I read data concurrently from different clients.
Now, my question: what are the solutions to get rid of multiple reads? Several options coming to my mind:
1) Exclusive (sequential) Kafka access. Until one consumer committed read, no other consumers access Kafka.
2) Somehow specify offset before each reading. I do not even know how to do that with assumption that read might fail (and offset will not be committed) - we gonna need some complicated distributed offset storage.
I'd like to ask people experienced with Kafka to recommend something to achieve behavior I need.
Every partition is consumed only by one client - another client with the same group ID won't get access to that partition, so concurrent reads won't occur...
I am building a correlated system using Kafka. Suppose, there's a service A that performs data processing and there're its thousands of clients B that submit jobs to it. Bs are short-lived, they appear on the network, push the data to A and then two important things happen:
B will immediately receive a status from A;
B then will either
drop out completely, stay online to receive further updates on
status, or will sporadically pop back on to check the status.
(this is not dissimilar to grid computing or mpi).
Both points should be achieved using a well-known concept of correlationId: B possesses a unique id (UUID in my case), which it sends to A in headers, which, in turn, uses it as Reply-To topic to send status updates to. Which means it has to create topics on the fly, they can't be predetermined.
I have auto.create.topics.enable switched on, and it indeed creates topics dynamically, but existing consumers are not aware of them and require to be restarted [to fetch topic metadata i suppose, if i understood the docs right]. I also checked consumer's metadata.max.age.ms setting, but it doesn't help it seems, even if i set it to a very low value.
As far as i've read, this is yet unanswered, i.e.: kafka filtering/Dynamic topic creation, kafka consumer to dynamically detect topics added, Can a Kafka producer create topics and partitions? or answered unsatisfactory.
As there're hundreds of As and thousands of Bs, i can't possibly use shared topics or anything like it, lest i overload my network. I can use Kafka's AdminTools, or whatever it's called, to pre-create topics, but i find it somehow silly (even though i saw real-life examples of people using it to talk to Zookeeper and Kafka infrastructure itself).
So the question is, is there a way to dynamically create Kafka topics in a way that makes both consumer and producer aware of it without being restarted or anything? And, in the worst case, will AdminTools really help it and on which side must i use it - A or B?
Kafka 0.11, Java 8
UPDATE
Creating topics with AdminClient doesn't help for whatever reason, consumers still throw LEADER_NOT_AVAILABLE when i try to subscribe.
Ok, so i’d answer my own question.
Creating topics with AdminClient works only if performed before corresponding consumers are created.
Changed the topology i have, taking into account 1) and introducing exchange of correlation ids in message headers (same as in JMS). I also had to implement certain topology management methodologies, grouping Bs into containers.
It should be noted that, as many people have said, this only works when Bs are in single-consumer groups and listen to topics with 1 partition.
To get some idea of the work i'm into, you might have a look at the middleware framework i've been working on https://github.com/ikonkere/magic.
Creating an unbounded number of topics is not recommended. Id advise to redesign your topology/system.
Ive thought of making dynamic topics myself but then realized that eventually zookeeper will fail as it will run out of memory due to stale topics (imagine a year from now on how many topics could be created). Maybe this could work if you make sure you have some upper bound on topics ever created. Overall an administrative headache.
If you look up using Kafka with request response you will find others also say it is awkward to do so (Does Kafka support request response messaging).
I am new to Kafka and I am trying to make a multiple produce subscribe functionality.
Lets say there are N number of producers called P1,P2,P3... and M number of consumers C1,C2,C3
Now C1 need to subscribe to P1,P2 and at some point of time he needs to subscribe to P3 also. Hence C1 has a dynamic list of topics it needs to subscribe to.
I was hoping this can be achieved using high level consumer , where we can name out consumer group and Kafka will store the offset till we read. But then what i noticed is that , we also need to give the topic names while creating high level consumer. In my case I have like 1000 number of topics i need to subscribe and this list is dynamically updated.
Is there a way , where in kafka high level consumer can remember the topics it have subscribed to and listen to them when brought up , rather than we providing the names of all the topics it was subscribed in the past.
I don't think that Kafka architecture that you outlined would work. The main issue, given that Kafka topic is a point of asynchrony between producers and consumers, is that you cannot do a clean cut switch with your "dynamic list of topics you need to subscribe to" (as you put it), since some amount of messages will presumably always be in "the queue".
Besides that, it's not exactly trivial to dynamically change the topic (and partition) in consumer clients. AFAIK Kafka is not meant to be used this way.
A better option would be to use a special message field that would tell your consumer clients whether the message is for them or not.
So you can use dedicated topics for messages that don't require this dynamic nature (in order to avoid doing this check for all messages, if possible) and a separate topic where you'd mix all messages that do require it.