How to read messages from kafka consumer group without consuming? - apache-kafka

I'm managing a kafka queue using a common consumer group across multiple machines. Now I also need to show the current content of the queue. How do I read only those messages within the group which haven't been read, yet making those messages again readable by other consumers in the group which actually processes those messages. Any help would be appreciated.

In Kafka, the notion of "reading" messages from a topic and that of "consuming" them are the same thing. At a high level, the only thing that makes a "consumed" message unavailable to a consumer is that consumer setting its read offset to a value beyond that of the message in question. Thus, you can turn off the autocommit feature of your consumers and avoid committing offsets in cases where you'd like only to "read" but not to "consume".
A good proxy for getting "all messages which haven't been read" is to compare the latest committed offset to the highwater mark offset per partition. This provides a notion of "lag" that indicates how far behind a given consumer is in its consumption of a partition. The fetch_consumer_lag CLI function in pykafka is a good example of how to do this.

In Kafka, a partition can be consumed by only one consumer in a group i.e. if your topic has 10 partitions and you spawned 20 consumers with same groupId, then only 10 will be connected to Kafka and remaining 10 will be sitting idle. A new consumer will be identified by Kafka only in case one of the existing consumer dies or does not poll from the topic.
AFAIK, I don't think you can do what I understand you want to do within a consumer group. You can obviously create another groupId and process message based on the information gathered by first consumer group.

Kafka now has a KStream.peek() method
See proposal "Add KStream peek method".
It's not 100% clear to me from the docs that this prevents consuming of message that's peeked from the topic, but I can't see how you could use it in any crash-safe, robust way unless it does.
See also:
Handling consumer rebalance when implementing synchronous auto-offset commit
High-Level Consumer and peeking messages

I think that you can use publish-subscribe model. Then each consumer has own offset and could consume all messages for itself.

Related

If I use Kafka as simple message. Does it really worth

=== Assume everything from consumer point of view ===
I was reading couple of Kafka articles and I saw that the number of partitions is coupled to number of micro-service instances.... Ex: If I say 1topic 1partition for my serviceA.. Producer pushes message to topicT1, partitionP1, and from consumerSide(ServiceA1) I can read from t1,p1. If I spin new pod(ServiceA2) to have highThroughput then second instance will never receive any message because Kafka/ZooKeeper assigns id to each Consumer and partition1 is already taken by serviceA1. So serviceA2++ stays idle... To avoid such a hassle Kafka recommends to add more partition, so that number of consumers can be increased/decreased based on need.
I was also able to test through commandLine and service2 never consumed any message. If I shut service1 then service2 was able to pick new message... So if I spin more pod then FailSafe/Availability increases but throughput is same always...
Is my assumption is correct. Am I missing anything. Now I feel like any standard messaging will have the same problem...How to extend message-oriented systems itself.
Every topic has a partition, by default it comes with only one partition if you don't define the partition count value. In your case, you have a consumer group that consists of two consumers. Every consumer read the log from the partition. In your case, first consumer read the log from the first partition(we have the only partition), and for second consumer there will be no partition to the consumer the data so it become idle. Once first consumer gets down then only the second consumer starts reading the data from the first partition from the last committed offset.
Please check below blogs and videos. It explains the topic, consumer, and consumer group in kafka.
https://www.javatpoint.com/apache-kafka-consumer-and-consumer-groups
http://cloudurable.com/blog/kafka-architecture-consumers/index.html
https://docs.confluent.io/platform/current/clients/consumer.html
https://www.youtube.com/watch?v=lAdG16KaHLs
I hope this will give you idea about the consumer and consumer group.
A broad solution to this is to decouple consumption of a message (i.e. receiving a message from Kafka and perhaps deserializing it and validating that it conforms to the schema) and processing it (interpreting the message). If the consumption is simple enough, being limited to no more instances consuming than there are partitions need not constrain.
One way to accomplish this is to have a Kafka consumption service which sends an HTTP request (perhaps through a load balancer or whatever) to a processing service which has arbitrarily many members.
Note that depending on what you're using Kafka for, there may be a requirement that certain messages always be in the same partition as one another in order to ensure that they get handled in a deterministic order (since ordering across partitions is not guaranteed). A typical example of this would be if the messages are change events for a particular record. If you're accomplishing this via some hash of the message key (or a portion of the key if using a custom partitioner), then simply changing the number of partitions might not be viable (you would need to introduce some sort of migration or have the producers know which records have to be routed to the old partitions and only route to the new partitions if the record has never been seen before).
We just started replacing messaging with Kafka.
In a traditional MQ there will be a cluster and 1orMQ will be there inside.
So the MQ cluster/co-ordinator service will deliver the message to clients.
Now there can be 10 services/clients which can consume message from single MQ.
So if there are 10 messages in MQ then each service/consumer/client can read/process 1 message
Now this case is not possible in Kafka which I understood now as per design
To achieve similar functionality in Kafka I have add equal or more number of partition as client/consumer/pods.

What is the difference between pulsar and kafka in regards to consumption?

In order to consume data from Kafka, we can have multiple consumers on a topic, totally decoupled. Then, what is meant by no shared consumption on the page(https://streaml.io/blog/pulsar-streaming-queuing) which shares differences between kafka and pulsar?
In his blog, Sijie is referring to shared messaging as queuing. With queuing messaging, multiple consumers are created to receive messages from a single topic. Which consumer gets the message is completely random.
The issue with implementing the messaging pattern with Kafka lies in way that Kafka consumers mark that they’ve consumed a message. Kafka consumers use what’s called a high watermark for consumer offsets. That means that a consumer can only say, “I’ve processed up to this point” rather than, “I’ve processed this message.”
Consider the scenario in which multiple Kafka consumers from the same consumer group were processing from the same topic partition and one of the consumers fails due to an exception while the other succeed. Because Kafka does not a have a built-in way to only acknowledge a single message, and only uses a high-water mark, the failed message would be erronously marked as consumed when in fact it failed and needs to be either reprocessed or published to an error queue, etc.
In order to avoid this situation, you would need to have just a single consumer per partition which limits the comsumption throughput of the topic. Which in turn requires you to increase the number of partitions in order to meet your throughput needs.
There is a detailed explanation in this blog post

Is manipulating the "read-offset" as kafka consumer bad-practice?

We have an ongoing discussion about the correct (or intended) usage of Kafka for events.
The arguing point is the ability of a consumer to not only subscribe (or resubscribe) to a topic but also to modify its own read offset.
Am I right in saying that "A consumer should be design in a way that it never modifies its own read offset!"
Reasoning behind this:
The consumer cannot know what events actually are stored inside a topic (log retention)
... So restoring a complete state from "delta"-events is not possible.
The consumer has consumed an event once and confirmed this to the broker. why consuming again?
If your consumer instances belongs to same consumer group, consumer need not to keep the state of reading from topic. The state of reading is nothing but offset of topic up to which record your consumer read so far. If your topic has multiple partitions consumers belong to the same consumer group can distribute the work load among consumers. In case one of the consumers crashed or failed other consumers from same consumer group will be aware of from which partition offset they continue to consume the record.

Kafka - Synchronized Consumer Groups

i am trying to make my head regarding Kafka consumers and I'd like to know if the following use case can be solved using Kafka.
My use case is basically this one:
I have a stream that I'd like to be consumed in sync by several consumers. In other words, I have a first consumer that starts to consume the stream, then another consumer arrives later. I'd like this second consumer to start to consume the stream at the offset where is currently the first consumer.
I know that I need to have the consumers in two different groups. But it is not clear for me :
on how or if it is possible to coordinate the groups offset
if I would expect a latency for such coordination task
You do not need two different groups, all consumers can check one topic. Or as many as they like, for that matter.
offset
Messages typically are identified by their arrival date, so all the clients need to tell the producer "my last visit was at 10:00, give me all new messages". So all each client needs to keep track of is when which individual topic was checked last.
latency
this is kind of "of scope" at this point. Of course there will be latency, but it depends on the environment, like "how many consumers", "how many topics", "message format" etc.
so can your usecase be solved using kafka
In short: yes. "Can one consumer continue where another has left", the consumers could exchange the latest index between each other, of course that would require some internal synchronization. Kafka itself does not care about consumers, so it will not keep track itself about the latest index. You need to do the work. Another possibility would be to actually consume the messages (like, delete them from queue once consumed), so each time another consumer hits the queue it is guaranteed to receive the messages another consumer left off. Of course that would depend on your usecase, can you actually delete your messages from the queue.
This is not a problematic treated by kafka directly (consumer group is to distribute partitions among members, not to attribute the same offset), but you can do somehting for this. You could simply create an other topic, where consumer1 would post either offset or copy of the message read (so you would need bth consumer and producer for this), and your other synchronized consumer would react against this - of course there ould be some latency for this.
What is your use case behind this? Why can't you consume at different offset? Couldn't you rather having one consumer, which would then dispatch the message read to to different processes, so that they are indeed synchronized? (with no latency)
What do you mean by synchronized: should consumer2 (and 3 and more) only consume the same message than consumer1 (ie can't consume faster, what I assume in both previous solution) While this is possible, it would really be better to know the reason behind this, maybe there is a better way for you to process data

Kafka - Multiple consumers (only one active) on same group/topic

Is it possible to have multiple copies of an application listen to the same Kafka group/topic so that only one is reading it at a time, but the other ones will start working if the main one crashes/stops reading?
I need to make an application highly available but can't tolerate doubling the traffic to the data store on the other end of the application by having multiple copies actively running.
FYI - Technically I'm using MapR streams but it adheres to the Kafka API and functionality, in case anyone knows a MapR stream-specific feature that helps the situation.
It is possible. If multi consumers are in same consumer group, when the group subscribes a topic, kafka will do a partition assignment work for your consumers: one partition could only be consumed by only one consumer in a same group.
So you could set your topic to have only one partition, then only one consumer to consume message, others will be idle. Once the consumer is shutdown, it will trigger the group rebalance operation : kafka will do the partition assignment again. And Then in your case , a new consumer will go ahead this work. It will process message from the last committed offset which commited by old consumer.
And if your case supports parallel processing, you could make many process(app) doing same work and set the topic to multi partitions. They will be assigned to consume different partitions and process different messages. So it will speed up your process and also can tolerant the fail over. As above said, if some consumers is failed, kafka will take care it for you, it will assign their paritition to other working consumer. So everything will be ok.