Is it possible in Kafka to read messages in reverse manner? - apache-kafka

Can be created a new consumer group with a consumer which assigned to existing topiс, but somehow set a preference to consume backward: offset will move from the latest message for the moment to the earliest in every partition?

Kafka topics are meant to be consumed sequentually in the order of appearance within the topic partitions.
However, I see two options to solve your issue:
You can steer the consumer what data it poll from the topic partition like: Have your consumer seek to the latestet offset, then consume it and then seek to the latest offset minus one but read only one offset. Again seek to the previous offset and so on. Although I have never seen it, this should be possible with the consumer.seek and the ConsumerConfiguration max.poll.records.
You could use any kind of state store and order it descending by the offset for each partition. Then have another consumer reading the state store in the desired order.

Related

Kafka current offset internals

Can somebody explain, how Kafka's current offset mechanism works from the consumer's point of view? I have a huge topic (several gigabytes), divided into 2 partitions. And in some business cases (rare ones), I need to choose random N records within partition and read it.
My colleague says, that Kafka consumer does not know anything about offsets, it just receives a bunch of records on every poll() with offset, attached to every record as meta-information. I.e. the "seek" mechanism works as follows: consumer asks records and ignores it until target offset has been met.
Is it true? In my understanding such a "rewinding" is a wasting of consumer resources and internet traffic. I think there MUST be a way to point at a specific offset, so that a broker could send the record with that specific offset immediately on poll() without that kinda "spinloop" stuff.
You can seek to a specific offset. But it's the consumer group / offsets topic that stores that information, not the consumer itself.
Hopping around to "random" offsets is indeed not efficient.
Size of topic doesn't matter.

Kafka multiple consumer

When we have multiple consumer reading from the topic with single partition Is there any possibility that all the consumer will get all the message.
I have created the two consumers with manual offset commit.started the first consumer and after 2 mins started 2nd consumer . The second consumer is reading from the message from where the 1st consumer stopped reading. Is there any possibility that the 2nd consumer will read all the message from beginning.I'm new to kafka please help me out.
In your consumer, you would be using commitSync which commits offset returned on last poll. Now, when you start your 2nd consumer, since it is in same consumer group it will read messages from last committed offset.
Messages which your consumer will consumes depends on the ConsumerGroup it belongs to. Suppose you have 2 partitions and 2 consumers in single Consumer Group, then each consumer will read from different partitions which helps to achieve parallelism.
So, if you want your 2nd consumer to read from beginning, you can do one of 2 things:
a) Try putting 2nd consumer in different consumer group. For this consumer group, there won't be any offset stored anywhere. At this time, auto.offset.reset config will decide the starting offset. Set auto.offset.reset to earliest(reset the offset to earliest offset) or to latest(reset the offset to latest offset).
b) Seek to start of all partitions your consumer is assigned by using: consumer.seekToBeginning(consumer.assignment())
Documentation: https://kafka.apache.org/11/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#seekToBeginning-java.util.Collection-
https://kafka.apache.org/documentation/#consumerconfigs
Partition is always assigned to unique consumer in single consumer group irrespective of multiplpe consumers. It means only that consumer can read the data and others won't consume data until the partition is assigned to them. When consumer goes down, partition rebalance happens and it will be assigned to another consumer. Since you are performing manual commit, new consumer will start reading from committed offset.

Set topic offset to value back in time for all consumers

I have multiple consumers subscribed to one topic and all of them are in the same group. Sometimes it is necessary to force all the consumers to re-process some data which they already processed. In this case I have exact time to set consumer offsets to.
Question: Is there any mechanism in Kafka to set all consumer offsets to specified time? Here I mean that consumers must not be re-started, just their offsets should be magically set to new value and next poll requests would start fetching from new offset.
If it is impossible with standard Kafka, are there any ready libraries providing such a mechanism?
Probably you are looking for a seek method:
public void seek(TopicPartition partition, long offset)
Overrides the fetch offsets that the consumer will use on the next poll(timeout)...

kafka subscribe commit offset manually

I am using Kafka 9 and confused with the behavior of subscribe.
Why does it expects group.id with subscribe.
Do we need to commit the offset manually using commitSync. Even if don't do that I see that it always starts from the latest.
Is there a way a replay the messages from beginning.
Why does it expects group.id with subscribe?
The concept of consumer groups is used by Kafka to enable parallel consumption of topics - every message will be delivered once per consumer group, no matter how many consumers actually are in that group. This is why the group parameter is mandatory, without a group Kafka would not know how this consumer should be treated in relation to other consumers that might subscribe to the same topic.
Whenever you start a consumer it will join a consumer group, based on how many other consumers are in this consumer group it will then be assigned partitions to read from. For these partitions it then checks whether a list read offset is known, if one is found it will start reading messages from this point.
If no offset is found, the parameter auto.offset.reset controls whether reading starts at the earliest or latest message in the partition.
Do we need to commit the offset manually using commitSync? Even if
don't do that I see that it always starts from the latest.
Whether or not you need to commit the offset depends on the value you choose for the parameter enable.auto.commit. By default this is set to true, which means the consumer will automatically commit its offset regularly (how often is defined by auto.commit.interval.ms). If you set this to false, then you will need to commit the offsets yourself.
This default behavior is probably also what is causing your "problem" where your consumer always starts with the latest message. Since the offset was auto-committed it will use that offset.
Is there a way a replay the messages from beginning?
If you want to start reading from the beginning every time, you can call seekToBeginning, which will reset to the first message in all subscribed partitions if called without parameters, or just those partitions that you pass in.

read kafka message starting from a specific offset using high level API

I hope I am not making a mistake, but I remember that in Kafka documentation it mentioned that using high level APIs you can't start reading messages from a specific offset, but it was mentioned that it would change.
Is it possible now using the high level APIs to read messages from a specific partition and a specific offset? Could you please give me an example how to do it?
I am using kafka 0.8.1.1.
Thanks in advance.
You can do that with kafka 0.9:
http://kafka.apache.org/090/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
public void seek(TopicPartition partition, long offset)
Overrides the fetch offsets that the consumer will use on the next poll(timeout). If this API is invoked for the
same partition more than once, the latest offset will be used on the
next poll(). Note that you may lose data if this API is arbitrarily
used in the middle of consumption, to reset the fetch offsets
Kafka 0.8.1.1 can use Zookeeper to store offsets for each consumer group. If you configure your consumer to commit offsets to zookeeper than you Need just to manually set the starting offset for the topic and partition under zookeeper for your consumer Group.
You Need to connect to zookeeper and use the set command:
set /consumers/[groupId]/offsets/[topic]/[partitionId] -> long (offset)
E.g. setting offset 10 for partition 0 of topicname for the spark-app consumer Group.
set /consumers/spark-app/offsets/topicname/0 10
When a consumer starts to consume message from Kafka it always starts to consume from the last committed offset. If this last committes offset is not.valid for any reason than the consumer applies the logic due the configurazione properties auto.offset.reset.
Hope this helps.