Duplicate offsets in a Kafka topic with more than one partition - apache-kafka

I am using kafka_2.10-0.10.0.1 with zookeeper-3.4.10. I know that there are many types of offsets. I have two questions:
- I want to know the type of the offset returned by ConsumerRecord.offset().
- If I use a topic created with 10 partitions, can I obtain a set of records with the same offset value? In my program, I need to obtain a list of records with different offset values. I want to know do I have to use a topic with a single partition to achieve this goal?

I want to know the type of the offset returned by ConsumerRecord.offset().
This is the offset of the record within the topic-partition the record came from.
If I use a topic created with 10 partitions, can I obtain a set of records with the same offset value?
Yes, you can seek to that offset in each partition and read the value. To do this, assign the topic-partitions you want to your consumer with Consumer#assign(), then use Consumer#seek() to see to the offset you want to read. When you poll(), the consumer will start reading from that offset.
I want to know do I have to use a topic with a single partition to achieve this goal?
You don't have to do this. You can read whatever offsets you want from whatever partitions you want.

Related

How to Retrieve Kafka message fast based on key?

i have scenario where I need to test Kafka message when transaction is completed. How to retrieve the message fast using Java? I know the key initial first 10 digit details which is unique.
Currently I am reading all partition and offset for the relevant topic which is not efficient(worst case scenario takes 2 min to get key)
This is not really possible with Kafka, each Kafka partition is an append-only log that uses an offset to specify its position. The key isn't used when reading the partition.
The only way to "seek" a specific message in a partition is through its offset, so instead of reading the whole partition if you know that the message is roughly from one hour ago(or another timeframe) you can consume just that piece of information.
See this answer on how to initialize a consumer on a specific offset based on timestamp in Java

Is there a common offset value that spans across Kafka partitions?

I am just experimenting on Kafka as a SSE holder on the server side and I want "replay capability". Say each kafka topic is in the form events.<username> and it would have a delete items older than X time set.
Now what I want is an API that looks like
GET /events/offset=n
offset would be the last processed offset by the client if not specified it is the same as latest offset + 1 which means no new results. It can be earliest which represents the earliest possible entry. The offset needs to exist as a security-through-obscurity check.
My suspicion is for this to work correctly the topic must remain in ONE partition and cannot scale horizontally. Though because the topics are tied to a user name the distribution between brokers would be handled by the fact that the topics are different.
If you want to retain event sequence for each of the per-user topics, then yes, you have to use one partition per user only. Kafka cannot guarantee message delivery order with multiple partitions.
The earliest and latest options you mention are already supported in any basic Kafka consumer configuration. The specific offset one, you'd have to filter out manually by issuing a request for the given offset, and then returning nothing if the first message you receive does not match the requested offset.

Is it possible in Kafka to read messages in reverse manner?

Can be created a new consumer group with a consumer which assigned to existing topiс, but somehow set a preference to consume backward: offset will move from the latest message for the moment to the earliest in every partition?
Kafka topics are meant to be consumed sequentually in the order of appearance within the topic partitions.
However, I see two options to solve your issue:
You can steer the consumer what data it poll from the topic partition like: Have your consumer seek to the latestet offset, then consume it and then seek to the latest offset minus one but read only one offset. Again seek to the previous offset and so on. Although I have never seen it, this should be possible with the consumer.seek and the ConsumerConfiguration max.poll.records.
You could use any kind of state store and order it descending by the offset for each partition. Then have another consumer reading the state store in the desired order.

How to find the offset of a message in Kafka Topic

How to find the offset of a message in Kafka Topic? Does the offset contains multiple messages or a single message?
Offsets only link to a single message within a single partition of a topic.
You could have the same offset be available in many partitions and many topics, but there is almost no correlation between those values unless you explictly made the producers do that.
There is no easy way to find the offset for a single message. You need to scan the whole topic (or at least a single partition)
Assuming you've read the message using a consume() or poll(), and you're interested in the details of this message, you can find the corresponding partition and offset using the following code:
# assuming you have a Kafka consumer in place
msg = consumer.consume()
partition = msg.partition()
offset = msg.offset()
A Kafka topic is divided into partitions. Kafka distributes the incoming messages in a round robin fashion across partitions (unless you've specified some key on which to partition). So, for a random message that you haven't read, you may have to scan all partitions to find the message, and then the corresponding offset.

How many records are stored in each offset of kafka partition?

I came across the below kafka official statement
For each topic, the Kafka cluster maintains a partitioned log
Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.
So, Lets say we have a kafka topic called "emprecords" and just assume that it has only one partition for now and in that partition let's say we have 10 offset starting from 0 to 9
My question is
Does each offset has got the ability to store only one record?
Or
Does each offset has got the ability to store more than one records?
For each partition, each offset can only be assigned to one record.