KafkaConsumer Position() vs Committed() - scala

position(TopicPartition partition)
Get the offset of the next record that will be fetched (if a record with that offset exists).
committed(TopicPartition partition): OffsetAndMetadata
Get the last committed offset for the given partition (whether the commit happened by this process or another).
if i need use the latest committed offset of a particular consumer group ( to be used in startingOffset from Spark Structured Streaming ) , what should i use.
My code shows committed deprecated.
val latestOffset = consumer.position(partition)
val last=consumer.committed(partition)
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>2.4.1</version>
</dependency>
Official Documentation :
Offsets and Consumer Position
Kafka maintains a numerical offset for each record in a partition. This offset acts as a unique identifier of a record within that partition, and also denotes the position of the consumer in the partition. For example, a consumer which is at position 5 has consumed records with offsets 0 through 4 and will next receive the record with offset 5. There are actually two notions of position relevant to the user of the consumer:
The position of the consumer gives the offset of the next record that will be given out. It will be one larger than the highest offset the consumer has seen in that partition. It automatically advances every time the consumer receives messages in a call to poll(long).
The committed position is the last offset that has been stored securely. Should the process fail and restart, this is the offset that the consumer will recover to. The consumer can either automatically commit offsets periodically; or it can choose to control this committed position manually by calling one of the commit APIs (e.g. commitSync and commitAsync).

You need to use the committed offset in your Spark Streaming Job as startingOffset.
The counter of the position API is incrementally increased by the KafkaConsumer during its runtime and can slightly differ from the result of the committed API because the consumer may or may not commit offsets and if it committs it might do it asynchronously.
In Kafka 2.4.1 the method committed(partition) is deprecated and it is recommended to use the newer API which takes a Set of TopicPartitions. Its signature is:
public Map<TopicPartition,OffsetAndMetadata> committed(Set<TopicPartition> partitions)
As you are using Scala, it is required to convert your Scala set into a Java set. This can be done as described here.

Related

Where is offset of consumer stored in Kafka [duplicate]

From what I understand a consumer reads messages off a particular topic, and the consumer client will periodically commit the offset.
So if for some reason the consumer fails a particular message, that offset won't be committed and you can then go back and reprocess he message.
Is there anything that tracks the offset you just consumed and the offset you then commit?
Does kafka distinguish between consumed offset and commited offset?
Yes, there is a big difference. The consumed offset is managed by the consumer in such a way that the consumer will fetch subsequent messages out of a topic partition.
The consumer can (but it is not a must) commit a message either automatically or by calling the commit API. The information is stored in a Kafka internal topic called __consumer_offsets and stores the committed offset based on ConsumerGroup, Topic and Partition. It will be used if the client is getting restartet or a new consumer joins/leaves the ConsumerGroup.
Just keep in mind that if your client does not committ offset n but later committs offset n+1, for Kafka it won't make a different to the case when you commit both offsets.
Edit: More details on consumed and committed offsets can be found in the JavaDocs of KafkaConsumer on Offsets and Consumer Position:
Kafka maintains a numerical offset for each record in a partition. This offset acts as a unique identifier of a record within that partition, and also denotes the position of the consumer in the partition. For example, a consumer which is at position 5 has consumed records with offsets 0 through 4 and will next receive the record with offset 5. There are actually two notions of position relevant to the user of the consumer:
The position of the consumer gives the offset of the next record that will be given out. It will be one larger than the highest offset the consumer has seen in that partition. It automatically advances every time the consumer receives messages in a call to poll(Duration).
The committed position is the last offset that has been stored securely. Should the process fail and restart, this is the offset that the consumer will recover to. The consumer can either automatically commit offsets periodically; or it can choose to control this committed position manually by calling one of the commit APIs (e.g. commitSync and commitAsync).
This distinction gives the consumer control over when a record is considered consumed. It is discussed in further detail below.

Does kafka distinguish between consumed offset and commited offset?

From what I understand a consumer reads messages off a particular topic, and the consumer client will periodically commit the offset.
So if for some reason the consumer fails a particular message, that offset won't be committed and you can then go back and reprocess he message.
Is there anything that tracks the offset you just consumed and the offset you then commit?
Does kafka distinguish between consumed offset and commited offset?
Yes, there is a big difference. The consumed offset is managed by the consumer in such a way that the consumer will fetch subsequent messages out of a topic partition.
The consumer can (but it is not a must) commit a message either automatically or by calling the commit API. The information is stored in a Kafka internal topic called __consumer_offsets and stores the committed offset based on ConsumerGroup, Topic and Partition. It will be used if the client is getting restartet or a new consumer joins/leaves the ConsumerGroup.
Just keep in mind that if your client does not committ offset n but later committs offset n+1, for Kafka it won't make a different to the case when you commit both offsets.
Edit: More details on consumed and committed offsets can be found in the JavaDocs of KafkaConsumer on Offsets and Consumer Position:
Kafka maintains a numerical offset for each record in a partition. This offset acts as a unique identifier of a record within that partition, and also denotes the position of the consumer in the partition. For example, a consumer which is at position 5 has consumed records with offsets 0 through 4 and will next receive the record with offset 5. There are actually two notions of position relevant to the user of the consumer:
The position of the consumer gives the offset of the next record that will be given out. It will be one larger than the highest offset the consumer has seen in that partition. It automatically advances every time the consumer receives messages in a call to poll(Duration).
The committed position is the last offset that has been stored securely. Should the process fail and restart, this is the offset that the consumer will recover to. The consumer can either automatically commit offsets periodically; or it can choose to control this committed position manually by calling one of the commit APIs (e.g. commitSync and commitAsync).
This distinction gives the consumer control over when a record is considered consumed. It is discussed in further detail below.

Kafka consumer-client is not registering offset of consumer group on zookeeper

I'm trying to create multiple consumers with different consumer groups to a kafka topic using kafka-clients v.0.10.2.1. Although I'm not able to retrieve the last offset commited by a consumer group.
Currently my Consumer properties looks like this
Properties cproperties = new Properties();
cproperties.put(ConsumerConfig.GROUP_ID_CONFIG, groupID);
cproperties.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, my-broker));
cproperties.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");
cproperties.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
cproperties.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, taskDecoder.getClass());
cproperties.put(ConsumerConfig.REQUEST_TIMEOUT_MS_CONFIG, "60000");
And without the property Auto-reset-offset I can't consume from the topic, but i can't use this config, I need the consumer group registered on zookeeper.
So, I need to create a consumer group on zookeeper /consumers too.
You need to include the property auto.offset.reset to earliest (or latest, depending on what are you trying to achieve) in order to avoid throwing an exception when an offset is not found (probably because data is deleted).
You also need to make sure that you manually commit offsets since you've disable auto-commit.
cproperties.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");
To do so, you can either use commitSync()
This commits offsets to Kafka. The offsets committed using this API
will be used on the first fetch after every rebalance and also on
startup. As such, if you need to store offsets in anything other than
Kafka, this API should not be used. The committed offset should be the
next message your application will consume, i.e.
lastProcessedMessageOffset + 1.
or commitAsync()
This commits offsets only to Kafka. The offsets committed using this
API will be used on the first fetch after every rebalance and also on
startup. As such, if you need to store offsets in anything other than
Kafka, this API should not be used.
This is an asynchronous call and will not block. Any errors
encountered are either passed to the callback (if provided) or
discarded.
Note that if you don't commit offsets an exception will be thrown if auto.offset.reset is set to none.
What to do when there is no initial offset in Kafka or if the current
offset does not exist any more on the server (e.g. because that data
has been deleted):
earliest: automatically reset the offset to the earliest offset
latest: automatically reset the offset to the latest offset
none: throw exception to the consumer if no previous offset is found for the consumer's group
anything else: throw exception to the consumer.

how to get last committed offset from read_committed Kafka Consumer

I am using the transactional KafkaProducer to send messages to a topic. This works fine. I use a KafkaConsumer with read_committed isolation level and I have an issue with the seek and seekToEnd methods. According to the documentation, the seek and seekToEnd methods give me the LSO (Last Stable Offset). But this is a bit confusing. As it gives me always the same value, the END of the topic. No matter if the last entry is committed (by the Producer) or part of an aborted transaction.
Example, after I abort the last 5 tries to insert 20_000 messages, the last 100_000 records should not be read by the Consumer. But during a seekToEnd it moves to the end of the Topic (including the 100_000 messages). But the poll() does not return them.
I am looking for a way to retrieve the Last Committed Offset (so the last successful committed message by the Producer). There seems to be no proper API method for this. So do I need to roll my own?
Option would be to move back and poll until no more records are retrieved, this would result in the last committed message. But I would assume that Kafka provides this method.
We use Kafka 1.0.0.
The class KafkaConsumer has some nice methods like: partitionFor, begginingOffsets and endOffsets also commited and position.
Check which one fits to your needs. Especially carefully consider all 4 offset-related methods.
The method partitionFor returns complete metadata object with other information, but can be useful for enriching the logging.
To get the last committed offset of a topic partitions you can use the KafkaConsumer.committed(TopicPartition partition) function.
TopicPartition topicPartition = new TopicPartition(record.topic(), record.partition());
Long committedOffset = consumer.committed(topicPartition).offset();
System.out.println("last committed offset: " + committedOffset);

Kafka manual offset managment issue

While implementing manual offset management, I encountered the following issue: (using 0.9)
In order to manage the offsets manually, for each consumed record, I retrieve the current offset of the record and commit the new offset (currentOffset + 1, since the offset reset strategy is "latest").
When a new consumer group is created, it has no explicit offsets (offset is "unknown"), therefore, if it didn't consume messages from all existing partitions before it is stopped, it will have committed offsets for only part of the partitions (the ones the consumer got messages from), while the offset for the rest of the partitions will still be "unknown".
When the consumer is started again, it gets only some of the messages that were produced while it was down (only the ones from the partitions that had a committed offset), the messages from partitions with "unknown" offset are lost and will never be consumed due to the offset reset strategy.
Since it's unacceptable in my case to miss any messages once a consumer group is created, I'd like to explicitly commit an offset for each partition before starting consumption.
To do that I found two options:
Use low level consumer to send an offset request.
Use high level consumer, call consumer.poll(0) (to trigger the assignment), then call consumer.assignment(), and for each TopicPartition call consumer.committed(topicPartition); consumer.seekToEnd(topicPartition); consumer.position(topicPartition) and eventually commit all offsets.
Both are more complex and noisy than I'd expect (I'd expect a simpler API I could use to get the log end position for all partitions assigned to a consumer).
Any thoughts or ideas for a better implementation would be appreciated.
10x.
Using consumer API totally depends upon where are you committing offsets.
If your offsets are getting stored in Kafka broker then definitely
you should use high-level consumer API it will provide you with more control
over offsets.
If you are keeping offsets in zookeeper than you can use any old consumer API like
List< KafkaStream < byte[], byte[] > > streams
=consumer.createMessageStreamsByFilter(new Whitelist(topicRegex),1)