Is there any way to get the current offset in Kafka 0.10.x? I do not want to use shell commands.
When I use the API's consumer.endOffsets(...), I can get the last offsets (logSize). However consumer.position(...) does not get me the current offset!
All in all, I want to get the current offset, log size and lag in one partition.
You can use KafkaConsumer#committed() to get the latest committed position. Thus, I you disable auto-commit and do manual commits, you can compute the exact lag each time you do a commit.
On the other hand, each record you do process, provide its offset via ConsumerRecord#offset(), thus, you can also compute the lag after reading a single record (for the record's partition).
Related
My requirement is simple yet not able to implement using plain consumer, I would like to consume records from Last committed offset position, every time I poll. I mean once after I polled set of records, if I am not manually committed the offset for those records, then I would expect the same set of records to be returned to me on the next poll. Is it possible to use plain Kafka consumers..? FYI I already configured my consumer, not to auto-commit.
The current workaround I employed is that I am manually seeking offset to last committed offset before every poll, but it is adding needless roundtrip and adding latency to message processing? is there an out of the box configuration available in Kafka consumer to achieve what I am expecting?
I know about configuring kafka to read from earliest or latest message.
How do we include an additional option in case I need to read from a previous offset?
The reason I need to do this is that the earlier messages which were read need to be processed again due to some mistake in the processing logic earlier.
In java kafka client, there is some methods about kafka consumer which could be used to specified next consume position.
public void seek(TopicPartition partition,
long offset)
Overrides the fetch offsets that the consumer will use on the next poll(timeout). If this API is invoked for the same partition more than once, the latest offset will be used on the next poll(). Note that you may lose data if this API is arbitrarily used in the middle of consumption, to reset the fetch offsets
This is enough, and there are also seekToBeginning and seekToEnd.
I'm trying to answer a similar but not quite the same question so let's see if my information may help you.
First, I have been working from this other SO question/answer
In short, you want to commit your offsets and the most common solution for that is ZooKeeper. So if your consumer encounters an error or needs to shut down, it can resume where it left off.
Myself I'm working with a high volume stream that is extremely large and my consumer (for a test) needs to start from the very tail each time. The documentation indicates I must use KafkaConsumer seek to declare my starting point.
I'll try to update my findings here once they are successful and reliable. For sure this is a solved problem.
I am using the Kafka Consumer Plugin for Pentaho CE and would appreciate your help in its usage. I would like to know if any of you were in a situation where pentaho failed and you lost any messages (based on the official docs there's no way to read the message twice, am I wrong ?). If this situation occurs how do you capture these messages so you can reprocess them?
reference:
http://wiki.pentaho.com/display/EAI/Apache+Kafka+Consumer
Kafka retains messages for the configured retention period whether they've been consumed or not, so it allows consumers to go back to an offset they previously processed and pick up there again.
I haven't used the Kafka plugin myself, but it looks like you can disable auto-commit and manage that yourself. You'll probably need the Kafka system tools from Apache and some command line steps in the job. You'd have to fetch the current offset at the start, get the last offset from the messages you consume and if the job/batch reaches the finish, commit that last offset to the cluster.
It could be that you can also provide the starting offset as a field (message key?) to the plugin, but I can't find any documentation on what that does. In that scenario, you could store the offset with your destination data and go back to the last offset there at the start of each run. A failed run wouldn't update the destination offset, so would not lose any messages.
If you go the second route, pay attention to the auto.offset.reset setting and behavior, as it may happen that the last offset in your destination has already disappeared from the cluster if it's been longer than the retention period.
Looking at the latest (v0.10) Kafka Consumer documentation:
"The position of the consumer gives the offset of the next record that will be given out. It will be one larger than the highest offset the consumer has seen in that partition. It automatically advances every time the consumer receives data calls poll(long) and receives messages."
Is there a way to query for the largest offset available for the partition on the server side, without retrieving all the messages?
The logic I am trying to implement is as follows:
query every second for the amount (A) of pending messages in a topic
if A > threshold, wake up a processor that would go ahead retrieving all the messages, and processing them
otherwise do nothing (sleep 1)
The motivation is that I need to do some batch processing, but I want the processor to wake up only when there is enough data (and I don't want to retrieve all the data twice).
You can use the Consumer.seekToEnd() method, run Consumer.poll(0) to make that take effect but return immediately, then Consumer.position() to find the positions for all subscribed (or assigned) topic partitions. These will be the current final offsets for all partitions. This will also start fetching some data from the brokers for those offsets, but any returned data will be ignored if you subsequently seek back to a different position.
Currently the alternative, as mentioned by serejja, is to use the old simple consumer, although the process is quite a bit more complicated as you need to manually find the leader for each partition.
Sadly, I don't see how this is possible with 0.10 consumer.
However, this is doable if you have any lower level Kafka client (sorry but I'm not sure if one exists for JVM, but there are plenty of them for other languages).
So if you have some time and inspiration to implement this, here's the way to go - every FetchResponse (which is the response for each "give me messages" request) contains a field called HighwaterMarkOffset, which essentially is an offset at the end of partition (https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-FetchResponse). The trick here is to send a FetchRequest that will immediately return (e.g. won't block waiting) nothing but HighwaterMarkOffset.
To do so your FetchRequest should have:
MaxWaitTime set to 0, which would mean "return immediately if cannot fetch at least MinBytes bytes".
MinBytes set to 0, which means "I'm OK if you return me an empty response".
FetchOffset doesn't matter in this case, and if I'm not wrong it might even be an invalid offset, but probably better to be a valid one.
MaxBytes set to 0, which means "give me no more than 0 bytes of data", e.g. nothing.
This way this request will return immediately with no data, but still with the highwatermark offset set to a proper value. Once you have the highwatermark offset, you can compare it to your current offset and figure out how much behind you are.
Hope this helps.
you can use this method public OffsetAndMetadata committed(TopicPartition partition) from the API below to get the last committed offset
https://kafka.apache.org/0100/javadoc/index.html?org/apache/kafka/clients/consumer/KafkaConsumer.html
When I consume message in Kafka, how can I know when this message is produced/created?
There is no created method in MessageAndMetadata.
And How to set offset in Kafka when neccessary? I prefer setting by code or command line.
Thanks.
There is currently no per-message timestamping in Apache Kafka 0.7 or 0.8, but it would be handy.
What exists is a time-based offset-lookup in the Offset API:
Time - Used to ask for all messages before a certain time (ms). There are two special values. Specify -1 to receive the latest offset (i.e. the offset of the next coming message) and -2 to receive the earliest available offset. Note that because offsets are pulled in descending order, asking for the earliest offset will always return you a single element.
https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol#AGuideToTheKafkaProtocol-OffsetAPI
It relies on the filesystem ctime of the topic+partition segment files on the broker so it is a bit crude (depending on file rotation intervals, etc) and in the case of replication may be outright incorrect (timestamp of replication rather than when the original segment was created).
The solution is to embed the message creation time in the message itself.