Pentaho Data Integration - Kafka Consumer - apache-kafka

I am using the Kafka Consumer Plugin for Pentaho CE and would appreciate your help in its usage. I would like to know if any of you were in a situation where pentaho failed and you lost any messages (based on the official docs there's no way to read the message twice, am I wrong ?). If this situation occurs how do you capture these messages so you can reprocess them?
reference:
http://wiki.pentaho.com/display/EAI/Apache+Kafka+Consumer

Kafka retains messages for the configured retention period whether they've been consumed or not, so it allows consumers to go back to an offset they previously processed and pick up there again.
I haven't used the Kafka plugin myself, but it looks like you can disable auto-commit and manage that yourself. You'll probably need the Kafka system tools from Apache and some command line steps in the job. You'd have to fetch the current offset at the start, get the last offset from the messages you consume and if the job/batch reaches the finish, commit that last offset to the cluster.
It could be that you can also provide the starting offset as a field (message key?) to the plugin, but I can't find any documentation on what that does. In that scenario, you could store the offset with your destination data and go back to the last offset there at the start of each run. A failed run wouldn't update the destination offset, so would not lose any messages.
If you go the second route, pay attention to the auto.offset.reset setting and behavior, as it may happen that the last offset in your destination has already disappeared from the cluster if it's been longer than the retention period.

Related

Apache Kafka messages got archived - is it possible to retrieve the messages

We are using Apache Kafka and we process more than 30 million messages per day. We have an retention policy of "30" days. However, before 30 days, our messages got archived.
Is there a way we could retrieve the deleted messages?
Is it possible to reset the "start index" to older index to retrieve the data through query?
What other options do we have?
If we have "disk backup", could we use that for retrieving the data?
Thank You
I'm assuming your messages got deleted by the Kafka cluster here.
In general, no - if the records got deleted due to duration / size related policies, then they have been removed.
Theoretically, if you have access to backups you might move the Kafka data-log files to server directory, but the behaviour is undefined. Trying that with a fresh cluster with infinite size/time policies (so nothing gets purged instantly) might work and let you consume again.
In my experience, until the general availability of Tiered Storage, there is no free/easy way to recover data (via the Kafka Consumer protocol).
For example, you can use some Kafka Connect Sink connector to write to some external, more persistent storage. Then, would you want to write a job that scrapes that data? Sure, you could have a SQL database table of STRING topic, INT timestamp, BLOB key, BLOB value, and maybe track "consumer offsets" separately from that? If you use that design, then Kafka doesn't really seem useful, as you'd be reimplementing various parts of it when you could've just added more storage to the Kafka cluster.
Is it possible to reset the "start index" to older index to retrieve the data through query?
That is what auto.offset.reset=earliest will do, or kafka-consumer-groups --reset-offsets --to-earliest
have "disk backup", could we use that
With caution, maybe. For example - you can copy old broker log segments into a server, but then there aren't any tools I know of that will retroactively discover the new "low watermark" of each topic (maybe the broker finds this upon restart, I haven't tested). You'd need to copy this data for each broker manually, I believe, since the replicas wouldn't know about old segments (again, maybe after a full cluster restart, they might).
Plus, the consumer offsets would already be reading way past that data, unless you stop all consumers and reset them.
I'm also not sure what happens if you had gaps in the segment files. E.g. your current oldest segment is N and you copy N-2, but not N-1... You might then run into an error or the consumer will simply apply auto.offset.reset policy, and seek to the next available offset or to the very end of the topic

Kafka consumer consuming older messages on restarting

I am consuming Kafka messages from a topic, but the issue is that every time the consumer restarts it reads older processed messages.
I have used auto.offset.reset=earliest. Will setting it manually using commit async help me overcome this issue?
I see that Kafka already has enabled auto commit to true by default.
I have used auto.offset.reset=earliest. Wwill setting it manually
using commit async help me overcome this issue?
When the setting auto.offset.reset=earliest is set the consumer will read from the earliest offset that is available instead of from the last offset. So, the first time you start your process with a new group.id and set this to earliest it will read from the starting offset.
Here is how we the issue can be debugged..
If your consumer group.id is same across every restart, you need to check if the commit is actually happening.
Cross check if you are manually overriding enable.auto.commit to false anywhere.
Next, check the auto commit interval (auto.commit.interval.ms) which is by default 5 sec and see if you have changed it to something higher and that you are restarting your process before the commit is getting triggered.
You can also use commitAsync() or even commitSync() to manually trigger. Use commitSync() (blocking call) for testing if there is any exception while committing. Few possible errors during committing are (from docs)
CommitFailedException - When you are trying to commit to partitions
that are no longer assigned to this consumer because the consumer is
for example no longer part of the group this exception would be thrown
RebalanceInProgressException - If the consumer instance is in the
middle of a rebalance so it is not yet determined which partitions
would be assigned to the consumer.
TimeoutException - if the timeout specified by
default.api.timeout.ms expires before successful completion of the
offset commit
Apart from this..
Also check if you are doing seek() or seekToBeginning() in your consumer code anywhere. If you are doing this and calling poll() you will likely get older messages also.
If you are using Embedded Kafka and doing some testing, the topic and the consumer groups will likely be created everytime you restart your test, there by reading from start. Check if it is a similar case.
Without looking into the code it is hard to tell what exactly is the error. This answer provides only an insight on debugging your scenario.

How to configure kafka such that we have an option to read from the earliest, latest and also from any given offset?

I know about configuring kafka to read from earliest or latest message.
How do we include an additional option in case I need to read from a previous offset?
The reason I need to do this is that the earlier messages which were read need to be processed again due to some mistake in the processing logic earlier.
In java kafka client, there is some methods about kafka consumer which could be used to specified next consume position.
public void seek(TopicPartition partition,
long offset)
Overrides the fetch offsets that the consumer will use on the next poll(timeout). If this API is invoked for the same partition more than once, the latest offset will be used on the next poll(). Note that you may lose data if this API is arbitrarily used in the middle of consumption, to reset the fetch offsets
This is enough, and there are also seekToBeginning and seekToEnd.
I'm trying to answer a similar but not quite the same question so let's see if my information may help you.
First, I have been working from this other SO question/answer
In short, you want to commit your offsets and the most common solution for that is ZooKeeper. So if your consumer encounters an error or needs to shut down, it can resume where it left off.
Myself I'm working with a high volume stream that is extremely large and my consumer (for a test) needs to start from the very tail each time. The documentation indicates I must use KafkaConsumer seek to declare my starting point.
I'll try to update my findings here once they are successful and reliable. For sure this is a solved problem.

Using Kafka Connect HOWTO "commit offsets" as soon as a "put" is completed in SinkTask

I am using Kafka Connect to get messages from a Kafka Broker (v0.10.2) and then sync it to a downstream service.
Currently, I have code in SinkTask#put that will process the SinkRecord & then persist it to the downstream service.
A couple of key requirements,
We need to make sure the messages are persisted to the downstream service AT LEAST once.
If the downstream service throws an error or says it didn't process the message then we need to make sure that the messages are re-read again.
So we thought we can rely on SinkTask#flush to effectively back out of committing offsets for that particular poll/cycle of received messages by throwing an exception or something that will tell Connect not to commit the offsets, but retry in the next poll.
But as we found out flush is actually time-based & is more or less independent of the polls & it will commit the offsets when it reaches a certain time threshold.
In 0.10.2 SinkTask#preCommit was introduced, so we thought we can use it for our purposes. But nowhere in the documentation it is mentioned that there is a 1:1 relationship between SinkTask#put & SinkTask#preCommit.
Since essentially we want to commit offsets as soon as a single put succeeds. And similarly, not commit the offsets, if that particular put failed.
How to accomplish this, if not via SinkTask#preCommit?
Getting data into and out of Kafka correctly can be challenging, and Kafka Connect makes this easier since it uses best practices and hides many of the complexities. For sink connectors, Kafka Connect reads messages from a topic, sends them to your connector, and then periodically commits the largest offsets for the various topic partitions that have been read and processed.
Note that "sending them to your connector" corresponds to the put(Collection<SinkRecord>) method, and this may be called many times before Kafka Connect commits the offsets. You can control how frequently Kafka Connect commits offsets, but Kafka Connect ensures that it will only commit an offset for a message when that message was successfully processed by the connector.
When the connector is operating nominally, everything is great and your connector sees each message once, even when the offsets are committed periodically. However, should the connector fail, then when it restarts the connector will start at the last committed offset. That might mean your connector sees some of the same messages that it processed just before the crash. This usually is not a problem if you carefully write your connector to have at least once semantics.
Why does Kafka Connect commit offsets periodically rather than with every record? Because it saves a lot of work and doesn't really matter when things are going nominally. It's only when things go wrong that the offset lag matters. And even then, if you're having Kafka Connect handle offsets your connector needs to be ready to handle messages at least once. Exactly once is possible, but your connector has to do more work (see below).
Writing Records
You have a lot of flexibility in writing a connector, and that's good because a lot will depend on the capabilities of the external system to which it's writing. Let's look at different ways of implementing put and flush.
If the system supports transactions or can handle a batch of updates, your connector's put(Collection<SinkRecord>) could write all of the records in that collection using a single transaction / batch, retrying as many times as needed until the transaction / batch completes or before finally throwing an error. In this case, put does all the work and will either succeed or will fail. If it succeeds, then Kafka Connect knows all of the records were handled properly and can thus (at some point) commit the offsets. If your put call fails, then Kafka Connect assumes doesn't know whether any of the records were processed, so it doesn't update its offsets and it stops your connector. Your connector's flush(...) would need to do nothing, since Kafka Connect is handling all the offsets.
If the system doesn't support transactions and instead you can only submit items one at a time, you might have have your connector's put(Collection<SinkRecord>) attempt to write out each record individually, blocking until it succeeds and retrying each as needed before throwing an error. Again, put does all the work, and the flush method might not need to do anything.
So far, my examples do all the work in put. You always have the option of having put simply buffer the records and to instead do all the work of writing to the external service in flush or preCommit. One reason you might do this is so that you're writes are time-based just like flush and preCommit. If you don't want your writes to be time-based, you probably don't want to do the writes in flush or preCommit.
To Record Offsets or Not To Record
As mentioned above, by default Kafka Connect will periodically record the offsets so that upon restart the connector can begin where it last left off.
However, sometimes it is desirable for a connector to record the offsets in the external system, especially when that can be done atomically. When such a connector starts up, it can look in the external system to find out the offset that was last written, and can then tell Kafka Connect where it wants to start reading. With this approach your connector may be able to do exactly once processing of messages.
When sink connectors do this, they actually don't need Kafka Connect to commit any offsets at all. The flush method is simply an opportunity for your connector to know which offsets that Kafka Connect is committing for you, and since it doesn't return anything it can't modify those offsets or tell Kafka Connect which offsets the connector is handling.
This is where the preCommit method comes in. It really is a replacement for flush (it actually takes the same parameters as flush), except that it is expected to return the offsets that Kafka Connect should commit. By default, preCommit just calls flush and then returns the same offsets that were passed to preCommit, which means Kafka Connect should commit all the offsets it passed to the connector via preCommit. But if your preCommit returns an empty set of offsets, then Kafka Connect will record no offsets at all.
So, if your connector is going to handle all offsets in the external system and doesn't need Kafka Connect to record anything, then you should override the preCommit method instead of flush, and return an empty set of offsets.

Simple-Kafka-consumer message delivery duplication

I am trying to implement a simple Producer-->Kafka-->Consumer application in Java. I am able to produce as well as consume the messages successfully, but the problem occurs when I restart the consumer, wherein some of the already consumed messages are again getting picked up by consumer from Kafka (not all messages, but a few of the last consumed messages).
I have set autooffset.reset=largest in my consumer and my autocommit.interval.ms property is set to 1000 milliseconds.
Is this 'redelivery of some already consumed messages' a known problem, or is there any other settings that I am missing here?
Basically, is there a way to ensure none of the previously consumed messages are getting picked up/consumed by the consumer?
Kafka uses Zookeeper to store consumer offsets. Since Zookeeper operations are pretty slow, it's not advisable to commit offset after consumption of every message.
It's possible to add shutdown hook to consumer that will manually commit topic offset before exit. However, this won't help in certain situations (like jvm crash or kill -9). To guard againts that situations, I'd advise implementing custom commit logic that will commit offset locally after processing each message (file or local database), and also commit offset to Zookeeper every 1000ms. Upon consumer startup, both these locations should be queried, and maximum of two values should be used as consumption offset.