How can I retrieve old data from Kafka to store in Mongodb - apache-kafka

How can i transfer old data from Kafka to MongoDB database, I had changed the topic of Kafka so because of that there was old data it did not come in MongoDB so how can I do it?
If I try to retrieve data from old topic then I get poll timeout error.

Unclear how you define "old".
Kafka actively removes data from topics with retention policies. If data is deleted, you can never consume it.
While it is still retained, you simply set auto.offset.reset=earliest (in Connect, you add consumer. prefix to this and put in the worker properties file, not the connector itself), and start polling. Timeout errors are separate issue, and have nothing to do with retention time of the topic. By default, any consumer or connector will set this property to latest and not see retained data

Related

Apache Kafka messages got archived - is it possible to retrieve the messages

We are using Apache Kafka and we process more than 30 million messages per day. We have an retention policy of "30" days. However, before 30 days, our messages got archived.
Is there a way we could retrieve the deleted messages?
Is it possible to reset the "start index" to older index to retrieve the data through query?
What other options do we have?
If we have "disk backup", could we use that for retrieving the data?
Thank You
I'm assuming your messages got deleted by the Kafka cluster here.
In general, no - if the records got deleted due to duration / size related policies, then they have been removed.
Theoretically, if you have access to backups you might move the Kafka data-log files to server directory, but the behaviour is undefined. Trying that with a fresh cluster with infinite size/time policies (so nothing gets purged instantly) might work and let you consume again.
In my experience, until the general availability of Tiered Storage, there is no free/easy way to recover data (via the Kafka Consumer protocol).
For example, you can use some Kafka Connect Sink connector to write to some external, more persistent storage. Then, would you want to write a job that scrapes that data? Sure, you could have a SQL database table of STRING topic, INT timestamp, BLOB key, BLOB value, and maybe track "consumer offsets" separately from that? If you use that design, then Kafka doesn't really seem useful, as you'd be reimplementing various parts of it when you could've just added more storage to the Kafka cluster.
Is it possible to reset the "start index" to older index to retrieve the data through query?
That is what auto.offset.reset=earliest will do, or kafka-consumer-groups --reset-offsets --to-earliest
have "disk backup", could we use that
With caution, maybe. For example - you can copy old broker log segments into a server, but then there aren't any tools I know of that will retroactively discover the new "low watermark" of each topic (maybe the broker finds this upon restart, I haven't tested). You'd need to copy this data for each broker manually, I believe, since the replicas wouldn't know about old segments (again, maybe after a full cluster restart, they might).
Plus, the consumer offsets would already be reading way past that data, unless you stop all consumers and reset them.
I'm also not sure what happens if you had gaps in the segment files. E.g. your current oldest segment is N and you copy N-2, but not N-1... You might then run into an error or the consumer will simply apply auto.offset.reset policy, and seek to the next available offset or to the very end of the topic

Kafka consumer is processing all messages at startup

I am new to Kafka, and am developing a personal project with a few services and the communication between them is made through Kafka and I am using Confluent for housing Kafka remotely.
All works fine, but when I startup a server it will try to process all the old messages in the topics that were generated as I was testing the system.
I would like to avoid this because it is time consuming and those messages were already processed, when the server was up the last time. Is there any way to prevent this in the development environment?
Am I even using Kafka correctly? Are there good practises that I missed?
By "server", I assume you mean consumer. The broker server doesn't process data, only stores it.
If you have auto.offset.reset=earliest + enable.auto.commit=false + are not committing the records in your code (or are overall using a new group.id each time), this is the expected behavior since your group.id is not tracking already consumed data.
Since you're now in a situation where you have processed data, but no stored offsets, first set a static group id, then your options include
re-process all the data again, accepting the duplicates, perhaps adding some conditional filter in your consumer code to skip records
skip all processed and un-processed data and only start consuming brand-new records after the consumer starts, by either setting a new group.id + auto.offset.reset=latest, or use consumer.seekToEnd() / the kafka-consumer-groups CLI tool ; downside of setting auto.offset.reset=latest is that you might run into a situation where the consumer group has been idle too long, and the group expires, causing you to go back to the end of the topic, even though there may still be un-processed data
manually find the offsets for all the partitions for the last processed data and consumer.seek() to those offsets

Dealing with data loss in kafka connect

I understood that Kafka connect can be deployed in cluster mode. And workers move data between data source and kafka topic. What I want to know is if a worker fails when moving data between data source to kafka topic would there be a dataloss? If there would be a dataloss how can we get the data back from the connector or will kafka connect automatically deal with it?
This depends on the source and if it supports offset tracking.
For example, lines in a file, rows in a database with a primary ID / timestamp, or some idenpotent API call can be repeatedly called and get the same starting position. (although, in each case, the underlying data also needs to be immutable for it to work consistently)
Kafka Connect SourceTask API has a call to commit tracked "offsets" (different from Kafka topic offsets)

In Kafka, how to handle deleted rows from source table that are already reflected in Kafka topic?

I am using a JDBC source connector with mode timestamp+incrementing to fetch table from Postgres, using Kafka Connect. The updates in data are reflected in Kafka topic but the deletion of records has no effect. So, my questions are:
Is there some way to handle deleted records?
How to handle records that are deleted but still present in kafka topic?
The recommendation is to either 1) adjust your source database to be append/update only, as well, either via a boolean or timestamp that is filtered out when Kafka Connect queries the table.
If your database is running out of space, then you can delete old records, which should already have been processed by Kafka
Option 2) Use CDC tools to capture delete events immediately rather than missing them in a period table scan. Debezium is a popular option for Postgres
A Kafka topic can be seen as an "append-only" log. It keeps all meesages for as long as you like but Kafka is not built to delete individual messages out of a topic.
In the scenario you are describing it is common that the downstream application (consuming the topic) handles the information on a deleted record.
As an alternative you could set the cleanup.policy of your topic to compact which means it will eventually keep only the latest value for each key. If you now define the key of a message as the primary key of the Postgres table, your topic will eventually delete the record when you produce a message with the same key and a null value into the topic. However,
I am not sure if your connector is flexible to do this
Depending on what you do with the data in the kafka topic, this could still not be a solution to your problem as the downstream application will still read both record, the original one and the null message as the deleted record.

Using Kafka Connect HOWTO "commit offsets" as soon as a "put" is completed in SinkTask

I am using Kafka Connect to get messages from a Kafka Broker (v0.10.2) and then sync it to a downstream service.
Currently, I have code in SinkTask#put that will process the SinkRecord & then persist it to the downstream service.
A couple of key requirements,
We need to make sure the messages are persisted to the downstream service AT LEAST once.
If the downstream service throws an error or says it didn't process the message then we need to make sure that the messages are re-read again.
So we thought we can rely on SinkTask#flush to effectively back out of committing offsets for that particular poll/cycle of received messages by throwing an exception or something that will tell Connect not to commit the offsets, but retry in the next poll.
But as we found out flush is actually time-based & is more or less independent of the polls & it will commit the offsets when it reaches a certain time threshold.
In 0.10.2 SinkTask#preCommit was introduced, so we thought we can use it for our purposes. But nowhere in the documentation it is mentioned that there is a 1:1 relationship between SinkTask#put & SinkTask#preCommit.
Since essentially we want to commit offsets as soon as a single put succeeds. And similarly, not commit the offsets, if that particular put failed.
How to accomplish this, if not via SinkTask#preCommit?
Getting data into and out of Kafka correctly can be challenging, and Kafka Connect makes this easier since it uses best practices and hides many of the complexities. For sink connectors, Kafka Connect reads messages from a topic, sends them to your connector, and then periodically commits the largest offsets for the various topic partitions that have been read and processed.
Note that "sending them to your connector" corresponds to the put(Collection<SinkRecord>) method, and this may be called many times before Kafka Connect commits the offsets. You can control how frequently Kafka Connect commits offsets, but Kafka Connect ensures that it will only commit an offset for a message when that message was successfully processed by the connector.
When the connector is operating nominally, everything is great and your connector sees each message once, even when the offsets are committed periodically. However, should the connector fail, then when it restarts the connector will start at the last committed offset. That might mean your connector sees some of the same messages that it processed just before the crash. This usually is not a problem if you carefully write your connector to have at least once semantics.
Why does Kafka Connect commit offsets periodically rather than with every record? Because it saves a lot of work and doesn't really matter when things are going nominally. It's only when things go wrong that the offset lag matters. And even then, if you're having Kafka Connect handle offsets your connector needs to be ready to handle messages at least once. Exactly once is possible, but your connector has to do more work (see below).
Writing Records
You have a lot of flexibility in writing a connector, and that's good because a lot will depend on the capabilities of the external system to which it's writing. Let's look at different ways of implementing put and flush.
If the system supports transactions or can handle a batch of updates, your connector's put(Collection<SinkRecord>) could write all of the records in that collection using a single transaction / batch, retrying as many times as needed until the transaction / batch completes or before finally throwing an error. In this case, put does all the work and will either succeed or will fail. If it succeeds, then Kafka Connect knows all of the records were handled properly and can thus (at some point) commit the offsets. If your put call fails, then Kafka Connect assumes doesn't know whether any of the records were processed, so it doesn't update its offsets and it stops your connector. Your connector's flush(...) would need to do nothing, since Kafka Connect is handling all the offsets.
If the system doesn't support transactions and instead you can only submit items one at a time, you might have have your connector's put(Collection<SinkRecord>) attempt to write out each record individually, blocking until it succeeds and retrying each as needed before throwing an error. Again, put does all the work, and the flush method might not need to do anything.
So far, my examples do all the work in put. You always have the option of having put simply buffer the records and to instead do all the work of writing to the external service in flush or preCommit. One reason you might do this is so that you're writes are time-based just like flush and preCommit. If you don't want your writes to be time-based, you probably don't want to do the writes in flush or preCommit.
To Record Offsets or Not To Record
As mentioned above, by default Kafka Connect will periodically record the offsets so that upon restart the connector can begin where it last left off.
However, sometimes it is desirable for a connector to record the offsets in the external system, especially when that can be done atomically. When such a connector starts up, it can look in the external system to find out the offset that was last written, and can then tell Kafka Connect where it wants to start reading. With this approach your connector may be able to do exactly once processing of messages.
When sink connectors do this, they actually don't need Kafka Connect to commit any offsets at all. The flush method is simply an opportunity for your connector to know which offsets that Kafka Connect is committing for you, and since it doesn't return anything it can't modify those offsets or tell Kafka Connect which offsets the connector is handling.
This is where the preCommit method comes in. It really is a replacement for flush (it actually takes the same parameters as flush), except that it is expected to return the offsets that Kafka Connect should commit. By default, preCommit just calls flush and then returns the same offsets that were passed to preCommit, which means Kafka Connect should commit all the offsets it passed to the connector via preCommit. But if your preCommit returns an empty set of offsets, then Kafka Connect will record no offsets at all.
So, if your connector is going to handle all offsets in the external system and doesn't need Kafka Connect to record anything, then you should override the preCommit method instead of flush, and return an empty set of offsets.