Put() vs Flush() in Kafka Connector Sink Task - apache-kafka

I am trying to send the data in a batch to a NOSQL database using Kafka Sink Connector. I am following https://kafka.apache.org/documentation/#connect documentation and confused about where the logic of sending records has to be implemented. Please help me in understanding how the records are processed internally and what has to be used Put() or Flush() to process the records in a batch.

When a Kafka Connect worker is running a sink task, it will consume messages from the topic partition(s) assigned to the task. As it does so, it repeatedly passes a batch of messages to the sink task through the put(Collection<SinkRecord>) method. This will continue as long as the connector and its tasks are running.
Kafka Connect also will periodically record the progress of the sink tasks, namely the offset of the most recently processed message on each topic partition. This is called committing the offsets, and it does this so that if the connector stops unexpectedly and uncleanly, Kafka Connect knows where in each topic partition the task should resume processing messages. But just before Kafka Connect writes the offsets to Kafka, the Kafka Connect worker gives the sink connector an opportunity to do work during this stage via the flush(...) method.
A particular sink connector might not need to do anything (if put(...) did all of the work), or it might use this opportunity to submit all of the messages already processed via put(...) to the data store. For example, Confluent's JDBC sink connector writes each batch of messages passed through the put(...) method using a transaction (the size of which can be controlled via the connector's consumer settings), and thus the flush(...) method doesn't need to do anything. Confluent's ElasticSearch sink connector, on the other hand, simply accumulates all of the messages for a series of put(...) methods and only writes them to Elasticsearch during flush(...).
The frequency that the offsets are committed for source and sink connectors is controlled by the connector's offset.flush.interval.ms configuration property. The default is to commit offsets every 60 seconds, which is infrequent enough to improve performance and reduce overhead, but is frequent enough to cap the potential amount of re-processing should the connector task unexpectedly die. Note that when the connector is shutdown gracefully or experiences an exception, Kafka Connect will always have a chance to commit the offsets. It's only when the Kafka Connect worker is killed unexpectedly that it might not have a chance to commit the offsets identifying what messages had been processed. Thus, only after restarting after such a failure will the connector potentially re-process some messages that it did just prior to the failure. And it's because messages will potentially be seen at least once that the messages should be idempotent. Take all of this plus your connectors' behavior into account when determining appropriate values for this setting.
Have a look at the Confluent documentation for Kafka Connect as well as open source sink connectors for more examples and details.

Related

confluent kafka dotnet consumer retry

If any message in kafka fails to be read, I want to save these messages in a new topic and retry the reading process a certain number of times. Meanwhile, I need to pause the reading process at the original location. How can I do that?

What is the relationship between connectors and tasks in Kafka Connect?

We've been using Kafka Connect for a while on a project, currently entirely using only the Confluent Kafka Connect JDBC connector. I'm struggling to understand the role of 'tasks' in Kafka Connect, and specifically with this connector. I understand 'connectors'; they encompass a bunch of configuration about a particular source/sink and the topics they connect from/to. I understand that there's a 1:Many relationship between connectors and tasks, and the general principle that tasks are used to parallelize work. However, how can we understand when a connector will/might create multiple tasks?
In the source connector case, we are using the JDBC connector to pick up source data by timestamp and/or a primary key, and so this seems in its very nature sequential. Indeed, all of our source connectors only ever seem to have one task. What would ever trigger Kafka Connect to create more than one connector? Currently we are running Kafka Connect in distributed mode, but only with one worker; if we had multiple workers, might we get multiple tasks per connector, or are the two not related?
In the sink connector case, we are explicitly configuring each of our sink connectors with tasks.max=1, and so unsurprisingly we only ever see one task for each connector there too. If we removed that configuration, presumably we could/would get more than one task. Would this mean the messages on our input topic might be consumed out of sequence? In which case, how is data consistency for changes assured?
Also, from time to time, we have seen situations where a single connector and task will both enter the FAILED state (because of input connectivity issues). Restarting the task will remove it from this state, and restart the flow of data, but the connector remains in FAILED state. How can this be - isn't the connector's state just the aggregate of all its child tasks?
A task is a thread that performs the actual sourcing or sinking of data.
The number of tasks per connector is determined by the implementation of the connector. Take a Debezium source connector to MySQL as an example, since one MySQL instance writes to exactly one binlog file at a time and a file has to be read sequentially, one connector generates exactly one task.
Whereas for sink connectors, the number of tasks should be equal to the number of partitions of the topic.
The task distribution among workers is determined by task rebalance which is a very similar process to Kafka consumer group rebalance.

Is exactly once sematics maintained if KStream#process to do all DB operation and forwards to next consumer and finally write to a topic

I want to do all stateful operation in external Database instead of RocksDB and to do that whereever stateful operation is required i am writing Custom Processor which will do DB operation and context#forward method forwards the key-value pair witten to DB to downstream consumers and finally writes to a Topic.
Kafka Streams enables exactly once semantic only within Kafka topics.
You don't have such guarantees with writes to external systems - Databases.
Following scenario is possible:
Custom Processor processor get message and perform write to external system (DB)
Some fatal error occurs - offset commit in source topic is not made, none record is passed to downstream
Application is restarted
Same message is processed by Custom Processor: same message is write to external system (DB) and passed to downstream and later commit is performed.
In described scenario:
External system (DB) - gets same message twice
Across Kafka exactly once is achieved
If you want write message to external system (DB) it is better to use Kafka Connect (JDBC Sink Connector). You could fully processed message using Kafka Streams and than use Kafka Connect to copy data from output topic to Database.
More regarding exactly once semantic you can find:
Matthias J. Sax presentation from Kafka Summit: https://kafka-summit.org/sessions/dont-repeat-introducing-exactly-semantics-apache-kafka
Guozhang Wang blog post at confluent page: https://www.confluent.io/blog/enabling-exactly-once-kafka-streams/
Neha Narkhede blog post at confluent page: https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/

Google Pubsub vs Kafka comparison on the restart of pipeline

I am trying to write an ingestion application on GCP by using Apache Beam.I should write it in a streaming way to read data from Kafka or pubsub topics and then ingest to datasource.
while it seems straight forward to write it with pubsub and apache beam but my question is what would happen if my ingestion fails or to be restarted and if it again reads all data from the start of pubsub topic or like kafka it can read from latest committed offsets in the topic?
Pub/sub messages are persisted until they are delivered and acknowledge by the subscribers which receives pending messages from its subscription. Once the message is acknowledge, it's removed from the subscription's queue.
For more information regarding the message flow, check this document
Hope it helps.

When is the Flume memory channel not recoverable, and why?

I got confused how Flume data when using file-channel is recoverable and with memory channel is not.
I tried a Kafka sink, when i put Flume down while reading, Kafka reads the data (file) in channel properly. when Flume is restarted, the pipeline continue delivering data in reliable way. So how memory-channel is not recoverable?. In which case i need to recover data in channel?, specially if Flume starts reading the file from a saved offset.
You can restart Kafka to check if the messages are lost.
Offset concept:
This depends on the flume transaction handling. In the process of flume restart, some of the transactions might get committed, but the processing may fail due to connection loss.
For eg: You have a transaction that requires some processing after which you store it in db. And you have transaction.commit() even when the flume sink throws any exception. So, you will loose your data in restart process as your processing logic throws exception. The transaction is committed and the offset is increased.
So it is safer to take an offset before restart process. You should follow
Take export offset
Stop flume
Import the offset
Start flume