Is exactly once sematics maintained if KStream#process to do all DB operation and forwards to next consumer and finally write to a topic - apache-kafka

I want to do all stateful operation in external Database instead of RocksDB and to do that whereever stateful operation is required i am writing Custom Processor which will do DB operation and context#forward method forwards the key-value pair witten to DB to downstream consumers and finally writes to a Topic.

Kafka Streams enables exactly once semantic only within Kafka topics.
You don't have such guarantees with writes to external systems - Databases.
Following scenario is possible:
Custom Processor processor get message and perform write to external system (DB)
Some fatal error occurs - offset commit in source topic is not made, none record is passed to downstream
Application is restarted
Same message is processed by Custom Processor: same message is write to external system (DB) and passed to downstream and later commit is performed.
In described scenario:
External system (DB) - gets same message twice
Across Kafka exactly once is achieved
If you want write message to external system (DB) it is better to use Kafka Connect (JDBC Sink Connector). You could fully processed message using Kafka Streams and than use Kafka Connect to copy data from output topic to Database.
More regarding exactly once semantic you can find:
Matthias J. Sax presentation from Kafka Summit: https://kafka-summit.org/sessions/dont-repeat-introducing-exactly-semantics-apache-kafka
Guozhang Wang blog post at confluent page: https://www.confluent.io/blog/enabling-exactly-once-kafka-streams/
Neha Narkhede blog post at confluent page: https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/

Related

Kafka Connect or Kafka Streams?

I have a requirement to read messages from a topic, enrich the message based on provided configuration (data required for enrichment is sourced from external systems), and publish the enriched message to an output topic. Messages on both source and output topics should be Avro format.
Is this a good use case for a custom Kafka Connector or should I use Kafka Streams?
Why I am considering Kafka Connect?
Lightweight in terms of code and deployment
Configuration driven
Connection and error handling
Scalability
I like the plugin based approach in Connect. If there is a new type of message that needs to be handled I just deploy a new connector without having to deploy a full scale Java app.
Why I am not sure this is good candidate for Kafka Connect?
Calls to external system
Can Kafka be both source and sink for a connector?
Can we use Avro schemas in connectors?
Performance under load
Cannot do stateful processing (currently there is no requirement)
I have experience with Kafka Streams but not with Connect
Use both?
Use Kafka Connect to source external database into a topic.
Use Kafka Streams to build that topic into a stream/table that can then be manipulated.
Use Kafka Connect to sink back into a database, or other system other than Kafka, as necessary.
Kafka Streams can also be config driven, use plugins (i.e. reflection), is just as scalable, and has no different connection modes (to Kafka). Performance should be the similar. Error handling is really the only complex part. ksqlDB is entirely "config driven" via SQL statements, and can connect to external Connect clusters, or embed its own.
Avro works for both, yes.
Some connectors are temporarily stateful, as they build in-memory batches, such as S3 or JDBC sink connectors

Can kafka publish messages to AWS lambda

I have to publish messages from a kafka topic to lambda to process them and store in a database using a springboot application, i did some research and found something to consume messages from kafka
public Function<KStream<String, String>, KStream<String, String>> process(){} however, im not sure if this is only used to publish the consumed messages to another kafka topic or can be used as an event source to lambda, I need some guidance on consuming and converting the consumed kafka message to event source.
Brokers do not push. Consumers always poll.
Code shown is for Kafka Streams API, which primarily writes to new Kafka topics. While you could fire HTTP events to start a lambda, that's not recommended.
Alternatively, Kafka is already supported as an event source. You don't need to write any consumer code.
https://aws.amazon.com/about-aws/whats-new/2020/12/aws-lambda-now-supports-self-managed-apache-kafka-as-an-event-source/
This is possible from MSK or a self managed Kafka
process them and store in a database
Your lambda could process the data and send to a new Kafka topic using a producer. You can then use MSK Connect or run your own Kafka Connect cluster elsewhere to dump records into a database. No Spring/Java code would be necessary.

Build a data transformation service using Kafka Connect

Kafka Streams is good, but I have to do every configuration very manual. Instead Kafka Connect provides its API interface, which is very useful for handling the configuration, as well as Tasks, Workers, etc...
Thus, I'm thinking of using Kafka Connect for my simple data transforming service. Basically, the service will read the data from a topic and send the transformed data to another topic. In order to do that, I have to make a custom Sink Connector to send the transformed data to the kafka topic, however, it seems those interface functions aren't available in SinkConnector. If I can do it, that would be great since I can manage tasks, workers via the REST API and running the tasks under a distributed mode (multiple instances).
There are 2 options in my mind:
Figuring out how to send the message from SinkConnector to a kafka topic
Figuring out how to build a REST interface API like Kafka Connect which wraps up the Kafka Streams app
Any ideas?
Figuring out how to send the message from SinkConnector to a kafka topic
A sink connector consumes data/messages from a Kafka topic. If you want to send data to a Kafka topic you are likely talking about a source connector.
Figuring out how to build a REST interface API like Kafka Connect which wraps up the Kafka Streams app.
using the kafka-connect-archtype you can have a template to create your own Kafka connector (source or sink). In your case that you want to build some stream processing pipeline after the connector, you are mostly talking about a connector of another stream processing engine that is not Kafka-stream. There are connectors for Kafka <-> Spark, Kafka <-> Flink, ...
But you can build your using the template of kafka-connect-archtype if you want. Use the MySourceTask List<SourceRecord> poll() method or the MySinkTask put(Collection<SinkRecord> records) method to process the records as stream. They extend the org.apache.kafka.connect.[source.SourceTask|sink.SinkTask] from Kafka connect.
a REST interface API like Kafka Connect which wraps up the Kafka Streams app
This is exactly what KsqlDB allows you to do
Outside of creating streams and tables with SQL queries, it offers a REST API as well as can interact with Connect endpoints (or embed a Connect worker itself)
https://docs.ksqldb.io/en/latest/concepts/connectors/

Put() vs Flush() in Kafka Connector Sink Task

I am trying to send the data in a batch to a NOSQL database using Kafka Sink Connector. I am following https://kafka.apache.org/documentation/#connect documentation and confused about where the logic of sending records has to be implemented. Please help me in understanding how the records are processed internally and what has to be used Put() or Flush() to process the records in a batch.
When a Kafka Connect worker is running a sink task, it will consume messages from the topic partition(s) assigned to the task. As it does so, it repeatedly passes a batch of messages to the sink task through the put(Collection<SinkRecord>) method. This will continue as long as the connector and its tasks are running.
Kafka Connect also will periodically record the progress of the sink tasks, namely the offset of the most recently processed message on each topic partition. This is called committing the offsets, and it does this so that if the connector stops unexpectedly and uncleanly, Kafka Connect knows where in each topic partition the task should resume processing messages. But just before Kafka Connect writes the offsets to Kafka, the Kafka Connect worker gives the sink connector an opportunity to do work during this stage via the flush(...) method.
A particular sink connector might not need to do anything (if put(...) did all of the work), or it might use this opportunity to submit all of the messages already processed via put(...) to the data store. For example, Confluent's JDBC sink connector writes each batch of messages passed through the put(...) method using a transaction (the size of which can be controlled via the connector's consumer settings), and thus the flush(...) method doesn't need to do anything. Confluent's ElasticSearch sink connector, on the other hand, simply accumulates all of the messages for a series of put(...) methods and only writes them to Elasticsearch during flush(...).
The frequency that the offsets are committed for source and sink connectors is controlled by the connector's offset.flush.interval.ms configuration property. The default is to commit offsets every 60 seconds, which is infrequent enough to improve performance and reduce overhead, but is frequent enough to cap the potential amount of re-processing should the connector task unexpectedly die. Note that when the connector is shutdown gracefully or experiences an exception, Kafka Connect will always have a chance to commit the offsets. It's only when the Kafka Connect worker is killed unexpectedly that it might not have a chance to commit the offsets identifying what messages had been processed. Thus, only after restarting after such a failure will the connector potentially re-process some messages that it did just prior to the failure. And it's because messages will potentially be seen at least once that the messages should be idempotent. Take all of this plus your connectors' behavior into account when determining appropriate values for this setting.
Have a look at the Confluent documentation for Kafka Connect as well as open source sink connectors for more examples and details.

Kafka source vs Avro source for reading and writing data into kafka channel using flume

In flume, I have Kafka-channel from where I can read and write data.
What is the difference between the performance of reading and writing data into Kafka channel if I replace Kafka source and Kafka sink with Avro source and Avro sink?
In my opinion, by replacing Kafka-source with Avro-source, I will be unable to read data in parallel from multiple partitions of Kafka broker, as there is no consumer group specified in case of Avro-source. Please correct me if I am wrong.
In Flume, the Avro RPC source binds to a specified TCP port of a network interface, so only one Avro source of one of the Flume agents running on a single machine can ever receive events sent to this port.
Avro source is meant to connect two or more Flume agents together: one or more Avro sinks connect to a single Avro source.
As you point out, using Kafka as a source allows for events to be received by several consumer groups. However, my experience with Flume 1.6.0 is that it is faster to push events from one Flume agent to another on a remote host through Avro RPC rather than through Kafka.
So I ended up with the following setup for log data collection:
[Flume agent on remote collected node] =Avro RPC=> [Flume agent in central cluster] =Kafka=> [multiple consumer groups in central cluster]
This way, I got better log ingestion and processing throughput and I also could encrypt and compress log data between remote sites and central cluster. This may however change when Flume adds support for the new protocol introduced by Kafka 0.9.0 in a future version, possibly making Kafka more usable as the front interface of the central cluster with remote data collection nodes (see here).