Build a data transformation service using Kafka Connect - apache-kafka

Kafka Streams is good, but I have to do every configuration very manual. Instead Kafka Connect provides its API interface, which is very useful for handling the configuration, as well as Tasks, Workers, etc...
Thus, I'm thinking of using Kafka Connect for my simple data transforming service. Basically, the service will read the data from a topic and send the transformed data to another topic. In order to do that, I have to make a custom Sink Connector to send the transformed data to the kafka topic, however, it seems those interface functions aren't available in SinkConnector. If I can do it, that would be great since I can manage tasks, workers via the REST API and running the tasks under a distributed mode (multiple instances).
There are 2 options in my mind:
Figuring out how to send the message from SinkConnector to a kafka topic
Figuring out how to build a REST interface API like Kafka Connect which wraps up the Kafka Streams app
Any ideas?

Figuring out how to send the message from SinkConnector to a kafka topic
A sink connector consumes data/messages from a Kafka topic. If you want to send data to a Kafka topic you are likely talking about a source connector.
Figuring out how to build a REST interface API like Kafka Connect which wraps up the Kafka Streams app.
using the kafka-connect-archtype you can have a template to create your own Kafka connector (source or sink). In your case that you want to build some stream processing pipeline after the connector, you are mostly talking about a connector of another stream processing engine that is not Kafka-stream. There are connectors for Kafka <-> Spark, Kafka <-> Flink, ...
But you can build your using the template of kafka-connect-archtype if you want. Use the MySourceTask List<SourceRecord> poll() method or the MySinkTask put(Collection<SinkRecord> records) method to process the records as stream. They extend the org.apache.kafka.connect.[source.SourceTask|sink.SinkTask] from Kafka connect.

a REST interface API like Kafka Connect which wraps up the Kafka Streams app
This is exactly what KsqlDB allows you to do
Outside of creating streams and tables with SQL queries, it offers a REST API as well as can interact with Connect endpoints (or embed a Connect worker itself)
https://docs.ksqldb.io/en/latest/concepts/connectors/

Related

Kafka Connect or Kafka Streams?

I have a requirement to read messages from a topic, enrich the message based on provided configuration (data required for enrichment is sourced from external systems), and publish the enriched message to an output topic. Messages on both source and output topics should be Avro format.
Is this a good use case for a custom Kafka Connector or should I use Kafka Streams?
Why I am considering Kafka Connect?
Lightweight in terms of code and deployment
Configuration driven
Connection and error handling
Scalability
I like the plugin based approach in Connect. If there is a new type of message that needs to be handled I just deploy a new connector without having to deploy a full scale Java app.
Why I am not sure this is good candidate for Kafka Connect?
Calls to external system
Can Kafka be both source and sink for a connector?
Can we use Avro schemas in connectors?
Performance under load
Cannot do stateful processing (currently there is no requirement)
I have experience with Kafka Streams but not with Connect
Use both?
Use Kafka Connect to source external database into a topic.
Use Kafka Streams to build that topic into a stream/table that can then be manipulated.
Use Kafka Connect to sink back into a database, or other system other than Kafka, as necessary.
Kafka Streams can also be config driven, use plugins (i.e. reflection), is just as scalable, and has no different connection modes (to Kafka). Performance should be the similar. Error handling is really the only complex part. ksqlDB is entirely "config driven" via SQL statements, and can connect to external Connect clusters, or embed its own.
Avro works for both, yes.
Some connectors are temporarily stateful, as they build in-memory batches, such as S3 or JDBC sink connectors

Kafka Streams without Sink

I'm currently planning the architecture for an application that reads from a Kafka topic and after some conversion puts data to RabbitMq.
I'm kind new for Kafka Streams and they look a good choice for my task. But the problem is that Kafka server is hosted at another vendor's place, so I can't even install Cafka Connector to RabbitMq Sink plugin.
Is it possible to write Kafka steam application that doesn't have any Sink points, but just processes input stream? I can just push to RabbitMQ in foreach operations, but I'm not sure will Stream even work without a sink point.
foreach is a Sink action, so to answer your question directly, no.
However, Kafka Streams should be limited to only Kafka Communication.
Kafka Connect can be installed and ran anywhere, if that is what you wanted to use... You can also use other Apache tools like Camel, Spark, NiFi, Flink, etc to write to RabbitMQ after consuming from Kafka, or write any application in a language of your choice. For example, the Spring Integration or Cloud Streams frameworks allows a single contract between many communication channels

Kafka Producer vs Kafka Connector

I need to push data from database(say Oracle DB) to a kafka topic by calling a stored procedure.I need to do some validations too on message
Should i use a Producer API or Kafka Connector.
You should use Kafka Connect. Either that, or use the producer API and write some code that handles:
Scale-out
Failover
Restarts
Schemas
Serialisation
Transformations
and that integrates with hundreds of other technologies in a standardised manner for ease of portability and management.
At which point, you'll have reinvented Kafka Connect ;-)
To learn more about Kafka Connect see this talk. To learn about the specifics of database->Kafka ingestion see this talk and this blog.

Kafka 2.0 - Kafka Connect Sink - Creating a Kafka Producer

We are currently on HDF (Hortonworks Dataflow) 3.3.1 which bundles Kafka 2.0.0 and are trying to use Kafka Connect in distributed mode to launch a Google Cloud PubSub Sink connector.
We are planning on sending back some metadata into a Kafka Topic and need to integrate a Kafka producer into the flush() function of the Sink task java code.
Would this have a negative impact on the process where Kafka Connect commits back the offsets to Kafka (as we would be adding a overhead of running a Kafka producer before the flush).
Also, how does Kafka Connect get the Bootstrap servers list from the configuration when it is not specified in the Connector Properties for either the sink or the source? I need to use the same Bootstrap server list to start the producer.
Currently I am changing the config for the sink connector, adding bootstrap server list as a property and parsing it in the Java code for the connector. I would like to use bootstrap server list from the Kafka Connect worker properties if that is possible.
Kindly help on this.
Thanks in advance.
need to integrate a Kafka producer into the flush() function of the Sink task java code
There is no producer instance exposed in the SinkTask API...
Would this have a negative impact on the process where Kafka Connect commits back the offsets to Kafka (as we would be adding a overhead of running a Kafka producer before the flush).
I mean, you can add whatever code you want. As far as negative impacts go, that's up to you to benchmark on your own infrastructure. Obviously adding more blocking code makes the other processes slower overall
how does Kafka Connect get the Bootstrap servers list from the configuration when it is not specified in the Connector Properties for either the sink or the source?
Sinks and sources are not workers. Look at connect-distributed.properties
I would like to use bootstrap server list from the Kafka Connect worker properties if that is possible
It's not possible. Adding extra properties to the sink/source configs are the only way. (Feel free to make a Kafka JIRA requesting such a feature of exposing the worker configs, though)

REST endpoint as Kafka Sink

I am new to Kafka world. We are planning to set up Kafka to fulfill our data streaming needs. The sink in our case, is REST endpoint. What connectors are available to support Kafka => REST endpoint connectivity? This is similar to how AWS simple queues or topics work.
AFAIK there is no certified HTTP sink for Apache Kafka. Why not simply create a kafka consumer and for every message (or message batch) make a REST call to your service?
Nothing out of the box that I'm aware of but akka-streams feels well suited for this use-case.
The source would be a Kafka message stream created using akka-stream-kafka (aka reactive-kafka) and the sink the akka-http client (flow-based variant):
http://doc.akka.io/docs/akka-http/10.0.6/scala/http/client-side/request-level.html#flow-based-variant
Integration patterns for akka-streams are (starting to be) documented under the name alpakka:
http://developer.lightbend.com/docs/alpakka/current/