We are currently on HDF (Hortonworks Dataflow) 3.3.1 which bundles Kafka 2.0.0 and are trying to use Kafka Connect in distributed mode to launch a Google Cloud PubSub Sink connector.
We are planning on sending back some metadata into a Kafka Topic and need to integrate a Kafka producer into the flush() function of the Sink task java code.
Would this have a negative impact on the process where Kafka Connect commits back the offsets to Kafka (as we would be adding a overhead of running a Kafka producer before the flush).
Also, how does Kafka Connect get the Bootstrap servers list from the configuration when it is not specified in the Connector Properties for either the sink or the source? I need to use the same Bootstrap server list to start the producer.
Currently I am changing the config for the sink connector, adding bootstrap server list as a property and parsing it in the Java code for the connector. I would like to use bootstrap server list from the Kafka Connect worker properties if that is possible.
Kindly help on this.
Thanks in advance.
need to integrate a Kafka producer into the flush() function of the Sink task java code
There is no producer instance exposed in the SinkTask API...
Would this have a negative impact on the process where Kafka Connect commits back the offsets to Kafka (as we would be adding a overhead of running a Kafka producer before the flush).
I mean, you can add whatever code you want. As far as negative impacts go, that's up to you to benchmark on your own infrastructure. Obviously adding more blocking code makes the other processes slower overall
how does Kafka Connect get the Bootstrap servers list from the configuration when it is not specified in the Connector Properties for either the sink or the source?
Sinks and sources are not workers. Look at connect-distributed.properties
I would like to use bootstrap server list from the Kafka Connect worker properties if that is possible
It's not possible. Adding extra properties to the sink/source configs are the only way. (Feel free to make a Kafka JIRA requesting such a feature of exposing the worker configs, though)
Related
I have a requirement to read messages from a topic, enrich the message based on provided configuration (data required for enrichment is sourced from external systems), and publish the enriched message to an output topic. Messages on both source and output topics should be Avro format.
Is this a good use case for a custom Kafka Connector or should I use Kafka Streams?
Why I am considering Kafka Connect?
Lightweight in terms of code and deployment
Configuration driven
Connection and error handling
Scalability
I like the plugin based approach in Connect. If there is a new type of message that needs to be handled I just deploy a new connector without having to deploy a full scale Java app.
Why I am not sure this is good candidate for Kafka Connect?
Calls to external system
Can Kafka be both source and sink for a connector?
Can we use Avro schemas in connectors?
Performance under load
Cannot do stateful processing (currently there is no requirement)
I have experience with Kafka Streams but not with Connect
Use both?
Use Kafka Connect to source external database into a topic.
Use Kafka Streams to build that topic into a stream/table that can then be manipulated.
Use Kafka Connect to sink back into a database, or other system other than Kafka, as necessary.
Kafka Streams can also be config driven, use plugins (i.e. reflection), is just as scalable, and has no different connection modes (to Kafka). Performance should be the similar. Error handling is really the only complex part. ksqlDB is entirely "config driven" via SQL statements, and can connect to external Connect clusters, or embed its own.
Avro works for both, yes.
Some connectors are temporarily stateful, as they build in-memory batches, such as S3 or JDBC sink connectors
Kafka Streams is good, but I have to do every configuration very manual. Instead Kafka Connect provides its API interface, which is very useful for handling the configuration, as well as Tasks, Workers, etc...
Thus, I'm thinking of using Kafka Connect for my simple data transforming service. Basically, the service will read the data from a topic and send the transformed data to another topic. In order to do that, I have to make a custom Sink Connector to send the transformed data to the kafka topic, however, it seems those interface functions aren't available in SinkConnector. If I can do it, that would be great since I can manage tasks, workers via the REST API and running the tasks under a distributed mode (multiple instances).
There are 2 options in my mind:
Figuring out how to send the message from SinkConnector to a kafka topic
Figuring out how to build a REST interface API like Kafka Connect which wraps up the Kafka Streams app
Any ideas?
Figuring out how to send the message from SinkConnector to a kafka topic
A sink connector consumes data/messages from a Kafka topic. If you want to send data to a Kafka topic you are likely talking about a source connector.
Figuring out how to build a REST interface API like Kafka Connect which wraps up the Kafka Streams app.
using the kafka-connect-archtype you can have a template to create your own Kafka connector (source or sink). In your case that you want to build some stream processing pipeline after the connector, you are mostly talking about a connector of another stream processing engine that is not Kafka-stream. There are connectors for Kafka <-> Spark, Kafka <-> Flink, ...
But you can build your using the template of kafka-connect-archtype if you want. Use the MySourceTask List<SourceRecord> poll() method or the MySinkTask put(Collection<SinkRecord> records) method to process the records as stream. They extend the org.apache.kafka.connect.[source.SourceTask|sink.SinkTask] from Kafka connect.
a REST interface API like Kafka Connect which wraps up the Kafka Streams app
This is exactly what KsqlDB allows you to do
Outside of creating streams and tables with SQL queries, it offers a REST API as well as can interact with Connect endpoints (or embed a Connect worker itself)
https://docs.ksqldb.io/en/latest/concepts/connectors/
I need to understand when to use Kafka connect vs. own consumer/producer written by developer. We are getting Confluent Platform. Also to achieve fault tolerant design do we have to run the consumer/producer code ( jar file) from all the brokers ?
Kafka connect is typically used to connect external sources to Kafka i.e. to produce/consume to/from external sources from/to Kafka.
Anything that you can do with connector can be done through
Producer+Consumer
Readily available Connectors only ease connecting external sources to Kafka without requiring the developer to write the low-level code.
Some points to remember..
If the source and sink are both the same Kafka cluster, Connector doesn't make sense
If you are doing changed-data-capture (CDC) from a database and push them to Kafka, you can use a Database source connector.
Resource constraints: Kafka connect is a separate process. So double check what you can trade-off between resources and ease of development.
If you are writing your own connector, it is well and good, unless someone has not already written it. If you are using third-party connectors, you need to check how well they are maintained and/or if support is available.
do we have to run the consumer/producer code ( jar file) from all the brokers ?
Don't run client code on the brokers. Let all memory and disk access be reserved for the broker process.
when to use Kafka connect vs. own consumer/produce
In my experience, these factors should be taken into consideration
You're planning on deploying and monitoring Kafka Connect anyway, and have the available resources to do so. Again, these don't run on the broker machines
You don't plan on changing the Connector code very often, because you must restart the whole connector JVM, which would be running other connectors that don't need restarted
You aren't able to integrate your own producer/consumer code into your existing applications or simply would rather have a simpler produce/consume loop
Having structured data not tied to the a particular binary format is preferred
Writing your own or using a community connector is well tested and configurable for your use cases
Connect has limited options for fault tolerance compared to the raw producer/consumer APIs, with the drawbacks of more code, depending on other libraries being used
Note: Confluent Platform is still the same Apache Kafka
Kafka Connect:
Kafka Connect is an open-source platform which basically contains two types: Sink and Source. The Kafka Connect is used to fetch/put data from/to a database to/from Kafka. The Kafka connect helps to use various other systems with Kafka. It also helps in tracking the changes (as mentioned in one of the answers Changed Data Capture (CDC) ) from DB's to Kafka. The system maintains the offset, in order to read/write data from that particular offset to Kafka or any other database.
For more details, you can refer to https://docs.confluent.io/current/connect/index.html
The Producer/Consumer:
The Producer and Consumer are just an end system, which use the Kafka to produce and consume topics to/from Kafka. They are used where we want to broadcast the data to various consumers in a consumer group. This kind of system also maintains the lag and offsets of data for the consumer groups.
No, you don't need to run any producer/consumer while running Kafka connect. In case you want to check there is no data loss you can run the consumer while running Source Connectors. In case, of Sink Connectors, the already produced data can be verified in your database, by running their particular select queries.
I'm to build a Java based Kafka streaming application that will listen to a topic X continiously, fetch data, perform some basic cleansing and write to a Oracle database. The kafka cluster is outside my domain and have no ability to deploy any code or configurations in it.
What is the best way to design such a solution? I came across Kafka Streams but was confused as to if it can be used for 'Topic > Process > Topic' scenarios?
I came accross Kafka Streams but was confused as to if it can be used for 'Topic > Process > Topic' scenarios?
Absolutely.
For example, excluding the "process" step, it's two lines outside of the configuration setup.
final StreamsBuilder builder = new StreamsBuilder();
builder.stream("streams-plaintext-input").to("streams-pipe-output");
This code is straight from the documentation
If you want to write to any database, you should first check if there is a Kafka Connect plugin to do that for you. Kafka Streams shouldn't really be used to read/write from/to external systems outside of Kafka, as it is latency-sensitive.
In your case, the JDBC Sink Connector would work well.
The kafka cluster is outside my domain and have no ability to deploy any code or configurations in it.
Using either solution above, you don't need to, but you will need some machine with Java installed to run a continous Kafka Streams application and/or Kafka Connect worker.
In flume, I have Kafka-channel from where I can read and write data.
What is the difference between the performance of reading and writing data into Kafka channel if I replace Kafka source and Kafka sink with Avro source and Avro sink?
In my opinion, by replacing Kafka-source with Avro-source, I will be unable to read data in parallel from multiple partitions of Kafka broker, as there is no consumer group specified in case of Avro-source. Please correct me if I am wrong.
In Flume, the Avro RPC source binds to a specified TCP port of a network interface, so only one Avro source of one of the Flume agents running on a single machine can ever receive events sent to this port.
Avro source is meant to connect two or more Flume agents together: one or more Avro sinks connect to a single Avro source.
As you point out, using Kafka as a source allows for events to be received by several consumer groups. However, my experience with Flume 1.6.0 is that it is faster to push events from one Flume agent to another on a remote host through Avro RPC rather than through Kafka.
So I ended up with the following setup for log data collection:
[Flume agent on remote collected node] =Avro RPC=> [Flume agent in central cluster] =Kafka=> [multiple consumer groups in central cluster]
This way, I got better log ingestion and processing throughput and I also could encrypt and compress log data between remote sites and central cluster. This may however change when Flume adds support for the new protocol introduced by Kafka 0.9.0 in a future version, possibly making Kafka more usable as the front interface of the central cluster with remote data collection nodes (see here).