Is there any way to forward Kafka messages from topic on one server to topic on another server? - apache-kafka

I have a scenario where we are forwarding our application logs to Kafka topic using fluentD agents,
as Kafka team introduced Kerberos authentication and fluentD version not supporting this authentication, I cannot directly use forward logs.
Now we have introduced a new Kafka server without authentication and created a topic there, I want forward messages from this topic in the new server to another topic in another server using Kafka connectors,
want to know how I can achieve this?

There's several different tools that enable you to stream messages from a Kafka topic on one cluster to a different cluster, including:
MirrorMaker (open source, part of Apache Kafka)
Confluent's Replicator (commercial tool, 30 day free trial)
uReplicator (open sourced from Uber)
Mirus (open sourced from Salesforce)
Brucke (open source)
Disclaimer: I work for Confluent.

Related

How to transfer data from source Kafka cluster to target Kafka cluster using Kafka Connect?

As described in the title, I have a use case where I want to copy data from source Kafka topic (Cloudera Kafka cluster) to destination Kafka topic (AWS MSK Kafka cluster) using Kafka connect framework. I have already gone through some of the available options for e.g. KafkaCat Utility and Mirror Maker 2. But I am curious if there is any such connector available in opensource.
Links followed:
Kafkacat: https://rmoff.net/2019/09/29/copying-data-between-kafka-clusters-with-kafkacat/
MirrorMaker: https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0
MirrorMaker 2 is open source, and it does exactly what you're asking
https://github.com/apache/kafka/tree/trunk/connect/mirror
kcat is also open-source, but doesn't exactly scale, and doesn't use Connect framework.

Kafka vs. Kafka Connect terminology - Sources & Sinks or Producers & Consumers

I'm reading up on Kafka and Kafka Connect. The documentation mentions 'Kafka sources' and 'Kafka sinks' in a generic sort of way in Kafka Connect documentation. I'm not certain if these two terms are specific to Kafka Connect or they are simply referring Producers and Consumers.
If you are in need to bring data into your kafka cluster or copy data outside of your kafka ( copy data from / into kafka ) there are many tools supporting you on that task ,
You might as well write and MAINTAIN your code with Kafka Consumer / Producer API
In order to avoid struggling to create new code for "already solved problem" kafka community developed the Kafka Connect framework.
the "kafka way" is by leveraging its internal ecosystem tool named kafka connect.
kafka connect is a distributed framework which has many connectors supported by community or vendor. open sourced or proprietary, there is big and growing hub "market place" for any need.
connector is piece of pluggable code (JAR files) that runs inside the framework, there are two types of connectors , sink connector is "read from kafka and sink to target", and source connector which is "read from data source and write to kafka".
in order to set up a connector you are just setting a configuration file with all the required parameters, without the need of any programming skills. no code. losing some flexibility in favor of simplicity

Kafka Producer vs Kafka Connector

I need to push data from database(say Oracle DB) to a kafka topic by calling a stored procedure.I need to do some validations too on message
Should i use a Producer API or Kafka Connector.
You should use Kafka Connect. Either that, or use the producer API and write some code that handles:
Scale-out
Failover
Restarts
Schemas
Serialisation
Transformations
and that integrates with hundreds of other technologies in a standardised manner for ease of portability and management.
At which point, you'll have reinvented Kafka Connect ;-)
To learn more about Kafka Connect see this talk. To learn about the specifics of database->Kafka ingestion see this talk and this blog.

Kafka 2.0 - Kafka Connect Sink - Creating a Kafka Producer

We are currently on HDF (Hortonworks Dataflow) 3.3.1 which bundles Kafka 2.0.0 and are trying to use Kafka Connect in distributed mode to launch a Google Cloud PubSub Sink connector.
We are planning on sending back some metadata into a Kafka Topic and need to integrate a Kafka producer into the flush() function of the Sink task java code.
Would this have a negative impact on the process where Kafka Connect commits back the offsets to Kafka (as we would be adding a overhead of running a Kafka producer before the flush).
Also, how does Kafka Connect get the Bootstrap servers list from the configuration when it is not specified in the Connector Properties for either the sink or the source? I need to use the same Bootstrap server list to start the producer.
Currently I am changing the config for the sink connector, adding bootstrap server list as a property and parsing it in the Java code for the connector. I would like to use bootstrap server list from the Kafka Connect worker properties if that is possible.
Kindly help on this.
Thanks in advance.
need to integrate a Kafka producer into the flush() function of the Sink task java code
There is no producer instance exposed in the SinkTask API...
Would this have a negative impact on the process where Kafka Connect commits back the offsets to Kafka (as we would be adding a overhead of running a Kafka producer before the flush).
I mean, you can add whatever code you want. As far as negative impacts go, that's up to you to benchmark on your own infrastructure. Obviously adding more blocking code makes the other processes slower overall
how does Kafka Connect get the Bootstrap servers list from the configuration when it is not specified in the Connector Properties for either the sink or the source?
Sinks and sources are not workers. Look at connect-distributed.properties
I would like to use bootstrap server list from the Kafka Connect worker properties if that is possible
It's not possible. Adding extra properties to the sink/source configs are the only way. (Feel free to make a Kafka JIRA requesting such a feature of exposing the worker configs, though)

Kafka design questions - Kafka Connect vs. own consumer/producer

I need to understand when to use Kafka connect vs. own consumer/producer written by developer. We are getting Confluent Platform. Also to achieve fault tolerant design do we have to run the consumer/producer code ( jar file) from all the brokers ?
Kafka connect is typically used to connect external sources to Kafka i.e. to produce/consume to/from external sources from/to Kafka.
Anything that you can do with connector can be done through
Producer+Consumer
Readily available Connectors only ease connecting external sources to Kafka without requiring the developer to write the low-level code.
Some points to remember..
If the source and sink are both the same Kafka cluster, Connector doesn't make sense
If you are doing changed-data-capture (CDC) from a database and push them to Kafka, you can use a Database source connector.
Resource constraints: Kafka connect is a separate process. So double check what you can trade-off between resources and ease of development.
If you are writing your own connector, it is well and good, unless someone has not already written it. If you are using third-party connectors, you need to check how well they are maintained and/or if support is available.
do we have to run the consumer/producer code ( jar file) from all the brokers ?
Don't run client code on the brokers. Let all memory and disk access be reserved for the broker process.
when to use Kafka connect vs. own consumer/produce
In my experience, these factors should be taken into consideration
You're planning on deploying and monitoring Kafka Connect anyway, and have the available resources to do so. Again, these don't run on the broker machines
You don't plan on changing the Connector code very often, because you must restart the whole connector JVM, which would be running other connectors that don't need restarted
You aren't able to integrate your own producer/consumer code into your existing applications or simply would rather have a simpler produce/consume loop
Having structured data not tied to the a particular binary format is preferred
Writing your own or using a community connector is well tested and configurable for your use cases
Connect has limited options for fault tolerance compared to the raw producer/consumer APIs, with the drawbacks of more code, depending on other libraries being used
Note: Confluent Platform is still the same Apache Kafka
Kafka Connect:
Kafka Connect is an open-source platform which basically contains two types: Sink and Source. The Kafka Connect is used to fetch/put data from/to a database to/from Kafka. The Kafka connect helps to use various other systems with Kafka. It also helps in tracking the changes (as mentioned in one of the answers Changed Data Capture (CDC) ) from DB's to Kafka. The system maintains the offset, in order to read/write data from that particular offset to Kafka or any other database.
For more details, you can refer to https://docs.confluent.io/current/connect/index.html
The Producer/Consumer:
The Producer and Consumer are just an end system, which use the Kafka to produce and consume topics to/from Kafka. They are used where we want to broadcast the data to various consumers in a consumer group. This kind of system also maintains the lag and offsets of data for the consumer groups.
No, you don't need to run any producer/consumer while running Kafka connect. In case you want to check there is no data loss you can run the consumer while running Source Connectors. In case, of Sink Connectors, the already produced data can be verified in your database, by running their particular select queries.