How to transfer data from source Kafka cluster to target Kafka cluster using Kafka Connect? - apache-kafka

As described in the title, I have a use case where I want to copy data from source Kafka topic (Cloudera Kafka cluster) to destination Kafka topic (AWS MSK Kafka cluster) using Kafka connect framework. I have already gone through some of the available options for e.g. KafkaCat Utility and Mirror Maker 2. But I am curious if there is any such connector available in opensource.
Links followed:
Kafkacat: https://rmoff.net/2019/09/29/copying-data-between-kafka-clusters-with-kafkacat/
MirrorMaker: https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0

MirrorMaker 2 is open source, and it does exactly what you're asking
https://github.com/apache/kafka/tree/trunk/connect/mirror
kcat is also open-source, but doesn't exactly scale, and doesn't use Connect framework.

Related

How to make a Data Pipeline from MQTT to KAFKA Broker to MongoDB?

How can I make a data pipeline, I am sending data from MQTT to KAFKA topic using Source Connector. and on the other side, I have also connected Kafka Broker to MongoDB using Sink Connector. I am having trouble making a data pipeline that goes from MQTT to KAFKA and then MongoDB. Both connectors are working properly individually. How can I integrate them?
here is my MQTT Connector
MQTT Connector
Node 1 MQTT Connector
Message Published from MQTT
Kafka Consumer
Node 2 MongoDB Connector
MongoDB
that is my MongoDB Connector
MongoDB Connector
It is hard to tell what exactly the problem is without more logs, please provide your connect.config as well, please check /status of your connector, I still did not understand exactly what the issue you are facing, you are saying that , MQTT SOURCE CONNECTOR sending messages successfully to KAFKA TOPIC and your MONGO DB SINK CONNECTOR successfully reading this KAFKA TOPIC and write to your mobgodb, hence your pipeline, Where is the error? Is your KAFKA is the same KAFKA? Or separated different KAFKA CLUSTERS? Seems like both localhost, but is it the same machine?
Please elaborate and explain what are you expecting? What does "pipeline" means in your word?
You need both connectors to share same kafka cluster, what does node1 and node2 mean is it seperate kafka instance? Your connector need to connect to the same kafka "node" / cluster in order to share the data inside the kafka topic one for input and one for output, share your bootstrap service parameters, share your server.properties as well of the kafka
In order to run two different connect clusters inside same kafka , you need to set in different internal topics for each connect cluster
config.storage.topic
offset.storage.topic
status.storage.topic

Can Kafka Connect consume data from a separate kerberized Kafka instance and then route to Splunk?

My pipeline is:
Kerberized Kafka --> Logstash (hosted on a different server) --> Splunk.
Can I replace the Logstash component with Kafka Connect?
Could you point me to a resource/guide where I can use kerberized Kafka as a source for my Kafka connect (which is hosted separately)?
From the documentation, what I understood is that if Kafka Connect is hosted on the same cluster as that of Kafka, that's quite possible. But I don't have that option right now, as our Kafka cluster is multi-tenant and hence not approved for additional processes on the cluster.
Kerberos keytabs aren't commonly machine/JVM specific, so yes, Kafka Connect should be able to be configured very similarly to Logstash since both are JVM processes using native Kafka protocol.
You shouldn't run Connect on the brokers anyway
If you can't add Kafka Connect to an existing Kafka cluster, you will have to spin up a separate Kafka Connect (Cluster or standalone).
I've written about it here: enter link description here

Does Mirrormaker 2 need a third kafka for mirroring operation?

I have a question when I use mirrormaker 2.
Mirrormaker 2 is based on the Kafka Connect framework and can be viewed at its core as a combination of a Kafka source and sink connector. So in MM2 architecture there are source and sink connectors. But is there any extra Kafka cluster for connectors in MM2 ? Because in kafka connect design; source and sink connector need Kafka cluster to move data.
For example MM2 needs source and target clusters; My question is that does MM2 need a third kafka for mirroring operation without using source and target clusters?
Other question is that does MM2 connectors can be run on distribute mode ? I didn't see any configuration about this question?
For example in docker environment; is configuration below enough for running MM2 on distributed mode?
mirrormaker:
image: 'wpietri/mirror-maker:2'
environment:
- SOURCE=source_ip:9092
- DESTINATION=dest_ip:9092
- TOPICS=test-topic
deploy:
replicas: 3
mode: replicated
Currently MirrorMaker 2 is a set of Source connectors.
A source connector grabs records from an external system and hands them to the Kafka Connect runtime that writes them into Kafka.
For MirrorMaker 2, the "external system" is another Kafka cluster. So to work, MirrorMaker 2 only needs 2 Kafka clusters. One where the connectors get records (called the source cluster) and one the Kafka Connect is connected to (called the target cluster).
MirrorMaker 2 connectors are standard Kafka Connect connectors. They can be used directly with Kafka Connect in standalone or distributed mode.

How can I show different kafka to my confluent?

I install confluent and it has own kafka.
I want to change kafka from own to another?
Which .properties or whatelse file I must change to look different kafka.
thanks in advance
In your Kafka Connect worker configuration, you need to set bootstrap.servers to point to the broker(s) on your source Kafka cluster.
You can only connect to one source Kafka cluster per Kafka Connect worker. If you need to stream data from multiple Kafka clusters, you would run multiple Kafka Connect workers.
Edit If you're using Confluent CLI then the Kafka Connect worker config is taken from etc/schema-registry/connect-avro-distributed.properties.

Kafka and Kafka Connect deployment environment

if I already have Kafka running on premises, is Kafka Connect just a configuration on top of my existing Kafka, or does Kafka Connect require it's own Server/Environment separate from that of my existing Kafka?
Kafka Connect is part of Apache Kafka, but it runs as a separate process, called a Kafka Connect Worker. Except in a sandbox environment, you would usually deploy it on a separate machine/node from your Kafka brokers.
This diagram shows conceptually how it runs, separate from your brokers:
You can run Kafka Connect on a single node, or as part of a cluster (for throughput and redundancy).
You can read more here about installation and configuration and architecture of Kafka Connect.
Kafka Connect is its own configuration on top of your bootstrap-server's configuration.
For Kafka Connect you can choose between a standalone server or distributed connect servers and you'll have to update the corresponding properties file to point to your currently running Kafka server(s).
Look under {kafka-root}/config and you'll see
You'll basically update connect-standalone or connect-distributed properties based on your need.