Kafka-MongoDB Debezium Connector : distributed mode - apache-kafka

I am working on debezium mongodb source connector. Can I run connector in local machine in distributed mode by giving kafka bootstrap server address as remote machine (deployed in Kubernetes) and remote MongoDB url?
I tried this and I see connector starts successfully, no errors, just few warnings but no data is flowing from mongodb.
Using below command to run connector
./bin/connect-distributed ./etc/schema-registry/connect-avro-distributed.properties ./etc/kafka/connect-mongodb-source.properties
If not how else can I achieve this, I donot want to install local kafka or mondoDB as most of the tutorial suggest. I want to use our test servers for this.
Followed below tutorial for this
: https://medium.com/tech-that-works/cloud-kafka-connector-for-mongodb-source-8b525b779772
Below are more details for the issue
Connector works fine, I see below lines at the end of connector log
INFO [Worker clientId=connect-1, groupId=connect-cluster] Starting connectors and tasks using config offset -1 (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1000)
] INFO [Worker clientId=connect-1, groupId=connect-cluster] Finished starting connectors and tasks (org.apache.kafka.connect.runtime.distributed.DistributedHerder:1021)
I have also defined MongoDB config in /etc/kafka/connect-mongodb-source.properties as follows
name=mongodb-source-connector
connector.class=io.debezium.connector.mongodb.MongoDbConnector
mongodb.hosts=/remoteserveraddress:27017
mongodb.name=mongo_conn
initial.sync.max.threads=1
tasks.max=1
But Data is not flowing between MongoDB and Kafka. I have also posted saperate question for this Kafka-MongoDB Debezium Connector : distributed mode
Any pointers are appriciated

connect-distributed only accepts a single property file.
You must use the REST API to configure Kafka Connect in Distributed mode.
https://docs.confluent.io/current/connect/references/restapi.html
Note: by default, the consumer will read the latest data off the topic, not existing data.
You would add this to the connect-avro-distributed.properties to fix it
consumer.auto.offset.reset=earliest

Related

Debezium fails when using it with Kerberos

I'm trying to configure the Oracle Connector (debezium 1.9) with a Kerberized Kafka cluster (from Cloudera Private CDP) and have some weird troubles.
I first tried to configure Debezium with a PLAINTEXT security protocol (using an Apache Kafka 3.1.0) to validate everything was fine (Oracle, Connect config... ) and everything runs perfectly.
Next, I deployed the same connector, using the same Oracle DB instance on my On Premises Cloudera CDP platform, which is kerberized, and updating the connector config by adding :
"database.history.kafka.topic": "schema-changes.oraclecdc",
"database.history.consumer.sasl.jaas.config": "com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab=\"/tmp/debezium.keytab\" principal=\"debezium#MYREALM\";",
"database.history.consumer.security.protocol": "SASL_PLAINTEXT",
"database.history.consumer.sasl.kerberos.service.name": "kafka",
"database.history.producer.sasl.jaas.config": "com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab=\"/tmp/debezium.keytab\" principal=\"debezium#MYREALM\";",
"database.history.producer.security.protocol": "SASL_PLAINTEXT",
"database.history.producer.sasl.kerberos.service.name": "kafka"
In this case, the topic schema-changes.oraclecdc is automatically created when the connector starts (auto creation enabled) and the DDL definitions are correctly reported. But that's it. So I suppose the JAAS config is OK and the producer config is correctly set as the connector has been able to create the topic and publish something in it.
But I can't get my updates/inserts/deletes being published. And the corresponding topics are not created. Instead kafka connect reports me the producer is disconnected, as soon as the connector starts.
Activating the TRACE level into kafka-connect, I can check that the updates/inserts/... are correctly detected by debezium from the redo log.
The fact the producer is being disconnected makes me think there's a problem of authentication. But if I understand the debezium documentation, the producer config is the same for either schema topic and tables cdc topics. So I can't understand why the "schema changes topic" is created with messages published, but the "CDC mechanism" doesn't create topics...
What am I missing here?

Using Amazon MSK and Debezium SQL Server Connector. Error while fetching metadata with correlation id 7 : {TestKafkaDB= UNKNOWN_TOPIC_OR_PARTITION}

I was trying to connect my RDS MS SQL server with Debezium SQL Server Connector to stream changes to Kafka Cluster on Amazon MSK.
I configured connector and Kafka Connect worker well run the Connect by
bin/connect-standalone.sh ../worker.properties connect/dbzmmssql.properties
Got WARN [Producer clientId=producer-1] Error while fetching metadata with correlation id 10 : {TestKafkaDB=UNKNOWN_TOPIC_OR_PARTITION} (org.apache.kafka.clients.NetworkClient:1031)
I've solved this problem and just want to share my possible solution with other fresher with Kafka.
TestKafkaDB=UNKNOWN_TOPIC_OR_PARTITION basically means the connector didn't find the a usable topic in Kafka broker. The reason I am facing this is the Kafka broker didn't automatically create a new topic for the stream.
To solving this, I changed Cluster Configuration in AWS MSK console, change auto.create.topics.enable from default false to true and update this configuration to the Cluster, then my problem solved.

How to make a Data Pipeline from MQTT to KAFKA Broker to MongoDB?

How can I make a data pipeline, I am sending data from MQTT to KAFKA topic using Source Connector. and on the other side, I have also connected Kafka Broker to MongoDB using Sink Connector. I am having trouble making a data pipeline that goes from MQTT to KAFKA and then MongoDB. Both connectors are working properly individually. How can I integrate them?
here is my MQTT Connector
MQTT Connector
Node 1 MQTT Connector
Message Published from MQTT
Kafka Consumer
Node 2 MongoDB Connector
MongoDB
that is my MongoDB Connector
MongoDB Connector
It is hard to tell what exactly the problem is without more logs, please provide your connect.config as well, please check /status of your connector, I still did not understand exactly what the issue you are facing, you are saying that , MQTT SOURCE CONNECTOR sending messages successfully to KAFKA TOPIC and your MONGO DB SINK CONNECTOR successfully reading this KAFKA TOPIC and write to your mobgodb, hence your pipeline, Where is the error? Is your KAFKA is the same KAFKA? Or separated different KAFKA CLUSTERS? Seems like both localhost, but is it the same machine?
Please elaborate and explain what are you expecting? What does "pipeline" means in your word?
You need both connectors to share same kafka cluster, what does node1 and node2 mean is it seperate kafka instance? Your connector need to connect to the same kafka "node" / cluster in order to share the data inside the kafka topic one for input and one for output, share your bootstrap service parameters, share your server.properties as well of the kafka
In order to run two different connect clusters inside same kafka , you need to set in different internal topics for each connect cluster
config.storage.topic
offset.storage.topic
status.storage.topic

Is Kafka topic linked with zookeeper and If zookeeper changed will topic disappeare

I was working with Kafka. I downloaded the zookeeper, extracted and started it.
Then I downloaded Kafka, extracted the zipped file and started Kafka. Everything was working good. I created few topics and I was able to send and receive messages. After that I stopped Kafka and Zookeeper. Then I read that Kafka itself provides Zookeeper. So I started Zookeeper that was provided with Kafka. However the data directory for it was different, and then I started Kafka from same configuration file and same data directory location. However after starting Kafka I could not find the topics that I had created.
I just want to know that, does this mean the meta data about the topics is maintained by Zookeeper. I searched Kafka documentation, however, I could not find anything in detail.
https://kafka.apache.org/documentation/
Check this documentation provided by confluent. According to this Apache Kafka® uses ZooKeeper to store persistent cluster metadata and is a critical component of the Confluent Platform deployment. For example, if you lost the Kafka data in ZooKeeper, the mapping of replicas to Brokers and topic configurations would be lost as well, making your Kafka cluster no longer functional and potentially resulting in total data loss.
So, the answer to your question is, yes, the purpose of zookeeper is to store relevant metadata about the kafka brokers, topics, etc,.
Also, since you have just started working on Kafka and Zookeeper, I would like to mention this. By default, Kafka stored it's data in a temp location which get's deleted on system reboot, so you should change that as well.
the answer to your question tag is yes,
1)Initially you started standalone zookeeper from zip file and you stopped the zookeeper, which means the topics that are created are stored in the zookeeper standalone are lost.Now you persistent cluster metadata related to Kafka is lost .
2)second time you started the zookeeper from the package that comes along with Kafka, now the new zookeeper instance does not have any topics information that you created previously, so you need to create newly .
3) suppose in case 1: if you close the terminal and start again the zookeeper from standalone , you no need to create the Topic again ,but if you stopped the zookeeper server from standalone then topics are lost.
in simple : you created two separate zookeeper instances, where topics will not be shared between them .

Kafka and Kafka Connect deployment environment

if I already have Kafka running on premises, is Kafka Connect just a configuration on top of my existing Kafka, or does Kafka Connect require it's own Server/Environment separate from that of my existing Kafka?
Kafka Connect is part of Apache Kafka, but it runs as a separate process, called a Kafka Connect Worker. Except in a sandbox environment, you would usually deploy it on a separate machine/node from your Kafka brokers.
This diagram shows conceptually how it runs, separate from your brokers:
You can run Kafka Connect on a single node, or as part of a cluster (for throughput and redundancy).
You can read more here about installation and configuration and architecture of Kafka Connect.
Kafka Connect is its own configuration on top of your bootstrap-server's configuration.
For Kafka Connect you can choose between a standalone server or distributed connect servers and you'll have to update the corresponding properties file to point to your currently running Kafka server(s).
Look under {kafka-root}/config and you'll see
You'll basically update connect-standalone or connect-distributed properties based on your need.