Confluent Kafka connect distributed mode jdbc connector - apache-kafka

We have successfully used MySQL - kafka data ingestion using jdbc standalone connector but now facing issue in using the same in distributed mode (as kafka connect service ).
Command used for standalone connector which works fine -
/usr/bin/connect-standalone /etc/kafka/connect-standalone.properties /etc/kafka-connect-jdbc/source-quickstart-mysql.properties
Now we have stopped this one and started the kafka connect service in distributed mode like this -
systemctl status confluent-kafka-connect
● confluent-kafka-connect.service - Apache Kafka Connect - distributed
Loaded: loaded (/usr/lib/systemd/system/confluent-kafka-connect.service; disabled; vendor preset: disabled)
Active: active (running) since Wed 2018-11-14 22:52:49 CET; 41min ago
Docs: http://docs.confluent.io/
Main PID: 130178 (java)
CGroup: /system.slice/confluent-kafka-connect.service
└─130178 java -Xms256M -Xmx2G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -Djava.a...
2 nodes are currently running the connect service with same connect-distributed.properties file .
bootstrap.servers=node1IP:9092,node2IP:9092
group.id=connect-cluster
key.converter.schemas.enable=true
value.converter.schemas.enable=true
offset.storage.topic=connect-offsets
offset.storage.replication.factor=1
config.storage.topic=connect-configs
config.storage.replication.factor=1
status.storage.topic=connect-status
status.storage.replication.factor=1
offset.flush.interval.ms=10000
plugin.path=/usr/share/java
The connect service is UP and running but it doesn't load the connectors defined under /etc/kafka/connect-standalone.properties.
What should be done to the service so that whenever you hit the command systemctl start confluent-kafka-connect , it runs the service and starts the defined connectors under /etc/kafka-connect-*/ just like when you run a standalone connector manually providing paths to properties files.

it runs the service and starts the defined connectors under /etc/kafka-connect-*/
That's not how distributed mode works... It doesn't know which property files you want to load, and it doesn't scan those folders1
With standalone-mode the N+1 property files that you give are loaded immediately, yes, but for connect-distributed, you must use HTTP POST calls to the Connect REST API.
Confluent Control Center or Landoop's Connect UI can provide a nice management web portal for these operations.
By the way, if you have more than one broker, I'll suggest increasing the replica factors on those connect topics in the connect-distributed.properties file.
1. It might be a nice feature if it did, but then you have to ensure connectors are never deleted/stopped in distributed mode, and you just end up in an inconsistent state with what's running and the files that are on the filesystem.

I can describe what I did for starting the jdbc connector in distributed mode:
I am using on my local machine, confluent CLI utility for booting up faster the services.
./confluent start
Afterwords I stopped kafka-connect
./confluent stop connect
and then I proceed to manually start the customized connect-distributed on two different ports (18083 and 28083)
➜ bin ./connect-distributed ../etc/kafka/connect-distributed-worker1.properties
➜ bin ./connect-distributed ../etc/kafka/connect-distributed-worker2.properties
NOTE: Set plugin.path setting to the full (and not relative) path (e.g.: plugin.path=/full/path/to/confluent-5.0.0/share/java)
Then i can easily add a new connector
curl -s -X POST -H "Content-Type: application/json" --data #/full/path/to/confluent-5.0.0/etc/kafka-connect-jdbc/source-quickstart-sqlite.json http://localhost:18083/connectors
This should do the trick.
As already pointed out by cricket_007 consider a replication factor of at least 3 for the kafka brokers in case you're dealing with stuff that you don't want to lose in case of outage of one of the brokers.

The connector in distributed mode can not be deployed by property file as in a standalone mode. Using REST API instead, pls refer to https://docs.confluent.io/current/connect/managing/configuring.html#connect-managing-distributed-mode

Related

No active Drillbit endpoint found from ZooKeeper

I am currently working with a simple project to query the messages from Apache Kafka topic using Apache Drill. And now I am encountering an error when running the Apache Drill cluster when running this command.
sqlline.bat -u "jdbc:drill:zk=localhost:2181"
And the error that I encountered is:
No active Drillbit endpoint found from ZooKeeper. Check connection parameters
I am using the single cluster instance of ZooKeeper that came from Apache Kafka.
Can anyone help me with this problem? Is it ok to use the Zookeeper from Apache Kafka installation with Drill?
sqlline.bat -u "jdbc:drill:zk=localhost:2181" command only connects to running DrillBit. If you have Drill running in distributed mode, please replace localhost with the correct IP address of the node, where Zookeeper is running and update port if needed.
If you want to start Drill in embedded mode, you may try running drill-embedded.bat or sqlline.bat -u "jdbc:drill:zk=local" command.
For more details please refer to https://drill.apache.org/docs/starting-drill-on-windows/.

Confluent Replicator End to End Latency Metrics Required

Problem Statement: I am trying to figure out end to end latency for replicator from On-prem to AWS data replication. We found in consumer group for replicator we have the option to display the End-to-End latency but it’s not showing any data in the control center as shown in below screenshots. I tried few things as explained below but it’s not working.
What we need to know is:
Which Metrics to be configured for this?
How can we configure these metrics?
Am I exploring the right thing as mentioned below?
Is anyone have any experience or faced similar issues
We tested the consumer group shows the overall messages running behind in the destination cluster.
It’s not showing any data in following screen. I tried figuring out why?
In Replicator connector I added following property to enable interceptor:
What confluent says:
“To monitor production and consumption in Control Center, installed the Confluent Monitoring Interceptors with your Apache Kafka® applications and configure your applications to use the interceptors on the Kafka messages produced and consumed, that are then sent to Control Center.”
We installed this in Replicator as mentioned below:
{
"name":"replicator",
"config":{
....
"src.consumer.interceptor.classes": "io.confluent.monitoring.clients.interceptor.MonitoringConsumerInterceptor",
....
}
}
}
Then I Checked the broker for Confluent Metrics Reporter which is already configured.
Thanks :)
Following are the configuration which worked for me.
FYI..... Control center/Replicator(connect worker) is running on my source cluster.
consumer.properties
zookeeper.connect=src-node1:2181,src-node2:2181,src-node3:2181
bootstrap.servers=src-node1:9092,src-node2,src-node3:9092
interceptor.classes=io.confluent.monitoring.clients.interceptor.MonitoringConsumerInterceptor
producer.properties
zookeeper.connect=dst-node1:2181,dst-node2:2181,dst-node3:2181
bootstrap.servers=dst-node1:9092,dst-node2:9092,dst-node3:9092
interceptor.classes=io.confluent.monitoring.clients.interceptor.MonitoringProducerInterceptor
replicator.properties
#Replication configuration
name=replicator-onprem-to-aws
topic.rename.format=${topic}.replica
replication.factor=1
config.storage.replication.factor=1
offset.storage.replication.factor=1
status.storage.replication.factor=1
confluent.topic.replication.factor=1
If you look into above configuration its the bare basic configuration to run the replicator. I am not using anything fancy here.
then you can run the following command:
[root#src-node1 ~]$ /app/confluent-5.3.1/bin/replicator --cluster.id 1 --consumer.config config/consumer.properties --producer.config config/producer.properties --replication.config config/replicator.properties --whitelist 'test-topic' > replicator.log 2>&1 &

Kafka and Kafka Connect deployment environment

if I already have Kafka running on premises, is Kafka Connect just a configuration on top of my existing Kafka, or does Kafka Connect require it's own Server/Environment separate from that of my existing Kafka?
Kafka Connect is part of Apache Kafka, but it runs as a separate process, called a Kafka Connect Worker. Except in a sandbox environment, you would usually deploy it on a separate machine/node from your Kafka brokers.
This diagram shows conceptually how it runs, separate from your brokers:
You can run Kafka Connect on a single node, or as part of a cluster (for throughput and redundancy).
You can read more here about installation and configuration and architecture of Kafka Connect.
Kafka Connect is its own configuration on top of your bootstrap-server's configuration.
For Kafka Connect you can choose between a standalone server or distributed connect servers and you'll have to update the corresponding properties file to point to your currently running Kafka server(s).
Look under {kafka-root}/config and you'll see
You'll basically update connect-standalone or connect-distributed properties based on your need.

Kafka connect cluster setup or launching connect workers

I am going through kafka connect, and i am trying to get the concepts.
Let us say I have kafka cluster (nodes k1, k2 and k3) setup and it is running, now i want to run kafka connect workers in different nodes say c1 and c2 in distributed mode.
Few questions.
1) To run or launch kafka connect in distributed mode I need to use command ../bin/connect-distributed.sh, which is available in kakfa cluster nodes, so I need to launch kafka connect from any one of the kafka cluster nodes? or any node from where I launch kafka connect needs to have kafka binaries so that i will be able to use ../bin/connect-distributed.sh
2) I need to copy the my connector plugins to any kafka cluster node( or to all cluster nodes?) from where I do the step 1?
3) how does kafka copies these connector plugins to worker node before starting jvm process on the worker node? because the plugin is the one which has my task code and it needs to be copied to worker in order to start the process in worker.
4) Do i need to install anything in connect cluster nodes c1 and c2, like need to install java or any kafka connect related?
5) In some places it says use confluent platform but i would like to start it with apache kafka connect alone first.
can some one please through some light or even pointer to some resources would also help.
Thank you.
1) In order to have a highly available kafka-connect service you need to run at least two instances of connect-distributed.sh on two distinct machines that have the same group.id. You can find more details regarding the configuration of each worker here. For improved performance, Connect should be ran independently of the broker and Zookeeper machines.
2) Yes, you need to place all your connectors under plugin.path (normally under /usr/share/java/) on every machine that you are planning to run kafka-connect.
3) kafka-connect will load the connectors on startup. You don't need to handle this. Note that if your kafka-connect instance is running and a new connector is added, you need to restart the service.
4) You need to have Java installed on all your machines. For Confluent Platform particularly:
Java 1.7 and 1.8 are supported in this version of Confluent Platform
(Java 1.9 is currently not supported). You should run with the
Garbage-First (G1) garbage collector. For more information, see the
Supported Versions and Interoperability.
5) It depends. Confluent was founded by the original creators of Apache Kafka and it comes as a more complete distribution adding schema management, connectors and clients. It also comes with KSQL which is quite useful if you need to act on certain events. Confluent simply adds on top of the Apache Kafka distribution, it's not a modified version.
Answer given by Giorgos is correct. I ran few connectors and now I understand it better.
I am just trying to put it differently.
In Kafka connect there are two things involved one is Worker and second is connector.Below is on details about running distributed Kafka connect.
Kafka connect Worker is a Java process on which the connector/connect task will run. So first thing is we need to launch worker, to run/launch a worker we need java installed on that machine then we need Kafka connect related sh/bat files to launch worker and kafka libs which will be used by kafka connect worker, for this we will just simply copy/install Kafka in the worker machine, also we need to copy all the connector and connect-task related jars/dependencies in "plugin.path" as defined in the below worker properties file, now worker machine is ready, to start worker we need to invoke ./bin/connect-distributed.sh ./config/connect-distributed.properties, here connect-distributed.properties will have configuration for worker. The same thing has to be repeated in each machine where we need to run Kafka connect.
Now the worker java process is running in all machines, the woker config will have group.id property, the workers which have this same property value will be forming a group/cluster of workers.
Each worker process will expose rest endpoint (default http://localhost:8083/connectors), to launch/start a connector on the running workers, we need do http-post a connector config json, based on the given config the worker will start the connector and the number of tasks in the above group/cluster workers.
Example: Connect post,
curl -X POST -H "Content-Type: application/json" --data '{"name": "local-file-sink", "config": {"connector.class":"FileStreamSinkConnector", "tasks.max":"3", "file":"test.sink.txt", "topics":"connect-test" }}' http://localhost:8083/connectors

Error running multiple kafka standalone hdfs connectors

We are trying to launch multiple standalone kafka hdfs connectors on a given node.
For each connector, we are setting the rest.port and offset.storage.file.filename to different ports and path respectively.
Also kafka broker JMX port is # 9999.
When I start the kafka standalone connector, I get the error
Error: Exception thrown by the agent : java.rmi.server.ExportException: Port already in use: 9999; nested exception is:
java.net.BindException: Address already in use (Bind failed)
Though the rest.port is set to 9100
kafka version: 2.12-0.10.2.1
kafka-connect-hdfs version: 3.2.1
Please help.
We are trying to launch multiple standalone kafka hdfs connectors on a given node.
Have you considered running these multiple connectors within a single instance of Kafka Connect? This might make things easier.
Kafka Connect itself can handle running multiple connectors within a single worker process. Kafka Connect in distributed mode can run on a single node, or across multiple ones.
For those who trying to use rest.port flag and still getting Address already in use error. That flag has been marked as deprecated in KIP-208 and finally removed in PR.
From that point listeners can be used to change default REST port.
Examples from Javadoc
listeners=HTTP://myhost:8083
listeners=HTTP://:8083
Configuring and Running Workers - Standalone mode
You may have open Kafka Connect connections that you don't know about. You can check this with:
ps -ef | grep connect
If you find any, kill those processes.