How to run Kafka-connect-replicator in distributed mode? - apache-kafka

I want to replicate data from one system to another using confluent's replicator.I am using two Ubuntu 18.04 systems where one is acting as source and other as destination.
I tried to run kafka-connect-replicator in distributed mode where I changed the following configurations:
In confluent/etc/kafka/server.properties I made the following changes
SOURCE
> advertised.listeners=PLAINTEXT://source.ip:9092
DESTINATION
> advertised.listeners=PLAINTEXT://destination.ip:9092
In confluent/etc/kafka-connect-replicator/replicator.connect.distributed.properties I made the following changes
- group.id=connect-replicator
group.id is same on source and destination system
SOURCE
- bootstrap.servers=destination.ip:9092, source.ip:9092
DESTINATION
- bootstrap.servers=destination.ip:9092, source.ip:9092
In confluent/etc/kafka-connect-replicator/quickstart-replicator.properties I changed the following configurations
SOURCE
name=replicator-source
connector.class=io.confluent.connect.replicator.ReplicatorSourceConnector
# source cluster connection info
src.kafka.bootstrap.servers=source.ip:9092
# Set to use direct connection to Zookeeper by Replicator on the source
src.zookeeper.connect=localhost:2181
# destination cluster connection info
dest.kafka.bootstrap.servers=destination.ip:9092
# Set to use direct connection to Zookeeper by Replicator on the destination
dest.zookeeper.connect=destination.ip:2181
# configure topics to replicate
topic.whitelist= test-topic
topic.rename.format=${topic}.replica
DESTINATION
name=replicator-source
connector.class=io.confluent.connect.replicator.ReplicatorSourceConnector
# source cluster connection info
src.kafka.bootstrap.servers=source.ip:9092
# Set to use direct connection to Zookeeper by Replicator on the source
src.zookeeper.connect=source.ip:2181
# destination cluster connection info
dest.kafka.bootstrap.servers=destination.ip:9092
# Set to use direct connection to Zookeeper by Replicator on the destination
dest.zookeeper.connect=destination.ip:2181
# configure topics to replicate
topic.whitelist= test-topic
topic.rename.format=${topic}.replica
And then I created topic in source system and launched the connector using the below command
PATH_TO_CONFLUENT> sudo ./bin/connect-distributed ./etc/kafka-connect-replicator/replicator-connect-distributed.properties ./etc/kafka-connect-replicator/quickstart-replicator.properties
After this I produced data in the topic from source system and try to consume in destination system with the topic name {topic}.replica but there is not topic present to consume from.

It's not clear what errors you've having, but some notes.
connect-distributed only takes one property file, not two. You HTTP Post the Properties to the Connect Cluster as JSON, not load a properties file during cluster startup. The quickstart file is meant to be used for connect-standalone
The JSON would look like
{"name": "your-replicator-name", "config": {
"src.kafka.bootstrap.servers": "...",
...
}
./etc/kafka/connect-distributed.properties should be a starting point for running any Connect or Replicator cluster in Distributed mode, although there may be similar configurations in replicator-connect-distributed.properties
bootstrap.servers should only ever point to a single cluster. The source and destination would be separated within src.kafka.bootstrap.servers and dest.kafka.bootstrap.servers

Related

Create kafka topic using predefined config files

Is there any way to create kafka topic in kafka/zookeeper configuration files before I will run the services, so once they will start - the topics will be in place?
I have looked inside of script bin/kafka-topics.sh and found that in the end, it executes a live command on the live server. But since the server is here, its config files are here and zookeeper with its configs also are here, is it a way to predefined topics in advance?
Unfortunately haven't found any existing config keys for this.
The servers need to be running in order to allocate metadata and log directories for them, so no

For Kafka,what IP values need to be setup in listeners & advertised.listeners value?

I have created an multi-node Azure Databricks cluster inside a VNET & I have created a multi-node Kafka HDInsight cluster inside different VNET. I have peered this 2 VNETs. After peering, my 2 machines are able to ping each other.
I am trying to dump messages to Kafka topic from Databricks cluster using Spark Structured Streaming & I am getting socket timeout error.
Upon research, I found that in Kafka we need to setup listeners & advertised.listeners in server.properties file.
In my scenario, what should I put the values for listeners & advertised.listeners? Would be very helpful if anyone can suggest me what all changes I need to make in server.properties file.
You need to create a listener for the host/IP on which your client machine (where Spark is running) can connect to your broker.
See https://rmoff.net/2018/08/02/kafka-listeners-explained/

How to run two instances of schema registry

I am trying to run Kafka in cluster mode using two instances of schema registry but I am not quite sure how to configure the second instance so that it takes over in case the first one is down.
Here's the properties file for the first schema-registry instance:
port=8081
# The address the socket server listens on.
# FORMAT:
# listeners = listener_name://host_name:port
# EXAMPLE:
# listeners = PLAINTEXT://your.host.name:9092
listeners=http://0.0.0.0:8081
# Zookeeper connection string for the Zookeeper cluster used by your Kafka cluster
# (see zookeeper docs for details).
# This is a comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
kafkastore.connection.url=localhost:2181,localhost:2182,localhost:2183
# Alternatively, Schema Registry can now operate without Zookeeper, handling all coordination via
# Kafka brokers. Use this setting to specify the bootstrap servers for your Kafka cluster and it
# will be used both for selecting the master schema registry instance and for storing the data for
# registered schemas.
# (Note that you cannot mix the two modes; use this mode only on new deployments or by shutting down
# all instances, switching to the new configuration, and then starting the schema registry
# instances again.)
#kafkastore.bootstrap.servers=localhost:9092
# The name of the topic to store schemas in
kafkastore.topic=_schemas
# If true, API requests that fail will include extra debugging information, including stack traces
debug=false
How should the second file look like so that they can both communicate with zookeeper and achieve high availability?
You can use either the Kafka leader election or Zookeeper leader election.
The only thing you need to change between two instances on the same machine connected to the same Kafka/Zookeeper is the port and listeners property
To appropriately configure high availability, you need a HTTP load balancer for giving one address for all instances.

Start multiple brokers in kafka

Beginner in kafka and confluent package.I want to start multiple brokers so as to consume the topic.
It can be done via this setting -
{'bootstrap.server' : 'ip:your_host,...',}
This setting can be defined in the server config file or else in the script as well.
But how shall I run those?. If I just add multiple end points to the bootstrap servers, it gives this error:
java.lang.IllegalArgumentException: requirement failed: Each listener must have a different name, listeners: PLAINTEXT://:9092, PLAINTEXT://:9093
cp config/server.properties config/server-1.properties
cp config/server.properties config/server-2.properties
config/server-1.properties:
broker.id=1
listeners=PLAINTEXT://:9093
log.dirs=/tmp/kafka-logs-1
config/server-2.properties:
broker.id=2
listeners=PLAINTEXT://:9094
log.dirs=/tmp/kafka-logs-2
Reference: kafka_quickstart_multibroker
Done.
I had actually mentioned same port for producer and consumer and hence was the issue.
Set up brokers on different ports and works fine even if one broker goes down.

Error running multiple kafka standalone hdfs connectors

We are trying to launch multiple standalone kafka hdfs connectors on a given node.
For each connector, we are setting the rest.port and offset.storage.file.filename to different ports and path respectively.
Also kafka broker JMX port is # 9999.
When I start the kafka standalone connector, I get the error
Error: Exception thrown by the agent : java.rmi.server.ExportException: Port already in use: 9999; nested exception is:
java.net.BindException: Address already in use (Bind failed)
Though the rest.port is set to 9100
kafka version: 2.12-0.10.2.1
kafka-connect-hdfs version: 3.2.1
Please help.
We are trying to launch multiple standalone kafka hdfs connectors on a given node.
Have you considered running these multiple connectors within a single instance of Kafka Connect? This might make things easier.
Kafka Connect itself can handle running multiple connectors within a single worker process. Kafka Connect in distributed mode can run on a single node, or across multiple ones.
For those who trying to use rest.port flag and still getting Address already in use error. That flag has been marked as deprecated in KIP-208 and finally removed in PR.
From that point listeners can be used to change default REST port.
Examples from Javadoc
listeners=HTTP://myhost:8083
listeners=HTTP://:8083
Configuring and Running Workers - Standalone mode
You may have open Kafka Connect connections that you don't know about. You can check this with:
ps -ef | grep connect
If you find any, kill those processes.