How to run two instances of schema registry - apache-kafka

I am trying to run Kafka in cluster mode using two instances of schema registry but I am not quite sure how to configure the second instance so that it takes over in case the first one is down.
Here's the properties file for the first schema-registry instance:
port=8081
# The address the socket server listens on.
# FORMAT:
# listeners = listener_name://host_name:port
# EXAMPLE:
# listeners = PLAINTEXT://your.host.name:9092
listeners=http://0.0.0.0:8081
# Zookeeper connection string for the Zookeeper cluster used by your Kafka cluster
# (see zookeeper docs for details).
# This is a comma separated host:port pairs, each corresponding to a zk
# server. e.g. "127.0.0.1:3000,127.0.0.1:3001,127.0.0.1:3002".
kafkastore.connection.url=localhost:2181,localhost:2182,localhost:2183
# Alternatively, Schema Registry can now operate without Zookeeper, handling all coordination via
# Kafka brokers. Use this setting to specify the bootstrap servers for your Kafka cluster and it
# will be used both for selecting the master schema registry instance and for storing the data for
# registered schemas.
# (Note that you cannot mix the two modes; use this mode only on new deployments or by shutting down
# all instances, switching to the new configuration, and then starting the schema registry
# instances again.)
#kafkastore.bootstrap.servers=localhost:9092
# The name of the topic to store schemas in
kafkastore.topic=_schemas
# If true, API requests that fail will include extra debugging information, including stack traces
debug=false
How should the second file look like so that they can both communicate with zookeeper and achieve high availability?

You can use either the Kafka leader election or Zookeeper leader election.
The only thing you need to change between two instances on the same machine connected to the same Kafka/Zookeeper is the port and listeners property
To appropriately configure high availability, you need a HTTP load balancer for giving one address for all instances.

Related

What is the difference between Kafka Cluster and Kafka Broker?

Has Kafka cluster and Kafka broker the same meaning?
I know cluster has multiple brokers (Is this wrong?).
But when I write code to produce messages, I find awkward option.
props.put("bootstrap.servers", "kafka001:9092, kafka002:9092, kafka003:9092");
Is this broker address or cluster address? If this is broker address, I think it is not good because we have to modify above address when brokers count changes.
(But it seems like broker address..)
Additionally, I saw in MSK in amazon, we can add broker to each AZ.
It means, we cannot have many broker. (Three or four at most?)
And they guided we should write this broker addresses to bootstrap.serveroption as a,` seperated list.
Why they don't guide us to use clusters address or ARN?
A Kafka cluster is a group of Kafka brokers.
When using the Producer API it is not required to mention all brokers within the cluster in the bootstrap.servers properties. The Producer configuration documentation on bootstrap.servers gives the full details:
A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. The client will make use of all servers irrespective of which servers are specified here for bootstrapping—this list only impacts the initial hosts used to discover the full set of servers. This list should be in the form host1:port1,host2:port2,.... Since these servers are just used for the initial connection to discover the full cluster membership (which may change dynamically), this list need not contain the full set of servers (you may want more than one, though, in case a server is down).
All brokers within a cluster share meta information of other brokers in the same cluster. Therefore, it is sufficient to mention even only one broker in the bootstrap-servers properties. However, you should still mention more than one in case of the one broker being not available for whatever reason.

How to run Kafka-connect-replicator in distributed mode?

I want to replicate data from one system to another using confluent's replicator.I am using two Ubuntu 18.04 systems where one is acting as source and other as destination.
I tried to run kafka-connect-replicator in distributed mode where I changed the following configurations:
In confluent/etc/kafka/server.properties I made the following changes
SOURCE
> advertised.listeners=PLAINTEXT://source.ip:9092
DESTINATION
> advertised.listeners=PLAINTEXT://destination.ip:9092
In confluent/etc/kafka-connect-replicator/replicator.connect.distributed.properties I made the following changes
- group.id=connect-replicator
group.id is same on source and destination system
SOURCE
- bootstrap.servers=destination.ip:9092, source.ip:9092
DESTINATION
- bootstrap.servers=destination.ip:9092, source.ip:9092
In confluent/etc/kafka-connect-replicator/quickstart-replicator.properties I changed the following configurations
SOURCE
name=replicator-source
connector.class=io.confluent.connect.replicator.ReplicatorSourceConnector
# source cluster connection info
src.kafka.bootstrap.servers=source.ip:9092
# Set to use direct connection to Zookeeper by Replicator on the source
src.zookeeper.connect=localhost:2181
# destination cluster connection info
dest.kafka.bootstrap.servers=destination.ip:9092
# Set to use direct connection to Zookeeper by Replicator on the destination
dest.zookeeper.connect=destination.ip:2181
# configure topics to replicate
topic.whitelist= test-topic
topic.rename.format=${topic}.replica
DESTINATION
name=replicator-source
connector.class=io.confluent.connect.replicator.ReplicatorSourceConnector
# source cluster connection info
src.kafka.bootstrap.servers=source.ip:9092
# Set to use direct connection to Zookeeper by Replicator on the source
src.zookeeper.connect=source.ip:2181
# destination cluster connection info
dest.kafka.bootstrap.servers=destination.ip:9092
# Set to use direct connection to Zookeeper by Replicator on the destination
dest.zookeeper.connect=destination.ip:2181
# configure topics to replicate
topic.whitelist= test-topic
topic.rename.format=${topic}.replica
And then I created topic in source system and launched the connector using the below command
PATH_TO_CONFLUENT> sudo ./bin/connect-distributed ./etc/kafka-connect-replicator/replicator-connect-distributed.properties ./etc/kafka-connect-replicator/quickstart-replicator.properties
After this I produced data in the topic from source system and try to consume in destination system with the topic name {topic}.replica but there is not topic present to consume from.
It's not clear what errors you've having, but some notes.
connect-distributed only takes one property file, not two. You HTTP Post the Properties to the Connect Cluster as JSON, not load a properties file during cluster startup. The quickstart file is meant to be used for connect-standalone
The JSON would look like
{"name": "your-replicator-name", "config": {
"src.kafka.bootstrap.servers": "...",
...
}
./etc/kafka/connect-distributed.properties should be a starting point for running any Connect or Replicator cluster in Distributed mode, although there may be similar configurations in replicator-connect-distributed.properties
bootstrap.servers should only ever point to a single cluster. The source and destination would be separated within src.kafka.bootstrap.servers and dest.kafka.bootstrap.servers

kafka bootstrap.servers as DNS A-Record with multiple IPs

I have a cluster of Kafka with 5 brokers and I'm using Consul Service Discovery to put their IPs into a dns record.
kafka.service.domain.cc A 1.1.1.1 2.2.2.2 ... 5.5.5.5
Is it recommended to use only one domain name:
kafka.bootstrap.servers = kafka.service.domain.cc:30000
or is it better to have multiple domain names (at least 2), each one resolves to one broker
kafka1.service.domain.cc A 1.1.1.1
kafka2.service.domain.cc A 2.2.2.2
then use them in in kafka
kafka.bootstrap.servers = kafka1.service.domain.cc:30000,kafka2.service.domain.cc:30000
my concerns with the first approach that the domain name will be resolved only once to a random broker, and if that broker is down, no new dns resolving will take place.
From the book Mastering Apache Kafka:
bootstrap.servers is a comma-separated list of host and port pairs
that are the addresses of the Kafka brokers in a "bootstrap" Kafka
cluster that a Kafka client connects to initially to bootstrap itself.
bootstrap.servers provides the initial hosts that act as the
starting point for a Kafka client to discover the full set of alive
servers in the cluster. Since these servers are just used for the
initial connection to discover the full cluster membership (which may
change dynamically), this list does not have to contain the full set
of servers (you may want more than one, though, in case a server is
down).
Clients (producers or consumers) make use of all servers irrespective
of which servers are specified in bootstrap.servers for bootstrapping.
So as the property bootstrap.servers provides the initial hosts that act as the starting point for a Kafka client to discover the full set of alive servers in the cluster, I think both the approach will do. But as they kept the value of the property to be a comma separated list, I guess second approach will be the recommended one. And also it will be a problem in approach 1 is, while bootstrapping, random broker may be down and client will not get the cluster information to continue. So it is always better to provide more than one as fallback if one broker is down during bootstrapping.
Kafka 2.1 included support for handling multiple DNS resource records in bootstrap.servers.
If you set client.dns.lookup="use_all_dns_ips" in your client configuration, it will use all of the IP addresses returned by DNS, not just the first (or a random one).
See KIP-235 and KIP-302 for more information.

how to setup Confluent Kafka Schema Registry in Cluster mode

Setup :- We have 3 Schema registry instance behind AWS ELB. how to change the schema_registry.properties file to setup schema registry in cluster mode?
We are calling schema registry with ELB endpoint.
The cluster of Schema Registry instances will be established by each instance contacting the same ZooKeeper cluster, so you'll want to basically have each instance have the same configuration. A single master will be elected using the strategy in the docs and any follower that receives a write request will just forward that request to the leader. If for some reason you only want certain instances to be master eligible, you can set master.eligbility=false in your properties file. If you want to get fancy and set non-default advertised listeners for your instances, then those have to be unique per instance (they are host:port combinations so this should be expected).

Confluent server went down

I am a starter in Confluent and Kafka.
When I am using the Confluent Platform on the slave node server(distributed mode but only on one server), the Confluent Server(only the server, the kafka is working properly) went down from time to time. Cause I am new to that, so I make mistakes when creating the sources and sinks, does that have anything to do with the break-down?
Here is my config:
# Sample configuration for a distributed Kafka Connect worker that uses Avro serialization and
# integrates the the Schema Registry. This sample configuration assumes a local installation of
# Confluent Platform with all services running on their default ports.
# Bootstrap Kafka servers. If multiple servers are specified, they should be comma-separated.
bootstrap.servers=localhost:9092
# The group ID is a unique identifier for the set of workers that form a single Kafka Connect
# cluster
group.id=connect-cluster
# The converters specify the format of data in Kafka and how to translate it into Connect data.
# Every Connect user will need to configure these based on the format they want their data in
# when loaded from or stored into Kafka
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:18081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:18081
# The internal converter used for offsets and config data is configurable and must be specified,
# but most users will always want to use the built-in default. Offset and config data is never
# visible outside of Connect in this format.
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
# Kafka topic where connector configuration will be persisted. You should create this topic with a
# single partition and high replication factor (e.g. 3)
config.storage.topic=connect-configs
# Kafka topic where connector offset data will be persisted. You should create this topic with many
# partitions (e.g. 25) and high replication factor (e.g. 3)
offset.storage.topic=connect-offsets
# Kafka topic where connector status data will be persisted. You should create this topic with many
# partitions (e.g. 25) and high replication factor (e.g. 3)
status.storage.topic=connect-statuses
# Confuent Control Center Integration -- uncomment these lines to enable Kafka client interceptors
# that will report audit data that can be displayed and analyzed in Confluent Control Center
producer.interceptor.classes=io.confluent.monitoring.clients.interceptor.MonitoringProducerInterceptor
consumer.interceptor.classes=io.confluent.monitoring.clients.interceptor.MonitoringConsumerInterceptor
So curious about that, cause Confluent Platform is a well designed project and Supported by a lot of experts, more importantly it is commercial.
Feiran