What is the difference between Kafka Cluster and Kafka Broker? - apache-kafka

Has Kafka cluster and Kafka broker the same meaning?
I know cluster has multiple brokers (Is this wrong?).
But when I write code to produce messages, I find awkward option.
props.put("bootstrap.servers", "kafka001:9092, kafka002:9092, kafka003:9092");
Is this broker address or cluster address? If this is broker address, I think it is not good because we have to modify above address when brokers count changes.
(But it seems like broker address..)
Additionally, I saw in MSK in amazon, we can add broker to each AZ.
It means, we cannot have many broker. (Three or four at most?)
And they guided we should write this broker addresses to bootstrap.serveroption as a,` seperated list.
Why they don't guide us to use clusters address or ARN?

A Kafka cluster is a group of Kafka brokers.
When using the Producer API it is not required to mention all brokers within the cluster in the bootstrap.servers properties. The Producer configuration documentation on bootstrap.servers gives the full details:
A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. The client will make use of all servers irrespective of which servers are specified here for bootstrapping—this list only impacts the initial hosts used to discover the full set of servers. This list should be in the form host1:port1,host2:port2,.... Since these servers are just used for the initial connection to discover the full cluster membership (which may change dynamically), this list need not contain the full set of servers (you may want more than one, though, in case a server is down).
All brokers within a cluster share meta information of other brokers in the same cluster. Therefore, it is sufficient to mention even only one broker in the bootstrap-servers properties. However, you should still mention more than one in case of the one broker being not available for whatever reason.

Related

what is bootstrap-server in kafka config?

I have just started learning kafka and continuously I am coming across a term bootstrap-server.
Which server does it represent in my kafka cluster?
It is the url of one of the Kafka brokers which you give to fetch the initial metadata about your Kafka cluster.
The metadata consists of the topics, their partitions, the leader brokers for those partitions etc.
Depending upon this metadata your producer or consumer produces or consumes the data.
You can have multiple bootstrap-servers in your producer or consumer configuration. So that if one of the broker is not accessible, then it falls back to other.
We know that a kafka cluster can have 100s or 1000nds of brokers (kafka servers). But how do we tell clients (producers or consumers) to which to connect? Should we specify all 1000nds of kafka brokers in the configuration of clients? no, that would be troublesome and the list will be very lengthy. Instead what we can do is, take two to three brokers and consider them as bootstrap servers where a client initially connects. And then depending on alive or spacing, those brokers will point to a good kafka broker.
So bootstrap.servers is a configuration we place within clients, which is a comma-separated list of host and port pairs that are the addresses of the Kafka brokers in a "bootstrap" Kafka cluster that a Kafka client connects to initially to bootstrap itself.
A host and port pair uses : as the separator.
localhost:9092
localhost:9092,another.host:9092
So as mentioned, bootstrap.servers provides the initial hosts that act as the starting point for a Kafka client to discover the full set of alive servers in the cluster.
Special Notes:
Since these servers are just used for the initial connection to discover the full cluster membership (which may change dynamically), this list does not have to contain the full set of servers (you may want more than one, though, in case a server is down).
Clients (producers or consumers) make use of all servers irrespective of which servers are specified in bootstrap.servers for bootstrapping.
bootstrap.servers is a comma-separated list of host and port pairs that are the addresses of the Kafka brokers in a "bootstrap" Kafka cluster that a Kafka client connects to initially to bootstrap itself.
Kafka broker
A Kafka cluster is made up of multiple Kafka Brokers. Each Kafka Broker has a unique ID (number). Kafka Brokers contain topic log partitions. Connecting to one broker bootstraps a client to the entire Kafka cluster. For failover, you want to start with at least three to five brokers. A Kafka cluster can have, 10, 100, or 1,000 brokers in a cluster if needed.
more information: check this, official doc

Clarification on the --broker-list option and --bootstrap-server option in Kafka-console producer and consumer respectively

In Kafka-console-producer, the --broker-list takes a list of servers.
Does the producer connect to all of them? (or)
Does the producer uses the list of servers to connect one of them and if that one fails, switches to the next and so on?
Similarly, in Kafka-console-consumer the --bootstrap-server takes a list of Kafka servers. If there are two Kafka servers, do I need to specify both of them in the --bootstrap-server?
I tried myself running the consumer with one server (Kafka-server1) and when I stopped Kafka-server1, it continued to receive data for the topic.
They both act/are the same.
If you look at the Kafka source code, you'll see both options lead to the same "bootstrap.servers" configuration property
def producerProps(config: ProducerConfig): Properties = {
val props =
if (config.options.has(config.producerConfigOpt))
Utils.loadProps(config.options.valueOf(config.producerConfigOpt))
else new Properties
props ++= config.extraProducerProps
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, config.brokerList) // <---- brokerList is passed as BOOTSTRAP_SERVER
Both consumer and producer will connect in a round-robin fashion to the list of provided addresses to create an initial "boostrap" connection to the Kafka Controller, which knows about all available brokers in the cluster at a given time. It is good practice to give at least 3 for high-availability.
If there are two Kafka servers, do I need to specify both of them in the --bootstrap-server?
With regards to having multiple addresses availble to use, in a cloud enviornment, where you might have brokers over availability zones, it is recommended to have at least 2 brokers listed per availability zone, so 6 total for 3 zones.
The address provided for clients could be similified using a load balancer / reverse proxy down to a single kafka.your.network:9092 address, but then you are introducing extra DNS and network hops to figure out the connection for the sake of having a single, well-known, address.
In any case, all available addresses for the brokers will be handed to the clients and then cached locally.
However, it is important to recognize all send/poll requests will only communicate with the singular leader of a TopicPartition, despite how many addresses you give and how many replicas a topic will have.

How many bootstrap servers to provide for large Kafka cluster

I have a use case where my Kafka cluster will have 1000 brokers and I am writing Kafka client.
In order to write client, i need to provide brokers list.
Question is, what are the recommended guidelines to provide brokers list in client?
Is there any proxy like service available in kafka which we can give to client?
- that proxy will know all the brokers in cluster and connect client to appropriate broker.
- like in redis world, we have twemproxy (nutcracker)
- confluent-rest-api can act as proxy?
Is it recommended to provide any specific number of brokers in client, for example provide list of 3 brokers even though cluster has 1000 nodes?
- what if provided brokers gets crashed?
- what if provided brokers restarts and there location/ip changes?
The list of broker URL you pass to the client are only to bootstrap the client. Thus, the client will automatically learn about all other available brokers automatically, and also connect to the correct brokers it need to "talk to".
Thus, if the client is already running, the those brokers go down, the client will not even notice. Only if all those brokers are down at the same time, and you startup the client, the client will "hang" as it cannot connect to the cluster and eventually time out.
It's recommended to provide at least 3 broker URLs to "survive" the outage of 2 brokers. But you can also provide more if you need a higher level of resilience.

Can a Kafka broker keep id but change advertised listener?

I have a cluster of 3 Kafka brokers. Most of the topics have replication factor of 2, while the consumer offsets all have a replication factor of 3.
I need to change where the individual brokers are listening, i.e. the IPs/hostnames on which they are listening. Is it possible to change the advertised listeners for a given broker ID? Or do I have to create a new broker with a different ID, repartition topics, and remove the old broker?
Assuming it does work, does the official Java Kafka client realize that the listener has changed and re-request the list of brokers for the topic(s)?
For the interested, I am running Kafka in Kubernetes. Originally, I needed access from both inside and outside the cluster, so I had services with nodePort (hostPort did not work with CNI prior to Kubernetes 1.7).
It worked, but was complex. I no longer need access from outside Kubernetes, so would like to keep it simple and have three brokers that advertise their hostname.
Can I bring down a broker and restart it with a different advertised listener? Or must I add a new broker, rebalance, and remove the old one?

Why do we need to add all zookeeper nodes in Kafka Consumer Configuration

Looks like we need to add the ip addresses of all zookeeper nodes in the property "zookeeper.connect" for configuring a consumer.
Now my understanding says the zookeeper cluster has a leader which is managed in a fail-safe way.
So, why cant we just provide a bootstrap list for zookeeper nodes like its done in Producer configuration(while providing bootstrap broker list) & they should provide metadata about the entire zookeeper cluster?
You can specify a subset of the nodes. The nodes in that list are only used to get an initial connection to the cluster of nodes and the client goes through the list until a connection is made. Usually the first node is up and available so the client doesn't have to go too far into the list. So you only need to add extra nodes to the list depending on how pessimistic you are.