How many bootstrap servers to provide for large Kafka cluster - apache-kafka

I have a use case where my Kafka cluster will have 1000 brokers and I am writing Kafka client.
In order to write client, i need to provide brokers list.
Question is, what are the recommended guidelines to provide brokers list in client?
Is there any proxy like service available in kafka which we can give to client?
- that proxy will know all the brokers in cluster and connect client to appropriate broker.
- like in redis world, we have twemproxy (nutcracker)
- confluent-rest-api can act as proxy?
Is it recommended to provide any specific number of brokers in client, for example provide list of 3 brokers even though cluster has 1000 nodes?
- what if provided brokers gets crashed?
- what if provided brokers restarts and there location/ip changes?

The list of broker URL you pass to the client are only to bootstrap the client. Thus, the client will automatically learn about all other available brokers automatically, and also connect to the correct brokers it need to "talk to".
Thus, if the client is already running, the those brokers go down, the client will not even notice. Only if all those brokers are down at the same time, and you startup the client, the client will "hang" as it cannot connect to the cluster and eventually time out.
It's recommended to provide at least 3 broker URLs to "survive" the outage of 2 brokers. But you can also provide more if you need a higher level of resilience.

Related

What is the difference between Kafka Cluster and Kafka Broker?

Has Kafka cluster and Kafka broker the same meaning?
I know cluster has multiple brokers (Is this wrong?).
But when I write code to produce messages, I find awkward option.
props.put("bootstrap.servers", "kafka001:9092, kafka002:9092, kafka003:9092");
Is this broker address or cluster address? If this is broker address, I think it is not good because we have to modify above address when brokers count changes.
(But it seems like broker address..)
Additionally, I saw in MSK in amazon, we can add broker to each AZ.
It means, we cannot have many broker. (Three or four at most?)
And they guided we should write this broker addresses to bootstrap.serveroption as a,` seperated list.
Why they don't guide us to use clusters address or ARN?
A Kafka cluster is a group of Kafka brokers.
When using the Producer API it is not required to mention all brokers within the cluster in the bootstrap.servers properties. The Producer configuration documentation on bootstrap.servers gives the full details:
A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. The client will make use of all servers irrespective of which servers are specified here for bootstrapping—this list only impacts the initial hosts used to discover the full set of servers. This list should be in the form host1:port1,host2:port2,.... Since these servers are just used for the initial connection to discover the full cluster membership (which may change dynamically), this list need not contain the full set of servers (you may want more than one, though, in case a server is down).
All brokers within a cluster share meta information of other brokers in the same cluster. Therefore, it is sufficient to mention even only one broker in the bootstrap-servers properties. However, you should still mention more than one in case of the one broker being not available for whatever reason.

Is it possible to produce to a kafka topic when only 1 of the brokers is reachable?

Is it possible to produce to a Kafka topic when only 1 of the brokers is reachable from the producer, none of the zookeeper nodes are reachable from the producer, but all of the brokers are healthy and are reachable from each other?
For example, this would be required if I were to produce messages via an SSH tunnel. If this were for a temporary push I could possibly create the topic with replication factor 1 and have all partitions assigned to the broker in question, and reassign the partitions after the fact, but I'm hoping there is a more flexible setup.
This is all using the java client.
Producers don't interact with Zookeeper so it's not an issue.
The only requirement for Producers is to be able to connect to the brokers that are leaders for the partitions they want to use.
If the broker you connect to is the leader for the partitions you want to use, then yes you can produce to it.
Otherwise it's not going to work. Also creating a topic may not help as its partitions could be assigned to any brokers. Also in order to create a topic, a client has to connect to the controller which may not be the broker you can reach.
If you can only connect to 1 "thing", you may want to consider using something like a REST Proxy. Your "isolated" environment could send REST requests to the proxy which is able to connect to all brokers in the cluster.

Kafka, configuring a single vs multiple broker servers for clients?

I use to configure bootstrap.servers in my kafka producer/consumer/stream apps with a list of broker ips. But I’d like to move to a single url entry that will be resolved by the DNS lookup to a broker ip currently known as up (DNS actively check the brokers in the cluster and responds to lookup with an IP short TTL [10s]). This gives me more flexibility to add brokers in the future, and I can keep the same config in my apps across all the environments/stages. Is this a recommended approach, or this remove resiliency on the client side to not have a strict list of brokers? I assume this config would only be used to initially “discover” the cluster and the partition leader brokers.
If anything, I'd say this adds a single point of failure on the single address you're providing, unless it's actually a load balanced, reverse proxy.
Another possibility that's worked somewhat well internally is using Consul service discovery, with Consul agents running on each broker. This way, you can do service discovery as well as health checks and easier monitoring setup, e.g. having Prometheus jmx_exporter on the brokers, and Prometheus Server scraping those values for all kafka.service.consul addresses

Can a Kafka broker keep id but change advertised listener?

I have a cluster of 3 Kafka brokers. Most of the topics have replication factor of 2, while the consumer offsets all have a replication factor of 3.
I need to change where the individual brokers are listening, i.e. the IPs/hostnames on which they are listening. Is it possible to change the advertised listeners for a given broker ID? Or do I have to create a new broker with a different ID, repartition topics, and remove the old broker?
Assuming it does work, does the official Java Kafka client realize that the listener has changed and re-request the list of brokers for the topic(s)?
For the interested, I am running Kafka in Kubernetes. Originally, I needed access from both inside and outside the cluster, so I had services with nodePort (hostPort did not work with CNI prior to Kubernetes 1.7).
It worked, but was complex. I no longer need access from outside Kubernetes, so would like to keep it simple and have three brokers that advertise their hostname.
Can I bring down a broker and restart it with a different advertised listener? Or must I add a new broker, rebalance, and remove the old one?

How many connections are made by producer and consumer to a Kafka cluster?

Can anyone shed some light on the number of connections and what type of connections are made by the Kafka Java producers and Kafka Java consumers to a Kafka cluster.
Are the number of connections based on the number of topics or partitions or brokers in the cluster?
Each consumer/producer needs to be connected to the broker which is leader for the partition that the consumer/producer wants to read/write.
It means that a client doesn't need to be connected to all brokers inside a cluster but just the brokers needed for reading/sending messages.
During the initial configuration, we provide a list of brokers to connect to (which could be even only one). Using such broker(s), the client gets metadata information about the topic/partitions it wants to use and where they are placed (other brokers in the cluster). Such connections need to be in place for client working on the desired topic/partitions.