Kafka, configuring a single vs multiple broker servers for clients? - apache-kafka

I use to configure bootstrap.servers in my kafka producer/consumer/stream apps with a list of broker ips. But I’d like to move to a single url entry that will be resolved by the DNS lookup to a broker ip currently known as up (DNS actively check the brokers in the cluster and responds to lookup with an IP short TTL [10s]). This gives me more flexibility to add brokers in the future, and I can keep the same config in my apps across all the environments/stages. Is this a recommended approach, or this remove resiliency on the client side to not have a strict list of brokers? I assume this config would only be used to initially “discover” the cluster and the partition leader brokers.

If anything, I'd say this adds a single point of failure on the single address you're providing, unless it's actually a load balanced, reverse proxy.
Another possibility that's worked somewhat well internally is using Consul service discovery, with Consul agents running on each broker. This way, you can do service discovery as well as health checks and easier monitoring setup, e.g. having Prometheus jmx_exporter on the brokers, and Prometheus Server scraping those values for all kafka.service.consul addresses

Related

Exposing Kafka brokers using Google click to deploy environment

I have used the Kafka cluster (with replication) click to deploy container from Google on kubernetes. How do I expose the brokers so I can consume from an external consumer? I'm very new to kunernetes.
I have tried exposing the broker nodes with a load balancer but the external ip given
I opened the port with a firewall rule
But when connecting from my consumer it throws an error about being disconnected
Any help would be great, I can provide more info if asked.
You cannot use a load balancer. Kafka clients must talk directly to the brokers.
Firewall is a good step, but you need to ensure each of the brokers are exposed properly via advertised.listeners within and outside of the VPC. Refer blog post.
Alternative, within GKE, you can run Strimzi Operator, which handles Kubernetes resources for you with regard to the Kafka cluster.

Kafka Post Deployment - Handling ever-growing clients

We have setup a Kafka Cluster for High Availability and distributed data load. The current consumers and producers specify all the broker IP addresses to connect to the cluster. In the future, there will be the need to continuosly monitor the cluster and add a new broker based on collected metrics and overall system performance. In case a broker crashes, as soon as possible we have to add a new broker with a different IP.
In these scenarios, we have to change all client configurations, a time consuming and stressful operation.
I think we can setup a Config Server (e.g. Spring Cloud Config Server) to specify all the broker IP addresses in a centralized manner, so we have to change all in one place, without touching all the clients, but I don't know if it is the best approach. Obviously, the clients must be programmed to get broker list from config server.
There's a better approach?
Worth pointing out that the "bootstrap" process doesn't require giving every single broker address to the clients, really only the first available address in the list is used for the initial connection, then the advertised.listeners on all the broker configs in the cluster, is what the clients actually use
The answer to your question is to use service discovery, yes. That could be Spring Could Config, but the more general option would be Hashicorp Consul or other service that uses DNS (Kubernetes uses CoreDNS, by default, for example, or AWS Route53).
Then you edit the /etc/resolv.conf of each machine (assuming Linux) the client is running on to include the DNS servers, and you can simply refer to kafka.your.domain:9092 rather than using IP addresses
You could use a load balancer (with a friendly dns like kafka.domain.com), which points to all of your brokers. We do this in our environment. Your clients then connect to kafka.domain.com:9092.
As soon as you add new brokers, you only change the load balancer endpoints and not the client configuration.
Additionally please note that you only need to connect to one bootstrap broker and don't have to list all of them in the client configuration.

What is the difference between Kafka Cluster and Kafka Broker?

Has Kafka cluster and Kafka broker the same meaning?
I know cluster has multiple brokers (Is this wrong?).
But when I write code to produce messages, I find awkward option.
props.put("bootstrap.servers", "kafka001:9092, kafka002:9092, kafka003:9092");
Is this broker address or cluster address? If this is broker address, I think it is not good because we have to modify above address when brokers count changes.
(But it seems like broker address..)
Additionally, I saw in MSK in amazon, we can add broker to each AZ.
It means, we cannot have many broker. (Three or four at most?)
And they guided we should write this broker addresses to bootstrap.serveroption as a,` seperated list.
Why they don't guide us to use clusters address or ARN?
A Kafka cluster is a group of Kafka brokers.
When using the Producer API it is not required to mention all brokers within the cluster in the bootstrap.servers properties. The Producer configuration documentation on bootstrap.servers gives the full details:
A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. The client will make use of all servers irrespective of which servers are specified here for bootstrapping—this list only impacts the initial hosts used to discover the full set of servers. This list should be in the form host1:port1,host2:port2,.... Since these servers are just used for the initial connection to discover the full cluster membership (which may change dynamically), this list need not contain the full set of servers (you may want more than one, though, in case a server is down).
All brokers within a cluster share meta information of other brokers in the same cluster. Therefore, it is sufficient to mention even only one broker in the bootstrap-servers properties. However, you should still mention more than one in case of the one broker being not available for whatever reason.

kafka bootstrap.servers as DNS A-Record with multiple IPs

I have a cluster of Kafka with 5 brokers and I'm using Consul Service Discovery to put their IPs into a dns record.
kafka.service.domain.cc A 1.1.1.1 2.2.2.2 ... 5.5.5.5
Is it recommended to use only one domain name:
kafka.bootstrap.servers = kafka.service.domain.cc:30000
or is it better to have multiple domain names (at least 2), each one resolves to one broker
kafka1.service.domain.cc A 1.1.1.1
kafka2.service.domain.cc A 2.2.2.2
then use them in in kafka
kafka.bootstrap.servers = kafka1.service.domain.cc:30000,kafka2.service.domain.cc:30000
my concerns with the first approach that the domain name will be resolved only once to a random broker, and if that broker is down, no new dns resolving will take place.
From the book Mastering Apache Kafka:
bootstrap.servers is a comma-separated list of host and port pairs
that are the addresses of the Kafka brokers in a "bootstrap" Kafka
cluster that a Kafka client connects to initially to bootstrap itself.
bootstrap.servers provides the initial hosts that act as the
starting point for a Kafka client to discover the full set of alive
servers in the cluster. Since these servers are just used for the
initial connection to discover the full cluster membership (which may
change dynamically), this list does not have to contain the full set
of servers (you may want more than one, though, in case a server is
down).
Clients (producers or consumers) make use of all servers irrespective
of which servers are specified in bootstrap.servers for bootstrapping.
So as the property bootstrap.servers provides the initial hosts that act as the starting point for a Kafka client to discover the full set of alive servers in the cluster, I think both the approach will do. But as they kept the value of the property to be a comma separated list, I guess second approach will be the recommended one. And also it will be a problem in approach 1 is, while bootstrapping, random broker may be down and client will not get the cluster information to continue. So it is always better to provide more than one as fallback if one broker is down during bootstrapping.
Kafka 2.1 included support for handling multiple DNS resource records in bootstrap.servers.
If you set client.dns.lookup="use_all_dns_ips" in your client configuration, it will use all of the IP addresses returned by DNS, not just the first (or a random one).
See KIP-235 and KIP-302 for more information.

How many bootstrap servers to provide for large Kafka cluster

I have a use case where my Kafka cluster will have 1000 brokers and I am writing Kafka client.
In order to write client, i need to provide brokers list.
Question is, what are the recommended guidelines to provide brokers list in client?
Is there any proxy like service available in kafka which we can give to client?
- that proxy will know all the brokers in cluster and connect client to appropriate broker.
- like in redis world, we have twemproxy (nutcracker)
- confluent-rest-api can act as proxy?
Is it recommended to provide any specific number of brokers in client, for example provide list of 3 brokers even though cluster has 1000 nodes?
- what if provided brokers gets crashed?
- what if provided brokers restarts and there location/ip changes?
The list of broker URL you pass to the client are only to bootstrap the client. Thus, the client will automatically learn about all other available brokers automatically, and also connect to the correct brokers it need to "talk to".
Thus, if the client is already running, the those brokers go down, the client will not even notice. Only if all those brokers are down at the same time, and you startup the client, the client will "hang" as it cannot connect to the cluster and eventually time out.
It's recommended to provide at least 3 broker URLs to "survive" the outage of 2 brokers. But you can also provide more if you need a higher level of resilience.