Ingress for kafka - apache-kafka

We are exploring in implementing the multi-tenancy at kafka for each of our dev team which would be hosted in AWS-EKS.
For this the initial thought process is to have topic level multi-tenant.
NLB-Nginx-Ingress: ingress host-route for each team and add all the brokers in the backend, in which that team's all the topic-partition leaders are present.
access restriction through the ACLs at broker level based on principal like user.
Sample flow:
Ingress book-keeping challenges:
When someone from foobar team creates a new topic and if that lands in a new broker, we need to add that broker to the backend of the respective ingress.
If a broker goes down, again the ingress need to be updated.
Prune the brokers when the partition leader goes away due to topic deletion.
What I'm Looking for:
Apart from writing an operator or app to do the above tasks, is there any other better way to achieve this? I'm ok to completely new suggestions as well. Since this is just in the POC stage.
PS: I'm new to kafka and if this exchange is not suitable for this question, pls suggest the right exchange to post. Thanks!

First of all the ACL restrictions are cluster level and not broker level,
Secondly, for bootstraping process you need to access at least one active broker from the cluster it will send back metadata where the data leaders are and on the continuous connection the client will connect to the brokers accordingly,
there is no need to put load balancer behind kafka bootstraping , the suggestion is to put at least two brokers or more in comma separated list , the client will connect the first available and get the metadata, for further connection , client need to be able to connect to all brokers in the cluster
You can use the ACL to restrict access by principals (users) to topics in the cluster based on their need.

Related

Kafka internal topic : Where are the internal topics created - source or target broker?

We are doing a stateful operation. Our cluster is managed. Everytime for internal topic creation , we have to ask admin guys to unlock so that internal topics can be created by the kafka stream app. We have control over target cluster not source cluster.
So, wanted to understand which cluster - source/ target are internal topics created?
AFAIK, There is only one cluster that the kafka-streams app connects to and all topics source/target/internal are created there.
So far, Kafka Stream applications can support connection to only one cluster as defined in the BOOTSTRAP_SERVERS_CONFIG in Stream configurations.
As answered above also, all source topics reside in those brokers and all internal topics(changelog/repartition topics) are created in the same cluster. KStream app will create the target topic in the same cluster as well.
It will be worth looking into the server logs to understand and analyze the actual root cause.
As the other answers suggest there should be only one cluster that the Kafka Stream application connects to. Internal topics are created by the Kafka stream application and will only be used by the application that created it. However, there could be some configuration related to security set on the Broker side which could be preventing the streaming application from creating these topics:
If security is enabled on the Kafka brokers, you must grant the underlying clients admin permissions so that they can create internal topics set. For more information, see Streams Security.
Quoted from here
Another point to keep in mind is that the internal topics are automatically created by the Stream application and there is no explicit configuration for auto creation of internal topics.

What is the difference between Kafka Cluster and Kafka Broker?

Has Kafka cluster and Kafka broker the same meaning?
I know cluster has multiple brokers (Is this wrong?).
But when I write code to produce messages, I find awkward option.
props.put("bootstrap.servers", "kafka001:9092, kafka002:9092, kafka003:9092");
Is this broker address or cluster address? If this is broker address, I think it is not good because we have to modify above address when brokers count changes.
(But it seems like broker address..)
Additionally, I saw in MSK in amazon, we can add broker to each AZ.
It means, we cannot have many broker. (Three or four at most?)
And they guided we should write this broker addresses to bootstrap.serveroption as a,` seperated list.
Why they don't guide us to use clusters address or ARN?
A Kafka cluster is a group of Kafka brokers.
When using the Producer API it is not required to mention all brokers within the cluster in the bootstrap.servers properties. The Producer configuration documentation on bootstrap.servers gives the full details:
A list of host/port pairs to use for establishing the initial connection to the Kafka cluster. The client will make use of all servers irrespective of which servers are specified here for bootstrapping—this list only impacts the initial hosts used to discover the full set of servers. This list should be in the form host1:port1,host2:port2,.... Since these servers are just used for the initial connection to discover the full cluster membership (which may change dynamically), this list need not contain the full set of servers (you may want more than one, though, in case a server is down).
All brokers within a cluster share meta information of other brokers in the same cluster. Therefore, it is sufficient to mention even only one broker in the bootstrap-servers properties. However, you should still mention more than one in case of the one broker being not available for whatever reason.

Kafka, configuring a single vs multiple broker servers for clients?

I use to configure bootstrap.servers in my kafka producer/consumer/stream apps with a list of broker ips. But I’d like to move to a single url entry that will be resolved by the DNS lookup to a broker ip currently known as up (DNS actively check the brokers in the cluster and responds to lookup with an IP short TTL [10s]). This gives me more flexibility to add brokers in the future, and I can keep the same config in my apps across all the environments/stages. Is this a recommended approach, or this remove resiliency on the client side to not have a strict list of brokers? I assume this config would only be used to initially “discover” the cluster and the partition leader brokers.
If anything, I'd say this adds a single point of failure on the single address you're providing, unless it's actually a load balanced, reverse proxy.
Another possibility that's worked somewhat well internally is using Consul service discovery, with Consul agents running on each broker. This way, you can do service discovery as well as health checks and easier monitoring setup, e.g. having Prometheus jmx_exporter on the brokers, and Prometheus Server scraping those values for all kafka.service.consul addresses

How many bootstrap servers to provide for large Kafka cluster

I have a use case where my Kafka cluster will have 1000 brokers and I am writing Kafka client.
In order to write client, i need to provide brokers list.
Question is, what are the recommended guidelines to provide brokers list in client?
Is there any proxy like service available in kafka which we can give to client?
- that proxy will know all the brokers in cluster and connect client to appropriate broker.
- like in redis world, we have twemproxy (nutcracker)
- confluent-rest-api can act as proxy?
Is it recommended to provide any specific number of brokers in client, for example provide list of 3 brokers even though cluster has 1000 nodes?
- what if provided brokers gets crashed?
- what if provided brokers restarts and there location/ip changes?
The list of broker URL you pass to the client are only to bootstrap the client. Thus, the client will automatically learn about all other available brokers automatically, and also connect to the correct brokers it need to "talk to".
Thus, if the client is already running, the those brokers go down, the client will not even notice. Only if all those brokers are down at the same time, and you startup the client, the client will "hang" as it cannot connect to the cluster and eventually time out.
It's recommended to provide at least 3 broker URLs to "survive" the outage of 2 brokers. But you can also provide more if you need a higher level of resilience.

Why do we need to add all zookeeper nodes in Kafka Consumer Configuration

Looks like we need to add the ip addresses of all zookeeper nodes in the property "zookeeper.connect" for configuring a consumer.
Now my understanding says the zookeeper cluster has a leader which is managed in a fail-safe way.
So, why cant we just provide a bootstrap list for zookeeper nodes like its done in Producer configuration(while providing bootstrap broker list) & they should provide metadata about the entire zookeeper cluster?
You can specify a subset of the nodes. The nodes in that list are only used to get an initial connection to the cluster of nodes and the client goes through the list until a connection is made. Usually the first node is up and available so the client doesn't have to go too far into the list. So you only need to add extra nodes to the list depending on how pessimistic you are.