Exposing a public kafka cluster - apache-kafka

If I were to create a public Kafka cluster that accepts messages from multiple clients, but are purely processed by a separate backend. What would be the right way to design it?
A bit more concrete example, let's say I have 50 kafka brokers. How do I:
Configure clients without the manually adding in IPs of the 50 kafka brokers.?
Loadbalancing messages to kafka broker based on load if possible.
Easier/automated way to setup additional clients with quota.

You can use hashicorp consul which is one of the open source service discovery tools to get your kafka brokers on, ultimately you will have single endpoint and you don't need to add multiple brokers in your clients. There are several other open source told available
There are few ways, use kafka assigner tool to balance the traffic or kafka cruise control open source tool to automatically balance the cluster for you

Related

Kafka Post Deployment - Handling ever-growing clients

We have setup a Kafka Cluster for High Availability and distributed data load. The current consumers and producers specify all the broker IP addresses to connect to the cluster. In the future, there will be the need to continuosly monitor the cluster and add a new broker based on collected metrics and overall system performance. In case a broker crashes, as soon as possible we have to add a new broker with a different IP.
In these scenarios, we have to change all client configurations, a time consuming and stressful operation.
I think we can setup a Config Server (e.g. Spring Cloud Config Server) to specify all the broker IP addresses in a centralized manner, so we have to change all in one place, without touching all the clients, but I don't know if it is the best approach. Obviously, the clients must be programmed to get broker list from config server.
There's a better approach?
Worth pointing out that the "bootstrap" process doesn't require giving every single broker address to the clients, really only the first available address in the list is used for the initial connection, then the advertised.listeners on all the broker configs in the cluster, is what the clients actually use
The answer to your question is to use service discovery, yes. That could be Spring Could Config, but the more general option would be Hashicorp Consul or other service that uses DNS (Kubernetes uses CoreDNS, by default, for example, or AWS Route53).
Then you edit the /etc/resolv.conf of each machine (assuming Linux) the client is running on to include the DNS servers, and you can simply refer to kafka.your.domain:9092 rather than using IP addresses
You could use a load balancer (with a friendly dns like kafka.domain.com), which points to all of your brokers. We do this in our environment. Your clients then connect to kafka.domain.com:9092.
As soon as you add new brokers, you only change the load balancer endpoints and not the client configuration.
Additionally please note that you only need to connect to one bootstrap broker and don't have to list all of them in the client configuration.

How to expand confluent cloud kafka cluster?

I have set up a confluent cloud multizone cluster and it got created with just one bootstrap server. There was no setting for choosing number of servers while creating the cluster. Even after creation, I can’t edit the number of bootstrap servers.
I want to know how to increase the number of servers in confluent cloud kafka cluster.
Under the hood, the Confluent Cloud cluster is already running multiple brokers. Depending on your cluster configuration (specifically, whether you're running Standard or Dedicated, and what region and cloud you're in), the cluster will have between six and several dozen brokers.
The way a Kafka client bootstrap server config works is that the client reaches out to the bootstrap server and requests a list of all brokers, and then uses those broker endpoints to actually produce/consume from Kafka (reference: https://jaceklaskowski.gitbooks.io/apache-kafka/content/kafka-properties-bootstrap-servers.html)
In Confluent Cloud, the provided bootstrap server is actually a load balancer in front of all of the brokers; when the client connects to the bootstrap server it'll receive the actual endpoints for all of the actual brokers, and then use that for subsequent connections.
So TL;DR, in your client, you only need to specify the one bootstrap server; under the hood, the Kafka client will connect to the (many) brokers running in Confluent Cloud, and it should all just work.
Source: I work at Confluent.

Read/Write with Nifi to Kafka in Cloudera Data Platform CDP public cloud

Nifi and Kafka are now both available in Cloudera Data Platform, CDP public cloud. Nifi is great at talking to everything and Kafka is a mainstream message bus, I just wondered:
What are the minimal steps needed to Produce/Consume data to Kafka from Apache Nifi within CDP Public Cloud
I would Ideally look for steps that work in any cloud, for instance Amazon AWS and Microsoft Azure.
I am satisfied with answers that follow best practices and work with the default configuration of the platform, but if there are common alternatives these are welcome as well.
There will be multiple form factors available in the future, for now I will assume you have an environment that contains 1 datahub with NiFi, and 1 Data Hub with Kafka. (The answer still works if both are on the same datahub).
Prerequisites
Data Hub(s) with NiFi and Kafka
Permission to access these (e.g. add processor, create Kafka topic)
Know your Workload User Name (Cdp management console>Click your name (bottom left) > Click profile)
You should have set your Workload Password in the same location
These steps allow you to Produce data from NiFi to Kafka in CDP Public Cloud
Unless mentioned otherwise, I have kept everything to its default settings.
In Kafka Data Hub Cluster:
Gather the FQDN links of the brokers, and the used ports.
If you have Streams Messaging Manager: Go to the brokers tab to see the FQDN and port already together
If you cannot use Streams Messaging Manager: Go to the hardware tab of your Data Hub with Kafka and get the FQDN of the relevant nodes. (Currently these are called broker). Then add :portnumber behind each one. The default port is 9093.
Combine the links together in this format: FQDN:port,FQDN:port,FQDN:port it should now look something like this:
broker1.abc:9093,broker2.abc:9093,broker3.abc:9093
In NiFi GUI:
Make sure you have some data in NiFi to produce, for example by using the GenerateFlowFile processor
Select the relevant processor for writing to kafka, for example PublishKafka_2_0, configure it as follows:
Settings
Automatically terminate relationships: Tick both success and faillure
Properties
Kafka Brokers: The combined list we created earlier
Security Protocol: SASL_SSL
SASL Mechanism: PLAIN
SSL Context Service: Default NiFi SSL Context Service
Username: your Workload User Name (see prerequisites above)
Password: your Workload Password
Topic Name: dennis
Use Transactions: false
Max Metadata Wait Time: 30 sec
Connect your GenerateFlowFile processor to your PublishKafka_2_0 processor and start the flow
These are the minimal steps, a more extensive explanation can be found on in the Cloudera Documentation. Note that it best practice to create topics explicitly (this example leverages the feature of Kafka that automatically lets it create topics when produced to).
These steps allow you to Consume data with NiFi from Kafka in CDP Public Cloud
A good check to see if data was written to Kafka, is consuming it again.
In NiFi GUI:
Create a Kafka consumption processor, for instance ConsumeKafka_2_0, configure its Properties as follows:
Kafka Brokers, Security Protocol, SASL Mechanism, SSL Context Service, Username, Password, Topic Name: All the same as in our producer example above
Consumer Group: 1
Offset Reset: earliest
Create another processor, or a funnel to send the messages to, and start the consumption processor.
And that is it, within 30 seconds you should see that the data that you published to Kafka is now flowing into NiFi again.
Full Disclosure: I am an employee of Cloudera, the driving force behind Nifi.

How many bootstrap servers to provide for large Kafka cluster

I have a use case where my Kafka cluster will have 1000 brokers and I am writing Kafka client.
In order to write client, i need to provide brokers list.
Question is, what are the recommended guidelines to provide brokers list in client?
Is there any proxy like service available in kafka which we can give to client?
- that proxy will know all the brokers in cluster and connect client to appropriate broker.
- like in redis world, we have twemproxy (nutcracker)
- confluent-rest-api can act as proxy?
Is it recommended to provide any specific number of brokers in client, for example provide list of 3 brokers even though cluster has 1000 nodes?
- what if provided brokers gets crashed?
- what if provided brokers restarts and there location/ip changes?
The list of broker URL you pass to the client are only to bootstrap the client. Thus, the client will automatically learn about all other available brokers automatically, and also connect to the correct brokers it need to "talk to".
Thus, if the client is already running, the those brokers go down, the client will not even notice. Only if all those brokers are down at the same time, and you startup the client, the client will "hang" as it cannot connect to the cluster and eventually time out.
It's recommended to provide at least 3 broker URLs to "survive" the outage of 2 brokers. But you can also provide more if you need a higher level of resilience.

Apache Kafka consumer groups and microservices running on Kubernetes, are they compatible?

So far, I have been using Spring Boot apps (with Spring Cloud Stream) and Kafka running without any supporting infrastructure (PaaS).
Since our corporate platform is running on Kubernetes we need to move those Spring Boot apps into K8s to allow the apps to scale and so on. Obviously there will be more than one instance of every application so we will define a consumer group per application to ensure the unique delivery and processing of every message.
Kafka will be running outside Kubernetes.
Now my doubt is: since the apps deployed on k8s are accessed through the k8s service that abstracts the underlying pods, and individual application pods can't be access directly outside of the k8s cluster, Kafka won't know how to call individual instances of the consumer group to deliver the messages, will it?
How can I make them work together?
Kafka brokers do not push data to clients. Rather clients poll() and pull data from the brokers. As long as the consumers can connect to the bootstrap servers and you set the Kafka brokers to advertise an IP and port that the clients can connect to and poll() then it will all work fine.
Can Spring Cloud Data Flow solve your requirement to control the number of instances deployed?
and, there is a community released Spring Cloud Data Flow server for OpenShift:
https://github.com/donovanmuller/spring-cloud-dataflow-server-openshift