How to expose a headless service for a StatefulSet externally in Kubernetes - apache-kafka

Using kubernetes-kafka as a starting point with minikube.
This uses a StatefulSet and a headless service for service discovery within the cluster.
The goal is to expose the individual Kafka Brokers externally which are internally addressed as:
kafka-0.broker.kafka.svc.cluster.local:9092
kafka-1.broker.kafka.svc.cluster.local:9092
kafka-2.broker.kafka.svc.cluster.local:9092
The constraint is that this external service be able to address the brokers specifically.
Whats the right (or one possible) way of going about this? Is it possible to expose a external service per kafka-x.broker.kafka.svc.cluster.local:9092?

We have solved this in 1.7 by changing the headless service to Type=NodePort and setting the externalTrafficPolicy=Local. This bypasses the internal load balancing of a Service and traffic destined to a specific node on that node port will only work if a Kafka pod is on that node.
apiVersion: v1
kind: Service
metadata:
name: broker
spec:
externalTrafficPolicy: Local
ports:
- nodePort: 30000
port: 30000
protocol: TCP
targetPort: 9092
selector:
app: broker
type: NodePort
For example, we have two nodes nodeA and nodeB, nodeB is running a kafka pod. nodeA:30000 will not connect but nodeB:30000 will connect to the kafka pod running on nodeB.
https://kubernetes.io/docs/tutorials/services/source-ip/#source-ip-for-services-with-typenodeport
Note this was also available in 1.5 and 1.6 as a beta annotation, more can be found here on feature availability: https://kubernetes.io/docs/tasks/access-application-cluster/create-external-load-balancer/#preserving-the-client-source-ip
Note also that while this ties a kafka pod to a specific external network identity, it does not guarantee that your storage volume will be tied to that network identity. If you are using the VolumeClaimTemplates in a StatefulSet then your volumes are tied to the pod while kafka expects the volume to be tied to the network identity.
For example, if the kafka-0 pod restarts and kafka-0 comes up on nodeC instead of nodeA, kafka-0's pvc (if using VolumeClaimTemplates) has data that it is for nodeA and the broker running on kafka-0 starts rejecting requests thinking that it is nodeA not nodeC.
To fix this, we are looking forward to Local Persistent Volumes but right now we have a single PVC for our kafka StatefulSet and data is stored under $NODENAME on that PVC to tie volume data to a particular node.
https://github.com/kubernetes/features/issues/121
https://kubernetes.io/docs/concepts/storage/volumes/#local

Solutions so far weren't quite satisfying enough for myself, so I'm going to post an answer of my own. My goals:
Pods should still be dynamically managed through a StatefulSet as much as possible.
Create an external service per Pod (i.e Kafka Broker) for Producer/Consumer clients and avoid load balancing.
Create an internal headless service so that each Broker can communicate with each other.
Starting with Yolean/kubernetes-kafka, the only thing missing is exposing the service externally and two challenges in doing so.
Generating unique labels per Broker pod so that we can create an external service for each of the Broker pods.
Telling the Brokers to communicate to each other using the internal Service while configuring Kafka to tell the producer/consumers to communicate over the external Service.
Per pod labels and external services:
To generate labels per pod, this issue was really helpful. Using it as a guide, we add the following line to the 10broker-config.yml init.sh property with:
kubectl label pods ${HOSTNAME} kafka-set-component=${HOSTNAME}
We keep the existing headless service, but we also generate an external Service per pod using the label (I added them to 20dns.yml):
apiVersion: v1
kind: Service
metadata:
name: broker-0
namespace: kafka
spec:
type: NodePort
ports:
- port: 9093
nodePort: 30093
selector:
kafka-set-component: kafka-0
Configure Kafka with internal/external listeners
I found this issue incredibly useful in trying to understand how to configure Kafka.
This again requires updating the init.sh and server.properties properties in 10broker-config.yml with the following:
Add the following to the server.properties to update the security protocols (currently using PLAINTEXT):
listener.security.protocol.map=INTERNAL_PLAINTEXT:PLAINTEXT,EXTERNAL_PLAINTEXT:PLAINTEXT
inter.broker.listener.name=INTERNAL_PLAINTEXT
Dynamically determine the external IP and for external port for each Pod in the init.sh:
EXTERNAL_LISTENER_IP=<your external addressable cluster ip>
EXTERNAL_LISTENER_PORT=$((30093 + ${HOSTNAME##*-}))
Then configure listeners and advertised.listeners IPs for EXTERNAL_LISTENER and INTERNAL_LISTENER (also in the init.sh property):
sed -i "s/#listeners=PLAINTEXT:\/\/:9092/listeners=INTERNAL_PLAINTEXT:\/\/0.0.0.0:9092,EXTERNAL_PLAINTEXT:\/\/0.0.0.0:9093/" /etc/kafka/server.properties
sed -i "s/#advertised.listeners=PLAINTEXT:\/\/your.host.name:9092/advertised.listeners=INTERNAL_PLAINTEXT:\/\/$HOSTNAME.broker.kafka.svc.cluster.local:9092,EXTERNAL_PLAINTEXT:\/\/$EXTERNAL_LISTENER_IP:$EXTERNAL_LISTENER_PORT/" /etc/kafka/server.properties
Obviously, this is not a full solution for production (for example addressing security for the externally exposed brokers) and I'm still refining my understanding of how to also let internal producer/consumers to also communicate with the brokers.
However, so far this is the best approach for my understanding of Kubernetes and Kafka.

Note: I completely rewrote this post a year after the initial posting:
1. Some of what I wrote is no longer relevant given updates to Kubernetes, and I figured it should be deleted to avoid confusing people.
2. I now know more about both Kubernetes and Kafka and should be able to do a better explanation.
Background Contextual Understanding of Kafka on Kubernetes:
Let's say a service of type cluster IP and stateful set are used to deploy a 5 pod Kafka cluster on a Kubernetes Cluster, because a stateful set was used to create the pods they each automatically get the following 5 inner cluster dns names, and then the kafka service of type clusterIP gives another inner cluster dns name.
M$* kafka-0.my-kafka-headless-service.my-namespace.svc.cluster.local
M$ kafka-1.my-kafka-headless-service.my-namespace.svc.cluster.local
M * kafka-2.my-kafka-headless-service.my-namespace.svc.cluster.local
M * kafka-3.my-kafka-headless-service.my-namespace.svc.cluster.local
M$ kafka-4.my-kafka-headless-service.my-namespace.svc.cluster.local
kafka-service.my-namespace.svc.cluster.local
^ Let's say you have 2 Kafka topics: $ and *
Each Kafka topic is replicated 3 times across the 5 pod Kafka cluster
(the ASCII diagram above shows which pods hold the replicas of the $ and * topics, M represents metadata)
4 useful bits of background knowledge:
1. .svc.cluster.local is the inner cluster DNS FQDN, but pods automatically are populated with the knowledge to autocomplete that, so you can omit it when talking via inner cluster DNS.
2. kafka-x.my-kafka-headless-service.my-namespace inner cluster DNS name resolves to a single pod.
3. kafka-service.my-namespace kubernetes service of type cluster IP acts like an inner cluster Layer 4 Load Balancer, and will round-robin traffic between the 5 kafka pods.
4. A critical Kafka specific concept to realize is when a Kafka client talks to a Kafka cluster it does so in 2 phases. Let's say a Kafka client wants to read the $ topic from the Kafka cluster.
Phase 1: The client reads the kafka clusters metadata, this is synchronized across all 5 kafka pods so it doesn't matter which one the client talks to, therefore it can be useful to do the initial communication using kafka-service.my-namespace (which LB's and only forwards to a random healthy kafka pod)
Phase 2: The metadata tells the Kafka client which Kafka brokers/nodes/servers/pods have the topic of interest, in this case $ exists on 0, 1, and 4. So for Phase 2 the client will only talk directly to the Kafka brokers that have the data it needs.
How to Externally Expose Pods of a Headless Service/Statefulset and Kafka specific Nuance:
Let's say that I have a 3 pod HashiCorp Consul Cluster spun up on a Kubernetes Cluster, I configure it so the webpage is enabled and I want to see the webpage from the LAN/externally expose it. There's nothing special about the fact that the pods are headless. You can use a Service of type NodePort or LoadBalancer to expose them as you normally would any pod, and the NP or LB will round robin LB incoming traffic between the 3 consul pods.
Because Kafka communication happens in 2 phases, this introduces some nuances where the normal method of externally exposing the statefulset's headless service using a single service of type LB or NP might not work when you have a Kafka Cluster of more than 1 Kafka pod.
1. The Kafka client is expecting to speak directly to the Kafka Broker during Phase 2 communications. So instead of 1 Service of type NodePort, you might want 6 services of type NodePort/LB. 1 that will round-robin LB traffic for phase 1, and 5 with a 1:1 mapping to individual pods for Phase 2 communication. (If you run kubectl get pods --show-labels against the 5 Kafka pods, you'll see that each pod of the stateful set has a unique label, statefulset.kubernetes.io/pod-name=kafka-0, and that allows you to manually create 1 NP/LB service that maps to 1 pod of a stateful set.) (Note this alone isn't enough)
2. When you install a Kafka Clusters on Kubernetes it's common for its default configuration to only support Kafka Clients inside the Kubernetes Cluster. Remember that Metadata from phase1 of a Kafka Client talking to a Kafka Cluster, well the kafka cluster may have been configured so that it's "advertised listeners" are made of inner cluster DNS names. So when the LAN client talks to an externally exposed Kafka Cluster via NP/LB, it's successful on phase1, but fails on phase 2, because the metadata returned by phase1 gave inner cluster DNS names as the means of communicating directly with the pods during phase 2 communications, which wouldn't be resolvable by clients outside the cluster and thus only work for Kafka Clients Inside the Cluster. So it's important to configure your kafka cluster so the "advertised.listeners" returned by the phase 1 metadata are resolvable by both clients external to the cluster and internal to the cluster.
Clarity on where the Problem caused by Kafka Nuance Lies:
For Phase 2 of communication between Kafka Client -> Broker, you need to configure the "advertised.listeners" to be externally resolvable.
This is difficult to pull off using Standard Kubernetes Logic, because what you need is for kafka-0 ... kafka-4 to each have a unique configuration/each to have a unique "advertised.listeners" that's externally reachable. But by default statefulsets are meant to have cookie-cutter configurations that are more or less identical.
Solution to the Problem caused by Kafka Nuances:
The Bitnami Kafka Helm Chart has some custom logic that allows each pod in the statefulset to have a unique "advertised.listerners" configuration.
Bitnami Offers hardened Containers, according to Quay.io 2.5.0 only has a single High CVE, runs as non-root, has reasonable documentation, and can be externally exposed*, https://quay.io/repository/bitnami/kafka?tab=tags
The last project I was on I went with Bitnami, because security was the priority and we only had kafka clients that were internal to the kubernetes cluster, I ended up having to figure out how to externally expose it in a dev env so someone could run some kind of test and I remember being able to get it to work, I also remember it wasn't super simple, that being said if I were to do another Kafka on Kubernetes Project I'd recommend looking into Strimzi Kafka Operator, as it's more flexible in terms of options for externally exposing Kafka, and it's got a great 5 part deep dive write up with different options for externally exposing a Kafka Cluster running on Kubernetes using Strimzi (via NP, LB, or Ingress) (I'm not sure what Strimzi's security looks like though, so I'd recommend using something like AnchorCLI to do a left shift CVE scan of the Strimzi images before trying a PoC)
https://strimzi.io/blog/2019/04/17/accessing-kafka-part-1/

Change the service from a headless ClusterIP into a NodePort which would forward request to any of the nodes on a set port (30092 in my example) to port 9042 on the Kafkas. You would hit one of the pods, on random, but I guess that is fine.
20dns.yml becomes (something like this):
# A no longer headless service to create DNS records
---
apiVersion: v1
kind: Service
metadata:
name: broker
namespace: kafka
spec:
type: NodePort
ports:
- port: 9092
- nodePort: 30092
# [podname].broker.kafka.svc.cluster.local
selector:
app: kafka
Disclaimer: You might need two services. One headless for the internal dns names and one NodePort for the external access. I haven't tried this my self.

From the kubernetes kafka documentation:
Outside access with hostport
An alternative is to use the hostport for the outside access. When
using this only one kafka broker can run on each host, which is a good
idea anyway.
In order to switch to hostport the kafka advertise address needs to be
switched to the ExternalIP or ExternalDNS name of the node running the
broker. in kafka/10broker-config.yml switch to
OUTSIDE_HOST=$(kubectl get node "$NODE_NAME" -o jsonpath='{.status.addresses[?(#.type=="ExternalIP")].address}')
OUTSIDE_PORT=${OutsidePort}
and in kafka/50kafka.yml add the hostport:
- name: outside
containerPort: 9094
hostPort: 9094

I solved this problem by creating separate statefulset for each broker and separate service of type NodePort for each broker. Internal communication can happen on each individual service name. External communication can happen on NodePort address.

Related

Migrate Schema Registry from VM to k8s with zero downtime

I want to migrate 3 instances of Schema Registry from VMs (with Kafka leader election - not zookeeper) to docker containers running in kubernetes with zero downtime.
Is there any way to check which instances are part of the schema-registry cluster ?
Should I expose the k8s instances as services + ingress for each pod ?
How to expose schema registry so it can be reachable from outside k8s ?
Should I move kafka first into k8s ?
the problem is that kafka can't reach the k8s network/nodes
[2023-01-20 13:33:34,778] ERROR Failed to send HTTP request to endpoint: http://10.100.102.139:18081/subjects/Alarm/versions (io.confluent.kafka.schemaregistry.client.rest.RestService)
java.net.SocketTimeoutException: connect timed out
what env variable to use in order to expose a DNS instead of IP (10.100.102.139), do I need one DNS for each instance ?
which instances are part of the schema-registry cluster
You would compare kafkastore.bootstrap.servers + kafkastore.topic + schema.registry.group.id properties of each instance schema-registry.properties (or env-vars for Docker container). If they match, they are the same Registry cluster. The latter two have default values, so may not be set.
expose the k8s instances as services + ingress for each pod
Depends where you need to access the Registry from. If you don't need external-cluster access, then you don't need an Ingress.
expose schema registry so it can be reachable from outside k8s
See above. That's what an Ingress does, in combination with a LoadBalancer or ClusterIP + NodePort configuration spec.
move kafka first into k8s ?
Up to you. That's not a requirement for running the Registry.
kafka can't reach the k8s network/nodes
The broker doesn't need to communicate with the Registry, only the clients do.
what env variable to use in order to expose a DNS
You wouldn't. Your IngressController would be configured to a DNS server, such as ALB IngressController + ExternalDNS w/ AWS. Then you provide that FQDN as schema.registry.url in your apps.
I suggest trying a simpler HTTP server first.
do I need one DNS for each instance ?
Kubernetes does that internally for the pods, but your external DNS address would only be for the Ingress pod. E.g. for nginx IngressController, the DNS entry would point direct traffic at an nginx pod, running a reverse proxy to the other pods.

Exposing Multiple Confluent Kafka Brokers Publicly using ingress-nginx Ingress Controller

I am trying to expose cp-kafka brokers publicly using the ingress-nginx and I happen to see this Stack Overflow question. The answers only shows one broker is exposed outside the cluster. Say, If I have 3 brokers running, how can I expose all 3 Kafka brokers using the nginx ingress controller?
I was able to fix it by changing the tcp-services configMap data to the one below.
31090: "default/demo-cp-kafka-0-nodeport:19092"
Kafka is using binary protocol so you cannot use http routing.
You would need to expose brokers on a separate ports. Read this page in ingress-nginx docs: Exposing TCP and UDP services. The answer you linked already explains how to do this for one port/service. Now all you have to do is expose two more ports. Since you cannot open a port number more than once, you need to expose every broker on a separate port.

Can Spark Cassandra Connector resolve hostnanmes from headless service in K8S environment?

Datastax Spark Cassandra Connector takes "spark.cassandra.connection.host" for connecting to cassandra cluster.
Can we provide headless service of C* cluster on K8S environment as host to this parameter("spark.cassandra.connection.host").
Will it resolve the contact points?
What is the preferred way of connecting with C* cluster on the K8s environment with Spark Cassandra Connector?
By default, SCC resolves all provided contact points into IP addresses on the first connect, and then only uses these IP addresses for reconnection. And after initial connection happened, it discover the rest of the cluster. Usually this is not a problem as SCC should receive notifications about nodes up & down and track nodes IP addresses. But in practice, it could happen that nodes are restarted too fast, and notifications are not received, so Spark jobs that use SCC could stuck trying to connect to the IP addresses that aren't valid anymore - I hit this multiple times on the DC/OS.
This problem is solved with the release of SCC 2.5.0 that includes a fix for SPARKC-571. It introduced a new configuration parameter - spark.cassandra.connection.resolveContactPoints that when it's set to false (true by default) will always use hostnames of the contact points for both initial connection & reconnection, avoiding the problems with changed IP addresses.
So on K8S I would try to use this configuration parameter with just normal Cassandra deployment.
Yes, why not. There is a good example on the Kubernetes official documentation. You create a headless service with a selector:
apiVersion: v1
kind: Service
metadata:
labels:
app: cassandra
name: cassandra
spec:
clusterIP: None
ports:
- port: 9042
selector:
app: cassandra
and basically when you specify spark.cassandra.connection.host=cassandra (in the same K8s namespace, otherwise, you have to provide Cassandra..svc.cluster.local` it will resolve to the Cassandra contact points (the Pod IP addresses where Cassandra is running)
✌️

Kafka on kubernetes cluster with Istio

I have k8s cluster with Istio v1.6.4. The sidecar injection is disabled by default.
I have Kafka cluster running on this k8s installed with strimzi kafka operator.
The Kafka cluster works without any problems when kafka as well as client pods doesn't have Istio-proxy injected.
My problem:
When I create a pod with kafka client and Istio-proxy injected I can't connect to Kafka cluster.
The logs on client side:
java.io.IOException: Connection reset by peer
and on the server side:
org.apache.kafka.common.network.InvalidReceiveException: Invalid receive (size = 369295616 larger than 104857600)
After some googling and checking the Istio-proxy logs it turns out the problem is that Istio-proxy connects to kafka plaintext endpoint with TLS.
I can workaround this by setting the default PeerAuthentication with mtls.mode: DISABLED but I don't want to set global setting for it.
What is strange if I create a simple k8s service and run the netcat "server" on pod running kafka server and netcat "client" on pod running kafka client - everything works fine.
I have 2 question:
Why the kafka Istio-proxy behaves different when connecting to Kafka
cluster than other TCP connections (like using nc)?
How to disable mtls for one host only? I was playing with PeerAuthentication but no luck...
With jt97's help I was able to solve this problem.
As I wrote I'm using Strimzi Operator to install kafka cluster on k8s. It creates 2 services:
kafka-bootstrap - which is a regular service with ClusterIP
kafka-brokers - a headless service.
In my case the full name of the services are kafka-kafka-operated-kafka-bootstrap and kafka-kafka-operated-kafka-brokers, respectively.
I created a DestinationRule:
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: kafka-no-tls
spec:
host: "kafka-kafka-operated-kafka-brokers"
trafficPolicy:
tls:
mode: DISABLE
and used the headless service when connecting to kafka:
kafka-topics --bootstrap-server kafka-kafka-operated-kafka-brokers:9092 --list
__consumer_offsets
_schemas
and it worked as expected.
BTW, setting tls.mode to SIMPLE didn't help.
To be honest I still don't understand why in this particular case Istio-proxy by default (without DestinationRule) tries to connect with TLS - according to documentation:
By default, Istio tracks the server workloads migrated to Istio proxies, and configures client proxies to send mutual TLS traffic to those workloads automatically, and to send plain text traffic to workloads without sidecars.

Setup statsd-exporter as daemon on Kubernetes and send metrics to it from pods

I want to setup statsd-exporter as DaemonSet on my Kubernetes cluster. It exposes a UDP port 9125, on which applications can send metrics using statsD client library. Prometheus crawlers can crawl this exporter for application or system metrics. I want to send the metrics to the UDP server running in the exporter on port 9125. I have two options:
Expose a service as ClusterIP for the DaemonSet and then configure the statsD clients to use that IP and Port for sending metrics
Make the statsd-exporter run on hostNetwork, and somehow enable the pods to send metrics to exporter running on the same node.
Somehow, option 2 seems better, since my pods will be sending metrics to an exporter running on the same node, but I am not able to send metrics to the local pod of statsd-exporter since I don't have the IP of the node the pod is running on.
Can you please compare the pros and cons of both methods, and suggest how can I know the IP address of Node on which the pod is running along with the exporter.
EDIT 1
I am able to get the Node IP by adding the environment variable.
- name: NODE_IP
valueFrom:
fieldRef:
fieldPath: status.hostIP
I still need clarity on which will be better approach to setting this up. Exposing a service with type ClusterIP and then using the HOST:PORT from environment variable in the pod or use hostNetwork: true in pod spec and then access it using NODE_IP environment variable. Does clusterIp guarantees that the packet will be routed to the same node as the pod sending the packet?
EDIT 2
I explored headless service and I think this comes closest to what I want. But I am not sure that the DNS resolution will return the local node IP as the first result in nslookup.
I would suggest one of the below two approaches, both have their pros and cons.
Fast, may not be entirely secure.
Daemon set using hostPort. It would be fast as both pods would be on the same node, but the statsd port would be exposed. (You need to secure statsd by some other way)
Not as fast as hostPort but secure
Exposing a service and using the service dns to connect (servicename.namespace.svc.cluster.local). Not as fast as the hostPort as there is no way to reach specific pod, but secure as no one from outside the cluster can hit the statsd.
More details: https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#communicating-with-daemon-pods