Kubernetes: kafka pod rechability issue from another pod - kubernetes

I know the below information is not enough to trace the issue but still, I want some solution.
We have Amazon EKS cluster.
Currently, we are facing the reachability of the Kafka pod issue.
Environment:
Total 10 nodes with Availability zone ap-south-1a,1b
I have a three replica of the Kafka cluster (Helm chart installation)
I have a three replica of the zookeeper (Helm chart installation)
Kafka using external advertised listener on port 19092
Kafka has service with an internal network load balancer
I have deployed a test-pod to check reachability of Kafka pod.
we are using cloud-map based DNS for advertized listener
Working:
When I run telnet command from ec2 like telnet 10.0.1.45 19092. It works as expected. IP 10.0.1.45 is a loadbalancer ip.
When I run telnet command from ec2 like telnet 10.0.1.69 31899. It works as expected. IP 10.0.1.69 is a actual node's ip and 31899 is nodeport.
Problem:
When I run same command from test-pod. like telnet 10.0.1.45 19092. It works sometime and sometime it will gives an error like telnet: Unable to connect to remote host: Connection timed out
The issue is something related to kube-proxy. we need help to resolve this issue.
Can anyone help to guide me?
Can I restart kube-proxy? Does it affect other pods/deployments?

I believe this problem is caused by AWS's NLB TCP-only nature (as mentioned in the comments).
In a nutshell, your pod-to-pod communication fails when hairpin is needed.
To confirm this is the root cause, you can verify that when the telnet works, kafka pod and client pod are not in the same EC2 node. And when they're in the same EC2 server, the telnet fails.
There are (at least) two approaches to tackle this issue:
Use K8s internal networking - Refer to k8s Service's URL
Every K8s service has its own DNS FQDN for internal usage (meaning using k8s network only, without reaching the LoadBalancer and come back to k8s again). You can just telnet this instead of the NodePort via the LB.
I.e. let's assume your kafka service is named kafka. Then you can just telnet kafka.svc.cluster.local (on the port exposed by kafka service)
Use K8s anti-affinity to make sure client and kafka are never scheduled in the same node.
Oh and as indicated in this answer you might need to make that service headless.

Related

How to make two local minikube clusters communicate with each other?

I have two minikube clusters(two separate profiles) running locally call it minikube cluster A and minikube Cluster B. Each of these cluster also have an ingress and a dns associated with it locally. The dns are hello.dnsa and hello.dnsb . I am able to do ping on both of them and nslookup just like this https://minikube.sigs.k8s.io/docs/handbook/addons/ingress-dns/#testing
I want pod A in cluster A to be able to communicate with pod B in cluster B. How can I do that? I logged into pod A cluster A and I did telnet hello.dnsb 80 and it doesn't get connected because I suspect there is no route. similarly I logged into pod B of cluster B and did telnet hello.dnsb 80 and it doesnt get connected. However If I do telnet hello.dnsb 80 or telnet hello.dnsb 80 from my host machine, telnet works!
Any simple way to solve this problem for now? I am ok with any solution like even adding routes manually using ip route add if needed
Skupper is a plugin available for performing these actions. It is a service interconnect that facilitates secured communication between the clusters, for more information on skupper go through this documentation.
There are multiple examples in which minikube is integrated with skupper, go through this configuration documentation for more details.

openVPN accesses the K8S cluster, it access the POD of the host where the server is located,cannot access the POD of other hosts in the cluster

I deployed the OpenVPN server in the K8S cluster and deployed the OpenVPN client on a host outside the cluster. However, when I use client access, I can only access the POD on the host where the OpenVPN server is located, but cannot access the POD on other hosts in the cluster.
The network used by the cluster is Calico. I also added the following iptables rules to the openVPN server host in the cluster:
I found that I did not receive the package back when I captured the package of tun0 on the server.
When the server is deployed on hostnetwork, a forward rule is missing in the iptables field.
Not sure how you set up iptables inside the server pod as iptables/netfilter was not accessible on most kube clusters I saw.
If you want to have full access to cluster networking over that OpenVPN server you probably want to use hostNetwork: true on your vpn server. The problem is that you still need proper MASQ/SNAT rule to get response across to your client.
You should investigate your traffic going out of the server pod to see if it has a properly rewritten source address, otherwise the nodes in cluster will have no knowledge on how to route the response.
You probably have a common gateway for your nodes, depending on your kube implementation you might get around this issue by setting the route back to your vpn, but that likely will require some scripting around vpn server it self to make sure the route is updated each time server pod is rescheduled.

EKS internal service connection unreliable

I just setup a new EKS cluster (latest version available, using three default AMI).
I deployed a Redis instance in it as a Kubernetes service and exposed it. I can access the Redis database through internal DNS like : mydatabase.redis (it's deployed in the redis namespace). In another pod I can connect to my Redis database, however sometimes the connection takes more than 10 seconds.
It's doesn't seem to be a DNS resolution issue as host mydatabase.redis responds immediately with the service IP address. However when I try to connect to it, for example: nc mydatabase.redis 6379 -v it sometimes connects instantly and sometimes takes more than 10 seconds.
All my services are impacted, I don't know why. I didn't change any settings in my cluster this a basic EKS cluster.
How can I debug this?

K8s istio enabled pod can't reach regular services

I'm trying to use Istio in a K8s 1.6 cluster on AWS.
I have a Kafka pod/service running the old fashion way, with a "kafka-zk-broker-kafka.dev" service without IP, so the kafka-zk-broker-kafka.dev service (I'm in the dev namespace) resolve to the internal name of my 3 Kafka pods. This is working great.
~ # nslookup kafka-zk-broker-kafka.dev
Name: kafka-zk-broker-kafka.dev
Address 1: 10.33.0.11 kafka-zk-kafka-0.kafka-zk-broker-kafka.dev.svc.cluster.local
Address 2: 10.38.96.16 kafka-zk-kafka-2.kafka-zk-broker-kafka.dev.svc.cluster.local
Address 3: 10.40.128.13 kafka-zk-kafka-1.kafka-zk-broker-kafka.dev.svc.cluster.local
I deployed a kafka producer application, using Istio sidecart as it is also exposing a gRPC port for internal uses.
Deployment went fine, but my application can't connect to to the "kafka-broker" service. DNS resolution is OK, but I can't reach the service port (TCP:9092) using either kafka client or telnet.
What I understand is that, when the Istio (envoy) sidecart is deployed, everything out of the POD is going through the Envoy proxy...
So the envoy proxy does not know how to reach regular services ?
Am I missing something ? is there a way to mix Istio/Envoy with regular k8s services ?
What you are doing should work, but I think you're running into this known bug: https://github.com/istio/issues/issues/37

Kubernetes service ip isn't always accessible within cluster (with flannel)

I built a kubernetes cluster, using flannel overlay network. The problem is one of the service ip isn't always accessible.
I tested within the cluster, by telneting the service ip and port, ended in connection timeout. Checked with netstat, the connection was always in "SYN_SENT" state, seemed that peer didn't accept connection.
But if I telnet to the pod ip and port that backed the service directly, the connection could be made successfully.
It only happened to one of the service, other services are ok.
And if I scaled the backend pod to a larger value, like 2. Then some of requests to the service ip can succeed. It seemed that the service wasn't able to connect to one of the backed pod.
Which component may be the cause of such problem? My service configuration, kube-proxy or flannel?
Check the discussion here: https://github.com/kubernetes/kubernetes/issues/38802
It's required to sysctl net.bridge.bridge-nf-call-iptables=1 on nodes.