Network connectivity/DNS issues on a GKE 1.10 kubernetes cluster

Network connectivity/DNS issues on a GKE 1.10 kubernetes cluster - kubernetes

I'm running into DNS issues on a GKE 1.10 kubernetes cluster. Occasionally pods start without any network connectivity. Restarting the pod tends to fix the issue.
Here's the result of the same few commands inside a container without network, and one with.
BROKEN:
kc exec -it -n iotest app1-b67598997-p9lqk -c userapp sh
/app $ nslookup www.google.com
nslookup: can't resolve '(null)': Name does not resolve
/app $ cat /etc/resolv.conf
nameserver 10.63.240.10
search iotest.svc.cluster.local svc.cluster.local cluster.local c.myproj.internal google.internal
options ndots:5
/app $ curl -I 10.63.240.10
curl: (7) Failed to connect to 10.63.240.10 port 80: Connection refused
/app $ netstat -antp
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.1:8001 0.0.0.0:* LISTEN 1/python
tcp 0 0 ::1:50051 :::* LISTEN 1/python
tcp 0 0 ::ffff:127.0.0.1:50051 :::* LISTEN 1/python
WORKING:
kc exec -it -n iotest app1-7d985bfd7b-h5dbr -c userapp sh
/app $ nslookup www.google.com
nslookup: can't resolve '(null)': Name does not resolve
Name: www.google.com
Address 1: 74.125.206.147 wk-in-f147.1e100.net
Address 2: 74.125.206.105 wk-in-f105.1e100.net
Address 3: 74.125.206.99 wk-in-f99.1e100.net
Address 4: 74.125.206.104 wk-in-f104.1e100.net
Address 5: 74.125.206.106 wk-in-f106.1e100.net
Address 6: 74.125.206.103 wk-in-f103.1e100.net
Address 7: 2a00:1450:400c:c04::68 wk-in-x68.1e100.net
/app $ cat /etc/resolv.conf
nameserver 10.63.240.10
search iotest.svc.cluster.local svc.cluster.local cluster.local c.myproj.internal google.internal
options ndots:5
/app $ curl -I 10.63.240.10
HTTP/1.1 404 Not Found
date: Sun, 29 Jul 2018 15:13:47 GMT
server: envoy
content-length: 0
/app $ netstat -antp
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.1:15000 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:15001 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:8001 0.0.0.0:* LISTEN 1/python
tcp 0 0 10.60.2.6:56508 10.60.48.22:9091 ESTABLISHED -
tcp 0 0 127.0.0.1:57768 127.0.0.1:50051 ESTABLISHED -
tcp 0 0 10.60.2.6:43334 10.63.255.44:15011 ESTABLISHED -
tcp 0 0 10.60.2.6:15001 10.60.45.26:57160 ESTABLISHED -
tcp 0 0 10.60.2.6:48946 10.60.45.28:9091 ESTABLISHED -
tcp 0 0 127.0.0.1:49804 127.0.0.1:50051 ESTABLISHED -
tcp 0 0 ::1:50051 :::* LISTEN 1/python
tcp 0 0 ::ffff:127.0.0.1:50051 :::* LISTEN 1/python
tcp 0 0 ::ffff:127.0.0.1:50051 ::ffff:127.0.0.1:49804 ESTABLISHED 1/python
tcp 0 0 ::ffff:127.0.0.1:50051 ::ffff:127.0.0.1:57768 ESTABLISHED 1/python
These pods are identical, just one was restarted.
Does anyone have advice about how to analyse and fix this issue?

Some steps to try:
1) ifconfig eth0 or whatever the primary interface is.
Is the interface up? Are the tx and rx packet counts increasing?
2)If interface is up, you can try tcpdump as you are running the nslookup command that you posted. See if the dns request packets are getting sent out.
3) See which node the pod is scheduled on, when network connectivity gets broken. Maybe it is on the same node every time? If yes, are other pods on that node running into similar problem?

I also faced the same problem, and I simply worked around it for now by switching to the 1.9.x GKE version (after spending many hours trying to debug why my app wasn't working).
Hope this helps!

Related

ConnectionError: ('Connection aborted.', error(104, 'Connection reset by peer')) in python while placing GET request

In my python script, when it tries to place a GET request from remote machine to a service running inside a kubernetes pod (test-pod), I am getting the below error:
unapiflaskapp.get_from_server: ERROR: GET from server: request 'https://test-pod:9906/statuses' got error
ConnectionError: ('Connection aborted.', error(104, 'Connection reset by peer'))
The service running inside test-pod will be listening on 9906 port.
Netstat output inside a container where this service binded to 9906 port (10.92.120.6 - remote machine ip from where GET request will come from and 10.30.4.20 which is eth1 ip inside container where the port is binded to):
[root#test-pod ]# netstat -talpn|grep 9906
tcp 0 0 127.0.0.1:9906 0.0.0.0:* LISTEN -
tcp 0 0 10.30.4.20:9906 0.0.0.0:* LISTEN -
tcp 0 1840 10.30.4.20:9906 10.92.120.6:33898 ESTABLISHED -
tcp 0 1841 10.30.4.20:9906 10.92.120.6:59972 LAST_ACK -
tcp 0 1841 10.30.4.20:9906 10.92.120.6:60544 LAST_ACK -
tcp 0 1841 10.30.4.20:9906 10.92.120.6:59452 LAST_ACK -
tcp 0 1841 10.30.4.20:9906 10.92.120.6:32840 LAST_ACK -
but the telnet connection got success from remote machine to test-pod:
root#remote_machine$ telnet 10.92.50.19 9906
Trying 10.92.50.19...
Connected to 10.92.50.19.
Escape character is '^]'.
10.92.50.19 is eth1 interface ip in my workernode which is bridged using ipvlan to 10.30.4.20 which is eth1 ip inside container where the port is binded to.
It would be really helpful if someone helps me out to understand this issue and find a fix for the same. Hope anyone helps me with this issue. Thanks in advance!

Kubernetes node failed to join master due "Timeout exceeded while awaiting headers error"

I am trying to setup k8s cluster with master and two worker nodes in Digital Ocean.
My Config:
I have created three droplets as follows:
Master: 2cpu, 3GB Mem
Worker Node1: 1cpu, 2GB Mem
Worker Node2: 1cpu, 2GB Mem
I was able to setup master node successfully
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
master Ready master 139m v1.18.3
I am unable to add worker to master.
Command i ran to join:
$ kubeadm join <PUBLIC IP>:6443 --token <token> --discovery-token-ca-cert-hash <hash>
Token had 23h of validity left at the time of executing the above command.
Error that i got:
W0528 14:13:09.920404 25129 join.go:346] [preflight] WARNING: JoinControlPane.controlPlane settings will be ignored when control-plane flag is not set.
[preflight] Running pre-flight checks
error execution phase preflight: couldn't validate the identity of the API Server: Get https://PUBLIC_IP:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
To see the stack trace of this error execute with --v=5 or higher
My observations on this issue:
$ netstat -pnltu
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.1:40389 0.0.0.0:* LISTEN 25074/kubelet
tcp 0 0 127.0.0.1:10248 0.0.0.0:* LISTEN 25074/kubelet
tcp 0 0 127.0.0.1:10249 0.0.0.0:* LISTEN 25478/kube-proxy
tcp 0 0 127.0.0.1:9099 0.0.0.0:* LISTEN 29823/calico-node
tcp 0 0 127.0.0.1:10257 0.0.0.0:* LISTEN 24580/kube-controll
tcp 0 0 127.0.0.1:10259 0.0.0.0:* LISTEN 24742/kube-schedule
tcp6 0 0 :::10250 :::* LISTEN 25074/kubelet
tcp6 0 0 :::10251 :::* LISTEN 24742/kube-schedule
tcp6 0 0 :::6443 :::* LISTEN 24725/kube-apiserve
tcp6 0 0 :::10252 :::* LISTEN 24580/kube-controll
tcp6 0 0 :::10256 :::* LISTEN 25478/kube-proxy
Is it because the API service is listening in IPV6 instead of IPV4?
here is the output of cluster-info:
$ kubectl cluster-info
Kubernetes master is running at https://<PUBLIC_IP>:6443
KubeDNS is running at https://<PUBLIC_IP>:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
Any help to fix this issue is much appreciated.

How to enable listening 10255 in my kubelet service

I am learning to work with Kubernetes and trying to configure monitoring of my Kubernetes cluster. For this I use metricbeat and elk.
After deploying and configuring metricbeat, I get an error:
error making http request: Get http://172.16.0.205:10255/stats/summary: dial tcp 172.16.0.205:10255: connect: connection refused
I found that my Kubelet is not listening on port 10255:
[root#kube2 /]# netstat -ap | grep -i "listen" | grep "kubelet"
tcp 0 0 localhost:40450 0.0.0.0:* LISTEN 8560/kubelet
tcp 0 0 localhost:10248 0.0.0.0:* LISTEN 8560/kubelet
tcp6 0 0 [::]:10250 [::]:* LISTEN 8560/kubelet
How can I enable this port. I found information that I need to use the parameter --read-only-port = 10255, but how do I apply it to my kubelet, I do not quite understand. For example:
[root#kube2 /]# kubelet --config --read-only-port=10255
\F1010 13:32:48.592306 15851 server.go:196] failed to load Kubelet config file --read-only-port=10255, error failed to read kubelet config file "/--read-only-port=10255", error: open /--read-only-port=10255: no such file or directory
It's does't work. Which file does it need?
Can anyone help me with a solution to this problem?

I resolved this issue. I added flags in /var/lib/kubelet/kubelet-flags in every my kubertenes' nodes:
KUBELET_KUBEADM_ARGS="--cgroup-driver=systemd --network-plugin=cni --pod-infra-container-image=k8s.gcr.io/pause:3.1 --read-only-port=10255"
and restart kubelet service.
Now I have open port 10255:
[root#kube2 7.1]# netstat -ap | grep -i "listen" | grep "kubelet"
tcp 0 0 localhost:44799 0.0.0.0:* LISTEN 6281/kubelet
tcp 0 0 localhost:10248 0.0.0.0:* LISTEN 6281/kubelet
tcp6 0 0 [::]:10250 [::]:* LISTEN 6281/kubelet
tcp6 0 0 [::]:10255 [::]:* LISTEN 6281/kubelet
And I see some logs of kubernetes in my kibana.

How do GCP load balancers route traffic to GKE services?

I'm relatively new (< 1 year) to GCP, and I'm still in the process of mapping the various services onto my existing networking mental model.
Once knowledge gap I'm struggling to fill is how HTTP requests are load balanced to services running in our GKE clusters.
On a test cluster, I created a service in front of pods that serve HTTP:
apiVersion: v1
kind: Service
metadata:
name: contour
spec:
ports:
- port: 80
name: http
protocol: TCP
targetPort: 8080
- port: 443
name: https
protocol: TCP
targetPort: 8443
selector:
app: contour
type: LoadBalancer
The service is listening on node ports 30472 and 30816.:
$ kubectl get svc contour
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
contour LoadBalancer 10.63.241.69 35.x.y.z 80:30472/TCP,443:30816/TCP 41m
A GCP network load balancer is automatically created for me. It has its own public IP at 35.x.y.z, and is listening on ports 80-443:
Curling the load balancer IP works:
$ curl -q -v 35.x.y.z
* TCP_NODELAY set
* Connected to 35.x.y.z (35.x.y.z) port 80 (#0)
> GET / HTTP/1.1
> Host: 35.x.y.z
> User-Agent: curl/7.62.0
> Accept: */*
>
< HTTP/1.1 404 Not Found
< date: Mon, 07 Jan 2019 05:33:44 GMT
< server: envoy
< content-length: 0
<
If I ssh into the GKE node, I can see the kube-proxy is listening on the service nodePorts (30472 and 30816) and nothing has a socket listening on ports 80 or 443:
# netstat -lntp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.1:20256 0.0.0.0:* LISTEN 1022/node-problem-d
tcp 0 0 127.0.0.1:10248 0.0.0.0:* LISTEN 1221/kubelet
tcp 0 0 127.0.0.1:10249 0.0.0.0:* LISTEN 1369/kube-proxy
tcp 0 0 0.0.0.0:5355 0.0.0.0:* LISTEN 297/systemd-resolve
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 330/sshd
tcp6 0 0 :::30816 :::* LISTEN 1369/kube-proxy
tcp6 0 0 :::4194 :::* LISTEN 1221/kubelet
tcp6 0 0 :::30472 :::* LISTEN 1369/kube-proxy
tcp6 0 0 :::10250 :::* LISTEN 1221/kubelet
tcp6 0 0 :::5355 :::* LISTEN 297/systemd-resolve
tcp6 0 0 :::10255 :::* LISTEN 1221/kubelet
tcp6 0 0 :::10256 :::* LISTEN 1369/kube-proxy
Two questions:
Given nothing on the node is listening on ports 80 or 443, is the load balancer directing traffic to ports 30472 and 30816?
If the load balancer is accepting traffic on 80/443 and forwarding to 30472/30816, where can I see that configuration? Clicking around the load balancer screens I can't see any mention of ports 30472 and 30816.

I think I found the answer to my own question - can anyone confirm I'm on the right track?
The network load balancer redirects the traffic to a node in the cluster without modifying the packet - packets for port 80/443 still have port 80/443 when they reach the node.
There's nothing listening on ports 80/443 on the nodes. However kube-proxy has written iptables rules that match packets to the load balancer IP, and rewrite them with the appropriate ClusterIP and port:
You can see the iptables config on the node:
$ iptables-save | grep KUBE-SERVICES | grep loadbalancer
-A KUBE-SERVICES -d 35.x.y.z/32 -p tcp -m comment --comment "default/contour:http loadbalancer IP" -m tcp --dport 80 -j KUBE-FW-D53V3CDHSZT2BLQV
-A KUBE-SERVICES -d 35.x.y.z/32 -p tcp -m comment --comment "default/contour:https loadbalancer IP" -m tcp --dport 443 -j KUBE-FW-J3VGAQUVMYYL5VK6
$ iptables-save | grep KUBE-SEP-ZAA234GWNBHH7FD4
:KUBE-SEP-ZAA234GWNBHH7FD4 - [0:0]
-A KUBE-SEP-ZAA234GWNBHH7FD4 -s 10.60.0.30/32 -m comment --comment "default/contour:http" -j KUBE-MARK-MASQ
-A KUBE-SEP-ZAA234GWNBHH7FD4 -p tcp -m comment --comment "default/contour:http" -m tcp -j DNAT --to-destination 10.60.0.30:8080
$ iptables-save | grep KUBE-SEP-CXQOVJCC5AE7U6UC
:KUBE-SEP-CXQOVJCC5AE7U6UC - [0:0]
-A KUBE-SEP-CXQOVJCC5AE7U6UC -s 10.60.0.30/32 -m comment --comment "default/contour:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-CXQOVJCC5AE7U6UC -p tcp -m comment --comment "default/contour:https" -m tcp -j DNAT --to-destination 10.60.0.30:8443
An interesting implication is the the nodePort is created but doesn't appear to be used. That matches this comment in the kube docs:
Google Compute Engine does not need to allocate a NodePort to make LoadBalancer work
It also explains why GKE creates an automatic firewall rule that allows traffic from 0.0.0.0/0 towards ports 80/443 on the nodes. The load balancer isn't rewriting the packets, so the firewall needs to allow traffic from anywhere to reach iptables on the node, and it's rewritten there.

To understand LoadBalancer services, you first have to grok NodePort services. The way those work is that there is a proxy (usually actually implemented in iptables or ipvs now for perf but that's an implementation detail) on every node in your cluster, and when create a NodePort service it picks a port that is unused and sets every one of those proxies to forward packets to your Kubernetes pod. A LoadBalancer service builds on top of that, so on GCP/GKE it creates a GCLB forwarding rule mapping the requested port to a rotation of all those node-level proxies. So the GCLB listens on port 80, which proxies to some random port on a random node, which proxies to the internal port on your pod.
The process is a bit more customizable than that, but that's the basic defaults.

Cannot curl kubelet read-only port

I have a heapster pod running on one of the nodes in my Kubernetes cluster. It is able to get http://<node-with-heapster-pod>:10255/stats/summary just fine, but whenever it runs the same get request on another node, it cannot. When I run curl from within any given node I can access that port, but when I curl any node from another machine I get the following error:
Failed to connect to 128.180.120.229 port 10255: No route to host
The following is the netstat output for all ports on which the kubelet is listening:
netstat -ap | grep -i "listen" | grep "kubelet"
tcp 0 0 localhost:10248 0.0.0.0:* LISTEN 7562/kubelet
tcp6 0 0 [::]:4194 [::]:* LISTEN 7562/kubelet
tcp6 0 0 [::]:10250 [::]:* LISTEN 7562/kubelet
tcp6 0 0 [::]:10255 [::]:* LISTEN 7562/kubelet
unix 2 [ ACC ] STREAM LISTENING 621349 7562/kubelet /var/run/dockershim.sock
I apologize for the messy last column. Any ideas why this may be? My iptables rules are set up to accept all incoming connections, and any node can ping port 10250 fine, just not 10255.

you may not have ip_forward enabled on your system. can you check this settings?
sysctl -n net.ipv4.ip_forward

If anybody still cares, port 10255 is the kubelet's read only port and may or may not be configured. You can confirm this by accessing the worker node in question then looking at the kubelet's startup command.
systemctl status kubelet-worker.service
Some on-prem kubernetes solutions set this to 0 as mentioned below
https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/
--read-only-port int32 The read-only port for the Kubelet to serve on with no authentication/authorization (set to 0 to disable) (default 10255) (DEPRECATED: This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/ for more information.)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Network connectivity/DNS issues on a GKE 1.10 kubernetes cluster - kubernetes

I also faced the same problem, and I simply worked around it for now by switching to the 1.9.x GKE version (after spending many hours trying to debug why my app wasn't working). Hope this helps!

Related

ConnectionError: ('Connection aborted.', error(104, 'Connection reset by peer')) in python while placing GET request

Kubernetes node failed to join master due "Timeout exceeded while awaiting headers error"

How to enable listening 10255 in my kubelet service

How do GCP load balancers route traffic to GKE services?

Cannot curl kubelet read-only port

Categories

Resources