Resolving external domains from within pods does not work - kubernetes

What happened
Resolving an external domain from within a pod fails with SERVFAIL message. In the logs, i/o timeout error is mentioned.
What I expected to happen
External domains should be successfully resolved from the pods.
How to reproduce it
apiVersion: v1
kind: Pod
metadata:
name: dnsutils
namespace: default
spec:
containers:
- name: dnsutils
image: gcr.io/kubernetes-e2e-test-images/dnsutils:1.3
command:
- sleep
- "3600"
imagePullPolicy: IfNotPresent
restartPolicy: Always
Create the pod above (from Debugging DNS Resolution help page).
Run kubectl exec dnsutils -it -- nslookup google.com
pig#pig202:~$ kubectl exec dnsutils -it -- nslookup google.com
Server: 10.152.183.10
Address: 10.152.183.10#53
** server can't find google.com.mshome.net: SERVFAIL
command terminated with exit code 1
Also run kubectl exec dnsutils -it -- nslookup google.com.
pig#pig202:~$ kubectl exec dnsutils -it -- nslookup google.com.
Server: 10.152.183.10
Address: 10.152.183.10#53
** server can't find google.com: SERVFAIL
command terminated with exit code 1
Additional information
I am using microk8s environment in a Hyper-V virtual machine.
Resolving DNS from the virtual machine works, and Kubernetes is able to pull container images. It's only from within the pods that the resolution is failing meaning I cannot communicate with the Internet from within the pods.
This is OK:
pig#pig202:~$ kubectl exec dnsutils -it -- nslookup kubernetes.default
Server: 10.152.183.10
Address: 10.152.183.10#53
Name: kubernetes.default.svc.cluster.local
Address: 10.152.183.1
Environment
The version of CoreDNS
image: 'coredns/coredns:1.6.6'
Corefile (taken from ConfigMap)
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
ready
log . {
class error
}
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
forward . 8.8.8.8 8.8.4.4
cache 30
loop
reload
loadbalance
}
Logs
pig#pig202:~$ kubectl logs --namespace=kube-system -l k8s-app=kube-dns -f
[INFO] 10.1.99.26:47204 - 29832 "AAAA IN grafana.com. udp 29 false 512" NOERROR - 0 2.0002558s
[ERROR] plugin/errors: 2 grafana.com. AAAA: read udp 10.1.99.19:52008->8.8.8.8:53: i/o timeout
[INFO] 10.1.99.26:59350 - 50446 "A IN grafana.com. udp 29 false 512" NOERROR - 0 2.0002028s
[ERROR] plugin/errors: 2 grafana.com. A: read udp 10.1.99.19:60405->8.8.8.8:53: i/o timeout
[INFO] 10.1.99.26:43050 - 13676 "AAAA IN grafana.com. udp 29 false 512" NOERROR - 0 2.0002151s
[ERROR] plugin/errors: 2 grafana.com. AAAA: read udp 10.1.99.19:45624->8.8.8.8:53: i/o timeout
[INFO] 10.1.99.26:36997 - 30359 "A IN grafana.com. udp 29 false 512" NOERROR - 0 2.0002791s
[ERROR] plugin/errors: 2 grafana.com. A: read udp 10.1.99.19:37554->8.8.4.4:53: i/o timeout
[INFO] 10.1.99.32:57927 - 53858 "A IN google.com.mshome.net. udp 39 false 512" NOERROR - 0 2.0001987s
[ERROR] plugin/errors: 2 google.com.mshome.net. A: read udp 10.1.99.19:34079->8.8.4.4:53: i/o timeout
[INFO] 10.1.99.32:38403 - 36398 "A IN google.com.mshome.net. udp 39 false 512" NOERROR - 0 2.000224s
[ERROR] plugin/errors: 2 google.com.mshome.net. A: read udp 10.1.99.19:59835->8.8.8.8:53: i/o timeout
[INFO] 10.1.99.26:57447 - 20295 "AAAA IN grafana.com.mshome.net. udp 40 false 512" NOERROR - 0 2.0001892s
[ERROR] plugin/errors: 2 grafana.com.mshome.net. AAAA: read udp 10.1.99.19:51534->8.8.8.8:53: i/o timeout
[INFO] 10.1.99.26:41052 - 56059 "A IN grafana.com.mshome.net. udp 40 false 512" NOERROR - 0 2.0001879s
[ERROR] plugin/errors: 2 grafana.com.mshome.net. A: read udp 10.1.99.19:47378->8.8.8.8:53: i/o timeout
[INFO] 10.1.99.26:56748 - 51804 "AAAA IN grafana.com.mshome.net. udp 40 false 512" NOERROR - 0 2.0003226s
[INFO] 10.1.99.26:45442 - 61916 "A IN grafana.com.mshome.net. udp 40 false 512" NOERROR - 0 2.0001922s
[ERROR] plugin/errors: 2 grafana.com.mshome.net. AAAA: read udp 10.1.99.19:35528->8.8.8.8:53: i/o timeout
[ERROR] plugin/errors: 2 grafana.com.mshome.net. A: read udp 10.1.99.19:53568->8.8.8.8:53: i/o timeout
OS
pig#pig202:~$ cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04 LTS"
VERSION_ID="20.04"
Tried on Ubuntu 18.04.3 LTS, same issue.
Other
mshome.net search domain comes from Hyper-V network, I assume. Perhaps this will be of help:
pig#pig202:~$ nmcli device show eth0
GENERAL.DEVICE: eth0
GENERAL.TYPE: ethernet
GENERAL.HWADDR: 00:15:5D:88:26:02
GENERAL.MTU: 1500
GENERAL.STATE: 100 (connected)
GENERAL.CONNECTION: Wired connection 1
GENERAL.CON-PATH: /org/freedesktop/NetworkManager/ActiveConnection/1
WIRED-PROPERTIES.CARRIER: on
IP4.ADDRESS[1]: 172.19.120.188/28
IP4.GATEWAY: 172.19.120.177
IP4.ROUTE[1]: dst = 0.0.0.0/0, nh = 172.19.120.177, mt = 100
IP4.ROUTE[2]: dst = 172.19.120.176/28, nh = 0.0.0.0, mt = 100
IP4.ROUTE[3]: dst = 169.254.0.0/16, nh = 0.0.0.0, mt = 1000
IP4.DNS[1]: 172.19.120.177
IP4.DOMAIN[1]: mshome.net
IP6.ADDRESS[1]: fe80::6b4a:57e2:5f1b:f739/64
IP6.GATEWAY: --
IP6.ROUTE[1]: dst = fe80::/64, nh = ::, mt = 100
IP6.ROUTE[2]: dst = ff00::/8, nh = ::, mt = 256, table=255

Finally found the solution which was the combination of two changes. After applying both changes, my pods could finally resolve addresses properly.
Kubelet configuration
Based on known issues, change resolv-conf path for Kubelet to use.
# Add resolv-conf flag to Kubelet configuration
echo "--resolv-conf=/run/systemd/resolve/resolv.conf" >> /var/snap/microk8s/current/args/kubelet
# Restart Kubelet
sudo service snap.microk8s.daemon-kubelet restart
CoreDNS forward
Change forward address in CoreDNS config map from default (8.8.8.8 8.8.4.4) to DNS on eth0 device.
# Dump definition of CoreDNS
microk8s.kubectl get configmap -n kube-system coredns -o yaml > coredns.yaml
Partial content of coredns.yaml:
Corefile: |
.:53 {
errors
health {
lameduck 5s
}
ready
log . {
class error
}
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
forward . 8.8.8.8 8.8.4.4
cache 30
loop
reload
loadbalance
}
Fetch DNS:
# Fetch eth0 DNS address (this will print 172.19.120.177 in my case)
nmcli dev show 2>/dev/null | grep DNS | sed 's/^.*:\s*//'
Change the following line and save:
forward . 8.8.8.8 8.8.4.4 # From this
forward . 172.19.120.177 # To this (your DNS will probably be different)
Finally apply to change CoreDNS forwarding:
microk8s.kubectl apply -f coredns.yaml

Related

k3s debuging dns resolution

I'm new to kubernetes and I have some issues with my dns names in my k3s cluster on pc with arm architecture.
I've tried to debug as docs (https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/) suggest
I installed 3ks as follows:
sudo curl -sfL https://get.k3s.io | K3S_KUBECONFIG_MODE=”644” sh -
And applied manifest for debugging pod:
kubectl apply -f https://k8s.io/examples/admin/dns/dnsutils.yaml
I've checked that pod is running:
kubectl get pods dnsutils
and tried to run
kubectl exec -i -t dnsutils -- nslookup kubernetes.default
and expected smth like that:
Server: 10.0.0.10
Address 1: 10.0.0.10
Name: kubernetes.default
Address 1: 10.0.0.1
But get:
;; connection timed out; no servers could be reached
command terminated with exit code 1
Any thoughts to debug? It seems that I messing smth...
UPD. Tried to debug as rancher suggests (https://docs.ranchermanager.rancher.io/v2.5/troubleshooting/other-troubleshooting-tips/dns):
kubectl run -it --rm --restart=Never busybox --image=busybox:1.28 -- nslookup kubernetes.default
And there is the output:
If you don't see a command prompt, try pressing enter.
Address 1: 10.43.0.10
nslookup: can't resolve 'kubernetes.default'
pod "busybox" deleted
pod default/busybox terminated (Error)
So I tried next step:
for p in $(kubectl get pods --namespace=kube-system -l k8s-app=kube-dns -o name); do kubectl logs --namespace=kube-system $p; done
and logs are:
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.server
.:53
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.server
[INFO] plugin/reload: Running configuration SHA512 = b941b080e5322f6519009bb49349462c7ddb6317425b0f6a83e5451175b720703949e3f3b454a24e77f3ffe57fd5e9c6130e528a5a1dd00d9000e4afd6c1108d
CoreDNS-1.9.1
linux/arm64, go1.17.8, 4b597f8
[ERROR] plugin/errors: 2 4288512074117887106.1437335397389171032. HINFO: read udp 10.42.0.5:39581->8.8.8.8:53: i/o timeout
[ERROR] plugin/errors: 2 4288512074117887106.1437335397389171032. HINFO: read udp 10.42.0.5:52272->8.8.8.8:53: i/o timeout
[ERROR] plugin/errors: 2 4288512074117887106.1437335397389171032. HINFO: read udp 10.42.0.5:41480->8.8.8.8:53: i/o timeout
[ERROR] plugin/errors: 2 4288512074117887106.1437335397389171032. HINFO: read udp 10.42.0.5:52059->8.8.8.8:53: i/o timeout
[ERROR] plugin/errors: 2 4288512074117887106.1437335397389171032. HINFO: read udp 10.42.0.5:46821->8.8.8.8:53: i/o timeout
[ERROR] plugin/errors: 2 4288512074117887106.1437335397389171032. HINFO: read udp 10.42.0.5:35222->8.8.8.8:53: i/o timeout
[ERROR] plugin/errors: 2 4288512074117887106.1437335397389171032. HINFO: read udp 10.42.0.5:38013->8.8.8.8:53: i/o timeout
[ERROR] plugin/errors: 2 4288512074117887106.1437335397389171032. HINFO: read udp 10.42.0.5:42222->8.8.8.8:53: i/o timeout
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.server
[ERROR] plugin/errors: 2 4288512074117887106.1437335397389171032. HINFO: read udp 10.42.0.5:50612->8.8.8.8:53: i/o timeout
[ERROR] plugin/errors: 2 4288512074117887106.1437335397389171032. HINFO: read udp 10.42.0.5:50341->8.8.8.8:53: i/o timeout
[WARNING] No files matching import glob pattern: /etc/coredns/custom/*.server
...
UPD2
kubectl -n kube-system get cm coredns -o yaml
apiVersion: v1
data:
Corefile: |
.:53 {
errors
health
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
hosts /etc/coredns/NodeHosts {
ttl 60
reload 15s
fallthrough
}
prometheus :9153
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}
import /etc/coredns/custom/*.server
NodeHosts: |
192.168.0.103 ubuntu
kind: ConfigMap
metadata:
annotations:
objectset.rio.cattle.io/applied: H4sIAAAAAAAA/4yQwWrzMBCEX0Xs2fEf20nsX9BDybH02lMva2kdq1Z2g6SkBJN3L8IUCiVtbyNGOzvfzoAn90IhOmHQcKmgAIsJQc+wl0CD8wQaSr1t1PzKSilFIUiIix4JfRoXHQjtdZHTuafAlCgq488xUSi9wK2AybEFDXvhwR2e8QQFHCnh50ZkloTJCcf8lP6NTIqUyuCkNJiSp9LJP5czoLjryztTWB0uE2iYmvjFuVSFenJsHx6tFf41gvGY6Y0Eshz/9D2e0OSZfIJVvMZExwzusSf/I9SIcQQNvaG6a+r/XVdV7abBddPtsN9W66Eedi0N7aberM22zaHf6t0tcPsIAAD//8Ix+PfoAQAA
objectset.rio.cattle.io/id: ""
objectset.rio.cattle.io/owner-gvk: k3s.cattle.io/v1, Kind=Addon
objectset.rio.cattle.io/owner-name: coredns
objectset.rio.cattle.io/owner-namespace: kube-system
creationTimestamp: "2022-09-23T09:06:05Z"
labels:
objectset.rio.cattle.io/hash: bce283298811743a0386ab510f2f67ef74240c57
name: coredns
namespace: kube-system
resourceVersion: "315"
uid: 33a8ccf6-511f-49c4-9752-424859d67d70
UPD3
kubectl -n kube-system get po -o wide
Output:
coredns-b96499967-sct84 1/1 Running 1 (17h ago) 20h 10.42.0.6 ubuntu <none> <none>
helm-install-traefik-crd-wrh5b 0/1 Completed 0 20h 10.42.0.3 ubuntu <none> <none>
helm-install-traefik-wx7s2 0/1 Completed 1 20h 10.42.0.5 ubuntu <none> <none>
local-path-provisioner-7b7dc8d6f5-qxjvs 1/1 Running 1 (17h ago) 20h 10.42.0.3 ubuntu <none> <none>
metrics-server-668d979685-ngbmr 1/1 Running 1 (17h ago) 20h 10.42.0.5 ubuntu <none> <none>
svclb-traefik-67fcd721-mz6sd 2/2 Running 2 (17h ago) 20h 10.42.0.2 ubuntu <none> <none>
traefik-7cd4fcff68-j74gd 1/1 Running 1 (17h ago) 20h 10.42.0.4 ubuntu <none> <none>
kubectl -n kube-system get svc
Output:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-dns ClusterIP 10.43.0.10 <none> 53/UDP,53/TCP,9153/TCP 20h
metrics-server ClusterIP 10.43.178.64 <none> 443/TCP 20h
traefik LoadBalancer 10.43.36.41 192.168.0.103 80:30268/TCP,443:30293/TCP 20h
Actually I found workaround. When install k3s one should use flag flannel-backend=ipsec
curl -sfL https://get.k3s.io | sh -s - server --write-kubeconfig-mode 644 --flannel-backend=ipsec
By default it uses --flannel-backend=vxlan I've tried --flannel-backend=host-gw
But for me works well flannel-backend=ipsec

Can not connect to external location inside a Kubernetes Pod - DNS Issues

i have the following problem. I have a namespace "qa". Pods inside this namespace can communicate with each other.
For Example
kubectl exec -it qa-file-watcher-85575bd8f7-npkns -n qa /bin/bash
root#qa-file-watcher-85575bd8f7-npkns:/usr/src/app# nslookup qa-kafka-broker
root#qa-file-watcher-85575bd8f7-npkns:/usr/src/app# nslookup qa-kafka-broker
Server: 10.96.0.10
Address: 10.96.0.10#53
Name: qa-kafka-broker.qa.svc.cluster.local
Address: 10.102.218.167
But if i try to connect to an external service e.g. 8.8.8.8 oder security.debian.org for apt-get update i get the following errors
root#qa-file-watcher-85575bd8f7-npkns:/usr/src/app# nslookup 8.8.8.8
Server: 10.96.0.10
Address: 10.96.0.10#53
** server can't find 8.8.8.8.in-addr.arpa: SERVFAIL
root#qa-file-watcher-85575bd8f7-npkns:/usr/src/app# nslookup security.debian.org
Server: 10.96.0.10
Address: 10.96.0.10#53
** server can't find security.debian.org.eu-central-1.compute.internal: SERVFAIL
Here are some informations about the setup. I use a bitnami/kubernetes image on a EC2-instance on AWS
bitnami#ip-172-30-0-120:~/buildAgent/work/aad99852b1e5781f$ kubectl version
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:14:22Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.3", GitCommit:"06ad960bfd03b39c8310aaf92d1e7c12ce618213", GitTreeState:"clean", BuildDate:"2020-02-11T18:07:13Z", GoVersion:"go1.13.6", Compiler:"gc", Platform:"linux/amd64"}
bitnami#ip-172-30-0-120:~/buildAgent/work/aad99852b1e5781f$ cat /etc/os-release
NAME="Ubuntu"
VERSION="16.04.6 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.6 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
bitnami#ip-172-30-0-120:~/buildAgent/work/aad99852b1e5781f$ cat /etc/resolv.conf
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
# DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 172.30.0.2
search xxxxxxxx.compute.internal default.svc.cluster.local svc.cluster.local cluster.local deb.debian.org
options ndots:5 single request-reopen
DNS=8.8.8.8
there are running coredns on the kubernetes with the following config
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: v1
data:
Corefile: |
.:53 {
log
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
ttl 30
}
prometheus :9153
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}
kind: ConfigMap
metadata:
creationTimestamp: "2020-02-25T12:52:17Z"
name: coredns
namespace: kube-system
resourceVersion: "31099780"
selfLink: /api/v1/namespaces/kube-system/configmaps/coredns
uid: 26a6800a-2ceb-4f29-ab85-82beaec0add8
Anyone has an idea what is going wrong here? If more detailed informations are needed pleas let me know.
Greetings and Thanks
EDIT:
this are the pods which are running on the namespace kube-system
bitnami#ip-172-30-0-120:~/deployments/qa-deployment$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-6955765f44-5glwz 1/1 Running 0 151m
coredns-6955765f44-hf2hd 1/1 Running 0 151m
etcd-ip-172-30-0-120 1/1 Running 4 9d
heapster-744b794df7-v2vz9 1/1 Running 1 9d
kube-apiserver-ip-172-30-0-120 1/1 Running 4 9d
kube-controller-manager-ip-172-30-0-120 1/1 Running 7 9d
kube-proxy-lfstn 1/1 Running 1 9d
kube-scheduler-ip-172-30-0-120 1/1 Running 6 9d
kubernetes-dashboard-8f7798644-m7r8x 1/1 Running 13 9d
kubernetes-metrics-scraper-6b97c6d857-nl98d 1/1 Running 0 8d
local-volume-provisioner-69vrv 1/1 Running 33 9d
monitoring-grafana-845bc8df5f-62d4x 1/1 Running 1 9d
monitoring-influxdb-56d9446bd9-wlrd5 1/1 Running 1 9d
nginx-ingress-controller-574d4c9dcf-fmdgm 1/1 Running 1 9d
registry-86c45b9d9b-pm6zj 1/1 Running 0 7d23h
weave-net-g78mj 2/2 Running 5 9d
and this is the log from the core-dns
...
...
...
[INFO] 10.32.0.35:49254 - 6294 "AAAA IN monitoring.xxxxxx.de.qa.svc.cluster.local. udp 66 false 512" NXDOMAIN qr,aa,rd 159 0.000297909s
[INFO] 10.32.0.35:55396 - 52809 "A IN monitoring.xxxxxx.de.svc.cluster.local. udp 63 false 512" NXDOMAIN qr,aa,rd 156 0.000152558s
[INFO] 10.32.0.35:55396 - 36432 "AAAA IN monitoring.xxxxxx.de.svc.cluster.local. udp 63 false 512" NXDOMAIN qr,aa,rd 156 0.000384192s
[INFO] 10.32.0.31:54436 - 61896 "AAAA IN xxxxxx.cq5rq6zjwmfc.eu-central-1.rds.amazonaws.com. udp 74 false 512" NOERROR - 0 2.000274796s
[ERROR] plugin/errors: 2 xxxxxx.cq5rq6zjwmfc.eu-central-1.rds.amazonaws.com. AAAA: read udp 10.32.0.30:41402->172.30.0.2:53: i/o timeout
[INFO] 10.32.0.31:54436 - 64312 "A IN xxxxxx.cq5rq6zjwmfc.eu-central-1.rds.amazonaws.com. udp 74 false 512" NOERROR - 0 2.000270418s
[ERROR] plugin/errors: 2 xxxxxx.cq5rq6zjwmfc.eu-central-1.rds.amazonaws.com. A: read udp 10.32.0.30:43606->172.30.0.2:53: i/o timeout
[INFO] 10.32.0.31:54436 - 8384 "AAAA IN postgres.qa.svc.cluster.local. udp 47 false 512" NOERROR qr,aa,rd 146 2.000560668s
[INFO] 10.32.0.31:54436 - 60087 "A IN postgres.qa.svc.cluster.local. udp 47 false 512" NOERROR qr,aa,rd 146 2.000566155s
EDIT2:
I cant go inside the coredns pod with
bitnami#ip-172-30-0-120:~/deployments/qa-deployment$ kubectl exec -it coredns-6955765f44-5glwz -n kube-system bash
error: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "2a604d5b8cfad5341acc0d548412f8376fdf063bf97d92d1aaa501841f959671": OCI runtime exec failed: exec failed: container_linux.go:349: starting container process caused "exec: \"bash\": executable file not found in $PATH": unknown
The resolve.conf inside the pod file-watcher-service in the namespace qa:
root#qa-file-watcher-service-7b7d47c67d-fjb8m:/etc# cat resolv.conf
search qa.svc.cluster.local svc.cluster.local cluster.local eu-central-1.compute.internal default.svc.cluster.local
nameserver 10.96.0.10
options ndots:5

DNS resolve problem in kubernetes cluster

We have a kubernetes cluster consist of four worker and one master node. On the worker1 and worker2 we can't resolve the DNS names but in the two other nodes everything is OK! I follow the instructions by official documentation here and I realized that the queries from worker1 and 2 are not received by the coredns pods.
I repeat all thing is good in the worker3 and worker4, I have a problem with worker1 and worker2. For example, when I run the busybox container in the worker1 and do nslookup kubernetes.default it doesn't return any thing but when it run in the worker3 the DNS resolving is OK.
Cluster information:
$ kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.1", GitCommit:"4ed3216f3ec431b140b1d899130a69fc671678f4", GitTreeState:"clean", BuildDate:"2018-10-05T16:43:08Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
$ kubectl get pod -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-576cbf47c7-6dtrc 1/1 Running 5 82d
coredns-576cbf47c7-jvx5l 1/1 Running 6 82d
etcd-master 1/1 Running 35 298d
kube-apiserver-master 1/1 Running 14 135m
kube-controller-manager-master 1/1 Running 42 298d
kube-proxy-22f49 1/1 Running 9 91d
kube-proxy-2s9sx 1/1 Running 34 298d
kube-proxy-jh2m7 1/1 Running 5 81d
kube-proxy-rc5r8 1/1 Running 5 63d
kube-proxy-vg8jd 1/1 Running 6 104d
kube-scheduler-master 1/1 Running 39 298d
kubernetes-dashboard-65c76f6c97-7cwwp 1/1 Running 45 293d
tiller-deploy-779784fbd6-dzq7k 1/1 Running 5 87d
weave-net-556ml 2/2 Running 12 66d
weave-net-h9km9 2/2 Running 15 81d
weave-net-s88z4 2/2 Running 0 145m
weave-net-smrgc 2/2 Running 14 63d
weave-net-xf6ng 2/2 Running 15 82d
$ kubectl logs coredns-576cbf47c7-6dtrc -n kube-system | tail -20
10.44.0.28:32837 - [14/Dec/2019:12:22:51 +0000] 2957 "AAAA IN spark-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd,ra 149 0.000661167s
10.44.0.28:51373 - [14/Dec/2019:12:25:09 +0000] 46278 "AAAA IN spark-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd,ra 149 0.000440918s
10.44.0.28:51373 - [14/Dec/2019:12:25:09 +0000] 47697 "A IN spark-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd,ra 110 0.00059741s
10.44.0.28:44969 - [14/Dec/2019:12:27:27 +0000] 33222 "AAAA IN spark-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd,ra 149 0.00044739s
10.44.0.28:44969 - [14/Dec/2019:12:27:27 +0000] 52126 "A IN spark-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd,ra 110 0.000310494s
10.44.0.28:39392 - [14/Dec/2019:12:29:11 +0000] 41041 "AAAA IN spark-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd,ra 149 0.000481309s
10.44.0.28:40999 - [14/Dec/2019:12:29:11 +0000] 695 "AAAA IN spark-master.svc.cluster.local. udp 48 false 512" NXDOMAIN qr,aa,rd,ra 141 0.000247078s
10.44.0.28:54835 - [14/Dec/2019:12:29:12 +0000] 59604 "AAAA IN spark-master. udp 30 false 512" NXDOMAIN qr,rd,ra 106 0.020408006s
10.44.0.28:38604 - [14/Dec/2019:12:29:15 +0000] 53244 "A IN spark-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd,ra 110 0.000209231s
10.44.0.28:38604 - [14/Dec/2019:12:29:15 +0000] 23079 "AAAA IN spark-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,rd,ra 149 0.000191722s
10.44.0.28:57478 - [14/Dec/2019:12:32:15 +0000] 15451 "AAAA IN spark-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd,ra 149 0.000383919s
10.44.0.28:57478 - [14/Dec/2019:12:32:15 +0000] 45086 "A IN spark-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd,ra 110 0.001197812s
10.40.0.34:54678 - [14/Dec/2019:12:52:31 +0000] 6509 "A IN kubernetes.default.svc.monitoring.svc.cluster.local. udp 69 false 512" NXDOMAIN qr,aa,rd,ra 162 0.000522769s
10.40.0.34:60234 - [14/Dec/2019:12:52:31 +0000] 15538 "AAAA IN kubernetes.default.svc.monitoring.svc.cluster.local. udp 69 false 512" NXDOMAIN qr,aa,rd,ra 162 0.000851171s
10.40.0.34:43989 - [14/Dec/2019:12:52:31 +0000] 2712 "AAAA IN kubernetes.default.svc.svc.cluster.local. udp 58 false 512" NXDOMAIN qr,aa,rd,ra 151 0.000306038s
10.40.0.34:59265 - [14/Dec/2019:12:52:31 +0000] 23765 "A IN kubernetes.default.svc.cluster.local. udp 54 false 512" NOERROR qr,aa,rd,ra 106 0.000274748s
10.40.0.34:45622 - [14/Dec/2019:13:26:31 +0000] 38766 "AAAA IN kubernetes.default.svc.monitoring.svc.cluster.local. udp 69 false 512" NXDOMAIN qr,aa,rd,ra 162 0.000436681s
10.40.0.34:42759 - [14/Dec/2019:13:26:31 +0000] 56753 "A IN kubernetes.default.svc.monitoring.svc.cluster.local. udp 69 false 512" NXDOMAIN qr,aa,rd,ra 162 0.000706638s
10.40.0.34:39563 - [14/Dec/2019:13:26:31 +0000] 37876 "AAAA IN kubernetes.default.svc.svc.cluster.local. udp 58 false 512" NXDOMAIN qr,aa,rd,ra 151 0.000445999s
10.40.0.34:57224 - [14/Dec/2019:13:26:31 +0000] 33157 "A IN kubernetes.default.svc.svc.cluster.local. udp 58 false 512" NXDOMAIN qr,aa,rd,ra 151 0.000536896s
$ kubectl get svc -n kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP 298d
kubernetes-dashboard ClusterIP 10.96.204.236 <none> 443/TCP 298d
tiller-deploy ClusterIP 10.110.41.66 <none> 44134/TCP 123d
$ kubectl get ep kube-dns --namespace=kube-system
NAME ENDPOINTS AGE
kube-dns 10.32.0.98:53,10.44.0.21:53,10.32.0.98:53 + 1 more... 298d
When busybox is in the worker1:
$ kubectl exec -ti busybox -- nslookup kubernetes.default
Server: 10.96.0.10
Address 1: 10.96.0.10
nslookup: can't resolve 'kubernetes.default'
command terminated with exit code 1
But when busybox is in the worker3:
$ kubectl exec -ti busybox -- nslookup kubernetes.default
Server: 10.96.0.10
Address 1: 10.96.0.10
Name: kubernetes.default
Address 1: 10.96.0.1 kubernetes.default.svc.cluster.local
All Nodes are : Ubuntu 16.04
The content of /etc/resolve.conf for all pods are same.
The only difference which I can find is in the kube-proxy logs:
The working node kube-proxy logs:
$ kubectl logs kube-proxy-vg8jd -n kube-system
W1214 06:12:19.201889 1 server_others.go:295] Flag proxy-mode="" unknown, assuming iptables proxy
I1214 06:12:19.321747 1 server_others.go:148] Using iptables Proxier.
W1214 06:12:19.332725 1 proxier.go:317] clusterCIDR not specified, unable to distinguish between internal and external traffic
I1214 06:12:19.332949 1 server_others.go:178] Tearing down inactive rules.
I1214 06:12:20.557875 1 server.go:447] Version: v1.12.1
I1214 06:12:20.601081 1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
I1214 06:12:20.601393 1 conntrack.go:52] Setting nf_conntrack_max to 131072
I1214 06:12:20.601958 1 conntrack.go:83] Setting conntrack hashsize to 32768
I1214 06:12:20.602234 1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
I1214 06:12:20.602300 1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
I1214 06:12:20.602544 1 config.go:202] Starting service config controller
I1214 06:12:20.602561 1 controller_utils.go:1027] Waiting for caches to sync for service config controller
I1214 06:12:20.602585 1 config.go:102] Starting endpoints config controller
I1214 06:12:20.602619 1 controller_utils.go:1027] Waiting for caches to sync for endpoints config controller
I1214 06:12:20.702774 1 controller_utils.go:1034] Caches are synced for service config controller
I1214 06:12:20.702827 1 controller_utils.go:1034] Caches are synced for endpoints config controller
The not working node kube-proxy logs:
$ kubectl logs kube-proxy-fgzpf -n kube-system
W1215 12:47:12.660749 1 server_others.go:295] Flag proxy-mode="" unknown, assuming iptables proxy
I1215 12:47:12.679348 1 server_others.go:148] Using iptables Proxier.
W1215 12:47:12.679538 1 proxier.go:317] clusterCIDR not specified, unable to distinguish between internal and external traffic
I1215 12:47:12.679665 1 server_others.go:178] Tearing down inactive rules.
E1215 12:47:12.760702 1 proxier.go:529] Error removing iptables rules in ipvs proxier: error deleting chain "KUBE-MARK-MASQ": exit status 1: iptables: Too many links.
I1215 12:47:12.799926 1 server.go:447] Version: v1.12.1
I1215 12:47:12.832047 1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
I1215 12:47:12.833067 1 conntrack.go:52] Setting nf_conntrack_max to 131072
I1215 12:47:12.833266 1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
I1215 12:47:12.833498 1 conntrack.go:98] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
I1215 12:47:12.833934 1 config.go:202] Starting service config controller
I1215 12:47:12.834061 1 controller_utils.go:1027] Waiting for caches to sync for service config controller
I1215 12:47:12.834253 1 config.go:102] Starting endpoints config controller
I1215 12:47:12.834338 1 controller_utils.go:1027] Waiting for caches to sync for endpoints config controller
I1215 12:47:12.934408 1 controller_utils.go:1034] Caches are synced for service config controller
I1215 12:47:12.934564 1 controller_utils.go:1034] Caches are synced for endpoints config controller
Line five doesn't appears in first one. I don't know that is related to the issue or not.
Any suggestions are welcomed.
The double svc.svc in kubernetes.default.svc.svc.cluster.local looks stange. Check if that is the same in the coredns-576cbf47c7-6dtrc pod.
Shutdown the coredns-576cbf47c7-6dtrc pod to guarantee that the single remaining DNS instance will be answering the DNS queries from all worker nodes.
According to the docs, problems like this "... indicate a problem with the coredns/kube-dns add-on or associated Services". Restarting coredns may solve the issue.
I'd add to the list of things to look into to check and compare /etc/resolv.conf on the nodes.
Looks like this commands can help:
sudo update-alternatives --set iptables /usr/sbin/iptables-legacy
sudo update-alternatives --set ip6tables /usr/sbin/ip6tables-legacy
sudo update-alternatives --set arptables /usr/sbin/arptables-legacy
sudo update-alternatives --set ebtables /usr/sbin/ebtables-legacy
Also if there is a failing flannel-pod, we can check logs for it container.
So
sudo ip link delete flannel.1 on the failing node allows the pod to recreate successfully after deleting the failing pod.

No route to host from some Kubernetes containers to other containers in same cluster

This is a Kubespray deployment using calico. All the defaults are were left as-is except for the fact that there is a proxy. Kubespray ran to the end without issues.
Access to Kubernetes services started failing and after investigation, there was no route to host to the coredns service. Accessing a K8S service by IP worked. Everything else seems to be correct, so I am left with a cluster that works, but without DNS.
Here is some background information:
Starting up a busybox container:
# nslookup kubernetes.default
Server: 169.254.25.10
Address: 169.254.25.10:53
** server can't find kubernetes.default: NXDOMAIN
*** Can't find kubernetes.default: No answer
Now the output while explicitly defining the IP of one of the CoreDNS pods:
# nslookup kubernetes.default 10.233.0.3
;; connection timed out; no servers could be reached
Notice that telnet to the Kubernetes API works:
# telnet 10.233.0.1 443
Connected to 10.233.0.1
kube-proxy logs:
10.233.0.3 is the service IP for coredns. The last line looks concerning, even though it is INFO.
$ kubectl logs kube-proxy-45v8n -nkube-system
I1114 14:19:29.657685 1 node.go:135] Successfully retrieved node IP: X.59.172.20
I1114 14:19:29.657769 1 server_others.go:176] Using ipvs Proxier.
I1114 14:19:29.664959 1 server.go:529] Version: v1.16.0
I1114 14:19:29.665427 1 conntrack.go:52] Setting nf_conntrack_max to 262144
I1114 14:19:29.669508 1 config.go:313] Starting service config controller
I1114 14:19:29.669566 1 shared_informer.go:197] Waiting for caches to sync for service config
I1114 14:19:29.669602 1 config.go:131] Starting endpoints config controller
I1114 14:19:29.669612 1 shared_informer.go:197] Waiting for caches to sync for endpoints config
I1114 14:19:29.769705 1 shared_informer.go:204] Caches are synced for service config
I1114 14:19:29.769756 1 shared_informer.go:204] Caches are synced for endpoints config
I1114 14:21:29.666256 1 graceful_termination.go:93] lw: remote out of the list: 10.233.0.3:53/TCP/10.233.124.23:53
I1114 14:21:29.666380 1 graceful_termination.go:93] lw: remote out of the list: 10.233.0.3:53/TCP/10.233.122.11:53
All pods are running without crashing/restarts etc. and otherwise services behave correctly.
IPVS looks correct. CoreDNS service is defined there:
# ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 10.233.0.1:443 rr
-> x.59.172.19:6443 Masq 1 0 0
-> x.59.172.20:6443 Masq 1 1 0
TCP 10.233.0.3:53 rr
-> 10.233.122.12:53 Masq 1 0 0
-> 10.233.124.24:53 Masq 1 0 0
TCP 10.233.0.3:9153 rr
-> 10.233.122.12:9153 Masq 1 0 0
-> 10.233.124.24:9153 Masq 1 0 0
TCP 10.233.51.168:3306 rr
-> x.59.172.23:6446 Masq 1 0 0
TCP 10.233.53.155:44134 rr
-> 10.233.89.20:44134 Masq 1 0 0
UDP 10.233.0.3:53 rr
-> 10.233.122.12:53 Masq 1 0 314
-> 10.233.124.24:53 Masq 1 0 312
Host routing also looks correct.
# ip r
default via x.59.172.17 dev ens3 proto dhcp src x.59.172.22 metric 100
10.233.87.0/24 via x.59.172.21 dev tunl0 proto bird onlink
blackhole 10.233.89.0/24 proto bird
10.233.89.20 dev calib88cf6925c2 scope link
10.233.89.21 dev califdffa38ed52 scope link
10.233.122.0/24 via x.59.172.19 dev tunl0 proto bird onlink
10.233.124.0/24 via x.59.172.20 dev tunl0 proto bird onlink
x.59.172.16/28 dev ens3 proto kernel scope link src x.59.172.22
x.59.172.17 dev ens3 proto dhcp scope link src x.59.172.22 metric 100
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown
I have redeployed this same cluster in separate environments with flannel and calico with iptables instead of ipvs. I have also disabled the docker http proxy after deploy temporarily. None of which makes any difference.
Also:
kube_service_addresses: 10.233.0.0/18
kube_pods_subnet: 10.233.64.0/18
(They do not overlap)
What is the next step in debugging this issue?
I highly recommend you to avoid using latest busybox image to troubleshoot DNS. There are few issues reported regarding dnslookup on versions newer than 1.28.
v 1.28.4
user#node1:~$ kubectl exec -ti busybox busybox | head -1
BusyBox v1.28.4 (2018-05-22 17:00:17 UTC) multi-call binary.
user#node1:~$ kubectl exec -ti busybox -- nslookup kubernetes.default
Server: 169.254.25.10
Address 1: 169.254.25.10
Name: kubernetes.default
Address 1: 10.233.0.1 kubernetes.default.svc.cluster.local
v 1.31.1
user#node1:~$ kubectl exec -ti busyboxlatest busybox | head -1
BusyBox v1.31.1 (2019-10-28 18:40:01 UTC) multi-call binary.
user#node1:~$ kubectl exec -ti busyboxlatest -- nslookup kubernetes.default
Server: 169.254.25.10
Address: 169.254.25.10:53
** server can't find kubernetes.default: NXDOMAIN
*** Can't find kubernetes.default: No answer
command terminated with exit code 1
Going deeper and exploring more possibilities, I've reproduced your problem on GCP and after some digging I was able to figure out what is causing this communication problem.
GCE (Google Compute Engine) blocks traffic between hosts by default; we have to allow Calico traffic to flow between containers on different hosts.
According to calico documentation, you can do it by creating a firewall allowing this communication rule:
gcloud compute firewall-rules create calico-ipip --allow 4 --network "default" --source-ranges "10.128.0.0/9"
You can verify the rule with this command:
gcloud compute firewall-rules list
This is not present on the most recent calico documentation but it's still true and necessary.
Before creating firewall rule:
user#node1:~$ kubectl exec -ti busybox2 -- nslookup kubernetes.default
Server: 10.233.0.3
Address 1: 10.233.0.3 coredns.kube-system.svc.cluster.local
nslookup: can't resolve 'kubernetes.default'
command terminated with exit code 1
After creating firewall rule:
user#node1:~$ kubectl exec -ti busybox2 -- nslookup kubernetes.default
Server: 10.233.0.3
Address 1: 10.233.0.3 coredns.kube-system.svc.cluster.local
Name: kubernetes.default
Address 1: 10.233.0.1 kubernetes.default.svc.cluster.local
It doesn't matter if you bootstrap your cluster using kubespray or kubeadm, this problem will happen because calico needs to communicate between nodes and GCE is blocking it as default.
This is what works for me, I tried to install my k8s cluster using kubespray configured with calico as CNI and containerd as container runtime
iptables -P INPUT ACCEPT
iptables -P FORWARD ACCEPT
iptables -P OUTPUT ACCEPT
iptables -F
[delete coredns pod]

Kubernetes DNS intermittently failing with kube-dns service and CoreDNS pods seeming OK

We have a Kubernetes cluster with 1 master and 3 nodes managed by kops that we use for our application deployment. We have minimal pod-to-pod connectivity but like the autoscaling features in Kubernetes. We've been using this for the past few months but recently have started having issue where our pods randomly cannot connect to our redis or database with an error like:
Set state pending error: dial tcp: lookup redis.id.0001.use1.cache.amazonaws.com on 100.64.0.10:53: read udp 100.126.88.186:35730->100.64.0.10:53: i/o timeout
or
OperationalError: (psycopg2.OperationalError) could not translate host name “postgres.id.us-east-1.rds.amazonaws.com” to address: Temporary failure in name resolution
What's stranger is this only occurs some of the time, then when a pod is recreated it will work again and this will trip it up shortly after.
We have tried following all of Kube's kube-dns debugging instructions to no avail, tried countless solutions like changing the ndots configuration and have even experimented moving to CoreDNS, but still have the exact same intermittent issues. We use Calico for networking but it's hard to say if it's occurring at the network level as we haven't seen issues with any other services.
Does anyone have any ideas of where else to look for what could be causing this behavior, or if you've experienced this behavior before yourself could you please share how you resolved it?
Thanks
The pods for CoreDNS look OK
⇒ kubectl get pods --namespace=kube-system
NAME READY STATUS RESTARTS AGE
...
coredns-784bfc9fbd-xwq4x 1/1 Running 0 3h
coredns-784bfc9fbd-zpxhg 1/1 Running 0 3h
...
We have enabled logging on CoreDNS and seen requests actually coming through:
⇒ kubectl logs coredns-784bfc9fbd-xwq4x --namespace=kube-system
.:53
2019-04-09T00:26:03.363Z [INFO] CoreDNS-1.2.6
2019-04-09T00:26:03.364Z [INFO] linux/amd64, go1.11.2, 756749c
CoreDNS-1.2.6
linux/amd64, go1.11.2, 756749c
[INFO] plugin/reload: Running configuration MD5 = 7f2aea8cc82e8ebb0a62ee83a9771ab8
[INFO] Reloading
[INFO] plugin/reload: Running configuration MD5 = 73a93c15a3b7843ba101ff3f54ad8327
[INFO] Reloading complete
...
2019-04-09T02:41:08.412Z [INFO] 100.126.88.129:34958 - 18745 "AAAA IN sqs.us-east-1.amazonaws.com.cluster.local. udp 59 false 512" NXDOMAIN qr,aa,rd,ra 152 0.000182646s
2019-04-09T02:41:08.412Z [INFO] 100.126.88.129:51735 - 62992 "A IN sqs.us-east-1.amazonaws.com.cluster.local. udp 59 false 512" NXDOMAIN qr,aa,rd,ra 152 0.000203112s
2019-04-09T02:41:13.414Z [INFO] 100.126.88.129:33525 - 52399 "A IN sqs.us-east-1.amazonaws.com.ec2.internal. udp 58 false 512" NXDOMAIN qr,rd,ra 58 0.001017774s
2019-04-09T02:41:18.414Z [INFO] 100.126.88.129:44066 - 47308 "A IN sqs.us-east-1.amazonaws.com. udp 45 false 512" NOERROR qr,rd,ra 140 0.000983118s
...
Service and endpoints look OK
⇒ kubectl get svc --namespace=kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-dns ClusterIP 100.64.0.10 <none> 53/UDP,53/TCP 63d
...
⇒ kubectl get ep kube-dns --namespace=kube-system
NAME ENDPOINTS AGE
kube-dns 100.105.44.88:53,100.127.167.160:53,100.105.44.88:53 + 1 more... 63d
...
We also encounter this issue, but issue was with query timeout.
The best way after testing was to run dns on all nodes and all PODs referring to their own node DNS. It will save round trips to other node pods because you may run multiple pods for DNS but dns service will distribute traffic some how and PODs will end up having more network traffic across nodes. Not sure if possible on amazon eks.