GKE in-cluster DNS resolutions stopped working - kubernetes

So this has been working forever. I have a few simple services running in GKE and they refer to each other via the standard service.namespace DNS names.
Today all DNS name resolution stopped working. I haven't changed anything, although this may have been triggered by a master upgrade.
/ambassador # nslookup ambassador-monitor.default
nslookup: can't resolve '(null)': Name does not resolve
nslookup: can't resolve 'ambassador-monitor.default': Try again
/ambassador # cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local c.snowcloud-01.internal google.internal
nameserver 10.207.0.10
options ndots:5
Master version 1.14.7-gke.14
I can talk cross-service using their IP addresses, it's just DNS that's not working.
Not really sure what to do about this...

The easiest way to verify if there is a problem with your Kube DNS is to look at the logs StackDriver [https://cloud.google.com/logging/docs/view/overview].
You should be able to find DNS resolution failures in the logs for the pods, with a filter such as the following:
resource.type="container"
("UnknownHost" OR "lookup fail" OR "gaierror")
Be sure to check logs for each container. Because the exact names and numbers of containers can change with the GKE version, you can find them like so:
kubectl get pod -n kube-system -l k8s-app=kube-dns -o \
jsonpath='{range .items[*].spec.containers[*]}{.name}{"\n"}{end}' | sort -u kubectl get pods -n kube-system -l k8s-app=kube-dns
Has the pod been restarted frequently? Look for OOMs in the node console. The nodes for each pod can be found like so:
kubectl get pod -n kube-system -l k8s-app=kube-dns -o \
jsonpath='{range .items[*]}{.spec.nodeName} pod={.metadata.name}{"\n"}{end}'
The kube-dns pod contains four containers:
kube-dns process watches the Kubernetes master for changes in Services and Endpoints, and maintains in-memory lookup structures to serve DNS requests,
dnsmasq adds DNS caching to improve performance,
sidecar provides a single health check endpoint while performing dual health checks (for dnsmasq and kubedns). It also collects dnsmasq metrics and exposes them in the Prometheus format,
prometheus-to-sd scraping the metrics exposed by sidecar and sending them to Stackdriver.
By default, the dnsmasq container accepts 150 concurrent requests. Requests beyond this are simply dropped and result in failed DNS resolution, including resolution for metadata. To check for this, view the logs with the following filter:
resource.type="container"resource.labels.cluster_name="<cluster-name>"resource.labels.namespace_id="kube-system"logName="projects/<project-id>/logs/dnsmasq""Maximum number of concurrent DNS queries reached"
If legacy stackdriver logging of cluster is disabled, use the following filter:
resource.type="k8s_container"resource.labels.cluster_name="<cluster-name>"resource.labels.namespace_name="kube-system"resource.labels.container_name="dnsmasq""Maximum number of concurrent DNS queries reached"
If Stackdriver logging is disabled, execute the following:
kubectl logs --tail=1000 --namespace=kube-system -l k8s-app=kube-dns -c dnsmasq | grep 'Maximum number of concurrent DNS queries reached'
Additionally, you can try to use the command [dig ambassador-monitor.default #10.207.0.10] from each nodes to verify if this is only impacting one node. If it is, you can simple re-create the impacted node.

It appears that I hit a bug that caused the gke-metadata server to start crash pooling (which in turn prevented kube-dns from working).
Creating a new pool with a previous version (1.14.7-gke.10) and migrating to it fixed everything.
I am told a fix has already been submitted.
Thank you for your suggestions.

Start by debugging your kubernetes services [1]. This will tell you whether is a k8s resource issue or kubernetes itself is failing. Once you understand that, you can proceed to fix it. You can post results here if you want to follow up.
[1] https://kubernetes.io/docs/tasks/debug-application-cluster/debug-service/

Related

kubernetes pod cannot resolve local hostnames but can resolve external ones like google.com

I am trying kubernetes and seem to have hit bit of a hurdle. The problem is that from within my pod I can't curl local hostnames such as wrkr1 or wrkr2 (machine hostnames on my network) but can successfully resolve hostnames such as google.com or stackoverflow.com.
My cluster is a basic setup with one master and 2 worker nodes.
What works from within the pod:
curl to google.com from pod -- works
curl to another service(kubernetes) from pod -- works
curl to another machine on same LAN via its IP address such as 192.168.x.x -- works
curl to another machine on same LAN via its hostname such as wrkr1 -- does not work
What works from the node hosting pod:
curl to google.com --works
curl to another machine on same LAN via
its IP address such as 192.168.x.x -- works
curl to another machine
on same LAN via its hostname such as wrkr1 -- works.
Note: the pod cidr is completely different from the IP range used in
LAN
the node contains a hosts file with entry corresponding to wrkr1's IP address (although I've checked node is able to resolve hostname without it also but I read somewhere that a pod inherits its nodes DNS resolution so I've kept the entry)
Kubernetes Version: 1.19.14
Ubuntu Version: 18.04 LTS
Need help as to whether this is normal behavior and what can be done if I want pod to be able to resolve hostnames on local LAN as well?
What happens
Need help as to whether this is normal behavior
This is normal behaviour, because there's no DNS server in your network where virtual machines are hosted and kubernetes has its own DNS server inside the cluster, it simply doesn't know about what happens on your host, especially in /etc/hosts because pods simply don't have access to this file.
I read somewhere that a pod inherits its nodes DNS resolution so I've
kept the entry
This is a point where tricky thing happens. There are four available DNS policies which are applied per pod. We will take a look at two of them which are usually used:
"Default": The Pod inherits the name resolution configuration from the node that the pods run on. See related discussion for more details.
"ClusterFirst": Any DNS query that does not match the configured cluster domain suffix, such as "www.kubernetes.io", is forwarded to the upstream nameserver inherited from the node. Cluster administrators may have extra stub-domain and upstream DNS servers configured
The trickiest ever part is this (from the same link above):
Note: "Default" is not the default DNS policy. If dnsPolicy is not
explicitly specified, then "ClusterFirst" is used.
That means that all pods that do not have DNS policy set will be run with ClusterFirst and they won't be able to see /etc/resolv.conf on the host. I tried changing this to Default and indeed, it can resolve everything host can, however internal resolving stops working, so it's not an option.
For example coredns deployment is run with Default dnsPolicy which allows coredns to resolve hosts.
How this can be resolved
1. Add local domain to coreDNS
This will require to add A records per host. Here's a part from edited coredns configmap:
This should be within .:53 { block
file /etc/coredns/local.record local
This part is right after block above ends (SOA information was taken from the example, it doesn't make any difference here):
local.record: |
local. IN SOA sns.dns.icann.org. noc.dns.icann.org. 2015082541 7200 3600 1209600 3600
wrkr1. IN A 172.10.10.10
wrkr2. IN A 172.11.11.11
Then coreDNS deployment should be added to include this file:
$ kubectl edit deploy coredns -n kube-system
volumes:
- configMap:
defaultMode: 420
items:
- key: Corefile
path: Corefile
- key: local.record # 1st line to add
path: local.record # 2nd line to add
name: coredns
And restart coreDNS deployment:
$ kubectl rollout restart deploy coredns -n kube-system
Just in case check if coredns pods are running and ready:
$ kubectl get pods -A | grep coredns
kube-system coredns-6ddbbfd76-mk2wv 1/1 Running 0 4h46m
kube-system coredns-6ddbbfd76-ngrmq 1/1 Running 0 4h46m
If everything's done correctly, now newly created pods will be able to resolve hosts by their names. Please find an example in coredns documentation
2. Set up DNS server in the network
While avahi looks similar to DNS server, it does not act like a DNS server. It's not possible to setup requests forwarding from coredns to avahi, while it's possible to proper DNS server in the network and this way have everything will be resolved.
3. Deploy avahi to kubernetes cluster
There's a ready image with avahi here. If it's deployed into the cluster with dnsPolicy set to ClusterFirstWithHostNet and most importantly hostNetwork: true it will be able to use host adapter to discover all available hosts within the network.
Useful links:
Pods DNS policy
Custom DNS entries for kubernetes

Telemetry mixer logs

I deploy istio 1.2.5 on a K8s cluster.
According to documentation https://istio.io/faq/mixer/ in rules section:
kubectl get rules --all-namespaces
You will get the list. In my cluster I got No resources found
But if I use:
kubectl get rules.config.istio.io -n istio-system
I got the list:
NAME AGE
kubeattrgenrulerule 5h
promhttp 5h
promtcp 5h
promtcpconnectionclosed 5h
promtcpconnectionopen 5h
stdio 5h
stdiotcp 5h
Someone know the difference?
Also if I try:
kubectl -n istio-system logs -f istio-telemetry-7df96d454b-4kxs9 -c mixer
I didn't got the log of request in the log ( I found it work in another cluster). Do you know why?
I tried to reproduce your issue on both versions Istio 1.2.5 and Istio 1.3.0 and environments like GKE, Minikube and Kubeadm.
I have tried to install it manually and using HELM. Each time everything worked as should.
Based on the information you have provided: I found it work in another cluster and you are using bare metal I would guess that this cluster have some specific configuration or some of the kubernetes/Istio objects have Insufficient resources.
$ kubectl describe node [node-name]
Please keep in mind that you might install Istio Configuration Profile which requested too many resources. Each profile contain different amount of resources based on each object (citadel, egress, galley, pilot, telemetry, etc). For example if you will check Istio Docs
The Envoy proxy uses 0.6 vCPU and 50 MB memory per 1000 requests per second going through the proxy.
The istio-telemetry service uses 0.6 vCPU per 1000 mesh-wide requests per second.
Pilot uses 1 vCPU and 1.5 GB of memory.

How can I sniff all DNS records from kuberenetes?

I wish to sniff and extract all DNS records from kubernetes: clientIP,serverIP,date,QueryType etc...
I had set up a kuberenetes service.
It is online and running. There I created several containerized micro-services that generate DNS queries (HTTP requests to external addresses). How can I see sniff it ? Is there a way to extract logs with DNS records ?
Given that you use CoreDNS as your cluster DNS service you can configure it to log queries, errors etc. to stdout. CoreDNS have been available as an alternative to kube-dns since k8s version 1.11, so if you're running a cluster of version >1.11 there's a good chance that you're using CoreDNS.
The CoreDNS service usually™️ lives in the kube-system namespace and can be reconfigured using the provided ConfigMap.
Example on how to log everything to stdout, taken from the README:
. {
...
log
...
}
When you've reconfigured CoreDNS you can check the Pod logs with:
kubectl logs -n kube-system <POD NAME>
I have successfully extracted DNS logs , using answer above. My new problem is that I can't see resolution data, i.e. RRDATA, such as resolved IP or other response info?

Something seems to be catching TCP traffic to pods

I'm trying to deploy Kubernetes with Calico (IPIP) with Kubeadm. After deployment is done I'm deploying Calico using these manifests
kubectl apply -f https://docs.projectcalico.org/v3.3/getting-started/kubernetes/installation/hosted/rbac-kdd.yaml
kubectl apply -f https://docs.projectcalico.org/v3.3/getting-started/kubernetes/installation/hosted/kubernetes-datastore/calico-networking/1.7/calico.yaml
Before applying it, I'm editing CALICO_IPV4POOL_CIDR and setting it to 10.250.0.0/17 as well as using command kubeadm init --pod-cidr 10.250.0.0/17.
After few seconds CoreDNS pods (for example getting addr 10.250.2.2) starts restarting with error 10.250.2.2:8080 connection refused.
Now a bit of digging:
from any node in cluster ping 10.250.2.2 works and it reaches pod (tcpdump in pod net namespace shows it).
from different pod (on different node) curl 10.250.2.2:8080 works well
from any node to curl 10.250.2.2:8080 fails with connection refused
Because it's coredns pod it listens on 53 both udp and tcp, so I've tried netcat from nodes
nc 10.250.2.2 53 - connection refused
nc -u 10.250.2.2 55 - works
Now I've tcpdump each interface on source node for port 8080 and curl to CoreDNS pod doesn't even seem to leave node... sooo iptables?
I've also tried weave, canal and flannel, all seem to have same issue.
I've ran out of ideas by now...any pointers please?
Seems to be a problem with Calico implementation, CoreDNS Pods are sensitive on the CNI network Pods successful functioning.
For proper CNI network plugin implementation you have to include --pod-network-cidr flag to kubeadm init command and afterwards apply the same value to CALICO_IPV4POOL_CIDR parameter inside calico.yml.
Moreover, for a successful Pod network installation you have to apply some RBAC rules in order to make sufficient permissions in compliance with general cluster security restrictions, as described in official Kubernetes documentation:
For Calico to work correctly, you need to pass
--pod-network-cidr=192.168.0.0/16 to kubeadm init or update the calico.yml file to match your Pod network. Note that Calico works on
amd64 only.
kubectl apply -f https://docs.projectcalico.org/v3.3/getting-started/kubernetes/installation/hosted/rbac-kdd.yaml
kubectl apply -f https://docs.projectcalico.org/v3.3/getting-started/kubernetes/installation/hosted/kubernetes-datastore/calico-networking/1.7/calico.yaml
In your case I would switched to the latest Calico versions at least from v3.3 as given in the example.
If you've noticed that you run Pod network plugin installation properly, please take a chance and update the question with your current environment setup and Kubernetes components versions with a health statuses.

Communication Between Different Services in Kubernetes Deploy

I'm new to using Kubernetes (v 1.11.2-gke.18), and am trying to get Sentry running as a Kubernetes deployment. This is all being run in Google Cloud. Here's the YAML files for Postgres, Redis, and Sentry:
The Redis and Postgres parts seem to be up and running just fine, but Sentry fails because: could not translate host name "postgres-sentry:5432" to address: Name or service not known.
My question is this: How do I get these services to communicate properly? How do I get the Sentry container to communicate with Postgres and Redis?
It turns out that in this case, what I needed was to change the SENTRY_REDIS_HOST and SENTRY_POSTGRES_HOST values to not have the port numbers in the env declaration. Taking those out made everything talk together as expected.
In Kubernetes pods generally can find other pods through services using DNS.
Your configuration looks correct if postgres-sentry is in the same namespace as your sentry app pod/deployment (which looks like the default namespace).
So it points that you may have a problem with DNS. You can check by shelling into the sentry app container/pod and trying to ping postgres-sentry:
$ kubectl exec -it <pod-id-of-your-sentry-app> sh
# ping postgres-sentry
Also, check if you have you /etc/resolv.conf looks something like this:
# cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
Finally, check if your DNS pods are running:
$ kubectl -n kube-system get pods | grep dns
coredns-xxxxxxxxxxx-xxxxx 1/1 Running 15 116d
coredns-xxxxxxxxxxx-xxxxx 1/1 Running 15 116d