How to troubleshoot why the Endpoints in my service don't get updated? - kubernetes

I have a Kubernetes cluster running on the Google Kubernetes Engine.
I have a deployment that I manually (by editing the hpa object) scaled up from 100 replicas to 300 replicas to do some load testing. When I was load testing the deployment by sending HTTP requests to the service, it seemed that not all pods were getting an equal amount of traffic, only around 100 pods were showing that they were processing traffic (by looking at their CPU-load, and our custom metrics). So my suspicion was that the service is not load balancing the requests among all the pods equally.
If I checked the deployment, I could see that all 300 replicas were ready.
$ k get deploy my-app --show-labels
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE LABELS
my-app 300 300 300 300 21d app=my-app
On the other hand, when I checked the service, I saw this:
$ k describe svc my-app
Name: my-app
Namespace: production
Labels: app=my-app
Selector: app=my-app
Type: ClusterIP
IP: 10.40.9.201
Port: http 80/TCP
TargetPort: http/TCP
Endpoints: 10.36.0.5:80,10.36.1.5:80,10.36.100.5:80 + 114 more...
Port: https 443/TCP
TargetPort: https/TCP
Endpoints: 10.36.0.5:443,10.36.1.5:443,10.36.100.5:443 + 114 more...
Session Affinity: None
Events: <none>
What was strange to me is this part
Endpoints: 10.36.0.5:80,10.36.1.5:80,10.36.100.5:80 + 114 more...
I was expecting to see 300 endpoints there, is that assumption correct?
(I also found this post, which is about a similar issue, but there the author was experiencing only a few minutes delay until the endpoints were updated, but for me it didn't change even in half an hour.)
How could I troubleshoot what was going wrong? I read that this is done by the Endpoints controller, but I couldn't find any info about where to check its logs.
Update: We managed to reproduce this a couple more times. Sometimes it was less severe, for example 381 endpoints instead of 445. One interesting thing we noticed is that if we retrieved the details of the endpoints:
$ k describe endpoints my-app
Name: my-app
Namespace: production
Labels: app=my-app
Annotations: <none>
Subsets:
Addresses: 10.36.0.5,10.36.1.5,10.36.10.5,...
NotReadyAddresses: 10.36.199.5,10.36.209.5,10.36.239.2,...
Then a bunch of IPs were "stuck" in the NotReadyAddresses state (not the ones that were "missing" from the service though, if I summed the number of IPs in Addresses and NotReadyAddresses, that was still less than the total number of ready pods). Although I don't know if this is related at all, I couldn't find much info online about this NotReadyAddresses field.

It turned out that this is caused by using preemptible VMs in our node pools, it doesn't happen if the nodes are not preemtibles.
We couldn't figure out more details of the root cause, but using preemtibles as the nodes is not an officially supported scenario anyway, so we switched to regular VMs.

Pod IPs can be added to NotReadyAddresses if a health/readiness probe is failing. This will in turn cause the pod IP to fail to be automatically added to the endpoints, meaning that the kubernetes service can't connect to the pod.

I refer to your first try with 300 pods.
I would check the following:
kubectl get po -l app=my-app to see if you get a 300 item list. Your service says you have 300 available pods, which makes your issue very interesting to analyze.
whether your pod/deployment manifest defined limit and request resources. This better helps scheduler.
whether some of your nodes have taints incompatible with your pod/deployment manifest
whether your pod/deploy manifest has liveness and readyness probes (please post them)
whether you defined some resourceQuota object, which limit the creation of pods/deployments

Related

How to avoid coredns resolving overhead in kubernetes

I think the title is pretty much self explanatory. I have done many experiments and the sad truth, is that coredns does add a 20 ms overhead to all the requests inside the cluster. At first we thought maybe by adding more replications, and balancing the resolving requests between more instances, we could improve the response time, but it did not help at all. (we scaled up from 2 pods to 4 pods)
There was some enhancements on the fluctuations of resolving time, after scaling up to 4 instances. But it wasn't what we were expecting, and the 20 ms overhead was still there.
We have some web-services that their actual response time is < 30 ms and using coredns we are doubling up the response time, and it is not cool!
After coming to a conclusion about this matter, we did an experiment to double-check that this is not an OS level overhead. And the results were not different from what we were expecting.
We thought maybe we can implement/deploy a solution based on putting list of needed hostname mappings for each pod, inside /etc/hosts of that pod. So my final questions are as follows:
Has anyone else experienced something similar with coredns?
Can you please suggest alternative solutions to coredns that work in k8s environment?
Any thoughts or insights are appreciated. Thanks in advance.
There are several things to look at when running coreDNS in your kubernetes cluster
Memory
AutoPath
Number of Replicas
Autoscaler
Other Plugins
Prometheus metrics
Separate Server blocks
Memory
CoreDNS recommended amount of memory for replicas is
MB required (default settings) = (Pods + Services) / 1000 + 54
Autopath
Autopath is a feature in Coredns that helps increase the response time for external queries
Normally a DNS query goes through
..svc.cluster.local
.svc.cluster.local
cluster.local
Then the configured forward, usually host search path (/etc/resolv.conf
Trying "example.com.default.svc.cluster.local"
Trying "example.com.svc.cluster.local"
Trying "example.com.cluster.local"
Trying "example.com"
Trying "example.com"
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 55265
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;example.com. IN A
;; ANSWER SECTION:
example.com. 30 IN A 93.184.216.34
This requires more memory so the calculation now becomes
MB required (w/ autopath) = (Number of Pods + Services) / 250 + 56
Number of replicas
Defaults to 2 but enabling the Autoscaler should help with load issues.
Autoscaler
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
name: coredns
namespace: default
spec:
maxReplicas: 20
minReplicas: 2
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: coredns
targetCPUUtilizationPercentage: 50
Node local cache
Beta in Kubernetes 1.15
NodeLocal DNSCache improves Cluster DNS performance by running a dns caching agent on cluster nodes as a DaemonSet. In today’s architecture, Pods in ClusterFirst DNS mode reach out to a kube-dns serviceIP for DNS queries. This is translated to a kube-dns/CoreDNS endpoint via iptables rules added by kube-proxy. With this new architecture, Pods will reach out to the dns caching agent running on the same node, thereby avoiding iptables DNAT rules and connection tracking. The local caching agent will query kube-dns service for cache misses of cluster hostnames(cluster.local suffix by default).
https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/
Other Plugins
These will also help see what is going on inside CoreDNS
Error - Any errors encountered during the query processing will be printed to standard output.
Trace - enable OpenTracing of how a request flows through CoreDNS
Log - query logging
health - CoreDNS is up and running this returns a 200 OK HTTP status code
ready - By enabling ready an HTTP endpoint on port 8181 will return 200 OK when all plugins that are able to signal readiness have done so.
Ready and Health should be used in the deployment
livenessProbe:
httpGet:
path: /health
port: 8080
scheme: HTTP
initialDelaySeconds: 60
timeoutSeconds: 5
successThreshold: 1
failureThreshold: 5
readinessProbe:
httpGet:
path: /ready
port: 8181
scheme: HTTP
Prometheus Metrics
Prometheus Plugin
coredns_health_request_duration_seconds{} - duration to process a HTTP query to the local /health endpoint. As this a local operation, it should be fast. A (large) increase in this duration indicates the CoreDNS process is having trouble keeping up with its query load.
https://github.com/coredns/deployment/blob/master/kubernetes/Scaling_CoreDNS.md
Separate Server blocks
One last bit of advice is to separate the Cluster DNS server block to external block
CLUSTER_DOMAIN REVERSE_CIDRS {
errors
health
kubernetes
ready
prometheus :9153
loop
reload
loadbalance
}
. {
errors
autopath #kubernetes
forward . UPSTREAMNAMESERVER
cache
loop
}
More information about the k8 plugin and other options here
https://github.com/coredns/coredns/blob/master/plugin/kubernetes/README.md

Does Kubernetes support green-blue deployment?

I would like to ask on the mechanism for stopping the pods in kubernetes.
I read https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods before ask the question.
Supposably we have a application with gracefully shutdown support
(for example we use simple http server on Go https://play.golang.org/p/5tmkPPMiSSt).
Server has two endpoints:
/fast, always send 200 http status code.
/slow, wait 10 seconds and send 200 http status code.
There is deployment/service resource with that configuration:
apiVersion: apps/v1
kind: Deployment
metadata:
name: test
spec:
replicas: 1
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
selector:
matchLabels:
app/name: test
template:
metadata:
labels:
app/name: test
spec:
terminationGracePeriodSeconds: 120
containers:
- name: service
image: host.org/images/grace:v0.1
livenessProbe:
httpGet:
path: /health
port: 10002
failureThreshold: 1
initialDelaySeconds: 1
readinessProbe:
httpGet:
path: /health
port: 10002
failureThreshold: 1
initialDelaySeconds: 1
---
apiVersion: v1
kind: Service
metadata:
name: test
spec:
type: NodePort
ports:
- name: http
port: 10002
targetPort: 10002
selector:
app/name: test
To make sure the pods deleted gracefully I conducted two test options.
First option (slow endpoint) flow:
Create deployment with replicas value equal 1.
Wait for pod readness.
Send request on /slow endpoint (curl http://ip-of-some-node:nodePort/slow) and delete pod (simultaneously, with 1 second out of sync).
Expected:
Pod must not end before http server completed my request.
Got:
Yes, http server process in 10 seconds and return response for me.
(if we pass --grace-period=1 option to kubectl, then curl will write - curl: (52) Empty reply from server)
Everything works as expected.
Second option (fast endpoint) flow:
Create deployment with replicas value equal 10.
Wait for pods readness.
Start wrk with "Connection: close" header.
Randomly delete one or two pods (kubectl delete pod/xxx).
Expected:
No socket errors.
Got:
$ wrk -d 2m --header "Connection: Close" http://ip-of-some-node:nodePort/fast
Running 2m test # http://ip-of-some-node:nodePort/fast
Thread Stats Avg Stdev Max +/- Stdev
Latency 122.35ms 177.30ms 1.98s 91.33%
Req/Sec 66.98 33.93 160.00 65.83%
15890 requests in 2.00m, 1.83MB read
Socket errors: connect 0, read 15, write 0, timeout 0
Requests/sec: 132.34
Transfer/sec: 15.64KB
15 socket errors on read, that is, some pods were disconnected from the service before all requests were processed (maybe).
The problem appears when a new deployment version is applied, scale down and rollout undo.
Questions:
What's reason of that behavior?
How to fix it?
Kubernetes version: v1.16.2
Edit 1.
The number of errors changes each time, but remains in the range of 10-20, when removing 2-5 pods in two minutes.
P.S. If we will not delete a pod, we don't got errors.
Does Kubernetes support green-blue deployment?
Yes, it does. You can read about it on Zero-downtime Deployment in Kubernetes with Jenkins,
A blue/green deployment is a change management strategy for releasing software code. Blue/green deployments, which may also be referred to as A/B deployments require two identical hardware environments that are configured exactly the same way. While one environment is active and serving end users, the other environment remains idle.
Container technology offers a stand-alone environment to run the desired service, which makes it super easy to create identical environments as required in the blue/green deployment. The loosely coupled Services - ReplicaSets, and the label/selector-based service routing in Kubernetes make it easy to switch between different backend environments.
I would also recommend reading Kubernetes Infrastructure Blue/Green deployments.
Here is a repository with examples from codefresh.io about blue green deployment.
This repository holds a bash script that allows you to perform blue/green deployments on a Kubernetes cluster. See also the respective blog post
Prerequisites
As a convention the script expects
The name of your deployment to be $APP_NAME-$VERSION
Your deployment should have a label that shows it version
Your service should point to the deployment by using a version selector, pointing to the corresponding label in the deployment
Notice that the new color deployment created by the script will follow the same conventions. This way each subsequent pipeline you run will work in the same manner.
You can see examples of the tags with the sample application:
service
deployment
You might be also interested in Canary deployment:
Another deployment strategy is using Canaries (a.k.a. incremental rollouts). With canaries, the new version of the application is gradually deployed to the Kubernetes cluster while getting a very small amount of live traffic (i.e. a subset of live users are connecting to the new version while the rest are still using the previous version).
...
The small subset of live traffic to the new version acts as an early warning for potential problems that might be present in the new code. As our confidence increases, more canaries are created and more users are now connecting to the updated version. In the end, all live traffic goes to canaries, and thus the canary version becomes the new “production version”.
EDIT
Questions:
What's reason of that behavior?
When new deployment is being applied old pods are being removed and new ones are being scheduled.
This is being done by Control Plan
For example, when you use the Kubernetes API to create a Deployment, you provide a new desired state for the system. The Kubernetes Control Plane records that object creation, and carries out your instructions by starting the required applications and scheduling them to cluster nodes–thus making the cluster’s actual state match the desired state.
You have only setup a readinessProbe, which tells your service if it should send traffic to the pod or not. This is not a good solution as like you can see in your example if you have 10 pods and remove one or two there is a gap and you receive socket error.
How to fix it?
You have to understand this is not broken so it doesn't need a fix.
This might be mitigated by implementing a check in your application to make sure it's sending request to working address or utilize other features like load balancing like ingress.
Also when you are updating deployment you can do checks before deleting the pod to check if it does have any traffic incoming/outgoing and roll the update to only not used pods.

Exposed Service and Replica Set Relation in Kubernetes

I have a question about how kubernetes decides the serving pod when there are several replicas of the pod.
For Instance, let's assume I have a web application running on a k8s cluster as multiple pod replicas and they are exposed by a service.
When a client sends a request it goes to service and kube-proxy. But where and when does kubernetes make a decision about which pod should serve the request?
I want to know the internals of kubernetes for this matter. Can we control this? Can we decide which pod should serve based on client requests and custom conditions?
can we decide which pod should serve based on client requests and custom conditions?
As kube-proxy works on L4 load balancing stuff thus you can control the session based on Client IP. it does not read the header of client request.
you can control the session with the following field service.spec.sessionAffinityConfig in service obejct
following command provide the explanation
kubectl explain service.spec.sessionAffinityConfig
Following paragraph and link provide detail answer.
Client-IP based session affinity can be selected by setting service.spec.sessionAffinity to “ClientIP” (the default is “None”), and you can set the max session sticky time by setting the field service.spec.sessionAffinityConfig.clientIP.timeoutSeconds if you have already set service.spec.sessionAffinity to “ClientIP” (the default is “10800”)-service-proxies
Service object would be like this
kind: Service
apiVersion: v1
metadata:
name: my-service
spec:
selector:
app: my-app
ports:
- name: http
protocol: TCP
port: 80
targetPort: 80
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 10000
Kubernetes service creates a load balancer(and an endpoint for it) and will use round robin by default to distribute requests among pods.
You can alter this behaviour.
As Suresh said you can also use sessionAffinity to ensure that requests for a particular session value always go to the same pod.

How to select a specific pod for a service in Kubernetes

I have a kubernetes cluster of 3 hosts where each Host has a unique id label.
On this cluster, there is a software that has 3 instances (replicas).
Each replica requires to talk to all other replicas. In addition, there is a service that contains all pods so that this application is permanently available.
So I have:
Instance1 (with labels run: theTool,instanceid: 1)
Instance2 (with labels run: theTool,instanceid: 2)
Instance3 (with labels run: theTool,instanceid: 3)
and
Service1 (selecting pods with label instanceid=1)
Service2 (selecting pods with label instanceid=2)
Service3 (selecting pods with label instanceid=3)
Service (selecting pods with label run=theTool)
This approach works but have I cannot scale or use the rolling-update feature.
I would like to define a deployment with 3 replicas, where each replicate gets a unique generic label (for instance the replica-id like 1/3, 2/3 and so on).
Within the services, I could use the selector to fetch this label which will exist even after an update.
Another solution might be to select the pod/deployment, depending on the host where it is running on. I could use a DaemonSet or just a pod/deployment with affinity to ensure that each host has an exact one replica of my deployment.
But I didn't know how to select a pod based on a host label where it runs on.
Using the hostname is not an option as hostnames will change in different environments.
I have searched the docs but didn't find anything matching this use case. Hopefully, someone here has an idea how to solve this.
The feature you're looking for is called StatefulSets, which just launched to beta with Kubernetes 1.5 (note that it was previously available in alpha under a different name, PetSets).
In a StatefulSet, each replica has a unique name that is persisted across restarts. In your example, these would be instance-1, instance-2, instance-3. Since the instance names are persisted (even if the pod is recreated on another node), you don't need a service-per-instance.
The documentation has more details:
Using StatefulSets
Scaling a StatefulSet
Deleting a StatefulSet
Debugging a StatefulSet
You can map NodeIP:NodePort with PodIP:PodPort. Your pod is running on some Node(Instance/VM).
Assign Label to your nodes ,
http://kubernetes.io/docs/user-guide/node-selection/
Write a service for your pod , for example
service.yaml:
apiVersion: v1
kind: Service
metadata:
name: mysql-service
labels:
label: mysql-service
spec:
type: NodePort
ports:
- port: 3306 #Port on which your service is running
nodePort: 32001 # Node port on which you can access it statically
targetPort: 3306
protocol: TCP
name: http
selector:
name: mysql-selector #bind pod here
Add node selector (in spec field) to your deployment.yaml
deployment.yaml:
spec:
nodeSelector:
nodename: mysqlnode #labelkey=labelname assigned in first step
With this you will be able to access your pod service with Nodeip:Nodeport. If I labeled node 10.11.20.177 with ,
nodename=mysqlnode
I will add in node selector ,
nodeSelector:
nodename : mysqlnode
I specified in service nodePort so now I can access pod service (Which is running in container)
10.11.20.177:32001
But this node should be in same network so it can access pod. For outside access make 32001 accessible publicaly with firewall configuration. It is static forever. Label will take care of your dynamic pod ips.

Google Container Engine Auto deleting services/pods

I am testing goolge container engine and everything was fine until I found this really weird issue.
bash-3.2# kubectl get services --namespace=es
NAME CLUSTER_IP EXTERNAL_IP PORT(S) SELECTOR AGE
elasticsearch-logging 10.67.244.176 <none> 9200/TCP name=elasticsearch-logging 5m
bash-3.2# kubectl describe service elasticsearch-logging --namespace=es
Name: elasticsearch-logging
Namespace: es
Labels: k8s-app=elasticsearch-logging,kubernetes.io/cluster-service=true,kubernetes.io/name=Elasticsearch
Selector: name=elasticsearch-logging
Type: ClusterIP
IP: 10.67.248.242
Port: <unnamed> 9200/TCP
Endpoints: <none>
Session Affinity: None
No events.
after exactly 5 minutes, the service was deleted automatically.
kubectl get events --namespace=es
1m 1m 1 elasticsearch-logging Service DeletingLoadBalancer {service-controller } Deleting load balancer
1m 1m 1 elasticsearch-logging Service DeletedLoadBalancer {service-controller } Deleted load balancer
Anyone got a clue why? thanks.
The label kubernetes.io/cluster-service=true is a special, reserved label that shouldn't be used by user resources. That's used by a system process that manages the cluster's addons, like the DNS and kube-ui pods that you'll see in your cluster's kube-system namespace.
The reason your service is being deleted is because the system process is checking for resources with that label, seeing one that it doesn't know about, and assuming that it's something that it started previously that isn't meant to exist anymore. This is explained a little more in this doc about cluster addons.
In general, you shouldn't have any labels that are prefixed with kubernetes.io/ on your resources, since that's a reserved namespace.
after removing the following from metadata/labels in the yaml file, the problem went away.
**kubernetes.io/cluster-service: "true"
kubernetes.io/name: "Elasticsearch"**