First request to a new ReplicaSet times out - kubernetes

I have a Kubernetes cluster on AWS, set up with kops.
I set up a Deployment that runs an Apache container and a Service for the Deployment (type: LoadBalancer).
When I update the deployment by running kubectl set image ..., as soon as the first pod of the new ReplicaSet becomes ready, the first couple of requests to the service time out.
Things I have tried:
I set up a readinessProbe on the pod, works.
I ran curl localhost on a pod, works.
I performed a DNS lookup for the service, works.
If I curl the IP returned by that DNS lookup inside a pod, the first request will timeout. This tells me it's not an ELB issue.
It's really frustrating since otherwise our Kubernetes stack is working great, but every time we deploy our application we run the risk of a user timing out on a request.

After a lot of debugging, I think I've solved this issue.
TL;DR; Apache has to exit gracefully.
I found a couple of related issues:
https://github.com/kubernetes/kubernetes/issues/47725
https://github.com/kubernetes/ingress-nginx/issues/69
504 Gateway Timeout - Two EC2 instances with load balancer
Some more things I tried:
Increase the KeepAliveTimeout on Apache, didn't help.
Ran curl on the pod IP and node IPs, worked normally.
Set up an externalName selector-less service for a couple of external dependencies, thinking it might have something to do with DNS lookups, didn't help.
The solution:
I set up a preStop lifecycle hook on the pod to gracefully terminate Apache to run apachectl -k graceful-stop
The issue (at least from what I can tell), is that when pods are taken down on a deployment, they receive a TERM signal, which causes apache to immediately kill all of its children. This might cause a race condition where kube-proxy still sends some traffic to pods that have received a TERM signal but not terminated completely.
Also got some help from this blog post on how to set up the hook.
I also recommend increasing the terminationGracePeriodSeconds in the PodSpec so apache has enough time to exit gracefully.

Related

k8s container initialization and load balancing

I have a deployment with one pod with my custom image. After executing kubectl create -f deployment.yaml, this pod becomes running. I see that everything is fine and it has "running" state in kubectl's output. But, i have one initialization script to start Apache Tomcat, it takes around 40-45 seconds to execute it and up server inside.
I also have load balancer deployment with nginx. Nginx redirects incoming requests to Apache Tomcat via proxy_pass. When i scale my deployment for 2 replicas and shut down one of them, sometimes application becomes stuck and freezing.
I feel that load balancing by k8s works not correctly, k8s is trying to use pod, which is initializing by script right now.
How can i tell k8s that pod in deployment hasn't been initialized and not to use it until it becomes totally up?
If I understand correctly mostly your problem is related to the application not being ready to accept requests because your initialization script hasn’t finished.
For that situation, you can easily setup different types of probes, such as liveliness and readiness. Such a solution would be useful, as your application wouldn’t be considered ready to accept requests unless the whole pod would start up and signal that it is alive.
Here you can read more about it: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/

Web-Server running in an EKS cluster with spot-instances

I'm running a web-server deployment in an EKS cluster. The deployment is exposed behind a NodePort service, ingress resource, and AWS Load Balancer controller.
This deployment is configured to run on "always-on" nodes, using a Node Selector.
The EKS cluster runs additional auto-scaled workloads which can also use spot instances if needed (in the same namespace).
Since the Node-Port service exposes a static port across all nodes in the cluster, there are many targets in the said target group, which are being registered and de-registered whenever a new node is being added/removed from the cluster.
What exactly happens if a request from the client is being navigated to the service that resides in a node that is about the be scaled down?
I'm asking since I'm getting many 504 Gateway Timeouts from the ALB. Specifically, these requests do not reach our FE/BE pods and terminate at the ALB level.
Welcome to the community #gil-shelef!
Based on AWS documentation, there should be used additional handlers to add both resilience and cost-savings.
Let's start with understanding how this works:
There is a specific node termination handler DaemonSet which adds pods to each spot instances and listens to spot instance interruption notification. This provides a possibility to gracefully terminate any running pods on that node, drain the node from loadbalancer and for Kubernetes scheduler to reschedule removed pods on different instances.
Workflow looks like following (taken from aws documentation - Spot Instance Interruption Handling. This link also has an example):
The workflow can be summarized as:
Identify that a Spot Instance is about to be interrupted in two minutes.
Use the two-minute notification window to gracefully prepare the node for termination.
Taint the node and cordon it off to prevent new pods from being placed on it.
Drain connections on the running pods.
Once pods are removed from endpoints, kube-proxy will trigger an update in iptables. It takes a little bit of time. To make this smoother for end-users, you should consider adding pre-stop pause about 5-10 seconds. More information about how this happens and how you can mitigate it, you can find in my answer here.
Also here are links for these handlers:
Node termination handler
Cluster autoscaler on AWS
For your last question, please check this AWS KB article on how to troubleshoot EKS and 504 errors

Connection refused error in outbound request in k8s app container. Istio?

Updated
I have some script that initializes our service.
The script fails when it runs in the container because of connection refused error in the first outbound request (to external service) in the script.
We tried to add a loop that makes curl and if it fails, re-try, if not - continuous the script.
Sometimes it succeeds for the first time, sometimes it fails 10-15 times in a row.
We recently started using istio
What may be a reason??
It is a famous istio bug https://github.com/istio/istio/issues/11130 ( App container unable to connect to network before Istio's sidecar is fully running) it seems the Istio proxy will not start in parallel , it is waiting for the app container to be ready. a sequential startup sequence as one blogger mentioned (https://medium.com/#marko.luksa/delaying-application-start-until-sidecar-is-ready-2ec2d21a7b74) quote: most Kubernetes users assume that after a pod’s init containers have finished, the pod’s regular containers are started in parallel. It turns out that’s not the case.
containers will start in order defined by the Deployment spec YAML.
so the biggest question is will the Istio proxy envoy will start while the first container is stuck in a curl-loop . (egg and chicken problem) .
App container script performs:
until curl --head localhost:15000 ; do echo "Waiting for Istio Proxy to start" ; sleep 3 ; done
as far as I saw: that script doesn't help a bit. proxy is up but connection to external hostname return "connection refused"
With istio 1.7 comes a new feature that configures the pod to start the sidecar first and hold every other container untill the sidecar is started.
Just set values.proxy.holdApplicationUntilProxyStarts to true.
Please note that the feature is still experimental.

GKE: 502 when stopping instance

I'm having troubles with my Kubernetes ingress on GKE. I'm simulating termination of a preemptible instance by manually deleting it (through the GCP dashboard). I am running a regional GKE cluster (one VM in each avaibility zone in us-west1).
A few seconds after selecting delete on only one of the VMs I start receiving 502 errors through the load balancer. Stackdriver logs for the load balancer list the error as failed_to_connect_to_backend.
Monitoring the health of backend-service shows the backend being terminated go from HEALTHY to UNHEALTHY and then disappearing while the other two backends remain HEALTHY.
After a few seconds requests begin to succeed again.
I'm quite confused why the load balancer is unable to direct traffic to the healthy nodes while one goes down - or maybe this is a kubernetes issue? Could the load balancer be properly routing traffic to a healthy instance, but the kubernetes NodePort service on that instance proxies the request back to the unhealthy instance for some reason?
Well, I would say if you kill a node from GCP Console, you are kind of killing it from outside in. It will take time until kubelet will realize this event. So kube-proxy also won't update service endpoint and the iptables immediately.
Until that happens, ingress controller will keep sending packets to the services specified by ingress rule, and the services to the pods, that no longer exist.
This is just a speculation. I might be wrong. But from GCP documentation, if you are using preemptible VMs, your app should be fail tolerant.
[EXTRA]
So, let's consider two general scenarios. In the first one we will send kubectl delete pod command, while with the second one we will kill a node abruptly.
with kubectl delete pod ... you are saying api-server that you want to kill a pod. api-server will summon kubelet to kill the pod, it will re-create it on another node (if the case). kube-proxy will update the iptables so the services will forward the requests to the right pod.
If you kill the node, that's kubelet that first realizes that something goes wrong, so it reports this to the api-server. api-server will re-schedule the pods on a different node (always). The rest is the same.
My point is that there is a difference between api-server knowing from the beginning that no packets can be send to a pod, and being notified once kubelet realizes that the node is unhealthy.
How to solve this? you can't. And actually this should be logical. You want to have the same performance with a preemptible machines, that cost about 5 times cheaper, then a normal VM? If this would be possible, everyone would be using these VMs.
Finally, again, Google advises using preemptible, if your application is failure tolerant.

Configure Kubernetes StatefulSet to start pods first restart failed containers after start?

Basic info
Hi, I'm encountering a problem with Kubernetes StatefulSets. I'm trying to spin up a set with 3 replicas.
These replicas/pods each have a container which pings a container in the other pods based on their network-id.
The container requires a response from all the pods. If it does not get a response the container will fail. In my situation I need 3 pods/replicas for my setup to work.
Problem description
What happens is the following. Kubernetes starts 2 pods rather fast. However since I need 3 pods for a fully functional cluster the first 2 pods keep crashing as the 3rd is not up yet.
For some reason Kubernetes opts to keep restarting both pods instead of adding the 3rd pod so my cluster will function.
I've seen my setup run properly after about 15 minutes because Kubernetes added the 3rd pod by then.
Question
So, my question.
Does anyone know a way to delay restarting failed containers until the desired amount of pods/replicas have been booted?
I've since found out the cause of this.
StatefulSets launch pods in a specific order. If one of the pods fails to launch it does not launch the next one.
You can add a podManagementPolicy: "Parallel" to launch the pods without waiting for previous pods to be Running.
See this documentation
I think a better way to deal with your problem is to leverage liveness probe, as described in the document, rather than delay the restart time (not configurable in the YAML).
Your pods respond to the liveness probe right after they are started to let Kubernetes know they are alive, which prevents them from being restarted. Meanwhile, your pods keep ping others until they are all up. Only when all your pods are started will serve the external requests. This is similar to creating a Zookeeper ensemble.