dual Kubernetes Readiness probes? - kubernetes

I have a scenario where it is required to 'prepare' Kubernetes towards taking off/terminating/shutdown a container, but allow it to serve some requests till that happens.
For example, lets assume that there are three methods: StartAction, ProcessAction, EndAction. I want to prevent clients from invoking StartAction when a container is about to be shutdown. However they should be able use ProcessAction and EndAction on that same container (after all Actions have been completed, the container will shutdown).
I was thinking that this is some sort of 'dual' readiness probe, where I basically want to indicate a 'not ready' status but continue to serve requests for already started Actions.
I know that there is a PreStop hook but I am not confident that this serves the need because according to the documentation I suspect that during the PreStop the pod is already taken off the load balancer:
(simultaneous with 3) Pod is removed from endpoints list for service, and are no longer considered part of the set of running Pods for replication controllers. Pods that shutdown slowly cannot continue to serve traffic as load balancers (like the service proxy) remove them from their rotations.
(https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods).
Assuming that I must rely on stickiness and must continue serving requests for Actions on containers where those actions were started, is there some recommended practice?

I think you can just implement 2 endpoints in your application:
Custom readiness probe
Shutdown preparation endpointList item
So to make graceful shutdown I think you should firstly call "Shutdown preparation endpoint" which will cause that "Custom readiness probe" will return error so Kubernetes will get out that POD from service load balancer (no new clients will come) but existing TCP connections will be kept (existing clients will operate). After your see in some custom metrics (which your service should provide) that all actions for clients are done you should shutdown containers using standard Kubernetes actions. All those actions should be probably automated somehow using Kubernetes and your application APIs.

Related

Using readiness probe to handle graceful shutdown

From Kubernetes document, when readiness probe fails, it removes the Pod's IP address from the endpoints of all services that match the pod.
We are thinking about implementing SIGTERM handler to fail the health check and stop the pod from receiving future traffic. That's what we want, no more Inbound traffic. The question is, if the pod contains requests that depend on backend service which are not reside in the same pod, will the pod still be able to complete those outbound requests?
From the docs (emphasis mine):
Sometimes, applications are temporarily unable to serve traffic. For example, an application might need to load large data or configuration files during startup, or depend on external services after startup. In such cases, you don't want to kill the application, but you don't want to send it requests either. Kubernetes provides readiness probes to detect and mitigate these situations. A pod with containers reporting that they are not ready does not receive traffic through Kubernetes Services.
The pod can't be reached through Kubernetes services. You can still make outbound requests, and anyone using the pod name or IP directly will also still be able to reach it.

Is container where liveness or readiness probes's config are set to a "pod check" container?

I'm following this task Configure Liveness, Readiness and Startup Probes
and it's unclear to me whether a container where the check is made is a container only used to check the availability of a pod? Because it makes sense if pod check container fails therefore api won't let any traffic in to the pod.
So a health check signal must be coming from container where some image or app runs? (sorry, another question)
From the link you provided it seems like they are speaking about Containers and not Pods so the probes are meant to be per containers. When all containers are ready the pod is described as ready too as written in the doc you provided :
The kubelet uses readiness probes to know when a Container is ready to
start accepting traffic. A Pod is considered ready when all of its
Containers are ready. One use of this signal is to control which Pods
are used as backends for Services. When a Pod is not ready, it is
removed from Service load balancers.
So yes, every containers that are running some images or apps are supposed to expose those metrics.
Livenes and readiness probes as described by Ko2r are additional checks inside your containers and verified by kubelet according to the settings fro particular probe:
If the command (defined by health-check) succeeds, it returns 0, and the kubelet considers the Container to be alive and healthy. If the command returns a non-zero value, the kubelet kills the Container and restarts it.
In addition:
The kubelet uses liveness probes to know when to restart a Container. For example, liveness probes could catch a deadlock, where an application is running, but unable to make progress. Restarting a Container in such a state can help to make the application more available despite bugs.
Fro another point of view:
Pod is a top-level resource in the Kubernetes REST API.
As per docs:
Pods are ephemeral. They are not designed to run forever, and when a Pod is terminated it cannot be brought back. In general, Pods do not disappear until they are deleted by a user or by a controller.
Information about controllers can find here:
So the best practise is to use controllers like describe above. You’ll rarely create individual Pods directly in Kubernetes–even singleton Pods. This is because Pods are designed as relatively ephemeral, disposable entities. When a Pod gets created (directly by you, or indirectly by a Controller), it is scheduled to run on a Node in your cluster. The Pod remains on that Node until the process is terminated, the pod object is deleted, the Pod is evicted for lack of resources, or the Node fails.
Note:
Restarting a container in a Pod should not be confused with restarting the Pod. The Pod itself does not run, but is an environment the containers run in and persists until it is deleted
Because Pods represent running processes on nodes in the cluster, it is important to allow those processes to gracefully terminate when they are no longer needed (vs being violently killed with a KILL signal and having no chance to clean up). Users should be able to request deletion and know when processes terminate, but also be able to ensure that deletes eventually complete. When a user requests deletion of a Pod, the system records the intended grace period before the Pod is allowed to be forcefully killed, and a TERM signal is sent to the main process in each container. Once the grace period has expired, the KILL signal is sent to those processes, and the Pod is then deleted from the API server. If the Kubelet or the container manager is restarted while waiting for processes to terminate, the termination will be retried with the full grace period.
The Kubernetes API server validates and configures data for the api objects which include pods, services, replicationcontrollers, and others. The API Server services REST operations and provides the frontend to the cluster’s shared state through which all other components interact.
For example, when you use the Kubernetes API to create a Deployment, you provide a new desired state for the system. The Kubernetes Control Plane records that object creation, and carries out your instructions by starting the required applications and scheduling them to cluster nodes–thus making the cluster’s actual state match the desired state.
Here you can find information about processing pod termination.
There are different probes:
For example for HTTP probe:
even if your app isn’t an HTTP server, you can create a lightweight HTTP server inside your app to respond to the liveness probe.
Command
For command probes, Kubernetes runs a command inside your container. If the command returns with exit code 0 then the container is marked as healthy.
More about probes and best practices.
Hope this help.

Is there a way to configure Istio to route traffic to a POD which is in the terminating state?

I have a Kubernetes cluster with two services deployed: SvcA and SvcB - both in the service mesh.
SvcA is backed by a single Pod, SvcA_P1. The application in SvcA_P1 exposes a PreStop HTTP hook. When performing a "kubectl drain" command on the node where SvcA_P1 resides, the Pod transitions into the "terminating" state and remains in that state until the application has completed its work (the rest request returns and Kubernetes removes the pod). The work for SvcA_P1 includes completing ongoing in-dialog (belonging to established sessions) HTTP requests/responses. It can stay in the "terminating" state for hours before completing.
When the Pod enters the "terminating" phase, Istio sidecar appears to remove the SvcA_P1 from the pool. Requests sent to SvcA_P1 from e.g., SvcB_P1 are rejected with a "no healthy upstream".
Is there a way to configure Istio/Envoy to:
Continue to send traffic/sessions with affinity to SvcA_P1 while in "terminating" state?
Reject traffic without session affinity to SvcA_P1 (no JSESSIONID, cookies, or special HTTP headers)?
I have played around with the DestinationRule(s), modifying trafficPolicy.loadBalancer.consistentHash.[httpHeaderName|httpCookie] with no luck. Once the Envoy removes the upstream server, the new destination is re-hashed using the reduced set of servers.
Thanks,
Thor
According to Kubernetes documentation, when pod must be deleted three things happen simultaneously:
Pod shows up as “Terminating” when listed in client commands
When the Kubelet sees that a Pod has been marked as terminating because the "dead" timer for the Pod has been set in the API server,
it begins the pod shutdown process.
If the pod has defined a preStop hook, it is invoked inside of the pod. If the preStop hook is still running after the grace period
expires, step 2 is then invoked with a small (2 second) extended grace
period.
Pod is removed from endpoints list for service, and are no longer considered part of the set of running pods for replication
controllers. Pods that shutdown slowly cannot continue to serve
traffic as load balancers (like the service proxy) remove them from
their rotations.
As soon as Istio works like a mesh network below/behind Kubernetes Services and Services no longer consider a Pod in Terminating state as a destination for the traffic, tweaking Istio policies doesn't help much.
Is there a way to configure Istio/Envoy to continue to send traffic/sessions with affinity to SvcA_P1 while in "terminating" state?
This problem is at Kubernetes level rather than Istio/Envoy level: by default, upon entering the "Terminating" state, Pods are removed from their corresponding Services.
You can change that behaviour by telling your Service to advertise Pods in the "Terminating" state: see that answer.

Specify scheduling order of a Kubernetes DaemonSet

I have Consul running in my cluster and each node runs a consul-agent as a DaemonSet. I also have other DaemonSets that interact with Consul and therefore require a consul-agent to be running in order to communicate with the Consul servers.
My problem is, if my DaemonSet is started before the consul-agent, the application will error as it cannot connect to Consul and subsequently get restarted.
I also notice the same problem with other DaemonSets, e.g Weave, as it requires kube-proxy and kube-dns. If Weave is started first, it will constantly restart until the kube services are ready.
I know I could add retry logic to my application, but I was wondering if it was possible to specify the order in which DaemonSets are scheduled?
Kubernetes itself does not provide a way to specific dependencies between pods / deployments / services (e.g. "start pod A only if service B is available" or "start pod A after pod B").
The currect approach (based on what I found while researching this) seems to be retry logic or an init container. To quote the docs:
They run to completion before any app Containers start, whereas app Containers run in parallel, so Init Containers provide an easy way to block or delay the startup of app Containers until some set of preconditions are met.
This means you can either add retry logic to your application (which I would recommend as it might help you in different situations such as a short service outage) our you can use an init container that polls a health endpoint via the Kubernetes service name until it gets a satisfying response.
retry logic is preferred over startup dependency ordering, since it handles both the initial bringup case and recovery from post-start outages

Does a Kubernetes rolling-update gracefully remove pods from a service load balancer

Standard practice for a rolling update of hosts behind load balancer is to gracefully take the hosts out of rotation. This can be done by marking the host "un-healthy" and ensuring the host is no longer receiving requests from the load balancer.
Does Kubernetes do something similar for pods managed by a ReplicationController and servicing a LoadBalancer Service?
I.e., does Kubernetes take a pod out of the LoadBalancer rotation, ensure incoming traffic has died-down, and only then issue pod shutdown?
Actually, once you delete the pod, it will be in "terminating" state until it is destroyed (after terminationGracePeriodSeconds) which means it is removed from the service load balancer, but still capable of serving existing requests.
We also use "readiness" health checks, and preStop is synchronous, so you could make your preStop hook mark the readiness of the pod to be false, and then wait for it to be removed from the load balancer, before having the preStop hook exit.
Not quite. Kubernetes will send a stop command to the containers in the pod. If the application doesn't stop it will force kill the container (after terminationGracePeriodSeconds parameter).
There are a bunch of bugs opened to take care of this: https://github.com/kubernetes/kubernetes/issues/2789
I can't think of anything elegant that will do this.
There is a preStop parameter for pods that will execute a script before termination. You could modify from here the pod label and rename it to something else. This will fool the replication controller and it will see that it has now a lower number of replicas.
For the pods with this label you will have to make your own logic on stopping them when they have finished working.