I have simple OpenShift setup with a Service configured with 2 backend PODS. The PODS have its READINESS Probe configured. The Service is exposed via NodePort. All these configuration are fine it is working as expected. Once the readiness probes fails the Services marks the pod as unreachable and any NEW requests don't get routed to the POD.
Scenario 1:
I execute CURL command to access the services. While the curl command is executing I introduce readiness failure of Pod-1. I see that no new requests are sent to Pod -1. This is FINE
Scenario 2:
I hava Java Client and use Apache Commons Http Client library to initiate a connection to the Kubernetes Service. Connection gets established and it is working fine. The problem comes when I introduce readiness failure of Pod-1. I still see the Client sending requests to Pod-1 only, even though Services has only the endpoint of Pod-2.
My hunch, as the HttpClient uses Persistence Connection and Services when exposed via NodePorts, the destination address for the Http Connection is the POD-1 itself. So even if the readiness probe fails it still sends requests to Pod-1.
Can some one explain why this works they way described above ??
kube-proxy (or rather the iptables rules it generates) intentionally does not shut down existing TCP connections when changing the endpoint mapping (which is what a failed readiness probe will trigger). This has been discussed a lot on many tickets over the years with generally little consensus on if the behavior should be changed. For now your best bet is to instead use an Ingress Controller for HTTP traffic, since those all update live and bypass kube-proxy. You could also send back a Keep-Alive header in your responses and terminate persistent connections after N seconds or requests, though that only shrinks the window for badness.
Related
From Kubernetes document, when readiness probe fails, it removes the Pod's IP address from the endpoints of all services that match the pod.
We are thinking about implementing SIGTERM handler to fail the health check and stop the pod from receiving future traffic. That's what we want, no more Inbound traffic. The question is, if the pod contains requests that depend on backend service which are not reside in the same pod, will the pod still be able to complete those outbound requests?
From the docs (emphasis mine):
Sometimes, applications are temporarily unable to serve traffic. For example, an application might need to load large data or configuration files during startup, or depend on external services after startup. In such cases, you don't want to kill the application, but you don't want to send it requests either. Kubernetes provides readiness probes to detect and mitigate these situations. A pod with containers reporting that they are not ready does not receive traffic through Kubernetes Services.
The pod can't be reached through Kubernetes services. You can still make outbound requests, and anyone using the pod name or IP directly will also still be able to reach it.
The problem I'm trying to solve is horizontal scaling for the web application, where some sessions lead to high CPU usage. The idea is to use Readiness probe to inform K8s that pod is loaded with the current task and new traffic has to be sent to another one (HPA will do the work and prepare a new pod).
But I want that session that processing on the initial pod will be active and once work is done the result will be delivered to the user.
The question is does it mean that if readiness probe fail K8s will:
Stop route ALL traffic to the pod, drop current sessions that open through ingress.
Stop route NEW traffic to the pod, but current sessions will be active during the specified timeout.
Thank you in advance.
UPDATE
It seems like I was totally not right in my 1st edit. More correct is to specify that It will Stop route NEW traffic to the pod but TCP connections like ssh will still be alive.
When the Endpoints controller receives the notification that the readiness probe failed, it removes the Pod as an Endpoint in the Service that the Pod is a part of. Then API server sends this information to the kube-proxies running on the worker nodes and kube-proxies update the iptables rules on its node, which is what prevents new connections from being forwarded to this Pod. However it's worth knowing that the TCP protocol is a stateful protocol (unlike HTTP) so existing connections (e.g ssh sessions) will still be active.
As we know, by default HTTP 1.1 uses persistent connections which is a long-lived connection. For any service in Kubernetes, for example, clusterIP mode, it is L4 based load balancer.
Suppose I have a service which is running a web server, this service contains 3 pods, I am wondering whether HTTP/1.1 requests can be distributed to 3 pods?
Could anybody help clarify it?
This webpage perfectly address your question: https://learnk8s.io/kubernetes-long-lived-connections
In the spirit of StackOverflow, let me summarize the webpage here:
TLDR: Kubernetes doesn't load balance long-lived connections, and some Pods might receive more requests than others.
Kubernetes Services do not exist. There's no process listening on the IP address and port of a Service.
The Service IP address is used only as a placeholder that will be translated by iptables rules into the IP addresses of one of the destination pods using cleverly crafted randomization.
Any connections from clients (regardless from inside or outside cluster) are established directly with the Pods, hence for an HTTP 1.1 persistent connection, the connection will be maintained between the client to a specific Pod until it is closed by either side.
Thus, all requests that use a single persistent connection will be routed to a single Pod (that is selected by the iptables rule when establishing connection) and not load-balanced to the other Pods.
Additional info:
By W3C RFC2616 (https://www.w3.org/Protocols/rfc2616/rfc2616-sec8.html#sec8.1.3), any proxy server that serves between client and server must maintain HTTP 1.1 persistent connections from client to itself and from itself to server.
I've seen scenarios where requests from one workload, sent to a ClusterIP service for another workload with no affinities set, only get routed to a subset of the associated pods. The Endpoints object for this service does show all of the pod IPs.
I did a little experiment to figure out what is happening.
Experiment
I set up minikube to have a "router" workload with 3 replicas sending requests to a "backend" workload also with 3 pods. The router just sends a request to the service name like http://backend.
I sent 100 requests to the router service via http://$MINIKUBE_IP:$NODE_PORT, since it's exposed as a NodePort service. Then I observed which backend pods actually handled requests. I repeated this test multiple times.
In most cases, only 2 backend pods handled any requests, with the occasional case where all 3 did. I didn't see any where all requests went to one in these experiments, though I have seen it happen before running other tests in AKS.
This led me to the theory that the router is keeping a persistent connection to the backend pod it connects to. Given there are 3 routers and 3 backends, there's an 11% chance all 3 routers "stick" to a single backend, a 67% chance that between the 3 routers, they stick to 2 of the backends, and a 22% chance that each router sticks to a different backend pod (1-to-1).
Here's one possible combination of router-to-backend connections (out of 27 possible):
Disabling HTTP Keep-Alive
If I use a Transport disabling HTTP Keep-Alives in the router's http client, then any requests I make to the router are uniformly distributed between the different backends on every test run as desired.
client := http.Client{
Transport: &http.Transport{
DisableKeepAlives: true,
},
}
resp, err := client.Get("http://backend")
So the theory seems accurate. But here's my question:
How does the router using HTTP KeepAlive / persistent connections actually result in a single connection between one router pod and one backend pod?
There is a kube-proxy in the middle, so I'd expect any persistent connections to be between the router pod and kube-proxy as well as between kube-proxy and the backend pods.
Also, when the router does a DNS lookup, it's going to find the Cluster IP of the backend service every time, so how can it "stick" to a Pod if it doesn't know the Pod IP?
Using Kubernetes 1.17.7.
This excellent article covers your question in detail.
Kubernetes Services indeed do not load balance long-lived TCP connections.
Under the hood Services (in most cases) use iptables to distribute connections between pods. But iptables wasn't designed as a balancer, it's a firewall. It isn't capable to do high-level load balancing.
As a weak substitution iptables can create (or not create) a connection to a certain target with some probability - and thus can be used as L3/L4 balancer. This mechanism is what kube-proxy employs to somewhat imitate load balancing.
Does iptables use round-robin?
No, iptables is primarily used for firewalls, and it is not designed to do load balancing.
However, you could craft a smart set of rules that could make iptables behave like a load balancer.
And this is precisely what happens in Kubernetes.
If you have three Pods, kube-proxy writes the following rules:
select Pod 1 as the destination with a likelihood of 33%. Otherwise, move to the next rule
choose Pod 2 as the destination with a probability of 50%. Otherwise, move to the following rule
select Pod 3 as the destination (no probability)
What happens when you use keep-alive with a Kubernetes Service?
Let's imagine that front-end and backend support keep-alive.
You have a single instance of the front-end and three replicas for the backend.
The front-end makes the first request to the backend and opens the TCP connection.
The request reaches the Service, and one of the Pod is selected as the destination.
The backend Pod replies and the front-end receives the response.
But instead of closing the TCP connection, it is kept open for subsequent HTTP requests.
What happens when the front-end issues more requests?
They are sent to the same Pod.
Isn't iptables supposed to distribute the traffic?
It is.
There is a single TCP connection open, and iptables rule were invocated the first time.
One of the three Pods was selected as the destination.
Since all subsequent requests are channelled through the same TCP connection, iptables isn't invoked anymore.
Also it's not quite correct to say that kube-proxy is in the middle.
It isn't - kube-proxy by itself doesn't manage any traffic.
All that it does - it creates iptables rules.
It's iptables who actually listens, distributes, does DNAT etc.
Similar question here.
Is this a thing?
I have some legacy services which will never run in Kubernetes that I currently make available to my cluster by defining a service and manually uploading an endpoints object.
However, the service is horizontally sharded and we often need to restart one of the endpoints. My google-fu might be weak, but i can't figure out if Kubernetes is clever enough to prevent the Service from repeatedly trying the dead endpoint?
The ideal behavior is that the proxy should detect the outage, mark the endpoint as failed, and at some point when the endpoint comes back re-admit it into the full list of working endpoints.
BTW, I understand that at present, liveness probes are HTTP only. This would need to be a TCP probe because it's a replicated database service that doesn't grok HTTP.
I think the design is for the thing managing the endpoint addresses to add/remove them based on liveness. For services backed by pods, the pod IPs are added to endpoints based on the pod's readiness check. If a pod's liveness check fails, it is deleted and its IP removed from the endpoint.
If you are manually managing endpoint addresses, the burden is currently on you (or your external health checker) to maintain the addresses/notReadyAddresses in the endpoint.