Kubernetes load balance HTTP/1.1 requests - kubernetes

As we know, by default HTTP 1.1 uses persistent connections which is a long-lived connection. For any service in Kubernetes, for example, clusterIP mode, it is L4 based load balancer.
Suppose I have a service which is running a web server, this service contains 3 pods, I am wondering whether HTTP/1.1 requests can be distributed to 3 pods?
Could anybody help clarify it?

This webpage perfectly address your question: https://learnk8s.io/kubernetes-long-lived-connections
In the spirit of StackOverflow, let me summarize the webpage here:
TLDR: Kubernetes doesn't load balance long-lived connections, and some Pods might receive more requests than others.
Kubernetes Services do not exist. There's no process listening on the IP address and port of a Service.
The Service IP address is used only as a placeholder that will be translated by iptables rules into the IP addresses of one of the destination pods using cleverly crafted randomization.
Any connections from clients (regardless from inside or outside cluster) are established directly with the Pods, hence for an HTTP 1.1 persistent connection, the connection will be maintained between the client to a specific Pod until it is closed by either side.
Thus, all requests that use a single persistent connection will be routed to a single Pod (that is selected by the iptables rule when establishing connection) and not load-balanced to the other Pods.
Additional info:
By W3C RFC2616 (https://www.w3.org/Protocols/rfc2616/rfc2616-sec8.html#sec8.1.3), any proxy server that serves between client and server must maintain HTTP 1.1 persistent connections from client to itself and from itself to server.

Related

How does kube-proxy handle persistent connections to a Service between pods?

I've seen scenarios where requests from one workload, sent to a ClusterIP service for another workload with no affinities set, only get routed to a subset of the associated pods. The Endpoints object for this service does show all of the pod IPs.
I did a little experiment to figure out what is happening.
Experiment
I set up minikube to have a "router" workload with 3 replicas sending requests to a "backend" workload also with 3 pods. The router just sends a request to the service name like http://backend.
I sent 100 requests to the router service via http://$MINIKUBE_IP:$NODE_PORT, since it's exposed as a NodePort service. Then I observed which backend pods actually handled requests. I repeated this test multiple times.
In most cases, only 2 backend pods handled any requests, with the occasional case where all 3 did. I didn't see any where all requests went to one in these experiments, though I have seen it happen before running other tests in AKS.
This led me to the theory that the router is keeping a persistent connection to the backend pod it connects to. Given there are 3 routers and 3 backends, there's an 11% chance all 3 routers "stick" to a single backend, a 67% chance that between the 3 routers, they stick to 2 of the backends, and a 22% chance that each router sticks to a different backend pod (1-to-1).
Here's one possible combination of router-to-backend connections (out of 27 possible):
Disabling HTTP Keep-Alive
If I use a Transport disabling HTTP Keep-Alives in the router's http client, then any requests I make to the router are uniformly distributed between the different backends on every test run as desired.
client := http.Client{
Transport: &http.Transport{
DisableKeepAlives: true,
},
}
resp, err := client.Get("http://backend")
So the theory seems accurate. But here's my question:
How does the router using HTTP KeepAlive / persistent connections actually result in a single connection between one router pod and one backend pod?
There is a kube-proxy in the middle, so I'd expect any persistent connections to be between the router pod and kube-proxy as well as between kube-proxy and the backend pods.
Also, when the router does a DNS lookup, it's going to find the Cluster IP of the backend service every time, so how can it "stick" to a Pod if it doesn't know the Pod IP?
Using Kubernetes 1.17.7.
This excellent article covers your question in detail.
Kubernetes Services indeed do not load balance long-lived TCP connections.
Under the hood Services (in most cases) use iptables to distribute connections between pods. But iptables wasn't designed as a balancer, it's a firewall. It isn't capable to do high-level load balancing.
As a weak substitution iptables can create (or not create) a connection to a certain target with some probability - and thus can be used as L3/L4 balancer. This mechanism is what kube-proxy employs to somewhat imitate load balancing.
Does iptables use round-robin?
No, iptables is primarily used for firewalls, and it is not designed to do load balancing.
However, you could craft a smart set of rules that could make iptables behave like a load balancer.
And this is precisely what happens in Kubernetes.
If you have three Pods, kube-proxy writes the following rules:
select Pod 1 as the destination with a likelihood of 33%. Otherwise, move to the next rule
choose Pod 2 as the destination with a probability of 50%. Otherwise, move to the following rule
select Pod 3 as the destination (no probability)
What happens when you use keep-alive with a Kubernetes Service?
Let's imagine that front-end and backend support keep-alive.
You have a single instance of the front-end and three replicas for the backend.
The front-end makes the first request to the backend and opens the TCP connection.
The request reaches the Service, and one of the Pod is selected as the destination.
The backend Pod replies and the front-end receives the response.
But instead of closing the TCP connection, it is kept open for subsequent HTTP requests.
What happens when the front-end issues more requests?
They are sent to the same Pod.
Isn't iptables supposed to distribute the traffic?
It is.
There is a single TCP connection open, and iptables rule were invocated the first time.
One of the three Pods was selected as the destination.
Since all subsequent requests are channelled through the same TCP connection, iptables isn't invoked anymore.
Also it's not quite correct to say that kube-proxy is in the middle.
It isn't - kube-proxy by itself doesn't manage any traffic.
All that it does - it creates iptables rules.
It's iptables who actually listens, distributes, does DNAT etc.
Similar question here.

Sticky sessions considering src IP and src port in K8s

I've got a lift 'n shift deployment type (i.e. by no means cloud-native) and I'd like to setup sticky sessions so that the requests keep being handled by the same pod if it's available (from the client's perspective).
Client --> LB --> Ingress --> Service --> Deployment
Due to the fact that LB does SNAT, I think service.spec.sessionAffinityConfig.clientIP will work, but because all the requests would be coming with the same source IP of the loadbalancer, the workload won't be truly balanced across all the pods in the deployment.
Can you think of any way to consider source IP & port pair in the sticky session behavior?
Edit 1: The deployment runs in Oracle Cloud. We're using the Oracle Cloud Loadbalancer service in plain TCP mode (i.e. OSI Layer4).
What the question describes is actually a default traffic management behavior in K8s. The packets within each TCP session target the same pod. The TCP session is initiated from the certain source IP (in our case the LB) and source port (which is different for each session), and this session remains "sticky" for its whole duration.

Kubernetes how to load balance EXTERNAL persistent tcp connections?

I'm having an issue with load balancing persistent tcp connections to my kubernetes replicas.
I have Unity3D clients outside of the kubernetes cluster.
My cluster is a baremetal cluster with metallb installed composed out of 3 nodes: 1 master and 2 workers.
As I have read there are two approaches:
1) client connects to all replicas and each time it needs to send a request it will do so on a random connection out of those that it has previously established. Periodically, it refreshes connections (in case autoscale happened or some of the persistent connections died).
The problem here is, I'm not sure how to access all replicas externally, headless services cannot be exposed externally.
2) service mesh ? I have vaguely read/understood that they might establish persistent tcp on your behalf. So something like this :
unity3d client <----persistent connection ---> controller <---persistent connection----> replicas
However, I'm not sure how to accomplish this and I'm not sure what will happen if the controller itself fails, will all the clients get their connections dropped ? As I see it, it will come down to the same issue as the one from 1), which is allowing a client to connect to multiple different replicas at the same time with a persistent TCP connection.
Part of question comes as a complement to this : https://learnk8s.io/kubernetes-long-lived-connections
In order to enable external traffic to your cluster you need an Ingress Gateway. Your ingress gateway could be the standard nginx Ingress, a gateway provided by a mesh like the Istio Gateway or a more specialized edge gateway like ambassador, traefik, kong, gloo, etc.
There are at least two ways you can perform load balancing in K8s:
Using a Service resource which is just a set of iptables rules managed by the kube-proxy process. This is L4 load balancing only. No L7 application protocols like HTTP2 or gRPC are supported. Depending on your case, this type of LB might not be ideal for long lived connections as connections will rarely be closed.
Using the L7 load balancing offered by any of the ingress controllers which will skip the iptables routing (using a headless Service) and allow for more advanced load balancing algorithms.
In order to benefit from the latter case you still need to ensure that connections are eventually terminated which is often done from the client to the proxy (while reusing connections from the proxy to the upstream). I'm not familiar with Unity3D connections but if terminating them is not an option you won't be able to do much load balancing after all.
When the controller fails, connections will be dropped and your client could either graciously re-attempt the connection or panic. It depends on how you code it.

Using a static IP for both ingress/egress in Kubernetes

I have a program which I'm trying to run in a Kubernetes cluster.
The program is a server that speaks a non-standard UDP-based protocol.
The protocol mostly consists of short request/reply pairs, similar to DNS.
One major difference from DNS is that both the "server" and the "clients" can send requests, ie. the communication can be initiated by either party.
The clients are embedded devices configured with the server's IP address.
The clients send their requests to this IP.
They also check that incoming messages originate from this IP, discarding messages from other IPs.
My question is how I can use Kubernetes to set up the server such that
The server accepts incoming UDP messages on a specific IP.
Real client source IPs are seen by the server.
Any replies (or other messages) the servers sends have that same IP as their source (so that the clients will accept them).
One thing I have tried that doesn't work is to set up a Service with type: LoadBalancer and externalTrafficPolicy: Local (the latter to preserve source IPs for requirement 2).
This setup fulfills requirements 1 and 2 above, but since outbound messages don't pass through the load balancer, their source IP is that of whatever node the pod containing the server is running on.
I'm running Kubernetes on Google Cloud Platform (GKE).
Please verify solution as described in:
1. Kubernetes..,
c) Source IP for Services with Type=LoadBalancer
- expose deployment as: --type=LoadBalancer
- set service.spec.externalTrafficPolicy: '{"spec":{"externalTrafficPolicy":"Local"}}'
Using the image as described in the example "echoserver" is returning my public address.

How to get the real ip in the request in the pod in kubernetes

I have to get the real ip from the request in my business.actually I got the 10.2.100.1 every time at my test environment. any way to do this ?
This is the same question as GCE + K8S - Accessing referral IP address and How to read client IP addresses from HTTP requests behind Kubernetes services?.
The answer, copied from them, is that this isn't yet possible in the released versions of Kubernetes.
Services go through kube_proxy, which answers the client connection and proxies through to the backend (your web server). The address that you'd see would be the IP of whichever kube-proxy the connection went through.
Work is being actively done on a solution that uses iptables as the proxy, which will cause your server to see the real client IP.
Try to get that service IP which service is associated with that pods.
One very roundabout way right now is to set up an HTTP liveness probe and watch the IP it originates from. Just be sure to also respond to it appropriately or it'll assume your pod is down.