How to manage persistent connections in kubernetes - kubernetes

In Kubernetes services talk to each other via a service ip. With iptables or something similar each TCP connection is transparently routed to one of the pods that are available for the called service. If the calling service is not closing the TCP connection (e.g. using TCP keepalive or a connection pool) it will connect to one pod and not use the other pods that may be spawned.
What is the correct way to handle such a situation?
My own unsatisfying ideas:
Closing connection after each api call
Am I making every call slower only to be able to distribute requests to different pods? Doesn't feel right.
Minimum number of connections
I could force the caller to open multiple connections (assuming it would then distribute the requests across these connections) but how many should be open? The caller has (and probably should not have) no idea how many pods there are.
Disable bursting
I could limit the resources of the called services so it gets slow on multiple requests and the caller will open more connections (hopefully to other pods). Again I don't like the idea of arbitrarily slowing down the requests and this will only work on cpu bound services.

The keep-alive behavior can be tuned by options specified in the Keep-Alive general header:
E.g:
Connection: Keep-Alive
Keep-Alive: max=10, timeout=60
Thus, you could re-open a tcp connection after a specific timeout instead than at each API request or after a max number of http transactions.
Keep in mind that timeout and max are not guaranteed.
EDIT:
Note that If you use k8s service you can choose two LB mode:
iptables proxy mode (By default, kube-proxy in iptables mode chooses a backend at random.)
IPVS proxy mode where you have different load balancing options:
IPVS provides more options for balancing traffic to backend Pods; these are:
rr: round-robin
lc: least connection (smallest number of open connections)
dh: destination hashing
sh: source hashing
sed: shortest expected delay
nq: never queue
check this link

One mechanism to do this might be to load balance in a layer underneath the TCP connection termination. For example, if you split your service into two - the microservice (let's call it frontend-svc) that does connection handling and maybe some authnz, and another separate service that does your business logic/processing.
clients <---persistent connection---> frontend-svc <----GRPC----> backend-svc
frontend-svc can maintain the make calls to your backend in a more granular fashion, making use of GRPC for example, and really load balance among the workers in the layer below. This means your pods that are part of the frontend-svc aren't doing much work and are completely stateless (and therefore have less need to load balance), which means you can also control them with an HPA provided you have some draining logic to ensure that you don't terminate existing connections.
This is a common approach that is used by SSL proxies etc to deal with connection termination separately from LB.

Related

Migration of IP Multimedia Subsystem(IMS) for VoLTE in Kubernetes (Issue with Cross Port Services)

As per 3GPP specifications, the UE(Mobile) should send the first REGISTER request over unsecured port of P-CSCF which is 5060 and the subsequent REGISTER request should be sent on the different port(secured port of P-CSCF) which is received in the response of the first REGISTER request.
This is a typical case of Multi-port services, which Kubernetes issues explicitly mentions that IPTables used in the services are not capable of handling the multi-port services with session affinity. Meaning that Kubernetes check the affinity of client per SEP (service endpoint - aka port) not per service. Switching to IPVS mode (sh source hashing scheduler does the trick for maintaining cross port session affinity per service) but breaks the distribution between the entire cluster especially between internal IMS Nodes (P-CSCF to I-CSCF, I-CSCF to S-CSCF and so on) because the scheduler becomes "sh" for every worker node. RR (round-robin) is the preferred scheduler but, as per IPVS documentation, it does not guarantee the cross port session affinity per service per client. Our aim was to perform something like sh (between UE to PCSCF) and RR for the rest. But, with respect to kube-proxy implementation, it does not appear to be a feasible idea. Do you have any suggestion with respect to this problem?

Kubernetes services : request Assignment algorithm

What is the logic algorithm that a kubernets service uses to assign requests to pods that it exposes? Can this algorithm be customized?
Thanks.
kube-proxy in userspace mode chooses a backend via a round-robin algorithm.
kube-proxy in iptables mode chooses a backend at random.
IPVS provides more options for balancing traffic to backend Pods; these are:rr: round-robin,lc: least connection (smallest number of open connections),dh: destination hashing,sh: source hashing,sed: shortest expected delay, nq: never queue
As mentioned here:- Service
For application level routing you would need to use a service mesh like istio ,envoy, kong.
You can use a component kube-proxy. What is it?
kube-proxy is a network proxy that runs on each node in your cluster, implementing part of the Kubernetes Service concept.
kube-proxy maintains network rules on nodes. These network rules allow network communication to your Pods from network sessions inside or outside of your cluster.
kube-proxy uses the operating system packet filtering layer if there is one and it's available. Otherwise, kube-proxy forwards the traffic itself.
But why use a proxy when there is a round-robin DNS algorithm? There are a few reasons for using proxying for Services:
There is a long history of DNS implementations not respecting record TTLs, and caching the results of name lookups after they should have expired.
Some apps do DNS lookups only once and cache the results indefinitely.
Even if apps and libraries did proper re-resolution, the low or zero TTLs on the DNS records could impose a high load on DNS that then becomes difficult to manage.
kube-proxy has many modes:
User space proxy mode - In the userspace mode, the iptables rule forwards to a local port where a go binary (kube-proxy) is listening for connections. The binary (running in userspace) terminates the connection, establishes a new connection to a backend for the service, and then forwards requests to the backend and responses back to the local process. An advantage of the userspace mode is that because the connections are created from an application, if the connection is refused, the application can retry to a different backend
Iptables proxy mode - In iptables mode, the iptables rules are installed to directly forward packets that are destined for a service to a backend for the service. This is more efficient than moving the packets from the kernel to kube-proxy and then back to the kernel so it results in higher throughput and better tail latency. The main downside is that it is more difficult to debug, because instead of a local binary that writes a log to /var/log/kube-proxy you have to inspect logs from the kernel processing iptables rules.
IPVS proxy mode - IPVS is a Linux kernel feature that is specifically designed for load balancing. In IPVS mode, kube-proxy programs the IPVS load balancer instead of using iptables. This works, it also uses a mature kernel feature and IPVS is designed for load balancing lots of services; it has an optimized API and an optimized look-up routine rather than a list of sequential rules.
You can read more here - good question about proxy mode on StackOverflow, here - comparing proxy modes and here - good article about proxy modes.
Like rohatgisanat mentioned in his answer you can also use service mesh. Here is also good article about Kubernetes service mesh comparsion.

Kubernetes load balance HTTP/1.1 requests

As we know, by default HTTP 1.1 uses persistent connections which is a long-lived connection. For any service in Kubernetes, for example, clusterIP mode, it is L4 based load balancer.
Suppose I have a service which is running a web server, this service contains 3 pods, I am wondering whether HTTP/1.1 requests can be distributed to 3 pods?
Could anybody help clarify it?
This webpage perfectly address your question: https://learnk8s.io/kubernetes-long-lived-connections
In the spirit of StackOverflow, let me summarize the webpage here:
TLDR: Kubernetes doesn't load balance long-lived connections, and some Pods might receive more requests than others.
Kubernetes Services do not exist. There's no process listening on the IP address and port of a Service.
The Service IP address is used only as a placeholder that will be translated by iptables rules into the IP addresses of one of the destination pods using cleverly crafted randomization.
Any connections from clients (regardless from inside or outside cluster) are established directly with the Pods, hence for an HTTP 1.1 persistent connection, the connection will be maintained between the client to a specific Pod until it is closed by either side.
Thus, all requests that use a single persistent connection will be routed to a single Pod (that is selected by the iptables rule when establishing connection) and not load-balanced to the other Pods.
Additional info:
By W3C RFC2616 (https://www.w3.org/Protocols/rfc2616/rfc2616-sec8.html#sec8.1.3), any proxy server that serves between client and server must maintain HTTP 1.1 persistent connections from client to itself and from itself to server.

How does kube-proxy handle persistent connections to a Service between pods?

I've seen scenarios where requests from one workload, sent to a ClusterIP service for another workload with no affinities set, only get routed to a subset of the associated pods. The Endpoints object for this service does show all of the pod IPs.
I did a little experiment to figure out what is happening.
Experiment
I set up minikube to have a "router" workload with 3 replicas sending requests to a "backend" workload also with 3 pods. The router just sends a request to the service name like http://backend.
I sent 100 requests to the router service via http://$MINIKUBE_IP:$NODE_PORT, since it's exposed as a NodePort service. Then I observed which backend pods actually handled requests. I repeated this test multiple times.
In most cases, only 2 backend pods handled any requests, with the occasional case where all 3 did. I didn't see any where all requests went to one in these experiments, though I have seen it happen before running other tests in AKS.
This led me to the theory that the router is keeping a persistent connection to the backend pod it connects to. Given there are 3 routers and 3 backends, there's an 11% chance all 3 routers "stick" to a single backend, a 67% chance that between the 3 routers, they stick to 2 of the backends, and a 22% chance that each router sticks to a different backend pod (1-to-1).
Here's one possible combination of router-to-backend connections (out of 27 possible):
Disabling HTTP Keep-Alive
If I use a Transport disabling HTTP Keep-Alives in the router's http client, then any requests I make to the router are uniformly distributed between the different backends on every test run as desired.
client := http.Client{
Transport: &http.Transport{
DisableKeepAlives: true,
},
}
resp, err := client.Get("http://backend")
So the theory seems accurate. But here's my question:
How does the router using HTTP KeepAlive / persistent connections actually result in a single connection between one router pod and one backend pod?
There is a kube-proxy in the middle, so I'd expect any persistent connections to be between the router pod and kube-proxy as well as between kube-proxy and the backend pods.
Also, when the router does a DNS lookup, it's going to find the Cluster IP of the backend service every time, so how can it "stick" to a Pod if it doesn't know the Pod IP?
Using Kubernetes 1.17.7.
This excellent article covers your question in detail.
Kubernetes Services indeed do not load balance long-lived TCP connections.
Under the hood Services (in most cases) use iptables to distribute connections between pods. But iptables wasn't designed as a balancer, it's a firewall. It isn't capable to do high-level load balancing.
As a weak substitution iptables can create (or not create) a connection to a certain target with some probability - and thus can be used as L3/L4 balancer. This mechanism is what kube-proxy employs to somewhat imitate load balancing.
Does iptables use round-robin?
No, iptables is primarily used for firewalls, and it is not designed to do load balancing.
However, you could craft a smart set of rules that could make iptables behave like a load balancer.
And this is precisely what happens in Kubernetes.
If you have three Pods, kube-proxy writes the following rules:
select Pod 1 as the destination with a likelihood of 33%. Otherwise, move to the next rule
choose Pod 2 as the destination with a probability of 50%. Otherwise, move to the following rule
select Pod 3 as the destination (no probability)
What happens when you use keep-alive with a Kubernetes Service?
Let's imagine that front-end and backend support keep-alive.
You have a single instance of the front-end and three replicas for the backend.
The front-end makes the first request to the backend and opens the TCP connection.
The request reaches the Service, and one of the Pod is selected as the destination.
The backend Pod replies and the front-end receives the response.
But instead of closing the TCP connection, it is kept open for subsequent HTTP requests.
What happens when the front-end issues more requests?
They are sent to the same Pod.
Isn't iptables supposed to distribute the traffic?
It is.
There is a single TCP connection open, and iptables rule were invocated the first time.
One of the three Pods was selected as the destination.
Since all subsequent requests are channelled through the same TCP connection, iptables isn't invoked anymore.
Also it's not quite correct to say that kube-proxy is in the middle.
It isn't - kube-proxy by itself doesn't manage any traffic.
All that it does - it creates iptables rules.
It's iptables who actually listens, distributes, does DNAT etc.
Similar question here.

Kubernetes how to load balance EXTERNAL persistent tcp connections?

I'm having an issue with load balancing persistent tcp connections to my kubernetes replicas.
I have Unity3D clients outside of the kubernetes cluster.
My cluster is a baremetal cluster with metallb installed composed out of 3 nodes: 1 master and 2 workers.
As I have read there are two approaches:
1) client connects to all replicas and each time it needs to send a request it will do so on a random connection out of those that it has previously established. Periodically, it refreshes connections (in case autoscale happened or some of the persistent connections died).
The problem here is, I'm not sure how to access all replicas externally, headless services cannot be exposed externally.
2) service mesh ? I have vaguely read/understood that they might establish persistent tcp on your behalf. So something like this :
unity3d client <----persistent connection ---> controller <---persistent connection----> replicas
However, I'm not sure how to accomplish this and I'm not sure what will happen if the controller itself fails, will all the clients get their connections dropped ? As I see it, it will come down to the same issue as the one from 1), which is allowing a client to connect to multiple different replicas at the same time with a persistent TCP connection.
Part of question comes as a complement to this : https://learnk8s.io/kubernetes-long-lived-connections
In order to enable external traffic to your cluster you need an Ingress Gateway. Your ingress gateway could be the standard nginx Ingress, a gateway provided by a mesh like the Istio Gateway or a more specialized edge gateway like ambassador, traefik, kong, gloo, etc.
There are at least two ways you can perform load balancing in K8s:
Using a Service resource which is just a set of iptables rules managed by the kube-proxy process. This is L4 load balancing only. No L7 application protocols like HTTP2 or gRPC are supported. Depending on your case, this type of LB might not be ideal for long lived connections as connections will rarely be closed.
Using the L7 load balancing offered by any of the ingress controllers which will skip the iptables routing (using a headless Service) and allow for more advanced load balancing algorithms.
In order to benefit from the latter case you still need to ensure that connections are eventually terminated which is often done from the client to the proxy (while reusing connections from the proxy to the upstream). I'm not familiar with Unity3D connections but if terminating them is not an option you won't be able to do much load balancing after all.
When the controller fails, connections will be dropped and your client could either graciously re-attempt the connection or panic. It depends on how you code it.