Sticky sessions considering src IP and src port in K8s - kubernetes

I've got a lift 'n shift deployment type (i.e. by no means cloud-native) and I'd like to setup sticky sessions so that the requests keep being handled by the same pod if it's available (from the client's perspective).
Client --> LB --> Ingress --> Service --> Deployment
Due to the fact that LB does SNAT, I think service.spec.sessionAffinityConfig.clientIP will work, but because all the requests would be coming with the same source IP of the loadbalancer, the workload won't be truly balanced across all the pods in the deployment.
Can you think of any way to consider source IP & port pair in the sticky session behavior?
Edit 1: The deployment runs in Oracle Cloud. We're using the Oracle Cloud Loadbalancer service in plain TCP mode (i.e. OSI Layer4).

What the question describes is actually a default traffic management behavior in K8s. The packets within each TCP session target the same pod. The TCP session is initiated from the certain source IP (in our case the LB) and source port (which is different for each session), and this session remains "sticky" for its whole duration.

Related

Kubernetes load balance HTTP/1.1 requests

As we know, by default HTTP 1.1 uses persistent connections which is a long-lived connection. For any service in Kubernetes, for example, clusterIP mode, it is L4 based load balancer.
Suppose I have a service which is running a web server, this service contains 3 pods, I am wondering whether HTTP/1.1 requests can be distributed to 3 pods?
Could anybody help clarify it?
This webpage perfectly address your question: https://learnk8s.io/kubernetes-long-lived-connections
In the spirit of StackOverflow, let me summarize the webpage here:
TLDR: Kubernetes doesn't load balance long-lived connections, and some Pods might receive more requests than others.
Kubernetes Services do not exist. There's no process listening on the IP address and port of a Service.
The Service IP address is used only as a placeholder that will be translated by iptables rules into the IP addresses of one of the destination pods using cleverly crafted randomization.
Any connections from clients (regardless from inside or outside cluster) are established directly with the Pods, hence for an HTTP 1.1 persistent connection, the connection will be maintained between the client to a specific Pod until it is closed by either side.
Thus, all requests that use a single persistent connection will be routed to a single Pod (that is selected by the iptables rule when establishing connection) and not load-balanced to the other Pods.
Additional info:
By W3C RFC2616 (https://www.w3.org/Protocols/rfc2616/rfc2616-sec8.html#sec8.1.3), any proxy server that serves between client and server must maintain HTTP 1.1 persistent connections from client to itself and from itself to server.

How does kube-proxy handle persistent connections to a Service between pods?

I've seen scenarios where requests from one workload, sent to a ClusterIP service for another workload with no affinities set, only get routed to a subset of the associated pods. The Endpoints object for this service does show all of the pod IPs.
I did a little experiment to figure out what is happening.
Experiment
I set up minikube to have a "router" workload with 3 replicas sending requests to a "backend" workload also with 3 pods. The router just sends a request to the service name like http://backend.
I sent 100 requests to the router service via http://$MINIKUBE_IP:$NODE_PORT, since it's exposed as a NodePort service. Then I observed which backend pods actually handled requests. I repeated this test multiple times.
In most cases, only 2 backend pods handled any requests, with the occasional case where all 3 did. I didn't see any where all requests went to one in these experiments, though I have seen it happen before running other tests in AKS.
This led me to the theory that the router is keeping a persistent connection to the backend pod it connects to. Given there are 3 routers and 3 backends, there's an 11% chance all 3 routers "stick" to a single backend, a 67% chance that between the 3 routers, they stick to 2 of the backends, and a 22% chance that each router sticks to a different backend pod (1-to-1).
Here's one possible combination of router-to-backend connections (out of 27 possible):
Disabling HTTP Keep-Alive
If I use a Transport disabling HTTP Keep-Alives in the router's http client, then any requests I make to the router are uniformly distributed between the different backends on every test run as desired.
client := http.Client{
Transport: &http.Transport{
DisableKeepAlives: true,
},
}
resp, err := client.Get("http://backend")
So the theory seems accurate. But here's my question:
How does the router using HTTP KeepAlive / persistent connections actually result in a single connection between one router pod and one backend pod?
There is a kube-proxy in the middle, so I'd expect any persistent connections to be between the router pod and kube-proxy as well as between kube-proxy and the backend pods.
Also, when the router does a DNS lookup, it's going to find the Cluster IP of the backend service every time, so how can it "stick" to a Pod if it doesn't know the Pod IP?
Using Kubernetes 1.17.7.
This excellent article covers your question in detail.
Kubernetes Services indeed do not load balance long-lived TCP connections.
Under the hood Services (in most cases) use iptables to distribute connections between pods. But iptables wasn't designed as a balancer, it's a firewall. It isn't capable to do high-level load balancing.
As a weak substitution iptables can create (or not create) a connection to a certain target with some probability - and thus can be used as L3/L4 balancer. This mechanism is what kube-proxy employs to somewhat imitate load balancing.
Does iptables use round-robin?
No, iptables is primarily used for firewalls, and it is not designed to do load balancing.
However, you could craft a smart set of rules that could make iptables behave like a load balancer.
And this is precisely what happens in Kubernetes.
If you have three Pods, kube-proxy writes the following rules:
select Pod 1 as the destination with a likelihood of 33%. Otherwise, move to the next rule
choose Pod 2 as the destination with a probability of 50%. Otherwise, move to the following rule
select Pod 3 as the destination (no probability)
What happens when you use keep-alive with a Kubernetes Service?
Let's imagine that front-end and backend support keep-alive.
You have a single instance of the front-end and three replicas for the backend.
The front-end makes the first request to the backend and opens the TCP connection.
The request reaches the Service, and one of the Pod is selected as the destination.
The backend Pod replies and the front-end receives the response.
But instead of closing the TCP connection, it is kept open for subsequent HTTP requests.
What happens when the front-end issues more requests?
They are sent to the same Pod.
Isn't iptables supposed to distribute the traffic?
It is.
There is a single TCP connection open, and iptables rule were invocated the first time.
One of the three Pods was selected as the destination.
Since all subsequent requests are channelled through the same TCP connection, iptables isn't invoked anymore.
Also it's not quite correct to say that kube-proxy is in the middle.
It isn't - kube-proxy by itself doesn't manage any traffic.
All that it does - it creates iptables rules.
It's iptables who actually listens, distributes, does DNAT etc.
Similar question here.

Kubernetes how to load balance EXTERNAL persistent tcp connections?

I'm having an issue with load balancing persistent tcp connections to my kubernetes replicas.
I have Unity3D clients outside of the kubernetes cluster.
My cluster is a baremetal cluster with metallb installed composed out of 3 nodes: 1 master and 2 workers.
As I have read there are two approaches:
1) client connects to all replicas and each time it needs to send a request it will do so on a random connection out of those that it has previously established. Periodically, it refreshes connections (in case autoscale happened or some of the persistent connections died).
The problem here is, I'm not sure how to access all replicas externally, headless services cannot be exposed externally.
2) service mesh ? I have vaguely read/understood that they might establish persistent tcp on your behalf. So something like this :
unity3d client <----persistent connection ---> controller <---persistent connection----> replicas
However, I'm not sure how to accomplish this and I'm not sure what will happen if the controller itself fails, will all the clients get their connections dropped ? As I see it, it will come down to the same issue as the one from 1), which is allowing a client to connect to multiple different replicas at the same time with a persistent TCP connection.
Part of question comes as a complement to this : https://learnk8s.io/kubernetes-long-lived-connections
In order to enable external traffic to your cluster you need an Ingress Gateway. Your ingress gateway could be the standard nginx Ingress, a gateway provided by a mesh like the Istio Gateway or a more specialized edge gateway like ambassador, traefik, kong, gloo, etc.
There are at least two ways you can perform load balancing in K8s:
Using a Service resource which is just a set of iptables rules managed by the kube-proxy process. This is L4 load balancing only. No L7 application protocols like HTTP2 or gRPC are supported. Depending on your case, this type of LB might not be ideal for long lived connections as connections will rarely be closed.
Using the L7 load balancing offered by any of the ingress controllers which will skip the iptables routing (using a headless Service) and allow for more advanced load balancing algorithms.
In order to benefit from the latter case you still need to ensure that connections are eventually terminated which is often done from the client to the proxy (while reusing connections from the proxy to the upstream). I'm not familiar with Unity3D connections but if terminating them is not an option you won't be able to do much load balancing after all.
When the controller fails, connections will be dropped and your client could either graciously re-attempt the connection or panic. It depends on how you code it.

Performance considerations for NodePort vs. ClusterIP vs. Headless Service on Kubernetes

We have two types of services that we run on AWS EKS:
external-facing services which we expose through an application-level load balancer using aws-alb-ingress-controller
internal-facing services which we use both directly through the service name (for EKS applications) and through an internal application-level loadbalancer also using aws-alb-ingress-controller (for non-EKS applications)
I would like to understand the performance implications of choosing Nodeport, ClusterIP or Headless Service for both the external and internal services. I have the setup working with all three options.
If I understanding the networking correctly, it seems that a Headless Service requires less hops and would hence be (slightly) faster? This article however seems to suggest that a Headless Service would not be properly load balanced when called directly. Is this correct? And would this still hold when called through the external (or internal) ALB?
Is there any difference in performance for NodePort vs ClusterIP?
Finally, what is the most elegant/performant way of using internal services from outside of the cluster (where we don't have access to the Kubernetes DNS) but within the same VPC? Would it be to use ClusterIp and specify the IP address in the service definition so it remains stable? Or are there better options?
I've put more detailed info on the each of the connection forwarding types and how the services are forwarded down under the headings belowfor context to my answers.
If I understanding the networking correctly, it seems that a Headless Service requires less hops and would hence be (slightly) faster?
Not substantially faster. The "extra hop" is the packet traversing local lookup tables which it traverses anyway so not a noticeable difference. The destination pod is still going to be the same number of actual network hops away.
If you have 1000's of services that run on a single pod and could be headless then you might use that to limit the number of iptables NAT rules and speed rule processing up (see iptables v ipvs below).
Is < a headless service not load balanced > correct? And would this still hold when called through the external (or internal) ALB?
Yes it is correct, the client (or ALB) would need to implement the load balancing across the Pod IP's.
Is there any difference in performance for NodePort vs ClusterIP?
A NodePort has a possible extra network hop from the entry node to the node running the pod. Assuming the ClusterIP ranges are routed to the correct node (and routed at all)
If you happen to be using a service type: LoadBalancer this behaviour can change by setting [.spec.externalTrafficPolicy to Local][https://kubernetes.io/docs/concepts/services-networking/service/#aws-nlb-support] which means traffic will only be directed to a local pod.
Finally, what is the most elegant/performant way of using internal services from outside of the cluster
I would say use the AWS ALB Ingress Controller with the alb.ingress.kubernetes.io/target-type: ip annotation. The k8s config from the cluster will be pushed out to the ALB via the ingress controller and address pods directly without traversing any connection forwarding or extra hops. All cluster reconfig will be automatically pushed out.
There is a little bit of latency for config to get to the ALB compared to cluster kube-proxy reconfiguration. Something like a rolling deployment might not be as seamless as the updates arrive after a pod is gone. The ALB's are equipped to handle the outage themselves, eventually.
Kubernetes Connection Forwarding
There is a kube-proxy process running on each node which manages how and where connections are forwared. There are 3 options for how kube-proxy does that: Userspace proxy, iptables or IPVS. Most clusters will be on iptables and that will cater for the vast majority of use cases.
Userspace proxy
The forwarding is via a process that runs in userspace to terminate and forward the connections. It's slow. It's unlikely you are using it, don't use it.
iptables
iptables forwards connections in kernel via NAT, which is fast. This is most common setup and will cover 90% of use cases. New connections are shared evenly between all nodes running pods for a service.
IPVS
Runs in kernel, it is fast and scalable. If you shift a traffic to a large number of apps this might improve the forwarding performance. It also supports different service load balancing modes:
- rr: round-robin
- lc: least connection (smallest number of open connections)
- dh: destination hashing
- sh: source hashing
- sed: shortest expected delay
- nq: never queue
Access to services
My explanations are iptables based as I haven't done much detailed work with ipvs clusters yet. I'm gonna handwave the ipvs complexity away and say it's basically the same as iptables, just with faster rule processing as the number of rules increases on huge clusters (i.e number of pods/services/network policies).
I'm also ignoring the userspace proxy in the description, due to the overhead just don't use it.
The basic thing to understand is a "Service ClusterIP" is a virtual construct in the cluster that only exists as rule for where the traffic should go. Every node maintains this rule mapping of all ClusterIP/port to PodIP/port (via kube-proxy)
Nodeport
ALB routes to any node, The node/nodeport forwards the connection to a pod handling the service. This could be a remote pod which would involve sending traffic back out over the "wire".
ALB > wire > Node > Kernel Forward to SVC ( > wire if remote node ) > Pod
ClusterIP
Using the ClusterIP for direct access depends on the Service cluster IP ranges being routed to the correct node. Sometimes they aren't routed at all.
ALB > wire > Node > Kernel Forward to SVC > Pod
The "Kernel Forward to SVC" step can be skipped with an ALB annotation without using a headless service.
Headless Service
Again, Pod IP's aren't always addressable from outside the cluster depending on the network setup. You should be fine on EKS.
ALB > wire > Node > Pod
Note
I'll suffix this with requests are probably looking at < 1ms of additional latency if a connection is forwarded to a node in a VPC. Enhanced networking instances at the low end of that. Inter availability-zone comms might be a tad higher than intra-AZ. If you happened to have a geographically separated cluster it might increase the importance of controlling traffic flow. For example having a tunnelled calico network that actually jumped over a number of real networks.
what is the most elegant/performant way of using internal services from outside of the cluster (where we don't have access to the Kubernetes DNS) but within the same VPC?
For this to achieve, I think you should have a look at a Service Mesh. For example, Istio(https://istio.io). It handles your internal service calls manually so that the call doesn't have to go through Kubernetes DNS. Please have a look at Istio's docs (https://istio.io/docs) for more info.
Also, you can have a look at Istio at EKS (https://aws.amazon.com/blogs/opensource/getting-started-istio-eks)
Headless service will not have any load balancing at L4 layer but if you use it behind an ALB you are getting load balancing at L7 layer.
Nodeport internally uses cluster IP but because your request may randomly be routed to a pod on another host when it could have been routed to a pod on the same host, avoiding that extra hop out to the network. Nodeport is generally a bad idea for production usage.
IMHO best way to access internal services from outside of the cluster will be using ingress.
You can use nginx as ingress controller where you deploy the nginx ingress controller on your cluster and expose it via a LoadBalancer type service using ALB. Then you can configure path or host based routing using ingress api to route traffic between backend kubernetes services.

Is there some way to handle SIP, RTP, DIAMETER, M3UA traffic in Kubernetes?

From a quick read of the Kubernetes docs, I noticed that the kube-proxy behaves as a Level-4 proxy, and perhaps works well for TCP/IP traffic (s.a. typically HTTP traffic).
However, there are other protocols like SIP (that could be over TCP or UDP), RTP (that is over UDP), and core telecom network signaling protocols like DIAMETER (over TCP or SCTP) or likewise M3UA (over SCTP). Is there a way to handle such traffic in application running in a Kubernetes minion ?
In my reading, I have come across the notion of Ingress API of Kuberntes, but I understood that it is a way to extend the capabilities of the proxy. Is that correct ?
Also, it is true that currently there is no known implementation (open-source or closed-source) of Ingress API, that can allow a Kubernetes cluster to handle the above listed type of traffic ?
Finally, other than usage of the Ingress API, is there no way to deal with the above listed traffic, even if it has performance limitations ?
Also, it is true that currently there is no known implementation (open-source or closed-source) of Ingress API, that can allow a Kubernetes cluster to handle the above listed type of traffic ?
Probably, and this IBM study on IBM Voice Gateway "Setting up high availability"
(here with SIPs (Session Initiation Protocol), like OpenSIPS)
Kubernetes deployments
In Kubernetes terminology, a single voice gateway instance equates to a single pod, which contains both a SIP Orchestrator container and a Media Relay container.
The voice gateway pods are installed into a Kubernetes cluster that is fronted by an external SIP load balancer.
Through Kubernetes, a voice gateway pod can be scheduled to run on a cluster of VMs. The framework also monitors pods and can be configured to automatically restart a voice gateway pod if a failure is detected.
Note: Because auto-scaling and auto-discovery of new pods by a SIP load balancer in Kubernetes are not currently supported, an external SIP.
And, to illustrate Kubernetes limitations:
Running IBM Voice Gateway in a Kubernetes environment requires special considerations beyond the deployment of a typical HTTP-based application because of the protocols that the voice gateway uses.
The voice gateway relies on the SIP protocol for call signaling and the RTP protocol for media, which both require affinity to a specific voice gateway instance. To avoid breaking session affinity, the Kubernetes ingress router must be bypassed for these protocols.
To work around the limitations of the ingress router, the voice gateway containers must be configured in host network mode.
In host network mode, when a port is opened in either of the voice gateway containers, those identical ports are also opened and mapped on the base virtual machine or node.
This configuration also eliminates the need to define media port ranges in the kubectl configuration file, which is not currently supported by Kubernetes. Deploying only one pod per node in host network mode ensures that the SIP and media ports are opened on the host VM and are visible to the SIP load balancer.
That network configuration put in place for Kubernetes is best illustrated in this answer, which describes the elements involved in pod/node-communication:
It is possible to handle TCP and UDP traffic from clients to your service, but it slightly depends where you run Kubernetes.
Solutions
A solution which working everywhere
It is possible to use Ingress for both TCP and UDP protocols, not only with HTTP. Some of the Ingress implementations has a support of proxying that types of traffic.
Here is an example of that kind of configuration for Nginx Ingress controller for TCP:
apiVersion: v1
kind: ConfigMap
metadata:
name: tcp-configmap-example
data:
9000: "default/example-go:8080" here is a "$namespace/$service_name:$port"
And UDP:
apiVersion: v1
kind: ConfigMap
metadata:
name: udp-configmap-example
data:
53: "kube-system/kube-dns:53" # here is a "$namespace/$service_name:$port"
So, actually, you can run your application which needs plain UDP and TCP connections with some limitations (you need somehow manage a load balancing if you have more than one pod etc).
But if you now have an application which can do it now, without Kubernetes - I don't think that you will have any problems with that after migration to Kubernetes.
A Small example of a traffic flow
For SIP UDP traffic, for an example, you can prepare configuration like this:
Client -> Nginx Ingress (UDP) -> OpenSIPS Load balancer (UDP) -> Sip Servers (UDP).
So, the client will send packets to Ingress, it will forward it to OpenSIPS, which will manage a state of your SIP cluster and send clients packets to a proper SIP server.
A solution only for Clouds
Also, if you will run in on Cloud, you can use ServiceType LoadBalancer for your Service and get TCP and UDP traffic to your application directly thru External Load Balancer provided by a cloud platform.
About SCTP
What about SCTP, unfortunately, no, that is not supported yet, but you can track a progress here.
With regard to SCTP support in k8s: it has been merged recently into k8s as alpha feature. SCTP is supported as a new protocol type in Service, NetworkPolicy and Pod definitions. See the PR here: https://github.com/kubernetes/kubernetes/pull/64973
Some restrictions exist:
the handling of multihomed SCTP associations was not in the scope of the PR. The support of multihomed SCTP associations for the cases when NAT is used is a much broader topic which affects also the current SCTP kernel modules that handle NAT for the protocol. See an example here: https://datatracker.ietf.org/doc/html/draft-ietf-tsvwg-natsupp-12
From k8s perspective one would also need a CNI plugin that supports the assignment of multiple IP addresses (on multiple interfaces preferably) to pods, so the pod can establish multihomed SCTP association. Also one would need an enhanced Service/Endpoint/DNS controller to handle those multiple IP addresses on the right way.
the support of SCTP as protocol for type=LoadBalancer Services is up to the load balancer implementation, which is not a k8s issue
in order to use SCTP in NetworkPolicy one needs a CNI plugin that supports SCTP in NetworkPolicies