Balancing traffic using least connection in Kubernetes - kubernetes

I have a Kubernetes cluster with a deployment like the next one:
The goal here is to deploy an application in multiple pods exposed through a ClusterIP service named my-app. The same deployment is made in multiple namespaces (A, B and C), changing slightly the config of the application. Then, in some nodes I have an HAProxy using hostNetwork to bind to the node ports. These HAProxy are exposed to my clients through a DNS pointing to them (my_app.com).
When a client connects to my app, they send a header specifying the namespace to which the request should be redirected (A, B or C) and the HAProxy resolves the IP of the service using do-resolve against a dns entry like my_app.A.svc.cluster.local, which returns the IP of the service my_app in the namespace A. That way I can have a single entry point (single DNS record) and a single port (80) to my cluster, which is one of my requirements. I'm also able to create new namespaces and deploy other configs of my app without the need to modify the HAProxies, which is the second requirement.
Now, the requests I get are a mix of short and long requests so I need to use least connection here. This is is not possible in the HAProxies as I don't have a list of backends (the redirection is dynamic as you can see in the code below). I'm trying to use kube-proxy with IPVS and least connection mode. What I noticed is that the tracking of connections to the different pods is per node, and this information is not shared between the different nodes. This way, if two request to my_app.com Namespace: A are processed by two different nodes, both can go to the same pod (eg. pod_1) as in each node, the number of active connections to that pod is 0. The problem becomes worse as I increase the number of HAProxies behind the DNS.
How can I solve this problem and have a better balance without having a single entry point to the cluster (having a single HAProxy behind the DNS)?
I'm adding here the code used in HAProxy to route based on headers:
resolvers dns
hold nx 3s
hold other 3s
parse-resolv-conf
frontend my_app_frontend
bind :80
default_backend my_app_backend
http-request set-var(sess.namespace) hdr(X-Namespace)
http-request do-resolve(txn.service,dns,ipv4) str(),concat(my_app.,sess.namespace,.svc.cluster.local)
backend my_app_backend
http-request set-dst var(txn.service)
http-request set-dst-port int(80)
server service 0.0.0.0:0

I would use the peers feature from HAProxy to save the sessions for the namespaces cross nodes border.
https://www.haproxy.com/blog/introduction-to-haproxy-stick-tables/
In short and untested
peers mypeers
peer node1 192.168.122.64:10000
peer node2 192.168.122.1:10000
backend my_app_backend
stick-table type string len 32 size 100k expire 30m peers mypeers
stick on hdr(X-Namespace)
http-request set-dst var(txn.service)
http-request set-dst-port int(80)
server service 0.0.0.0:0

Related

Does sessionAffinity over ClientIP works with UDP protocol on Kubernetes setup?

lets say, we have two independent Kubernetes clusters Cluster 1 & Cluster 2 , Each of them has two replicas of same application Pod. Like
Cluster 1 : Pod A & Pod B
Cluster 2 : Pod C & Pod D
Application code in Pod A(client) wants to connect to any Pod running in cluster 2 via NodePort/Loadbalancer service over UDP protocol to send messages. The only requirement is, to maintain affinity so that all messages from Pod A should go to any one pod only (either Pod C or Pod D). Since, UDP is a connectionless protocol, my concern is around the session Affinity based on ClientIP. Should setting the sessionAffinity as client IP solve my issue ?
Since, UDP is a connectionless protocol, my concern is around the session Affinity based on ClientIP. Should setting the sessionAffinity as client IP solve my issue ?
sessionAffinity keeps each session based on sourceIP regardless of the protocols at the same cluster. But it does not mean your real session is kept as you expected on your env across your whole access path journey.
In other words, just only using sessionAffinity does not ensure keeping whole session on your access paths.
For example, Pod A outbound IP is translated as running node IP(SNAT) if you does not use egress IP solutions for the Pod A.
It also depends your NodePort and LoadBalancer Service config about source IP in cluster 2. Refer Using Source IP for more details.
So you should consider how to keep session safely while accessing each other between other clusters. Personally I think you had better consider application layer(7Layer) sticky session for keeping the session, not sessionAffinity of the service.

Kubernetes load balance HTTP/1.1 requests

As we know, by default HTTP 1.1 uses persistent connections which is a long-lived connection. For any service in Kubernetes, for example, clusterIP mode, it is L4 based load balancer.
Suppose I have a service which is running a web server, this service contains 3 pods, I am wondering whether HTTP/1.1 requests can be distributed to 3 pods?
Could anybody help clarify it?
This webpage perfectly address your question: https://learnk8s.io/kubernetes-long-lived-connections
In the spirit of StackOverflow, let me summarize the webpage here:
TLDR: Kubernetes doesn't load balance long-lived connections, and some Pods might receive more requests than others.
Kubernetes Services do not exist. There's no process listening on the IP address and port of a Service.
The Service IP address is used only as a placeholder that will be translated by iptables rules into the IP addresses of one of the destination pods using cleverly crafted randomization.
Any connections from clients (regardless from inside or outside cluster) are established directly with the Pods, hence for an HTTP 1.1 persistent connection, the connection will be maintained between the client to a specific Pod until it is closed by either side.
Thus, all requests that use a single persistent connection will be routed to a single Pod (that is selected by the iptables rule when establishing connection) and not load-balanced to the other Pods.
Additional info:
By W3C RFC2616 (https://www.w3.org/Protocols/rfc2616/rfc2616-sec8.html#sec8.1.3), any proxy server that serves between client and server must maintain HTTP 1.1 persistent connections from client to itself and from itself to server.

How does kube-proxy handle persistent connections to a Service between pods?

I've seen scenarios where requests from one workload, sent to a ClusterIP service for another workload with no affinities set, only get routed to a subset of the associated pods. The Endpoints object for this service does show all of the pod IPs.
I did a little experiment to figure out what is happening.
Experiment
I set up minikube to have a "router" workload with 3 replicas sending requests to a "backend" workload also with 3 pods. The router just sends a request to the service name like http://backend.
I sent 100 requests to the router service via http://$MINIKUBE_IP:$NODE_PORT, since it's exposed as a NodePort service. Then I observed which backend pods actually handled requests. I repeated this test multiple times.
In most cases, only 2 backend pods handled any requests, with the occasional case where all 3 did. I didn't see any where all requests went to one in these experiments, though I have seen it happen before running other tests in AKS.
This led me to the theory that the router is keeping a persistent connection to the backend pod it connects to. Given there are 3 routers and 3 backends, there's an 11% chance all 3 routers "stick" to a single backend, a 67% chance that between the 3 routers, they stick to 2 of the backends, and a 22% chance that each router sticks to a different backend pod (1-to-1).
Here's one possible combination of router-to-backend connections (out of 27 possible):
Disabling HTTP Keep-Alive
If I use a Transport disabling HTTP Keep-Alives in the router's http client, then any requests I make to the router are uniformly distributed between the different backends on every test run as desired.
client := http.Client{
Transport: &http.Transport{
DisableKeepAlives: true,
},
}
resp, err := client.Get("http://backend")
So the theory seems accurate. But here's my question:
How does the router using HTTP KeepAlive / persistent connections actually result in a single connection between one router pod and one backend pod?
There is a kube-proxy in the middle, so I'd expect any persistent connections to be between the router pod and kube-proxy as well as between kube-proxy and the backend pods.
Also, when the router does a DNS lookup, it's going to find the Cluster IP of the backend service every time, so how can it "stick" to a Pod if it doesn't know the Pod IP?
Using Kubernetes 1.17.7.
This excellent article covers your question in detail.
Kubernetes Services indeed do not load balance long-lived TCP connections.
Under the hood Services (in most cases) use iptables to distribute connections between pods. But iptables wasn't designed as a balancer, it's a firewall. It isn't capable to do high-level load balancing.
As a weak substitution iptables can create (or not create) a connection to a certain target with some probability - and thus can be used as L3/L4 balancer. This mechanism is what kube-proxy employs to somewhat imitate load balancing.
Does iptables use round-robin?
No, iptables is primarily used for firewalls, and it is not designed to do load balancing.
However, you could craft a smart set of rules that could make iptables behave like a load balancer.
And this is precisely what happens in Kubernetes.
If you have three Pods, kube-proxy writes the following rules:
select Pod 1 as the destination with a likelihood of 33%. Otherwise, move to the next rule
choose Pod 2 as the destination with a probability of 50%. Otherwise, move to the following rule
select Pod 3 as the destination (no probability)
What happens when you use keep-alive with a Kubernetes Service?
Let's imagine that front-end and backend support keep-alive.
You have a single instance of the front-end and three replicas for the backend.
The front-end makes the first request to the backend and opens the TCP connection.
The request reaches the Service, and one of the Pod is selected as the destination.
The backend Pod replies and the front-end receives the response.
But instead of closing the TCP connection, it is kept open for subsequent HTTP requests.
What happens when the front-end issues more requests?
They are sent to the same Pod.
Isn't iptables supposed to distribute the traffic?
It is.
There is a single TCP connection open, and iptables rule were invocated the first time.
One of the three Pods was selected as the destination.
Since all subsequent requests are channelled through the same TCP connection, iptables isn't invoked anymore.
Also it's not quite correct to say that kube-proxy is in the middle.
It isn't - kube-proxy by itself doesn't manage any traffic.
All that it does - it creates iptables rules.
It's iptables who actually listens, distributes, does DNAT etc.
Similar question here.

Sticky sessions considering src IP and src port in K8s

I've got a lift 'n shift deployment type (i.e. by no means cloud-native) and I'd like to setup sticky sessions so that the requests keep being handled by the same pod if it's available (from the client's perspective).
Client --> LB --> Ingress --> Service --> Deployment
Due to the fact that LB does SNAT, I think service.spec.sessionAffinityConfig.clientIP will work, but because all the requests would be coming with the same source IP of the loadbalancer, the workload won't be truly balanced across all the pods in the deployment.
Can you think of any way to consider source IP & port pair in the sticky session behavior?
Edit 1: The deployment runs in Oracle Cloud. We're using the Oracle Cloud Loadbalancer service in plain TCP mode (i.e. OSI Layer4).
What the question describes is actually a default traffic management behavior in K8s. The packets within each TCP session target the same pod. The TCP session is initiated from the certain source IP (in our case the LB) and source port (which is different for each session), and this session remains "sticky" for its whole duration.

Unique external IP per kubernetes pod

I need to scale my application so that it won't get banned for passing request rate-limit of a site it uses frequently (which allow up to X requests per minute per IP).
I meant to use kubernetes and split the requests between multiple workers, but I saw that all the pods get the same external IP.
so what can I do?
I used kubernetes DaemonSet to attach pod to each node, and instead of scaling by changing deployment, I'm scaling by adding new nodes.
If you run in cloud you can create worker nodes with Public IP addresses. Then your pods will use node's public IP address. And then you can somehow distribute your pods across nodes using multiple replicas or DaemonSet.
do not worry a bout getting one external IP because if you have 3 worker and one master like below
worker1 192.168.1.10
worker2 192.168.1.11
worker3 192.168.1.12
master 192.168.1.13
and forexample if you expose nginx on 30000 port the kubernetes open this port in every nod and you can access it by
curl 192.168.1.10:30000
curl 192.168.1.11:30000
curl 192.168.1.12:30000
curl 192.168.1.13:30000
and if you want to every worker have one pod you can use DaemonSet or you can use label to the node that you want
This probably has less to do with your Kubernetes implementation and more to do with your network setup. It would depend on the source of the "exernal IP" you're referencing: is it given to you by your ISP? If you google "what is my ip", does it match the single IP you're talking about? If so, then you would need to negotiate with your ISP for additional external IPs.
Worth Noting that #JamesJJ is correct. Using additional IPs to 'trick' the API into allowing more connections is most likely a violation of that site's TOS and may result in your access getting terminated.