HAProxy and DNS SRV priority and weight - haproxy

If DNS SRV record has weight and priority equal to zero is that record is used by a haproxy or effectively disabled? I'm asking this, because all my SRV records in DNS server have a priority
and weight set to zero and on stats page I see that all my servers are set to active or soft disabled and I get 503 Service unavailable.

A SRV weight of 0 becomes an haproxy weight of 0. An haproxy weight of 0 means "don't send traffic to this backend", aka DRAIN state (but that backend can receive traffic in some cases, e.g. session sticking).

Related

How can I have more than 64K connections per node in Kubernetes?

I have an EKS Kubernetes cluster. High level the setup is:
a) There is an EC2 instance, lets call it "VM" or "Host"
b) In the VM, there is a POD running 2 containers: Side Car HAProxy Container + MyApp Container
What happens is that when external requests come, inside of HAProxy container, I can see that the source IP is the "Host" IP. As the Host has a single IP, there can be a maximum of 64K connections to HAProxy.
I'm curious to know how to workaround this problem as I want to be able to make like 256K connections per Host.
I'm not sure is you understand reason for 64k limit so try to explain it
At first that is a good answer about 64k limitations
Let's say that HAProxy (192.168.100.100) listening at port 8080 and free ports at Host (192.168.1.1) are 1,353~65,353, so you have combination of:
source 192.168.1.1:1353~65353 → destination 192.168.100.100:8080
That is 64k simultaneous connections. I don't know how often NAT table is updating, but after update unused ports will be reused. So simultaneous is important
If your only problem is limit of connections per IP, here is couple solutions:
Run multiple HAProxyes. Three containers increase limit to 64,000 X 3 = 192,000
Listen multiple ports on HAProxy (check about SO_REUSEPORT). Three ports (8080, 8081, 8082) increase max number of connections to 192,000
Host interface IP is acting like a gateway for Docker internal network so I not sure if it is possible to set couple IPs for Host or HAProxy. At least I didn't find information about it.
It turns that in Kubernetes one can configure how we want clients to access the service and the choice that we had was nodePort. When we changed it to hostPort, the source IP was seen in the haproxy container and hence the limitation that I was having was removed.
If this option would have failed, my next option was to try the recommendation in the other response which was to have haproxy listening in multiple ports. Thankfully that was not needed.
Thanks!

How does kube-proxy handle persistent connections to a Service between pods?

I've seen scenarios where requests from one workload, sent to a ClusterIP service for another workload with no affinities set, only get routed to a subset of the associated pods. The Endpoints object for this service does show all of the pod IPs.
I did a little experiment to figure out what is happening.
Experiment
I set up minikube to have a "router" workload with 3 replicas sending requests to a "backend" workload also with 3 pods. The router just sends a request to the service name like http://backend.
I sent 100 requests to the router service via http://$MINIKUBE_IP:$NODE_PORT, since it's exposed as a NodePort service. Then I observed which backend pods actually handled requests. I repeated this test multiple times.
In most cases, only 2 backend pods handled any requests, with the occasional case where all 3 did. I didn't see any where all requests went to one in these experiments, though I have seen it happen before running other tests in AKS.
This led me to the theory that the router is keeping a persistent connection to the backend pod it connects to. Given there are 3 routers and 3 backends, there's an 11% chance all 3 routers "stick" to a single backend, a 67% chance that between the 3 routers, they stick to 2 of the backends, and a 22% chance that each router sticks to a different backend pod (1-to-1).
Here's one possible combination of router-to-backend connections (out of 27 possible):
Disabling HTTP Keep-Alive
If I use a Transport disabling HTTP Keep-Alives in the router's http client, then any requests I make to the router are uniformly distributed between the different backends on every test run as desired.
client := http.Client{
Transport: &http.Transport{
DisableKeepAlives: true,
},
}
resp, err := client.Get("http://backend")
So the theory seems accurate. But here's my question:
How does the router using HTTP KeepAlive / persistent connections actually result in a single connection between one router pod and one backend pod?
There is a kube-proxy in the middle, so I'd expect any persistent connections to be between the router pod and kube-proxy as well as between kube-proxy and the backend pods.
Also, when the router does a DNS lookup, it's going to find the Cluster IP of the backend service every time, so how can it "stick" to a Pod if it doesn't know the Pod IP?
Using Kubernetes 1.17.7.
This excellent article covers your question in detail.
Kubernetes Services indeed do not load balance long-lived TCP connections.
Under the hood Services (in most cases) use iptables to distribute connections between pods. But iptables wasn't designed as a balancer, it's a firewall. It isn't capable to do high-level load balancing.
As a weak substitution iptables can create (or not create) a connection to a certain target with some probability - and thus can be used as L3/L4 balancer. This mechanism is what kube-proxy employs to somewhat imitate load balancing.
Does iptables use round-robin?
No, iptables is primarily used for firewalls, and it is not designed to do load balancing.
However, you could craft a smart set of rules that could make iptables behave like a load balancer.
And this is precisely what happens in Kubernetes.
If you have three Pods, kube-proxy writes the following rules:
select Pod 1 as the destination with a likelihood of 33%. Otherwise, move to the next rule
choose Pod 2 as the destination with a probability of 50%. Otherwise, move to the following rule
select Pod 3 as the destination (no probability)
What happens when you use keep-alive with a Kubernetes Service?
Let's imagine that front-end and backend support keep-alive.
You have a single instance of the front-end and three replicas for the backend.
The front-end makes the first request to the backend and opens the TCP connection.
The request reaches the Service, and one of the Pod is selected as the destination.
The backend Pod replies and the front-end receives the response.
But instead of closing the TCP connection, it is kept open for subsequent HTTP requests.
What happens when the front-end issues more requests?
They are sent to the same Pod.
Isn't iptables supposed to distribute the traffic?
It is.
There is a single TCP connection open, and iptables rule were invocated the first time.
One of the three Pods was selected as the destination.
Since all subsequent requests are channelled through the same TCP connection, iptables isn't invoked anymore.
Also it's not quite correct to say that kube-proxy is in the middle.
It isn't - kube-proxy by itself doesn't manage any traffic.
All that it does - it creates iptables rules.
It's iptables who actually listens, distributes, does DNAT etc.
Similar question here.

HAProxy with Cloudmap with Fargate

Looking to migrate to using AWS Fargate to host a number of containers to be load balanced via HAProxy, it seems an elegant method to then use a combination of AWS Cloudmap for service discovery and then HAProxy DNS (server-template) syntax to autopopulate the backend servers.
However it's come to attention that route 53 the underlying system of Cloudmap only returns 8 A or SRV records at most which from HAProxy documentation makes it sound like it will continuously mark the nodes not returned in the latest DNS call to be marked as unhealthy which would lead to backends being constantly dropped and re-added to the HAProxy pool even if they're all healthy.
I can only assume this is something others have encountered before and if there's a trick to get ting HAProxy to accomodate for the maximum value of 8 backend servers?
HAProxy supports DNS service discovery with the server-template directive. Make sure you configure a resolvers section and use it with the resolvers directive on the server line. There's a blog post here. If you find that you need to accommodate more records you can adjust the accepted_payload_size size.

Is the Istio metric "istio_requests_total" the number of served requests?

Istio has several default metrics, such as istio_requests_total, istio_request_bytes, istio_tcp_connections_opened_total. Istio envoy proxy computes and exposes these metrics. On the Istio website, it shows that istio_requests_total is a COUNTER incremented for every request handled by an Istio proxy.
We made some experiments where we let a lot of requests go through the Istio envoy to reach a microservice behind Istio envoy, and at the same time we monitored the metric from Istio envoy. However, we found that istio_requests_total does not include the requests that have got through Istio envoy to the backend microservice but their responses have not arrived at Istio envoy from the backend microservice. In other words, istio_requests_total only includes the number of served requests, and does not include the requests in flight.
My question is: is our observation right? Why does istio_requests_total not include the requests in flight?
As mentioned here
The default metrics are standard information about HTTP, gRPC and TCP requests and responses. Every request is reported by the source proxy and the destination proxy as well and these can provide a different view on the traffic. Some requests may not be reported by the destination (if the request didn't reach the destination at all), but some labels (like connection_security_policy) are only available on the destination side. Here are some of the most important HTTP metrics:
istio_requests_total is a COUNTER that aggregates request totals between Kubernetes workloads, and groups them by response codes, response flags and security policy.
As mentioned here
When Mixer collects metrics from Envoy, it assigns dimensions that downstream backends can use for grouping and filtering. In Istio’s default configuration, dimensions include attributes that indicate where in your cluster a request is traveling, such as the name of the origin and destination service. This gives you visibility into traffic anywhere in your cluster.
Metric to watch: requests_total
The request count metric indicates the overall throughput of requests between services in your mesh, and increments whenever an Envoy sidecar receives an HTTP or gRPC request. You can track this metric by both origin and destination service. If the count of requests between one service and another has plummeted, either the origin has stopped sending requests or the destination has failed to handle them. In this case, you should check for a misconfiguration in Pilot, the Istio component that routes traffic between services. If there’s a rise in demand, you can correlate this metric with increases in resource metrics like CPU utilization, and ensure that your system resources are scaled correctly.
Maybe it's worth to check envoy docs about that, because of what's written here
The queries above use the istio_requests_total metric, which is a standard Istio metric. You can observe other metrics, in particular, the ones of Envoy (Envoy is the sidecar proxy of Istio). You can see the collected metrics in the insert metric at cursor drop-down menu.
Based on above docs I agree with that what #Joel mentioned in comments
I think you're correct, and I imagine the "why" is because of response flags that are expected to be found on the metric labels. This can be written only when a response is received. If they wanted to do differently, I guess it would mean having 2 different counters, one for request sent and one for response received.

Balancing traffic using least connection in Kubernetes

I have a Kubernetes cluster with a deployment like the next one:
The goal here is to deploy an application in multiple pods exposed through a ClusterIP service named my-app. The same deployment is made in multiple namespaces (A, B and C), changing slightly the config of the application. Then, in some nodes I have an HAProxy using hostNetwork to bind to the node ports. These HAProxy are exposed to my clients through a DNS pointing to them (my_app.com).
When a client connects to my app, they send a header specifying the namespace to which the request should be redirected (A, B or C) and the HAProxy resolves the IP of the service using do-resolve against a dns entry like my_app.A.svc.cluster.local, which returns the IP of the service my_app in the namespace A. That way I can have a single entry point (single DNS record) and a single port (80) to my cluster, which is one of my requirements. I'm also able to create new namespaces and deploy other configs of my app without the need to modify the HAProxies, which is the second requirement.
Now, the requests I get are a mix of short and long requests so I need to use least connection here. This is is not possible in the HAProxies as I don't have a list of backends (the redirection is dynamic as you can see in the code below). I'm trying to use kube-proxy with IPVS and least connection mode. What I noticed is that the tracking of connections to the different pods is per node, and this information is not shared between the different nodes. This way, if two request to my_app.com Namespace: A are processed by two different nodes, both can go to the same pod (eg. pod_1) as in each node, the number of active connections to that pod is 0. The problem becomes worse as I increase the number of HAProxies behind the DNS.
How can I solve this problem and have a better balance without having a single entry point to the cluster (having a single HAProxy behind the DNS)?
I'm adding here the code used in HAProxy to route based on headers:
resolvers dns
hold nx 3s
hold other 3s
parse-resolv-conf
frontend my_app_frontend
bind :80
default_backend my_app_backend
http-request set-var(sess.namespace) hdr(X-Namespace)
http-request do-resolve(txn.service,dns,ipv4) str(),concat(my_app.,sess.namespace,.svc.cluster.local)
backend my_app_backend
http-request set-dst var(txn.service)
http-request set-dst-port int(80)
server service 0.0.0.0:0
I would use the peers feature from HAProxy to save the sessions for the namespaces cross nodes border.
https://www.haproxy.com/blog/introduction-to-haproxy-stick-tables/
In short and untested
peers mypeers
peer node1 192.168.122.64:10000
peer node2 192.168.122.1:10000
backend my_app_backend
stick-table type string len 32 size 100k expire 30m peers mypeers
stick on hdr(X-Namespace)
http-request set-dst var(txn.service)
http-request set-dst-port int(80)
server service 0.0.0.0:0