Kubernetes options request Load Balancer latency of 10 seconds - how to debug? - kubernetes

I have an issue on the backend responding to Options requests.
Sometimes the serve responds in a few ms to the options request, but many times it takes 10 seconds or more. I have a feeling the load-balancer / ingress is holding back since the backend server is not doing this locally and it just has a nodejs app.use(cors()).
The main question is, how can I debug where this goes wrong?
I can see incoming requests but I cannot find for sure if it is the loadbalancer, or the loadbalancer waiting for a response from server.

Related

Delayed Unauthorized responses from AKS

We use AKS's kube-apiserver for leader-election from a VM, external to k8s cluster but in the same VNET. We use a k8s client from the client-go package. The client tries to get/update the lease object every 2 sec. We observe occasional failures (every few hours) caused by delayed "Unauthorized" responses. The client-go HTTP client refreshes creds/tokens when it receives an Unauthorized response. However, when the Unauthorized response is extremely delayed (sometimes up to 30 sec!), the timeouts of HTTP client or leader-election mechanism kick in.
We could tune the HTTP client and leader-election timeouts but we do need a fast failover (e.g., up to 30 sec), so it would be great to eliminate these delays.
What is the reason these Unauthorized responses get so delayed?

grpc client shows Unavailable errors during server scale up

I have two grpc services that communicate to each other using Istio service mesh and have envoy proxies for all the services.
We have an issue where during the scaling up of server pods due to high load, the client throws a few grpc UNAVAILABLE(mostly)/DEADLINE_EXCEEDED/CANCELLED errors for a while as soon as the new pod is ready.
I don't see any CPU throttling in server pods at all.
What else could be the issue and how can I investigate this?
Without the concrete status message, it's hard to say what could be the cause of the errors mentioned above.
To reduce UNAVAILABLE, one way to ask the RPC to wait-for-ready: https://github.com/grpc/grpc/blob/master/doc/wait-for-ready.md. This feature is available in all major gRPC languages (some may rename it to fail-fast=false).
DEADLINE_EXCEEDED is caused by the timeout set by your application or Envoy config, you should be able to tune it.
CANCELLED could mean: 1. the server is entering a graceful shutdown state; 2. the server is overloaded and rejecting new connections.

How does kube-proxy handle persistent connections to a Service between pods?

I've seen scenarios where requests from one workload, sent to a ClusterIP service for another workload with no affinities set, only get routed to a subset of the associated pods. The Endpoints object for this service does show all of the pod IPs.
I did a little experiment to figure out what is happening.
Experiment
I set up minikube to have a "router" workload with 3 replicas sending requests to a "backend" workload also with 3 pods. The router just sends a request to the service name like http://backend.
I sent 100 requests to the router service via http://$MINIKUBE_IP:$NODE_PORT, since it's exposed as a NodePort service. Then I observed which backend pods actually handled requests. I repeated this test multiple times.
In most cases, only 2 backend pods handled any requests, with the occasional case where all 3 did. I didn't see any where all requests went to one in these experiments, though I have seen it happen before running other tests in AKS.
This led me to the theory that the router is keeping a persistent connection to the backend pod it connects to. Given there are 3 routers and 3 backends, there's an 11% chance all 3 routers "stick" to a single backend, a 67% chance that between the 3 routers, they stick to 2 of the backends, and a 22% chance that each router sticks to a different backend pod (1-to-1).
Here's one possible combination of router-to-backend connections (out of 27 possible):
Disabling HTTP Keep-Alive
If I use a Transport disabling HTTP Keep-Alives in the router's http client, then any requests I make to the router are uniformly distributed between the different backends on every test run as desired.
client := http.Client{
Transport: &http.Transport{
DisableKeepAlives: true,
},
}
resp, err := client.Get("http://backend")
So the theory seems accurate. But here's my question:
How does the router using HTTP KeepAlive / persistent connections actually result in a single connection between one router pod and one backend pod?
There is a kube-proxy in the middle, so I'd expect any persistent connections to be between the router pod and kube-proxy as well as between kube-proxy and the backend pods.
Also, when the router does a DNS lookup, it's going to find the Cluster IP of the backend service every time, so how can it "stick" to a Pod if it doesn't know the Pod IP?
Using Kubernetes 1.17.7.
This excellent article covers your question in detail.
Kubernetes Services indeed do not load balance long-lived TCP connections.
Under the hood Services (in most cases) use iptables to distribute connections between pods. But iptables wasn't designed as a balancer, it's a firewall. It isn't capable to do high-level load balancing.
As a weak substitution iptables can create (or not create) a connection to a certain target with some probability - and thus can be used as L3/L4 balancer. This mechanism is what kube-proxy employs to somewhat imitate load balancing.
Does iptables use round-robin?
No, iptables is primarily used for firewalls, and it is not designed to do load balancing.
However, you could craft a smart set of rules that could make iptables behave like a load balancer.
And this is precisely what happens in Kubernetes.
If you have three Pods, kube-proxy writes the following rules:
select Pod 1 as the destination with a likelihood of 33%. Otherwise, move to the next rule
choose Pod 2 as the destination with a probability of 50%. Otherwise, move to the following rule
select Pod 3 as the destination (no probability)
What happens when you use keep-alive with a Kubernetes Service?
Let's imagine that front-end and backend support keep-alive.
You have a single instance of the front-end and three replicas for the backend.
The front-end makes the first request to the backend and opens the TCP connection.
The request reaches the Service, and one of the Pod is selected as the destination.
The backend Pod replies and the front-end receives the response.
But instead of closing the TCP connection, it is kept open for subsequent HTTP requests.
What happens when the front-end issues more requests?
They are sent to the same Pod.
Isn't iptables supposed to distribute the traffic?
It is.
There is a single TCP connection open, and iptables rule were invocated the first time.
One of the three Pods was selected as the destination.
Since all subsequent requests are channelled through the same TCP connection, iptables isn't invoked anymore.
Also it's not quite correct to say that kube-proxy is in the middle.
It isn't - kube-proxy by itself doesn't manage any traffic.
All that it does - it creates iptables rules.
It's iptables who actually listens, distributes, does DNAT etc.
Similar question here.

GKE streaming large file download fails with partial response

I have an app hosted on GKE which, among many tasks, serve's a zip file to clients. These zip files are constructed on the fly through many individual files on google cloud storage.
The issue that I'm facing is that when these zip's get particularly large, the connection fails randomly part way through (anywhere between 1.4GB to 2.5GB). There doesn't seem to be any pattern with timing either - it could happen between 2-8 minutes.
AFAIK, the connection is disconnecting somewhere between the load balancer and my app. Is GKE ingress (load balancer) known to close long/large connections?
GKE setup:
HTTP(S) load balancer ingress
NodePort backend service
Deployment (my app)
More details/debugging steps:
I can't reproduce it locally (without kubernetes).
The load balancer logs statusDetails: "backend_connection_closed_after_partial_response_sent" while the response has a 200 status code. A google of this gave nothing helpful.
Directly accessing the pod and downloading using k8s port-forward worked successfully
My app logs that the request was cancelled (by the requester)
I can verify none of the files are corrupt (can download all directly from storage)
I believe your "backend_connection_closed_after_partial_response_sent" issue is caused by websocket connection being killed by the back-end prematurily. You can see the documentation on websocket proxying in nginx - it explains the nature of this process. In short - by default WebSocket connection is killed after 10 minutes.
Why it works when you download the file directly from the pod ? Because you're bypassing the load-balancer and the websocket connection is kept alive properly. When you proxy websocket then things start to happen because WebSocket relies on hop-by-hop headers which are not proxied.
Similar case was discussed here. It was resolved by sending ping frames from the back-end to the client.
In my opinion your best shot is to do the same. I've found many cases with similar issues when websocket was proxied and most of them suggest to use pings because it will reset the connection timer and will keep it alive.
Here's more about pinging the client using WebSocket and timeouts
I work for Google and this is as far as I can help you - if this doesn't resolve your issue you have to contact GCP support.

Azure Traffic Manager and Kubernetes Service showing Degraded

We're trying to implement a traffic manager over the top of our Azure Kubernetes services so we can run a cluster in 2 regions (uk west and south) and balance across both regions.
The actual traffic manager seems to be working ok, but in the azure portal its showing as being degraded, and in the ingress controller logs on the k8 cluster i can see a request that looks like this
[18/Sep/2019:10:40:58 +0000] "GET / HTTP/1.1" 404 153 "-" "Azure Traffic Manager Endpoint Monitor" 407 0.000 [-]
So the traffic manager is firing of a request, its hitting the ingress controller but it obviously cant resolve that path so its returning a 404.
I've had a play about with the Custom host header setting to point them to a health check endpoint in on of the pods, it did kind of work for a bit but then it seemed to go back to doing a GET on / so it went into degraded again (yeah i know sounds odd).
Even if that worked i dont really want to have to point it at a specific pod endpoint in case that is really down for some reason. Is there something we can do in the ingress controller config to make it respond with a 200 so the traffic manager knows that its up?
Cheers
I would suggest you to switch to TCP based probing for a quick fix. You can change the protocol to TCP and choose the port where your AKS is listening.
If the 3 way handshake to the port fails, then the probe is considered failed.
Why not expose a simple health check endpoint on the same pod where app is hosted rather than a different pod? If at all you deploy a work around to return http 200 from ingress controller and if the backend is down then the traffic will still be routed which defeats the reason to have a probe.