grpc client shows Unavailable errors during server scale up - kubernetes

I have two grpc services that communicate to each other using Istio service mesh and have envoy proxies for all the services.
We have an issue where during the scaling up of server pods due to high load, the client throws a few grpc UNAVAILABLE(mostly)/DEADLINE_EXCEEDED/CANCELLED errors for a while as soon as the new pod is ready.
I don't see any CPU throttling in server pods at all.
What else could be the issue and how can I investigate this?

Without the concrete status message, it's hard to say what could be the cause of the errors mentioned above.
To reduce UNAVAILABLE, one way to ask the RPC to wait-for-ready: https://github.com/grpc/grpc/blob/master/doc/wait-for-ready.md. This feature is available in all major gRPC languages (some may rename it to fail-fast=false).
DEADLINE_EXCEEDED is caused by the timeout set by your application or Envoy config, you should be able to tune it.
CANCELLED could mean: 1. the server is entering a graceful shutdown state; 2. the server is overloaded and rejecting new connections.

Related

Readiness Probe and Apache Common Http Client

I have simple OpenShift setup with a Service configured with 2 backend PODS. The PODS have its READINESS Probe configured. The Service is exposed via NodePort. All these configuration are fine it is working as expected. Once the readiness probes fails the Services marks the pod as unreachable and any NEW requests don't get routed to the POD.
Scenario 1:
I execute CURL command to access the services. While the curl command is executing I introduce readiness failure of Pod-1. I see that no new requests are sent to Pod -1. This is FINE
Scenario 2:
I hava Java Client and use Apache Commons Http Client library to initiate a connection to the Kubernetes Service. Connection gets established and it is working fine. The problem comes when I introduce readiness failure of Pod-1. I still see the Client sending requests to Pod-1 only, even though Services has only the endpoint of Pod-2.
My hunch, as the HttpClient uses Persistence Connection and Services when exposed via NodePorts, the destination address for the Http Connection is the POD-1 itself. So even if the readiness probe fails it still sends requests to Pod-1.
Can some one explain why this works they way described above ??
kube-proxy (or rather the iptables rules it generates) intentionally does not shut down existing TCP connections when changing the endpoint mapping (which is what a failed readiness probe will trigger). This has been discussed a lot on many tickets over the years with generally little consensus on if the behavior should be changed. For now your best bet is to instead use an Ingress Controller for HTTP traffic, since those all update live and bypass kube-proxy. You could also send back a Keep-Alive header in your responses and terminate persistent connections after N seconds or requests, though that only shrinks the window for badness.

GKE streaming large file download fails with partial response

I have an app hosted on GKE which, among many tasks, serve's a zip file to clients. These zip files are constructed on the fly through many individual files on google cloud storage.
The issue that I'm facing is that when these zip's get particularly large, the connection fails randomly part way through (anywhere between 1.4GB to 2.5GB). There doesn't seem to be any pattern with timing either - it could happen between 2-8 minutes.
AFAIK, the connection is disconnecting somewhere between the load balancer and my app. Is GKE ingress (load balancer) known to close long/large connections?
GKE setup:
HTTP(S) load balancer ingress
NodePort backend service
Deployment (my app)
More details/debugging steps:
I can't reproduce it locally (without kubernetes).
The load balancer logs statusDetails: "backend_connection_closed_after_partial_response_sent" while the response has a 200 status code. A google of this gave nothing helpful.
Directly accessing the pod and downloading using k8s port-forward worked successfully
My app logs that the request was cancelled (by the requester)
I can verify none of the files are corrupt (can download all directly from storage)
I believe your "backend_connection_closed_after_partial_response_sent" issue is caused by websocket connection being killed by the back-end prematurily. You can see the documentation on websocket proxying in nginx - it explains the nature of this process. In short - by default WebSocket connection is killed after 10 minutes.
Why it works when you download the file directly from the pod ? Because you're bypassing the load-balancer and the websocket connection is kept alive properly. When you proxy websocket then things start to happen because WebSocket relies on hop-by-hop headers which are not proxied.
Similar case was discussed here. It was resolved by sending ping frames from the back-end to the client.
In my opinion your best shot is to do the same. I've found many cases with similar issues when websocket was proxied and most of them suggest to use pings because it will reset the connection timer and will keep it alive.
Here's more about pinging the client using WebSocket and timeouts
I work for Google and this is as far as I can help you - if this doesn't resolve your issue you have to contact GCP support.

How to signal "bad" but not "fatal" health check from spring boot to Kubernetes?

What we're looking for is a way for an actuator health check to signal some intention like "I am limping but not dead. If there are X number of other pods claiming to be healthy, then you should restart me, otherwise, let me limp."
We have a rest service hosted in clustered Kubernetes containers that periodically call out to fetch fresh data from an external resource. Occasionally we have failures reaching those external resources, and sometimes, but not every time, a restart of the pod will resolve the issue.
The services can operate just fine on possibly stale data. Although we wouldn't want to continue operating on stale data, that's preferable to just going down entirely.
In the interim, we're planning on having a node unilaterally decide not to report any problems through actuator until X amount of time has passed since the last successful sync, but that really only delays the point at which all nodes would still report failure.
In Kubernetes you can use LivenessProbe and ReadinessProbe to let a controller to heal your service, but some situations is better handled with HTTP response codes or alternative degraded service.
LivenessPobe
Use a LivenessProbe to resolve a deadlock situation. When your pod does not respond on a LivenessProbe, it will be killed and a new pod will replace it.
ReadinessProbe
Use a ReadinessProbe when your pod is not prepared for serving requests, e.g. if your pod need to read some files or need to connect to an external service before serving requests.
Fault affecting all replicas
If you have a problem that all your replicas depends on, e.g. an external service is down, then you can not solve it by restarting your pods. You may use an OpsToogle or a circuit breaker in this situation and notifying other services that you are degraded or show a message about temporary error.
For your situations
If there are X number of other pods claiming to be healthy, then you should restart me, otherwise, let me limp.
You can not delegate that logic to Kubernetes. Your application need to understand each fault situation, e.g. if an error was a transient network error or if your error will affect all replicas.

Kubernetes HPA and Scaling Down

I have a kubernetes HPA set up in my cluster, and it works as expected scaling up and down instances of pods as the cpu/memory increases and decreases.
The only thing is that my pods handle web requests, so it occasionally scales down a pod that's in the process of handling a web request. The web server never gets a response back from the pod that was scaled down and thus the caller of the web api gets an error back.
This all makes sense theoretically. My question is does anyone know of a best practice way to handle this? Is there some way I can wait until all requests are processed before scaling down? Or some other way to ensure that requests complete before HPA scales down the pod?
I can think of a few solutions, none of which I like:
Add retry mechanism to the caller and just leave the cluster as is.
Don't use HPA for web request pods (seems like it defeats the purpose).
Try to create some sort of custom metric and see if I can get that metric into Kubernetes (e.x https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#support-for-custom-metrics)
Any suggestions would be appreciated. Thanks in advance!
Graceful shutdown of pods
You must design your apps to support graceful shutdown. First your pod will receive a SIGTERM signal and after 30 seconds (can be configured) your pod will receive a SIGKILL signal and be removed. See Termination of pods
SIGTERM: When your app receives termination signal, your pod will not receive new requests but you should try to fulfill responses of already received requests.
Design for idempotency
Your apps should also be designed for idempotency so you can safely retry failed requests.

Application container unable to access network before sidecar ready

I was trying fortio server/client application on istio. I used istoctl for injecting istio dependency and my serer pod was came up fine. But client pod was giving connection refused error due to proxy sidecar is not yet ready to handle connection request of client. Please help me addressing this issue. For reference attaching my yaml files.
This is by design and there is no way around it.
The part responsible for configuration of the iptables for capturing the traffic is run as an init container, which ensures that the required rules are in place before any of the normal pod containers start up. If you use istio for all the traffic, then until it's container is ready, no network traffic will reach in/out of the container.
You should make sure your application handles this right. Apps should be able to withstand unavailability of it's dependencies for a time, both on startup and during operation. In worst case you can introduce your own handling in form of ie. custom entrypoint that awaits for communication to be up.