AKS Containers refusing connection following upgrade - kubernetes

We have upgraded our AKS to 1.24.3, and since we have, we are having an issue with containers refusing connection.
There have been no changes to the deployed microservices as part of the AKS upgrade, and the issue is occurring at random intervals.
From what I can see the container is returning the error - The client closed the connection.
What I cannot seem to be able to trace is, the connections, within AKS, and the issue is across all services.
Has anyone experienced anything similar and are able to provide any advise?

I hit similar issue upgrading from 1.23.5 to 1.24.3, issue was configuration mis-match with kubernetes load balancer health probe path and ingress-nginx probe endpoints.
Added this annotation to my ingress-nginx helm install command corrected my problem: --set controller.service.annotations."service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path"=/healthz

Related

kubetnetes cluster in Azure (AKS) upgrade 1.24.9 in fail state with pods facing intermittent DNS issues

I upgrade AKS using Azure portal from 1.23.5 to 1.24.9. This part finished properly (or so I assumed) based on below status on Azure portal.
I continued with 1.24.9 to 1.25.5. This time it worked partly. Azure portal shows 1.25.5 for nodepool with provision state "Failed". While nodes are still at 1.24.9.
I found that some nodes were having issues connecting to network including outside e.g. github as well as internal "services". For some reason it is intermittent issue. On same node it sometime works and sometimes not. (I had pods running on each node with python.)
Each node has cluster IP in resolv.conf
One of the question on SO had a hint about ingress-nginx compatibility. I found that I had an incompatible version. So I upgraded it to 1.6.4 which is compatible with 1.24 and 1.25 both
But this network issue still persists. I am not sure if this is because AKS provisioning state of "Failed". Connectivity check for this cluster in Azure portal is Success. Only issue reported in Azure portal diagnostics is nodepool provisioning state.
is there anything I need to do after ingress-nginx upgrade for all nodes/pods to get the new config?
Or is there a way to re-trigger this upgrade? although I am not sure why, but just assuming that it may reset the configs on all nodes and might work.

Issues with outbound connections from pods on GKE cluster with NAT (and router)

I'm trying to investigate issue with random 'Connection reset by peer' error or long (up 2 minutes) PDO connection initializations but failing to find a solution.
Similar issue: https://kubernetes.io/blog/2019/03/29/kube-proxy-subtleties-debugging-an-intermittent-connection-reset/, but that supposed to be fixed in the version of kubernetes that I'm running.
GKE config details:
GKE is running on 1.20.12-gke.1500 version, with a NAT network configuration and a router. Cluster has 2 nodes and router has 2 static IP's assigned with dynamic port allocation and range of 32728-65536 ports per VM.
On the kubernetes:
deployments: docker image with local nginx, php-fpm, and google sql proxy
services: LoadBalancer to expose the deployment
As per replication of the issue I created a simple script connecting in a loop to database and making simple count query. I eliminated issues with the database server by testing the script on a standalone GCE VM where I didn't get any issues. When I'm running the script on any of the application pods in the cluster, I'm getting random 'Connection reset by peer' errors. I have tested that script using google sql proxy service or with direct database IP with same random connection issues.
Any help would be appreciated.
Update
On https://cloud.google.com/kubernetes-engine/docs/release-notes I can see that there has been fix released to solve potentially something that I'm getting: "The following GKE versions fix a known issue in which random TCP connection resets might happen for GKE nodes that use Container-Optimized OS with Docker (cos). To fix the issue, upgrade your nodes to any of these versions:"
I'm updating nodes this evening so I hope that will solve the issue.
Update
The update of nodes solved random connection resets.
Updating cluster and nodes to 1.20.15-gke.3400 version using google cloud panel resolved the issue.

rancher/k8s cluster not accessible when rancher server down

I set up a two clusters with rancher 2.5.x, one single-node management cluster for running the rancher server and one "production" server which handles the application stacks.
This worked all fine, now during updating rancher server to 2.6 something failed apparently and the rancher server is down ever since. The management cluster itself is still up, only the rancher server not. However, since the access is passed throught rancher server I cannot connect to any of the clusters via kubectl or helm.
I do see that all required containers on the management cluster are still up and running:
Also, i can ssh to this server. So I do have access to all resources, but since i cannot connect to the cluster istself i cannot fix this issue. I guess it would be quite easy to just fix the rancher helm release to make it work again. But I have no idea how i could do that. I thought about running kubectl or helm locally on the node in the management cluster, but i don't know how to get the kubeconfig for that. The kubeconfig i used before connects to the rancher server, which happens to be the problem now.
Is there any chance to connect to the cluster without using the rancher generated kubeconfig?

Kubernetes breaks after OOM

I faced the issue with Kubernetes after OOM on the master node. Kubernetes services were looking Ok, there were not any error or warning messages in the log. But Kubernetes failed to process new deployment, wich was created after OOM happened.
I reloaded Kubernetes by systemctl restart kube-*. And it solved the issue, Kubernetes began work normally.
I just wonder is it expected behavior or bug in Kubernetes?
It would be great if you can share kube-controller's log. But when api server crash / OOMKilled, there can be potential synchronization problems in early version of kubernetes (i remember we saw similar problems with daemonset and I have bug filed to Kubernete community), but rare.
Meanwhile, we did a lot of work to make kubernetes production ready: both tuning kubernetes and crafting other micro-services that need to talk to kubernetes. Hope these blog entries would help:
https://applatix.com/making-kubernetes-production-ready-part-2/ This is about 30+ knobs we used to tune kubernetes
https://applatix.com/making-kubernetes-production-ready-part-3/ This is about micro service behavior to ensure cluster stability
It seems the problem wasn't caused by OOM. It was caused by kube-controller regardless to was OOM happen or not.
If I restart kube-controller Kubernetes begins process deployments and pods normally.

How to restart unresponsive kubernetes master in GKE

The kubernetes master in one of my GKE clusters became unresponsive last night following the infrastructure issue in us-central1-a.
Whenever I run "kubectl get pods" in the default namespace I get the following error message:
Error from server: an error on the server has prevented the request from succeeding
If I run "kubectl get pods --namespace=kube-system", I only see the kube-proxy and the fluentd-logging daemon.
I have trying scaling the cluster down to 0 and then scaling it back up. I have also tried downgrading and upgrading the cluster but that seems to apply only to the nodes (not the master). Is there any GKE/K8S API command to issue a restart to the kubernetes master?
There is not a command that will allow you to restart the Kubernetes master in GKE (since the master is considered a part of the managed service). There is automated infrastructure (and then an oncall engineer from Google) that is responsible for restarting the master if it is unhealthy.
In this particular cases, restarting the master had no effect on restoring it to normal behavior because Google Compute Engine Incident #16011 caused an outage on 2016-06-28 for GKE masters running in us-central1-a (even though that isn't indicated on the Google Cloud Status Dashboard). During the incident, many masters were unavailable.
If you had tried to create a GCE cluster using kube-up.sh during that time, you would have similarly seen that it would be unable to create a functional master VM due to the SSD Persistent disk latency issues.
I'm trying to have at least one version to upgrade ready, if you trying to upgrade the master, it will restart and work within few minutes. Otherwise you should wait around 3 days while Google team will reboot it. On e-mail/phone, then won't help you. And unless you have payed support (transition to which taking few days), they won't give a bird.