kubetnetes cluster in Azure (AKS) upgrade 1.24.9 in fail state with pods facing intermittent DNS issues - kubernetes

I upgrade AKS using Azure portal from 1.23.5 to 1.24.9. This part finished properly (or so I assumed) based on below status on Azure portal.
I continued with 1.24.9 to 1.25.5. This time it worked partly. Azure portal shows 1.25.5 for nodepool with provision state "Failed". While nodes are still at 1.24.9.
I found that some nodes were having issues connecting to network including outside e.g. github as well as internal "services". For some reason it is intermittent issue. On same node it sometime works and sometimes not. (I had pods running on each node with python.)
Each node has cluster IP in resolv.conf
One of the question on SO had a hint about ingress-nginx compatibility. I found that I had an incompatible version. So I upgraded it to 1.6.4 which is compatible with 1.24 and 1.25 both
But this network issue still persists. I am not sure if this is because AKS provisioning state of "Failed". Connectivity check for this cluster in Azure portal is Success. Only issue reported in Azure portal diagnostics is nodepool provisioning state.
is there anything I need to do after ingress-nginx upgrade for all nodes/pods to get the new config?
Or is there a way to re-trigger this upgrade? although I am not sure why, but just assuming that it may reset the configs on all nodes and might work.

Related

AKS Containers refusing connection following upgrade

We have upgraded our AKS to 1.24.3, and since we have, we are having an issue with containers refusing connection.
There have been no changes to the deployed microservices as part of the AKS upgrade, and the issue is occurring at random intervals.
From what I can see the container is returning the error - The client closed the connection.
What I cannot seem to be able to trace is, the connections, within AKS, and the issue is across all services.
Has anyone experienced anything similar and are able to provide any advise?
I hit similar issue upgrading from 1.23.5 to 1.24.3, issue was configuration mis-match with kubernetes load balancer health probe path and ingress-nginx probe endpoints.
Added this annotation to my ingress-nginx helm install command corrected my problem: --set controller.service.annotations."service.beta.kubernetes.io/azure-load-balancer-health-probe-request-path"=/healthz

Kubernetes on GCP, Stackdriver logging after update to v1.15

I have a Kubernetes cluster on GCP that hosts a Flask application and some more services.
Before upgrading the master node to version 1.15 (was 1.14.x) I saw every log from the flask application on Stackdriver's GKE Container logs, now I don't get any log.
Searching through the release notes I noticed that from 1.15 they:
disabled stackdriver logging agent to prevent node startup failures
I'm not entirely sure that's the reason but I'm sure that the logging stopped after upgrading the master and node versions to 1.15, there has been no code change in the application core.
My question is how can I reactivate the logs I saw before?
I actually found the solution, as stated by the release notes, the stackdriver agent actually becomes disabled by default in 1.15.
To activate it again you need to edit the cluster following these instructions, setting "System and workload logging and monitoring" under "Stackdriver Kubernetes Engine Monitoring"
After that, I could not use anymore the legacy Stackdriver Monitoring, so I found my logs weren't under the resource "GKE Container" but under "Kubernetes Container".
I also had to update every log-based metric that had a filter on resource.type="container", changing it to resource.type="k8s_container"

Service Fabric Cluster Upgrade Failing

I've an on-premise, secure, development cluster that I wish to upgrade. The current version is 5.7.198.9494. I've followed the steps listed here.
At the time of writing, the latest version of SF is 6.2.283.9494. However, running Get-ServiceFabricRuntimeUpgradeVersion -BaseVersion 5.7.198.9494 shows that I first must update to 6.0.232.9494, before upgrade to 6.2.283.9494.
I run the following in Powershell, and the upgrade does start:
Copy-ServiceFabricClusterPackage -Code -CodePackagePath .\MicrosoftAzureServiceFabric.6.0.232.9494.cab -ImageStoreConnectionString "fabric:ImageStore"
Register-ServiceFabricClusterPackage -Code -CodePackagePath MicrosoftAzureServiceFabric.6.0.232.9494.cab
Start-ServiceFabricClusterUpgrade -Code -CodePackageVersion 6.0.232.9494 -Monitored -FailureAction Rollback
However, after a few minutes the following happens:
Powershell IDE crashes
The Service Fabric Cluster becomes unreachable
Service Fabric Local Cluster Manager disappears from the task bar
Event Viewer will log the events, see below.
Quite some time later, the vm will reboot. Service Fabric Local Cluster Manager will only give options to Setup or Restart the local cluster.
Event viewer has logs in the under Applications and Services Logs -> Microsoft-Service Fabric -> Operational. Most are information about opening, closing, and aborting one of the upgrade domains. There are some warnings about a vm failing to open an upgrade domain stating error: Lease Failed.
This behavior happens consistently, and I've not yet been able to update the cluster. My guess is that we are not able to upgrade a development cluster, but I've not found an article that states that.
Am I doing something incorrectly here, or is it impossible to upgrade a development cluster?
I will assume you have a development cluster with a single node or multiple nodes in a single VM.
As described in the first section of the documentation from the same link your provided:
service-fabric-cluster-upgrade-windows-server
You can upgrade your cluster to the new version only if you're using a
production-style node configuration, where each Service Fabric node is
allocated on a separate physical or virtual machine. If you have a
development cluster, where more than one Service Fabric node is on a
single physical or virtual machine, you must re-create the cluster
with the new version.

How to restart unresponsive kubernetes master in GKE

The kubernetes master in one of my GKE clusters became unresponsive last night following the infrastructure issue in us-central1-a.
Whenever I run "kubectl get pods" in the default namespace I get the following error message:
Error from server: an error on the server has prevented the request from succeeding
If I run "kubectl get pods --namespace=kube-system", I only see the kube-proxy and the fluentd-logging daemon.
I have trying scaling the cluster down to 0 and then scaling it back up. I have also tried downgrading and upgrading the cluster but that seems to apply only to the nodes (not the master). Is there any GKE/K8S API command to issue a restart to the kubernetes master?
There is not a command that will allow you to restart the Kubernetes master in GKE (since the master is considered a part of the managed service). There is automated infrastructure (and then an oncall engineer from Google) that is responsible for restarting the master if it is unhealthy.
In this particular cases, restarting the master had no effect on restoring it to normal behavior because Google Compute Engine Incident #16011 caused an outage on 2016-06-28 for GKE masters running in us-central1-a (even though that isn't indicated on the Google Cloud Status Dashboard). During the incident, many masters were unavailable.
If you had tried to create a GCE cluster using kube-up.sh during that time, you would have similarly seen that it would be unable to create a functional master VM due to the SSD Persistent disk latency issues.
I'm trying to have at least one version to upgrade ready, if you trying to upgrade the master, it will restart and work within few minutes. Otherwise you should wait around 3 days while Google team will reboot it. On e-mail/phone, then won't help you. And unless you have payed support (transition to which taking few days), they won't give a bird.

Can't access pods behind service

I'm attempting to access a WebSphere App Server I have running on Kubernetes, and I can't seem to be able to connect properly. I have a pod running behind a service that should connect it's admin console port to an external NodePort. Problem is it resets the connection every time I try. I had the same issue with an ssh port for another pod, but that was fixed when I fixed a Weave networking error. At this point, I can't tell if it's a Kubernetes or Weave error, or something else all together. Any help would be appreciated.
I'm running KubeV1.1.0 off the release-1.1 branch, Mesos 0.24.0, and Weave 1.0.1. My next step is to try a different version of K8s.
The Kubernetes release-1.1 branch was cut before the master branch started passing conformance tests on Mesos. A lot of work has been done since then to make Kubernetes on Mesos more usable. Not all of those changes/fixes have been backported/cherry-picked to fix the v1.1 release branch. It's probably a better idea to try using the code from master branch for the time being, at least while the kubernetes-mesos integration is in alpha. We hope future point releases will be more stable.
(I work on the Kubernetes-Mesos team at Mesosphere.)