Single node AWS OKD cluster not responding after restart

Single node AWS OKD cluster not responding after restart - kubernetes

My OKD 4.8 single node deployment has been up for more than a month now. Then today it started acting (pods not being created). So I thought maybe I reboot the node. I shut it down via the AWS console. And then I started it again.
However, after restart, it is not responding. The node is running but OKD is not accessible. Neither the console nor the API can be reached. Any oc command results in "The connection to the server api.api1.hostname.info:6443 was refused - did you specify the right host or port?"
The domain name and all zones are hosted by AWS.
Any troubleshooting ideas?

Related

Issues with outbound connections from pods on GKE cluster with NAT (and router)

I'm trying to investigate issue with random 'Connection reset by peer' error or long (up 2 minutes) PDO connection initializations but failing to find a solution.
Similar issue: https://kubernetes.io/blog/2019/03/29/kube-proxy-subtleties-debugging-an-intermittent-connection-reset/, but that supposed to be fixed in the version of kubernetes that I'm running.
GKE config details:
GKE is running on 1.20.12-gke.1500 version, with a NAT network configuration and a router. Cluster has 2 nodes and router has 2 static IP's assigned with dynamic port allocation and range of 32728-65536 ports per VM.
On the kubernetes:
deployments: docker image with local nginx, php-fpm, and google sql proxy
services: LoadBalancer to expose the deployment
As per replication of the issue I created a simple script connecting in a loop to database and making simple count query. I eliminated issues with the database server by testing the script on a standalone GCE VM where I didn't get any issues. When I'm running the script on any of the application pods in the cluster, I'm getting random 'Connection reset by peer' errors. I have tested that script using google sql proxy service or with direct database IP with same random connection issues.
Any help would be appreciated.
Update
On https://cloud.google.com/kubernetes-engine/docs/release-notes I can see that there has been fix released to solve potentially something that I'm getting: "The following GKE versions fix a known issue in which random TCP connection resets might happen for GKE nodes that use Container-Optimized OS with Docker (cos). To fix the issue, upgrade your nodes to any of these versions:"
I'm updating nodes this evening so I hope that will solve the issue.
Update
The update of nodes solved random connection resets.

Updating cluster and nodes to 1.20.15-gke.3400 version using google cloud panel resolved the issue.

Telepresence with SSH Tunnel for Kubernetes

I have a remote privately managed Kubernetes cluster that I reach by going via an intermediary VM. To use kubectl from my machine I have setup an SSH tunnel that hops onto my VM and then onto my master node - this works fine.
I am trying to configure Telepresence (https://www.telepresence.io/) which attempts to start up (correctly detecting that kubectl works) but then fails due to a timeout.
subprocess.TimeoutExpired: Command '['ssh', '-F', '/dev/null', '-oStrictHostKeyChecking=no', '-oUserKnownHostsFile=/dev/null', '-q', '-p', '65367', 'telepresence#127.0.0.1', '/bin/true']' timed out after 5 seconds
Is this a setup that telepresence should support or is the presence of an intermediary VM going to be a roadblock for me?

Telepresence 2 should support this better as it installs a sidecar container that makes it more resilient to interrupted connections. I would give the new version a try to see if you're still seeing timeout errors.
https://www.getambassador.io/docs/latest/telepresence/quick-start/

Service Fabric Cluster Upgrade Failing

I've an on-premise, secure, development cluster that I wish to upgrade. The current version is 5.7.198.9494. I've followed the steps listed here.
At the time of writing, the latest version of SF is 6.2.283.9494. However, running Get-ServiceFabricRuntimeUpgradeVersion -BaseVersion 5.7.198.9494 shows that I first must update to 6.0.232.9494, before upgrade to 6.2.283.9494.
I run the following in Powershell, and the upgrade does start:
Copy-ServiceFabricClusterPackage -Code -CodePackagePath .\MicrosoftAzureServiceFabric.6.0.232.9494.cab -ImageStoreConnectionString "fabric:ImageStore"
Register-ServiceFabricClusterPackage -Code -CodePackagePath MicrosoftAzureServiceFabric.6.0.232.9494.cab
Start-ServiceFabricClusterUpgrade -Code -CodePackageVersion 6.0.232.9494 -Monitored -FailureAction Rollback
However, after a few minutes the following happens:
Powershell IDE crashes
The Service Fabric Cluster becomes unreachable
Service Fabric Local Cluster Manager disappears from the task bar
Event Viewer will log the events, see below.
Quite some time later, the vm will reboot. Service Fabric Local Cluster Manager will only give options to Setup or Restart the local cluster.
Event viewer has logs in the under Applications and Services Logs -> Microsoft-Service Fabric -> Operational. Most are information about opening, closing, and aborting one of the upgrade domains. There are some warnings about a vm failing to open an upgrade domain stating error: Lease Failed.
This behavior happens consistently, and I've not yet been able to update the cluster. My guess is that we are not able to upgrade a development cluster, but I've not found an article that states that.
Am I doing something incorrectly here, or is it impossible to upgrade a development cluster?

I will assume you have a development cluster with a single node or multiple nodes in a single VM.
As described in the first section of the documentation from the same link your provided:
service-fabric-cluster-upgrade-windows-server
You can upgrade your cluster to the new version only if you're using a
production-style node configuration, where each Service Fabric node is
allocated on a separate physical or virtual machine. If you have a
development cluster, where more than one Service Fabric node is on a
single physical or virtual machine, you must re-create the cluster
with the new version.

How to restart unresponsive kubernetes master in GKE

The kubernetes master in one of my GKE clusters became unresponsive last night following the infrastructure issue in us-central1-a.
Whenever I run "kubectl get pods" in the default namespace I get the following error message:
Error from server: an error on the server has prevented the request from succeeding
If I run "kubectl get pods --namespace=kube-system", I only see the kube-proxy and the fluentd-logging daemon.
I have trying scaling the cluster down to 0 and then scaling it back up. I have also tried downgrading and upgrading the cluster but that seems to apply only to the nodes (not the master). Is there any GKE/K8S API command to issue a restart to the kubernetes master?

There is not a command that will allow you to restart the Kubernetes master in GKE (since the master is considered a part of the managed service). There is automated infrastructure (and then an oncall engineer from Google) that is responsible for restarting the master if it is unhealthy.
In this particular cases, restarting the master had no effect on restoring it to normal behavior because Google Compute Engine Incident #16011 caused an outage on 2016-06-28 for GKE masters running in us-central1-a (even though that isn't indicated on the Google Cloud Status Dashboard). During the incident, many masters were unavailable.
If you had tried to create a GCE cluster using kube-up.sh during that time, you would have similarly seen that it would be unable to create a functional master VM due to the SSD Persistent disk latency issues.

I'm trying to have at least one version to upgrade ready, if you trying to upgrade the master, it will restart and work within few minutes. Otherwise you should wait around 3 days while Google team will reboot it. On e-mail/phone, then won't help you. And unless you have payed support (transition to which taking few days), they won't give a bird.

How to re-connect to Amazon kubernetes cluster after stopping & starting instances?

I create a cluster for trying out kubernetes using cluster/kube-up.sh in Amazon EC2. Then I stop it to save money when not using it. Next time I start the master & minion instances in amazon, *~/.kube/config has old IP-s for the cluster master as EC2 assigns new public IP to the instances.
Currently I havent found way to provide Elastic IP-s to cluster/kube-up.sh so that consistent IP-s between stopping & starting instances would be set in place. Also the certificate in ~/.kube/config for the old IP so manually changing IP doesn't work either:
Running: ./cluster/../cluster/aws/../../cluster/../_output/dockerized/bin/darwin/amd64/kubectl get pods --context=aws_kubernetes
Error: Get https://52.24.72.124/api/v1beta1/pods?namespace=default: x509: certificate is valid for 54.149.120.248, not 52.24.72.124
How to make kubectl make queries against the same kubernetes master on a running on different IP after its restart?

If the only thing that has changed about your cluster is the IP address of the master, you can manually modify the master location by editing the file ~/.kube/config (look for the line that says "server" with an IP address).
This use case (pausing/resuming a cluster) isn't something that we commonly test for so you may encounter other issues once your cluster is back up and running. If you do, please file an issue on the GitHub repository.

I'm not sure which version of Kubernetes you were using but in v1.0.6 you can pass MASTER_RESERVED_IP environment variable to kube-up.sh to assign a given Elastic IP to Kubernetes Master Node.
You can check all the available options for kube-up.sh in config-default.sh file for AWS in Kubernetes repository.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse