How to drain a Backend Service on GCP - command-line

I am using GCP (Loadbalancer/Backend Service/Instance Group).
What I am trying to do is to setup a process using shell script to push a canary server, before we actually roll out to production. The canary server should take a percentage (eg. 1%) of the production request, so we can spot errors before rolling out fully.
The plan is:
drain the canary server
update the canary server
restore the connection for the canary server
I am able to use the following command to control the capacity of the backend in GCP
$ gcloud compute backend-services update-backend api-canary --instance-group api-canary --global --capacity-scaler 0.01 --instance-group-zone asia-east1-a
However, I am not able to really drain the service by setting the capacity to 0.0
$ gcloud compute backend-services update-backend api-canary --instance-group api-canary --global --capacity-scaler 0.0 --instance-group-zone asia-east1-a
ERROR: (gcloud.compute.backend-services.update-backend) Could not fetch resource:
- Invalid value for field 'resource': ''. None of the backends have a valid capacity

Reposting as requested:
You're getting the error because at least one of the backend services must have non-zero capacity.
Create at least one non-zero capacity backend and it should work.

Set capacity-scaler to 0 if you wish to drain a backend service. Eg (as > in doc of Google Cloud):
$ gcloud compute backend-services add-backend video-backend-service
--balancing-mode=UTILIZATION
--max-utilization=0.8
--capacity-scaler=0
--instance-group=ig-video-us
--instance-group-zone=us-central1-b
--global

Related

EKS: kubectl exec does not respect streamingConnectionIdleTimeout

Using EKS with Kubernetes 1.21, managed nodegroups in a private subnet. I'm trying to set the cluster up so that kubectl exec times out after inactivity regardless of the workload being execed into, and without any client configuration.
I'm aware of https://github.com/containerd/containerd/issues/5563, except we're on 1.21 with Docker runtime, not containerd yet.
I set streamingConnectionIdleTimeout: 3600s on the kubelet in the launch template:
cat /etc/kubernetes/kubelet/kubelet-config.json | jq '.streamingConnectionIdleTimeout = "3600s"' > /etc/kubernetes/kubelet/kubelet-config.json
/etc/eks/bootstrap.sh {{CLUSTER_NAME}}
And confirmed with curl -sSL "http://localhost:8001/api/v1/nodes/(node name)/proxy/configz".
However, kubectl exec still does not time out.
I confirmed /proc/sys/net/ipv4/tcp_keepalive_time = 7200 on both the client and the node, so we should be hitting the streaming connection idle timeout before Linux starts sending keepalive probes.
Reading through How kubectl exec Works, it seems possible that the EKS managed control plane is keeping the connection alive. There are people online who have the opposite problem - their connection times out regardless of streamingConnectionIdleTimeout - and they solve it by adjusting the timeout on the load balancer in front of their k8s API server. However, there are no knobs (that I know of) to tweak in that regard on the EKS managed control plane.
I would appreciate any input on this topic.

Get permanent certificate to access to kubectl

I have setup a cluster on AWS using kops. I want to connect to the cluster from my local machine.
I have to do cat ~/.kube/config, copy the content and replace it with my local kube config to access to the cluster.
The problem is that it expires after certain amount of time. Is there a way to get permanent access to the cluster?
Not sure if you can get permanent access to the cluster, but based on official kOps documentation you can just run kops update cluster command with --admin={duration} flag and set expire time to a very big value.
For example - let set it for almost 10 years:
kops update cluster {your-cluster-name} --admin=87599h --yes
Then just copy as usual your config file to the client.
Based on official release notes, to back to the previous behaviour just use value 87600h.

Terraform dial tcp 192.xx.xx.xx:443: i/o timeout error

I am trying to implement CI / CD using GitLab + Terraform to K8S Cluster and K8S Control Plane (Master node) was setup on CentOS
However, Pipeline job fails with the following error
Error: Failed to get existing workspaces: Get "https://192.xx.xx.xx/api/v1/namespaces/default/secrets?labelSelector=tfstate%3Dtrue": dial tcp 192.xx.xx.xx:443: i/o timeout
From the error mentioned above (default/secrets?labelSelector=tfstate%3Dtrue), I assume the error is related to missing 'terraform secret' on default namespace
Example (Terraform secret taken from my Windows)
PS C:\> kubectl get secret
NAME TYPE DATA AGE
default-token-7mzv6 kubernetes.io/service-account-token 3 27d
tfstate-default-state Opaque 1 15h
However, I am not sure which process would create 'tfsecret' or should we create it manually ?
Kindly let me know if I my understanding is wrong and had I missed anything else
EDIT
The issue mentioned above occurred because existing Gitlab-runner was on a different subnet (eg 172.xx.xx.xx instead of 192.xx.xx.xx)
I was asked to use a different Gitlab-runner which runs on the same subnet and now it throws the following error
Error: Failed to get existing workspaces: Get "https://192.xx.xx.xx:6443/api/v1/namespaces/default/secrets?labelSelector=tfstate%3Dtrue": x509: certificate signed by unknown authority
Now, I am bit confused whether the certificate-issue is between GitLab-Runner and Gitlab-Server or Gitlab-Server and K8S Cluster or something else
You have configured Kubernetes as the remote state backend for your Terraform configuration. The error is, that the backend is trying to query existing secrets to determine what workspaces are configured. The x509: certificate signed by unknown authority indicates, that the KUBECONFIG the remote state backend uses does not match the CA of the API server you're connecting to.
If the runners are K8s pods themselves, make sure you provide a KUBECONFIG that matches your target cluster and that the remote state does not configure itself as in-cluster by reading the service account token every K8s pod has - which in most cases will only work for the cluster the pod is running on.
You don't provide enough information to be more specific. But big picture, you have to configure the state backend, and any provider that connect to K8s. Theoretically, the state backend secrets and the K8s resources do not have to be on the same cluster. Meaning, you may have to have different configuration for state backend and K8s providers.

GKE - Upgrading cluster master after cluster creation completes

Once we increase load by using JMeter client than my deployed service is interrupted and on GCP/GKE console it says that -
Upgrading cluster master
The values shown below are going to change soon.
And my kubectl client throw this error during upgrade -
Unable to connect to the server: dial tcp 35.236.238.66:443: connectex: No connection could be made because the target machine actively refused it.
How can I stop this upgrade or prevent my service interruption ? If service will be intrupted than there is no benefit of this auto scaling. I am new to GKE, please let me know if I am missing any configuration or parameter here.
I am using this command to create my cluster-
gcloud container clusters create ajeet-gke --zone us-east4-b --node-locations us-east4-b --machine-type n1-standard-8 --num-nodes 1 --enable-autoscaling --min-nodes 4 --max-nodes 16
It is not upgrading k8s version. Because it works fine with smaller load but as I increase load than cluster starts upgrade of master. So it looks the master is resizing itself for more nodes. After upgrade I can see more nodes on GCP console. https://github.com/terraform-providers/terraform-provider-google/issues/3385
Below command says auto scaling is not enabled on instance group.
> gcloud compute instance-groups managed list
NAME AUTOSCALED LOCATION SCOPE ---
ajeet-gke-cluster- no us-east4-b zone ---
default-pool-4***0
Workaround
Sorry forget to update it here, I found a workaround to fix it - after splitting cluster creation command in to two steps cluster is auto scaling without restarting master node:
gcloud container clusters create ajeet-ggs --zone us-east4-b --node-locations us-east4-b --machine-type n1-standard-8 --num-nodes 1
gcloud container clusters update ajeet-ggs --enable-autoscaling --min-nodes 1 --max-nodes 10 --zone us-east4-b --node-pool default-pool
To prevent this you should always create your cluster with hardcoded cluster version to the last version available.
See the documentation: https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-architecture#master
This means that Goolge is managing the master, meaning that if your master is not up to date it will be updated to be in the last version and allow google to limit the number of version currently managed. https://cloud.google.com/kubernetes-engine/docs/concepts/regional-clusters
Now why do you have an interruption of service during the update: because you are in zonal mode with only one master, to prevent this you should go in regional cluster mode with more than one master, allowing for clean rolling update.
The master won't resize the node, unless the autoscaling feature is enabled in it.
As mentioned in above answer, this is a feature at the node-pool level. By looking at description of the issue, it does seems like 'autoscaling' is enabled on your node-pool and eventually a GKE's cluster autoscaler automatically resizes clusters based on the demands of the workloads you want to run(ie when there are pods that are not able to be scheduled due to resource shortages such as CPU).
Additionaly, Kubernetes cluster autoscaling does not use the Managed Instance Group autoscaler. It runs a cluster-autoscaler controller on the Kubernetes master that uses Kubernetes-specific signals to scale your nodes.
It is therefore, highly recommended not use(or rely on the autoscaling status showed by MIG) Compute Engine's autoscaling feature on instance groups created by Kubernetes Engine.

TLS handshake timeout with kubernetes in GKE

I've created a cluster on Google Kubernetes Engine (previously Google Container Engine) and installed the Google Cloud SDK and the Kubernetes tools with it on my Windows machine.
It worked well for some time, and, out of nowhere, it stopped working. Every command I'm issuing with kubectl provokes the following:
Unable to connect to the server: net/http: TLS handshake timeout
I've searched Google, the Kubernetes Github Issues, Stack Overflow, Server Fault ... without success.
I've tried the following:
Restart my computer
Change wifi connection
Check that I'm not somehow using a proxy
Delete and re-create my cluster
Uninstall the Google Cloud SDK (and kubectl) from my machine and re-install them
Delete my .kube folder (config and cache)
Check my .kube/config
Change my cluster's version (tried 1.8.3-gke.0 and 1.7.8-gke.0)
Retry several hours later
Tried both on PowerShell and cmd.exe
Note that the cluster seem to work perfectly, since I have my application running on it and can interact with it normally through the Google Cloud Shell.
Running:
gcloud container clusters get-credentials cluster-2 --zone europe-west1-b --project ___
kubectl get pods
works on Google Cloud Shell and provokes the TLS handshake timeout on my machine.
For others seeing this issue, there is another cause to consider.
After doing:
gcloud config set project $PROJECT_NAME
gcloud config set container/cluster $CLUSTER_NAME
gcloud config set compute/zone europe-west2
gcloud beta container clusters get-credentials $CLUSTER_NAME --region europe-west2 --project $PROJECT_NAME
I was then seeing:
kubectl cluster-info
Unable to connect to the server: net/http: TLS handshake timeout
I tried everything suggested here and elsewhere. When the above worked without issue from my home desktop, I discovered that shared workspace wifi was disrupting TLS/VPNs to control the internet access!
This is what I did to solve the above problem.
I simply ran the following commands::
> gcloud container clusters get-credentials {cluster_name} --zone {zone_name} --project {project_name}
> gcloud auth application-default login
Replace the placeholders appropriately.
So this MAY NOT work for you on GKE, but Azure AKS (managed Kubernetes) has a similar problem with the same error message so who knows — this might be helpful to someone.
The solution to this for me was to scale the nodes in my Cluster from the Azure Kubernetes service blade web console.
Workaround / Solution
Log into the Azure (or GKE) Console — Kubernetes Service UI.
Scale your cluster up by 1 node.
Wait for scale to complete and attempt to connect (you should be able to).
Scale your cluster back down to the normal size to avoid cost increases.
Total time it took me ~2 mins.
More Background Info on the Issue
Added this to the full ticket description write up that I posted over here (if you want more info have a read):
'Unable to connect Net/http: TLS handshake timeout' — Why can't Kubectl connect to Azure AKS server?