scaling GKE K8s Cluster breaks networking - kubernetes

Folks,
When trying to increase a GKE cluster from 1 to 3 nodes, running in separate zones (us-centra1-a, b, c). The following seems apparent:
Pods scheduled on new nodes can not access resources on the internet... i.e. not able to connect to stripe apis, etc. (potentially kube-dns related, have not tested traffic attempting to leave without a DNS lookup).
Similarly, am not able to route between pods in K8s as expected. I.e. it seems cross-az calls could be failing? When testing with openvpn, unable to connect to pods scheduled on new nodes.
A separate issue I noticed was Metrics server seems wonky. kubectl top nodes shows unknown for the new nodes.
At the time of writing, master k8s version 1.15.11-gke.9
The settings am paying attention to:
VPC-native (alias IP) - disabled
Intranode visibility - disabled
gcloud container clusters describe cluster-1 --zone us-central1-a
clusterIpv4Cidr: 10.8.0.0/14
createTime: '2017-10-14T23:44:43+00:00'
currentMasterVersion: 1.15.11-gke.9
currentNodeCount: 1
currentNodeVersion: 1.15.11-gke.9
endpoint: 35.192.211.67
initialClusterVersion: 1.7.8
instanceGroupUrls:
- https://www.googleapis.com/compute/v1/projects/skilful-frame-180217/zones/us-central1-a/instanceGroupManagers/gke-cluster-1-default-pool-ff24932a-grp
ipAllocationPolicy: {}
labelFingerprint: a9dc16a7
legacyAbac:
enabled: true
location: us-central1-a
locations:
- us-central1-a
loggingService: none
....
masterAuthorizedNetworksConfig: {}
monitoringService: none
name: cluster-1
network: default
networkConfig:
network: .../global/networks/default
subnetwork: .../regions/us-central1/subnetworks/default
networkPolicy:
provider: CALICO
nodeConfig:
diskSizeGb: 100
diskType: pd-standard
imageType: COS
machineType: n1-standard-2
...
nodeIpv4CidrSize: 24
nodePools:
- autoscaling: {}
config:
diskSizeGb: 100
diskType: pd-standard
imageType: COS
machineType: n1-standard-2
...
initialNodeCount: 1
locations:
- us-central1-a
management:
autoRepair: true
autoUpgrade: true
name: default-pool
podIpv4CidrSize: 24
status: RUNNING
version: 1.15.11-gke.9
servicesIpv4Cidr: 10.11.240.0/20
status: RUNNING
subnetwork: default
zone: us-central1-a
Next troubleshooting step is creating a new pool and migrating to it. Maybe the answer is staring at me right in the face... could it be nodeIpv4CidrSize a /24?
Thanks!

In your question, the description of your cluster have the following Network Policy:
name: cluster-1
network: default
networkConfig:
network: .../global/networks/default
subnetwork: .../regions/us-central1/subnetworks/default
networkPolicy:
provider: CALICO
I deployed a cluster as similar as I could:
gcloud beta container --project "PROJECT_NAME" clusters create "cluster-1" \
--zone "us-central1-a" \
--no-enable-basic-auth \
--cluster-version "1.15.11-gke.9" \
--machine-type "n1-standard-1" \
--image-type "COS" \
--disk-type "pd-standard" \
--disk-size "100" \
--metadata disable-legacy-endpoints=true \
--scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" \
--num-nodes "1" \
--no-enable-ip-alias \
--network "projects/owilliam/global/networks/default" \
--subnetwork "projects/owilliam/regions/us-central1/subnetworks/default" \
--enable-network-policy \
--no-enable-master-authorized-networks \
--addons HorizontalPodAutoscaling,HttpLoadBalancing \
--enable-autoupgrade \
--enable-autorepair
After that I got the same configuration as yours, I'll point two parts:
addonsConfig:
networkPolicyConfig: {}
...
name: cluster-1
network: default
networkConfig:
network: projects/owilliam/global/networks/default
subnetwork: projects/owilliam/regions/us-central1/subnetworks/default
networkPolicy:
enabled: true
provider: CALICO
...
In the comments you mention "in the UI, it says network policy is disabled...is there a command to drop calico?". Then I gave you the command, for which you got the error stating that Network Policy Addon is not Enabled.
Which is weird, because it's applied but not enabled. I DISABLED it on my cluster and look:
addonsConfig:
networkPolicyConfig:
disabled: true
...
name: cluster-1
network: default
networkConfig:
network: projects/owilliam/global/networks/default
subnetwork: projects/owilliam/regions/us-central1/subnetworks/default
nodeConfig:
...
NetworkPolicyConfig went from {} to disabled: true and the section NetworkPolicy above nodeConfig is now gone. So, i suggest you to enable and disable it again to see if it updates the proper resources and fix your network policy issue, here is what we will do:
If your cluster is not on production, I'd suggest you to resize it back to 1, do the changes and then scale again, the update will be quicker. but if it is in production, leave it as it is, but it might take longer depending on your pod disrupting policy. (default-pool is the name of my cluster pool), I'll resize it on my example:
$ gcloud container clusters resize cluster-1 --node-pool default-pool --num-nodes 1
Do you want to continue (Y/n)? y
Resizing cluster-1...done.
Then enable the network policy addon itself (it will not activate it, only make available):
$ gcloud container clusters update cluster-1 --update-addons=NetworkPolicy=ENABLED
Updating cluster-1...done.
and we enable (activate) the network policy:
$ gcloud container clusters update cluster-1 --enable-network-policy
Do you want to continue (Y/n)? y
Updating cluster-1...done.
Now let's undo it:
$ gcloud container clusters update cluster-1 --no-enable-network-policy
Do you want to continue (Y/n)? y
Updating cluster-1...done.
After disabling it, wait until the pool is ready and run the last command:
$ gcloud container clusters update cluster-1 --update-addons=NetworkPolicy=DISABLED
Updating cluster-1...done.
Scale it back to 3 if you had downscaled:
$ gcloud container clusters resize cluster-1 --node-pool default-pool --num-nodes 3
Do you want to continue (Y/n)? y
Resizing cluster-1...done.
Finally check again the description to see if it matches the right configuration and test the communication between the pods.
Here is the reference for this configuration:
Creating a Cluster Network Policy
If you still got the issue after that, update your question with the latest cluster description and we will dig further.

Related

How to migrate Persistent Volumes within GKE clusters in the same project?

I have a GKE cluster running with several persistent disks for storage.
To set up a staging environment, I created a second cluster inside the same project.
Now I want to use the data from the persistent disks of the production cluster in the staging cluster.
I already created persistent disks for the staging cluster. What is the best approach to move over the production data to the disks of the staging cluster.
You can use the open source tool Velero which is designed to migrate Kubernetes cluster resources.
Follow these steps to migrate a persistent disk within GKE clusters:
Create a GCS bucket:
BUCKET=<your_bucket_name>
gsutil mb gs://$BUCKET/
Create a Google Service Account and store the associated email in a variable for later use:
GSA_NAME=<your_service_account_name>
gcloud iam service-accounts create $GSA_NAME \
--display-name "Velero service account"
SERVICE_ACCOUNT_EMAIL=$(gcloud iam service-accounts list \
--filter="displayName:Velero service account" \
--format 'value(email)')
Create a custom role for the Service Account:
PROJECT_ID=<your_project_id>
ROLE_PERMISSIONS=(
compute.disks.get
compute.disks.create
compute.disks.createSnapshot
compute.snapshots.get
compute.snapshots.create
compute.snapshots.useReadOnly
compute.snapshots.delete
compute.zones.get
storage.objects.create
storage.objects.delete
storage.objects.get
storage.objects.list
)
gcloud iam roles create velero.server \
--project $PROJECT_ID \
--title "Velero Server" \
--permissions "$(IFS=","; echo "${ROLE_PERMISSIONS[*]}")"
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member serviceAccount:$SERVICE_ACCOUNT_EMAIL \
--role projects/$PROJECT_ID/roles/velero.server
gsutil iam ch serviceAccount:$SERVICE_ACCOUNT_EMAIL:objectAdmin gs://${BUCKET}
Grant access to Velero:
gcloud iam service-accounts keys create credentials-velero \
--iam-account $SERVICE_ACCOUNT_EMAIL
Download and install Velero on the source cluster:
wget https://github.com/vmware-tanzu/velero/releases/download/v1.8.1/velero-v1.8.1-linux-amd64.tar.gz
tar -xvzf velero-v1.8.1-linux-amd64.tar.gz
sudo mv velero-v1.8.1-linux-amd64/velero /usr/local/bin/velero
velero install \
--provider gcp \
--plugins velero/velero-plugin-for-gcp:v1.4.0 \
--bucket $BUCKET \
--secret-file ./credentials-velero
Note: The download and installation was performed on a Linux system, which is the OS used by Cloud Shell. If you are managing your GCP resources via Cloud SDK, the release and installation process could vary.
Confirm that the velero pod is running:
$ kubectl get pods -n velero
NAME READY STATUS RESTARTS AGE
velero-xxxxxxxxxxx-xxxx 1/1 Running 0 11s
Create a backup for the PV,PVCs:
velero backup create <your_backup_name> --include-resources pvc,pv --selector app.kubernetes.io/<your_label_name>=<your_label_value>
Verify that your backup was successful with no errors or warnings:
$ velero backup describe <your_backup_name> --details
Name: your_backup_name
Namespace: velero
Labels: velero.io/storage-location=default
Annotations: velero.io/source-cluster-k8s-gitversion=v1.21.6-gke.1503
velero.io/source-cluster-k8s-major-version=1
velero.io/source-cluster-k8s-minor-version=21
Phase: Completed
Errors: 0
Warnings: 0
Now that the Persistent Volumes are backed up, you can proceed with the migration to the destination cluster following these steps:
Authenticate in the destination cluster
gcloud container clusters get-credentials <your_destination_cluster> --zone <your_zone> --project <your_project>
Install Velero using the same parameters as step 5 on the first part:
velero install \
--provider gcp \
--plugins velero/velero-plugin-for-gcp:v1.4.0 \
--bucket $BUCKET \
--secret-file ./credentials-velero
Confirm that the velero pod is running:
kubectl get pods -n velero
NAME READY STATUS RESTARTS AGE
velero-xxxxxxxxxx-xxxxx 1/1 Running 0 19s
To avoid the backup data being overwritten, change the bucket to read-only mode:
kubectl patch backupstoragelocation default -n velero --type merge --patch '{"spec":{"accessMode":"ReadOnly"}}'
Confirm Velero is able to access the backup from bucket:
velero backup describe <your_backup_name> --details
Restore the backed up Volumes:
velero restore create --from-backup <your_backup_name>
Confirm that the persistent volumes have been restored on the destination cluster:
kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
redis-data-my-release-redis-master-0 Bound pvc-ae11172a-13fa-4ac4-95c5-d0a51349d914 8Gi RWO standard 79s
redis-data-my-release-redis-replicas-0 Bound pvc-f2cc7e07-b234-415d-afb0-47dd7b9993e7 8Gi RWO standard 79s
redis-data-my-release-redis-replicas-1 Bound pvc-ef9d116d-2b12-4168-be7f-e30b8d5ccc69 8Gi RWO standard 79s
redis-data-my-release-redis-replicas-2 Bound pvc-65d7471a-7885-46b6-a377-0703e7b01484 8Gi RWO standard 79s
Check out this tutorial as a reference.

"unable to retrieve the complete list of server APIs: tap.linkerd.io/v1alpha1" error using Linkerd on private cluster in GKE

Why does the following error occur when I install Linkerd 2.x on a private cluster in GKE?
Error: could not get apiVersions from Kubernetes: unable to retrieve the complete list of server APIs: tap.linkerd.io/v1alpha1: the server is currently unable to handle the request
The default firewall rules of a private cluster on GKE only permit traffic on ports 443 and 10250. This allows communication to the kube-apiserver and kubelet, respectively.
Linkerd uses ports 8443 and 8089 for communication between the control and the proxies deployed to the data plane.
The tap component uses port 8089 to handle requests to its apiserver.
The proxy injector and service profile validator components, both of which are types of admission controllers, use port 8443 to handle requests.
The Linkerd 2 docs include instructions for configuring your firewall on a GKE private cluster: https://linkerd.io/2/reference/cluster-configuration/
They are included below:
Get the cluster name:
CLUSTER_NAME=your-cluster-name
gcloud config set compute/zone your-zone-or-region
Get the cluster MASTER_IPV4_CIDR:
MASTER_IPV4_CIDR=$(gcloud container clusters describe $CLUSTER_NAME \
| grep "masterIpv4CidrBlock: " \
| awk '{print $2}')
Get the cluster NETWORK:
NETWORK=$(gcloud container clusters describe $CLUSTER_NAME \
| grep "^network: " \
| awk '{print $2}')
Get the cluster auto-generated NETWORK_TARGET_TAG:
NETWORK_TARGET_TAG=$(gcloud compute firewall-rules list \
--filter network=$NETWORK --format json \
| jq ".[] | select(.name | contains(\"$CLUSTER_NAME\"))" \
| jq -r '.targetTags[0]' | head -1)
Verify the values:
echo $MASTER_IPV4_CIDR $NETWORK $NETWORK_TARGET_TAG
# example output
10.0.0.0/28 foo-network gke-foo-cluster-c1ecba83-node
Create the firewall rules for proxy-injector and tap:
gcloud compute firewall-rules create gke-to-linkerd-control-plane \
--network "$NETWORK" \
--allow "tcp:8443,tcp:8089" \
--source-ranges "$MASTER_IPV4_CIDR" \
--target-tags "$NETWORK_TARGET_TAG" \
--priority 1000 \
--description "Allow traffic on ports 8843, 8089 for linkerd control-plane components"
Finally, verify that the firewall is created:
gcloud compute firewall-rules describe gke-to-linkerd-control-plane
In my case, it was related to linkerd/linkerd2#3497, when the Linkerd service had some internal problems and couldn't respond back to the API service requests. Fixed by restarting its pods.
Solution:
The steps I followed are:
kubectl get apiservices : If linkered apiservice is down with the error CrashLoopBackOff try to follow the step 2 otherwise just try to restart the linkered service using kubectl delete apiservice/"service_name". For me it was v1alpha1.tap.linkerd.io.
kubectl get pods -n kube-system and found out that pods like metrics-server, linkered, kubernetes-dashboard are down because of the main coreDNS pod was down.
For me it was:
NAME READY STATUS RESTARTS AGE
pod/coredns-85577b65b-zj2x2 0/1 CrashLoopBackOff 7 13m
Use kubectl describe pod/"pod_name" to check the error in coreDNS pod and if it is down because of /etc/coredns/Corefile:10 - Error during parsing: Unknown directive proxy, then we need to use forward instead of proxy in the yaml file where coreDNS config is there. Because CoreDNS version 1.5x used by the image does not support the proxy keyword anymore.
This was a linkerd issue for me. To diagnose any linkerd related issues, you can use the linkerd CLI and run linkerd check this should show you if there is an issue with linkerd and links on instructions to fix it.
For me, the issue was that linkerd root certs had expired. In my case, linkerd was experimental in a dev cluster so I removed it. However, if you need to update your certificates you can follow the instructions at the following link.
https://linkerd.io/2.11/tasks/replacing_expired_certificates/
Thanks to https://stackoverflow.com/a/59644120/1212371 I was put on the right path.

How to get Kubernetes secret from one cluster to apply to another?

For my e2e tests I'm spinning up a separate cluster into which I'd like to import my production TLS certificate. I'm having trouble to switch the context between the two clusters (export/get from one and import/apply (in)to another) because the cluster doesn't seem to be visible.
I extracted a MVCE using a GitLab CI and the following .gitlab-ci.yml where I create a secret for demonstration purposes:
stages:
- main
- tear-down
main:
image: google/cloud-sdk
stage: main
script:
- echo "$GOOGLE_KEY" > key.json
- gcloud config set project secret-transfer
- gcloud auth activate-service-account --key-file key.json --project secret-transfer
- gcloud config set compute/zone us-central1-a
- gcloud container clusters create secret-transfer-1-$CI_COMMIT_SHORT_SHA-$CI_PIPELINE_IID --project secret-transfer --machine-type=f1-micro
- kubectl create secret generic secret-1 --from-literal=key=value
- gcloud container clusters create secret-transfer-2-$CI_COMMIT_SHORT_SHA-$CI_PIPELINE_IID --project secret-transfer --machine-type=f1-micro
- gcloud config set container/use_client_certificate True
- gcloud config set container/cluster secret-transfer-1-$CI_COMMIT_SHORT_SHA-$CI_PIPELINE_IID
- kubectl get secret letsencrypt-prod --cluster=secret-transfer-1-$CI_COMMIT_SHORT_SHA-$CI_PIPELINE_IID -o yaml > secret-1.yml
- gcloud config set container/cluster secret-transfer-2-$CI_COMMIT_SHORT_SHA-$CI_PIPELINE_IID
- kubectl apply --cluster=secret-transfer-2-$CI_COMMIT_SHORT_SHA-$CI_PIPELINE_IID -f secret-1.yml
tear-down:
image: google/cloud-sdk
stage: tear-down
when: always
script:
- echo "$GOOGLE_KEY" > key.json
- gcloud config set project secret-transfer
- gcloud auth activate-service-account --key-file key.json
- gcloud config set compute/zone us-central1-a
- gcloud container clusters delete --quiet secret-transfer-1-$CI_COMMIT_SHORT_SHA-$CI_PIPELINE_IID
- gcloud container clusters delete --quiet secret-transfer-2-$CI_COMMIT_SHORT_SHA-$CI_PIPELINE_IID
I added secret-transfer-[1/2]-$CI_COMMIT_SHORT_SHA-$CI_PIPELINE_IID before kubectl statements in order to avoid error: no server found for cluster "secret-transfer-1-...-...", but it doesn't change the outcome.
I created a project secret-transfer, activated the Kubernetes API and got a JSON key for the Compute Engine service account which I'm providing in the environment variable GOOGLE_KEY. The output after checkout is
$ echo "$GOOGLE_KEY" > key.json
$ gcloud config set project secret-transfer
Updated property [core/project].
$ gcloud auth activate-service-account --key-file key.json --project secret-transfer
Activated service account credentials for: [131478687181-compute#developer.gserviceaccount.com]
$ gcloud config set compute/zone us-central1-a
Updated property [compute/zone].
$ gcloud container clusters create secret-transfer-1-$CI_COMMIT_SHORT_SHA-$CI_PIPELINE_IID --project secret-transfer --machine-type=f1-micro
WARNING: In June 2019, node auto-upgrade will be enabled by default for newly created clusters and node pools. To disable it, use the `--no-enable-autoupgrade` flag.
WARNING: Starting in 1.12, new clusters will have basic authentication disabled by default. Basic authentication can be enabled (or disabled) manually using the `--[no-]enable-basic-auth` flag.
WARNING: Starting in 1.12, new clusters will not have a client certificate issued. You can manually enable (or disable) the issuance of the client certificate using the `--[no-]issue-client-certificate` flag.
WARNING: Currently VPC-native is not the default mode during cluster creation. In the future, this will become the default mode and can be disabled using `--no-enable-ip-alias` flag. Use `--[no-]enable-ip-alias` flag to suppress this warning.
WARNING: Starting in 1.12, default node pools in new clusters will have their legacy Compute Engine instance metadata endpoints disabled by default. To create a cluster with legacy instance metadata endpoints disabled in the default node pool, run `clusters create` with the flag `--metadata disable-legacy-endpoints=true`.
WARNING: Your Pod address range (`--cluster-ipv4-cidr`) can accommodate at most 1008 node(s).
This will enable the autorepair feature for nodes. Please see https://cloud.google.com/kubernetes-engine/docs/node-auto-repair for more information on node autorepairs.
Creating cluster secret-transfer-1-9b219ea8-9 in us-central1-a...
...done.
Created [https://container.googleapis.com/v1/projects/secret-transfer/zones/us-central1-a/clusters/secret-transfer-1-9b219ea8-9].
To inspect the contents of your cluster, go to: https://console.cloud.google.com/kubernetes/workload_/gcloud/us-central1-a/secret-transfer-1-9b219ea8-9?project=secret-transfer
kubeconfig entry generated for secret-transfer-1-9b219ea8-9.
NAME LOCATION MASTER_VERSION MASTER_IP MACHINE_TYPE NODE_VERSION NUM_NODES STATUS
secret-transfer-1-9b219ea8-9 us-central1-a 1.12.8-gke.10 34.68.118.165 f1-micro 1.12.8-gke.10 3 RUNNING
$ kubectl create secret generic secret-1 --from-literal=key=value
secret/secret-1 created
$ gcloud container clusters create secret-transfer-2-$CI_COMMIT_SHORT_SHA-$CI_PIPELINE_IID --project secret-transfer --machine-type=f1-micro
WARNING: In June 2019, node auto-upgrade will be enabled by default for newly created clusters and node pools. To disable it, use the `--no-enable-autoupgrade` flag.
WARNING: Starting in 1.12, new clusters will have basic authentication disabled by default. Basic authentication can be enabled (or disabled) manually using the `--[no-]enable-basic-auth` flag.
WARNING: Starting in 1.12, new clusters will not have a client certificate issued. You can manually enable (or disable) the issuance of the client certificate using the `--[no-]issue-client-certificate` flag.
WARNING: Currently VPC-native is not the default mode during cluster creation. In the future, this will become the default mode and can be disabled using `--no-enable-ip-alias` flag. Use `--[no-]enable-ip-alias` flag to suppress this warning.
WARNING: Starting in 1.12, default node pools in new clusters will have their legacy Compute Engine instance metadata endpoints disabled by default. To create a cluster with legacy instance metadata endpoints disabled in the default node pool, run `clusters create` with the flag `--metadata disable-legacy-endpoints=true`.
WARNING: Your Pod address range (`--cluster-ipv4-cidr`) can accommodate at most 1008 node(s).
This will enable the autorepair feature for nodes. Please see https://cloud.google.com/kubernetes-engine/docs/node-auto-repair for more information on node autorepairs.
Creating cluster secret-transfer-2-9b219ea8-9 in us-central1-a...
...done.
Created [https://container.googleapis.com/v1/projects/secret-transfer/zones/us-central1-a/clusters/secret-transfer-2-9b219ea8-9].
To inspect the contents of your cluster, go to: https://console.cloud.google.com/kubernetes/workload_/gcloud/us-central1-a/secret-transfer-2-9b219ea8-9?project=secret-transfer
kubeconfig entry generated for secret-transfer-2-9b219ea8-9.
NAME LOCATION MASTER_VERSION MASTER_IP MACHINE_TYPE NODE_VERSION NUM_NODES STATUS
secret-transfer-2-9b219ea8-9 us-central1-a 1.12.8-gke.10 104.198.37.21 f1-micro 1.12.8-gke.10 3 RUNNING
$ gcloud config set container/use_client_certificate True
Updated property [container/use_client_certificate].
$ gcloud config set container/cluster secret-transfer-1-$CI_COMMIT_SHORT_SHA-$CI_PIPELINE_IID
Updated property [container/cluster].
$ kubectl get secret secret-1 --cluster=secret-transfer-1-$CI_COMMIT_SHORT_SHA-$CI_PIPELINE_IID -o yaml > secret-1.yml
error: no server found for cluster "secret-transfer-1-9b219ea8-9"
I'm expecting kubectl get secret to work because both clusters exist and the --cluster argument points to the right cluster.
Generally gcloud commands are used to manage gcloud resources and handle how you authenticate with gcloud, whereas kubectl commands affect how you interact with Kubernetes clusters, whether or not they happen to be running on GCP and/or created in GKE. As such, I would avoid doing:
$ gcloud config set container/use_client_certificate True
Updated property [container/use_client_certificate].
$ gcloud config set container/cluster \
secret-transfer-1-$CI_COMMIT_SHORT_SHA-$CI_PIPELINE_IID
Updated property [container/cluster].
It's not doing what you probably think it's doing (namely, changing anything about how kubectl targets clusters), and might mess with how future gcloud commands work.
Another consequence of gcloud and kubectl being separate, and in particular kubectl not knowing intimately about your gcloud settings, is that the cluster name from gcloud perspective is not the same as from the kubectl perspective. When you do things like gcloud config set compute/zone, kubectl doesn't know anything about that, so it has to be able to identify clusters uniquely which may have the same name but be in different projects and zone, and maybe not even in GKE (like minikube or some other cloud provider). That's why kubectl --cluster=<gke-cluster-name> <some_command> is not going to work, and it's why you're seeing the error message:
error: no server found for cluster "secret-transfer-1-9b219ea8-9"
As #coderanger pointed out, the cluster name that gets generated in your ~/.kube/config file after doing gcloud container clusters create ... has a more complex name, which currently has a pattern something like gke_[project]_[region]_[name].
So you could run commands with kubectl --cluster gke_[project]_[region]_[name] ... (or kubectl --context [project]_[region]_[name] ... which would be more idiomatic, although both will happen to work in this case since you're using the same service account for both clusters), however that requires knowledge of how gcloud generates these strings for context and cluster names.
An alternative would be to do something like:
$ KUBECONFIG=~/.kube/config1 gcloud container clusters create \
secret-transfer-1-$CI_COMMIT_SHORT_SHA-$CI_PIPELINE_IID \
--project secret-transfer --machine-type=f1-micro
$ KUBECONFIG=~/.kube/config1 kubectl create secret secret-1 --from-literal=key=value
$ KUBECONFIG=~/.kube/config2 gcloud container clusters create \
secret-transfer-2-$CI_COMMIT_SHORT_SHA-$CI_PIPELINE_IID \
--project secret-transfer --machine-type=f1-micro
$ KUBECONFIG=~/.kube/config1 kubectl get secret secret-1 -o yaml > secret-1.yml
$ KUBECONFIG=~/.kube/config2 kubectl apply -f secret-1.yml
By having separate KUBECONFIG files that you control, you don't have to guess any strings. Setting the KUBECONFIG variable when creating a cluster will result in creating that file and gcloud putting the credentials for kubectl to access that cluster in that file. Setting the KUBECONFIG environment variable when running kubectl command will ensure kubectl uses the context as set in that particular file.
You probably mean to be using --context rather than --cluster. The context sets both the cluster and user in use. Additionally the context and cluster (and user) names created by GKE are not just the cluster identifier, it's gke_[project]_[region]_[name].

Node pool does not reduce his node size to zero although autoscaling is enabled

I have created two node pools. A small one for all the google system jobs and a bigger one for my tasks. The bigger one should reduce its size to 0 after the job is done.
The problem is: Even if there are no cron jobs, the node pool do not
reduce his size to 0.
Creating cluster:
gcloud beta container --project "projectXY" clusters create "cluster" --zone "europe-west3-a" --username "admin" --cluster-version "1.9.6-gke.0" --machine-type "n1-standard-1" --image-type "COS" --disk-size "100" --scopes "https://www.googleapis.com/auth/cloud-platform" --num-nodes "1" --network "default" --enable-cloud-logging --enable-cloud-monitoring --subnetwork "default" --enable-autoscaling --enable-autoupgrade --min-nodes "1" --max-nodes "1"
Creating node pool:
The node pool should reduce its size to 0 after all tasks are done.
gcloud container node-pools create workerpool --cluster=cluster --machine-type="n1-highmem-8", -m "n1-highmem-8" --zone=europe-west3-a, -z europe-west3-a --disk-size=100 --enable-autoupgrade --num-nodes=0 --enable-autoscaling --max-nodes=2 --min-nodes=0
Create cron job:
kubectl create -f cronjob.yaml
Quoting from Google Documentation:
"Note: Beginning with Kubernetes version 1.7, you can specify a minimum size of zero for your node pool. This allows your node pool to scale down completely if the instances within aren't required to run your workloads. However, while a node pool can scale to a zero size, the overall cluster size does not scale down to zero nodes (as at least one node is always required to run system Pods)."
Notice also that:
"Cluster autoscaler also measures the usage of each node against the node pool's total demand for capacity. If a node has had no new Pods scheduled on it for a set period of time, and [this option does not work for you since it is the last node] all Pods running on that node can be scheduled onto other nodes in the pool , the autoscaler moves the Pods and deletes the node.
Note that cluster autoscaler works based on Pod resource requests, that is, how many resources your Pods have requested. Cluster autoscaler does not take into account the resources your Pods are actively using. Essentially, cluster autoscaler trusts that the Pod resource requests you've provided are accurate and schedules Pods on nodes based on that assumption."
Therefore I would check:
that your version of your Kubernetes cluster is at least 1.7
that there are no pods running on the last node (check every namespace, the pods that have to run on every node do no count: fluentd, kube-dns, kube-proxy), the fact that there are no cronjobs is not enough
that for the autoscaler is NOT enabled for the corresponding managed instance groups since they are different tools
that there are no pods stuck in any weird state still assigned to that node
that there is no pods waiting to be scheduled in the cluster
If still everything likely it is an issue with the autoscaler and you can either open a private issue specifying your project ID with Google since there is not much the community can do.
If you are interested place in the comments the link of the issue tracker and I will take a look in your project (I work for Google Cloud Platform Support)
I ran into the same issue and tested a number of different scenarios. I finally got it to work by doing the following:
Create your node pool with an initial size of 1 instead of 0:
gcloud container node-pools create ${NODE_POOL_NAME} \
--cluster ${CLUSTER_NAME} \
--num-nodes 1 \
--enable-autoscaling --min-nodes 0 --max-nodes 1 \
--zone ${ZONE} \
--machine-type ${MACHINE_TYPE}
Configure your CronJob in a similar fashion to this:
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: cronjob800m
spec:
schedule: "7 * * * *"
concurrencyPolicy: Forbid
failedJobsHistoryLimit: 0
successfulJobsHistoryLimit: 0
jobTemplate:
spec:
template:
spec:
containers:
- name: cronjob800m
image: busybox
args:
- /bin/sh
- -c
- date; echo Hello from the Kubernetes cluster
resources:
requests:
cpu: "800m"
restartPolicy: Never
Note that the resources are set in a way that the job is only able to be run on the large node pool but not on the small one. Also note that we set both failedJobsHistoryLimit and successfulJobsHistoryLimit to 0 in order for the job to be automatically cleaned from the node pool after success/failure.
That should be it.

kubernetes allow privileged local testing cluster

I'm busy testing out kubernetes on my local pc using https://github.com/kubernetes/kubernetes/blob/master/docs/getting-started-guides/docker.md
which launches a dockerized single node k8s cluster. I need to run a privileged container inside k8s (it runs docker in order to build images from dockerfiles). What I've done so far is add a security context privileged=true to the pod config which returns forbidden when trying to create the pod. I know that you have to enable privileged on the node with --allow-privileged=true and I've done this by adding the parameter arg to step two (running the master and worker node) but it still returns forbidden when creating the pod.
Anyone know how to enable privileged in this dockerized k8s for testing?
Here is how I run the k8s master:
docker run --privileged --net=host -d -v /var/run/docker.sock:/var/run/docker.sock gcr.io/google_containers/hyperkube:v1.0.1 /hyperkube kubelet --api-servers=http://localhost:8080 --v=2 --address=0.0.0.0 --allow-privileged=true --enable-server --hostname-override=127.0.0.1 --config=/etc/kubernetes/manifests
Update: Privileged mode is now enabled by default (both in the apiserver and in the kubelet) starting with the 1.1 release of Kubernetes.
To enable privileged containers, you need to pass the --allow-privileged flag to the Kubernetes apiserver in addition to the Kubelet when it starts up. The manifest file that you use to launch the Kubernetes apiserver in the single node docker example is bundled into the image (from master.json), but you can make a local copy of that file, add the --allow-privileged=true flag to the apiserver command line, and then change the --config flag you pass to the Kubelet in Step Two to a directory containing your modified file.