EKS - How to annotate some nodes in USERDATA? - kubernetes

To prevent Cluster Auto Scaler from terminating some of nodes, I would need to annotate them with:
cluster-autoscaler.kubernetes.io/scale-down-disabled=true;
Is there a way to do so in USERDATA script?
For labeling the nodes, there is no issue, and it is possible to do so via:
--kubelet-extra-args \
"--node-labels=
Thanks

No, it not possible.
The list of supported parameters for the bootstrap script:
--use-max-pods Sets --max-pods for the kubelet when true. (default: true)
--b64-cluster-ca The base64 encoded cluster CA content. Only valid when used with --apiserver-endpoint. Bypasses calling \"aws eks describe-cluster\"
--apiserver-endpoint The EKS cluster API Server endpoint. Only valid when used with --b64-cluster-ca. Bypasses calling \"aws eks describe-cluster\"
--kubelet-extra-args Extra arguments to add to the kubelet. Useful for adding labels or taints.
--enable-docker-bridge Restores the docker default bridge network. (default: false)
--aws-api-retry-attempts Number of retry attempts for AWS API call (DescribeCluster) (default: 3)
--docker-config-json The contents of the /etc/docker/daemon.json file. Useful if you want a custom config differing from the default one in the AMI

You can add node labels, taints, etc by using the --kubelet-extra-args option on the bootstrap.sh invokation as you guessed. For an example, see the AWS Blog post: Improvements for Amazon EKS Worker Node Provisioning
Use a USERDATA script similar to the following:
UserData: !Base64
"Fn::Sub": |
#!/bin/bash
set -o xtrace
/etc/eks/bootstrap.sh ${ClusterName} ${BootstrapArguments}
/opt/aws/bin/cfn-signal --exit-code $? \
--stack ${AWS::StackName} \
--resource NodeGroup \
--region ${AWS::Region}
The above is a fragment from the CloudFormation template. Of course you can make your script more complex, with security hardening, etc. if you so desire.
For a complete CloudFormation template, download the sample from AWS:
curl -O https://amazon-eks.s3-us-west-2.amazonaws.com/cloudformation/2019-11-15/amazon-eks-nodegroup.yaml

It is absolutely possible. Here is part of my example userdata, specifically useful if you want to run both OnDemand and Spot instance. In my example I am adding lifecycle node label which changes based on the type. See below:
--use-max-pods 'true' \
--kubelet-extra-args ' --node-labels=lifecycle=OnDemand \
--system-reserved cpu=250m,memory=0.2Gi,ephemeral-storage=1Gi \
--kube-reserved cpu=250m,memory=1Gi,ephemeral-storage=1Gi \
--eviction-hard memory.available<0.2Gi,nodefs.available<10% \
--event-qps 0'
I hope that gives you a nice example.

Related

How can I disable SSH on nodes in a GKE node pool?

I am running a regional GKE kubernetes cluster in is-central1-b us-central-1-c and us-central1-f. I am running 1.21.14-gke.700. I am adding a confidential node pool to the cluster with this command.
gcloud container node-pools create card-decrpyt-confidential-pool-1 \
--cluster=svcs-dev-1 \
--disk-size=100GB \
--disk-type=pd-standard \
--enable-autorepair \
--enable-autoupgrade \
--enable-gvnic \
--image-type=COS_CONTAINERD \
--machine-type="n2d-standard-2" \
--max-pods-per-node=8 \
--max-surge-upgrade=1 \
--max-unavailable-upgrade=1 \
--min-nodes=4 \
--node-locations=us-central1-b,us-central1-c,us-central1-f \
--node-taints=dedicatednode=card-decrypt:NoSchedule \
--node-version=1.21.14-gke.700 \
--num-nodes=4 \
--region=us-central1 \
--sandbox="type=gvisor" \
--scopes=https://www.googleapis.com/auth/cloud-platform \
--service-account="card-decrpyt-confidential#corp-dev-project.iam.gserviceaccount.com" \
--shielded-integrity-monitoring \
--shielded-secure-boot \
--tags=testingdonotuse \
--workload-metadata=GKE_METADATA \
--enable-confidential-nodes
This creates a node pool but there is one problem... I can still SSH to the instances that the node pool creates. This is unacceptable for my use case as these node pools need to be as secure as possible. I went into my node pool and created a new machine template with ssh turned off using an instance template based off the one created for my node pool.
gcloud compute instance-templates create card-decrypt-instance-template \
--project=corp-dev-project
--machine-type=n2d-standard-2
--network-interface=aliases=gke-svcs-dev-1-pods-10a0a3cd:/28,nic-type=GVNIC,subnet=corp-dev-project-private-subnet,no-address
--metadata=block-project-ssh-keys=true,enable-oslogin=true
--maintenance-policy=TERMINATE --provisioning-model=STANDARD
--service-account=card-decrpyt-confidential#corp-dev-project.iam.gserviceaccount.com
--scopes=https://www.googleapis.com/auth/cloud-platform
--region=us-central1 --min-cpu-platform=AMD\ Milan
--tags=testingdonotuse,gke-svcs-dev-1-10a0a3cd-node
--create-disk=auto-delete=yes,boot=yes,device-name=card-decrpy-instance-template,image=projects/confidential-vm-images/global/images/cos-89-16108-766-5,mode=rw,size=100,type=pd-standard
--shielded-secure-boot
--shielded-vtpm -
-shielded-integrity-monitoring
--labels=component=gke,goog-gke-node=,team=platform --reservation-affinity=any
When I change the instance templates of the nodes in the node pool the new instances come online but they do not attach to the node pool. The cluster is always trying to repair itself and I can't change any settings until I delete all the nodes in the pool. I don't receive any errors.
What do I need to do to disable ssh into the node pool nodes with the original node pool I created or with the new instance template I created. I have tried a bunch of different configurations with a new node pool and the cluster and have not had any luck. I've tried different tags network configs and images. None of these have worked.
Other info:
The cluster was not originally a confidential cluster. The confidential nodes are the first of its kind added to the cluster.
One option you have here is to enable private IP addresses for the nodes in your cluster. The --enable-private-nodes flag will make it so the nodes in your cluster get private IP addresses (rather than the default public, internet-facing IP addresses).
Note that in this case, you would still be able to SSH into these nodes, but only from within your VPC network.
Also note that this means you would not be able to access NodePort type services from outside of your VPC network. Instead, you would need to use a LoadBalancer type service (or provide some other way to route traffic to your service from outside of the cluster, if required).
If you'd like to prevent SSH access even from within your VPC network, your easiest option would likely be to configure a firewall rule to deny SSH traffic to your nodes (TCP/UDP/SCTP port 22). Use network tags (the --tags flag) to target your GKE nodes.
Something along the lines of:
gcloud compute firewall-rules create fw-d-i-ssh-to-gke-nodes \
--network NETWORK_NAME \
--action deny \
--direction ingress \
--rules tcp:22,udp:22,sctp:22 \
--source-ranges 0.0.0.0/0 \
--priority 65534 \
--target-tags my-gke-node-network-tag
Finally, one last option I'll mention for creating a hardened GKE cluster is to use Google's safer-cluster Terraform module. This is an opinionated setup of a GKE cluster that follows many of the principles laid out in Google's cluster hardening guide and the Terraform module takes care of a lot of the nitty-gritty boilerplate here.
I needed the metadata flag when creating the node pool
--metadata=block-project-ssh-keys=TRUE \
This blocked ssh.
However, enable-os-login=false won't work because it is reserved for use by the Kubernetes Engine

Cannot install Helm chart when accessing GKE cluster directly

I've set up a basic GKE cluster using Autopilot settings. I am able to install Helm charts on it using kubectl with proper kubeconfig pointing to the GKE cluster.
I'd like to do the same without the kubeconfig, by providing the cluster details with relevant parameters.
To do that I'm running a docker container using alpine/helm image and passing the paramtrised command which looks like this:
docker run --rm -v $(pwd):/chart alpine/helm install <my_chart_name> /chart --kube-apiserver <cluster_endpoint> --kube-ca-file /chart/<cluster_certificate_file> --kube-as-user <my_gke_cluster_username> --kube-token <token>
unfortunately it returns :
Error: INSTALLATION FAILED: Kubernetes cluster unreachable: Get "http://<cluster_endpoint>/version": dial tcp <cluster_endpoint>:80: i/o timeout
Is this even doable with GKE?
One challenge will be that GKE leverages a plugin (currently built in to kubectl itself but soon the standlone gke-gcloud-auth-plugin) to obtain an access token for the default gcloud user.
This token expires hourly.
If you can, it would be better to mount the kubeconfig (${HOME}/.kube/config) file into the container as it should (!) then authenticate as if it were kubectl which will not only leverage the access token correctly but will renew it as appropriate.
https://github.com/alpine-docker/helm
docker run \
--interactive --tty --rm \
--volume=${PWD}/.kube:/root/.kube \
--volume=${PWD}/.helm:/root/.helm \
--volume=${PWD}/.config/helm:/root/.config/helm \
--volume=${PWD}/.cache/helm:/root/.cache/helm \
alpine/helm ...
NOTE It appears there are several (.helm, .config and .cache) other local paths that may be required too.
Problem solved! A more experienced colleague has found the solution.
I should have used the address including "http://" protocol specification. That however still kept returning "Kubernetes cluster unreachable: " error, with "unknown" details instead.
I had been using incorect username. Instead the one from kubeconfig file, a new service account should be created and its name used instead in a form system:serviceaccount:<namespace>:<service_account>. However that would not alter the error either.
The service account lacked proper role, following command did the job: kubectl create rolebinding --clusterrole=cluster-admin --serviceaccount=<namespace>:<service_account>. Ofc, cluster-admin might now be the role we want to give away freely.

Instance attributes are not available in metadata

I'm setting up a new K8S Cluster (1.13.7-gke.8) on GKE and I want the Google cloud logging API to report properly namespace and instance names.
This is executed in a new GKE cluster with workload-identity enabled.
I started workload container to test the access to metadata service and these are the results:
kubectl run -it --generator=run-pod/v1 --image google/cloud-sdk --namespace prod --rm workload-identity-test
And from the container after executing:
curl "http://metadata.google.internal/computeMetadata/v1/instance/attributes/" -H "Metadata-Flavor: Google"
I expect the output of cluster-name, container-name, and namespace-id, but the actual output is only cluster-name.
I was getting the same but when I ran the following the metadata showed up:
gcloud beta container node-pools create [NODEPOOL_NAME] \
--cluster=[CLUSTER_NAME] \
--workload-metadata-from-node=EXPOSED
However, you will only get cluster-name from the metadada. For example,
root#workload-identity-test:/# curl "http://metadata.google.internal/computeMetadata/v1/instance/attributes/" -H "Metadata-Flavor: Google"
cluster-location
cluster-name
cluster-uid
configure-sh
created-by
disable-legacy-endpoints
enable-oslogin
gci-ensure-gke-docker
gci-update-strategy
google-compute-enable-pcid
instance-template
kube-env
kube-labels
kubelet-config
user-data
If you are looking at getting namespaces and containers, I suggest you look at talking directly to the Kubernetes API which essentially what the 'Workloads' tab on GKE does. I'm not really sure what you are trying to do with the 'Google cloud logging API' but maybe you can elaborate on a different question.

Identify redundant GCP resources created by Kubernetes

When creating various Kubernetes objects in GKE, associated GCP resources are automatically created. I'm specifically referring to:
forwarding-rules
target-http-proxies
url-maps
backend-services
health-checks
These have names such as k8s-fw-service-name-tls-ingress--8473ea5ff858586b.
After deleting a cluster, these resources remain. How can I identify which of these are still in use (by other Kubernetes objects, or another cluster) and which are not?
There is no easy way to identify which added GCP resources (LB, backend, etc.) are linked to which cluster. You need to manually go into these resources to see what they are linked to.
If you delete a cluster with additional resources attached, you have to also manually delete these resources as well. At this time, I would suggest taking note of which added GCP resources are related to which cluster, so that you will know which resources to delete when the time comes to deleting the GKE cluster.
I would also suggest to create a feature request here to request for either a more defined naming convention for additional GCP resources being created linked to a specific cluster and/or having the ability to automatically delete all additonal resources linked to a cluster when deleting said cluster.
I would recommend you to look at https://github.com/kelseyhightower/kubernetes-the-hard-way/blob/master/docs/14-cleanup.md
You can easily delete all the objects by using the google cloud sdk in the following manner :
gcloud -q compute firewall-rules delete \
kubernetes-the-hard-way-allow-nginx-service \
kubernetes-the-hard-way-allow-internal \
kubernetes-the-hard-way-allow-external \
kubernetes-the-hard-way-allow-health-check
{
gcloud -q compute routes delete \
kubernetes-route-10-200-0-0-24 \
kubernetes-route-10-200-1-0-24 \
kubernetes-route-10-200-2-0-24
gcloud -q compute networks subnets delete kubernetes
gcloud -q compute networks delete kubernetes-the-hard-way
gcloud -q compute forwarding-rules delete kubernetes-forwarding-rule \
--region $(gcloud config get-value compute/region)
gcloud -q compute target-pools delete kubernetes-target-pool
gcloud -q compute http-health-checks delete kubernetes
gcloud -q compute addresses delete kubernetes-the-hard-way
}
This assumes you named your resources 'kubernetes-the-hard-way', if you do not know the names, you can also use various filter mechanisms to filter resources by namespaces etc to remove these.

Kubernetes get endpoints

I have a set of pods providing nsqlookupd service.
Now I need each nsqd container to have a list of nsqlookupd servers to connect to (while service will point to different every time) simultaneously. Something similar I get with
kubectl describe service nsqlookupd
...
Endpoints: ....
but I want to have it in a variable within my deployment definition or somehow from within nsqd container
Sounds like you would need an extra service running either in your nsqd container or in a separate container in the same pod. The role of that service would be to pole the API regularly in order to fetch the list of endpoints.
Assuming that you enabled Service Accounts (enabled by default), here is a proof of concept on the shell using curl and jq from inside a pod:
# Read token and CA cert from Service Account
CACERT="/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
# Replace the namespace ("kube-system") and service name ("kube-dns")
ENDPOINTS=$(curl -s --cacert "$CACERT" -H "Authorization: Bearer $TOKEN" \
https://kubernetes.default.svc/api/v1/namespaces/kube-system/endpoints/kube-dns \
)
# Filter the JSON output
echo "$ENDPOINTS" | jq -r .subsets[].addresses[].ip
# output:
# 10.100.42.3
# 10.100.67.3
Take a look at the source code of Kube2sky for a good implementation of that kind of service in Go.
Could be done with a StatefuSet. Stable names + stable storage