kubernetes provisioner for pv in a statefulset with aws-ebs pv issue - kubernetes

Have followed the documentation on how to setup k8s on aws including
Add the provider=aws
Make sure the Nodes have correct IAM permissions
Keep getting the following and I am unsure of where to find logs to see the underlying error that is making the AWS query fail.
This is how error looks:
Failed to provision volume with StorageClass "gp2": error querying for all zones: no instances returned

I faced the same issue and found the solution.
I hope this applies to your issue as well.
So every EC2 instance that is a node in your kubernetes cluster should have a tag
kubernetes.io/cluster/CLUSTERNAME = owned
When you request to create a new persistentstoragevolume kubernetes will request this from AWS. AWS will then check in which AZs you have worked nodes so it doesn't create the volume in a AZ where there are no nodes.
It seem to be doing this by listing all EC2 instances with the tag kubernetes.io/cluster/CLUSTERNAME = owned
But if you have changed or removed this tag, so that it no longer match you cluster name, you will get the exact error message you got here.
Lets say you changed it to
kubernetes.io/cluster/CLUSTERNAME-default = owned
That would trigger the issue.

Related

401 Unauthorized error while trying to pull image from Google Container Registry

I am using google container registry (GCR) to push and pull docker images. I have created a deployment in kubernetes with 3 replicas. The deployment will use a docker image pulled from the GCR.
Out of 3 replicas, 2 are pulling the images and running fine.But the third replica is showing the below error and the pod's status remains "ImagePullBackOff" or "ErrImagePull"
"Failed to pull image "gcr.io/xxx:yyy": rpc error: code = Unknown desc
= failed to pull and unpack image "gcr.io/xxx:yyy": failed to resolve reference "gcr.io/xxx:yyy": unexpected status code: 401 Unauthorized"
I am confused like why only one of the replicas is showing the error and the other 2 are running without any issue. Can anyone please clarify this?
Thanks in Advance!
ImagePullBackOff and ErrImagePull indicate that the image used by a container cannot be loaded from the image registry.
401 unauthorized error might occur when you pull an image from a private Container Registry repository. For troubleshooting the error:
Identify the node that runs the pod by kubectl describe pod POD_NAME | grep "Node:"
Verify the node has the storage scope by running the command
gcloud compute instances describe NODE_NAME --zone=COMPUTE_ZONE --format="flattened(serviceAccounts[].scopes)"
The node's access scope should contain at least one of the following:
serviceAccounts[0].scopes[0]: https://www.googleapis.com/auth/devstorage.read_only
serviceAccounts[0].scopes[0]: https://www.googleapis.com/auth/cloud-platform
Recreate the node pool that node belongs to with sufficient scope and you cannot modify existing nodes, you must recreate the node with the correct scope.
Create a new node pool with the gke-default scope by the following command
gcloud container node-pools create NODE_POOL_NAME --cluster=CLUSTER_NAME --zone=COMPUTE_ZONE --scopes="gke-default"
Create a new node pool with only storage scope
gcloud container node-pools create NODE_POOL_NAME --cluster=CLUSTER_NAME --zone=COMPUTE_ZONE --scopes="https://www.googleapis.com/auth/devstorage.read_only"
Refer to the link for more information on the troubleshooting process.
Hi you will setup role for cluster to access GCR images for pulling and pushing you can see https://github.com/GoogleContainerTools/skaffold/issues/336

Missing edit permissions on a kubernetes cluster on GCP

This is a Google Cloud specific problem.
I returned from vacation and noticed I can no longer manage workloads or cluster due to this error: "Missing edit permissions on account"
I am a sole person with access to this account (owner role) and yet I see this issue.
The troubleshooting guide suggests checking system service account role, looks like it's set up correctly (why would it not if I haven't edited it):
If it's not set up correctly it suggests turning off/on the Kubernetes API on GCP, but when you press on "disable" there's a scary-looking prompt that your Kubernetes resources are going to be deleted, so obviously I can't do that.
Upon trying to connect to it I get
gcloud container clusters get-credentials cluster-1 --zone us-west1-b --project PROJECT_ID
Fetching cluster endpoint and auth data.
WARNING: cluster cluster-1 is not running. The kubernetes API may not be available.
In the logs I found a record (the last one) that is 4 days old:
"Readiness probe failed: Get http://10.20.0.5:44135/readiness: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"
Anyone here has any ideas?
Thanks in advance.
The issue is solved,
I had to upgrade node versions in the pool.
What a misleading error message.
Hopefully, this helps someone.

How to set a Kubernetes Node's name in EKS

I'm not having any luck in changing the name of a worker node in AWS EKS. I have not been able to find any documentation regarding how the node is named by default.
Currently my nodes are named as follows, for example
NAME STATUS ROLES AGE VERSION
ip-10-241-111-216.us-west-2.compute.internal Ready 44m v1.14.7-eks-1861c5
I've tried passing in --hostname-override through the user data but it didn't seem to have any effect.
This is a known limitation in AWS EKS. It's discussed in this GitHub issue and it's still open.

RouteController failed to create a route on GKE

I have a cluster on GKE whose node pool I create when I want to use the cluster, and delete when I'm done with it.
It's a two node cluster with the master in europe-west2-a and with and whose node zones are europe-west2-a and europe-west2-b.
The most recent creation resulted in the node in zone B failing with NetworkUnavailable because RouteController failed to create a route. The reason was because Could not create route xxx 10.244.1.0/24 for node xxx after 342.263706ms: instance not found.
Why would this be happening all of a sudden, and what can I do to fix it?!
You didn't mention which version of GKE you are using so just for clarification:
Changes in access scopes
Beginning with Kubernetes version 1.10, gcloud and GCP Console no longer grants the compute-rw access scope on new clusters and new node pools by default. Furthermore, if --scopes is specified in gcloud container create, gcloud no longer silently adds compute-rw or storage-ro.
In any case you can still revert to legacy access scopes but this is not recommended approach.
Hope this help.
With gke 1.13.6-gke.13, some of the default scopes were changed, including the compute-rw scope being removed. I think that due to the age of the cluster, this scope was necessary for a route to be correctly created between nodes in a node pool.
In the end, my gcloud creation command had these scopes:
--scopes https://www.googleapis.com/auth/projecthosting,storage-rw,monitoring,trace,compute-rw

GCE Image suddenly not found

I'm running kubernetes on GCE. I used kube-up.sh to create the cluster and the nodes and masters are all running the image gci-stable-56-9000-84-2. I deleted a few nodes today which triggered the autoscaler to create new ones. But they failed with the following error.
Instance 'kubernetes-minion-30gb-20180131-9jwn' creation failed: The
resource
'projects/google-containers/global/images/gci-stable-56-9000-84-2' was
not found (when acting as 'REDACTED')
Is it possible this image was deleted somehow? I don't think I changed any access controls or permissions for any service accounts.
The image is present on this page:
https://cloud.google.com/container-optimized-os/docs/release-notes#cos-stable-56-9000-84-2
This error could be due to authentication issues. Re-authenticate to the gcloud command-line tool with command ‘gcloud auth login’
It could be as well that the Kubernetes Engine service account has been deleted or edited. Check this: https://cloud.google.com/kubernetes-engine/docs/troubleshooting#error_404