Missing edit permissions on a kubernetes cluster on GCP - kubernetes

This is a Google Cloud specific problem.
I returned from vacation and noticed I can no longer manage workloads or cluster due to this error: "Missing edit permissions on account"
I am a sole person with access to this account (owner role) and yet I see this issue.
The troubleshooting guide suggests checking system service account role, looks like it's set up correctly (why would it not if I haven't edited it):
If it's not set up correctly it suggests turning off/on the Kubernetes API on GCP, but when you press on "disable" there's a scary-looking prompt that your Kubernetes resources are going to be deleted, so obviously I can't do that.
Upon trying to connect to it I get
gcloud container clusters get-credentials cluster-1 --zone us-west1-b --project PROJECT_ID
Fetching cluster endpoint and auth data.
WARNING: cluster cluster-1 is not running. The kubernetes API may not be available.
In the logs I found a record (the last one) that is 4 days old:
"Readiness probe failed: Get http://10.20.0.5:44135/readiness: net/http: request canceled (Client.Timeout exceeded while awaiting headers)"
Anyone here has any ideas?
Thanks in advance.

The issue is solved,
I had to upgrade node versions in the pool.
What a misleading error message.
Hopefully, this helps someone.

Related

kubectl command to GKE-Autopilot sometimes return forbidden error

env
GKE Autopilot v1.22.12-gke.2300
use kubectl command from ubuntu2004 VM
use gke-gcloud-auth-plugin
what happens
kubectl command sometimes return (Forbidden) error. e.g.)
kubectl get pod
Error from server (Forbidden): pods is forbidden: User "my-mail#domain.com" cannot list resource "pods" in API group "" in the namespace "default": GKEAutopilot authz: the request was sent before policy enforcement is enabled
It happens not always, so it must not be IAM problem. (it happens about 40%).
Before, I thinks it was GKE Autopilot v1.21.xxxx, this error didn't happen; at least not such frequently.
I couldn't find any helpful info even if I searched "GKEAutopilot authz", or "the request was sent before policy enforcement is enabled"
I wish if someone who faced to same issue has any idea.
Thank you in advance
I asked google cloud support.
They said it's bug on GKE master, and was fixed by them.
This problem doesn't happen anymore

Unable to enter a pod in the gke cluster

We have our k8s cluster set up with our app, including a neo4j DB deployment and other artifacts. Overnight, we've started facing an issue in our GKE cluster when trying to enter or interact somehow with any pod running in the cluster. The following screenshot shows a sample of the error we get.
issued command
error: unable to upgrade connection: Authorization error (user=kube-apiserver, verb=create, resource=nodes, subresource=proxy)
Our GKE cluster is created as standard (no autopilot) and the versions are
Node pool details
cluster basics
As said before it was working fine regardless of the warning about the versions. However, we haven't been able yet to identify what could have changed between the last time it worked, and now.
Any clue on what authorization setup might have been changed making it incompatible now is very welcomed

FailedToUpdateEnpoint in kubernetes

I have a kubernetes cluster with some deployments and pods.I have experienced a issue with my deployments with error messages like FailedToUpdateEndpoint, RedinessprobeFailed.
This errors are unexpected and didn't have idea about it.When we analyse the logs of our, it seems like someone try hack our cluster(not sure about it).
Thing to be clear:
1.Is there any chance someone can illegally access our kubernetes cluster without having the kubeconfig?
2.Is there any chance, by using the frontend IP,access our apps and make changes in cluster configurations(means hack the cluster services via Web URL)?
3.Even if the cluster access illegally via frontend URL, is there any chance to change the configuration in cluster?
4.Is there is any mechanism to detect, whether the kubernetes cluster is healthy state or hacked by someone?
Above three mentioned are focus the point, is there any security related issues with kubernetes engine.If not
Then,
5.Still I work on this to find reason for that errors, Please provide more information on that, what may be the cause for these errors?
Error Messages:
FailedToUpdateEndpoint: Failed to update endpoint default/job-store: Operation cannot be fulfilled on endpoints "job-store": the object has been modified; please apply your changes to the latest version and try again
The same error happens for all our pods in cluster.
Readiness probe failed: Error verifying datastore: Get https://API_SERVER: context deadline exceeded; Error reaching apiserver: taking a long time to check apiserver

kubernetes can't pull certain images from ibm cloud registry

My pod does the following:
Warning Failed 21m (x4 over 23m) kubelet, 10.76.199.35 Failed to pull image "registryname/image:version1.2": rpc error: code = Unknown desc = Error response from daemon: unauthorized: authentication required
but other images will work. The output of
ibmcloud cr images
doesn't show anything different about the images that don't work. What could be going wrong here?
Given this is in kubenetes and you can see the image in ibmcloud cr images it most likely going to be a misconfiguration of your imagePullSecrets.
If you do kubectl get pod <pod-name> -o yaml you will be able to see the what imagePullSecrets are in scope for the pod and check if it looks correct (could be worth comparing it to a pod that is working).
It's worth noting that if your cluster is an instance in the IBM Cloud Kubernetes Service a default imagePullSecret for your account is added to the default namespace and therefore if you are running the pod in a different Kubenetes namespace you will need to do additional steps to make that work. This is a good place to start for information on this topic.
https://console.bluemix.net/docs/containers/cs_images.html#other
Looks like you haven't logged into the IBM Cloud Container registry. If you haven't done this yet, You should login with this command
ibmcloud cr login
Other issues can be
Docker is not installed.
The Docker client is not logged in to IBM Cloud Container Registry.
Your IBM Cloud access token might have expired.
You can find more troubleshooting instructions here

GCE Image suddenly not found

I'm running kubernetes on GCE. I used kube-up.sh to create the cluster and the nodes and masters are all running the image gci-stable-56-9000-84-2. I deleted a few nodes today which triggered the autoscaler to create new ones. But they failed with the following error.
Instance 'kubernetes-minion-30gb-20180131-9jwn' creation failed: The
resource
'projects/google-containers/global/images/gci-stable-56-9000-84-2' was
not found (when acting as 'REDACTED')
Is it possible this image was deleted somehow? I don't think I changed any access controls or permissions for any service accounts.
The image is present on this page:
https://cloud.google.com/container-optimized-os/docs/release-notes#cos-stable-56-9000-84-2
This error could be due to authentication issues. Re-authenticate to the gcloud command-line tool with command ‘gcloud auth login’
It could be as well that the Kubernetes Engine service account has been deleted or edited. Check this: https://cloud.google.com/kubernetes-engine/docs/troubleshooting#error_404