GKE Metadata server errors - metadata

I have a GKE with Workload identity enabled.
Most of our workloads use Cloud Storage or Cloud logging GCP packages which means actually using the Workload identity for GCP access.
Recently we’ve started adding Secret Manager to the stack and started encountering random errors for the Metadata Server on workload startup. It happens on different frameworks.
Python:
File "/venv/lib/python3.8/site-packages/google/auth/compute_engine/credentials.py", line 117, in refresh six.raise_from(new_exc, caught_exc) File "<string>", line 3, in raise_from google.auth.exceptions.RefreshError: ("Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true from the Google Compute Enginemetadata service. Status: 404 Response:\nb'Not Found\\n'", <google.auth.transport.requests._Response object at 0x7f3a3084dd60>)
NodeJS:
failed to initialize. exiting. Error: 16 UNAUTHENTICATED: Failed to retrieve auth metadata with error: Could not refresh access token: network timeout at: http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token?scopes=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform at Object
I’m trying to understand why it's happening.
First, 404 Not Found means we are trying to get metadata which does not exist/deleted. The thing is it recovers a few seconds later so I'm not sure how exactly.
Based on documentation, sometimes it takes some time for the metadata server to be available, and hence the error which ‘recover’ afterwards. So recommendation is to add delays on the app code or using init Containers until the Metadata server is operated.
I wonder if that's really the best approach, to add an init container to all of our workloads, and if it's really our use case as the error code is a bit misleading. Also, not quite sure why its only started when adding the secret manager.

This sometimes happens due to OOM issues on Metadata server. you can check status of the pod running metadata server using:
kubectl -n kube-system describe pods <pod_name>
you can get the pod_name using:
kubectl get pods --namespace kube-system .
the pod name will start with a prefix gke-metadata-server-
if you see something like following in output when you describe the pod:
Last State: Terminated
Reason: OOMKilled
then that would indicate OOM issue.
Some mitigations that you can try:
check if you have un-used ServiceAccounts in your cluster and if you can remove em.
check if you are creating too many clients (new one for every API
request). sharing clients if possible will reduce token refresh calls to Metadata server thus, saving memory.
check if you can find metadata server's definition under /etc/kubernetes/addons/. if you can, update the memory to increase it and apply the updated config.

Related

GCP Alerting Policy for failed GKE CronJob

What would be the best way to set up a GCP monitoring alert policy for a Kubernetes CronJob failing? I haven't been able to find any good examples out there.
Right now, I have an OK solution based on monitoring logs in the Pod with ERROR severity. I've found this to be quite flaky, however. Sometimes a job will fail for some ephemeral reason outside my control (e.g., an external server returning a temporary 500) and on the next retry, the job runs successfully.
What I really need is an alert that is only triggered when a CronJob is in a persistent failed state. That is, Kubernetes has tried rerunning the whole thing, multiple times, and it's still failing. Ideally, it could also handle situations where the Pod wasn't able to come up either (e.g., downloading the image failed).
Any ideas here?
Thanks.
First of all, confirm the GKE’s version that you are running. For that, the following commands are going to help you to identify the GKE’s
default version and the available versions too:
Default version.
gcloud container get-server-config --flatten="channels" --filter="channels.channel=RAPID" \
--format="yaml(channels.channel,channels.defaultVersion)"
Available versions.
gcloud container get-server-config --flatten="channels" --filter="channels.channel=RAPID" \
--format="yaml(channels.channel,channels.validVersions)"
Now that you know your GKE’s version and based on what you want is an alert that is only triggered when a CronJob is in a persistent failed state, GKE Workload Metrics was the GCP’s solution that used to provide a fully managed and highly configurable solution for sending to Cloud Monitoring all Prometheus-compatible metrics emitted by GKE workloads (such as a CronJob or a Deployment for an application). But, as it is right now deprecated in G​K​E 1.24 and was replaced with Google Cloud Managed Service for Prometheus, then this last is the best option you’ve got inside of GCP, as it lets you monitor and alert on your workloads, using Prometheus, without having to manually manage and operate Prometheus at scale.
Plus, you have 2 options from the outside of GCP: Prometheus as well and Ranch’s Prometheus Push Gateway.
Finally and just FYI, it can be done manually by querying for the job and then checking it's start time, and compare that to the current time, this way, with bash:
START_TIME=$(kubectl -n=your-namespace get job your-job-name -o json | jq '.status.startTime')
echo $START_TIME
Or, you are able to get the job’s current status as a JSON blob, as follows:
kubectl -n=your-namespace get job your-job-name -o json | jq '.status'
You can see the following thread for more reference too.
Taking the “Failed” state as the medullary point of your requirement, setting up a bash script with kubectl to send an email if you see a job that is in “Failed” state can be useful. Here I will share some examples with you:
while true; do if `kubectl get jobs myjob -o jsonpath='{.status.conditions[?(#.type=="Failed")].status}' | grep True`; then mail email#address -s jobfailed; else sleep 1 ; fi; done
For newer K8s:
while true; do kubectl wait --for=condition=failed job/myjob; mail#address -s jobfailed; done

FailedToUpdateEnpoint in kubernetes

I have a kubernetes cluster with some deployments and pods.I have experienced a issue with my deployments with error messages like FailedToUpdateEndpoint, RedinessprobeFailed.
This errors are unexpected and didn't have idea about it.When we analyse the logs of our, it seems like someone try hack our cluster(not sure about it).
Thing to be clear:
1.Is there any chance someone can illegally access our kubernetes cluster without having the kubeconfig?
2.Is there any chance, by using the frontend IP,access our apps and make changes in cluster configurations(means hack the cluster services via Web URL)?
3.Even if the cluster access illegally via frontend URL, is there any chance to change the configuration in cluster?
4.Is there is any mechanism to detect, whether the kubernetes cluster is healthy state or hacked by someone?
Above three mentioned are focus the point, is there any security related issues with kubernetes engine.If not
Then,
5.Still I work on this to find reason for that errors, Please provide more information on that, what may be the cause for these errors?
Error Messages:
FailedToUpdateEndpoint: Failed to update endpoint default/job-store: Operation cannot be fulfilled on endpoints "job-store": the object has been modified; please apply your changes to the latest version and try again
The same error happens for all our pods in cluster.
Readiness probe failed: Error verifying datastore: Get https://API_SERVER: context deadline exceeded; Error reaching apiserver: taking a long time to check apiserver

How to find the reason of a pod crashing?

Is there a way to see why a kubernetes pod is failing with the status "craskLoopBackOff" under a heavy load?
I have a HorizontalPodAutoscaler which never kicks in. In its status it always shows low (Under 50%) cpu and memory usage.
Tailing the application logs within the pods doesnt give any insights either.
Try looking at the Kubernetes events kubectl get events --sort-by='.lastTimestamp'
If you don't get anything meaningful out of events go to the specific node and see the kubelet logs journalctl -u kubelet
To get logs from a pod you should use:
kubectl logs [podname] -p
You can also do kubelet logs but that's mostly for Cluster logs.
If there is no logs that means your application did not produces any logs before the crash. You would need to rewrite the app and for example add a memory dump on crush.
You mentioned that the pod is dying under heavy load but stats shows only 50% utilization. You should login to the pod and check yourself the load, maybe check how many files are being open because maybe you are hitting the limit.
You can read the Kubernetes docs about Application Introspection and Debugging and go over Debugging CrashLoopBackoffs with Init-Containers.
You can also try running your image in Docker and checking logs there. There is a nice documentation about Logs and troubleshooting available.
If you provide more details we might be more helpful.
Below are some obvious reasons for crashloopbackoff, which I have observed:
waiting for some condition to be full-filled e.g. some secrets,
failing healthcheck etc
pod is running with burstable or besteffort
QoS and is getting killed due to non-availability of resources on
node
You can run this script to find the possible issues for pods in a namespace: https://github.com/dguyhasnoname/k8s-day2-ops/blob/master/namespace_scripts/debug_app_namespace.sh

Creating a Kubernetes Service with Pulumi up results in error Could not create watcher for Endpoint objects associated with Service

I'm trying to use Pulumi to create a Deployment with a linked Service in a Kubesail cluster. The Deployment is created fine but when Pulumi tries to create the Service an error is returned:
kubernetes:core:Service (service):
error: Plan apply failed: resource service was not successfully created by the Kubernetes API server : Could not create watcher for Endpoint objects associated with Service "service": unknown
The Service is correctly created in Kubesail and the error seems to be glaringly obvious that it can't do Pulumi's neat monitoring but the unknown error isn't so neat!
What might be being denied on the Kubernetes cluster such that Pulumi can't do the monitoring that would be different between a Deployment and a Service? Is there a way to skip the watching that I missed in the docs to get me past this?
I dug a little into the Pulumi source code and found the resource kinds it uses to track and used kubectl auth can-i and low and behold watching an endpoint is currently denied but watching replicaSets and the service themselves is not.

k8s API server is down due to misconfiguration, how to bring it up again?

I was trying to add a command line flag to the API server. In my setup, it was running as a daemon set inside the k8s cluster so I got the daemon set manifest using kubectl, updated it, and executed kubectl apply -f apiserver.yaml (I know, this was not a good idea).
Of course, the new yaml file I wrote had an error so the API server is not starting anymore and I can't use kubectl to update it. I have an ssh connection to the node where it was running and I can see how the kubelet is trying to run the apiserver pod every few seconds with the ill-formed command. I am trying to configure the kubelet service to use the correct api-server command but am not being able to do so.
Any ideas?
The API server definition usually lives in /etc/kubernetes/manifests - Edit the configuration there rather than at the API level