Creating a Kubernetes Service with Pulumi up results in error Could not create watcher for Endpoint objects associated with Service - kubernetes

I'm trying to use Pulumi to create a Deployment with a linked Service in a Kubesail cluster. The Deployment is created fine but when Pulumi tries to create the Service an error is returned:
kubernetes:core:Service (service):
error: Plan apply failed: resource service was not successfully created by the Kubernetes API server : Could not create watcher for Endpoint objects associated with Service "service": unknown
The Service is correctly created in Kubesail and the error seems to be glaringly obvious that it can't do Pulumi's neat monitoring but the unknown error isn't so neat!
What might be being denied on the Kubernetes cluster such that Pulumi can't do the monitoring that would be different between a Deployment and a Service? Is there a way to skip the watching that I missed in the docs to get me past this?

I dug a little into the Pulumi source code and found the resource kinds it uses to track and used kubectl auth can-i and low and behold watching an endpoint is currently denied but watching replicaSets and the service themselves is not.

Related

How can I see which kubernetes user is creating the deployment and what type of authentication is used?

I am trying to see which kubernetes user is creating the deployment and what type of authentication is used (basic auth, token, etc).
I try to do it using this:
kubectl describe deployment/my-workermole
but I am not finding that type of information in there.
Cluster is not managed by me and I am not able to find it in the deployment Jenkinsfile. Where and how can I find that type of information in my kubernetes deployment but after deployment?

Terraform dial tcp 192.xx.xx.xx:443: i/o timeout error

I am trying to implement CI / CD using GitLab + Terraform to K8S Cluster and K8S Control Plane (Master node) was setup on CentOS
However, Pipeline job fails with the following error
Error: Failed to get existing workspaces: Get "https://192.xx.xx.xx/api/v1/namespaces/default/secrets?labelSelector=tfstate%3Dtrue": dial tcp 192.xx.xx.xx:443: i/o timeout
From the error mentioned above (default/secrets?labelSelector=tfstate%3Dtrue), I assume the error is related to missing 'terraform secret' on default namespace
Example (Terraform secret taken from my Windows)
PS C:\> kubectl get secret
NAME TYPE DATA AGE
default-token-7mzv6 kubernetes.io/service-account-token 3 27d
tfstate-default-state Opaque 1 15h
However, I am not sure which process would create 'tfsecret' or should we create it manually ?
Kindly let me know if I my understanding is wrong and had I missed anything else
EDIT
The issue mentioned above occurred because existing Gitlab-runner was on a different subnet (eg 172.xx.xx.xx instead of 192.xx.xx.xx)
I was asked to use a different Gitlab-runner which runs on the same subnet and now it throws the following error
Error: Failed to get existing workspaces: Get "https://192.xx.xx.xx:6443/api/v1/namespaces/default/secrets?labelSelector=tfstate%3Dtrue": x509: certificate signed by unknown authority
Now, I am bit confused whether the certificate-issue is between GitLab-Runner and Gitlab-Server or Gitlab-Server and K8S Cluster or something else
You have configured Kubernetes as the remote state backend for your Terraform configuration. The error is, that the backend is trying to query existing secrets to determine what workspaces are configured. The x509: certificate signed by unknown authority indicates, that the KUBECONFIG the remote state backend uses does not match the CA of the API server you're connecting to.
If the runners are K8s pods themselves, make sure you provide a KUBECONFIG that matches your target cluster and that the remote state does not configure itself as in-cluster by reading the service account token every K8s pod has - which in most cases will only work for the cluster the pod is running on.
You don't provide enough information to be more specific. But big picture, you have to configure the state backend, and any provider that connect to K8s. Theoretically, the state backend secrets and the K8s resources do not have to be on the same cluster. Meaning, you may have to have different configuration for state backend and K8s providers.

Istio on GKE in Autopilot mode

Hi there I was reviewing the GKE autopilot mode and noticed that in cluster configureation istio is disabled and I'm not able to change it. Also installation via istioctl install fail with following error
error installer failed to update resource with server-side apply for obj MutatingWebhookConfiguration//istio-sidecar-injector: mutatingwebhookconfigurations.admissionregistration.k8s.io "istio-sidecar-injector" is forbidden: User "something#example" cannot patch resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope: GKEAutopilot authz: cluster scoped resource "mutatingwebhookconfigurations/" is managed and access is denied
Am I correct or it's not possible to run istio in GKE autopilot mode?
TL;DR
It is not possible at this moment to run istio in GKE autopilot mode.
Conclusion
If you are using Autopilot, you don't need to manage your nodes. You don't have to worry about operations such as updating, scaling or changing the operating system. However, the autopilot has a number of limitations.
Even if you are trying to install istio with a command istioctl install, istio will not be installed. You will then see the following message:
This will install the Istio profile into the cluster. Proceed? (y/N) y
✔ Istio core installed
✔ Istiod installed
✘ Ingress gateways encountered an error: failed to wait for resource: resources not ready after 5m0s: timed out waiting for the condition
Deployment/istio-system/istio-ingressgateway
Pruning removed resources 2021-05-07T08:24:40.974253Z warn installer retrieving resources to prune type admissionregistration.k8s.io/v1beta1, Kind=MutatingWebhookConfiguration: mutatingwebhookconfigurations.admissionregistration.k8s.io is forbidden: User "something#example" cannot list resource "mutatingwebhookconfigurations" in API group "admissionregistration.k8s.io" at the cluster scope: GKEAutopilot authz: cluster scoped resource "mutatingwebhookconfigurations/" is managed and access is denied not found
Error: failed to install manifests: errors occurred during operation
This command failed, bacuse for sidecar injection, installer tries to create a MutatingWebhookConfiguration called istio-sidecar-injector. This limitation is mentioned here.
For more information you can also read this page.
It is not possible to create mutating admission webhooks according to documentation
You cannot create custom mutating admission webhooks for Autopilot clusters
Since Istio uses mutating webhooks to inject its sidecars, it will probably not work and it is also consistent with the error you get.
According to the documentation this should be possible with GKE 1.21:
In GKE version 1.21.3-gke.900 and later, you can create validating and
mutating dynamic admission webhooks. However, Autopilot modifies the
admission webhooks objects to add a namespace selector which excludes the
resources in managed namespaces (currently, kube-system) from being
intercepted. Additionally, webhooks which specify one or more of following
resources (and any of their sub-resources) in the rules, will be rejected:
group: ""
resource: nodes
group: certificates.k8s.io
resource: certificatesigningrequests
group: authentication.k8s.io
resource: tokenreviews
https://cloud.google.com/kubernetes-engine/docs/concepts/autopilot-overview#webhooks_limitations

GKE Metadata server errors

I have a GKE with Workload identity enabled.
Most of our workloads use Cloud Storage or Cloud logging GCP packages which means actually using the Workload identity for GCP access.
Recently we’ve started adding Secret Manager to the stack and started encountering random errors for the Metadata Server on workload startup. It happens on different frameworks.
Python:
File "/venv/lib/python3.8/site-packages/google/auth/compute_engine/credentials.py", line 117, in refresh six.raise_from(new_exc, caught_exc) File "<string>", line 3, in raise_from google.auth.exceptions.RefreshError: ("Failed to retrieve http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/?recursive=true from the Google Compute Enginemetadata service. Status: 404 Response:\nb'Not Found\\n'", <google.auth.transport.requests._Response object at 0x7f3a3084dd60>)
NodeJS:
failed to initialize. exiting. Error: 16 UNAUTHENTICATED: Failed to retrieve auth metadata with error: Could not refresh access token: network timeout at: http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token?scopes=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform at Object
I’m trying to understand why it's happening.
First, 404 Not Found means we are trying to get metadata which does not exist/deleted. The thing is it recovers a few seconds later so I'm not sure how exactly.
Based on documentation, sometimes it takes some time for the metadata server to be available, and hence the error which ‘recover’ afterwards. So recommendation is to add delays on the app code or using init Containers until the Metadata server is operated.
I wonder if that's really the best approach, to add an init container to all of our workloads, and if it's really our use case as the error code is a bit misleading. Also, not quite sure why its only started when adding the secret manager.
This sometimes happens due to OOM issues on Metadata server. you can check status of the pod running metadata server using:
kubectl -n kube-system describe pods <pod_name>
you can get the pod_name using:
kubectl get pods --namespace kube-system .
the pod name will start with a prefix gke-metadata-server-
if you see something like following in output when you describe the pod:
Last State: Terminated
Reason: OOMKilled
then that would indicate OOM issue.
Some mitigations that you can try:
check if you have un-used ServiceAccounts in your cluster and if you can remove em.
check if you are creating too many clients (new one for every API
request). sharing clients if possible will reduce token refresh calls to Metadata server thus, saving memory.
check if you can find metadata server's definition under /etc/kubernetes/addons/. if you can, update the memory to increase it and apply the updated config.

Terraform Kubernetes provider with EKS fails on configmap

I've followed the instructions to create an EKS cluster in AWS using Terraform.
https://www.terraform.io/docs/providers/aws/guides/eks-getting-started.html
I've also copied the output for connecting to the cluster to ~/.kube/config-eks. I've verified this successfully works as I've been able to connect to the cluster and manually deploy containers. However, now i'm trying to use the Terraform Kubernetes provider to connect to the cluster but cannot seem to be able to configure the provider properly.
I've configured the provider to use my kubectl configuration but when attempting to push a simple configmap, i get an error stating the following:
configmaps is forbidden: User "system:anonymous" cannot create configmaps in the namespace "kube-system"
I know that the provider is picking up part of the configuration but I cannot seem to get it to authenticate. I suspect this is because EKS uses heptio for authentication and i'm not sure if the K8s Go client used by Terraform can support heptio. However, given that Terraform released their AWS EKS support when EKS went GA, I'd doubt that they wouldn't also update their Terraform provider to work with it.
Is it possible to even do this now? Are there alternatives?
Exec auth was added here: https://github.com/kubernetes/client-go/commit/19c591bac28a94ca793a2f18a0cf0f2e800fad04
This is what is utilized for custom authentication plugins and was published Feb 7th.
Right now, Terraform doesn't support the new exec-based authentication provider, but there is an issue open with a workaround: https://github.com/terraform-providers/terraform-provider-kubernetes/issues/161
That said, if I get some free time I will work on a PR.