googleapi: Error 403: Quota 'BACKEND_SERVICES' exceeded. Limit: 9.0, quotaExceeded - kubernetes

I'm encountering the following error for my ingress controller.
Warning GCE googleapi: Error 403: Quota 'BACKEND_SERVICES' exceeded. Limit: 9.0, quotaExceeded
My limit is set as 9, and this has previously worked so I'm not sure why this error is being encountered now.
I did delete the cluster and then created a new one, what do these backend services refer to? How could I remove any old ones that have not been deleted?

You could also ask for a small up on the backend # quota page.
If it's small enough it will get auto accepted.

I had to delete the previously created Load balancers, and the related "backends" in the Google Cloud console.
The quota was shortly updated after that.

Just a heads up — I ran into this quickly trialing a Multi Region GCE Ingress deployed using Kubemci. Since you are essentially duplicating your backend across many regions the maximum number of regions you could use on a GCP Trial Account would be 5.
GCP will force you to upgrade to a full account (and enter billing if you haven't yet). Not a big deal but in my instance I had do this in order to test a service being served from more than 5 regions at once — where the error was not immediately evident in the logs.
When trouble shooting the rest of the Multi-Region Ingress process this one was tricky to track down — so hopefully this saves a bit of time for someone trying to deploy many clusters on a trial account (like I was!).

Related

AWS elastic search cluster becoming unresponsive

we have several AWS elastic search domains which sometimes become unresponsive for no apparent reason. The ES endpoint as well as Kibana return bad gateway errors after a few minutes of trying to load the resources.
The node status message is the following (not that it's any help):
/_cluster/health: {"code":"ProxyRequestServiceException","message":"Unable to execute HTTP request: Read timed out (Service: null; Status Code: 0; Error Code: null; Request ID: null)"}
Error logs are activated for the cluster but do not show anything relevant for the time at which the cluster became inactive.
I would like to at least be able to restart the cluster but the status remains "processing" seemingly forever.
Unfortunately, if you are using the AWS ElasticSearch Service (as in not building it on your own EC2 instances), many... well... MOST... of the admin API's and capabilities are restricted so you cannot dig as much into it as you could if you built it from the ground up.
I have found that AWS Support does a pretty good job in getting to the bottom of things when needed, so I would suggest you open a support ticket.
I wish this wasn't the case, but using their service is nice and easy (as in you don't have to build and maintain the infra yourself), but you lose a LOT of capabilities from an Admin or Troubleshooting perspective. :(

How to detect GKE autoupgrading a node in Stackdriver logs

We have a GKE cluster with auto-upgrading nodes. We recently noticed a node become unschedulable and eventually deleted that we suspect was being upgraded automatically for us. Is there a way to confirm (or otherwise) in Stackdriver that this was indeed the cause what was happening?
You can use the following advanced logs queries with Cloud Logging (previously Stackdriver) to detect upgrades to node pools:
protoPayload.methodName="google.container.internal.ClusterManagerInternal.UpdateClusterInternal"
resource.type="gke_nodepool"
and master:
protoPayload.methodName="google.container.internal.ClusterManagerInternal.UpdateClusterInternal"
resource.type="gke_cluster"
Additionally, you can control when the update are applied with Maintenance Windows (like the user aurelius mentioned).
I think your question has been already answered in the comments. Just as addition automatic upgrades occur at regular intervals at the discretion of the GKE team. To get more control you can create a Maintenance Windows as explained here. This is basically a time frame that you choose in which automatic upgrades should occur.

Why shouldn't you run Kubernetes pods for longer than an hour from Composer?

The Cloud Composer documentation explicitly states that:
Due to an issue with the Kubernetes Python client library, your Kubernetes pods should be designed to take no more than an hour to run.
However, it doesn't provide any more context than that, and I can't find a definitively relevant issue on the Kubernetes Python client project.
To test it, I ran a pod for two hours and saw no problems. What issue creates this restriction, and how does it manifest?
I'm not deeply familiar with either the Cloud Composer or Kubernetes Python client library ecosystems, but sorting the GitHub issue tracker by most comments shows this open item near the top of the list: https://github.com/kubernetes-client/python/issues/492
It sounds like there is a token expiration issue:
#yliaog this is an issue for us, as we are running kubernetes pods as
batch processes and tracking the state of the pods with a static
client. Once the client object is initialized, it does no refresh, and
therefore any job that takes longer than 60 minutes will fail. Looking
through python-base, it seems like we could make a wrapper class that
generates a new client (or refreshes the config) every n minutes, or
checks status prior to every call (as #mvle suggested). The best fix
would be in swagger-codegen, but a temporary solution would probably
be very useful for a lot of people.
- #flylo, https://github.com/kubernetes-client/python/issues/492#issuecomment-376581140
https://issues.apache.org/jira/browse/AIRFLOW-3253 is the reason (and hopefully, my fix will be merged soon). As the others suggested, this affects anyone using the Kubernetes Python client with GCP auth. If you are authenticating with a Kubernetes service account, you should see no problem.
If you are authenticating via a GCP service account with gcloud (e.g. using the GKEPodOperator), you will generally see this problem with jobs that take longer than an hour because the auth token expires after an hour.
There are more insights here too.
Currently, long-running jobs on GKE always eventually fail with a 404 error (https://bitbucket.org/snakemake/snakemake/issues/932/long-running-jobs-on-kubernetes-fail). We believe that the problem is in the Kubernetes client, as we determined that although _refresh_gcp_token is being called when the token is expired, the next API call still fails with a 404 error.
You can see here that Snakemake uses the kubernetes python client.

Kubernetes etcd HighNumberOfFailedHTTPRequests QGET

I run kubernetes cluster in AWS, CoreOS-stable-1745.6.0-hvm (ami-401f5e38), all deployed by kops 1.9.1 / terraform.
etcd_version = "3.2.17"
k8s_version = "1.10.2"
This Prometheus alert method=QGET alertname=HighNumberOfFailedHTTPRequests is coming from coreos kube-prometheus monitoring bundle. The alert started to fire from the very beginning of the cluster lifetime and now exists for ~3 weeks without visible impact.
^ QGET fails - 33% requests.
NOTE: I have the 2nd cluster in other region built from scratch on the same versions and it has exact same behavior. So it's reproducible.
Anyone knows what might be the root cause, and what's the impact if ignored further?
EDIT:
Later I found this GH issue which describes my case precisely: https://github.com/coreos/etcd/issues/9596
From CoreOS documentation:
For alerts to not appear on arbitrary events it is typically better not to alert directly on a raw value that was sampled, but rather by aggregating and defining a relative threshold rather than a hardcoded value. For example: send a warning if 1% of the HTTP requests fail, instead of sending a warning if 300 requests failed within the last five minutes. A static value would also require a change whenever your traffic volume changes.
Here you can find detailed information on how to Develop Prometheus alerts for etcd.
I got the explanation in GitHub issue thread.
HTTP metrics/alerts should be replaced with GRPC.

Openshift says Quota limit reached

In the Open shift i have 4 projects and 25Gb of space allocated to the projects.
And db i use is Mongo Db(3.2 Version).
So in openshift iam getting the message has Quota limit reached and if i check all the 25 GB has been used as per openshift
But in Mongo db if i check using db.stats() for all the projects i have used 5.7GB
I want to know where the remaining space is used Or how to find exact space that i am using.
I think you’d like to do double checks about your resource issues.
check what resource limit was reached, is it a storage?
you should check the event logs which provide more details.
check what quota limits were configured your cluster or project.
have you been experienced some troubles after the showing the messages? Such as db hanging up, no response from pod and so on.
They are just troubleshooting guides, but i hope it help you.