Kubernetes etcd HighNumberOfFailedHTTPRequests QGET - kubernetes

I run kubernetes cluster in AWS, CoreOS-stable-1745.6.0-hvm (ami-401f5e38), all deployed by kops 1.9.1 / terraform.
etcd_version = "3.2.17"
k8s_version = "1.10.2"
This Prometheus alert method=QGET alertname=HighNumberOfFailedHTTPRequests is coming from coreos kube-prometheus monitoring bundle. The alert started to fire from the very beginning of the cluster lifetime and now exists for ~3 weeks without visible impact.
^ QGET fails - 33% requests.
NOTE: I have the 2nd cluster in other region built from scratch on the same versions and it has exact same behavior. So it's reproducible.
Anyone knows what might be the root cause, and what's the impact if ignored further?
EDIT:
Later I found this GH issue which describes my case precisely: https://github.com/coreos/etcd/issues/9596

From CoreOS documentation:
For alerts to not appear on arbitrary events it is typically better not to alert directly on a raw value that was sampled, but rather by aggregating and defining a relative threshold rather than a hardcoded value. For example: send a warning if 1% of the HTTP requests fail, instead of sending a warning if 300 requests failed within the last five minutes. A static value would also require a change whenever your traffic volume changes.
Here you can find detailed information on how to Develop Prometheus alerts for etcd.

I got the explanation in GitHub issue thread.
HTTP metrics/alerts should be replaced with GRPC.

Related

Regarding GKE optimised image hardening

We tried to harden the gke optimized image (gke-1.15.11) for our cluster. We took an ssh into the node instance and made the cis porposed changes in the /home/kubernetes/kubelet-config.yaml file and ran kubebench to check if all the conditions have passed around 8 condtions failed these where the exact conditions we changed in the file. But, then we made the exact argument changes in /etc/default/kubernetes and ran kubebench again the conditions passed. But, when we restarted the instance we all the changes we made in the /ect/default/kubernetes file where gone. Can someone let me know where we are going wrong or is there any other path where we have to make the cis benchmark suggested entries
GKE doesn't support user-provided node images as of April 2020. Recommended option is to create your own DaemonSet with host filesystem writes and/or host services restart to propagate all the required changes.

How to detect GKE autoupgrading a node in Stackdriver logs

We have a GKE cluster with auto-upgrading nodes. We recently noticed a node become unschedulable and eventually deleted that we suspect was being upgraded automatically for us. Is there a way to confirm (or otherwise) in Stackdriver that this was indeed the cause what was happening?
You can use the following advanced logs queries with Cloud Logging (previously Stackdriver) to detect upgrades to node pools:
protoPayload.methodName="google.container.internal.ClusterManagerInternal.UpdateClusterInternal"
resource.type="gke_nodepool"
and master:
protoPayload.methodName="google.container.internal.ClusterManagerInternal.UpdateClusterInternal"
resource.type="gke_cluster"
Additionally, you can control when the update are applied with Maintenance Windows (like the user aurelius mentioned).
I think your question has been already answered in the comments. Just as addition automatic upgrades occur at regular intervals at the discretion of the GKE team. To get more control you can create a Maintenance Windows as explained here. This is basically a time frame that you choose in which automatic upgrades should occur.

(How) do node pool autoupgrades in GKE actually work?

We have a fairly large kubernetes deployment on GKE, and we wanted to make our life a little easier by enabling auto-upgrades. The documentation on the topic tells you how to enable it, but not how it actually works.
We enabled the feature on a test cluster, but no nodes were ever upgraded (although the UI kept nagging us that "upgrades are available").
The docs say it would be updated to the "latest stable" version and that it occurs "at regular intervals at the discretion of the GKE team" - both of which is not terribly helpful.
The UI always says: "Next auto-upgrade: Not scheduled"
Has someone used this feature in production and can shed some light on what it'll actually do?
What I did:
I enabled the feature on the nodepools (not the cluster itself)
I set up a maintenance window
Cluster version was 1.11.7-gke.3
Nodepools had version 1.11.5-gke.X
The newest available version was 1.11.7-gke.6
What I expected:
The nodepool would be updated to either 1.11.7-gke.3 (the default cluster version) or 1.11.7-gke.6 (the most recent version)
The update would happen in the next maintenance window
The update would otherwise work like a "manual" update
What actually happened:
Nothing
The nodepools remained on 1.11.5-gke.X for more than a week
My question
Is the nodepool version supposed to update?
If so, at what time?
If so, to what version?
I'll finally answer this myself. The auto-upgrade does work, though it took several days to a week until the version was upgraded.
There is no indication of the planned upgrade date, or any feedback other than the version updating.
It will upgrade to the current master version of the cluster.
Addition: It still doesn't work reliably, and still no way to debug if it doesn't. One information I got was that the mechanism does not work if you initially provided a specific version for the node pool. As it is not possible to deduce the inner workings of the autoupdates, we had to resort to manually checking the status again.
I wanted to share two other possibilities as to why a node-pool may not be auto-upgrading or scheduled to upgrade.
One of our projects was having the similar issue where the master version had auto-upgraded to 1.14.10-gke.27 but our node-pool stayed stuck at 1.14.10-gke.24 for over a month.
Reaching a node quota
The node-pool upgrade might be failing due to a node quota (although I'm not sure the web console would say Next auto-upgrade: Not scheduled). From the node upgrades documentation, it suggests we can run the following to view any failed upgrade operations:
gcloud container operations list --filter="STATUS=DONE AND TYPE=UPGRADE_NODES AND targetLink:https://container.googleapis.com/v1/projects/[PROJECT_ID]/zones/[ZONE]/clusters/[CLUSTER_NAME]"
Automatic node upgrades are for minor+ versions only
After exhausting my troubleshooting steps, I reached out GCP Support and opened a case (Case 23113272 for anyone working at Google). They told me the following:
Automatic node upgrade:
The node version could not necessary upgrade automatically, let me explain, exists three upgrades in a node: Minor versions (1.X), Patch releases (1.X.Y) and Security updates and bug fixes (1.X.Y-gke.N), please take a look at this documentation [2] the automatic node upgrade works from a minor version and in your case the upgrade was a security update that can't upgrade automatically.
I responded back and they confirmed that automatic node upgrades will only happen for minor versions and above. I have requested that they submit a request to update their documentation because (at the time of this response) it is not outlined anywhere in their node auto-upgrade documentation.
This feature replaces the VMs (Kubernetes nodes) in your node pool running the "old" Kubernetes version with VMs running the "new" version.
The node pool "upgrade" operation is done in a rolling fashion: It's not like GKE deletes all your VMs and recreates them simultaneously (except when you have only 1 node in your cluster). By default, the nodes are replaced with newer nodes one-by-one (although this might change).
GKE internally uses mostly the features of managed instance groups to manage operations on node pools.
You can find documentation on how to schedule node upgrades by specifying certain "maintenance windows" so you are impacted minimally. (This article also gives a bit more insights on how upgrades happen.)
That said, you can disable auto-upgrades and upgrade your cluster manually (although this is not recommended). Some GKE users have thousands of nodes, therefore for them, upgrading VMs one-by-one are not feasible.
For that GKE offers an option that lets you choose "how many nodes are upgraded at a time":
gcloud container clusters upgrade \
--concurrent-node-count=CONCURRENT_NODE_COUNT
Documentation of this flag says:
The number of nodes to upgrade concurrently. Valid values are [1, 20]. It is a recommended best practice to set this value to no higher than 3% of your cluster size.'

Why shouldn't you run Kubernetes pods for longer than an hour from Composer?

The Cloud Composer documentation explicitly states that:
Due to an issue with the Kubernetes Python client library, your Kubernetes pods should be designed to take no more than an hour to run.
However, it doesn't provide any more context than that, and I can't find a definitively relevant issue on the Kubernetes Python client project.
To test it, I ran a pod for two hours and saw no problems. What issue creates this restriction, and how does it manifest?
I'm not deeply familiar with either the Cloud Composer or Kubernetes Python client library ecosystems, but sorting the GitHub issue tracker by most comments shows this open item near the top of the list: https://github.com/kubernetes-client/python/issues/492
It sounds like there is a token expiration issue:
#yliaog this is an issue for us, as we are running kubernetes pods as
batch processes and tracking the state of the pods with a static
client. Once the client object is initialized, it does no refresh, and
therefore any job that takes longer than 60 minutes will fail. Looking
through python-base, it seems like we could make a wrapper class that
generates a new client (or refreshes the config) every n minutes, or
checks status prior to every call (as #mvle suggested). The best fix
would be in swagger-codegen, but a temporary solution would probably
be very useful for a lot of people.
- #flylo, https://github.com/kubernetes-client/python/issues/492#issuecomment-376581140
https://issues.apache.org/jira/browse/AIRFLOW-3253 is the reason (and hopefully, my fix will be merged soon). As the others suggested, this affects anyone using the Kubernetes Python client with GCP auth. If you are authenticating with a Kubernetes service account, you should see no problem.
If you are authenticating via a GCP service account with gcloud (e.g. using the GKEPodOperator), you will generally see this problem with jobs that take longer than an hour because the auth token expires after an hour.
There are more insights here too.
Currently, long-running jobs on GKE always eventually fail with a 404 error (https://bitbucket.org/snakemake/snakemake/issues/932/long-running-jobs-on-kubernetes-fail). We believe that the problem is in the Kubernetes client, as we determined that although _refresh_gcp_token is being called when the token is expired, the next API call still fails with a 404 error.
You can see here that Snakemake uses the kubernetes python client.

How to use the Python Kubernetes client in a way resilient to GKE Kubernetes Master disruptions?

We sometimes use Python scripts to spin up and monitor Kubernetes Pods running on Google Kubernetes Engine using the Official Python client library for kubernetes. We also enable auto-scaling on several of our node pools.
According to this, "Master VM is automatically scaled, upgraded, backed up and secured". The post also seems to indicate that some automatic scaling of the control plane / Master VM occurs when the node count increases from 0-5 to 6+ and potentially at other times when more nodes are added.
It seems like the control plane can go down at times like this, when many nodes have been brought up. In and around when this happens, our Python scripts that monitor pods via the control plane often crash, seemingly unable to find the KubeApi/Control Plane endpoint triggering some of the following exceptions:
ApiException, urllib3.exceptions.NewConnectionError, urllib3.exceptions.MaxRetryError.
What's the best way to handle this situation? Are there any properties of the autoscaling events that might be helpful?
To clarify what we're doing with the Python client is that we are in a loop reading the status of the pod of interest via read_namespaced_pod every few minutes, and catching exceptions similar to the provided example (in addition we've tried also catching exceptions for the underlying urllib calls). We have also added retrying with exponential back-off, but things are unable to recover and fail after a specified max number of retries, even if that number is high (e.g. keep retrying for >5 minutes).
One thing we haven't tried is recreating the kubernetes.client.CoreV1Api object on each retry. Would that make much of a difference?
When a nodepool size changes, depending on the size, this can initiate a change in the size of the master. Here are the nodepool sizes mapped with the master sizes. In the case where the nodepool size requires a larger master, automatic scaling of the master is initiated on GCP. During this process, the master will be unavailable for approximately 1-5 minutes. Please note that these events are not available in Stackdriver Logging.
At this point all API calls to the master will fail, including the ones from the Python API client and kubectl. However after 1-5 minutes the master should be available and calls from both the client and kubectl should work. I was able to test this by scaling my cluster from 3 node to 20 nodes and for 1-5 minutes the master wasn't available .
I obtained the following errors from the Python API client:
Max retries exceeded with url: /api/v1/pods?watch=False (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at>: Failed to establish a new connection: [Errno 111] Connection refused',))
With kubectl I had :
“Unable to connect to the server: dial tcp”
After 1-5 minutes the master was available and the calls were successful. There was no need to recreate kubernetes.client.CoreV1Api object as this is just an API endpoint.
According to your description, your master wasn't accessible after 5 minutes which signals a potential issue with your master or setup of the Python script. To troubleshoot this further on side while your Python script runs, you can check for availability of master by running any kubectl command.