I am seeing a continuous 8 to 15% CPU usage on Rancher related processes while there is not a single cluster being managed by it. Nor is any user interacting with. What explains this high CPU usage when idle? Also, there are several "rancher-agent" containers perpetually running and restarting. Which does not look right. There is no Kubernetes cluster running on this machine. This machine (unless Rancher is creating its own single node cluster for whatever reason).
I am using Rancher 2.3
docker stats:
docker ps:
htop:
I'm not sure I would call 15% "high", but Kubernetes has a lot of ongoing stuff even if it looks like the cluster is entirely quiet. Stuff like processing node heartbeats, etcd election traffic, controllers with time-based conditions which have to be processed. K3s probably streamlines that a bit, but 0% CPU usage is not a design goal even in the fork.
Rancher (2.3.x) does not do anything involving k3s. These pictures are not "just Rancher".
k3s is separately installed and running.
The agents further suggest that this node is added to a cluster (maybe the same Rancher running on it, maybe not).
It restarting all the time is not helping CPU usage, especially if it is registered to that local Rancher instance.
Also you're running a completely random commit from head instead of an actual release.
FWIW...In my case, I built the raspberry pi based Rancher/k3 lab as designed by Network Chuck on youtube. The VM on my linux host that runs Rancher will start off fairly quiet, then over the course of a couple of days the rancherd process will consistently hit near 100% cpu usage (I gave it 3 vcpu's) and stay there, even though I have no pods running on either the pi cluster or the local Rancher VM cluster. A reboot starts the process over, but within a few days its back to 100% cpu usage.
On writing this I just noticed that due to a DHCP issue, my original external ip for the local rancher cluster node got changed from 163 to 151 (I reserved it in pihole to 151, just never updated rancher config). Just fixed it in the Rancher gui, we'll see if that clears up some of the errors I saw in the logs and keeps the CPU usage normal on idle.
Related
If you run taint command on Kubernetes master:
kubectl taint nodes --all node-role.kubernetes.io/master-
it allows you to schedule pods.
So it acts as node and master.
I have tried to run 3 server cluster where all nodes have both roles. I didn't notice any issues from the first look.
Do you think nowadays this solution can be used to run small cluster for production service? If not, what are the real downsides? In which situations this setup fails comparing with standard setup?
Assume that etcd is running on all three servers.
Thank you
The standard reason to run separate master nodes and worker nodes is to keep a busy workload from interfering with the cluster proper.
Say you have three nodes as proposed. One winds up running a database; one runs a Web server; the third runs an asynchronous worker pod. Suddenly you get a bunch of traffic into your system, the Rails application is using 100% CPU, the Sidekiq worker is cranking away at 100% CPU, the MySQL database is trying to handle some complicated joins and is both high CPU and also is using all of the available disk bandwidth. You run kubectl get pods: which node is actually able to service these requests? If your application triggers the Linux out-of-memory killer, can you guarantee that it won't kill etcd or kubelet, both of which are critical to the cluster working?
If this is running in a cloud environment, you can often get away with smaller (cheaper) nodes to be the masters. (Kubernetes on its own doesn't need a huge amount of processing power, but it does need it to be reliably available.)
My Kubernetes Engine cluster keeps rebooting one of my nodes, even though all pods on the node are "well-behaved". I've tried to look at the cluster's Stackdriver logs, but was not able to find a reason. After a while, the continuous reboots usually stop, only to occur again a few hours or days later.
Usually only one single node is affected, while the other nodes are fine, but deleting that node and creating a new one in its place only helps temporarily.
I have already disabled node auto-repair to see if that makes a difference (it was turned on before), and if I recall correctly this started after upgrading my cluster to Kubernetes 1.13 (specifically version 1.13.5-gke). The issue has persisted after upgrading to 1.13.6-gke.0. Even creating a new node pool and migrating to it had no effect.
The cluster consists of four nodes with 1 CPU and 3 GB RAM each. I know that's small for a k8s cluster, but this has worked fine in the past.
I am using the new Stackdriver Kubernetes Monitoring as well as Istio on GKE.
Any pointers as to what could be the reason or where I look for possible causes would be appreciated.
Screenshots of the Node event list (happy to provide other logs; couldn't find anything meaningful in Stackdriver Logging yet):
Posting this answer as a community wiki to give some troubleshooting tips/steps as the underlying issue wasn't found.
Feel free to expand it.
After below steps, the issue with a node rebooting were not present anymore:
Updated the Kubernetes version (GKE)
Uninstalling Istio
Using e2-medium instances as nodes.
As pointed by user #aurelius:
I would start from posting the kubectl describe node maybe there is something going on before your Node gets rebooted and unhealthy. Also do you use resources and limits? Can this restarts be a result of some burstable workload? Also have you tried checking system logs after the restart on the Node itself? Can you post the results? – aurelius Jun 7 '19 at 15:38
Above comment could be a good starting point for troubleshooting issues with the cluster.
Options to troubleshoot the cluster pointed in comment:
$ kubectl describe node focusing on output in:
Conditions - KubeletReady, KubeletHasSufficientMemory, KubeletHasNoDiskPressure, etc.
Allocated resources - Requests and Limits of scheduled workloads
Checking system logs after the restart on the node itself:
GCP Cloud Console (Web UI) -> Logging -> Legacy Logs Viewer/Logs Explorer -> VM Instance/GCE Instance
It could be also beneficiary to check the CPU/RAM usage in:
GCP Cloud Console (Web UI) -> Monitoring -> Metrics Explorer
You can also check if there are any operations on the cluster:
gcloud container operations list
Adding to above points:
Creating a cluster with Istio on GKE
We suggest creating at least a 4 node cluster with the 2 vCPU machine type when using this add-on. You can deploy Istio itself with the default GKE new cluster setup but this may not provide enough resources to explore sample applications.
-- Cloud.google.com: Istio: Docs: Istio on GKE: Installing
Also, the official docs of Istio are stating:
CPU and memory
Since the sidecar proxy performs additional work on the data path, it consumes CPU and memory. As of Istio 1.7, a proxy consumes about 0.5 vCPU per 1000 requests per second.
-- Istio.io: Docs: Performance and scalability: CPU and memory
Additional resources:
Cloud.google.com: Kubernetes Engine: Docs: Troubleshooting
Kubernetes.io: Docs: Debug cluster
After deploying anything to minikube it seems as though the apiserver starts eating up all the CPU and makes the dashboard mostly unusable until the apiserver dies and gets restarted.
I've read through a bit of the references found in this post: kube-apiserver high CPU and requests
However, those seem to specifically target deployed k8s clusters on many machines, or at least where the master isn't on the same machine.
That's not how it would work with minikube since it's a signle node cluster. Not to mention it typically isn't given a ton of resources (neither CPU or mem).
Is there a way to curb or eliminate this behavior? Perhaps I've missed some important configuration for running on minikube?
I setup a kubernetes cluster with 2 powerful physical servers (32 cores + 64GB memory.) Everything runs very smooth except the bad network performance I observed.
As comparison: I run my service on such physical machine directly (one instance). Have a client machine in the same network subset calling the service. The rps can goes to 10k easily. While when I put the exact same service in kubernetes version 1.1.7, one pod (instance) of the service in launched and expose the service by ExternalIP in service yaml file. With the same client, the rps drops to 4k. Even after I switched to iptable mode of kube-proxy, it doesn't seem help a lot.
When I search around, I saw this document https://www.percona.com/blog/2016/02/05/measuring-docker-cpu-network-overhead/
Seems the docker port-forwarding is the network bottleneck. While other network mode of docker: like --net=host, bridge network, or containers sharing network don't have such performance drop. Wondering whether Kubernetes team already aware of such network performance drop? Since docker containers are launched and managed by Kubernetes. Is there anyway to tune the kubernetest to use other network mode of docker?
You can configure Kubernetes networking in a number of different ways when configuring the cluster, and a few different ways on a per-pod basis. If you want to try verifying whether the docker networking arrangement is the problem, set hostNetwork to true in your pod specification and give it another try (example here). This is the equivalent of the docker --net=host setting.
This is a general question about GKE compared to GCE. If one is running a lightweight service on a single small GCE VM, is it a reasonable thing to do to try running that same service from a single GKE container on the same size instance? Or does the overhead of cluster management make this unfeasible?
Specifics: I'm serving a low-traffic website from a tiny (f1-micro) GCE VM. For various reasons I thought I'd try moving it to serve from an apache/nginx container, with the same hardware underneath. In practice though, I find that GKE won't even let you create a cluster of f1-micro instances unless it has at least 3 nodes - the release notes say this is so there will be enough memory to manage pods.
I'd supposed that the same service would take up similar resources whether in a VM or a container, but the GKE's 3-node restriction makes it sound like simply managing the cluster eats more memory than serving my site does in the first place. Is that the case, or is the restriction meant for much heaver services than mine? (For reference, you can actually create a 3-node cluster of f1-micro instances and then change the size to 1 node, and it seems to run normally, but I haven't tried actually running a service this way.)
Thanks!
GKE enables logging and monitoring by default, which runs Fluentd and Heapster pods in your cluster. These eat up a good chunk of memory. Even if you disable logging/monitoring, you still have to run Docker, Kubelet, and the DNS pod. That chews through the f1-micro's 600MB pretty quickly.
I'd suggest a 1 node g1-small cluster over a 3 node (or 1 node) f1-micro. The per-node cluster-management overhead is smaller relatively, so your service would still be able to run in the same (or larger) footprint. But, if the resize-to-1 workaround is working for you, it seems fine to just roll with that.