Kubernetes Heapster not working correctly - kubernetes

I set up kubernetes-dashboard and heapster according to the docs in the github. My kubernetes client and server version are both 1.5.4 and the whole cluster is deployed on 10 physical servers with ubuntu 14.04 OS. I used heapster 1.3.
I can access the dashboard and see some figures about cpu and memory usage, but I do not know why some pods do not have such figures, i.e., the cpu and memory usage of some pods are not displayed in figure format. The two pictures below are examples.
dashboard display 1
dashboard display 2
Also, I find that sometimes for a pod, it only displays memory, without cpu usage.
I checked cAdvisor on each node by accessing its web ui, and it works quite well on each node. This problem troubles me several days. Can someone help me figure out this issue or give some hints? I'd be very grateful to you.

Related

Rancher: High CPU utilization even with zero clusters under management

I am seeing a continuous 8 to 15% CPU usage on Rancher related processes while there is not a single cluster being managed by it. Nor is any user interacting with. What explains this high CPU usage when idle? Also, there are several "rancher-agent" containers perpetually running and restarting. Which does not look right. There is no Kubernetes cluster running on this machine. This machine (unless Rancher is creating its own single node cluster for whatever reason).
I am using Rancher 2.3
docker stats:
docker ps:
htop:
I'm not sure I would call 15% "high", but Kubernetes has a lot of ongoing stuff even if it looks like the cluster is entirely quiet. Stuff like processing node heartbeats, etcd election traffic, controllers with time-based conditions which have to be processed. K3s probably streamlines that a bit, but 0% CPU usage is not a design goal even in the fork.
Rancher (2.3.x) does not do anything involving k3s. These pictures are not "just Rancher".
k3s is separately installed and running.
The agents further suggest that this node is added to a cluster (maybe the same Rancher running on it, maybe not).
It restarting all the time is not helping CPU usage, especially if it is registered to that local Rancher instance.
Also you're running a completely random commit from head instead of an actual release.
FWIW...In my case, I built the raspberry pi based Rancher/k3 lab as designed by Network Chuck on youtube. The VM on my linux host that runs Rancher will start off fairly quiet, then over the course of a couple of days the rancherd process will consistently hit near 100% cpu usage (I gave it 3 vcpu's) and stay there, even though I have no pods running on either the pi cluster or the local Rancher VM cluster. A reboot starts the process over, but within a few days its back to 100% cpu usage.
On writing this I just noticed that due to a DHCP issue, my original external ip for the local rancher cluster node got changed from 163 to 151 (I reserved it in pihole to 151, just never updated rancher config). Just fixed it in the Rancher gui, we'll see if that clears up some of the errors I saw in the logs and keeps the CPU usage normal on idle.

Live monitoring of container, nodes and cluster

we are using k8s cluster for one of our application, cluster is owned by other team and we dont have full control over there… We are trying to find out metrics around resource utilization (CPU and memory), detail about running containers/pods/nodes etc. Need to find out how many parallel containers are running. Problem is they have exposed monitoring of cluster via Prometheus but with Prometheus we are not getting live data, it does not have info about running containers.
My query is , what is that API which is by default available in k8s cluster and can give all what we need. We dont want to read data form another client like Prometheus or anything else, we want to read metrics directly from cluster so that data is not stale. Any suggestions?
As you mentioned you will need metrics-server (or heapster) to get those information.
You can confirm if your metrics server is running kubectl top nodes/pods or just by checking if there is a heapster or metrics-server pod present in kube-system namespace.
Also the provided command would be able to show you the information you are looking for. I wont go into details as here you can find a lot of clues and ways of looking at cluster resource usage. You should probably take a look at cadvisor too which should be already present in the cluster. It exposes a web UI which exports live information about all the containers on the machine.
Other than that there are probably commercial ways of acheiving what you are looking for, for example SignalFx and other similar projects - but this will probably require the cluster administrator involvement.

Kubernetes Engine: Node keeps getting unhealthy and rebooted for no apparent reason

My Kubernetes Engine cluster keeps rebooting one of my nodes, even though all pods on the node are "well-behaved". I've tried to look at the cluster's Stackdriver logs, but was not able to find a reason. After a while, the continuous reboots usually stop, only to occur again a few hours or days later.
Usually only one single node is affected, while the other nodes are fine, but deleting that node and creating a new one in its place only helps temporarily.
I have already disabled node auto-repair to see if that makes a difference (it was turned on before), and if I recall correctly this started after upgrading my cluster to Kubernetes 1.13 (specifically version 1.13.5-gke). The issue has persisted after upgrading to 1.13.6-gke.0. Even creating a new node pool and migrating to it had no effect.
The cluster consists of four nodes with 1 CPU and 3 GB RAM each. I know that's small for a k8s cluster, but this has worked fine in the past.
I am using the new Stackdriver Kubernetes Monitoring as well as Istio on GKE.
Any pointers as to what could be the reason or where I look for possible causes would be appreciated.
Screenshots of the Node event list (happy to provide other logs; couldn't find anything meaningful in Stackdriver Logging yet):
Posting this answer as a community wiki to give some troubleshooting tips/steps as the underlying issue wasn't found.
Feel free to expand it.
After below steps, the issue with a node rebooting were not present anymore:
Updated the Kubernetes version (GKE)
Uninstalling Istio
Using e2-medium instances as nodes.
As pointed by user #aurelius:
I would start from posting the kubectl describe node maybe there is something going on before your Node gets rebooted and unhealthy. Also do you use resources and limits? Can this restarts be a result of some burstable workload? Also have you tried checking system logs after the restart on the Node itself? Can you post the results? – aurelius Jun 7 '19 at 15:38
Above comment could be a good starting point for troubleshooting issues with the cluster.
Options to troubleshoot the cluster pointed in comment:
$ kubectl describe node focusing on output in:
Conditions - KubeletReady, KubeletHasSufficientMemory, KubeletHasNoDiskPressure, etc.
Allocated resources - Requests and Limits of scheduled workloads
Checking system logs after the restart on the node itself:
GCP Cloud Console (Web UI) -> Logging -> Legacy Logs Viewer/Logs Explorer -> VM Instance/GCE Instance
It could be also beneficiary to check the CPU/RAM usage in:
GCP Cloud Console (Web UI) -> Monitoring -> Metrics Explorer
You can also check if there are any operations on the cluster:
gcloud container operations list
Adding to above points:
Creating a cluster with Istio on GKE
We suggest creating at least a 4 node cluster with the 2 vCPU machine type when using this add-on. You can deploy Istio itself with the default GKE new cluster setup but this may not provide enough resources to explore sample applications.
-- Cloud.google.com: Istio: Docs: Istio on GKE: Installing
Also, the official docs of Istio are stating:
CPU and memory
Since the sidecar proxy performs additional work on the data path, it consumes CPU and memory. As of Istio 1.7, a proxy consumes about 0.5 vCPU per 1000 requests per second.
-- Istio.io: Docs: Performance and scalability: CPU and memory
Additional resources:
Cloud.google.com: Kubernetes Engine: Docs: Troubleshooting
Kubernetes.io: Docs: Debug cluster

How can I retrieve the memory utilization of a pod in kubernetes via kubectl?

Inside a namespace, I have created a pod with its specs consisting of memory limit and memory requests parameters. Once up a and running, I would like to know how can I get the memory utilization of the pod in order to figure out if the memory utilization is within the specified limit or not. "kubectl top" command returns back with a services related error.
kubectl top pod <pod-name> -n <fed-name> --containers
FYI, this is on v1.16.2
You need to install metrics server to get the metrics. Follow the below thread
Error from server (NotFound): podmetrics.metrics.k8s.io "mem-example/memory-demo" not found
kubectl top pod POD_NAME --containers
shows metrics for a given pod and its containers.
If you want to see graphs of memory and cpu utilization then you can see them through the kubernetes dashboard.
A better solution would be to install a metrics server alongwith prometheus and grafana in your cluster. Prometheus will scrape the metrics which can be used by grafana for displaying as graphs.
This might be useful.
Instead of building ad-hoc metric snapshots, a much better way is to install and work with 3rd party data collector programs which if managed well gives you a great solution for monitoring systems and a neat Grafana UI (or likewise) you can play with. One of them is Prometheus and which comes highly recommended.
using such PnP systems, you can not only create a robust monitoring pipeline but also the consumption and hence the reaction to the problem is well managed and executed compared to only relying on TOP

Monitoring Rancher containers by hosts through Prometheus cAdvisor NodeExporter

I have a setup where I manage to monitor every container of my Rancher 1.6 environnement with a stack Prometheus(2.4.3)/Grafana (with cAdvisor v0.27.4 and NodeExporter v0.16.0).
Here is my issue. I manage to monitor every container consumption but I can't relate the consumption of a container based on the host.
For example, if I want to show information about CPU usage I use container_cpu_user_seconds_total from cAdvisor which provides cpu usage of the container in percentage related to its host but I can't find which host is concerned (I Have 4 hosts on this environnement) so the cpu consumption cumulative tends to go over 100%.
I would like to either show charts by host (I saw I could create dynamic charts in Grafana but it seems a bit hard so manually creating them doesn't bother me).
Should I try to create my own metrics in prom-conf file ? Seems a bit overkill for such stuff
I find this very strange that this information only interests me. That's why I ask it here.
I'm new to all of these tools.
Thank you in advance.