Problem getting pods stats from kubelet and cri-o - kubernetes

We are running Kubernetes with the following configuration:
On-premise Kubernetes 1.11.3, cri-o 1.11.6 and CentOS7 with UEK-4.14.35
I can't make crictl stats to return pods information, it returns empty list only. Has anyone run into the same problem?
Another issue we have, is that when I query the kubelet for stats/summary it returns an empty pods list.
I think that these two issues are related, although I am not sure which one of them is the problem.

I would recommend checking kubelet service to verify health status and debug any suspicious events within the cluster. I assume that CRI-O runtime engine can select kubelet as the main Pods information provider because of its managing Pod lifecycle role.
systemctl status kubelet -l
journalctl -u kubelet
In case you found some errors or dubious events, share it in a comment below this answer.
However, you can use metrics-server, which will collect Pod metrics in the cluster and enable kube-apiserver flags for Aggregation Layer. Here is a good article about Horizontal Pod Autoscaling and monitoring resources via Prometheus.

Related

deployed a service on k8s but not showing any pods weven when it failed

I have deployed a k8s service, however its not showing any pods. This is what I see
kubectl get deployments
It should create on the default namespace
kubectl get nodes (this shows me nothing)
How do I troubleshoot a failed deployment. The test-control-plane is the one deployed by kind this is the k8s one I'm using.
kubectl get nodes
If above command is not showing anything which mean there is no Nodes in your cluster so where your workload will run ?
You need to have at least one worker node in K8s cluster so deployment can schedule the POD on it and run the application.
You can check worker node using same command
kubectl get nodes
You can debug more and check the reason of issue further using
kubectl describe deployment <name of your deployment>
To find out what really went wrong, first follow the steps described in Harsh Manvar in his answer. Perhaps obtaining that information can help you find the problem. If not, check the logs of your deployment. Try to list your pods and see which ones did not boot properly, then check their logs.
You can also use the kubectl describe on pods to see in more detail what went wrong. Since you are using kind, I include a list of known errors for you.
You can also see this visual guide on troubleshooting Kubernetes deployments and 5 Tips for Troubleshooting Kubernetes Deployments.

How to monitor pod preemption event

I have a bunch of Rancher clusters I take care of and on some of them developers use PriorityClasses to ensure that some of the more important workloads get scheduled. The 3 PriorityClasses are in 3 digits range so they will not interfere with the default ones. However, at present none of the PriorityClasses is set as default and neither is the preemptionPolicy set so it defaults to PreemptLowerPriority.
None of the rancher, longhorn, prometheus, grafana, etc., workloads have priorityClassName set.
Long story short, I believe this causes havoc on the cluster when resources are in short supply.
Before I take my opinion to the developers I would like to collect some data to back up my story.
The question: How do I detect if the pod was Terminated due to Preemption?
I tried to google the subject but couldn't find anything. I was hoping kube state metrics would have something but I didn't find anything.
Any help would be greatly appreciated.
You can try to look for convincing data like the pod termination reason with help of kubectl.
You can see the last restart logs of a container using the following command:
kubectl logs podname -c containername --previous
You can also use the following command to check the lifecycle events sent by the kubelet to the apiserver about the pod.
kubectl describe pod podname
Finally, You can also write a final message to /dev/termination-log, and this will show up as described in the docs.
To use kubectl commands with rancher kindly refer to this documentation page.

How to find the pod that led to an error in GKE

If I look at my logs in GCP logs, I see for instance that I got a request that gave 500
log_message: "Method: some_cloud_goo.Endpoint failed: INTERNAL_SERVER_ERROR"
I would like to quickly go to that pod and do a kubectl logs on it. But I did not find a way to do this.
I am fairly new to k8s and GKE, any way to traceback the pod that handled that request?
You could run command "kubectl get pods " on each node to check the status of all pods and could figure out accordingly by running for detail description of an error " kubectl describe pod pod-name"
As mentioned in #Neelam answer, you can can get the pod names with the command kubectl get pods -A and log into all your pods to find the error.
Or, alternatively, you could deploy a custom monitoring system like Elastic GKE Logging available in GCP github Click-to-deploy.
See here to install from MarketPlace with few clicks.
It is a free alternative to have a complete monitoring system and you can filter your logs in Kibana dashboard after deployed.

Replacing dead master in Kubernetes 1.15 cluster with stacked control plane

I have a Kubernetes cluster with 3-master stacked control plane - so each master also has its own etcd instance running locally. The problem I am trying solve is this:
"If one master dies such that it cannot be restarted, how do I replace it?"
Currently, when I try to add the replacement master into the cluster, I get the following error while running kubeadm join:
[check-etcd] Checking that the etcd cluster is healthy
I0302 22:43:41.968068 9158 local.go:66] [etcd] Checking etcd cluster health
I0302 22:43:41.968089 9158 local.go:69] creating etcd client that connects to etcd pods
I0302 22:43:41.986715 9158 etcd.go:106] etcd endpoints read from pods: https://10.0.2.49:2379,https://10.0.225.90:2379,https://10.0.247.138:2379
error execution phase check-etcd: error syncing endpoints with etc: dial tcp 10.0.2.49:2379: connect: no route to host
The 10.0.2.49 node is the one that died. These nodes are all running in an AWS AutoScaling group, so I don't have control over the addresses.
I have drained and deleted the dead master node using kubectl drain and kubectl delete; and I have used etcdctl to make sure the dead node was not in the member list.
Why is it still trying to connect to that node's etcd?
It is still trying to connect to the member because etcd maintains a list of members in its store -- that's how it knows to vote on quorum decisions. I don't believe etcd is unique in that way -- most distributed key-value stores know their member list
The fine manual shows how to remove a dead member, but it also warns to add a new member before removing unhealthy ones.
There is also a project etcdadm that is designed to smooth over some of the rough edges about etcd cluster management, but I haven't used it to say what it is good at versus not
The problem turned out to be that the failed node was still listed in the ConfigMap. Further investigation led me to the following thread, which discusses the same problem:
https://github.com/kubernetes/kubeadm/issues/1300
The solution that worked for me was to edit the ConfigMap manually.
kubectl -n kube-system get cm kubeadm-config -o yaml > tmp-kubeadm-config.yaml
manually edit tmp-kubeadm-config.yaml to remove the old server
kubectl -n kube-system apply -f tmp-kubeadm-config.yaml
I believe updating the etcd member list is still necessary to ensure cluster stability, but it wasn't the full solution.

Kubernetes liveness probe logging recovery

I am trying to test a liveness probe while learning kubernetes.
I have set up a minikube and configured a pod with a liveness probe.
Testing the script (E.g. via docker exec) it seems to report success and failure as required.
The probe leads to failures events which I can view via kubectl describe podname
but it does not report recovery from failures.
This answer says that liveness probe successes are not reported by default.
I have been trying to increase the log level with no success by running variations like:
minikube start --extra-config=apiserver.v=4
minikube start --extra-config=kubectl.v=4
minikube start --v=4
As suggested here & here.
What is the proper way to configure the logging level for a kubelet?
can it be modified without restarting the pod or minikube?
An event will be reported if a failure causes the pod to be restarted.
For kubernetes itself I understand that using it to decide whether to restart the pod is sufficient.
Why aren't events recorded for recovery from a failure which does not require a restart?
This is how I would expect probes to work in a health monitoring system.
How would recovery be visible if the same probe was used in prometheus or similar?
For an expensive probe I would not want it to be run multiple times.
(granted one probe could cache the output to a file making the second probe cheaper)
I have been trying to increase the log level with no success by
running variations like:
minikube start --extra-config=apiserver.v=4
minikube start --extra-config=kubectl.v=4
minikube start --v=4
#Bruce, none of the options mentioned by you will work as they are releted with other components of Kubernetes cluster and in the answer you referred to it was clearly said:
The output of successful probes isn't recorded anywhere unless your
Kubelet has a log level of at least --v=4, in which case it'll be in
the Kubelet's logs.
So you need to set -v=4 specifically for kubelet. In the official docs you can see that it can be started with specific flags including the one, changing default verbosity level of it's logs:
-v, --v Level
number for the log level verbosity
Kubelet runs as a system service on each node so you can check it's status by simply issuing:
systemctl status kubelet.service
and if you want to see it's logs issue the command:
journalctl -xeu kubelet.service
Try:
minikube start --extra-config=kubelet.v=4
however I'm not 100% sure if Minikube is able to pass this parameter so you'll need to verify it on your own. If it doesn't work you should still be able to add it in kubelet configuration file, specifying parameters with which it is started (don't forget to restart your kubelet.service after submitting the changes, you simply need to run systemctl restart kubelet.service)
Let me know if it helps and don't hesitate to ask additional questions if something is not completely clear.