Prometheus : list docker and docker compose version of each node - docker-compose

I would like to list each version of "docker" and "docker compose" of each node?
How can I do this with prometheus?
I have cadvisor and NodeExporter on each node. Is it enough? witch configuration ?
Thanks for your answers.
I ever try with
container_tasks_state{job="cadvisor", name="cadvisor",state="running"}
to extract container_label_com_docker_compose_version but I never succeeded because it's always the same result for each node and i know it's false.

Related

Kubectl connection refused existing cluster

Hope someone can help me.
To describe the situation in short, I have a self managed k8s cluster, running on 3 machines (1 master, 2 worker nodes). In order to make it HA, I attempted to add a second master to the cluster.
After some failed attempts, I found out that I needed to add controlPlaneEndpoint configuration to kubeadm-config config map. So I did, with masternodeHostname:6443.
I generated the certificate and join command for the second master, and after running it on the second master machine, it failed with
error execution phase control-plane-join/etcd: error creating local etcd static pod manifest file: timeout waiting for etcd cluster to be available
Checking the first master now, I get connection refused for the IP on port 6443. So I cannot run any kubectl commands.
Tried recreating the .kube folder, with all the config copied there, no luck.
Restarted kubelet, docker.
The containers running on the cluster seem ok, but I am locked out of any cluster configuration (dashboard is down, kubectl commands not working).
Is there any way I make it work again? Not losing any of the configuration or the deployments already present?
Thanks! Sorry if it’s a noob question.
Cluster information:
Kubernetes version: 1.15.3
Cloud being used: (put bare-metal if not on a public cloud) bare-metal
Installation method: kubeadm
Host OS: RHEL 7
CNI and version: weave 0.3.0
CRI and version: containerd 1.2.6
This is an old, known problem with Kubernetes 1.15 [1,2].
It is caused by short etcd timeout period. As far as I'm aware it is a hard coded value in source, and cannot be changed (feature request to make it configurable is open for version 1.22).
Your best bet would be to upgrade to a newer version, and recreate your cluster.

gpu worker node unable to join cluster

I've a EKS setup (v1.16) with 2 ASG: one for compute ("c5.9xlarge") and the other gpu ("p3.2xlarge").
Both are configured as Spot and set with desiredCapacity 0.
K8S CA works as expected and scale out each ASG when necessary, the issue is that the newly created gpu instance is not recognized by the master and running kubectl get nodes emits nothing.
I can see that the ec2 instance was in Running state and also I could ssh the machine.
I double checked the the labels and tags and compared them to the "compute".
Both are configured almost similarly, the only difference is that the gpu nodegroup has few additional tags.
Since I'm using eksctl tool (v.0.35.0) and the compute nodeGroup vs. gpu nodeGroup is basically copy&paste, I can't figured out what could be the problem.
UPDATE:
ssh the instance I could see the following error (/var/log/messages)
failed to run Kubelet: misconfiguration: kubelet cgroup driver: "systemd" is different from docker cgroup driver: "cgroupfs"
and the kubelet service crashed.
would it possible the my GPU uses wrong AMI (amazon-eks-gpu-node-1.18-v20201211)?
As a simple you can use this preBootstrapCommands in eksctl yaml config file:
- name: test-node-group
preBootstrapCommands:
- "sed -i 's/cgroupDriver:.*/cgroupDriver: cgroupfs/' /etc/eksctl/kubelet.yaml"
There is some issue with EKS 1.16, even the graviton processors machine won't join the cluster. To fix it first you try upgrading your CNI version. Please refer the documentation here:
https://docs.aws.amazon.com/eks/latest/userguide/cni-upgrades.html
And if that doesn't work, then upgrade your EKS version to the latest available version then should work.
I've found out the issue. It seems to be mis-alignment between eksctl (v0.35.0) and the AL2-GPU AMI.
AWS team change the control group in docker to be "systemd" instead of "cgroup" (github) while the eksctl tool I used didn't absorb the changes.
A temporary solution is to edit the /etc/eksctl/kubelet.yaml file using preBootstrapCommands

Taking snapshot of a running container via kubectl

Is it possible to take an image or a snapshot of container running inside pod using kubectl?
Via docker, it is possible to use the docker commit command that creates an image of a container from which we can spawn more containers. I wanted to understand if there was something similar that we could do with kubectl.
No, partially because that's not in the kubernetes mental model of anything one would wish to do to a cluster, and partially because docker is not the only container runtime kubernetes uses. Every runtime one could use underneath kubernetes would need to support that operation, and I doubt they do.
You are welcome to do your own docker commit either by getting a shell on the Node, or by running a privileged Pod then connecting to the docker.sock via a volumeMount and running it that way

Problem getting pods stats from kubelet and cri-o

We are running Kubernetes with the following configuration:
On-premise Kubernetes 1.11.3, cri-o 1.11.6 and CentOS7 with UEK-4.14.35
I can't make crictl stats to return pods information, it returns empty list only. Has anyone run into the same problem?
Another issue we have, is that when I query the kubelet for stats/summary it returns an empty pods list.
I think that these two issues are related, although I am not sure which one of them is the problem.
I would recommend checking kubelet service to verify health status and debug any suspicious events within the cluster. I assume that CRI-O runtime engine can select kubelet as the main Pods information provider because of its managing Pod lifecycle role.
systemctl status kubelet -l
journalctl -u kubelet
In case you found some errors or dubious events, share it in a comment below this answer.
However, you can use metrics-server, which will collect Pod metrics in the cluster and enable kube-apiserver flags for Aggregation Layer. Here is a good article about Horizontal Pod Autoscaling and monitoring resources via Prometheus.

kubernetes pods spawn across all servers but kubectl only shows 1 running and 1 pending

I have new setup of Kubernetes and I created replication with 2. However what I see when I do " kubectl get pods' is that one is running another is "pending". Yet when I go to my 7 test nodes and do docker ps I see that all of them are running.
What I think is happening is that I had to change the default insecure port from 8080 to 7080 (the docker app actually runs on 8080), however I don't know how to tell if I am right, or where else to look.
Along the same vein, is there any way to setup config for kubectl where I can specify the port. Doing kubectl --server="" is a bit annoying (yes I know I can alias this).
If you changed the API port, did you also update the nodes to point them at the new port?
For the kubectl --server=... question, you can use kubectl config set-cluster to set cluster info in your ~/.kube/config file to avoid having to use --server all the time. See the following docs for details:
http://kubernetes.io/v1.0/docs/user-guide/kubectl/kubectl_config.html
http://kubernetes.io/v1.0/docs/user-guide/kubectl/kubectl_config_set-cluster.html
http://kubernetes.io/v1.0/docs/user-guide/kubectl/kubectl_config_set-context.html
http://kubernetes.io/v1.0/docs/user-guide/kubectl/kubectl_config_use-context.html