Debug Alpine Image in K8s: No `netstat`, no `ip`, no `apk` - kubernetes

There is a container in my Kubernetes cluster which I want to debug.
But there is nonetstat, no ip and no apk.
Is there a way to upgrade this image, so that the common tools are installed?
In this case it is the nginx container image in a K8s 1.23 cluster.

Alpine is a stripped-down version of the image to reduce the footprint. So the absence of those tools is expected. Although since Kubernetes 1.23, you can use the kubectl debug command to attach a debug pod to the subject pod.
Syntax:
kubectl debug -it <POD_TO_DEBUG> --image=ubuntu --target=<CONTAINER_TO_DEBUG> --share-processes
Example:
In the below example, the ubuntu container is attached to the Nginx-alpine pod, requiring debugging. Also, note that the ps -eaf output shows nginx process running and the cat /etc/os-release shows ubuntu running. The indicating process is shared/visible between the two containers.
ps#kube-master:~$ kubectl debug -it nginx --image=ubuntu --target=nginx --share-processes
Targeting container "nginx". If you don't see processes from this container, the container runtime doesn't support this feature.
Defaulting debug container name to debugger-2pgtt.
If you don't see a command prompt, try pressing enter.
root#nginx:/# ps -eaf
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 19:50 ? 00:00:00 nginx: master process nginx -g daemon off;
101 33 1 0 19:50 ? 00:00:00 nginx: worker process
101 34 1 0 19:50 ? 00:00:00 nginx: worker process
101 35 1 0 19:50 ? 00:00:00 nginx: worker process
101 36 1 0 19:50 ? 00:00:00 nginx: worker process
root 248 0 1 20:00 pts/0 00:00:00 bash
root 258 248 0 20:00 pts/0 00:00:00 ps -eaf
root#nginx:/#
Debugging as ubuntu as seen here, this arm us with all sort of tools:
root#nginx:/# cat /etc/os-release
NAME="Ubuntu"
VERSION="20.04.3 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.3 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal
root#nginx:/#
In case ephemeral containers need to be enabled in your cluster, then you can enable it via feature gates as described here.

The whole point of using containers is to optimize the resource utilization in your cluster. The images used should only include the packages that are needed to run your app.
The unwanted packages should be removed from your images (especially in prod) to reduce the compute utilization and to reduce the attack vector.
This appears to be a stripped down image that has only the libraries needed to run that application.
In order to debug, you will have to create a new container in the same pid and network namespace as the container you are trying to debug
Build container first
Dockerfile
FROM alpine
RUN apk update && apk add strace
CMD ["strace", "-p", "1"]
Build
$ docker build -t strace .
Run
docker run -t --pid=container:<targetContainer> \
--net=container:targetContainer \
--cap-add sys_admin \
--cap-add sys_ptrace \
strace
strace: Process 1 attached
futex(0xd72e90, FUTEX_WAIT, 0, NULL
https://rothgar.medium.com/how-to-debug-a-running-docker-container-from-a-separate-container-983f11740dc6

Related

The tdengine image (2.4.0.3) cannot be shut down normally when pulled up with kubernetes

I created a three node cluster through k8s deployment using the tdengine version 2.4.0.3 image. View the pod information:
kubectl get pods -n mytdengine
NAME READY STATUS RESTART AGE
tdengine-01 1/1 Running 0 1m45s
tdengine-02 1/1 Running 0 1m45s
tdengine-03 1/1 Running 0 1m45s
Everthing was going well.
However, when I tried to stop the pods with delete operation:
kubectl delete pod tdengine-03 -n mytdengine
The target pod is not deleted as expect. The status turns to:
NAME READY STATUS RESTART AGE
tdengine-01 1/1 Running 0 2m35s
tdengine-02 1/1 Running 0 2m35s
tdengine-03 1/1 Terminating 0 2m35s
After several tests, pod will successfully deleted until 3 mins, which is unnormal. I didn't actually use the tdengine instance, which means there are no excessive load or storage occupation. I cannot find a reason to explain why it cost 3mins to shut down.
After testing, I eliminated the problem of kubernetes configuration. Moreover, I found that the parameter ‘terminationgraceperiodseconds’ configured in the yaml file of Pod: 180
terminationgraceperiodseconds:180
This means that the pod was not shut down gracefully, but was forcibly removed after timeout.
Generally speaking, the stop of pod usually sends a signal of signterm. The container processes the signal correctly and makes an elegant shutdown. However, if it does not stop or the container does not respond to the signal and exceeds the timeout set by the above parameter 'termination graceriodseconds', the container will receive the signal of signkill and forcibly kill the container. Ref: https://tasdikrahman.me/2019/04/24/handling-singals-for-applications-in-kubernetes-docker/
The reason for this is tdengine2 4.0.3 the startup script of the image pulls up taosadapter first and then taosd, but it does not rewrite the processing method of signterm signal. Due to the particularity of Linux PID 1, only PID 1 receives the signterm signal after k8s sends it to the pod content container (as shown in the figure below, PID 1 is the startup script) and does not notify taosadapter and taosd (become zombie processes).
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
9 root 20 0 2873404 80144 2676 S 2.3 0.5 112:30.81 taosadapter
8 root 20 0 2439240 41364 2996 S 1.0 0.3 130:53.67 taosd
1 root 20 0 20044 1648 1368 S 0.0 0.0 0:00.01 run_taosd.sh
7 root 20 0 20044 476 200 S 0.0 0.0 0:00.00 run_taosd.sh
135 root 20 0 20176 2052 1632 S 0.0 0.0 0:00.00 bash
146 root 20 0 38244 1788 1356 R 0.0 0.0 0:00.00 top
I personally choose the way to rewrite hook function in k8s yaml file to immediately delete the container:
lifecycle:
preStop:
command:
- /bin/bash
- -c
- procnum=`ps aux | grep taosd | grep -v -e grep -e entrypoint -e run_taosd
| awk '{print $2}'`; kill -15 $procnum; if ["$?" -eq 0]; then echo "kill
Of course, once we know the cause of the problem, there are other solutions which are not discussed here.

Why does kubeadm not start even after disabling swap?

I am trying to install kubernetes with kubeadm in my laptop which has Ubuntu 16.04. I have disabled swap, since kubelet does not work with swap on. The command I used is :
swapoff -a
I also commented out the reference to swap in /etc/fstab.
# /etc/fstab: static file system information.
#
# Use 'blkid' to print the universally unique identifier for a
# device; this may be used with UUID= as a more robust way to name devices
# that works even if disks are added and removed. See fstab(5).
#
# <file system> <mount point> <type> <options> <dump> <pass>
# / was on /dev/sda1 during installation
UUID=1d343a19-bd75-47a6-899d-7c8bc93e28ff / ext4 errors=remount-ro 0 1
# swap was on /dev/sda5 during installation
#UUID=d0200036-b211-4e6e-a194-ac2e51dfb27d none swap sw 0 0
I confirmed swap is turned off by running the following:
free -m
total used free shared buff/cache available
Mem: 15936 2108 9433 954 4394 12465
Swap: 0 0 0
When I start kubeadm, I get the following error:
kubeadm init --pod-network-cidr=10.244.0.0/16
[init] Using Kubernetes version: v1.14.2
[preflight] Running pre-flight checks
[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR Swap]: running with swap on is not supported. Please disable swap
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
I also tried restarting my laptop, but I get the same error. What could the reason be?
below was the root cause.
detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd".
you need to update the docker cgroup driver.
follow the below fix
cat > /etc/docker/daemon.json <<EOF
{
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m"
},
"storage-driver": "overlay2",
"storage-opts": [
"overlay2.override_kernel_check=true"
]
}
EOF
mkdir -p /etc/systemd/system/docker.service.d
# Restart Docker
systemctl daemon-reload
systemctl restart docker
you could try kubeadm reset , then kubeadm init --ignore-preflight-errors Swap .
first try with sudo
sudo swapoff -a
then check if there's anything swapped
cat /proc/swaps
and
free -h

kubelet saying node "master01" not found

I try to stack up my kubeadm cluster with three masters. I receive this problem from my init command...
[kubelet-check] Initial timeout of 40s passed.
Unfortunately, an error has occurred:
timed out waiting for the condition
This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
- 'systemctl status kubelet'
- 'journalctl -xeu kubelet'
Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI, e.g. docker.
Here is one example how you may list all Kubernetes containers running in docker:
- 'docker ps -a | grep kube | grep -v pause'
Once you have found the failing container, you can inspect its logs with:
- 'docker logs CONTAINERID'
error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster
But I do not use no cgroupfs but systemd
And my kubelet complain for not knowing his nodename.
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.251885 5620 kubelet.go:2266] node "master01" not found
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.352932 5620 kubelet.go:2266] node "master01" not found
Jan 23 14:54:12 master01 kubelet[5620]: E0123 14:54:12.453895 5620 kubelet.go:2266] node "master01" not found
Please let me know where is the issue.
The issue can be because of docker version, as docker version < 18.6 is supported in latest kubernetes version i.e. v1.13.xx.
Actually I also got the same issue but it get resolved after downgrading the docker version from 18.9 to 18.6.
If the problem is not related to Docker it might be because the Kubelet service failed to establish connection to API server.
I would first of all check the status of Kubelet: systemctl status kubelet and consider restarting with systemctl restart kubelet.
If this doesn't help try re-installing kubeadm or running kubeadm init with other version (use the --kubernetes-version=X.Y.Z flag).
In my case,my k8s version is 1.21.1 and my docker version is 19.03. I solved this bug by upgrading docker to version 20.7.

Ran out of Docker disk space

I have this Docker command:
docker run -d mongo
this will build and run a mongodb server running in a docker container
However, I get an error:
no space left on device
I am on MacOS, and using the newer versions of Docker which use hyper-v instead of VirtualBox (I think that's correct).
Here is the exact error message from the mongo container:
$ docker logs efee16702c5756659d563b98d4ae0f58ecf1f1bba8a54f63443c0ae4b520ab4e
about to fork child process, waiting until server is ready for connections.
forked process: 21
2017-05-04T20:23:51.412+0000 I CONTROL [main] ***** SERVER RESTARTED *****
2017-05-04T20:23:51.430+0000 I CONTROL [main] ERROR: Cannot write pid file to /tmp/tmp.Lo035QkbfL: No space left on device
ERROR: child process failed, exited with error number 1
Any idea how to fix this and prevent it from happening in future?
As suggested, the output of df -h is:
Filesystem Size Used Avail Capacity iused ifree %iused Mounted on
/dev/disk1 465Gi 116Gi 349Gi 25% 1963838 4293003441 0% /
devfs 183Ki 183Ki 0Bi 100% 634 0 100% /dev
map -hosts 0Bi 0Bi 0Bi 100% 0 0 100% /net
map auto_home 0Bi 0Bi 0Bi 100% 0 0 100% /home
Output of docker info is:
$ docker info
Containers: 5
Running: 0
Paused: 0
Stopped: 5
Images: 741
Server Version: 17.03.1-ce
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 4ab9917febca54791c5f071a9d1f404867857fcc
runc version: 54296cf40ad8143b62dbcaa1d90e520a2136ddfe
init version: N/A (expected: 949e6facb77383876aeff8a6944dde66b3089574)
Security Options:
seccomp
Profile: default
Kernel Version: 4.9.13-moby
Operating System: Alpine Linux v3.5
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 1.952 GiB
Name: moby
ID: OR4L:WYWW:FFAP:IDX3:B6UK:O2AN:UVTO:EPH6:GYSV:4GV4:L5WP:BQTH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
File Descriptors: 17
Goroutines: 30
System Time: 2017-05-04T20:45:27.056157913Z
EventsListeners: 1
No Proxy: *.local, 169.254/16
Registry: https://index.docker.io/v1/
Experimental: true
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
As you state in the comments to the question, ls -altrh ~/Library/Containers/com.docker.docker/Data/com.docker.driver.‌​amd64-linux/Docker.q‌​cow2 returns the following:
-rw-r--r--# 1 alexamil staff 53G
This is a known bug on MacOS (actually, not only) and an official dev comment could be found here. Except for one thing: I read, that different people get different size limit. In the comment it is 64Gb, but for another person it was 20Gb.
There are a couple walkarounds, but no definite solution that I could find.
The manual one
Run docker ps -a and manually remove all unused containers. Then run docker images and remove manually all the intermediate and unused images.
The simplest one
Delete the Docker.qcow2 file entirely. But you will lose all images and containers. Completely.
The less simple
Another way is to run docker volume prune, which will remove all unused volumes
The resizing one (keeps the data)
Another idea that comes to me is to expand the disk image size with QEMU or something like it:
$ brew install qemu
$ /Applications/Docker.app/Contents/MacOS/qemu-img resize ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/Docker.qcow2 +5G
After you expanded the image, you will need to run a VM in which you should run GParted against Docker.qcow2 and expand the partition to use added space. You could use GParted Live ISO for that:
$ qemu-system-x86_64 -drive file=Docker.qcow2 -m 512 -cdrom ~/Downloads/gparted-live.iso -boot d -device usb-mouse -usb
Some people report this either doesn't work or doesn't help.
Yet another resizing one (wipes the data)
Create a substitute image with desired size (120G):
$ qemu-img create -f qcow2 ~/data.qcow2 120G
$ cp ~/data.qcow2 /Application/Docker.app/Contents/Resources/moby/data.qcow2
$ rm ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/Docker.qcow2
data.qcow2 is copied to ~/Library/Containers/com.docker.docker/Data/com.docker.driver.amd64-linux/Docker.qcow2 when you restart docker.
This walkaround comes from this comment.
Hope this helps. Good luck!

Where are the Kubernetes kubelet logs located?

I installed Kubernetes on my Ubuntu machine. For some debugging purposes I need to look at the kubelet log file (if there is any such file).
I have looked in /var/logs but I couldn't find a such file. Where could that be?
If you run kubelet using systemd, then you could use the following method to see kubelet's logs:
# journalctl -u kubelet
If you are trying to go directly to the file you can find the kubelet logs in /var/log/syslog directory. This is for ubuntu 16.04 and above.
It depends how it was installed. I installed Kubernetes on some Ubuntu machines following the Docker-MultiNode instructions.
With this install, I find the logs using the logs command like this.
Find your container ID.
$ docker ps | egrep kubelet
Use that container ID to view the logs
$ docker logs `<container-id>`
Finally I could find it in /var/log/upstart directory. Kubernetes in my machine is started using upstart. That's why those log files are in upstart directory
I installed Kubernetes by kind (Kubernetes in docker).
find docker container of kind to enter
$ docker container ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
62588e4d284b kindest/node:v1.17.0 "/usr/local/bin/entr…" 2 weeks ago Up 2 weeks 127.0.0.1:32769->6443/tcp kind2-control-plane
$ docker container exec -it kind2-control-plane bash
root#kind2-control-plane:/#
Inside container kind2-control-plane, you could find logfiles in two place:
/var/log/containers/
/var/log/pods/
And then,you will find they are the same, you can see the example below:
root#kind2-control-plane:/# cat /var/log/containers/redis-master-7db7f6579f-scw95_default_master-f6374281c2c6afcfcd0ee1214d9bd51c1684c0b6c0ba1056295246ecd055563c.log | tail -n 5
2020-04-08T12:09:29.824252114Z stdout F
2020-04-08T12:09:29.824372278Z stdout F [1] 08 Apr 12:09:29.822 # Server started, Redis version 2.8.19
2020-04-08T12:09:29.824440661Z stdout F [1] 08 Apr 12:09:29.823 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
2020-04-08T12:09:29.824459317Z stdout F [1] 08 Apr 12:09:29.823 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
2020-04-08T12:09:29.82446451Z stdout F [1] 08 Apr 12:09:29.824 * The server is now ready to accept connections on port 6379
root#kind2-control-plane:/# cat /var/log/pods/default_redis-master-7db7f6579f-scw95_094824e1-25aa-4e1e-ab23-d4bae861988a/master/0.log | tail -n 5
2020-04-08T12:09:29.824252114Z stdout F
2020-04-08T12:09:29.824372278Z stdout F [1] 08 Apr 12:09:29.822 # Server started, Redis version 2.8.19
2020-04-08T12:09:29.824440661Z stdout F [1] 08 Apr 12:09:29.823 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
2020-04-08T12:09:29.824459317Z stdout F [1] 08 Apr 12:09:29.823 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
2020-04-08T12:09:29.82446451Z stdout F [1] 08 Apr 12:09:29.824 * The server is now ready to accept connections on port 6379
root#kind2-control-plane:/# ls -l /var/log/containers/ | grep redis
lrwxrwxrwx 1 root root 101 Apr 8 12:09 redis-master-7db7f6579f-scw95_default_master-f6374281c2c6afcfcd0ee1214d9bd51c1684c0b6c0ba1056295246ecd055563c.log -> /var/log/pods/default_redis-master-7db7f6579f-scw95_094824e1-25aa-4e1e-ab23-d4bae861988a/master/0.log
If you want to know more in detail about the directories, you can see 2019-2-merge-request in Github.