Eventual failure: kubectl exec fails with "operation not permitted: unkown" - kubernetes

I have some Pods that are running some Python programs. Initially I'm able to execute simple commands into the Pods. However after some time (maybe hours?) I start to get the following error:
$ kubectl exec -it mypod -- bash
error: Internal error occurred: error executing command in container: failed to exec in container: failed to start exec "37a9f1042841590e48e1869f8b0ca13e64df02d25458783e74d8e8f2e33ad398": OCI runtime exec failed: exec failed: unable to start container pr
ocess: open /dev/pts/0: operation not permitted: unknown
If I restart the Pods, then this clears the condition. However, I'd like to figure out why this happening to avoid having to restart Pods each time.
The Pods are running a simple Python script, and the Python program is still running as normal (kubectl logs shows what I expect).
Also, I'm running K3s for Kubernetes across 4 nodes (1 master, 3 workers). I noticed all Pods running on certain nodes started to experience this issue. For example, initially I found all Pods running on worker2 and worker3 had this issue (but all Pods on worker1 did not). Eventually all Pods across all worker nodes start to have this problem. So it appears to be related to a condition on the node that is preventing exec from running. However restarting the Pods resets the condition.
As far as I can tell, the containers are running fine in containerd. I can log into the nodes and containerd shows the containers are running, I can check logs, etc...
What else should I check?
Why would the ability to exec stop working? (but containers are still running)

There is a couple of GitHub issues one or another from the middle of august. They said it was an SELinux issue and fixed in the runc v1.1.4. You should check your runc version and when it is below the mentioned version then update it.
Otherwise, you can disable SELinux when you aren't working in production:
setenforce 0
or when you want some more sophisticated solution, try this: https://github.com/moby/moby/issues/43969#issuecomment-1217629129

Related

Why I can't get into the container running "kubernetes-dashboard"?

I was trying to get into kubernetes-dashboard Pod, but I keep getting this error:
C:\Users\USER>kubectl exec -n kubernetes-dashboard kubernetes-dashboard-66c887f759-bljtc -it -- sh
OCI runtime exec failed: exec failed: unable to start container process: exec: "sh": executable file not found in $PATH: unknown
command terminated with exit code 126
The Pod is running normally and I can access the Kubernetes UI via the browser. But I was getting some issues getting it running before, and I wanted to get inside the pod to run some commands, but I always get the same error mentioned above.
When I try the same command with a pod running nginx for example, it works:
C:\Users\USER>kubectl exec my-nginx -it -- sh
/ # ls
bin home proc sys
dev lib root tmp
docker-entrypoint.d media run usr
docker-entrypoint.sh mnt sbin var
etc opt srv
/ # exit
Any explanation, please?
Prefix the command to run with /bin so your updated command will look like:
kubectl exec -n kubernetes-dashboard <POD_NAME> -it -- /bin/sh
The reason you're getting that error is because Git in Windows slightly modifies the MSYS that changes command args. Generally using the command /bin/sh or /bash/bash works universally.
That error message means literally what it says: there is no sh or any other shell in the container. There's no particular requirement that a container have a shell, and if a Docker image is built FROM scratch (as the Kubernetes dashboard image is) or a "distroless" image, it just may not contain one.
In most cases you shouldn't need to "enter a container", and you should use kubectl exec (or docker exec) sparingly if at all. This is doubly true in Kubernetes: it's not just that changes you make manually will get lost when the container exits, but also that in Kubernetes you typically have multiple replicas that you can't manually edit all at once, and also that in some cases the cluster can delete and recreate a Pod outside of your control.

How can I keep a Pod from crashing so that I can debug it?

In Kubernetes, when a Pod repeatedly crashes and is in CrashLoopBackOff status, it is not possible to shell into the container and poke around to find the problem, due to the fact that containers (unlike VMs) live only as long as the primary process. If I shell into a container and the Pod is restarted, I'm kicked out of the shell.
How can I keep a Pod from crashing so that I can investigate if my primary process is failing to boot properly?
Redefine the command
In development only, a temporary hack to keep a Kubernetes pod from crashing is to redefine it and specify the container's command (corresponding to a Docker ENTRYPOINT) and args to be a command that will not crash. For instance:
containers:
- name: something
image: some-image
# `shell -c` evaluates a string as shell input
command: [ "sh", "-c"]
# loop forever, outputting "yo" every 5 seconds
args: ["while true; do echo 'yo' && sleep 5; done;"]
This allows the container to run and gives you a chance to shell into it, like kubectl exec -it pod/some-pod -- sh, and investigate what may be wrong.
This needs to be undone after debugging so that the container will run the command it's actually meant to run.
Adapted from this blog post.
There are also other methods used for debugging pods that are worth noting in your use case scenario:
If your container has previously crashed, you can access the previous container's crash log with: kubectl logs --previous ${POD_NAME} ${CONTAINER_NAME}
Debugging with an ephemeral debug container: Ephemeral containers are useful for interactive troubleshooting when kubectl exec is insufficient because a container has crashed or a container image doesn't include debugging utilities, such as with distroless images. kubectl has an alpha command that can create ephemeral containers for debugging beginning with version v1.18. An example for this method can be found here.
in my case I did build using mac m1/silicon. In this case pod crashes and there is no explicit message about this.
The problem was that I a also debugged using docker on the same m1 so could not really see what is wrong.
I would need to build image using docker build --platform linux/amd64.

How to stop/start containers at k3s agent?

Docker provides the following functions to stop and start the same container.
OP46B1:/ # docker stop 18788407a60c
OP46B1:/ # docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
18788407a60c ubuntu:test "/bin/bash" 34 minutes ago Exited (0) 7 seconds ago charming_gagarin
OP46B1:/ # docker start 18788407a60c
But k3s agent does not provide this function. A container stopped by "k3s crictl stop" cannot be restarted by "k3s crictl start". The following error will appear. How to stop and start the same container at k3s agent?
OP46B1:/data # ./k3s-arm64 crictl stop 5485f899c7bb6
5485f899c7bb6
OP46B1:/data # ./k3s-arm64 crictl ps -a
CONTAINER IMAGE CREATED STATE NAME ATTEMPT POD ID
5485f899c7bb6 b58be220837f0 3 days ago Exited pod-webapp86 0 92a94e8eec410
OP46B1:/data# ./k3s-arm64 crictl start 5485f899c7bb6
FATA[2020-10-20T00:54:04.520056930Z] Starting the container "5485f899c7bb6" failed: rpc error: code = Unknown desc = failed to set starting state for container "5485f899c7bb6f2d294a3a131b33d8f35c9cf84df73cacb7b8af1ee48a591dcf": container is in CONTAINER_EXITED state
k3s is a distribution of kubernetes. Kubernetes is an abstraction over the container framework (containerd/docker/etc.). As such, you shouldn't try to control the containers directly using k3s crictl, but instead use the pod abstraction provided by kubernetes.
k3s kubectl get pods -A will list all the pods that are currently running in the k3s instance.1
k3s kubectl delete pod -n <namespace> <pod-selector> will delete the pod(s) specified, which will stop (and delete) their containers.2

Cannot shell into the container, rpc error: code = 5 desc ... shim-log.json: no such file or directory

trying to shell into the container by kubectl exec -it xxxxxx
but it returns
rpc error: code = 5 desc = open /var/run/docker/libcontainerd/containerd/faf3fd49262cc738e16368001eba5e1113abcb8a87e7b818cb84af3799906149/30fe901c16e0465aa15b596bf3e4f244fb12a7e4133b6e4da5aa35167a8dfb30/shim-log.json: no such file or directory
trying to reboot the node but not help
Thanks #Prafull Ladha
Eventually I restarted the docker (systemctl restart docker) of that Node which my pods could not be shelled, and it resumes to normal
The problem is with containerd, Once the containerd restart in the background, the docker daemon still try to process event streams against the old socket handles. After that, the error handeling when client can't connect to the containerd leads to the CPU spike on machine.
This is the open issue with docker and currently the workaround is to restart the docker.
sudo systemctl restart docker
It appears like some issue with the docker daemon. it would help if you add the logs from the container to research the root cause.
deploy alpine pod and see if you can get into the container. This is to isolate if the problem is with the platform or the pod that you are running.
kubectl run pingpong --image alpine ping 8.8.8.8
kubectl exec -it <pingpong-pod-name> sh
most likely something wrong with the pod that you are running. share the container logs for further help

Docker Compose - one specific container randomly doesn't start properly

I have a docker environment with 5 containers that are composed via docker compose. Now only on mac machines and only sometimes (seems completely random) 1 of these 5 container doesn't start.
The weird thing about it is, that docker ps says the container is running and I can connect to it. Inside the container is a JBoss server and ps says that there is a process that runs the JBoss. BUT in fact the JBoss is not up and running. There is no logging in the docker compose console and JBoss not accessible.
There is also the problem that if this happens the whole docker-compose process cannot be canceled properly anymore. All containers shutdown and also can be forced to shutdown but the JBoss container. Then the docker-machine hangs up.
I didn't find any hint in the interwebs ... please help !
It seems that the process running inside the container is in a weird state.
Try killing it without providing a grace period, or removing the container.
stop : Stop a container by sending SIGTERM and then SIGKILL after a
grace period
--help=false Print usage
-t, --time=10 Seconds to wait for stop before killing it
kill : Kill a running container using SIGKILL or a specified signal
--help=false Print usage
-s, --signal="KILL" Signal to send to the container
rm : Remove one or more containers
-f, --force=false Force the removal of a running container (uses SIGKILL)
--help=false Print usage
-l, --link=false Remove the specified link
-v, --volumes=false Remove the volumes associated with the container
Moreover try checking the logs of the container:
docker logs --follow <container_name or container_id>
After updating to docker v1.10 the problem didn't occur anymore :)