Kubeflow pipeline fail to create container - kubernetes

I'm running Kubeflow in a local machine that I deployed with multipass using these steps but when I tried running my pipeline, it got stuck with the message ContainerCreating. When I ran kubectl describe pod train-pipeline-msmwc-1648946763 -n kubeflow I found this on the Events part of the describe:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 7m12s (x51 over 120m) kubelet, kubeflow-vm Unable to mount volumes for pod "train-pipeline-msmwc-1648946763_kubeflow(45889c06-87cf-4467-8cfa-3673c7633518)": timeout expired waiting for volumes to attach or mount for pod "kubeflow"/"train-pipeline-msmwc-1648946763". list of unmounted volumes=[docker-sock]. list of unattached volumes=[podmetadata docker-sock mlpipeline-minio-artifact pipeline-runner-token-dkvps]
Warning FailedMount 2m22s (x67 over 122m) kubelet, kubeflow-vm MountVolume.SetUp failed for volume "docker-sock" : hostPath type check failed: /var/run/docker.sock is not a socket file
Looks to me like there is a problem with my deployment, but I'm new to Kubernetes and can't figure out what I supposed to do right now. Any idea on how to solve this? I don't know if it helps but I'm pulling the containers from a private docker registry and I've set up the secret according to this.

You don't need to use docker. In fact the problem is with workflow-controller-configmap in kubeflow name space. You can edit it with
kubectl edit configmap workflow-controller-configmap -n kubeflow
and change containerRuntimeExecutor: docker to containerRuntimeExecutor: pns. Also you can change some of the steps and install kubeflow 1.3 in mutlitpass 1.21 rather than 1.15. Do not use kubelfow add-on (at least didn't work for me). You need kustomize 3.2 to create manifests as they mentioned in https://github.com/kubeflow/manifests#installation.

There was one step missing which is not mentioned in the tutorial, which is, I have to install docker. I've installed docker, rebooted the machine, and now everything works fine.

Related

in a google cloud Kubernetes cluster my pods sometimes all restart, how do I find the reason for the restart?

From time to time all my pods restart and I'm not sure how to figure out why it's happening. Is there someplace in google cloud where I can get that information? or a kubectl command to run? It happens every couple of months or so. maybe less frequently than that.
Using below methods for checking the reason for pod restart:
Use kubectl describe deployment <deployment_name> and kubectl describe pod <pod_name> which contains the information.
# Events:
# Type Reason Age From Message
# ---- ------ ---- ---- -------
# Warning BackOff 40m kubelet, gke-xx Back-off restarting failed container
# ..
You can see that the pod is restarted due to image pull backoff. We need to troubleshoot on that particular issue.
Check for logs using : kubectl logs <pod_name>
To get previous logs of your container (the restarted one), you may use --previous key on pod, like this:
kubectl logs your_pod_name --previous
You can also write a final message to /dev/termination-log, and this will show up as described in docs.
Attaching a troubleshooting doc for reference.
It's also a good thing to check your cluster and node-pool operations.
Check the cluster operation in cloud shell and run the command:
gcloud container operations list
Check the age of the nodes with the command
kubectl get nodes
Check and analyze your deployment on how it reacts to operations such as cluster upgrade, node-pool upgrade & node-pool auto-repair. You can check the cloud logging if your cluster upgrade or node-pool upgrades using queries below:
Please note you have to add your cluster and node-pool name in the queries.
Control plane (master) upgraded:
resource.type="gke_cluster"
log_id("cloudaudit.googleapis.com/activity")
protoPayload.methodName:("UpdateCluster" OR "UpdateClusterInternal")
(protoPayload.metadata.operationType="UPGRADE_MASTER"
OR protoPayload.response.operationType="UPGRADE_MASTER")
resource.labels.cluster_name=""
Node-pool upgraded
resource.type="gke_nodepool"
log_id("cloudaudit.googleapis.com/activity")
protoPayload.methodName:("UpdateNodePool" OR "UpdateClusterInternal")
protoPayload.metadata.operationType="UPGRADE_NODES"
resource.labels.cluster_name=""
resource.labels.nodepool_name=""

fail to run istio-ingressgateway, got Readiness probe failed: connection refused

I fail to deploy istio and met this problem. When I tried to deploy istio using istioctl install --set profile=default -y. The output is like:
➜ istio-1.11.4 istioctl install --set profile=default -y
✔ Istio core installed
✔ Istiod installed
✘ Ingress gateways encountered an error: failed to wait for resource: resources not ready after 5m0s: timed out waiting for the condition
Deployment/istio-system/istio-ingressgateway (containers with unready status: [istio-proxy])
- Pruning removed resources Error: failed to install manifests: errors occurred during operation
After running kubectl get pods -n=istio-system, I found the pod of istio-ingressgateway was created, and the result of describe:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 4m36s default-scheduler Successfully assigned istio-system/istio-ingressgateway-8dbb57f65-vc85p to k8s-slave
Normal Pulled 4m35s kubelet Container image "docker.io/istio/proxyv2:1.11.4" already present on machine
Normal Created 4m35s kubelet Created container istio-proxy
Normal Started 4m35s kubelet Started container istio-proxy
Warning Unhealthy 3m56s (x22 over 4m34s) kubelet Readiness probe failed: Get "http://10.244.1.4:15021/healthz/ready": dial tcp 10.244.1.4:15021: connect: connection refused
And I can't get the log of this pod:
➜ ~ kubectl logs pods/istio-ingressgateway-8dbb57f65-vc85p -n=istio-system
Error from server: Get "https://192.168.0.154:10250/containerLogs/istio-system/istio-ingressgateway-8dbb57f65-vc85p/istio-proxy": dial tcp 192.168.0.154:10250: i/o timeout
I run all this command on two VM in Huawei Cloud, with a 2C8G master and a 2C4G slave in ubuntu18.04. I have reinstall the environment and the kubernetes cluster, but that doesn't help.
Without ingressgateway
I also tried istioctl install --set profile=minimal -y that only run istiod. But when I try to run httpbin(kubectl apply -f samples/httpbin/httpbin.yaml) with auto injection on, the deployment can't create pod.
➜ istio-1.11.4 kubectl get deployment
NAME READY UP-TO-DATE AVAILABLE AGE
httpbin 0/1 0 0 5m24s
➜ istio-1.11.4 kubectl describe deployment/httpbin
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 6m6s deployment-controller Scaled up replica set httpbin-74fb669cc6 to 1
When I unlabel the default namespace(kubectl label namespace default istio-injection-), everything works fine.
I hope to deploy istio ingressgateway and run demo like istio-ingressgateway, but I have no idea to solve this situation. Thanks for any help.
I made a silly mistake Orz.
After communiation with my cloud provider, I was informed that there was a network security policy of my cloud server. It's strange that one server has full access and the other has partial access (which only allow for port like 80, 443 and so on). After I change the policy, everything works fine.
For someone who may meet the similar question, I found all these questions seem to come with network problems like dns configuration, k8s configuration or server network problem after hours of searching in google. Like what howardjohn said in this issue, this is not a istio problem.

MountVolume.SetUp failed for volume "rook-ceph-crash-collector-keyring" : secret "rook-ceph-crash-collector-keyring" not found

I am trying to configure ceph on kubernetes cluster using rook, I have run the following commands:
kubectl apply -f common.yaml
kubectl apply -f operator.yaml
kubectl apply -f cluster.yaml
I have three worker nodes with atached volumes and on master, all the created pods are running except the rook-ceph-crashcollector pods for the three nodes, when I describe these pods I get this message
MountVolume.SetUp failed for volume "rook-ceph-crash-collector-keyring" : secret "rook-ceph-crash-collector-keyring" not found
However all the nodes are running and working
It is hard to exactly tell what might be the cause of this but there are few possibilities:
Cluster networking problem between nodes
Some possible leftover sockets in the /var/lib/kubelet directory related to rook ceph.
A bug when connecting to an external Ceph cluster.
In order to fix your issue you can:
Use Flannel and make sure it is using the right interface. Check the kube-flannel.yml file and see if it uses the --iface= option. Or alternatively try to use Calico.
Clear the ./var/lib/rook/, ./var/lib/kubelet/plugins/ and ./var/lib/kubelet/plugins_registry/ directories and reinstall the rook service.
Create the rook-ceph-crash-collector-keyring secret manually by executing: kubectl -n rook-ceph create secret generic rook-ceph-crash-collector-keyring.

pod with pvc stuck on container creating

My overall issue is that my pod which has a PVC is stuck on container-creating after it was deleted. My guess why, is because of the following:
So, I have a pod with a mounted PVC. I did a:
kubectl exec -it "name" bash
navigated to the path of the mounted PVC and wanted to create a tar gzip file of several directories. The reason was because I wanted to copy the folders to local, but they were quite big. Anyways, managed to create the tar file, but someone else released to our dev environment and the pod was killed. After that, when recreating our env, the pod with the PVC that has the tar file is stuck on container creating. Is it because that I created that file on the PVC? Like, based on the warnings it seems like the PVC points to the previous pod?
kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
graphite-pvc Bound xxxx 256Gi RWO managed-premium 12
and if I do, i get the following warnings:
kubectl describe pod xxx
Warning FailedAttachVolume 22m (x8 over 24m) attachdetach-controller
AttachVolume.Attach failed for volume "pvc-f65cb358-014b-11ea-b698-000d3a556597" : Attach volume "kubernetes-dynamic-pvc-f65cb358-014b-11ea-b698-000d3a556597" to instance "/subscriptions/1405bf18-bf7d-4a2f-9aa7-25ff73ba58a6/resourceGroups/cie-dev-2-1-eastus/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-dev-nodes-2002/virtualMachines/6" failed with compute.VirtualMachineScaleSetVMsClient#Update: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status= Code="ConflictingUserInput" Message="Disk '/subscriptions/1405bf18-bf7d-4a2f-9aa7-25ff73ba58a6/resourceGroups/cie-dev-2-1-eastus/providers/Microsoft.Compute/disks/kubernetes-dynamic-pvc-f65cb358-014b-11ea-b698-000d3a556597' cannot be attached as the disk is already owned by VM '/subscriptions/1405bf18-bf7d-4a2f-9aa7-25ff73ba58a6/resourceGroups/cie-dev-2-1-eastus/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-dev-nodes-2002/virtualMachines/k8s-dev-nodes-2002_111'."
and
Warning FailedMount 48s (x13 over 28m) kubelet, k8s-dev-nodes-2002000006 Unable to mount volumes for pod "xxxx": timeout expired waiting for volumes to attach or mount for pod "xxxxxx". list of unmounted volumes=[pvc_name]. list of unattached volumes=[pvc_name default-token-6tmkm]
So, first, do you think it has any correlation with the fact that I was inside the PVC and created a file, when the pod was killed, or is it pure coincidence (cannot be, right?).

Minikube got stuck when creating container

I recently got started to learn Kubernetes by using Minikube locally in my Mac. Previously, I was able to start a local Kubernetes cluster with Minikube 0.10.0, created a deployment and viewed Kubernetes dashboard.
Yesterday I tried to delete the cluster and re-did everything from scratch. However, I found I cannot get the assets deployed and cannot view the dashboard. From what I saw, everything seemed to get stuck during container creation.
After I ran minikube start, it reported
Starting local Kubernetes cluster...
Kubectl is now configured to use the cluster.
When I ran kubectl get pods --all-namespaces, it reported (pay attention to the STATUS column):
kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system kube-addon-manager-minikube 0/1 ContainerCreating 0 51s
docker ps showed nothing:
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
minikube status tells me the VM and cluster are running:
minikubeVM: Running
localkube: Running
If I tried to create a deployment and an autoscaler, I was told they were created successfully:
kubectl create -f configs
deployment "hello-minikube" created
horizontalpodautoscaler "hello-minikube-autoscaler" created
$ kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
default hello-minikube-661011369-1pgey 0/1 ContainerCreating 0 1m
default hello-minikube-661011369-91iyw 0/1 ContainerCreating 0 1m
kube-system kube-addon-manager-minikube 0/1 ContainerCreating 0 21m
When exposing the service, it said:
$ kubectl expose deployment hello-minikube --type=NodePort
service "hello-minikube" exposed
$ kubectl get service
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
hello-minikube 10.0.0.32 <nodes> 8080/TCP 6s
kubernetes 10.0.0.1 <none> 443/TCP 22m
When I tried to access the service, I was told:
curl $(minikube service hello-minikube --url)
Waiting, endpoint for service is not ready yet...
docker ps still showed nothing. It looked to me everything got stuck when creating a container. I tried some other ways to work around this issue:
Upgraded to minikube 0.11.0
Use the xhyve driver instead of the Virtualbox driver
Delete everything cached, like ~/.minikube, ~/.kube, and the cluster, and re-try
None of them worked for me.
Kubernetes is still new to me and I would like to know:
How can I troubleshoot this kind of issue?
What could be the cause of this issue?
Any help is appreciated. Thanks.
It turned out to be a network problem in my case.
The pod status is "ContainerCreating", and I found during container creation, docker image will be pulled from gcr.io, which is inaccessible in China (blocked by GFW). Previous time it worked for me because I happened to connect to it via a VPN.
I didn't try minikube but I use kubernetes. With the information provided it is difficult to say the cause of the issue. Your minikube has no problem in creating resources but ContainerCreating is a problem related to docker daemon or improper communication between kube-api and docker daemon or some problem with kubelet.
You can try the following command:
kubectl describe po POD_NAME
This will give you the POD's events. Maybe this will provide a path to the root cause of issue.
You may also check the logs of kubelet to get the events.
I had this problem on Windows, but it was related to an NTLM proxy. I deleted the minikube VM then recreated it with the correct proxy settings for my CNTLM installation:
minikube start \
--docker-env http_proxy=http://10.0.2.2:3128 \
--docker-env https_proxy=http://10.0.2.2:3128 \
--docker-env no_proxy=localhost,127.0.0.1,::1,192.168.99.100
See https://blog.alexellis.io/minikube-behind-proxy/
The horizontalpodautoscaler (hpa) requires heapster to use. You'll need to run heapster in minikube for that to work. You can always debug these kinds of issues with minikube logs or interactively through the dashboard found at minikube dashboard.
You can find the steps to run heapster and grafana at https://github.com/kubernetes/heapster
For me, it takes several minutes before I see the ContainerCreating problem. After executing the following command:
systemctl status kube-controller-manager.service
I get this error:
Sync "default/redis-master-2229813293" failed with unable to create pods: No API token found for service account "default", retry after the token is automatically created and added to the service account.
There are two ways to solve this:
Set the service account with token
Remove the ServiceAccount setting of KUBE_ADMISSION_CONTROL in api-server