pod with pvc stuck on container creating - kubernetes

My overall issue is that my pod which has a PVC is stuck on container-creating after it was deleted. My guess why, is because of the following:
So, I have a pod with a mounted PVC. I did a:
kubectl exec -it "name" bash
navigated to the path of the mounted PVC and wanted to create a tar gzip file of several directories. The reason was because I wanted to copy the folders to local, but they were quite big. Anyways, managed to create the tar file, but someone else released to our dev environment and the pod was killed. After that, when recreating our env, the pod with the PVC that has the tar file is stuck on container creating. Is it because that I created that file on the PVC? Like, based on the warnings it seems like the PVC points to the previous pod?
kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
graphite-pvc Bound xxxx 256Gi RWO managed-premium 12
and if I do, i get the following warnings:
kubectl describe pod xxx
Warning FailedAttachVolume 22m (x8 over 24m) attachdetach-controller
AttachVolume.Attach failed for volume "pvc-f65cb358-014b-11ea-b698-000d3a556597" : Attach volume "kubernetes-dynamic-pvc-f65cb358-014b-11ea-b698-000d3a556597" to instance "/subscriptions/1405bf18-bf7d-4a2f-9aa7-25ff73ba58a6/resourceGroups/cie-dev-2-1-eastus/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-dev-nodes-2002/virtualMachines/6" failed with compute.VirtualMachineScaleSetVMsClient#Update: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status= Code="ConflictingUserInput" Message="Disk '/subscriptions/1405bf18-bf7d-4a2f-9aa7-25ff73ba58a6/resourceGroups/cie-dev-2-1-eastus/providers/Microsoft.Compute/disks/kubernetes-dynamic-pvc-f65cb358-014b-11ea-b698-000d3a556597' cannot be attached as the disk is already owned by VM '/subscriptions/1405bf18-bf7d-4a2f-9aa7-25ff73ba58a6/resourceGroups/cie-dev-2-1-eastus/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-dev-nodes-2002/virtualMachines/k8s-dev-nodes-2002_111'."
and
Warning FailedMount 48s (x13 over 28m) kubelet, k8s-dev-nodes-2002000006 Unable to mount volumes for pod "xxxx": timeout expired waiting for volumes to attach or mount for pod "xxxxxx". list of unmounted volumes=[pvc_name]. list of unattached volumes=[pvc_name default-token-6tmkm]
So, first, do you think it has any correlation with the fact that I was inside the PVC and created a file, when the pod was killed, or is it pure coincidence (cannot be, right?).

Related

Failed to provision volume with StorageClass "rook-cephfs": rpc error: code = Aborted desc = an operation with the given Volume ID pvc- already exists

I am deploying application with helm chart.
I am facing an issue with StorageClass rook-ceph whenever I deploy a chart my pods are in pending state because pvc are not getting created. Logs for pvc are
Warning ProvisioningFailed 96s (x13 over 20m) rook-ceph.cephfs.csi.ceph.com_csi-cephfsplugin-provisioner-775dcbbc86-nt8tr_170456b2-6876-4a49-9077-05cd2395cfed failed to provision volume with StorageClass "rook-cephfs": rpc error: code = Aborted desc = an operation with the given Volume ID pvc-f65acb47-f145-449e-ba1c-e8a61efa67b0 already exists
try to restart MDS service, if you can copy the data of pvc to a new one, it should work also
I had this error, and what solved it for me was deleting the csi-cephfsplugin-provisioner and csi-rbdplugin-provisioner pods, and let the replicaset recreate them. Once I did that, all of my PVCs created PVs and bound as expected. I may have only needed to kill the csi-rbdplugin-provisioner pods, so try that first.

Kubeflow pipeline fail to create container

I'm running Kubeflow in a local machine that I deployed with multipass using these steps but when I tried running my pipeline, it got stuck with the message ContainerCreating. When I ran kubectl describe pod train-pipeline-msmwc-1648946763 -n kubeflow I found this on the Events part of the describe:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 7m12s (x51 over 120m) kubelet, kubeflow-vm Unable to mount volumes for pod "train-pipeline-msmwc-1648946763_kubeflow(45889c06-87cf-4467-8cfa-3673c7633518)": timeout expired waiting for volumes to attach or mount for pod "kubeflow"/"train-pipeline-msmwc-1648946763". list of unmounted volumes=[docker-sock]. list of unattached volumes=[podmetadata docker-sock mlpipeline-minio-artifact pipeline-runner-token-dkvps]
Warning FailedMount 2m22s (x67 over 122m) kubelet, kubeflow-vm MountVolume.SetUp failed for volume "docker-sock" : hostPath type check failed: /var/run/docker.sock is not a socket file
Looks to me like there is a problem with my deployment, but I'm new to Kubernetes and can't figure out what I supposed to do right now. Any idea on how to solve this? I don't know if it helps but I'm pulling the containers from a private docker registry and I've set up the secret according to this.
You don't need to use docker. In fact the problem is with workflow-controller-configmap in kubeflow name space. You can edit it with
kubectl edit configmap workflow-controller-configmap -n kubeflow
and change containerRuntimeExecutor: docker to containerRuntimeExecutor: pns. Also you can change some of the steps and install kubeflow 1.3 in mutlitpass 1.21 rather than 1.15. Do not use kubelfow add-on (at least didn't work for me). You need kustomize 3.2 to create manifests as they mentioned in https://github.com/kubeflow/manifests#installation.
There was one step missing which is not mentioned in the tutorial, which is, I have to install docker. I've installed docker, rebooted the machine, and now everything works fine.

Kubernetes Persistent Volume Claim FileSystemResizePending

i have a persistent volume claim for a kubernetes pod which shows the message "Waiting for user to (re-)start a pod to finish file system resize of volume on node." if i check it with 'kubectl describe pvc ...'
The rezising itself worked which was done with terraform in our deployments but this message still shows up here and i'm not really sure how to get this fixed? The pod was already restarted several times - i tried kubectl delete pod and scale it down with kubectl scale deployment.
Does anyone have an idea how to get rid of this message?screenshot
There are few things to consider:
Instead of using the Terraform, try resizing the PVC by editing it manually. After that wait for the underlying volume to be expanded by the storage provider and verify if the FileSystemResizePending condition is present by executing kubectl get pvc <pvc_name> -o yaml. Than, make sure that all the associated pods are restarted so the whole process can be completed. Once file system resizing is done, the PVC will automatically be updated to reflect new size.
Make sure that your volume type is supported for expansion. You can expand the following types of volumes:
gcePersistentDisk
awsElasticBlockStore
Cinder
glusterfs
rbd
Azure File
Azure Disk
Portworx
FlexVolumes
CSI
Check if in your StorageClass the allowVolumeExpansion field is set to true.

Mount error while trying to connect encrypted AWS EFS with efs-csi-node in EKS

I have tried connecting unencrypted EFS and it is working fine, but with encrypted EFS, the pod is throwing below error:
Normal Scheduled 10m default-scheduler Successfully assigned default/jenkins-efs-test-8ffb4dc86-xnjdj to ip-10-100-4-249.ap-south-1.compute.internal
Warning FailedMount 6m33s (x2 over 8m49s) kubelet, ip-10-100-4-249.ap-south-1.compute.internal Unable to attach or mount volumes: unmounted volumes=[jenkins-home], unattached volumes=[sc-config-volume tmp jenkins-home jenkins-config secrets-dir plugins plugin-dir jenkins-efs-test-token-7nmkz]: timed out waiting for the condition
Warning FailedMount 4m19s kubelet, ip-10-100-4-249.ap-south-1.compute.internal Unable to attach or mount volumes: unmounted volumes=[jenkins-home], unattached volumes=[plugins plugin-dir jenkins-efs-test-token-7nmkz sc-config-volume tmp jenkins-home jenkins-config secrets-dir]: timed out waiting for the condition
Warning FailedMount 2m2s kubelet, ip-10-100-4-249.ap-south-1.compute.internal Unable to attach or mount volumes: unmounted volumes=[jenkins-home], unattached volumes=[tmp jenkins-home jenkins-config secrets-dir plugins plugin-dir jenkins-efs-test-token-7nmkz sc-config-volume]: timed out waiting for the condition
Warning FailedMount 35s (x13 over 10m) kubelet, ip-10-100-4-249.ap-south-1.compute.internal MountVolume.SetUp failed for volume "efs-pv" : kubernetes.io/csi: mounter.SetupAt failed: rpc error: code = Internal desc = Could not mount "" at "/var/lib/kubelet/pods/354800a1-dcf5-4812-aa91-0e84ca6fba59/volumes/kubernetes.io~csi/efs-pv/mount": mount failed: exit status 1
Mounting command: mount
Mounting arguments: -t efs /var/lib/kubelet/pods/354800a1-dcf5-4812-aa91-0e84ca6fba59/volumes/kubernetes.io~csi/efs-pv/mount
Output: mount: /var/lib/kubelet/pods/354800a1-dcf5-4812-aa91-0e84ca6fba59/volumes/kubernetes.io~csi/efs-pv/mount: can't find in /etc/fstab.
What am I missing here?
You didn't specify what the K8s manifests are or any configuration. There shouldn't be any difference between encrypted and non-encrypted volumes when it comes to mounting from the client-side. In essence, AWS manages the encryption keys for you using KMS.
The error you are seeing is basically because the mount command is not specifying the mount point so there must have been some other default configuration from the K8s side that you are changing when using un-encrypted EFS volumes. Also, is the EFS Mount helper available on the Kubernetes node where you are trying to mount the EFS Volume?
✌️
Check the logs of the cloud init agent (/var/logs/cloud-init.log and /var/logs/cloud-init-output.log) if the EFS filesystem mount does not work as expected. Check /etc/fstab file.
Try to update efs-csi-node daemonset from amazon/aws-efs-csi-driver:v0.3.0 image to amazon/aws-efs-csi-driver:latest.
Here is example mounting EFS script. Compare it to yours and note that:
Dependencies for this script:
Default ECS cluster configuration (Amazon Linux ECS AMI).
The ECS instance must have a IAM role that gives it at least read access to EFS (in order to locate the EFS filesystem ID).
The ECS instance must be in a security group that allows port tcp/2049 (NFS) inbound/outbound.
The security group that the ECS instance belongs to must be associated with the target EFS filesystem.
Notes on this script:
The EFS mount path is calculated on a per-instance basis as the EFS endpoint varies depending upon the region and availability zone where the instance is launched.
The EFS mount is added to /etc/fstab so that if the ECS instance is rebooted, the mount point will be re-created.
Docker is restarted to ensure it correctly detects the EFS filesystem mount.
Restart docker after mounting EFS with command: $ service docker restart. At the end try to reboot the EKS worker node.
Take a look: mounting-efs-in-eks-cluster-example-deployment-fails, efs-provisioner, dynamic-ip-in-etc-fstab.

How to define the uid, gid of a mounted volume in Pod

This is a question in our production environment. We use Kubernetes to deploy our application through Pods. The Pods may need some storage to store files.
We use 'Persistent Volume' and 'Persistent Volume Claim' to present the real backend storage server. Currently, the real back storage server is 'NFS'. But the 'NFS' is not controlled by us and we cannot change the NFS configuration.
Every time, the uid and gid of the volume mount into the Pod is always 'root root'. But the process in the Pod is running as a non-root user, the process cannot read/write the mounted volume.
What our current solution is that we define an initContainer which run as root and use command 'chown [udi] [gid] [folder]' to change the ownership. There is a limitation that the ininContainer must be run as root.
For now, we are trying to deploy our application on Openshift. By default, all the Pods(containers) cannot be run as root. Otherwise, the Pod is failed to create.
So my question is that a k8s way or Openshift way to define/change the uid and gid of the mounted volume.
I have tried mountOptions which in talked about in Kubernetes Persistent Volume Claim mounted with wrong gid
mountOptions: #these options
- uid=1000
- gid=1000
But failed with the below error message. Seems that the NFS server does not support the uid and gid parameters.
Warning FailedMount 11s kubelet, [xxxxx.net] MountVolume.SetUp failed for volume "nfs-gid-pv" : mount failed: exit status 32 Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /opt/kubernetes/data/kubelet/pods/3c75930a-d3f7-4d55-9996-4d10dcac9549/volumes/kubernetes.io~nfs/nfs-gid-pv --scope -- mount -t nfs -o gid=1999,uid=1999 shc-sma-cd74.hpeswlab.net:/var/vols/itom/itsma/tzhong /opt/kubernetes/data/kubelet/pods/3c75930a-d3f7-4d55-9996-4d10dcac9549/volumes/kubernetes.io~nfs/nfs-gid-pv
Output: Running scope as unit run-22636.scope.
mount.nfs: an incorrect mount option was specified
Warning FailedMount 7s kubelet, [xxxxx.net] MountVolume.SetUp failed for volume "nfs-gid-pv" : mount failed: exit status 32
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /opt/kubernetes/data/kubelet/pods/3c75930a-d3f7-4d55-9996-4d10dcac9549/volumes/kubernetes.io~nfs/nfs-gid-pv --scope -- mount -t nfs -o gid=1999,uid=1999 shc-sma-cd74.hpeswlab.net:/var/vols/itom/itsma/tzhong /opt/kubernetes/data/kubelet/pods/3c75930a-d3f7-4d55-9996-4d10dcac9549/volumes/kubernetes.io~nfs/nfs-gid-pv
Output: Running scope as unit run-22868.scope.
mount.nfs: an incorrect mount option was specified
If we speak about Kubernetes, you could set group ID that owns the volume this can be done by using fsGroup, a feature from Pod Security Context.
As or OpenShift I do not know.
apiVersion: v1
kind: Pod
metadata:
name: hello-world
spec:
containers:
# specification of the pod's containers
# ...
securityContext:
fsGroup: 1000
The security context for a Pod applies to the Pod's Containers and also to the Pod's Volumes when applicable. Specifically fsGroup and seLinuxOptions are applied to Volumes as follows:
fsGroup: Volumes that support ownership management are modified to be owned and writable by the GID specified in fsGroup. See the Ownership Management design document for more details.
You can also read more about it here and follow steps posted by #rajdeepbs29 posted here.