How does OpenEBS determine the location of my volume? - kubernetes

I am experimenting with OpenEBS as storage provider for our Kubernetes cluster. OpenEBS is installed via helm on a cluster consisting of 5 nodes, created by Rancher. It seems to work, however I don't really understand how the volume itself is provisioned.
Each node is created with 2 disks, with logical volumes spanning over the disks. For example:
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 20G 0 disk
├─sda1 8:1 0 1G 0 part /boot
└─sda2 8:2 0 19G 0 part
├─centos_intern--rancher--node05-root 253:0 0 50G 0 lvm /
└─centos_intern--rancher--node05-swap 253:1 0 7,9G 0 lvm [SWAP]
sdb 8:16 0 80G 0 disk
└─sdb1 8:17 0 80G 0 part
├─centos_intern--rancher--node05-root 253:0 0 50G 0 lvm /
└─centos_intern--rancher--node05-home 253:2 0 41,1G 0 lvm /home
The node device manage (NDM) is configured with a filter excluding loop,fd0,sr0,/dev/ram,/dev/dm-,/dev/md. So far, so good.
When we list the block device resources created by NDM, it lists 2 resources for this node (other nodes are omitted)
> kubectl get blockdevice --all-namespaces
NAMESPACE NAME NODENAME SIZE CLAIMSTATE STATUS AGE
openebs blockdevice-d7d2b90b000a8b2268faf07c9e0f7cc5 intern-rancher-node05 85899345920 Unclaimed Active 18h
openebs sparse-e4ea6423e7d139104049e67566a2b634 intern-rancher-node05 10737418240 Unclaimed Active 18
Exploring the created blockdevice, we see that it uses /dev/sdb as disk:
> kubectl describe blockdevice blockdevice-d7d2b90b000a8b2268faf07c9e0f7cc5 -n openebs
Name: blockdevice-d7d2b90b000a8b2268faf07c9e0f7cc5
...
Node Attributes:
Node Name: intern-rancher-node05
Partitioned: No
Path: /dev/sdb
Status:
Claim State: Unclaimed
State: Active
Events: <none>
So here stops my understanding. Why did NDM pick /dev/sdb, and not /dev/sda? What is the difference between the disks that one is used and the other not? Should /dev/sdb not be skipped because it is in use by the logical volumes? If I create a persistent volume, does this limit the size of my logical volumes (/home)?
Also, if I create a persistent volume claim (using jiva), a persistent volume is created in /var/openebs, for example /var/openebs/pvc-cdc4c5a2-89e1-41ed-b9e7-c672f27a8bed. Does this mean it doesn't use the disk at all but stores everything in the filesystem in the logical volume?

Related

Unschedulable 0/1 nodes are available insufficient ephemeral-storage

I have one strange issue.
The error that I'm getting is:
unschedulable 0/1 nodes are available insufficient ephemeral-storage
My requests per workflow that I run in kubernetes are:
resources:
requests:
ephemeral-storage: 50Gi
memory: 8Gi
And my node capacity is 100GiB per node.
When I run kubectl describe node <node-name> I get the following result:
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 125m (3%) 0 (0%)
memory 8Gi (55%) 0 (0%)
ephemeral-storage 50Gi (56%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Do ephemeral-storage and memory overlap? What can be the issue? I cannot resolve.
In the kubectl describe node output, Kubernetes believes it's used 50 GiB of disk ("ephemeral storage") and that's 56% of the available resources. That implies there's about 89 GiB of usable disk space, and about 39 GiB left, so less space than your container claims it needs.
If the node has a 100 GiB disk, space required by the operating system, Kubernetes, and any pulled images counts against that disk capacity before being considered available for ephemeral storage. You probably will never be able to run two Pods that both require 50 GiB of disk; with the OS overhead they will not both fit at the same time.
(It's also not impossible that the node has 100 GB and not 100 GiB storage. 100 * 10^9 is only 93 * 2^30, which would make that overhead about 4 GiB, which feels a little more typical to me.)
The easiest and "most Kubernetes" option here is to get another node, maybe via the cluster autoscaler. If you do control the node configuration, changing nodes to more like 120 GB storage would make these Pods fit better. Especially in an AWS/EKS context, current Kubernetes also supports generic ephemeral volumes which would let you get per-pod storage backed by a volume (on AWS, most likely an EBS volume) rather than fixed-size local disk.

What does Kubelet use to determine the ephemeral-storage capacity of the node?

I have Kubernetes cluster running on a VM. A truncated overview of the mounts is:
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 20G 4.5G 15G 24% /
/dev/mapper/vg001-lv--docker 140G 33G 108G 23% /var/lib/docker
As you can see, I added an extra disk to store the docker images and its volumes. However, when querying the node's capacity, the following is returned
Capacity:
cpu: 12
ephemeral-storage: 20145724Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 65831264Ki
nvidia.com/gpu: 1
pods: 110
ephemeral-storage is 20145724Ki which is 20G, referring to the disk mounted at /.
How does Kubelet calculate its ephemeral-storage? Is it simply looking at the disk space available at /? Or is it looking at another folder like /var/log/containers?
This is a similar post where the user eventually succumbed to increasing the disk mounted at /.
Some theory
By default Capacity and Allocatable for ephemeral-storage in standard kubernetes environment is sourced from filesystem (mounted to /var/lib/kubelet).
This is the default location for kubelet directory.
The kubelet supports the following filesystem partitions:
nodefs: The node's main filesystem, used for local disk volumes, emptyDir, log storage, and more. For example, nodefs contains
/var/lib/kubelet/.
imagefs: An optional filesystem that container runtimes use to store container images and container writable layers.
Kubelet auto-discovers these filesystems and ignores other
filesystems. Kubelet does not support other configurations.
From Kubernetes website about volumes:
The storage media (such as Disk or SSD) of an emptyDir volume is
determined by the medium of the filesystem holding the kubelet root
dir (typically /var/lib/kubelet).
Location for kubelet directory can be configured by providing:
Command line parameter during kubelet initialization
--root-dir string
Default: /var/lib/kubelet
Via kubeadm with config file (e.g.)
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
nodeRegistration:
kubeletExtraArgs:
root-dir: "/data/var/lib/kubelet"
Customizing kubelet:
To customize the kubelet you can add a KubeletConfiguration next to
the ClusterConfiguration or InitConfiguration separated by ---
within the same configuration file. This file can then be passed to
kubeadm init.
When bootstrapping kubernetes cluster using kubeadm, Capacity reported by kubectl get node is equal to the disk capacity mounted into /var/lib/kubelet
However Allocatable will be reported as:
Allocatable = Capacity - 10% nodefs using the standard kubeadm configuration, since the kubelet has the following default hard eviction thresholds:
nodefs.available<10%
It can be configured during kubelet initialization with:
-eviction-hard mapStringString
Default: imagefs.available<15%,memory.available<100Mi,nodefs.available<10%
Example
I set up a test environment for Kubernetes with a master node and two worker nodes (worker-1 and worker-2).
Both worker nodes have volumes of the same capacity: 50Gb.
Additionally, I mounted a second volume with a capacity of 20Gb for the Worker-1 node at the path /var/lib/kubelet.
Then I created a cluster with kubeadm.
Result
From worker-1 node:
skorkin#worker-1:~# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 49G 2.8G 46G 6% /
...
/dev/sdb 20G 45M 20G 1% /var/lib/kubelet
and
Capacity:
cpu: 2
ephemeral-storage: 20511312Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 4027428Ki
pods: 110
Size of ephemeral-storage is the same as volume mounted at /var/lib/kubelet.
From worker-2 node:
skorkin#worker-2:~# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 49G 2.7G 46G 6% /
and
Capacity:
cpu: 2
ephemeral-storage: 50633164Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 4027420Ki
pods: 110

Using Microk8s and OpenEBS cStor leads to an error when creating pool claims. Anybody know why this is occurring, and how to fix it?

I am using Microk8s (1.19, on Ubuntu 20.04.1 LTS) and am trying to use OpenOBS (cStor engine) for storage.
Since I'm running this locally before pushing to prod, I created virtual block devices with:
blockdevicedisk='/k8storage/diskimage'
blockdevicesize=10000
sudo dd if=/dev/zero of=$blockdevicedisk bs=1M count=$blockdevicesize
sudo mkfs.ext4 $blockdevicedisk
sudo losetup /dev/loop0 /k8storage/diskimage
$lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0 7:0 0 9.8G 0 loop
loop1 7:1 0 7.8G 0 loop
sda 8:0 0 256G 0 disk
sdb 8:16 0 256G 0 disk /
I installed OpenEBS with helm, then removed 'loop' from the openebs-ndm-config -> filterconfigs -> path-filter -> exclude. So that ndm would display these as blockdevices.
$kubectl get blockdevices -n openebs
NAME NODENAME SIZE CLAIMSTATE STATUS AGE
blockdevice-87ca7d6819eab3ea3af2884f2f6e9f8e v 274877906944 Unclaimed Active 19h
blockdevice-0a6c8d26081660a37f0a87dbb316c7ae v 10485760000 Unclaimed Active 19h
blockdevice-cd43d37664edd1c880e11f5b8e9cbe60 v 8388608000 Unclaimed Active 19h
^ the last 2 are the ones I made.
I then wrote the config to create a cStor StoragePoolClaim
apiVersion: openebs.io/v1alpha1
kind: StoragePoolClaim
metadata:
name: cstor-pool-claim
spec:
name: cstor-pool-claim
type: disk
poolSpec:
poolType: striped
blockDevices:
blockDeviceList:
- blockdevice-0a6c8d26081660a37f0a87dbb316c7ae
- blockdevice-cd43d37664edd1c880e11f5b8e9cbe60
When I apply it, both the blockdevices are claimed
$kubectl get blockdevices -n openebs
NAME NODENAME SIZE CLAIMSTATE STATUS AGE
blockdevice-87ca7d6819eab3ea3af2884f2f6e9f8e v 274877906944 Unclaimed Active 19h
blockdevice-0a6c8d26081660a37f0a87dbb316c7ae v 10485760000 Claimed Active 19h
blockdevice-cd43d37664edd1c880e11f5b8e9cbe60 v 8388608000 Claimed Active 19h
which is expected.
$kubectl get spc
NAME AGE
cstor-pool-claim 18h
However, there is a problem!
$kubectl get csp
NAME ALLOCATED FREE CAPACITY STATUS READONLY TYPE AGE
cstor-pool-claim-nf0g Init false striped 19h
It never changes from Init status. There is a pod created which shows the error as
$kubectl describe pod cstor-pool-claim-nf0g-6cb75f8f49-sw6q2 -n openebs
which spews a lot of text I could show if it helps. The key part is the error message, which is:
Error: failed to create containerd task: OCI runtime create failed:
container_linux.go:370: starting container process caused:
process_linux.go:459: container init caused: rootfs_linux.go:59:
mounting
"/var/snap/microk8s/common/run/containerd/io.containerd.grpc.v1.cri/sandboxes/a9b84df9076c91b83982f157e9bacdc5a10f80846d32034dd15cdae1c1d4c4c1/shm"
to rootfs at "/dev/shm" caused: secure join: too many levels of
symbolic links: unknown
I've tried resetting my setup and re-entering the commands one-by-one to ensure that I was following the documentation, and other examples correctly, however, I keep encountering this error.
Is this a limitation of microk8s? A fault of OpenEBS? Something weird about my setup? Or did I do something wrong?
More importantly: Is there a way to get this to work correctly?

GlusterFS volume with replica 3 arbiter 1 mounted in Kubernetes PODs contains zero size files

I was planning to migrate from replica 3 to replica 3 with arbiter 1, but faced with a strange issue on my third node(that acts as arbiter).
When I mounted new volume endpoint to the node where Gluster arbiter POD is running I'm getting strange behavior: some files are fine, but some are zero sizes. When I mount the same share on another node, all files are fine.
GlusterFS is running as a Kubernetes daemonset and I'm using heketi to manage GlusterFS from Kubernetes automatically.
I'm using glusterfs 4.1.5 and Kubernetes 1.11.1.
gluster volume info vol_3ffdfde93880e8aa39c4b4abddc392cf
Type: Replicate
Volume ID: e67d2ade-991a-40f9-8f26-572d0982850d
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x (2 + 1) = 3
Transport-type: tcp
Bricks:
Brick1: 192.168.2.70:/var/lib/heketi/mounts/vg_426b28072d8d0a4c27075930ddcdd740/brick_35389ca30d8f631004d292b76d32a03b/brick
Brick2: 192.168.2.96:/var/lib/heketi/mounts/vg_3a9b2f229b1e13c0f639db6564f0d820/brick_953450ef6bc25bfc1deae661ea04e92d/brick
Brick3: 192.168.2.148:/var/lib/heketi/mounts/vg_7d1e57c2a8a779e69d22af42812dffd7/brick_b27af182cb69e108c1652dc85b04e44a/brick (arbiter)
Options Reconfigured:
user.heketi.id: 3ffdfde93880e8aa39c4b4abddc392cf
user.heketi.arbiter: true
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
Status Output:
gluster volume status vol_3ffdfde93880e8aa39c4b4abddc392cf
Status of volume: vol_3ffdfde93880e8aa39c4b4abddc392cf
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick 192.168.2.70:/var/lib/heketi/mounts/v
g_426b28072d8d0a4c27075930ddcdd740/brick_35
389ca30d8f631004d292b76d32a03b/brick 49152 0 Y 13896
Brick 192.168.2.96:/var/lib/heketi/mounts/v
g_3a9b2f229b1e13c0f639db6564f0d820/brick_95
3450ef6bc25bfc1deae661ea04e92d/brick 49152 0 Y 12111
Brick 192.168.2.148:/var/lib/heketi/mounts/
vg_7d1e57c2a8a779e69d22af42812dffd7/brick_b
27af182cb69e108c1652dc85b04e44a/brick 49152 0 Y 25045
Self-heal Daemon on localhost N/A N/A Y 25069
Self-heal Daemon on worker1-aws-va N/A N/A Y 12134
Self-heal Daemon on 192.168.2.70 N/A N/A Y 13919
Task Status of Volume vol_3ffdfde93880e8aa39c4b4abddc392cf
------------------------------------------------------------------------------
There are no active volume tasks
Heal output:
gluster volume heal vol_3ffdfde93880e8aa39c4b4abddc392cf info
Brick 192.168.2.70:/var/lib/heketi/mounts/vg_426b28072d8d0a4c27075930ddcdd740/brick_35389ca30d8f631004d292b76d32a03b/brick
Status: Connected
Number of entries: 0
Brick 192.168.2.96:/var/lib/heketi/mounts/vg_3a9b2f229b1e13c0f639db6564f0d820/brick_953450ef6bc25bfc1deae661ea04e92d/brick
Status: Connected
Number of entries: 0
Brick 192.168.2.148:/var/lib/heketi/mounts/vg_7d1e57c2a8a779e69d22af42812dffd7/brick_b27af182cb69e108c1652dc85b04e44a/brick
Status: Connected
Number of entries: 0
Any ideas how to resolve this issue?
The issues were fixed after updating glusterfs-client and glusterfs-common packages on Kubernetes Workers to a more recent version.

K8s Volume doesn't detach from host

We're using Kubernetes on-premise and it's currently running on VMWare. So far, we have been successfull in being able to provision volumes for the apps that we deploy. The problem comes if the pods - for whatever reason - switch to a different worker node. When that happens, the disk fails to mount to the second worker as it's already present on the first worker where the pod was originally running. See below:
As it stands, we have no app on either worker1 or worker2:
[root#worker01 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
fd0 2:0 1 4K 0 disk
sda 8:0 0 200G 0 disk
├─sda1 8:1 0 500M 0 part /boot
└─sda2 8:2 0 199.5G 0 part
├─vg_root-lv_root 253:0 0 20G 0 lvm /
├─vg_root-lv_swap 253:1 0 2G 0 lvm
├─vg_root-lv_var 253:2 0 50G 0 lvm /var
└─vg_root-lv_k8s 253:3 0 20G 0 lvm /mnt/disks
sr0 11:0 1 1024M 0 rom
[root#worker02 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
fd0 2:0 1 4K 0 disk
sda 8:0 0 200G 0 disk
├─sda1 8:1 0 500M 0 part /boot
└─sda2 8:2 0 199.5G 0 part
├─vg_root-lv_root 253:0 0 20G 0 lvm /
├─vg_root-lv_swap 253:1 0 2G 0 lvm
├─vg_root-lv_var 253:2 0 50G 0 lvm /var
└─vg_root-lv_k8s 253:3 0 20G 0 lvm /mnt/disks
sr0 11:0 1 4.5G 0 rom
Next we create our PVC with the following:
[root#master01 ~]$ cat app-pvc.yaml
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: app-pvc
annotations:
volume.beta.kubernetes.io/storage-class: thin-disk
namespace: tools
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
[root#master01 ~]$ kubectl create -f app-pvc.yaml
persistentvolumeclaim "app-pvc" created
This works fine as the disk is created and bound:
[root#master01 ~]$ kubectl get pvc -n tools
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
app-pvc Bound pvc-d4bf77cc-294e-11e9-9106-005056a4b1c7 10Gi RWO thin-disk 12s
[root#master01 ~]$ kubectl get pv -n tools
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-d4bf77cc-294e-11e9-9106-005056a4b1c7 10Gi RWO Delete Bound tools/app-pvc thin-disk 12s
Now we can deploy our application which creates the pod and sorts storage etc:
[centos#master01 ~]$ cat app.yaml
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: app
namespace: tools
spec:
replicas: 1
template:
metadata:
labels:
app: app
spec:
containers:
- image: sonatype/app3:latest
imagePullPolicy: IfNotPresent
name: app
ports:
- containerPort: 8081
- containerPort: 5000
volumeMounts:
- mountPath: /app-data
name: app-data-volume
securityContext:
fsGroup: 2000
volumes:
- name: app-data-volume
persistentVolumeClaim:
claimName: app-pvc
---
apiVersion: v1
kind: Service
metadata:
name: app-service
namespace: tools
spec:
type: NodePort
ports:
- port: 80
targetPort: 8081
protocol: TCP
name: http
- port: 5000
targetPort: 5000
protocol: TCP
name: docker
selector:
app: app
[centos#master01 ~]$ kubectl create -f app.yaml
deployment.extensions "app" created
service "app-service" created
This deploys fine:
[centos#master01 ~]$ kubectl get pods -n tools
NAME READY STATUS RESTARTS AGE
app-6588cf4b87-wvwg2 0/1 ContainerCreating 0 6s
[centos#neb-k8s02-master01 ~]$ kubectl describe pod app-6588cf4b87-wvwg2 -n tools
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 18s default-scheduler Successfully assigned nexus-6588cf4b87-wvwg2 to neb-k8s02-worker01
Normal SuccessfulMountVolume 18s kubelet, worker01 MountVolume.SetUp succeeded for volume "default-token-7cv62"
Normal SuccessfulAttachVolume 15s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-d4bf77cc-294e-11e9-9106-005056a4b1c7"
Normal SuccessfulMountVolume 7s kubelet, worker01 MountVolume.SetUp succeeded for volume "pvc-d4bf77cc-294e-11e9-9106-005056a4b1c7"
Normal Pulled 7s kubelet, worker01 Container image "acme/app:latest" already present on machine
Normal Created 7s kubelet, worker01 Created container
Normal Started 6s kubelet, worker01 Started container
We can also see the disk has been created and mounted in VMWare for Worker01 and not for Worker02:
[root#worker01 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
fd0 2:0 1 4K 0 disk
sda 8:0 0 200G 0 disk
├─sda1 8:1 0 500M 0 part /boot
└─sda2 8:2 0 199.5G 0 part
├─vg_root-lv_root 253:0 0 20G 0 lvm /
├─vg_root-lv_swap 253:1 0 2G 0 lvm
├─vg_root-lv_var 253:2 0 50G 0 lvm /var
└─vg_root-lv_k8s 253:3 0 20G 0 lvm /mnt/disks
sdb 8:16 0 10G 0 disk /var/lib/kubelet/pods/1e55ad6a-294f-11e9-9175-005056a47f18/volumes/kubernetes.io~vsphere-volume/pvc-d4bf77cc-294e-11e9-9106-005056a4b1c7
sr0 11:0 1 1024M 0 rom
[root#worker02 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
fd0 2:0 1 4K 0 disk
sda 8:0 0 200G 0 disk
├─sda1 8:1 0 500M 0 part /boot
└─sda2 8:2 0 199.5G 0 part
├─vg_root-lv_root 253:0 0 20G 0 lvm /
├─vg_root-lv_swap 253:1 0 2G 0 lvm
├─vg_root-lv_var 253:2 0 50G 0 lvm /var
└─vg_root-lv_k8s 253:3 0 20G 0 lvm /mnt/disks
sr0 11:0 1 4.5G 0 rom
If Worker01 falls over then Worker02 kicks in and we can see the disk being attached to the other node:
[root#worker02 ~]# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
fd0 2:0 1 4K 0 disk
sda 8:0 0 200G 0 disk
├─sda1 8:1 0 500M 0 part /boot
└─sda2 8:2 0 199.5G 0 part
├─vg_root-lv_root 253:0 0 20G 0 lvm /
├─vg_root-lv_swap 253:1 0 2G 0 lvm
├─vg_root-lv_var 253:2 0 50G 0 lvm /var
└─vg_root-lv_k8s 253:3 0 20G 0 lvm /mnt/disks
sdb 8:16 0 10G 0 disk /var/lib/kubelet/pods/a0695030-2950-11e9-9175-005056a47f18/volumes/kubernetes.io~vsphere-volume/pvc-d4bf77cc-294e-11e9-9106-005056a4b1c7
sr0 11:0 1 4.5G 0 rom
However, seeing as though the disk is now attached to Worker01 and Worker02, Worker01 will no longer start citing the following error in vCenter:
Cannot open the disk '/vmfs/volumes/5ba35d3b-21568577-efd4-469e3c301eaa/kubevols/kubernetes-dynamic-pvc-e55ad6a-294f-11e9-9175-005056a47f18.vmdk' or one of the snapshot disks it depends on.
This error occurs because (I assume) Worker02 has access to the disk and is reading/writing from/to it. Shouldn't Kubernetes detach the disk from nodes that do not need it if it's been attached to another node. How can we go about fixing this issue? If a pods moves to another host due to node failure then we have to manually detach the disk and then start the other worker manually.
Any and all help appreciated.
First, I'll assume your running in tree vsphere disks.
Second, in this case (and more so, with CSI) kubernetes doesn't have control over all volume operations. The VMWare functionality for managing attachment and detachment of a disk is implemented in the volume plugin which you are using. Kubernetes doesn't strictly control all volume attachment/detachment semantics as a generic function.
To see the in-tree implementation details, check out:
https://kubernetes.io/docs/concepts/storage/volumes/#vspherevolume
Overall i think the way you are doing failover is going to mean that when your worker1 pod dies, worker2 can schedule. At that point, worker1 should not be able to grab the same PVC, and it should not schedule until the worker2 pod dies.
However if worker1 is scheduling, it means that Vsphere is trying to (erroneously) let worker1 start, and the kubelet is failing.
There is a chance that this is a bug in the VMWare driver in that it will bind a persistent volume even though it is not ready to.
To further elaborate, details about how worker2 is being launched may be helped. Is it a separate replication controller ? or is it running outside of kubernetes? If the latter, then the volumes wont be managed the same way, and you cant use a the same PVC as the locking mechanism.