I currently have a single-node Kubernetes instance running on a VM. The disk attached to this VM is 100GB, but 'df -h' shows that the / partition only has 88GB available (other stuff is used for OS overhead etc...)
I have a kubernetes manifest that creates a 100GB local Persistent Volume.
I also have a pod creating a 100GB Persistent Volume Claim.
Both of these deploy and come up normally even though the entire VM does not even have 100GB available.
To make things even more complicated, the VM is thin provisioned, and only using 20 GB on the actual disk right now...
HOW IS THIS WORKING !?!?
The local provisioner does no size checks, nor is the size enforced anyway. The final "volume" is just a bind mount like with hostPath. The main reason local PVs exist is because hostPath isn't something the scheduler understands so in a multi-node scenario it won't restrict topology.
Related
I'm a beginner in kubernetes, and when I was reading the book, I found that it is not recommended to use hostpath as the volume type for production environment, because it will lead to binding between pod and node, but if you don't use hostpath, then if you use other volume types, when reading and writing files, will it lead to extra network IO, and will this performance suffer? Will this have an additional performance impact?
hostpath is, as the name suggests, reading and writing from a place on the host where the pod is running. If the host goes down, or the pod gets evicted or otherwise removed from the node, that data is (normally) lost. This is why the "binding" is mentioned -- the pod must stay on that same node otherwise it will lose that data.
Using a volume type and having volumes provisioned is better as the disk and the pod can be reattached together on another node and you will not lose the data.
In terms of I/O, there would indeed be a miniscule difference, since you're no longer talking to the node's local disk but a mounted disk.
hostPath volumes are generally used for temporary files or storage that can be lost without impact to the pod, in much the same way you would use /tmp on a desktop machine/
To get a local volume you can use the volume type Local volume, but you need a local volume provisioner that can allocate and recycle volumes for you.
Since local volumes are disks on the host, there are no performance trade-offs. But it is more common to use network located volumes provided by a cloud provider, and they do have a latency trade-off.
According to the documentation:
A PersistentVolume (PV) is a piece of storage in the cluster that has been provisioned ... It is a resource in the cluster just like a node is a cluster resource...
So I was reading about all currently available plugins for PVs and I understand that for 3rd-party / out-of-cluster storage this doesn't matter (e.g. storing data in EBS, Azure or GCE disks) because there are no or very little implications when adding or removing nodes from a cluster. However, there are different ones such as (ignoring hostPath as that works only for single-node clusters):
csi
local
which (at least from what I've read in the docs) don't require 3rd-party vendors/software.
But also:
... local volumes are subject to the availability of the underlying node and are not suitable for all applications. If a node becomes unhealthy, then the local volume becomes inaccessible by the pod. The pod using this volume is unable to run. Applications using local volumes must be able to tolerate this reduced availability, as well as potential data loss, depending on the durability characteristics of the underlying disk.
The local PersistentVolume requires manual cleanup and deletion by the user if the external static provisioner is not used to manage the volume lifecycle.
Use-case
Let's say I have a single-node cluster with a single local PV and I want to add a new node to the cluster, so I have 2-node cluster (small numbers for simplicity).
Will the data from an already existing local PV be 1:1 replicated into the new node as in having one PV with 2 nodes of redundancy or is it strictly bound to the existing node only?
If the already existing PV can't be adjusted from 1 to 2 nodes, can a new PV (created from scratch) be created so it's 1:1 replicated between 2+ nodes on the cluster?
Alternatively if not, what would be the correct approach without using a 3rd-party out-of-cluster solution? Will using csi cause any change to the overall approach or is it the same with redundancy, just different "engine" under the hood?
Can a new PV be created so it's 1:1 replicated between 2+ nodes on the cluster?
None of the standard volume types are replicated at all. If you can use a volume type that supports ReadWriteMany access (most readily NFS) then multiple pods can use it simultaneously, but you would have to run the matching NFS server.
Of the volume types you reference:
hostPath is a directory on the node the pod happens to be running on. It's not a directory on any specific node, so if the pod gets recreated on a different node, it will refer to the same directory but on the new node, presumably with different content. Aside from basic test scenarios I'm not sure when a hostPath PersistentVolume would be useful.
local is a directory on a specific node, or at least following a node-affinity constraint. Kubernetes knows that not all storage can be mounted on every node, so this automatically constrains the pod to run on the node that has the directory (assuming the node still exists).
csi is an extremely generic extension mechanism, so that you can run storage drivers that aren't on the list you link to. There are some features that might be better supported by the CSI version of a storage backend than the in-tree version. (I'm familiar with AWS: the EBS CSI driver supports snapshots and resizing; the EFS CSI driver can dynamically provision NFS directories.)
In the specific case of a local test cluster (say, using kind) using a local volume will constrain pods to run on the node that has the data, which is more robust than using a hostPath volume. It won't replicate the data, though, so if the node with the data is deleted, the data goes away with it.
I'm teaching myself Kubernetes with a 5 Rpi cluster, and I'm a bit confused by the way Kubernetes treats Persistent Volumes with respect to Pod Scheduling.
I have 4 worker nodes using ext4 formatted 64GB micro SD cards. It's not going to give GCP or AWS a run for their money, but it's a side project.
Let's say I create a Persistent volume Claim requesting 10GB of storage on worker1, and I deploy a service which relies on this PVC, is that service then forced to be scheduled on worker1?
Should I be looking into distributed file systems like Ceph or Hdfs so that Pods aren't restricted to being scheduled on a particular node?
Sorry if this seems like a stupid question, I'm self taught and still trying to figure this stuff out! (Feel free to improve my tl;dr doc for kubernetes with a pull req)
just some examples, as already mentioned it depends on your storage system, as i see you use the local storage option
Local Storage:
yes the pod needs to be run on the same machine where the pv is located (your case)
ISCSI/Trident San:
no, the node will mount the iscsi block device where the pod will be scheduled
(as mentioned already volume binding mode is an important keyword, its possible you need to set this to 'WaitForFirstConsumer')
NFS/Trident Nas:
no, its nfs, mountable from everywhere if you can access and auth against it
VMWare VMDK's:
no, same as iscsi, the node which gets the pod scheduled mounts the vmdk from the datastore
ceph/rook.io:
no, you get 3 options for storage, file, block an object storage, every type is distributed so you can schedule a pod on every node.
also ceph is the ideal system for carrying a distributed software defined storage on commodity hardware, what i can recommend is https://rook.io/ basically an opensource ceph on 'container-steroids'
Let's say I create a Persistent volume Claim requesting 10GB of storage on worker1, and I deploy a service which relies on this PVC, is that service then forced to be scheduled on worker1?
This is a good question. How this works depends on your storage system. The StorageClass defined for your Persistent Volume Claim contains information about Volume Binding Mode. It is common to use dynamic provisioning volumes, so that the volume is first allocated when a user/consumer/Pod is scheduled. And typically this volume does not exist on the local Node but remote in the same data center. Kubernetes also has support for Local Persistent Volumes that are physical volumes located on the same Node, but they are typically more expensive and used when you need high disk performance and volume.
Kubernetes has pretty extensive volume and volume mounting support (many different volume types, subpaths, mounting single files).
Can the same be achieved with GCE VMs?
Update:
I have some Kubernetes workflow that uses NFS and GCE PD volumes.
Suppose I want to run the same workflow without Kubernetes (by just starting GCE VMs).
What volume-related features will I lose/keep?
Some examples of features:
Having the same volume shared between multiple producer Pods/VMs.
Mounting single files into container/VM (as opposed to mounting directories only).
The PVs and GCE PD volumes used by GKE use Google Persistent Disks and thus are bound by the same limitations. This also means that there isn't much you can do on k8s that you can't do on GCE. The major difference is the resources won't be as fluid.
You can attach a disk to a GCE VM and mount it as a subpath if you want at the OS level or just mount the entire disk normally. You can also use a single disk in readOnlyMany mode which can be shared by multiple VMs in the same zone (same restriction you have in GKE). If you need scalability, you can use a Managed Instance Group that uses a snapshot of your disk so that replication won't skew the data.
You can also mount NFS in GCE as in GKE.
Migrating from GKE to GCE generally does not have too many restrictions. The major difference is that you are moving from a managed orchestration system to an unmanaged VM so you may need to do some more leg work to make sure that there is scalability (if need be) and resiliency.
Aside from the benefits that k8s offers all around, I can't think of any major benefits you lose concerning the volumes specifically.
I tried to remain on the free-tier of google cloud platform and it only permits 3 nodes and 30Gb of Storage in which where the cluster created, each nodes are mapped to each storage 10Gb each.
And when I tried to mount persistentVolume and Claims to existing Disks, the error shows :
Attach failed for volume "myapp-pv" : googleapi: Error 400: The disk resource 'projects/myapp-dev/zones/us-central1-a/disks/gke-myapp-dev-clus-default-pool-64e30c4b-dvkc' is already being used by 'projects/myapp-dev/zones/us-central1-a/instances/gke-myapp-dev-clus-default-pool-64e30c4b-dvkc
The working sollution is for me to create another disks, but the problem is it is out of the free-tier, I wonder how can we stay in free-tier without creating another persistentDisk in GCP ?
And when I tried to mount persistentVolume and Claims to existing Disks, the error shows
This error is happening because of this constraint of PV on GCE:
Important! A volume can only be mounted using one access mode at a time,
even if it supports many. For example, a GCEPersistentDisk can be mounted as ReadWriteOnce
by a single node or ReadOnlyMany by many nodes, but not at the same time.
Table given in above link shown that GCEPersistentDisk can't be mounted as ReadWriteMany so if you need to connect it in that way you have to use some other volume plugin.
I wonder how can we stay in free-tier without creating another persistentDisk in GCP?
Just some thouhgts... With free-tier you are limited in a number of nodes and disk space available:
You can always 'simulate' ReadWriteMany with NFS volume plugin for example (installing your own provisioner for NFS) providing that your use case is not excluding NFS usage. Dowside is that you need to install NFS provisioner (squeeze it in you capacity) and it is not really well suited for fast io (database and stuff)
You can use hostPath on each of the nodes and manually juggle pods around, but that is prone to data loss and not really a proper kubernetes approach to PV handling. This is something to consider if you need fast io (you are testing with databases) and proper backup should be in place to avoid data loss if node dies.