Best way to access and store files on kubernetes cluster without cloud resources - kubernetes

I need to persist files of different formats and sizes on a Kubernetes cluster volume and access them simultaneously by several applications.
I know there are cloud resources like Azure Files that can help with this issue of simultaneous access to the same storage volume. However, one of my project requirements is not to use cloud resources to persist files.
So what can be the best way to persist files and access them simultaneously without using any cloud resources?

We are currently running NFS which is behaving really good and it is pretty straight forward to set it up, however there are several options to get non cloud storages:
cephfs
A cephfs volume allows an existing CephFS volume to be mounted into your Pod. Unlike emptyDir, which is erased when a pod is removed, the contents of a cephfs volume are preserved and the volume is merely unmounted. This means that a cephfs volume can be pre-populated with data, and that data can be shared between pods. The cephfs volume can be mounted by multiple writers simultaneously.
More info
iscsi (does not meet your needs!)
An iscsi volume allows an existing iSCSI (SCSI over IP) volume to be mounted into your Pod. Unlike emptyDir, which is erased when a Pod is removed, the contents of an iscsi volume are preserved and the volume is merely unmounted. This means that an iscsi volume can be pre-populated with data, and that data can be shared between pods.
A feature of iSCSI is that it can be mounted as read-only by multiple consumers simultaneously. This means that you can pre-populate a volume with your dataset and then serve it in parallel from as many Pods as you need. Unfortunately, iSCSI volumes can only be mounted by a single consumer in read-write mode. Simultaneous writers are not allowed.
More info
nfs
An nfs volume allows an existing NFS (Network File System) share to be mounted into a Pod. Unlike emptyDir, which is erased when a Pod is removed, the contents of an nfs volume are preserved and the volume is merely unmounted. This means that an NFS volume can be pre-populated with data, and that data can be shared between pods. NFS can be mounted by multiple writers simultaneously.
More info
Further info-Kubernetes Volumes

Related

Why should I use Kubernetes Persistent Volumes instead of Volumes

To use storage inside Kubernetes PODs I can use volumes and persistent volumes. While the volumes like emptyDir are ephemeral, I could use hostPath and many other cloud based volume plugins which would provide a persistent solution in volumes itself.
In that case why should I be using Persistent Volume then?
It is very important to understand the main differences between Volumes and PersistentVolumes. Both Volumes and PersistentVolumes are Kubernetes resources which provides an abstraction of a data storage facility.
Volumes: let your pod write to a filesystem that exists as long as the pod exists. They also let you share data between containers in the same pod but data in that volume will be destroyed when the pod is restarted. Volume decouples the storage from the Container. Its lifecycle is coupled to a pod.
PersistentVolumes: serves as a long-term storage in your Kubernetes cluster. They exist beyond containers, pods, and nodes. A pod uses a persistent volume claim to to get read and write access to the persistent volume. PersistentVolume decouples the storage from the Pod. Its lifecycle is independent. It enables safe pod restarts and sharing data between pods.
When it comes to hostPath:
A hostPath volume mounts a file or directory from the host node's
filesystem into your Pod.
hostPath has its usage scenarios but in general it might not recommended due to several reasons:
Pods with identical configuration (such as created from a PodTemplate) may behave differently on different nodes due to different files on the nodes
The files or directories created on the underlying hosts are only writable by root. You either need to run your process as root in a privileged Container or modify the file permissions on the host to be able to write to a hostPath volume
You don't always directly control which node your pods will run on, so you're not guaranteed that the pod will actually be scheduled on the node that has the data volume.
If a node goes down you need the pod to be scheduled on other node where your locally provisioned volume will not be available.
The hostPath would be good if for example you would like to use it for log collector running in a DaemonSet.
I recommend the Kubernetes Volumes Guide as a nice supplement to this topic.
PersistentVoluemes is cluster-wide storage and allows you to manage the storage more centrally.
When you configure a volume (either using hostPath or any of the cloud-based volume plugins) then you need to do this configuration within the POD definition file. Every configuration information, required to configure storage for the volume, goes within the POD definition file.
When you have a large environment with a lot of users and a large number of PODs then users will have to configure storage every time for each POD they deploy. Whatever storage solution is used, the user who deploys the POD will have to configure that storage on all of his/her POD definition files. If a change needs to be made then the user will have to make this change on all of his/her PODs. After a certain scale, this is not the most optimal way to manage storage.
Instead, you would like to manage this centrally. You would like to manage the storage in such a way that an Administrator can create a large pool of storage and users can carve out a part of this storage as required, and this is exactly what you can do using PersistentVolumes and PersistentVolumeClaims.
Use PersistentVolumes when you need to set up a database like MongoDB, Redis, Postgres & MySQL. Because it's long-term storage and not deeply coupled with your pods! Perfect for database applications. Because they will not die with the pods.
Avoid Volumes when you need long-term storage. Because they will die with the pods!
In my case, when I have to store something, I will always go for persistent volumes!

How kubernetes block, file and object storage types are used inside containers

In the context of Kubernetes, I've come across the terms Block Storage, File Storage and Object Storage but I don't understand how they are really used (mounted) inside a container. I have a few questions,
Are these all storage types backed by raw block devices?
Is Block Storage a term used to mean a logical abstraction of block devices?
Is Block Storage mounted to a path inside a container just like we mount a file system on linux? which also implies the question whether the Block Storage is a formatted file system?
How Object Storage is presented to a container? How does the container make use of it? Is it mounted to a path?
How File Storage is presented to a container? How does the container make use of it? Is it mounted to a path?
What are 3 example scenarios to use these 3 storage types?
Block storage is backed by block device. It can be physical disk or it can be network-attached device (iSCSI, FC or AWS EBS volume) or even Ceph RBD. In most cases pods don't need to work with raw block devices (with exception of Kube native storages like Ceph, Portworx) and Kubernetes instead creates filesystem on top of it and mounts it into pod. The main thing about block storage is that in most cases it's Read-Write Only (RWO) which means it can be mounted read-write only to single pod.
File storage is backed by filesystem. It can be local filesystem, like hostPath, or it can be network share like NFS. In that case Kubernetes can directly mount it inside pod without any additional preparation. The main thing about NFS is that it can be mounted Read-Write Many (RWX) which means it can be mounted read-write to many pods. Also filesystems on one node can be attached to many pods on that particular node.
Object storage can be imagined like files-over-HTTP(S) (AWS S3, GCP GCS, Azure Blob Storage, Ceph RGW, Minio). There is no official Kubernetes supported way to mount object storage inside pods, but there are some dirty workarounds like s3fs, Ganesha NFS and may be others. In most cases you will work with object storage directly from your app using provider specific libraries which is how it's meant to work.

Installing Postgres in GKE as NFS with multiple micro-services deployed

I have a GKE cluster, with almost 6-7 micro-services deployed. I need a Postgres DB to be installed inside GKE (not Cloudsql as cost). When checked the different types of persistent volumes i can see that if multiple micro-service accessing the same DB, should i go using NFS or PVC with normal disk would be enough not anyway local storage.
Request your thought on this.
Everything depends from your scenario. In general you should follow AccessMode when you are considering which Volume Plugin you want to use.
A PersistentVolume can be mounted on a host in any way supported by the resource provider. As shown in the table below, providers will have different capabilities and each PV's access modes are set to the specific modes supported by that particular volume.
In this documentation below, you will find table with different Volume Plugins and supported Access Modes.
According to update form your comment, you have only one node. With that setup, you can use almost every Volume which supports RWO Access mode.
ReadWriteOnce -- the volume can be mounted as read-write by a single node.
There are 2 other Access Modes which should be consider if would like to use it for more than 1 node.
ReadOnlyMany -- the volume can be mounted read-only by many nodes
ReadWriteMany -- the volume can be mounted as read-write by many nodes
So in your case you can use gcePersistentDisk as it supports (ReadWriteOnce and ReadOnlyMany).
Using NFS would benefit if you would like to access this PV from many nodes.
NFS can support multiple read/write clients, but a specific NFS PV might be exported on the server as read-only. Each PV gets its own set of access modes describing that specific PV's capabilities.
Just as addition, if this is for learning puropse, you can also check Local Persistent Volume. Example can be found in this tutorial, however it would require few updates like image or apiVersion.

Are Pods forced to run on nodes where their persistent volumes exist?

I'm teaching myself Kubernetes with a 5 Rpi cluster, and I'm a bit confused by the way Kubernetes treats Persistent Volumes with respect to Pod Scheduling.
I have 4 worker nodes using ext4 formatted 64GB micro SD cards. It's not going to give GCP or AWS a run for their money, but it's a side project.
Let's say I create a Persistent volume Claim requesting 10GB of storage on worker1, and I deploy a service which relies on this PVC, is that service then forced to be scheduled on worker1?
Should I be looking into distributed file systems like Ceph or Hdfs so that Pods aren't restricted to being scheduled on a particular node?
Sorry if this seems like a stupid question, I'm self taught and still trying to figure this stuff out! (Feel free to improve my tl;dr doc for kubernetes with a pull req)
just some examples, as already mentioned it depends on your storage system, as i see you use the local storage option
Local Storage:
yes the pod needs to be run on the same machine where the pv is located (your case)
ISCSI/Trident San:
no, the node will mount the iscsi block device where the pod will be scheduled
(as mentioned already volume binding mode is an important keyword, its possible you need to set this to 'WaitForFirstConsumer')
NFS/Trident Nas:
no, its nfs, mountable from everywhere if you can access and auth against it
VMWare VMDK's:
no, same as iscsi, the node which gets the pod scheduled mounts the vmdk from the datastore
ceph/rook.io:
no, you get 3 options for storage, file, block an object storage, every type is distributed so you can schedule a pod on every node.
also ceph is the ideal system for carrying a distributed software defined storage on commodity hardware, what i can recommend is https://rook.io/ basically an opensource ceph on 'container-steroids'
Let's say I create a Persistent volume Claim requesting 10GB of storage on worker1, and I deploy a service which relies on this PVC, is that service then forced to be scheduled on worker1?
This is a good question. How this works depends on your storage system. The StorageClass defined for your Persistent Volume Claim contains information about Volume Binding Mode. It is common to use dynamic provisioning volumes, so that the volume is first allocated when a user/consumer/Pod is scheduled. And typically this volume does not exist on the local Node but remote in the same data center. Kubernetes also has support for Local Persistent Volumes that are physical volumes located on the same Node, but they are typically more expensive and used when you need high disk performance and volume.

persistent volume on openshift

I am new to openshift ,I have deployed one application on openshift which uses persistent volume to store the files,and there is another application which pick that file and process it.
Now my challenge is I am not able to understand how to use same persistent volume for two application,
and how to pick the file from persistent volume is it mountPath where files get stored ?
To leverage shared storage for use by two separate containers (in two independent pods) configure PV of type NFS, or other shared storage such as GlusterFS etc.
A basic example using NFS available here Sharing an NFS Persistent Volume (PV) Across Two Pods
You can address your requirement using one of the below option
use the two containers in the same pod. that way both the containers can share the volume
Use NFS or some other persistent storage that supports ReadWriteMany. That way multiple pods can share same volume