I am working with kubernetes and I need to pass parquet files containing datasets between pods , but I don't know which option will work best.
As I know, persistent disk allows me to mount a shared volume on my pods, but with cloud storage I can share these files too.
All the process is hosted on google cloud.
If you want to persist the data you have to use the file store of Google. Which will support the read write many.
Persistent Volumes in GKE are supported using the Persistent Disks.
The problem with these disks is that they only support
ReadWriteOnce(RWO) (the volume can be mounted as read-write by a
single node) and ReadOnlyMany (ROX)(the volume can be mounted
read-only by many nodes) access modes.
Read more at : https://medium.com/#Sushil_Kumar/readwritemany-persistent-volumes-in-google-kubernetes-engine-a0b93e203180
With disk, it won't be possible to share the data between pods as it will only support the read-write once. The single disk will get attach to a single node.
If you looking forward to mounting the storage like a cloud bucket behind the POD using CSI driver, your file writing IO will be very slow. Storage can give better performance with API.
You can create the NFS server in Kubernetes and use also which will provide support again to read writ many.
Gluster FS & MinIo is one of the option to use, however if looking for managed NFS use the filestore of Google.
I would say go with the local persistent volume when you need to pass large amount of data sets which will be cost effective and efficient.
You should use Google Filestore as a file share. Then you need to:
create a Persistence Volume (PV)
create a Persistence Volume Claim (PVC)
Use the PVC with your pods
More details here
Related
In the context of Kubernetes, I've come across the terms Block Storage, File Storage and Object Storage but I don't understand how they are really used (mounted) inside a container. I have a few questions,
Are these all storage types backed by raw block devices?
Is Block Storage a term used to mean a logical abstraction of block devices?
Is Block Storage mounted to a path inside a container just like we mount a file system on linux? which also implies the question whether the Block Storage is a formatted file system?
How Object Storage is presented to a container? How does the container make use of it? Is it mounted to a path?
How File Storage is presented to a container? How does the container make use of it? Is it mounted to a path?
What are 3 example scenarios to use these 3 storage types?
Block storage is backed by block device. It can be physical disk or it can be network-attached device (iSCSI, FC or AWS EBS volume) or even Ceph RBD. In most cases pods don't need to work with raw block devices (with exception of Kube native storages like Ceph, Portworx) and Kubernetes instead creates filesystem on top of it and mounts it into pod. The main thing about block storage is that in most cases it's Read-Write Only (RWO) which means it can be mounted read-write only to single pod.
File storage is backed by filesystem. It can be local filesystem, like hostPath, or it can be network share like NFS. In that case Kubernetes can directly mount it inside pod without any additional preparation. The main thing about NFS is that it can be mounted Read-Write Many (RWX) which means it can be mounted read-write to many pods. Also filesystems on one node can be attached to many pods on that particular node.
Object storage can be imagined like files-over-HTTP(S) (AWS S3, GCP GCS, Azure Blob Storage, Ceph RGW, Minio). There is no official Kubernetes supported way to mount object storage inside pods, but there are some dirty workarounds like s3fs, Ganesha NFS and may be others. In most cases you will work with object storage directly from your app using provider specific libraries which is how it's meant to work.
I got a service that needs to scan large files and process them, upload them back to the file server.
My problem is that default available space in a pod is 10G which is not enough.
I have 3 options:
use hostFile/emptyDir volume, but this way I can't specify how much space I need, my pods could be scheduled to a node which didn't have enough disk space.
use hostFile persistent volume, but the documents say it is "Single node testing only“
use local persistent volume, but according to the document Dynamic provisioning is not supported yet, I have to manually create pv in each node which seems not acceptable by me, but if there is no other options this will be the only way to go.
Is there any other simpler options than local persistent volume?
Depending on your cloud provider you can mount their block storage options e.g
e.g. Google Cloud Storage, Azure storage by Azure, Elasticblockstore for AWS.
This way you won`t be depended on your node availability for storage. All of them are supported in Kubernetes via plugins as an expanded persistent volume claims. For example:
gcePersistentDisk
A gcePersistentDisk volume mounts a Google Compute Engine (GCE)
Persistent Disk into
your Pod. Unlike emptyDir, which is erased when a Pod is removed,
the contents of a PD are preserved and the volume is merely unmounted.
This means that a PD can be pre-populated with data, and that data can
be "handed off" between Pods.T
This is similar for awsElasticBlockStore or azureDisk
If you want to use AWS S3 there is an S3 Operator which you may find interesting.
AWS S3 Operator will deploy the AWS S3 Provisioner which will
dynamically or statically provision AWS S3 Bucket storage and access.
I was trying the local or host path volumes on a LAN bare metal servers.
tried local but each node was having there own copy of the data.
How can i use volumes across all the nodes and pods.
Persistent Volumes have access semantics. Example on GCE if you are using a Persistent Disk, can either be mounted as writable to a single pod or to multiple pods as read-only. If you want multi writer semantics, you need to setup NFS or some other storage that let's you write from multiple pods. NFS can support multiple read/write clients.
In case you are interested in running NFS take a look: nfs-setup.
The NFS persistent volume and NFS claim gives an indirection that allow multiple pods to refer to the NFS server using a symbolic name rather than the hardcoded server address.
Take a look: pv-multiple-pods.
If you want to share data through your cluster, then you need to use network storage.
You can't expect kubernetes to just share your data accross all the nodes of your cluster. So local storage and host path won't work in that case.
As #MaggieO said, you can setup and use a NFS server.
If you just want to try it out, you can also use your favorite cloud provider storage solution (AWS S3, GCP Bucket, Azure Disk, etc). You can see the full list here
I have a web application running on a Google Kubernetes cluster. My web app also uses persistent volumes for multiple MongoDB databases to store user and application data.
(1) Thus I am wondering if it is practical to store all data inside those persistent volumes in the long-run?
(2) Are there any methods for safely backing up the persistent volumes e.g. on a weekly basis (automatically)?
(3) I am also planning to integrate some kind of file upload into the application. Are persistent volumes capable of storing many GB/TB of data, or should I opt for something like Google cloud storage in this case?
Deploying statefull apps on K8s is bit painfull which is well known in K8s community. Usually, if we need HA for DBs supposed to deploy as cluster mode. But in K8s, if you want to deploy in cluster mode, you need to check StatefulSets concept. Anyways, I'm pasting links for your questions, so that you can start from there.
(1) Thus I am wondering if it is practical to store all data inside
those persistent volumes in the long-run?
Running MongoDB on Kubernetes with StatefulSets
(2) Are there any methods for safely backing up the persistent volumes
e.g. on a weekly basis (automatically)?
Persistent Volume Snapshots
Volume Snapshot (Beta from K8s docs)
You can google even more docs.
(3) I am also planning to integrate some kind of file upload into the
application. Are persistent volumes capable of storing many GB/TB of
data, or should I opt for something like Google cloud storage in this
case?
Not sure, it can hold TBs!?? but definitely, if you have cloud, consider to use it
Yes you can use the PVC in Kubernetes to store the data. However it's depends on your application usecase and size.
In kubernetes you can deploy Mongo DB as cluster and run it which is storing data inside PVC.MongoDB helm chart available for HA you can also look for that.
Helm chart : https://github.com/helm/charts/tree/master/stable/mongodb
It's suggested to single pod or statefulset of MongoDB on Kubernetes.
Backup:
For backup of MongoDB database, you can choose taking a snapshot of disk storage (PVC) weekly however along with that you can alos use Mongo snapshot.
Most people choose to manage service but still, it depends on your organization also.
Backup method
MongoDB snapshot
Disk storage snapshot
Filesystem :
Yes it can handle TB of data as it's ultimately disk volume or file
system.
Yes you can use PVC as file system but later in future you may get issue for scaling as PVC is ReadWriteOnce if you want to scale application along with PVC you have to implement ReadWriteMany.
There is sevral method also to achive this you can also directly mount file system to pod like AWS EFS but you may find it slow for file operations.
For file system there are various options available in Kubernetes like csi driver, gluster FS, minio, EFS.
I am new to openshift ,I have deployed one application on openshift which uses persistent volume to store the files,and there is another application which pick that file and process it.
Now my challenge is I am not able to understand how to use same persistent volume for two application,
and how to pick the file from persistent volume is it mountPath where files get stored ?
To leverage shared storage for use by two separate containers (in two independent pods) configure PV of type NFS, or other shared storage such as GlusterFS etc.
A basic example using NFS available here Sharing an NFS Persistent Volume (PV) Across Two Pods
You can address your requirement using one of the below option
use the two containers in the same pod. that way both the containers can share the volume
Use NFS or some other persistent storage that supports ReadWriteMany. That way multiple pods can share same volume