Is it Appropriate to Store Database in a Kubernetes Persistent Volume (And how to back up?)

Is it Appropriate to Store Database in a Kubernetes Persistent Volume (And how to back up?) - kubernetes

I have a web application running on a Google Kubernetes cluster. My web app also uses persistent volumes for multiple MongoDB databases to store user and application data.
(1) Thus I am wondering if it is practical to store all data inside those persistent volumes in the long-run?
(2) Are there any methods for safely backing up the persistent volumes e.g. on a weekly basis (automatically)?
(3) I am also planning to integrate some kind of file upload into the application. Are persistent volumes capable of storing many GB/TB of data, or should I opt for something like Google cloud storage in this case?

Deploying statefull apps on K8s is bit painfull which is well known in K8s community. Usually, if we need HA for DBs supposed to deploy as cluster mode. But in K8s, if you want to deploy in cluster mode, you need to check StatefulSets concept. Anyways, I'm pasting links for your questions, so that you can start from there.
(1) Thus I am wondering if it is practical to store all data inside
those persistent volumes in the long-run?
Running MongoDB on Kubernetes with StatefulSets
(2) Are there any methods for safely backing up the persistent volumes
e.g. on a weekly basis (automatically)?
Persistent Volume Snapshots
Volume Snapshot (Beta from K8s docs)
You can google even more docs.
(3) I am also planning to integrate some kind of file upload into the
application. Are persistent volumes capable of storing many GB/TB of
data, or should I opt for something like Google cloud storage in this
case?
Not sure, it can hold TBs!?? but definitely, if you have cloud, consider to use it

Yes you can use the PVC in Kubernetes to store the data. However it's depends on your application usecase and size.
In kubernetes you can deploy Mongo DB as cluster and run it which is storing data inside PVC.MongoDB helm chart available for HA you can also look for that.
Helm chart : https://github.com/helm/charts/tree/master/stable/mongodb
It's suggested to single pod or statefulset of MongoDB on Kubernetes.
Backup:
For backup of MongoDB database, you can choose taking a snapshot of disk storage (PVC) weekly however along with that you can alos use Mongo snapshot.
Most people choose to manage service but still, it depends on your organization also.
Backup method
MongoDB snapshot
Disk storage snapshot
Filesystem :
Yes it can handle TB of data as it's ultimately disk volume or file
system.
Yes you can use PVC as file system but later in future you may get issue for scaling as PVC is ReadWriteOnce if you want to scale application along with PVC you have to implement ReadWriteMany.
There is sevral method also to achive this you can also directly mount file system to pod like AWS EFS but you may find it slow for file operations.
For file system there are various options available in Kubernetes like csi driver, gluster FS, minio, EFS.

Related

What storage to use for passing data between pods?

I am working with kubernetes and I need to pass parquet files containing datasets between pods , but I don't know which option will work best.
As I know, persistent disk allows me to mount a shared volume on my pods, but with cloud storage I can share these files too.
All the process is hosted on google cloud.

If you want to persist the data you have to use the file store of Google. Which will support the read write many.
Persistent Volumes in GKE are supported using the Persistent Disks.
The problem with these disks is that they only support
ReadWriteOnce(RWO) (the volume can be mounted as read-write by a
single node) and ReadOnlyMany (ROX)(the volume can be mounted
read-only by many nodes) access modes.
Read more at : https://medium.com/#Sushil_Kumar/readwritemany-persistent-volumes-in-google-kubernetes-engine-a0b93e203180
With disk, it won't be possible to share the data between pods as it will only support the read-write once. The single disk will get attach to a single node.
If you looking forward to mounting the storage like a cloud bucket behind the POD using CSI driver, your file writing IO will be very slow. Storage can give better performance with API.
You can create the NFS server in Kubernetes and use also which will provide support again to read writ many.
Gluster FS & MinIo is one of the option to use, however if looking for managed NFS use the filestore of Google.

I would say go with the local persistent volume when you need to pass large amount of data sets which will be cost effective and efficient.

You should use Google Filestore as a file share. Then you need to:
create a Persistence Volume (PV)
create a Persistence Volume Claim (PVC)
Use the PVC with your pods
More details here

What is a preferable way to run a job need large tmp disk space?

I got a service that needs to scan large files and process them, upload them back to the file server.
My problem is that default available space in a pod is 10G which is not enough.
I have 3 options:
use hostFile/emptyDir volume, but this way I can't specify how much space I need, my pods could be scheduled to a node which didn't have enough disk space.
use hostFile persistent volume, but the documents say it is "Single node testing only“
use local persistent volume, but according to the document Dynamic provisioning is not supported yet, I have to manually create pv in each node which seems not acceptable by me, but if there is no other options this will be the only way to go.
Is there any other simpler options than local persistent volume?

Depending on your cloud provider you can mount their block storage options e.g
e.g. Google Cloud Storage, Azure storage by Azure, Elasticblockstore for AWS.
This way you won`t be depended on your node availability for storage. All of them are supported in Kubernetes via plugins as an expanded persistent volume claims. For example:
gcePersistentDisk
A gcePersistentDisk volume mounts a Google Compute Engine (GCE)
Persistent Disk into
your Pod. Unlike emptyDir, which is erased when a Pod is removed,
the contents of a PD are preserved and the volume is merely unmounted.
This means that a PD can be pre-populated with data, and that data can
be "handed off" between Pods.T
This is similar for awsElasticBlockStore or azureDisk
If you want to use AWS S3 there is an S3 Operator which you may find interesting.
AWS S3 Operator will deploy the AWS S3 Provisioner which will
dynamically or statically provision AWS S3 Bucket storage and access.

Kubernetes cluster Mysql Nodes Storage

We have started setting up a Kubernetes cluster. On Production, we have 4 Mysql Nodes(2 Active Master, 2 Active slaves). Complete servers are on-premise, There is NO cloud providers usage.
Now how do I configure storage? I mean should I use PV / PVC? How will it work. Should I use local PV? Can someone explain to me this?

You need to use PersistentVolumes and PersistentVolumeClaims in order to achieve that.
A PersistentVolume (PV) is a piece of storage in the cluster that has
been provisioned by an administrator or dynamically provisioned using
Storage Classes.
A PersistentVolumeClaim (PVC) is a request for storage by a user.
Claims can request specific size and access modes (e.g., they can be
mounted once read/write or many times read-only).
Containers are ephemeral. When the container is restarted all the changes made prior to it are lost. Databases, however expect the data is persistent, therefore you need persistent volumes. You have to create a storage claim and the pod must be configured to mount the claimed storage.
Here you will find a simple guide showing how to deploy MySQL with a PersistentVolume. However, I strongly recommend getting familiar with the official docs that I have linked in order to fully understand the concept and adjust the access mode, class, size, etc according to your needs.
Please let me know if that helped.

How do I mount data into persisted storage on Kubernetes and share the storage amongst multiple pods?

I am new at Kubernetes and am trying to understand the most efficient and secure way to handle sensitive persisted data that interacts with a k8 pod. I have the following requirements when I start a pod in a k8s cluster:
The pod should have persisted storage.
Data inside the pod should be persistent even if the pod crashes or restarts.
I should be able to easily add or remove data from hostPath into the pod. (Not sure if this is feasible since I do not know how the data will behave if the pod starts on a new node in a multi node environment. Do all nodes have access to the data on the same hostPath?)
Currently, I have been using StatefulSets with a persistent volume claim on GKE. The image that I am using has a couple of constraints as follows:
I have to mount a configuration file into the pod before it starts. (I am currently using configmaps to pass the configuration file)
The pod that comes up, creates its own TLS certificates which I need to pass to other pods. (Currently I do not have a process in place to do this and thus have been manually copy pasting these certificates into other pods)
So, how do I maintain a common persisted storage that handles sensitive data between multiple pods and how do I add pre-configured data to this storage? Any guidance or suggestions are appreciated.

I believe this documentation on creating a persistent disk with multiple readers [1] is what you are looking for. you will however only be able to have the pods read from the disk since GCP does not support "WRITEMANY".
Regarding hostpaths, the mount point is on the pod the volume is a directory on the node. I believe the hostpath is confined to individual nodes.
[1] https://cloud.google.com/kubernetes-engine/docs/how-to/persistent-volumes/readonlymany-disks
[2] https://kubernetes.io/docs/concepts/storage/persistent-volumes/#access-modes

How to manage file uploads with GKE?

I'm trying to run an api (based on Symfony) with kubernetes thanks to Google Container Engine (GKE).
This API also allow user to store and download files, which are supposed to be saved somewhere.
I tried to run it with 1 replica, and noticed a downtime of the service during the creation of the new container. It looks like at least 2 replicas are needed to avoid downtime.
Taking that into consideration, I'm interested about these options :
A volume based on Google Persistent Disk. Would this mean that all my replicas would be on the same node ? (ReadWriteOnce access mode). If so, in case of a node failure, my service would not be available.
A volume based on Flocker (Backend Persistent Disk). What is the recommended way to install it on GKE ?
Is there another interesting option ? What would you recommend ?

Using GCS (like tex mentioned) is probably the simplest solution (and will be very fast from a GKE cluster). Here is an answer that may help.
If you have a specific need for local persistent storage, you can use Google Persistent Disks, but they can only be mounted as writable in one place.
Petsets (currently alpha) will provide better support for distributed persistent in-cluster storage, so you can also look into that if GCS doesn't work for you.