Google Compute Engine + Google Cloud Storage + NFS VM Instance - google-cloud-storage

I wanted to know if anyone has tried with good success on setting Google Compute Engine + Google Cloud Storage + NFS VM Instance?
The scenario I have in mind is to create a Google Cloud Storage instance and have it presented to an NFS VM instance that runs on GCE. Then, configure the NFS VM instance to export the Google Cloud Storage bucket to several web servers that will need to read and write to that bucket (Cloud Storage).
The reason I would prefer this approach, if possible, is because my Cloud Storage data would be more reliable in terms of backups, etc. I know I can just create a persistent disk for the NFS VM instance on GCE and dump my data on that persistent disk but then I have to worry about backing up my own data at that point. Snapshots of disks are fine within GCE but don't know if this is the best solution.
I am new to GCE and Google Cloud Platform overall and trying to determine how to mimic my current physical systems in the cloud using different methods.

If your scenario implies mounting Cloud Storage bucket as a filesystem on NFS server instance and then exporting it to clients, you won't get production-level performance and reliability. The reason is because there is no native way to mount Cloud Storage and you will be limited to userspace filesystem implementation such as s3fuse. It might work for simple use-cases and low load though.
Note that you can mount a Persistent Disk to multiple instances simultaneously in read-only mode.

Related

What storage to use for passing data between pods?

I am working with kubernetes and I need to pass parquet files containing datasets between pods , but I don't know which option will work best.
As I know, persistent disk allows me to mount a shared volume on my pods, but with cloud storage I can share these files too.
All the process is hosted on google cloud.
If you want to persist the data you have to use the file store of Google. Which will support the read write many.
Persistent Volumes in GKE are supported using the Persistent Disks.
The problem with these disks is that they only support
ReadWriteOnce(RWO) (the volume can be mounted as read-write by a
single node) and ReadOnlyMany (ROX)(the volume can be mounted
read-only by many nodes) access modes.
Read more at : https://medium.com/#Sushil_Kumar/readwritemany-persistent-volumes-in-google-kubernetes-engine-a0b93e203180
With disk, it won't be possible to share the data between pods as it will only support the read-write once. The single disk will get attach to a single node.
If you looking forward to mounting the storage like a cloud bucket behind the POD using CSI driver, your file writing IO will be very slow. Storage can give better performance with API.
You can create the NFS server in Kubernetes and use also which will provide support again to read writ many.
Gluster FS & MinIo is one of the option to use, however if looking for managed NFS use the filestore of Google.
I would say go with the local persistent volume when you need to pass large amount of data sets which will be cost effective and efficient.
You should use Google Filestore as a file share. Then you need to:
create a Persistence Volume (PV)
create a Persistence Volume Claim (PVC)
Use the PVC with your pods
More details here

How can I use GCP NFS Filestore on k8 cluster with TPUs?

I'm using GKE to run K8 workloads and want to add TPU support. From GCP docs, I "need" to attach a GCS bucket so the Job can read models and store logs. However, we already create shared NSF mounts for our k8 clusters. How hard of a requirement is it to "need" GCS to use TPUs? Can shared Filestore NFS mounts work just fine? What about using GCS Fuse?
I'm trying to avoid having the cluster user know about the back end file system (NFS vs GCS), and just know that that the files they provide will be available at "/home/job". Since the linked docs show passing a gs://mybucket/some/path value as needed for file system parameters, I'm not sure if a /home/job value will still work. Does the TPU access the filesystem directly and is only compatible with GCS? Or do the nodes access the filesystem (preferring GCS) and then share the data (in memory) with the TPUs?
I'll try it out to learn the hard way (and report back), but curious if others have experience with this already.

What is a preferable way to run a job need large tmp disk space?

I got a service that needs to scan large files and process them, upload them back to the file server.
My problem is that default available space in a pod is 10G which is not enough.
I have 3 options:
use hostFile/emptyDir volume, but this way I can't specify how much space I need, my pods could be scheduled to a node which didn't have enough disk space.
use hostFile persistent volume, but the documents say it is "Single node testing only“
use local persistent volume, but according to the document Dynamic provisioning is not supported yet, I have to manually create pv in each node which seems not acceptable by me, but if there is no other options this will be the only way to go.
Is there any other simpler options than local persistent volume?
Depending on your cloud provider you can mount their block storage options e.g
e.g. Google Cloud Storage, Azure storage by Azure, Elasticblockstore for AWS.
This way you won`t be depended on your node availability for storage. All of them are supported in Kubernetes via plugins as an expanded persistent volume claims. For example:
gcePersistentDisk
A gcePersistentDisk volume mounts a Google Compute Engine (GCE)
Persistent Disk into
your Pod. Unlike emptyDir, which is erased when a Pod is removed,
the contents of a PD are preserved and the volume is merely unmounted.
This means that a PD can be pre-populated with data, and that data can
be "handed off" between Pods.T
This is similar for awsElasticBlockStore or azureDisk
If you want to use AWS S3 there is an S3 Operator which you may find interesting.
AWS S3 Operator will deploy the AWS S3 Provisioner which will
dynamically or statically provision AWS S3 Bucket storage and access.

What are the dangers of using InfluxDB with networked storage (SSD) instead of local disks for production?

We are thinking of running InfluxDB inside Kubernetes (on GKE) and using Google's networked storage Persistent Disk SSDs for our production database.
Has anyone done this before? InfluxDB's documentation says that it's not tested against networked storage devices.
If all the other sizing and performance numbers are within InfluxDB's recommendation, what are the foreseeable issues that could occur with running a database on Kubernetes with networked storage?

Google Kubernetes storage in EC2

I started to use Docker and I'm trying out Google's Kubernetes project for my container orchestration. It looks really good!
The only thing I'm curious of is how I would handle the volume storage.
I'm using EC2 instances and the containers do volume from the EC2 filesystem.
The only thing left is the way I have to deploy my application code into all those EC2 instances, right? How can I handle this?
It's somewhat unclear what you're asking, but a good place to start would be reading about your options for volumes in Kubernetes.
The options include using local EC2 disk with a lifetime tied to the lifetime of your pod (emptyDir), local EC2 disk with lifetime tied to the lifetime of the node VM (hostDir), and an Elastic Block Store volume (awsElasticBlockStore).
The Kubernetes Container Storage Interface (CSI) project is reaching maturity and includes a volume driver for AWS EBS that allows you to attach EBS volumes to your containers.
The setup is relatively advanced, but does work smoothly once implemented. The advantage of using EBS rather than local storage is that the EBS storage is persistent and independent of the lifetime of the EC2 instance.
In addition, the CSI plugin takes care of the disk creation -> mounting -> unmounting -> deletion lifecycle for you.
The EBS CSI driver has a simple example that could get you started quickly