list buckets and their content using ofek.dev/csi-gcs - google-cloud-storage

I deployed the static CSI https://ofek.dev/csi-gcs for google cloud
The writer works fine: anything written by the pod in the mounting directory shows up un the bucket.
However the reader only allows to access the data written by the pod but it cannot list the buckets and objects that have not been created by the pod itself.
I implemented the Static provisioning step by step as explained in the link.
By the way: if I dir a bucket that I know it exists, does not solve the problem.
And I also gave GSA role/editor so it should be able to list the buckets.

Related

Orchestration of an NLP model via airflow and kubernetes

This is more of an architecture question. I have a data engineering background and have been using airflow to orchestrate ETL tasks using airflow for a while. I have limited knowledge of containerization and kuberentes. I have a task to come up with a good practice framework for productionalizting our Data science models using an orchestration engine namely airflow.
Our Data science team creates many NLP models to process different text documents from various resources. Previously the model was created by an external team which requires us to create an anacoda environment install libraries on it and run the model. The running of model was very manual where a data engineer would spin us a EC2 instance, and setup the model download the files to the ec2 instance and process the files using the model and take the output for further processing.
We are trying to move away from this to an automated pipeline where we have an airflow dag that basically orchestrates this all. The point where I am struggling is the running the model part.
This is the logical step I am thinking of doing. Please let me know if you think this would be feasible. All of these will be down in airflow. Step 2,3,4 are the ones I am totally unsure how to achieve.
Download files from ftp to s3
**Dynamically spin up a kubernetes cluster and create parallel pod based on number of files to be process.
Split files between those pods so each pod can only process its subset of files
Collate output of model from each pod into s3 location**
Do post processing on them
I am unsure how I can spin up a kuberentes cluster in airflow on runtime and especially how I split files between pods so each pod only processes on its own chunk of files and pushes output to shared location.
The running of the model has two methods. Daily and Complete. Daily would be a delta of files that have been added since last run whereas complete is a historical reprocessing of the whole document catalogue that we run every 6 months. As you can imagine the back catalogue would require alot of parallel processing and pods in parallel to process the number of documents.
I know this is a very generic post but my lack of kuberentes is the issue and any help would be appreciated in pointing me in the right direction.
Normally people schedule the container or PODs as per need on top of k8s cluster, however, I am not sure how frequent you need to crate the k8s cluster.
K8s cluster setup :
You can create the K8s cluster in different ways that are more dependent on the cloud provider and options they provide like SDK, CLI, etc.
Here is one example you can use this option with airflow to create the AWS EKS clusters : https://leftasexercise.com/2019/04/01/python-up-an-eks-cluster-part-i/
Most cloud providers support the CLI option so maybe using just CLI also you can create the K8s cluster.
If you want to use GCP GKE you can also check for the operators to create cluster : https://airflow.apache.org/docs/apache-airflow-providers-google/stable/operators/cloud/kubernetes_engine.html
Split files between those pods so each pod can only process its subset
of files
This is more depends on the file structure, you can mount the S3 direct to all pods, or you can keep the file into NFS and mount it to POD but in all cases you have to manage the directory structure accordingly, you can mount it to POD.
Collate output of model from each pod into s3 location**
You can use boto3 to upload files to S3, Can also mount S3 bucket direct to POD.
it's more now on your structure how big files are generated, and stored.

GCP stackdriver logging logs format changed in bucket from folder per container to stdout\stderr

i have a question, similar as describe here: GKE kubernetes container stdout logs format changed
in old version of stackdriver i had 1 sink with filter like this:
resource.type=container,
resource.namespace_id=[NAMESPACE_NAME]
resource.pod_id=[POD_NAME]
and logs was stored in bucket pretty well, like this:
logName=projects/[PROJECT-NAME]/logs/[CONTAINER-NAME]
...so i had folders whith logs for each container.
But now i updated my stackdriver logging+monitoring to last version and now i have 2 folders stdout\stderr which contains all logs for all containers!
logName=projects/[PROJECT-NAME]/logs/stdout
logName=projects/[PROJECT-NAME]/logs/stderr
All logs from many containers stored in this single folders! This is pretty uncomfortable =(
I'v read about this in docs: https://cloud.google.com/monitoring/kubernetes-engine/migration#changes_in_log_entry_contents
The logName field might change. Stackdriver Kubernetes Engine Monitoring log entries use stdout or stderr in their log names whereas Legacy Stackdriver used a wider variety of names, including the container name. The container name is still available as a resource label.
...but i can't find solution! Please, help me, how to make container per folder logging, like it was in old version of stackdriver?
Here is a workaround that has been suggested:
Create a different sink for each of your containers filtered by
resource.labels.container_name
Export each sink to a different
bucket
Note: If you configure each separate sink to the same bucket the logs will be combined.
More details at Google Issue Tracker

Mount a shared volume to Kubernetes cluster so that all users can access same storage and share files

I am following Zero to JupyterHub with Kubernetes to create a jupyterHub environment for my team to use.
Using Google Kubernetes Engine and every user gets his/her own storage and files are stored - this setup works fine.
I am having trouble as how should I create a volume or shared database so that everyone in team can see each other's notebooks, share file's and data.
To explain more, in desired setup - when a user signs in and goes to his/her jupyter image - every user sees the same folder "shared" and one can create individual folders for themselves inside that folder but are able to reuse code that someone else has already written.
I looked into NFS with Firestore but that seems very expensive.
As in the documentation gcePersistenceDisk do not support multiple read and write.
There is alternative solution for the problem. Rook is a storage backend various storage provisioner available through it. One of them is Ceph which has shared filesystem solution on kubernetes.

Copying directories into minikube and persisting them

I am trying to copy some directories into the minikube VM to be used by some of the pods that are running. These include API credential files and template files used at run time by the application. I have found you can copy files using scp into the /home/docker/ directory, however these files are not persisted over reboots of the VM. I have read files/directories are persisted if stored in the /data/ directory on the VM (among others) however I get permission denied when trying to copy files to these directories.
Are there:
A: Any directories in minikube that will persist data that aren't protected in this way
B: Any other ways of doing the above without running into this issue (could well be going about this the wrong way)
To clarify, I have already been able to mount the files from /home/docker/ into the pods using volumes, so it's just the persisting data I'm unclear about.
Kubernetes has dedicated object types for these sorts of things. API credential files you might store in a Secret, and template files (if they aren't already built into your Docker image) could go into a ConfigMap. Both of them can either get translated to environment variables or mounted as artificial volumes in running containers.
In my experience, trying to store data directly on a node isn't a good practice. It's common enough to have multiple nodes, to not directly have login access to those nodes, and for them to be created and destroyed outside of your direct control (imagine an autoscaler running on a cloud provider that creates a new node when all of the existing nodes are 90% scheduled). There's a good chance your data won't (or can't) be on the host where you expect it.
This does lead to a proliferation of Kubernetes objects and associated resources, and you might find a Helm chart to be a good resource to tie them together. You can check the chart into source control along with your application, and deploy the whole thing in one shot. While it has a couple of useful features beyond just packaging resources together (a deploy-time configuration system, a templating language for the Kubernetes YAML itself) you can ignore these if you don't need them and just write a bunch of YAML files and a small control file.
For minikube, data kept in $HOME/.minikube/files directory is copied to / directory in VM host by minikube.

When creating a new cloud composer env, is it possible to set the bucket to preexisting one?

So I already have an empty storage bucket created for this and I don't want composer to create its own bucket for the dags - I'd like to use the one already created.
It's not ideal to have it just create a random bucket and then go
gcloud composer environments run test-environment --location europe-west1 variables -- --set gcs_bucket gs://my-bucket
I've dug around the docs but it seems you cannot go around it creating a brand new bucket every time?
Currently, it is not possible.
In the environment’s configuration in Cloud Composer API, the dagGcsPrefix parameter is output only, you cannot set it. Documentation also mentions a Cloud Storage bucket is always created along with the Composer environment, the name of the bucket is based on the environment’s region, name and a random Id.
You may want to “Star” this Feature Request for the mentioned functionality, to receive notifications whenever an update on this regard is published. You can also review or subscribe to the Cloud Composer release notes to be updated about recently added features.
You are right, this is currently not supported in Composer.