I'm writing Cloud Composer plugin and I need to create DAG in runtime. How can I create DAG file from webserver or how can I access bucket ID from plugin code(so I can use gcs client and just upload DAG)? I tried code below and it doesn't work, I don't get any exceptions but also I don't see any results:
dag_path = os.path.join(settings.DAGS_FOLDER, dag_id + '.py')
with open(dag_path, 'w') as dag:
dag.write(result)
Possible solution is to read bucket ID from Cloud Composer env variable
You may either use the Environment Variables, or you may make use of the API's provided by GCloud SDK.
gcloud composer environments describe --format=json --project=<project-name> --location=<region> <cluster-name>
This would return the details of the cloud composer cluster.
It would have the dag location under the key dagGcsPrefix
The format of dagGcsPrefix would be gs://<GCSBucket>/dags
Related
I'm trying to avoid race conditions with gcloud / gsutil authentication on the same system but different CI/CD jobs on my Gitlab-Runner on a Mac Mini.
I have tried setting the auth manually with
RUN gcloud auth activate-service-account --key-file="gitlab-runner.json"
RUN gcloud config set project $GCP_PROJECT_ID
for the Dockerfile (in which I'm performing a download operation from a Google Cloud Storage bucket).
I'm using a configuration in the bash script to run the docker command and in the same script for authenticating I'm using
gcloud config configurations activate $TARGET
Where I've previously done the above two commands to save them to the configuration.
The configurations are working fine if I start the CI/CD jobs one after the other has finished. But I want to trigger them for all clients at the same time, which causes race conditions with gcloud authentication and one of the jobs trying to download from the wrong project bucket.
How to avoid a race condition? I'm already authenticating before each gsutil command but still its causing the race condition. Do I need something like CloudBuild to separate the runtime environments?
You can use Cloud Build to get separate execution environments but this might be an overkill for your use case, as a Cloud Build worker uses an entire VM which might be just too heavy, linux containers / Docker can provide necessary isolation as well.
You should make sure that each container you run has a unique config file placed in the path expected by gcloud. The issue may come from improper volume mounting (all the containers share the same location from the host/OS), or maybe you should mount a directory containing their configuration file (unique for each bucket) on running an image, or perhaps you should run gcloud config configurations activate in a Dockerfile step (thus creating image variants for different buckets if it’s feasible).
Alternatively, and I think this solution might be easier, you can switch from Cloud SDK distribution to standalone gsutil distribution. That way you can provide a path to a boto configuration file through an environment variable.
Such variables can be specified on running a Docker image.
I am trying to write some python service that will be deployed on kubernetes that does something similar to a cloud function triggered by google.storage.object.finalize action and listening on a bucket. In essence I need to replace a cloud function that was created with the following parameters:
--trigger-resource YOUR_TRIGGER_BUCKET_NAME
--trigger-event google.storage.object.finalize
however I can't find online any resource on how to do this. What would be the best way for some python script deployed in kubernetes to observe actions performed on a bucket and do something when a new file gets written into it? Thank you
You just need to enable pubsub notifications on the bucket to publish to a pub/sub topic: https://cloud.google.com/storage/docs/pubsub-notifications
And then have you python application listen to a subscription on the topic that you picked, either in a pull or push setup: https://cloud.google.com/pubsub/docs/pull.
We have a requirement that while provisioning the Databricks service thru CI/CD pipeline in Azure DevOps we should able to mount a blob storage to DBFS without connecting to a cluster. Is it possible to mount object storage to DBFS cluster by using a bash script from Azure DevOps ?
I looked thru various forums but they all mention about doing this using dbutils.fs.mount but the problem is we cannot run this command in Azure DevOps CI/CD pipeline.
Will appreciate any help on this.
Thanks
What you're asking is possible but it requires a bit of extra work. In our organisation we've tried various approaches and I've been working with Databricks for a while. The solution that works best for us is to write a bash script that makes use of the databricks-cli in your Azure Devops pipeline. The approach we have is as follows:
Retrieve a Databricks token using the token API
Configure the Databricks CLI in the CI/CD pipeline
Use Databricks CLI to upload a mount script
Create a Databricks job using the Jobs API and set the mount script as file to execute
The steps above are all contained in a bash script that is part of our Azure Devops pipeline.
Setting up the CLI
Setting up the Databricks CLI without any manual steps is now possible since you can generate a temporary access token using the Token API. We use a Service Principal for authentication.
https://learn.microsoft.com/en-US/azure/databricks/dev-tools/api/latest/tokens
Create a mount script
We have a scala script that follows the mount instructions. This can be Python as well. See the following link for more information:
https://docs.databricks.com/data/data-sources/azure/azure-datalake-gen2.html#mount-azure-data-lake-storage-gen2-filesystem.
Upload the mount script
In the Azure Devops pipeline the databricks-cli is configured by creating a temporary token using the token API. Once this step is done, we're free to use the CLI to upload our mount script to DBFS or import it as a notebook using the Workspace API.
https://learn.microsoft.com/en-US/azure/databricks/dev-tools/api/latest/workspace#--import
Configure the job that actually mounts your storage
We have a JSON file that defines the job that executes the "mount storage" script. You can define a job to use the script/notebook that you've uploaded in the previous step. You can easily define a job using JSON, check out how it's done in the Jobs API documentation:
https://learn.microsoft.com/en-US/azure/databricks/dev-tools/api/latest/jobs#--
At this point, triggering the job should create a temporary cluster that mounts the storage for you. You should not need to use the web interface, or perform any manual steps.
You can apply this approach to different environments and resource groups, as do we. For this we make use of Jinja templating to fill out variables that are environment or project specific.
I hope this helps you out. Let me know if you have any questions!
I want to take Oracle database backup using RMAN directly into the Google Cloud Storage
I am unable to find the plugin to use to take the RMAN backups into Cloud Storage. We have a plugin for Amazon S3 and am looking for one such related to Google Cloud Storage.
I don't believe there's an official way of doing this. Although I did file a Feature Request for the Cloud Storage engineering team to look into that you can find here.
I recommend you to star the Feature Request, for easy visibility and access, allowing you to view its status updates. The Cloud Storage team might ask questions there too.
You can use gcsfuse to mount GCS bucket as file systems on your machine and use RMAN to create backups there.
You can find more information about gcsfuse on its github page. Here are the basic steps to mount a bucket and run RMAN:
Create a bucket oracle_bucket. Check that it doesn't have a retention policy defined on it (it looks like gcsfuse has some issues with retention policies).
Please have a look at mounting.md that describes credentials for GCS. For example, I created a service account with Storage Admin role and created a JSON key for it.
Next, set up credentials for gcsfuse on your machine. In my case, I set GOOGLE_APPLICATION_CREDENTIALS to the path to JSON key from step 1. Run:
sudo su - oracle
mkdir ./mnt_bucket
gcsfuse --dir-mode 755 --file-mode 777 --implicit-dirs --debug_fuse oracle_bucket ./mnt_bucket
From gcsfuse docs:
Important: You should run gcsfuse as the user who will be using the
file system, not as root. Do not use sudo.
Configure RMAN to create a backup in mnt_bucket. For example:
configure controlfile autobackup format for device type disk to '/home/oracle/mnt_bucket/%F';
configure channel device type disk format '/home/oracle/mnt_bucket/%U';
After you run backup database you'll see a backup files created in your GCS bucket.
So I already have an empty storage bucket created for this and I don't want composer to create its own bucket for the dags - I'd like to use the one already created.
It's not ideal to have it just create a random bucket and then go
gcloud composer environments run test-environment --location europe-west1 variables -- --set gcs_bucket gs://my-bucket
I've dug around the docs but it seems you cannot go around it creating a brand new bucket every time?
Currently, it is not possible.
In the environment’s configuration in Cloud Composer API, the dagGcsPrefix parameter is output only, you cannot set it. Documentation also mentions a Cloud Storage bucket is always created along with the Composer environment, the name of the bucket is based on the environment’s region, name and a random Id.
You may want to “Star” this Feature Request for the mentioned functionality, to receive notifications whenever an update on this regard is published. You can also review or subscribe to the Cloud Composer release notes to be updated about recently added features.
You are right, this is currently not supported in Composer.