target multiple buckets with eventarc? - google-cloud-storage

currently we are trying to use eventarc to send us all finalized files for buckets.
this works create however, currently it looks like event-arc can only target a single bucket and we would need to enable it for every bucket on it's own. is there a way to target multiple buckets?
currently we use the following to create the eventarc trigger:
gcloud eventarc triggers create storage-events \
--location="$LOCATION" \
--destination-gke-cluster="CLUSTER-NAME" \
--destination-gke-location="$LOCATION" \
--destination-gke-namespace="$NAMSEPACE" \
--destination-gke-service="$SERVICE" \
--destination-gke-path="api/events/receive" \
--event-filters="type=google.cloud.storage.object.v1.finalized" \
--event-filters="bucket=$BUCKET" \
--service-account=$SERVICEACCOUNT-compute#developer.gserviceaccount.com
the problem is, that we generate a bucket per customer, thus we would need to create the trigger for each bucket (which is a alot) is there a simpler way?

You have several options.
If you want to use the native event google.cloud.storage.object.v1.finalized, you must select one and only one bucket. Therefore, you have to create one eventarc per bucket.
If you can use the audit logs event storage.objects.create, you have to activate the audit logs but you are not filter on the buckets. ALL the buckets are listen. If you don't want, you can play with the Cloud logging router to discard the logs that you don't want (especially the audit logs on the buckets that you don't want)
A latest solution, if you really want to use eventarc, especially for the Cloud Event format of the messages, you can do that:
Create a Coud Storage PubSub notification for all your bucket that you want to listen. Use the same PubSub topic everytime
Create a Custom eventarc on PubSub and catch the message published on the Topic.

Related

google cloud python kubernetes service performing action on bucket write

I am trying to write some python service that will be deployed on kubernetes that does something similar to a cloud function triggered by google.storage.object.finalize action and listening on a bucket. In essence I need to replace a cloud function that was created with the following parameters:
--trigger-resource YOUR_TRIGGER_BUCKET_NAME
--trigger-event google.storage.object.finalize
however I can't find online any resource on how to do this. What would be the best way for some python script deployed in kubernetes to observe actions performed on a bucket and do something when a new file gets written into it? Thank you
You just need to enable pubsub notifications on the bucket to publish to a pub/sub topic: https://cloud.google.com/storage/docs/pubsub-notifications
And then have you python application listen to a subscription on the topic that you picked, either in a pull or push setup: https://cloud.google.com/pubsub/docs/pull.

IBM Cloud Object Storage: Get bucket size using CLI

I'm trying to find a way to automate the task of getting COS bucket sizes on IBM Cloud.
I have dozens of buckets over different accounts but still couldn't find a way to get this information using IBM Cloud COS CLI, just other information, like bucket names, etc.
The COS S3 API does not return size information for buckets. Thus, the CLI, which is based on the API, won't return size information either.
But here's an indirect way to find the size of a bucket by looping through the sizes of the individual objects in the bucket
ibmcloud cos objects --bucket <BUCKET_NAME> --output JSON | jq 'reduce (.Contents[] | to_entries[]) as {$key,$value} ({}; .[$key] += $value) | .Size'
The output is in bytes
You may have to loop through the bucket names may be in a shell script. For all the buckets in an account + resource group, run the below command
ibmcloud cos buckets --output JSON
Note: Before running the above commands, remember to add the COS service CRN to the config with the below command
ibmcloud cos config crn --crn <SERVICE_CRN>
The answer that loops through the individual objects is indeed the only (and likely best) way to use the IBM Cloud CLI to find that information, but there are a few other ways worth mentioning for completion.
If you need to do this elegantly on the command line, the Minio Client provides a Linux-esque syntax:
mc du cos/$BUCKET
This returns the size of the bucket in MiB.
Additionally, the COS Resource Configuration API will directly return a bytes_used value, with no iterating over objects behind the scenes. While there's no official CLI implementation yet (although it's in the pipeline) it's relatively easy to use cURL or httpie to query the bucket.
curl "https://config.cloud-object-storage.cloud.ibm.com/v1/b/$BUCKET" \
-H 'Authorization: bearer $TOKEN'

How can I use dataproc to pull data from bigquery that is not in the same project as my dataproc cluster?

I work for an organisation that needs to pull data from one of our client's bigquery datasets using Spark and given that both the client and ourselves use GCP it makes sense to use Dataproc to achieve this.
I have read Use the BigQuery connector with Spark which looks very useful however it seems to make the assumption that the dataproc cluster, the bigquery dataset and the storage bucket for temporary BigQuery export are all in the same GCP project - that is not the case for me.
I have a service account key file that allows me to connect to and interact with our client's data stored in bigquery, how can I use that service account key file in conjunction with the BigQuery connector and dataproc in order to pull data from bigquery and interact with it in dataproc? To put it another way, how can I modify the code provided at Use the BigQuery connector with Spark to use my service account key file?
To use service account key file authorization you need to set mapred.bq.auth.service.account.enable property to true and point BigQuery connector to a service account json keyfile using mapred.bq.auth.service.account.json.keyfile property (cluster or job). Note that this property value is a local path, that's why you need to distribute a keyfile to all the cluster nodes beforehand, using initialization action, for example.
Alternatively, you can use any authorization method described here, but you need to replace fs.gs properties prefix with mapred.bq for BigQuery connector.

When creating a new cloud composer env, is it possible to set the bucket to preexisting one?

So I already have an empty storage bucket created for this and I don't want composer to create its own bucket for the dags - I'd like to use the one already created.
It's not ideal to have it just create a random bucket and then go
gcloud composer environments run test-environment --location europe-west1 variables -- --set gcs_bucket gs://my-bucket
I've dug around the docs but it seems you cannot go around it creating a brand new bucket every time?
Currently, it is not possible.
In the environment’s configuration in Cloud Composer API, the dagGcsPrefix parameter is output only, you cannot set it. Documentation also mentions a Cloud Storage bucket is always created along with the Composer environment, the name of the bucket is based on the environment’s region, name and a random Id.
You may want to “Star” this Feature Request for the mentioned functionality, to receive notifications whenever an update on this regard is published. You can also review or subscribe to the Cloud Composer release notes to be updated about recently added features.
You are right, this is currently not supported in Composer.

Setting the Durable Reduced Availability (DRA) attribute for a bucket using Storage Console

When manually creating a new cloud storage bucket using the web-based storage console (https://console.developers.google.com/), is there a way to specify the DRA attribute? From the documentation, it appears that the only way to create buckets with that attribute is to either use Curl, gsutil or some other script, but not the console.
There is currently no way to do this.
At present, the storage console provides only a subset of the Cloud Storage API, so you'll need to use one of the tools you mentioned to create a DRA bucket.
For completeness, it's pretty easy to do this using gsutil (documentation at https://developers.google.com/storage/docs/gsutil/commands/mb):
gsutil mb -c DRA gs://some-bucket