Run a Google Cloud Function for each file in a bucket - google-cloud-storage

I have a Google Cloud Function triggered by a Google Cloud Storage object.finalize event. When I deploy a new version of this function, I would like to run it for every existing file in the bucket (which have already been processed by the previous version of the function). Processing all the existing files in the bucket is a long running task, hence I don't think a Google Cloud Function which will process all files in a row is an option.
The best option I can see for now is to make a Google Cloud Function I can triggered via HTTP that will list all the files in the bucket and publish one event per file via Google PubSub, and then process each of these events with a slightly modified version of my initial Google Cloud Function which accepts a PubSub event in place of the object.finalize storage event.
I think it can work but I was wondering if there was an easier way to perform this operation.

If the operation you're trying to perform may take longer than the maximum time that a Cloud Function can run, you will need to split that operation into multiple steps. Your approach of using a PubSub trigger for each individual file, sounds like a valid approach to do that for me.

One option might be to write a small program that lists all of the objects in a bucket and, for each object, posts a message to Cloud Pub/Sub that triggers your function in the same way a GCS change would.

Related

Running periodic queries on google cloud sql instance

I have a google cloud postgre instance and I'd like to run periodic sql queries on it and use the monitoring system to alert the user with the results.
How can I accomplish just using the gcp platform? Without having to develop a separate app.
As far as I am aware of, There is no Built-in feature for recurring queries in Cloud SQL at the moment.
So you have to implement your own. You can Use Cloud Scheduler to trigger a Cloud function (via HTTP/S endpoint) that runs the query on Cloud SQL and then notify the user in the way that suits your needs (I would recommend using pub/sub).
and you might want to save the result in a GCS bucket and the user is to pull the result from there.
Also, you might want to check BigQuery. It has a built-in feature of Scheduling queries.

Reveal GCS bucket at specific date and time

Having a Google Cloud Storage bucket I would like to reveal it (make it public) at specific date and time. How can it be achieved?
I have tried the permissions of bucket only to find out that with principal allUsers I cannot use any condition.
Another way that comes up is to script Google Compute instance with a startup script together with Google Scheduler, this however has a unpredictable delay which is my purpose cannot tolerate.
So is there any other way? I do not necessarily need to use GCS, any other service that will allow me to reveal a folder/files at specific time should be enough.
You can execute your function that makes objects public in Firebase Cloud Functions with functions.pubsub.schedule().onRun():
In Cloud Functions for Firebase, scheduling logic resides in your
functions code, with no special deploy-time requirements. To create a
scheduled function, use functions.pubsub.schedule('your schedule').onRun((context)).

Automate / schedule a script

I read a number of blog and watched tutorials - cannot find anything to help me with my problem.
I have a stakeholder that drops files into Google Cloud Storage, I have already written a script that performs ETL tasks to.
It would be great where I can create a trigger which runs my script as soon as the file is dropped in a specific place in Google Cloud Storage.
Google Cloud Storage supports Google Cloud Pub/Sub Notifications. This allows you to programmatically receive notifications when new objects are uploaded to your bucket.

Google Cloud Dataflow: while in PubSub streaming mode, TextIO.Read uses massive amounts of vCPU time

I'm using Google Cloud Platform to transfer data from an Azure server to a BigQuery table (working nice and smoothly, functionally speaking).
The pipeline looks like this:
Dataflow streaming pipeline
The 'FetchMetadata' part of the pipeline is a simple TextIO.Read implementation where I read a 66-line .csv file with metadata from a GCP Storage bucket:
PCollection<String> metaLine = p.apply(TextIO.Read.named("FetchMetadata")
.from("gs://my-bucket"));
When I use my pipeline in Batch mode this works like a charm: first the metadata file is loaded in the pipeline in less than a second of vCPU time and then the data itself is loaded in the pipeline. Now when running in Streaming mode I would love to replicate that behaviour to some extent but when I just use the same code there is a problem: when running the pipeline for just 15 minutes (actual time) the TextIO.Read block uses a whopping 4 hours of vCPU time. For a pipeline that will be permanently running for a low budget project this is unacceptable.
So my question: is it possible to change the code so the file is periodically read again (if the file changes I want the pipeline to be updated, so let's say hourly updates) and not continiously like it's doing right now.
I've found some documentation where there is mention of TextIO.Read.Bound which looks like a good place to start solving this issue, but it's no solution for my periodical update problem (as far as I know)
I was stuck in a similar situation. The way I solved this problem is a bit different. I would like the community's insights into this solution.
I had files being updated every hour in a GCS bucket. I followed the blog post about Scheduling Dataflow Jobs from App Engine or Google Cloud Function.
I had the app engine endpoint configured to receive the object change notifications from the GCS bucket which contained the files to be processed. For every file that was created (update is also a create operation in an object store), app engine application would submit a job to google dataflow. The job would read the lines from the file (file name in the HTTP request body) and publish it to a Google PubSub topic.
A streaming pipeline then had been subscribed to the Google PubSub topic that would process and output the relevant rows to big query. This way, streaming pipeline ran at the minimum worker count when idle, the ingest of the files happened through a batch pipeline and the streaming pipeline scaled with respect to the volume of the publications in the Google PubSub topic.
In the tutorial for submitting jobs to Google Dataflow, the jar is executed on the underlying terminal. I modified the code to submit a job to google dataflow using templates which can be executed with parameters. This way, the job submission operation becomes super light weight while still creating a job for every new file upload to the GCS bucket. Please refer this link for details about executing google dataflow job templates.
Note: Please mention in the comments if the answer needs to be modified for the code snippets of the dataflow job template and app engine application and I will update the answer accordingly.

How to automatically create a new file based on an existing file within Google Cloud Storage

It's the first time I used Google Cloud, so I might ask the question in the wrong place.
 
Information provider upload a new file to Google Cloud Storage every day.
The file contains the information of all my clients/departments.
I have to sort through information and create a new file/s containing the relevant information for each department in my company .so that everyone gets the relevant information to them (security).
I can't figure out what are the steps I need to follow, to complete the task.
Can you help me?
You want to have a process that starts automatically and subsequently generates a new file once you upload something to Google Cloud Storage.
The easiest way to handle this is using Object Change Notifications. You can set up Object Change Notifications per bucket and this will send a POST request to an URL that you can define.
You can then easily set up a server (or run it on app engine) that will execute an action based on the POST request that it receives.
There is an even simpler option (although still in alpha) named cloud functions. Cloud functions is a serverless service that provides event based microservices (e.g. 'do this' if a new file is uploaded on GCS). This means you only have to write the code that defines what needs to happen if a new file is uploaded and then Cloud Functions will take care of executing the code when you upload a file to GCS. See this tutorial on using cloud functions with Google Cloud Storage.