Triggering a Dataflow job when new files are added to Cloud Storage - google-cloud-storage

I'd like to trigger a Dataflow job when new files are added to a Storage bucket in order to process and add new data into a BigQuery table. I see that Cloud Functions can be triggered by changes in the bucket, but I haven't found a way to start a Dataflow job using the gcloud node.js library.
Is there a way to do this using Cloud Functions or is there an alternative way of achieving the desired result (inserting new data to BigQuery when files are added to a Storage bucket)?

This is supported in Apache Beam starting with 2.2. See Watching for new files matching a filepattern in Apache Beam.

Maybe this post would help on how to trigger Dataflow pipelines from App Engine or Cloud Functions?
https://cloud.google.com/blog/big-data/2016/04/scheduling-dataflow-pipelines-using-app-engine-cron-service-or-cloud-functions

Related

Cloud Fusion Pipeline works well in Preview mode but throws error in Deployment mode

I have the below pipeline which ingests news data from RSS Feeds. Pipeline is contructed using HTTPPoller, XMLMultiParser Transorm, Javascript and MongoDB Sink. The pipeline works well in Preview mode but throws "bucket not found" error in Deployment mode
RSS Ingest Pipeline
Error
Cloud Data Fusion (CDF) creates a Google Cloud Storage (GCS) bucket with the name format similar to the one mentioned in the error message in your GCP project when you create a CDF instance. Judging by the error message, its possible that the GCS bucket may have been deleted. Try to deploy the same pipeline in a new CDF instance (with the bucket present this time) and it should not raise the same exception.
This bucket is used as a Hadoop Compatible File System (HCFS) which is required to run pipelines on Dataproc

How to get Talend to wait for a file to land in S3

I have a file that lands in AWS S3 several times a day. I am using Talend as my ETL tool to populate a warehouse in Snowflake and need it to watch for the file to trigger my job. I've tried tWaitForFile but can't seem to get it to connect to S3. Has anyone done this before?
Can you check below link automate pipeline using S3 and lambda to trigger files to talend job.
Automate S3 File Push

Is there a way to deploy scheduled queries to GCP directly through a github action, with a configurable schedule?

Currently using GCP BigQuery UI for scheduled queries, everything is manually done.
Wondering if there's a way to automatically deploy to GCP using a config JSON that contains the scheduled query's parameters and scheduled times through github actions?
So far, this is one option I've found that makes it more "automated":
- store query in a file on Cloud Storage. When invoking Cloud Function, you read the file content and you perform a bigQuery job on it.
- have to update the file content to update the query
- con: read file from storage, then call BQ: 2 api calls and query file to manage
Currently using DBT in other repos to automate and make this process quicker: https://docs.getdbt.com/docs/introduction
Would prefer the github actions version though, just haven't found a good docu yet :)

Google Cloud Dataflow to Cloud Storage

Above reference architecture indicates the existence of Cloud Storage sink from Cloud Dataflow, however the Beam API which seems to be the current default Dataflow API has no Cloud Storage I/O connector listed.
Can anyone help clarify if there is one that exists, if not what is the alternative to bring data from Dataflow to Cloud Storage.
Beam does support writing/reading from GCS. You simply use the TextIO classes.
https://beam.apache.org/documentation/sdks/javadoc/0.2.0-incubating/org/apache/beam/sdk/io/TextIO.html
To read a PCollection from one or more text files, use TextIO.Read. You can instantiate a transform using TextIO.Read.from(String) to specify the path of the file(s) to read from (e.g., a local filename or filename pattern if running locally, or a Google Cloud Storage filename or filename pattern of the form "gs:///").
You can use TextIO, AvroIO or any other connector that reads from/writes to files to interact with GCS. Beam identifies any file path that starts with "gs://" to be for GCS. Beam does this using the pluggable FileSystem [1] interface.
[1] https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/storage/GcsFileSystem.java

How to continuously write mongodb data into a running hdinsight cluster

I want to keep a windows azure hdinsight cluster always running so that I can periodically write updates from my master data store (which is mongodb) and have it process map-reduce jobs on demand.
How can periodically sync data from mongodb with the hdinsight service? I'm trying to not have to upload all data whenever a new query is submitted which anytime, but instead have it somehow pre-warmed.
Is that possible on hdinsight? Is it even possible with hadoop?
Thanks,
It is certainly possible to have that data pushed from Mongo into Hadoop.
Unfortunately HDInsight does not support HBase (yet) otherwise you could use something like ZeroWing which is a solution from Stripe that reads the MongoDB Op log used by Mongo for replication and then writes that our to HBase.
Another solution might be to write out documents from your Mongo to Azure Blob storage, this means you wouldn't have to have the cluster up all the time, but would be able to use it to do periodic map reduce analytics against the files in the storage vault.
Your best method is undoubtedly to use the Mongo Hadoop connector. This can be installed in HDInsight, but it's a bit fiddly. I've blogged a method here.