Google Cloud Dataflow to Cloud Storage - google-cloud-storage

Above reference architecture indicates the existence of Cloud Storage sink from Cloud Dataflow, however the Beam API which seems to be the current default Dataflow API has no Cloud Storage I/O connector listed.
Can anyone help clarify if there is one that exists, if not what is the alternative to bring data from Dataflow to Cloud Storage.

Beam does support writing/reading from GCS. You simply use the TextIO classes.
https://beam.apache.org/documentation/sdks/javadoc/0.2.0-incubating/org/apache/beam/sdk/io/TextIO.html
To read a PCollection from one or more text files, use TextIO.Read. You can instantiate a transform using TextIO.Read.from(String) to specify the path of the file(s) to read from (e.g., a local filename or filename pattern if running locally, or a Google Cloud Storage filename or filename pattern of the form "gs:///").

You can use TextIO, AvroIO or any other connector that reads from/writes to files to interact with GCS. Beam identifies any file path that starts with "gs://" to be for GCS. Beam does this using the pluggable FileSystem [1] interface.
[1] https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/storage/GcsFileSystem.java

Related

Google Cloud SQL PostgreSQL replication?

I want to make sure that there's not a better (easier, more elegant) way of emulating what I think is typically referred to as "logical replication" ("logical decoding"?) within the PostgreSQL community.
I've got a Cloud SQL instance (PostgreSQL v9.6) that contains two databases, A and B. I want B to mirror A, as closely as possible, but don't need to do so in real time or anything near that. Cloud SQL does not offer the capability of logical replication where write-ahead logs are used to mirror a database (or subset thereof) to another database. So I've cobbled together the following:
A Cloud Scheduler job publishes a message to a topic in Google Cloud Platform (GCP) Pub/Sub.
A Cloud Function kicks off an export. The exported file is in the form of a pg_dump file.
The dump file is written to a named bucket in Google Cloud Storage (GCS).
Another Cloud Function (the import function) is triggered by the writing of this export file to GCS.
The import function makes an API call to delete database B (the pg_dump file created by the export API call does not contain initial DROP statements and there is no documented facility for adding them via the API).
It creates database B anew.
It makes an API call to import the pg_dump file.
It deletes the old pg_dump file.
That's five different objects across four GCP services, just to obtain already existing, native functionality in PostgreSQL.
Is there a better way to do this within Google Cloud SQL?

Dataproc: Hot data on HDFS, cold data on Cloud Storage?

I am studying for the Professional Data Engineer and I wonder what is the "Google recommended best practice" for hot data on Dataproc (given that costs are no concern)?
If cost is a concern then I found a recommendation to have all data in Cloud Storage because it is cheaper.
Can a mechanism be set up, such that all data is on Cloud Storage and recent data is cached on HDFS automatically? Something like AWS does with FSx/Lustre and S3.
What to store in HDFS and what to store in GCS is a case-dependant question. Dataproc supports running hadoop or spark jobs on GCS with GCS connector, which makes Cloud Storage HDFS compatible without performance losses.
Cloud Storage connector is installed by default on all Dataproc cluster nodes and it's available on both Spark and PySpark environments.
After researching a bit: the performance of HDFS and Cloud Storage (or any other blog store) is not completely equivalent. For instance a "mv" operation in a blob store is emulated as copy + delete.
What the ASF can do is warn that our own BlobStore filesystems (currently s3:, s3n: and swift:) are not complete replacements for hdfs:, as operations such as rename() are only emulated through copying then deleting all operations, and so a directory rename is not atomic -a requirement of POSIX filesystems which some applications (MapReduce) currently depend on.
Source: https://cwiki.apache.org/confluence/display/HADOOP2/HCFS

Fluentd daemonset alternative to S3 on Azure (Blob)

For log-intensive microservices, I was hoping to persist my logs into blobs and save them in azure blob storage (s3 alternative). However, I noticed that fluentd does not seem to support it out of the box.
Is there any alternative for persisting my logs in azure like so in S3?
There are plugins that support Fluend with Azure blob storage,specifically blob append:
Azure Storage Append Blob output plugin buffers logs in local file and uploads them to Azure Storage Append Blob periodically.
there's a step by step guide available here which is a Microsoft solution, there's also an external plugin with same capabilities here
There is an easy solution to use a lightweight log forwarding agent from DataDog which is called vector. This is, of course, free to use and a better alternative to fluentd for a non-enterprise level use case.
I recently set that up to forward the logs from Azure AKS to a storage bucket in near real-time. Feel free to check out my Blog and Youtube Video on the same. I hope it helps.

Triggering a Dataflow job when new files are added to Cloud Storage

I'd like to trigger a Dataflow job when new files are added to a Storage bucket in order to process and add new data into a BigQuery table. I see that Cloud Functions can be triggered by changes in the bucket, but I haven't found a way to start a Dataflow job using the gcloud node.js library.
Is there a way to do this using Cloud Functions or is there an alternative way of achieving the desired result (inserting new data to BigQuery when files are added to a Storage bucket)?
This is supported in Apache Beam starting with 2.2. See Watching for new files matching a filepattern in Apache Beam.
Maybe this post would help on how to trigger Dataflow pipelines from App Engine or Cloud Functions?
https://cloud.google.com/blog/big-data/2016/04/scheduling-dataflow-pipelines-using-app-engine-cron-service-or-cloud-functions

how to rotate file while doing a streaming transfer to google cloud storage

We are working on a POC where we want to stream our web logs to google cloud storage. We learnt that objects on google cloud storage are immutable and cannot be appended from java api. However, we can do streaming transfers using gsutil according to this link https://cloud.google.com/storage/docs/concepts-techniques?hl=en#streaming
Now we would like to write hourly files. Is there a way to change the file name every hour like logrotate?
gsutil doesn't offer any logrotate-style features for object naming.
With a gsutil streaming transfer, the resulting cloud object is named according to the destination object in your gsutil cp command. To achieve rotation, your job that produces the stream could close the stream on an hourly basis, select a new filename, and issue a new streaming gsutil cp command.