how to rotate file while doing a streaming transfer to google cloud storage - google-cloud-storage

We are working on a POC where we want to stream our web logs to google cloud storage. We learnt that objects on google cloud storage are immutable and cannot be appended from java api. However, we can do streaming transfers using gsutil according to this link https://cloud.google.com/storage/docs/concepts-techniques?hl=en#streaming
Now we would like to write hourly files. Is there a way to change the file name every hour like logrotate?

gsutil doesn't offer any logrotate-style features for object naming.
With a gsutil streaming transfer, the resulting cloud object is named according to the destination object in your gsutil cp command. To achieve rotation, your job that produces the stream could close the stream on an hourly basis, select a new filename, and issue a new streaming gsutil cp command.

Related

How to upload larger than 5Tb object to Google Cloud Storage?

Trying to save a PostgreSQL backup (~20 Tb) to Google Cloud Storage for the long-term, and I am currently piping PostgreSQL pg_dump() command to an streaming transfer through gsutil.
pg_dump -d $DB_NAME -b --format=t \
| gsutil cp - gs://$BUCKET_NAME/$BACKUP_FILE
However, I am worried that the process will crash because of GCS' 5Tb object size limit.
Is there any way to upload larger than 5Tb objects to Google Cloud Storage?
EDITION: using split?
I am considering to pipe pg_dump to Linux's split utility and the gsutil cp.
pg_dump -d $DB -b --format=t \
| split -b 50G - \
| gsutil cp - gs://$BUCKET/$BACKUP
Would something like that work?
You generally don't want to upload a single object in the multi-terabyte range with a streaming transfer. Streaming transfers have two major downsides, and they're both very bad news for you:
Streaming Transfers don't use Cloud Storage's checksum support. You'll get regular HTTP data integrity checking, but that's it, and for periodic 5 TB uploads, there's a nonzero chance that this could eventually end up in a corrupt backup.
Streaming Transfers can't be resumed if they fail. Assuming you're uploading at 100 Mbps around the clock, a 5 TB upload would take at least 4 and a half days, and if your HTTP connection failed, you'd need to start over from scratch.
Instead, here's what I would suggest:
First, minimize the file size. pg_dump has a number of options for reducing the file size. It's possible something like "--format=c -Z9" might produce a much smaller file.
Second, if possible, store the dump as a file (or, preferably, a series of split up files) before uploading. This is good because you'll be able to calculate their checksums, which gsutil can take advantage of, and also you'd be able to manually verify that they uploaded correctly if you wanted. Of course, this may not be practical because you'll need a spare 5TB of hard drive space, but unless your database won't be changing for a few days, there may not be an easy alternative to retry in case you lose your connection.
As mentioned by Ferregina Pelona, guillaume blaquiere and John Hanley. There is no way to bypass the 5-TB limit implemented by Google, as mentioned in this document:
Cloud Storage 5TB object size limit
Cloud Storage supports a maximum single-object size up to 5 terabytes.
If you have objects larger than 5TB, the object transfer fails for
those objects for either Cloud Storage or Transfer for on-premises.
If the file surpasses the limit (5 TB), the transfer fails.
You can use Google's issue tracker to request this feature, within the link provided, you can check the features that were requested or request a feature that satisfies your expectations.

Dataproc: Hot data on HDFS, cold data on Cloud Storage?

I am studying for the Professional Data Engineer and I wonder what is the "Google recommended best practice" for hot data on Dataproc (given that costs are no concern)?
If cost is a concern then I found a recommendation to have all data in Cloud Storage because it is cheaper.
Can a mechanism be set up, such that all data is on Cloud Storage and recent data is cached on HDFS automatically? Something like AWS does with FSx/Lustre and S3.
What to store in HDFS and what to store in GCS is a case-dependant question. Dataproc supports running hadoop or spark jobs on GCS with GCS connector, which makes Cloud Storage HDFS compatible without performance losses.
Cloud Storage connector is installed by default on all Dataproc cluster nodes and it's available on both Spark and PySpark environments.
After researching a bit: the performance of HDFS and Cloud Storage (or any other blog store) is not completely equivalent. For instance a "mv" operation in a blob store is emulated as copy + delete.
What the ASF can do is warn that our own BlobStore filesystems (currently s3:, s3n: and swift:) are not complete replacements for hdfs:, as operations such as rename() are only emulated through copying then deleting all operations, and so a directory rename is not atomic -a requirement of POSIX filesystems which some applications (MapReduce) currently depend on.
Source: https://cwiki.apache.org/confluence/display/HADOOP2/HCFS

Google Cloud Dataflow to Cloud Storage

Above reference architecture indicates the existence of Cloud Storage sink from Cloud Dataflow, however the Beam API which seems to be the current default Dataflow API has no Cloud Storage I/O connector listed.
Can anyone help clarify if there is one that exists, if not what is the alternative to bring data from Dataflow to Cloud Storage.
Beam does support writing/reading from GCS. You simply use the TextIO classes.
https://beam.apache.org/documentation/sdks/javadoc/0.2.0-incubating/org/apache/beam/sdk/io/TextIO.html
To read a PCollection from one or more text files, use TextIO.Read. You can instantiate a transform using TextIO.Read.from(String) to specify the path of the file(s) to read from (e.g., a local filename or filename pattern if running locally, or a Google Cloud Storage filename or filename pattern of the form "gs:///").
You can use TextIO, AvroIO or any other connector that reads from/writes to files to interact with GCS. Beam identifies any file path that starts with "gs://" to be for GCS. Beam does this using the pluggable FileSystem [1] interface.
[1] https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/storage/GcsFileSystem.java

Triggering a Dataflow job when new files are added to Cloud Storage

I'd like to trigger a Dataflow job when new files are added to a Storage bucket in order to process and add new data into a BigQuery table. I see that Cloud Functions can be triggered by changes in the bucket, but I haven't found a way to start a Dataflow job using the gcloud node.js library.
Is there a way to do this using Cloud Functions or is there an alternative way of achieving the desired result (inserting new data to BigQuery when files are added to a Storage bucket)?
This is supported in Apache Beam starting with 2.2. See Watching for new files matching a filepattern in Apache Beam.
Maybe this post would help on how to trigger Dataflow pipelines from App Engine or Cloud Functions?
https://cloud.google.com/blog/big-data/2016/04/scheduling-dataflow-pipelines-using-app-engine-cron-service-or-cloud-functions

Setting the Durable Reduced Availability (DRA) attribute for a bucket using Storage Console

When manually creating a new cloud storage bucket using the web-based storage console (https://console.developers.google.com/), is there a way to specify the DRA attribute? From the documentation, it appears that the only way to create buckets with that attribute is to either use Curl, gsutil or some other script, but not the console.
There is currently no way to do this.
At present, the storage console provides only a subset of the Cloud Storage API, so you'll need to use one of the tools you mentioned to create a DRA bucket.
For completeness, it's pretty easy to do this using gsutil (documentation at https://developers.google.com/storage/docs/gsutil/commands/mb):
gsutil mb -c DRA gs://some-bucket