How to set file type when using TextIO.write to Google Cloud Storage - google-cloud-storage

I wrote a DataFlow pipeline that outputs a single small csv file on Google Cloud Storage. The file type of that file is text/plain but i want it to be application/csv.
this is the code i use
TextIO.write()
.to("gs://bucket/path/to/filename").withoutSharding()
.withSuffix(".csv")
.withDelimiter(new char[]{'\r','\n'})
How do i specify the file type so that the file type will be application/csv after the pipeline completes?

TextIO always write content type text/plain. This is configured here. https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextSink.java#L95
One option for you might be to update the content type of objects already written to GCS. This can be done using the gsutil tool after you finish your Dataflow pipeline that writes files. See here for more information.
https://cloud.google.com/storage/docs/gsutil/commands/setmeta

Related

Azure data factory rest api x-ms-file-rename-source issues

I have a pipeline in Azure Data Factory that is using a web task to rename a file on a file share on one of our azure storage accounts using rest api.
The process almost works and creates a copy of the file with the new name, but the new file is empty. I’ve tried this with both xlsx and a standard txt file. These are the headers I’m using:
x-ms-date: <generating in ADF>
x-ms-version: 2021-08-06
x-ms-rename-source: <path to original file>
x-ms-type: file
x-ms-content-length: <?>
I put <?> for content length because I think this is the issue and I’m not sure what value I should use here. I tried not including the x-ms-content-length to preserve the file attributes but I get an error that the header is required. Any thoughts on why the file is empty/being resized?

Producing a CSV of Cloud Bucket files

What's the best way to create a CSV file listing images in a Google Cloud bucket to be imported into AutoML Vision?
If you want to listen the files that are saved on a bucket you can use a Google cloud function to listen the new files and create the csv file in another bucket
For example you can use this python code as starting point, this code log the details of a new uploaded file
def hello_gcs_generic(data, context):
"""Background Cloud Function to be triggered by Cloud Storage.
This generic function logs relevant data when a file is changed.
Args:
data (dict): The Cloud Functions event payload.
context (google.cloud.functions.Context): Metadata of triggering event.
Returns:
None; the output is written to Stackdriver Logging
"""
print('Event ID: {}'.format(context.event_id))
print('Event type: {}'.format(context.event_type))
print('Bucket: {}'.format(data['bucket']))
print('File: {}'.format(data['name']))
print('Metageneration: {}'.format(data['metageneration']))
print('Created: {}'.format(data['timeCreated']))
print('Updated: {}'.format(data['updated']))
Basically the function is listening the storage events "google.storage.object.finalize
" (this happen when a file is uploaded)
To deploy this function on the cloud you can use this command
gcloud functions deploy hello_gcs_generic --runtime python37 --trigger-resource [your bucket name] --trigger-event google.storage.object.finalize
or you can use the GCP console (Web UI) to deploy this function.
selecting "cloud storage" on the trigger field
select "Finalize/create" on the event type
specifiying your bucket
Even you can directly process the files using Auto ML within a cloud function as is mentioned in this example.

Directory Origin for Streamsets -- need only the filename to pass

I am trying to build a pipeline in StreamSets wherein when a file comes to a directory i want to invoke a rest api with just the file name; I don't want StreamSets to read the file or do any processing on it.
But whatever I try, it's trying to send the whole file to the destination.
The file is a special SEGD format file which is kind a binary file.
It is trying to read the file and failing.
My requirement is to invoke a REST API as soon as a file comes to a folder.
As you've discovered, by default, StreamSets Data Collector's Directory origin will parse the contents of the file as JSON, delimited data etc. If you use the Whole File format, though, the origin will instead read only the file metadata, and pass a special record along the pipeline, with the following fields:
You can then use the HTTP Client processor or destination, referencing the filename with the expression ${record:value('/fileInfo/filename')}.

Is it possible to use "Custom Sources and Sinks" to write/append file during Dataflow pipeline execution?

My program relies on local system storage to write a file that is being generated by the program itself. Hence executing the job in "DirectPipelineRunner" mode. Below is the flow,
One of my function - Makes multiple REST API requests and creates/appends to a file(Output.txt) in local system storage.
Pipeline: a) Upload generated file to GCS 2) Read the file from GCS c) Perform transformation d) Write to BigQuery.
Since, my program writes/appends API response to local system storage, I'm executing the pipeline in DirectPipelineRunner mode.
Is it possible to have temporary space in cloud to remove dependency on local file system So that I can execute the pipleline in DataflowPipelineRunner mode?
I guess Custom Sources and Sinks can be used here. Can someone add some light on this problem statement?

Loading Amazon Redshift with a manifest, with an error in one file

When using the COPY command to load Amazon Redshift with a manifest, suppose one of the files contains an error.
Is there a way to just log the error for that file, but continue loading the other files?
The manifest file indicates whether a file is mandatory and whether an error should be generated if a file is not found. (Using a Manifest to Specify Data Files)
The COPY command will retry if it cannot read a file. (Errors When Reading Multiple Files)
The COPY command can specify a MAXERRORS parameter that permits a certain number of errors before the COPY command fails. (MAXERROR)
When loading data from files, Amazon Redshift will report any errors in the STL_LOAD_ERRORS table. (STL_LOAD_ERRORS)
As said above, the maxerror property should satisfy the above requirement.
In addition, copy-noload property checks the validity of the data without loading. Running with NOLOAD parameter is much faster as it only parses the file