Is it possible to use "Custom Sources and Sinks" to write/append file during Dataflow pipeline execution? - google-cloud-storage

My program relies on local system storage to write a file that is being generated by the program itself. Hence executing the job in "DirectPipelineRunner" mode. Below is the flow,
One of my function - Makes multiple REST API requests and creates/appends to a file(Output.txt) in local system storage.
Pipeline: a) Upload generated file to GCS 2) Read the file from GCS c) Perform transformation d) Write to BigQuery.
Since, my program writes/appends API response to local system storage, I'm executing the pipeline in DirectPipelineRunner mode.
Is it possible to have temporary space in cloud to remove dependency on local file system So that I can execute the pipleline in DataflowPipelineRunner mode?
I guess Custom Sources and Sinks can be used here. Can someone add some light on this problem statement?

Related

How to make Snakemake recognize Globus remote files using Globus CLI?

I am working in a high performance computing grid environment, where large-scale data transfers are done via Globus. I would like to use Snakemake to pull data from a Globus path, process the data, and then push the processed data to a different Globus path. Globus has a command-line interface.
Pulling the data is no problem, for I'd just create a rule that would run globus transfer to create the requisite local file. But for pushing the data back to Globus, I think I'll need a rule that can "see" that the file is missing at the remote location, and then work backwards to determine what needs to happen to create the file.
I could create local "proxy" files that represent the remote files. For example I could make a rule for creating 'processed_data_1234.tar.gz' output files in a directory. These files would just be created using touch (thus empty), and the same rule will run globus transfer to push the files remotely. But then there's the overhead of making sure that the proxy files don't get out of sync with the real Globus-hosted files.
Is there a more elegant way to do this akin to the Remote File capability? Is it difficult to add a Globus CLI support for Snakemake? Thanks in advance for any advice!
Would it help to create a utility function that would generate a list of all desired files and compare it against the list of files available on globus? Something like this (pseudocode):
def return_needed_files():
list_needed_files = [] # either hard-coded or specified with some logic
list_available = [] # as appropriate, e.g. using globus ls
return [i for i in list_needed_files if i not in list_available]
# include all the needed files in the all rule
rule all:
input: return_needed_files

Reading Json file from Azure datalake as a file using Json.load in Azure databricks /Synapse notebooks

I am trying to parse Json data with multi nested level. I am using the approach is giving filename and using open(File-name) to load the data. when I am providing datalake path, it is throwing error that file path not found. I am able to read data in dataframes but How can I read file from data lake without converting to dataframes and reading it as a file and open it?
Current code approach on local machine which is working:
f = open(File_Name.Json)
data = json.load(f)
Failing scenario when provding datalake path:
f = open(Datalake path/File_Name.Json)
data = json.load(f)
You need to mount the data lake folder to a location in dbfs (in Databricks), although mounting is a security risk. Anyone with access to Databricks resource will have access to all mounted locations.
Documentation on mounting to dbfs: https://docs.databricks.com/data/databricks-file-system.html#mount-object-storage-to-dbfs
The open function works only with local files, not understanding (out of box) the cloud file paths. You can of course try to mount the cloud storage, but as it was mentioned by #ARCrow, it would be a security risk (until you create so-called passthrough mount that will control access on the cloud storage level).
But if you're able to read file into dataframe, then it means that cluster has all necessary settings for accessing the cloud storage - in this case you can just use dbutils.fs.cp command to copy file from the cloud storage to local disk, and then open it with open function. Something like this:
dbutils.fs.cp("Datalake path/File_Name.Json", "file:///tmp/File_Name.Json")
with open("/tmp/File_Name.Json", "r") as f:
data = json.load(f)

How to use file name prefix in Data Factory when importing data into Azure data lake from SAP BW Open Hub?

I have a source of SAP BW Open Hub in data factory and a sink of Azure data lake gen2 and am using a copy activity to move the data.
I am attempting to transfer the data to the lake and split into numerous files, with 200000 rows per file. I would also like to be able to prefix all of the filenames e.g. 'cust_', so the files would be something along the lines of cust_1, cust_2, cust_3 etc.
This method only seems to be an issue when using SAP BW Open Hub as a source (it works fine when using SQL Server as a source. Please see the warning message below. After checking with out internal SAP BW team, they assure me that the data is in a tabular format, and no explicit partition is enabled, so there shouldn't be an issue.
When executing the copy activity, the files are transferred to the lake but the file name prefix setting is ignored, and the filenames instead are set automatically, as below (the name seems to be automatically made up of the SAP BW Open Hub table and the request ID):
Here is the source config:
All other properties on the other tabs are set to default and have been unchanged.
QUESTION: without using a data flow, is there any way to split the files when pulling from SAP BW Open Hub and also be able to dictate the filenames in the lake?
I tried to reproduce the issue and it works fine with a work around. Instead of splitting the data while copying from SAP BW to Azure data lake storage, you can just simply copy the entire exact data (without partition) into the Azure SQL Database. Please follow copy data from SAP Business warehouse by using azure data factory (make sure to use Azure SQL Database as sink).
Now the data is in you Azure SQL Database, you can now simply use the copy activity to copy the data to Azure data lake storage.
In source configuration, keep “Partition option” as None.
Source Config:
Sink config:
Output:

deploy zip to aws lambda automatically

I have zipped my source code using python and moved Zip file to S3 bucket. And how can I automatically deploy this zip file to my already existing Lambda function.
could you please give an idea on this.
Thanks in advance.
first install serverless.
npm install -g serverless
check this repo for examples. I am providing a simple python lambda function example. serverless examples
You can reference your lambda function from the files and also create necessary roles and invoke permissions and mention your resources in serverless.yml.
To deploy the cloud formation script simply use below command from the directory of serverless.yml file
serverless deploy
To delete the resources you deployed simply use following command from serverless.yml file's directory.
serverless remove
This saves you a lot of time than creating your resources through console.
You can also see different examples of nodejs etc in that repo.
You can setup S3 to trigger a different lambda function whenever a code is uploaded in the s3 bucket and configure this lambda function to upload that zip in s3 to your desired lambda function.
If your usecase is you only have to do changes and update the code from bucket. You can use serverless instead of paying for another lambda function.
Serverless uses cloudformation underlyingly.
see this reference on how to setup a s3 trigger create s3 trigger. Write your logic using boto3 client in this triggered lambda to upload the code to other lambda.

How to set file type when using TextIO.write to Google Cloud Storage

I wrote a DataFlow pipeline that outputs a single small csv file on Google Cloud Storage. The file type of that file is text/plain but i want it to be application/csv.
this is the code i use
TextIO.write()
.to("gs://bucket/path/to/filename").withoutSharding()
.withSuffix(".csv")
.withDelimiter(new char[]{'\r','\n'})
How do i specify the file type so that the file type will be application/csv after the pipeline completes?
TextIO always write content type text/plain. This is configured here. https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/TextSink.java#L95
One option for you might be to update the content type of objects already written to GCS. This can be done using the gsutil tool after you finish your Dataflow pipeline that writes files. See here for more information.
https://cloud.google.com/storage/docs/gsutil/commands/setmeta