Writing to cloud storage as a side effect in cloud dataflow - google-cloud-storage

I have a cloud dataflow job that does a bunch of processing for an appengine app. At one stage in the pipeline, I do a group by a particular key, and for each record that matches that key I would like to write a file to Cloud Storage (using that key as part of the file name).
I don't know in advance how many of these records there will be. So this usage pattern doesn't fit the standard cloud dataflow data sink pattern (where the sharding of that output stage determines the # output files, and I have no control over the output file names per shard).
I am considering writing to Cloud Storage directly as a side-effect in a ParDo function, but have the following queries:
Is writing to cloud storage as a side-effect allowed at all?
If I was writing from outside a dataflow pipeline, it seems I should use the Java client for the JSON cloud storage API. But that involves authenticating via OAUTH to do any work: and that seems inappropriate for a job already running on GCE machines as part of a dataflow pipeline: will this work?
Any advice gratefully received.

Answering the first part of your question:
While nothing directly prevents you from performing side-effects (such as writing to Cloud Storage) in our pipeline code, usually it is a very bad idea. You should consider the fact that your code is not running in a single-threaded fashion on a single machine. You'd need to deal with several problems:
Multiple writers could be writing at the same time. You need to find a way to avoid conflicts between writers. Since Cloud Storage doesn't support appending to an object directly, you might want to use Composite objects techniques.
Workers can be aborted, e.g. in case of transient failures or problems with the infrastructure, which means that you need to be able to handle the interrupted/incomplete writes issue.
Workers can be restarted (after they were aborted). That would cause the side-effect code to run again. Thus you need to be able to handle duplicate entries in your output in one way or another.

Nothing in Dataflow prevents you from writing to a GCS file in your ParDo.
You can use GcpOptions.getCredential() to obtain a credential to use for authentication. This will use a suitable mechanism for obtaining a credential depending on how the job is running. For example, it will use a service account when the job is executing on the Dataflow service.

Related

AWS Glue output to stream

I'm just starting to get familiar with AWS and it's tools and have been researching Glue/DataBrew. I'm trying to understand if it would fit a streaming use case I have in mind and I can clearly see plenty of documentation around consuming streaming data into Glue, but I can't find anything related to publishing streaming data from a glue job.
What I would like to do is pick up a file from some source, rip it apart into component records using Glue and then publish each individual record onto a stream (Kinesis, SNS, Kafka, etc). Is this yet possible with Glue? or am I barking up the wrong tree here.
Is there a better more appropriate AWS solution for this type of use case?
pick up a file from some source
Use S3... Hook a AWS Lambda trigger to S3 upload events.
Write a Lambda that will download this file's contents and parse it.
Then as parsing, you can send events to SNS, MSK, or Kinesis, or write to Athena, RDS, other S3 files, etc...
Sure, Glue might piece some of these together, but you dont "need" it for simple ETL workloads.

GCP Dataflow vs Cloud Functions to automate scrapping output and file-on-cloud merge into JSON format to insert in DB

I have two sources:
A csv that will be uploaded to a cloud storage service, probably GCP Cloud Storage.
The output of a scrapping process done with Python.
When a user updates 1) (the cloud stored file) an event should be triggered to execute 2) (the scrapping process) and then some transformation should take place in order to merge these two sources into one in a JSON format. Finally, the content of this JSON file should be stored in a DB of easy access and low cost. The files the user will update are of max 5MB and the updates will take place once weekly.
From what I've read, I can use GCP Cloud Functions to accomplish this whole process or I can use Dataflow too. I've even considered using both. I've also thought of using MongoDB to store the JSON objects of the two sources final merge.
Why should I use Cloud Functions, Dataflow or both? What are your thoughts on the DB? I'm open to different approaches. Thanks.
Regarding de use of Cloud Functions and Dataflow. In your case I will go for Cloud Functions as you don't have a big volume of data. Dataflow is more complex, more expensive and you will have to use Apache Beam. If you are confortable with python and having into consideration your scenario I will choose Cloud Functions. Easy, convenient...
To trigger a Cloud Functions when Cloud Storage object is updated you will have to configure the triggers. Pretty easy.
https://cloud.google.com/functions/docs/calling/storage
Regarding the DB. MongoDB is a good option but if you wanth something quick an inexpensive consider DataStore
As a managed service it will make your life easy with a lot of native integrations. Also it has a very interesting free tier.

Why does Snowflake recommend creating an external stage rather than loading it directly from a bucket?

In the snowflake documents about bulk loading from AWS S3,
they are saying like :
You can load directly from the bucket, but Snowflake recommends creating an external stage that references the bucket and using the external stage instead.
So my first question is:
Why does Snowflake recommend creating an external stage rather than loading it directly from a bucket?
Is there a reason for this? Or If you have any documentation explaining why, please let me know. :)
And my second question is:
In the architecture diagram of Bulk Loading from a Local File System, there are arrows(➡) from data files to stage, but in the case of Bulk Loading from Amazon S3, there are no arrows from Data Files to external stage. What is the difference between with and without arrows?
Bulk Loading from Amazon S3:
https://docs.snowflake.com/en/user-guide/data-load-s3.html
Bulk Loading from a Local File System:
https://docs.snowflake.com/en/user-guide/data-load-local-file-system.html
The stage hold all the permissions for the bucket, so and security role can create deal with the AWS tokens, and then grant access to the stage for reads/writes, to other roles, this separates the two tasks of loading data, and securing data.
It also allows the stage to have tokens changed/updated, and code/users using it are not impacted, or even changing to methods where (name escapes me but the) dynamic key exchange happens, so key rotation is all automatic between S3/AWS. Which how we do it, in fact we have many stages, for different sources of data, and the security aspects on business policies do not need to be known handle by the data engineer's who build the ETL code.

Difference between DataFlow and Pipelines

I do not understand the difference between dataflow and pipeline in Azure Data Factory.
I have read and see DataFlow can Transform Data without writing any line of code.
But I have made a pipeline and this is exactly the same thing.
Thanks
A Pipeline is an orchestrator and does not transform data. It manages a series of one or more activities, such as Copy Data or Execute Stored Procedure. Data Flow is one of these activity types and is very different from a Pipeline.
Data Flow performs row and column level transformations, such as parsing values, calculations, adding/renaming/deleting columns, even adding or removing rows. At runtime a Data Flow is executed in a Spark environment, not the Data Factory execution runtime.
A Pipeline can run without a Data Flow, but a Data Flow cannot run without a Pipeline.
Firstly, dataflow activity need to be executed in the pipeline. So I suspect that you are talking about the copy activity and dataflow activity as both of them are used for transferring data from source to sink.
I have read and see DataFlow can Transform Data without writing any
line of code.
Your could see the overview of Data Flow. Data flow allows data engineers to develop graphical data transformation logic without writing code. All data transfer steps are based on visual interfaces.
I have made a pipeline and this is exactly the same thing.
Copy activity could be used for data transmission. However, it has many limitations with column mapping. So,if you just need simple and pure data transmission, Copy Activity could be used. In order to further meet the personalized needs, you could find many built-in features in the Data Flow Activity. For example, Derived column, Aggregate,Sort etc.

Concurrent file processing in data flow activity Azure Data Factory

When using control flow, it's possible to use a GetMetadata activity to retrieve a list of files in a blob storage account and then pass that list to a for each activity where the Sequential flag is false to process all files concurrently (in parallel) up to the max batch size according to the activities defined in the for each loop.
However, when reading about data flows in the following article from Microsoft (https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-column-pattern), they indicate the following:
A mapping data flow will execute better
when the Source transformation iterates over multiple files instead of
looping via the For Each activity. We recommend using wildcards or
file lists in your source transformation. The Data Flow process will
execute faster by allowing the looping to occur inside the Spark
cluster. For more information, see Wildcarding in Source
Transformation.
For example, if you have a list of data files from July 2019 that you
wish to process in a folder in Blob Storage, below is a wildcard you
can use in your Source transformation.
DateFiles/_201907.txt
By using wildcarding, your pipeline will only contain one Data Flow
activity. This will perform better than a Lookup against the Blob
Store that then iterates across all matched files using a ForEach with
an Execute Data Flow activity inside.
Based on this finding, I have configured a data flow task where the source is a blob directory of files and it processes all files in that directory with no control loops. I do not, however, see any options to process files concurrently within the data flow. I do, however, see an Optimize tab where you can set the partitioning option.
Is this option only for processing a single large file into multiple threads or does this control how many files it processes concurrently within the directory where the source is pointing?
Is the documentation assuming the for each control loop is set to "Sequential" (I can see why that would be true if it was, but having a hard time believing it if it's running one file at a time in the data flow)?
Inside data flow, each source transformation will read all of the files indicated in the folder or wildcard path and store those contents into data frames in memory for processing.
Setting the partitioning manually from the Optimize tab will instruct ADF the partitioning scheme you wish to use inside Spark.
To process each file individually 1x1, use the control flow capabilities in the pipeline.
Iterate over each file you wish to process and send the name of the file into the data flow via iterator parameter inside a For Each setting the execution to Sequential.