Why does Snowflake recommend creating an external stage rather than loading it directly from a bucket? - snowflake-schema

In the snowflake documents about bulk loading from AWS S3,
they are saying like :
You can load directly from the bucket, but Snowflake recommends creating an external stage that references the bucket and using the external stage instead.
So my first question is:
Why does Snowflake recommend creating an external stage rather than loading it directly from a bucket?
Is there a reason for this? Or If you have any documentation explaining why, please let me know. :)
And my second question is:
In the architecture diagram of Bulk Loading from a Local File System, there are arrows(➡) from data files to stage, but in the case of Bulk Loading from Amazon S3, there are no arrows from Data Files to external stage. What is the difference between with and without arrows?
Bulk Loading from Amazon S3:
https://docs.snowflake.com/en/user-guide/data-load-s3.html
Bulk Loading from a Local File System:
https://docs.snowflake.com/en/user-guide/data-load-local-file-system.html

The stage hold all the permissions for the bucket, so and security role can create deal with the AWS tokens, and then grant access to the stage for reads/writes, to other roles, this separates the two tasks of loading data, and securing data.
It also allows the stage to have tokens changed/updated, and code/users using it are not impacted, or even changing to methods where (name escapes me but the) dynamic key exchange happens, so key rotation is all automatic between S3/AWS. Which how we do it, in fact we have many stages, for different sources of data, and the security aspects on business policies do not need to be known handle by the data engineer's who build the ETL code.

Related

How to export dataset from PostgreSQL to CSV on AWS so that users can download it?

I have an API where users can query some time-series data. But now I want to make the entire data set available for users to download for their own uses. How would I go about doing something like this? I have RDS, an EC2 instance setup. What would my next steps be?
In this scenario and without any other data or restrictions given, I would use S3 bucket in the center of this process.
Create an S3 Bucket to save the database/dataset dump.
Dump the database/dataset to S3. ( examples: docker, lambda )
Manually transform dataset to CSV or use a Lambda triggered on every dataset dump. (not sure if pg_dump can give you CSV out of the box)
Host those datasets in a bucket accessible to your users and allow access to them as per case:
You can create a publicly available bucket and share its HTTP URL.
You can create a pre-signed URL to allow limited access to your dataset
S3 is proposed since its cheap and you can find a lot of readily available tooling to work with.

GCP Dataflow vs Cloud Functions to automate scrapping output and file-on-cloud merge into JSON format to insert in DB

I have two sources:
A csv that will be uploaded to a cloud storage service, probably GCP Cloud Storage.
The output of a scrapping process done with Python.
When a user updates 1) (the cloud stored file) an event should be triggered to execute 2) (the scrapping process) and then some transformation should take place in order to merge these two sources into one in a JSON format. Finally, the content of this JSON file should be stored in a DB of easy access and low cost. The files the user will update are of max 5MB and the updates will take place once weekly.
From what I've read, I can use GCP Cloud Functions to accomplish this whole process or I can use Dataflow too. I've even considered using both. I've also thought of using MongoDB to store the JSON objects of the two sources final merge.
Why should I use Cloud Functions, Dataflow or both? What are your thoughts on the DB? I'm open to different approaches. Thanks.
Regarding de use of Cloud Functions and Dataflow. In your case I will go for Cloud Functions as you don't have a big volume of data. Dataflow is more complex, more expensive and you will have to use Apache Beam. If you are confortable with python and having into consideration your scenario I will choose Cloud Functions. Easy, convenient...
To trigger a Cloud Functions when Cloud Storage object is updated you will have to configure the triggers. Pretty easy.
https://cloud.google.com/functions/docs/calling/storage
Regarding the DB. MongoDB is a good option but if you wanth something quick an inexpensive consider DataStore
As a managed service it will make your life easy with a lot of native integrations. Also it has a very interesting free tier.

What is the difference between polybase and bulk insert, copy methods in azure data factory and when to use them?

Can some one please elaborate on when to use Polybase versus bulk insert in azure datafactory, what are the differences between these two copy methods?
The two options labeled “Polybase” and the “COPY command” are only applicable to Azure Synapse Analytics (formerly Azure SQL Data Warehouse). They are both fast methods of loading which involve staging data in Azure storage (if it’s not already in Azure Storage) and using a fast, highly parallel method of loading to each compute node from storage. Especially on large tables these options are preferred due to their scalability but they do come with some restrictions documented at the link above.
In contrast, on Azure Synapse Analytics a bulk insert is a slower load method which loads data through the control node and is not as highly parallel or performant. It is an order of magnitude slower on large files. But it can be more forgiving in terms of data types and file formatting.
On other Azure SQL databases, bulk insert is the preferred and fast method.

AWS glue: Deploy model in aws environment

As per our AWS environment , we have 2 different types SAGs( service account Group) for Data storage. One SAG is for generic storage , another SAG for secure data which will only hold PII or restricted data. In our environment, we are planning to deploy Glue . In that case ,
Would we have one metastore over both secure and non-secure?
If we needed two meta stores, how would this work with Databricks?
If one metastore, how to handle the secure datas ?
Please help us to more details on this in .
If you are using a single region with one AWS Account, there will be only one metastore for both secure and generic data, and you will have to handle access with fine grained access policies.
A better approach would be to either use 2 different regions in a single AWS Account, or two different AWS accounts, so that access is easily managed for two different metastores.
To integrate your metastore with Databricks for (1), you will have to create two Glue Catalog instance profiles with resource level access. One instance profile will have access to generic database and tables while the other will have access to the secure databases and tables.
To integrate your metastores with Databricks for (2), you will simply create two Glue Catalog instance profiles with access to the respective metastore.
It is recommended to go with the second option as it will save you guys a lot of maintenance cost and human errors on longer run. More details on Glue Catalog and Databricks integration.
Edit:
Based on the discussion in comments, if we have to access both datasets inside the same Databricks Runtime, option 2 won't work. Option 1 can be used with 2 permission sets. First only for generic data and second for both generic and secure data.
In AWS Glue, each AWS account has one persistent metadata store per region (called Glue Data catalog).
It contains database definitions, table definitions, job definitions, and other control information to manage your AWS Glue environment. You manage permissions to that objects using IAM (e.g., who can make GetTable or GetDatabase API calls to that objects).
In addition to AWS Glue permissions, you would also need to configure permissions to the data itself (e.g., who can make GetObject API call to the data stored on S3).
So, answering your questions. Yes, you would have a single data catalog.
However, depending on your security requirements, you would be able to define resource-based and role-based permissions on metadata and content.
You can find a detailed overview here - https://aws.amazon.com/blogs/big-data/restrict-access-to-your-aws-glue-data-catalog-with-resource-level-iam-permissions-and-resource-based-policies

Writing to cloud storage as a side effect in cloud dataflow

I have a cloud dataflow job that does a bunch of processing for an appengine app. At one stage in the pipeline, I do a group by a particular key, and for each record that matches that key I would like to write a file to Cloud Storage (using that key as part of the file name).
I don't know in advance how many of these records there will be. So this usage pattern doesn't fit the standard cloud dataflow data sink pattern (where the sharding of that output stage determines the # output files, and I have no control over the output file names per shard).
I am considering writing to Cloud Storage directly as a side-effect in a ParDo function, but have the following queries:
Is writing to cloud storage as a side-effect allowed at all?
If I was writing from outside a dataflow pipeline, it seems I should use the Java client for the JSON cloud storage API. But that involves authenticating via OAUTH to do any work: and that seems inappropriate for a job already running on GCE machines as part of a dataflow pipeline: will this work?
Any advice gratefully received.
Answering the first part of your question:
While nothing directly prevents you from performing side-effects (such as writing to Cloud Storage) in our pipeline code, usually it is a very bad idea. You should consider the fact that your code is not running in a single-threaded fashion on a single machine. You'd need to deal with several problems:
Multiple writers could be writing at the same time. You need to find a way to avoid conflicts between writers. Since Cloud Storage doesn't support appending to an object directly, you might want to use Composite objects techniques.
Workers can be aborted, e.g. in case of transient failures or problems with the infrastructure, which means that you need to be able to handle the interrupted/incomplete writes issue.
Workers can be restarted (after they were aborted). That would cause the side-effect code to run again. Thus you need to be able to handle duplicate entries in your output in one way or another.
Nothing in Dataflow prevents you from writing to a GCS file in your ParDo.
You can use GcpOptions.getCredential() to obtain a credential to use for authentication. This will use a suitable mechanism for obtaining a credential depending on how the job is running. For example, it will use a service account when the job is executing on the Dataflow service.