Snowflake ingestion: Snowpipe/Stream/Tasks or External Tables/Stream/Tasks - merge

For ingesting data from an external storage location into Snowflake when de-duping is necessary, I came across two ways:
Option 1:
Create a Snowpipe for the storage location (Azure container or S3 bucket) which is automatically triggered by event notifications (Azure event grid and queues or AWS SQS) and copy data into a staging table in Snowflake
Create a Stream for this staging table to capture change data
Periodically run a task that consumes the Stream data and merges (upserts) data into the destination table based on the primary key
Option 2:
Create an external table with automatic refresh through event notifications (Azure event grid and queues or AWS SQS)
Create a Stream for this external table to capture change data
Periodically run a task that consumes the Stream data and merges (upserts) data into the destination table based on the primary key
I believe if the merge statement wasn't necessary to enforce primary key and remove duplicates, Snowpipe was the clear winner because it copies changed data directly into a table in one step. However, since staging and merging the data is necessary, which option is better?
Thank you!

Related

Partition data by multiple partition keys - Azure ADF

I have some data on on-prem SQL table. The data is huge ~100GB. The data many columns but two important ones are d_type and d_date.
d_type unique elements are 1,10,100 and d_date ranges from 2022-01-01 - 2022-03-30
I want to load this data into Azure using copy activity or dataflow but in a partitioned fashion, like the following format:
someDir/d_type=1/
2022-01/somedata.parquet
2022-02/somedata.parquet
2022-03/somedata.parquet
someDir/d_type=10/
2022-01/somedata.parquet
2022-02/somedata.parquet
2022-03/somedata.parquet
someDir/d_type=100/
2022-01/somedata.parquet
2022-02/somedata.parquet
2022-03/somedata.parquet
I have tried with copy activity:
Copy activity can only use one partition key
If I partition by d_type, it creates parquet file with random bins i.e 1-20 (which contains only data for d_type=1), other file could have bins be 20-30 (which has no data)
Dataflow allows multiple partition keys, but I cannot use that sinceill have to copy the entrire data first from onprem sql to azure then process it. (As dataflow can only work with source link service which are linke via AzureIR and not SHIR).
Anyone got tips on how to solve this?
We ended up using custom python scripts because CopyActivity doesn't support partitions with multiple keys and we couldn't use the dataflow due to some business reasons as explained in the question.

AWS Glue with RDS SQL Server

I have created an AWS Glue Job (pyspark script), which pulls the data from S3 bucket and load the data into RDS (SQL Server). I have to perform few pre-actions (delete selective data) on the destination table before loading the data. For this, i have used data-frames i.e. bring the entire destination table data into data frame first then performing the delete operation and finally appending (UNION) the data with source data (S3) and loading into RDS table.
Looks like this is not a feasible approach, as it has to load entire destination table data into memory first for performing pre-action. Also there are two connections gets established within the script (JDBC and Glue context), what if one commit gets executed successfully but the other gets failed.
Can someone please suggest the better approach for performing these operations along with maintaining proper TRANSACTION properties?

Azure Table Storage Sink in ADF Data Flow

Here is how my ADF Pipeline looks like. In Data Flow, I read some data from a source, perform filter & join and store data to a sink. My plan was to use Azure Table Storage as the sink. However, according to https://github.com/MicrosoftDocs/azure-docs/issues/34981, ADF Data Flow does not support Azure Table Storage as a sink. Is there an alternative to use Azure Table Storage as the sink in Data Flow?
No, it is impossible. Azure Table Storage can not be the sink of data flow.
Only these six dataset is allowed:
Not only these limits. When as the sink of the dataflow, Azure Blob Storage and Azure Data Lake Storage Gen1&Gen2 only support four format: JSON, Avro, Text, Parquet.'
At least for now, your idea is not a viable solution.
For more information, have a look of this offcial doc:
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-sink#supported-sink-connectors-in-mapping-data-flow
Even today it isn't possible. One option could be (we are solving a similar case like this currently) to use a Blob Storage as a temporary destination.
The data flow will store the result in the Blob Storage. The source data is processed by all these different transformations in the data flow and prepared well for table storage, e.g. PartitionKey, RowKey, and all other columns are there.
A subsequent Copy Activity will move the data from Blob Storage into Table Storage easily.
The marked part of the pipeline is doing exactly this:
Full Orders runs the data flow
to Table Storage copy activity moves data into the Table Storage

How to get max of a given column from ADF Copy Data activity

I have a copy data activity for on-premise SQL Server as source and ADLS Gen2 as sink. There is a control table to pickup tableName, watermarkDateColumn and the watermarkDatetime to pull incremental data from the source database.
After data is pulled/loaded in sink, I want to get the max of the watermarkDateColumn in my dataset. Can it be obtained from #activity('copyActivity1').output?
I'm not allowed to use one extra lookup activity to query the source table for getting the max(watermarkDateColumn) in pipeline.
Copy activity only could be used for data transmission,not for any other aggregation feature. So #activity('copyActivity1').output won't help. Since you said you can't use lookup activity, i'm afraid your requirement is not available so far.
If you prefer not using additional activities, I suggest you using Data Flow Activity instead which is more flexible.There is built-in aggregation feature in the Data Flow Activity.

Cloud SQL: export data to CSV periodically avoiding duplicates

I want to export the data from Cloud SQL (postgres) to a CSV file periodically (once a day for example) and each time the DB rows are exported it must not be exported in the next export task.
I'm currently using a POST request to perform the export task using cloud scheduler. The problem here (or at least until I know) is that it won't be able to export and delete (or update the rows to mark them as exported) in a single http export request.
Is there any possibility to delete (or update) the rows which have been exported automatically with any Cloud SQL parameter in the http export request?
If not, I assume it should be done it a cloud function triggered by a pub/sub (using scheduler to send data once a day to pub/sub) but, is there any optimal way to take all the ID of the rows retrieved from the select statment (which will be use in the export) to delete (or update) them later?
You can export and delete (or update) at the same time using RETURNING.
\copy (DELETE FROM pgbench_accounts WHERE aid<1000 RETURNING *) to foo.txt
The problem would be in the face of crashes. How can you know that foo.txt has been writing and flushed to disk, before the DELETE is allowed to commit? Or the reverse, foo.txt is partially (or fully) written, but a crash prevents DELETE from committing.
Can't you make the system idempotent, so that exporting the same row more than once doesn't create problems?
You could use a set up to achieve what you are looking for: 
1.Create a Cloud Function to extract the information from the database that subscribes to a Pub/Sub topic.
2.Create a Pub/Sub topic to trigger that function.
3.Create a Cloud Scheduler job that invokes the Pub/Sub trigger.
4.Run the Cloud Scheduler job.
5.Then create a trigger which activate another Cloud Function to delete all the data require from the database once the csv has been created.
Here I leave you some documents which could help you if you decide to follow this path.
Using Pub/Sub to trigger a Cloud Function:https://cloud.google.com/scheduler/docs/tut-pub-sub
Connecting to Cloud SQL from Cloud Functions:https://cloud.google.com/sql/docs/mysql/connect-functionsCloud
Storage Tutorial:https://cloud.google.com/functions/docs/tutorials/storage
Another method aside from #jjanes would be to partition your database by date. This would allow you to create an index on the date, making exporting or deleting a days entries very easy. With this implementation, you could also create a Cron Job that deletes all tables older then X days ago.
The documentation provided will walk you through setting up a Ranged partition
The table is partitioned into “ranges” defined by a key column or set of columns, with no overlap between the ranges of values assigned to different partitions. For example, one might partition by date ranges, or by ranges of identifiers for particular business objects.
Thank you for all your answers. There are multiples ways of doing this, so I'm goint to explain how I did it.
In the database I have included a column which contains the date when the data was inserted.
I used a cloud scheduler with the following body:
{"exportContext":{"fileType": "CSV", "csvExportOptions" :{"selectQuery" : "select \"column1\", \"column2\",... , \"column n\" from public.\"tablename\" where \"Insertion_Date\" = CURRENT_DATE - 1" },"uri": "gs://bucket/filename.csv","databases": ["postgres"]}}
This scheduler will be triggered once a day and it will export only the data of the previous day
Also, I have to noticed that in the query I used in cloud scheduler you can choose which columns you want to export, doing this you can avoid to export the column which include the Insertion_Date and use this column only an auxiliary.
Finally, the cloud scheduler will create automatically the csv file in a bucket