Partition data by multiple partition keys - Azure ADF - azure-data-factory

I have some data on on-prem SQL table. The data is huge ~100GB. The data many columns but two important ones are d_type and d_date.
d_type unique elements are 1,10,100 and d_date ranges from 2022-01-01 - 2022-03-30
I want to load this data into Azure using copy activity or dataflow but in a partitioned fashion, like the following format:
someDir/d_type=1/
2022-01/somedata.parquet
2022-02/somedata.parquet
2022-03/somedata.parquet
someDir/d_type=10/
2022-01/somedata.parquet
2022-02/somedata.parquet
2022-03/somedata.parquet
someDir/d_type=100/
2022-01/somedata.parquet
2022-02/somedata.parquet
2022-03/somedata.parquet
I have tried with copy activity:
Copy activity can only use one partition key
If I partition by d_type, it creates parquet file with random bins i.e 1-20 (which contains only data for d_type=1), other file could have bins be 20-30 (which has no data)
Dataflow allows multiple partition keys, but I cannot use that sinceill have to copy the entrire data first from onprem sql to azure then process it. (As dataflow can only work with source link service which are linke via AzureIR and not SHIR).
Anyone got tips on how to solve this?

We ended up using custom python scripts because CopyActivity doesn't support partitions with multiple keys and we couldn't use the dataflow due to some business reasons as explained in the question.

Related

Azure Data Factory Copy Multiple Dataset in One Pipeline

This is the current situation:
In Azure Data Factory i have more and less 59different Dataset. Each Dataset comes from different DAta Lake container and folders. I want to copy these 59 datasets in a single Pipeline inside Data Factory into different SQL tables. Is it possible to just make a single pipeline that reads all the differents 59 datasets and copy them into the sql tables? and how do you that?
I am avoiding to make 59 different pipelines, which make maintaining Data Factory very difficult.
Thanks
Here is the sample procedure to copy multiple datasets in a single pipeline.
Create Config table to read the datasets from different containers and different folders.
Read the config table to get the details with Lookup.
Get config details from the output of lookup activity and iterate the containers and folders #activity('Lookup1').output.value
Create linked service to the ADLS account.
Parameters were created for Containers, Folders and Files for the source dataset.
Values for the parameters were provisioned from the FOREACH to iterate the Containers, Folders and Files.
Create a parameters with TableName in Sink dataset
Add dynamic content for TableName parameter with expression #first(split(item().Files,'.')) for extracting the file name.

Migrating a huge Bigtable database in GCP from one account to another using DataFlow

I have a huge database stored in Bigtable in GCP. I am migrating the bigtable data from one account to another GCP Account using DataFlow.
but, when I created a job to create a sequence file from the bigtable it has created 3000 sequence files on the destination bucket.
so, it is not possible to create a single dataflow for each 3000 sequence file
so, Is there any way to reduce the sequence files or a way to provide the whole 3000 sequence files at once in a Data Flow Job template in GCP
We have two sequence file wanted to upload data sequentially one after another(10 rows and one column), but actually getting result uploaded(5 rows and 2 columns)
The sequence files should have some sort of pattern to their naming e.g. gs://mybucket/somefolder/output-1, gs://mybucket/somefolder/output-2, gs://mybucket/somefolder/output-3 etc.
When running the Cloud Storage SequenceFile to Bigtable Dataflow template set the sourcePattern parameter to the prefix of that pattern like gs://mybucket/somefolder/output-* or gs://mybucket/somefolder/*

Azure Data Factory - Degree of copy parallelism

I'm running an Azure Data Factory that copies multiple tables from on prem SQL server to an Azure Data Lake.
So, I set many Copy activities through Az Data Factory Designer to execute parallel copies (each activity is carrying on the extract of one table).
For better resources optimization, I would like to know if there is a way to copy multiple tables with one Copy activity ?
I heard of "degree of copy parallelism", but don't know how to use it ?
Rgds,
If the question helped, up-vote it. Thanks in advance.
To use one Copy activity for multiple tables, you'd need to wrap a single parameterized Copy activity in a ForEach activity. The ForEach can scale to run multiple sources at one time by setting isSequential to false and setting the batchCount value to the number of threads you want. The default batch count is 20 and the max is 50. Copy Parallelism on a single Copy activity just uses more threads to concurrently copy partitions of data from the same data source.

Cloud SQL: export data to CSV periodically avoiding duplicates

I want to export the data from Cloud SQL (postgres) to a CSV file periodically (once a day for example) and each time the DB rows are exported it must not be exported in the next export task.
I'm currently using a POST request to perform the export task using cloud scheduler. The problem here (or at least until I know) is that it won't be able to export and delete (or update the rows to mark them as exported) in a single http export request.
Is there any possibility to delete (or update) the rows which have been exported automatically with any Cloud SQL parameter in the http export request?
If not, I assume it should be done it a cloud function triggered by a pub/sub (using scheduler to send data once a day to pub/sub) but, is there any optimal way to take all the ID of the rows retrieved from the select statment (which will be use in the export) to delete (or update) them later?
You can export and delete (or update) at the same time using RETURNING.
\copy (DELETE FROM pgbench_accounts WHERE aid<1000 RETURNING *) to foo.txt
The problem would be in the face of crashes. How can you know that foo.txt has been writing and flushed to disk, before the DELETE is allowed to commit? Or the reverse, foo.txt is partially (or fully) written, but a crash prevents DELETE from committing.
Can't you make the system idempotent, so that exporting the same row more than once doesn't create problems?
You could use a set up to achieve what you are looking for: 
1.Create a Cloud Function to extract the information from the database that subscribes to a Pub/Sub topic.
2.Create a Pub/Sub topic to trigger that function.
3.Create a Cloud Scheduler job that invokes the Pub/Sub trigger.
4.Run the Cloud Scheduler job.
5.Then create a trigger which activate another Cloud Function to delete all the data require from the database once the csv has been created.
Here I leave you some documents which could help you if you decide to follow this path.
Using Pub/Sub to trigger a Cloud Function:https://cloud.google.com/scheduler/docs/tut-pub-sub
Connecting to Cloud SQL from Cloud Functions:https://cloud.google.com/sql/docs/mysql/connect-functionsCloud
Storage Tutorial:https://cloud.google.com/functions/docs/tutorials/storage
Another method aside from #jjanes would be to partition your database by date. This would allow you to create an index on the date, making exporting or deleting a days entries very easy. With this implementation, you could also create a Cron Job that deletes all tables older then X days ago.
The documentation provided will walk you through setting up a Ranged partition
The table is partitioned into “ranges” defined by a key column or set of columns, with no overlap between the ranges of values assigned to different partitions. For example, one might partition by date ranges, or by ranges of identifiers for particular business objects.
Thank you for all your answers. There are multiples ways of doing this, so I'm goint to explain how I did it.
In the database I have included a column which contains the date when the data was inserted.
I used a cloud scheduler with the following body:
{"exportContext":{"fileType": "CSV", "csvExportOptions" :{"selectQuery" : "select \"column1\", \"column2\",... , \"column n\" from public.\"tablename\" where \"Insertion_Date\" = CURRENT_DATE - 1" },"uri": "gs://bucket/filename.csv","databases": ["postgres"]}}
This scheduler will be triggered once a day and it will export only the data of the previous day
Also, I have to noticed that in the query I used in cloud scheduler you can choose which columns you want to export, doing this you can avoid to export the column which include the Insertion_Date and use this column only an auxiliary.
Finally, the cloud scheduler will create automatically the csv file in a bucket

AWS Glue, data filtering before loading into a frame, naming s3 objects

I have 3 questions, for the following context:
I'm trying to migrate my historical from RDS postgresql to S3. I have about a billion rows of dat in my database,
Q1) Is there a way for me to tell an aws glue job what rows to load? For example i want it to load data from a certain date onwards? There is no bookmarking feature for a PostgreSQL data source,
Q2) Once my data is processed, the glue job automatically creates a name for the s3 output objects, I know i can speciofy the path in DynamicFrame write, but can I specify the object name? if so, how? I cannot find an option for this.
Q3) I tried my glue job on a sample table with 100 rows of data, and it automatically separated the output into 20 files with 5 rows in each of those files, how can I specify the batch size in a job?
Thanks in advance
This is a question I have also posted in AWS Glue forum as well, here is a link to that: https://forums.aws.amazon.com/thread.jspa?threadID=280743
Glue supports pushdown predicates feature, however currently it works with partitioned data on s3 only. There is a feature request to support it for JDBC connections though.
It's not possible to specify name of output files. However, looks like there is an option with renaming files (note that renaming on s3 means copying file from one location into another so it's costly and not atomic operation)
You can't really control the size of output files. There is an option to control min number of files using coalesce though. Also starting from Spark 2.2 there is a possibility to set max number of records per file by setting config spark.sql.files.maxRecordsPerFile