How to take data from 2 databases (with same schema) and copy it into 1 database using Data factory - azure-data-factory

I want to take data from 2 databases and copy(coalesce) it into 1 using Data factory.
The issue is: It seems that multiple inputs is not allowed for copy activities.
So i resorted to having 2 different datasets which are exact copies but with a different name... and then putting 2 different activities into the 1 pipeline which use their specific output dataset.
It just seems odd and wrong to do it this way.
Can i have some help.
This is what my diagram currently looks like:
Is there no way of just copying data from 2 seperate databases (which have the same structure but different data) to the 1 database?

The short answer is yes. But you need to work within the constraints of how ADF handles this.
A couple of things to help...
You'll always need at least 2 activities to do this when using the copy type activity. Microsoft of course charges per activity execution in ADF, so they aren't going to allow you to take shortcuts having many inputs and output per single copy activity (single charge).
The approach you show above is ok and to pass the ADF validation as you've found you simply need to have the output datasets created separately and called different things. Even if they still refer to the same underlying target table etc. This is really only a problem for the copy activity. What you could do is land the data firstly into separate staging tables in the Azure target database just for the copy (1:1). Then have a third downstream activity that executes a stored procedure that does the union of tables. In this case you could have 2 inputs to 1 output in the activity if you want to have that level of control in ADF.
Like this:
Final point, if you don't want the activities to execute in parallel you could chain the datasets to enforce a fake dependency or add a simple 'delay' clause to one of the copy operations. A delay on an activity would be simpler than provisioning a time slice offset.
Hope this helps

Related

Export database to multiple files in same job Spring Batch

I need to export some database of arround 180k objects to JSON files so I can retain data structure in certain way that suits me for later import to other database. However because of amount of data, I wanto to separate and group data based on some atribute value from database records itself. So all records that have attribute1=value1, I want to go to value1.json, value2.json and so on.
However I still haven't figured out how to do this kind of job. I am using RepositoryItemReader and JsonFileWriter.
I started by filtering data on that attribute and running separate exports, just to verify that works, however I need to do this so I can automate whole process and let it work.
Can this be done?
There are several ways to do that. Here are a couple of options:
Option 1: parallel steps
You start by creating a tasklet that calculates the distinct values of the attribute you want to group items by, and you put this information in the job execution context.
After that, you create a flow with a chunk-oriented step for each value. Each chunk-oriented step would process a distinct value and generate an output file. The item reader and writer would be step-scoped bean and dynamically configured with the information from the job execution context.
Option 2: partitioned step
Here, you would implement a Partitioner that creates a partition for each distinct value. Each worker step would then process a distinct value and generate an output file.
Both options should perform equally in your use-case. However, option 2 is easier to implement and configure in my opinion.

Azure Data Factory - run script on parquet files and output as parquet files

In Azure Data Factory I have a pipeline, created from the built-in copy data task, that copies data from 12 entities (campaign, lead, contact etc.) from Dynamics CRM (using a linked service) and outputs the contents as parquet files in account storage. This is run every day, into a folder structure based on the date. The output structure in the container looks something like this:
Raw/CRM/2022/05/28/campaign.parquet
Raw/CRM/2022/05/28/lead.parquet
Raw/CRM/2022/05/29/campaign.parquet
Raw/CRM/2022/05/29/lead.parquet
That's just an example, but there is a folder structure for every year/month/day that the pipeline runs, and a parquet file for each of the 12 entities I'm retrieving.
This involved creating a pipeline, dataset for the source and dataset for the target. I modified the pipeline to add the pipeline's run date/time as a column in the parquet files, called RowStartDate (which I'll need in the next stage of processing)
My next step is to process the data into a staging area, which I'd like to output to a different folder in my container. My plan was to create 12 scripts (one for campaigns, one for leads, one for contact etc.) that essentially does the following:
accesses all of the correct files, using a wildcard path along the lines of: Raw/CRM/ * / * / * /campaign.parquet
selects the columns that I need
Rename column headings
in some cases, just take the most recent data (using the RowStartDate)
in some cases, create a slowly changing dimension, ensuring every row has a RowEndDate
I made some progress figuring out how to do this in SQL, by running a query using OPENROWSET with wildcards in the path as per above - but I don't think I can use my SQL script in ADF to move/process the data into a separate folder in my container.
My question is, how can I do this (preferably in ADF pipelines):
for each of my 12 entities, access each occurrence in the container with some sort of Raw/CRM///*/campaign.parquet statement
Process it as per the logic I've described above - a script of some sort
Output the contents back to a different folder in my container (each script would produce 1 output)
I've tried:
Using Azure Data Factory, but when I tell it which dataset to use, I point it to the dataset I created in my original pipeline - but this dataset has all 12 entities in the dataset and the data flow activity produces the error: "No value provided for Parameter 'cw_fileName" - but I don't see any place when configuring the data flow to specify a parameter (its not under source settings, source options, projection, optimize or inspect)
using Azure Data Factory, tried to add a script - but in trying to connect to my SQL script in Synapse - I don't know my Service Principal Key for the synapse workspace
using a notebook Databricks, I tried to mount my container but got an error along the lines that "adding secret to Databricks scope doesn't work in Standard Tier" so couldn't proceed
using Synapse, but as expected, it wants things in SQL whereas I'm trying to keep things in a container for now.
Could anybody point me in the right direction. What's the best approach that I should take? And if its one that I've described above, how do I go about getting past the issue I've described?
Pass the data flow dataset parameter values from the pipeline data flow activity settings.

Parameter Variation: Tracking the Metadata

I am trying to use parameter variation in AnyLogic. My inputs are 3 parameters, each varying 5 times. My output is water demand. What I need from parameter variation is the way in which demand changes according to the different combinations of the three parameters. I imagine something like: there are 10,950 rows (one for each day), the first column is time (in days), the second column are the values for the first combination, the second column is the second combination, and so on and so forth. What would be the best way to track this metadata to then be able to export it to excel? I have added a "dataset" to my main to track demand through each simulation, but I am not sure what to add to the parameter variation experiment interface to track the output across the different iterations. It would also be helpful to have a way to know which combination of inputs produced a given output (for example, have the combination be the name for each column). I see that there are Java Actions, but I haven't been able to figure out the code to do what I need. I appreciate any help with this matter.
The easiest approach is just to track this in output database tables which are then exported to Excel at the end of your run. As long as these tables include outputs from multiple runs (and are, for example, only cleared at the start of the experiment not the run), your Parameter Variation experiment will end up with an Excel file having outcomes from all the runs. (You will probably need to turn off parallel execution in the PV experiment so you don't run into issues trying to write to the same Excel file in parallel.)
So, for example, you might have tables:
run_details with columns id, parm1, parm2 and parm3 (with proper column names given your actual parameters and some unique ID generated for each run)
output_demand with columns run_id, sim_time_hrs and demand_value (if, say, you're storing some demand value each hour of simulated time) where run_id cross-references the run's ID in run_details
(There is extra complexity in how you could allocate a unique run ID and how and when you write to/clear those tables, but I'm just presenting the core design. You can also get round the need-serial-execution point by programmatically controlling when you export to Excel, rather than using the built-in "Export tables at the end of model execution" capability, but that's also more complicated.)

Best practices for parameterizing load of multiple CSV files in Data Factory

I am experimenting with Azure Data Factory to replace some other data-load solutions we currently have, and I'm struggling with finding the best way to organize and parameterize the pipelines to provide the scalability we need.
Our typical pattern is that we build an integration for a particular Platform. This "integration" is essentially the mapping and transform of fields from their data files (CSVs) into our Stage1 SQL database, and by the time the data lands in there, the data types should be set properly and the indexes set.
Within each Platform, we have Customers. Each Customer has their own set of data files that get processed in that Customer context -- within the scope of a Platform, all Customer files follow the same schema (or close to it), but they all get sent to us separately. If you looked at our incoming file store, it might look like (simplified, there are 20-30 source datasets per customer depending on platform):
Platform
Customer A
Employees.csv
PayPeriods.csv
etc
Customer B
Employees.csv
PayPeriods.csv
etc
Each customer lands in their own SQL schema. So after processing the above, I should have CustomerA.Employees and CustomerB.Employees tables. (This allows a little bit of schema drift between customers, which does happen on some platforms. We handle it later in our stage 2 ETL process.)
What I'm trying to figure out is:
What is the best way to setup ADF so I can effectively manage one set of mappings per platform, and automatically accommodate any new customers we add to that platform without having to change the pipeline/flow?
My current thinking is to have one pipeline per platform, and one dataflow per file per platform. The pipeline has a variable, "schemaname", which is set using the path of the file that triggered it (e.g. "CustomerA"). Then, depending on file name, there is a branching conditional that will fire the right dataflow. E.g. if it's "employees.csv" it runs one dataflow, if it's "payperiods.csv" it loads a different dataflow. Also, they'd all be using the same generic target sink datasource, the table name being parameterized and those parameters being set in the pipeline using the schema variable and the filename from the conditional branch.
Are there any pitfalls to setting it up this way? Am I thinking about this correctly?
This sounds solid. Just be aware that you if you define column-specific mappings with expressions that expect those columns to be present, you may have data flow execution failures if those columns are not present in your customer source files.
The ways to protect against that in ADF Data Flow is to use column patterns. This will allow you to define mappings that are generic and more flexible.

How to pass the Result of my CustomDotNet Activity to my stored procedure

I've created a ADF for my project,which consist of a Custom Activity and stored procedureactivity.
The thing I'm blocked here is.
My custom activity does the obtain the most recent modified file - let's assume xx.txt is the file - from in my Azure Blob Container.
My stored procedure has the single parameter FileName. I want to pass the file name to my stored procedure which can be obtained from the above custom activity.
(We can say simply as my stored procedure activity input depends on the Custom Activity output)
How can I do this in my ADF?
You will need to do this with two separate activities (maybe in two separate pipelines, if you want greater control) and using some middle ground storage that both processes can access.
For example:
Input dataset, Blob.
Custom activity.
Output dataset, SQL DB.
Input dataset, SQLDB (same as 3).
Stored Proc activity.
Output dataset, SQLDB.
It is the staging area datasets used for point 3 and 4 that is most important here.
This is how ADF will want to handle it. Although I understand why you would want to pass the output for the custom activity straight to the stored procedure, without using an additional dataset.
Hope this helps.