currently I can pass one parameter to u-sql script in data factory workflow.
and with that parameter i can apply some pattern to generate files paths.
is there any way to pass collection of datetimes parameters to u-sql and
apply pattern to generate file paths?
You can pass multiple parameters. U-SQL also allows parameters of type SqlArray<>. I am not sure though if ADF supports passing in such typed values. I think the PowerShell APIs do allow it.
I assume that passing the values as a file will not work, since you will not get compile time partition elimination with it.
Pass a Json parameter. Then handle it with u-sql.
Related
I have a switch statement that looks at a variable value and based on that determines which data flow to execute. The problem with this is, I need to update the switch statement every time I add a new ID/dataflow.
Is there an alternative design to this? What if my dataflows had the same name as the variable value - would it be possible to parameterize the name of the dataflow to execute?
e.g. variable value = "1" execute data flow with name "1_dataflow", 2 execute "2_dataflow" etc. How would I accomplish this?
Yes, you can parameterize the values in any activity in Azure Data Factory and make the pipeline dynamic instead of giving hard-coded values.
You can use parameters to pass external values into pipelines,
datasets, linked services, and data flows. Once the parameter has been
passed into the resource, it cannot be changed. By parameterizing
resources, you can reuse them with different values each time.
You can also use Set Variable activity in to define and variable and set value to it which you can use in Switch activity and also can change it later.
Refer: How to use parameters, expressions and functions in Azure Data Factory, Set Variable Activity in Azure Data Factory and Azure Synapse Analytics
I wanted to create a data curation framework using Data Flow that uses generic data flow pipelines.
I have multiple data feeds (raw tables) to validate (between 10-100) and write to sink as curated tables:
For each raw data feed, need to validate the expected schema (based on a parameterized file name)
For each raw data feed, need to provide the Data Flow Script with validation logic (some columns should not be null, some columns should have specifici data types and value ranges, etc.)
Using Python SDK, create Data Factory and mapping data flows pipelines using the Data Flow Script prepared with the parameters provided (for schema validation)
Trigger the python code that creates the pipelines for each feed, does validation, write the issues into Log Analytics workspace and tear off the resources at specific schedules.
Has anyone done something like this? What is the best approach for the above please?
My overall goal is to reduce the time to validate/curate the data feeds, thus I wanted to prepare the validation logic quickly for each feed and create python classes or Powershell scripts scheduled to run them on generic data pipelines at specific times of the day.
many thanks
CK
To validate the schema, you can have a reference dataset which will be having the same schema (first row) as of your main dataset. Then you need to use “Get Metadata” activity for each dataset and get the structure of each dataset. Your Get Metadata activity will look like this:
You can then use “If Condition” activity to matches the structure of both datasets using equal Logical Function. Your equal expression will look something like this:
If both datasets’ structure matches, your next required activity(like copy the dataset to another container) will be performed.
Your complete pipeline will look like this:
The script which you want to run on your inserted dataset could be performed using “Custom” activity. You again need to create the linked service and it’s corresponding dataset for your script which you will run to validate the raw data. Please refer: https://learn.microsoft.com/en-us/azure/batch/tutorial-run-python-batch-azure-data-factory
To schedule the pipeline as per your specific pipeline will be take care by Triggers in Azure Data Factory. A schedule trigger will take care of your requirement of auto trigger your pipeline at any specific time.
What is the difference between linked task parameters (process parameters) and variables in classic Azure DevOps build pipeline? Don't they all allow having a single place where to change values?
What I mean by "linked" task parameters are what you get by clicking the link icon when configuring tasks like below
which leads to adding a textbox for the linked value in settings page for the pipeline as you see below
Regarding parameters in the classic pipeline, we generally use Process parameters. You can link all important arguments for tasks used across the build definition as process parameters, which are then shown at one place-the Pipeline view. This means you can quickly edit these arguments without needing to click through all the tasks. Templates come with a set of predefined process parameters.
Variables give you a convenient way to get key bits of data into various parts of the pipeline. The most common use of variables is to define a value that you can then use in your pipeline. All variables are stored as strings and are mutable. The value of a variable can change from run to run or job to job of your pipeline.
The difference between them is:
Variables can be a convenient way to collect information from the
user up front. You can also use variables to pass data from step to
step within a pipeline.Unlike variables, pipeline parameters can't be
changed by a pipeline while it's running.
Parameters have data types such as number and string, and they can be
restricted to a subset of values. Restricting the parameters is
useful when a user-configurable part of the pipeline should take a
value only from a constrained list. The setup ensures that the
pipeline won't take arbitrary data.
Process parameters differ from variables in the kind of input supported by them. Variables only take in string inputs while process parameters in addition to string inputs support additional data types like check boxes and drop-down list boxes.
For detailed information, please refer to the following documents:
Define variables
Process parameters
Variables and parameters
I have a parametrized dataset, that I used for a copy data activty and it worked fine.
I am trying to replicate that using a mapping dataflow but I cant find where to input the value for the dataset parameter...
To clarify Joel's answer - you cannot assign parameter values to a Dataset from within the Data Flow settings. It is done from the Pipeline that executes the Data Flow. This means that you may get an error message if you attempt to 'Test connection' for a parameterised dataset.
Parameterized Data Sets work exactly the same way in a Data Flow activity. If a Data Set used in the Data Flow has parameters, they will have configuration points on the Settings tab:
Our requirement is to read from a database, marshall the output into XML, and save to a file. Our prototype already does this.
The database SELECT takes a parameter which is a timestamp. Currently, this is stored in a properties file. After each run of the batch, the property file is updated with an incremented date. This is done in a tasklet with runs in a second step.
Is this the correct approach, or is there a better option to store parameters of a job?
you could use the org.springframework.batch.core.JobParametersIncrementer interface overriding the only getNext method that allow you to modify appropriately the JobParameters object. You need also to referencing it in the xml using the incrementer="..." attribute on job tag. See paragraph 4.6.4 of the official documentation http://docs.spring.io/spring-batch/reference/html/configureJob.html
Bye.
sigint76.