Increasing concurrency in Azure Data Factory

Increasing concurrency in Azure Data Factory - azure-data-factory

We have a parent pipeline that gets a list of tables and feeds it into a ForEach. Within the ForEach we then call another pipeline passing in some config, this child pipeline moves the data for the table it is passed as config.
When we run this at scale I often see 20 or so instances of the child pipeline created in the monitor. All but 4 will be "Queued", the other 4 are executing as "In progress" . I can't seem to find any setting for this limit of 4. We have several hundred pipelines to execute and I really could do with it doing more than 4 at a time. I have set concurrency as 20 throughout the pipelines and tasks, hence we get 20 instances fired up. But I can't figure out what it is I need to twiddle to get more than 4 executing at the same time.
The ForEach looks like this
activities in ForEach loop look like this
many thanks

I think I have found it. On the child Pipeline (the one that is being executed inside the ForEach loop) on the General Tab is a concurrency setting. I had this set to 4. When I increased this to 8 I got 8 executing, and when I increased it to 20 I got 20 executing.

It seems max 20 loop iteration can be executed at once in parallel.
The documentation is however a bit unclear.
The BatchCount setting that controls this have max value to 50, default 20. But in the documentation for isSequential it states maximum is 20.
Under Limitations and workarounds, the documentation states:
"The ForEach activity has a maximum batchCount of 50 for parallel processing, and a maximum of 100,000 items."
https://learn.microsoft.com/en-us/azure/data-factory/control-flow-for-each-activity

Related

ADF What happens if <Batch count> of <Foreach Activity> is too large?

What if I have 10 "Copy Activity", And I run it in double "Foreach Activity"(batch count = 50*50=2500).
Is it meaning that the actual Batch count is dynamic? microsoft
Batch count to be used for controlling the number of parallel execution (when isSequential is set to false). This is the upper concurrency limit, but the for-each activity will not always execute at this number

What is suggested in question is a Parallel Execution.
Parallel execution
If isSequential is set to false, the activity iterates in parallel with a maximum of 50 concurrent iterations. This setting should be used with caution. If the concurrent iterations are writing to the same folder but to different files, this approach is fine. If the concurrent iterations are writing concurrently to the exact same file, this approach most likely causes an error.
As per official documentation there is a limitation of the ForEach activity in terms of maximum batchCount
Limitation –
The ForEach activity has a maximum batchCount of 50 for parallel processing, and a maximum of 100,000 items.
Workaround –
Design a two-level pipeline where the outer pipeline with the ForEach activity iterates over an inner pipeline.

select Output 5 block to release agents in order and once batch size condition satisfied

I need to put a condition or action in the select output 5 block.
enter image description here
The model has select 5 output block, 5 batch blocks and 5 delays blocks. Each delay block has a delay of 105 hours. I need to control the movement of agents to fill each delay in sequence. If one delay becomes available, then select output 5 block will release agents to the available delay.
For example, select output 5 block control the release of agents from each exit based on a condition. Condition one will check if batch capacity filled (see image attached). Therefore, it will start to release agents from exit two to fill up batch 1. Once batch one capacity completed, the select 5 block will start to release agents from exit 2 to fill batch 2 capacity and so on.
enter image description here
Can I do the above using select output 5 block?

If I understand your question, you want to select the output based on which batches have available space. The problem is that batches aren't really ever full, because as soon as they get, say, 5 agents, it immediately makes a batch and passes that new batched agent onto the next process block. So really, you should be polling the queue in the delay block. For example, the exit condition for the first output (into batch) could be Curing_Drying.size() < Curing_Drying.capacity. This means that there is capacity in that delay for more batched agents and you can continue sending stuff down that line.
This also means that the batch line will be used more than, say, batch4, since that one will only be used whenever all other Curing_Drying delays are full. And if that one fills up and there's no space anywhere else, you'll get an error saying "An agent was not able to leave the port...".

Integration Runtime with TTL not helping with Cluster startup time

Hi I have a pipeline with Foreach loop with in which I have a Dataflow task, that runs on a integration runtime I have setup with 10 min time to live. When I triggered the pipeline with three files (i.e the Dataflow task within the Foreach would execute three times) I see that the cluster startup time remains almost the same (4-6 minutes) for each dataflow execution. I assumed the IR with 10 min TTL would reduce the cluster startup time substantially (for at least the second or third execution) but it doesn't seem that way.
Not sure if I am missing a setup/configuration on the pipeline or IR, or if this is intended behavior. any insight would be appreciated.

When using a ForEach w/Dataflow activity in ADF, if you wish to take advantage of shortened cluster start-up times, you must set the ForEach to execute iterations sequentially. Allow the ForEach to execute in parallel will fire-up new clusters for every iteration even if you have a TTL set on the Azure IR.

I found the solution. Microsoft added a check box in the Integration Runtime creation process...

Stop running Azure Data Factory Pipeline when it is still running

I have a Azure Data Factory Pipeline. My trigger has been set for every each 5 minutes.
Sometimes my Pipeline takes more than 5 mins to finished its jobs. In this case, Trigger runs again and creates another instance of my Pipeline and two instances of the same pipeline make problem in my ETL.
How can I be sure than just one instance of my pipeline runs at time?
As you can see there are several instances running of my pipelines

Few options I could think of:
OPT 1
Specify 5 min timeout on your pipeline activities:
https://learn.microsoft.com/en-us/azure/data-factory/concepts-pipelines-activities
https://learn.microsoft.com/en-us/azure/data-factory/concepts-pipelines-activities#activity-policy
OPT 2
1) Create a 1 row 1 column sql RunStatus table: 1 will be our "completed", 0 - "running" status
2) At the end of your pipeline add a stored procedure activity that would set the bit to 1.
3) At the start of your pipeline add a lookup activity to read that bit.
4) The output of this lookup will then be used in if condition activity:
if 1 - start the pipeline's job, but before that add another stored procedure activity to set our status bit to 0.
if 0 - depending on the details of your project: do nothing, add a wait activity, send an email, etc.
To make a full use of this option, you can turn the table into a log, where the new line with start and end time will be added after each successful run (before initiating a new run, you can check if the previous run had the end time). Having this log might help you gather data on how much does it take to run your pipeline and perhaps either add more resources or increase the interval between the runs.
OPT 3
Monitor the pipeline run with SDKs (have not tried that, so this is just to possibly direct you):
https://learn.microsoft.com/en-us/azure/data-factory/monitor-programmatically
Hopefully you can use at least one of them

It sounds like you're trying to run a process more or less constantly, which is a good fit for tumbling window triggers. You can create a dependency such that the trigger is dependent on itself - so it won't run until the previous run has completed.
Start by creating a trigger that runs a pipeline on a tumbling window, then create a tumbling window trigger dependency. The section at the bottom of that article discusses "tumbling window self-dependency properties", which shows you what the code should look like once you've successfully set this up.

Try changing the concurrency of the pipeline to 1.
Link: https://www.datastackpros.com/2020/05/prevent-azure-data-factory-from-running.html

My first thought is that the recurrence is too frequent under these circumstances. If the graph you shared is all for the same pipeline, then most of them take close to 5 minutes, but you have some that take 30, 40, even 60 minutes. Situations like this are when a simple recurrence trigger probably isn't sufficient. What is supposed to happen while the 60 minute one is running? There will be 10-12 runs that wouldn't start: so they still need to run or can they be ignored?
To make sure all the pipelines run, and manage concurrency, you're going to need to build a queue manager of some kind. ADF cannot handle this itself, so I have built such a system internally and rely on it extensively. I use a combination of Logic Apps, Stored Procedures (Azure SQL), and Azure Functions to queue, execute, and monitor pipeline executions. Here is a high level break down of what you probably need:
Logic App 1: runs every 5 minutes and queues an ADF job in the SQL database.
Logic App 2: runs every 2-3 minutes and checks the queue to see if a) there is not a job currently running (status = 'InProgress') and 2) there is a job in the queue waiting to run (I do this with a Stored Procedure). IF this state is met: execute the next ADF and update its status to 'InProgress'.
I use an Azure Function to submit jobs instead of the built in Logic App activity because I have better control over variable parameters. Also, they can return the newly created ADF RunId, which I rely in #3.
Logic App 3: runs every minute and updates the status of any 'InProgress' jobs.
I use an Azure Function to check the status of the ADF pipeline based on RunId.

Using ForkManager and Perl properly?

My developer recently disappeared and I need to make a small change to my website.
Here's the code I'll be referring to:
my $process = scalar(#rows);
$process = 500 if $process > 500;
my $pm = Parallel::ForkManager->new($process);
This is code from a Perl script that scrapes data from an API system through a cron job. Every time the cron job is run, it's opening a ton of processes for that file. For example, cron-job.pl will be running 100+ times.
The number of instances it opens is based on the amount of data that needs to be checked so it's different every time, but never seems to exceed 500. Is the code above what would be causing this to happen?
I'm not familiar with using ForkManager, but from the research I've done it looks like it runs the same file multiple times, that way it'll be extracting multiple streams of data from the API system all at the same time.
The problem is that the number of instances being run is significantly slowing down the server. To lower the amount of instances, is it really as simple as changing 500 to a lower number or am I missing something?

To lower the number of instances created, yes, just lower 500 (in both cases) to something else.
Parallel::ForkManager is a way of using fork (spawning new processes) to handle parallel processing. The parameter passed to new() specifies the maximum number of concurrent processes to create.

Your code simplifies to
my $pm = Parallel::ForkManager->new(500);
It means: Limit the the number of children to 500 at any given time.
If you have fewer than 500 jobs, only that many workers will be created.
If you have more than 500 jobs, the manager will start 500 jobs, wait for one to finish, then start the next job.
If you want fewer children executing any given time, lower that number.
my $pm = Parallel::ForkManager->new(50);

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse