Can we limit the number of DAGs running at any time in Apache Airflow - workflow

Can we limit the number of DAGs running at any time in Apache Airflow ?
We have a limit on resources in the environment . Is there a configuration to limit the no. of DAGs running in Airflow as a whole at a point in time ?
max_active_runs parameter limits run within a DAG
Is it possible that , If one DAG is running , all other scheduled DAGs should wait for the first DAG to complete and then trigger itself sequentially ?

By setting parallelism configuration option in airflow.cfg, you can limit the total maximum number of tasks (not DAGs) allowed to run in parallel. Then, by setting the dag_concurrency configuration option, you can specify how many tasks can a DAG run in parallel.
For example, setting parallelism=8 and dag_concurrency=1 will give you at maximum 8 DAGs running in parallel (with 1 running task each) at any time.

Related

Cluster Resource Usage in Databricks

I was just wondering if anyone could explain if all compute resources in a Databricks cluster are shared or if the resources are tied to each worker. For example, if two users were connected to a cluster made up of 2 workers with 4 cores per worker and one user's job required 2 cores and the other's required 6 cores, would they be able to share the 8 total cores or would the full 4 cores from one worker be unavailable during the job that only required 2 cores?
TL;DR; Yes, default behavior is to allow sharing but you're going to have to tightly control the default parallelism with such a small cluster.
Take a look at Job Scheduling for Apache Spark. I'm assuming you are using an "all-purpose" / "interactive" cluster where users are working on notebooks OR you are submitting jobs to an existing, all-purpose cluster and it is NOT a job cluster with multiple spark applications being deployed.
Databricks Runs in FAIR Scheduling Mode by Default
Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. This mode is best for multi-user settings.
By default, all queries started in a notebook run in the same fair scheduling pool
The Apache Spark scheduler in Azure Databricks automatically preempts tasks to enforce fair sharing.
Apache Spark Defaults to FIFO
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided into “stages” (e.g. map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. If the jobs at the head of the queue don’t need to use the whole cluster, later jobs can start to run right away, but if the jobs at the head of the queue are large, then later jobs may be delayed significantly.
Keep in mind the word "job" is specific Spark term that represents an action being taken that launches one or more stages and tasks. See What is the concept of application, job, stage and task in spark?.
So in your example you have...
2 Workers with 4 cores each == 8 cores == 8 tasks can be handled in parallel
One application (App A) that has a job that launches a stage with only 2 tasks.
One application (App B) that has a job that launches a stage with 6 tasks.
In this case, YES, you will be able to share the resources of the cluster. However, the devil is in the default behaviors. If you're reading from many files, performing a join, aggregating, etc, you're going to run into the fact that Spark is going to partition your data into chunks that can be acted on in parallel (see configuration like spark.default.parallelism).
So, in a more realistic example, you're going to have...
2 Workers with 4 cores each == 8 cores == 8 tasks can be handled in parallel
One application (App A) that has a job that launches a stage with 200 tasks.
One application (App B) that has a job that launches three stage with 8, 200, and 1 tasks respectively.
In a scenario like this FIFO scheduling, as is the default, will result in one of these applications blocking the other since the number of executors is completely overwhelmed by the number of tasks in just one stage.
In a FAIR scheduling mode, there will still be some blocking since the number of executors is small but some work will be done on each job since FAIR scheduling does a round-robin at the task level.
In Apache Spark, you have tighter control by creating different pools of the resources and submitting apps only to those pools where they have "isolated" resources. The "better" way of doing this is with Databricks Job clusters that have isolated compute dedicated to the application being ran.

Do Dataproc have a resource allocation limit per job

Let say I have a Dataproc cluster of 100 worker nodes with a certain spec.
When I submitted a job to dataproc, is there a usage allocation limit to each job
e.g. job A cannot run more than 50% of all total nodes
Do we have this kind of limit? Or any job can allocate all resource of the cluster
There is no such per job limit on DataProc. One job could use all resources of YARN, and that's usually the default config for various job types on DataProc. But users can set per job limit as they want, e.g., for Spark, disable dynamic allocation, set the number of executors and the memory size of each executors.

Total Datastage Jobs Greater Than My Max Jobs I Set

I have below system policies defined in InfoSphere DataStage Operations Console under "Work Load Management(WLM)".
Sometimes, the total number of currently running jobs shoots upto 150 although I have defined maximum running job count as 40 in WLM.
Whenever the currently running job count increases beyond 100, most of the datastage jobs starts showing increased startup time in director log and they took long time to run otherwise if the job concurrency is less than 100 then the same set of jobs run fine with startup time in seconds. Please suggest how to address this issue and how to enforce currently running job should not exceed eg 100 at any point of time. Thanks a lot !
This is working as designed, generally the WLM system is used to control the start of parallel and server jobs. It uses a set of user-defined queues and when a job is started, it is submitted to a designated queue. In the figure above the parallel jobs are in queue named 'MediumPriorityJobs'.
Note that the sequence job is not in the queue to be counted to the total running workloads controlled by the WLM Job Count System Policy.
Source: https://www.ibm.com/support/pages/how-interpret-job-count-maximum-running-jobs-system-policy-ibm-infosphere-information-server-workload-management-wlm

Airflow Workers starving for queued tasks

I am using Airflow (1.10.3) with AWS RDS Postgres as metaStore and celery_backend, SQS as queue service and CeleryExecutor. I have 1 master machine running airflow webserver and scheduler service, and a 1 worker machine.
Airflow worker is always starving for more tasks (queued) with lot of unused resources (CPU and RAM, with usage always below 20%). I've observed worker pick up tasks in batches, for eg: If there are 10 tasks in queue and 2 running tasks, then it will wait for 2 tasks to complete before picking next batch of tasks from the queue.
Parallelism setting in airflow.cfg in Worker instances.
parallelism = 32 .
dag_concurrency = 32.
non_pooled_task_slot_count = 128.
max_active_runs_per_dag = 32.
max_threads = 2 (no issues in scheduler though, as tasks are queued immediately)
One important thing to point out in my implementation - Airflow task is not a single process task, and individual Task further spawns multiple processes (3-5). Though even after considering process counts, my airflow worker never reaches full parallelism.
Any suggestion to -
a). Is there a way to fully utilise parallel execution of tasks on an airflow worker? Or if there's some more info that I am missing while setting up Airflow.
b). Above mentioned parallelism settings are configured at airflow task as atomic-unit, or number of threads/processes that task spawns?
Thanks!

Running Parallel Tasks in Batch

I have few questions about running tasks in parallel in Azure Batch. Per the official documentation, "Azure Batch allows you to set maximum tasks per node up to four times (4x) the number of node cores."
Is there a setup other than specifying the max tasks per node when creating a pool, that needs to be done (to the code) to be able to run parallel tasks with batch?
So if I am understanding this correctly, if I have a Standard_D1_v2 machine with 1 core, I can run up to 4 concurrent tasks running in parallel in it. Is that right? If yes, I ran some tests and I am quite not sure about the behavior that I got. In a pool of D1_v2 machines set up to run 1 task per node, I get about 16 min for my job execution time. Then, using the same applications and same parameters with the only change being a new pool with same setup, also D1_v2, except running 4 tasks per node, I still get a job execution time of about 15 min. There wasn't any improvement in the job execution time for running tasks in parallel. What could be happening? What am I missing here?
I ran a test with a pool of D3_v2 machines with 4 cores, set up to run 2 tasks per core for a total of 8 tasks per node, and another test with a pool (same number of machines as previous one) of D2_v2 machines with 2 cores, set up to run 2 tasks per core for a total of 4 parallel tasks per node. The run time/ job execution time for both these tests were the same. Isn't there supposed to be an improvement considering that 8 tasks are running per node in the first test versus 4 tasks per node in the second test? If yes, what could be a reason why I'm not getting this improvement?
No. Although you may want to look into the task scheduling policy, compute node fill type to control how your tasks are distributed amongst nodes in your pool.
How many tasks are in your job? Are your tasks compute-bound? If so, you won't see any improvement (perhaps even end-to-end performance degradation).
Batch merely schedules the tasks concurrently on the node. If the command/process that you're running utilizes all of the cores on the machine and is compute-bound, you won't see an improvement. You should double check your tasks start and end times within the job and the node execution info to see if they are actually being scheduled concurrently on the same node.