Airflow tasks stuck in queued state - kubernetes

We're running Airflow 1.10.12, with KubernetesExecutor and KubernetesPodOperator.
In the past few days, we’re seeing tasks getting stuck in queued state for a long time (to be honest, unless we restart the scheduler, it will remain stuck in that state), new tasks of the same DAG are getting scheduled properly.
The only thing that helps is either clearing it manually, or restarting the scheduler service
We usually see it happen when we run our E2E tests, which spawns ~20 DAG runs for everyone of our 3 DAGs, due to limited parallelism, some will be queued (which is fine by us)
These are our parallelism params in airflow.cfg
parallelism = 32
dag_concurrency = 16
max_active_runs_per_dag = 16
2 of our DAGs, overwrite the max_active_runs and set it to 10
Any idea what could be causing it?

Related

Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

Thank you for reading this SO question, it may seem long, but I'll try to get as most information as possible in it to help to get the answer.
Summary
We are currently experiencing a scheduling issue with our Flink cluster.
The symptoms are that some/most/all (it depends, the symptoms are not always the same) of our tasks are shown as SCHEDULED but fail after a timeout. The jobs are then shown as RUNNING.
The failing exception is the following one:
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Slot request bulk is not fulfillable! Could not allocate the required slot within slot request timeout
After analysis, we assume (we cannot prove it, as there are not that much logs for that part of the code) that the failure is due to a deadlock/race condition that is happening when several jobs are being submitted at the same time to the Flink cluster, even though we have enough slots available in the cluster.
We actually have the error with 52 available task slots, and have 12 jobs that are not scheduled.
Additional information
Flink version: 1.13.1 commit a7f3192
Flink cluster in session mode
2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, limits sets on memory to 4Gb)
50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No limits set).
Our Flink cluster is shut down every night, and restarted every morning. The error seems to occur when a lot of jobs needs to be scheduled. The jobs are configured to restore their state, and we do not see any issues for jobs that are being scheduled and run correctly, it seems to really be related to a scheduling issue.
Questions
May it be that the issue described in FLINK-23409 is actually the same, but occurs only when there is a race condition when scheduling several jobs?
Is there any way to increase logging in the scheduler to debug this issue?
Is it a known issue? If yes, is there any workaround/solution to resolve it?
P.S: a while ago, I asked more or less the same question on the ML, but dropped it, I'm sorry if this is considered as cross-asking, it's not intended t. We are just opening a new thread as we have more information and the issue re-occur.

Total Datastage Jobs Greater Than My Max Jobs I Set

I have below system policies defined in InfoSphere DataStage Operations Console under "Work Load Management(WLM)".
Sometimes, the total number of currently running jobs shoots upto 150 although I have defined maximum running job count as 40 in WLM.
Whenever the currently running job count increases beyond 100, most of the datastage jobs starts showing increased startup time in director log and they took long time to run otherwise if the job concurrency is less than 100 then the same set of jobs run fine with startup time in seconds. Please suggest how to address this issue and how to enforce currently running job should not exceed eg 100 at any point of time. Thanks a lot !
This is working as designed, generally the WLM system is used to control the start of parallel and server jobs. It uses a set of user-defined queues and when a job is started, it is submitted to a designated queue. In the figure above the parallel jobs are in queue named 'MediumPriorityJobs'.
Note that the sequence job is not in the queue to be counted to the total running workloads controlled by the WLM Job Count System Policy.
Source: https://www.ibm.com/support/pages/how-interpret-job-count-maximum-running-jobs-system-policy-ibm-infosphere-information-server-workload-management-wlm

Airflow Workers starving for queued tasks

I am using Airflow (1.10.3) with AWS RDS Postgres as metaStore and celery_backend, SQS as queue service and CeleryExecutor. I have 1 master machine running airflow webserver and scheduler service, and a 1 worker machine.
Airflow worker is always starving for more tasks (queued) with lot of unused resources (CPU and RAM, with usage always below 20%). I've observed worker pick up tasks in batches, for eg: If there are 10 tasks in queue and 2 running tasks, then it will wait for 2 tasks to complete before picking next batch of tasks from the queue.
Parallelism setting in airflow.cfg in Worker instances.
parallelism = 32 .
dag_concurrency = 32.
non_pooled_task_slot_count = 128.
max_active_runs_per_dag = 32.
max_threads = 2 (no issues in scheduler though, as tasks are queued immediately)
One important thing to point out in my implementation - Airflow task is not a single process task, and individual Task further spawns multiple processes (3-5). Though even after considering process counts, my airflow worker never reaches full parallelism.
Any suggestion to -
a). Is there a way to fully utilise parallel execution of tasks on an airflow worker? Or if there's some more info that I am missing while setting up Airflow.
b). Above mentioned parallelism settings are configured at airflow task as atomic-unit, or number of threads/processes that task spawns?
Thanks!

Running Parallel Tasks in Batch

I have few questions about running tasks in parallel in Azure Batch. Per the official documentation, "Azure Batch allows you to set maximum tasks per node up to four times (4x) the number of node cores."
Is there a setup other than specifying the max tasks per node when creating a pool, that needs to be done (to the code) to be able to run parallel tasks with batch?
So if I am understanding this correctly, if I have a Standard_D1_v2 machine with 1 core, I can run up to 4 concurrent tasks running in parallel in it. Is that right? If yes, I ran some tests and I am quite not sure about the behavior that I got. In a pool of D1_v2 machines set up to run 1 task per node, I get about 16 min for my job execution time. Then, using the same applications and same parameters with the only change being a new pool with same setup, also D1_v2, except running 4 tasks per node, I still get a job execution time of about 15 min. There wasn't any improvement in the job execution time for running tasks in parallel. What could be happening? What am I missing here?
I ran a test with a pool of D3_v2 machines with 4 cores, set up to run 2 tasks per core for a total of 8 tasks per node, and another test with a pool (same number of machines as previous one) of D2_v2 machines with 2 cores, set up to run 2 tasks per core for a total of 4 parallel tasks per node. The run time/ job execution time for both these tests were the same. Isn't there supposed to be an improvement considering that 8 tasks are running per node in the first test versus 4 tasks per node in the second test? If yes, what could be a reason why I'm not getting this improvement?
No. Although you may want to look into the task scheduling policy, compute node fill type to control how your tasks are distributed amongst nodes in your pool.
How many tasks are in your job? Are your tasks compute-bound? If so, you won't see any improvement (perhaps even end-to-end performance degradation).
Batch merely schedules the tasks concurrently on the node. If the command/process that you're running utilizes all of the cores on the machine and is compute-bound, you won't see an improvement. You should double check your tasks start and end times within the job and the node execution info to see if they are actually being scheduled concurrently on the same node.

Jobs in a queue is dropped unexpectedly in Gearman

I'm dealing with a very strange problem now.
Since I queue the jobs over 1,000 at once, Gearman doesn't work properly so far...
The problem is that, when I reserve the jobs in background mode, I could see the jobs were correctly queued from the monitoring page (gearman monitor),
but It is drained right after without delivering it to the worker. (within a few seconds)
After all, the jobs never be executed by the worker, just disappeared from the queue (job server).
So I tried rebooting the server entirely, and reinstall gearman as well as php library. (I'm using 1 CentOS, 1 Ubuntu with PHP gearman library, and version is 0.34 and 1.0.2)
But no luck yet... Job server just misbehaving as I explained in aobve.
What should I do for now?
Can I check the workers state, or see and monitor the whole process from queueing the jobs to the delivering to the worker?
When I tried gearmand with a option like: 'gearmand -vvvv' It never print anything on the screen while I register worker to the server, and run a job with client code (PHP)
Any comment will be appreciated.
For your information, I'm not considering persistent queue using MySQL or SQLite for now, because it sometimes occurs performance issue with slow execution.