Flink Completion Exception Couldn't acquire the minimum required resources - kubernetes

Hi I have a flink Job that that every time there's an Exception, it will have after it around 7 additional Exceptions of
java.util.concurrent.CompletionException: org.apache.flink.runtime.jobmanager.schedular.NoResourceAvailableException: Could not acquire the minimum required resources.
For example the timing
5:47:40 - Exception
5:50:40 - NoResourceAvailableException
5:50:47 - NoResourceAvailableException
5:50:56 - NoResourceAvailableException
5:51:10 - NoResourceAvailableException
5:51:31 - NoResourceAvailableException
5:52:04 - NoResourceAvailableException
5:52:57 - NoResourceAvailableException
Only after around 5 min the job runned again.
The environment have 15 taskmanger and 30 taks slots.
And the job run on 28 task slots, we left 2 task slot so on fail the environment will have standby task slots.
As you can see it still doesn't help and the job takes 5 min until it's up again.
The environment run on kubernetes.
My guess is that the pods restart because of the Exception and that the jobmanger waits to the same pod to be restored. But I don't understand why it won't use the standby task slots.
The up time of the job is crucial so every minute count and we try to have minimal downtime.
I tried to change the number of task slots and task manager so it will have more weaker instances. But the job won't run in any diffrent configuration we get backpresure instead.
And adding more standby task slot in the current state is extremely expensive because each pods (task manager) have a lot of resources.
Tnx for anyone that can help.

Related

Scheduling jobs fails with org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException

Thank you for reading this SO question, it may seem long, but I'll try to get as most information as possible in it to help to get the answer.
Summary
We are currently experiencing a scheduling issue with our Flink cluster.
The symptoms are that some/most/all (it depends, the symptoms are not always the same) of our tasks are shown as SCHEDULED but fail after a timeout. The jobs are then shown as RUNNING.
The failing exception is the following one:
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Slot request bulk is not fulfillable! Could not allocate the required slot within slot request timeout
After analysis, we assume (we cannot prove it, as there are not that much logs for that part of the code) that the failure is due to a deadlock/race condition that is happening when several jobs are being submitted at the same time to the Flink cluster, even though we have enough slots available in the cluster.
We actually have the error with 52 available task slots, and have 12 jobs that are not scheduled.
Additional information
Flink version: 1.13.1 commit a7f3192
Flink cluster in session mode
2 Job managers using k8s HA mode (resource requests: 2 CPU, 4Gb Ram, limits sets on memory to 4Gb)
50 task managers with 2 slots each (resource requests: 2 CPUs, 2GB Ram. No limits set).
Our Flink cluster is shut down every night, and restarted every morning. The error seems to occur when a lot of jobs needs to be scheduled. The jobs are configured to restore their state, and we do not see any issues for jobs that are being scheduled and run correctly, it seems to really be related to a scheduling issue.
Questions
May it be that the issue described in FLINK-23409 is actually the same, but occurs only when there is a race condition when scheduling several jobs?
Is there any way to increase logging in the scheduler to debug this issue?
Is it a known issue? If yes, is there any workaround/solution to resolve it?
P.S: a while ago, I asked more or less the same question on the ML, but dropped it, I'm sorry if this is considered as cross-asking, it's not intended t. We are just opening a new thread as we have more information and the issue re-occur.

Airflow long running job killed after 1 hr but the task is still in running state

I need a help with a long running dag that keeps on failing after an hour but the task is still in running mode.
I have been using Airflow for the past 6-8 months. I with the help of our infrastructure team has setup Airflow in our company. It’s running on a AWS ECS cluster. The dags sit in an EFS instance with throughput set to provisioned. The logs are written in a s3 bucket.
For the worker aws ecs service we have an autoscaling policy that scales up the cluster at night 1 AM and scales down at 4AM.
It’s running fine for short duration jobs. It also was successful with a long duration job that was writing the results into a redshift table intermittently.
But now I have a job that is looping over a pandas dataframe and updating two dictionaries.
Issue:
It takes about 4 hrs for the job to finish but at around 1 hr it automatically fails without any error. The task still is in running mode until I manually stop it. And when I try to look at the logs the actual log doesn’t come up It shows
[2021-05-04 19:59:18,785] {taskinstance.py:664} INFO - Dependencies not met for <TaskInstance: app-doctor-utilisation.execute 2021-05-04T18:57:10.480384+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
[2021-05-04 19:59:18,786] {local_task_job.py:90} INFO - Task is not able to be run
Now when I stop the task I can see some of the logs and the following logs at the end.
[2021-05-04 20:11:11,785] {helpers.py:325} INFO - Sending Signals.SIGTERM to GPID 38
[2021-05-04 20:11:11,787] {taskinstance.py:955} ERROR - Received SIGTERM. Terminating subprocesses.
[2021-05-04 20:11:11,959] {helpers.py:291} INFO - Process psutil.Process(pid=38, status='terminated', exitcode=0, started='18:59:13') (38) terminated with exit code 0
[2021-05-04 20:11:11,960] {local_task_job.py:102} INFO - Task exited with return code 0
Can someone please help me figure out the issue and if there is any solution for this?

Airflow tasks stuck in queued state

We're running Airflow 1.10.12, with KubernetesExecutor and KubernetesPodOperator.
In the past few days, we’re seeing tasks getting stuck in queued state for a long time (to be honest, unless we restart the scheduler, it will remain stuck in that state), new tasks of the same DAG are getting scheduled properly.
The only thing that helps is either clearing it manually, or restarting the scheduler service
We usually see it happen when we run our E2E tests, which spawns ~20 DAG runs for everyone of our 3 DAGs, due to limited parallelism, some will be queued (which is fine by us)
These are our parallelism params in airflow.cfg
parallelism = 32
dag_concurrency = 16
max_active_runs_per_dag = 16
2 of our DAGs, overwrite the max_active_runs and set it to 10
Any idea what could be causing it?

How do I stop a CronJob from recreating failed Jobs?

When for whatever reasons I delete the pod running the Job that was started by a CronJob, I immediately see a new pod being created. It is only once I delete something like six times the backoffLimit number of pods, that new ones stop being created.
Of course, if I'm actively monitoring the process, I can delete the CronJob, but what if the Pod inside the job fails when I'm not looking? I would like it not to be recreated.
How can I stop the CronJob from persisting in creating new jobs (or pods?), and wait until the next scheduled time if the current job/pod failed? Is there something similar to Jobs' backoffLimit, but for CronJobs?
Set startingDeadlineSeconds to a large value or left unset (the default).
At the same time set .spec.concurrencyPolicy as Forbid and the CronJobs skips the new job run while previous created job is still running.
If startingDeadlineSeconds is set to a large value or left unset (the default) and if concurrencyPolicy is set to Forbid, the job will not be run if failed.
Concurrent policy field you can add to specification to defintion of your CronJob (.spec.concurrencyPolicy), but this is optional.
It specifies how to treat concurrent executions of a job that is created by this CronJob. The spec may specify only one of these three concurrency policies:
Allow (default) - The cron job allows concurrently running jobs
Forbid - The cron job does not allow concurrent runs; if it is time for a new job run and the previous job run hasn’t finished yet, the cron job skips the new job run
Replace - If it is time for a new job run and the previous job run hasn’t finished yet, the cron job replaces the currently running job run with a new job run
It is good to know that currency policy applies just to the jobs created by the same CronJob.
If there are multiple CronJobs, their respective jobs are always allowed to run concurrently.
A CronJob is counted as missed if it has failed to be created at its scheduled time. For example, If concurrencyPolicy is set to Forbid and a CronJob was attempted to be scheduled when there was a previous schedule still running, then it would count as missed.
For every CronJob, the CronJob controller checks how many schedules it missed in the duration from its last scheduled time until now. If there are more than 100 missed schedules, then it does not start the job and logs the error
More information you can find here: CronJobs and AutomatedTask.
I hope it helps.
CronJob creates a job by a "backoffLimit" with a default value (6) in your case, and restart policy by default is (Always)
Better to make backoffLimit > (0) and make restart policy = (Never) and increase startingDeadlineSeconds to be lower than or equal to your interval or you can customize it up on your request to control the run time of each CronJob run
Additionally, you may stop "concurrencyPolicy" >> (Forbid)

Running Parallel Tasks in Batch

I have few questions about running tasks in parallel in Azure Batch. Per the official documentation, "Azure Batch allows you to set maximum tasks per node up to four times (4x) the number of node cores."
Is there a setup other than specifying the max tasks per node when creating a pool, that needs to be done (to the code) to be able to run parallel tasks with batch?
So if I am understanding this correctly, if I have a Standard_D1_v2 machine with 1 core, I can run up to 4 concurrent tasks running in parallel in it. Is that right? If yes, I ran some tests and I am quite not sure about the behavior that I got. In a pool of D1_v2 machines set up to run 1 task per node, I get about 16 min for my job execution time. Then, using the same applications and same parameters with the only change being a new pool with same setup, also D1_v2, except running 4 tasks per node, I still get a job execution time of about 15 min. There wasn't any improvement in the job execution time for running tasks in parallel. What could be happening? What am I missing here?
I ran a test with a pool of D3_v2 machines with 4 cores, set up to run 2 tasks per core for a total of 8 tasks per node, and another test with a pool (same number of machines as previous one) of D2_v2 machines with 2 cores, set up to run 2 tasks per core for a total of 4 parallel tasks per node. The run time/ job execution time for both these tests were the same. Isn't there supposed to be an improvement considering that 8 tasks are running per node in the first test versus 4 tasks per node in the second test? If yes, what could be a reason why I'm not getting this improvement?
No. Although you may want to look into the task scheduling policy, compute node fill type to control how your tasks are distributed amongst nodes in your pool.
How many tasks are in your job? Are your tasks compute-bound? If so, you won't see any improvement (perhaps even end-to-end performance degradation).
Batch merely schedules the tasks concurrently on the node. If the command/process that you're running utilizes all of the cores on the machine and is compute-bound, you won't see an improvement. You should double check your tasks start and end times within the job and the node execution info to see if they are actually being scheduled concurrently on the same node.