Airflow long running job killed after 1 hr but the task is still in running state - amazon-ecs

I need a help with a long running dag that keeps on failing after an hour but the task is still in running mode.
I have been using Airflow for the past 6-8 months. I with the help of our infrastructure team has setup Airflow in our company. It’s running on a AWS ECS cluster. The dags sit in an EFS instance with throughput set to provisioned. The logs are written in a s3 bucket.
For the worker aws ecs service we have an autoscaling policy that scales up the cluster at night 1 AM and scales down at 4AM.
It’s running fine for short duration jobs. It also was successful with a long duration job that was writing the results into a redshift table intermittently.
But now I have a job that is looping over a pandas dataframe and updating two dictionaries.
Issue:
It takes about 4 hrs for the job to finish but at around 1 hr it automatically fails without any error. The task still is in running mode until I manually stop it. And when I try to look at the logs the actual log doesn’t come up It shows
[2021-05-04 19:59:18,785] {taskinstance.py:664} INFO - Dependencies not met for <TaskInstance: app-doctor-utilisation.execute 2021-05-04T18:57:10.480384+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
[2021-05-04 19:59:18,786] {local_task_job.py:90} INFO - Task is not able to be run
Now when I stop the task I can see some of the logs and the following logs at the end.
[2021-05-04 20:11:11,785] {helpers.py:325} INFO - Sending Signals.SIGTERM to GPID 38
[2021-05-04 20:11:11,787] {taskinstance.py:955} ERROR - Received SIGTERM. Terminating subprocesses.
[2021-05-04 20:11:11,959] {helpers.py:291} INFO - Process psutil.Process(pid=38, status='terminated', exitcode=0, started='18:59:13') (38) terminated with exit code 0
[2021-05-04 20:11:11,960] {local_task_job.py:102} INFO - Task exited with return code 0
Can someone please help me figure out the issue and if there is any solution for this?

Related

Flink Completion Exception Couldn't acquire the minimum required resources

Hi I have a flink Job that that every time there's an Exception, it will have after it around 7 additional Exceptions of
java.util.concurrent.CompletionException: org.apache.flink.runtime.jobmanager.schedular.NoResourceAvailableException: Could not acquire the minimum required resources.
For example the timing
5:47:40 - Exception
5:50:40 - NoResourceAvailableException
5:50:47 - NoResourceAvailableException
5:50:56 - NoResourceAvailableException
5:51:10 - NoResourceAvailableException
5:51:31 - NoResourceAvailableException
5:52:04 - NoResourceAvailableException
5:52:57 - NoResourceAvailableException
Only after around 5 min the job runned again.
The environment have 15 taskmanger and 30 taks slots.
And the job run on 28 task slots, we left 2 task slot so on fail the environment will have standby task slots.
As you can see it still doesn't help and the job takes 5 min until it's up again.
The environment run on kubernetes.
My guess is that the pods restart because of the Exception and that the jobmanger waits to the same pod to be restored. But I don't understand why it won't use the standby task slots.
The up time of the job is crucial so every minute count and we try to have minimal downtime.
I tried to change the number of task slots and task manager so it will have more weaker instances. But the job won't run in any diffrent configuration we get backpresure instead.
And adding more standby task slot in the current state is extremely expensive because each pods (task manager) have a lot of resources.
Tnx for anyone that can help.

AWS ECS won't start tasks: http request timed out enforced after 4999ms

I have an ECS cluster (fargate), task, and service I have had setup in Terraform for at least a year. I haven't touched it for a long while. My normal deployment for updating the code is to push a new container to the registry and then stop all tasks on the cluster with a script. Today, my service did not run a new task in response to that task being stopped. It's desired count is fixed at so it should.
I have go in an tried to manually run this and I'm seeing this error.
Unable to run task
Http request timed out enforced after 4999ms
When I try to do this, a new stopped task is added to my stopped tasks lists. When I look into that task the stopped reason is "Deployment restart" and two of them are now showing "Task provisioning failed." which I think might be tasks the service tried to start. But these tasks do not show a started timestamp. The ones I start in the console have a started timestamp.
My site is now down and I can't get it back up. Does anyone know of a way to debug this? Is AWS ECS experiencing problems right now? I checked the health monitors and I see no issues.
This was an AWS outage affecting Fargate in us-east-1. It's fixed now.

Airflow example DAGs take a long time in GUI and the scheduler is showing this: DagFileProcessorManager (PID=...) last sent a heartbeat

While running standard Airflow examples with airflow 2.1.2, DAGs are taking a long time to complete. On every DAG run, this problem occurs. The problem happens when running from the airflow GUI. It isn't a problem when running as a test from the airflow command line. Looking at scheduler log as it runs, this is what is apparent: after a task runs, apparently the DagFileProcessorManager has to be restarted for it to continue to the next tasks, which take 1 to 2 minutes. The restart happens after the absence of heartbeat responses, and this error shows:
{dag_processing.py:414} ERROR - DagFileProcessorManager (PID=67503) last sent a heartbeat 64.25 seconds ago! Restarting it
Question: How can I fix this?
This fixed the problem:
(1) Use postgresql instead of sqlite.
(2) Switch from SequentialExecutor to LocalExecutor.
Just to add to that - we had other similar reports and we decided to make a very clear warning in such case in the UI (will be released in the next version):
https://github.com/apache/airflow/pull/17133

Batch account node restarted unexpectedly

I am using an Azure batch account to run sqlpackage.exe in order to move databases from a server to another. A task that has started 6 days ago has suddenly been restarted and started from the beginning after 4 days of running (extremely large databases). The task run uninterruptedly up until then and should have continued to run for about 1-2 days.
The PowerShell script that contains all the logic handles all the exceptions that could occur during the execution. Also, the retry count for the task was set to 0 in case it fails.
Unfortunately, I did not have diagnostics settings configured and I could only look at the metrics and there was a short period when there wasn't any node.
What can be the causes for this behavior? Restarting while the node is still running
Thanks
Unfortunately, there is no way to give a definitive answer to this question. You will need to dig into the compute node (interactively log in) and check system logs to give you details on why the node restarted. There is no guarantee that a compute node will have 100% uptime as there may be hardware faults or other service interruptions.
In general, it's best practice to have long running tasks checkpoint progress combined with a retry policy. Programs that can reload state can pick up at the time of the checkpoint when the Batch service automatically reschedules the task execution. Please see the Batch best practices guide for more information.

AWS Fargate vs Batch vs ECS for a once a day batch process

I have a batch process, written in PHP and embedded in a Docker container. Basically, it loads data from several webservices, do some computation on data (during ~1h), and post computed data to an other webservice, then the container exit (with a return code of 0 if OK, 1 if failure somewhere on the process). During the process, some logs are written on STDOUT or STDERR. The batch must be triggered once a day.
I was wondering what is the best AWS service to use to schedule, execute, and monitor my batch process :
at the very begining, I used a EC2 machine with a crontab : no high-availibilty function here, so I decided to switch to a more PaaS approach.
then, I was using Elastic Beanstalk for Docker, with a non-functional Webserver (only to reply to the Healthcheck), and a Crontab inside the container to wake-up my batch command once a day. With autoscalling rule min=1 max=1, I have HA (if the container crash or if the VM crash, it is restarted by AWS)
but now, to be more efficient, I decided to move to some ECS service, and have an approach where I do not need to have EC2 instances awake 23/24 for nothing. So I tried Fargate.
with Fargate I defined my task (Fargate type, not the EC2 type), and configure everything on it.
I create a Cluster to run my task : I can run "by hand, one time" my task, so I know every settings are corrects.
Now, going deeper in Fargate, I want to have my task executed once a day.
It seems to work fine when I used the Scheduled Task feature of ECS : the container start on time, the process run, then the container stop. But CloudWatch is missing some metrics : CPUReservation and CPUUtilization are not reported. Also, there is no way to know if the batch quit with exit code 0 or 1 (all execution stopped with status "STOPPED"). So i Cant send a CloudWatch alarm if the container execution failed.
I use the "Services" feature of Fargate, but it cant handle a batch process, because the container is started every time it stops. This is normal, because the container do not have any daemon. There is no way to schedule a service. I want my container to be active only when it needs to work (once a day during at max 1h). But the missing metrics are correctly reported in CloudWatch.
Here are my questions : what are the best suitable AWS managed services to trigger a container once a day, let it run to do its task, and have reporting facility to track execution (CPU usage, batch duration), including alarm (SNS) when task failed ?
We had the same issue with identifying failed jobs. I propose you take a look into AWS Batch where logs for FAILED jobs are available in CloudWatch Logs; Take a look here.
One more thing you should consider is total cost of ownership of whatever solution you choose eventually. Fargate, in this regard, is quite expensive.
may be too late for your projects but still I thought it could benefit others.
Have you had a look at AWS Step Functions? It is possible to define a workflow and start tasks on ECS/Fargate (or jobs on EKS for that matter), wait for the results and raise alarms/send emails...