Hi I have a flink Job that that every time there's an Exception, it will have after it around 7 additional Exceptions of
java.util.concurrent.CompletionException: org.apache.flink.runtime.jobmanager.schedular.NoResourceAvailableException: Could not acquire the minimum required resources.
For example the timing
5:47:40 - Exception
5:50:40 - NoResourceAvailableException
5:50:47 - NoResourceAvailableException
5:50:56 - NoResourceAvailableException
5:51:10 - NoResourceAvailableException
5:51:31 - NoResourceAvailableException
5:52:04 - NoResourceAvailableException
5:52:57 - NoResourceAvailableException
Only after around 5 min the job runned again.
The environment have 15 taskmanger and 30 taks slots.
And the job run on 28 task slots, we left 2 task slot so on fail the environment will have standby task slots.
As you can see it still doesn't help and the job takes 5 min until it's up again.
The environment run on kubernetes.
My guess is that the pods restart because of the Exception and that the jobmanger waits to the same pod to be restored. But I don't understand why it won't use the standby task slots.
The up time of the job is crucial so every minute count and we try to have minimal downtime.
I tried to change the number of task slots and task manager so it will have more weaker instances. But the job won't run in any diffrent configuration we get backpresure instead.
And adding more standby task slot in the current state is extremely expensive because each pods (task manager) have a lot of resources.
Tnx for anyone that can help.
I have an ECS cluster (fargate), task, and service I have had setup in Terraform for at least a year. I haven't touched it for a long while. My normal deployment for updating the code is to push a new container to the registry and then stop all tasks on the cluster with a script. Today, my service did not run a new task in response to that task being stopped. It's desired count is fixed at so it should.
I have go in an tried to manually run this and I'm seeing this error.
Unable to run task
Http request timed out enforced after 4999ms
When I try to do this, a new stopped task is added to my stopped tasks lists. When I look into that task the stopped reason is "Deployment restart" and two of them are now showing "Task provisioning failed." which I think might be tasks the service tried to start. But these tasks do not show a started timestamp. The ones I start in the console have a started timestamp.
My site is now down and I can't get it back up. Does anyone know of a way to debug this? Is AWS ECS experiencing problems right now? I checked the health monitors and I see no issues.
This was an AWS outage affecting Fargate in us-east-1. It's fixed now.
While running standard Airflow examples with airflow 2.1.2, DAGs are taking a long time to complete. On every DAG run, this problem occurs. The problem happens when running from the airflow GUI. It isn't a problem when running as a test from the airflow command line. Looking at scheduler log as it runs, this is what is apparent: after a task runs, apparently the DagFileProcessorManager has to be restarted for it to continue to the next tasks, which take 1 to 2 minutes. The restart happens after the absence of heartbeat responses, and this error shows:
{dag_processing.py:414} ERROR - DagFileProcessorManager (PID=67503) last sent a heartbeat 64.25 seconds ago! Restarting it
Question: How can I fix this?
This fixed the problem:
(1) Use postgresql instead of sqlite.
(2) Switch from SequentialExecutor to LocalExecutor.
Just to add to that - we had other similar reports and we decided to make a very clear warning in such case in the UI (will be released in the next version):
https://github.com/apache/airflow/pull/17133
I am using an Azure batch account to run sqlpackage.exe in order to move databases from a server to another. A task that has started 6 days ago has suddenly been restarted and started from the beginning after 4 days of running (extremely large databases). The task run uninterruptedly up until then and should have continued to run for about 1-2 days.
The PowerShell script that contains all the logic handles all the exceptions that could occur during the execution. Also, the retry count for the task was set to 0 in case it fails.
Unfortunately, I did not have diagnostics settings configured and I could only look at the metrics and there was a short period when there wasn't any node.
What can be the causes for this behavior? Restarting while the node is still running
Thanks
Unfortunately, there is no way to give a definitive answer to this question. You will need to dig into the compute node (interactively log in) and check system logs to give you details on why the node restarted. There is no guarantee that a compute node will have 100% uptime as there may be hardware faults or other service interruptions.
In general, it's best practice to have long running tasks checkpoint progress combined with a retry policy. Programs that can reload state can pick up at the time of the checkpoint when the Batch service automatically reschedules the task execution. Please see the Batch best practices guide for more information.
I have a batch process, written in PHP and embedded in a Docker container. Basically, it loads data from several webservices, do some computation on data (during ~1h), and post computed data to an other webservice, then the container exit (with a return code of 0 if OK, 1 if failure somewhere on the process). During the process, some logs are written on STDOUT or STDERR. The batch must be triggered once a day.
I was wondering what is the best AWS service to use to schedule, execute, and monitor my batch process :
at the very begining, I used a EC2 machine with a crontab : no high-availibilty function here, so I decided to switch to a more PaaS approach.
then, I was using Elastic Beanstalk for Docker, with a non-functional Webserver (only to reply to the Healthcheck), and a Crontab inside the container to wake-up my batch command once a day. With autoscalling rule min=1 max=1, I have HA (if the container crash or if the VM crash, it is restarted by AWS)
but now, to be more efficient, I decided to move to some ECS service, and have an approach where I do not need to have EC2 instances awake 23/24 for nothing. So I tried Fargate.
with Fargate I defined my task (Fargate type, not the EC2 type), and configure everything on it.
I create a Cluster to run my task : I can run "by hand, one time" my task, so I know every settings are corrects.
Now, going deeper in Fargate, I want to have my task executed once a day.
It seems to work fine when I used the Scheduled Task feature of ECS : the container start on time, the process run, then the container stop. But CloudWatch is missing some metrics : CPUReservation and CPUUtilization are not reported. Also, there is no way to know if the batch quit with exit code 0 or 1 (all execution stopped with status "STOPPED"). So i Cant send a CloudWatch alarm if the container execution failed.
I use the "Services" feature of Fargate, but it cant handle a batch process, because the container is started every time it stops. This is normal, because the container do not have any daemon. There is no way to schedule a service. I want my container to be active only when it needs to work (once a day during at max 1h). But the missing metrics are correctly reported in CloudWatch.
Here are my questions : what are the best suitable AWS managed services to trigger a container once a day, let it run to do its task, and have reporting facility to track execution (CPU usage, batch duration), including alarm (SNS) when task failed ?
We had the same issue with identifying failed jobs. I propose you take a look into AWS Batch where logs for FAILED jobs are available in CloudWatch Logs; Take a look here.
One more thing you should consider is total cost of ownership of whatever solution you choose eventually. Fargate, in this regard, is quite expensive.
may be too late for your projects but still I thought it could benefit others.
Have you had a look at AWS Step Functions? It is possible to define a workflow and start tasks on ECS/Fargate (or jobs on EKS for that matter), wait for the results and raise alarms/send emails...