How to make AWS Stepfunctions/ECS schedule tasks only when resources are available - amazon-ecs

I am using ECS/EC2 to run Stepfunction (Standard type) tasks
I am using an API to trigger the StepFunction execution, So sometimes I have a peek more thyan the capacity of the ECS cluster, So I frequently get errors like this:
[{"Arn":"arn:aws:ecs:us-east-1:432214534264:container-instance/192a09715eea48828b798600b5c67532","Reason":"RESOURCE:GPU"}] (Service: AmazonECS; Status Code: 400; Error Code: AmazonECS.Unknown; Request ID: d33c3afa-d158-45f4-83f4-2567efb53017; Proxy: null)
Which means I do not have enough GPU (Sometimes MEMORY) to run more tasks.
So, How to configure the StepFunction and/or the ECS cluster to only schedule tasks if its resources are available.
I tried PlacementConstraints and PlacementStrategies, but they are fit for this, as these are guides to distribute tasks when resources are enough, not to pause new task creation till resources are available


Airflow - Async pod lauch

we have a long-running task in airflow (5hr and more)
I'm looking on a way to do fire a forget (after the pod started running) and monitor the status of the task using a sensor
The rationale is to free the worker and reduce resource consumption.
This line seems to monitor for pod health.
Now, I know I can override this limitation using a Cloud Function to trigger the pod, but this is a kind of over-complication of the DAG.
I've also opened an issue in github

Task definitions AWS Fargate

Let us say I am defining a task definition in AWS Fargate, this task definition would be used to start up tasks that involve a multi-container application regarding 2 web servers. How many task definitions would I need, how many tasks would I pay for and how many services are create?
I have read a lot of documentation, but it does not click for me. Is there anyone who can explain the correlation between: task definitions, task/s, Docker containers, services and ECS Fargate clusters?
A task definition is a specification. You use it to define one or more containers (with image URIs) that you want to run together, along with other details such as environment variables, CPU/memory requirements, etc. The task definition doesn't actually run anything, its a description of how things will be set up when something does run.
A task is an actual thing that is running. ECS uses the task definition to run the task; it downloads the container images, configures the runtime environment based on other details in the task definition. You can run one or many tasks for any given task definition. Each running task is a set of one or more running containers - the containers in a task all run on the same instance.
A service in ECS is a way to run N tasks all using the same task definition, and keep those N tasks running if they happen to shut down unexpectedly. Those N tasks can run on different instances in EC2 (although some may run on the same instance depending on the placement strategy used for the service); on Fargate, there are no instances and the tasks "just run", so you don't have to think about placement strategies. You can also use services to connect those tasks to a load balancer, so that requests from a client inside or outside of AWS can be routed evenly cross all N tasks. You can update the task definition used by a service, which will then trigger a rolling update (starting up and shutting down running tasks) so that all running tasks will be using the new version of the task definition after the deployment completes. This is used, for example, when you create a new container image and want your service to be updated to use the latest version.
A service is scoped to a cluster. A cluster is really just a name. Different clusters can have different IAM policies and roles, so that you can restrict who can create services in different clusters using IAM.

Delay in Kubernetes Job status update when running many jobs in parallel

I have a bit of a unique use-case where I want to run a large number (thousands to tens of thousands) of Kubernetes Jobs at once. Each job consists of a single container, Parallelism 1 and Completions 1, with no side-car or agent. My cluster has plenty of capacity for the resources I'm requesting.
My problem is that the Job status is not transitioning to Complete for a significant period of time when I run many jobs concurrently.
My application submits Jobs and has a watcher on the namespace - as soon as a Job's status transitions to 'succeeded 1', we delete the Job and send information back to the application. The application needs this to happen as soon as possible in order to define and submit subsequent Jobs.
I'm able to submit new Job requests as fast as I want, and Pod scheduling happens without delay, but beyond about one or two hundred concurrent Jobs I get significant delay between a Job's Pod completing and the Job's status updating to Complete. At only around 1,000 jobs in the cluster, it can easily take 5-10 minutes for a Job status to update.
This tells me there is some process in the Kubernetes Control Plane that needs more resources to process Pod completion events more rapidly, or a configuration option that enables it to process more tasks in parallel. However, my system monitoring tools have not yet been able to identify any Control Plane services that are maxing out their available resources while the cluster processes the backlog, and all other operations on the cluster appear to be normal.
My question is - where should I look for system resource or configuration bottlenecks? I don't know enough about Kubernetes to know exactly what components are responsible for updating a Job's status.

AWS Fargate vs Batch vs ECS for a once a day batch process

I have a batch process, written in PHP and embedded in a Docker container. Basically, it loads data from several webservices, do some computation on data (during ~1h), and post computed data to an other webservice, then the container exit (with a return code of 0 if OK, 1 if failure somewhere on the process). During the process, some logs are written on STDOUT or STDERR. The batch must be triggered once a day.
I was wondering what is the best AWS service to use to schedule, execute, and monitor my batch process :
at the very begining, I used a EC2 machine with a crontab : no high-availibilty function here, so I decided to switch to a more PaaS approach.
then, I was using Elastic Beanstalk for Docker, with a non-functional Webserver (only to reply to the Healthcheck), and a Crontab inside the container to wake-up my batch command once a day. With autoscalling rule min=1 max=1, I have HA (if the container crash or if the VM crash, it is restarted by AWS)
but now, to be more efficient, I decided to move to some ECS service, and have an approach where I do not need to have EC2 instances awake 23/24 for nothing. So I tried Fargate.
with Fargate I defined my task (Fargate type, not the EC2 type), and configure everything on it.
I create a Cluster to run my task : I can run "by hand, one time" my task, so I know every settings are corrects.
Now, going deeper in Fargate, I want to have my task executed once a day.
It seems to work fine when I used the Scheduled Task feature of ECS : the container start on time, the process run, then the container stop. But CloudWatch is missing some metrics : CPUReservation and CPUUtilization are not reported. Also, there is no way to know if the batch quit with exit code 0 or 1 (all execution stopped with status "STOPPED"). So i Cant send a CloudWatch alarm if the container execution failed.
I use the "Services" feature of Fargate, but it cant handle a batch process, because the container is started every time it stops. This is normal, because the container do not have any daemon. There is no way to schedule a service. I want my container to be active only when it needs to work (once a day during at max 1h). But the missing metrics are correctly reported in CloudWatch.
Here are my questions : what are the best suitable AWS managed services to trigger a container once a day, let it run to do its task, and have reporting facility to track execution (CPU usage, batch duration), including alarm (SNS) when task failed ?
We had the same issue with identifying failed jobs. I propose you take a look into AWS Batch where logs for FAILED jobs are available in CloudWatch Logs; Take a look here.
One more thing you should consider is total cost of ownership of whatever solution you choose eventually. Fargate, in this regard, is quite expensive.
may be too late for your projects but still I thought it could benefit others.
Have you had a look at AWS Step Functions? It is possible to define a workflow and start tasks on ECS/Fargate (or jobs on EKS for that matter), wait for the results and raise alarms/send emails...

Kubernetes dynamic Job scaling

I’m finally dipping my toes in the kubernetes pool and wanted to get some advice on the best way to approach a problem I have:
Tech we are using:
GCP Pub/Sub
We need to do bursts of batch processing spread out across a fleet and have decided on the following approach:
New raw data flows in
A node analyses this and breaks the data up into manageable portions which are pushed onto a queue
We have a cluster with Autoscaling On and Min Size ‘0’
A Kubernetes job spins up a pod for each new message on this cluster
When pods can’t pull anymore messages they terminate successfully
The question is:
What is the standard approach for triggering jobs such as this?
Do you create a new job each time or are jobs meant to be long lived and re-run?
I have only seen examples of using a yaml file however we would probably want the node which did the portioning of work to create the job as it knows how many parallel pods should be run. Would it be recommended to use the python sdk to create the job spec programatically? Or if jobs are long lived would you simply hit the k8 api and modify the parallel pods required then re-run job?
Jobs in Kubernetes are meant to be short-lived and are not designed to be reused. Jobs are designed for run-once, run-to-completion workloads. Typically they are be assigned a specific task, i.e. to process a single queue item.
However, if you want to process multiple items in a work queue with a single instance then it is generally advisable to instead use a Deployment to scale a pool of workers that continue to process items in the queue, scaling the number of pool workers dependent on the number of items in the queue. If there are no work items remaining then you can scale the deployment to 0 replicas, scaling back up when there is work to be done.
To create and control your workloads in Kubernetes the best-practice would be to use the Kubernetes SDK. While you can generate YAML files and shell out to another tool like kubectl using the SDK simplifies configuration and error handling, as well as allowing for simplified introspection of resources in the cluster as well.