How to give a job manager task permissions to resize the pool? - azure-batch

I'm running embarrassingly parallel workloads, but the number of parallel tasks not known beforehand. Instead, my job manager task performs simple computation to determine the number of parallel tasks and then adds the tasks to the job.
Now, as soon as I know the number of parallel tasks I would like to immediately resize the pool I'm running in accordingly (I am running the job in an auto-pool). Here is how I try do this.
When I create the JobManagerTask I supply
...
authentication_token_settings=AuthenticationTokenSettings(
access=[AccessScope.job]),
...
At run time the task receives AZ_BATCH_AUTHENTICATION_TOKEN in environment, uses it to create BatchServiceClient, uses the client to add worker tasks to the job and ultimately calls client.pool.resize() to increase target_dedicated_nodes. At this stage the task gets an error from the service:
.../site-packages/azure/batch/operations/_pool_operations.py", line 1310, in resize
raise models.BatchErrorException(self._deserialize, response)
azure.batch.models._models_py3.BatchErrorException: Request encountered an exception.
Code: PermissionDenied
Message: {'additional_properties': {}, 'lang': 'en-US', 'value': 'Server failed to authorize the request.\nRequestId:4b34d8e5-7c28-4af2-9e1f-9cf88a486511\nTime:2020-11-26T17:32:55.7673310Z'}
AuthenticationErrorDetail: The supplied authentication token does not have permission to call the requested Url.
How can I give the task permission to resize the pool?

Currently the AZ_BATCH_AUTHENTICATION_TOKEN is limited to permissions immediately with the job. The pool ends up being a separate resource even in the auto-pool configuration so is not modifiable with the token.
There are two main approaches you can take. You can either add a certificate to your account and add it to your pool allowing you to authenticate with a ServicePrincipal with permissions to your account, or you could set your pool to autoscale depending on the number of pending tasks which doesn't get immediate resize options instead doing them at set intervals as needed.

Related

How to make AWS Stepfunctions/ECS schedule tasks only when resources are available

I am using ECS/EC2 to run Stepfunction (Standard type) tasks
I am using an API to trigger the StepFunction execution, So sometimes I have a peek more thyan the capacity of the ECS cluster, So I frequently get errors like this:
[{"Arn":"arn:aws:ecs:us-east-1:432214534264:container-instance/192a09715eea48828b798600b5c67532","Reason":"RESOURCE:GPU"}] (Service: AmazonECS; Status Code: 400; Error Code: AmazonECS.Unknown; Request ID: d33c3afa-d158-45f4-83f4-2567efb53017; Proxy: null)
Which means I do not have enough GPU (Sometimes MEMORY) to run more tasks.
So, How to configure the StepFunction and/or the ECS cluster to only schedule tasks if its resources are available.
I tried PlacementConstraints and PlacementStrategies, but they are fit for this, as these are guides to distribute tasks when resources are enough, not to pause new task creation till resources are available
Thanks

Complete parallel Kubernetes job when one worker pod succeeds

I have a simple containerised python script which I am trying to parallelise with Kubernetes. This script guesses hashes until it finds a hashed value below a certain threshold.
I am only interested in the first such value, so I wish to create a Kubernetes job that spawns n worker pods and completes as soon as one worker pod finds a suitable value.
By default, Kubernetes jobs wait until all worker pods complete before marking the job as complete. I have so far been unable to find a way around this (no mention of this job pattern in the documentation), and have been relying on checking the logs of bare pods via a bash script to determine whether one has completed.
Is there a native means to achieve this? And, if not, what would be the best approach?
Hi look this link https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#parallel-jobs.
I've never tried it but it seems possible to launch several pods and configure the end of the job when x pods have finished. In your case x is 1.
We can define two specifications for parallel Jobs:
1. Parallel Jobs with a fixed completion count:
specify a non-zero positive value for .spec.completions.
the Job represents the overall task, and is complete when there is
one successful Pod for each value in the range 1 to
.spec.completions
not implemented yet: Each Pod is passed a different index in the
range 1 to .spec.completions.
2. Parallel Jobs with a work queue:
do not specify .spec.completions, default to .spec.parallelism
the Pods must coordinate amongst themselves or an external service to
determine what each should work on.
For example, a Pod might fetch a batch of up to N items from the work queue.
each Pod is independently capable of determining whether or not all its peers are done, and thus that the entire Job is done.
when any Pod from the Job terminates with success, no new Pods are
created
once at least one Pod has terminated with success and all Pods are
terminated, then the Job is completed with success
once any Pod has exited with success, no other Pod should still be
doing any work for this task or writing any output. They should all
be in the process of exiting
For a fixed completion count Job, you should set .spec.completions to the number of completions needed. You can set .spec.parallelism, or leave it unset and it will default to 1.
For a work queue Job, you must leave .spec.completions unset, and set .spec.parallelism to a non-negative integer.
For more information about how to make use of the different types of job, see the job patterns section.
You can also take a look on single job which starts controller pod:
This pattern is for a single Job to create a Pod which then creates other Pods, acting as a sort of custom controller for those Pods. This allows the most flexibility, but may be somewhat complicated to get started with and offers less integration with Kubernetes.
One example of this pattern would be a Job which starts a Pod which runs a script that in turn starts a Spark master controller (see spark example), runs a spark driver, and then cleans up.
An advantage of this approach is that the overall process gets the completion guarantee of a Job object, but complete control over what Pods are created and how work is assigned to them.
At the same time take under consideration that completition status of Job set by dafault - when specified number of successful completions is reached it ensure that all tasks are processed properly. Applying this status before all tasks are finished is not secure solution.
You should also know that finished Jobs are usually no longer needed in the system. Keeping them around in the system will put pressure on the API server. If the Jobs are managed directly by a higher level controller, such as CronJobs, the Jobs can be cleaned up by CronJobs based on the specified capacity-based cleanup policy.
Here is official documentations: jobs-parallel-processing , parallel-jobs.
Useful blog: article-parallel job.
EDIT:
Another option is that you can create special script which will continuously check values you look for. Using job then will not be necessary, you can simply use deployment.

AWS Fargate vs Batch vs ECS for a once a day batch process

I have a batch process, written in PHP and embedded in a Docker container. Basically, it loads data from several webservices, do some computation on data (during ~1h), and post computed data to an other webservice, then the container exit (with a return code of 0 if OK, 1 if failure somewhere on the process). During the process, some logs are written on STDOUT or STDERR. The batch must be triggered once a day.
I was wondering what is the best AWS service to use to schedule, execute, and monitor my batch process :
at the very begining, I used a EC2 machine with a crontab : no high-availibilty function here, so I decided to switch to a more PaaS approach.
then, I was using Elastic Beanstalk for Docker, with a non-functional Webserver (only to reply to the Healthcheck), and a Crontab inside the container to wake-up my batch command once a day. With autoscalling rule min=1 max=1, I have HA (if the container crash or if the VM crash, it is restarted by AWS)
but now, to be more efficient, I decided to move to some ECS service, and have an approach where I do not need to have EC2 instances awake 23/24 for nothing. So I tried Fargate.
with Fargate I defined my task (Fargate type, not the EC2 type), and configure everything on it.
I create a Cluster to run my task : I can run "by hand, one time" my task, so I know every settings are corrects.
Now, going deeper in Fargate, I want to have my task executed once a day.
It seems to work fine when I used the Scheduled Task feature of ECS : the container start on time, the process run, then the container stop. But CloudWatch is missing some metrics : CPUReservation and CPUUtilization are not reported. Also, there is no way to know if the batch quit with exit code 0 or 1 (all execution stopped with status "STOPPED"). So i Cant send a CloudWatch alarm if the container execution failed.
I use the "Services" feature of Fargate, but it cant handle a batch process, because the container is started every time it stops. This is normal, because the container do not have any daemon. There is no way to schedule a service. I want my container to be active only when it needs to work (once a day during at max 1h). But the missing metrics are correctly reported in CloudWatch.
Here are my questions : what are the best suitable AWS managed services to trigger a container once a day, let it run to do its task, and have reporting facility to track execution (CPU usage, batch duration), including alarm (SNS) when task failed ?
We had the same issue with identifying failed jobs. I propose you take a look into AWS Batch where logs for FAILED jobs are available in CloudWatch Logs; Take a look here.
One more thing you should consider is total cost of ownership of whatever solution you choose eventually. Fargate, in this regard, is quite expensive.
may be too late for your projects but still I thought it could benefit others.
Have you had a look at AWS Step Functions? It is possible to define a workflow and start tasks on ECS/Fargate (or jobs on EKS for that matter), wait for the results and raise alarms/send emails...

Azure Data Factory - Custom Activity never complete

I'm new to Azure and I am working on data factory and custom activity. I am creating a pipeline with only one custom activity (the activity actually do nothing and return immediately).
However, it seems that the custom activity is sent to batch account. I can see the Job and task created but task remains "Active" and never complete.
Is there anything I missed?
Job: Created and is belonged to desired application pool
Job
Task: Not sure why but application pool is n/a and never complete
Job -> Task Status
Task application pool n/a
Code of the dummy activity. I'm using ADF v2 and therefore it is just a simple console program.
Dummy activity
I figured out.
The problem is from the batch account. The node of the pool failed at start task which block the node to take job. I have changed the start task of the pool not to wait for success so that even if the start task failed the node can still take job.

matlab does not save variables on parallel batch job

I was running (in a cluster) a batch job and in the end I was trying to save results using save(), but I had the following error:
ErrorMessage: The parallel job was cancelled because the task with ID 1
terminated abnormally for the following reason:
Cannot create 'results.mat' because '/home/myusername/experiments' does not exist.
why is that happening? What is the correct way to save variables in a parallel job?
You can use SAVE in the normal way during execution of a parallel job, but you also need to be aware of where you are running. If you are running using the MathWorks jobmanager on the cluster, then depending on the security level set on the jobmanager, you might not have access to the same set of directories as you normally would. More about that stuff here: http://www.mathworks.co.uk/help/mdce/setting-job-manager-security.html