GitHub workflow job timeout-minutes is ignored. Why? - github

According to https://docs.github.com/en/actions/reference/workflow-syntax-for-github-actions#jobsjob_idtimeout-minutes
The timeout-minutes parameter defaults to 360 minutes (6 hours).
I parallelized my mutation testing so that my workflow takes around 6.5 hours to run (mutation testing with Stryker of ~1600 mutants on just 2 cores - 9 jobs in parallel). Thus, I’ve set the timeout-minutes to 420 minutes (7 hours) for the mutation job just in case: https://github.com/lbragile/TabMerger/blob/b53a668678b7dcde0dd8f8b06ae23ee668ff8f9e/.github/workflows/testing.yml#L53
This seems to be ignored as the workflow still ends in 6 hours 23min (without warnings/errors): https://github.com/lbragile/TabMerger/runs/2035483167?check_suite_focus=true
Why is my value being ignored?
Also, is there anything I can do to use more CPUs on the workflows virtual machine?

GitHub-hosted runners are limited to maximum 6 hours per job.
Usage limits
There are some limits on GitHub Actions usage when using GitHub-hosted runners. These limits are subject to change.
[...]
Job execution time - Each job in a workflow can run for up to 6 hours of execution time. If a job reaches this limit, the job is terminated and fails to complete.
https://docs.github.com/en/actions/reference/usage-limits-billing-and-administration#usage-limits

Related

Flink Completion Exception Couldn't acquire the minimum required resources

Hi I have a flink Job that that every time there's an Exception, it will have after it around 7 additional Exceptions of
java.util.concurrent.CompletionException: org.apache.flink.runtime.jobmanager.schedular.NoResourceAvailableException: Could not acquire the minimum required resources.
For example the timing
5:47:40 - Exception
5:50:40 - NoResourceAvailableException
5:50:47 - NoResourceAvailableException
5:50:56 - NoResourceAvailableException
5:51:10 - NoResourceAvailableException
5:51:31 - NoResourceAvailableException
5:52:04 - NoResourceAvailableException
5:52:57 - NoResourceAvailableException
Only after around 5 min the job runned again.
The environment have 15 taskmanger and 30 taks slots.
And the job run on 28 task slots, we left 2 task slot so on fail the environment will have standby task slots.
As you can see it still doesn't help and the job takes 5 min until it's up again.
The environment run on kubernetes.
My guess is that the pods restart because of the Exception and that the jobmanger waits to the same pod to be restored. But I don't understand why it won't use the standby task slots.
The up time of the job is crucial so every minute count and we try to have minimal downtime.
I tried to change the number of task slots and task manager so it will have more weaker instances. But the job won't run in any diffrent configuration we get backpresure instead.
And adding more standby task slot in the current state is extremely expensive because each pods (task manager) have a lot of resources.
Tnx for anyone that can help.

Why is my Azure DevOps Migration timing out after several hours?

I have a long running Migration (don't ask) being run by an AzureDevOps Release pipeline.
Specifically, it's an "Azure SQL Database deployment" activity, running a "SQL Script File" Deployment Type.
Despite having configured maximums in all the timeouts in the Invoke-Sql Additional Parameters settings, my migration is still timing out.
Specifically, I get:
We stopped hearing from agent Hosted Agent. Verify the agent machine is running and has a healthy network connection. Anything that terminates an agent process, starves it for CPU, or blocks its network access can cause this error.
So far it's timed out after:
6:13:15
6:13:18
6:14:41
6:10:19
So "after 6 and a bit hours". It's ~22,400 seconds, which doesn't seem like any obvious kind of number either :)
Why? And how do I fix it?
It turns out that AzureDevOps uses Hosting Agents, to execute each Task in a pipeline, and those Agents have innate lifetimes, independent from whatever task they're running.
https://learn.microsoft.com/en-us/azure/devops/pipelines/troubleshooting/troubleshooting?view=azure-devops#job-time-out
A pipeline may run for a long time and then fail due to job time-out. Job timeout closely depends on the agent being used. Free Microsoft hosted agents have a max timeout of 60 minutes per job for a private repository and 360 minutes for a public repository. To increase the max timeout for a job, you can opt for any of the following.
Buy a Microsoft hosted agent which will give you 360 minutes for all jobs, irrespective of the repository used
Use a self-hosted agent to rule out any timeout issues due to the agent
Learn more about job timeout.
So I'm hitting the "360 minute" limit (presumably they give you a little extra on top, so that no-one complains?).
Solution is to use a self-hosted agent. (or make my Migration run in under 6 hours, of course)

How to control interval between kubernetes cronjobs (i.e. cooldown period after completion)

If I set up a kubernetes cronjob with for example
spec:
schedule: "*/5 * * * *"
concurrencyPolicy: Forbid
then it will create a job every 5 minutes.
However if the job takes e.g. 4 minutes, then it will create another job 1 minute after the previous job completed.
Is there a way to make it create a job every 5 minutes after the previous job finished?
You might say; just make the schedule */9 * * * * to account for the 4 minutes the job takes, but the job might not be predictable like that.
Unfortunately there is no possibility within Kubernetes CronJob to specify a situation when the timer starts (for example 5 minutes) after a job is completed.
A word about cron:
The software utility cron is a time-based job scheduler in Unix-like computer operating systems. Users that set up and maintain software environments use cron to schedule jobs (commands or shell scripts) to run periodically at fixed times, dates, or intervals.
-- Wikipedia.org: Cron
The behavior of your CronJob within Kubernetes environment can be modified by:
As said Schedule in spec definition
schedule: "*/5 * * * *"
startingDeadline field that is optional and it describe a deadline in seconds for starting a job. If it doesn't start in that time period it will be counted as failed. After a 100 missed schedules it will no longer be scheduled.
Concurrency policy that will specify how concurrent executions of the same Job are going to be handled:
Allow - concurrency will be allowed
Forbid - if previous Job wasn't finished the new one will be skipped
Replace - current Job will be replaced with a new one
Suspend parameter if it is set to true, all subsequent executions are suspended. This setting does not apply to already started executions.
You could refer to official documentation: CronJobs
As it's unknown what type of Job you want to run you could try to:
Write a shell script in type of:
while true
do
HERE_RUN_YOUR_JOB_AND_WAIT_FOR_COMPLETION.sh
sleep 300 # ( 5 * 60 seconds )
done
Create an image that mimics usage of above script and use it as pod in Kubernetes.
Try to get logs from this pod if it's necessary as described here
Another way would be to create a pod that could connect to Kubernetes API.
Take a look on additional resources about Jobs:
Kubernetes.io: Fine parallel processing work queue
Kubernetes.io: Coarse parallel processing work queue
Please let me know if you have any questions to that.

Total Datastage Jobs Greater Than My Max Jobs I Set

I have below system policies defined in InfoSphere DataStage Operations Console under "Work Load Management(WLM)".
Sometimes, the total number of currently running jobs shoots upto 150 although I have defined maximum running job count as 40 in WLM.
Whenever the currently running job count increases beyond 100, most of the datastage jobs starts showing increased startup time in director log and they took long time to run otherwise if the job concurrency is less than 100 then the same set of jobs run fine with startup time in seconds. Please suggest how to address this issue and how to enforce currently running job should not exceed eg 100 at any point of time. Thanks a lot !
This is working as designed, generally the WLM system is used to control the start of parallel and server jobs. It uses a set of user-defined queues and when a job is started, it is submitted to a designated queue. In the figure above the parallel jobs are in queue named 'MediumPriorityJobs'.
Note that the sequence job is not in the queue to be counted to the total running workloads controlled by the WLM Job Count System Policy.
Source: https://www.ibm.com/support/pages/how-interpret-job-count-maximum-running-jobs-system-policy-ibm-infosphere-information-server-workload-management-wlm

Running Parallel Tasks in Batch

I have few questions about running tasks in parallel in Azure Batch. Per the official documentation, "Azure Batch allows you to set maximum tasks per node up to four times (4x) the number of node cores."
Is there a setup other than specifying the max tasks per node when creating a pool, that needs to be done (to the code) to be able to run parallel tasks with batch?
So if I am understanding this correctly, if I have a Standard_D1_v2 machine with 1 core, I can run up to 4 concurrent tasks running in parallel in it. Is that right? If yes, I ran some tests and I am quite not sure about the behavior that I got. In a pool of D1_v2 machines set up to run 1 task per node, I get about 16 min for my job execution time. Then, using the same applications and same parameters with the only change being a new pool with same setup, also D1_v2, except running 4 tasks per node, I still get a job execution time of about 15 min. There wasn't any improvement in the job execution time for running tasks in parallel. What could be happening? What am I missing here?
I ran a test with a pool of D3_v2 machines with 4 cores, set up to run 2 tasks per core for a total of 8 tasks per node, and another test with a pool (same number of machines as previous one) of D2_v2 machines with 2 cores, set up to run 2 tasks per core for a total of 4 parallel tasks per node. The run time/ job execution time for both these tests were the same. Isn't there supposed to be an improvement considering that 8 tasks are running per node in the first test versus 4 tasks per node in the second test? If yes, what could be a reason why I'm not getting this improvement?
No. Although you may want to look into the task scheduling policy, compute node fill type to control how your tasks are distributed amongst nodes in your pool.
How many tasks are in your job? Are your tasks compute-bound? If so, you won't see any improvement (perhaps even end-to-end performance degradation).
Batch merely schedules the tasks concurrently on the node. If the command/process that you're running utilizes all of the cores on the machine and is compute-bound, you won't see an improvement. You should double check your tasks start and end times within the job and the node execution info to see if they are actually being scheduled concurrently on the same node.