How to query which node executed a step? How to schedule a step on the same node as a previous step? - argo-workflows

I have 3 argo workflow steps that I want to schedule always on the same node. Is it possible to limit execution of the 2nd and 3rd steps to the node that executed the 1st step? How can I get the executing node name/label during executing the step? Is this possible? My idea is I could put it into an output variable, then use that with a nodeSelector in the subsequent steps.
What I have currently is that all 3 steps are always scheduled on one designated node using :
spec:
templates:
- name: first-step
nodeSelector:
mycustomlabel: "true"
But the issue is this creates a CPU and disk space bottleneck.

Related

How to control interval between kubernetes cronjobs (i.e. cooldown period after completion)

If I set up a kubernetes cronjob with for example
spec:
schedule: "*/5 * * * *"
concurrencyPolicy: Forbid
then it will create a job every 5 minutes.
However if the job takes e.g. 4 minutes, then it will create another job 1 minute after the previous job completed.
Is there a way to make it create a job every 5 minutes after the previous job finished?
You might say; just make the schedule */9 * * * * to account for the 4 minutes the job takes, but the job might not be predictable like that.
Unfortunately there is no possibility within Kubernetes CronJob to specify a situation when the timer starts (for example 5 minutes) after a job is completed.
A word about cron:
The software utility cron is a time-based job scheduler in Unix-like computer operating systems. Users that set up and maintain software environments use cron to schedule jobs (commands or shell scripts) to run periodically at fixed times, dates, or intervals.
-- Wikipedia.org: Cron
The behavior of your CronJob within Kubernetes environment can be modified by:
As said Schedule in spec definition
schedule: "*/5 * * * *"
startingDeadline field that is optional and it describe a deadline in seconds for starting a job. If it doesn't start in that time period it will be counted as failed. After a 100 missed schedules it will no longer be scheduled.
Concurrency policy that will specify how concurrent executions of the same Job are going to be handled:
Allow - concurrency will be allowed
Forbid - if previous Job wasn't finished the new one will be skipped
Replace - current Job will be replaced with a new one
Suspend parameter if it is set to true, all subsequent executions are suspended. This setting does not apply to already started executions.
You could refer to official documentation: CronJobs
As it's unknown what type of Job you want to run you could try to:
Write a shell script in type of:
while true
do
HERE_RUN_YOUR_JOB_AND_WAIT_FOR_COMPLETION.sh
sleep 300 # ( 5 * 60 seconds )
done
Create an image that mimics usage of above script and use it as pod in Kubernetes.
Try to get logs from this pod if it's necessary as described here
Another way would be to create a pod that could connect to Kubernetes API.
Take a look on additional resources about Jobs:
Kubernetes.io: Fine parallel processing work queue
Kubernetes.io: Coarse parallel processing work queue
Please let me know if you have any questions to that.

Complete parallel Kubernetes job when one worker pod succeeds

I have a simple containerised python script which I am trying to parallelise with Kubernetes. This script guesses hashes until it finds a hashed value below a certain threshold.
I am only interested in the first such value, so I wish to create a Kubernetes job that spawns n worker pods and completes as soon as one worker pod finds a suitable value.
By default, Kubernetes jobs wait until all worker pods complete before marking the job as complete. I have so far been unable to find a way around this (no mention of this job pattern in the documentation), and have been relying on checking the logs of bare pods via a bash script to determine whether one has completed.
Is there a native means to achieve this? And, if not, what would be the best approach?
Hi look this link https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#parallel-jobs.
I've never tried it but it seems possible to launch several pods and configure the end of the job when x pods have finished. In your case x is 1.
We can define two specifications for parallel Jobs:
1. Parallel Jobs with a fixed completion count:
specify a non-zero positive value for .spec.completions.
the Job represents the overall task, and is complete when there is
one successful Pod for each value in the range 1 to
.spec.completions
not implemented yet: Each Pod is passed a different index in the
range 1 to .spec.completions.
2. Parallel Jobs with a work queue:
do not specify .spec.completions, default to .spec.parallelism
the Pods must coordinate amongst themselves or an external service to
determine what each should work on.
For example, a Pod might fetch a batch of up to N items from the work queue.
each Pod is independently capable of determining whether or not all its peers are done, and thus that the entire Job is done.
when any Pod from the Job terminates with success, no new Pods are
created
once at least one Pod has terminated with success and all Pods are
terminated, then the Job is completed with success
once any Pod has exited with success, no other Pod should still be
doing any work for this task or writing any output. They should all
be in the process of exiting
For a fixed completion count Job, you should set .spec.completions to the number of completions needed. You can set .spec.parallelism, or leave it unset and it will default to 1.
For a work queue Job, you must leave .spec.completions unset, and set .spec.parallelism to a non-negative integer.
For more information about how to make use of the different types of job, see the job patterns section.
You can also take a look on single job which starts controller pod:
This pattern is for a single Job to create a Pod which then creates other Pods, acting as a sort of custom controller for those Pods. This allows the most flexibility, but may be somewhat complicated to get started with and offers less integration with Kubernetes.
One example of this pattern would be a Job which starts a Pod which runs a script that in turn starts a Spark master controller (see spark example), runs a spark driver, and then cleans up.
An advantage of this approach is that the overall process gets the completion guarantee of a Job object, but complete control over what Pods are created and how work is assigned to them.
At the same time take under consideration that completition status of Job set by dafault - when specified number of successful completions is reached it ensure that all tasks are processed properly. Applying this status before all tasks are finished is not secure solution.
You should also know that finished Jobs are usually no longer needed in the system. Keeping them around in the system will put pressure on the API server. If the Jobs are managed directly by a higher level controller, such as CronJobs, the Jobs can be cleaned up by CronJobs based on the specified capacity-based cleanup policy.
Here is official documentations: jobs-parallel-processing , parallel-jobs.
Useful blog: article-parallel job.
EDIT:
Another option is that you can create special script which will continuously check values you look for. Using job then will not be necessary, you can simply use deployment.

Getting kubernetes cronjob history

I have a CronJob which runs every 15 Mins. Say, Its running for the last 1 year.
Is it possible to get the complete history using Kube API? Or, Is it possible to control the maximum history that can be stored?
Also, Can we get the status( Success/ Failure ) of each run along with the total completion time?
Does the POD die after completing the Job?
A CronJob creates a Job object for each execution.
For regular Jobs you can configure .spec.ttlSecondsAfterFinished along with the TTLAfterFinished feature gate to configure which Job instances are retained.
For CronJob you can specify the .spec.successfulJobsHistoryLimit to configure the number of managed Job instances to be retained.
You can get the desired information from these objects.
The pod does not die when the job completes, it is the other way around: If the pod terminates without an error, the job is considered completed.
The .spec.successfulJobsHistoryLimit and .spec.failedJobsHistoryLimit fields are optional.
These fields specify how many completed and failed jobs should be kept.
By default, they are set to 3 and 1 respectively.

Schedule cron job to never happen?

Here is part of my CronJob spec:
kind: CronJob
spec:
schedule: #{service.schedule}
For a specific environment a cron job is set up, but I never want it to run. Can I write some value into schedule: that will cause it to never run?
I haven't found any documentation for all supported syntax, but I am hoping for something like:
#never or #InABillionYears
#reboot doesn't guarantee that the job will never be run. It will actually be run always when your system is booted/rebooted and it may happen. It will be also run each time when cron daemon is restarted so you need to rely on that "typically it should not happen" on your system...
There are far more certain ways to ensure that a CronJob will never be run:
On Kubernetes level by suspending a job by setting its .spec.suspend field to true
You can easily set it using patch:
kubectl patch cronjobs <job-name> -p '{"spec" : {"suspend" : true }}'
On Cron level. Use a trick based on fact that crontab syntax is not strictly validated and set a date that you can be sure will never happen like 31th of February. Cron will accept that as it doesn't check day of the month in relation to value set in a month field. It just requires that you put valid numbers in both fields (1-31 and 1-12 respectively). You can set it to something like:
* * 31 2 *
which for Cron is perfectly valid value but we know that such a date is impossible and it will never happen.
kind: CronJob
spec:
suspend: true
Why do you need this to be a CronJob in the first place? If you never want it to run, you could specify a simple Job: https://kubernetes.io/docs/concepts/workloads/controllers/job/
I think you can use #reboot,
see: https://en.wikipedia.org/wiki/Cron
#reboot configures a job to run once when the daemon is started. Since cron is typically never restarted, this typically corresponds to the machine being booted.

How do I stop a CronJob from recreating failed Jobs?

When for whatever reasons I delete the pod running the Job that was started by a CronJob, I immediately see a new pod being created. It is only once I delete something like six times the backoffLimit number of pods, that new ones stop being created.
Of course, if I'm actively monitoring the process, I can delete the CronJob, but what if the Pod inside the job fails when I'm not looking? I would like it not to be recreated.
How can I stop the CronJob from persisting in creating new jobs (or pods?), and wait until the next scheduled time if the current job/pod failed? Is there something similar to Jobs' backoffLimit, but for CronJobs?
Set startingDeadlineSeconds to a large value or left unset (the default).
At the same time set .spec.concurrencyPolicy as Forbid and the CronJobs skips the new job run while previous created job is still running.
If startingDeadlineSeconds is set to a large value or left unset (the default) and if concurrencyPolicy is set to Forbid, the job will not be run if failed.
Concurrent policy field you can add to specification to defintion of your CronJob (.spec.concurrencyPolicy), but this is optional.
It specifies how to treat concurrent executions of a job that is created by this CronJob. The spec may specify only one of these three concurrency policies:
Allow (default) - The cron job allows concurrently running jobs
Forbid - The cron job does not allow concurrent runs; if it is time for a new job run and the previous job run hasn’t finished yet, the cron job skips the new job run
Replace - If it is time for a new job run and the previous job run hasn’t finished yet, the cron job replaces the currently running job run with a new job run
It is good to know that currency policy applies just to the jobs created by the same CronJob.
If there are multiple CronJobs, their respective jobs are always allowed to run concurrently.
A CronJob is counted as missed if it has failed to be created at its scheduled time. For example, If concurrencyPolicy is set to Forbid and a CronJob was attempted to be scheduled when there was a previous schedule still running, then it would count as missed.
For every CronJob, the CronJob controller checks how many schedules it missed in the duration from its last scheduled time until now. If there are more than 100 missed schedules, then it does not start the job and logs the error
More information you can find here: CronJobs and AutomatedTask.
I hope it helps.
CronJob creates a job by a "backoffLimit" with a default value (6) in your case, and restart policy by default is (Always)
Better to make backoffLimit > (0) and make restart policy = (Never) and increase startingDeadlineSeconds to be lower than or equal to your interval or you can customize it up on your request to control the run time of each CronJob run
Additionally, you may stop "concurrencyPolicy" >> (Forbid)