Remove resource requirements of SGE jobs using qalter - queue

So I have pending jobs on an SGE queue. The job is running with the parameters
-l h_rt=1200,h_vmem=16g
I know that I can use qalter to change the parameters.
qalter -l h_rt=1300 jid
But how do I remove these parameters?

I am not aware of a method to reset these values to their default value.
However, you can manually check what the defaults are and then apply those.
This will show the default values for queue all.q:
qconf -sq all.q
If you did not apply complex modifications to your SGE setup, the hard resource limits will likely be INFINITY. So you can reset the limits via
qalter -l h_rt=INFINITY jid
Keep in mind that above command will also replace your other limits (e.g. h_vmem). If you wish to keep these other settings you will have to re-assign them in your qalter command.

Related

kubectl wait not working for creation of resources

How do you get around waiting on resources not yes created?
In script I get:
kubectl wait --for=condition=ready --timeout=60s -n <some namespace> --all pods
error: no matching resources found
This is a community wiki answer posted for better visibility. Feel free to expand it.
As documented:
Experimental: Wait for a specific condition on one or many resources.
The command takes multiple resources and waits until the specified
condition is seen in the Status field of every given resource.
Alternatively, the command can wait for the given set of resources to
be deleted by providing the "delete" keyword as the value to the --for
flag.
A successful message will be printed to stdout indicating when the
specified condition has been met. One can use -o option to change to
output destination.
This command will not work for the resources that hasn't been created yet. #EmruzHossain has posted two valid points:
Make sure you have provided a valid namespace.
First wait for the resource to get created. Probably a loop running kubectl get periodically. When the desired resource is found, break the loop. Then, run kubectl wait to wait for the resource to be ready.
Also, there is this open thread: kubectl wait for un-existed resource. #83242 which is still waiting (no pun intended) to be implemented.

Argo Workflows semaphore with value 0

I'm learning about semaphores in the Argo project workflows to avoid concurrent workflows using the same resource.
My use case is that I have several external resources which only one workflow can use one at a time. So far so good, but sometimes the resource needs some maintenance and during that period I don't want to Argo to start any workflow.
I guess I have two options:
I tested manually setting the semaphore value in the configMap to the value 0, but Argo started one workflow anyway.
I can start a workflow that runs forever, until it is deleted, claiming the synchronization lock, but that requires some extra overhead to have workflows running that don't do anything.
So I wonder how it is supposed to work if I set the semaphore value to 0, I think it should not start the workflow then since it says 0. Anyone have any info about this?
This is the steps I carried out:
First I apply my configmap with kubectl -f.
I then submit some workflows and since they all use the same semaphore Argo will start one and the rest will be executed in order one at a time.
I then change value of the semapore with kubectl edit configMap
Submit new job which then Argo will execute.
Perhaps Argo will not reload the configMap when I update the configMap through kubectl edit? I would like to update the configmap programatically in the future but used kubectl edit now for testing.
Quick fix: after applying the ConfigMap change, cycle the workflow-controller pod. That will force it to reload semaphore state.
I couldn't reproduce your exact issue. After using kubectl edit to set the semaphore to 0, any newly submitted workflows remained Pending.
I did encounter an issue where using kubectl edit to bump up the semaphore limit did not automatically kick off any of the Pending workflows. Cycling the workflow controller pod allowed the workflows to start running again.
Besides using the quick fix, I'd recommend submitting an issue. Synchronization is a newer feature, and it's possible it's not 100% robust yet.

Complete parallel Kubernetes job when one worker pod succeeds

I have a simple containerised python script which I am trying to parallelise with Kubernetes. This script guesses hashes until it finds a hashed value below a certain threshold.
I am only interested in the first such value, so I wish to create a Kubernetes job that spawns n worker pods and completes as soon as one worker pod finds a suitable value.
By default, Kubernetes jobs wait until all worker pods complete before marking the job as complete. I have so far been unable to find a way around this (no mention of this job pattern in the documentation), and have been relying on checking the logs of bare pods via a bash script to determine whether one has completed.
Is there a native means to achieve this? And, if not, what would be the best approach?
Hi look this link https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#parallel-jobs.
I've never tried it but it seems possible to launch several pods and configure the end of the job when x pods have finished. In your case x is 1.
We can define two specifications for parallel Jobs:
1. Parallel Jobs with a fixed completion count:
specify a non-zero positive value for .spec.completions.
the Job represents the overall task, and is complete when there is
one successful Pod for each value in the range 1 to
.spec.completions
not implemented yet: Each Pod is passed a different index in the
range 1 to .spec.completions.
2. Parallel Jobs with a work queue:
do not specify .spec.completions, default to .spec.parallelism
the Pods must coordinate amongst themselves or an external service to
determine what each should work on.
For example, a Pod might fetch a batch of up to N items from the work queue.
each Pod is independently capable of determining whether or not all its peers are done, and thus that the entire Job is done.
when any Pod from the Job terminates with success, no new Pods are
created
once at least one Pod has terminated with success and all Pods are
terminated, then the Job is completed with success
once any Pod has exited with success, no other Pod should still be
doing any work for this task or writing any output. They should all
be in the process of exiting
For a fixed completion count Job, you should set .spec.completions to the number of completions needed. You can set .spec.parallelism, or leave it unset and it will default to 1.
For a work queue Job, you must leave .spec.completions unset, and set .spec.parallelism to a non-negative integer.
For more information about how to make use of the different types of job, see the job patterns section.
You can also take a look on single job which starts controller pod:
This pattern is for a single Job to create a Pod which then creates other Pods, acting as a sort of custom controller for those Pods. This allows the most flexibility, but may be somewhat complicated to get started with and offers less integration with Kubernetes.
One example of this pattern would be a Job which starts a Pod which runs a script that in turn starts a Spark master controller (see spark example), runs a spark driver, and then cleans up.
An advantage of this approach is that the overall process gets the completion guarantee of a Job object, but complete control over what Pods are created and how work is assigned to them.
At the same time take under consideration that completition status of Job set by dafault - when specified number of successful completions is reached it ensure that all tasks are processed properly. Applying this status before all tasks are finished is not secure solution.
You should also know that finished Jobs are usually no longer needed in the system. Keeping them around in the system will put pressure on the API server. If the Jobs are managed directly by a higher level controller, such as CronJobs, the Jobs can be cleaned up by CronJobs based on the specified capacity-based cleanup policy.
Here is official documentations: jobs-parallel-processing , parallel-jobs.
Useful blog: article-parallel job.
EDIT:
Another option is that you can create special script which will continuously check values you look for. Using job then will not be necessary, you can simply use deployment.

Is it possible to stop a job in Kubernetes without deleting it

Because Kubernetes handles situations where there's a typo in the job spec, and therefore a container image can't be found, by leaving the job in a running state forever, I've got a process that monitors job events to detect cases like this and deletes the job when one occurs.
I'd prefer to just stop the job so there's a record of it. Is there a way to stop a job?
1) According to the K8S documentation here.
Finished Jobs are usually no longer needed in the system. Keeping them around in the system will put pressure on the API server. If the Jobs are managed directly by a higher level controller, such as CronJobs, the Jobs can be cleaned up by CronJobs based on the specified capacity-based cleanup policy.
Here are the details for the failedJobsHistoryLimit property in the CronJobSpec.
This is another way of retaining the details of the failed job for a specific duration. The failedJobsHistoryLimit property can be set based on the approximate number of jobs run per day and the number of days the logs have to be retained. Agree that the Jobs will be still there and put pressure on the API server.
This is interesting. Once the job completes with failure as in the case of a wrong typo for image, the pod is getting deleted and the resources are not blocked or consumed anymore. Not sure exactly what kubectl job stop will achieve in this case. But, when the Job with a proper image is run with success, I can still see the pod in kubectl get pods.
2) Another approach without using the CronJob is to specify the ttlSecondsAfterFinished as mentioned here.
Another way to clean up finished Jobs (either Complete or Failed) automatically is to use a TTL mechanism provided by a TTL controller for finished resources, by specifying the .spec.ttlSecondsAfterFinished field of the Job.
Not really, no such mechanism exists in Kubernetes yet afaik.
You can workaround is to ssh into the machine and run a: (if you're are using Docker)
# Save the logs
$ docker log <container-id-that-is-running-your-job> 2>&1 > save.log
$ docker stop <main-container-id-for-your-job>
It's better to stream log with something like Fluentd, or logspout, or Filebeat and forward the logs to an ELK or EFK stack.
In any case, I've opened this
You can suspend cronjobs by using the suspend attribute. From the Kubernetes documentation:
https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/#suspend
Documentation says:
The .spec.suspend field is also optional. If it is set to true, all
subsequent executions are suspended. This setting does not apply to
already started executions. Defaults to false.
So, to pause a cron you could:
run and edit "suspend" from False to True.
kubectl edit cronjob CRON_NAME (if not in default namespace, then add "-n NAMESPACE_NAME" at the end)
you could potentially create a loop using "for" or whatever you like, and have them all changed at once.
you could just save the yaml file locally and then just run:
kubectl create -f cron_YAML
and this would recreate the cron.
The other answers hint around the .spec.suspend solution for the CronJob API, which works, but since the OP asked specifically about Jobs it is worth noting the solution that does not require a CronJob.
As of Kubernetes 1.21, there alpha support for the .spec.suspend field in the Job API as well, (see docs here). The feature is behind the SuspendJob feature gate.

What is a use case for kubernetes job?

I'm looking to fully understand the jobs in kubernetes.
I have successfully create and executed a job, but I do not see the use case.
Not being able to rerun a job or not being able to actively listen to it completion makes me think it is a bit difficult to manage.
Anyone using them? Which is the use case?
Thank you.
A job retries pods until they complete, so that you can tolerate errors that cause pods to be deleted.
If you want to run a job repeatedly and periodically, you can use CronJob alpha or cronetes.
Some Helm Charts use Jobs to run install, setup, or test commands on clusters, as part of installing services. (Example).
If you save the YAML for the job then you can re-run it by deleting the old job an creating it again, or by editing the YAML to change the name (or use e.g. sed in a script).
You can watch a job's status with this command:
kubectl get jobs myjob -w
The -w option watches for changes. You are looking for the SUCCESSFUL column to show 1.
Here is a shell command loop to wait for job completion (e.g. in a script):
until kubectl get jobs myjob -o jsonpath='{.status.conditions[?(#.type=="Complete")].status}' | grep True ; do sleep 1 ; done
One of the use case can be to take a backup of a DB. But as already mentioned that are some overheads to run a job e.g. When a Job completes the Pods are not deleted . so you need to manually delete the job(which will also delete the pods created by job). so recommended option will be to use Cron instead of Jobs