How can a Job with multiple steps run some steps on Node 1 and other on Node 2?
For example:
On Node 1, I have to copy a file to a folder cp file.txt /var/www/htm/
On Node 2, I have to download this file wget https://www.mywebsite.com/file.txt
I have tried creating three jobs,
JOB 1, workflow I have Execute Command on remote cp file.txt /var/www/htm/ and NODES filter to my NODE 1
JOB 2, workflow I have Execute Command on remote wget https://www.mywebsite.com/file.txt and NODES filter to NODE 2
JOB 3, workflow step 1: selected Job Reference, and paste the UUID from JOB 1, step 2 Job reference and paste UUID JOB 2 and node filter I writed .* to get all nodes.
For now I tried to only run a command ls(on JOB 1 and JOB 2), but when I run JOB 3 the output is 3 time the command each job, for example:
// Run Job 3
// Output from Job 1
test-folder
test.text
test-folder
test.text
test-folder
test.text
And same for JOB 2
How can I implement my job?
Using the job reference step is the right way to solve that, but instead of defining .* to get all nodes, you can use the node1 name in the first job reference step call and the node2 name for the second job reference call, on "Override node filters?" section. Alternatively you can define the nodes filter in each job and just call it from the Job 3 using job reference step.
Related
I want to shut down celery workers specifically. I was using app.control.broadcast('shutdown'); however, this shutdown all the workers; therefore, I would like to pass the destination parameter.
When I run ps -ef | grep celery, I can see the --hostname on the process.
I know that the format is {CELERYD_NODES}{NODENAME_SEP}{hostname} from the utility function nodename
destination = ''.join(['celery', # CELERYD_NODES defined at /etc/default/newfies-celeryd
'#', # from celery.utils.__init__ import NODENAME_SEP
socket.gethostname()])
Is there a helper function which returns the nodename? I don't want to create it myself since I don't want to hardcode the value.
I am not sure if that's what you're looking for, but with control.inspect you can get info about the workers, for example:
app = Celery('app_name', broker=...)
app.control.inspect().stats() # statistics per worker
app.control.inspect().registered() # registered tasks per each worker
app.control.inspect().active() # active workers/tasks
so basically you can get the list of workers from each one of them:
app.control.inspect().stats().keys()
app.control.inspect().registered().keys()
app.control.inspect().active().keys()
for example:
>>> app.control.inspect().registered().keys()
dict_keys(['worker1#my-host-name', 'worker2#my-host-name', ..])
My Azure DevOps pipeline tasks successfully complete without issues except for the final deployment step:
Job Issues - 1 Error
The job running on agent XXXX ran longer than the maximum time of 00:05:00 minutes. For more information, see https://go.microsoft.com/fwlink/?linkid=2077134
The build logs state the operation was canceled:
021-03-02T20:50:00.4223027Z Folders: 695
2021-03-02T20:50:00.4223319Z Files: 10645
2021-03-02T20:50:00.4223589Z Size: 672611102
2021-03-02T20:50:00.4223851Z Compressed: 249144045
2021-03-02T20:50:03.6023001Z ##[warning]Unable to apply transformation for the given package. Verify the following.
2021-03-02T20:50:03.6032907Z ##[warning]1. Whether the Transformation is already applied for the MSBuild generated package during build. If yes, remove the <DependentUpon> tag for each config in the csproj file and rebuild.
2021-03-02T20:50:03.6034584Z ##[warning]2. Ensure that the config file and transformation files are present in the same folder inside the package.
2021-03-02T20:50:04.5268038Z Initiated variable substitution in config file : C:\azagent\A2\_work\_temp\temp_web_package_3012195912183888\Areas\Admin\sitemap.config
2021-03-02T20:50:04.5552027Z Skipped Updating file: C:\azagent\A2\_work\_temp\temp_web_package_3012195912183888\Areas\Admin\sitemap.config
2021-03-02T20:50:04.5553082Z Initiated variable substitution in config file : C:\azagent\A2\_work\_temp\temp_web_package_3012195912183888\web.config
2021-03-02T20:50:04.5642868Z Skipped Updating file: C:\azagent\A2\_work\_temp\temp_web_package_3012195912183888\web.config
2021-03-02T20:50:04.5643366Z XML variable substitution applied successfully.
2021-03-02T20:51:00.8934630Z ##[error]The operation was canceled.
2021-03-02T20:51:00.8938641Z ##[section]Finishing: Deploy IIS Website/App:
When I examine the deployment states, I notice one of my tasks takes quite a while for what should be a fairly simple operation:
The file transform portion takes over half of the allotted 5 minutes? Could this be the issue?
steps:
- task: FileTransform#1
displayName: 'File Transform: '
inputs:
folderPath: '$(System.DefaultWorkingDirectory)/_site.com/drop/Release/Nop.Web.zip'
fileType: json
targetFiles: '**/dataSettings.json'
It may be inefficient but FileTransform log shows a significant amount of time spent after the variable has been substituted. Not sure what's causing the long delay, but the logs don't account for the time after the variable has been successfully substituted:
2021-03-02T23:04:44.3796910Z Folders: 695
2021-03-02T23:04:44.3797285Z Files: 10645
2021-03-02T23:04:44.3797619Z Size: 672611002
2021-03-02T23:04:44.3797916Z Compressed: 249143976
2021-03-02T23:04:44.3970596Z Applying JSON variable substitution for **/App_Data/dataSettings.json
2021-03-02T23:04:45.2396016Z Applying JSON variable substitution for C:\azagent\A2\_work\_temp\temp_web_package_0182869515217865\App_Data\dataSettings.json
2021-03-02T23:04:45.2399264Z Substituting value on key DataConnectionString with (string) value: ***
**2021-03-02T23:04:45.2446986Z JSON variable substitution applied successfully.
2021-03-02T23:07:25.4881687Z ##[section]Finishing: File Transform:**
The job running on agent XXXX ran longer than the maximum time of 00:05:00 minutes.
Based on the error message, the cause of this issue is that the running time of the Agent Job has reached the maximum value set.
If you are using the release pipeline, you could set the timeout in Agent Job -> Execution plan ->timeout.
If you are using the build pipeline, you could set the timeout in Agent Job -> Agent Job -> Execution plan -> timeout (For An Agent Job) and Options -> Build job -> Build job timeout in minutes (For whole Build pipeline).
The file transform portion takes over half of the allotted 5 minutes
From the task log, the zip package contains many files and folders. Therefore, transform task will take more time to traverse the file to find the target file.
I am trying to write two arrays of identical size in SLURM and have each job in array #2 start after the corresponding job in #1 has completed. As far as I understand this is the exact use case of --dependency=aftercorr:#jobid in SLURM as per one-to-one dependency between two job arrays in SLURM
However, when I do the below, the second array is left PENDING although many jobs in #1 have completed:
sbatch --mem=30g -c 2 --time 10:0:0 -e %A.%a.out -o %A.%a.out --array=1-977%200 ./step1.sh
# Store the array ID as $array1
# wait 1 hour, some of them show as COMPLETED
sbatch --mem=30g -c 2 --time 10:0:0 -e %A.%a.out -o %A.%a.out --array=1-977%200 --dependency=aftercorr:$array1 ./step2.sh
# This just remains PENDING forever
I have a pretty simple snakemake pipeline that takes an input file does three subsequent steps to produce one output. Each individual job is very quick. Now I want to apply this pipeline to >10k files on an SGE cluster. Even if I use group to have one job for each three rules per input file, I would still submit >10k cluster jobs. Is there a way to instead submit limited number of cluster jobs (lets say 100) and distribute all tasks equally between them?
An example would be something like
rule A:
input: {prefix}.start
output: {prefix}.A
group "mygroup"
rule B:
input: {prefix}.A
output: {prefix}.B
group "mygroup"
rule C:
input: {prefix}.B
output: {prefix}.C
group "mygroup"
rule runAll:
input: expand("{prefix}.C", prefix = VERY_MANY_PREFIXES)
and then run it with
snakemake --cluster "qsub <some parameters>" runAll
You could process all the 10k files in the same rule using a for loop (not sure if this is what Manavalan Gajapathy has in mind). For example:
rule A:
input:
txt= expand('{prefix}.start', prefix= PREFIXES),
output:
out= expand('{prefix}.A', prefix= PREFIXES),
run:
io= zip(input.txt, output.out)
for x in io:
shell('some_command %s %s' %(x[0], x[1]))
and the same for rule B and C.
Look also at snakemake local-rules
The only solution I can think of would be to declare rules A, B, and C to be local rules, so that they run in the main snakemake job instead of being submitted as a job. Then you can break up your runAll into batches:
rule runAll1:
input: expand("{prefix}.C", prefix = VERY_MANY_PREFIXES[:1000])
rule runAll2:
input: expand("{prefix}.C", prefix = VERY_MANY_PREFIXES[1000:2000])
rule runAll3:
input: expand("{prefix}.C", prefix = VERY_MANY_PREFIXES[2000:3000])
...etc
Then you submit a snakemake job for runAll1, another for runAll2, and so on. You do this fairly easily with a bash loop:
for i in {1..10}; do sbatch [sbatch params] snakemake runAll$i; done;
Another option which would be more scalable than creating multiple runAll rules would be to have a helper python script that does something like this:
import subprocess
for i in range(0, len(VERY_MANY_PREFIXES), 1000):
subprocess.run(['sbatch', 'snakemake'] + ['{prefix}'.C for prefix in VERY_MANY_PREFIXES[i:i+1000]])
Spring Batch jobs can be started from the commandline by telling the JVM to run CommandLineJobRunner. According to the JavaDoc, running the same command with the added parameter of -stop will stop the Job:
The arguments to this class can be provided on the command line
(separated by spaces), or through stdin (separated by new line). They
are as follows:
jobPath jobIdentifier (jobParameters)* The command line options are as
follows
jobPath: the xml application context containing a Job
-restart: (optional) to restart the last failed execution
-stop: (optional) to stop a running execution
-abandon: (optional) to abandon a stopped execution
-next: (optional) to start the next in a sequence according to the JobParametersIncrementer in the Job jobIdentifier: the name of the job or the id of a job execution (for -stop, -abandon or -restart).
jobParameters: 0 to many parameters that will be used to launch a job specified in the form of key=value pairs.
However, on the JavaDoc for the main() method the -stop parameter is not specified. Looking through the code on docjar.com I can't see any use of the -stop parameter where I would expect it to be.
I suspect that it is possible to stop a batch that has been started from the command line but only if the batches being run are backed by a non-transient jobRepository? If running a batch on the command line that only stores its data in HSQL (ie in memory) there is no way to stop the job other than CTRL-C etc?
stop command is implemented, see source for CommandLineJobRunner, line 300+
if (opts.contains("-stop")) {
List<JobExecution> jobExecutions = getRunningJobExecutions(jobIdentifier);
if (jobExecutions == null) {
throw new JobExecutionNotRunningException("No running execution found for job=" + jobIdentifier);
}
for (JobExecution jobExecution : jobExecutions) {
jobExecution.setStatus(BatchStatus.STOPPING);
jobRepository.update(jobExecution);
}
return exitCodeMapper.intValue(ExitStatus.COMPLETED.getExitCode());
}
The stop switch will work, but it will only stop the job after the currently executing step completes. It won't kill the job immediately.