snakemake --max-jobs-per-second parameter ignored - hpc

I am currently running Snakemake on my department's cluster (SGE). For this I have used a template given by a workshop to run and submit jobs to the scheduler and run my scripts within the different rules. The template for the profile is taken from this snakemake-gridengine repository.
However, I am running into an issue where Snakemake is not submitting the max number of jobs it should be able to the cluster at once.
snakemake --snakefile pop_split_imputation_pipeline.smk \
-j 1000 --max-status-checks-per-second 0.01 \
--profile ~/snakemake/profile -f --rerun-incomplete --use-conda
For instance, above is an example of a command used to submit a .smk pipeline to be run, which in theory should generate 1000 jobs per rule. However, within my cluster, only 10-50 jobs at any one time are being submitted. Within my config.yaml I already have set max-jobs-per-second: 1000, so clearly it should be able to submit all these jobs at once yet it doesn't.
Can anyone point to something to improve the submission speed of these jobs?

Related

Azure DevOps Agent - Custom Setup/Teardown Operations

We have a cloud full of self-hosted Azure Agents running on custom AMIs. In some cases, I have some cleanup operations which I'd really like to do either before or after a job runs on the machine, but I don't want the developer waiting for the job to wait either at the beginning or the end of the job (which holds up other stages).
What I'd really like is to have the Azure Agent itself say "after this job finishes, I will run a set of custom scripts that will prepare for the next job, and I won't accept more work until that set of scripts is done".
In a pinch, maybe just a "cooldown" setting would work -- wait 30 seconds before accepting another job. (Then at least a job could trigger some background work before finishing.)
Has anyone had experience with this, or knows of a workable solution?
I suggest three solutions
Create another pipeline to run the clean up tasks on agents - you can also add demand for specific agents (See here https://learn.microsoft.com/en-us/azure/devops/pipelines/process/demands?view=azure-devops&tabs=yaml) by Agent.Name -equals [Your Agent Name]. You can set frequency to minutes, hours, as you like using cron pattern. As while this pipeline will be running and taking up the agent, the agent being cleaned will not be available for other jobs. Do note that you can trigger this pipeline from another pipeline, but if both are using the same agents - they can just get deadlocked.
Create a template containing scripts tasks having all clean up logic and use it at the end of every job (which you have discounted).
Rather than using static VM's for agent hosting, use Azure scaleset for Self hosted agents - everytime agents are scaled down they are gone and when scaled up they start fresh. This saves a lot of money in terms of sitting agents not doing anything when no one is working. We use this option and went away from static agents. We have also used packer to create the VM image/vhd overnight to update it with patches, softwares required, and docker images cached.
ref: https://learn.microsoft.com/en-us/azure/devops/pipelines/agents/scale-set-agents?view=azure-devops
For those discovering this question, there's a much better way: run your self-hosted agent with the --once flag, documented here.
You'll need to wrap it in your own bash script, but something like this works:
while :
do
echo "Performing pre-job setup..."
echo "Waiting for job..."
./run.sh --once
echo "Cleaning up..."
sleep 2
done
Another option would be to use a ScaleSet VM setup which preps a new agent for each run and discards the VM when the run is done. It can prepare new VMs in the background while the job is running.
And I suspect you could implement your own IMaintenanceProvirer..
https://github.com/microsoft/azure-pipelines-agent/blob/master/src/Agent.Worker/Maintenance/MaintenanceJobExtension.cs#L53

Use Airflow to run parametrized jobs on-demand and with a schedule

I have a reporting application that uses Celery to process thousands of jobs per day. There is a python module per each report type that encapsulates all job steps. Jobs take customer-specific parameters and typically complete within a few minutes. Currently, jobs are triggered by customers on-demand when they create a new report or request a refresh of an existing one.
Now, I would like to add scheduling, so the jobs run daily, and reports get refreshed automatically. I understand that Airflow shines at task orchestration and scheduling. I also like the idea of expressing my jobs as DAGs and getting the benefit of task retries. I can see how I can use Airflow to run scheduled batch-processing jobs, but I am unsure about my use case.
If I express my jobs as Airflow DAGs, I will still need to run them parametrized for each customer. It means, if the customer creates a new report, I will need to have a way to trigger a DAG with the customer-specific configuration. And with a scheduled execution, I will need to enumerate all customers and create a parametrized (sub-)DAG for each of them. My understanding this should be possible since Airflow supports DAGs created dynamically, however, I am not sure if this is an efficient and correct way to use Airflow.
I wonder if anyway considered using Airflow for a scenario similar to mine.
Celery workflows do literally the same, and you can create and run them at any point of time. Also, Celery has a pretty good scheduler (I have never seen it failing in 5 years of using Celery) - Celery Beat.
Sure, Airflow can be used to do what you need without any problems.
You can use Airflow to create DAGs dynamically, I am not sure if this will work with a scale of 1000 of DAGs though. There are some good examples on astronomer.io on Dynamically Generating DAGs in Airflow.
I have some DAGs and task that are dynamically generated by a yaml configuration with different schedules and configurations. It all works without any issue.
Only thing that might be challenging is the "jobs are triggered by customers on-demand" - I guess you could trigger any DAG with Airflow's REST API, but it's still in a experimental state.

Snakemake: sub-workflow error when executing in LSF cluster

I have two Snakemake workflows that are very similar. Both of them share a sub-workflow and a couple of includes. Both of them work when doing dry runs. Both of them use the same cluser config file, and I'm running them with the same launch command. One of them fails when submitting to the LSF cluster with this error:
Executing subworkflow wf_common.
WorkflowError:
Config file __default__ not found.
I'm wondering whether it's "legal" in Snakemake for two workflows to share a sub-workflow, like in this case, and if not, whether the fact that I ran the workflow that does work first could have this effect.
Can you try Snakemake 3.12.0? It fixed a bug with passing the cluster config to a subworkflow. I would think that this solves your problem.

Make job submitted with bsub run in parallel with job that was submitted before and still running

I would like the new submitted job to bsub not PEND but start immediate run.
If possible I would like to limit this to N jobs.
If you want two jobs to run in parallel, its probably best to submit a single parallel job (bsub -n) that runs two different processes, potentially on two different hosts.
The LSF admin can force a PENDing job to run with the brun command. However, this will cause the execution host to be temporarily overloaded.

Prioritizing Jobs on Sun Grid Engine Queue

I have a bunch of jobs lined up for processing on a Sun Grid Engine queue, but I have just submitted a new job that I would like to prioritize. (The new job is 163981 in the left-most column.) Is there a command I can run to ask the server to process the 163981 job next, rather than the next job in the 140522 job array? I would be grateful for any advice others can offer on this question.
With admin/manager access you could use:
qalter -p <positive number up to 1024>
[job id of the job you want to run sooner]
Without admin/manager access you could use:
qalter -p <negative numeber down to -1023>
[job id of the other job you don't want to run next]
These may not work depending on how long a lag time between when the older job was submitted and the current time and how much weight the administrator has put on the waiting time.
Another option without admin/manager access would be to put the job you don't want to run on hold.
qalter -h u <job id of the job you don't want to run now>
This will make the job you want to run be the only one eligible. Once it has started running you can remove the hold on the other job with
qalter -h U <job id>
Does changing the job share (-js option of qsub) accomplish what you want?
Assuming other jobs are running and in queue with -js value of 0 (default), submit new job with higher priority like so:
qsub -js 10 high_priority.sh
Source: http://www.lifesci.dundee.ac.uk/services/lsc/services/cluster/using-cluster