Snakemake: Cluster multiple jobs together - hpc

I have a pretty simple snakemake pipeline that takes an input file does three subsequent steps to produce one output. Each individual job is very quick. Now I want to apply this pipeline to >10k files on an SGE cluster. Even if I use group to have one job for each three rules per input file, I would still submit >10k cluster jobs. Is there a way to instead submit limited number of cluster jobs (lets say 100) and distribute all tasks equally between them?
An example would be something like
rule A:
input: {prefix}.start
output: {prefix}.A
group "mygroup"
rule B:
input: {prefix}.A
output: {prefix}.B
group "mygroup"
rule C:
input: {prefix}.B
output: {prefix}.C
group "mygroup"
rule runAll:
input: expand("{prefix}.C", prefix = VERY_MANY_PREFIXES)
and then run it with
snakemake --cluster "qsub <some parameters>" runAll

You could process all the 10k files in the same rule using a for loop (not sure if this is what Manavalan Gajapathy has in mind). For example:
rule A:
input:
txt= expand('{prefix}.start', prefix= PREFIXES),
output:
out= expand('{prefix}.A', prefix= PREFIXES),
run:
io= zip(input.txt, output.out)
for x in io:
shell('some_command %s %s' %(x[0], x[1]))
and the same for rule B and C.
Look also at snakemake local-rules

The only solution I can think of would be to declare rules A, B, and C to be local rules, so that they run in the main snakemake job instead of being submitted as a job. Then you can break up your runAll into batches:
rule runAll1:
input: expand("{prefix}.C", prefix = VERY_MANY_PREFIXES[:1000])
rule runAll2:
input: expand("{prefix}.C", prefix = VERY_MANY_PREFIXES[1000:2000])
rule runAll3:
input: expand("{prefix}.C", prefix = VERY_MANY_PREFIXES[2000:3000])
...etc
Then you submit a snakemake job for runAll1, another for runAll2, and so on. You do this fairly easily with a bash loop:
for i in {1..10}; do sbatch [sbatch params] snakemake runAll$i; done;
Another option which would be more scalable than creating multiple runAll rules would be to have a helper python script that does something like this:
import subprocess
for i in range(0, len(VERY_MANY_PREFIXES), 1000):
subprocess.run(['sbatch', 'snakemake'] + ['{prefix}'.C for prefix in VERY_MANY_PREFIXES[i:i+1000]])

Related

snakemake only runs the first rule not all

My snakefile looks like this.
rule do00_download_step01_download_:
input:
output:
"data/00_download/scores.pqt"
run:
from lib.do00_download import do00_download_step01_download_
do00_download_step01_download_()
rule do00_download_step02_get_the_mean_:
input:
"data/00_download/scores.pqt"
output:
"data/00_download/cleaned.pqt"
run:
from lib.do00_download import do00_download_step02_get_the_mean_
do00_download_step02_get_the_mean_()
rule do01_corr_step01_correlate:
input:
"data/00_download/cleaned.pqt"
output:
"data/01_corr/corr.pqt"
run:
from lib.do01_corr import do01_corr_step01_correlate
do01_corr_step01_correlate()
rule do95_plot_step01_correlations:
input:
"data/01_corr/corr.pqt"
output:
"plot/heatmap.png"
run:
from lib.do95_plot import do95_plot_step01_correlations
do95_plot_step01_correlations()
rule do95_plot_step02_plot_dist:
input:
"data/00_download/cleaned.pqt"
output:
"plot/dist.png"
run:
from lib.do95_plot import do95_plot_step02_plot_dist
do95_plot_step02_plot_dist()
rule do99_figures_step01_make_figure:
input:
"plot/dist.png"
"plot/heatmap.png"
output:
"figs/fig01.svg"
run:
from lib.do99_figures import do99_figures_step01_make_figure
do99_figures_step01_make_figure()
rule all:
input:
"figs/fig01.svg"
I have arranged the rules in a sequential manner, hoping that this will make sure all the steps will be run in that order. However, when I run snakemake, it only runs the first rule and then it exits.
I have individually checked all the steps (functions that I import) if they work well, the paths of the input and output files. Everything looks ok. So I am guessing that the issue is with how I formatted the snakefile. I am new to snakemake (beginer-level). So it would be very helpful if somebody points out how I should fix this issue.
This is the intended behavior. Here's a relevalent section from the docs:
Moreover, if no target is given at the command line, Snakemake will define the first rule of the Snakefile as the target. Hence, it is best practice to have a rule all at the top of the workflow which has all typically desired target files as input files.
The output of the first rule is assumed to be the target.
If you move the rule all: to the top of your Snakefile, it should work as expected.

rundeck make job with multiple steps on different nodes

How can a Job with multiple steps run some steps on Node 1 and other on Node 2?
For example:
On Node 1, I have to copy a file to a folder cp file.txt /var/www/htm/
On Node 2, I have to download this file wget https://www.mywebsite.com/file.txt
I have tried creating three jobs,
JOB 1, workflow I have Execute Command on remote cp file.txt /var/www/htm/ and NODES filter to my NODE 1
JOB 2, workflow I have Execute Command on remote wget https://www.mywebsite.com/file.txt and NODES filter to NODE 2
JOB 3, workflow step 1: selected Job Reference, and paste the UUID from JOB 1, step 2 Job reference and paste UUID JOB 2 and node filter I writed .* to get all nodes.
For now I tried to only run a command ls(on JOB 1 and JOB 2), but when I run JOB 3 the output is 3 time the command each job, for example:
// Run Job 3
// Output from Job 1
test-folder
test.text
test-folder
test.text
test-folder
test.text
And same for JOB 2
How can I implement my job?
Using the job reference step is the right way to solve that, but instead of defining .* to get all nodes, you can use the node1 name in the first job reference step call and the node2 name for the second job reference call, on "Override node filters?" section. Alternatively you can define the nodes filter in each job and just call it from the Job 3 using job reference step.

pytest-parallel not honouring module-scope fixtures

Suppose I have the below test cases written in a file, test_something.py:
#pytest.fixture(scope="module")
def get_some_binary_file():
# Some logic here that creates a path "/a/b/bin" and then downloads a binary into this path
os.mkdir("/a/b/bin") ### This line throws the error in pytest-parallel
some_binary = os.path.join("/a/b/bin", "binary_file")
download_bin("some_bin_url", some_binary)
return some_binary
test_input = [
{"some": "value"},
{"foo": "bar"}
]
#pytest.mark.parametrize("test_input", test_input, ids=["Test_1", "Test_2"])
def test_1(get_some_binary_file, test_input):
# Testing logic here
# Some other completely different tests below
def test_2():
# Some other testing logic here
When I run the above using below pytest command then they work without any issues.
pytest -s --disable-warnings test_something.py
However, I want to run these test cases in a parallel manner. I know that test_1 and test_2 should run parallelly. So I looked into pytest-parallel and did the below:
pytest --workers auto -s --disable-warnings test_something.py.
But as shown above in the code, when it goes to create the /a/b/bin folder, it throws an error saying that the directory already exists. So this means that the module-scope is not being honoured in pytest-parallel. It is trying to execute the get_some_binary_file for every parameterized input to test_1 Is there a way for me to do this?
I have also looked into pytest-xdist with the --dist loadscope option, and ran the below command for it:
pytest -n auto --dist loadscope -s --disable-warnings test_something.py
But this gave me an output like below, where both test_1 and test_2 are being executed on the same worker.
tests/test_something.py::test_1[Test_1]
[gw1] PASSED tests/test_something.py::test_1[Test_1] ## Expected
tests/test_something.py::test_1[Test_2]
[gw1] PASSED tests/test_something.py::test_1[Test_2] ## Expected
tests/test_something.py::test_2
[gw1] PASSED tests/test_something.py::test_2 ## Not expected to run in gw1
As can be seen from above output, the test_2 is running in gw1. Why? Shouldn't it run in a different worker?
Group the definitions with xdist_group to run per process. Run like this to assign it to per process, pytest xdistloadscope.py -n 2 --dist=loadgroup
#pytest.mark.xdist_group("group1")
#pytest.fixture(scope="module")
def get_some_binary_file():
# Some logic here that creates a path "/a/b/bin" and then downloads a binary into this path
os.mkdir("/a/b/bin") ### This line throws the error in pytest-parallel
some_binary = os.path.join("/a/b/bin", "binary_file")
download_bin("some_bin_url", some_binary)
return some_binary
test_input = [
{"some": "value"},
{"foo": "bar"}
]
#pytest.mark.xdist_group("group1")
#pytest.mark.parametrize("test_input", test_input, ids=["Test_1", "Test_2"])
def test_1(get_some_binary_file, test_input):
# Testing logic here
# Some other completely different tests below
#pytest.mark.xdist_group("group2")
def test_2():
# Some other testing logic here

Pipeline Dependencies in Data Fusion

I have three pipelines in Data Fusion say A,B and C. I want to the Pipeline C to get triggered after execution of Pipeline A and B both Completes. Pipeline triggers are putting the dependency on one pipeline only.
Can this be implemented in Data Fusion ?
You can do it using Google Cloud Composer [1]. In order to perform this action first of all you need to create a new Environment in Google Cloud Composer [2], once done, you need to install a new Python Package in your environment [3], and the package that you will need to install is [4] "apache-airflow-backport-providers-google".
With this package installed you will be able to use these operations [5], the one you will need is [6] "Start a DataFusion pipeline", this way you will be able to start a new pipeline from Airflow.
An example of the python code would be as follows:
import airflow
import datetime
from airflow import DAG
from airflow import models
from airflow.operators.bash_operator import BashOperator
from datetime import timedelta
from airflow.providers.google.cloud.operators.datafusion import (
CloudDataFusionStartPipelineOperator
)
default_args = {
'start_date': airflow.utils.dates.days_ago(0),
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with models.DAG(
'composer_DF',
schedule_interval=datetime.timedelta(days=1),
default_args=default_args) as dag:
# the operations.
A = CloudDataFusionStartPipelineOperator(
location="us-west1", pipeline_name="A",
instance_name="instance_name", task_id="start_pipelineA",
)
B = CloudDataFusionStartPipelineOperator(
location="us-west1", pipeline_name="B",
instance_name="instance_name", task_id="start_pipelineB",
)
C = CloudDataFusionStartPipelineOperator(
location="us-west1", pipeline_name="C",
instance_name="instance_name", task_id="start_pipelineC",
)
# First A then B and then C
A >> B >> C
You can set the time intervals by checking the Airflow documentation.
Once you have this code saved as a .py file, save it to ther Google Cloud Storage DAG folder of your environment.
When the DAG starts, it will execute task A and when it finishes it will execute task B and so on.
[1] https://cloud.google.com/composer
[2] https://cloud.google.com/composer/docs/how-to/managing/creating#:~:text=In%20the%20Cloud%20Console%2C%20open%20the%20Create%20Environment%20page.&text=Under%20Node%20configuration%2C%20click%20Add%20environment%20variable.&text=The%20From%3A%20email%20address%2C%20such,%40%20.&text=Your%20SendGrid%20API%20key.
[3] https://cloud.google.com/composer/docs/how-to/using/installing-python-dependencies
[4] https://pypi.org/project/apache-airflow-backport-providers-google/
[5] https://airflow.readthedocs.io/en/latest/_api/airflow/providers/google/cloud/operators/datafusion/index.html
[6] https://airflow.readthedocs.io/en/latest/howto/operator/google/cloud/datafusion.html#start-a-datafusion-pipeline
There is no direct way i could think of but two workarounds
Work around 1. Merging the pipeline A and B into pipeline AB then trigger pipeline C (AB > C).
Pipeline A - (GCS Copy > Decompress),
Pipeline B - (GCS2 > thrashsad)
BigQueryExecute to mitigate error : Invalid DAG. There is an island made up of stages..
In BigQueryExecute, valid and dummy query.
Merging the two pipeline in one, may unease the pipeline testing. To overcome this you can add a dummy condition to run a pipeline one time.
In BigQueryExecute,change query to 'Select ${flag}' and pass the value of flag in runtime argument or Select 1 as flag and tick "Row As Arguments" to true.
Add condition plugin after BigQueryExecute and put condition runtime['flag'] = 1
Condition plugin has two outlet, connect them to pipeline A and pipeline B.
Workaround 2 : Store the flag of both pipelines(A & B) in BiqQuery table,create two flow A>C and B >C to trigger the pipeline C. This would trigger pipeline C twice but using BigQueryExecute and condition plugin will run only when both flags are available in BigQuery table.
How?
In Pipeline A & B to write output (a row) to BigQuery table 'Pipeline_Run'
In Pipeline C, add BigQueryExecute and query 'select count(*) as Cnt from ds.Pipeline_Run' and tick "Row As Arguments" to true.
In Pipeline C, add Condition plugin and check if value of cnt is 2 (runtime['cnt'] = 2) and connect your rest of the pipeline's plugins to its "Yes" outlet.
You can explore "schedules" set through CDAP REST APIs. That allows parallel execution of pipelines and there is no dependency on cloud composer (except for file based trigger of first pipeline in workflow. For that you would need cloud function or may be cloud composer file sensor)

Execute task on hosts which match a pattern

We use this pattern for our hosts (and linux users):
coreapp_customer_stageID#server
I want to run fabric commands on a list of hosts which match a pattern.
Example: I want to run "date" on all hosts of core app "foocms" of customer "c1".
I could use roles, but there are a lot of customers ... A glob matching way would be good.
you can use this
#task
def set_hosts(lookup_param):
'''
'''
hosts=get_all_systems() # that needs to be implemented by you.
regex=re.compile('^%s' % lookup_param.replace('*', '.*'))
sub_hosts=[host for host in hosts if regex.match(host)]
if not sub_hosts:
raise ValueError('No host matches %s. Available: %s' % (lookup_param, hosts))
env.hosts = sub_hosts
#task
def date():
run('date')
Example: fab set_hosts:coreapp_customer1_* date
Taken from: http://docs.fabfile.org/en/1.7/usage/execution.html#the-alternate-approach