Force a certain rule to execute at the end - workflow

My question is very similar to this one.
I am writing a snakemake pipeline, and it does a lot pre- and post-alignment quality control. At the end of the pipeline, I run multiQC on those QC results.
Basically, the workflow is: preprocessing -> fastqc -> alignment -> post-alignment QCs such as picard, qualimap, and preseq -> peak calling -> motif analysis -> multiQC.
MultiQC should generate a report on all those outputs as long as multiQC support them.
One way to force multiqc to run at the very end is to include all the output files from the above rules in the input directive of multiqc rule, as below:
rule a:
input: "a.input"
output: "a.output"
rule b:
input: "b.input"
output: "b.output"
rule c:
input: "b.output"
output: "c.output"
rule multiqc:
input: "a.output", "c.output"
output: "multiqc.output"
However, I want a more flexible way that doesn't depend on specific upstream output files. In such a way, when I change the pipelines (adding or removing any rules), I don't need to change the dependency for multiqc rule. The input to multiqc should simply be a directory containing all the files that I want multiqc to scan over.
In my situation, how can I force the multiQC rule to execute at the very end of pipeline? Or is there any general way that I can force a certain rule in snakemake to run as the last job? Probably through some configuration on smakemake such that in any situation, no matter how I change the pipeline, this rule will execute at the end. I am not sure whether or not such method exists.
Thanks very much for helping!

From your comments I gather that what you really want to do is run a flexibly configured number of QC methods and then summarise them in the end. The summary should only run, once all the QC methods you want to run have completed.
Rather than forcing the MultiQC rule to be executed in the end, manually, you can set up the MultiQC rule in such a way that it automatically gets executed in the end - by requiring the QC method's output as input.
Your goal of flexibly configuring which QC rules to run can be easily achieved by passing the names of the QC rules through a config file, or even easier as a command line argument.
Here is a minimal working example for you to extend:
###Snakefile###
rule end:
input: 'start.out',
expand('opt_{qc}.out',qc=config['qc'])
rule start:
output: 'start.out'
rule qc_a:
input: 'start.out'
output: 'opt_a.out'
#shell: #whatever qc method a needs here
rule qc_b:
input: 'start.out'
output: 'opt_b.out'
#shell: #whatever qc method b needs here
This is how you configure which QC method to run:
snakemake -npr end --config qc=['b'] #run just method b
snakemake -npr end --config qc=['a','b'] #run method a and b
snakemake -npr end --config qc=[] #run no QC method

It seems like onsuccess handler in snakemake is what I am looking for.

Related

Issues running commands via cloud-init

An open source project I'm working on has components that require Linux and consequently virtualization has generally been the best solution for development and testing new features. I'm attempting to provide a simple cloud-init file for Multipass that will configure the VM with our code by pulling our files from Git and setting them up in the VM automatically. However, even though extra time elapsed for launch seems to indicate the process is being run, no files seem to actually be saved to the home directory, even for simpler cases, i.e.
runcmd:
- [ cd, ~ ]
- [ touch test ]
- [ echo 'test' > test ]
Am I just misconfiguring cloud-init or am I missing something crucial?
There are a couple of problems going on here.
First, your cloud config user data must begin with the line:
#cloud-config
Without that line, cloud-init doesn't know what to do with it. If you were to submit a user-data configuration like this:
#cloud-config
runcmd:
- [ cd, ~ ]
- [ touch test ]
- [ echo 'test' > test ]
You would find the following errors in /var/log/cloud-init-output.log:
runcmd.0: ['cd', None] is not valid under any of the given schemas
/var/lib/cloud/instance/scripts/runcmd: 2: cd: can't cd to None
/var/lib/cloud/instance/scripts/runcmd: 3: touch test: not found
/var/lib/cloud/instance/scripts/runcmd: 4: echo 'test' > test: not found
You'll find the solution to these problems in the documentation, which includes this note about runcmd:
# run commands
# default: none
# runcmd contains a list of either lists or a string
# each item will be executed in order at rc.local like level with
# output to the console
# - runcmd only runs during the first boot
# - if the item is a list, the items will be properly executed as if
# passed to execve(3) (with the first arg as the command).
# - if the item is a string, it will be simply written to the file and
# will be interpreted by 'sh'
You passed a list of lists, so the behavior is governed by "*if the item is a list, the items will be properly executed as if passed to execve(3) (with the first arg as the command)". In this case, the ~ in [cd, ~] doesn't make any sense -- the command isn't being executed by the shell, so there's nothing to expand ~.
The second two commands include on a single list item, and there is no command on your system named either touch test or echo 'test' > test.
The simplest solution here is to simply pass in a list of strings intead:
#cloud-config
runcmd:
- cd /root
- touch test
- echo 'test' > test
I've replaced cd ~ here with cd /root, because it seems better to be explicit (and you know these commands are running as root anyway).

Execute certain rule at the very end

I am currently writing a Snakefile, which does a lot of post-alignment quality control (CollectInsertSizeMetics, CollectAlignmentSummaryMetrics, CollectGcBiasMetrics, ...).
At the very end of the Snakefile, I am running multiQC to combine all the metrics in one html report.
I know that if I use the output of rule A as input of rule B, rule B will only be executed after rule A is finished.
The problem in my case is that the input of multiQC is a directory, which exists right from the start. Inside of this directory, multiQC will search for certain files and then create the report.
If I am currently executing my Snakemake file, multiQC will be executed before all quality controls will be performed (e.g. fastqc takes quite some time), thus these are missing in the final report.
So my question is, if there is an option, that specifies that a certain rule is executed last.
I know that I could use --wait-for-files to wait for a certain fastqc report, but that seems very inflexible.
The last rule currently looks like this:
rule multiQC:
input:
input_dir = "post-alignment-qc"
output:
output_html="post-alignment-qc/multiQC/mutliqc-report.html"
log:
err='post-alignment-qc/logs/fastQC/multiqc_stderr.err'
benchmark:
"post-alignment-qc/benchmark/multiQC/multiqc.tsv"
shell:
"multiqc -f -n {output.output_html} {input.input_dir} 2> {log.err}"
Any help is appreciated!
You could give to the input of multiqc rule the files produced by the individual QC rules. In this way, multiqc will start once all those files are available:
samples = ['a', 'b', 'c']
rule collectInsertSizeMetrics:
input:
'{sample}.bam',
output:
'post-alignment-qc/{sample}.insertSizeMetrics.txt' # <- Some file produced by CollectInsertSizeMetrics
shell:
"CollectInsertSizeMetics {input} > {output}"
rule CollectAlignmentSummaryMetrics:
output:
'post-alignment-qc/{sample}.CollectAlignmentSummaryMetrics.txt'
rule multiqc:
input:
expand('post-alignment-qc/{sample}.insertSizeMetrics.txt', sample=samples),
expand('post-alignment-qc/{sample}.CollectAlignmentSummaryMetrics.txt', sample=samples),
shell:
"multiqc -f -n {output.output_html} post-alignment-qc 2> {log.err}"
This seems like a classic A-B-problem to me. Your hidden assumption is that because multiqc asks for a directory in its command line arguments, the Snakemake rule's input needs to be a directory. This is wrong.
Inputs should be all the things that required for a rule to be able to run. Having a folder exist, does not satisfy this requirement, because the folder can be empty. This is exactly the problem you're running into.
What you really need to be there are the files that are produced by other commands to be present in the folder. So you should define the required input to be the files. That's the Snakemake way. Stop thinking in terms of: this rule needs to run last. Think in terms of: these files need to be there for this rule to run.

Snakemake: how to realize a mechanism to copy input/output files to/from tmp folder and apply rule there

We use Slurm workload manager to submit jobs to our high performance cluster. During runtime of a job, we need to copy the input files from a network filesystem to the node's local filesystem, run our analysis there and then copy the output files back to the project directory on the network filesystem.
While the workflow management system Snakemake integrates with Slurm (by defining profiles) and allows to run each rule/step in the workflow as Slurm job, I haven't found a simple way to specify for each rule, wether a tmp folder should be used (with all the implications stated above or not.
I am very happy for simple solutions how to realise this behaviour.
I am not entirely sure if I understand correctly. I am guessing you do not want to copy the input of each rule to a certain directory, do the rule, then copy the output back to another filesystem, since that would be a lot of unnecessary files moving around. So for the first half of the answer I assume before execution you move your files to /scratch/mydir.
I believe you could use the --directory command (https://snakemake.readthedocs.io/en/stable/executing/cli.html). However I find this works poorly, since then snakemake has difficulty finding the config.yaml and samples.tsv.
The way I solve this is just by adding a working dir in front of my paths in each rule...
rule example:
input:
config["cwd"] + "{sample}.txt"
output:
config["cwd"] + "processed/{sample}.txt"
shell:
"""
touch {output}
"""
So all you then have to do is change cwd in your config.yaml.
local:
cwd: ./
slurm:
cwd: /scratch/mydir
You would then have to manually copy them back to your long-term filesystem or make a rule that would do that for you.
Now if however you do want to copy your files from filesystem A -> B, do your rule, and then move the result from B -> A, then I think you want to make use of shadow rules. I think the docs properly explain how to use that so I just give a link :).

Informatica Session Failing

I created a mapping that pulls data from a flat file that shows me usage data for specific SSRS reports. The file is overwritten each day with the previous days usage data. My issue is, sometimes the report doesn't have any usage for that day and my ETL sends me a "Failed" email because there wasn't any data in the Source. The job from running if there is no data in the source or to prevent it from failing.
--Thanks
A simple way to solve this is to create a "Passthrough" mapping that only contains a flat file source, source qualifier, and a flat file target.
You would create a session that runs this mapping at the beginning of your workflow and have it read your flat file source. The target can just be a dummy flat file that you keep overwriting. Then you would have this condition in the link to your next session that would actually process the file:
$s_Passthrough.SrcSuccessRows > 0
Yes, there are several ways, you can do this.
You can provide an empty file to ETL job when there is no source data. To do this, use a pre-session command like touch <filename> in the Informatica workflow. This will create an empty file with the <filename> if it is not present. The workflow will run successfully with 0 rows.
If you have a script that triggers the Informatica job, then you can put a check there as well like this:
if [ -e <filename> ]
then
pmcmd ...
fi
This will skip the job from executing.
Have another session before the actual dataload. Read the file, use a FALSE filter and some dummy target. Link this one to the session you already have and set the following link condition:
$yourDummySessionName.SrcSuccessRows > 0

Passing parameters to Capistrano

I'm looking into the possibility of using Capistrano as a generic deploy solution. By "generic", I mean not-rails. I'm not happy with the quality of the documentation I'm finding, though, granted, I'm not looking at the ones that presume you are deploying rails. So I'll just try to hack up something based on a few examples, but there are a couple of problems I'm facing right from the start.
My problem is that cap deploy doesn't have enough information to do anything. Importantly, it is missing the tag for the version I want to deploy, and this has to be passed on the command line.
The other problem is how I specify my git repository. Our git server is accessed by SSH on the user's account, but I don't know how to change deploy.rb to use the user's id as part of the scm URL.
So, how do I accomplish these things?
Example
I want to deploy the result of the first sprint of the second release. That's tagged in the git repository as r2s1. Also, let's say user "johndoe" gets the task of deploying the system. To access the repository, he has to use the URL johndoe#gitsrv.domain:app. So the remote URL for the repository depends on the user id.
The command lines to get the desired files would be these:
git clone johndoe#gitsrv.domain:app
cd app
git checkout r2s1
Update: For Capistrano 3, see scieslak's answer below.
Has jarrad has said, capistrano-ash is a good basic set of helper modules to deploy other project types, though it's not required as at the end of the day. It's just a scripting language and most tasks are done with the system commands and end up becoming almost shell script like.
To pass in parameters, you can set the -s flag when running cap to give you a key value pair. First create a task like this.
desc "Parameter Testing"
task :parameter do
puts "Parameter test #{branch} #{tag}"
end
Then start your task like so.
cap test:parameter -s branch=master -s tag=1.0.0
For the last part. I would recommend setting up passwordless access using ssh keys to your server. But if you want to take it from the current logged in user. You can do something like this.
desc "Parameter Testing"
task :parameter do
system("whoami", user)
puts "Parameter test #{user} #{branch} #{tag}"
end
UPDATE: Edited to work with the latest versions of Capistrano. The configuration array is no longer available.
Global Parameters: See comments Use set :branch, fetch(:branch, 'a-default-value') to use parameters globally. (And pass them with -S instead.)
Update. Regarding passing parameters to Capistrano 3 task only.
I know this question is quite old but still pops up first on Google when searching for passing parameters to Capistrano task. Unfortunately, the fantastic answer provided by Jamie Sutherland is no longer valid with Capistrano 3. Before you waste your time trying it out except the results to be like below:
cap test:parameter -s branch=master
outputs :
cap aborted!
OptionParser::AmbiguousOption: ambiguous option: -s
OptionParser::InvalidOption: invalid option: s
and
cap test:parameter -S branch=master
outputs:
invalid option: -S
The valid answers for Capistrano 3 provided by #senz and Brad Dwyer you can find by clicking this gold link:
Capistrano 3 pulling command line arguments
For completeness see the code below to find out about two option you have.
1st option:
You can iterate tasks with the key and value as you do with regular hashes:
desc "This task accepts optional parameters"
task :task_with_params, :first_param, :second_param do |task_name, parameter|
run_locally do
puts "Task name: #{task_name}"
puts "First parameter: #{parameter[:first_param]}"
puts "Second parameter: #{parameter[:second_param]}"
end
end
Make sure there is no space between parameters when you call cap:
cap production task_with_params[one,two]
2nd option:
While you call any task, you can assign environmental variables and then call them from the code:
set :first_param, ENV['first_env'] || 'first default'
set :second_param, ENV['second_env'] || 'second default'
desc "This task accepts optional parameters"
task :task_with_env_params do
run_locally do
puts "First parameter: #{fetch(:first_param)}"
puts "Second parameter: #{fetch(:second_param)}"
end
end
To assign environmental variables, call cap like bellow:
cap production task_with_env_params first_env=one second_env=two
Hope that will save you some time.
I'd suggest to use ENV variables.
Somethings like this (command):
$ GIT_REPO="johndoe#gitsrv.domain:app" GIT_BRANCH="r2s1" cap testing
Cap config:
#deploy.rb:
task :testing, :roles => :app do
puts ENV['GIT_REPO']
puts ENV['GIT_BRANCH']
end
And take a look at the https://github.com/capistrano/capistrano/wiki/2.x-Multistage-Extension, may be this approach will be useful for you as well.
As Jamie already showed, you can pass parameters to tasks with the -s flag. I want to show you how you additionally can use a default value.
If you want to work with default values, you have to use fetch instead of ||= or checking for nil:
namespace :logs do
task :tail do
file = fetch(:file, 'production') # sets 'production' as default value
puts "I would use #{file}.log now"
end
end
You can either run this task by (uses the default value production for file)
$ cap logs:tail
or (uses the value cron for file
$ cap logs:tail -s file=cron
Check out capistrano-ash for a library that helps with non-rails deployment. I use it to deploy a PyroCMS app and it works great.
Here is a snippet from my Capfile for that project:
# deploy from git repo
set :repository, "git#git.mygitserver.com:mygitrepo.git"
# tells cap to use git
set :scm, :git
I'm not sure I understand the last two parts of the question. Provide some more detail and I'd be happy to help.
EDIT after example given:
set :repository, "#{scm_user}#gitsrv.domain:app"
Then each person with deploy priveledges can add the following to their local ~/.caprc file:
set :scm_user, 'someuser'