Execute certain rule at the very end - workflow

I am currently writing a Snakefile, which does a lot of post-alignment quality control (CollectInsertSizeMetics, CollectAlignmentSummaryMetrics, CollectGcBiasMetrics, ...).
At the very end of the Snakefile, I am running multiQC to combine all the metrics in one html report.
I know that if I use the output of rule A as input of rule B, rule B will only be executed after rule A is finished.
The problem in my case is that the input of multiQC is a directory, which exists right from the start. Inside of this directory, multiQC will search for certain files and then create the report.
If I am currently executing my Snakemake file, multiQC will be executed before all quality controls will be performed (e.g. fastqc takes quite some time), thus these are missing in the final report.
So my question is, if there is an option, that specifies that a certain rule is executed last.
I know that I could use --wait-for-files to wait for a certain fastqc report, but that seems very inflexible.
The last rule currently looks like this:
rule multiQC:
input:
input_dir = "post-alignment-qc"
output:
output_html="post-alignment-qc/multiQC/mutliqc-report.html"
log:
err='post-alignment-qc/logs/fastQC/multiqc_stderr.err'
benchmark:
"post-alignment-qc/benchmark/multiQC/multiqc.tsv"
shell:
"multiqc -f -n {output.output_html} {input.input_dir} 2> {log.err}"
Any help is appreciated!

You could give to the input of multiqc rule the files produced by the individual QC rules. In this way, multiqc will start once all those files are available:
samples = ['a', 'b', 'c']
rule collectInsertSizeMetrics:
input:
'{sample}.bam',
output:
'post-alignment-qc/{sample}.insertSizeMetrics.txt' # <- Some file produced by CollectInsertSizeMetrics
shell:
"CollectInsertSizeMetics {input} > {output}"
rule CollectAlignmentSummaryMetrics:
output:
'post-alignment-qc/{sample}.CollectAlignmentSummaryMetrics.txt'
rule multiqc:
input:
expand('post-alignment-qc/{sample}.insertSizeMetrics.txt', sample=samples),
expand('post-alignment-qc/{sample}.CollectAlignmentSummaryMetrics.txt', sample=samples),
shell:
"multiqc -f -n {output.output_html} post-alignment-qc 2> {log.err}"

This seems like a classic A-B-problem to me. Your hidden assumption is that because multiqc asks for a directory in its command line arguments, the Snakemake rule's input needs to be a directory. This is wrong.
Inputs should be all the things that required for a rule to be able to run. Having a folder exist, does not satisfy this requirement, because the folder can be empty. This is exactly the problem you're running into.
What you really need to be there are the files that are produced by other commands to be present in the folder. So you should define the required input to be the files. That's the Snakemake way. Stop thinking in terms of: this rule needs to run last. Think in terms of: these files need to be there for this rule to run.

Related

Force a certain rule to execute at the end

My question is very similar to this one.
I am writing a snakemake pipeline, and it does a lot pre- and post-alignment quality control. At the end of the pipeline, I run multiQC on those QC results.
Basically, the workflow is: preprocessing -> fastqc -> alignment -> post-alignment QCs such as picard, qualimap, and preseq -> peak calling -> motif analysis -> multiQC.
MultiQC should generate a report on all those outputs as long as multiQC support them.
One way to force multiqc to run at the very end is to include all the output files from the above rules in the input directive of multiqc rule, as below:
rule a:
input: "a.input"
output: "a.output"
rule b:
input: "b.input"
output: "b.output"
rule c:
input: "b.output"
output: "c.output"
rule multiqc:
input: "a.output", "c.output"
output: "multiqc.output"
However, I want a more flexible way that doesn't depend on specific upstream output files. In such a way, when I change the pipelines (adding or removing any rules), I don't need to change the dependency for multiqc rule. The input to multiqc should simply be a directory containing all the files that I want multiqc to scan over.
In my situation, how can I force the multiQC rule to execute at the very end of pipeline? Or is there any general way that I can force a certain rule in snakemake to run as the last job? Probably through some configuration on smakemake such that in any situation, no matter how I change the pipeline, this rule will execute at the end. I am not sure whether or not such method exists.
Thanks very much for helping!
From your comments I gather that what you really want to do is run a flexibly configured number of QC methods and then summarise them in the end. The summary should only run, once all the QC methods you want to run have completed.
Rather than forcing the MultiQC rule to be executed in the end, manually, you can set up the MultiQC rule in such a way that it automatically gets executed in the end - by requiring the QC method's output as input.
Your goal of flexibly configuring which QC rules to run can be easily achieved by passing the names of the QC rules through a config file, or even easier as a command line argument.
Here is a minimal working example for you to extend:
###Snakefile###
rule end:
input: 'start.out',
expand('opt_{qc}.out',qc=config['qc'])
rule start:
output: 'start.out'
rule qc_a:
input: 'start.out'
output: 'opt_a.out'
#shell: #whatever qc method a needs here
rule qc_b:
input: 'start.out'
output: 'opt_b.out'
#shell: #whatever qc method b needs here
This is how you configure which QC method to run:
snakemake -npr end --config qc=['b'] #run just method b
snakemake -npr end --config qc=['a','b'] #run method a and b
snakemake -npr end --config qc=[] #run no QC method
It seems like onsuccess handler in snakemake is what I am looking for.

Snakemake: how to realize a mechanism to copy input/output files to/from tmp folder and apply rule there

We use Slurm workload manager to submit jobs to our high performance cluster. During runtime of a job, we need to copy the input files from a network filesystem to the node's local filesystem, run our analysis there and then copy the output files back to the project directory on the network filesystem.
While the workflow management system Snakemake integrates with Slurm (by defining profiles) and allows to run each rule/step in the workflow as Slurm job, I haven't found a simple way to specify for each rule, wether a tmp folder should be used (with all the implications stated above or not.
I am very happy for simple solutions how to realise this behaviour.
I am not entirely sure if I understand correctly. I am guessing you do not want to copy the input of each rule to a certain directory, do the rule, then copy the output back to another filesystem, since that would be a lot of unnecessary files moving around. So for the first half of the answer I assume before execution you move your files to /scratch/mydir.
I believe you could use the --directory command (https://snakemake.readthedocs.io/en/stable/executing/cli.html). However I find this works poorly, since then snakemake has difficulty finding the config.yaml and samples.tsv.
The way I solve this is just by adding a working dir in front of my paths in each rule...
rule example:
input:
config["cwd"] + "{sample}.txt"
output:
config["cwd"] + "processed/{sample}.txt"
shell:
"""
touch {output}
"""
So all you then have to do is change cwd in your config.yaml.
local:
cwd: ./
slurm:
cwd: /scratch/mydir
You would then have to manually copy them back to your long-term filesystem or make a rule that would do that for you.
Now if however you do want to copy your files from filesystem A -> B, do your rule, and then move the result from B -> A, then I think you want to make use of shadow rules. I think the docs properly explain how to use that so I just give a link :).

Informatica Session Failing

I created a mapping that pulls data from a flat file that shows me usage data for specific SSRS reports. The file is overwritten each day with the previous days usage data. My issue is, sometimes the report doesn't have any usage for that day and my ETL sends me a "Failed" email because there wasn't any data in the Source. The job from running if there is no data in the source or to prevent it from failing.
--Thanks
A simple way to solve this is to create a "Passthrough" mapping that only contains a flat file source, source qualifier, and a flat file target.
You would create a session that runs this mapping at the beginning of your workflow and have it read your flat file source. The target can just be a dummy flat file that you keep overwriting. Then you would have this condition in the link to your next session that would actually process the file:
$s_Passthrough.SrcSuccessRows > 0
Yes, there are several ways, you can do this.
You can provide an empty file to ETL job when there is no source data. To do this, use a pre-session command like touch <filename> in the Informatica workflow. This will create an empty file with the <filename> if it is not present. The workflow will run successfully with 0 rows.
If you have a script that triggers the Informatica job, then you can put a check there as well like this:
if [ -e <filename> ]
then
pmcmd ...
fi
This will skip the job from executing.
Have another session before the actual dataload. Read the file, use a FALSE filter and some dummy target. Link this one to the session you already have and set the following link condition:
$yourDummySessionName.SrcSuccessRows > 0

colorgcc perl script with output to non-tty enabled writing to C dependency files

Ok, so here's my issue. I have written a build script in bash that pipes output to tee and sorts different output to different log files (so I can summarize errors/warnings at the end and get some statistics on files built). I wanted to use the colorgcc perl script (colorgcc.1.3.2) to colorize the output from gcc and had found in other places that this won't work piping to tee, since the script checks if it is writing to something that is not a tty. Having disabled this check everything was working until I did a full build and discovered some of the code we receive from another group builds C dependency files (we don't control this code, changing it or the build process for these isn't really an option).
The problem is that these .d files have the form as follows:
filename.o filename.d : filename.c \
dependant_file1.h \
dependant_file2.h (and so on for however many dependencies there are)
This output from GCC gets written into the .d file, but, since it is close enough to a warning/error message colorgcc outputs color codes (believe it's the check for filename:lineno:message but not 100% sure, could be filename:message check in the GCCOUT while loop). I've tried editing the regex to attempt to not match this but my perl-fu is admittedly pretty weak. So what I end up with is a color code on each line for these dependency files, which obviously causes the build to fail.
I ended up just replacing the check for ! -t STDOUT with a check for a NO_COLOR envar I set and unset in the build script for these directories (emulates the previous behavior of no color for non-tty). This works great if I run the full script, but doesn't if I cd into the directory and just run make (obviously setting and unsetting manually would work but this is a pain to do every time). Anyone have any ideas how to prevent this script from writing color codes into dependency files?
Here's how I worked around this. I added the following to colorgcc to search the gcc input for the flag to generate the .d files and just directly called the compiler in that case. This was inserted in place of the original TTY check.
for each $argnum (0 .. $#ARGV)
{
if ($ARGV[$argnum] =~ m/-M{1,2}/)
{
exec $compiler, #ARGV
or die("Couldn't exec");
}
}
I don't know if this is the proper 'perl' way of doing this sort of operation but it seems to work. Compiling inside directories that build .d files no longer inserts color codes and the source file builds do (both to terminal and my log files like I wanted). I guess sometimes the answer is more hacks instead of "hey, did you try giving up?".

Preventing the accidentally marking of all conflicts as resolved in Mercurial

It doesn't happen too often, but every once in a while I'll fumble with my typing and accidentally invoke "hg resolve -m" without a file argument. This then helpfully marks all the conflicts resolved. Is there any way to prevent it from resolving without one or more file arguments?
You can do this with a pre-resolve hook but you'd have to parse the arguments yourself to ensure that they are valid which could be tricky.
The relevant environment variables that you might need to look at are:
HG_ARGS - the contents of the whole command line e.g. resolve -m
HG_OPTS - a dictionary object containing options specified. This would have an an entry called mark with a value of True if -m had been specified
HG_PATS - this is the list of files specified
Depending upon the scripting language you would use, you should be able to test if HG_OPTS contains a value of True for mark and fail if it does and the HG_PATS array is empty.
It starts to get complicated when you take into account the --include and --exclude arguments.
If you specify the files to resolve as part of the --include option then the files to include would be in HG_OPTS, not HG_PATS. Also, I don't know what would happen if you specified hg resolve -m test.txt --exclude test.txt. I'd hope that it would not resolve anything but you'd need to test that.
Once you've parsed the command arguments, you'd return either 0 to allow the command or 1 to prevent it. You should echo a reason for the failure if you return 1 to avoid confusion later.
If you don't know how to do this then you'd need to specify what OS and shell you are using for anyone to provide more specific help.