Snakemake workflow, ChildIOException or MissingInputException - workflow

I am trying to add a file renaming step in my current workflow to make it easier on some of the other users. What I want to do is take the contigs.fasta file from a spades assembly directory and rename it to include the sample name. (i.e foo_de_novo/contigs.fasta to foo_de_novo/foo.fasta)
here is my code... well currently.
configfile: "config.yaml"
import os
def is_file_empty(file_path):
""" Check if file is empty by confirming if its size is 0 bytes"""
# Check if singleton file exist and it is empty from bbrepair output
return os.path.exists(file_path) and os.stat(file_path).st_size == 0
rule all:
input:
expand("{sample}_de_novo/{sample}.fasta", sample = config["names"]),
rule fastp:
input:
r1 = lambda wildcards: config["sample_reads_r1"][wildcards.sample],
r2 = lambda wildcards: config["sample_reads_r2"][wildcards.sample]
output:
r1 = temp("clean/{sample}_r1.trim.fastq.gz"),
r2 = temp("clean/{sample}_r2.trim.fastq.gz")
shell:
"fastp --in1 {input.r1} --in2 {input.r2} --out1 {output.r1} --out2 {output.r2} --trim_front1 20 --trim_front2 20"
rule bbrepair:
input:
r1 = "clean/{sample}_r1.trim.fastq.gz",
r2 = "clean/{sample}_r2.trim.fastq.gz"
output:
r1 = temp("clean/{sample}_r1.fixed.fastq"),
r2 = temp("clean/{sample}_r2.fixed.fastq"),
singles = temp("clean/{sample}.singletons.fastq")
shell:
"repair.sh -Xmx10g in1={input.r1} in2={input.r2} out1={output.r1} out2={output.r2} outs={output.singles}"
rule spades:
input:
r1 = "clean/{sample}_r1.fixed.fastq",
r2 = "clean/{sample}_r2.fixed.fastq",
s = "clean/{sample}.singletons.fastq"
output:
directory("{sample}_de_novo")
run:
isempty = is_file_empty("clean/{sample}.singletons.fastq")
if isempty == "False":
shell("spades.py --careful --phred-offset 33 -1 {input.r1} -2 {input.r2} -s {input.singletons} -o {output}")
else:
shell("spades.py --careful --phred-offset 33 -1 {input.r1} -2 {input.r2} -o {output}")
rule rename_spades:
input:
"{sample}_de_novo/contigs.fasta"
output:
"{sample}_de_novo/{sample}.fasta"
shell:
"cp {input} {output}"
When I have it written like this I get the MissingInputError and when I change it to this.
rule rename_spades:
input:
"{sample}_de_novo"
output:
"{sample}_de_novo/{sample}.fasta"
shell:
"cp {input} {output}"
I get the ChildIOException
I feel I understand why snakemake is unhappy with both versions. The first one is becasue I don't explicitly output the "{sample}_de_novo/contigs.fasta" file. Its just one of several files spades outputs. And the other error is because it doesn't like how I am asking it to look into the directory. I however am at a loss on how to fix this.
Is there a way to ask snakmake to look into a directory for a file and then perform the task requested?
Thank you,
Sean
EDIT File Structure of Spades output
Sample_de_novo
|-corrected/
|-K21/
|-K33/
|-K55/
|-K77/
|-misc/
|-mismatch_corrector/
|-tmp/
|-assembly_graph.fastg
|-assembly_graph_with_scaffolds.gfa
|-before_rr.fasta
|-contigs.fasta
|-contigs.paths
|-dataset.info
|-input_dataset.ymal
|-params.txt
|-scaffolds.fasta
|-scaffolds.paths
|spades.log

Make {sample}_de_novo/contigs.fasta to be the output of spades and parse its path to get the directory that will be the argument to spades -o. Snakemake won't mind if there are other files created in addition to contigs.fasta. This should run --dry-run mode:
rule all:
input:
expand('{sample}_de_novo/{sample}.fasta', sample=['A', 'B']),
rule spades:
output:
fasta='{sample}_de_novo/contigs.fasta',
run:
outdir=os.path.dirname(output.fasta)
shell(f'spades ... -o {outdir}')
rule rename:
input:
fasta='{sample}_de_novo/contigs.fasta',
output:
fasta='{sample}_de_novo/{sample}.fasta',
shell:
r"""
mv {input.fasta} {output.fasta}
"""

Nope, spoke too soon. It didn't name the output directory correctly, so I moved it to the params and, now, finailly is working the way I wanted.
rule spades:
input:
r1 = "clean/{sample}_r1.fixed.fastq",
r2 = "clean/{sample}_r2.fixed.fastq",
s = "clean/{sample}.singletons.fastq"
output:
"{sample}_de_novo/contigs.fasta"
params:
outdir = directory("{sample}_de_novo/")
run:
isempty = is_file_empty("clean/{sample}.singletons.fastq")
if isempty == "False":
shell("spades.py --isolate --phred-offset 33 -1 {input.r1} -2 {input.r2} -s {input.singletons} -o {params.outdir}")
else:
shell("spades.py --isolate --phred-offset 33 -1 {input.r1} -2 {input.r2} -o {params.outdir}")
rule rename_spades:
input:
"{sample}_de_novo/contigs.fasta"
output:
"{sample}_de_novo/{sample}.fasta"
shell:
"cp {input} {output}"

Related

Generate many files with wildcard, then merge into one

I have two rules on my Snakefile: one generates several sets of files using wildcards, the other one merges everything into a single file. This is how I wrote it:
chr = range(1,23)
rule generate:
input:
og_files = config["tmp"] + '/chr{chr}.bgen',
output:
out = multiext(config["tmp"] + '/plink/chr{{chr}}',
'.bed', '.bim', '.fam')
shell:
"""
plink \
--bgen {input.og_files} \
--make-bed \
--oxford-single-chr \
--out {config[tmp]}/plink/chr{chr}
"""
rule merge:
input:
plink_chr = expand(config["tmp"] + '/plink/chr{chr}.{ext}',
chr = chr,
ext = ['bed', 'bim', 'fam'])
output:
out = multiext(config["tmp"] + '/all',
'.bed', '.bim', '.fam')
shell:
"""
plink \
--pmerge-list-dir {config[tmp]}/plink \
--make-bed \
--out {config[tmp]}/all
"""
Unfortunately, this does not allow me to track the file coming from the first rule to the 2nd rule:
$ snakemake -s myfile.smk -c1 -np
Building DAG of jobs...
MissingInputException in line 17 of myfile.smk:
Missing input files for rule merge:
[list of all the files made by expand()]
What can I use to be able to generate the 22 sets of files with the wildcard chr in generate, but be able to track them in the input of merge? Thank you in advance for your help
In rule generate I think you don't want to escape the {chr} wildcard, otherwise it doesn't get replaced. I.e.:
out = multiext(config["tmp"] + '/plink/chr{{chr}}',
'.bed', '.bim', '.fam')
should be:
out = multiext(config["tmp"] + '/plink/chr{chr}',
'.bed', '.bim', '.fam')

Why are my wildcard attributes not being filled in Snakemake?

I am following the tutorial in the documentation (https://snakemake.readthedocs.io/en/stable/tutorial/advanced.html) and have been stuck on the "Step 4: Rule parameter" exercise. I would like to access a float from my config file using a wildcard in my params directive.
I seem to be getting the same error whenever I run snakemake -np in the command line:
InputFunctionException in line 46 of /mnt/c/Users/Matt/Desktop/snakemake-tutorial/Snakefile:
Error:
AttributeError: 'Wildcards' object has no attribute 'sample'
Wildcards:
Traceback:
File "/mnt/c/Users/Matt/Desktop/snakemake-tutorial/Snakefile", line 14, in get_bcftools_call_priors
This is my code so far
import time
configfile: "config.yaml"
rule all:
input:
"plots/quals.svg"
def get_bwa_map_input_fastqs(wildcards):
print(wildcards.__dict__, 1, time.time()) #I have this print as a check
return config["samples"][wildcards.sample]
def get_bcftools_call_priors(wildcards):
print(wildcards.__dict__, 2, time.time()) #I have this print as a check
return config["prior_mutation_rates"][wildcards.sample]
rule bwa_map:
input:
"data/genome.fa",
get_bwa_map_input_fastqs
#lambda wildcards: config["samples"][wildcards.sample]
output:
"mapped_reads/{sample}.bam"
params:
rg=r"#RG\tID:{sample}\tSM:{sample}"
threads: 2
shell:
"bwa mem -R '{params.rg}' -t {threads} {input} | samtools view -Sb - > {output}"
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
shell:
"samtools index {input}"
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
#prior=get_bcftools_call_priors
params:
prior=get_bcftools_call_priors
output:
"calls/all.vcf"
shell:
"samtools mpileup -g -f {input.fa} {input.bam} | "
"bcftools call -P {params.prior} -mv - > {output}"
rule plot_quals:
input:
"calls/all.vcf"
output:
"plots/quals.svg"
script:
"scripts/plot-quals.py"
and here is my config.yaml
samples:
A: data/samples/A.fastq
#B: data/samples/B.fastq
#C: data/samples/C.fastq
prior_mutation_rates:
A: 1.0e-4
#B: 1.0e-6
I don't understand why my input function call in bcftools_call says that the wildcards object is empty of attributes, yet an almost identical function call in bwa_map has the attribute sample that I want. From the documentation it seems like the wildcards would be propogated before anything is run, so why is it missing?
This is the full output of the commandline call snakemake -np:
{'_names': {'sample': (0, None)}, '_allowed_overrides': ['index', 'sort'], 'index': functools.partial(<function Namedlist._used_attribute at 0x7f91b1a58f70>, _name='index'), 'sort': functools.partial(<function Namedlist._used_attribute at 0x7f91b1a58f70>, _name='sort'), 'sample': 'A'} 1 1628877061.8831172
Job stats:
job count min threads max threads
-------------- ------- ------------- -------------
all 1 1 1
bcftools_call 1 1 1
bwa_map 1 1 1
plot_quals 1 1 1
samtools_index 1 1 1
samtools_sort 1 1 1
total 6 1 1
[Fri Aug 13 10:51:01 2021]
rule bwa_map:
input: data/genome.fa, data/samples/A.fastq
output: mapped_reads/A.bam
jobid: 4
wildcards: sample=A
resources: tmpdir=/tmp
bwa mem -R '#RG\tID:A\tSM:A' -t 1 data/genome.fa data/samples/A.fastq | samtools view -Sb - > mapped_reads/A.bam
[Fri Aug 13 10:51:01 2021]
rule samtools_sort:
input: mapped_reads/A.bam
output: sorted_reads/A.bam
jobid: 3
wildcards: sample=A
resources: tmpdir=/tmp
samtools sort -T sorted_reads/A -O bam mapped_reads/A.bam > sorted_reads/A.bam
[Fri Aug 13 10:51:01 2021]
rule samtools_index:
input: sorted_reads/A.bam
output: sorted_reads/A.bam.bai
jobid: 5
wildcards: sample=A
resources: tmpdir=/tmp
samtools index sorted_reads/A.bam
[Fri Aug 13 10:51:01 2021]
rule bcftools_call:
input: data/genome.fa, sorted_reads/A.bam, sorted_reads/A.bam.bai
output: calls/all.vcf
jobid: 2
resources: tmpdir=/tmp
{'_names': {}, '_allowed_overrides': ['index', 'sort'], 'index': functools.partial(<function Namedlist._used_attribute at 0x7f91b1a58f70>, _name='index'), 'sort': functools.partial(<function Namedlist._used_attribute at 0x7f91b1a58f70>, _name='sort')} 2 1628877061.927639
InputFunctionException in line 46 of /mnt/c/Users/Matt/Desktop/snakemake-tutorial/Snakefile:
Error:
AttributeError: 'Wildcards' object has no attribute 'sample'
Wildcards:
Traceback:
File "/mnt/c/Users/Matt/Desktop/snakemake-tutorial/Snakefile", line 14, in get_bcftools_call_priors
If anyone knows what is going wrong I would really appreciate an explaination. Also if there is a better way of getting information out of the config.yaml into the different directives, I would gladly appreciate those tips.
Edit:
I have searched around the internet quite a bit, but have yet to understand this issue.
Wildcards for each rule are based on that rule's output file(s). The rule bcftools_call has one output file (calls/all.vcf), which has no wildcards. Because of this, when get_bcftools_call_priors is called, it throws an exception when it tries to access the unset wildcards.sample attribute.
You should probably set a global prior_mutation_rate in your config file and then access that in the bcftools_call rule:
rule bcftools_call:
...
params:
prior=config["prior_mutation_rate"],

WildcardError - No values given for wildcard - Snakemake

I am really lost what exactly I can do to fix this error.
I am running snakemake to perform some post alignment quality checks.
My code looks like this:
SAMPLES = ["Exome_Tumor_sorted_mrkdup_bqsr", "Exome_Norm_sorted_mrkdup_bqsr",
"WGS_Tumor_merged_sorted_mrkdup_bqsr", "WGS_Norm_merged_sorted_mrkdup_bqsr"]
rule all:
input:
expand("post-alignment-qc/flagstat/{sample}.txt", sample=SAMPLES),
expand("post-alignment-qc/CollectInsertSizeMetics/{sample}.txt", sample=SAMPLES),
expand("post-alignment-qc/CollectAlignmentSummaryMetrics/{sample}.txt", sample=SAMPLES),
expand("post-alignment-qc/CollectGcBiasMetrics/{sample}_summary.txt", samples=SAMPLES) # this is the problem causing line
rule flagstat:
input:
bam = "align/{sample}.bam"
output:
"post-alignment-qc/flagstat/{sample}.txt"
log:
err='post-alignment-qc/logs/flagstat/{sample}_stderr.err'
shell:
"samtools flagstat {input} > {output} 2> {log.err}"
rule CollectInsertSizeMetics:
input:
bam = "align/{sample}.bam"
output:
txt="post-alignment-qc/CollectInsertSizeMetics/{sample}.txt",
pdf="post-alignment-qc/CollectInsertSizeMetics/{sample}.pdf"
log:
err='post-alignment-qc/logs/CollectInsertSizeMetics/{sample}_stderr.err',
out='post-alignment-qc/logs/CollectInsertSizeMetics/{sample}_stdout.txt'
shell:
"gatk CollectInsertSizeMetrics -I {input} -O {output.txt} -H {output.pdf} 2> {log.err}"
rule CollectAlignmentSummaryMetrics:
input:
bam = "align/{sample}.bam",
genome= "references/genome/ref_genome.fa"
output:
txt="post-alignment-qc/CollectAlignmentSummaryMetrics/{sample}.txt",
log:
err='post-alignment-qc/logs/CollectAlignmentSummaryMetrics/{sample}_stderr.err',
out='post-alignment-qc/logs/CollectAlignmentSummaryMetrics/{sample}_stdout.txt'
shell:
"gatk CollectAlignmentSummaryMetrics -I {input.bam} -O {output.txt} -R {input.genome} 2> {log.err}"
rule CollectGcBiasMetrics:
input:
bam = "align/{sample}.bam",
genome= "references/genome/ref_genome.fa"
output:
txt="post-alignment-qc/CollectGcBiasMetrics/{sample}_metrics.txt",
CHART="post-alignment-qc/CollectGcBiasMetrics/{sample}_metrics.pdf",
S="post-alignment-qc/CollectGcBiasMetrics/{sample}_summary.txt"
log:
err='post-alignment-qc/logs/CollectGcBiasMetrics/{sample}_stderr.err',
out='post-alignment-qc/logs/CollectGcBiasMetrics/{sample}_stdout.txt'
shell:
"gatk CollectGcBiasMetrics -I {input.bam} -O {output.txt} -R {input.genome} -CHART = {output.CHART} "
"-S {output.S} 2> {log.err}"
The error message says the following:
WildcardError in line 9 of Snakefile:
No values given for wildcard 'sample'.
File "Snakefile", line 9, in <module>
In my code above I have indicated the problem causing line. When I simply remove this line everything runs perfekt. I am really confused, because I pretty much copy and pasted each rule, and this is the only rule that causes any problems.
If someone could point out what I did wrong, I would be very thankful!
Cheers!
Seems like it could be a spelling mistake - in the highlighted line, you write samples=SAMPLES, but the wildcard is called {sample} without the "s".

Merging several vcf files using snakemake

I am trying to merge several vcf files by chromosome using snakemake. My files are like this, and as you can see has various coordinates. What is the best way to merge all chr1A and all chr1B?
chr1A:0-2096.filtered.vcf
chr1A:2096-7896.filtered.vcf
chr1B:0-3456.filtered.vcf
chr1B:3456-8796.filtered.vcf
My pseudocode:
chromosomes=["chr1A","chr1B"]
rule all:
input:
expand("{sample}.vcf", sample=chromosomes)
rule merge:
input:
I1="path/to/file/{sample}.xxx.filtered.vcf",
I2="path/to/file/{sample}.xxx.filtered.vcf",
output:
outf ="{sample}.vcf"
shell:
"""
java -jar picard.jar GatherVcfs I={input.I1} I={input.I2} O={output.outf}
"""
EDIT:
workdir: "/media/prova/Maxtor2/vcf2/merged/"
import subprocess
d = {"chr1A": ["chr1A:0-2096.flanking.view.filtered.vcf", "chr1A:2096-7896.flanking.view.filtered.vcf"],
"chr1B": ["chr1B:0-3456.flanking.view.filtered.vcf", "chr1B:3456-8796.flanking.view.filtered.vcf"]}
rule all:
input:
expand("{sample}.vcf", sample=d)
def f(w):
return d.get(w.chromosome, "")
rule merge:
input:
f
output:
outf ="{chromosome}.vcf"
params:
lambda w: "I=" + " I=".join(d[w.chromosome])
shell:
"java -jar /home/Documents/Tools/picard.jar GatherVcfs {params[0]} O={output.outf}"
I was able to reproduce your bug. When constraining the wildcards, it works:
d = {"chr1A": ["chr1A:0-2096.flanking.view.filtered.vcf", "chr1A:2096-7896.flanking.view.filtered.vcf"],
"chr1B": ["chr1B:0-3456.flanking.view.filtered.vcf", "chr1B:3456-8796.flanking.view.filtered.vcf"]}
chromosomes = list(d)
rule all:
input:
expand("{sample}.vcf", sample=chromosomes)
# these tell Snakemake exactly what values the wildcards may take
# we use "|" to create the regex chr1A|chr1B
wildcard_constraints:
chromosome = "|".join(chromosomes)
rule merge:
input:
# a lambda is an unnamed function
# the first argument is the wildcards
# we merely use it to look up the appropriate files in the dict d
lambda w: d[w.chromosome]
output:
outf = "{chromosome}.vcf"
params:
# here we create the string
# "I=chr1A:0-2096.flanking.view.filtered.vcf I=chr1A:2096-7896.flanking.view.filtered.vcf"
# for use in our command
lambda w: "I=" + " I=".join(d[w.chromosome])
shell:
"java -jar /home/Documents/Tools/picard.jar GatherVcfs {params[0]} O={output.outf}"
It should have worked without the constraints too; this seems like a bug in Snakemake.

Shell variable name queried from Matlab has additional character

I'm working with the following script, run_test:
#!/bin/sh
temp=$1;
cat <<EOF | matlab
[status name] = unix('echo $temp');
disp(name);
% some Matlab code
test_complete = 1;
save(name)
exit
EOF
I want to pass a name to the script, run some code then save a .mat file with the name that was passed. However, there is a curious piece of behavior:
[energon2] ~ $ ./run_test 'run1'
Warning: No display specified. You will not be able to display graphics on the screen.
< M A T L A B (R) >
Copyright 1984-2010 The MathWorks, Inc.
Version 7.12.0.635 (R2011a) 64-bit (glnxa64)
March 18, 2011
To get started, type one of these: helpwin, helpdesk, or demo.
For product information, visit www.mathworks.com.
>> >> >> >> run1
>> >> >> >> >>
[energon2] ~ $ ls *.mat
run1?.mat
There is a "?" at the end of the file name when it's saved, but not when displayed on command line. This is acceptable for my needs, but a bit irritating to not know why it's occurring. Any explanation would be appreciated.
Edits, solution:
Yuk was correct below in the underlying cause and the use of save('$temp'). I'm now using the following script
#!/bin/sh
temp=$1;
cat <<EOF | matlab
% some Matlab code
test_complete = 1;
save('$temp')
exit
EOF
Thanks for the help.
You name variable has end-of-line as the last character. When you run echo run1 in unix this command display run1 and then "hit enter". In your script all the output of echo are saved to the name variable.
You can confirm it with the following:
>> format compact
>> [status, name] = unix('echo run1')
status =
0
name =
run1
>> numel(name)
ans =
5
>> int8(name(end))
ans =
10
>> int8(sprintf('\n'))
ans =
10
Apparently this character can be a part of a file name in unix, but shell displays it as ?.
Can't you do save($temp) instead?
EDIT: See my comments below for correction and more explanation.