Why are my wildcard attributes not being filled in Snakemake? - workflow

I am following the tutorial in the documentation (https://snakemake.readthedocs.io/en/stable/tutorial/advanced.html) and have been stuck on the "Step 4: Rule parameter" exercise. I would like to access a float from my config file using a wildcard in my params directive.
I seem to be getting the same error whenever I run snakemake -np in the command line:
InputFunctionException in line 46 of /mnt/c/Users/Matt/Desktop/snakemake-tutorial/Snakefile:
Error:
AttributeError: 'Wildcards' object has no attribute 'sample'
Wildcards:
Traceback:
File "/mnt/c/Users/Matt/Desktop/snakemake-tutorial/Snakefile", line 14, in get_bcftools_call_priors
This is my code so far
import time
configfile: "config.yaml"
rule all:
input:
"plots/quals.svg"
def get_bwa_map_input_fastqs(wildcards):
print(wildcards.__dict__, 1, time.time()) #I have this print as a check
return config["samples"][wildcards.sample]
def get_bcftools_call_priors(wildcards):
print(wildcards.__dict__, 2, time.time()) #I have this print as a check
return config["prior_mutation_rates"][wildcards.sample]
rule bwa_map:
input:
"data/genome.fa",
get_bwa_map_input_fastqs
#lambda wildcards: config["samples"][wildcards.sample]
output:
"mapped_reads/{sample}.bam"
params:
rg=r"#RG\tID:{sample}\tSM:{sample}"
threads: 2
shell:
"bwa mem -R '{params.rg}' -t {threads} {input} | samtools view -Sb - > {output}"
rule samtools_sort:
input:
"mapped_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam"
shell:
"samtools sort -T sorted_reads/{wildcards.sample} "
"-O bam {input} > {output}"
rule samtools_index:
input:
"sorted_reads/{sample}.bam"
output:
"sorted_reads/{sample}.bam.bai"
shell:
"samtools index {input}"
rule bcftools_call:
input:
fa="data/genome.fa",
bam=expand("sorted_reads/{sample}.bam", sample=config["samples"]),
bai=expand("sorted_reads/{sample}.bam.bai", sample=config["samples"])
#prior=get_bcftools_call_priors
params:
prior=get_bcftools_call_priors
output:
"calls/all.vcf"
shell:
"samtools mpileup -g -f {input.fa} {input.bam} | "
"bcftools call -P {params.prior} -mv - > {output}"
rule plot_quals:
input:
"calls/all.vcf"
output:
"plots/quals.svg"
script:
"scripts/plot-quals.py"
and here is my config.yaml
samples:
A: data/samples/A.fastq
#B: data/samples/B.fastq
#C: data/samples/C.fastq
prior_mutation_rates:
A: 1.0e-4
#B: 1.0e-6
I don't understand why my input function call in bcftools_call says that the wildcards object is empty of attributes, yet an almost identical function call in bwa_map has the attribute sample that I want. From the documentation it seems like the wildcards would be propogated before anything is run, so why is it missing?
This is the full output of the commandline call snakemake -np:
{'_names': {'sample': (0, None)}, '_allowed_overrides': ['index', 'sort'], 'index': functools.partial(<function Namedlist._used_attribute at 0x7f91b1a58f70>, _name='index'), 'sort': functools.partial(<function Namedlist._used_attribute at 0x7f91b1a58f70>, _name='sort'), 'sample': 'A'} 1 1628877061.8831172
Job stats:
job count min threads max threads
-------------- ------- ------------- -------------
all 1 1 1
bcftools_call 1 1 1
bwa_map 1 1 1
plot_quals 1 1 1
samtools_index 1 1 1
samtools_sort 1 1 1
total 6 1 1
[Fri Aug 13 10:51:01 2021]
rule bwa_map:
input: data/genome.fa, data/samples/A.fastq
output: mapped_reads/A.bam
jobid: 4
wildcards: sample=A
resources: tmpdir=/tmp
bwa mem -R '#RG\tID:A\tSM:A' -t 1 data/genome.fa data/samples/A.fastq | samtools view -Sb - > mapped_reads/A.bam
[Fri Aug 13 10:51:01 2021]
rule samtools_sort:
input: mapped_reads/A.bam
output: sorted_reads/A.bam
jobid: 3
wildcards: sample=A
resources: tmpdir=/tmp
samtools sort -T sorted_reads/A -O bam mapped_reads/A.bam > sorted_reads/A.bam
[Fri Aug 13 10:51:01 2021]
rule samtools_index:
input: sorted_reads/A.bam
output: sorted_reads/A.bam.bai
jobid: 5
wildcards: sample=A
resources: tmpdir=/tmp
samtools index sorted_reads/A.bam
[Fri Aug 13 10:51:01 2021]
rule bcftools_call:
input: data/genome.fa, sorted_reads/A.bam, sorted_reads/A.bam.bai
output: calls/all.vcf
jobid: 2
resources: tmpdir=/tmp
{'_names': {}, '_allowed_overrides': ['index', 'sort'], 'index': functools.partial(<function Namedlist._used_attribute at 0x7f91b1a58f70>, _name='index'), 'sort': functools.partial(<function Namedlist._used_attribute at 0x7f91b1a58f70>, _name='sort')} 2 1628877061.927639
InputFunctionException in line 46 of /mnt/c/Users/Matt/Desktop/snakemake-tutorial/Snakefile:
Error:
AttributeError: 'Wildcards' object has no attribute 'sample'
Wildcards:
Traceback:
File "/mnt/c/Users/Matt/Desktop/snakemake-tutorial/Snakefile", line 14, in get_bcftools_call_priors
If anyone knows what is going wrong I would really appreciate an explaination. Also if there is a better way of getting information out of the config.yaml into the different directives, I would gladly appreciate those tips.
Edit:
I have searched around the internet quite a bit, but have yet to understand this issue.

Wildcards for each rule are based on that rule's output file(s). The rule bcftools_call has one output file (calls/all.vcf), which has no wildcards. Because of this, when get_bcftools_call_priors is called, it throws an exception when it tries to access the unset wildcards.sample attribute.
You should probably set a global prior_mutation_rate in your config file and then access that in the bcftools_call rule:
rule bcftools_call:
...
params:
prior=config["prior_mutation_rate"],

Related

Snakemake workflow, ChildIOException or MissingInputException

I am trying to add a file renaming step in my current workflow to make it easier on some of the other users. What I want to do is take the contigs.fasta file from a spades assembly directory and rename it to include the sample name. (i.e foo_de_novo/contigs.fasta to foo_de_novo/foo.fasta)
here is my code... well currently.
configfile: "config.yaml"
import os
def is_file_empty(file_path):
""" Check if file is empty by confirming if its size is 0 bytes"""
# Check if singleton file exist and it is empty from bbrepair output
return os.path.exists(file_path) and os.stat(file_path).st_size == 0
rule all:
input:
expand("{sample}_de_novo/{sample}.fasta", sample = config["names"]),
rule fastp:
input:
r1 = lambda wildcards: config["sample_reads_r1"][wildcards.sample],
r2 = lambda wildcards: config["sample_reads_r2"][wildcards.sample]
output:
r1 = temp("clean/{sample}_r1.trim.fastq.gz"),
r2 = temp("clean/{sample}_r2.trim.fastq.gz")
shell:
"fastp --in1 {input.r1} --in2 {input.r2} --out1 {output.r1} --out2 {output.r2} --trim_front1 20 --trim_front2 20"
rule bbrepair:
input:
r1 = "clean/{sample}_r1.trim.fastq.gz",
r2 = "clean/{sample}_r2.trim.fastq.gz"
output:
r1 = temp("clean/{sample}_r1.fixed.fastq"),
r2 = temp("clean/{sample}_r2.fixed.fastq"),
singles = temp("clean/{sample}.singletons.fastq")
shell:
"repair.sh -Xmx10g in1={input.r1} in2={input.r2} out1={output.r1} out2={output.r2} outs={output.singles}"
rule spades:
input:
r1 = "clean/{sample}_r1.fixed.fastq",
r2 = "clean/{sample}_r2.fixed.fastq",
s = "clean/{sample}.singletons.fastq"
output:
directory("{sample}_de_novo")
run:
isempty = is_file_empty("clean/{sample}.singletons.fastq")
if isempty == "False":
shell("spades.py --careful --phred-offset 33 -1 {input.r1} -2 {input.r2} -s {input.singletons} -o {output}")
else:
shell("spades.py --careful --phred-offset 33 -1 {input.r1} -2 {input.r2} -o {output}")
rule rename_spades:
input:
"{sample}_de_novo/contigs.fasta"
output:
"{sample}_de_novo/{sample}.fasta"
shell:
"cp {input} {output}"
When I have it written like this I get the MissingInputError and when I change it to this.
rule rename_spades:
input:
"{sample}_de_novo"
output:
"{sample}_de_novo/{sample}.fasta"
shell:
"cp {input} {output}"
I get the ChildIOException
I feel I understand why snakemake is unhappy with both versions. The first one is becasue I don't explicitly output the "{sample}_de_novo/contigs.fasta" file. Its just one of several files spades outputs. And the other error is because it doesn't like how I am asking it to look into the directory. I however am at a loss on how to fix this.
Is there a way to ask snakmake to look into a directory for a file and then perform the task requested?
Thank you,
Sean
EDIT File Structure of Spades output
Sample_de_novo
|-corrected/
|-K21/
|-K33/
|-K55/
|-K77/
|-misc/
|-mismatch_corrector/
|-tmp/
|-assembly_graph.fastg
|-assembly_graph_with_scaffolds.gfa
|-before_rr.fasta
|-contigs.fasta
|-contigs.paths
|-dataset.info
|-input_dataset.ymal
|-params.txt
|-scaffolds.fasta
|-scaffolds.paths
|spades.log
Make {sample}_de_novo/contigs.fasta to be the output of spades and parse its path to get the directory that will be the argument to spades -o. Snakemake won't mind if there are other files created in addition to contigs.fasta. This should run --dry-run mode:
rule all:
input:
expand('{sample}_de_novo/{sample}.fasta', sample=['A', 'B']),
rule spades:
output:
fasta='{sample}_de_novo/contigs.fasta',
run:
outdir=os.path.dirname(output.fasta)
shell(f'spades ... -o {outdir}')
rule rename:
input:
fasta='{sample}_de_novo/contigs.fasta',
output:
fasta='{sample}_de_novo/{sample}.fasta',
shell:
r"""
mv {input.fasta} {output.fasta}
"""
Nope, spoke too soon. It didn't name the output directory correctly, so I moved it to the params and, now, finailly is working the way I wanted.
rule spades:
input:
r1 = "clean/{sample}_r1.fixed.fastq",
r2 = "clean/{sample}_r2.fixed.fastq",
s = "clean/{sample}.singletons.fastq"
output:
"{sample}_de_novo/contigs.fasta"
params:
outdir = directory("{sample}_de_novo/")
run:
isempty = is_file_empty("clean/{sample}.singletons.fastq")
if isempty == "False":
shell("spades.py --isolate --phred-offset 33 -1 {input.r1} -2 {input.r2} -s {input.singletons} -o {params.outdir}")
else:
shell("spades.py --isolate --phred-offset 33 -1 {input.r1} -2 {input.r2} -o {params.outdir}")
rule rename_spades:
input:
"{sample}_de_novo/contigs.fasta"
output:
"{sample}_de_novo/{sample}.fasta"
shell:
"cp {input} {output}"

Does the VSCode problem matcher strip ANSI escape sequences before matching?

I'm making a custom task to run Perl unit tests with yath. The output of that command contains details about failed tests, which I would like to filter and display as problems.
I've written the following matcher for the my output.
"problemMatcher": {
"owner": "yath",
"fileLocation": [ "relative", "${workspaceFolder}" ],
"severity": "error",
"pattern": [
{
"regexp": "\\[\\s*FAIL\\s*\\]\\s*job\\s*\\d+\\s*\\+?\\s*(.+)",
"message": 1,
},{
"regexp": "\\(\\s*DIAG\\s*\\)\\s*job\\s*\\d+\\s*\\+?\\s*at (.+) line (\\d+)\\.",
"file": 1,
"line": 2
}
]
}
This is supposed to match two different lines in the following output, which I will present as code for copying, and as a screenshot.
** Defaulting to the 'test' command **
( LAUNCH ) job 1 t/foo.t
( NOTE ) job 1 Seeded srand with seed '20220414' from local date.
[ PASS ] job 1 + passing test
[ FAIL ] job 1 + failing test
( DIAG ) job 1 Failed test 'failing test'
( DIAG ) job 1 at t/foo.t line 57.
[ PLAN ] job 1 Expected assertions: 2
( FAILED ) job 1 t/foo.t
( TIME ) job 1 Startup: 0.30841s | Events: 0.01992s | Cleanup: 0.00417s | Total: 0.33250s
< REASON > job 1 Test script returned error (Err: 1)
< REASON > job 1 Assertion failures were encountered (Count: 1)
The following jobs failed:
+--------------------------------------+-----------------------------------+
| Job ID | Test File |
+--------------------------------------+-----------------------------------+
| e7aee661-b49f-4b60-b815-f420d109457a | t/foo.t |
+--------------------------------------+-----------------------------------+
Yath Result Summary
-----------------------------------------------------------------------------------
Fail Count: 1
File Count: 1
Assertion Count: 2
Wall Time: 0.74 seconds
CPU Time: 0.76 seconds (usr: 0.20s | sys: 0.00s | cusr: 0.49s | csys: 0.07s)
CPU Usage: 103%
--> Result: FAILED <--
But it's actually pretty with colours.
I suspect there are ANSI escape sequences in this output. I could pass a flag to yath to make it not print colours, but I would like to be able to read this output as well, so that isn't ideal.
Do I have to change my pattern to match the escape sequences (I can read the source of the program that prints them, but it's annoying), or are they in fact stripped out and my pattern is wrong, but I can't see where?
Here's the first pattern as a regex101 match, and here's the second.

WildcardError - No values given for wildcard - Snakemake

I am really lost what exactly I can do to fix this error.
I am running snakemake to perform some post alignment quality checks.
My code looks like this:
SAMPLES = ["Exome_Tumor_sorted_mrkdup_bqsr", "Exome_Norm_sorted_mrkdup_bqsr",
"WGS_Tumor_merged_sorted_mrkdup_bqsr", "WGS_Norm_merged_sorted_mrkdup_bqsr"]
rule all:
input:
expand("post-alignment-qc/flagstat/{sample}.txt", sample=SAMPLES),
expand("post-alignment-qc/CollectInsertSizeMetics/{sample}.txt", sample=SAMPLES),
expand("post-alignment-qc/CollectAlignmentSummaryMetrics/{sample}.txt", sample=SAMPLES),
expand("post-alignment-qc/CollectGcBiasMetrics/{sample}_summary.txt", samples=SAMPLES) # this is the problem causing line
rule flagstat:
input:
bam = "align/{sample}.bam"
output:
"post-alignment-qc/flagstat/{sample}.txt"
log:
err='post-alignment-qc/logs/flagstat/{sample}_stderr.err'
shell:
"samtools flagstat {input} > {output} 2> {log.err}"
rule CollectInsertSizeMetics:
input:
bam = "align/{sample}.bam"
output:
txt="post-alignment-qc/CollectInsertSizeMetics/{sample}.txt",
pdf="post-alignment-qc/CollectInsertSizeMetics/{sample}.pdf"
log:
err='post-alignment-qc/logs/CollectInsertSizeMetics/{sample}_stderr.err',
out='post-alignment-qc/logs/CollectInsertSizeMetics/{sample}_stdout.txt'
shell:
"gatk CollectInsertSizeMetrics -I {input} -O {output.txt} -H {output.pdf} 2> {log.err}"
rule CollectAlignmentSummaryMetrics:
input:
bam = "align/{sample}.bam",
genome= "references/genome/ref_genome.fa"
output:
txt="post-alignment-qc/CollectAlignmentSummaryMetrics/{sample}.txt",
log:
err='post-alignment-qc/logs/CollectAlignmentSummaryMetrics/{sample}_stderr.err',
out='post-alignment-qc/logs/CollectAlignmentSummaryMetrics/{sample}_stdout.txt'
shell:
"gatk CollectAlignmentSummaryMetrics -I {input.bam} -O {output.txt} -R {input.genome} 2> {log.err}"
rule CollectGcBiasMetrics:
input:
bam = "align/{sample}.bam",
genome= "references/genome/ref_genome.fa"
output:
txt="post-alignment-qc/CollectGcBiasMetrics/{sample}_metrics.txt",
CHART="post-alignment-qc/CollectGcBiasMetrics/{sample}_metrics.pdf",
S="post-alignment-qc/CollectGcBiasMetrics/{sample}_summary.txt"
log:
err='post-alignment-qc/logs/CollectGcBiasMetrics/{sample}_stderr.err',
out='post-alignment-qc/logs/CollectGcBiasMetrics/{sample}_stdout.txt'
shell:
"gatk CollectGcBiasMetrics -I {input.bam} -O {output.txt} -R {input.genome} -CHART = {output.CHART} "
"-S {output.S} 2> {log.err}"
The error message says the following:
WildcardError in line 9 of Snakefile:
No values given for wildcard 'sample'.
File "Snakefile", line 9, in <module>
In my code above I have indicated the problem causing line. When I simply remove this line everything runs perfekt. I am really confused, because I pretty much copy and pasted each rule, and this is the only rule that causes any problems.
If someone could point out what I did wrong, I would be very thankful!
Cheers!
Seems like it could be a spelling mistake - in the highlighted line, you write samples=SAMPLES, but the wildcard is called {sample} without the "s".

How to query changes to IDB predicates?

I want to make changes to some EDBs and then find out how the IDBs changed as a result.
The docs say that the query stage comes after the final stage and "has access to the effects of stage FINAL". But if I run
query '_(id) <- ^level(id; _).'
(where level is an IDB) I get
block block_1Z7PZ61E: line 2: error: predicate level is an IDB, therefore deltas for it will not be available until stage final. (code: STAGE_INITIAL_IDB_DELTA)
^level(id; _).
^^^^^^^^^^^^^
1 ERROR
BloxCompiler reported 1 error in block 'block_1Z7PZ61E'
I also tried following the diff predicate example. But this
query '_(id) <- (level\level#prev)(id; _).'
causes a syntax error:
block block_1Z80EFSZ: line 2: error: illegal character 'U+005C' (code: ILLEGAL_CHARACTER)
(level\level#prev)(id; _).
^
block block_1Z80EFSZ: line 2: error: unexpected token 'level' (code: UNEXPECTED_TOKEN)
(level\level#prev)(id; _).
^^^^^
block block_1Z80EFSZ: line 2: error: unexpected token ')' (code: UNEXPECTED_TOKEN)
(level\level#prev)(id; _).
^
block block_1Z80EFSZ: line 2: error: unexpected token ';' (code: UNEXPECTED_TOKEN)
(level\level#prev)(id; _).
^
4 ERRORS
BloxCompiler reported 4 errors in block 'block_1Z80EFSZ'
Unfortunately the diff predicate functionality is not yet released (as of Dec 2016). We've noted that as a documentation bug to fix. The feature is expected in LB 4.4 in Q1 2017.
Here's a simple example for printing or querying the values in the IDB predicate after changes to the EDB. You don't need to use delta logic in this case.
[dan#glycerine Fri Dec 30 09:55:54 ~/Temp]
$ cat foo.logic
foo(x), foo_id(x: y) -> int(y).
edb[x] = y -> foo(x), int(y).
idb[x] = y -> foo(x), int(y).
idb[x] = y <- edb[x] + 10 = y.
[dan#glycerine Fri Dec 30 09:56:33 ~/Temp]
$ lb create --overwrite foo && lb addblock foo --name foo --file foo.logic && lb exec foo '+foo(x), +foo_id[x] = 1, +edb[x] = 5.'
created workspace 'foo'
added block 'foo'
[dan#glycerine Fri Dec 30 09:57:23 ~/Temp]
$ lb print foo edb
[10000000005] 1 5
[dan#glycerine Fri Dec 30 09:57:28 ~/Temp]
$ lb print foo idb
[10000000005] 1 15
[dan#glycerine Fri Dec 30 09:57:32 ~/Temp]
$ lb exec foo '^edb[x] = 15 <- foo_id[x] = 1.'
[dan#glycerine Fri Dec 30 09:57:54 ~/Temp]
$ lb print foo idb
[10000000005] 1 25
[dan#glycerine Fri Dec 30 09:57:55 ~/Temp]
$ lb query foo '_[x] = edb[x].'
/--------------- _ ---------------\
[10000000005] 1 15
\--------------- _ ---------------/
I can think of a couple options for watching changes more as they happen. In once case you could create a watcher predicate and add values to it like this.
watcher(x, y) -> int(x), int(y).
^watcher(x, y) <- foo_id[x] = id, ^idb[id] = y.
That could be enhanced to have its own entity and timestamp...
Another way, if you are using web services to make the EDB updates, is to use
lb web-server monitor -w foo idb
to watch a list of predicates.

AWK - filter file with not equal fields

I've been trying to pull a field from a row in a file although each row may have plus or minus 2 or 3 fields per row. They aren't always equal in the number of fields per row.
Here is a snippet:
A orarpp 45286124 1 1 0 20 60 Nov 25 9-16:42:32 01:04:58 11176 117056 0 - oracleXXX (LOCAL=NO)
A orarpp 45351560 1 1 3 20 61 Nov 30 5-03:54:42 02:24:48 4804 110684 0 - ora_w002_XXX
A orarpp 45548236 1 1 22 20 71 Nov 26 8-19:36:28 00:56:18 10628 116508 0 - oracleXXX (LOCAL=NO)
A orarpp 45679190 1 1 0 20 60 Nov 28 6-23:42:20 00:37:59 10232 116112 0 - oracleXXX (LOCAL=NO)
A orarpp 45744808 1 1 0 20 60 10:52:19 23:08:12 00:04:58 11740 117620 0 - oracleXXX (LOCAL=NO)
A root 45810380 1 1 0 -- 39 Nov 25 9-19:54:34 00:00:00 448 448 0 - garbage
In the case of the first line, I'm interested in 9-16:42:32 and the similar fields for each row.
I've tried to pull it by using ':' as the field separator and then filter from there however, what I am trying to accomplish is to do something if the number before the dash (in the example it's 9) is greater than one.
cat file.txt | grep oracle | awk -F: '{print substr($1, length($1)-5)}'
This is because the number of fields on either side of the actual field I need can be different from line to line.
Definitely not the most efficient but I've been trying to do this with an awk one liner.
Hints or a direction would be appreciated to get me moving again. I am not opposed to doing in a better way than awk.
Thanks.
Maybe cut is the right tool for this job? For example, with your snippet:
$ cut -c 62-71 file.txt
9-16:42:32
5-03:54:42
8-19:36:28
6-23:42:20
23:08:12
9-19:54:34
The arguments tell cut to snip columns (-c) 62 through 71.
For additional processing, you can pipe it to awk.
You can also accomplish the whole thing in awk by accepting entire lines and then using substr to extract the columns you want. For example, this awk command produces the same output as the cut command above:
awk '{ print substr($0, 62, 10) }' file.txt
Whether you create a pipeline or do the processing entirely in awk is at least in part a matter of personal taste / style.
Would this do?
awk -F: '/oracle/ {print substr($0,62,10)}' file.txt
9-16:42:32
8-19:36:28
6-23:42:20
23:08:12
This search for oracle and then print 10 characters starting from position 62
You can grab those identifiers with one of
grep -o '[[:digit:]]\+-[[:digit:]]\{2\}:[[:digit:]]\{2\}:[[:digit:]]\{2\}'
grep -oP '\d+-\d\d:\d\d:\d\d' # GNU grep
It sounds like you want to do something with the lines, not just find the ids. Please elaborate.
Using GNU awk:
gawk --re-interval '
/oracle/ && \
match($0, /([[:digit:]]+)-([[:digit:]]{2}:){2}[[:digit:]]{2}/, a) && \
a[1]>1 {
# do something with the matching line
print
}
' file