Generate many files with wildcard, then merge into one

Generate many files with wildcard, then merge into one - merge

I have two rules on my Snakefile: one generates several sets of files using wildcards, the other one merges everything into a single file. This is how I wrote it:
chr = range(1,23)
rule generate:
input:
og_files = config["tmp"] + '/chr{chr}.bgen',
output:
out = multiext(config["tmp"] + '/plink/chr{{chr}}',
'.bed', '.bim', '.fam')
shell:
"""
plink \
--bgen {input.og_files} \
--make-bed \
--oxford-single-chr \
--out {config[tmp]}/plink/chr{chr}
"""
rule merge:
input:
plink_chr = expand(config["tmp"] + '/plink/chr{chr}.{ext}',
chr = chr,
ext = ['bed', 'bim', 'fam'])
output:
out = multiext(config["tmp"] + '/all',
'.bed', '.bim', '.fam')
shell:
"""
plink \
--pmerge-list-dir {config[tmp]}/plink \
--make-bed \
--out {config[tmp]}/all
"""
Unfortunately, this does not allow me to track the file coming from the first rule to the 2nd rule:
$ snakemake -s myfile.smk -c1 -np
Building DAG of jobs...
MissingInputException in line 17 of myfile.smk:
Missing input files for rule merge:
[list of all the files made by expand()]
What can I use to be able to generate the 22 sets of files with the wildcard chr in generate, but be able to track them in the input of merge? Thank you in advance for your help

In rule generate I think you don't want to escape the {chr} wildcard, otherwise it doesn't get replaced. I.e.:
out = multiext(config["tmp"] + '/plink/chr{{chr}}',
'.bed', '.bim', '.fam')
should be:
out = multiext(config["tmp"] + '/plink/chr{chr}',
'.bed', '.bim', '.fam')

Related

Snakemake workflow, ChildIOException or MissingInputException

I am trying to add a file renaming step in my current workflow to make it easier on some of the other users. What I want to do is take the contigs.fasta file from a spades assembly directory and rename it to include the sample name. (i.e foo_de_novo/contigs.fasta to foo_de_novo/foo.fasta)
here is my code... well currently.
configfile: "config.yaml"
import os
def is_file_empty(file_path):
""" Check if file is empty by confirming if its size is 0 bytes"""
# Check if singleton file exist and it is empty from bbrepair output
return os.path.exists(file_path) and os.stat(file_path).st_size == 0
rule all:
input:
expand("{sample}_de_novo/{sample}.fasta", sample = config["names"]),
rule fastp:
input:
r1 = lambda wildcards: config["sample_reads_r1"][wildcards.sample],
r2 = lambda wildcards: config["sample_reads_r2"][wildcards.sample]
output:
r1 = temp("clean/{sample}_r1.trim.fastq.gz"),
r2 = temp("clean/{sample}_r2.trim.fastq.gz")
shell:
"fastp --in1 {input.r1} --in2 {input.r2} --out1 {output.r1} --out2 {output.r2} --trim_front1 20 --trim_front2 20"
rule bbrepair:
input:
r1 = "clean/{sample}_r1.trim.fastq.gz",
r2 = "clean/{sample}_r2.trim.fastq.gz"
output:
r1 = temp("clean/{sample}_r1.fixed.fastq"),
r2 = temp("clean/{sample}_r2.fixed.fastq"),
singles = temp("clean/{sample}.singletons.fastq")
shell:
"repair.sh -Xmx10g in1={input.r1} in2={input.r2} out1={output.r1} out2={output.r2} outs={output.singles}"
rule spades:
input:
r1 = "clean/{sample}_r1.fixed.fastq",
r2 = "clean/{sample}_r2.fixed.fastq",
s = "clean/{sample}.singletons.fastq"
output:
directory("{sample}_de_novo")
run:
isempty = is_file_empty("clean/{sample}.singletons.fastq")
if isempty == "False":
shell("spades.py --careful --phred-offset 33 -1 {input.r1} -2 {input.r2} -s {input.singletons} -o {output}")
else:
shell("spades.py --careful --phred-offset 33 -1 {input.r1} -2 {input.r2} -o {output}")
rule rename_spades:
input:
"{sample}_de_novo/contigs.fasta"
output:
"{sample}_de_novo/{sample}.fasta"
shell:
"cp {input} {output}"
When I have it written like this I get the MissingInputError and when I change it to this.
rule rename_spades:
input:
"{sample}_de_novo"
output:
"{sample}_de_novo/{sample}.fasta"
shell:
"cp {input} {output}"
I get the ChildIOException
I feel I understand why snakemake is unhappy with both versions. The first one is becasue I don't explicitly output the "{sample}_de_novo/contigs.fasta" file. Its just one of several files spades outputs. And the other error is because it doesn't like how I am asking it to look into the directory. I however am at a loss on how to fix this.
Is there a way to ask snakmake to look into a directory for a file and then perform the task requested?
Thank you,
Sean
EDIT File Structure of Spades output
Sample_de_novo
|-corrected/
|-K21/
|-K33/
|-K55/
|-K77/
|-misc/
|-mismatch_corrector/
|-tmp/
|-assembly_graph.fastg
|-assembly_graph_with_scaffolds.gfa
|-before_rr.fasta
|-contigs.fasta
|-contigs.paths
|-dataset.info
|-input_dataset.ymal
|-params.txt
|-scaffolds.fasta
|-scaffolds.paths
|spades.log

Make {sample}_de_novo/contigs.fasta to be the output of spades and parse its path to get the directory that will be the argument to spades -o. Snakemake won't mind if there are other files created in addition to contigs.fasta. This should run --dry-run mode:
rule all:
input:
expand('{sample}_de_novo/{sample}.fasta', sample=['A', 'B']),
rule spades:
output:
fasta='{sample}_de_novo/contigs.fasta',
run:
outdir=os.path.dirname(output.fasta)
shell(f'spades ... -o {outdir}')
rule rename:
input:
fasta='{sample}_de_novo/contigs.fasta',
output:
fasta='{sample}_de_novo/{sample}.fasta',
shell:
r"""
mv {input.fasta} {output.fasta}
"""

Nope, spoke too soon. It didn't name the output directory correctly, so I moved it to the params and, now, finailly is working the way I wanted.
rule spades:
input:
r1 = "clean/{sample}_r1.fixed.fastq",
r2 = "clean/{sample}_r2.fixed.fastq",
s = "clean/{sample}.singletons.fastq"
output:
"{sample}_de_novo/contigs.fasta"
params:
outdir = directory("{sample}_de_novo/")
run:
isempty = is_file_empty("clean/{sample}.singletons.fastq")
if isempty == "False":
shell("spades.py --isolate --phred-offset 33 -1 {input.r1} -2 {input.r2} -s {input.singletons} -o {params.outdir}")
else:
shell("spades.py --isolate --phred-offset 33 -1 {input.r1} -2 {input.r2} -o {params.outdir}")
rule rename_spades:
input:
"{sample}_de_novo/contigs.fasta"
output:
"{sample}_de_novo/{sample}.fasta"
shell:
"cp {input} {output}"

WildcardError - No values given for wildcard - Snakemake

I am really lost what exactly I can do to fix this error.
I am running snakemake to perform some post alignment quality checks.
My code looks like this:
SAMPLES = ["Exome_Tumor_sorted_mrkdup_bqsr", "Exome_Norm_sorted_mrkdup_bqsr",
"WGS_Tumor_merged_sorted_mrkdup_bqsr", "WGS_Norm_merged_sorted_mrkdup_bqsr"]
rule all:
input:
expand("post-alignment-qc/flagstat/{sample}.txt", sample=SAMPLES),
expand("post-alignment-qc/CollectInsertSizeMetics/{sample}.txt", sample=SAMPLES),
expand("post-alignment-qc/CollectAlignmentSummaryMetrics/{sample}.txt", sample=SAMPLES),
expand("post-alignment-qc/CollectGcBiasMetrics/{sample}_summary.txt", samples=SAMPLES) # this is the problem causing line
rule flagstat:
input:
bam = "align/{sample}.bam"
output:
"post-alignment-qc/flagstat/{sample}.txt"
log:
err='post-alignment-qc/logs/flagstat/{sample}_stderr.err'
shell:
"samtools flagstat {input} > {output} 2> {log.err}"
rule CollectInsertSizeMetics:
input:
bam = "align/{sample}.bam"
output:
txt="post-alignment-qc/CollectInsertSizeMetics/{sample}.txt",
pdf="post-alignment-qc/CollectInsertSizeMetics/{sample}.pdf"
log:
err='post-alignment-qc/logs/CollectInsertSizeMetics/{sample}_stderr.err',
out='post-alignment-qc/logs/CollectInsertSizeMetics/{sample}_stdout.txt'
shell:
"gatk CollectInsertSizeMetrics -I {input} -O {output.txt} -H {output.pdf} 2> {log.err}"
rule CollectAlignmentSummaryMetrics:
input:
bam = "align/{sample}.bam",
genome= "references/genome/ref_genome.fa"
output:
txt="post-alignment-qc/CollectAlignmentSummaryMetrics/{sample}.txt",
log:
err='post-alignment-qc/logs/CollectAlignmentSummaryMetrics/{sample}_stderr.err',
out='post-alignment-qc/logs/CollectAlignmentSummaryMetrics/{sample}_stdout.txt'
shell:
"gatk CollectAlignmentSummaryMetrics -I {input.bam} -O {output.txt} -R {input.genome} 2> {log.err}"
rule CollectGcBiasMetrics:
input:
bam = "align/{sample}.bam",
genome= "references/genome/ref_genome.fa"
output:
txt="post-alignment-qc/CollectGcBiasMetrics/{sample}_metrics.txt",
CHART="post-alignment-qc/CollectGcBiasMetrics/{sample}_metrics.pdf",
S="post-alignment-qc/CollectGcBiasMetrics/{sample}_summary.txt"
log:
err='post-alignment-qc/logs/CollectGcBiasMetrics/{sample}_stderr.err',
out='post-alignment-qc/logs/CollectGcBiasMetrics/{sample}_stdout.txt'
shell:
"gatk CollectGcBiasMetrics -I {input.bam} -O {output.txt} -R {input.genome} -CHART = {output.CHART} "
"-S {output.S} 2> {log.err}"
The error message says the following:
WildcardError in line 9 of Snakefile:
No values given for wildcard 'sample'.
File "Snakefile", line 9, in <module>
In my code above I have indicated the problem causing line. When I simply remove this line everything runs perfekt. I am really confused, because I pretty much copy and pasted each rule, and this is the only rule that causes any problems.
If someone could point out what I did wrong, I would be very thankful!
Cheers!

Seems like it could be a spelling mistake - in the highlighted line, you write samples=SAMPLES, but the wildcard is called {sample} without the "s".

Merging several vcf files using snakemake

I am trying to merge several vcf files by chromosome using snakemake. My files are like this, and as you can see has various coordinates. What is the best way to merge all chr1A and all chr1B?
chr1A:0-2096.filtered.vcf
chr1A:2096-7896.filtered.vcf
chr1B:0-3456.filtered.vcf
chr1B:3456-8796.filtered.vcf
My pseudocode:
chromosomes=["chr1A","chr1B"]
rule all:
input:
expand("{sample}.vcf", sample=chromosomes)
rule merge:
input:
I1="path/to/file/{sample}.xxx.filtered.vcf",
I2="path/to/file/{sample}.xxx.filtered.vcf",
output:
outf ="{sample}.vcf"
shell:
"""
java -jar picard.jar GatherVcfs I={input.I1} I={input.I2} O={output.outf}
"""
EDIT:
workdir: "/media/prova/Maxtor2/vcf2/merged/"
import subprocess
d = {"chr1A": ["chr1A:0-2096.flanking.view.filtered.vcf", "chr1A:2096-7896.flanking.view.filtered.vcf"],
"chr1B": ["chr1B:0-3456.flanking.view.filtered.vcf", "chr1B:3456-8796.flanking.view.filtered.vcf"]}
rule all:
input:
expand("{sample}.vcf", sample=d)
def f(w):
return d.get(w.chromosome, "")
rule merge:
input:
f
output:
outf ="{chromosome}.vcf"
params:
lambda w: "I=" + " I=".join(d[w.chromosome])
shell:
"java -jar /home/Documents/Tools/picard.jar GatherVcfs {params[0]} O={output.outf}"

I was able to reproduce your bug. When constraining the wildcards, it works:
d = {"chr1A": ["chr1A:0-2096.flanking.view.filtered.vcf", "chr1A:2096-7896.flanking.view.filtered.vcf"],
"chr1B": ["chr1B:0-3456.flanking.view.filtered.vcf", "chr1B:3456-8796.flanking.view.filtered.vcf"]}
chromosomes = list(d)
rule all:
input:
expand("{sample}.vcf", sample=chromosomes)
# these tell Snakemake exactly what values the wildcards may take
# we use "|" to create the regex chr1A|chr1B
wildcard_constraints:
chromosome = "|".join(chromosomes)
rule merge:
input:
# a lambda is an unnamed function
# the first argument is the wildcards
# we merely use it to look up the appropriate files in the dict d
lambda w: d[w.chromosome]
output:
outf = "{chromosome}.vcf"
params:
# here we create the string
# "I=chr1A:0-2096.flanking.view.filtered.vcf I=chr1A:2096-7896.flanking.view.filtered.vcf"
# for use in our command
lambda w: "I=" + " I=".join(d[w.chromosome])
shell:
"java -jar /home/Documents/Tools/picard.jar GatherVcfs {params[0]} O={output.outf}"
It should have worked without the constraints too; this seems like a bug in Snakemake.

Including variable in file directory in matlab ..?

I want to access several sequenced folders
Example :
[ndata, text, alldata] = xlsread(' D: \ folder \ 1 \ file ' ) ;
[ndata, text, alldata] = xlsread(' D: \ folder \ 2 \ file ' ) ;
[ndata, text, alldata] = xlsread(' D: \ folder \ 3 \ file ' ) ;
[ndata, text, alldata] = xlsread(' D: \ folder \ 4 \ file ' ) ;
Could I replace 1,2,3and 4 by variable i .. How could the directory be written here ?!
Please need any recommendation !

The fullfile command is meant for this purpose:
xlsread(fullfile('D:','folder', sprintf('%d',i) , 'file'));
The fullfile function takes care of OS-specific file separator and insuring only one file separator is used per folder division. (i.e. strcmp(fullfile('a','b') equals fullfile('a/','/b'))

[ndata, text, alldata] = xlsread([' D:/folder/' num2str(i) '/file ' ]) ;

Just use forward slashes, that works everywhere.
Don't make things harder than they should be.

Yes, you can. Use the sprintf() command.
i=1;
[ndata, text, alldata] = xlsread(sprintf('D:\\folder\\%i\\file',i))
To make sure this is working right, change the sprintf to a fprintf, and make sure the file exists.
>> i=1;
>> fprintf('D:\\folder\\%i\\file',i)
D:\folder\1\file
>> ls D:\folder\1\file

Is it possible to copy all files from one S3 bucket to another with s3cmd?

I'm pretty happy with s3cmd, but there is one issue: How to copy all files from one S3 bucket to another? Is it even possible?
EDIT: I've found a way to copy files between buckets using Python with boto:
from boto.s3.connection import S3Connection
def copyBucket(srcBucketName, dstBucketName, maxKeys = 100):
conn = S3Connection(awsAccessKey, awsSecretKey)
srcBucket = conn.get_bucket(srcBucketName);
dstBucket = conn.get_bucket(dstBucketName);
resultMarker = ''
while True:
keys = srcBucket.get_all_keys(max_keys = maxKeys, marker = resultMarker)
for k in keys:
print 'Copying ' + k.key + ' from ' + srcBucketName + ' to ' + dstBucketName
t0 = time.clock()
dstBucket.copy_key(k.key, srcBucketName, k.key)
print time.clock() - t0, ' seconds'
if len(keys) < maxKeys:
print 'Done'
break
resultMarker = keys[maxKeys - 1].key
Syncing is almost as straight forward as copying. There are fields for ETag, size, and last-modified available for keys.
Maybe this helps others as well.

s3cmd sync s3://from/this/bucket/ s3://to/this/bucket/
For available options, please use:
$s3cmd --help

AWS CLI seems to do the job perfectly, and has the bonus of being an officially supported tool.
aws s3 sync s3://mybucket s3://backup-mybucket
http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html

The answer with the most upvotes as I write this is this one:
s3cmd sync s3://from/this/bucket s3://to/this/bucket
It's a useful answer. But sometimes sync is not what you need (it deletes files, etc.). It took me a long time to figure out this non-scripting alternative to simply copy multiple files between buckets. (OK, in the case shown below it's not between buckets. It's between not-really-folders, but it works between buckets equally well.)
# Slightly verbose, slightly unintuitive, very useful:
s3cmd cp --recursive --exclude=* --include=file_prefix* s3://semarchy-inc/source1/ s3://semarchy-inc/target/
Explanation of the above command:
–recursiveIn my mind, my requirement is not recursive. I simply want multiple files. But recursive in this context just tells s3cmd cp to handle multiple files. Great.
–excludeIt’s an odd way to think of the problem. Begin by recursively selecting all files. Next, exclude all files. Wait, what?
–includeNow we’re talking. Indicate the file prefix (or suffix or whatever pattern) that you want to include.s3://sourceBucket/ s3://targetBucket/This part is intuitive enough. Though technically it seems to violate the documented example from s3cmd help which indicates that a source object must be specified:s3cmd cp s3://BUCKET1/OBJECT1 s3://BUCKET2[/OBJECT2]

You can also use the web interface to do so:
Go to the source bucket in the web interface.
Mark the files you want to copy (use shift and mouse clicks to mark several).
Press Actions->Copy.
Go to the destination bucket.
Press Actions->Paste.
That's it.

I needed to copy a very large bucket so I adapted the code in the question into a multi threaded version and put it up on GitHub.
https://github.com/paultuckey/s3-bucket-to-bucket-copy-py

It's actually possible. This worked for me:
import boto
AWS_ACCESS_KEY = 'Your access key'
AWS_SECRET_KEY = 'Your secret key'
conn = boto.s3.connection.S3Connection(AWS_ACCESS_KEY, AWS_SECRET_KEY)
bucket = boto.s3.bucket.Bucket(conn, SRC_BUCKET_NAME)
for item in bucket:
# Note: here you can put also a path inside the DEST_BUCKET_NAME,
# if you want your item to be stored inside a folder, like this:
# bucket.copy(DEST_BUCKET_NAME, '%s/%s' % (folder_name, item.key))
bucket.copy(DEST_BUCKET_NAME, item.key)

Thanks - I use a slightly modified version, where I only copy files that don't exist or are a different size, and check on the destination if the key exists in the source. I found this a bit quicker for readying the test environment:
def botoSyncPath(path):
"""
Sync keys in specified path from source bucket to target bucket.
"""
try:
conn = S3Connection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
srcBucket = conn.get_bucket(AWS_SRC_BUCKET)
destBucket = conn.get_bucket(AWS_DEST_BUCKET)
for key in srcBucket.list(path):
destKey = destBucket.get_key(key.name)
if not destKey or destKey.size != key.size:
key.copy(AWS_DEST_BUCKET, key.name)
for key in destBucket.list(path):
srcKey = srcBucket.get_key(key.name)
if not srcKey:
key.delete()
except:
return False
return True

I wrote a script that backs up an S3 bucket: https://github.com/roseperrone/aws-backup-rake-task
#!/usr/bin/env python
from boto.s3.connection import S3Connection
import re
import datetime
import sys
import time
def main():
s3_ID = sys.argv[1]
s3_key = sys.argv[2]
src_bucket_name = sys.argv[3]
num_backup_buckets = sys.argv[4]
connection = S3Connection(s3_ID, s3_key)
delete_oldest_backup_buckets(connection, num_backup_buckets)
backup(connection, src_bucket_name)
def delete_oldest_backup_buckets(connection, num_backup_buckets):
"""Deletes the oldest backup buckets such that only the newest NUM_BACKUP_BUCKETS - 1 buckets remain."""
buckets = connection.get_all_buckets() # returns a list of bucket objects
num_buckets = len(buckets)
backup_bucket_names = []
for bucket in buckets:
if (re.search('backup-' + r'\d{4}-\d{2}-\d{2}' , bucket.name)):
backup_bucket_names.append(bucket.name)
backup_bucket_names.sort(key=lambda x: datetime.datetime.strptime(x[len('backup-'):17], '%Y-%m-%d').date())
# The buckets are sorted latest to earliest, so we want to keep the last NUM_BACKUP_BUCKETS - 1
delete = len(backup_bucket_names) - (int(num_backup_buckets) - 1)
if delete <= 0:
return
for i in range(0, delete):
print 'Deleting the backup bucket, ' + backup_bucket_names[i]
connection.delete_bucket(backup_bucket_names[i])
def backup(connection, src_bucket_name):
now = datetime.datetime.now()
# the month and day must be zero-filled
new_backup_bucket_name = 'backup-' + str('%02d' % now.year) + '-' + str('%02d' % now.month) + '-' + str(now.day);
print "Creating new bucket " + new_backup_bucket_name
new_backup_bucket = connection.create_bucket(new_backup_bucket_name)
copy_bucket(src_bucket_name, new_backup_bucket_name, connection)
def copy_bucket(src_bucket_name, dst_bucket_name, connection, maximum_keys = 100):
src_bucket = connection.get_bucket(src_bucket_name);
dst_bucket = connection.get_bucket(dst_bucket_name);
result_marker = ''
while True:
keys = src_bucket.get_all_keys(max_keys = maximum_keys, marker = result_marker)
for k in keys:
print 'Copying ' + k.key + ' from ' + src_bucket_name + ' to ' + dst_bucket_name
t0 = time.clock()
dst_bucket.copy_key(k.key, src_bucket_name, k.key)
print time.clock() - t0, ' seconds'
if len(keys) < maximum_keys:
print 'Done backing up.'
break
result_marker = keys[maximum_keys - 1].key
if __name__ =='__main__':main()
I use this in a rake task (for a Rails app):
desc "Back up a file onto S3"
task :backup do
S3ID = "*****"
S3KEY = "*****"
SRCBUCKET = "primary-mzgd"
NUM_BACKUP_BUCKETS = 2
Dir.chdir("#{Rails.root}/lib/tasks")
system "./do_backup.py #{S3ID} #{S3KEY} #{SRCBUCKET} #{NUM_BACKUP_BUCKETS}"
end

mdahlman's code didn't work for me but this command copies all the files in the bucket1 to a new folder (command also creates this new folder) in bucket 2.
cp --recursive --include=file_prefix* s3://bucket1/ s3://bucket2/new_folder_name/

s3cmd won't cp with only prefixes or wildcards but you can script the behavior with 's3cmd ls sourceBucket', and awk to extract the object name. Then use 's3cmd cp sourceBucket/name destBucket' to copy each object name in the list.
I use these batch files in a DOS box on Windows:
s3list.bat
s3cmd ls %1 | gawk "/s3/{ print \"\\"\"\"substr($0,index($0,\"s3://\"))\"\\"\"\"; }"
s3copy.bat
#for /F "delims=" %%s in ('s3list %1') do #s3cmd cp %%s %2

You can also use s3funnel which uses multi-threading:
https://github.com/neelakanta/s3funnel
example (without the access key or secret key parameters shown):
s3funnel source-bucket-name list | s3funnel dest-bucket-name copy --source-bucket source-bucket-name --threads=10

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Generate many files with wildcard, then merge into one - merge

In rule generate I think you don't want to escape the {chr} wildcard, otherwise it doesn't get replaced. I.e.: out = multiext(config["tmp"] + '/plink/chr{{chr}}', '.bed', '.bim', '.fam') should be: out = multiext(config["tmp"] + '/plink/chr{chr}', '.bed', '.bim', '.fam')

Related

Snakemake workflow, ChildIOException or MissingInputException

WildcardError - No values given for wildcard - Snakemake

Merging several vcf files using snakemake

Including variable in file directory in matlab ..?

Is it possible to copy all files from one S3 bucket to another with s3cmd?

Categories

Resources