Snakemake ancient tag with wildcards - workflow

I have few SRA files which I download from NCBI website. Now I want to add them to my snakemake workflow. However, I want to retain ability to download them with prefetch if they are not available. I had following simple rule,
BASE = "/path/to/working/folder"
rule all:
input: [f"{BASE}/fastq/SRR000001.sra_1.fastq", f"{BASE}/fastq/SRR000001.sra_2.fastq"]
shell:
"echo Finished"
rule get_sra:
input: ancient("config/config.yaml")
output:"{BASE_FOLDER}/sra/{SSR_ID}.sra"
shell:
"prefetch -p {wildcards.SSR_ID} --output-file {output} "
rule get_fastq:
input: expand("{folder}/sra/{srr}.sra", folder=BASE, srr="{SRR_ID}")
output:
expand("{folder}/fastq/{srr}.sra_{i}.fastq", folder=BASE,
srr="{SRR_ID}", i=[1, 2])
shell:
"fasterq-dump {input} --outdir {BASE}/fastq"
If I use above rule, my workflow will recreate my SRA files as their timestamp will be older. However, I do not want to download full SRA file again from the server and use the already downloaded one.
For this purpose I am trying to use the ancient tag. But I am not able to use this tag with any of the wildcards.
input: ancient("{BASE_FOLDER}/sra/{SSR_ID}.sra")
Above rule gives error as
Wildcards in input files cannot be determined from output files:
Any solution to this problem? This also does not work when I use expand.

The problem is that not everything that you specify in curly braces is actually a wildcard. You may have 3 different use cases where you may use the curly braces:
expand functon
f-string
wildcards
In the first two cases (expand and f-string) the result is a fully specified string without any wildcards at all. If you have something like that:
rule dummy:
input: "{wildcard}.input"
output: expand("{wildcard}.output", wildcard=["1", "2"])
the result would be simply:
rule dummy:
input: "{wildcard}.input"
output: ["1.output", "2.output"]
As you can see, there are no wildcards in the output section at all, so the input cannot determine the value for it's wildcard.
The typical solution is to separate this rule into two rules:
rule all:
input: expand("{wildcard}.output", wildcard=["1", "2"])
rule do_some_work:
input: "{wildcard}.input"
output: "{wildcard}.output"
Note however that something that I called {wildcard} in the rule all: is not a wildcard per se but just an arbitrarily selected name in the local context of the expand function.

Related

match string pattern by certain characters but exclude combinations of those characters

I have the following sample string:
'-Dparam="x" -f hello-world.txt bye1.txt foo_bar.txt -Dparam2="y"'
I am trying to use RegEx (PowerShell, .NET flavor) to extract the filenames hello-world.txt, bye1.txt, and foo_bar.txt.
The real use case could have any number of -D parameters, and the -f <filenames> argument could appear in any position between these other parameters. I can't easily use something like split to extract it as the delimiter positioning could change, so I thought RegEx might be a good proposition here.
My attempt is something like this in PowerShell (can be opened on any Windows system and copy pasted into it):
'-Dparam="x" -f hello-world.txt bye1.txt foo_bar.txt -Dparam2="y"' -replace '^.* -f ([a-zA-Z0-9_.\s-]+).*$','$1'
Desired output:
hello-world.txt bye1.txt foo_bar.txt
My problem is that I either only take hello-world.txt, or I get hello-world.txt all the way to the end of the string or next = symbol (as in the example above).
I am having trouble expressing that \s is allowed, since I need to capture multiple space-delimited filenames, but that the combination of \s-[a-zA-Z] is not allowed, as that indicates the start of the next argument.

Select files using include and exclude array of recursive glob patterns

I've been given two file glob parameters in JSON, include and exclude, in the following format:
{
include: ['**/*.md', '**/swagger/*.json', '**/*.yml', 'somedir/*.yml'],
exclude: ['**/obj/**', 'otherdir/**', '**/includes/**']
}
I'm tasked with walking a directory tree to select files according to the include and exclude rules in these formats; this has to be written as a Powershell script.
I've been trying to find a built-in command that supports the double-asterisk, recursive file glob pattern; additionally, since Powershell is converting the JSON to an object, it would be nice if the command parameters could accept an array as input.
I've looked at Get-ChildItem, but I'm not sure that I can mimic the glob resolution behavior using -include, -exclude, and/or -filter. I've also looked at Resolve-Path, but I'm not sure if the wildcards will work correctly (and I might have to manually exclude paths).
How can I select paths using multiple recursive wildcard file globs in Powershell while excluding other globs? Is there a Powershell command that supports this?
Thank you!
EDIT:
In these glob patterns, the single asterisk is a regular wildcard. The double asterisk (**), however, is a known standard which denotes a recursive directory search.
For example: the pattern dir1/*/file.txt would match:
dir1/dir2/file.txt
dir1/dir3/file.txt
...but not:
dir1/dir2/dir3/file.txt
The pattern dir1/**/file.txt would match everything that the above selector would, but it would also match:
dir1/dir3/dir4/file.txt
dir1/dir7/dir9/dir23/dir47/file.txt
and so on. So, an exclude glob pattern like **/obj/** basically means "exclude anything found in any obj folder found at any point in the directory hierarchy, no matter how deep".

Uppercasing filename in Makefile using sed

I try to convert a filename such as foo/bar/baz.proto into something like foo/bar/Baz.java in my Makefile. For this purpose, I thought I could use sed. However, it seems that the command does not work as expected:
uppercase_file = $(shell echo "$(1)" | sed 's/\(.*\/\)\(.*\)/\1\u\2/')
# generated Java sources
PROTO_JAVA_TARGETS := ${PROTO_SPECS:$(SRCDIR)/%.proto=$(JAVAGEN)/$(call uppercase_file,%).java}
When I try to run the sed command on the command line it seems to work:
~$ echo "foo/bar/baz" | sed 's/\(.*\/\)\(.*\)/\1\u\2/'
foo/bar/Baz
Any ideas why this does not work inside the Makefile?
UPDATE:
The java files are generated with the following target:
$(JAVAGEN)/%.java: $(SRCDIR)/%.proto
How can I apply the substitution also for targets?
GNU Make does not replace % character in the replacement part of a substitution reference (which is basically a syntactic sugar for patsubst) if it is part of a variable reference. I have not found this behavior described in the documentation, but you can look it implemented in the source code (the relevant function I believe is find_char_unquote).
I suggest moving the call out of the substitution reference, since uppercase_file obviously works properly on any file path:
PROTO_JAVA_TARGETS := $(call uppercase_file,${PROTO_SPECS:$(SRCDIR)/%.proto=$(JAVAGEN)/%.java})
If $(PROTO_SPECS) resolves not to a single element, but rather to a list of elements, you can use foreach to call the function on every elements of a processed list:
PROTO_JAVA_TARGETS := $(foreach JAVA,${PROTO_SPECS:$(SRCDIR)/%.proto=$(JAVAGEN)/%.java},$(call uppercase_file,$(JAVA)))
The java files are generated with the following target: $(JAVAGEN)/%.java: $(SRCDIR)/%.proto
How can I apply the substitution also for targets?
Since Make matches targets first, and there is no way to run sed backwards, what you need here is either define an inverse function, or generate multiple explicit rules. I will show the latter approach.
define java_from_proto
$(call uppercase_file,$(1:$(SRCDIR)/%.proto=$(JAVAGEN)/%.java)): $1
# Whatever recipe you use.
# Use `$$#`, `$$<` and so on instead of `$#` or `$<`.
endef
$(foreach PROTO,$(PROTO_SPECS),$(eval $(call java_from_proto,$(PROTO))))
We basically generate one rule per file in $(PROTO_SPEC) using a multiline variable syntax, and then use eval to install that rule. There is also a very similar example on this documentation page that can be helpful.

PCRE Regex - How to return matches with multiline string looking for multiple strings in any order

I need to use Perl-compatible regex to match several strings which appear over multiple lines in a file.
The matches need to appear in any order (server servernameA.company.com followed by servernameZ.company.com followed by servernameD.company.com or any order combination of the three). Note: All matches will appear at the beginning of each line.
In my testing with grep -P, I haven't even been able to produce a match on simple string terms that appear in any order over new lines (even when using the /s and /m modifiers). I am pretty sure from reading I need a look-ahead assertion but the samples I used didn't produce a match for me even after analyzing each bit of the regex to make sure it was relevant to my scenario.
Since I need to support this in Production, I would like an answer that is simple and relatively straight-forward to interpret.
Sample Input
irrelevant_directive = 0
# Comment
server servernameA.company.com iburst
additional_directive = yes
server servernameZ.company.com iburst
server servernameD.company.com iburst
# Additional Comment
final_directive = true
Expectation
The regex should match and return the 3 lines beginning with server (that appear in any order) if and only if there is a perfect match for strings'serverA.company.com', 'serverZ.company.com', and 'serverD.company.com' followed by iburst. All 3 strings must be included.
Finally, if the answer (or a very similar form of the answer) can address checking for strings in any order on a single line, that would be very helpful. For example, if I have a single-line string of: preauth param audit=true silent deny=5 severe=false unlock_time=1000 time=20ms and I want to ensure the terms deny=5 and time=20ms appear in any order and if so match.
Thank you in advance for your assistance.
Regarding the main issue [for the secondary question see Casimir et Hippolyte answer] (using x modifier): https://regex101.com/r/mkxcap/5
(?:
(?<a>.*serverA\.company\.com\s+iburst.*)
|(?<z>.*serverZ\.company\.com\s+iburst.*)
|(?<d>.*serverD\.company\.com\s+iburst.*)
|[^\n]*(?:\n|$)
)++
(?(a)(?(z)(?(d)(*ACCEPT))))(*SKIP)(*F)
The matches are now all in the a, z and d capturing groups.
It's not the most efficient (it goes three times over each line with backtracking...), but the main takeaway is to register the matches with capturing groups and then checking for them being defined.
You don't need to use the PCRE features, you can simply write in ERE:
grep -E '.*(\bdeny=5\b.*\btime=20ms\b|\btime=20ms\b.*\bdeny=5\b).*' file
The PCRE approach will be different: (however you can also use the previous pattern)
grep -P '^(?=.*\bdeny=5\b).*\btime=20ms\b.*' file

oneliner -- multiple file substitution transformation produces out-of-sync results

Context
perl 5.22
multi-file transformation with perl oneliner
Overview
TrevorWattanStewie has a directory full of config files, and he wants to transform them.
The transformation operation is best understood by comparing "BEFORE" to "AFTER".
Files BEFORE
## ./configfile001.config
TrevorWattanStewie#oldmail.com;--blank--
## ./configfile002.config
TrevorWattanStewie#oldmail.com;--blank--
## ./configfile003.config
TrevorWattanStewie#oldmail.com;--blank--
## ./configfile004.config
TrevorWattanStewie#oldmail.com;--blank--
SallyWattanStewie#oldmail.com;--blank--
RickyWattanStewie#oldmail.com;--blank--
Files AFTER (Desired result)
## ./configfile001.config
TrevorWattanStewie#newmail.com;configfile001.config
## ./configfile002.config
TrevorWattanStewie#newmail.com;configfile002.config
## ./configfile003.config
TrevorWattanStewie#newmail.com;configfile003.config
## ./configfile004.config
TrevorWattanStewie#newmail.com;configfile004.config
SallyWattanStewie#newmail.com;configfile004.config
RickyWattanStewie#newmail.com;configfile004.config
Step by Step Explanation
Trevor wants to:
replace all --blank-- tokens with the name of the file currently being processed.
change all substrings from #oldmail into #newmail
Trevor's attempt
Trevor decides the quickest way to get the job done is with a perl oneliner script.
The oneliner Trevor uses is as follows:
perl -pi -e '$curf=$ARGV[0];s/--blank--/$curf/; s/#oldmail.com/#newmail.com/;' *.asc
Problem
When Trevor runs the script, the output does not meet his expectations.
The actual result is as follows:
Files AFTER (Actual result)
## ./configfile001.config
TrevorWattanStewie#oldmail.com;configfile002.config
## ./configfile002.config
TrevorWattanStewie#oldmail.com;configfile003.config
## ./configfile003.config
TrevorWattanStewie#oldmail.com;configfile004.config
## ./configfile004.config
TrevorWattanStewie#oldmail.com;
SallyWattanStewie#oldmail.com;
RickyWattanStewie#oldmail.com;
Questions
Why did Trevor's script fail to transform #oldmail to #newmail?
Why is the file numbering mismatched? The sequence numbering is off by one.
You want to use the variable $ARGV, which is the name of the currently processed file.
So s/--blank--/$ARGV/;
Also, #oldmail (etc) will be interpolated inside the regex, as Wumpus Q. Wumbley notes.
I always run my one-liners with -wE.
Trevor didn't enable warnings, thus missing out on the explanation:
$ perl -wpi -e '$curf=$ARGV[0];s/--blank--/$curf/; s/#oldmail.com/#newmail.com/;' *.asc
Possible unintended interpolation of #oldmail in string at -e line 1.
Possible unintended interpolation of #newmail in string at -e line 1.
#oldmail and #newmail are arrays. the s/// operator interpolates variables, including arrays. You need to use \#