sed wildcard not behaving as expected

sed wildcard not behaving as expected - sed

trying replace "/data/kollman/appion/*/relion/micrographs" with "micrographs"
sed -i 's/\/data\/kollman\/appion\/.*\/relion\/micrographs/micrographs/g' micrographs_all_gctf.star
Each line needs two corrections.
/data/kollman/appion/17nov14d/relion/micrographs/00001_nonDW.mrc /data/kollman/appion/17nov14d/relion/micrographs/00001_nonDW.ctf:mrc 18326.289062 19408.296875 74.539665 120.000000 2.120000 0.200000 87500.000000 14.000000 -0.032973 3.656274
/data/kollman/appion/17nov14d/relion/micrographs/00002_nonDW.mrc /data/kollman/appion/17nov14d/relion/micrographs/00002_nonDW.ctf:mrc 19867.357422 20695.939453 48.760956 120.000000 2.120000 0.200000 87500.000000 14.000000 -0.034282 3.727132
needs to be turned into this:
micrographs/00001_nonDW.mrc micrographs/00001_nonDW.ctf:mrc 18326.289062 19408.296875 74.539665 120.000000 2.120000 0.200000 87500.000000 14.000000 -0.032973 3.656274
micrographs/00002_nonDW.mrc micrographs/00002_nonDW.ctf:mrc 19867.357422 20695.939453 48.760956 120.000000 2.120000 0.200000 87500.000000 14.000000 -0.034282 3.727132
but instead, the result I'm getting is this:
micrographs/00001_nonDW.ctf:mrc 18326.289062 19408.296875 74.539665 120.000000 2.120000 0.200000 87500.000000 14.000000 -0.032973 3.656274
micrographs/00002_nonDW.ctf:mrc 19867.357422 20695.939453 48.760956 120.000000 2.120000 0.200000 87500.000000 14.000000 -0.034282 3.727132
The problem seems to be the way I'm using wildcard here. I need to have it because that part of the folder structure is always going to be different since this command is intended to be generalizable to all folder structures like that. The asterisk is for the date, which always changes.
Anyways, the wild card replaces the date as expected, but it looks like it is extending all the way past the date and crosses over to the second instance that needs replacement. The result is that it deletes one entry of the file structure on each line.

Your wildcard .* is matching all characters up to the second instance of /relion/micrographs. You need a more deterministic pattern.
Also, use a different sed expression delimiter so that you don't need to escape each /:
sed -E 's#/data/kollman/appion/[^ ]+/relion/micrographs#micrographs#g' file
the character class [^ ] makes sure that we only match the non space characters and prevent the greedy match that could swallow all characters up to the second instance of /relion/micrographs.

Related

match string pattern by certain characters but exclude combinations of those characters

I have the following sample string:
'-Dparam="x" -f hello-world.txt bye1.txt foo_bar.txt -Dparam2="y"'
I am trying to use RegEx (PowerShell, .NET flavor) to extract the filenames hello-world.txt, bye1.txt, and foo_bar.txt.
The real use case could have any number of -D parameters, and the -f <filenames> argument could appear in any position between these other parameters. I can't easily use something like split to extract it as the delimiter positioning could change, so I thought RegEx might be a good proposition here.
My attempt is something like this in PowerShell (can be opened on any Windows system and copy pasted into it):
'-Dparam="x" -f hello-world.txt bye1.txt foo_bar.txt -Dparam2="y"' -replace '^.* -f ([a-zA-Z0-9_.\s-]+).*$','$1'
Desired output:
hello-world.txt bye1.txt foo_bar.txt
My problem is that I either only take hello-world.txt, or I get hello-world.txt all the way to the end of the string or next = symbol (as in the example above).
I am having trouble expressing that \s is allowed, since I need to capture multiple space-delimited filenames, but that the combination of \s-[a-zA-Z] is not allowed, as that indicates the start of the next argument.

Snakemake ancient tag with wildcards

I have few SRA files which I download from NCBI website. Now I want to add them to my snakemake workflow. However, I want to retain ability to download them with prefetch if they are not available. I had following simple rule,
BASE = "/path/to/working/folder"
rule all:
input: [f"{BASE}/fastq/SRR000001.sra_1.fastq", f"{BASE}/fastq/SRR000001.sra_2.fastq"]
shell:
"echo Finished"
rule get_sra:
input: ancient("config/config.yaml")
output:"{BASE_FOLDER}/sra/{SSR_ID}.sra"
shell:
"prefetch -p {wildcards.SSR_ID} --output-file {output} "
rule get_fastq:
input: expand("{folder}/sra/{srr}.sra", folder=BASE, srr="{SRR_ID}")
output:
expand("{folder}/fastq/{srr}.sra_{i}.fastq", folder=BASE,
srr="{SRR_ID}", i=[1, 2])
shell:
"fasterq-dump {input} --outdir {BASE}/fastq"
If I use above rule, my workflow will recreate my SRA files as their timestamp will be older. However, I do not want to download full SRA file again from the server and use the already downloaded one.
For this purpose I am trying to use the ancient tag. But I am not able to use this tag with any of the wildcards.
input: ancient("{BASE_FOLDER}/sra/{SSR_ID}.sra")
Above rule gives error as
Wildcards in input files cannot be determined from output files:
Any solution to this problem? This also does not work when I use expand.

The problem is that not everything that you specify in curly braces is actually a wildcard. You may have 3 different use cases where you may use the curly braces:
expand functon
f-string
wildcards
In the first two cases (expand and f-string) the result is a fully specified string without any wildcards at all. If you have something like that:
rule dummy:
input: "{wildcard}.input"
output: expand("{wildcard}.output", wildcard=["1", "2"])
the result would be simply:
rule dummy:
input: "{wildcard}.input"
output: ["1.output", "2.output"]
As you can see, there are no wildcards in the output section at all, so the input cannot determine the value for it's wildcard.
The typical solution is to separate this rule into two rules:
rule all:
input: expand("{wildcard}.output", wildcard=["1", "2"])
rule do_some_work:
input: "{wildcard}.input"
output: "{wildcard}.output"
Note however that something that I called {wildcard} in the rule all: is not a wildcard per se but just an arbitrarily selected name in the local context of the expand function.

use perl to extract specific output lines

I'm endeavoring to create a system to generalize rules from input text. I'm using reVerb to create my initial set of rules. Using the following command[*], for instance:
$ echo "Bananas are an excellent source of potassium." | ./reverb -q | tr '\t' '\n' | cat -n
To generate output of the form:
1 stdin
2 1
3 Bananas
4 are an excellent source of
5 potassium
6 0
7 1
8 1
9 6
10 6
11 7
12 0.9999999997341693
13 Bananas are an excellent source of potassium .
14 NNS VBP DT JJ NN IN NN .
15 B-NP B-VP B-NP I-NP I-NP I-NP I-NP O
16 bananas
17 be source of
18 potassium
I'm currently piping the output to a file, which includes the preceding white space and numbers as depicted above.
What I'm really after is just the simple rule at the end, i.e. lines 16, 17 & 18. I've been trying to create a script to extract just that component and put it to a new file in the form of a Prolog clause, i.e. be source of(banans, potassium).
Is that feasible? Can Prolog rules contain white space like that?
I think I'm locked into getting all that output from reVerb so, what would be the best way to extract the desirable component? With a Perl script? Or maybe sed?
*Later I plan to replace this with a larger input file as opposed to just single sentences.

This seems wasteful. Why not leave the tabs as they are, and use:
$ echo "Bananas are an excellent source of potassium." \
| ./reverb -q | cut --fields=16,17,18
And yes, you can have rules like this in Prolog. See the answer by #mat. You need to know a bit of Prolog before you move on, I guess.
It is easier, however, to just make the string a a valid name for a predicate:
be_source_of with underscores instead of spaces
or 'be source of' with spaces, and enclosed in single quotes.
You can use probably awk to do what you want with the three fields. See for example the printf command in awk. Or, you can parse it again from Prolog directly. Both are beyond the scope of your current question, I feel.

sed -n 'N;N
:cycle
$!{N
D
b cycle
}
s/\(.*\)\n\(.*\)\n\(.*\)/\2 (\1,\3)/p' YourFile
if number are in output and not jsut for the reference, change last sed action by
s/\^ *[0-9]\{1,\} \{1,\}\(.*\)\n *[0-9]\{1,\} \{1,\}\(.*\)\n *[0-9]\{1,\} \{1,\}\(.*\)/\2 (\1,\3)/p
assuming the last 3 lines are the source of your "rules"

Regarding the Prolog part of the question:
Yes, Prolog facts can contain whitespace like this, with suitable operator declarations present.
For example:
:- op(700, fx, be).
:- op(650, fx, source).
:- op(600, fx, of).
Example query and its result, to let you see the shape of terms that are created with this syntax:
?- write_canonical(be source of(a, b)).
be(source(of(a,b))).
Therefore, with these operator declarations, a fact like:
be source of(a, b).
is exactly the same as stating:
be(source(of(a,b)).
Depending on use cases and other definitions, it may even be an advantage to create this kind of facts (i.e., facts of the form be/1 instead of source_of/2). If this is the only kind of facts you need, you can simply write:
source_of(a, b).
This creates no redundant wrappers and is easier to use.
Or, as Boris suggested, you can use single quotes as in 'be source of'/2.

Find and Replace one list of "words" with another list of "words" pairwise in csh

I am trying to modify some length code. I want to replace words in all words in list 1 with words in list 2 (pairwise).
List 1:
Vsap1*(GF/(Kagf+GF))
kdap1*AP1
vsprb
kpc1*pRB*E2F
.
.
List 2:
v1
v2
v3
v4
.
.
In other words, I'd like it to replace all instances of "Vsap1*(GF/(Kagf+GF))" with "v1" (and so on) in the file "code.txt". I have List 1 in a text file ("search_for.txt").
So far, I've been doing something like this:
set search_for=`cat search_for.txt`
set vv=1
foreach reaction $search_for
sed -i s/$reaction/$vv/g code.txt
set vv=$vv+1
end
There are many problems with this code. First, it seems the code can't handle expression with parentheses (something about "regular expressions"?). Second, I'm not sure my counter is working properly. Third, I haven't even integrated the replace list -- I thought it would be easier to just replace with 1,2,3… instead. Ideally, I would like to replace with v1,v3,v3…
Any help would be greatly appreciated!! I work mainly in Matlab (in which it is hard to deal with strings and such) so I'm not that great at csh.
Best,
Mehdi

awk should be better i think
set search_for=`cat search_for.txt`
set vindex=1
foreach reaction ${search_for}
ReactionEscaped="`printf \"%s\" \"${reaction}\" | sed 's²[\+*./[]²\\\\&²g'`"
sed -i "s/${ReactionEscaped}/v${vindex}/g code.txt
let vindex+=1
end
I haven't test (no system available here) so
ReactionEscaped="printf \"%s\" \"${reaction}\" | sed
's²[\+*./[]²\\\\&²g'\"
have to be fine tuned certainly (due to double \ between "", and special meaning of car in first sed pattern) [there is lot of post about escaping special char sed pattern on the site)

matlab regexprep

How to use matlab regexprep , for multiple expression and replacements?
file='http:xxx/sys/tags/Rel/total';
I want to replace 'sys' with sys1 and 'total' with 'total1'. For a single expression a replacement it works like this:
strrep(file,'sys', 'sys1')
and want to have like
strrep(file,'sys','sys1','total','total1') .
I know this doesn't work for strrep

Why not just issue the command twice?
file = 'http:xxx/sys/tags/Rel/total';
file = strrep(file,'sys','sys1')
strrep(file,'total','total1')

To solve it you need substitute functionality with regex, try to find in matlab's regexes something similar to this in php:
$string = 'http:xxx/sys/tags/Rel/total';
preg_replace('/http:(.*?)\//', 'http:${1}1/', $string);
${1} means 1st match group, that is what in parenthesis, (.*?).
http:(.*?)\/ - match pattern
http:${1}1/ - replace pattern with second 1 as you wish to add (first 1 is a group number)
http:xxx/sys/tags/Rel/total - input string
The secret is that whatever is matched by (.*?) (whether xxx or yyyy or 1234) will be inserted instead of ${1} in replace pattern, and then replace instead of old stuff into the input string. Welcome to see more examples on substitute functionality in php.

As documented in the help page for regexprep, you can specify pairs of patterns and replacements like this:
file='http:xxx/sys/tags/Rel/total';
regexprep(file, {'sys' 'total'}, {'sys1' 'total1'})
ans =
http:xxx/sys1/tags/Rel/total1
It is even possible to use tokens, should you be able to define a match pattern for everything you want to replace:
regexprep(file, '/([st][yo][^/$]*)', '/$11')
ans =
http:xxx/sys1/tags/Rel/total1
However, care must be taken with the first approach under certain circumstances, because MATLAB replaces the pairs one after another. That is to say if, say, the first pattern matches a string and replaces it with something that is subsequently matched by a later pattern, then that will also be replaced by the later replacement, even though it might not have matched the later pattern in the original string.
Example:
regexprep('This\is{not}LaTeX.', {'\\' '([{}])'}, {'\\textbackslash{}' '\\$1'})
ans =
This\textbackslash\{\}is\{not\}LaTeX.
=> This\{}is{not}LaTeX.
and
regexprep('This\is{not}LaTeX.', {'([{}])' '\\'}, {'\\$1' '\\textbackslash{}'})
ans =
This\textbackslash{}is\textbackslash{}{not\textbackslash{}}LaTeX.
=> This\is\not\LaTeX.
Both results are unintended, and there seems to be no way around this with consecutive replacements instead of simultaneous ones.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

sed wildcard not behaving as expected - sed

Related

match string pattern by certain characters but exclude combinations of those characters

Snakemake ancient tag with wildcards

use perl to extract specific output lines

Find and Replace one list of "words" with another list of "words" pairwise in csh

matlab regexprep

Categories

Resources