Matching a C++ style regex with sed?

Matching a C++ style regex with sed? - sed

Does anyone know how to match this C++ style regex using sed? Specifically to break it into multiple parts using (pattern) and \n?
// ZIP code pattern: XXddddd-dddd and variants
regex pat (R"(\w{2}\s*\d{5}(−\d{4})?)");
For instance the following string would match AB00000-0000 and \1, \2, \3 would print appropriate substrings from pattern space.
SOLUTION:
Here is the sed answer, which accounts for initial two characters.
$ echo AB00000-0000 | sed 's/\([[:alpha:]]\{2\}\)[[:space:]]*\([[:digit:]]\{5\}\)\(-\([[:digit:]]\{4\}\)\)\{0,1\}/\1 \2 \4/'
AB 00000 0000
Thank you

This:
\w{2}\s*\d{5}(−\d{4})?
is a PCRE (as would work in perl or GNU grep -P and can be written as a POSIX ERE (as would work in awk or grep -E or GNU/OSX sed -E) as:
[[:alnum:]_]{2}[[:space:]]*[[:digit:]]{5}(−[[:digit:]]{4})?
or in a POSIX BRE (as would work in grep or sed without -E) as:
[[:alnum:]_]\{2\}[[:space:]]*[[:digit:]]\{5\}\(−[[:digit:]]\{4\}\)\{0,1\}

Assuming you have a file which contains:
NJ 08542-0033
PA19103-0200
NY10002
Then the sed command:
sed 's/[[:alpha:]]\{2\}[[:space:]]*\([[:digit:]]\{5\}\)\(-\([[:digit:]]\{4\}\)\)\{0,1\}/\1 \3/' file
will output:
08542 0033
19103 0200
10002

Related

How do I correctly use variables with sed?

I need to use a string variable in a sed command. My attempt is given in script.sh, it does not do what I want, I assume because my variable contains characters that I need sed to evaluate. I am working in linux bash.
input.txt
delicious.banana
gross.apple
script.sh
adjectives="delicious\|gross\|bearable\|yummy"
sed "s/\($adjectives\)\.//g" input.txt > output.txt
output.txt desired
banana
apple
output.txt current
deliciousbanana
grossdapple

Non-gnu sed don't work with \| in BRE (basic regex mode). I suggest using ERE (Extended regex mode) using -E and as a bonus you can eliminate all the escaping:
adjectives="delicious|gross|bearable|yummy"
sed -E "s/($adjectives)\.//g" input.txt
banana
apple

Replacing a part of path

I have a folder with many subfolders containing files with lines like this: version/1.1/... or version/1.1.1/...
I want to replace all version numbers for version 1.2 like this: version/1.2/...
Before replacing I want to display all version numbers uses in all files in this hierarchy. How can I do it using grep and sed?
What I've tried:
grep -Ri "version\/([0-9].[0-9](?:.[0-9])?)\/" .
grep -Ril "version\/([0-9].[0-9](?:.[0-9])?)\/" . | xargs sed sed -i -e 's/(version\/([0-9].[0-9](?:.[0-9])?)\/)/(1.2)/g'

POSIX regex does not support non-capturing groups, and ? is not a quantifier in POSIX BRE, it just matches a literal ? char. If you are using GNU sed, you may escape ? to make it a quantifier, or use the -E option to make the pattern POSIX ERE compliant.
The grep "version\/([0-9].[0-9](?:.[0-9])?)\/" pattern must be written as "version/[0-9]\.[0-9]\(\.[0-9]\)\{0,1\}/" (again, it is a POSIX BRE pattern, capturing groups are defined with \(...\) and range quantifiers are specified using \{min,max\}). Do not forget to escape the literal . char when it is used outside of bracket expressions.
After removing a duplicated sed word (you have sed sed) you can use
sed -i 's/\(version\/\)[0-9.]*/\11.2/g'
It will match and capture version/ string into Group 1 and then will match zero or more digits or dots and will replace with the Group 1 value (\1) and 1.2 string.
Full command:
grep -Ril "version/[0-9]\.[0-9]\(\.[0-9]\)\{0,1\}/" . | \
xargs sed -i 's/\(version\/\)[0-9.]*/\11.2/g'

sed equivalent of perl -pe

I'm looking for an equivalent of perl -pe. Ideally, it would be replace with sed if it's possible. Any help is highly appreciated.
The code is:
perl -pe 's/^\[([^\]]+)\].*$/$1/g'

$ echo '[foo] 123' | perl -pe 's/^\[([^\]]+)\].*$/$1/g'
foo
$ echo '[foo] 123' | sed -E 's/^\[([^]]+)\].*$/\1/'
foo
sed by default accepts code from command line, so -e isn't needed (though it can be used)
printing the pattern space is default, so -p isn't needed and sed -n is similar to perl -n
-E is used here to be as close as possible to Perl regex. sed supports BRE and ERE (not as feature rich as Perl) and even that differs from implementation to implementation.
with BRE, the command for this example would be: sed 's/^\[\([^]]*\)\].*$/\1/'
\ isn't special inside character class unless it is an escape sequence like \t, \x27 etc
backreferences use \N format (and limited to maximum 9)
Also note that g flag isn't needed in either case, as you are using line anchors

sed to copy part of line to end

I'm trying to copy part of a line to append to the end:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1_IonXpress_024_genomic.fna.gz
becomes:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1/GCA_900169985_IonXpress_024_genomic.fna.gz
I have tried:
sed 's/\(.*(GCA_\)\(.*\))/\1\2\2)'

$ f1=$'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1_IonXpress_024_genomic.fna.gz'
$ echo "$f1"
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1_IonXpress_024_genomic.fna.gz
$ sed -E 's/(.*)(GCA_.[^.]*)(.[^_]*)(.*)/\1\2\3\/\2\4/' <<<"$f1"
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1/GCA_900169985_IonXpress_024_genomic.fna.gz
sed -E (or -r in some systems) enables extended regex support in sed , so you don't need to escape the group parenthesis ( ).
The format (GCA_.[^.]*) equals to "get from GCA_ all chars up and excluding the first found dot" :
$ sed -E 's/(.*)(GCA_.[^.]*)(.[^_]*)(.*)/\2/' <<<"$f1"
GCA_900169985
Similarly (.[^_]*) means get all chars up to first found _ (excluding _ char). This is the regex way to perform a non greedy/lazy capture (in perl regex this would have been written something like as .*_?)
$ sed -E 's/(.*)(GCA_.[^.]*)(.[^_]*)(.*)/\3/' <<<"$f1"
.1

Short sed approach:
s="ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1_IonXpress_024_genomic.fna.gz"
sed -E 's/(GCA_[^._]+)\.([^_]+)/\1.\2\/\1/' <<< "$s"
The output:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1/GCA_900169985_IonXpress_024_genomic.fna.gz

Sed uppercase lines if they starting with an uppercase character

I want the lines starting with one uppercase character to be uppercased, other lines should be not touched.
So this input:
cat myfile
a
b
Cc
should result in this output:
a
b
CC
I tried this command, but this not matches if i use grouping:
cat myfile | sed -r 's/\([A-Z]+.*\)/\U\1/g'
What am i doing wrong?

When you use the -r option, you must not put \ before parentheses used for grouping. So it should be:
sed -r 's/^([A-Z].*)/\U\1/' myfile
Also, notice that you need ^ to match the beginning of the line. The g modifier isn't needed, since you're matching the entire line.

cat myfile | sed 's/^\([A-Z].*\)$/\U\1/'

\U for uppercase conversion is a GNU sed extension.
Alternative for platforms where that is not available (e.g., macOS, with its BSD awk implementation):
awk '/^[A-Z]/ { print toupper($0); next } 1'

sed '/^[A-Z].*[a-z]/ s/.*/\U\1/' YourFile
only on line that are not compliant

This might work for you (GNU sed):
sed 's/^[[:upper:]].*/\U&/' file

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Matching a C++ style regex with sed? - sed

Assuming you have a file which contains: NJ 08542-0033 PA19103-0200 NY10002 Then the sed command: sed 's/[[:alpha:]]\{2\}[[:space:]]*\([[:digit:]]\{5\}\)\(-\([[:digit:]]\{4\}\)\)\{0,1\}/\1 \3/' file will output: 08542 0033 19103 0200 10002

Related

How do I correctly use variables with sed?

Replacing a part of path

sed equivalent of perl -pe

sed to copy part of line to end

Sed uppercase lines if they starting with an uppercase character

Categories

Resources