bsd sed and double quotes - sed

Consider the file test.txt:
#include "foo.h"
#include "bar.h"
#include "baz.h"
using GNU sed version 4.2.1 (on Ubuntu 10.04.4 LTS), I can extract foo.h, bar.h and baz.h with:
SHELL$) sed -n -e 's:^\s*\#include\s*"\(.*\)".*:\1:p' test.txt
foo.h
bar.h
baz.h
using BSD sed (on Mac OS X lion), and modifying the above command, I can extract foo.h, bar.h and baz.h, but with double quotes:
SHELL) sed -n -e 's:^\s*\#include\s*\(.*\).*:\1:p' test.txt
"foo.h"
"bar.h"
"bar.h"
How can to extract names without the quotes with BSD sed? The output of theses commands are empty:
SHELL) sed -n -e 's:^\s*\#include\s*"\(.*\)".*:\1:p' test.txt
SHELL) sed -n -e 's:^\s*\#include\s*\"\(.*\)\".*:\1:p' test.txt

BSD sed (unsurprisingly, really) doesn't support the \s Perlism -- it is interpreted as just a literal s. Try this instead;
sed -n -e 's!^[[:space:]]*\#include[[:space:]]*"\(.*\)".*!\1!p' test.txt
The character class [[:space:]] should work in all POSIX regex implementations. (Other seds may or may not want backslashes before grouping parentheses.)

Related

How do I correctly use variables with sed?

I need to use a string variable in a sed command. My attempt is given in script.sh, it does not do what I want, I assume because my variable contains characters that I need sed to evaluate. I am working in linux bash.
input.txt
delicious.banana
gross.apple
script.sh
adjectives="delicious\|gross\|bearable\|yummy"
sed "s/\($adjectives\)\.//g" input.txt > output.txt
output.txt desired
banana
apple
output.txt current
deliciousbanana
grossdapple
Non-gnu sed don't work with \| in BRE (basic regex mode). I suggest using ERE (Extended regex mode) using -E and as a bonus you can eliminate all the escaping:
adjectives="delicious|gross|bearable|yummy"
sed -E "s/($adjectives)\.//g" input.txt
banana
apple

Matching a C++ style regex with sed?

Does anyone know how to match this C++ style regex using sed? Specifically to break it into multiple parts using (pattern) and \n?
// ZIP code pattern: XXddddd-dddd and variants
regex pat (R"(\w{2}\s*\d{5}(−\d{4})?)");
For instance the following string would match AB00000-0000 and \1, \2, \3 would print appropriate substrings from pattern space.
SOLUTION:
Here is the sed answer, which accounts for initial two characters.
$ echo AB00000-0000 | sed 's/\([[:alpha:]]\{2\}\)[[:space:]]*\([[:digit:]]\{5\}\)\(-\([[:digit:]]\{4\}\)\)\{0,1\}/\1 \2 \4/'
AB 00000 0000
Thank you
This:
\w{2}\s*\d{5}(−\d{4})?
is a PCRE (as would work in perl or GNU grep -P and can be written as a POSIX ERE (as would work in awk or grep -E or GNU/OSX sed -E) as:
[[:alnum:]_]{2}[[:space:]]*[[:digit:]]{5}(−[[:digit:]]{4})?
or in a POSIX BRE (as would work in grep or sed without -E) as:
[[:alnum:]_]\{2\}[[:space:]]*[[:digit:]]\{5\}\(−[[:digit:]]\{4\}\)\{0,1\}
Assuming you have a file which contains:
NJ 08542-0033
PA19103-0200
NY10002
Then the sed command:
sed 's/[[:alpha:]]\{2\}[[:space:]]*\([[:digit:]]\{5\}\)\(-\([[:digit:]]\{4\}\)\)\{0,1\}/\1 \3/' file
will output:
08542 0033
19103 0200
10002

sed equivalent of perl -pe

I'm looking for an equivalent of perl -pe. Ideally, it would be replace with sed if it's possible. Any help is highly appreciated.
The code is:
perl -pe 's/^\[([^\]]+)\].*$/$1/g'
$ echo '[foo] 123' | perl -pe 's/^\[([^\]]+)\].*$/$1/g'
foo
$ echo '[foo] 123' | sed -E 's/^\[([^]]+)\].*$/\1/'
foo
sed by default accepts code from command line, so -e isn't needed (though it can be used)
printing the pattern space is default, so -p isn't needed and sed -n is similar to perl -n
-E is used here to be as close as possible to Perl regex. sed supports BRE and ERE (not as feature rich as Perl) and even that differs from implementation to implementation.
with BRE, the command for this example would be: sed 's/^\[\([^]]*\)\].*$/\1/'
\ isn't special inside character class unless it is an escape sequence like \t, \x27 etc
backreferences use \N format (and limited to maximum 9)
Also note that g flag isn't needed in either case, as you are using line anchors

Sed not matching one or more patterns

I have this list of files:
$ more files
one_this_2017_1_abc.txt
two_that_2018_1_abc.txt
three_another_2017_10.abc.txt
four_again_2018_10.abc.txt
five_back_2018_1a.abc.txt
I would like to get this output:
one_this_XXXX_YY_abc.txt
two_that_XXXX_YY_abc.txt
three_another_XXXX_YY.abc.txt
four_again_XXXX_YY.abc.txt
five_back_XXXX_YY.abc.txt
I am trying to remove the year and the bit after the year and replace them with another string--this is to generate test cases.
I can get the year just fine, but it's that one or two character piece after it I can't seem to match.
This should work, right?
~/test_cases
$ cat files | sed -e 's/_[[:digit:]]\{4\}_/_XXXX_/' -e 's/_[[:alnum:]]\{1,2\}_/_YY_/'
one_this_XXXX_YY_abc.txt
two_that_XXXX_YY_abc.txt
three_another_XXXX_10.abc.txt
four_again_XXXX_10.abc.txt
five_back_XXXX_1a.abc.txt
Except it doesn't for the 2 character cases.
$ cat files | sed -e 's/_[[:digit:]]\{4\}_/_XXXX_/' -e 's/_[[:alnum:]]\
{2\}_/_YY_/'
one_this_XXXX_1_abc.txt
two_that_XXXX_1_abc.txt
three_another_XXXX_10.abc.txt
four_again_XXXX_10.abc.txt
five_back_XXXX_1a.abc.txt
Doesn't work for the two character cases either, and this works not at all (but according to the docs it should):
$ cat files | sed -e 's/_[[:digit:]]\{4\}_/_XXXX_/' -e 's/_[[:alnum:]]\+_/_YY_/'
one_YY_XXXX_1_abc.txt
two_YY_XXXX_1_abc.txt
three_YY_XXXX_10.abc.txt
four_YY_XXXX_10.abc.txt
five_YY_XXXX_1a.abc.txt
Other random experiments that don't work:
$ cat files | sed -e 's/_[[:digit:]]\{4\}_/_XXXX_/' -e 's/_[a-zA-Z0-9]\+_/_YY_/'
one_YY_XXXX_1_abc.txt
two_YY_XXXX_1_abc.txt
three_YY_XXXX_10.abc.txt
four_YY_XXXX_10.abc.txt
five_YY_XXXX_1a.abc.txt
$ cat files | sed -e 's/_[[:digit:]]\{4\}_/_XXXX_/' -e 's/_[a-zA-Z0-9]\{1\}_/_YY_/'
one_this_XXXX_YY_abc.txt
two_that_XXXX_YY_abc.txt
three_another_XXXX_10.abc.txt
four_again_XXXX_10.abc.txt
five_back_XXXX_1a.abc.txt
$ cat files | sed -e 's/_[[:digit:]]\{4\}_/_XXXX_/' -e 's/_[a-zA-Z0-9]\{2\}_/_YY_/'
one_this_XXXX_1_abc.txt
two_that_XXXX_1_abc.txt
three_another_XXXX_10.abc.txt
four_again_XXXX_10.abc.txt
five_back_XXXX_1a.abc.txt
Tried with both GNU sed version 4.2.1 under Linux and sed (GNU sed) 4.4 under Cygwin.
And yes, I realize I can pipe this through multiple sed calls to get it to work, but that regex SHOULD work, right?
if your Input_file is same as shown sample then following may help you in same.
sed 's/\([^_]*\)_\([^_]*\)_\(.*_\)\(.*\)/\1_\2_XXXX_YY_\4/g' Input_file
Output will be as follows.
one_this_XXXX_YY_abc.txt
two_that_XXXX_YY_abc.txt
three_another_XXXX_YY_10.abc.txt
four_again_XXXX_YY_10.abc.txt
five_back_XXXX_YY_1a.abc.txt

sed to copy part of line to end

I'm trying to copy part of a line to append to the end:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1_IonXpress_024_genomic.fna.gz
becomes:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1/GCA_900169985_IonXpress_024_genomic.fna.gz
I have tried:
sed 's/\(.*(GCA_\)\(.*\))/\1\2\2)'
$ f1=$'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1_IonXpress_024_genomic.fna.gz'
$ echo "$f1"
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1_IonXpress_024_genomic.fna.gz
$ sed -E 's/(.*)(GCA_.[^.]*)(.[^_]*)(.*)/\1\2\3\/\2\4/' <<<"$f1"
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1/GCA_900169985_IonXpress_024_genomic.fna.gz
sed -E (or -r in some systems) enables extended regex support in sed , so you don't need to escape the group parenthesis ( ).
The format (GCA_.[^.]*) equals to "get from GCA_ all chars up and excluding the first found dot" :
$ sed -E 's/(.*)(GCA_.[^.]*)(.[^_]*)(.*)/\2/' <<<"$f1"
GCA_900169985
Similarly (.[^_]*) means get all chars up to first found _ (excluding _ char). This is the regex way to perform a non greedy/lazy capture (in perl regex this would have been written something like as .*_?)
$ sed -E 's/(.*)(GCA_.[^.]*)(.[^_]*)(.*)/\3/' <<<"$f1"
.1
Short sed approach:
s="ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1_IonXpress_024_genomic.fna.gz"
sed -E 's/(GCA_[^._]+)\.([^_]+)/\1.\2\/\1/' <<< "$s"
The output:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1/GCA_900169985_IonXpress_024_genomic.fna.gz