Replace string with substring in lowercase using sed / awk / tr / perl? - perl

I have a plaintext file containing multiple instances of the pattern $$DATABASE_*$$ and the asterisk could be any string of characters. I'd like to replace the entire instance with whatever is in the asterisk portion, but lowercase.
Here is a test file:
$$DATABASE_GIBSON$$
test me $$DATABASE_GIBSON$$ test me
$$DATABASE_GIBSON$$ test $$DATABASE_GIBSON$$ test
$$DATABASE_GIBSON$$ $$DATABASE_GIBSON$$$$DATABASE_GIBSON$$
Here is the desired output:
gibson
test me gibson test me
gibson test gibson test
gibson gibsongibson
How do I do this with sed/awk/tr/perl?

Here's the perl version I ended up using.
perl -p -i.bak -e 's/\$\$DATABASE_(.*?)\$\$/lc($1)/eg' inputFile

Unfortunately there's no easy, foolproof way with awk, but here's one approach:
$ cat tst.awk
{
gsub(/[$][$]/,"\n")
head = ""
tail = $0
while ( match(tail, "\nDATABASE_[^\n]+\n") ) {
head = head substr(tail,1,RSTART-1)
trgt = substr(tail,RSTART,RLENGTH)
tail = substr(tail,RSTART+RLENGTH)
gsub(/\n(DATABASE_)?/,"",trgt)
head = head tolower(trgt)
}
$0 = head tail
gsub("\n","$$")
print
}
$ cat file
The quick brown $$DATABASE_FOX$$ jumped over the lazy $$DATABASE_DOG$$s back.
The grey $$DATABASE_SQUIRREL$$ ate $$DATABASE_NUT$$s under a $$DATABASE_TREE$$.
Put a dollar $$DATABASE_DOL$LAR$$ in the $$ string.
$ awk -f tst.awk file
The quick brown fox jumped over the lazy dogs back.
The grey squirrel ate nuts under a tree.
Put a dollar dol$lar in the $$ string.
Note the trick of converting $$ to a newline char so we can negate that char in the match(RE), without that (i.e. if we used ".+" instead of "[^\n]+") then due to greedy RE matching if the same pattern appeared twice on one input line the matching string would extend from the start of the first pattern to the end of the second pattern.

This one works with complicated examples.
perl -ple 's/\$\$DATABASE_(.*?)\$\$/lc($1)/eg' filename.txt
And for simpler examples :
echo '$$DATABASE_GIBSON$$' | sed 's#$$DATABASE_\(.*\)\$\$#\L\1#'
in sed, \L means lower case (\E to stop if needed)

Using awk alone:
> echo '$$DATABASE_AWESOME$$' | awk '{sub(/.*_/,"");sub(/\$\$$/,"");print tolower($0);}'
awesome
Note that I'm in FreeBSD, so this is not GNU awk.
But this can be done using bash alone:
[ghoti#pc ~]$ foo='$$DATABASE_AWESOME$$'
[ghoti#pc ~]$ foo=${foo##*_}
[ghoti#pc ~]$ foo=${foo%\$\$}
[ghoti#pc ~]$ foo=${foo,,}
[ghoti#pc ~]$ echo $foo
awesome
Of the above substitutions, all except the last one (${foo,,}) will work in standard Bourne shell. If you don't have bash, you can instead do use tr for this step:
$ echo $foo
AWESOME
$ foo=$(echo "$foo" | tr '[:upper:]' '[:lower:]')
$ echo $foo
awesome
$
UPDATE:
Per comments, it seems that what the OP really wants is to strip the substring out of any text in which it is included -- that is, our solutions need to account for the possibility of leading or trailing spaces, before or after the string he provided in his question.
> echo 'foo $$DATABASE_KITTENS$$ bar' | sed -nE '/\$\$[^$]+\$\$/{;s/.*\$\$DATABASE_//;s/\$\$.*//;p;}' | tr '[:upper:]' '[:lower:]'
kittens
And if you happen to have pcregrep on your path (from the devel/pcre FreeBSD port), you can use that instead, with lookaheads:
> echo 'foo $$DATABASE_KITTENS$$ bar' | pcregrep -o '(?!\$\$DATABASE_)[A-Z]+(?=\$\$)' | tr '[:upper:]' '[:lower:]'
kittens
(For Linux users reading this: this is equivalent to using grep -P.)
And in pure bash:
$ shopt -s extglob
$ foo='foo $$DATABASE_KITTENS$$ bar'
$ foo=${foo##*(?)\$\$DATABASE_}
$ foo=${foo%%\$\$*(?)}
$ foo=${foo,,}
$ echo $foo
kittens
Note that NONE of these three updated solutions will handle situations where multiple tagged database names exist in the same line of input. That's not stated as a requirement in the question either, but I'm just sayin'....

You can do this in a pretty foolproof way with the supercool command cut :)
echo '$$DATABASE_AWESOME$$' | cut -d'$' -f3 | cut -d_ -f2 | tr 'A-Z' 'a-z'

This might work for you (GNU sed):
sed 's/$\$/\n/g;s/\nDATABASE_\([^\n]*\)\n/\L\1/g;s/\n/$$/g' file

Here is the shortest (GNU) awk solution I could come up with that does everything requested by the OP:
awk -vRS='[$][$]DATABASE_([^$]+[$])+[$]' '{ORS=tolower(substr(RT,12,length(RT)-13))}1'
Even if the string indicated with the asterix (*) contained one or more single Dollar signs ($) and/or linebreaks this soultion should still work.

awk '{gsub(/\$\$DATABASE_GIBSON\$\$/,"gibson")}1' file
gibson
test me gibson test me
gibson test gibson test
gibson gibsongibson

echo $$DATABASE_WOOLY$$ | awk '{print tolower($0)}'
awk will take what ever input, in this case the first agurment, and use the tolower function and return the results.
For your bash script you can do something like this and use the variable DBLOWER
DBLOWER=$(echo $$DATABASE_WOOLY$$ | awk '{print tolower($0)}');

Related

How to replace consecutive symbols using only one sed command?

I have a simple .csv file with lines that holds 't' values. Here is the example:
2ABC;t;t;t;tortuga;fault;t;t;bored
I want to replace them to '1' using sed.
If I make sed "s/;t;/;1;/g" I get the next result:
2ABC;1;t;1;tortuga;fault;1;t;bored
As you can see, consecutive ';t;' have been replaced through one. Yes, I can replace all ';t;' by sed -e "s/;t;/;1;/g" -e "s/;t;/;1;/g" but this is boring.
How can I make the replacement by one sed command?
If there is something to replace, branch to replace again.
sed ': again; /;t;/{ s//;1;/; b again }'
Overall, parsing cvs with sed is crude. Consider awk.
awk -F';' -v OFS=';' '{ for(i=1;i<=NF;++i) if ($i=="t") $i=1 } 1'
Lookarounds is helpful in such cases:
$ s='t;2ABC;t;t;t;tortuga;fault;t;t;bored;t'
$ echo "$s" | perl -lpe 's/(?<![^;])t(?![^;])/1/g'
1;2ABC;1;1;1;tortuga;fault;1;1;bored;1
echo '2ABC;t;t;t;tortuga;fault;t;t;bored' |
— gawk-specific solution
gawk -be '(ORS = RT)^!(NF = NF)' FS='^t$' OFS=1 RS=';'
— cross-awk-solution
{m,g,n}awk 'gsub(FS, OFS, $!(NF = NF))^_' FS=';t;' OFS=';1;' RS=
2ABC;1;1;1;tortuga;fault;1;1;bored

Using a single sed call to split and grep

This is mostly by curiosity, I am trying to have the same behavior as:
echo -e "test1:test2:test3"| sed 's/:/\n/g' | grep 1
in a single sed command.
I already tried
echo -e "test1:test2:test3"| sed -e "s/:/\n/g" -n "/1/p"
But I get the following error:
sed: can't read /1/p: No such file or directory
Any idea on how to fix this and combine different types of commands into a single sed call?
Of course this is overly simplified compared to the real usecase, and I know I can get around by using multiple calls, again this is just out of curiosity.
EDIT: I am mostly interested in the sed tool, I already know how to do it using other tools, or even combinations of those.
EDIT2: Here is a more realistic script, closer to what I am trying to achieve:
arch=linux64
base=https://chromedriver.storage.googleapis.com
split="<Contents>"
curl $base \
| sed -e 's/<Contents>/<Contents>\n/g' \
| grep $arch \
| sed -e 's/^<Key>\(.*\)\/chromedriver.*/\1/' \
| sort -V > out
What I would like to simplify is the curl line, turning it into something like:
curl $base \
| sed 's/<Contents>/<Contents>\n/g' -n '/1/p' -e 's/^<Key>\(.*\)\/chromedriver.*/\1/' \
| sort -V > out
Here are some alternatives, awk and sed based:
sed -E "s/(.*:)?([^:]*1[^:]*).*/\2/" <<< "test1:test2:test3"
awk -v RS=":" '/1/' <<< "test1:test2:test3"
# or also
awk 'BEGIN{RS=":"} /1/' <<< "test1:test2:test3"
Or, using your logic, you would need to pipe a second sed command:
sed "s/:/\n/g" <<< "test1:test2:test3" | sed -n "/1/p"
See this online demo. The awk solution looks cleanest.
Details
In sed solution, (.*:)?([^:]*1[^:]*).* pattern matches an optional sequence of any 0+ chars and a :, then captures into Group 2 any 0 or more chars other than :, 1, again 0 or more chars other than :, and then just matches the rest of the line. The replacement just keeps Group 2 contents.
In awk solution, the record separator is set to : and then /1/ regex is used to only return the record having 1 in it.
This might work for you (GNU sed):
sed 's/:/\n/;/^[^\n]*1/P;D' file
Replace each : and if the first line in the pattern space contains 1 print it.
Repeat.
An alternative:
sed -Ez 's/:/\n/g;s/^[^1]*$//mg;s/\n+/\n/;s/^\n//' file
This slurps the whole file into memory and replaces all colons by newlines. All lines that do not contain 1 are removed and surplus newlines deleted.
An alternative to the really ugly sed is: grep -o '\w*2\w*'
$ printf "test1:test2:test3\nbob3:bob2:fred2\n" | grep -o '\w*2\w*'
test2
bob2
fred2
grep -o: only matching
Or: grep -o '[^:]*2[^:]*'
echo -e "test1:test2:test3" | sed -En 's/:/\n/g;/^[^\n]*2[^\n]*(\n|$)/P;//!D'
sed -n doesn't print unless told to
sed -E allows using parens to match (\n|$) which is newline or the end of the pattern space
P prints the pattern buffer up to the first newline.
D trims the pattern buffer up to the first newline
[^\n] is a character class that matches anything except a newline
// is sed shorthand for repeating a match
//! is then matching everything that didn't match previously
So, after you split into newlines, you want to make sure the 2 character is between the start of the pattern buffer ^ and the first newline.
And, if there is not the character you are looking for, you want to D delete up to the first newline.
At that point, it works for one line of input, with one string containing the character you're looking for.
To expand to several matches within a line, you have to ta, conditionally branch back to label :a:
$ printf "test1:test2:test3\nbob3:bob2:fred2\n" | \
sed -En ':a s/:/\n/g;/^[^\n]*2[^\n]*(\n|$)/P;D;ta'
test2
bob2
fred2
This is simply NOT a job for sed. With GNU awk for multi-char RS:
$ echo "test1:test2:test3:test4:test5:test6"| awk -v RS='[:\n]' '/1/'
test1
$ echo "test1:test2:test3:test4:test5:test6"| awk -v RS='[:\n]' 'NR%2'
test1
test3
test5
$ echo "test1:test2:test3:test4:test5:test6"| awk -v RS='[:\n]' '!(NR%2)'
test2
test4
test6
$ echo "foo1:bar1:foo2:bar2:foo3:bar3" | awk -v RS='[:\n]' '/foo/ || /2/'
foo1
foo2
bar2
foo3
With any awk you'd just have to strip the \n from the final record before operating on it:
$ echo "test1:test2:test3:test4:test5:test6"| awk -v RS=':' '{sub(/\n$/,"")} /1/'
test1

Parsing HTML on the command line; How to capture text in <strong></strong>?

I'm trying to grab data from HTML output that looks like this:
<strong>Target1NoSpaces</strong><span class="creator"> ....
<strong>Target2 With Spaces</strong><span class="creator"> ....
I'm using a pipe train to whittle down the data to the targets I'm trying to hit. Here's my approach so far:
grep "/strong" output.html | awk '{print $1}'
Grep on "/strong" to get the lines with the targets; that works fine.
Pipe to 'awk '{print $1}'. That works in case #1 when the target has no spaces, but fails in case #2 when the target has spaces..only the first word is preserved as below:
<strong>Target1NoSpaces</strong><span
<strong>Target2
Do you have any tips on hitting the target properly, either in my awk or in different command? Anything quick and dirty (grep, awk, sed, perl) would be appreciated.
Try pup, a command line tool for processing HTML. For example:
$ pup 'strong text{}' < file.html
Target1NoSpaces
Target2 With Spaces
To search via XPath, try xpup.
Alternatively, for a well-formed HTML/XML document, try html-xml-utils.
One way using mojolicious and its DOM parser:
perl -Mojo -E '
g("http://your.web")
->dom
->find("strong")
->each( sub { if ( $t = shift->text ) { say $t } } )'
Using Perl regex's look-behind and look-ahead feature in grep. It should be simpler than using awk.
grep -oP "(?<=<strong>).*?(?=</strong>)" file
Output:
Target1NoSpaces
Target2 With Spaces
Add:
This implementation of Perl's regex's multi-matching in Ruby could match values in multiple lines:
ruby -e 'File.read(ARGV.shift).scan(/(?<=<strong>).*?(?=<\/strong>)/m).each{|e| puts "----------"; puts e;}' file
Input:
<strong>Target
A
B
C
</strong><strong>Target D</strong><strong>Target E</strong>
Output:
----------
Target
A
B
C
----------
Target D
----------
Target E
Here's a solution using xmlstarlet
xml sel -t -v //strong input.html
Trying to parse HTML without a real HTML parser is a bad idea. Having said that, here is a very quick and dirty solution to the specific example you provided. It will not work when there is more than one <strong> tag on a line, when the tag runs over more than one line, etc.
awk -F '<strong>|</strong>' '/<strong>/ {print $2}' filename
You never need grep with awk and the field separator doesn't have to be whitespace:
$ awk -F'<|>' '/strong/{print $3}' file
Target1NoSpaces
Target2 With Spaces
You should really use a proper parser for this however.
Since you tagged perl
perl -ne 'if(/(?:<strong>)(.*)(?:<\/strong>)/){print $1."\n";}' input.html
I am surprised no one mensions W3C HTML-XML-utils
curl -Ss https://stackoverflow.com/questions/18746957/parsing-html-on-the-command-line-how-to-capture-text-in-strong-strong |
hxnormalize -x |
hxselect -s '\n' strong
output:
<strong class="fc-black-750 mb6">Stack Overflow
for Teams</strong>
<strong>Teams</strong>
To capture only content:
curl -Ss https://stackoverflow.com/questions/18746957/parsing-html-on-the-command-line-how-to-capture-text-in-strong-strong |
hxnormalize -x |
hxselect -s '\n' -c strong
Stack Overflow
for Teams
Teams

sed/awk : match a pattern and return everything between the end of the pattern and a semicolon

I have a line:
<random junk>TYPE=snp;<more random junk>
and I need to return everything between the end of TYPE= and the ; (in this case snp but it could be any of a number of text strings.
I tried various sed / awk solutions but I can't seem to get it working. I have the feeling this is a simple problem so, sorry about that.
This seems to work:
sed 's/.*TYPE=\(.*\);.*/\1/'
EDIT:
Ah, so there can be semicolons in the random junk. Try this:
sed 's/.*TYPE=\([^;]*\);.*/\1/'
requires GNU grep:
grep -Po '(?<=TYPE=)[^;]+'
meaning: preceded by "TYPE=", find some non-semicolon characters
One way using GNU sed:
sed -r 's/.*TYPE=([^;]+).*/\1/' file.txt
Since you also tagged this awk:
$ text='<random junk>TYPE=snp;<more random junk>'
$ echo "$text" | awk -FTYPE= '{sub(/;.*/,"",$2); print $2}'
snp
$ text='foo=bar;baz=fnu;TYPE=snp;XAI=0;XAM=0'
$ echo "$text" | awk -FTYPE= '{sub(/;.*/,"",$2); print $2}'
snp
(Only using the variable to keep the lines from wrapping.)
Or, to parse this as set of variable=value pairs rather than just a string of text:
$ echo "$text" | awk -vRS=";" -F= '$1=="TYPE" {print $2}'
snp
You can also do this in pure bash, if you want:
$ t="red=blue;TYPE=snp;XAI=0.0037843;XAM=0.0170293;XAS=0.013245;XRI=0;XRM=0"
$ t=${t#*TYPE=}
$ t=${t%%;*}
$ echo $t
snp

How do I push `sed` matches to the shell call in the replacement pattern?

I need to replace several URLs in a text file with some content dependent on the URL itself. Let's say for simplicity it's the first line of the document at the URL.
What I'm trying is this:
sed "s/^URL=\(.*\)/TITLE=$(curl -s \1 | head -n 1)/" file.txt
This doesn't work, since \1 is not set. However, the shell is getting called. Can I somehow push the sed match variables to that subprocess?
The accept answer is just plain wrong. Proof:
Make an executable script foo.sh:
#! /bin/bash
echo $* 1>&2
Now run it:
$ echo foo | sed -e "s/\\(foo\\)/$(./foo.sh \\1)/"
\1
$
The $(...) is expanded before sed is run.
So you are trying to call an external command from inside the replacement pattern of a sed substitution. I dont' think it can be done, the $... inside a pattern just allows you to use an already existent (constant) shell variable.
I'd go with Perl, see the /e option in the search-replace operator (s/.../.../e).
UPDATE: I was wrong, sed plays nicely with the shell, and it allows you do to that. But, then, the backlash in \1 should be escaped. Try instead:
sed "s/^URL=\(.*\)/TITLE=$(curl -s \\1 | head -n 1)/" file.txt
Try this:
sed "s/^URL=\(.*\)/\1/" file.txt | while read url; do sed "s#URL=\($url\)#TITLE=$(curl -s $url | head -n 1)#" file.txt; done
If there are duplicate URLs in the original file, then there will be n^2 of them in the output. The # as a delimiter depends on the URLs not including that character.
Late reply, but making sure people don't get thrown off by the answers here -- this can be done in gnu sed using the e command. The following, for example, decrements a number at the beginning of a line:
echo "444 foo" | sed "s/\([0-9]*\)\(.*\)/expr \1 - 1 | tr -d '\n'; echo \"\2\";/e"
will produce:
443 foo