Parsing outlook email strings, need just angle bracket text smtp address - perl

I have a string
Casperfoo DataOps <casperf#domain .com>;Booya Support <iptsupport#domain .com>; Tripped Support <Tsupport#domain capital.com>; Smith, Joe: IB Reference Data (NYK) <joe.smith#domain .com>; Johnson, Walter: IB Reference Data (PUN) <walter.johnson#domain .com>;
and I'd expect this to work
cat /tmp/foo | perl -nle 'print /\<(\w+)\>/'
but it does not.
All I want is the string between the < and the >. I don't need the crappy outlook aliases - just the smtp addresses.
This is the expected output
casperf#domain.com
iptsupport#domain.com
Tsupport#domaincapital.com
joe.smith#domain.com
walter.johnson#domain.com

and I'd expect this to work
Then your expectations are wrong :-) Perhaps you'd get closer if you read the regex documentation (or the tutorial).
Some problems:
You match with \w+. The escape sequence \w matches letters, numbers and the underscore. Email addresses contain other characters (for example, #) so this matches nothing. Let's replace \w+ with \S+ (to match all non-whitespace characters).
That still doesn't work, as your email addresses all seem to contain a space (e.g. <casperf#domain .com>). Ok, so let's use <.+> instead to match anything, but add ? to force a minimal match - <.+?>.
That just gives us the first match, because that's what we're asking for. We need to add /g to the match operator to get all matches.
We now have:
$ cat /tmp/foo | perl -nle 'print /\<(.+?)\>/g
Which gives us:
casperf#domain .comiptsupport#domain .comTsupport#domain capital.comjoe.smith#domain .comwalter.johnson#domain .com
So we get all of the matches, but they're all on the same line. The easiest fix is to put the match in a postfix for loop so print() gets the matches one at a time:
$ cat /tmp/foo | perl -nle 'print for /\<(.+?)\>/g'
casperf#domain .com
iptsupport#domain .com
Tsupport#domain capital.com
joe.smith#domain .com
walter.johnson#domain .com

Try this:
cat /tmp/foo | perl -nle 'while (/<([^>]+)>/g){ print $1; }'
It should work. Tried it with
echo "Casperfoo DataOps <casperf#domain .com>;Booya Support <iptsupport#domain .com>; Tripped Support <Tsupport#domain capital.com>; Smith, Joe: IB Reference Data (NYK) <joe.smith#domain .com>; Johnson, Walter: IB Reference Data (PUN) <walter.johnson#domain .com>;" | perl -nle 'while (/<([^>]+)>/g){ print $1; }'
which gave me:
casperf#domain .com
iptsupport#domain .com
Tsupport#domain capital.com
joe.smith#domain .com
walter.johnson#domain .com

Related

sed not working with a single long line file [duplicate]

I'm trying to use sed to clean up lines of URLs to extract just the domain.
So from:
http://www.suepearson.co.uk/product/174/71/3816/
I want:
http://www.suepearson.co.uk/
(either with or without the trailing slash, it doesn't matter)
I have tried:
sed 's|\(http:\/\/.*?\/\).*|\1|'
and (escaping the non-greedy quantifier)
sed 's|\(http:\/\/.*\?\/\).*|\1|'
but I can not seem to get the non-greedy quantifier (?) to work, so it always ends up matching the whole string.
Neither basic nor extended Posix/GNU regex recognizes the non-greedy quantifier; you need a later regex. Fortunately, Perl regex for this context is pretty easy to get:
perl -pe 's|(http://.*?/).*|\1|'
In this specific case, you can get the job done without using a non-greedy regex.
Try this non-greedy regex [^/]* instead of .*?:
sed 's|\(http://[^/]*/\).*|\1|g'
With sed, I usually implement non-greedy search by searching for anything except the separator until the separator :
echo "http://www.suon.co.uk/product/1/7/3/" | sed -n 's;\(http://[^/]*\)/.*;\1;p'
Output:
http://www.suon.co.uk
this is:
don't output -n
search, match pattern, replace and print s/<pattern>/<replace>/p
use ; search command separator instead of / to make it easier to type so s;<pattern>;<replace>;p
remember match between brackets \( ... \), later accessible with \1,\2...
match http://
followed by anything in brackets [], [ab/] would mean either a or b or /
first ^ in [] means not, so followed by anything but the thing in the []
so [^/] means anything except / character
* is to repeat previous group so [^/]* means characters except /.
so far sed -n 's;\(http://[^/]*\) means search and remember http://followed by any characters except / and remember what you've found
we want to search untill the end of domain so stop on the next / so add another / at the end: sed -n 's;\(http://[^/]*\)/' but we want to match the rest of the line after the domain so add .*
now the match remembered in group 1 (\1) is the domain so replace matched line with stuff saved in group \1 and print: sed -n 's;\(http://[^/]*\)/.*;\1;p'
If you want to include backslash after the domain as well, then add one more backslash in the group to remember:
echo "http://www.suon.co.uk/product/1/7/3/" | sed -n 's;\(http://[^/]*/\).*;\1;p'
output:
http://www.suon.co.uk/
Simulating lazy (un-greedy) quantifier in sed
And all other regex flavors!
Finding first occurrence of an expression:
POSIX ERE (using -r option)
Regex:
(EXPRESSION).*|.
Sed:
sed -r ‍'s/(EXPRESSION).*|./\1/g' # Global `g` modifier should be on
Example (finding first sequence of digits) Live demo:
$ sed -r 's/([0-9]+).*|./\1/g' <<< 'foo 12 bar 34'
12
How does it work?
This regex benefits from an alternation |. At each position engine tries to pick the longest match (this is a POSIX standard which is followed by couple of other engines as well) which means it goes with . until a match is found for ([0-9]+).*. But order is important too.
Since global flag is set, engine tries to continue matching character by character up to the end of input string or our target. As soon as the first and only capturing group of left side of alternation is matched (EXPRESSION) rest of line is consumed immediately as well .*. We now hold our value in the first capturing group.
POSIX BRE
Regex:
\(\(\(EXPRESSION\).*\)*.\)*
Sed:
sed 's/\(\(\(EXPRESSION\).*\)*.\)*/\3/'
Example (finding first sequence of digits):
$ sed 's/\(\(\([0-9]\{1,\}\).*\)*.\)*/\3/' <<< 'foo 12 bar 34'
12
This one is like ERE version but with no alternation involved. That's all. At each single position engine tries to match a digit.
If it is found, other following digits are consumed and captured and the rest of line is matched immediately otherwise since * means
more or zero it skips over second capturing group \(\([0-9]\{1,\}\).*\)* and arrives at a dot . to match a single character and this process continues.
Finding first occurrence of a delimited expression:
This approach will match the very first occurrence of a string that is delimited. We can call it a block of string.
sed 's/\(END-DELIMITER-EXPRESSION\).*/\1/; \
s/\(\(START-DELIMITER-EXPRESSION.*\)*.\)*/\1/g'
Input string:
foobar start block #1 end barfoo start block #2 end
-EDE: end
-SDE: start
$ sed 's/\(end\).*/\1/; s/\(\(start.*\)*.\)*/\1/g'
Output:
start block #1 end
First regex \(end\).* matches and captures first end delimiter end and substitues all match with recent captured characters which
is the end delimiter. At this stage our output is: foobar start block #1 end.
Then the result is passed to second regex \(\(start.*\)*.\)* that is same as POSIX BRE version above. It matches a single character
if start delimiter start is not matched otherwise it matches and captures the start delimiter and matches the rest of characters.
Directly answering your question
Using approach #2 (delimited expression) you should select two appropriate expressions:
EDE: [^:/]\/
SDE: http:
Usage:
$ sed 's/\([^:/]\/\).*/\1/g; s/\(\(http:.*\)*.\)*/\1/' <<< 'http://www.suepearson.co.uk/product/174/71/3816/'
Output:
http://www.suepearson.co.uk/
Note: this will not work with identical delimiters.
sed does not support "non greedy" operator.
You have to use "[]" operator to exclude "/" from match.
sed 's,\(http://[^/]*\)/.*,\1,'
P.S. there is no need to backslash "/".
sed - non greedy matching by Christoph Sieghart
The trick to get non greedy matching in sed is to match all characters excluding the one that terminates the match. I know, a no-brainer, but I wasted precious minutes on it and shell scripts should be, after all, quick and easy. So in case somebody else might need it:
Greedy matching
% echo "<b>foo</b>bar" | sed 's/<.*>//g'
bar
Non greedy matching
% echo "<b>foo</b>bar" | sed 's/<[^>]*>//g'
foobar
Non-greedy solution for more than a single character
This thread is really old but I assume people still needs it.
Lets say you want to kill everything till the very first occurrence of HELLO. You cannot say [^HELLO]...
So a nice solution involves two steps, assuming that you can spare a unique word that you are not expecting in the input, say top_sekrit.
In this case we can:
s/HELLO/top_sekrit/ #will only replace the very first occurrence
s/.*top_sekrit// #kill everything till end of the first HELLO
Of course, with a simpler input you could use a smaller word, or maybe even a single character.
HTH!
This can be done using cut:
echo "http://www.suepearson.co.uk/product/174/71/3816/" | cut -d'/' -f1-3
another way, not using regex, is to use fields/delimiter method eg
string="http://www.suepearson.co.uk/product/174/71/3816/"
echo $string | awk -F"/" '{print $1,$2,$3}' OFS="/"
sed certainly has its place but this not not one of them !
As Dee has pointed out: Just use cut. It is far simpler and much more safe in this case. Here's an example where we extract various components from the URL using Bash syntax:
url="http://www.suepearson.co.uk/product/174/71/3816/"
protocol=$(echo "$url" | cut -d':' -f1)
host=$(echo "$url" | cut -d'/' -f3)
urlhost=$(echo "$url" | cut -d'/' -f1-3)
urlpath=$(echo "$url" | cut -d'/' -f4-)
gives you:
protocol = "http"
host = "www.suepearson.co.uk"
urlhost = "http://www.suepearson.co.uk"
urlpath = "product/174/71/3816/"
As you can see this is a lot more flexible approach.
(all credit to Dee)
sed 's|(http:\/\/[^\/]+\/).*|\1|'
There is still hope to solve this using pure (GNU) sed. Despite this is not a generic solution in some cases you can use "loops" to eliminate all the unnecessary parts of the string like this:
sed -r -e ":loop" -e 's|(http://.+)/.*|\1|' -e "t loop"
-r: Use extended regex (for + and unescaped parenthesis)
":loop": Define a new label named "loop"
-e: add commands to sed
"t loop": Jump back to label "loop" if there was a successful substitution
The only problem here is it will also cut the last separator character ('/'), but if you really need it you can still simply put it back after the "loop" finished, just append this additional command at the end of the previous command line:
-e "s,$,/,"
sed -E interprets regular expressions as extended (modern) regular expressions
Update: -E on MacOS X, -r in GNU sed.
Because you specifically stated you're trying to use sed (instead of perl, cut, etc.), try grouping. This circumvents the non-greedy identifier potentially not being recognized. The first group is the protocol (i.e. 'http://', 'https://', 'tcp://', etc). The second group is the domain:
echo "http://www.suon.co.uk/product/1/7/3/" | sed "s|^\(.*//\)\([^/]*\).*$|\1\2|"
If you're not familiar with grouping, start here.
I realize this is an old entry, but someone may find it useful.
As the full domain name may not exceed a total length of 253 characters replace .* with .\{1, 255\}
This is how to robustly do non-greedy matching of multi-character strings using sed. Lets say you want to change every foo...bar to <foo...bar> so for example this input:
$ cat file
ABC foo DEF bar GHI foo KLM bar NOP foo QRS bar TUV
should become this output:
ABC <foo DEF bar> GHI <foo KLM bar> NOP <foo QRS bar> TUV
To do that you convert foo and bar to individual characters and then use the negation of those characters between them:
$ sed 's/#/#A/g; s/{/#B/g; s/}/#C/g; s/foo/{/g; s/bar/}/g; s/{[^{}]*}/<&>/g; s/}/bar/g; s/{/foo/g; s/#C/}/g; s/#B/{/g; s/#A/#/g' file
ABC <foo DEF bar> GHI <foo KLM bar> NOP <foo QRS bar> TUV
In the above:
s/#/#A/g; s/{/#B/g; s/}/#C/g is converting { and } to placeholder strings that cannot exist in the input so those chars then are available to convert foo and bar to.
s/foo/{/g; s/bar/}/g is converting foo and bar to { and } respectively
s/{[^{}]*}/<&>/g is performing the op we want - converting foo...bar to <foo...bar>
s/}/bar/g; s/{/foo/g is converting { and } back to foo and bar.
s/#C/}/g; s/#B/{/g; s/#A/#/g is converting the placeholder strings back to their original characters.
Note that the above does not rely on any particular string not being present in the input as it manufactures such strings in the first step, nor does it care which occurrence of any particular regexp you want to match since you can use {[^{}]*} as many times as necessary in the expression to isolate the actual match you want and/or with seds numeric match operator, e.g. to only replace the 2nd occurrence:
$ sed 's/#/#A/g; s/{/#B/g; s/}/#C/g; s/foo/{/g; s/bar/}/g; s/{[^{}]*}/<&>/2; s/}/bar/g; s/{/foo/g; s/#C/}/g; s/#B/{/g; s/#A/#/g' file
ABC foo DEF bar GHI <foo KLM bar> NOP foo QRS bar TUV
Have not yet seen this answer, so here's how you can do this with vi or vim:
vi -c '%s/\(http:\/\/.\{-}\/\).*/\1/ge | wq' file &>/dev/null
This runs the vi :%s substitution globally (the trailing g), refrains from raising an error if the pattern is not found (e), then saves the resulting changes to disk and quits. The &>/dev/null prevents the GUI from briefly flashing on screen, which can be annoying.
I like using vi sometimes for super complicated regexes, because (1) perl is dead dying, (2) vim has a very advanced regex engine, and (3) I'm already intimately familiar with vi regexes in my day-to-day usage editing documents.
Since PCRE is also tagged here, we could use GNU grep by using non-lazy match in regex .*? which will match first nearest match opposite of .*(which is really greedy and goes till last occurrence of match).
grep -oP '^http[s]?:\/\/.*?/' Input_file
Explanation: using grep's oP options here where -P is responsible for enabling PCRE regex here. In main program of grep mentioning regex which is matching starting http/https followed by :// till next occurrence of / since we have used .*? it will look for first / after (http/https://). It will print matched part only in line.
echo "/home/one/two/three/myfile.txt" | sed 's|\(.*\)/.*|\1|'
don bother, i got it on another forum :)
sed 's|\(http:\/\/www\.[a-z.0-9]*\/\).*|\1| works too
Here is something you can do with a two step approach and awk:
A=http://www.suepearson.co.uk/product/174/71/3816/
echo $A|awk '
{
var=gensub(///,"||",3,$0) ;
sub(/\|\|.*/,"",var);
print var
}'
Output:
http://www.suepearson.co.uk
Hope that helps!
Another sed version:
sed 's|/[:alnum:].*||' file.txt
It matches / followed by an alphanumeric character (so not another forward slash) as well as the rest of characters till the end of the line. Afterwards it replaces it with nothing (ie. deletes it.)
#Daniel H (concerning your comment on andcoz' answer, although long time ago): deleting trailing zeros works with
s,([[:digit:]]\.[[:digit:]]*[1-9])[0]*$,\1,g
it's about clearly defining the matching conditions ...
You should also think about the case where there is no matching delims. Do you want to output the line or not. My examples here do not output anything if there is no match.
You need prefix up to 3rd /, so select two times string of any length not containing / and following / and then string of any length not containing / and then match / following any string and then print selection. This idea works with any single char delims.
echo http://www.suepearson.co.uk/product/174/71/3816/ | \
sed -nr 's,(([^/]*/){2}[^/]*)/.*,\1,p'
Using sed commands you can do fast prefix dropping or delim selection, like:
echo 'aaa #cee: { "foo":" #cee: " }' | \
sed -r 't x;s/ #cee: /\n/;D;:x'
This is lot faster than eating char at a time.
Jump to label if successful match previously. Add \n at / before 1st delim. Remove up to first \n. If \n was added, jump to end and print.
If there is start and end delims, it is just easy to remove end delims until you reach the nth-2 element you want and then do D trick, remove after end delim, jump to delete if no match, remove before start delim and and print. This only works if start/end delims occur in pairs.
echo 'foobar start block #1 end barfoo start block #2 end bazfoo start block #3 end goo start block #4 end faa' | \
sed -r 't x;s/end//;s/end/\n/;D;:x;s/(end).*/\1/;T y;s/.*(start)/\1/;p;:y;d'
If you have access to gnu grep, then can utilize perl regex:
grep -Po '^https?://([^/]+)(?=)' <<< 'http://www.suepearson.co.uk/product/174/71/3816/'
http://www.suepearson.co.uk
Alternatively, to get everything after the domain use
grep -Po '^https?://([^/]+)\K.*' <<< 'http://www.suepearson.co.uk/product/174/71/3816/'
/product/174/71/3816/
The following solution works for matching / working with multiply present (chained; tandem; compound) HTML or other tags. For example, I wanted to edit HTML code to remove <span> tags, that appeared in tandem.
Issue: regular sed regex expressions greedily matched over all the tags from the first to the last.
Solution: non-greedy pattern matching (per discussions elsewhere in this thread; e.g. https://stackoverflow.com/a/46719361/1904943).
Example:
echo '<span>Will</span>This <span>remove</span>will <span>this.</span>remain.' | \
sed 's/<span>[^>]*>//g' ; echo
This will remain.
Explanation:
s/<span> : find <span>
[^>] : followed by anything that is not >
*> : until you find >
//g : replace any such strings present with nothing.
Addendum
I was trying to clean up URLs, but I was running into difficulty matching / excluding a word - href - using the approach above. I briefly looked at negative lookarounds (Regular expression to match a line that doesn't contain a word) but that approach seemed overly complex and did not provide a satisfactory solution.
I decided to replace href with ` (backtick), do the regex substitutions, then replace ` with href.
Example (formatted here for readability):
printf '\n
<a aaa h href="apple">apple</a>
<a bbb "c=ccc" href="banana">banana</a>
<a class="gtm-content-click"
data-vars-link-text="nope"
data-vars-click-url="https://blablabla"
data-vars-event-category="story"
data-vars-sub-category="story"
data-vars-item="in_content_link"
data-vars-link-text
href="https:example.com">Example.com</a>\n\n' |
sed 's/href/`/g ;
s/<a[^`]*`/\n<a href/g'
apple
banana
Example.com
Explanation: basically as above. Here,
s/href/` : replace href with ` (backtick)
s/<a : find start of URL
[^`] : followed by anything that is not ` (backtick)
*` : until you find a `
/<a href/g : replace each of those found with <a href
Unfortunately, as mentioned, this it is not supported in sed.
To overcome this, I suggest to use the next best thing(actually better even), to use vim sed-like capabilities.
define in .bash-profile
vimdo() { vim $2 --not-a-term -c "$1" -es +"w >> /dev/stdout" -cq! ; }
That will create headless vim to execute a command.
Now you can do for example:
echo $PATH | vimdo "%s_\c:[a-zA-Z0-9\\/]\{-}python[a-zA-Z0-9\\/]\{-}:__g" -
to filter out python in $PATH.
Use - to have input from pipe in vimdo.
While most of the syntax is the same. Vim features more advanced features, and using \{-} is standard for non-greedy match. see help regexp.

replace first occurence after a match on all lines for all matches using sed [duplicate]

I'm trying to use sed to clean up lines of URLs to extract just the domain.
So from:
http://www.suepearson.co.uk/product/174/71/3816/
I want:
http://www.suepearson.co.uk/
(either with or without the trailing slash, it doesn't matter)
I have tried:
sed 's|\(http:\/\/.*?\/\).*|\1|'
and (escaping the non-greedy quantifier)
sed 's|\(http:\/\/.*\?\/\).*|\1|'
but I can not seem to get the non-greedy quantifier (?) to work, so it always ends up matching the whole string.
Neither basic nor extended Posix/GNU regex recognizes the non-greedy quantifier; you need a later regex. Fortunately, Perl regex for this context is pretty easy to get:
perl -pe 's|(http://.*?/).*|\1|'
In this specific case, you can get the job done without using a non-greedy regex.
Try this non-greedy regex [^/]* instead of .*?:
sed 's|\(http://[^/]*/\).*|\1|g'
With sed, I usually implement non-greedy search by searching for anything except the separator until the separator :
echo "http://www.suon.co.uk/product/1/7/3/" | sed -n 's;\(http://[^/]*\)/.*;\1;p'
Output:
http://www.suon.co.uk
this is:
don't output -n
search, match pattern, replace and print s/<pattern>/<replace>/p
use ; search command separator instead of / to make it easier to type so s;<pattern>;<replace>;p
remember match between brackets \( ... \), later accessible with \1,\2...
match http://
followed by anything in brackets [], [ab/] would mean either a or b or /
first ^ in [] means not, so followed by anything but the thing in the []
so [^/] means anything except / character
* is to repeat previous group so [^/]* means characters except /.
so far sed -n 's;\(http://[^/]*\) means search and remember http://followed by any characters except / and remember what you've found
we want to search untill the end of domain so stop on the next / so add another / at the end: sed -n 's;\(http://[^/]*\)/' but we want to match the rest of the line after the domain so add .*
now the match remembered in group 1 (\1) is the domain so replace matched line with stuff saved in group \1 and print: sed -n 's;\(http://[^/]*\)/.*;\1;p'
If you want to include backslash after the domain as well, then add one more backslash in the group to remember:
echo "http://www.suon.co.uk/product/1/7/3/" | sed -n 's;\(http://[^/]*/\).*;\1;p'
output:
http://www.suon.co.uk/
Simulating lazy (un-greedy) quantifier in sed
And all other regex flavors!
Finding first occurrence of an expression:
POSIX ERE (using -r option)
Regex:
(EXPRESSION).*|.
Sed:
sed -r ‍'s/(EXPRESSION).*|./\1/g' # Global `g` modifier should be on
Example (finding first sequence of digits) Live demo:
$ sed -r 's/([0-9]+).*|./\1/g' <<< 'foo 12 bar 34'
12
How does it work?
This regex benefits from an alternation |. At each position engine tries to pick the longest match (this is a POSIX standard which is followed by couple of other engines as well) which means it goes with . until a match is found for ([0-9]+).*. But order is important too.
Since global flag is set, engine tries to continue matching character by character up to the end of input string or our target. As soon as the first and only capturing group of left side of alternation is matched (EXPRESSION) rest of line is consumed immediately as well .*. We now hold our value in the first capturing group.
POSIX BRE
Regex:
\(\(\(EXPRESSION\).*\)*.\)*
Sed:
sed 's/\(\(\(EXPRESSION\).*\)*.\)*/\3/'
Example (finding first sequence of digits):
$ sed 's/\(\(\([0-9]\{1,\}\).*\)*.\)*/\3/' <<< 'foo 12 bar 34'
12
This one is like ERE version but with no alternation involved. That's all. At each single position engine tries to match a digit.
If it is found, other following digits are consumed and captured and the rest of line is matched immediately otherwise since * means
more or zero it skips over second capturing group \(\([0-9]\{1,\}\).*\)* and arrives at a dot . to match a single character and this process continues.
Finding first occurrence of a delimited expression:
This approach will match the very first occurrence of a string that is delimited. We can call it a block of string.
sed 's/\(END-DELIMITER-EXPRESSION\).*/\1/; \
s/\(\(START-DELIMITER-EXPRESSION.*\)*.\)*/\1/g'
Input string:
foobar start block #1 end barfoo start block #2 end
-EDE: end
-SDE: start
$ sed 's/\(end\).*/\1/; s/\(\(start.*\)*.\)*/\1/g'
Output:
start block #1 end
First regex \(end\).* matches and captures first end delimiter end and substitues all match with recent captured characters which
is the end delimiter. At this stage our output is: foobar start block #1 end.
Then the result is passed to second regex \(\(start.*\)*.\)* that is same as POSIX BRE version above. It matches a single character
if start delimiter start is not matched otherwise it matches and captures the start delimiter and matches the rest of characters.
Directly answering your question
Using approach #2 (delimited expression) you should select two appropriate expressions:
EDE: [^:/]\/
SDE: http:
Usage:
$ sed 's/\([^:/]\/\).*/\1/g; s/\(\(http:.*\)*.\)*/\1/' <<< 'http://www.suepearson.co.uk/product/174/71/3816/'
Output:
http://www.suepearson.co.uk/
Note: this will not work with identical delimiters.
sed does not support "non greedy" operator.
You have to use "[]" operator to exclude "/" from match.
sed 's,\(http://[^/]*\)/.*,\1,'
P.S. there is no need to backslash "/".
sed - non greedy matching by Christoph Sieghart
The trick to get non greedy matching in sed is to match all characters excluding the one that terminates the match. I know, a no-brainer, but I wasted precious minutes on it and shell scripts should be, after all, quick and easy. So in case somebody else might need it:
Greedy matching
% echo "<b>foo</b>bar" | sed 's/<.*>//g'
bar
Non greedy matching
% echo "<b>foo</b>bar" | sed 's/<[^>]*>//g'
foobar
Non-greedy solution for more than a single character
This thread is really old but I assume people still needs it.
Lets say you want to kill everything till the very first occurrence of HELLO. You cannot say [^HELLO]...
So a nice solution involves two steps, assuming that you can spare a unique word that you are not expecting in the input, say top_sekrit.
In this case we can:
s/HELLO/top_sekrit/ #will only replace the very first occurrence
s/.*top_sekrit// #kill everything till end of the first HELLO
Of course, with a simpler input you could use a smaller word, or maybe even a single character.
HTH!
This can be done using cut:
echo "http://www.suepearson.co.uk/product/174/71/3816/" | cut -d'/' -f1-3
another way, not using regex, is to use fields/delimiter method eg
string="http://www.suepearson.co.uk/product/174/71/3816/"
echo $string | awk -F"/" '{print $1,$2,$3}' OFS="/"
sed certainly has its place but this not not one of them !
As Dee has pointed out: Just use cut. It is far simpler and much more safe in this case. Here's an example where we extract various components from the URL using Bash syntax:
url="http://www.suepearson.co.uk/product/174/71/3816/"
protocol=$(echo "$url" | cut -d':' -f1)
host=$(echo "$url" | cut -d'/' -f3)
urlhost=$(echo "$url" | cut -d'/' -f1-3)
urlpath=$(echo "$url" | cut -d'/' -f4-)
gives you:
protocol = "http"
host = "www.suepearson.co.uk"
urlhost = "http://www.suepearson.co.uk"
urlpath = "product/174/71/3816/"
As you can see this is a lot more flexible approach.
(all credit to Dee)
sed 's|(http:\/\/[^\/]+\/).*|\1|'
There is still hope to solve this using pure (GNU) sed. Despite this is not a generic solution in some cases you can use "loops" to eliminate all the unnecessary parts of the string like this:
sed -r -e ":loop" -e 's|(http://.+)/.*|\1|' -e "t loop"
-r: Use extended regex (for + and unescaped parenthesis)
":loop": Define a new label named "loop"
-e: add commands to sed
"t loop": Jump back to label "loop" if there was a successful substitution
The only problem here is it will also cut the last separator character ('/'), but if you really need it you can still simply put it back after the "loop" finished, just append this additional command at the end of the previous command line:
-e "s,$,/,"
sed -E interprets regular expressions as extended (modern) regular expressions
Update: -E on MacOS X, -r in GNU sed.
Because you specifically stated you're trying to use sed (instead of perl, cut, etc.), try grouping. This circumvents the non-greedy identifier potentially not being recognized. The first group is the protocol (i.e. 'http://', 'https://', 'tcp://', etc). The second group is the domain:
echo "http://www.suon.co.uk/product/1/7/3/" | sed "s|^\(.*//\)\([^/]*\).*$|\1\2|"
If you're not familiar with grouping, start here.
I realize this is an old entry, but someone may find it useful.
As the full domain name may not exceed a total length of 253 characters replace .* with .\{1, 255\}
This is how to robustly do non-greedy matching of multi-character strings using sed. Lets say you want to change every foo...bar to <foo...bar> so for example this input:
$ cat file
ABC foo DEF bar GHI foo KLM bar NOP foo QRS bar TUV
should become this output:
ABC <foo DEF bar> GHI <foo KLM bar> NOP <foo QRS bar> TUV
To do that you convert foo and bar to individual characters and then use the negation of those characters between them:
$ sed 's/#/#A/g; s/{/#B/g; s/}/#C/g; s/foo/{/g; s/bar/}/g; s/{[^{}]*}/<&>/g; s/}/bar/g; s/{/foo/g; s/#C/}/g; s/#B/{/g; s/#A/#/g' file
ABC <foo DEF bar> GHI <foo KLM bar> NOP <foo QRS bar> TUV
In the above:
s/#/#A/g; s/{/#B/g; s/}/#C/g is converting { and } to placeholder strings that cannot exist in the input so those chars then are available to convert foo and bar to.
s/foo/{/g; s/bar/}/g is converting foo and bar to { and } respectively
s/{[^{}]*}/<&>/g is performing the op we want - converting foo...bar to <foo...bar>
s/}/bar/g; s/{/foo/g is converting { and } back to foo and bar.
s/#C/}/g; s/#B/{/g; s/#A/#/g is converting the placeholder strings back to their original characters.
Note that the above does not rely on any particular string not being present in the input as it manufactures such strings in the first step, nor does it care which occurrence of any particular regexp you want to match since you can use {[^{}]*} as many times as necessary in the expression to isolate the actual match you want and/or with seds numeric match operator, e.g. to only replace the 2nd occurrence:
$ sed 's/#/#A/g; s/{/#B/g; s/}/#C/g; s/foo/{/g; s/bar/}/g; s/{[^{}]*}/<&>/2; s/}/bar/g; s/{/foo/g; s/#C/}/g; s/#B/{/g; s/#A/#/g' file
ABC foo DEF bar GHI <foo KLM bar> NOP foo QRS bar TUV
Have not yet seen this answer, so here's how you can do this with vi or vim:
vi -c '%s/\(http:\/\/.\{-}\/\).*/\1/ge | wq' file &>/dev/null
This runs the vi :%s substitution globally (the trailing g), refrains from raising an error if the pattern is not found (e), then saves the resulting changes to disk and quits. The &>/dev/null prevents the GUI from briefly flashing on screen, which can be annoying.
I like using vi sometimes for super complicated regexes, because (1) perl is dead dying, (2) vim has a very advanced regex engine, and (3) I'm already intimately familiar with vi regexes in my day-to-day usage editing documents.
Since PCRE is also tagged here, we could use GNU grep by using non-lazy match in regex .*? which will match first nearest match opposite of .*(which is really greedy and goes till last occurrence of match).
grep -oP '^http[s]?:\/\/.*?/' Input_file
Explanation: using grep's oP options here where -P is responsible for enabling PCRE regex here. In main program of grep mentioning regex which is matching starting http/https followed by :// till next occurrence of / since we have used .*? it will look for first / after (http/https://). It will print matched part only in line.
echo "/home/one/two/three/myfile.txt" | sed 's|\(.*\)/.*|\1|'
don bother, i got it on another forum :)
sed 's|\(http:\/\/www\.[a-z.0-9]*\/\).*|\1| works too
Here is something you can do with a two step approach and awk:
A=http://www.suepearson.co.uk/product/174/71/3816/
echo $A|awk '
{
var=gensub(///,"||",3,$0) ;
sub(/\|\|.*/,"",var);
print var
}'
Output:
http://www.suepearson.co.uk
Hope that helps!
Another sed version:
sed 's|/[:alnum:].*||' file.txt
It matches / followed by an alphanumeric character (so not another forward slash) as well as the rest of characters till the end of the line. Afterwards it replaces it with nothing (ie. deletes it.)
#Daniel H (concerning your comment on andcoz' answer, although long time ago): deleting trailing zeros works with
s,([[:digit:]]\.[[:digit:]]*[1-9])[0]*$,\1,g
it's about clearly defining the matching conditions ...
You should also think about the case where there is no matching delims. Do you want to output the line or not. My examples here do not output anything if there is no match.
You need prefix up to 3rd /, so select two times string of any length not containing / and following / and then string of any length not containing / and then match / following any string and then print selection. This idea works with any single char delims.
echo http://www.suepearson.co.uk/product/174/71/3816/ | \
sed -nr 's,(([^/]*/){2}[^/]*)/.*,\1,p'
Using sed commands you can do fast prefix dropping or delim selection, like:
echo 'aaa #cee: { "foo":" #cee: " }' | \
sed -r 't x;s/ #cee: /\n/;D;:x'
This is lot faster than eating char at a time.
Jump to label if successful match previously. Add \n at / before 1st delim. Remove up to first \n. If \n was added, jump to end and print.
If there is start and end delims, it is just easy to remove end delims until you reach the nth-2 element you want and then do D trick, remove after end delim, jump to delete if no match, remove before start delim and and print. This only works if start/end delims occur in pairs.
echo 'foobar start block #1 end barfoo start block #2 end bazfoo start block #3 end goo start block #4 end faa' | \
sed -r 't x;s/end//;s/end/\n/;D;:x;s/(end).*/\1/;T y;s/.*(start)/\1/;p;:y;d'
If you have access to gnu grep, then can utilize perl regex:
grep -Po '^https?://([^/]+)(?=)' <<< 'http://www.suepearson.co.uk/product/174/71/3816/'
http://www.suepearson.co.uk
Alternatively, to get everything after the domain use
grep -Po '^https?://([^/]+)\K.*' <<< 'http://www.suepearson.co.uk/product/174/71/3816/'
/product/174/71/3816/
The following solution works for matching / working with multiply present (chained; tandem; compound) HTML or other tags. For example, I wanted to edit HTML code to remove <span> tags, that appeared in tandem.
Issue: regular sed regex expressions greedily matched over all the tags from the first to the last.
Solution: non-greedy pattern matching (per discussions elsewhere in this thread; e.g. https://stackoverflow.com/a/46719361/1904943).
Example:
echo '<span>Will</span>This <span>remove</span>will <span>this.</span>remain.' | \
sed 's/<span>[^>]*>//g' ; echo
This will remain.
Explanation:
s/<span> : find <span>
[^>] : followed by anything that is not >
*> : until you find >
//g : replace any such strings present with nothing.
Addendum
I was trying to clean up URLs, but I was running into difficulty matching / excluding a word - href - using the approach above. I briefly looked at negative lookarounds (Regular expression to match a line that doesn't contain a word) but that approach seemed overly complex and did not provide a satisfactory solution.
I decided to replace href with ` (backtick), do the regex substitutions, then replace ` with href.
Example (formatted here for readability):
printf '\n
<a aaa h href="apple">apple</a>
<a bbb "c=ccc" href="banana">banana</a>
<a class="gtm-content-click"
data-vars-link-text="nope"
data-vars-click-url="https://blablabla"
data-vars-event-category="story"
data-vars-sub-category="story"
data-vars-item="in_content_link"
data-vars-link-text
href="https:example.com">Example.com</a>\n\n' |
sed 's/href/`/g ;
s/<a[^`]*`/\n<a href/g'
apple
banana
Example.com
Explanation: basically as above. Here,
s/href/` : replace href with ` (backtick)
s/<a : find start of URL
[^`] : followed by anything that is not ` (backtick)
*` : until you find a `
/<a href/g : replace each of those found with <a href
Unfortunately, as mentioned, this it is not supported in sed.
To overcome this, I suggest to use the next best thing(actually better even), to use vim sed-like capabilities.
define in .bash-profile
vimdo() { vim $2 --not-a-term -c "$1" -es +"w >> /dev/stdout" -cq! ; }
That will create headless vim to execute a command.
Now you can do for example:
echo $PATH | vimdo "%s_\c:[a-zA-Z0-9\\/]\{-}python[a-zA-Z0-9\\/]\{-}:__g" -
to filter out python in $PATH.
Use - to have input from pipe in vimdo.
While most of the syntax is the same. Vim features more advanced features, and using \{-} is standard for non-greedy match. see help regexp.

Can grep or sed show only words that match multiple search patterns in a line?

I am wondering, if one can print the matched strings as it is in each line... using grep or sed?
TestCase1: File1 contains below text
The Sun
Thunder The Rain They say
They say The dance
If I use this command:
egrep -o 'The|They' File1
The output I get is:
The
The
They
They
The
But, my expected output should be as below:
The
The They
They The
I am aware that, In grep the option -o, --only-matching prints only the matched non-empty) parts of a matching line, with each such part on a separate output line.
Edit: Please also suggest, if one wants to have a filter with exact word match with multiple match strings
i.e. <The> and <They> exact word match? Space separated words simply.
TestCase2: File2 contains below text
The Sun
Thunder The Rain They say
They say The dance
They're dancing with them in the dorm
The sun is shining the east and they scream.
Output is:
The
The They
They the
the
The the they
How to approach this?
With GNU awk for FPAT:
$ awk -v FPAT='\\<[Tt]hey?\\>' '{$1=$1}1' file
The
The They
They The
They the
The the they
Note that that can't NOT identify They when it appears in They're. If that's really an issue and you want to look for space-separated complete strings then this might be what you want:
$ awk '{c=0; for (i=1;i<=NF;i++) if ($i ~ /^[Tt]hey?$/) printf "%s%s", (c++?OFS:""), $i; print ""}' file
The
The They
They The
the
The the they
If not, let us know.
The above was run against this iteration of the OPs posted sample input:
$ cat file
The Sun
Thunder The Rain They say
They say The dance
They're dancing with them in the dorm
The sun is shining the east and they scream.
Best do it with Perl:
~$ perl -nE 'say /They? /g' File1
The
The They
They The
EDIT : Add new conditions. The regex still matches all but the lowercase the. Adding the i flag makes the match case-insensitive and matches all your test strings.
$ perl -nE 'say /They? /ig' File1
The
The They
They The
the
The the they
There is a little bit of a trick here: the match also picks up the space after the ? and prints it in the output. E.g. the first line of output is realy: "The_\n" - where "_" = space character. This may or may not be acceptable. One way to remove the spaces and reassemble the string would be:
$ perl -nE 'say join " ", map {substr $_,0,-1} /They? /ig' File1
As to your question about matching full words <The> and <They>, as you put it, the ? in They? indicates that the 'y' is optional. I.e. matches 0 or 1 times. Therefore the pattern is considering 'The' and 'They' as full words, one or the other, followed by a space. You could rewrite the pattern as:
$ perl -nE 'say /(?:They|The) /ig' File1
And effect the same output.
Now that you are considering lowercase the you may run into more edge case "gotchas" like words that end in "the". "loathe" and "tythe" come to mind.
$ echo "I'm loathe to cringe and tythe socks" >> File1
$ perl -nE 'say /They? /ig' File1
The
The They
They The
the
The the they
the the <--- not wanted!
You can then add the \b test in to match on word boundaries (as in zdim's answer):
$ perl -nE 'say /\bThey? /ig' File1
The
The They
They The
the
The the they
<-- But you get this empty line where no match occurs
So to refine further, you could only print if the line matches. Like this:
$ perl -nE 'say /\bThey? /ig if /\bThey? /i' File1
The
The They
They The
the
The the they
Then, I'm sure, you can find more edge cases that will blow it all up and force further refinement.
Things are not fully specified so here are a couple of possibilities
To catch all words starting with The, and print them with a space in between
perl -wnE'say join " ", /\bThe\w*/g' file
where \b is a word-boundary, a zero-width anchor, and \w is a word character. Using \S (a non-space character) is yet more permissive.
For only The or They can instead use
perl -wnE'say join " ", /\bThey?\b/g' file
where y? makes y optional.
To allow the as well use [tT] instead of T in the pattern, or /i for either case for all chars.
It's been clarified in coments that punctuation after The|They isn't allowed, and that low case t is. Then we need to constrain the match by space, not word boundary, and use [tT] as mentioned
perl -wnE'say join " ", /\b([Tt]hey?)\s/g' file
Now the capturing parenthesis () are needed since \s does consume, unlike \b before.
This prints the desired output with the provided input.
awk to the rescue!
$ awk -v p="They?" '$0~p{for(i=1;i<=NF;i++) if($i~p) printf "%s",$i OFS; print ""}' file
The
The They
They The
try one more awk:
awk '{while(match($0,/The|They/)){string=substr($0,RSTART,RLENGTH);VAL=VAL?VAL OFS string:string;$0=substr($0,RSTART+RLENGTH+1);};print VAL;VAL=""}' Input_file
NON-ONE line form of solution as follows too.
awk '{
while(match($0,/The|They/)){
string=substr($0,RSTART,RLENGTH);
VAL=VAL?VAL OFS string:string;
$0=substr($0,RSTART+RLENGTH+1);
};
print VAL;
VAL=""
}
' Input_file
Will add the explanation shortly for same.

How to use sed-awk-gawk to display a matched string

I've got a file called 'res' that's 29374 characters of http data in a one-line string. Inside it, there are several http links, but I only want to be display those that end in '/idNNNNNNNNN' where N is a digit. In fact I'm only interested in the string 'idNNNNNNNNN'.
I've tried with:
cat res | sed -n '0,/.*\(id[0-9]*\).*/s//\1/p'
but I get the whole file.
Do you know a way to do it?
perl -n -E 'say $1 while m!/id(\d{9})!g' input-file
should work. That assumes exactly 9 digits; that's the {9} in the above. You can match 8 or 9 ({8,9}), 8 or more ({8,}), up to 9 ({0,9}), etc.
Example of this working:
$ echo -n 'junk jumk http://foo/id231313 junk lalala http://bar/id23123 asda' | perl -n -E 'say $1 while m!id(\d{0,9})!g'
231313
23123
That's with the 0 to 9 variant, of course.
If you're stuck with a pre-5.10 perl, use -e instead of -E and print "$1\n" instead of say $1.
How it works
First is the two command-line arguments to Perl. -n tells Perl to read input from standard input or files given on the command line, line by line, setting $_ to each line. $_ is perl's default target for a lot of things, including regular expression matches. -E merely tells Perl that the next argument is a Perl one-liner, using the new language features (vs. -e which does not use the 5.10 extensions).
So, looking at the one liner: say means to print out some value, followed by a newline. $1 is the first regular expression capture (captures are made by parentheses in regular expressions). while is a looping construct, which you're probably familiar with. m is the match operator, the ! after it is the regular expression delimiter (normally, you see / here, but since the pattern contains / it's easier to use something else, so you don't have to escape the / as \/). /id(\d{9}) is the regular expression to match. Keep in mind that the delimiter is !, so the / is not special, it just matches a literal /. The parentheses form a capture group, so $1 will be the number. The ! is the delimiter, followed by g which means to match as many times as possible (as opposed to once). This is what makes it pick up all the URLs in the line, not just the first. As long as there is a match, the m operator will return a true value, so the loop will continue (and run that say $1, printing out the match).
Two-sed solution
I think this is one way to do this with only sed. Much more complicated!
echo 'junk jumk http://foo/id231313 junk lalala http://bar/id23123 asda' | \
sed 's!http://!\nhttp://!g' | \
sed 's!^.*/id\([0-9]*\).*$!\1!'
cat res | perl -ne 'chomp; print "$1\n" if m/\/(id\d*)/'
The trouble is that sed and grep and awk work on lines, and you've only got one line. So, you probably need to split things up so you have more than one line -- then you can make the normal tools work.
tr ':' '\012' < res |
sed -n 's%.*/\(id[0-9][0-9]*\).*%\1%p'
This takes advantage of URLs containing colons and maps colons to newlines with tr, then uses sed to pick up anything up to a slash, followed by id and one or more digits, followed by anything, and prints out the id and digit string (only). Since these only occur in URLs, they will only appear one per line and relatively near the start of the line too.
Here's a solution using only one invocation of sed:
sed -n 's| |\n|g;/^http/{s|http://[^/]*/id\([0-9]*\)|\1|;P};D' inputfile
Explanation:
s| |\n|g; - Divide and conquer
/^http/{ - If pattern space begins with "http"
s|http://[^/]*/id\([0-9]*\)|\1|; - capture the id
P - Print the string preceding the first newline
}; - end if
D - Delete the string preceding the first newline regardless of whether it contains "http"
Edit:
This version uses the same technique but is more selective.
sed -n 's|http://|\n&|g;/^\n*http/{s|\n*http://[^/]*/id\([0-9]*\)|\1\n|;P};D' inputfile

sed rare-delimiter (other than & | / ?...)

I am using the Unix sed command on a string that can contain all types of characters (&, |, !, /, ?, etc).
Is there a complex delimiter (with two characters?) that can fix the error:
sed: -e expression #1, char 22: unknown option to `s'
The characters in the input file are of no concern - sed parses them fine. There may be an issue, however, if you have most of the common characters in your pattern - or if your pattern may not be known beforehand.
At least on GNU sed, you can use a non-printable character that is highly improbable to exist in your pattern as a delimiter. For example, if your shell is Bash:
$ echo '|||' | sed s$'\001''|'$'\001''/'$'\001''g'
In this example, Bash replaces $'\001' with the character that has the octal value 001 - in ASCII it's the SOH character (start of heading).
Since such characters are control/non-printable characters, it's doubtful that they will exist in the pattern. Unless, that is, you are doing something weird like modifying binary files - or Unicode files without the proper locale settings.
Another way to do this is to use Shell Parameter Substitution.
${parameter/pattern/replace} # substitute replace for pattern once
or
${parameter//pattern/replace} # substitute replace for pattern everywhere
Here is a quite complex example that is difficult with sed:
$ parameter="Common sed delimiters: [sed-del]"
$ pattern="\[sed-del\]"
$ replace="[/_%:\\#]"
$ echo "${parameter//$pattern/replace}"
result is:
Common sed delimiters: [/_%:\#]
However: This only work with bash parameters and not files where sed excel.
There is no such option for multi-character expression delimiters in sed, but I doubt
you need that. The delimiter character should not occur in the pattern, but if it appears in the string being processed, it's not a problem. And unless you're doing something extremely weird, there will always be some character that doesn't appear in your search pattern that can serve as a delimiter.
You need the nested delimiter facility that Perl offers. That allows to use stuff like matching, substituting, and transliterating without worrying about the delimiter being included in your contents. Since perl is a superset of sed, you should be able to use it for whatever you’re used sed for.
Consider this:
$ perl -nle 'print if /something/' inputs
Now if your something contains a slash, you have a problem. The way to fix this is to change delimiter, preferably to a bracketing one. So for example, you could having anything you like in the $WHATEVER shell variable (provided the backets are balanced), which gets interpolated by the shell before Perl is even called here:
$ perl -nle "print if m($WHATEVER)" /usr/share/dict/words
That works even if you have correctly nested parens in $WHATEVER. The four bracketing pairs which correctly nest like this in Perl are < >, ( ), [ ], and { }. They allow arbitrary contents that include the delimiter if that delimiter is balanced.
If it is not balanced, then do not use a delimiter at all. If the pattern is in a Perl variable, you don’t need to use the match operator provided you use the =~ operator, so:
$whatever = "some arbitrary string ( / # [ etc";
if ($line =~ $whatever) { ... }
With the help of Jim Lewis, I finally did a test before using sed :
if [ `echo $1 | grep '|'` ]; then
grep ".*$1.*:" $DB_FILE | sed "s#^.*$1*.*\(:\)## "
else
grep ".*$1.*:" $DB_FILE | sed "s|^.*$1*.*\(:\)|| "
fi
Thanks for help
Wow. I totally did not know that you could use any character as a delimiter.
At least half the time I use the sed and BREs its on paths, code snippets, junk characters, things like that. I end up with a bunch of horribly unreadable escapes which I'm not even sure won't die on some combination I didn't think of. But if you can exclude just some character class (or just one character even)
echo '#01Y $#1+!' | sed -e 'sa$#1+ashita' -e 'su#01YuHolyug'
> > > Holy shit!
That's so much easier.
Escaping the delimiter inline for BASH to parse is cumbersome and difficult to read (although the delimiter does need escaping for sed's benefit when it's first used, per-expression).
To pull together thkala's answer and user4401178's comment:
DELIM=$(echo -en "\001");
sed -n "\\${DELIM}${STARTING_SEARCH_TERM}${DELIM},\\${DELIM}${ENDING_SEARCH_TERM}${DELIM}p" "${FILE}"
This example returns all results starting from ${STARTING_SEARCH_TERM} until ${ENDING_SEARCH_TERM} that don't match the SOH (start of heading) character with ASCII code 001.
There's no universal separator, but it can be escaped by a backslash for sed to not treat it like separator (at least unless you choose a backslash character as separator).
Depending on the actual application, it might be handy to just escape those characters in both pattern and replacement.
If you're in a bash environment, you can use bash substitution to escape sed separator, like this:
safe_replace () {
sed "s/${1//\//\\\/}/${2//\//\\\/}/g"
}
It's pretty self-explanatory, except for the bizarre part.
Explanation to that:
${1//\//\\\/}
${ - bash expansion starts
1 - first positional argument - the pattern
// - bash pattern substitution pattern separator "replace-all" variant
\/ - literal slash
/ - bash pattern substitution replacement separator
\\ - literal backslash
\/ - literal slash
} - bash expansion ends
example use:
$ input="ka/pus/ta"
$ pattern="/pus/"
$ replacement="/re/"
$ safe_replace "$pattern" "$replacement" <<< "$input"
ka/re/ta