Why is my sed multiline find-and-replace not working as expected? - sed

I have a simple sed command that I am using to replace everything between (and including) //thistest.com-- and --thistest.com with nothing (remove the block all together):
sudo sed -i "s#//thistest\.com--.*--thistest\.com##g" my.file
The contents of my.file are:
//thistest.com--
zone "awebsite.com" {
type master;
file "some.stuff.com.hosts";
};
//--thistest.com
As I am using # as my delimiter for the regex, I don't need to escape the / characters. I am also properly (I think) escaping the . in .com. So I don't see exactly what is failing.
Why isn't the entire block being replaced?

You have two problems:
Sed doesn't do multiline pattern matches—at least, not the way you're expecting it to. However, you can use multiline addresses as an alternative.
Depending on your version of sed, you may need to escape alternate delimiters, especially if you aren't using them solely as part of a substitution expression.
So, the following will work with your posted corpus in both GNU and BSD flavors:
sed '\#^//thistest\.com--#, \#^//--thistest\.com# d' /tmp/corpus
Note that in this version, we tell sed to match all lines between (and including) the two patterns. The opening delimiter of each address pattern is properly escaped. The command has also been changed to d for delete instead of s for substitute, and some whitespace was added for readability.
I've also chosen to anchor the address patterns to the start of each line. You may or may not find that helpful with this specific corpus, but it's generally wise to do so when you can, and doesn't seem to hurt your use case.

# separation by line with 1 s//
sed -n -e 'H;${x;s#^\(.\)\(.*\)\1//thistest.com--.*\1//--thistest.com#\2#;p}' YourFile
# separation by line with address pattern
sed -e '\#//thistest.com--#,\#//--thistest.com# d' YourFile
# separation only by char (could be CR, CR/LF, ";" or "oneline") with s//
sed -n -e '1h;1!H;${x;s#//thistest.com--.*\1//--thistest.com##;p}' YourFile
Note:
assuming there is only 1 section thistest per file (if not, it remove anything between the first opening until the last closing section) for the use of s//
does not suite for huge file (load entire file into memory) with s//
sed using addresses pattern cannot select section on the same line, it search 1st pattern to start, and a following line to stop but very efficient on big file and/or multisection

Related

sed is ignoring some matches

I'm trying to replace all matches of http to https using backreference:
example test3.txt file:
http://stronka.wpblog.internal http://stronka.wpblog.internal
abc
jdfgijdf dfijog http://stronka.wpblog.internal dfgtdgrtg http://stronka.wpblog.internal/ sfdgth http://stronka.wpblog.internal/dupa drgfthj
ghj gjerioghj fhjdf http://stronka.wpblog.internal/
and when I run sed against the test3.txt file:
~# sed -r 's#http(://.*.wpblog.internal)#https\1#g' test3.txt
https://stronka.wpblog.internal http://stronka.wpblog.internal
abc
jdfgijdf dfijog https://stronka.wpblog.internal dfgtdgrtg https://stronka.wpblog.internal/ sfdgth http://stronka.wpblog.internal/dupa drgfthj
ghj gjerioghj fhjdf https://stronka.wpblog.internal/
Line 1 second link remains unchanged, line 2 third link remains unchanged, I'm lost, how could I tell sed to replace everything that is matching?
Because the .* wildcard is greedy, i.e. it will consume as much as possible of the line.
The simplest solution by far is to not use a wildcard at all; then sed does precisely what you expect on the simple input you provided.
sed 's#http://#https://#g' test3.txt
(Nothing in this regex needs anything except bog-standard 1968 regex, so the -r option - or its Linux equivalent -E - is not necessary or useful here.)
If for some reason you want a wildcard, use one which doesn't match across URL boundaries. In your example data, spaces seem to separate distinct URLs, so we can match greedily as many non-space characters as possible:
sed -r 's#http(://[^ ]*\.wpblog\.internal)#https\1#g' test3.txt
(Notice also how we use \. to match literal dots.)
Modern regex dialects like Perl's have non-greedy wildcards, but even then, it's better to use a regex which actually means what you want.
Try below:
sed -r 's/\bhttp\b/https/g'
\b is used to set boundaries around "http"
Based on the replies I've replaced the greedy wildcard .*:
sed -E 's#http(://[a-zA-Z0-9.-]*\.wpblog\.internal)#https\1#g'
And it's working as it should now, thank you all!

gnu sed remove portion of line after pattern match with special characters

The goal is to use sed to return only the url from each line of FF extension Mining Blocker which uses this format for its regex lines:
{"baseurl":"*://002.0x1f4b0.com/*", "suburl":"*://*/002.0x1f4b0.com/*"},
{"baseurl":"*://003.0x1f4b0.com/*", "suburl":"*://*/003.0x1f4b0.com/*"},
the result should be:
002.0x1f4b0.com
003.0x1f4b0.com
One way would be to keep everything after suburl":"*://*/ then remove each occurrence of /*"},
I found https://unix.stackexchange.com/questions/24140/return-only-the-portion-of-a-line-after-a-matching-pattern but the special characters are a problem.
this won't work:
sed -n -e s#^.*suburl":"*://*/##g hosts
Would someone please show me how to mark the 2 asterisks in the string so they are seen by regex as literal characters, not wildcards?
edit:
sed -n 's#.*://\*/\([^/]\+\)/.*#\1#p' hosts
doesn't work, unfortunately.
regarding character substitution, thanks for directing me to the references.
I reduced the searched-for string to //*/ and used ASCII character codes like this:
sed -n -e s#^.*\d047\d047\d042\d047##g hosts
Unfortunately, that didn't output any changes to the lines.
My assumptions are:
^.*something specifies everything up to and including the last occurrence of "something" in a line
sed -n -e s#search##g deletes (replace with nothing) "search" within a line
So, this line:
sed -n -e s#^.*\d047\d047\d042\d047##g hosts
Should output everything after //*/ in each line...except it doesn't.
What is incorrect with that line?
Regarding deleting everything including and after the first / AFTER that first operation, yes, that's wanted too.
This might work for you (GNU sed):
sed -n 's#.*://\*/\([^/]\+\)/.*#\1#p' file
Match greedily (the longest string that matches) all characters up to ://*/, followed by a group of characters (which will be referred to as \1) that do not match a /, followed by the rest of the line and replace it by the group \1.
N.B. the sed substitution delimiters are arbitrary, in this case chosen to be # so as make pattern matching / easier. Also the character * on the left hand side of the substitution command may be interpreted as a meta character that means zero or more of the previous character/group and so is quoted \* so that it does not mistakenly exert this property. Finally, using the option -n toggles off the usual printing of every thing in the pattern space after all the sed commands have been executed. The p flag on the substitution command, prints the pattern space following a successful substitution, therefore only URL's will appear in the output or nothing.

Matching strings even if they start with white spaces in SED

I'm having issues matching strings even if they start with any number of white spaces. It's been very little time since I started using regular expressions, so I need some help
Here is an example. I have a file (file.txt) that contains two lines
#String1='Test One'
String1='Test Two'
Im trying to change the value for the second line, without affecting line 1 so I used this
sed -i "s|String1=.*$|String1='Test Three'|g"
This changes the values for both lines. How can I make sed change only the value of the second string?
Thank you
With gnu sed, you match spaces using \s, while other sed implementations usually work with the [[:space:]] character class. So, pick one of these:
sed 's/^\s*AWord/AnotherWord/'
sed 's/^[[:space:]]*AWord/AnotherWord/'
Since you're using -i, I assume GNU sed. Either way, you probably shouldn't retype your word, as that introduces the chance of a typo. I'd go with:
sed -i "s/^\(\s*String1=\).*/\1'New Value'/" file
Move the \s* outside of the parens if you don't want to preserve the leading whitespace.
There are a couple of solutions you could use to go about your problem
If you want to ignore lines that begin with a comment character such as '#' you could use something like this:
sed -i "/^\s*#/! s|String1=.*$|String1='Test Three'|g" file.txt
which will only operate on lines that do not match the regular expression /.../! that begins ^ with optional whiltespace\s* followed by an octothorp #
The other option is to include the characters before 'String' as part of the substitution. Doing it this way means you'll need to capture \(...\) the group to include it in the output with \1
sed -i "s|^\(\s*\)String1=.*$|\1String1='Test Four'|g" file.txt
With GNU sed, try:
sed -i "s|^\s*String1=.*$|String1='Test Three'|" file
or
sed -i "/^\s*String1=/s/=.*/='Test Three'/" file
Using awk you could do:
awk '/String1/ && f++ {$2="Test Three"}1' FS=\' OFS=\' file
#String1='Test One'
String1='Test Three'
It will ignore first hits of string1 since f is not true.

Extract CentOS mirror domain names using sed

I'm trying to extract a list of CentOS domain names only from http://mirrorlist.centos.org/?release=6.4&arch=x86_64&repo=os
Truncating prefix "http://" and "ftp://" to the first "/" character only resulting a list of
yum.phx.singlehop.com
mirror.nyi.net
bay.uchicago.edu
centos.mirror.constant.com
mirror.teklinks.com
centos.mirror.netriplex.com
centos.someimage.com
mirror.sanctuaryhost.com
mirrors.cat.pdx.edu
mirrors.tummy.com
I searched stackoverflow for the sed method but I'm still having trouble.
I tried doing this with sed
curl "http://mirrorlist.centos.org/?release=6.4&arch=x86_64&repo=os" | sed '/:\/\//,/\//p'
but doesn't look like it is doing anything. Can you give me some advice?
Here you go:
curl "http://mirrorlist.centos.org/?release=6.4&arch=x86_64&repo=os" | sed -e 's?.*://??' -e 's?/.*??'
Your sed was completely wrong:
/x/,/y/ is a range. It selects multiple lines, from a line matching /x/ until a line matching /y/
The p command prints the selected range
Since all lines match both the start and end pattern you used, you effectively selected all lines. And, since sed echoes the input by default, the p command results in duplicated lines (all lines printed twice).
In my fix:
I used s??? instead of s/// because this way I didn't need to escape all the / in the patterns, so it's a bit more readable this way
I used two expressions with the -e flag:
s?.*://?? matches everything up until :// and replaces it with nothing
s?/.*?? matches everything from / until the end replaces it with nothing
The two expressions are executed in the given order
In modern versions of sed you can omit -e and separate the two expressions with ;. I stick to using -e because it's more portable.

Deleting multiline text from multiple files

I have a bunch of java files from which I want to remove the javadoc lines with the license [am changing it on my code].
The pattern I am looking for is
^\* \* ProjectName .* USA\.$
but matched across lines
Is there a way sed [or a commonly used editor in Windows/Linux] can do a search/replace for a multiline pattern?
Here's the appropriate reference point in my favorite sed tutorial.
Probably someone is still looking for such solution from time to time. Here is one.
Use awk to find the lines to be removed. Then use diff to remove the lines and let sed clean up.
awk "/^\* \* ProjectName /,/ USA\.$/" input.txt \
| diff - input.txt \
| sed -n -e"s/^> //p" \
>output.txt
A warning note: if the first pattern exist while the second does not, you will loose all text below the first pattern - so check that first.
Yes. Are you using sed, awk, perl, or something else to solve this problem?
Most regular expression tools allow you to specify multi-line patterns. Just be careful with regular expressions that are too greedy, or they'll match the code between comments if it exists.
Here's an example:
/\*(?:.|[\r\n])*?\*/
perl -0777ne 'print m!/\*(?:.|[\r\n])*?\*/!g;' <file>
Prints out all the comments run
together. The (?: notation must be
used for non-capturing parenthesis. /
does not have to be escaped because !
delimits the expression. -0777 is used
to enable slurp mode and -n enables
automatic reading.
(From: http://ostermiller.org/findcomment.html )