Using sed to remove urls with specific anchor text matches - sed

Trying to parse some spam injection out of a mysql export file, and for some reason this is not working:
sed 's|([^<]*Buy[^<]*)||g'
Which, imo, should match and remove:
Buy Generic Drugs Without Prescription
but for some reason isn't. I can do it in perl no prob, since that supports non-greedy matches, but it is so slow, and since I will probably have to do 7 or 8 passes to get all the different permutations it would be much better if I can get sed to work instead.

Do not forget -r to support extended regexp: sed -r 's|([^<]*Buy[^<]*)||g' or just remove the useless parenthesis (that should be \( and \) without -r)
Are you sure that perl -p -e 's|[^<]*Buy[^<]*||g' is really slower.

Related

sed is ignoring some matches

I'm trying to replace all matches of http to https using backreference:
example test3.txt file:
http://stronka.wpblog.internal http://stronka.wpblog.internal
abc
jdfgijdf dfijog http://stronka.wpblog.internal dfgtdgrtg http://stronka.wpblog.internal/ sfdgth http://stronka.wpblog.internal/dupa drgfthj
ghj gjerioghj fhjdf http://stronka.wpblog.internal/
and when I run sed against the test3.txt file:
~# sed -r 's#http(://.*.wpblog.internal)#https\1#g' test3.txt
https://stronka.wpblog.internal http://stronka.wpblog.internal
abc
jdfgijdf dfijog https://stronka.wpblog.internal dfgtdgrtg https://stronka.wpblog.internal/ sfdgth http://stronka.wpblog.internal/dupa drgfthj
ghj gjerioghj fhjdf https://stronka.wpblog.internal/
Line 1 second link remains unchanged, line 2 third link remains unchanged, I'm lost, how could I tell sed to replace everything that is matching?
Because the .* wildcard is greedy, i.e. it will consume as much as possible of the line.
The simplest solution by far is to not use a wildcard at all; then sed does precisely what you expect on the simple input you provided.
sed 's#http://#https://#g' test3.txt
(Nothing in this regex needs anything except bog-standard 1968 regex, so the -r option - or its Linux equivalent -E - is not necessary or useful here.)
If for some reason you want a wildcard, use one which doesn't match across URL boundaries. In your example data, spaces seem to separate distinct URLs, so we can match greedily as many non-space characters as possible:
sed -r 's#http(://[^ ]*\.wpblog\.internal)#https\1#g' test3.txt
(Notice also how we use \. to match literal dots.)
Modern regex dialects like Perl's have non-greedy wildcards, but even then, it's better to use a regex which actually means what you want.
Try below:
sed -r 's/\bhttp\b/https/g'
\b is used to set boundaries around "http"
Based on the replies I've replaced the greedy wildcard .*:
sed -E 's#http(://[a-zA-Z0-9.-]*\.wpblog\.internal)#https\1#g'
And it's working as it should now, thank you all!

Pipe Grep Results to Sed — Only do sed on results of grep

[Mac OS]
It seems that sed requires an input file, and that I cannot pipe grep to it. Although sed can match like grep does, it can make the sed operation very complex if it's handling both a find and replace.
For example, if I wanted to remove the 3rd word of every line that started with 'T', it's much more convenient to separate the find/replace commands than to create a complex regex.
Looking through SO answers, there doesn't seem to be an elegant solution where you can pipe grep to sed without new files being involved. I did find this, which almost does what I want:
sed -i "s/$(grep 'old' input.txt)/new/g" input.txt
But it doesn't handle multiple matches well.
I'll generalize:
Is there a better way to find specific lines in a text file and modify those lines in-place? Preferably cli, or as low-level as possible.

Why is my sed multiline find-and-replace not working as expected?

I have a simple sed command that I am using to replace everything between (and including) //thistest.com-- and --thistest.com with nothing (remove the block all together):
sudo sed -i "s#//thistest\.com--.*--thistest\.com##g" my.file
The contents of my.file are:
//thistest.com--
zone "awebsite.com" {
type master;
file "some.stuff.com.hosts";
};
//--thistest.com
As I am using # as my delimiter for the regex, I don't need to escape the / characters. I am also properly (I think) escaping the . in .com. So I don't see exactly what is failing.
Why isn't the entire block being replaced?
You have two problems:
Sed doesn't do multiline pattern matches—at least, not the way you're expecting it to. However, you can use multiline addresses as an alternative.
Depending on your version of sed, you may need to escape alternate delimiters, especially if you aren't using them solely as part of a substitution expression.
So, the following will work with your posted corpus in both GNU and BSD flavors:
sed '\#^//thistest\.com--#, \#^//--thistest\.com# d' /tmp/corpus
Note that in this version, we tell sed to match all lines between (and including) the two patterns. The opening delimiter of each address pattern is properly escaped. The command has also been changed to d for delete instead of s for substitute, and some whitespace was added for readability.
I've also chosen to anchor the address patterns to the start of each line. You may or may not find that helpful with this specific corpus, but it's generally wise to do so when you can, and doesn't seem to hurt your use case.
# separation by line with 1 s//
sed -n -e 'H;${x;s#^\(.\)\(.*\)\1//thistest.com--.*\1//--thistest.com#\2#;p}' YourFile
# separation by line with address pattern
sed -e '\#//thistest.com--#,\#//--thistest.com# d' YourFile
# separation only by char (could be CR, CR/LF, ";" or "oneline") with s//
sed -n -e '1h;1!H;${x;s#//thistest.com--.*\1//--thistest.com##;p}' YourFile
Note:
assuming there is only 1 section thistest per file (if not, it remove anything between the first opening until the last closing section) for the use of s//
does not suite for huge file (load entire file into memory) with s//
sed using addresses pattern cannot select section on the same line, it search 1st pattern to start, and a following line to stop but very efficient on big file and/or multisection

capturing groups in sed

I have many lines of the form
ko04062 ko:CXCR3
ko04062 ko:CX3CR1
ko04062 ko:CCL3
ko04062 ko:CCL5
ko04080 ko:GZMA
and would dearly like to get rid of the ko: bit of the right-hand column. I'm trying to use sed, as follows:
echo "ko05414 ko:ITGA4" | sed 's/\(^ko\d{5}\)\tko:\(.*$\)/\1\2/'
which simply outputs the original string I echo'd. I'm very new to command line scripting, sed, pipes etc, so please don't be too angry if/when I'm doing something extremely dumb.
The main thing that is confusing me is that the same thing happens if I reverse the \1\2 bit to read \2\1 or just use one group. This, I guess, implies that I'm missing something about the mechanics of piping the output of echo into sed, or that my regexp is wrong or that I'm using sed wrong or that sed isn't printing the results of the substitution.
Any help would be greatly appreciated!
sed is outputting its input because the substitution isn't matching. Since you're probably using GNU sed, try this:
echo "ko05414 ko:ITGA4" | sed 's/\(^ko[0-9]\{5\}\)\tko:\(.*$\)/\1\2/'
\d -> [0-9] since GNU sed doesn't recognize \d
{} -> \{\} since GNU sed by default uses basic regular expressions.
This should do it. You can also skip the last group and simply use, \1 instead, but since you're learning sed and regex this is good stuff. I wanted to use a non-capturing group in the middle (:? ) but I could not get that to play with sed for whatever reason, perhaps it's not supported.
sed --posix 's/\(^ko[0-9]\{5\}\)\( ko:\)\(.*$\)/\1 \3/g' file > result
And ofcourse you can use
sed --posix 's/ko://'
You don't need sed for this
Here is how you can do it with bash:
var="ko05414 ko:ITGA4"
echo ${var//"ko:"}
${var//"ko:"} replaces all "ko:" with ""
See Manipulating Strings for more info
#OP, if you just want to get rid of "ko:", then
$ cat file
ko04062 ko:CXCR3
ko04062 ko:CX3CR1
ko04062 ko:CCL3
ko04062 ko:CCL5
some text with a legit ko: this ko: will be deleted if you use gsub.
ko04080 ko:GZMA
$ awk '{sub("ko:","",$2)}1' file
ko04062 CXCR3
ko04062 CX3CR1
ko04062 CCL3
ko04062 CCL5
some text with a legit ko: this ko: will be deleted if you use gsub.
ko04080 GZMA
Jsut a note. While you can use pure bash string substitution, its only more efficient when you are changing a single string. If you have a file, especially a big file, using bash's while read loop is still slower than using sed or awk.

Deleting multiline text from multiple files

I have a bunch of java files from which I want to remove the javadoc lines with the license [am changing it on my code].
The pattern I am looking for is
^\* \* ProjectName .* USA\.$
but matched across lines
Is there a way sed [or a commonly used editor in Windows/Linux] can do a search/replace for a multiline pattern?
Here's the appropriate reference point in my favorite sed tutorial.
Probably someone is still looking for such solution from time to time. Here is one.
Use awk to find the lines to be removed. Then use diff to remove the lines and let sed clean up.
awk "/^\* \* ProjectName /,/ USA\.$/" input.txt \
| diff - input.txt \
| sed -n -e"s/^> //p" \
>output.txt
A warning note: if the first pattern exist while the second does not, you will loose all text below the first pattern - so check that first.
Yes. Are you using sed, awk, perl, or something else to solve this problem?
Most regular expression tools allow you to specify multi-line patterns. Just be careful with regular expressions that are too greedy, or they'll match the code between comments if it exists.
Here's an example:
/\*(?:.|[\r\n])*?\*/
perl -0777ne 'print m!/\*(?:.|[\r\n])*?\*/!g;' <file>
Prints out all the comments run
together. The (?: notation must be
used for non-capturing parenthesis. /
does not have to be escaped because !
delimits the expression. -0777 is used
to enable slurp mode and -n enables
automatic reading.
(From: http://ostermiller.org/findcomment.html )