Find and replace string with sed - sed

I need to do a multi-file find and replace with nothing (delete) using sed. I have the line:
So replace the line:
<meta name="keywords" content="there could be anything here">
With '' (nothing) in all files in and under the current dir.
I have got this so far:
sed -e 's/<meta name="keywords" content=".*>//g' myfile.html'
But I know this is only going to remove the < or > tags. How can I match against
<meta name="keywords" content="
and delete everything from that to the next
>
I also need to do it for all files in and under (recursively) the current directory.
Thanks in advance!

sed has the delete directive try using
sed -e '/<meta name="keywords"/d' myfile.html

Related

How to use sed comand to replace HTML text with text from another file?

I have 2 html files.
index.html in a folder called "main"
other.html in a folder called "other"
index.html has the below code:
<h1>My first story</h1>
<p>The pen was blue.</p>
other.html had the below code:
<p>The pen was black.</p>
I want to replace the paragraph content in "index.html" with the entire content of other.html so that index.html will give out the below result:
<h1>My first story</h1>
<p>The pen was red.</p>
I tried using the terminal command cd main && sed -e "s_<p>The pen was blue.</p>_$(cd ../other && sed 's:/:\\/:g' other.html)_" index.html but I got the error message "bash: cd: main: No such file or directory". Even when I tried putting them in the same directory for the purpose of testing it did not work. But I need them in separate folders.
Update: I fixed the issue of accessing files in different directories by using the command sed -e "s_<p>The pen was blue.</p>_$(sed 's:/:\\/:g' other/other.html)_" main/index.html but now I get the error message "sed: -e expression #1, char 54: unterminated `s' command"
How do I make it copy the contents of the 2nd file into the 1st file?
By using Is it possible to escape regex metacharacters reliably with sed the part with multiline replacement string, try the following:
IFS= read -d '' -r < <(sed -e ':a' -e '$!{N;ba' -e '}' -e 's/[&/\]/\\&/g; s/\n/\\&/g' other.html)
replaceEscaped=${REPLY%$'\n'}
sed -e "s_<p>The pen was blue.</p>_$replaceEscaped_" main/index.html

remove everything between two characters with sed

I'd like to remove any characters between including them also
<img src=\"/wp-content/uploads/9e580e68ed249dec8fc0e668da78d170.jpg\" / hspace=\"5\" vspace=\"0\" align=\"left\">
I was trying
sed -i -e 's/<img src.*align=\\"left\\">//g' file
You do not say what version of sed you are using, or what shell.
With GNU sed and bash, your attempt was almost there. Try:
sed -i 's/<img src[^>]*align=\\"left\\">//g' file
Explanation:
s/<img src[^>]*align=\\"left\\">/ search for <img src_STUFF_align=\"left\">, where _STUFF_ cannot contain any >
// and replace it with nothing
/g and continue
-i and modify the file
I believe this should work with most version of sed (except for the -i).

How to extract URL from html source with sed/awk or cut?

I am writing a script that will download an html page source as a file and then read the file and extract a specific URL that is located after a specific code. (it only has 1 occurrence)
Here is a sample that I need matched:
<img id="sample-image" class="photo" src="http://xxxx.com/some/ic/pic_1asda963_16x9.jpg"
The code preceding the URL will always be the same so I need to extract the part between:
<img id="sample-image" class="photo" src="
and the " after the URL.
I tried something with sed like this:
sed -n '\<img\ id=\"sample-image\"\ class=\"photo\"\ src=\",\"/p' test.txt
But it does not work. I would appreciate your suggestions, thanks a lot !
You can use grep like this :
grep -oP '<img\s+id="sample-image"\s+class="photo"\s+src="\K[^"]+' test.txt
or with sed :
sed -r 's/<img\s+id="sample-image"\s+class="photo"\s+src="([^"]+)"/\1/' test.txt
or with awk :
awk -F'src="' -F'"' '/<img\s+id="sample-image"/{print $6}' test.txt
If you have GNU grep then you can do something like:
grep -oP "(?<=src=\")[^\"]+(?=\")" test.txt
If you wish to use awk then the following would work:
awk -F\" '{print $(NF-1)}' test.txt
With sed as
echo $string | sed 's/\<img.*src="\(.*\)".*/\1/'
A few things about the sed command you are using:
sed -n '\<img\ id=\"sample-image\"\ class=\"photo\"\ src=\",\"/p' test.txt
You don't need to escape the <, " or space. The single quotes prevents the shell from doing word splitting and other stuff on your sed expression.
You are essentially doing this sed -n '/pattern/p' test.txt (except you seemed to be missing the opening backslash) which says "match this pattern, then print the line which contain the match", you are not really extracting the URL.
This is minor, but you don't need to match class="photo" since the id already makes the HTML element unique (no two elements share the same id w/in the same HTML).
Here's what I would do
sed -n 's/.*<img id="sample-image".*src="\([^"]+\)".*/\1/p' test.txt
The p flag tells sed to print the line where substitution (s) was performed.
\(pattern\) captures a subexpression which can be accessed via \1, \2, etc. on the right side of s///
The .* at the start of regex is in case there is something else preceding the <img> element on the line (you did mention you are parsing a HTML file)

Using curlmirror.pl gives different outputs

Using http://curl.haxx.se/programs/curlmirror.txt [Edit: Current version at https://github.com/cudeso/tools/blob/master/curlmirror.txt ], I'm looking to download a website and check for changes between the newly downloaded website and one that I have downloaded previously. However when I download the same website sometimes the links on the website use relative paths, sometimes they use absolute paths, and that counts as a "change" even though the website did not change.
Usage: curlmirror.pl -l -d 3 -o someOutputFileDirectory/url http://url
Output 1: <td>LINK</td>
Output 2: <td>LINK</td>
Is there a way to convert all relative paths to absolute paths or the other way around? I just need to standardize the download so that these links do not appear as "changes"
UPDATED
I assume that the url is placed to $url variable. Then You can try something like bellow:
perl -pe 'BEGIN {$url="http://somedomain.org"}
s!(\b(?:url|href)=")([^/]+)(")!$1$url/$2$3!gi' << XXX
<td>LINK</td>
<td>LINK</td>
<meta http-equiv="Refresh" content="0;URL="home">
XXX
Output:
<td>LINK</td>
<td>LINK</td>
<meta http-equiv="Refresh" content="0;URL="http://mymain.org/home">
It replaces all href="..." or url="..." (case-insensitive) patterns with href="$url/..." or url="$url/..." if ... not contains / character.
If the input is a file, You can replace these patterns in the file directly:
cat >tfile << XXX
<td>LINK</td>
<td>LINK</td>
<meta http-equiv="Refresh" content="0;URL="home">
XXX
cat tfile
perl -i -pe 'BEGIN {$url="http://mymain.org"}
s!(\b(?:url|href)=")([^/]+)(")!$1$url/$2$3!gi' tfile
echo "---"
cat tfile
Output:
<td>LINK</td>
<td>LINK</td>
<meta http-equiv="Refresh" content="0;URL="home">
---
<td>LINK</td>
<td>LINK</td>
<meta http-equiv="Refresh" content="0;URL="http://mymain.org/home">

Using sed to find and delete across multiple files recursively

How can I do a string match against, for example:
<meta name="keywords" content="
Then delete that whole line every time a match is found?
I'm looking to do this for all files in the current directory and below.
I'm also new to sed.
Try this command:
find . -type f -exec sed -i '/foobar/d' {} \;
Change foobar to what you search for.
In answer to the question: "How do I do x to all files recursively?", the answer is to use find. To use sed to delete a line, you can either use the non-portable -i, or simply write a script to redirect the stream. For example:
find . -exec sh -c 'f=/tmp/t.$$;
sed "/<meta name=\"keywords\" content=\"/d" $0 > $f; mv $f $0' {} \;