Using curlmirror.pl gives different outputs - perl

Using http://curl.haxx.se/programs/curlmirror.txt [Edit: Current version at https://github.com/cudeso/tools/blob/master/curlmirror.txt ], I'm looking to download a website and check for changes between the newly downloaded website and one that I have downloaded previously. However when I download the same website sometimes the links on the website use relative paths, sometimes they use absolute paths, and that counts as a "change" even though the website did not change.
Usage: curlmirror.pl -l -d 3 -o someOutputFileDirectory/url http://url
Output 1: <td>LINK</td>
Output 2: <td>LINK</td>
Is there a way to convert all relative paths to absolute paths or the other way around? I just need to standardize the download so that these links do not appear as "changes"

UPDATED
I assume that the url is placed to $url variable. Then You can try something like bellow:
perl -pe 'BEGIN {$url="http://somedomain.org"}
s!(\b(?:url|href)=")([^/]+)(")!$1$url/$2$3!gi' << XXX
<td>LINK</td>
<td>LINK</td>
<meta http-equiv="Refresh" content="0;URL="home">
XXX
Output:
<td>LINK</td>
<td>LINK</td>
<meta http-equiv="Refresh" content="0;URL="http://mymain.org/home">
It replaces all href="..." or url="..." (case-insensitive) patterns with href="$url/..." or url="$url/..." if ... not contains / character.
If the input is a file, You can replace these patterns in the file directly:
cat >tfile << XXX
<td>LINK</td>
<td>LINK</td>
<meta http-equiv="Refresh" content="0;URL="home">
XXX
cat tfile
perl -i -pe 'BEGIN {$url="http://mymain.org"}
s!(\b(?:url|href)=")([^/]+)(")!$1$url/$2$3!gi' tfile
echo "---"
cat tfile
Output:
<td>LINK</td>
<td>LINK</td>
<meta http-equiv="Refresh" content="0;URL="home">
---
<td>LINK</td>
<td>LINK</td>
<meta http-equiv="Refresh" content="0;URL="http://mymain.org/home">

Related

How to use sed comand to replace HTML text with text from another file?

I have 2 html files.
index.html in a folder called "main"
other.html in a folder called "other"
index.html has the below code:
<h1>My first story</h1>
<p>The pen was blue.</p>
other.html had the below code:
<p>The pen was black.</p>
I want to replace the paragraph content in "index.html" with the entire content of other.html so that index.html will give out the below result:
<h1>My first story</h1>
<p>The pen was red.</p>
I tried using the terminal command cd main && sed -e "s_<p>The pen was blue.</p>_$(cd ../other && sed 's:/:\\/:g' other.html)_" index.html but I got the error message "bash: cd: main: No such file or directory". Even when I tried putting them in the same directory for the purpose of testing it did not work. But I need them in separate folders.
Update: I fixed the issue of accessing files in different directories by using the command sed -e "s_<p>The pen was blue.</p>_$(sed 's:/:\\/:g' other/other.html)_" main/index.html but now I get the error message "sed: -e expression #1, char 54: unterminated `s' command"
How do I make it copy the contents of the 2nd file into the 1st file?
By using Is it possible to escape regex metacharacters reliably with sed the part with multiline replacement string, try the following:
IFS= read -d '' -r < <(sed -e ':a' -e '$!{N;ba' -e '}' -e 's/[&/\]/\\&/g; s/\n/\\&/g' other.html)
replaceEscaped=${REPLY%$'\n'}
sed -e "s_<p>The pen was blue.</p>_$replaceEscaped_" main/index.html

Extract data using grep/sed from html tag with special class/id

I need to grep info from website and it is stored like:
<div class="name">Mark</div>
<div class="surname">John</div>
<div class="phone">8434</div>
and etc.
Tried to grep it and parse it later with sed:
grep -o '<div class="name">.*</div>' | sed -e 's?<div class="name">?|?g'
but, when I try to replace with sed -e 's?<\/div><div class="phone">?|?g' - no result
and for every class do the same thing. I cannot delete all html tags (sed 's/<[^>]\+>//g'), and need to do it only for div with this classes.
The output format should be like
|Mark|John|8434|
I need to do it with grep/sed
Using awk should do the job:
awk -F"[<>]" '{printf "%s|",$3}' file
Mark|John|8434|
If you need a new line at the end:
awk -F"[<>]" '{printf "%s|",$3} END {print ""}' file
It creates filed separated by < or > then print the third field with | as separator.

sed command to find .css file inside link

I am using sed to read the .css file name (after "href=") from an html file. The command is follow:
cssFiles=$(echo "$BODY" | sed -rn 's/<link\s.*href=\W(.*.css).*/\1/p')
But, it does not works correctly. Below, sample input, output and expected output is given. Where am I wrong?
Sample input:
<link href="/css/default.css" rel="stylesheet" type="text/css" />
<link rel="stylesheet" type="text/css" href="js/flexslider/flexslider.css">
Sample output:
/css/default.css" rel="stylesheet" type="text/css
js/flexslider/flexslider.css
Expected output:
/css/default.css
js/flexslider/flexslider.css
Try this:
cssFiles=$(echo "$BODY" | sed -rn 's/<link\s.*href=\W(.*.css).*/\1/p' | awk -F'=' '{print$2}' awk -F' ' '{print$1}')

How to extract URL from html source with sed/awk or cut?

I am writing a script that will download an html page source as a file and then read the file and extract a specific URL that is located after a specific code. (it only has 1 occurrence)
Here is a sample that I need matched:
<img id="sample-image" class="photo" src="http://xxxx.com/some/ic/pic_1asda963_16x9.jpg"
The code preceding the URL will always be the same so I need to extract the part between:
<img id="sample-image" class="photo" src="
and the " after the URL.
I tried something with sed like this:
sed -n '\<img\ id=\"sample-image\"\ class=\"photo\"\ src=\",\"/p' test.txt
But it does not work. I would appreciate your suggestions, thanks a lot !
You can use grep like this :
grep -oP '<img\s+id="sample-image"\s+class="photo"\s+src="\K[^"]+' test.txt
or with sed :
sed -r 's/<img\s+id="sample-image"\s+class="photo"\s+src="([^"]+)"/\1/' test.txt
or with awk :
awk -F'src="' -F'"' '/<img\s+id="sample-image"/{print $6}' test.txt
If you have GNU grep then you can do something like:
grep -oP "(?<=src=\")[^\"]+(?=\")" test.txt
If you wish to use awk then the following would work:
awk -F\" '{print $(NF-1)}' test.txt
With sed as
echo $string | sed 's/\<img.*src="\(.*\)".*/\1/'
A few things about the sed command you are using:
sed -n '\<img\ id=\"sample-image\"\ class=\"photo\"\ src=\",\"/p' test.txt
You don't need to escape the <, " or space. The single quotes prevents the shell from doing word splitting and other stuff on your sed expression.
You are essentially doing this sed -n '/pattern/p' test.txt (except you seemed to be missing the opening backslash) which says "match this pattern, then print the line which contain the match", you are not really extracting the URL.
This is minor, but you don't need to match class="photo" since the id already makes the HTML element unique (no two elements share the same id w/in the same HTML).
Here's what I would do
sed -n 's/.*<img id="sample-image".*src="\([^"]+\)".*/\1/p' test.txt
The p flag tells sed to print the line where substitution (s) was performed.
\(pattern\) captures a subexpression which can be accessed via \1, \2, etc. on the right side of s///
The .* at the start of regex is in case there is something else preceding the <img> element on the line (you did mention you are parsing a HTML file)

Find and replace string with sed

I need to do a multi-file find and replace with nothing (delete) using sed. I have the line:
So replace the line:
<meta name="keywords" content="there could be anything here">
With '' (nothing) in all files in and under the current dir.
I have got this so far:
sed -e 's/<meta name="keywords" content=".*>//g' myfile.html'
But I know this is only going to remove the < or > tags. How can I match against
<meta name="keywords" content="
and delete everything from that to the next
>
I also need to do it for all files in and under (recursively) the current directory.
Thanks in advance!
sed has the delete directive try using
sed -e '/<meta name="keywords"/d' myfile.html