Avoid wget appending index.html to links - wget

I am trying to make a static HTML copy of a Wordpress site that I can upload somewhere else, like Github pages.
I use this command:
Option 1:
wget -k -r -l 1000 -p -N -F -nH -P ./website http://example.com/website
It downloads the entire site etc. but my main issue here is that it adds "index.html" to every single link. I understand the need for this to view the site locally, but it is not required on a static website host.
So is there a way to tell wget not to modify all the links and add index.html to them?
For example it creates:
Hello world!
On the default Worpress Hello World post.
Option 2:
Use mirroring command with -k convert links:
wget -E -m -p -F -nH -P ./website http://example.com/website
Then it will not apply index.html and retain the domain name.
But then it also crawls up to http://example.com and indexes everything there. I do not want that. I want the /website to be the root (Because Wordpress multi site). How do I fix this?
I also want it to rewrite the hostname instead of stripping it or keeping it. So it should go from http://example.com/website/ (Wordpress multi site) to http://example.org/ Is this possible or do I need to run sed/awk on all files after download?

faced a similar problem, solved it by postprocessing it with sed.
This replaces all occurrences of /index.html' by /' as the comment above indicates that a redirect occurrs anyway if the trailing slash is missing, I added it =)
find ./ -type f -exec sed -i -e "s/\/index\.html'/\/\'/g" {} \;
And this monster replaces all occurrences of "index.html" or 'index.html' (or "index.html' or 'index.html" ..) by ".":
find ./ -type f -exec sed -i -e "s/['\\\"]index\.html['\\\"]/\\\".\\\"/g" {} \;
You can look what sed is doing with your matches e.g. on index.html with this command:
sed -n "s/['\\\"]index\.html['\\\"]/'\/'/p" index.html
Hope you find that useful

Related

grep all Gentoo Stage3 links to the terminal

I want to display all links from https://www.gentoo.org/downloads/mirrors/ to the terminal.
Firstly, the script would wget the web page to a file called index.html, then a grep or sed command would simply display all the https://, http:// and ftp:// to the terminal.
Can someone help me with this command? I know it's simple, but I'm not really familiar with neither of these commands.
What i tried:
grep "<code>" index.html
Output:
<code>ftp://mirrors.tera-byte.com/pub/gentoo</code>
<code>http://gentoo.mirrors.tera-byte.com/</code>
<code>rsync://mirrors.tera-byte.com/gentoo</code>
How can I remove the empty spaces, tags and all unnecessary text after the link?
If you just want the domain link to remain, you can try this grep
grep -Eo '[h|f]t*ps?://.[^<|>|"]*' index.html
This will display only http,https and ftp matches
If the need is to match within <code> blocks, this sed will work
sed -En '/<code>/ {s|.*([h|f]t*ps?://.[^<|>|"]*).*|\1|p}' index.html
You can use grep with this pattern:
grep -Po "(?<=<code>)(https?|ftp)(.*)(?=<\/code>)" index.html
Top 3 lines of output:
ftp://mirrors.tera-byte.com/pub/gentoo
http://gentoo.mirrors.tera-byte.com/
ftp://mirror.csclub.uwaterloo.ca/gentoo-distfiles/

wget --accept-regex for url path matching?

Is it possible to wget files matched with specific directory pattern?
For example, I want to download all the .dsc files form this site http://ftp.us.debian.org/debian/pool/. And only from sub directory /libg/.
http://ftp.us.debian.org/debian/pool/main/libg/libgdsii/libgdsii_0.1+ds.1-1.dsc
http://ftp.us.debian.org/debian/pool/non-free/libg/libgeotiff-epsg/libgeotiff-epsg_1.4.0-1.dsc
...
I tried this command but not working as expected.
wget -r -nH -np --accept-regex="\/libg\/.*\.dsc" -R index.htm* http://ftp.us.debian.org/debian/pool/

wget - only output redirect url but no download

I have a download link to a large file.
You need to be logged in to the site, so a cookie is used.
The download link redirects to another URL.
I'm able to download the file with wget but I only want the output of the "real" direct download link.
wget does exactly this before starting the download
Location: https://foo.com/bar.zip [following]
Is there a way to make wget stop and not actually downloading the file?
The solutions I found recommend redirecting to dev/null but this would still download the file. What I want is wget following the redirects but not actually starting the download.
I couldn't find a way to do it with wget, but I found a way to do it with curl:
curl https://openlibrary.org/data/ol_dump_latest.txt.gz -s -L -I -o /dev/null -w '%{url_effective}'
This only downloads the HEAD of the page (and sends it to /dev/null), so the file itself is never downloaded.
(src: https://stackoverflow.com/a/5300429/2317712 )
Going off of #qqilihq's comment to the curl answer, this will first strip out the line starting with "Location:" then remove the "Location: " from the beginning and the " [following]" from the end using awk. Not sure if I would use this as it looks like a small change in the wget output could make it blow up. I would use the curl answer myself.
wget --max-redirect=0 http://example.com/link-to-get-redirec-url-from 2>&1 | awk '/Location: /,// { print }' | awk '{print $2}'

wget downloads only one index.html file instead of other some 500 html files

with Wget I normally receive only one -- index.html file. I enter the following string:
wget -e robots=off -r http://www.korpora.org/kant/aa03
which gives back an index.html file, alas, only.
The directory aa03 implies Kant's book, volume 3, there must be some 560 files (pages) or so in it. These pages are readable online, but will not be downloaded. Any remedy?! THX
Following that link brings us to:
http://korpora.zim.uni-duisburg-essen.de/kant/aa03/
wget won't follow links that point to domains not specified by the user. Since korpora.zim.uni-duisburg-essen.de is not equal to korpora.org, wget will not follow the links on the index page.
To remedy this, use --span-hosts or -H. -rH is a VERY dangerous combination - combined, you can accidentally crawl the entire Internet - so you'll want to keep its scope very tightly focused. This command will do what you intended to do:
wget -e robots=off -rH -l inf -np -D korpora.org,korpora.zim.uni-duisburg-essen.de http://korpora.org/kant/aa03/index.html
(-np, or --no-parent, will limit the crawl to aa03/. -D will limit it to only those two domains. -l inf will crawl infinitely deep, constrained by -D and -np).

Remove string/script from all files (recursive)

One of my websites has been hacked, all the index.html and index.php files have been infected with a certain Javascript. I would like to have a unix command to remove this script from all files.
Script is here: http://pastie.org/private/6osrvd5zhphe372gblrc6w
I am trying to figure this out with sed but no luck so far
Thanks!
sed -i 's/<script>.*<\/script>//' fileName
will remove the tag script and all its content.
This works if you only have one <script> tag.
If you haven't only one, extend it with try keyword in the following way
sed -i 's/<script>try.*<\/script>//' fileName
Edit
If you want to do it on all files in a recursive way, you can use a find command like this:
find . -name "index.html" -print | xargs sed -i 's/<script>try.*<\/script>//' fileName
where . is the current directory
You can try this
find src/ -name "index.html" -print | xargs sed -i 's/<script>try{document.body++}catch(dgsgsdg){zxc=12;ww=window;}if(zxc).*<\/scri‌​pt>//
perl -pi -e 's/<script>.*<\/script>//g' index.html